1. Why Merge Data?
In today’s data-driven world, it’s pretty rare to find all the necessary information in one single file. Most often, data is scattered across multiple tables and files, and to analyze it properly, you need to merge it. For example, you might have sales data for each month in separate files. To analyze overall trends, you need to merge this data into one table.
Additionally, merging data helps to:
- Save time on manually assembling reports.
- Reduce errors caused by manual data transfer.
- Automate the data analysis workflow.
2. Methods of Merging Data
Data Concatenation
Vertical and horizontal merging of data is called concatenation.
Let’s start with a simple example: we’ve got two DataFrame
objects that we want to merge.
import pandas as pd
# Creating two DataFrames with similar structure
data1 = {
'Product': ['Apples', 'Bananas'],
'Sales': [100, 150]
}
data2 = {
'Product': ['Cherries', 'Dates'],
'Sales': [75, 200]
}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
# Vertical concatenation (adding rows)
combined = pd.concat([df1, df2], ignore_index=True)
print(combined)
As a result, we get a combined DataFrame
containing all rows from both sources.
Notice the ignore_index=True
parameter, which resets the indices in the resulting DataFrame
.
Merging Data Using the merge
Method
If you need to merge data based on common columns (like id
or Product
),
the merge
method is your friend. It’s like a date for your data — they need to have something
in common to meet up.
# Creating sample data
sales_data = {
'Product': ['Apples', 'Bananas', 'Cherries'],
'Sales': [100, 150, 75]
}
price_data = {
'Product': ['Apples', 'Bananas', 'Dates'],
'Price': [1.2, 0.8, 3.0]
}
sales_df = pd.DataFrame(sales_data)
price_df = pd.DataFrame(price_data)
# Merging DataFrames on the 'Product' column
merged_df = pd.merge(sales_df, price_df, on='Product', how='inner')
print(merged_df)
Even though we have information for each product, merge
only returns rows
where there’s a match on the key column Product
.
Let’s break down the types of joins:
inner
(default): returns only rows present in bothDataFrame
objects.outer
: returns all rows, filling in missing values withNaN
.left
: returns all rows from the leftDataFrame
and matching rows from the right.right
: returns all rows from the rightDataFrame
and matching rows from the left.
Merging Data from Multiple Sheets
Now that we’ve refreshed how merging works in pandas, let’s apply that knowledge and combine data from multiple sheets within an Excel file.
# Reading an Excel file with multiple sheets
excel_file = pd.ExcelFile('sales_data.xlsx')
# List of all sheet names
sheet_names = excel_file.sheet_names
# Reading data from each sheet into individual DataFrames and storing them in a list
dataframes = [pd.read_excel(excel_file, sheet_name=sheet_name) for sheet_name in sheet_names]
# Performing vertical concatenation of all DataFrames
all_sales_data = pd.concat(dataframes, ignore_index=True)
print(all_sales_data)
Here we read all the sheets from an Excel file into separate DataFrame
objects, then merge them into one.
Pretty slick, right?
Merging Data from Multiple Files
In real-world scenarios, you might have data spread across multiple files. Let’s see how to automate the process of merging them.
import glob
# Path to files
file_path = r'data/sales_*.xlsx'
# Get a list of all files matching the pattern
files = glob.glob(file_path)
# Reading data from each file into individual DataFrames and storing them in a list
dataframes = [pd.read_excel(file) for file in files]
# Performing vertical concatenation of all DataFrames
all_files_data = pd.concat(dataframes, ignore_index=True)
print(all_files_data)
We use the glob
library to get a list of files matching the specified pattern,
then merge data from each file into a single DataFrame
.
3. Real-Life Use Cases
Data Merging for Business Analysis
Imagine you’re an analyst working with monthly sales data stored in separate files. Using the techniques above, you can easily combine the data to analyze yearly trends and gain valuable insights.
Automating Reports
By automating data merging, you can set up a script to gather information from different sources and generate daily, weekly, or monthly reports automatically.
Processing Data from Various Sources
Your data might come from different systems, with each source contributing a piece of the puzzle. Using pandas, you can bring it all together to get the full picture, like combining customer data with their purchases.
4. Common Errors and Fixes
A common mistake is mismatched column names in the DataFrame
objects being merged.
Make sure the columns you’re merging on actually match, and use the left_on
and right_on
parameters in the merge
method if necessary.
Sometimes, data might contain missing values after merging.
Use methods like fillna
to fill missing values or dropna
to remove them.
If your data contains duplicates, the concat
method might result in a longer DataFrame
than expected.
Use the drop_duplicates
method for cleanup.
Now that we’ve covered the methods and tricks for merging data with pandas, you’re equipped to tackle real-world data automation and analysis tasks. Put this knowledge to good use!
GO TO FULL VERSION