CodeGym /Java Course /Python SELF EN /Merging Data from Multiple Sheets and Excel Files

Merging Data from Multiple Sheets and Excel Files

Python SELF EN
Level 28 , Lesson 1
Available

1. Why Merge Data?

In today’s data-driven world, it’s pretty rare to find all the necessary information in one single file. Most often, data is scattered across multiple tables and files, and to analyze it properly, you need to merge it. For example, you might have sales data for each month in separate files. To analyze overall trends, you need to merge this data into one table.

Additionally, merging data helps to:

  • Save time on manually assembling reports.
  • Reduce errors caused by manual data transfer.
  • Automate the data analysis workflow.

2. Methods of Merging Data

Data Concatenation

Vertical and horizontal merging of data is called concatenation. Let’s start with a simple example: we’ve got two DataFrame objects that we want to merge.

Python

import pandas as pd

# Creating two DataFrames with similar structure
data1 = {
    'Product': ['Apples', 'Bananas'],
    'Sales': [100, 150]
}

data2 = {
    'Product': ['Cherries', 'Dates'],
    'Sales': [75, 200]
}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

# Vertical concatenation (adding rows)
combined = pd.concat([df1, df2], ignore_index=True)
print(combined)

As a result, we get a combined DataFrame containing all rows from both sources. Notice the ignore_index=True parameter, which resets the indices in the resulting DataFrame.

Merging Data Using the merge Method

If you need to merge data based on common columns (like id or Product), the merge method is your friend. It’s like a date for your data — they need to have something in common to meet up.

Python

# Creating sample data
sales_data = {
    'Product': ['Apples', 'Bananas', 'Cherries'],
    'Sales': [100, 150, 75]
}

price_data = {
    'Product': ['Apples', 'Bananas', 'Dates'],
    'Price': [1.2, 0.8, 3.0]
}

sales_df = pd.DataFrame(sales_data)
price_df = pd.DataFrame(price_data)

# Merging DataFrames on the 'Product' column
merged_df = pd.merge(sales_df, price_df, on='Product', how='inner')
print(merged_df)

Even though we have information for each product, merge only returns rows where there’s a match on the key column Product.

Let’s break down the types of joins:

  • inner (default): returns only rows present in both DataFrame objects.
  • outer: returns all rows, filling in missing values with NaN.
  • left: returns all rows from the left DataFrame and matching rows from the right.
  • right: returns all rows from the right DataFrame and matching rows from the left.

Merging Data from Multiple Sheets

Now that we’ve refreshed how merging works in pandas, let’s apply that knowledge and combine data from multiple sheets within an Excel file.

Python

# Reading an Excel file with multiple sheets
excel_file = pd.ExcelFile('sales_data.xlsx')

# List of all sheet names
sheet_names = excel_file.sheet_names

# Reading data from each sheet into individual DataFrames and storing them in a list
dataframes = [pd.read_excel(excel_file, sheet_name=sheet_name) for sheet_name in sheet_names]

# Performing vertical concatenation of all DataFrames
all_sales_data = pd.concat(dataframes, ignore_index=True)

print(all_sales_data)

Here we read all the sheets from an Excel file into separate DataFrame objects, then merge them into one. Pretty slick, right?

Merging Data from Multiple Files

In real-world scenarios, you might have data spread across multiple files. Let’s see how to automate the process of merging them.

Python

import glob

# Path to files
file_path = r'data/sales_*.xlsx'

# Get a list of all files matching the pattern
files = glob.glob(file_path)

# Reading data from each file into individual DataFrames and storing them in a list
dataframes = [pd.read_excel(file) for file in files]

# Performing vertical concatenation of all DataFrames
all_files_data = pd.concat(dataframes, ignore_index=True)

print(all_files_data)

We use the glob library to get a list of files matching the specified pattern, then merge data from each file into a single DataFrame.

3. Real-Life Use Cases

Data Merging for Business Analysis

Imagine you’re an analyst working with monthly sales data stored in separate files. Using the techniques above, you can easily combine the data to analyze yearly trends and gain valuable insights.

Automating Reports

By automating data merging, you can set up a script to gather information from different sources and generate daily, weekly, or monthly reports automatically.

Processing Data from Various Sources

Your data might come from different systems, with each source contributing a piece of the puzzle. Using pandas, you can bring it all together to get the full picture, like combining customer data with their purchases.

4. Common Errors and Fixes

A common mistake is mismatched column names in the DataFrame objects being merged. Make sure the columns you’re merging on actually match, and use the left_on and right_on parameters in the merge method if necessary.

Sometimes, data might contain missing values after merging. Use methods like fillna to fill missing values or dropna to remove them.

If your data contains duplicates, the concat method might result in a longer DataFrame than expected. Use the drop_duplicates method for cleanup.

Now that we’ve covered the methods and tricks for merging data with pandas, you’re equipped to tackle real-world data automation and analysis tasks. Put this knowledge to good use!

1
Task
Python SELF EN, level 28, lesson 1
Locked
Basic data merging in pandas
Basic data merging in pandas
2
Task
Python SELF EN, level 28, lesson 1
Locked
Merging Data from Multiple Sheets
Merging Data from Multiple Sheets
3
Task
Python SELF EN, level 28, lesson 1
Locked
Merging using merge
Merging using merge
4
Task
Python SELF EN, level 28, lesson 1
Locked
Automated Data Merging from Multiple Files
Automated Data Merging from Multiple Files
Comments
TO VIEW ALL COMMENTS OR TO MAKE A COMMENT,
GO TO FULL VERSION