CodeGym /Courses /Python SELF EN /Formatting Data and Cleaning Tables of Extra Stuff

Formatting Data and Cleaning Tables of Extra Stuff

Python SELF EN
Level 28 , Lesson 2
Available

1. Why is this important?

Imagine you're a data factory engineer. Your job is to analyze and organize raw material (data) to get a finished product (insights, reports). Without cleaning and formatting, your data will stay like a pile of scattered parts that are hard to put together.

Clean and formatted data isn't just about aesthetics and readability but is essential for proper analysis and model building. Badly processed data can lead to incorrect conclusions. And let's be real, no one wants to end the work day realizing they calculated the average employee age in "tiny data fragments" instead of years.

Main Formatting Tasks

Before diving into the code, let’s break down the problems we aim to avoid and solve with formatting:

  • Removing empty rows and columns: They are like empty pockets—useless but misleading.
  • Removing duplicates: Duplicate entries add noise to our data.
  • Filling missing values: It's like filling holes in cheese; sometimes, you need to fill them with something tasty.
  • Converting data types: It's crucial to ensure that age is a number, not text. Don't confuse real age with "25 years of work experience" written in quotes.

2. Tools for Cleaning and Formatting Data

Now that we know why we need this, let’s look at tools that can help us, like a Swiss army knife on a camping trip:

  • Removing empty values: dropna()
  • Working with duplicates: duplicated() and drop_duplicates()
  • Filling in missing data: fillna()
  • Converting data types: astype()

Removing Empty Values

Empty values—one of those things that always show up at the wrong time. With dropna(), we can get rid of them like old, useless socks. Let’s check out an example:

Python

import pandas as pd

# Creating a DataFrame with missing values
data = {'Name': ['Anna', 'Ivan', None, 'Maria'],
        'Age': [29, None, 34, 23],
        'City': ['Moscow', 'Saint Petersburg', 'Novosibirsk', None]}

df = pd.DataFrame(data)

# Removing rows with any NaN values
cleaned_df = df.dropna()

print(cleaned_df)

This code removes all rows containing at least one empty value.

Working with Duplicates

Sometimes your data has duplicates, like in a "Spot the Difference" game where there are no differences. We can filter out such entries using drop_duplicates().

Python

# Creating a DataFrame with duplicates
data = {'Name': ['Anna', 'Ivan', 'Anna', 'Maria'],
        'Age': [29, 34, 29, 23]}

df = pd.DataFrame(data)

# Removing duplicates
unique_df = df.drop_duplicates()

print(unique_df)

Filling in Missing Data

Filling in missing values can be done in different ways—for instance, with the average value or even a fixed value. The fillna() method will help you with this:

Python

# Filling missing values in 'Age' with the mean of the column
df['Age'] = df['Age'].fillna(df['Age'].mean())

print(df)

Converting Data Types

Converting data is sometimes necessary to avoid getting trapped by text values where numbers are expected:

Python

# Converting the 'Age' column to an integer type
df['Age'] = df['Age'].astype(int)

print(df)

3. Practical Application: Cleaning Real Datasets

Let’s try cleaning and formatting a real dataset for reporting. For example, let’s take a hypothetical Excel file with sales information.

Python

# Loading data from Excel
filename = "sales_data.xlsx"
df = pd.read_excel(filename)

# Removing rows with missing data in any column
df = df.dropna()

# Removing duplicate records
df = df.drop_duplicates()

# Filling missing 'Sales' values with the mean of that column
df['Sales'] = df['Sales'].fillna(df['Sales'].mean())

# Converting data types
df['Product Code'] = df['Product Code'].astype(str)

# Saving cleaned data back to Excel
df.to_excel("cleaned_sales_data.xlsx", index=False)

As a result, we get a neat and clean Excel file ready for analysis and reporting.

4. Common Mistakes in Data Cleaning

Like fixing electronics, you can mess up something with data, so here are a few common mistakes to avoid.

Removing useful information while removing duplicates. Sometimes duplicates signal an error in the data and deserve more investigation. Don’t mix up data types. It might seem simple to convert formats, but it can lead to data loss if you don’t check whether the conversion is valid.

By following these tips and taking the suggested steps, you can easily and enjoyably work with data, even if it looks like an analyst's nightmare at first. And remember, like your code, your data should be clean!

1
Task
Python SELF EN, level 28, lesson 2
Locked
Removing empty rows and duplicates
Removing empty rows and duplicates
2
Task
Python SELF EN, level 28, lesson 2
Locked
Filling missing values
Filling missing values
3
Task
Python SELF EN, level 28, lesson 2
Locked
Data Type Conversion
Data Type Conversion
4
Task
Python SELF EN, level 28, lesson 2
Locked
Cleaning and Formatting Data from Excel
Cleaning and Formatting Data from Excel
Comments
TO VIEW ALL COMMENTS OR TO MAKE A COMMENT,
GO TO FULL VERSION