CodeGym /Courses /Python SELF EN /Building Histograms to Visualize Data Distribution

Building Histograms to Visualize Data Distribution

Python SELF EN
Level 41 , Lesson 3
Available

1. Basics of Building Histograms

If you've ever looked at a buffet table and tried to figure out which type of snacks are most common, you already kind of get data distribution. In programming, we use histograms to uncover patterns in data that might not be obvious at first glance. Histograms help us visually analyze how data is distributed across specific categories or numerical ranges. Let's dive in!

What is a Histogram?

A Histogram is a type of chart that helps visualize the distribution of data across specific intervals or "bins" (sometimes called "buckets" or "containers"). For example, if we want to know how often students scored a certain number of points on a test, a histogram is the best way to show this.

Main Parameters of a Histogram

A histogram is made up of bars (bins), where each bar shows how many values fall into the corresponding interval. The main parameters of a histogram are:

  • Bins (bins): the number of intervals into which the data range is divided.
  • Color and Edgecolor (color and edgecolor): determine the appearance of the bars.
  • Range (range): sets the minimum and maximum values to be displayed.
  • Density (density): if set to True, the histogram will be normalized so that the total area of the bars equals one.

Using the hist() Function to Build Histograms

The Matplotlib library has this awesome hist() function that makes creating histograms super easy. Let's check out a simple example:

Python

import matplotlib.pyplot as plt
import numpy as np

# Create a dataset
data = np.random.normal(0, 1, 1000)

# Create a histogram
plt.hist(data, bins=30, alpha=0.7, color='blue')
plt.title('Data Distribution Histogram')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.show()

Here we generate a dataset using the function np.random.normal(), which creates a normal distribution. We split the data into 30 bins and set the transparency of the bars with the alpha parameter for better visualization.

2. Adjusting Bins and Graph Appearance

Choosing the Number and Size of Bins

The number and size of bins can seriously impact how you interpret a histogram. Bins that are too large can hide important details, while bins that are too small can make the histogram messy and hard to read.

Practical Example:

Python

# Changing the number and size of bins
plt.hist(data, bins=10, color='green', edgecolor='black')
plt.title('Histogram with 10 bins')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.show()

plt.hist(data, bins=50, color='red', edgecolor='black')
plt.title('Histogram with 50 bins')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.show()

Check out how the histogram changes when using 10 and 50 bins. See the difference? This is one of those cases where size really matters!

Practical Examples of Customizing Histogram Parameters

Histograms can be customized not just based on the number of bins, but also their color, transparency, and even the style of the bars. Here's some more fun examples:

Python

# Other customization parameters
plt.hist(data, bins=30, density=True, color='purple', edgecolor='white', linestyle='dashed')
plt.title('Density Histogram with Custom Style')
plt.xlabel('Values')
plt.ylabel('Density')
plt.grid(True)
plt.show()

In this example, we added the density=True parameter, which normalizes the histogram so that the total area under the graph equals 1. This can be useful when you want to analyze the probability density of a distribution.

3. Examples of Using Histograms

Histograms are used in all sorts of fields—from analyzing financial data to physics experiments. Let's see how to use a histogram with some real-world data.

Building a Histogram on Real Data

Imagine we have a dataset representing the average daily temperature for a year. We want to analyze how often the temperature falls within certain ranges.

Python

# Temperature dataset (simplified for example)
temperatures = [15, 16, 15, 14, 19, 22, 24, 25, 17, 18, 15, 16, 23, 24, 21, 19, 18, 20, 22, 25, 26, 27]

# Building the histogram
plt.hist(temperatures, bins=5, color='navy', edgecolor='black')
plt.title('Temperature Histogram')
plt.xlabel('Temperature (°C)')
plt.ylabel('Frequency')
plt.show()

In this example, we used 5 bins to find out how the values are distributed. The histogram shows that most temperatures are in the 15 to 20°C range.

Comparing Distributions on a Single Histogram

Sometimes you need to compare the distributions of several datasets on a single histogram. In Matplotlib, you can overlay multiple histograms using the alpha (transparency) parameter.

Example 3: Comparing Distributions

Python

import matplotlib.pyplot as plt

# Generating data
data1 = [5, 10, 10, 15, 15, 20, 25, 30, 30, 35, 40]
data2 = [5, 7, 9, 10, 11, 13, 15, 17, 19, 20, 25]

# Plotting overlapping histograms
plt.hist(data1, bins=5, color="blue", alpha=0.5, label="Dataset 1")
plt.hist(data2, bins=5, color="green", alpha=0.5, label="Dataset 2")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.title("Comparison of Two Distributions")
plt.legend()
plt.show()

In this example, we use the alpha=0.5 parameter for each histogram, making the bars semi-transparent and allowing us to visually compare overlaps between distributions.

Helpful Tips for Working with Histograms

  • Optimal Bin Selection: Pick the number of bins based on the size and nature of your data. Too few or too many bins can distort the distribution.
  • Comparing Distributions: Use transparency (alpha) to overlay multiple histograms and compare distributions.
  • Adding a Grid: A grid helps interpret the data better. You can add it with the plt.grid(True) function.
  • Density Parameter: Use density=True to display data as a probability density, which is particularly useful when comparing distributions.
1
Task
Python SELF EN, level 41, lesson 3
Locked
Building a Basic Histogram
Building a Basic Histogram
2
Task
Python SELF EN, level 41, lesson 3
Locked
Configuring Histogram Parameters
Configuring Histogram Parameters
3
Task
Python SELF EN, level 41, lesson 3
Locked
Comparison of Multiple Data Sets
Comparison of Multiple Data Sets
4
Task
Python SELF EN, level 41, lesson 3
Locked
Analysis and Visualization of a Real Data Set
Analysis and Visualization of a Real Data Set
Comments
TO VIEW ALL COMMENTS OR TO MAKE A COMMENT,
GO TO FULL VERSION