CodeGym /Java Course /Python SELF EN /Working with BeautifulSoup and Extracting Info

Working with BeautifulSoup and Extracting Info

Python SELF EN
Level 31 , Lesson 1
Available

If you think web pages are just pretty pictures and text, here's some news for you: they’re like onions — multi-layered and can make you cry (from joy, of course!), when you find out how much data you can extract from them. Today, we’re digging into HTML pages with the BeautifulSoup library. Grab your virtual shovel — it’s digging time!

1. Analyzing HTML Documents

Simple pages

Let’s go over a few simple HTML documents to understand what they’re made of and which elements might be interesting for data extraction.

Example of a news page:

HTML

<html>
  <head>
    <title>News</title>
  </head>
  <body>
    <h1>Main news of the day</h1>
    <p>Something important happened today!</p>
  </body>
</html>

In this example, h1 contains the article title, and p — the main text.

The impact of HTML structure on scraping

Before using BeautifulSoup, it’s important to understand how the HTML document you want to parse is structured. This helps you figure out which parts of the page contain the data you need. For example, if you’re looking for the page title, check out <h1>, and for extracting a list, look at <ul> and <li>.

Prepping for scraping

Before starting data extraction, it’s crucial to identify the key tags and attributes. For instance, if web developers have marked up data in their page, like using the attribute class="headline" for the title, this will help you a lot. Use your browser’s developer tools to examine the HTML structure. Right-click on an element and select "Inspect" (in Google Chrome).

2. Installing and Setting Up Libraries

Installing BeautifulSoup and requests

To work with HTML, we’ll use the BeautifulSoup library. Also, to load HTML pages, we’ll need requests. Installing them is straightforward and only requires a couple of commands in your terminal:

Bash

pip install beautifulsoup4 requests

Teaming up requests and BeautifulSoup

Requests will let us fetch HTML from a web page, and BeautifulSoup will parse it. Let’s see how it works in action:

Python

import requests
from bs4 import BeautifulSoup

# Fetching the web page
url = 'https://example.com'
response = requests.get(url)

# Parsing the page with BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

# Extracting the page title
title = soup.title.text
print('Title:', title)

3. Navigation and Data Extraction by Tags

Navigation methods

Now that we have an HTML document, we can use BeautifulSoup to navigate through it. The awesome .select() method lets you extract data using CSS selectors.

Data extraction by tags

BeautifulSoup provides methods that let you find elements by their tags, like find and find_all. They will help you dig out those juicy data bits:

Python

# Finding the first paragraph
paragraph = soup.find('p').text
print('First paragraph:', paragraph)

# Finding all list items
list_items = soup.find_all('li')
for item in list_items:
    print('List item:', item.text)

Using attributes for filtering

Sometimes you’ll need to extract elements that meet specific conditions, like having a certain class. BeautifulSoup makes this easy:

Python

# Extracting an element with a specific class
headline = soup.find('h1', class_='main-headline').text
print('Headline:', headline)

4. Using CSS Selectors

What are CSS selectors?

CSS selectors are a powerful tool from a Python dev’s perspective, giving you the ability to extract data based on specific criteria. They can be used to find elements with common styling, making scraping more flexible and precise.

Using selectors in BeautifulSoup

BeautifulSoup lets you use CSS selectors through the select method. For example:

Python

# Selecting all links
links = soup.select('a')
for link in links:
    print('Link:', link['href'])

You can even combine selectors for more precise targeting. For instance, soup.select('div.article h2') will select all h2 inside div with the class article.

Examples of searching with CSS selectors

Let’s apply our knowledge in practice. Select all paragraphs with the class highlight and print their text:

Python

# Extracting all paragraphs with the class 'highlight'
highlighted_paragraphs = soup.select('p.highlight')
for para in highlighted_paragraphs:
    print('Highlighted paragraph:', para.text)

That’s it for now, don’t forget to practice your scraping skills until next time. Good luck in the exciting world of parsing!

1
Task
Python SELF EN, level 31, lesson 1
Locked
Installing and basic use of BeautifulSoup
Installing and basic use of BeautifulSoup
2
Task
Python SELF EN, level 31, lesson 1
Locked
Extracting and Printing Text by Tags
Extracting and Printing Text by Tags
3
Task
Python SELF EN, level 31, lesson 1
Locked
Using CSS Selectors to Select Elements
Using CSS Selectors to Select Elements
4
Task
Python SELF EN, level 31, lesson 1
Locked
Extracting a Link with a Specific Class
Extracting a Link with a Specific Class
Comments
TO VIEW ALL COMMENTS OR TO MAKE A COMMENT,
GO TO FULL VERSION