If you think web pages are just pretty pictures and text, here's some news for you: they’re like onions — multi-layered and can make you cry (from joy, of course!), when you find out how much data you can extract from them. Today, we’re digging into HTML pages with the BeautifulSoup
library. Grab your virtual shovel — it’s digging time!
1. Analyzing HTML Documents
Simple pages
Let’s go over a few simple HTML documents to understand what they’re made of and which elements might be interesting for data extraction.
Example of a news page:
<html>
<head>
<title>News</title>
</head>
<body>
<h1>Main news of the day</h1>
<p>Something important happened today!</p>
</body>
</html>
In this example, h1
contains the article title, and p
— the main text.
The impact of HTML structure on scraping
Before using BeautifulSoup
, it’s important to understand how the HTML document you want to parse is structured. This helps you figure out which parts of the page contain the data you need. For example, if you’re looking for the page title, check out <h1>
, and for extracting a list, look at <ul>
and <li>
.
Prepping for scraping
Before starting data extraction, it’s crucial to identify the key tags and attributes. For instance, if web developers have marked up data in their page, like using the attribute class="headline"
for the title, this will help you a lot. Use your browser’s developer tools to examine the HTML structure. Right-click on an element and select "Inspect"
(in Google Chrome).
2. Installing and Setting Up Libraries
Installing BeautifulSoup
and requests
To work with HTML, we’ll use the BeautifulSoup
library. Also, to load HTML pages, we’ll need requests
. Installing them is straightforward and only requires a couple of commands in your terminal:
pip install beautifulsoup4 requests
Teaming up requests and BeautifulSoup
Requests will let us fetch HTML from a web page, and BeautifulSoup
will parse it. Let’s see how it works in action:
import requests
from bs4 import BeautifulSoup
# Fetching the web page
url = 'https://example.com'
response = requests.get(url)
# Parsing the page with BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
# Extracting the page title
title = soup.title.text
print('Title:', title)
3. Navigation and Data Extraction by Tags
Navigation methods
Now that we have an HTML document, we can use BeautifulSoup
to navigate through it. The awesome .select()
method lets you extract data using CSS selectors.
Data extraction by tags
BeautifulSoup
provides methods that let you find elements by their tags, like find
and find_all
. They will help you dig out those juicy data bits:
# Finding the first paragraph
paragraph = soup.find('p').text
print('First paragraph:', paragraph)
# Finding all list items
list_items = soup.find_all('li')
for item in list_items:
print('List item:', item.text)
Using attributes for filtering
Sometimes you’ll need to extract elements that meet specific conditions, like having a certain class
. BeautifulSoup
makes this easy:
# Extracting an element with a specific class
headline = soup.find('h1', class_='main-headline').text
print('Headline:', headline)
4. Using CSS Selectors
What are CSS selectors?
CSS selectors are a powerful tool from a Python dev’s perspective, giving you the ability to extract data based on specific criteria. They can be used to find elements with common styling, making scraping more flexible and precise.
Using selectors in BeautifulSoup
BeautifulSoup
lets you use CSS selectors through the select
method. For example:
# Selecting all links
links = soup.select('a')
for link in links:
print('Link:', link['href'])
You can even combine selectors for more precise targeting. For instance, soup.select('div.article h2')
will select all h2
inside div
with the class article
.
Examples of searching with CSS selectors
Let’s apply our knowledge in practice. Select all paragraphs with the class highlight
and print their text:
# Extracting all paragraphs with the class 'highlight'
highlighted_paragraphs = soup.select('p.highlight')
for para in highlighted_paragraphs:
print('Highlighted paragraph:', para.text)
That’s it for now, don’t forget to practice your scraping skills until next time. Good luck in the exciting world of parsing!
GO TO FULL VERSION