1. Navigating the HTML Tree
Today, we're diving into the mysterious world of HTML trees and learning how to scrape information from web pages like true coding ninjas. We’re going to keep using the magic BeautifulSoup library to get the data we need and level up our already-awesome scripts with even more functionality. So, fire up those keyboards, and let’s roll!
Before we dig into scraping data, let’s get a refresher on what an HTML tree is. Think of it like a big family tree where every tag is a relative. You've got parents, kids, siblings, and so on. Our job is to find the specific "relatives" to neatly grab those valuable family heirlooms (a.k.a. data).
Here's what a snippet of HTML might look like:
<div class="article">
<h2 id="title">Title</h2>
<p class="content">This is the article text...</p>
<a href="https://example.com" class="link">Read more</a>
</div>
Here we’ve got a div that’s the parent element for h2, p, and a. Each of these has attributes and content of their own.
2. Extracting Data by Tags
BeautifulSoup has handy methods for cruising through the tree and grabbing data. Let’s start with the basic find() method, which lets you snag the first element with a specific tag. Then there’s find_all()—the bulldozer of searching, which digs up all elements of the specified tag.
from bs4 import BeautifulSoup
html_doc = """<div class="article">
<h2 id="title">Title</h2>
<p class="content">This is the article text...</p>
<a href="https://example.com" class="link">Read more</a>
</div>"""
soup = BeautifulSoup(html_doc, 'html.parser')
# Find the first paragraph
first_paragraph = soup.find('p')
print(first_paragraph.text) # Output: This is the article text...
# Find all the links
all_links = soup.find_all('a')
for link in all_links:
print(link['href']) # Output: https://example.com
3. Filtering Elements with Attributes
Now that we’ve got the hang of tag searching, it’s time to learn how to filter elements using attributes like id and class. These attributes are like bookmarks, pointing out exactly what’s where.
<div class="article">
<h2 id="title">Title</h2>
<p class="content">This is the article text...</p>
<a href="https://example.com" class="link">Read more</a>
</div>
# Find an element with a specific id
title = soup.find(id="title")
print(title.text) # Output: Title
# Find all elements with the "content" class
content_paragraphs = soup.find_all(class_="content")
for p in content_paragraphs:
print(p.text) # Output: This is the article text...
Important! We use class_ instead of class to avoid messing with Python’s reserved keyword.
4. Practicing Conditional Data Extraction
Time to get hands-on! Imagine you need to grab links and titles from a big, repeating HTML block of articles. Here’s a sample and how to handle it:
<div class="articles">
<div class="article">
<h2 class="title">First Article</h2>
<a href="https://example.com/1" class="read-more">Read more</a>
</div>
<div class="article">
<h2 class="title">Second Article</h2>
<a href="https://example.com/2" class="read-more">Read more</a>
</div>
</div>
Here’s how we can grab those titles and links:
html_doc = """<div class="articles">
<div class="article">
<h2 class="title">First Article</h2>
<a href="https://example.com/1" class="read-more">Read more</a>
</div>
<div class="article">
<h2 class="title">Second Article</h2>
<a href="https://example.com/2" class="read-more">Read more</a>
</div>
</div>"""
soup = BeautifulSoup(html_doc, 'html.parser')
articles = soup.find_all('div', class_='article')
for article in articles:
title = article.find('h2', class_='title').text
link = article.find('a', class_='read-more')['href']
print(f"Title: {title}, Link: {link}")
# Output:
# Title: First Article, Link: https://example.com/1
# Title: Second Article, Link: https://example.com/2
5. Watch Out for Gotchas
Now that you’re armed with knowledge, let’s look at some common mistakes. One big one is trying to access an attribute that doesn’t exist. Python will throw a friendly but still annoying KeyError. To avoid this, you can use the .get() method to fetch attributes with a default value if they’re missing.
Also, don’t forget that HTML elements can be nested or have complex structures. Use your browser’s inspect tool to make sure you understand the layout before scraping data with BeautifulSoup.
Next stop on our journey is using CSS selectors to extract data even more precisely. Stick around, because the BeautifulSoup adventure is just getting started!
GO TO FULL VERSION