1. Navigating the HTML Tree
Today, we're diving into the mysterious world of HTML trees and
learning how to scrape information from web pages like true
coding ninjas. We’re going to keep using the magic
BeautifulSoup
library to get the data we need and
level up our already-awesome scripts with even more
functionality. So, fire up those keyboards, and let’s roll!
Before we dig into scraping data, let’s get a refresher on what an HTML tree is. Think of it like a big family tree where every tag is a relative. You've got parents, kids, siblings, and so on. Our job is to find the specific "relatives" to neatly grab those valuable family heirlooms (a.k.a. data).
Here's what a snippet of HTML might look like:
<div class="article">
<h2 id="title">Title</h2>
<p class="content">This is the article text...</p>
<a href="https://example.com" class="link">Read more</a>
</div>
Here we’ve got a div
that’s the parent element for
h2
, p
, and a
. Each of these
has attributes and content of their own.
2. Extracting Data by Tags
BeautifulSoup
has handy methods for cruising through
the tree and grabbing data. Let’s start with the basic
find()
method, which lets you snag the
first element with a specific tag. Then there’s
find_all()
—the bulldozer of searching, which
digs up all elements of the specified tag.
from bs4 import BeautifulSoup
html_doc = """<div class="article">
<h2 id="title">Title</h2>
<p class="content">This is the article text...</p>
<a href="https://example.com" class="link">Read more</a>
</div>"""
soup = BeautifulSoup(html_doc, 'html.parser')
# Find the first paragraph
first_paragraph = soup.find('p')
print(first_paragraph.text) # Output: This is the article text...
# Find all the links
all_links = soup.find_all('a')
for link in all_links:
print(link['href']) # Output: https://example.com
3. Filtering Elements with Attributes
Now that we’ve got the hang of tag searching, it’s time to learn
how to filter elements using attributes like id
and
class
. These attributes are like bookmarks, pointing
out exactly what’s where.
<div class="article">
<h2 id="title">Title</h2>
<p class="content">This is the article text...</p>
<a href="https://example.com" class="link">Read more</a>
</div>
# Find an element with a specific id
title = soup.find(id="title")
print(title.text) # Output: Title
# Find all elements with the "content" class
content_paragraphs = soup.find_all(class_="content")
for p in content_paragraphs:
print(p.text) # Output: This is the article text...
Important! We use class_
instead of
class
to avoid messing with Python’s reserved
keyword.
4. Practicing Conditional Data Extraction
Time to get hands-on! Imagine you need to grab links and titles from a big, repeating HTML block of articles. Here’s a sample and how to handle it:
<div class="articles">
<div class="article">
<h2 class="title">First Article</h2>
<a href="https://example.com/1" class="read-more">Read more</a>
</div>
<div class="article">
<h2 class="title">Second Article</h2>
<a href="https://example.com/2" class="read-more">Read more</a>
</div>
</div>
Here’s how we can grab those titles and links:
html_doc = """<div class="articles">
<div class="article">
<h2 class="title">First Article</h2>
<a href="https://example.com/1" class="read-more">Read more</a>
</div>
<div class="article">
<h2 class="title">Second Article</h2>
<a href="https://example.com/2" class="read-more">Read more</a>
</div>
</div>"""
soup = BeautifulSoup(html_doc, 'html.parser')
articles = soup.find_all('div', class_='article')
for article in articles:
title = article.find('h2', class_='title').text
link = article.find('a', class_='read-more')['href']
print(f"Title: {title}, Link: {link}")
# Output:
# Title: First Article, Link: https://example.com/1
# Title: Second Article, Link: https://example.com/2
5. Watch Out for Gotchas
Now that you’re armed with knowledge, let’s look at some
common mistakes. One big one is trying to access an attribute
that doesn’t exist. Python will throw a friendly but still
annoying KeyError
. To avoid this, you can use
the .get()
method to fetch attributes with a
default value if they’re missing.
Also, don’t forget that HTML elements can be nested or have
complex structures. Use your browser’s inspect tool to make
sure you understand the layout before scraping data with
BeautifulSoup
.
Next stop on our journey is using CSS selectors to extract
data even more precisely. Stick around, because the
BeautifulSoup
adventure is just getting started!
GO TO FULL VERSION