CodeGym /Java Course /Python SELF EN /Extracting Data by HTML Tags and Attributes

Extracting Data by HTML Tags and Attributes

Python SELF EN
Level 31 , Lesson 3
Available

1. Navigating the HTML Tree

Today, we're diving into the mysterious world of HTML trees and learning how to scrape information from web pages like true coding ninjas. We’re going to keep using the magic BeautifulSoup library to get the data we need and level up our already-awesome scripts with even more functionality. So, fire up those keyboards, and let’s roll!

Before we dig into scraping data, let’s get a refresher on what an HTML tree is. Think of it like a big family tree where every tag is a relative. You've got parents, kids, siblings, and so on. Our job is to find the specific "relatives" to neatly grab those valuable family heirlooms (a.k.a. data).

Here's what a snippet of HTML might look like:

HTML

<div class="article">
    <h2 id="title">Title</h2>
    <p class="content">This is the article text...</p>
    <a href="https://example.com" class="link">Read more</a>
</div>

Here we’ve got a div that’s the parent element for h2, p, and a. Each of these has attributes and content of their own.

2. Extracting Data by Tags

BeautifulSoup has handy methods for cruising through the tree and grabbing data. Let’s start with the basic find() method, which lets you snag the first element with a specific tag. Then there’s find_all()—the bulldozer of searching, which digs up all elements of the specified tag.

Python
                      
                        from bs4 import BeautifulSoup
                
                        html_doc = """<div class="article">
                                        <h2 id="title">Title</h2>
                                        <p class="content">This is the article text...</p>
                                        <a href="https://example.com" class="link">Read more</a>
                                     </div>"""
                        
                        soup = BeautifulSoup(html_doc, 'html.parser')
                        
                        # Find the first paragraph
                        first_paragraph = soup.find('p')
                        print(first_paragraph.text)  # Output: This is the article text...
                        
                        # Find all the links
                        all_links = soup.find_all('a')
                        for link in all_links:
                            print(link['href'])  # Output: https://example.com
                      
                    

3. Filtering Elements with Attributes

Now that we’ve got the hang of tag searching, it’s time to learn how to filter elements using attributes like id and class. These attributes are like bookmarks, pointing out exactly what’s where.

HTML

<div class="article">
    <h2 id="title">Title</h2>
    <p class="content">This is the article text...</p>
    <a href="https://example.com" class="link">Read more</a>
</div>
Python

# Find an element with a specific id
title = soup.find(id="title")
print(title.text)  # Output: Title

# Find all elements with the "content" class
content_paragraphs = soup.find_all(class_="content")
for p in content_paragraphs:
    print(p.text)  # Output: This is the article text...

Important! We use class_ instead of class to avoid messing with Python’s reserved keyword.

4. Practicing Conditional Data Extraction

Time to get hands-on! Imagine you need to grab links and titles from a big, repeating HTML block of articles. Here’s a sample and how to handle it:

HTML

<div class="articles">
    <div class="article">
        <h2 class="title">First Article</h2>
        <a href="https://example.com/1" class="read-more">Read more</a>
    </div>
    <div class="article">
        <h2 class="title">Second Article</h2>
        <a href="https://example.com/2" class="read-more">Read more</a>
    </div>
</div>

Here’s how we can grab those titles and links:

Python

html_doc = """<div class="articles">
                <div class="article">
                    <h2 class="title">First Article</h2>
                    <a href="https://example.com/1" class="read-more">Read more</a>
                </div>
                <div class="article">
                    <h2 class="title">Second Article</h2>
                    <a href="https://example.com/2" class="read-more">Read more</a>
                </div>
              </div>"""

soup = BeautifulSoup(html_doc, 'html.parser')

articles = soup.find_all('div', class_='article')
for article in articles:
    title = article.find('h2', class_='title').text
    link = article.find('a', class_='read-more')['href']
    print(f"Title: {title}, Link: {link}")
    
# Output:
# Title: First Article, Link: https://example.com/1
# Title: Second Article, Link: https://example.com/2

5. Watch Out for Gotchas

Now that you’re armed with knowledge, let’s look at some common mistakes. One big one is trying to access an attribute that doesn’t exist. Python will throw a friendly but still annoying KeyError. To avoid this, you can use the .get() method to fetch attributes with a default value if they’re missing.

Also, don’t forget that HTML elements can be nested or have complex structures. Use your browser’s inspect tool to make sure you understand the layout before scraping data with BeautifulSoup.

Next stop on our journey is using CSS selectors to extract data even more precisely. Stick around, because the BeautifulSoup adventure is just getting started!

1
Task
Python SELF EN, level 31, lesson 3
Locked
Extracting Text from HTML Tags
Extracting Text from HTML Tags
2
Task
Python SELF EN, level 31, lesson 3
Locked
Extracting Information by Attributes
Extracting Information by Attributes
3
Task
Python SELF EN, level 31, lesson 3
Locked
Extracting Texts from Nested Elements
Extracting Texts from Nested Elements
4
Task
Python SELF EN, level 31, lesson 3
Locked
Extracting Information from a Complex HTML Structure
Extracting Information from a Complex HTML Structure
Comments
TO VIEW ALL COMMENTS OR TO MAKE A COMMENT,
GO TO FULL VERSION