CodeGym /Java Course /Python SELF EN /Handling Errors in Web Scraping

Handling Errors in Web Scraping

Python SELF EN
Level 32 , Lesson 4
Available

If you’ve made it to this lecture, it means you already have a decent understanding of how web scraping works and how to use the awesome BeautifulSoup to extract the data you need from HTML. However, in the world of web scraping, not everything is as smooth as they describe in textbooks. Sometimes, our dream of collecting data turns into a battle with errors, so let’s talk about how to deal with pitfalls and make our scraper as resilient as possible.

1. Common Errors in Web Scraping

Error 404 and Other HTTP Errors

The classic problem: you try to fetch a page, and instead of content, you get the proud "404 Not Found." This might happen because the page has been deleted or moved. Other common HTTP errors include 403 (Forbidden), 500 (Internal Server Error), and 503 (Service Unavailable).

HTML Structure Changes

You’ve spent a ton of time writing code to extract data, and the next day the site decides to "show off" a little and changes its HTML structure. Oops, back to rewriting everything!

Request Rate Limits

Some sites start getting suspicious when web scrapers are "licking" them all day long. In the best-case scenario, you’ll get temporarily banned, and in the worst case—forever.

Timeouts and Response Delays

Sometimes, pages load slowly, and your script might crash if the wait exceeds the standard timeout period.

2. Error Handling Methods

Using try-except

Your scripts shouldn’t crash at the first sign of trouble. Adding try-except blocks helps you catch errors and ensure your web scraper keeps going like nothing happened.

Python

    import requests
    from bs4 import BeautifulSoup
    
    url = 'http://example.com/some-nonexistent-page'
    
    try:
        response = requests.get(url)
        response.raise_for_status()  # Triggers HTTPError for bad responses
    except requests.exceptions.HTTPError as errh:
        print("HTTP Error:", errh)
    except requests.exceptions.ConnectionError as errc:
        print("Error Connecting:", errc)
    except requests.exceptions.Timeout as errt:
        print("Timeout Error:", errt)
    except requests.exceptions.RequestException as err:
        print("OOps: Something Else", err)
        

A good script doesn’t just catch exceptions—it has a response for every type of error. Got banned by IP? Switch to the next proxy. Site is down? Scrape another one for now. If certain elements on the page that should be there are missing, you can notify the scraper owner to update the parsing script and maybe send them an email.

Logging

"Why logs?" you might ask. Because logs are your second pair of eyes. They’ll help you figure out what went wrong and fix the issue ASAP.

Python

import logging

logging.basicConfig(filename='scraper.log', level=logging.INFO)

try:
    # Your scraping code
    pass
except Exception as e:
    logging.error("Exception occurred", exc_info=True)
    

Using Timeouts and Retries

Sometimes, all you need is to wait a little and try again. Timeouts and retries are perfect for this.

Python

try:
    response = requests.get(url, timeout=5)
    response.raise_for_status()
except requests.exceptions.Timeout:
    print("Timeout occurred. Retrying...")
    # Retry the request or perform another action
    

3. Examples of Resilient Scraping

A Simple Scraper with Error Handling

Let’s create a small but reliable scraper that can handle some common errors.

Python

import requests
from bs4 import BeautifulSoup
import time
import logging

logging.basicConfig(filename='scraper.log', level=logging.INFO)

url = 'http://example.com/some-nonexistent-page'

def fetch_html(url):
    try:
        response = requests.get(url, timeout=5)
        response.raise_for_status()
        return response.text
    except requests.exceptions.HTTPError as errh:
        logging.error("HTTP Error: %s", errh)
    except requests.exceptions.ConnectionError as errc:
        logging.error("Error Connecting: %s", errc)
    except requests.exceptions.Timeout as errt:
        logging.warning("Timeout occurred. Retrying...")
        time.sleep(1)  # Wait a bit and try again
        return fetch_html(url)
    except requests.exceptions.RequestException as err:
        logging.error("OOps: Something Else %s", err)
    return None

html_content = fetch_html(url)
if html_content:
    soup = BeautifulSoup(html_content, 'html.parser')
    # Your data extraction code
else:
    print("Failed to retrieve the page content.")
    

Saving Data in Parts

To avoid losing already extracted data in case of a crash, save it in parts. For example, if you’re extracting information from a list of pages, saving results as you go minimizes the risk of losing data.

Python

import csv

def save_to_csv(data, filename='data.csv'):
    with open(filename, mode='a', newline='') as file:
        writer = csv.writer(file)
        writer.writerow(data)
    

This way, even if your script crashes halfway through, you won’t lose all the data and can continue from the last saved point.

The page you’re scraping might partially change, and most of the data will still be accessible. It would be a shame to cancel the scraping and lose valuable data just because a small percentage of it is missing.

1
Task
Python SELF EN, level 32, lesson 4
Locked
Handling HTTP Errors in Web Scraping
Handling HTTP Errors in Web Scraping
2
Task
Python SELF EN, level 32, lesson 4
Locked
Logging Web Scraping Errors to a File
Logging Web Scraping Errors to a File
3
Task
Python SELF EN, level 32, lesson 4
Locked
Retry on errors and save intermediate data
Retry on errors and save intermediate data
4
Task
Python SELF EN, level 32, lesson 4
Locked
Creating a robust scraper to collect data from multiple pages
Creating a robust scraper to collect data from multiple pages
1
Опрос
Data Extraction,  32 уровень,  4 лекция
недоступен
Data Extraction
Data Extraction
Comments
TO VIEW ALL COMMENTS OR TO MAKE A COMMENT,
GO TO FULL VERSION