If you’ve made it to this lecture, it means you already have
a decent understanding of how web scraping works and how to use
the awesome BeautifulSoup
to extract the data you need
from HTML. However, in the world of web scraping, not everything
is as smooth as they describe in textbooks. Sometimes, our dream
of collecting data turns into a battle with errors, so let’s talk
about how to deal with pitfalls and make our scraper as resilient
as possible.
1. Common Errors in Web Scraping
Error 404 and Other HTTP Errors
The classic problem: you try to fetch a page, and instead of content, you get the proud "404 Not Found." This might happen because the page has been deleted or moved. Other common HTTP errors include 403 (Forbidden), 500 (Internal Server Error), and 503 (Service Unavailable).
HTML Structure Changes
You’ve spent a ton of time writing code to extract data, and the next day the site decides to "show off" a little and changes its HTML structure. Oops, back to rewriting everything!
Request Rate Limits
Some sites start getting suspicious when web scrapers are "licking" them all day long. In the best-case scenario, you’ll get temporarily banned, and in the worst case—forever.
Timeouts and Response Delays
Sometimes, pages load slowly, and your script might crash if the wait exceeds the standard timeout period.
2. Error Handling Methods
Using try-except
Your scripts shouldn’t crash at the first sign of trouble.
Adding try-except
blocks helps you catch errors and ensure
your web scraper keeps going like nothing happened.
import requests
from bs4 import BeautifulSoup
url = 'http://example.com/some-nonexistent-page'
try:
response = requests.get(url)
response.raise_for_status() # Triggers HTTPError for bad responses
except requests.exceptions.HTTPError as errh:
print("HTTP Error:", errh)
except requests.exceptions.ConnectionError as errc:
print("Error Connecting:", errc)
except requests.exceptions.Timeout as errt:
print("Timeout Error:", errt)
except requests.exceptions.RequestException as err:
print("OOps: Something Else", err)
A good script doesn’t just catch exceptions—it has a response for every type of error. Got banned by IP? Switch to the next proxy. Site is down? Scrape another one for now. If certain elements on the page that should be there are missing, you can notify the scraper owner to update the parsing script and maybe send them an email.
Logging
"Why logs?" you might ask. Because logs are your second pair of eyes. They’ll help you figure out what went wrong and fix the issue ASAP.
import logging
logging.basicConfig(filename='scraper.log', level=logging.INFO)
try:
# Your scraping code
pass
except Exception as e:
logging.error("Exception occurred", exc_info=True)
Using Timeouts and Retries
Sometimes, all you need is to wait a little and try again. Timeouts and retries are perfect for this.
try:
response = requests.get(url, timeout=5)
response.raise_for_status()
except requests.exceptions.Timeout:
print("Timeout occurred. Retrying...")
# Retry the request or perform another action
3. Examples of Resilient Scraping
A Simple Scraper with Error Handling
Let’s create a small but reliable scraper that can handle some common errors.
import requests
from bs4 import BeautifulSoup
import time
import logging
logging.basicConfig(filename='scraper.log', level=logging.INFO)
url = 'http://example.com/some-nonexistent-page'
def fetch_html(url):
try:
response = requests.get(url, timeout=5)
response.raise_for_status()
return response.text
except requests.exceptions.HTTPError as errh:
logging.error("HTTP Error: %s", errh)
except requests.exceptions.ConnectionError as errc:
logging.error("Error Connecting: %s", errc)
except requests.exceptions.Timeout as errt:
logging.warning("Timeout occurred. Retrying...")
time.sleep(1) # Wait a bit and try again
return fetch_html(url)
except requests.exceptions.RequestException as err:
logging.error("OOps: Something Else %s", err)
return None
html_content = fetch_html(url)
if html_content:
soup = BeautifulSoup(html_content, 'html.parser')
# Your data extraction code
else:
print("Failed to retrieve the page content.")
Saving Data in Parts
To avoid losing already extracted data in case of a crash, save it in parts. For example, if you’re extracting information from a list of pages, saving results as you go minimizes the risk of losing data.
import csv
def save_to_csv(data, filename='data.csv'):
with open(filename, mode='a', newline='') as file:
writer = csv.writer(file)
writer.writerow(data)
This way, even if your script crashes halfway through, you won’t lose all the data and can continue from the last saved point.
The page you’re scraping might partially change, and most of the data will still be accessible. It would be a shame to cancel the scraping and lose valuable data just because a small percentage of it is missing.
GO TO FULL VERSION