CodeGym /Courses /Python SELF EN /Bypassing scraping restrictions: setting up user-agent, c...

Bypassing scraping restrictions: setting up user-agent, cookies, and avoidance methods

Python SELF EN
Level 33 , Lesson 0
Available

1. Introduction to web scraping restrictions

Today, we’re diving into a pretty spicy topic—how to bypass web scraping restrictions. Every programmer curious about scraping will eventually face restrictions and bans from websites. It's about time we figured out how we, the good folks, can avoid getting caught up in the traps of site protection systems and keep collecting data without angering servers.

When you send requests to websites, you're kinda invading their personal space to grab data they're fiercely guarding. But why are websites making it hard for us? There could be many reasons: copyright protection, ensuring server reliability and performance, or preventing unauthorized use of data. If you use too many of a site’s resources or break its rules, you might get... banned. And nobody likes bans—except maybe server admins.

2. Setting up user-agent

What is user-agent?

User-agent is an identification string your browser sends with every HTTP request. It tells the server which browser and operating system you're using. And guess what? This user-agent can easily be faked so the server thinks you're visiting from, say, the latest iPhone, instead of a Python script you launched during a coffee break.

Examples of changing user-agent

Changing the user-agent in the requests library is pretty straightforward. Here's a quick example:

Python

import requests

url = "https://example.com"

# Setting user-agent like Chrome browser
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36"
}

response = requests.get(url, headers=headers)
print(response.content)

Faking a user-agent is a bit like showing up to a company party in a panda suit. It’s still you, but you just look different. Servers, seeing a "browser" instead of a "script," might just let you into the party.

3. Working with cookies

What are cookies?

Cookies are small pieces of data websites save in your browser. They can store all sorts of stuff, from site settings to session IDs that keep you logged in.

Using cookies in requests

Working with cookies in requests is also pretty simple. Usually, you grab some cookies with your first request to the site, then use them for subsequent ones:

Python

# Creating a session to hold cookies
session = requests.Session()

# Making the first request to capture cookies
session.get(url)

# Using the same cookies for future requests
response = session.get(url)
print(response.content)

Sessions are like public transport for cookies. They ride along with you from site to site, keeping your settings intact.

4. Methods to avoid blocks

Practical tips to reduce the chance of getting blocked

Here are a few tricks:

  • Delays between requests: Add random delays between requests so your bot doesn’t raise suspicion.
  • Changing IP addresses: Use VPNs or proxies to change your IP address so you don’t get blocked based on it.
  • Rotating user-agent: Change your user-agent on every request to look like different browsers.

Examples of adding delays and changing IP addresses

Use the time library to add delays:

Python

import time
import random

for _ in range(10):
    response = session.get(url)
    # Random delay
    time.sleep(random.uniform(1, 3))

To change your IP address in requests, you can use proxies:

Python

proxies = {
    "http": "http://10.10.10.10:8000",
    "https": "https://10.10.10.10:8000",
}

response = requests.get(url, proxies=proxies)

5. Additional deception methods

To avoid looking suspicious, use headers and cookies to imitate a real user. Remember, realism is your secret weapon in the war against bans.

That’s a wrap for our lesson today. Using these techniques will help you stay "under the radar" of websites and keep collecting valuable data without getting blocked. Remember, like any superhero, with these powers comes responsibility—use them ethically and within legal boundaries. Believe in yourself, and may your code be as elegant as a cat dancing on a keyboard!

1
Task
Python SELF EN, level 33, lesson 0
Locked
Changing User-Agent
Changing User-Agent
3
Task
Python SELF EN, level 33, lesson 0
Locked
Simulating Delays in a Script
Simulating Delays in a Script
4
Task
Python SELF EN, level 33, lesson 0
Locked
Using a proxy to bypass blocks
Using a proxy to bypass blocks
Comments
TO VIEW ALL COMMENTS OR TO MAKE A COMMENT,
GO TO FULL VERSION