1. Using the "Next" Button
If the site has a "Next" button or link to navigate to the next page, you can set up a loop to click on this button as long as it's available.
Code Example
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
import time
def initialize_driver():
driver = webdriver.Chrome()
driver.implicitly_wait(10)
return driver
def open_page(driver, url):
driver.get(url)
def collect_data(driver):
# Example of collecting data from the current page
items = driver.find_elements(By.CLASS_NAME, "item_class")
for item in items:
print(item.text) # Here you can save or process the data
def click_next_button(driver):
try:
next_button = driver.find_element(By.LINK_TEXT, "Next")
next_button.click()
return True
except NoSuchElementException:
return False # Button not found, meaning we're on the last page
def main():
driver = initialize_driver()
open_page(driver, "https://example.com/page1")
try:
while True:
collect_data(driver)
if not click_next_button(driver):
break # Exit the loop if the "Next" button is absent
time.sleep(2) # Delay for loading the next page
finally:
driver.quit()
main()
Code Explanation
initialize_driver() — initializes the driver.
open_page() — opens the first page to start working.
collect_data() — a function to collect data from the current page.
click_next_button() — a function that finds and clicks the "Next" button. If the button is missing, it returns False, which means page navigation has ended.
The loop in main() — the main loop for navigating pages. It stops when the "Next" button can no longer be found.
2. Pagination Using Page Numbers
Some sites have numbered page links (e.g., "1", "2", "3", and so on). In such cases, you can gather a list of links and navigate through them in sequence.
Code Example
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
def initialize_driver():
driver = webdriver.Chrome()
driver.implicitly_wait(10)
return driver
def open_page(driver, url):
driver.get(url)
def collect_data(driver):
items = driver.find_elements(By.CLASS_NAME, "item_class")
for item in items:
print(item.text)
def go_to_page(driver, page_number):
page_link = driver.find_element(By.LINK_TEXT, str(page_number))
page_link.click()
def main():
driver = initialize_driver()
open_page(driver, "https://example.com/page1")
try:
total_pages = 5 # Specify the total number of pages if known
for page in range(1, total_pages + 1):
collect_data(driver)
if page < total_pages: # Don't navigate further after the last page
go_to_page(driver, page + 1)
time.sleep(2) # Delay for loading the next page
finally:
driver.quit()
main()
Code Explanation
go_to_page() — a function that finds the link to the desired page by its number and navigates to it.
The loop in main() — uses the total_pages variable to determine the number of pages. The loop navigates to the next page until it reaches the last one.
3. Modifying the URL for Each Page
Some sites have a simple URL structure, where each page is identified by a number in the URL, like https://example.com/page/1, https://example.com/page/2, etc. In this case, you can just modify the URL to load the desired page, avoiding the need to search for elements.
Code Example
from selenium import webdriver
import time
def initialize_driver():
driver = webdriver.Chrome()
driver.implicitly_wait(10)
return driver
def open_page(driver, url):
driver.get(url)
def collect_data(driver):
items = driver.find_elements_by_class_name("item_class")
for item in items:
print(item.text)
def main():
driver = initialize_driver()
try:
total_pages = 5 # Specify the total number of pages if known
base_url = "https://example.com/page/"
for page_number in range(1, total_pages + 1):
url = f"{base_url}{page_number}"
open_page(driver, url)
collect_data(driver)
time.sleep(2) # Delay for loading the next page
finally:
driver.quit()
main()
Code Explanation
The base_url variable contains the base URL of the page. The loop appends the page number to it and sequentially opens each page.
The loop generates the URL for each page and collects data without clicking on elements. This minimizes the likelihood of errors.
4. Optimization Tips
- Minimize waiting and clicks on dynamic elements: Using links and URLs is more robust than clicking on JavaScript-loaded buttons.
- Use wait timers with minimal delay: When navigating to a new page, use a small delay like
time.sleep(2)to ensure elements have time to load, but don't delay longer than needed. - Collect data after the full page has loaded: Ensure that the data on the page is fully loaded before starting its collection. Use
implicitly_waitfor reliable element detection. - Logging: Implement logging to record the current page, errors, and successful transitions. This will simplify the script's troubleshooting during its execution.
GO TO FULL VERSION