1. Basic CSS Concepts for Web Scraping
For successful web scraping, understanding the structure of HTML and CSS classes on a page is key.
Knowing how page elements are styled and structured using CSS allows you to more accurately select and
extract the desired data. Let's see how linking CSS to HTML, using selectors, as well as the attributes
style
, class
, id
, and name
assist in working with the structure
of web pages for scraping.
CSS is responsible for styling web pages. However, for web scraping purposes, we can consider CSS as a tool for understanding the structure and selecting elements. Let's look at some key CSS concepts that are important for scraping:
- Selectors — are rules that point to specific HTML elements. Using them helps precisely identify the desired data.
-
Attributes
class
,id
, andname
— they are unique identifiers that help highlight and differentiate elements. For scraping, they are especially useful because they help isolate the necessary elements, simplifying data extraction.
2. Linking CSS to an HTML Document
CSS can be linked to HTML in various ways. Understanding these methods is essential for navigating elements and determining their styles and classes, as this will help isolate target data.
External File
CSS is often linked as an external file, which can be seen in an HTML document through the <link>
tag
in the <head>
section. External CSS files define styles for the entire page, including identifiers and
classes, which makes navigation easier when scraping.
<head>
<link rel="stylesheet" href="styles.css">
</head>
Internal Styles
Sometimes styles can be defined within a page using the <style>
tag. Internal styles can be
found in the page's <head>
and used as a clue to understand the classes and
identifiers used to select necessary elements.
<head>
<style>
.price {
color: red;
}
</style>
</head>
Inline Styles (attribute style
)
Inline styles are directly in the HTML tags and affect only the specific element. The
style
attribute often contains unique properties that can be helpful for identifying target
data.
<p style="color: red; font-size: 18px;">Text with inline style</p>
<p style="color: red; font-size: 18px;">Text with inline style</p>
3. Selectors in CSS
Selectors in CSS are used to apply styles to elements, but for web scraping, their main use is to precisely select elements that contain the data you need. Let's look at the main types of selectors that can be used in web scraping.
Main Types of Selectors
Tag Selector: This selector picks all elements of a certain tag (e.g., <p>
or
<div>
). In web scraping, tag selectors are helpful for extracting information from tags that
may contain text, images, and other information.
p {
color: blue;
}
Class Selector: This selector chooses elements with a specific class
attribute value.
A class is designated by a period (.
) before the name. In web scraping, classes are particularly useful
as they can identify elements with the same styles, like a list of products.
.price {
color: red;
}
.price {
color: red;
}
<p class="price">Price: $99</p>
ID Selector: This selector chooses an element with a unique id
attribute, marked by the
#
symbol. In web scraping, id
is especially useful for selecting unique elements,
such as a headline or a button on the page.
#product-title {
font-size: 24px;
}
<h1 id="product-title">Product Name</h1>
Attribute Selectors: These selectors pick elements based on specific attributes like name
,
type
, and more. In web scraping, attribute selectors are useful for selecting form elements or specific
fields, for instance, selecting fields with a particular name
.
input[name="email"] {
border: 2px solid blue;
}
Combined Selectors: These selectors allow you to precisely pick elements by combining multiple criteria.
For example, .product-list .price
will select only product prices inside a
product-list
container.
You'll learn more about attribute and combined selectors in the upcoming lectures.
4. Attributes style
, class
, id
and name
Attribute style
The style
attribute is used to add inline styles to elements, which can serve as a
distinguisher for elements that are difficult to identify by other attributes. In web scraping, it can be used
as an additional filter to find specific elements on a page.
<p style="color: red; font-size: 18px;">This text is highlighted with inline style</p>
Attribute class
The class
attribute labels a group of elements with the same styles, such as products, prices, or
descriptions. When scraping, class
helps select a group of elements with the same visual structure,
making bulk data extraction easier.
<p class="price">Price: $99</p>
<p class="price">Price: $89</p>
.price {
color: red;
}
Attribute id
The id
attribute is unique for each element, making it valuable for extracting unique data. For example,
a product title on a page may have a unique id
, and that identifier can be used for precise selection
of that title.
<h1 id="main-title">Product Name</h1>
#main-title {
font-size: 30px;
}
Attribute name
The name
attribute is often used in form elements and can be applied for precise selection of input
fields, such as fields for email or phone number. For web scraping, name
is helpful when extracting
data from forms.
<input type="text" name="username" placeholder="Enter your username">
input[name="username"] {
border: 1px solid #333;
}
5. Example of a Page Using CSS and HTML
Below is an example of an HTML document utilizing various selectors and attributes to highlight and structure the elements that can be useful for web scraping.
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Web Scraping Example Page</title>
<link rel="stylesheet" href="styles.css">
<style>
.price {
color: red;
font-weight: bold;
}
</style>
</head>
<body>
<h1 id="main-title">Product of the Week</h1>
<p class="price">Price: $99</p>
<p class="description">This is a unique product with excellent features.</p>
<form action="/submit" method="post">
<label for="username">Username:</label>
<input type="text" id="username" name="username">
<label for="email">Email:</label>
<input type="email" id="email" name="email">
<button type="submit">Submit</button>
</form>
</body>
</html>
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Web Scraping Example Page</title>
<link rel="stylesheet" href="styles.css">
<style>
.price {
color: red;
font-weight: bold;
}
</style>
</head>
<body>
<h1 id="main-title">Product of the Week</h1>
<p class="price">Price: $99</p>
<p class="description">This is a unique product with excellent features.</p>
<form action="/submit" method="post">
<label for="username">Username:</label>
<input type="text" id="username" name="username">
<label for="email">Email:</label>
<input type="email" id="email" name="email">
<button type="submit">Submit</button>
</form>
</body>
</html>
#main-title {
font-size: 24px;
color: green;
}
input[name="username"] {
border: 1px solid #333;
padding: 5px;
}
GO TO FULL VERSION