1. Text Extraction
Pretty much all of us have faced situations where we need info from a PDF document—be it a financial report, research, or even your favorite e-book. But what if you need to extract this data automatically instead of doing it manually? That’s where Python and the amazing PyPDF2 library come in.
Main Steps in Text Extraction
To extract text from a PDF successfully, follow these easy steps:
- Reading the PDF file.
- Parsing the PDF content.
- Extracting text for further analysis.
2. Reading and Parsing PDF Files
Let’s see how to open and read a PDF document in Python. First, we need to import PyPDF2:
import PyPDF2
Now, let’s open a PDF document. Suppose we have a file called sample.pdf that we want to analyze. Let’s load it and find out how many pages it has.
Loading the PDF File
# Opening the PDF file
with open("sample.pdf", "rb") as pdf_file:
# Create a PDF Reader object
pdf_reader = PyPDF2.PdfReader(pdf_file)
# Get the total number of pages
num_pages = len(pdf_reader.pages)
print(f"Total number of pages in the document: {num_pages}")
Extracting Text
Now that the PDF document is open, let’s extract the text from it. Here’s what we need to do:
PdfReaderopens the PDF file for reading.- We use a
forloop to go through all the pages and callextract_text()to extract text. - The extracted text is stored in the
textvariable and can be printed or processed.
Here’s an example:
import PyPDF2
# Opening the PDF file
with open("sample.pdf", "rb") as pdf_file:
pdf_reader = PyPDF2.PdfReader(pdf_file)
text = ""
# Extracting text from each page
for page_num in range(len(pdf_reader.pages)):
page = pdf_reader.pages[page_num]
text += page.extract_text() + "\n"
print(text)
Extracting Text from Specific Pages
What if you only need text from a specific page? For example, let’s say we want to get text only from the third page. Here’s how it’s done:
import PyPDF2
with open("sample.pdf", "rb") as pdf_file:
pdf_reader = PyPDF2.PdfReader(pdf_file)
page = pdf_reader.pages[2] # Extracting text from the third page (index 2)
text = page.extract_text()
print(text)
This example extracts text only from the third page, which can be super useful when dealing with large documents and you want to limit the processing. PyPDF2 uses 0-based page numbering.
3. Automating Text Processing from PDFs
After extracting text from a PDF, you can analyze and process it for deeper data insights. PyPDF2 allows you to automate this process, which is especially helpful when working with a large number of documents.
Extracting and Saving Text to a File
For convenience in further analysis, you can save the extracted text to a text file. This makes it easier to process later.
import PyPDF2
# Opening the PDF file
with open("sample.pdf", "rb") as pdf_file:
pdf_reader = PyPDF2.PdfReader(pdf_file)
text = ""
for page_num in range(len(pdf_reader.pages)):
page = pdf_reader.pages[page_num]
text += page.extract_text() + "\n"
# Saving the extracted text to a file
with open("extracted_text.txt", "w", encoding="utf-8") as text_file:
text_file.write(text)
Processing and Analyzing the Extracted Text
After extracting the text, you can analyze it using Python. Common libraries for this include re (regular expressions), nltk (Natural Language Toolkit), or pandas.
Counting Words and Searching for Key Phrases
Let’s say you have a text file called extracted_text.txt, and you want to count the occurrences of specific words or phrases in the document.
import re
# Open the extracted text
with open("extracted_text.txt", "r", encoding="utf-8") as text_file:
text = text_file.read()
# Search and count keywords
keywords = ["report", "data", "analysis"]
keyword_counts = {keyword: len(re.findall(keyword, text, re.IGNORECASE)) for keyword in keywords}
print("Keyword frequency:", keyword_counts)
Here’s what’s happening:
- We open the saved text.
- Use regular expressions to count keyword occurrences (case insensitive).
- Get the count of each keyword mention.
4. PyPDF2 Advantages and Limitations
Advantages:
- Super easy to use for basic text extraction and page processing.
- Supports basic operations: reading text, merging, and splitting documents.
- Easily integrates with other Python libraries.
Limitations:
- PyPDF2 doesn’t always properly extract text from complex PDFs with multi-level formats, tables, or images.
- Lacks support for directly extracting images or tables.
- Doesn’t support processing encrypted or password-protected files (although you can try to unlock if the password is known).
GO TO FULL VERSION