CodeGym /Courses /Python SELF EN /Extracting Text from PDF Documents for Data Analysis

Extracting Text from PDF Documents for Data Analysis

Python SELF EN
Level 43 , Lesson 2
Available

1. Text Extraction

Pretty much all of us have faced situations where we need info from a PDF document—be it a financial report, research, or even your favorite e-book. But what if you need to extract this data automatically instead of doing it manually? That’s where Python and the amazing PyPDF2 library come in.

Main Steps in Text Extraction

To extract text from a PDF successfully, follow these easy steps:

  1. Reading the PDF file.
  2. Parsing the PDF content.
  3. Extracting text for further analysis.

2. Reading and Parsing PDF Files

Let’s see how to open and read a PDF document in Python. First, we need to import PyPDF2:

Python

import PyPDF2

Now, let’s open a PDF document. Suppose we have a file called sample.pdf that we want to analyze. Let’s load it and find out how many pages it has.

Loading the PDF File

Python

# Opening the PDF file
with open("sample.pdf", "rb") as pdf_file:
    # Create a PDF Reader object
    pdf_reader = PyPDF2.PdfReader(pdf_file)

    # Get the total number of pages
    num_pages = len(pdf_reader.pages)
    print(f"Total number of pages in the document: {num_pages}")

Extracting Text

Now that the PDF document is open, let’s extract the text from it. Here’s what we need to do:

  • PdfReader opens the PDF file for reading.
  • We use a for loop to go through all the pages and call extract_text() to extract text.
  • The extracted text is stored in the text variable and can be printed or processed.

Here’s an example:

Python

import PyPDF2

# Opening the PDF file
with open("sample.pdf", "rb") as pdf_file:
    pdf_reader = PyPDF2.PdfReader(pdf_file)
    text = ""
    
    # Extracting text from each page
    for page_num in range(len(pdf_reader.pages)):
        page = pdf_reader.pages[page_num]
        text += page.extract_text() + "\n"

print(text)

Extracting Text from Specific Pages

What if you only need text from a specific page? For example, let’s say we want to get text only from the third page. Here’s how it’s done:

Python

import PyPDF2

with open("sample.pdf", "rb") as pdf_file:
    pdf_reader = PyPDF2.PdfReader(pdf_file)
    page = pdf_reader.pages[2]  # Extracting text from the third page (index 2)
    text = page.extract_text()

print(text)

This example extracts text only from the third page, which can be super useful when dealing with large documents and you want to limit the processing. PyPDF2 uses 0-based page numbering.

3. Automating Text Processing from PDFs

After extracting text from a PDF, you can analyze and process it for deeper data insights. PyPDF2 allows you to automate this process, which is especially helpful when working with a large number of documents.

Extracting and Saving Text to a File

For convenience in further analysis, you can save the extracted text to a text file. This makes it easier to process later.

Python

import PyPDF2

# Opening the PDF file
with open("sample.pdf", "rb") as pdf_file:
    pdf_reader = PyPDF2.PdfReader(pdf_file)
    text = ""
    
    for page_num in range(len(pdf_reader.pages)):
        page = pdf_reader.pages[page_num]
        text += page.extract_text() + "\n"

# Saving the extracted text to a file
with open("extracted_text.txt", "w", encoding="utf-8") as text_file:
    text_file.write(text)

Processing and Analyzing the Extracted Text

After extracting the text, you can analyze it using Python. Common libraries for this include re (regular expressions), nltk (Natural Language Toolkit), or pandas.

Counting Words and Searching for Key Phrases

Let’s say you have a text file called extracted_text.txt, and you want to count the occurrences of specific words or phrases in the document.

Python

import re

# Open the extracted text
with open("extracted_text.txt", "r", encoding="utf-8") as text_file:
    text = text_file.read()

# Search and count keywords
keywords = ["report", "data", "analysis"]
keyword_counts = {keyword: len(re.findall(keyword, text, re.IGNORECASE)) for keyword in keywords}

print("Keyword frequency:", keyword_counts)

Here’s what’s happening:

  • We open the saved text.
  • Use regular expressions to count keyword occurrences (case insensitive).
  • Get the count of each keyword mention.

4. PyPDF2 Advantages and Limitations

Advantages:

  • Super easy to use for basic text extraction and page processing.
  • Supports basic operations: reading text, merging, and splitting documents.
  • Easily integrates with other Python libraries.

Limitations:

  • PyPDF2 doesn’t always properly extract text from complex PDFs with multi-level formats, tables, or images.
  • Lacks support for directly extracting images or tables.
  • Doesn’t support processing encrypted or password-protected files (although you can try to unlock if the password is known).
1
Task
Python SELF EN, level 43, lesson 2
Locked
Simple Text Extraction
Simple Text Extraction
2
Task
Python SELF EN, level 43, lesson 2
Locked
Extracting and Saving Text
Extracting and Saving Text
3
Task
Python SELF EN, level 43, lesson 2
Locked
Extracting Text from Specific Pages
Extracting Text from Specific Pages
4
Task
Python SELF EN, level 43, lesson 2
Locked
Extracted Text Analysis
Extracted Text Analysis
Comments
TO VIEW ALL COMMENTS OR TO MAKE A COMMENT,
GO TO FULL VERSION