1. Merging PDFs with PyPDF2
Why merge PDF files
First, let's figure out why you'd want to merge PDF files in the first place. As they say, "One PDF is better than ten!" In a work environment, you might have reports, research results, technical documentation, or presentations provided as separate files. Constantly switching between them is not just inconvenient but also risky — you might miss something. By merging all the files into one document, you make working with this data easier and create a more structured approach to analysis and distribution.
Moreover, merging PDF files is handy for further archiving, creating a single final report, or stitching together multiple versions of a document for tracking changes. Basically, the possibilities are endless!
Basics of Using PyPDF2 to Merge PDFs
Let's start with the basics of working with PyPDF2. We'll create a script to merge several PDF files into one. Of course, the code will include comments so you understand what's happening at each step.
import PyPDF2
# Create a PdfMerger object from the PyPDF2 library
pdf_merger = PyPDF2.PdfMerger()
# List of our PDF documents we want to merge
pdf_files = ['document1.pdf', 'document2.pdf', 'document3.pdf']
# Loop to add each file to the PdfMerger object
for file in pdf_files:
pdf_merger.append(file)
# Save the result into a new PDF file
output_filename = 'merged_document.pdf'
with open(output_filename, 'wb') as output_file:
pdf_merger.write(output_file)
# Close the PdfMerger object to free resources
pdf_merger.close()
print(f"Merged PDF created: {output_filename}")
Order and Structure of the Merged Document
Now that we've learned to merge PDF documents, we should think about the page order. Remember that PyPDF2 adds pages in the order you pass the files to the .append()
method. So, the order in the pdf_files
list impacts the order in the final document.
2. Merging Individual Pages
If you want to create a final document from parts of different files rather than merging entire files, you'll need to use the PdfWriter
class instead of PdfMerger
. Here's an example:
import PyPDF2
# List of PDF files to combine
pdf_files = ["file1.pdf", "file2.pdf", "file3.pdf"]
# Create a PdfWriter object to write the combined PDF
pdf_writer = PyPDF2.PdfWriter()
# Iterate over each PDF file
for pdf_file in pdf_files:
with open(pdf_file, "rb") as file:
pdf_reader = PyPDF2.PdfReader(file)
# Add each page to PdfWriter
for page_num in range(len(pdf_reader.pages)):
# You can skip pages you don't want to add here
page = pdf_reader.pages[page_num]
pdf_writer.add_page(page)
# Save the combined PDF
with open("merged_document.pdf", "wb") as output_file:
pdf_writer.write(output_file)
How does this code work?
-
Create a file list: The
pdf_files
list contains paths to the PDF documents to be combined. -
Initialize PdfWriter:
pdf_writer
is used to create a new PDF file. - Iterate over each file: Each PDF file is opened in read mode.
-
Add pages: All pages of the file are sequentially added to
pdf_writer
usingadd_page()
. -
Save the result: Once all pages are added, the new PDF file is written to
merged_document.pdf
.
3. Styling the New Document
Adding Bookmarks and a Table of Contents
What if your merged document becomes too large and difficult to navigate? In such a case, bookmarks come to the rescue! PyPDF2 allows you to add basic bookmarks to make navigating the document easier. Let's add bookmarks for each document we merge.
pdf_merger = PyPDF2.PdfMerger()
# Page index for bookmarks
page_offset = 0
for file in pdf_files:
# Read the current document
pdf_reader = PyPDF2.PdfReader(file)
# Add the document to PdfMerger
pdf_merger.append(file)
# Add a bookmark with the file name
pdf_merger.add_bookmark(file, page_offset)
# Update the page offset
page_offset += len(pdf_reader.pages)
with open(output_filename, 'wb') as output_file:
pdf_merger.write(output_file)
pdf_merger.close()
This little trick will help you stay calm and avoid getting lost in a sea of PDFs.
Updating Metadata of the Merged File
After merging, you can add or update the document's metadata, like author, title, and keywords.
import PyPDF2
pdf_files = ["file1.pdf", "file2.pdf"]
pdf_writer = PyPDF2.PdfWriter()
for pdf_file in pdf_files:
with open(pdf_file, "rb") as file:
pdf_reader = PyPDF2.PdfReader(file)
for page_num in range(len(pdf_reader.pages)):
page = pdf_reader.pages[page_num]
pdf_writer.add_page(page)
# Add metadata
pdf_writer.add_metadata({
"/Title": "Merged Document",
"/Author": "Ivan Ivanov",
"/Subject": "Sales Report"
})
# Save the merged file
with open("merged_with_metadata.pdf", "wb") as output_file:
pdf_writer.write(output_file)
This code adds metadata to help identify and structure the document.
GO TO FULL VERSION