CodeGym /Java Course /Python SELF EN /Combining Multiple PDF Files into One Document

Combining Multiple PDF Files into One Document

Python SELF EN
Level 43 , Lesson 3
Available

1. Merging PDFs with PyPDF2

Why merge PDF files

First, let's figure out why you'd want to merge PDF files in the first place. As they say, "One PDF is better than ten!" In a work environment, you might have reports, research results, technical documentation, or presentations provided as separate files. Constantly switching between them is not just inconvenient but also risky — you might miss something. By merging all the files into one document, you make working with this data easier and create a more structured approach to analysis and distribution.

Moreover, merging PDF files is handy for further archiving, creating a single final report, or stitching together multiple versions of a document for tracking changes. Basically, the possibilities are endless!

Basics of Using PyPDF2 to Merge PDFs

Let's start with the basics of working with PyPDF2. We'll create a script to merge several PDF files into one. Of course, the code will include comments so you understand what's happening at each step.

Python

import PyPDF2

# Create a PdfMerger object from the PyPDF2 library
pdf_merger = PyPDF2.PdfMerger()

# List of our PDF documents we want to merge
pdf_files = ['document1.pdf', 'document2.pdf', 'document3.pdf']

# Loop to add each file to the PdfMerger object
for file in pdf_files:
    pdf_merger.append(file)

# Save the result into a new PDF file
output_filename = 'merged_document.pdf'
with open(output_filename, 'wb') as output_file:
    pdf_merger.write(output_file)

# Close the PdfMerger object to free resources
pdf_merger.close()

print(f"Merged PDF created: {output_filename}")

Order and Structure of the Merged Document

Now that we've learned to merge PDF documents, we should think about the page order. Remember that PyPDF2 adds pages in the order you pass the files to the .append() method. So, the order in the pdf_files list impacts the order in the final document.

2. Merging Individual Pages

If you want to create a final document from parts of different files rather than merging entire files, you'll need to use the PdfWriter class instead of PdfMerger. Here's an example:

Python

import PyPDF2

# List of PDF files to combine
pdf_files = ["file1.pdf", "file2.pdf", "file3.pdf"]

# Create a PdfWriter object to write the combined PDF
pdf_writer = PyPDF2.PdfWriter()

# Iterate over each PDF file
for pdf_file in pdf_files:
    with open(pdf_file, "rb") as file:
        pdf_reader = PyPDF2.PdfReader(file)
        # Add each page to PdfWriter
        for page_num in range(len(pdf_reader.pages)):
            # You can skip pages you don't want to add here
            page = pdf_reader.pages[page_num]
            pdf_writer.add_page(page)

# Save the combined PDF
with open("merged_document.pdf", "wb") as output_file:
    pdf_writer.write(output_file)

How does this code work?

  1. Create a file list: The pdf_files list contains paths to the PDF documents to be combined.
  2. Initialize PdfWriter: pdf_writer is used to create a new PDF file.
  3. Iterate over each file: Each PDF file is opened in read mode.
  4. Add pages: All pages of the file are sequentially added to pdf_writer using add_page().
  5. Save the result: Once all pages are added, the new PDF file is written to merged_document.pdf.

3. Styling the New Document

Adding Bookmarks and a Table of Contents

What if your merged document becomes too large and difficult to navigate? In such a case, bookmarks come to the rescue! PyPDF2 allows you to add basic bookmarks to make navigating the document easier. Let's add bookmarks for each document we merge.

Python

pdf_merger = PyPDF2.PdfMerger()

# Page index for bookmarks
page_offset = 0

for file in pdf_files:
    # Read the current document
    pdf_reader = PyPDF2.PdfReader(file)

    # Add the document to PdfMerger
    pdf_merger.append(file)

    # Add a bookmark with the file name
    pdf_merger.add_bookmark(file, page_offset)

    # Update the page offset
    page_offset += len(pdf_reader.pages)

with open(output_filename, 'wb') as output_file:
    pdf_merger.write(output_file)

pdf_merger.close()

This little trick will help you stay calm and avoid getting lost in a sea of PDFs.

Updating Metadata of the Merged File

After merging, you can add or update the document's metadata, like author, title, and keywords.

Python

import PyPDF2

pdf_files = ["file1.pdf", "file2.pdf"]
pdf_writer = PyPDF2.PdfWriter()

for pdf_file in pdf_files:
    with open(pdf_file, "rb") as file:
        pdf_reader = PyPDF2.PdfReader(file)
        for page_num in range(len(pdf_reader.pages)):
            page = pdf_reader.pages[page_num]
            pdf_writer.add_page(page)

# Add metadata
pdf_writer.add_metadata({
    "/Title": "Merged Document",
    "/Author": "Ivan Ivanov",
    "/Subject": "Sales Report"
})

# Save the merged file
with open("merged_with_metadata.pdf", "wb") as output_file:
    pdf_writer.write(output_file)

This code adds metadata to help identify and structure the document.

1
Task
Python SELF EN, level 43, lesson 3
Locked
Basic PDF Merge
Basic PDF Merge
2
Task
Python SELF EN, level 43, lesson 3
Locked
Merge with Even Pages Selection
Merge with Even Pages Selection
3
Task
Python SELF EN, level 43, lesson 3
Locked
Adding Bookmarks When Merging
Adding Bookmarks When Merging
4
Task
Python SELF EN, level 43, lesson 3
Locked
Merging with Metadata Addition
Merging with Metadata Addition
Comments
TO VIEW ALL COMMENTS OR TO MAKE A COMMENT,
GO TO FULL VERSION