Extract Text from PDF File using Python

Introduction

We will create an example using Python programming language how to extract text from PDF file.

In this example we are going to use PyPDF2 package from Python to work with PDF file.

There are few advantages using PDF file:

  • PDF format allows professionals to edit, share, collaborate and ensure the security of the content within digital documents.
  • Reports are mostly generated in PDF format because a PDF file is a “read only” document that cannot be altered without leaving an electronic footprint.
  • PDF files are compatible across multiple platforms.

Prerequisites

Python 3.8.3, PyPDF2 (pip install PyPDF2)

Extract Text from PDF

First we import the required library PyPDF2, then we open and read the PDF file.

We count the number of pages in the PDF file. Then we iterate each page for the total number of pages and extract the text and append into a list variable.

Finally we print the extracted text on the console.

#Importing PDF reader PyPDF2
import PyPDF2

#Open file Path
pdf_File = open('simple.pdf', 'rb') 

#Create PDF Reader Object
pdf_Reader = PyPDF2.PdfFileReader(pdf_File)
count = pdf_Reader.numPages # counts number of pages in pdf
TextList = []

#Extracting text data from each page of the pdf file
for i in range(count):
   try:
    page = pdf_Reader.getPage(i)
    TextList.append(page.extractText())
   except:
       pass

#Converting multiline text to single line text
TextString = " ".join(TextList)

print(TextString)

Testing the Program

Running the above code you will see the following output:

extract text from pdf file using python

You can download the sample pdf file used in this example from the below source code section.

Source Code

Download

Thanks for reading.

Leave a Reply

Your email address will not be published. Required fields are marked *