Tuesday, 21 September 2021

Extract Text From PDF File Using Python

 To achieve what we want which is text extraction first of all we need to have the python package called PyPDF.

Installation of PyPDF

Open your terminal as administrator and type the command below.

pip install PyPDF2


Here is a glance of the PDF we input to the code for extraction.



Now let's see the python code to extract text from the pdf.


import PyPDF2 
    
pdfFileObject = open('mynote.pdf', 'rb') 
    
pdfReader = PyPDF2.PdfFileReader(pdfFileObject) 
    
print("Number of pages in the pdf : ",pdfReader.numPages)

print()
print("******************")
    
pageObject = pdfReader.getPage(0) 
    
print(pageObject.extractText()) 
    
pdfFileObject.close() 



Below is a snap of the output.




Let's look into the code in line by line.


pdfFileObject = open('mynote.pdf', 'rb')

Here we open the mynote.pdf in binary mode and saved the file object as pdfFileObject.


pdfReader=PyPDF2.PdfFileReader(pdfFileObject)

Now here we create an object of PdfFileReader class of PyPDF2 module.
Then we pass the pdf file object and get a pdf reader object.


print("Number of pages in the pdf : ",pdfReader.numPages)

numPages gives the exact number of pages in PDF file. In our case , it is 4.


pageObject = pdfReader.getPage(0) 

Here we create an object of PageObject class of PyPDF2 module.
PDF reader object has function getPage() which takes the page number starting from index 0.
Then it returns the page object.

print(pageObject.extractText())

Page object has the function extractText() to extract text from our PDF file.


pdfFileObject.close()

Finally we close the PDF file object.

No comments:

Post a Comment

Introduction to the Python Calendar Module

 The Calendar module is built into Python 3. But for some reason it is installed by default. We can install it to  - Windows Administrator u...