To achieve what we want which is text extraction first of all we need to have the python package called PyPDF.
Installation of PyPDF
Open your terminal as administrator and type the command below.
pip install PyPDF2
Here is a glance of the PDF we input to the code for extraction.
import PyPDF2 pdfFileObject = open('mynote.pdf', 'rb') pdfReader = PyPDF2.PdfFileReader(pdfFileObject) print("Number of pages in the pdf : ",pdfReader.numPages) print() print("******************") pageObject = pdfReader.getPage(0) print(pageObject.extractText()) pdfFileObject.close()
Below is a snap of the output.
pdfFileObject = open('mynote.pdf', 'rb')
Here we open the mynote.pdf in binary mode and saved the file object as pdfFileObject.
pdfReader=PyPDF2.PdfFileReader(pdfFileObject)
Now here we create an object of PdfFileReader class of PyPDF2 module.
Then we pass the pdf file object and get a pdf reader object.
print("Number of pages in the pdf : ",pdfReader.numPages)
numPages gives the exact number of pages in PDF file. In our case , it is 4.
pageObject = pdfReader.getPage(0)
Here we create an object of PageObject class of PyPDF2 module.
PDF reader object has function getPage() which takes the page number starting from index 0.
Then it returns the page object.
print(pageObject.extractText())
pdfFileObject.close()
No comments:
Post a Comment