python 3.x - PyPDF2 returns only empty lines for some files -

May 15, 2015

i working on script "reads" pdf files , and automatically renames files recognizes dictionary. pypdf2 returns empty lines pdfs, while working fine others. code reading files:

import pypdf2  # file name file = 'sample.pdf'  # open file open(file, "rb") f:     # read in file     pdfreader = pypdf2.pdffilereader(f)      # check number of pages     number_of_pages = pdfreader.numpages     print(number_of_pages)      # first page     pageobj = pdfreader.getpage(0)      # extract text page 1     text = pageobj.extracttext()          print(text)

it number of pages correctly, able open pdf.

if replace print(text) repr(text) files doesn't read, like:

"'\\n\\n\\n\\n\\n\\n\\n\\nn\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n'"

weirdly enough, when enhance (ocr) files adobe, script performs worse. recognized 140 out of 800 files , after enhancing 110.

the pdfs machine readable/searchable, because able copy/paste text notepad. tested files "pdfminer" , show text, throws in lot of errors. if possible keep working pypdf2.

specifications of software using:
windows: 10.0.15063
python: 3.6.1
pypdf: 1.26.0
adobe version: 17.009.20058

anyone suggestions? appreciated!

Search This Blog

Insert

python 3.x - PyPDF2 returns only empty lines for some files -

Comments

Post a Comment

Popular posts from this blog

vue.js - Create hooks for automated testing -

php - Vagrant up error - Uncaught Reflection Exception: Class DOMDocument does not exist -

serial port - hub4com OVERRUN Error -