python 3.x - PyPDF2 returns only empty lines for some files -
i working on script "reads" pdf files , and automatically renames files recognizes dictionary. pypdf2 returns empty lines pdfs, while working fine others. code reading files:
import pypdf2 # file name file = 'sample.pdf' # open file open(file, "rb") f: # read in file pdfreader = pypdf2.pdffilereader(f) # check number of pages number_of_pages = pdfreader.numpages print(number_of_pages) # first page pageobj = pdfreader.getpage(0) # extract text page 1 text = pageobj.extracttext() print(text)
it number of pages correctly, able open pdf.
if replace print(text) repr(text) files doesn't read, like:
"'\\n\\n\\n\\n\\n\\n\\n\\nn\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n'"
weirdly enough, when enhance (ocr) files adobe, script performs worse. recognized 140 out of 800 files , after enhancing 110.
the pdfs machine readable/searchable, because able copy/paste text notepad. tested files "pdfminer" , show text, throws in lot of errors. if possible keep working pypdf2.
specifications of software using:
windows: 10.0.15063
python: 3.6.1
pypdf: 1.26.0
adobe version: 17.009.20058
anyone suggestions? appreciated!
Comments
Post a Comment