WRieder
asked on
Extract Text from PDF File
I need to import text from PDF Files where columns need to be added to a Table.
Columns must be preserved in the Text File.
I have tried some, but they are either not doing the job properly or are too expensive ($ 2000.00)
Columns must be preserved in the Text File.
I have tried some, but they are either not doing the job properly or are too expensive ($ 2000.00)
One common use of this type of processing I use is to convert all my bank + credit card statements into text for processing tax records.
My approach is fairly simple.
1) The PDF file must have a text component. Most do. Some don't. For PDF files with no text component, rescan the statement using cheap ScanSnap scanner with OCR of all pages enabled. These scanners cost around $100 USD + scan full duplex (both page sides) around 1 page/second.
2) Extract the text component...
3) Pull $in.txt into a line editor + make any hand edits required to ensure all column alignment is correct.
Note: If you have 1000s of files, likely you can write scripts to either to 100% of your fixes or nearly 100% of your fixes.
4) At this point all your .txt files will have all their column data normalized (the same), so this data can be process mechanically by other software.
5) For access to pdftotext, install the Poppler Tools package for your OS.
All Linux Distros provide a Poppler Tools package. On Macs both MacPorts + Brew provide Poppler Tools packaging.
https://poppler.freedeskto p.org/ provides a link to an unofficial Windows version.
My approach is fairly simple.
1) The PDF file must have a text component. Most do. Some don't. For PDF files with no text component, rescan the statement using cheap ScanSnap scanner with OCR of all pages enabled. These scanners cost around $100 USD + scan full duplex (both page sides) around 1 page/second.
2) Extract the text component...
pdftotext -enc ASCII7 -nopgbrk -layout "$in" - 2>/dev/null > "$in.txt"
3) Pull $in.txt into a line editor + make any hand edits required to ensure all column alignment is correct.
Note: If you have 1000s of files, likely you can write scripts to either to 100% of your fixes or nearly 100% of your fixes.
4) At this point all your .txt files will have all their column data normalized (the same), so this data can be process mechanically by other software.
5) For access to pdftotext, install the Poppler Tools package for your OS.
All Linux Distros provide a Poppler Tools package. On Macs both MacPorts + Brew provide Poppler Tools packaging.
https://poppler.freedeskto
ASKER
David Favor:
pdftotext -enc ASCII7 -nopgbrk -layout "$in" - 2>/dev/null > "$in.txt"
Can you please let me know, where you found this?
pdftotext -enc ASCII7 -nopgbrk -layout "$in" - 2>/dev/null > "$in.txt"
Can you please let me know, where you found this?
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
To Joe Winograd:
Thank you very much, it has solved my problem.
I have marked your answer as solution.
Not sure, how the points are assigned, but I will
give you the maximum allowed.
Kind Regards
Wolfgang
Thank you very much, it has solved my problem.
I have marked your answer as solution.
Not sure, how the points are assigned, but I will
give you the maximum allowed.
Kind Regards
Wolfgang
You're very welcome, Wolfgang, I'm glad to hear that it solved your problem. And thanks to you for the points...much appreciated! Regards, Joe
Open in new window
https://www.gemboxsoftware.com/pdf