Link to home
Start Free TrialLog in
Avatar of WRieder
WRiederFlag for South Africa

asked on

Extract Text from PDF File

I need to import text from PDF Files where columns need to be added to a Table.
Columns must be preserved in the Text File.

I have tried some, but they are either not doing the job properly or are too expensive ($ 2000.00)
Avatar of Shaun Vermaak
Shaun Vermaak
Flag of Australia image

GemBox PDF @ $240.00
using GemBox.Pdf;

class Program
{
    static void Main()
    {
        // If using Professional version, put your serial key below.
        ComponentInfo.SetLicense("FREE-LIMITED-KEY");

        using (var document = new PdfDocument())
        {
            // Add a first empty page.
            document.Pages.Add();

            // Add a second empty page.
            document.Pages.Add();

            document.Save("Hello World.pdf");
        }
    }
}

Open in new window

https://www.gemboxsoftware.com/pdf
One common use of this type of processing I use is to convert all my bank + credit card statements into text for processing tax records.

My approach is fairly simple.

1) The PDF file must have a text component. Most do. Some don't. For PDF files with no text component, rescan the statement using cheap ScanSnap scanner with OCR of all pages enabled. These scanners cost around $100 USD + scan full duplex (both page sides) around 1 page/second.

2) Extract the text component...

pdftotext -enc ASCII7 -nopgbrk -layout "$in" - 2>/dev/null > "$in.txt"

Open in new window


3) Pull $in.txt into a line editor + make any hand edits required to ensure all column alignment is correct.

Note: If you have 1000s of files, likely you can write scripts to either to 100% of your fixes or nearly 100% of your fixes.

4) At this point all your .txt files will have all their column data normalized (the same), so this data can be process mechanically by other software.

5) For access to pdftotext, install the Poppler Tools package for your OS.

All Linux Distros provide a Poppler Tools package. On Macs both MacPorts + Brew provide Poppler Tools packaging.

https://poppler.freedesktop.org/ provides a link to an unofficial Windows version.
Avatar of WRieder

ASKER

David Favor:

pdftotext -enc ASCII7 -nopgbrk -layout "$in" - 2>/dev/null > "$in.txt"

Can you please let me know, where you found this?
ASKER CERTIFIED SOLUTION
Avatar of Joe Winograd
Joe Winograd
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of WRieder

ASKER

To Joe Winograd:

Thank you very much, it has solved my problem.
I have marked your answer as solution.
Not sure, how the points are assigned, but I will
give you the maximum allowed.

Kind Regards
Wolfgang
You're very welcome, Wolfgang, I'm glad to hear that it solved your problem. And thanks to you for the points...much appreciated! Regards, Joe