asked on

Extract Text from PDF File

I need to import text from PDF Files where columns need to be added to a Table.
Columns must be preserved in the Text File.

I have tried some, but they are either not doing the job properly or are too expensive ($ 2000.00)

Shaun Vermaak

GemBox PDF @ $240.00

using GemBox.Pdf;

class Program
{
    static void Main()
    {
        // If using Professional version, put your serial key below.
        ComponentInfo.SetLicense("FREE-LIMITED-KEY");

        using (var document = new PdfDocument())
        {
            // Add a first empty page.
            document.Pages.Add();

            // Add a second empty page.
            document.Pages.Add();

            document.Save("Hello World.pdf");
        }
    }
}

Open in new window

https://www.gemboxsoftware.com/pdf

David Favor

One common use of this type of processing I use is to convert all my bank + credit card statements into text for processing tax records.

My approach is fairly simple.

1) The PDF file must have a text component. Most do. Some don't. For PDF files with no text component, rescan the statement using cheap ScanSnap scanner with OCR of all pages enabled. These scanners cost around $100 USD + scan full duplex (both page sides) around 1 page/second.

2) Extract the text component...

pdftotext -enc ASCII7 -nopgbrk -layout "$in" - 2>/dev/null > "$in.txt"

Open in new window

3) Pull $in.txt into a line editor + make any hand edits required to ensure all column alignment is correct.

Note: If you have 1000s of files, likely you can write scripts to either to 100% of your fixes or nearly 100% of your fixes.

4) At this point all your .txt files will have all their column data normalized (the same), so this data can be process mechanically by other software.

5) For access to pdftotext, install the Poppler Tools package for your OS.

All Linux Distros provide a Poppler Tools package. On Macs both MacPorts + Brew provide Poppler Tools packaging.

https://poppler.freedesktop.org/ provides a link to an unofficial Windows version.

WRieder

ASKER

David Favor:

pdftotext -enc ASCII7 -nopgbrk -layout "$in" - 2>/dev/null > "$in.txt"

Can you please let me know, where you found this?

ASKER CERTIFIED SOLUTION

Joe Winograd

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

WRieder

ASKER

To Joe Winograd:

Thank you very much, it has solved my problem.
I have marked your answer as solution.
Not sure, how the points are assigned, but I will
give you the maximum allowed.

Kind Regards
Wolfgang

Joe Winograd

You're very welcome, Wolfgang, I'm glad to hear that it solved your problem. And thanks to you for the points...much appreciated! Regards, Joe