Extracting a CSV from a PDF

curiouswebster used Ask the Experts™
I need my website to extract the data from a PDF and generate a CSV file. And I hope to do this on the front end, inside the client browser. But, if required, I could to this extraction on the back-end.

The PDF would be a month merchant credit card statement. The data I would extract to a CSV would be the numerous transactions.

What web technology can do this? And without human intervention.

Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
David FavorFractional CTO
Distinguished Expert 2018
https://www.experts-exchange.com/questions/29075322/Example-of-how-Perl-can-be-used-to-process-an-array-of-data.html covers my answer about this to another person just this morning.

Scan my answer there + update this question with any thing else you require.
David FavorFractional CTO
Distinguished Expert 2018
https://www.experts-exchange.com/questions/29075272/Is-WordPress-ALL-I-NEED.html is also required reading, as parsing .pdf documents is far more complex than you might imagine.

If you're working with money + must be correct every time, you will require a human invention (manual human review) step.
curiouswebsterSoftware Engineer


Is the human review step because of potential formatting issues? Otherwise, since this does not involve converting images to text, how might an error be introduced?
Fractional CTO
Distinguished Expert 2018
Ah... Sounds like this is your first time parsing large numbers .pdf documents.

The actual formatting of any given .pdf statement from any company can vary wildly.

Formats change periodically. Columns change. The debit + credit column can even reverse sometimes, which then makes all you income + expenses reverse.

In all my .pdf document code the first pass on any document is a set of unit tests for the document itself, to ensure every single line can be parsed. This includes line I don't use.

In other words, every single line must be matched + either thrown out or passed through to the next layer of parsing.

The next layer matches the statement date + processes the statement based on it's date range. In other words, with the documents entire format changes + it will eventually, then you have to key off the statement date to ensure you're even running the correct version of your parser.

Any record that can't be parsed is collected in an accumulator, then at the end of completely processing either one .pdf or a collection of .pdf files, if there's any residue in the accumulator, this residue is kicked out + then you'll have to code to fix these.

Once you get a clean run, I scan every document against the output record data, to ensure a match.

Once you get into this process, you will no longer have any questions about "why is a manual human review" required for every line of data. You will know why.

Just start the process. Get into the parsing + you'll find out.

Also, another test I build into my processing is to ensure all mandatory records appear.

One year, Citicard changed from including a text component in all their statements to only include a single jpeg image, which you can't parse with poppler. When this type of statement change occurs, you'll either have to run a command line preprocessor to convert an image only statement to text or run these statements through a fast scanner, like a SnapScan (<1 second to scan both sites of a double sided page).

Note: If you're unfamiliar with all the tools for accomplishing this, might be good for you to hire someone who's done this for years, as it takes a massive amount of experience to know all the problems + how to code around most of them.
You can actually extract data or information from PDF form when you have the right PDF tools.  Here are two samples.

1. Export form data into excel (CSV.), please see the screenshot:

2. Export data from scanned PDFs, please see the screenshot:

Then you can extra PDF form data from from hundreds of identical forms into a single, accessible Excel sheet within seconds.
If your files are scanned PDFs, then see the second screenshot,  OCR technology can converts piles of paper documents into Office files, then apply the same data extraction rules to hundreds of scanned PDFs with the identical layout, and export all the data into one single spreadsheet.
Here is the full guide:
extract data

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial