Extracting a CSV from a PDF

I need my website to extract the data from a PDF and generate a CSV file. And I hope to do this on the front end, inside the client browser. But, if required, I could to this extraction on the back-end.

The PDF would be a month merchant credit card statement. The data I would extract to a CSV would be the numerous transactions.

What web technology can do this? And without human intervention.

newbiewebSr. Software EngineerAsked:
Who is Participating?
David FavorLinux/LXD/WordPress/Hosting SavantCommented:
Ah... Sounds like this is your first time parsing large numbers .pdf documents.

The actual formatting of any given .pdf statement from any company can vary wildly.

Formats change periodically. Columns change. The debit + credit column can even reverse sometimes, which then makes all you income + expenses reverse.

In all my .pdf document code the first pass on any document is a set of unit tests for the document itself, to ensure every single line can be parsed. This includes line I don't use.

In other words, every single line must be matched + either thrown out or passed through to the next layer of parsing.

The next layer matches the statement date + processes the statement based on it's date range. In other words, with the documents entire format changes + it will eventually, then you have to key off the statement date to ensure you're even running the correct version of your parser.

Any record that can't be parsed is collected in an accumulator, then at the end of completely processing either one .pdf or a collection of .pdf files, if there's any residue in the accumulator, this residue is kicked out + then you'll have to code to fix these.

Once you get a clean run, I scan every document against the output record data, to ensure a match.

Once you get into this process, you will no longer have any questions about "why is a manual human review" required for every line of data. You will know why.

Just start the process. Get into the parsing + you'll find out.

Also, another test I build into my processing is to ensure all mandatory records appear.

One year, Citicard changed from including a text component in all their statements to only include a single jpeg image, which you can't parse with poppler. When this type of statement change occurs, you'll either have to run a command line preprocessor to convert an image only statement to text or run these statements through a fast scanner, like a SnapScan (<1 second to scan both sites of a double sided page).

Note: If you're unfamiliar with all the tools for accomplishing this, might be good for you to hire someone who's done this for years, as it takes a massive amount of experience to know all the problems + how to code around most of them.
David FavorLinux/LXD/WordPress/Hosting SavantCommented:
https://www.experts-exchange.com/questions/29075322/Example-of-how-Perl-can-be-used-to-process-an-array-of-data.html covers my answer about this to another person just this morning.

Scan my answer there + update this question with any thing else you require.
David FavorLinux/LXD/WordPress/Hosting SavantCommented:
https://www.experts-exchange.com/questions/29075272/Is-WordPress-ALL-I-NEED.html is also required reading, as parsing .pdf documents is far more complex than you might imagine.

If you're working with money + must be correct every time, you will require a human invention (manual human review) step.
newbiewebSr. Software EngineerAuthor Commented:
Is the human review step because of potential formatting issues? Otherwise, since this does not involve converting images to text, how might an error be introduced?
Maggie JYCommented:
You can actually extract data or information from PDF form when you have the right PDF tools.  Here are two samples.

1. Export form data into excel (CSV.), please see the screenshot:

2. Export data from scanned PDFs, please see the screenshot:

Then you can extra PDF form data from from hundreds of identical forms into a single, accessible Excel sheet within seconds.
If your files are scanned PDFs, then see the second screenshot,  OCR technology can converts piles of paper documents into Office files, then apply the same data extraction rules to hundreds of scanned PDFs with the identical layout, and export all the data into one single spreadsheet.
Here is the full guide:
extract data
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.