Software Solution that Systematically & Programatically Identifies, Compares specific text from flattened PDFs (for automatic notification of any month to month changes)

Posted on 2012-09-12
Medium Priority
Last Modified: 2013-01-11
I am provided about 1000 flattened (non-fillable) PDFs per month which each contain our customer's information (addresses, move-in dates/move out dates) and I bear the burden of manually comparing all of them on a monthly basis to identify any changes to their occupancy status, telephone numbers, etc. so that I can take the appropriate action should any of their information change.  If none of their information changes, or we get a new customer, I do not need to take any action, but still waste time comparing them.

I wanted to ask, What is the most suitable software solution/development platform to systematically, programmatically & automatically identify the relevant text areas, extract it from each PDF, log/store/archive it, compare it to the customer's last month's data and will also notify me which one's have changes?

It's very time consuming to go through each PDF unnecessarily when 90% of the information does not change and no action is necessary.  I just need to be notified of the 10%.

Is there an Adobe software that would accommodate my development needs?  Cold fusion?  Anything?  Even if it were a multiple platform solution, it would be helpful.  The only thing I don't have any control over is what format type the data is provided to me (the flat PDFs).  Thank you in advance for your time!
Question by:Hwy419
LVL 46

Assisted Solution

aikimark earned 1000 total points
ID: 38424715
I would address this with two folders, prior month (last known status) and the live updatable folder.
1, Compare the contents of the two folders on three criteria

a. What are new items that we did not have last month?
b. What items are missing from this month that we had last month?
c. For same named items, are their hash values the same?
I will leave the processing details of items in the 1.a and 1.b groups to you.

For all items in 1.c that have unequal hash values, you might do something as simple as extracting the text (I use PDFTEXT for one of my client applications, but there are other utilities).  Once you have the text, you could run it and the prior month's text through some text comparison utility or use Word to do the comparison.

It would be possible to write a program to extract data directly out of the PDF fields, but that seems like overkill for your problem description.
LVL 35

Accepted Solution

Bembi earned 1000 total points
ID: 38426124
Just some additional thoughts to akimark...

As you said, 90 percent are equal....
If all documents are stored in something like a staging folder, it should not quite difficult at least to sort them into other folders just by a hash compare. So, a small programm can build a hash from the new file, maybe compare it to a database to decide, if it already exists or even not.

This way you can sort out 90%.

The other thought into an absolutely different direction is to move the responsibility from your site to the site, who delivers the information, as long it is possible. That means, that the source of the imformation is resposible for the actuality of the information instead of bombing one departemnt with useful or useless information.

So maybe Sharepoint or similar systems is a thought you can put into your mind. You can give the people access to specific pages, and they take care to keep them up to date.

The other option, i.e. Sharepoint provides also something like a staging folder. If you get the information by email, you can redirect it to a staging email enabled document library in sharepoint, which has a ruleset to automatically move documents into other libraries, dependend from the rule definition. Or you use this staging library as a central upload store.
Not sure if you can catch enought information from the documents and the source to build up a reliable ruleset, but just an idea. But even if there is no possibility to really compare the documents, the newest ones are usually the most actual.
At least Sharepoint enterprise features are cabable to index also PDF files, there should be a possibility to regognize special kind of documents to know where to move them.

As information flooding is one of our current side effects in information exchange, my thought would be to change the procedure in general to limit the maintenance effort that to try to get rid of the more and more growing amount of information.

Author Closing Comment

ID: 38769006
Thank you for your recommendations, suggestions and input.  They were very helpful!

Featured Post

Transaction-level recovery for Oracle database

Veeam Explore for Oracle delivers low RTOs and RPOs with agentless transaction log backup and transaction-level recovery of Oracle databases. You can restore the database to a precise point in time, even to a specific transaction.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In this post, I will showcase the steps for how to create groups in Office 365. Office 365 groups allow for ease of flexibility and collaboration between staff members.
Today as you open your Outlook, you witness an error message: “Outlook is using an old copy of your Outlook Data File…”. Probably, Outlook is accessing an old OST file.
In a recent question (https://www.experts-exchange.com/questions/28997919/Pagination-in-Adobe-Acrobat.html) here at Experts Exchange, a member asked how to add page numbers to a PDF file using Adobe Acrobat XI Pro. This short video Micro Tutorial sh…
Whether it be Exchange Server Crash Issues, Dirty Shutdown Errors or Failed to mount error, Stellar Phoenix Mailbox Exchange Recovery has always got your back. With the help of its easy to understand user interface and 3 simple steps recovery proced…

839 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question