Software Solution that Systematically & Programatically Identifies, Compares specific text from flattened PDFs (for automatic notification of any month to month changes)

Posted on 2012-09-12
Last Modified: 2013-01-11
I am provided about 1000 flattened (non-fillable) PDFs per month which each contain our customer's information (addresses, move-in dates/move out dates) and I bear the burden of manually comparing all of them on a monthly basis to identify any changes to their occupancy status, telephone numbers, etc. so that I can take the appropriate action should any of their information change.  If none of their information changes, or we get a new customer, I do not need to take any action, but still waste time comparing them.

I wanted to ask, What is the most suitable software solution/development platform to systematically, programmatically & automatically identify the relevant text areas, extract it from each PDF, log/store/archive it, compare it to the customer's last month's data and will also notify me which one's have changes?

It's very time consuming to go through each PDF unnecessarily when 90% of the information does not change and no action is necessary.  I just need to be notified of the 10%.

Is there an Adobe software that would accommodate my development needs?  Cold fusion?  Anything?  Even if it were a multiple platform solution, it would be helpful.  The only thing I don't have any control over is what format type the data is provided to me (the flat PDFs).  Thank you in advance for your time!
Question by:Hwy419
    LVL 44

    Assisted Solution

    I would address this with two folders, prior month (last known status) and the live updatable folder.
    1, Compare the contents of the two folders on three criteria

    a. What are new items that we did not have last month?
    b. What items are missing from this month that we had last month?
    c. For same named items, are their hash values the same?
    I will leave the processing details of items in the 1.a and 1.b groups to you.

    For all items in 1.c that have unequal hash values, you might do something as simple as extracting the text (I use PDFTEXT for one of my client applications, but there are other utilities).  Once you have the text, you could run it and the prior month's text through some text comparison utility or use Word to do the comparison.

    It would be possible to write a program to extract data directly out of the PDF fields, but that seems like overkill for your problem description.
    LVL 35

    Accepted Solution

    Just some additional thoughts to akimark...

    As you said, 90 percent are equal....
    If all documents are stored in something like a staging folder, it should not quite difficult at least to sort them into other folders just by a hash compare. So, a small programm can build a hash from the new file, maybe compare it to a database to decide, if it already exists or even not.

    This way you can sort out 90%.

    The other thought into an absolutely different direction is to move the responsibility from your site to the site, who delivers the information, as long it is possible. That means, that the source of the imformation is resposible for the actuality of the information instead of bombing one departemnt with useful or useless information.

    So maybe Sharepoint or similar systems is a thought you can put into your mind. You can give the people access to specific pages, and they take care to keep them up to date.

    The other option, i.e. Sharepoint provides also something like a staging folder. If you get the information by email, you can redirect it to a staging email enabled document library in sharepoint, which has a ruleset to automatically move documents into other libraries, dependend from the rule definition. Or you use this staging library as a central upload store.
    Not sure if you can catch enought information from the documents and the source to build up a reliable ruleset, but just an idea. But even if there is no possibility to really compare the documents, the newest ones are usually the most actual.
    At least Sharepoint enterprise features are cabable to index also PDF files, there should be a possibility to regognize special kind of documents to know where to move them.

    As information flooding is one of our current side effects in information exchange, my thought would be to change the procedure in general to limit the maintenance effort that to try to get rid of the more and more growing amount of information.

    Author Closing Comment

    Thank you for your recommendations, suggestions and input.  They were very helpful!

    Write Comment

    Please enter a first name

    Please enter a last name

    We will never share this with anyone.

    Featured Post

    Looking for New Ways to Advertise?

    Engage with tech pros in our community with native advertising, as a Vendor Expert, and more.

    In our personal lives, we have well-designed consumer apps to delight us and make even the most complex transactions simple. Many enterprise applications, however, are a bit behind the times. For an enterprise app to be successful in today's tech wo…
    Skype is a P2P (Peer to Peer) instant messaging and VOIP (Voice over IP) service – as well as a whole lot more.
    Learn how to create and modify your own paragraph styles in Microsoft Word. This can be helpful when wanting to make consistently referenced styles throughout a document or template.
    In this fifth video of the Xpdf series, we discuss and demonstrate the PDFdetach utility, which is able to list and, more importantly, extract attachments that are embedded in PDF files. It does this via a command line interface, making it suitable …

    737 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    22 Experts available now in Live!

    Get 1:1 Help Now