How to programmatically read or search through pdf file?

Posted on 2004-11-15
Last Modified: 2012-05-05
Dear pdf experts:

Every week,  I received approximately 575 statistical data REPORTS in the pdf format.
I need to electronically file these reports by sender and the only way that i can get the name of the sender is to read/open up the pdf report.  So rather than do it manually "which is what I am doing now",  I would like to know if there are ways for me programmatically read the pdf file and extracting the name of the sender and put it their appropriate electronic folders.

Please note: the name of the pdf reports is not same from week to week.  But inside the pdf report, the report format is ALWAYS the same for all of them.    The name of the sender is always located in the upper left hand corner (1" margin) and also located in the lower right corner (1" margin).

Please HELP!!!!!
In order for me to programmatically read these files, do I need to have Adobe Writer? if yes, do you have sample codes?
If no, then can I use     Please tell me where to look for sample codes or please give some URLs to look into.

Question by:afridays
    LVL 44

    Expert Comment

    by:Karl Heinz Kremer
    This is a non-trivial project, and you should be prepared for quite a bit of work...

    First, there is no such program as "Adobe Writer". What you probably mean is "Adobe Acrobat" (of which Adobe announced version 7 today, to be released around the end of the year). You may need Acrobat, but there are other potential solutions.

    It's pretty complicated to get the text at a certain location ("upper left hand corner"), so if you are able to get this information any other way, you should try this first. Once option is to e.g. extract the textual information from the PDF files. You can e.g. use the free XPdf's pdftotext tool ( to get a feeling about what you can do with this method. This is a command line tool, and you need to see if you can find the information you are looking for in a better way (e.g. end of first line, end of third line or something similar). If that's possible, just run this tool and filter the output to determine the sender name.

    If that's not possible, try to see if the information is e.g. stored in the document properties (File>Document Properties). This information can also be easily extracted.

    If this is also not working, you need to extract the contents of the PDF file and try to find the sender information. This is not a simple task.

    The Acrobat JavaScript API allows you to get the Nth word from every page (Doc.getPageNthWord()). You need to do this for all words until you find your sender information. You can get the location of the word with the Doc.getPageNthWordQuads(). You need to define what "upper left hand corner" means in terms of absolute location.

    You can do this with just JavaScript in Acrobat. If you run this as a batch process, you could e.g. process all files in a certain directory and save them to a new directory under a new name. If you do not want to use JavaScript directly, you can also use the VB/JavaScript bridge and do the programming in VB. This solution would require the full version of Adobe Acrobat Professional.

    Does this sound like a viable option? If yes, I can provide more information.

    Author Comment

    Currently, I do not have the full version of Adobe Acrobat but  I will check with my manager about purchasing the full version of Adobe Acrobat -- he is very very cheap.

    I tried to capture the sender name through (File>Document Properties) but that is not possible because that data is not 100% accurate.

    There got to be away for me to crapture the name of the sender.
    LVL 44

    Expert Comment

    by:Karl Heinz Kremer
    You can download a 30 day eval version of Acrobat 6.0 Professional from Adobe's web site.

    Have you tried the pdftotext program?

    Author Comment

    To:  khkremer
    For two days now, we could not get the pdftotext to work either because of the way the contents of data are being formated.
    For example:  One of pdf document to text -- all of the text are running one big long long long string.   And the other pdf document to text has tons of white spaces in between the text.  And the other pdf document to text has some unusual characters in between the text and the paragraph.

    I am going to neet to explore other options.
    LVL 44

    Accepted Solution

    This may be an indication that the task is even more complicated.

    The text that's in a PDF file is not organized so that it can be easily extracted. An extraction tool has to find all parts of the text, which can be individual characters, and the locations of these parts, and then, based on the location arrange the text again. In some instances, this is not possible (e.g. what you experienced).

    Is the string you are looking for at the beginning of the extracted text?

    Author Comment


    To: Venabili

    On December 11, I  accepted  khkremer answer.

    Thank you for the reminder!

    Featured Post

    IT, Stop Being Called Into Every Meeting

    Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

    Join & Write a Comment

    Acrobat’s JavaScript is a great tool to extend the application, or to automate recurring tasks. There are several ways a JavaScript can be added to the application or a document (e.g. folder level scripts, validation scripts, event handling scripts,…
    PaperPort is a popular document imaging/management product from Nuance Communications ( It is in widespread use by both individuals ( and businesses (http:/…
    In this first video of the three-part Xpdf series, we introduce and describe Xpdf, a library containing nine command line utilities that perform various functions on PDF files. We show where the library is located and how to download it, discuss its…
    We often encounter PDF files that are pure images, that is, they do not have text characters, but instead contain only raster graphics. The most common causes of this are document scanning software and faxing software/services that create image-only…

    754 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    21 Experts available now in Live!

    Get 1:1 Help Now