Link to home
Start Free TrialLog in
Avatar of afridays
afridays

asked on

How to programmatically read or search through pdf file?

Dear pdf experts:

Every week,  I received approximately 575 statistical data REPORTS in the pdf format.
I need to electronically file these reports by sender and the only way that i can get the name of the sender is to read/open up the pdf report.  So rather than do it manually "which is what I am doing now",  I would like to know if there are ways for me programmatically read the pdf file and extracting the name of the sender and put it their appropriate electronic folders.

Please note: the name of the pdf reports is not same from week to week.  But inside the pdf report, the report format is ALWAYS the same for all of them.    The name of the sender is always located in the upper left hand corner (1" margin) and also located in the lower right corner (1" margin).

Please HELP!!!!!
In order for me to programmatically read these files, do I need to have Adobe Writer? if yes, do you have sample codes?
If no, then can I use VB.net?     Please tell me where to look for sample codes or please give some URLs to look into.

Thanks,
Augusta
Avatar of Karl Heinz Kremer
Karl Heinz Kremer
Flag of United States of America image

This is a non-trivial project, and you should be prepared for quite a bit of work...

First, there is no such program as "Adobe Writer". What you probably mean is "Adobe Acrobat" (of which Adobe announced version 7 today, to be released around the end of the year). You may need Acrobat, but there are other potential solutions.

It's pretty complicated to get the text at a certain location ("upper left hand corner"), so if you are able to get this information any other way, you should try this first. Once option is to e.g. extract the textual information from the PDF files. You can e.g. use the free XPdf's pdftotext tool (http://www.foolabs.com/xpdf) to get a feeling about what you can do with this method. This is a command line tool, and you need to see if you can find the information you are looking for in a better way (e.g. end of first line, end of third line or something similar). If that's possible, just run this tool and filter the output to determine the sender name.

If that's not possible, try to see if the information is e.g. stored in the document properties (File>Document Properties). This information can also be easily extracted.

If this is also not working, you need to extract the contents of the PDF file and try to find the sender information. This is not a simple task.

The Acrobat JavaScript API allows you to get the Nth word from every page (Doc.getPageNthWord()). You need to do this for all words until you find your sender information. You can get the location of the word with the Doc.getPageNthWordQuads(). You need to define what "upper left hand corner" means in terms of absolute location.

You can do this with just JavaScript in Acrobat. If you run this as a batch process, you could e.g. process all files in a certain directory and save them to a new directory under a new name. If you do not want to use JavaScript directly, you can also use the VB/JavaScript bridge and do the programming in VB. This solution would require the full version of Adobe Acrobat Professional.

Does this sound like a viable option? If yes, I can provide more information.
Avatar of afridays
afridays

ASKER

Currently, I do not have the full version of Adobe Acrobat but  I will check with my manager about purchasing the full version of Adobe Acrobat -- he is very very cheap.

I tried to capture the sender name through (File>Document Properties) but that is not possible because that data is not 100% accurate.

There got to be away for me to crapture the name of the sender.
You can download a 30 day eval version of Acrobat 6.0 Professional from Adobe's web site.

Have you tried the pdftotext program?
To:  khkremer
For two days now, we could not get the pdftotext to work either because of the way the contents of data are being formated.
For example:  One of pdf document to text -- all of the text are running together...like one big long long long string.   And the other pdf document to text has tons of white spaces in between the text.  And the other pdf document to text has some unusual characters in between the text and the paragraph.

I am going to neet to explore other options.
ASKER CERTIFIED SOLUTION
Avatar of Karl Heinz Kremer
Karl Heinz Kremer
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial

To: Venabili

On December 11, I  accepted  khkremer answer.

Thank you for the reminder!