How to programmatically read or search through pdf file?

Dear pdf experts:

Every week,  I received approximately 575 statistical data REPORTS in the pdf format.
I need to electronically file these reports by sender and the only way that i can get the name of the sender is to read/open up the pdf report.  So rather than do it manually "which is what I am doing now",  I would like to know if there are ways for me programmatically read the pdf file and extracting the name of the sender and put it their appropriate electronic folders.

Please note: the name of the pdf reports is not same from week to week.  But inside the pdf report, the report format is ALWAYS the same for all of them.    The name of the sender is always located in the upper left hand corner (1" margin) and also located in the lower right corner (1" margin).

Please HELP!!!!!
In order for me to programmatically read these files, do I need to have Adobe Writer? if yes, do you have sample codes?
If no, then can I use VB.net?     Please tell me where to look for sample codes or please give some URLs to look into.

Thanks,
Augusta
afridaysAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Karl Heinz KremerCommented:
This is a non-trivial project, and you should be prepared for quite a bit of work...

First, there is no such program as "Adobe Writer". What you probably mean is "Adobe Acrobat" (of which Adobe announced version 7 today, to be released around the end of the year). You may need Acrobat, but there are other potential solutions.

It's pretty complicated to get the text at a certain location ("upper left hand corner"), so if you are able to get this information any other way, you should try this first. Once option is to e.g. extract the textual information from the PDF files. You can e.g. use the free XPdf's pdftotext tool (http://www.foolabs.com/xpdf) to get a feeling about what you can do with this method. This is a command line tool, and you need to see if you can find the information you are looking for in a better way (e.g. end of first line, end of third line or something similar). If that's possible, just run this tool and filter the output to determine the sender name.

If that's not possible, try to see if the information is e.g. stored in the document properties (File>Document Properties). This information can also be easily extracted.

If this is also not working, you need to extract the contents of the PDF file and try to find the sender information. This is not a simple task.

The Acrobat JavaScript API allows you to get the Nth word from every page (Doc.getPageNthWord()). You need to do this for all words until you find your sender information. You can get the location of the word with the Doc.getPageNthWordQuads(). You need to define what "upper left hand corner" means in terms of absolute location.

You can do this with just JavaScript in Acrobat. If you run this as a batch process, you could e.g. process all files in a certain directory and save them to a new directory under a new name. If you do not want to use JavaScript directly, you can also use the VB/JavaScript bridge and do the programming in VB. This solution would require the full version of Adobe Acrobat Professional.

Does this sound like a viable option? If yes, I can provide more information.
0
afridaysAuthor Commented:
Currently, I do not have the full version of Adobe Acrobat but  I will check with my manager about purchasing the full version of Adobe Acrobat -- he is very very cheap.

I tried to capture the sender name through (File>Document Properties) but that is not possible because that data is not 100% accurate.

There got to be away for me to crapture the name of the sender.
0
Karl Heinz KremerCommented:
You can download a 30 day eval version of Acrobat 6.0 Professional from Adobe's web site.

Have you tried the pdftotext program?
0
Cloud Class® Course: MCSA MCSE Windows Server 2012

This course teaches how to install and configure Windows Server 2012 R2.  It is the first step on your path to becoming a Microsoft Certified Solutions Expert (MCSE).

afridaysAuthor Commented:
To:  khkremer
For two days now, we could not get the pdftotext to work either because of the way the contents of data are being formated.
For example:  One of pdf document to text -- all of the text are running together...like one big long long long string.   And the other pdf document to text has tons of white spaces in between the text.  And the other pdf document to text has some unusual characters in between the text and the paragraph.

I am going to neet to explore other options.
0
Karl Heinz KremerCommented:
This may be an indication that the task is even more complicated.

The text that's in a PDF file is not organized so that it can be easily extracted. An extraction tool has to find all parts of the text, which can be individual characters, and the locations of these parts, and then, based on the location arrange the text again. In some instances, this is not possible (e.g. what you experienced).

Is the string you are looking for at the beginning of the extracted text?
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
afridaysAuthor Commented:

To: Venabili

On December 11, I  accepted  khkremer answer.

Thank you for the reminder!
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Adobe Acrobat

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.