[Last Call] Learn how to a build a cloud-first strategyRegister Now

  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 2693
  • Last Modified:

How to programmatically read or search through pdf file?

Dear pdf experts:

Every week,  I received approximately 575 statistical data REPORTS in the pdf format.
I need to electronically file these reports by sender and the only way that i can get the name of the sender is to read/open up the pdf report.  So rather than do it manually "which is what I am doing now",  I would like to know if there are ways for me programmatically read the pdf file and extracting the name of the sender and put it their appropriate electronic folders.

Please note: the name of the pdf reports is not same from week to week.  But inside the pdf report, the report format is ALWAYS the same for all of them.    The name of the sender is always located in the upper left hand corner (1" margin) and also located in the lower right corner (1" margin).

Please HELP!!!!!
In order for me to programmatically read these files, do I need to have Adobe Writer? if yes, do you have sample codes?
If no, then can I use VB.net?     Please tell me where to look for sample codes or please give some URLs to look into.

  • 3
  • 3
1 Solution
Karl Heinz KremerCommented:
This is a non-trivial project, and you should be prepared for quite a bit of work...

First, there is no such program as "Adobe Writer". What you probably mean is "Adobe Acrobat" (of which Adobe announced version 7 today, to be released around the end of the year). You may need Acrobat, but there are other potential solutions.

It's pretty complicated to get the text at a certain location ("upper left hand corner"), so if you are able to get this information any other way, you should try this first. Once option is to e.g. extract the textual information from the PDF files. You can e.g. use the free XPdf's pdftotext tool (http://www.foolabs.com/xpdf) to get a feeling about what you can do with this method. This is a command line tool, and you need to see if you can find the information you are looking for in a better way (e.g. end of first line, end of third line or something similar). If that's possible, just run this tool and filter the output to determine the sender name.

If that's not possible, try to see if the information is e.g. stored in the document properties (File>Document Properties). This information can also be easily extracted.

If this is also not working, you need to extract the contents of the PDF file and try to find the sender information. This is not a simple task.

The Acrobat JavaScript API allows you to get the Nth word from every page (Doc.getPageNthWord()). You need to do this for all words until you find your sender information. You can get the location of the word with the Doc.getPageNthWordQuads(). You need to define what "upper left hand corner" means in terms of absolute location.

You can do this with just JavaScript in Acrobat. If you run this as a batch process, you could e.g. process all files in a certain directory and save them to a new directory under a new name. If you do not want to use JavaScript directly, you can also use the VB/JavaScript bridge and do the programming in VB. This solution would require the full version of Adobe Acrobat Professional.

Does this sound like a viable option? If yes, I can provide more information.
afridaysAuthor Commented:
Currently, I do not have the full version of Adobe Acrobat but  I will check with my manager about purchasing the full version of Adobe Acrobat -- he is very very cheap.

I tried to capture the sender name through (File>Document Properties) but that is not possible because that data is not 100% accurate.

There got to be away for me to crapture the name of the sender.
Karl Heinz KremerCommented:
You can download a 30 day eval version of Acrobat 6.0 Professional from Adobe's web site.

Have you tried the pdftotext program?
Keep up with what's happening at Experts Exchange!

Sign up to receive Decoded, a new monthly digest with product updates, feature release info, continuing education opportunities, and more.

afridaysAuthor Commented:
To:  khkremer
For two days now, we could not get the pdftotext to work either because of the way the contents of data are being formated.
For example:  One of pdf document to text -- all of the text are running together...like one big long long long string.   And the other pdf document to text has tons of white spaces in between the text.  And the other pdf document to text has some unusual characters in between the text and the paragraph.

I am going to neet to explore other options.
Karl Heinz KremerCommented:
This may be an indication that the task is even more complicated.

The text that's in a PDF file is not organized so that it can be easily extracted. An extraction tool has to find all parts of the text, which can be individual characters, and the locations of these parts, and then, based on the location arrange the text again. In some instances, this is not possible (e.g. what you experienced).

Is the string you are looking for at the beginning of the extracted text?
afridaysAuthor Commented:

To: Venabili

On December 11, I  accepted  khkremer answer.

Thank you for the reminder!

Featured Post

[Webinar] Cloud and Mobile-First Strategy

Maybe you’ve fully adopted the cloud since the beginning. Or maybe you started with on-prem resources but are pursuing a “cloud and mobile first” strategy. Getting to that end state has its challenges. Discover how to build out a 100% cloud and mobile IT strategy in this webinar.

  • 3
  • 3
Tackle projects and never again get stuck behind a technical roadblock.
Join Now