Go Premium for a chance to win a PS4. Enter to Win

x
?
Solved

Regular expression search in Acrobat

Posted on 2003-11-22
3
Medium Priority
?
1,697 Views
Last Modified: 2013-11-18
I want to have a plugin developed that will use regular expressions to search PDF files and then return the quad info on the matched search.  I haven't been able to find any reference in the Adobe SDK that indicates this can be done.
Anybody have any input?

Dennis
0
Comment
Question by:dHaserot
  • 2
3 Comments
 
LVL 44

Expert Comment

by:Karl Heinz Kremer
ID: 9804185
Almost anything can be done in a plug-in :-)
It's however not a trivial task. You may be able to use Javascript (that's not my field of expertiese) that you could execute from within a plug-in. I'll however only describe how you would accomplish this with just the plug-in interface.
As you've found out, there is no reg ex interface in Acrobat. You have to provide this yourself. The hard part is getting access to the textual content of the file. This can be done by using the word finder interface:

First you create a PDWordFinder object with either PDDocCreateWordFinderEx, PDDocCreateWordFinder or PDDocCreateWordFinderUCS. With this PDWordFinder you can then call either PDWordFinderGetNthWord() to get one word after the other from the list or PDWordFinderAcquireWordList() to get all words on a page. Either way, you end up with a list of PDWord objects. You have to then apply your reg ex search to this information. The PDWord then gives you access to the quad information through PDWordGetCharQuad().

After you are done, you destroy the PDWordFinder again and do whatever else is necessary in cleanup.

As I said, it's not trivial, but it can be done. The problem with this solution however is that you do not get white space information. Mainly because this is not part of the PDF file to start with. If the cursor gets advanced by a tab, or by eight spaces, or by a fixed amount, the "moevto" command that you will find in the PostScript code that in turn gets converted to some PDF commands will always look the same.

The content of a PDF file does not necessarily have a one-to-one relationship with elements in your source file.

You may also run into problems with sub- or superscripts: In PDF it's just a string in a smaller font, that's positioned  above or below the base line. It may not even be rendered at the same time. So Acrobat needs to make some assumptions about how text belongs together.

Is this enough to completely scare yo away from doing reg expression searches in PDF? :-)
0
 

Author Comment

by:dHaserot
ID: 9804524
KhKremer,

This is my first question to the group and your response was far more detailed than I had hoped.  No I am not scared away by the response.

Though you said it is not your field of expertiese, can you estimate a rough range of cost to develope the plugin (not including the regex engine)?

Thanks very much for your response.

dHaserot
0
 
LVL 44

Accepted Solution

by:
Karl Heinz Kremer earned 2000 total points
ID: 9804644
The Javascript is not my field of expertise.

It all depends on who's doing the development. You probably know that it takes quite some time to get familiar with the Acrobat SDK, so if you have somebody who's never written a plug-in, you have to caclulate about 3 months to get familiar enough with the development environment to handle a job like this.

I've never tried to integrate the word finder with another subsystem, so my estimate can be totally off... But I think that if you have the regex engine, it probably takes between four and eight weeks to get the software written and tested.
0

Featured Post

Keep up with what's happening at Experts Exchange!

Sign up to receive Decoded, a new monthly digest with product updates, feature release info, continuing education opportunities, and more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

*Adobe Acrobat 9 was used for this article. Particular steps may vary depending on software versions. 1. Create a framework of your form in Word, leaving space where you’d ultimately like the Adobe fields to appear.  (Note: I use the blank lines …
*Adobe Acrobat 9 was used for this article.  Particular steps may vary depending on software versions. Adobe Acrobat has many, many variables that my be utilized to customize your forms for clarity and ease of use. The Form Editing Tool will be y…
The purpose of this video is to demonstrate how to set up the WordPress backend so that each page automatically generates a Mailchimp signup form in the sidebar. This will be demonstrated using a Windows 8 PC. Tools Used are Photoshop, Awesome…
Sometimes we receive PDF files that are in the wrong orientation. They may be sideways or even upside down. This most commonly happens with scanned or faxed documents. It is possible to rotate the view of these PDFs with the free Adobe Reader produc…
Suggested Courses

772 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question