Regular expression search in Acrobat

I want to have a plugin developed that will use regular expressions to search PDF files and then return the quad info on the matched search.  I haven't been able to find any reference in the Adobe SDK that indicates this can be done.
Anybody have any input?

Dennis
dHaserotAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Karl Heinz KremerCommented:
Almost anything can be done in a plug-in :-)
It's however not a trivial task. You may be able to use Javascript (that's not my field of expertiese) that you could execute from within a plug-in. I'll however only describe how you would accomplish this with just the plug-in interface.
As you've found out, there is no reg ex interface in Acrobat. You have to provide this yourself. The hard part is getting access to the textual content of the file. This can be done by using the word finder interface:

First you create a PDWordFinder object with either PDDocCreateWordFinderEx, PDDocCreateWordFinder or PDDocCreateWordFinderUCS. With this PDWordFinder you can then call either PDWordFinderGetNthWord() to get one word after the other from the list or PDWordFinderAcquireWordList() to get all words on a page. Either way, you end up with a list of PDWord objects. You have to then apply your reg ex search to this information. The PDWord then gives you access to the quad information through PDWordGetCharQuad().

After you are done, you destroy the PDWordFinder again and do whatever else is necessary in cleanup.

As I said, it's not trivial, but it can be done. The problem with this solution however is that you do not get white space information. Mainly because this is not part of the PDF file to start with. If the cursor gets advanced by a tab, or by eight spaces, or by a fixed amount, the "moevto" command that you will find in the PostScript code that in turn gets converted to some PDF commands will always look the same.

The content of a PDF file does not necessarily have a one-to-one relationship with elements in your source file.

You may also run into problems with sub- or superscripts: In PDF it's just a string in a smaller font, that's positioned  above or below the base line. It may not even be rendered at the same time. So Acrobat needs to make some assumptions about how text belongs together.

Is this enough to completely scare yo away from doing reg expression searches in PDF? :-)
0
dHaserotAuthor Commented:
KhKremer,

This is my first question to the group and your response was far more detailed than I had hoped.  No I am not scared away by the response.

Though you said it is not your field of expertiese, can you estimate a rough range of cost to develope the plugin (not including the regex engine)?

Thanks very much for your response.

dHaserot
0
Karl Heinz KremerCommented:
The Javascript is not my field of expertise.

It all depends on who's doing the development. You probably know that it takes quite some time to get familiar with the Acrobat SDK, so if you have somebody who's never written a plug-in, you have to caclulate about 3 months to get familiar enough with the development environment to handle a job like this.

I've never tried to integrate the word finder with another subsystem, so my estimate can be totally off... But I think that if you have the regex engine, it probably takes between four and eight weeks to get the software written and tested.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Web Components

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.