Solved

Regular expression search in Acrobat

Posted on 2003-11-22
3
1,420 Views
Last Modified: 2013-11-18
I want to have a plugin developed that will use regular expressions to search PDF files and then return the quad info on the matched search.  I haven't been able to find any reference in the Adobe SDK that indicates this can be done.
Anybody have any input?

Dennis
0
Comment
Question by:dHaserot
  • 2
3 Comments
 
LVL 44

Expert Comment

by:Karl Heinz Kremer
ID: 9804185
Almost anything can be done in a plug-in :-)
It's however not a trivial task. You may be able to use Javascript (that's not my field of expertiese) that you could execute from within a plug-in. I'll however only describe how you would accomplish this with just the plug-in interface.
As you've found out, there is no reg ex interface in Acrobat. You have to provide this yourself. The hard part is getting access to the textual content of the file. This can be done by using the word finder interface:

First you create a PDWordFinder object with either PDDocCreateWordFinderEx, PDDocCreateWordFinder or PDDocCreateWordFinderUCS. With this PDWordFinder you can then call either PDWordFinderGetNthWord() to get one word after the other from the list or PDWordFinderAcquireWordList() to get all words on a page. Either way, you end up with a list of PDWord objects. You have to then apply your reg ex search to this information. The PDWord then gives you access to the quad information through PDWordGetCharQuad().

After you are done, you destroy the PDWordFinder again and do whatever else is necessary in cleanup.

As I said, it's not trivial, but it can be done. The problem with this solution however is that you do not get white space information. Mainly because this is not part of the PDF file to start with. If the cursor gets advanced by a tab, or by eight spaces, or by a fixed amount, the "moevto" command that you will find in the PostScript code that in turn gets converted to some PDF commands will always look the same.

The content of a PDF file does not necessarily have a one-to-one relationship with elements in your source file.

You may also run into problems with sub- or superscripts: In PDF it's just a string in a smaller font, that's positioned  above or below the base line. It may not even be rendered at the same time. So Acrobat needs to make some assumptions about how text belongs together.

Is this enough to completely scare yo away from doing reg expression searches in PDF? :-)
0
 

Author Comment

by:dHaserot
ID: 9804524
KhKremer,

This is my first question to the group and your response was far more detailed than I had hoped.  No I am not scared away by the response.

Though you said it is not your field of expertiese, can you estimate a rough range of cost to develope the plugin (not including the regex engine)?

Thanks very much for your response.

dHaserot
0
 
LVL 44

Accepted Solution

by:
Karl Heinz Kremer earned 500 total points
ID: 9804644
The Javascript is not my field of expertise.

It all depends on who's doing the development. You probably know that it takes quite some time to get familiar with the Acrobat SDK, so if you have somebody who's never written a plug-in, you have to caclulate about 3 months to get familiar enough with the development environment to handle a job like this.

I've never tried to integrate the word finder with another subsystem, so my estimate can be totally off... But I think that if you have the regex engine, it probably takes between four and eight weeks to get the software written and tested.
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Update 21-May-2015: I temporarily removed the source code to make major changes to the program. Regards, Joe In a previous Experts Exchange article, How To Rename-Move a Batch of PDF Files Based on Contents of the Files (http://www.experts-exchan…
This article focuses on how to remove password security from multiple PDF files by Adobe Acrobat program. Sometimes it is essential to access the stored data items and to print, edit as well as copy content from Portable Document Format files in abs…
The purpose of this video is to demonstrate how to set up the WordPress backend so that each page automatically generates a Mailchimp signup form in the sidebar. This will be demonstrated using a Windows 8 PC. Tools Used are Photoshop, Awesome…
In this sixth video of the Xpdf series, we discuss and demonstrate the PDFtoPNG utility, which converts a multi-page PDF file to separate color, grayscale, or monochrome PNG files, creating one PNG file for each page in the PDF. It does this via a c…

920 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now