Solved

Regular expression search in Acrobat

Posted on 2003-11-22
3
1,398 Views
Last Modified: 2013-11-18
I want to have a plugin developed that will use regular expressions to search PDF files and then return the quad info on the matched search.  I haven't been able to find any reference in the Adobe SDK that indicates this can be done.
Anybody have any input?

Dennis
0
Comment
Question by:dHaserot
  • 2
3 Comments
 
LVL 44

Expert Comment

by:Karl Heinz Kremer
Comment Utility
Almost anything can be done in a plug-in :-)
It's however not a trivial task. You may be able to use Javascript (that's not my field of expertiese) that you could execute from within a plug-in. I'll however only describe how you would accomplish this with just the plug-in interface.
As you've found out, there is no reg ex interface in Acrobat. You have to provide this yourself. The hard part is getting access to the textual content of the file. This can be done by using the word finder interface:

First you create a PDWordFinder object with either PDDocCreateWordFinderEx, PDDocCreateWordFinder or PDDocCreateWordFinderUCS. With this PDWordFinder you can then call either PDWordFinderGetNthWord() to get one word after the other from the list or PDWordFinderAcquireWordList() to get all words on a page. Either way, you end up with a list of PDWord objects. You have to then apply your reg ex search to this information. The PDWord then gives you access to the quad information through PDWordGetCharQuad().

After you are done, you destroy the PDWordFinder again and do whatever else is necessary in cleanup.

As I said, it's not trivial, but it can be done. The problem with this solution however is that you do not get white space information. Mainly because this is not part of the PDF file to start with. If the cursor gets advanced by a tab, or by eight spaces, or by a fixed amount, the "moevto" command that you will find in the PostScript code that in turn gets converted to some PDF commands will always look the same.

The content of a PDF file does not necessarily have a one-to-one relationship with elements in your source file.

You may also run into problems with sub- or superscripts: In PDF it's just a string in a smaller font, that's positioned  above or below the base line. It may not even be rendered at the same time. So Acrobat needs to make some assumptions about how text belongs together.

Is this enough to completely scare yo away from doing reg expression searches in PDF? :-)
0
 

Author Comment

by:dHaserot
Comment Utility
KhKremer,

This is my first question to the group and your response was far more detailed than I had hoped.  No I am not scared away by the response.

Though you said it is not your field of expertiese, can you estimate a rough range of cost to develope the plugin (not including the regex engine)?

Thanks very much for your response.

dHaserot
0
 
LVL 44

Accepted Solution

by:
Karl Heinz Kremer earned 500 total points
Comment Utility
The Javascript is not my field of expertise.

It all depends on who's doing the development. You probably know that it takes quite some time to get familiar with the Acrobat SDK, so if you have somebody who's never written a plug-in, you have to caclulate about 3 months to get familiar enough with the development environment to handle a job like this.

I've never tried to integrate the word finder with another subsystem, so my estimate can be totally off... But I think that if you have the regex engine, it probably takes between four and eight weeks to get the software written and tested.
0

Featured Post

How to improve team productivity

Quip adds documents, spreadsheets, and tasklists to your Slack experience
- Elevate ideas to Quip docs
- Share Quip docs in Slack
- Get notified of changes to your docs
- Available on iOS/Android/Desktop/Web
- Online/Offline

Join & Write a Comment

Update 21-May-2015: I temporarily removed the source code and the code snippets to make major changes to the program. Regards, Joe A recent question here at Experts Exchange piqued my interest, so I decided to provide a thorough solution and publ…
In a previous article published here at Experts Exchange, Signature Image with Transparent Background (http://www.experts-exchange.com/Web_Development/Document_Imaging/A_12380-Signature-Image-with-Transparent-Background.html), I explained how to cre…
In this video, we show how to convert an image-only PDF file into a PDF Searchable Image file, that is, a file with both the image (typically from scanning) and text, which is created in an automated fashion with Optical Character Recognition (OCR) …
In this fifth video of the Xpdf series, we discuss and demonstrate the PDFdetach utility, which is able to list and, more importantly, extract attachments that are embedded in PDF files. It does this via a command line interface, making it suitable …

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

17 Experts available now in Live!

Get 1:1 Help Now