Solved

Regular expression search in Acrobat

Posted on 2003-11-22
3
1,521 Views
Last Modified: 2013-11-18
I want to have a plugin developed that will use regular expressions to search PDF files and then return the quad info on the matched search.  I haven't been able to find any reference in the Adobe SDK that indicates this can be done.
Anybody have any input?

Dennis
0
Comment
Question by:dHaserot
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
3 Comments
 
LVL 44

Expert Comment

by:Karl Heinz Kremer
ID: 9804185
Almost anything can be done in a plug-in :-)
It's however not a trivial task. You may be able to use Javascript (that's not my field of expertiese) that you could execute from within a plug-in. I'll however only describe how you would accomplish this with just the plug-in interface.
As you've found out, there is no reg ex interface in Acrobat. You have to provide this yourself. The hard part is getting access to the textual content of the file. This can be done by using the word finder interface:

First you create a PDWordFinder object with either PDDocCreateWordFinderEx, PDDocCreateWordFinder or PDDocCreateWordFinderUCS. With this PDWordFinder you can then call either PDWordFinderGetNthWord() to get one word after the other from the list or PDWordFinderAcquireWordList() to get all words on a page. Either way, you end up with a list of PDWord objects. You have to then apply your reg ex search to this information. The PDWord then gives you access to the quad information through PDWordGetCharQuad().

After you are done, you destroy the PDWordFinder again and do whatever else is necessary in cleanup.

As I said, it's not trivial, but it can be done. The problem with this solution however is that you do not get white space information. Mainly because this is not part of the PDF file to start with. If the cursor gets advanced by a tab, or by eight spaces, or by a fixed amount, the "moevto" command that you will find in the PostScript code that in turn gets converted to some PDF commands will always look the same.

The content of a PDF file does not necessarily have a one-to-one relationship with elements in your source file.

You may also run into problems with sub- or superscripts: In PDF it's just a string in a smaller font, that's positioned  above or below the base line. It may not even be rendered at the same time. So Acrobat needs to make some assumptions about how text belongs together.

Is this enough to completely scare yo away from doing reg expression searches in PDF? :-)
0
 

Author Comment

by:dHaserot
ID: 9804524
KhKremer,

This is my first question to the group and your response was far more detailed than I had hoped.  No I am not scared away by the response.

Though you said it is not your field of expertiese, can you estimate a rough range of cost to develope the plugin (not including the regex engine)?

Thanks very much for your response.

dHaserot
0
 
LVL 44

Accepted Solution

by:
Karl Heinz Kremer earned 500 total points
ID: 9804644
The Javascript is not my field of expertise.

It all depends on who's doing the development. You probably know that it takes quite some time to get familiar with the Acrobat SDK, so if you have somebody who's never written a plug-in, you have to caclulate about 3 months to get familiar enough with the development environment to handle a job like this.

I've never tried to integrate the word finder with another subsystem, so my estimate can be totally off... But I think that if you have the regex engine, it probably takes between four and eight weeks to get the software written and tested.
0

Featured Post

Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Acrobat Printing Issues 5 95
Replace Adobe Acrobat XI with what? 2 119
How to add vertical lines to 53 pages in Acrobat Pro 1 95
Editing PDF using MAC OS X 3 103
Have you ever come up with a need of emailing only few pages of PDF file to one of yourfriend or colleague, instead of whole Adobe file? If yes, then surely you have face problems in doing that! Read this section as I have suggested multiple solutio…
*Adobe Acrobat 9 was used for this article.  Particular steps may vary depending on software versions. Adobe Acrobat has many, many variables that my be utilized to customize your forms for clarity and ease of use. The Form Editing Tool will be y…
In this first video of the three-part Xpdf series, we introduce and describe Xpdf, a library containing nine command line utilities that perform various functions on PDF files. We show where the library is located and how to download it, discuss its…
In this fifth video of the Xpdf series, we discuss and demonstrate the PDFdetach utility, which is able to list and, more importantly, extract attachments that are embedded in PDF files. It does this via a command line interface, making it suitable …
Suggested Courses

751 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question