?
Solved

Find text in PDF and print the pages it is found on

Posted on 2014-04-23
7
Medium Priority
?
1,592 Views
Last Modified: 2014-04-25
I want to create an Action in Adobe X Pro where I use Javascript to find all instances in a PDF document where the word "Total" is found and I'd like to print each page it is found on.  

I am working on Windows 7

Thank you.
0
Comment
Question by:mak345
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 3
7 Comments
 
LVL 44

Expert Comment

by:Karl Heinz Kremer
ID: 40020397
How much experience do you have with JavaScript and Acrobat?

Text extraction - or more exactly how successful you will be to extract text reliably - depends highly on the "inner" quality of your PDF files. This does not mean that a file does not look good on screen or when printed, it means that a file may not have all the information that Acrobat needs to extract text. At the end, the characters you see "drawn" on a PDF page are just that, drawings. Acrobat needs a table that allows it to convert these "drawings" back to Unicode characters (usually called the ToUnicode table). If a PDF generator did not include that table in the document for every font used, then you cannot extract text.

Now let's take a look at what support Acrobat's JavaScript has for text extraction. The only way you can get access to the text in a document is by using the "word finder". This is a few API routines in the Doc object that allow you to get the number of words on a page, and then iterate over all words on that page (and potentially get the location for each word).

Take a look at these two API documentation pages for more information about this functionality:

Doc.getPageNumWords(): http://livedocs.adobe.com/acrobat_sdk/11/Acrobat11_HTMLHelp/JS_API_AcroJS.89.494.html

Doc.getPageNthWord(): http://livedocs.adobe.com/acrobat_sdk/11/Acrobat11_HTMLHelp/JS_API_AcroJS.89.492.html

There are some sample scripts in the documentation.

Now to the reasons why this may not work for you: A "word" as far as this word finder is concerned is a collection of letters. As soon as you throw in punctuation marks or numbers, Acrobat will have a problem. So the term "test-abc" will result in two words: test and abc.

The number 100.12 will be reported as 100 and 12 - without any way of knowing if there was anything in between these two words that would connect them to a decimal number.

So, in theory you just have to iterate over all words on a page and see if one of them is "Total" - if you find the first total, just add the current page number to an array of pages that contain the term "Total", and return that array once you've processed all pages.
0
 

Author Comment

by:mak345
ID: 40022804
Thanks for your response!   I will not run into the scenarios you spoke about regarding dashes and decimal points for what I am trying to accomplish.  I am basically just searching for 1 word, "TOTAL" which will only show up in specific places on the reports I will be running against.

However, I do not have much experience at all with Javascript and Acrobat.  Most of my experience is with VBA and SQL.  I am just very unfamiliar with coding in Javascript.  Could you help me with the code I need to add the page numbers to an array, and then to print them?  

I would really appreciate your help; it would save my department a ton of time performing this tedious task.  

Thanks again!
0
 
LVL 44

Expert Comment

by:Karl Heinz Kremer
ID: 40023032
Unfortunately, that would take a bit more time than I have right now. If you want to dive into JavaScript yourself, it's not too complicated to pick up if you are already familiar with programming in general.
0
Get real performance insights from real users

Key features:
- Total Pages Views and Load times
- Top Pages Viewed and Load Times
- Real Time Site Page Build Performance
- Users’ Browser and Platform Performance
- Geographic User Breakdown
- And more

 
LVL 44

Expert Comment

by:Karl Heinz Kremer
ID: 40023235
Actually, I did find some time. Take a look here:

http://khkonsulting.com/2014/04/extract-pdf-pages-based-content/
0
 

Author Comment

by:mak345
ID: 40023504
That is a lot of help.  Thank you!  Just my last step is to print that newly created document.  If I add the line to print, it wants to print the original document, not the new one.

Does this need to be 2 seperate actions altogether or can it be combined into 1?

Thanks
0
 
LVL 44

Accepted Solution

by:
Karl Heinz Kremer earned 2000 total points
ID: 40023533
The second document does not exist as far as the action is concerned. You would either need to print via JavaScript by adding a line

    d.print(/* potentially some parameters */);

Open in new window


You can learn more about the print command in Acrobat's JavaScript API documentation:

http://livedocs.adobe.com/acrobat_sdk/11/Acrobat11_HTMLHelp/JS_API_AcroJS.89.517.html
0
 

Author Closing Comment

by:mak345
ID: 40023565
Perfect!  Thanks for all your help.  I really appreciate it!
0

Featured Post

Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Power PDF (http://www.nuance.com/for-business/document-imaging-and-scanning/power-pdf-converter/index.htm) is the newest product from the Document Imaging division of Nuance Communications (http://www.nuance.com/). It is available in two editions — …
This article discusses how to implement server side field validation and display customized error messages to the client.
In this fourth video of the Xpdf series, we discuss and demonstrate the PDFinfo utility, which retrieves the contents of a PDF's Info Dictionary, as well as some other information, including the page count. We show how to isolate the page count in a…
In this fifth video of the Xpdf series, we discuss and demonstrate the PDFdetach utility, which is able to list and, more importantly, extract attachments that are embedded in PDF files. It does this via a command line interface, making it suitable …
Suggested Courses

777 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question