Solved

Can I extract specific pages that have specific text

Posted on 2014-09-24
5
201 Views
Last Modified: 2014-09-26
I have a large PDF file (3000 pages) and I like to build a new file with a subset of the data based upon the OCR in adobe.
Can i do something like this?
0
Comment
Question by:Scott Johnston
  • 3
  • 2
5 Comments
 
LVL 51

Expert Comment

by:Joe Winograd, EE MVE
Comment Utility
I have begun work on a program that does this. It is based on a discussion following this other EE question:
http://www.experts-exchange.com/Software/Misc/Q_28510119.html

As stated in that question, the new program is a follow-up to the one based on yet another EE question:
http://www.experts-exchange.com/Software/Server_Software/Document_Management/Q_28084148.html

As also stated earlier, it is a violation of EE's Terms of Use/Code of Conduct to offer to sell any goods or services for any commercial purpose. However, it is permissible to contact members at the email address in their profiles, and more recently EE has brought back the Hire Me button in profiles, as well as created a new Messages system (click the envelope icon in the upper right). So if you're interested in pursuing the matter, please contact me via one of the permitted EE mechanisms.

In the meantime, I'd like to understand your requirements better. You say that you want to build a new file with a subset of the data in the 3,000-page file. Some questions about that:

(1) How are you going to identify the subset?

(2) If the answer to (1) is via a text search string, then several more questions, starting with - What kind of search? Single word? Entire phrase?

(3) Case sensitive or case insensitive or an option for either?

(4) Partial word or whole word or an option for either?

(5) Do you want Boolean search (AND, OR, NOT)?

(6) Do you want Regular Expression (RegEx) search?

(7) Anything else that will help to specify your requirements for creating the subset?

Regards, Joe
0
 

Author Comment

by:Scott Johnston
Comment Utility
I've requested that this question be deleted for the following reason:

Thank you but the answer is you cannot do that, I was not looking for a solution.  I just wanted someone to confirm is adobe has a feature to retrieve a subset of pages from a large PDF file.
0
 
LVL 51

Accepted Solution

by:
Joe Winograd, EE MVE earned 500 total points
Comment Utility
> confirm...adobe has a feature to retrieve a subset of pages from a large PDF file

The answer is dependent on how you want to identify the subset. For example, if you're willing to select each page of the 3,000-page PDF via the standard Windows techniques (Ctrl-left-click and Shift-left-click), then the answer is yes. Select the thumbnails of the pages you want, then right-click on any selected thumbnail, and then left-click Extract Pages from the context menu. However, I was going on the assumption that you don't want to go through the entire 3,000-page file and manually select each page. My assumption is that you wanted to search for an identifying string on each page, such as "Microsoft", and then automatically extract each page with a hit into a new PDF file with that subset of pages. I don't know of a way for Acrobat to do that.

But that's why I asked the questions I did — before providing an answer to a question, it is important to understand the question well, so you'll find that here at Experts Exchange, experts will often ask you questions about your question in order to assist you better.

Btw, you said "adobe" in your initial question and "adobe" in your Delete Request, so to be clear, Adobe Reader cannot Extract Pages — Adobe Acrobat can. Regards, Joe
0
 

Author Closing Comment

by:Scott Johnston
Comment Utility
I will award you the point because in the very last post you identified that adobe does not have this function.  That is all I wanted to find out.  I appreciate that you may have a solution but that is not what I was asking for.
That is why I wanted to delete the question.
I am aware of many different solutions to my question but as for Adobe acrobat X it does not have this type of search extraction capabilities.
0
 
LVL 51

Expert Comment

by:Joe Winograd, EE MVE
Comment Utility
Thank you for the points — I really appreciate it! But I want to be absolutely certain that we leave this thread with a correct understanding. So I want to make sure we're on the same wavelength with respect to your comment:
...but as for Adobe acrobat X it does not have this type of search extraction capabilities.
To be clear:

Adobe Acrobat X and XI Standard do have the Extract Pages feature via selection of pages
Adobe Acrobat X and XI Professional do have the Extract Pages feature via selection of pages
Adobe Reader X and XI do not have the Extract Pages feature

But I'm still not sure of exactly what you mean by "this type of search extraction capabilities." If you mean to search for a phrase and then automatically extract all pages where the phrase is found into a new PDF, then I'm not aware of any way to do that in the off-the-shelf Adobe Acrobat X or XI, Standard or Professional.

One other thing — you said:
I am aware of many different solutions to my question...
I am not. Please tell me some of the many different solutions to your question. I would like to research them. Thanks very much. Regards, Joe
0

Featured Post

IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

One of the questions I get asked again and again is how to validate a field value in an AcroForm with a custom validation script. Adobe provided a lot of infrastructure to do that with just a simple script. Let’s take a look at how to do that wit…
PaperPort is a popular document imaging/management product from Nuance Communications (http://www.nuance.com/). It is in widespread use by both individuals (http://www.nuance.com/for-individuals/by-product/paperport/index.htm) and businesses (http:/…
In this first video of the three-part Xpdf series, we introduce and describe Xpdf, a library containing nine command line utilities that perform various functions on PDF files. We show where the library is located and how to download it, discuss its…
In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

6 Experts available now in Live!

Get 1:1 Help Now