Need help searching scanned PDF's

We have 9000 PDF's that were scanned when created, each PDF contains a Invoice that was originally created by word processor of some sort that has ticket numbers. We have about 80 receipts we need to find the corresponding ticket numbers for (located in the PDF's)  any suggestions on how to accomplish this task?
bankadminAsked:
Who is Participating?
 
Joe Winograd, Fellow&MVEDeveloperCommented:
> all my scanners setup to be searchable PDF's by default

Excellent!

> Is there a way I can strip the first page on all the files in one action?

I recommend the PDF Toolkit (PDFtk) to do this. It comes in both command line and GUI versions. The command line version is called PDFtk Server and may be downloaded here:
http://www.pdflabs.com/tools/pdftk-server/

Don't be misled by "Server" in the name. I don't know why they called it that, but it's just an executable (pdftk.exe, with a supporting DLL, libiconv2.dll) that runs on XP, Vista, W7, and W8 (i.e., it does not have to run on a "server" OS).

Put all of the PDFs in a folder along with pdftk.exe and libiconv2.dll, and put this line in a BAT file (also in the same folder, otherwise you'll have to specify paths):

for %%I in (*.pdf) do "pdftk.exe" "%%I" cat 1 output "%%~nI-page1.pdf"

That will create 9,000 PDFs each of which has a file name ending in -page1.pdf (the part before the hyphen is the name of each original file). You may then use Acrobat to OCR the 9,000 one-page PDFs, or, if you'd like to see what happens on a single, 9,000-page file, put this statement as the second line in the BAT file:

pdftk *-page1.pdf cat output firstpages.pdf

That will create a file called firstpages.pdf with the first page of all 9,000 PDFs. I have no clue how Acrobat's Recognize Text will perform on that, but it can't hurt to give it a spin. If it croaks, then OCR the 9,000 one-page files. Regards, Joe
0
 
Joe Winograd, Fellow&MVEDeveloperCommented:
The key issue is if those PDFs were scanned as image-only PDFs or if they were scanned as PDF Searchable Image files, meaning that they contain both images as well as text that was created by the OCR process during scanning. To see if they have text, open one with Adobe Reader (or whatever PDF reader/viewer you use) and try to copy/paste the text into a text editor or Word. If you get text, then you can search for the ticket numbers with any search tool that you want; if you don't get text, then you'll need to OCR them. There are many OCR packages out there that will do it. Let me know if you want recommendations. Regards, Joe
0
 
bankadminAuthor Commented:
Thanks Joe with the original I have I cannot copy text from it. I do have adobe acrobat Xl and I have saved one as a word and I'm testing using the OCR feature in Acrobat I guess I just have to figure out which is more efficient. Do you have any suggestions?
0
Ultimate Tool Kit for Technology Solution Provider

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy now.

 
Joe Winograd, Fellow&MVEDeveloperCommented:
Acrobat XI will do the trick! Of course, with 9000 PDFs, you don't want to do them one at a time. You should do:

Tools
Text Recognition
In Multiple Files
Add Folders...

Here's what the dialog looks like:

Acrobat OCR Add Folders
That will run OCR on all of the PDFs in those folders, creating Searchable Image PDFs (make sure you set that in the PDF Output Style drop-down), and I also recommend 300 DPI:

Acrobat OCR
The new, OCR'ed PDFs will all be searchable. I recommend storing the OCR'ed PDFs in different folders from the originals. Regards, Joe
0
 
bankadminAuthor Commented:
Yep I just did that. However with 9k files for searching them all at once would it make more sense to first combine them into one file then run the OCR against the new combined file to make easier when I could actually start searching them?
0
 
Joe Winograd, Fellow&MVEDeveloperCommented:
Even with just one page per file, that's 9,000 pages. I do not recommend having a 9,000-page PDF! OCR could easily croak on a file that big. Leave them in separate files for the OCR process, so you'll have 9,000 new Searchable Image PDFs. Then do Edit>Advanced Search, telling it look in the folder with the 9,000 new Searchable Image PDFs, such as D:\0tempD in the screenshot below:

Acrobat Advanced Search
Regards, Joe
0
 
bankadminAuthor Commented:
Great advice thanks.... the sample file I have is 17 pages, technically Im only concerned with the first page of each file though.
0
 
Joe Winograd, Fellow&MVEDeveloperCommented:
If they're all 17 pages, that's in the neighborhood of 150,000 pages, which could take quite a while to OCR — and 140,000+ of them are worthless! If the first page is all you care about, I'd be tempted to peel off the first page of each. That way, you'd have to OCR "only" 9,000 pages. This is a good time to say that, in the future, when you scan documents, scan them to PDF Searchable Image files, not PDF image-only files.
0
 
bankadminAuthor Commented:
Agreed we did not do the scanning of these documents I have all my scanners setup to be searchable PDF's by default. Is there a way I can strip the first page on all the files in one action?
0
 
bankadminAuthor Commented:
Thank you for all the advice
0
 
Joe Winograd, Fellow&MVEDeveloperCommented:
You're very welcome. Happy to help. Good luck on the project. Regards, Joe
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.