Solved

Need help searching scanned PDF's

Posted on 2015-01-30
11
181 Views
Last Modified: 2015-02-05
We have 9000 PDF's that were scanned when created, each PDF contains a Invoice that was originally created by word processor of some sort that has ticket numbers. We have about 80 receipts we need to find the corresponding ticket numbers for (located in the PDF's)  any suggestions on how to accomplish this task?
0
Comment
Question by:bankadmin
  • 6
  • 5
11 Comments
 
LVL 51

Assisted Solution

by:Joe Winograd, EE MVE
Joe Winograd, EE MVE earned 500 total points
ID: 40580609
The key issue is if those PDFs were scanned as image-only PDFs or if they were scanned as PDF Searchable Image files, meaning that they contain both images as well as text that was created by the OCR process during scanning. To see if they have text, open one with Adobe Reader (or whatever PDF reader/viewer you use) and try to copy/paste the text into a text editor or Word. If you get text, then you can search for the ticket numbers with any search tool that you want; if you don't get text, then you'll need to OCR them. There are many OCR packages out there that will do it. Let me know if you want recommendations. Regards, Joe
0
 

Author Comment

by:bankadmin
ID: 40580625
Thanks Joe with the original I have I cannot copy text from it. I do have adobe acrobat Xl and I have saved one as a word and I'm testing using the OCR feature in Acrobat I guess I just have to figure out which is more efficient. Do you have any suggestions?
0
 
LVL 51

Assisted Solution

by:Joe Winograd, EE MVE
Joe Winograd, EE MVE earned 500 total points
ID: 40580679
Acrobat XI will do the trick! Of course, with 9000 PDFs, you don't want to do them one at a time. You should do:

Tools
Text Recognition
In Multiple Files
Add Folders...

Here's what the dialog looks like:

Acrobat OCR Add Folders
That will run OCR on all of the PDFs in those folders, creating Searchable Image PDFs (make sure you set that in the PDF Output Style drop-down), and I also recommend 300 DPI:

Acrobat OCR
The new, OCR'ed PDFs will all be searchable. I recommend storing the OCR'ed PDFs in different folders from the originals. Regards, Joe
0
 

Author Comment

by:bankadmin
ID: 40580688
Yep I just did that. However with 9k files for searching them all at once would it make more sense to first combine them into one file then run the OCR against the new combined file to make easier when I could actually start searching them?
0
 
LVL 51

Assisted Solution

by:Joe Winograd, EE MVE
Joe Winograd, EE MVE earned 500 total points
ID: 40580716
Even with just one page per file, that's 9,000 pages. I do not recommend having a 9,000-page PDF! OCR could easily croak on a file that big. Leave them in separate files for the OCR process, so you'll have 9,000 new Searchable Image PDFs. Then do Edit>Advanced Search, telling it look in the folder with the 9,000 new Searchable Image PDFs, such as D:\0tempD in the screenshot below:

Acrobat Advanced Search
Regards, Joe
0
Complete Microsoft Windows PC® & Mac Backup

Backup and recovery solutions to protect all your PCs & Mac– on-premises or in remote locations. Acronis backs up entire PC or Mac with patented reliable disk imaging technology and you will be able to restore workstations to a new, dissimilar hardware in minutes.

 

Author Comment

by:bankadmin
ID: 40580721
Great advice thanks.... the sample file I have is 17 pages, technically Im only concerned with the first page of each file though.
0
 
LVL 51

Assisted Solution

by:Joe Winograd, EE MVE
Joe Winograd, EE MVE earned 500 total points
ID: 40580750
If they're all 17 pages, that's in the neighborhood of 150,000 pages, which could take quite a while to OCR — and 140,000+ of them are worthless! If the first page is all you care about, I'd be tempted to peel off the first page of each. That way, you'd have to OCR "only" 9,000 pages. This is a good time to say that, in the future, when you scan documents, scan them to PDF Searchable Image files, not PDF image-only files.
0
 

Author Comment

by:bankadmin
ID: 40580902
Agreed we did not do the scanning of these documents I have all my scanners setup to be searchable PDF's by default. Is there a way I can strip the first page on all the files in one action?
0
 
LVL 51

Accepted Solution

by:
Joe Winograd, EE MVE earned 500 total points
ID: 40580923
> all my scanners setup to be searchable PDF's by default

Excellent!

> Is there a way I can strip the first page on all the files in one action?

I recommend the PDF Toolkit (PDFtk) to do this. It comes in both command line and GUI versions. The command line version is called PDFtk Server and may be downloaded here:
http://www.pdflabs.com/tools/pdftk-server/

Don't be misled by "Server" in the name. I don't know why they called it that, but it's just an executable (pdftk.exe, with a supporting DLL, libiconv2.dll) that runs on XP, Vista, W7, and W8 (i.e., it does not have to run on a "server" OS).

Put all of the PDFs in a folder along with pdftk.exe and libiconv2.dll, and put this line in a BAT file (also in the same folder, otherwise you'll have to specify paths):

for %%I in (*.pdf) do "pdftk.exe" "%%I" cat 1 output "%%~nI-page1.pdf"

That will create 9,000 PDFs each of which has a file name ending in -page1.pdf (the part before the hyphen is the name of each original file). You may then use Acrobat to OCR the 9,000 one-page PDFs, or, if you'd like to see what happens on a single, 9,000-page file, put this statement as the second line in the BAT file:

pdftk *-page1.pdf cat output firstpages.pdf

That will create a file called firstpages.pdf with the first page of all 9,000 PDFs. I have no clue how Acrobat's Recognize Text will perform on that, but it can't hurt to give it a spin. If it croaks, then OCR the 9,000 one-page files. Regards, Joe
0
 

Author Comment

by:bankadmin
ID: 40591170
Thank you for all the advice
0
 
LVL 51

Expert Comment

by:Joe Winograd, EE MVE
ID: 40591420
You're very welcome. Happy to help. Good luck on the project. Regards, Joe
0

Featured Post

Highfive Gives IT Their Time Back

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

Suggested Solutions

I use more than 1 computer in my office for various reasons. Multiple keyboards and mice take up more than just extra space, they make working a little more complicated. Using one mouse and keyboard for all of my computers makes life easier. This co…
This is about my first experience with programming Arduino.
This video is the first in a two-part series that discusses PaperPort's "Send To Bar" feature . This first video tutorial explains the purpose of the Send To Bar, how to use it, and how to hide unwanted items that are automatically created on it whe…
This video Micro Tutorial is the first in a two-part series that shows how to create and use custom scanning profiles in Nuance's PaperPort 14.5 (http://www.experts-exchange.com/articles/17490/). But the ability to create custom scanning profiles al…

747 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now