Solved

Need help searching scanned PDF's

Posted on 2015-01-30
11
203 Views
Last Modified: 2015-02-05
We have 9000 PDF's that were scanned when created, each PDF contains a Invoice that was originally created by word processor of some sort that has ticket numbers. We have about 80 receipts we need to find the corresponding ticket numbers for (located in the PDF's)  any suggestions on how to accomplish this task?
0
Comment
Question by:bankadmin
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 6
  • 5
11 Comments
 
LVL 54

Assisted Solution

by:Joe Winograd, EE MVE
Joe Winograd, EE MVE earned 500 total points
ID: 40580609
The key issue is if those PDFs were scanned as image-only PDFs or if they were scanned as PDF Searchable Image files, meaning that they contain both images as well as text that was created by the OCR process during scanning. To see if they have text, open one with Adobe Reader (or whatever PDF reader/viewer you use) and try to copy/paste the text into a text editor or Word. If you get text, then you can search for the ticket numbers with any search tool that you want; if you don't get text, then you'll need to OCR them. There are many OCR packages out there that will do it. Let me know if you want recommendations. Regards, Joe
0
 

Author Comment

by:bankadmin
ID: 40580625
Thanks Joe with the original I have I cannot copy text from it. I do have adobe acrobat Xl and I have saved one as a word and I'm testing using the OCR feature in Acrobat I guess I just have to figure out which is more efficient. Do you have any suggestions?
0
 
LVL 54

Assisted Solution

by:Joe Winograd, EE MVE
Joe Winograd, EE MVE earned 500 total points
ID: 40580679
Acrobat XI will do the trick! Of course, with 9000 PDFs, you don't want to do them one at a time. You should do:

Tools
Text Recognition
In Multiple Files
Add Folders...

Here's what the dialog looks like:

Acrobat OCR Add Folders
That will run OCR on all of the PDFs in those folders, creating Searchable Image PDFs (make sure you set that in the PDF Output Style drop-down), and I also recommend 300 DPI:

Acrobat OCR
The new, OCR'ed PDFs will all be searchable. I recommend storing the OCR'ed PDFs in different folders from the originals. Regards, Joe
0
Revamp Your Training Process

Drastically shorten your training time with WalkMe's advanced online training solution that Guides your trainees to action.

 

Author Comment

by:bankadmin
ID: 40580688
Yep I just did that. However with 9k files for searching them all at once would it make more sense to first combine them into one file then run the OCR against the new combined file to make easier when I could actually start searching them?
0
 
LVL 54

Assisted Solution

by:Joe Winograd, EE MVE
Joe Winograd, EE MVE earned 500 total points
ID: 40580716
Even with just one page per file, that's 9,000 pages. I do not recommend having a 9,000-page PDF! OCR could easily croak on a file that big. Leave them in separate files for the OCR process, so you'll have 9,000 new Searchable Image PDFs. Then do Edit>Advanced Search, telling it look in the folder with the 9,000 new Searchable Image PDFs, such as D:\0tempD in the screenshot below:

Acrobat Advanced Search
Regards, Joe
0
 

Author Comment

by:bankadmin
ID: 40580721
Great advice thanks.... the sample file I have is 17 pages, technically Im only concerned with the first page of each file though.
0
 
LVL 54

Assisted Solution

by:Joe Winograd, EE MVE
Joe Winograd, EE MVE earned 500 total points
ID: 40580750
If they're all 17 pages, that's in the neighborhood of 150,000 pages, which could take quite a while to OCR — and 140,000+ of them are worthless! If the first page is all you care about, I'd be tempted to peel off the first page of each. That way, you'd have to OCR "only" 9,000 pages. This is a good time to say that, in the future, when you scan documents, scan them to PDF Searchable Image files, not PDF image-only files.
0
 

Author Comment

by:bankadmin
ID: 40580902
Agreed we did not do the scanning of these documents I have all my scanners setup to be searchable PDF's by default. Is there a way I can strip the first page on all the files in one action?
0
 
LVL 54

Accepted Solution

by:
Joe Winograd, EE MVE earned 500 total points
ID: 40580923
> all my scanners setup to be searchable PDF's by default

Excellent!

> Is there a way I can strip the first page on all the files in one action?

I recommend the PDF Toolkit (PDFtk) to do this. It comes in both command line and GUI versions. The command line version is called PDFtk Server and may be downloaded here:
http://www.pdflabs.com/tools/pdftk-server/

Don't be misled by "Server" in the name. I don't know why they called it that, but it's just an executable (pdftk.exe, with a supporting DLL, libiconv2.dll) that runs on XP, Vista, W7, and W8 (i.e., it does not have to run on a "server" OS).

Put all of the PDFs in a folder along with pdftk.exe and libiconv2.dll, and put this line in a BAT file (also in the same folder, otherwise you'll have to specify paths):

for %%I in (*.pdf) do "pdftk.exe" "%%I" cat 1 output "%%~nI-page1.pdf"

That will create 9,000 PDFs each of which has a file name ending in -page1.pdf (the part before the hyphen is the name of each original file). You may then use Acrobat to OCR the 9,000 one-page PDFs, or, if you'd like to see what happens on a single, 9,000-page file, put this statement as the second line in the BAT file:

pdftk *-page1.pdf cat output firstpages.pdf

That will create a file called firstpages.pdf with the first page of all 9,000 PDFs. I have no clue how Acrobat's Recognize Text will perform on that, but it can't hurt to give it a spin. If it croaks, then OCR the 9,000 one-page files. Regards, Joe
0
 

Author Comment

by:bankadmin
ID: 40591170
Thank you for all the advice
0
 
LVL 54

Expert Comment

by:Joe Winograd, EE MVE
ID: 40591420
You're very welcome. Happy to help. Good luck on the project. Regards, Joe
0

Featured Post

Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Can someone tell me how I've hooked up this low-pass filter wrong? 3 109
Seatools For Dos (Just shows FreeDos) 21 158
Uniden UDW20055 3 122
IRQL not equal zero blue screen 13 78
PaperPort 14.5 Patch 1 update is often not detected or downloaded automatically. This article provides direct download links to solve the problem for retail (non-bundled) versions of the Standard and Professional editions, as well as the Professiona…
I use more than 1 computer in my office for various reasons. Multiple keyboards and mice take up more than just extra space, they make working a little more complicated. Using one mouse and keyboard for all of my computers makes life easier. This co…
This video is the first in a two-part series that discusses PaperPort's "Send To Bar" feature . This first video tutorial explains the purpose of the Send To Bar, how to use it, and how to hide unwanted items that are automatically created on it whe…
This video is the second in a two-part series that discusses PaperPort's "Send To Bar" feature . The first video tutorial (http://www.experts-exchange.com/VP_207.html) explains the purpose of the Send To Bar, how to use it, and how to hide unwanted …

752 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question