Solved

Need help searching scanned PDF's

Posted on 2015-01-30
11
188 Views
Last Modified: 2015-02-05
We have 9000 PDF's that were scanned when created, each PDF contains a Invoice that was originally created by word processor of some sort that has ticket numbers. We have about 80 receipts we need to find the corresponding ticket numbers for (located in the PDF's)  any suggestions on how to accomplish this task?
0
Comment
Question by:bankadmin
  • 6
  • 5
11 Comments
 
LVL 52

Assisted Solution

by:Joe Winograd, EE MVE
Joe Winograd, EE MVE earned 500 total points
ID: 40580609
The key issue is if those PDFs were scanned as image-only PDFs or if they were scanned as PDF Searchable Image files, meaning that they contain both images as well as text that was created by the OCR process during scanning. To see if they have text, open one with Adobe Reader (or whatever PDF reader/viewer you use) and try to copy/paste the text into a text editor or Word. If you get text, then you can search for the ticket numbers with any search tool that you want; if you don't get text, then you'll need to OCR them. There are many OCR packages out there that will do it. Let me know if you want recommendations. Regards, Joe
0
 

Author Comment

by:bankadmin
ID: 40580625
Thanks Joe with the original I have I cannot copy text from it. I do have adobe acrobat Xl and I have saved one as a word and I'm testing using the OCR feature in Acrobat I guess I just have to figure out which is more efficient. Do you have any suggestions?
0
 
LVL 52

Assisted Solution

by:Joe Winograd, EE MVE
Joe Winograd, EE MVE earned 500 total points
ID: 40580679
Acrobat XI will do the trick! Of course, with 9000 PDFs, you don't want to do them one at a time. You should do:

Tools
Text Recognition
In Multiple Files
Add Folders...

Here's what the dialog looks like:

Acrobat OCR Add Folders
That will run OCR on all of the PDFs in those folders, creating Searchable Image PDFs (make sure you set that in the PDF Output Style drop-down), and I also recommend 300 DPI:

Acrobat OCR
The new, OCR'ed PDFs will all be searchable. I recommend storing the OCR'ed PDFs in different folders from the originals. Regards, Joe
0
 

Author Comment

by:bankadmin
ID: 40580688
Yep I just did that. However with 9k files for searching them all at once would it make more sense to first combine them into one file then run the OCR against the new combined file to make easier when I could actually start searching them?
0
 
LVL 52

Assisted Solution

by:Joe Winograd, EE MVE
Joe Winograd, EE MVE earned 500 total points
ID: 40580716
Even with just one page per file, that's 9,000 pages. I do not recommend having a 9,000-page PDF! OCR could easily croak on a file that big. Leave them in separate files for the OCR process, so you'll have 9,000 new Searchable Image PDFs. Then do Edit>Advanced Search, telling it look in the folder with the 9,000 new Searchable Image PDFs, such as D:\0tempD in the screenshot below:

Acrobat Advanced Search
Regards, Joe
0
Superior storage. Superior surveillance.

WD Purple drives are built for 24/7, always-on, high-definition security systems. With support for up to 8 hard drives and 32 cameras, WD Purple drives are optimized for surveillance.

 

Author Comment

by:bankadmin
ID: 40580721
Great advice thanks.... the sample file I have is 17 pages, technically Im only concerned with the first page of each file though.
0
 
LVL 52

Assisted Solution

by:Joe Winograd, EE MVE
Joe Winograd, EE MVE earned 500 total points
ID: 40580750
If they're all 17 pages, that's in the neighborhood of 150,000 pages, which could take quite a while to OCR — and 140,000+ of them are worthless! If the first page is all you care about, I'd be tempted to peel off the first page of each. That way, you'd have to OCR "only" 9,000 pages. This is a good time to say that, in the future, when you scan documents, scan them to PDF Searchable Image files, not PDF image-only files.
0
 

Author Comment

by:bankadmin
ID: 40580902
Agreed we did not do the scanning of these documents I have all my scanners setup to be searchable PDF's by default. Is there a way I can strip the first page on all the files in one action?
0
 
LVL 52

Accepted Solution

by:
Joe Winograd, EE MVE earned 500 total points
ID: 40580923
> all my scanners setup to be searchable PDF's by default

Excellent!

> Is there a way I can strip the first page on all the files in one action?

I recommend the PDF Toolkit (PDFtk) to do this. It comes in both command line and GUI versions. The command line version is called PDFtk Server and may be downloaded here:
http://www.pdflabs.com/tools/pdftk-server/

Don't be misled by "Server" in the name. I don't know why they called it that, but it's just an executable (pdftk.exe, with a supporting DLL, libiconv2.dll) that runs on XP, Vista, W7, and W8 (i.e., it does not have to run on a "server" OS).

Put all of the PDFs in a folder along with pdftk.exe and libiconv2.dll, and put this line in a BAT file (also in the same folder, otherwise you'll have to specify paths):

for %%I in (*.pdf) do "pdftk.exe" "%%I" cat 1 output "%%~nI-page1.pdf"

That will create 9,000 PDFs each of which has a file name ending in -page1.pdf (the part before the hyphen is the name of each original file). You may then use Acrobat to OCR the 9,000 one-page PDFs, or, if you'd like to see what happens on a single, 9,000-page file, put this statement as the second line in the BAT file:

pdftk *-page1.pdf cat output firstpages.pdf

That will create a file called firstpages.pdf with the first page of all 9,000 PDFs. I have no clue how Acrobat's Recognize Text will perform on that, but it can't hurt to give it a spin. If it croaks, then OCR the 9,000 one-page files. Regards, Joe
0
 

Author Comment

by:bankadmin
ID: 40591170
Thank you for all the advice
0
 
LVL 52

Expert Comment

by:Joe Winograd, EE MVE
ID: 40591420
You're very welcome. Happy to help. Good luck on the project. Regards, Joe
0

Featured Post

Save on storage to protect fatherhood memories

You're the dad who has everything. This Father's Day, make sure your family memories are protected. My Passport Ultra has automatic backup and password protection to keep your cherished photos and videos safe. With up to 3TB, you have plenty of room to hold the adventures ahead.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Need a cheap RFID setup 10 69
Device same like our heart 12 78
Honda Stream & Odyssey comes in Hybrid models with auto-drive ? 1 40
How to size a UPS 2 25
In this article you will get to know about pros and cons of storage drives HDD, SSD and SSHD.
This paper addresses the security of Sennheiser DECT Contact Center and Office (CC&O) headsets. It describes the DECT security chain comprised of “Pairing”, “Per Call Authentication” and “Encryption”, which are all part of the standard DECT protocol.
We often encounter PDF files that are pure images, that is, they do not have text characters, but instead contain only raster graphics. The most common causes of this are document scanning software and faxing software/services that create image-only…
This video Micro Tutorial is the first in a two-part series that shows how to create and use custom scanning profiles in Nuance's PaperPort 14.5 (http://www.experts-exchange.com/articles/17490/). But the ability to create custom scanning profiles al…

867 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

21 Experts available now in Live!

Get 1:1 Help Now