Go Premium for a chance to win a PS4. Enter to Win

x
?
Solved

Need help searching scanned PDF's

Posted on 2015-01-30
11
Medium Priority
?
219 Views
Last Modified: 2015-02-05
We have 9000 PDF's that were scanned when created, each PDF contains a Invoice that was originally created by word processor of some sort that has ticket numbers. We have about 80 receipts we need to find the corresponding ticket numbers for (located in the PDF's)  any suggestions on how to accomplish this task?
0
Comment
Question by:bankadmin
  • 6
  • 5
11 Comments
 
LVL 56

Assisted Solution

by:Joe Winograd, EE MVE 2015&2016
Joe Winograd, EE MVE 2015&2016 earned 2000 total points
ID: 40580609
The key issue is if those PDFs were scanned as image-only PDFs or if they were scanned as PDF Searchable Image files, meaning that they contain both images as well as text that was created by the OCR process during scanning. To see if they have text, open one with Adobe Reader (or whatever PDF reader/viewer you use) and try to copy/paste the text into a text editor or Word. If you get text, then you can search for the ticket numbers with any search tool that you want; if you don't get text, then you'll need to OCR them. There are many OCR packages out there that will do it. Let me know if you want recommendations. Regards, Joe
0
 

Author Comment

by:bankadmin
ID: 40580625
Thanks Joe with the original I have I cannot copy text from it. I do have adobe acrobat Xl and I have saved one as a word and I'm testing using the OCR feature in Acrobat I guess I just have to figure out which is more efficient. Do you have any suggestions?
0
 
LVL 56

Assisted Solution

by:Joe Winograd, EE MVE 2015&2016
Joe Winograd, EE MVE 2015&2016 earned 2000 total points
ID: 40580679
Acrobat XI will do the trick! Of course, with 9000 PDFs, you don't want to do them one at a time. You should do:

Tools
Text Recognition
In Multiple Files
Add Folders...

Here's what the dialog looks like:

Acrobat OCR Add Folders
That will run OCR on all of the PDFs in those folders, creating Searchable Image PDFs (make sure you set that in the PDF Output Style drop-down), and I also recommend 300 DPI:

Acrobat OCR
The new, OCR'ed PDFs will all be searchable. I recommend storing the OCR'ed PDFs in different folders from the originals. Regards, Joe
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 

Author Comment

by:bankadmin
ID: 40580688
Yep I just did that. However with 9k files for searching them all at once would it make more sense to first combine them into one file then run the OCR against the new combined file to make easier when I could actually start searching them?
0
 
LVL 56

Assisted Solution

by:Joe Winograd, EE MVE 2015&2016
Joe Winograd, EE MVE 2015&2016 earned 2000 total points
ID: 40580716
Even with just one page per file, that's 9,000 pages. I do not recommend having a 9,000-page PDF! OCR could easily croak on a file that big. Leave them in separate files for the OCR process, so you'll have 9,000 new Searchable Image PDFs. Then do Edit>Advanced Search, telling it look in the folder with the 9,000 new Searchable Image PDFs, such as D:\0tempD in the screenshot below:

Acrobat Advanced Search
Regards, Joe
0
 

Author Comment

by:bankadmin
ID: 40580721
Great advice thanks.... the sample file I have is 17 pages, technically Im only concerned with the first page of each file though.
0
 
LVL 56

Assisted Solution

by:Joe Winograd, EE MVE 2015&2016
Joe Winograd, EE MVE 2015&2016 earned 2000 total points
ID: 40580750
If they're all 17 pages, that's in the neighborhood of 150,000 pages, which could take quite a while to OCR — and 140,000+ of them are worthless! If the first page is all you care about, I'd be tempted to peel off the first page of each. That way, you'd have to OCR "only" 9,000 pages. This is a good time to say that, in the future, when you scan documents, scan them to PDF Searchable Image files, not PDF image-only files.
0
 

Author Comment

by:bankadmin
ID: 40580902
Agreed we did not do the scanning of these documents I have all my scanners setup to be searchable PDF's by default. Is there a way I can strip the first page on all the files in one action?
0
 
LVL 56

Accepted Solution

by:
Joe Winograd, EE MVE 2015&2016 earned 2000 total points
ID: 40580923
> all my scanners setup to be searchable PDF's by default

Excellent!

> Is there a way I can strip the first page on all the files in one action?

I recommend the PDF Toolkit (PDFtk) to do this. It comes in both command line and GUI versions. The command line version is called PDFtk Server and may be downloaded here:
http://www.pdflabs.com/tools/pdftk-server/

Don't be misled by "Server" in the name. I don't know why they called it that, but it's just an executable (pdftk.exe, with a supporting DLL, libiconv2.dll) that runs on XP, Vista, W7, and W8 (i.e., it does not have to run on a "server" OS).

Put all of the PDFs in a folder along with pdftk.exe and libiconv2.dll, and put this line in a BAT file (also in the same folder, otherwise you'll have to specify paths):

for %%I in (*.pdf) do "pdftk.exe" "%%I" cat 1 output "%%~nI-page1.pdf"

That will create 9,000 PDFs each of which has a file name ending in -page1.pdf (the part before the hyphen is the name of each original file). You may then use Acrobat to OCR the 9,000 one-page PDFs, or, if you'd like to see what happens on a single, 9,000-page file, put this statement as the second line in the BAT file:

pdftk *-page1.pdf cat output firstpages.pdf

That will create a file called firstpages.pdf with the first page of all 9,000 PDFs. I have no clue how Acrobat's Recognize Text will perform on that, but it can't hurt to give it a spin. If it croaks, then OCR the 9,000 one-page files. Regards, Joe
0
 

Author Comment

by:bankadmin
ID: 40591170
Thank you for all the advice
0
 
LVL 56

Expert Comment

by:Joe Winograd, EE MVE 2015&2016
ID: 40591420
You're very welcome. Happy to help. Good luck on the project. Regards, Joe
0

Featured Post

What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

PaperPort 14.5 Patch 1 update is often not detected or downloaded automatically. This article provides direct download links to solve the problem for retail (non-bundled) versions of the Standard and Professional editions, as well as the Professiona…
This article shows how to use a free utility called 'Parkdale' to easily test the performance and benchmark any Hard Drive(s) installed in your computer. We also look at RAM Disks and their speed comparisons.
This video is the first in a two-part series that discusses PaperPort's "Send To Bar" feature . This first video tutorial explains the purpose of the Send To Bar, how to use it, and how to hide unwanted items that are automatically created on it whe…
In this video, we show how to convert an image-only PDF file into a PDF Searchable Image file, that is, a file with both the image (typically from scanning) and text, which is created in an automated fashion with Optical Character Recognition (OCR) …

972 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question