• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1100
  • Last Modified:

Bulk OCR on a small budget

I am looking for a way of bulk OCRing around 50,000 PDF's to make them searchable. There are around 6 million pages in total.

What would be the best software and hardware to do this on - for a reasonable cost (Acrobat Capture is too expensive)

The system needs to be robust - and automatic - so in the event of a crash it will automatically restart OCRing. (for example Acrobat Pro does not seem to do this) It also needs to move the processed document to a new folder.

I am happy to use either Windows based software or Linux based.
0
Rabbit80
Asked:
Rabbit80
  • 5
  • 3
1 Solution
 
Amila HendahewaCommented:
i recommend abbyy finereader. www.abbyy.com
there are also some other products like abbyy recognition server but may be expensive for this particular project.

for scanners best would be to use a low end fujitsu scanner like 5120 c2; this depends on your document size.
0
 
Rabbit80Author Commented:
Abbyy finereader is not at all suitable for processing 6 million pages in 50,000 documents - I need batch processing (which the server editions offer) which can restart in the event of an application crash. I have used Finereader before but found it to be unstable with extremely large volumes. Should the program stop over night or over a weekend for example, our time constraints may prove to get the better of us.

I am expecting this project to take around 4 months of processing time (average around 1.5 - 2 seconds per page)

As for scanners - this is irrelevant since I already have the  PDF's however I use Bowe Bell & Howell 8080D and Fujitsu M4099D production scanners.

I am currently testing a trial version of Tiff Junction by Aquaforest. Its a little more expensive than I had hoped at £555 (I was looking for something in the £300-£400 region) but so far it looks to be quite stable. Unfortunately, Dual/Quad core does not appear to be supported however I am still averaging just over 2 seconds per page and it is possible to run multiple instances. Its not perfect - but is much closer to what I am looking for.
0
 
Rabbit80Author Commented:
I have also found Omnipage 16 Professional at just £249. So far it is looking good - averaging just under 1.5s per page on dual core AMD 5600+ - not too bad really!

It also supports multicore processors properly. The only thing I am unsure about is its stability. I will test for a day or two longer ...
0
Keep up with what's happening at Experts Exchange!

Sign up to receive Decoded, a new monthly digest with product updates, feature release info, continuing education opportunities, and more.

 
Amila HendahewaCommented:
In that case i cannot think of a solution in that budget. We have used Kofax on such large volumes. But the price is on higher side.

you can also think of alternatives such as using several PCs etc...

let me know the results. Good Luck...

0
 
Rabbit80Author Commented:
We also use Kofax for scanning - however these documents are already scanned... In fact they have already been OCR'ed but they are currently stored as seperate text and tiff files. As far as I can tell there is no way to recombine the existing files into searchable PDF's, however I can get all the documents as non-searchable, image only PDF's from the existing DMS system.

The purpose is to transfer into a new DMS system. The reason for the tight budget is because the customer will not stand the costs if they run into the £1000's. Timescale is not too much of an issue however, but man hours are!
0
 
Amila HendahewaCommented:
Have you found any solution on this?
0
 
Rabbit80Author Commented:
The project is currently on hold until our customer gives us the go ahead. We have run a number of trials with Omnipage which have been reasonably positive although we may have to write a "crash monitor" to ensure that it is continues to run and can be restarted automatically. I will add further comments once the project starts.
0
 
joe90kaneCommented:
aquaforest.com offer excellent bulk pdf & tiff ocr functionality
0
 
Rabbit80Author Commented:
Project still on hold - we expect it to go ahead later on this year. We found a solution that used a Linux server and cuneiform / hocr2pdf with a bit of creative scripting.
0

Featured Post

Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

  • 5
  • 3
Tackle projects and never again get stuck behind a technical roadblock.
Join Now