Bulk OCR on a small budget

I am looking for a way of bulk OCRing around 50,000 PDF's to make them searchable. There are around 6 million pages in total.

What would be the best software and hardware to do this on - for a reasonable cost (Acrobat Capture is too expensive)

The system needs to be robust - and automatic - so in the event of a crash it will automatically restart OCRing. (for example Acrobat Pro does not seem to do this) It also needs to move the processed document to a new folder.

I am happy to use either Windows based software or Linux based.
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Amila HendahewaLead ConsultantCommented:
i recommend abbyy finereader. www.abbyy.com
there are also some other products like abbyy recognition server but may be expensive for this particular project.

for scanners best would be to use a low end fujitsu scanner like 5120 c2; this depends on your document size.
Rabbit80Author Commented:
Abbyy finereader is not at all suitable for processing 6 million pages in 50,000 documents - I need batch processing (which the server editions offer) which can restart in the event of an application crash. I have used Finereader before but found it to be unstable with extremely large volumes. Should the program stop over night or over a weekend for example, our time constraints may prove to get the better of us.

I am expecting this project to take around 4 months of processing time (average around 1.5 - 2 seconds per page)

As for scanners - this is irrelevant since I already have the  PDF's however I use Bowe Bell & Howell 8080D and Fujitsu M4099D production scanners.

I am currently testing a trial version of Tiff Junction by Aquaforest. Its a little more expensive than I had hoped at £555 (I was looking for something in the £300-£400 region) but so far it looks to be quite stable. Unfortunately, Dual/Quad core does not appear to be supported however I am still averaging just over 2 seconds per page and it is possible to run multiple instances. Its not perfect - but is much closer to what I am looking for.
Rabbit80Author Commented:
I have also found Omnipage 16 Professional at just £249. So far it is looking good - averaging just under 1.5s per page on dual core AMD 5600+ - not too bad really!

It also supports multicore processors properly. The only thing I am unsure about is its stability. I will test for a day or two longer ...
CompTIA Cloud+

The CompTIA Cloud+ Basic training course will teach you about cloud concepts and models, data storage, networking, and network infrastructure.

Amila HendahewaLead ConsultantCommented:
In that case i cannot think of a solution in that budget. We have used Kofax on such large volumes. But the price is on higher side.

you can also think of alternatives such as using several PCs etc...

let me know the results. Good Luck...

Rabbit80Author Commented:
We also use Kofax for scanning - however these documents are already scanned... In fact they have already been OCR'ed but they are currently stored as seperate text and tiff files. As far as I can tell there is no way to recombine the existing files into searchable PDF's, however I can get all the documents as non-searchable, image only PDF's from the existing DMS system.

The purpose is to transfer into a new DMS system. The reason for the tight budget is because the customer will not stand the costs if they run into the £1000's. Timescale is not too much of an issue however, but man hours are!
Amila HendahewaLead ConsultantCommented:
Have you found any solution on this?
Rabbit80Author Commented:
The project is currently on hold until our customer gives us the go ahead. We have run a number of trials with Omnipage which have been reasonably positive although we may have to write a "crash monitor" to ensure that it is continues to run and can be restarted automatically. I will add further comments once the project starts.
aquaforest.com offer excellent bulk pdf & tiff ocr functionality
Rabbit80Author Commented:
Project still on hold - we expect it to go ahead later on this year. We found a solution that used a Linux server and cuneiform / hocr2pdf with a bit of creative scripting.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today

From novice to tech pro — start learning today.