Need a way to batch-remove (or replace) bad OCR in multiple PDFs - can use Acrobat or some other solution

I have a very large set of assorted PDF files. They contain searchable text, but the text is filled with errors. If I save a page from the PDF as an image, and use my OCR software on it, I get a much better result. So, I would like to re-OCR all of the files — but first I need to "flatten" all of the text objects in the PDFs, since none of my OCR tools will overwrite any existing text.

I do have Acrobat X and XI Pro, and I've tried using a batch action to strip the text and rerun OCR, but anytime the program encounters an error, it interrupts the process with a dialog box. I searched for a way to prevent this, but there does not appear to be one.

So, the way I see it, I need one of three things:

1. A way to force Acrobat to skip over errors in batch actions and process the remaining files. I could swear you used to be able to do this.

2. A batch OCR tool, free or paid, that will remove and replace all existing text objects.

3. A tool to batch-flatten all text objects in a large set of PDFs (so I can then run them through OCR). I've found software that looks related, but everything seems to either delete text, which I don't want; or else it flattens form fields, images, etc. but does not mention text.

Any of the three of these would solve my problem. I'm open to other suggestions too, of course -- any advice is greatly appreciated!
John PanucciAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Joe Winograd, Fellow&MVEDeveloperCommented:
Hi John,
use my OCR software on it
What OCR software? Regards, Joe
0
Joe Winograd, Fellow&MVEDeveloperCommented:
Hi John,
I'm leaving my office now for a few hours. Will check back into the thread as soon as I return. In the meantime, another question for you:

> the text is filled with errors
> my OCR software ... much better result
> would like to re-OCR all of the files

All of that makes good sense, but what I don't get is this comment:

> everything seems to either delete text, which I don't want

If the text is filled with errors, why don't you want it deleted? Seems to me that deleting it and leaving you with an image-only PDF will allow you to use your OCR software, which you say gives a good result. If I'm understanding this right, the only reason that your OCR software won't work on the docs is because they already have text and your OCR software won't overwrite any existing text. So why not delete the (bad!) text? Regards, Joe
0
John PanucciAuthor Commented:
Joe,

Thanks for the replies! I'm using Acrobat for OCR. And to clarify, you're right; your definition of "delete" IS what I want to do to the old text.

Terminology is one of the main problems I had in searching for a solution: "flattening" a PDF seemed to mostly refer to getting rid of form fields and other interactive elements, while preserving searchable text. Meanwhile, "deleting" text, in the context of batch operations, seemed to refer to getting rid of the text object AND its visual representation, leaving just white space and images.

Please let me know if that makes sense. Correctly wording what I'm trying to do was my biggest problem with both attempting to search and writing this question. :)
0
10 Tips to Protect Your Business from Ransomware

Did you know that ransomware is the most widespread, destructive malware in the world today? It accounts for 39% of all security breaches, with ransomware gangsters projected to make $11.5B in profits from online extortion by 2019.

Joe Winograd, Fellow&MVEDeveloperCommented:
John,
I'm surprised to hear you say that Acrobat (X or XI) is giving you good OCR results, although you do say that it is in comparison to whatever product did the prior OCR, and I'm sure there's OCR out there that's worse than Acrobat. But my experience is that Acrobat OCR comes up short when compared to top quality OCR packages, such as ABBYY FineReader and Nuance OmniPage.

My advice for your project is your approach #2 ("A batch OCR tool, free or paid, that will remove and replace all existing text objects."), and the product that I recommend is Nuance Power PDF Advanced, due to its Batch Converter feature and because it uses the OmniPage engine under the covers (as does Nuance's PaperPort 14.5).

For your situation, the trick with the Batch Converter feature is to make sure you tell it to OCR all pages, not just the image-only pages. This is not the default, so you need to set it via:

File
Options
Document
Searchable PDF Document

You'll get this:

PowPDF convert to searchable PDF
In the "Process pages" section, tick both "All pages" and "Process documents using OCR". By doing that, it will run OCR even on docs that already have text.

The Batch Converter GUI looks like this:

PowPDFadv Batch Converter
The dialog is straightforward. Of course, select PDF for the Source File Type and Searchable PDF for the Destination File Type.

Sidebar: Power PDF Advanced has a command line interface (CLI) for its Batch Converter feature that I discuss in this EE article:
Batch Conversion of PDF, TIFF, and Other Image Formats via Command Line Interface to PDF, PDF Searchable, and TIFF with Power PDF Advanced

However, it has a bug such that it will not OCR a PDF that already has text. I've submitted a bug report to Nuance on this. The good news is that the GUI works fine...does not have this bug.

There are some other great features in Power PDF Advanced. Although unrelated to this question, you may find these five-minute EE video Micro Tutorials interesting:
Bates Stamping/Numbering of PDF Files with Power PDF Advanced
Convert Scanned Image-Only PDF Files to PDF Searchable Image Files via OCR with Power PDF Advanced
End Sidebar

You may get a free trial of Power PDF Advanced here:
https://www.nuance.com/print-capture-and-pdf-solutions/pdf-and-document-conversion/power-pdf-converter/free-trial.html

If you'd like me to convert one of your "bad OCR" PDFs to a searchable PDF with Power PDF Advanced, post it here or send it via PM if you don't want to expose it to the web. Of course, in either case, make sure that it doesn't have any private/sensitive info in it. Regards, Joe
1

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
John PanucciAuthor Commented:
This response is heads and shoulders above what I could've asked for in the level of detail and clarity, not to mention the links to supporting resources and documentation. I'll investigate the tools you mention; based on your description I'm quite sure they'll handle what I need. I will also look into alternative OCR solutions; I'm using Acrobat primarily because A) I own it and B) it handles the Unicode language combos I need, but I'm sure you're right that I could get an improvement with a more specialized toolset. Thank you very much for your time and effort!
0
Joe Winograd, Fellow&MVEDeveloperCommented:
You're welcome, John, and thanks to you for the kind words...I appreciate hearing them! I can certainly understand using Acrobat's OCR when you already own Acrobat, and for many folks it's adequate. But when highly accurate OCR is required, there are better choices, imo. Best of luck on your project! Regards, Joe
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Software

From novice to tech pro — start learning today.