Link to home
Start Free TrialLog in
Avatar of steveurich
steveurichFlag for United States of America

asked on

I am looking for help in doing bulk OCR on a MAC.

I am looking for help in doing bulk OCR.

I have the following software, ABBY Find reader for Scan Snap, DevonThink 3, PDF Element, ABBYY FineReader, Evernote, EagleFiler, OCR Wizard, PDF Expert, Keyboard Maestro, Hazel and Mac Sparky's excellent book on going Paperless.

Part of me thinks that DevonThink 3 may be the answer, but I have avoided that, as it seems like it will be quite an investment of time and effort to get that working well.

In the past I've tried using the Mac Power Users script along with Hazel to call PDFpen Pro and have it OCR about 500 files.

I've been trying off and on to get this to work for the last six years. So I've been through multiple changes of the operating system as well as new versions of PDFpen Pro. It will do several files and then just hang.

I don't think it's the PDF files from selves as I have an old version of Adobe Acrobat Pro running on a Windows machine and it's able to OCR the files just fine.

I was really excited when the latest version of PDFpen Pro included "OCR Files" right inside the application. Unfortunately it exhibits the same performance as it will OCR 10 to 20 files and then just hang.

Computers are complicated so I'm sure that there could be something on my system that is causing these problems however I don't have any other applications the behave like this, i.e. work for a while and then hang.

I'm not sure if it makes any difference but my files are stored in iCloud.

I've asked Smile Software to see if they could provide an updated script to replace the one that I got from Mac power users. I wonder if there were a few more or longer delay  in so that it does not overwhelm the application it might be more reliable.

I have sent smile software log files but even after all this time we've not been able to get it working.

Here is the script that I am using.

tell application "PDFpenPro"
      open theFile as alias
      -- does the document need to be OCR'd?
      get the needs ocr of document 1
      if result is true then
            tell document 1
                  ocr
                  repeat while performing ocr
                        delay 1
                  end repeat
                  delay 1
                  close with saving
            end tell
            --In PDFpen, when no documents are open, window 1 is "Preferences"
            --If other documents are open, do not close the App.
            if name of window 1 is "Preferences" then
                  tell application "PDFpenPro"
                        quit
                  end tell
            end if
      else
            -- Scan Doc was previously OCR'd or is already a text type PDF.
            tell document 1
                  close without saving
            end tell
            --In PDFpen, when no documents are open, window 1 is "Preferences"
            --If other documents are open, do not close the App.
            if name of window 1 is "Preferences" then
                  tell application "PDFpenPro"
                        quit
                  end tell
            end if
      end if
end tell

I hate having to go back to windows to be able to OCR the files, and when those files are on iCloud it takes a while for them to be reflected on my Mac.

Is anyone else having the same problem with PDFpen Pro hanging?

Do you see any additions to the above script that might help?

Can you recommend any additional software it might do a better job at bulk OCR.

I purchased ABBYY find reader pro which offers quite a few conversion utilities but I haven't figured out how to have it do Bulk OCR yet.

As you can imagine I'm beyond frustrated as I have spent a lot of money on software to try and solve this as well as a lot of time scanning my documents in only to find that it's difficult to locate what I need.
Avatar of Scott Fell
Scott Fell
Flag of United States of America image

it has been a long time since I've used a Mac. in a case like this what I did was create a script through the automator where you could drop a file or multiple files into a folder then in bulk or one by one have a program such as Adobe Acrobat pro OCR the file then move it to a new folder.

The issue you have is your files are not local there in iCloud and you have to either sink the iCloud files locally to do that or put a service in the cloud or on a web server to run this function for you.
Many PDF files contain a text component, so if you haven't yet check for a text component.

Most PDF files have a text component these days.

Open one of your PDF files + try searching for a text string. If searching works, you already have an... OCR'ed text component, so all you have to do is extract the text component.

To do this..

Install MacPorts + Poppler Tools, then issue this command...

pdftotext -enc ASCII7 -nopgbrk -layout "$file.pdf" > "$file.pdf.txt"

Open in new window


Note: Be sure to do an eyeball of all scanned files, to determine things like column alignment, as text components are rarely 100% correct across 100s-1000s of files.
Avatar of steveurich

ASKER

No so much interested in getting the text out of an OCR, more for just searching and finding the file after I scan and OCR it. I have someone come in and do scanning and if she has to wait on each scan to do the OCR I end up paying for her to wait around for it to finish. I want to scan and then overnight OCR.

I just found a way for ScanSnap home to OCR files that were scanned but that does not help with print to PDF or other info captured to PDF that was first paper.
It does not matter the reason you want to use OCR, it goes through the same process if you use it for searching content or being able to copy the text directly.  

For most small documents,you will typically not be able to tell the difference in time to generate the pdf if it is searchable or not until the page count increases.  A law office for instance can easily have documents in the 100 to 500+ page range.  

In any case, if speed is what you are after. outputting the pdf to a folder for later processing is the way to go.  It does not matter if it is mac or pc, it works in a similar manner. I did this very thing for a law office. I have one version that works from a server and other that uses the cloud. Instead of icloud it is either dropbox or google drive.

What threw me off is you said you are using icloud. But if your workflow is to hire somebody to come in and scan, then the fastest option is to scan to folder A, set automation on folder A to move to Folder B when complete. This allows your scan person to go about their busienss without waiting for OCR. Then your script picks up each file in folder A and works at it's own pace moving to Folder B.  There are some easy options to move files to special folders.

One of the things my program does is to also add a sequential bates stamp which if you ever did this using adobe, it is a long process. Automation is the way to go.  

Creating a script using Mac's automator makes this very easy.  You can record your actions such as as File -> Save As etc. Once you have that recorded, you can add the automation to listen for a new file to hit a folder and then move it when done.

I hope this makes better sense for you.
The big issue that I mentioned is that I need to do bulk OCR.

While The file is technically an iCloud it also exist physically on my local hard drive so I’m not sure that copping it to another folder provides any real benefits, I will test that out.

You mention “my program”, is it an Apple script that calls a OCR engine of some sort. I didn’t see any sort of attachment.

Do you see any problems with the Apple script that I posted in my original note?

PDF pen pro hangs whether I’m doing it via script from Hazel or if I use their built-in utility.

Not sure how to call the apps PDF expert or Abby fine reader from a script so that I could try using hazel to call those other apps to do the OCR.

This shouldn’t be this hard.
Abbvy fine reader has good OCR function.

As others pointed out, you can scan in but makesureto use at least 400 dpi resolution of the scanner,while the higher the resolution the larger the file, but at the sametime the better the OCR outcoeoquestionable...

What the resulting output you want, format.
The big issue that I mentioned is that I need to do bulk OCR.
Everything I have mentioned is for bulk OCR.

As far as your pseudo code, it looks good. For something like this, I personally like to use two folders. One for the person scanning or printing to send the files to. Your script will use this folder to grab the next file and process if it needs ocr. Then when completed, save the file to a final destination folder and delete the original file. On saving the file, run a test for a duplicate. The duplicate function can run through a loop and keep adding a number filename (2).pdf until it is unique.

Watch that your code does not try to loop through all files and process without first making sure the previous file is completed before going on to the next or you will end up processing multiple files at the same time which can hang your system.

Another thing to keep in mind is the process of using your Abby Finereader that comes with the Scansnap is it is a desktop program. And it will probably work but I believe it will take up more resources than if you were to use server software or the Abby sdk or what David is describing where you can use command line functions. This is the route I use. I have documents that are coming from multiple xerox multi function devices, multiple personal scan snaps and printing/saving. But I am offloading to either a local server or web server depending on if it is for in house or something saved to dropbox/google drive web.
You're comment above "No so much interested in getting the text out of an OCR" is confusing.

The point of OCR is to convert an image into human text.

Maybe clarify more about what you're asking what you're trying to accomplish when you say OCR'ing many files.
Also, as Scott mentioned above. If you're trying to OCR large files, this might take a very long time for each file.

If the files already contain an OCR'ed text component, extracting the existing text component takes a few seconds, where a large multi-hundred page document can take... wow... a very long time to OCR...
I purchased ABBYY find reader pro which offers quite a few conversion utilities but I haven't figured out how to have it do Bulk OCR yet.

How would I modify the script to call ABBYY Pro instead? PDFpen always hangs

Here is the Applescript that I am using.

tell application "PDFpenPro"
      open theFile as alias
      -- does the document need to be OCR'd?
      get the needs ocr of document 1
      if result is true then
            tell document 1
                  ocr
                  repeat while performing ocr
                        delay 1
                  end repeat
                  delay 1
                  close with saving
            end tell
            --In PDFpen, when no documents are open, window 1 is "Preferences"
            --If other documents are open, do not close the App.
            if name of window 1 is "Preferences" then
                  tell application "PDFpenPro"
                        quit
                  end tell
            end if
      else
            -- Scan Doc was previously OCR'd or is already a text type PDF.
            tell document 1
                  close without saving
            end tell
            --In PDFpen, when no documents are open, window 1 is "Preferences"
            --If other documents are open, do not close the App.
            if name of window 1 is "Preferences" then
                  tell application "PDFpenPro"
                        quit
                  end tell
            end if
      end if
end tell
look to see if there is a command line or an API like interface to their ..

Why not use abbvy's options? and approach it differently
https://www.abbyy.com/en-us/news/abbyy-finereader-pro-for-mac-now-supports-mac-os-x-native-automation-tools/#sthash.leSyVlo9.dpbs
Any of the software you have such as adobe or abbyy you can run using the automator.  I don't have a mac anymore to test, but you can create a macro that watches your mouse clicks such as file save etc. You can create macro and run it in the auotomater where you can also create a watch folder. Drop or copy files to that folder for processing.
Both arnold + Scott are correct, with any GUI type software you'll use Automater.

I use pdftotext (Poppler) + tesseract because they have no GUI + are by design built to handle bulk OCR operations on the command line, so they can be scripted easily doing all sorts of bulk OCR operations.
@David Favor

Can you provide some more details on how to set that up.

My days of programming and setting up Automator scripts are long gone. I am reduced to running Hazel to call programs.

I would like the PDF to have the text contained in the OCR Layer
ASKER CERTIFIED SOLUTION
Avatar of steveurich
steveurich
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
You asked, "Can you provide some more details on how to set that up."

Use either MacPorts or HomeBrew to setup all tools I mentioned.

Hazel... doesn't really fit well here...

You'll use PERL or BASH or some other scripting language.

Even if you use Hazel, you'll still require some simple scripts to automate the process I mentioned above.