Solved

Paper Capture (OCR) "renderable text" error - PDF not searchable

Posted on 2004-09-24
6
20,342 Views
Last Modified: 2013-12-03
Using Adobe Acrobat Pro 6, I am simply trying to produce a searchable pdf file.

The file is 70 pages - mostly text with some graphics (it's a directory with a few advertisements).

"Document Properties" show 31 fonts (all but 3 say (embedded subset) next to them)

The PDF file was given to me, so I don't have any "originals".

I can use the "touch up text tool" to select text and change it (although when I "copy" selected text and then try to "paste", I get large blank space pasted)

When I try to use "Paper Capture", I get the following error on each page:
--------------
     Acrobat could not run Paper Capture on this page because of the following error:
     This page contains renderable text.
-------------
It would seem that the renderable text should be searchable. However, the document appears to be unsearchable, because when I pick out a word from the text and perform a search, it reports back with 0 found.

I read somewhere that if you scan in text (which would be non-searchable) and then add text to it (that would be renderable and searchable), the added text will produce the above error. The suggestion was to "Paper Capture" the imaged text before adding any text, but since I don't have the original text, I can't do that (if that is even the problem - I don't know). Is there a way to identify which text is an image and which text is renderable?

I also saw something that said to convert the entire document to an image and then run Paper Capture again, but I am unsure how to do that or if that will introduce extra OCR mistakes or if I will lose all my links, bookmarks, etc. in my document.

I just want to make the text searchable from Adobe Reader...

Any help would be appreciated.

Thanks,
Dan
0
Comment
Question by:HighTechGeek
  • 3
  • 3
6 Comments
 
LVL 44

Expert Comment

by:Karl Heinz Kremer
Comment Utility
If you convert the document to raster, you will use all your links and bookmarks. There is however a trick: At the end you will have two documents: The original and the "image only" PDF. Open the original document and select "Replace pages" and then select to replace all pages with the pages from the image-only document. This will keep the bookmarks (and I think links as well), but will replace all renderable text with the images.

To create the images you can do a "Save As", and then select e.g. TIFF as the output format. Make sure that you select an empty directory (otherwise you will have your 70 TIFF images mixed in with other documents). Select to create a PDF from multiple images, and select all TIFF images. Make sure that you bring them into the correct order.

Another solution is to not use Paper Capture, but e.g. Abbyy's FineReader, or ScanSoft's OmniPage. If I remember correctly, both will work with mixed (images and text) documents.

However, I don't think that you have images that you can OCR in your document. I suspect that the document does not contain extractable text: The fact that you cannot copy and paste information supports this. It is possible that the document was generated with a "bad" PDF generator that did not add any useful information to map glyphs back to characters.

When you go to "Document Properties" in Acrobat, what application is listed as the creator or producer?
0
 
LVL 5

Author Comment

by:HighTechGeek
Comment Utility
Thanks khkremer. I will do what you suggest and post back results in a day or 3

Application: CorelDRAW version 12.0
PDF Producer: Corel PDF Engine Version 1.0.0.458
PDF Version: 1.3 (Acrobat 4.x)

It's just seems odd to me that I can select text in the "image" with the "Touchup Text Tool" and even look at the properties of that selected text:
Font: TimesNewRomanPSMT
Permissions: can embed font for print and preview only
Font Size: 12 pt
etc.
yet not copy it or search on it...
0
 
LVL 44

Expert Comment

by:Karl Heinz Kremer
Comment Utility
There are no characters in a PDF file - all you have are glyphs, the representation of a certain character in a certain font. From the glyph alone, it's impossible to map back to a character. This means that the application that generates the PDF document needs to store this "mapping information", so that you can find out what characters are represented. If this information is missing, you can still render the text, and Acrobat still knows that you have e.g. 5 or 10 characters that form a word, and you can select this text, but when you copy and paste (or extract the text for search purposes), you get either garbage, or like in your case, nothing. All you can do is complain to Corel...
0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 
LVL 5

Author Comment

by:HighTechGeek
Comment Utility
So far, as a test, I saved the PDF as TIFF. It created 70 TIFF images. Then I created a PDF file from multiple files, selected the 70 TIFF files and had a new document (no bookmarks or links, as expected). I then performed a Paper Capture and ran into 2 problems:

FIRST ISSUE:

15 of the pages would not "capture" and gave the following error:
----------------
 Acrobat could not run Paper Capture on this page because of the following error:

 This page is of an unsupported resolution, so it cannot be captured. Supported resolutions are 200-600 dpi for b/w and 200-400 dpi for gray/color.
----------------

When I save the PDF file as TIFF, I cannot find any options for choosing resolution. Do I need to use a TIFF editor to re-save each of those 15 pages? That seems a bit extraneous.

SECOND ISSUE:

On the pages that did capture, there are some huge areas that Acrobat did not recognize as text. For example, on my table of contents page, it is a list (in Times New Roman) of 41 titles on the left side with page numbers on the right. There are dots between the text and numbers like this:

Introduction..........3
Patrons................4
Welcome..............6

Acrobat converted 5 (seemingly random) of the 41 lines of text to searchable text. The remaining lines apparently remain as "objects". All of the text is a standard font, so why is Acrobat struggling with this OCR? I can see having a few mistakes, but this is 88% failure (on this page). I must be doing something wrong...

I thought maybe the dots are throwing off the OCR, but there are many entries on this page that have multiple words (table of contents........12) and you would think that at least the words "table of" would OCR, but they have not.

In the original document, I could select all of the text with the Text Touch-up tool and I could search none of the text.
In the new test document, I can only select a small portion of text with the Text Touch-up tool, however, this portion of text is searchable.

I still need help...
0
 
LVL 44

Accepted Solution

by:
Karl Heinz Kremer earned 50 total points
Comment Utility
Adobe removed the resolution selection for "Save As TIFF" from Acrobat 5 to 6. Regarding the qualilty of the OCR: This may be related to the resolution issue. But, it could also just be a problem with the OCR engine. In my opinon, both Abbyy's FineReader and ScanSoft's OmniPage do a better job.

You could try a different method of creating the TIFF images that gives you more control over e.g. resolution: Ghostscript (http://www.ghostscript.com) can also do this.

But, I guess your best bet is to go back to whoever created the document and ask for a better PDF document. Your PDF file is of low quality. I suspect that if the original document creator would have printed to the Distiller Printer (or any other PDF creator based on a printer driver), you would not have these problems.
0
 
LVL 5

Author Comment

by:HighTechGeek
Comment Utility
I hate it when software won't do what you want it to do... especially when it seems so simple :-)

In this case:
   Shame on Corel for a questionable PDF export engine.
   Shame on Adobe for not being able to OCR very well.

(like I could program something better, right?)

I got what I needed, but I had to go back to another pdf file that was "closer" to an original, so I  couldn't do what I really wanted to do. However, thanks goes to khkremer, for giving me the best shot at getting there. I learned some things along the way too!
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

Suggested Solutions

One of the questions I get asked again and again is how to validate a field value in an AcroForm with a custom validation script. Adobe provided a lot of infrastructure to do that with just a simple script. Let’s take a look at how to do that wit…
*Adobe Acrobat 9 was used for this article.  Particular steps may vary depending on software versions. Adobe Acrobat has many, many variables that my be utilized to customize your forms for clarity and ease of use. The Form Editing Tool will be y…
Sometimes we receive PDF files that are in the wrong orientation. They may be sideways or even upside down. This most commonly happens with scanned or faxed documents. It is possible to rotate the view of these PDFs with the free Adobe Reader produc…
We often encounter PDF files that are pure images, that is, they do not have text characters, but instead contain only raster graphics. The most common causes of this are document scanning software and faxing software/services that create image-only…

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

7 Experts available now in Live!

Get 1:1 Help Now