Solved

Paper Capture (OCR) "renderable text" error - PDF not searchable

Posted on 2004-09-24
6
20,355 Views
Last Modified: 2013-12-03
Using Adobe Acrobat Pro 6, I am simply trying to produce a searchable pdf file.

The file is 70 pages - mostly text with some graphics (it's a directory with a few advertisements).

"Document Properties" show 31 fonts (all but 3 say (embedded subset) next to them)

The PDF file was given to me, so I don't have any "originals".

I can use the "touch up text tool" to select text and change it (although when I "copy" selected text and then try to "paste", I get large blank space pasted)

When I try to use "Paper Capture", I get the following error on each page:
--------------
     Acrobat could not run Paper Capture on this page because of the following error:
     This page contains renderable text.
-------------
It would seem that the renderable text should be searchable. However, the document appears to be unsearchable, because when I pick out a word from the text and perform a search, it reports back with 0 found.

I read somewhere that if you scan in text (which would be non-searchable) and then add text to it (that would be renderable and searchable), the added text will produce the above error. The suggestion was to "Paper Capture" the imaged text before adding any text, but since I don't have the original text, I can't do that (if that is even the problem - I don't know). Is there a way to identify which text is an image and which text is renderable?

I also saw something that said to convert the entire document to an image and then run Paper Capture again, but I am unsure how to do that or if that will introduce extra OCR mistakes or if I will lose all my links, bookmarks, etc. in my document.

I just want to make the text searchable from Adobe Reader...

Any help would be appreciated.

Thanks,
Dan
0
Comment
Question by:HighTechGeek
  • 3
  • 3
6 Comments
 
LVL 44

Expert Comment

by:Karl Heinz Kremer
ID: 12148430
If you convert the document to raster, you will use all your links and bookmarks. There is however a trick: At the end you will have two documents: The original and the "image only" PDF. Open the original document and select "Replace pages" and then select to replace all pages with the pages from the image-only document. This will keep the bookmarks (and I think links as well), but will replace all renderable text with the images.

To create the images you can do a "Save As", and then select e.g. TIFF as the output format. Make sure that you select an empty directory (otherwise you will have your 70 TIFF images mixed in with other documents). Select to create a PDF from multiple images, and select all TIFF images. Make sure that you bring them into the correct order.

Another solution is to not use Paper Capture, but e.g. Abbyy's FineReader, or ScanSoft's OmniPage. If I remember correctly, both will work with mixed (images and text) documents.

However, I don't think that you have images that you can OCR in your document. I suspect that the document does not contain extractable text: The fact that you cannot copy and paste information supports this. It is possible that the document was generated with a "bad" PDF generator that did not add any useful information to map glyphs back to characters.

When you go to "Document Properties" in Acrobat, what application is listed as the creator or producer?
0
 
LVL 5

Author Comment

by:HighTechGeek
ID: 12148650
Thanks khkremer. I will do what you suggest and post back results in a day or 3

Application: CorelDRAW version 12.0
PDF Producer: Corel PDF Engine Version 1.0.0.458
PDF Version: 1.3 (Acrobat 4.x)

It's just seems odd to me that I can select text in the "image" with the "Touchup Text Tool" and even look at the properties of that selected text:
Font: TimesNewRomanPSMT
Permissions: can embed font for print and preview only
Font Size: 12 pt
etc.
yet not copy it or search on it...
0
 
LVL 44

Expert Comment

by:Karl Heinz Kremer
ID: 12148783
There are no characters in a PDF file - all you have are glyphs, the representation of a certain character in a certain font. From the glyph alone, it's impossible to map back to a character. This means that the application that generates the PDF document needs to store this "mapping information", so that you can find out what characters are represented. If this information is missing, you can still render the text, and Acrobat still knows that you have e.g. 5 or 10 characters that form a word, and you can select this text, but when you copy and paste (or extract the text for search purposes), you get either garbage, or like in your case, nothing. All you can do is complain to Corel...
0
Announcing the Most Valuable Experts of 2016

MVEs are more concerned with the satisfaction of those they help than with the considerable points they can earn. They are the types of people you feel privileged to call colleagues. Join us in honoring this amazing group of Experts.

 
LVL 5

Author Comment

by:HighTechGeek
ID: 12163010
So far, as a test, I saved the PDF as TIFF. It created 70 TIFF images. Then I created a PDF file from multiple files, selected the 70 TIFF files and had a new document (no bookmarks or links, as expected). I then performed a Paper Capture and ran into 2 problems:

FIRST ISSUE:

15 of the pages would not "capture" and gave the following error:
----------------
 Acrobat could not run Paper Capture on this page because of the following error:

 This page is of an unsupported resolution, so it cannot be captured. Supported resolutions are 200-600 dpi for b/w and 200-400 dpi for gray/color.
----------------

When I save the PDF file as TIFF, I cannot find any options for choosing resolution. Do I need to use a TIFF editor to re-save each of those 15 pages? That seems a bit extraneous.

SECOND ISSUE:

On the pages that did capture, there are some huge areas that Acrobat did not recognize as text. For example, on my table of contents page, it is a list (in Times New Roman) of 41 titles on the left side with page numbers on the right. There are dots between the text and numbers like this:

Introduction..........3
Patrons................4
Welcome..............6

Acrobat converted 5 (seemingly random) of the 41 lines of text to searchable text. The remaining lines apparently remain as "objects". All of the text is a standard font, so why is Acrobat struggling with this OCR? I can see having a few mistakes, but this is 88% failure (on this page). I must be doing something wrong...

I thought maybe the dots are throwing off the OCR, but there are many entries on this page that have multiple words (table of contents........12) and you would think that at least the words "table of" would OCR, but they have not.

In the original document, I could select all of the text with the Text Touch-up tool and I could search none of the text.
In the new test document, I can only select a small portion of text with the Text Touch-up tool, however, this portion of text is searchable.

I still need help...
0
 
LVL 44

Accepted Solution

by:
Karl Heinz Kremer earned 50 total points
ID: 12163134
Adobe removed the resolution selection for "Save As TIFF" from Acrobat 5 to 6. Regarding the qualilty of the OCR: This may be related to the resolution issue. But, it could also just be a problem with the OCR engine. In my opinon, both Abbyy's FineReader and ScanSoft's OmniPage do a better job.

You could try a different method of creating the TIFF images that gives you more control over e.g. resolution: Ghostscript (http://www.ghostscript.com) can also do this.

But, I guess your best bet is to go back to whoever created the document and ask for a better PDF document. Your PDF file is of low quality. I suspect that if the original document creator would have printed to the Distiller Printer (or any other PDF creator based on a printer driver), you would not have these problems.
0
 
LVL 5

Author Comment

by:HighTechGeek
ID: 12231358
I hate it when software won't do what you want it to do... especially when it seems so simple :-)

In this case:
   Shame on Corel for a questionable PDF export engine.
   Shame on Adobe for not being able to OCR very well.

(like I could program something better, right?)

I got what I needed, but I had to go back to another pdf file that was "closer" to an original, so I  couldn't do what I really wanted to do. However, thanks goes to khkremer, for giving me the best shot at getting there. I learned some things along the way too!
0

Featured Post

Announcing the Most Valuable Experts of 2016

MVEs are more concerned with the satisfaction of those they help than with the considerable points they can earn. They are the types of people you feel privileged to call colleagues. Join us in honoring this amazing group of Experts.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In a previous article published here at Experts Exchange, Signature Image with Transparent Background (http://www.experts-exchange.com/Web_Development/Document_Imaging/A_12380-Signature-Image-with-Transparent-Background.html), I explained how to cre…
PDF files have been in the limelight due to its unmatched features.  Personal documents, emails, business reports and eBooks are all converted into PDF files owing to peerless features provided by it. Adding watermark to a PDF file is a method to se…
In this first video of the three-part Xpdf series, we introduce and describe Xpdf, a library containing nine command line utilities that perform various functions on PDF files. We show where the library is located and how to download it, discuss its…
In this fourth video of the Xpdf series, we discuss and demonstrate the PDFinfo utility, which retrieves the contents of a PDF's Info Dictionary, as well as some other information, including the page count. We show how to isolate the page count in a…

815 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now