PaperPort is a popular document imaging/management product from
Nuance Communications. It is in widespread use by both
individuals and
businesses.
The current version of PaperPort is
14. The previous version was
12 (yes, Nuance got superstitious and skipped
13). Both of these most recent versions come in two editions, Professional and Standard. All four products —
PP12 Standard, PP12 Professional, PP14 Standard, PP14 Professional — have the ability to create a searchable PDF file without any other software needing to be installed.
PP12 was the first release that could do this (and it was carried forward into
PP14).
Prior PaperPort releases require Nuance's
OmniPage (a separately priced OCR product) to be installed in order to create a searchable PDF file that PaperPort calls a
PDF Searchable Image file (because it contains
both the raster image and the text created by OCR). The reason that PP12 and PP14 can create a PDF Searchable Image file is that it contains the OmniPage OCR engine under the covers — via the
OmniPage Capture Software Development Kit (CSDK).
Sidebar on PaperPort Version: If you are running PP12.0, I recommend that you upgrade (free!) to
PP12.1. This EE article explains how to do it:
PaperPort 12 - Free Upgrade to Version 12.1
If you are running PP14.0, PP14.1, or PP14.2, I recommend that you upgrade (free!) to
PP14.5 (there was not a public release for either 14.3 or 14.4). This EE article explains how to do it:
PaperPort 14 - Free Upgrade to Version 14.5
End of Sidebar
There are three ways to create a PDF Searchable Image file in PP12 and PP14 —
scanning,
converting (via Save As), and
printing (to the PaperPort Image Printer):
(1) To
scan directly to a PDF Searchable Image file, create a
Scanning Profile in the
Scan or Get Photo pane, click on the
Output tab, and select PDF Searchable Image in the
File type drop-down:
(2) To
convert a file to a PDF Searchable Image file, right-click the item on the
PaperPort Desktop, click
Save As... from the context menu, and select PDF Searchable Image in the
Save as type drop-down:
(3) To
print to a PDF Searchable Image file, print to the
PaperPort Image Printer in any Windows program. For example, here's the print dialog for printing a TIFF file from MS Paint to the PP Image Printer, thereby creating a PDF Searchable Image file:
However, you must first configure the PaperPort Image Printer to an output type of PDF Searchable Image file, not PDF Image. To do this in PP12 and PP14, click the
Desktop menu, then the
Desktop Options button on the ribbon, then the
Item tab, and select PDF Searchable Image in the
PaperPort Image Printer file type drop-down:
It is important to note that printing to the PP Image Printer creates a raster image (bitmap/graphic) which then has to go through the OCR process in order to create text.
If your source document already has text, such as a typical web page or Word file, this is generally not the right technique for creating a PDF, that is, there's no reason to go from text to an image and then back to text again via OCR.
The better technique is to print to a PDF print driver that goes from the source text straight to text in the PDF file, creating what's known as a PDF Normal file. PaperPort installs such a driver that has had various names over the years, including DocuCom, Nuance PDF, and ScanSoft PDF Create.
These are in addition to, and different from, the PP Image Printer. They are similar to other PDF print drivers that create a PDF Normal file (straight text-to-text, i.e., no OCR), such as Adobe PDF (Distiller), part of an Adobe Acrobat installation, as well as many free ones, including
Bullzip,
CutePDF Writer,
doPDF,
Foxit Reader PDF Printer (part of the Foxit Reader install),
Nitro PDF Creator (part of the Nitro Reader install),
PDFCreator,
PDF-XChange Printer (part of the PDF-XChange Editor install), and
PrimoPDF.
In summary, when scanning paper, you must scan to an image and have PaperPort invoke OCR to create a PDF Searchable Image file (which it does automatically via a
Scanning Profile). Likewise, when converting an image-only file, such as a BMP, JPG, PNG, [image-only] PDF, or TIFF, to a PDF Searchable Image file, you must also have PaperPort invoke OCR to create it (which it does automatically via
Save As). But when printing to a PDF file, you should print to the PP Image Printer only if the source document is a raster image (bitmap/graphic); if it isn't, then it's better to print to one of the other PDF print drivers mentioned above.
Two important variables that affect OCR accuracy are
Mode (Black&White, Grayscale, Color) and
Resolution (DPI - dots per inch). For typical business documents, I recommend B&W (monochrome/1-bit) and 300 DPI. This generally results in reasonable files size and accurate OCR. On rare occasions, I'll use B&W and 400 or 600 DPI, but in many cases, 600 DPI (counter intuitively) results in less accurate OCR. On other rare occasions, I'll use Grayscale (8-bit) and either 200 or 300 DPI, which sometimes results in more accurate OCR. To learn more about Mode and Resolution when scanning, I recommend Wayne Fulton's excellent site,
A few scanning tips. In particular, look at the section that discusses OCR,
Scanning Line art.
In PaperPort, you may set the Mode and Resolution in all three methods for creating a PDF Searchable Image:
Scanning
Converting
Printing
That's it!
Three easy ways to create PDF Searchable Image files in PaperPort 12 and PaperPort 14.
If you find this article to be helpful, please click the
thumbs-up icon below. This lets me know what is valuable for EE members and provides direction for future articles. Thanks very much! Regards, Joe
Comments (6)
Author
Commented:What can you tell us about that company? Do you work for it?
I checked the domain at URLVoid, which says that it was created just eight days ago:
Of course, every website had its first day of existence, so that isn't necessarily a bad thing, but I'd like to hear your thoughts on the company/site.
A web search for "free online OCR" turns up several hits of established companies, such as:
http://finereaderonline.com/
The ABBYY FineReader software is excellent OCR and this website from the ABBYY folks uses the same OCR engine. Furthermore, the URLVoid domain report shows that it was first registered nearly eight years ago and has no reported safety issues. It does, however, have a monthly page limitation.
Another example is Online OCR:
http://www.onlineocr.net/
The URLVoid domain report for it shows that it was first registered more than seven years ago and has no reported safety issues.
Those are just two examples. There are many others.
In the interest of informing (and protecting) our EE members, I'm looking forward to hearing back from you about this new company/site. Regards, Joe
Commented:
Author
Commented:You're very welcome. And thanks to you for joining EE today (welcome aboard!), as well as reading and endorsing my article — I really appreciate it! I'm glad you found it helpful. Regards, Joe
Commented:
Author
Commented:Thank you for joining Experts Exchange this week and reading my article.
> Any ideas how to make the fonts vectorized in the searchable .pdf?
I do not have great expertise in font technology and am not aware of any way to control the font settings when PaperPort creates PDF Searchable Image files via the methods discussed in this article.
> I am asking this question because I would not like to install a pirated Adobe Acrobat to convert one pdf book into a pdf book with vectorized fonts.
I find that a strange comment — why would you even consider installing pirated software? We do not condone that here at Experts Exchange and, in fact, the Experts Exchange Terms of Use strictly prohibit any posting related to such activities (under Section 6, Code of Conduct). If you know that Adobe Acrobat will solve your font issue, and it is for only one PDF book, then I recommend purchasing just one month of Adobe Acrobat DC. For around 25 bucks, you'll avoid pirating software ($22.99 for one month of Acrobat Standard DC or $24.99 for one month of Acrobat Pro DC).
> What I got from PaperPort did not meet my expectations. the fonts got blurry.
It's likely that the fonts are blurry only when viewing the image layer. If you view just the text layer, the fonts should be fine. For example, I printed the first page of this article with the PaperPort Image Printer in B&W at 300 DPI to a PDF Image (not PDF Searchable Image). The whole page is attached as a PDF, but here's what it looks like:
The fonts, indeed, are blurry, because that's a view of the image (in Adobe Acrobat). I then used Nuance's Power PDF to convert to a searchable PDF, but told it not to keep the images. The whole page for that is also attached as a PDF, but here's the same small sample as shown above:
The fonts look great, because that's a view of the text (in Adobe Acrobat), since there is no image layer in the PDF.
> I expected them to get clean and vectorized, to be able to zoom in without those annoying pixels.
The fonts are fine in the text, as shown above. They get pixelated only when viewing the image layer. Another way to observe this is to Copy the text from the PDF Searchable Image file (created by PaperPort via one of the methods explained in this article) and then Paste it into a text-capable product, such as Notepad or Word — the fonts will, of course, appear fine. Regards, Joe
image-only-PaperPort-PDF-Image.pdf
text-only-Power-PDF-searchable-do-no.pdf
View More