Solved

Converting pdf to searchable text format

Posted on 2004-08-11
5
3,108 Views
Last Modified: 2006-11-17
Hi all,

I have a pdf file which seems to be just pages of scanned pages (I can't search for specific words). I would like to convert this file to a pdf where I can search the text. Is there some kind of OCR package which would do this?

Thanks,
Freerider.
0
Comment
Question by:Freerider
  • 2
5 Comments
 
LVL 11

Expert Comment

by:lbertacco
ID: 11770899
If you have office2003 you can print it to "Microsoft Office Image Writer" printer, then open it with "Microsoft Office Document Imaging" and click on Tools->send text to word
0
 
LVL 44

Accepted Solution

by:
Karl Heinz Kremer earned 100 total points
ID: 11770962
You can use Adobe Acrobat (the full version): It comes with "Paper Capture", which is an OCR engine. If you don't have Acrobat. Other options are ScanSoft's OmniPage Pro (http://www.scansoft.com/omnipage/) or the Abbyy FineReader (http://www.abbyy.com/finereader/).
You have several options when you convert your image-only PDF: You can convert everything to "real" text and graphics, which may not be your best solution, because you very likely will end up with a mix of recognized text and not recognized text, which will stay as scanned image. This means that your characters in your text will change from read characters to the scanned images, and this is visible even to the untrained eye. You can avoid this by selecting "image with hidden text", where the original scanned image will be used for display and printing purposes, but the recognized text will be stored behind the image (in the correct location). This means that you can index and search the document. When you find a term, the correct section of the document will be highlighted, but you still have the high quality scan that you started with when you view or print the document.
0
 

Author Comment

by:Freerider
ID: 11863285
Thanks khkremer,
Finereader does the job. The only problem I have now is the bookmarks from the original document have been removed. Any idea how to get them back? I've downloaded a few trial programs but nothing seems to work.

Freerider.
0
 
LVL 44

Expert Comment

by:Karl Heinz Kremer
ID: 11863758
Try this: Take the original file (with the bookmarks) and open it in Acrobat, then select Document>Pages>Replace and select to replace all pages with the pages from your OCR'ed document.
0

Featured Post

Free Trending Threat Insights Every Day

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

Join & Write a Comment

PaperPort is a popular document imaging/management product from Nuance Communications (http://www.nuance.com/). It is in widespread use by both individuals (http://www.nuance.com/for-individuals/by-product/paperport/index.htm) and businesses (http:/…
The Adobe PDF proprietary file format is recognized as secure and formulated. But these PDF files are also prone to corruption and any external threat like virus attacks, improper storage can hit PDF file integrity.This type of damages can make cruc…
In this first video of the three-part Xpdf series, we introduce and describe Xpdf, a library containing nine command line utilities that perform various functions on PDF files. We show where the library is located and how to download it, discuss its…
In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…

708 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now