APD Toronto
asked on
Searchable PDF for Scanned documents
Hi Experts,
With a document scanning project, what is "Searchable PDF"?
I am using Brother Control Center, and I believe when scanning into PDF, they are treated as image, but I know when I use OCR they are converting to simple text format?
With a document scanning project, what is "Searchable PDF"?
I am using Brother Control Center, and I believe when scanning into PDF, they are treated as image, but I know when I use OCR they are converting to simple text format?
ASKER
What type of software or scanner do you use?
I use the Scanner on my HP 8610 and it creates searchable PDF's. I just searched for text in a collection of installation PDF's that I scanned and I found the text.
> what is "Searchable PDF"?
It means that the PDF has text in it, as opposed to being an image-only doc. When scanning, there are three types of PDF that can be created:
(1) Image-only: This is an image, aka bitmap, graphic, picture, photo.
(2) Searchable PDF, but with the image, too (aka PDF Searchable Image). This has the image (bitmap/graphic) from scanning, but also has text in it from an OCR process.
(3) Searchable PDF, but without the image. This had text created via an OCR process, but then discarded the scanned image and kept only the OCR'ed text.
These EE articles and videos will help you to understand more about Searchable PDFs:
Batch Conversion of PDF, TIFF, and Other Image Formats via Command Line Interface to PDF, PDF Searchable, and TIFF with Power PDF Advanced
PaperPort - How To Create Searchable PDF Files
Convert Scanned Image-Only PDF Files to PDF Searchable Image Files via OCR with Power PDF Advanced
How to OCR pages in a PDF with free software
Btw, I've had Brother MFC devices for decades (current ones are the MFC-9970CDW and MFC-L8850CDW), but I've never used Control Center with them...have always used PaperPort to scan...currently using PaperPort Pro 14.5 with Patch 1 and the PaperPort 14 Scanner Connection Tool. Regards, Joe
It means that the PDF has text in it, as opposed to being an image-only doc. When scanning, there are three types of PDF that can be created:
(1) Image-only: This is an image, aka bitmap, graphic, picture, photo.
(2) Searchable PDF, but with the image, too (aka PDF Searchable Image). This has the image (bitmap/graphic) from scanning, but also has text in it from an OCR process.
(3) Searchable PDF, but without the image. This had text created via an OCR process, but then discarded the scanned image and kept only the OCR'ed text.
These EE articles and videos will help you to understand more about Searchable PDFs:
Batch Conversion of PDF, TIFF, and Other Image Formats via Command Line Interface to PDF, PDF Searchable, and TIFF with Power PDF Advanced
PaperPort - How To Create Searchable PDF Files
Convert Scanned Image-Only PDF Files to PDF Searchable Image Files via OCR with Power PDF Advanced
How to OCR pages in a PDF with free software
Btw, I've had Brother MFC devices for decades (current ones are the MFC-9970CDW and MFC-L8850CDW), but I've never used Control Center with them...have always used PaperPort to scan...currently using PaperPort Pro 14.5 with Patch 1 and the PaperPort 14 Scanner Connection Tool. Regards, Joe
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
I assume hand written text will not be searchable, just typed?
hand written textThe term that I mentioned above, OCR (Optical Character Recognition), is for typewritten text. But handwriting is a different (and much more difficult) ballgame that requires a process known as Intelligent Character Recognition (ICR) or another one known as Intelligent Word Recognition (IWR). ICR recognizes cursive handwriting a character at a time, while IWR recognizes full words and phrases in cursive handwriting. The accuracy of ICR and IWR is way, way below that of OCR. In most cases, I have found users to be extremely disappointed with the accuracy of ICR/IWR. Regards, Joe
Handwritten text is very difficult to decode.
Best to stick with typed text.
Best to stick with typed text.
You need exceedingly neat handwriting to be recognized with any degree of accuracy .
OCR as you point out creates an actual text / Word file.