Detect if PDF files have searchable text or not...

Posted on 2008-06-21
Last Modified: 2012-05-05
As far as I know, a PDF (Portable Document File - created by Adobe) file can be created by 2 different ways:

*)From a Scanner, Digital Photo Camera, or ScreenShot/Screen Capture applications, thus it is a image/picture/photo-based PDF without TEXT (no searchable text)...

*)From a document file with TEXT (has searchable text), such as Word or OpenOffice Writer...

This is very important to me because I am an eBook collector, and I use Windows Desktop Search (WDS) and Google Desktop Search (GDS) a lot!!... So, I have many, many PDF files and I need to find a way to automatically scan each PDF file and identify if each PDF has searchable text or not (if not, I will OCR it!). My final goal is have all PDF with searchable text (the original image/picture/photo-based PDF will be OCRed, after being identified and isolated, in order to get searchable text); so I can run WDS/GDS with trust... I know I cannot trust in WDS/GDS if I have a big number of image/picture/photo-based PDF without TEXT; since this files are ignored...

My question is - how to automatically detect if a PDF needs to be OCRed?

Thanks in advance!
Question by:asgarcymed
  • 4
  • 3
LVL 44

Expert Comment

ID: 21839866
There is no process to automate it -- adobe specifically avoids any command line arguments in any of their products.  They assume you have your whole life to spend, sitting in front of a computer, pointing to their software, and going through all the clicks they want you to do, to waste your life away (I am serious !!).

Hence, you will have to open every PDF, and look for the binoculars being highlighted.  If there is no searchable text in the document, the binoculars will be grayed out.

Here is Adobe's take on the problem (note, they admit it is a problem, but do nothing about it ) --

Author Comment

ID: 21841583
Damn!... I am sad and disappointed... :( :(

However, I must thank you for answering my question!!

I thought about it and I tried a method - using VeryPDF PDF2TXT, which has batch support, I get 1 TXT file corresponding to the text contained in each PDF file. If a PDF has no text, the TXT file continues to exist, but it is empty.
Although this method is far way from what I really want, it is better than nothing... Do you want to suggest anything else?

Thank you!
LVL 44

Expert Comment

ID: 21842667
I was going to suggest non-adobe PDF programs -- there are zillions of them,  Pardon me if I give you a general link, there are way too many to list individually -- check them out, you will find something of interest

Notice it's listing the free software first.  Good luck.
Ransomware: The New Cyber Threat & How to Stop It

This infographic explains ransomware, type of malware that blocks access to your files or your systems and holds them hostage until a ransom is paid. It also examines the different types of ransomware and explains what you can do to thwart this sinister online threat.  


Author Comment

ID: 21847355
After making some google searches, I had a new idea (I do not know if it is a good idea, so I post it here, hoping you to make a critic about it) - image-only PDF files usually have a bigger file size than text-only PDF files for the same number of pages...
I know something about AutoIt and VBScript...
There are some ActiveX=COM DLL's to count PDF pages...
So, I thought - I can make a script which:

1) Get each PDF file's SIZE

2) Get each PDF file's NUMBER OF PAGES

3) Calculate the RATIO = SIZE / NUMBER OF PAGES

4) Condition:

If RATIO is High, Then it is, very probably, an image-only PDF file

Else If RATIO is Low, Then it is, very probably, an text-only PDF file

Else = Doubt

End If

Does this looks silly or reliable? Why?

Thank you!
LVL 44

Accepted Solution

scrathcyboy earned 500 total points
ID: 21849203
That would be workable -- ASSUMING the PDF is "normally" constructed.  Some PDF's could have lots of tiny text (footnotes and indices as well) and therefore be as large as a PDF with say just one image per page scattered throughout the document.

Almost all new PDFs have a mixture of images and text, so this still does not solve the main goal of getting out whatever searchable text there is in the PDF.  Of the links listed, there should be several programs you can use (many free), to extract out the searchable text.  If those programs have command lines, then you can really go to town -- say  -x = extract, -t =  text, -f = file to write to .. then here goes --

for each *.PDF in C:\directory ( i++, PDF-extractor.exe -x -t -f"C:\directory\file(i).txt)

There's a hypothetical batch to process them all in one command.  The problem is, programmers are not using command line options like they used to -- so that is the challenge.

The other thing perhaps you are overlooking is how google has already scanned virtually all PDFs it encounters, and made text files from them, which they store on their own servers.  In many google searches you will see the PDF file listed first, but under it, there is a text file (also linked) with all the extractable text already in a file, cached on their servers.  There may be a way to google for just the cached pages, not the original PDFs.  If you look at enough, you might find they are all on one server.

Also, before you go off writing code with assumptions that could be broken, look at these links --
I think you will find that many people have already done what you want -- don't reinvent.

Author Comment

ID: 21849388
THANK YOU VERY MUCH FOR EVERYTHING! You are really a nice guy! ;)
Best regards.
LVL 44

Expert Comment

ID: 21849734
my pleasure, and good luck.

Featured Post

Microsoft Certification Exam 74-409

Veeam® is happy to provide the Microsoft community with a study guide prepared by MVP and MCT, Orin Thomas. This guide will take you through each of the exam objectives, helping you to prepare for and pass the examination.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

I use more than 1 computer in my office for various reasons. Multiple keyboards and mice take up more than just extra space, they make working a little more complicated. Using one mouse and keyboard for all of my computers makes life easier. This co…
All of the resources available today make learning a new digital media easier than ever-- if you know where to begin. This is a clear, simple guide to a few of the basic digital art mediums and how to begin learning them on your own.
Microsoft Office Picture Manager is not included in Office 2013. This comes as quite a surprise to users upgrading from earlier versions of Office, such as 2007 and 2010, where Picture Manager was included as a standard application. This video expla…
In this sixth video of the Xpdf series, we discuss and demonstrate the PDFtoPNG utility, which converts a multi-page PDF file to separate color, grayscale, or monochrome PNG files, creating one PNG file for each page in the PDF. It does this via a c…

803 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question