Solved

Detect if PDF files have searchable text or not...

Posted on 2008-06-21
7
2,436 Views
Last Modified: 2012-05-05
As far as I know, a PDF (Portable Document File - created by Adobe) file can be created by 2 different ways:

*)From a Scanner, Digital Photo Camera, or ScreenShot/Screen Capture applications, thus it is a image/picture/photo-based PDF without TEXT (no searchable text)...

*)From a document file with TEXT (has searchable text), such as Word or OpenOffice Writer...

This is very important to me because I am an eBook collector, and I use Windows Desktop Search (WDS) and Google Desktop Search (GDS) a lot!!... So, I have many, many PDF files and I need to find a way to automatically scan each PDF file and identify if each PDF has searchable text or not (if not, I will OCR it!). My final goal is have all PDF with searchable text (the original image/picture/photo-based PDF will be OCRed, after being identified and isolated, in order to get searchable text); so I can run WDS/GDS with trust... I know I cannot trust in WDS/GDS if I have a big number of image/picture/photo-based PDF without TEXT; since this files are ignored...


My question is - how to automatically detect if a PDF needs to be OCRed?

Thanks in advance!
0
Comment
Question by:asgarcymed
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 3
7 Comments
 
LVL 44

Expert Comment

by:scrathcyboy
ID: 21839866
There is no process to automate it -- adobe specifically avoids any command line arguments in any of their products.  They assume you have your whole life to spend, sitting in front of a computer, pointing to their software, and going through all the clicks they want you to do, to waste your life away (I am serious !!).

Hence, you will have to open every PDF, and look for the binoculars being highlighted.  If there is no searchable text in the document, the binoculars will be grayed out.

Here is Adobe's take on the problem (note, they admit it is a problem, but do nothing about it ) --

http://blogs.adobe.com/acrolaw/2007/02/is_that_pdf_sea.html
0
 

Author Comment

by:asgarcymed
ID: 21841583
Damn!... I am sad and disappointed... :( :(

However, I must thank you for answering my question!!

I thought about it and I tried a method - using VeryPDF PDF2TXT, which has batch support, I get 1 TXT file corresponding to the text contained in each PDF file. If a PDF has no text, the TXT file continues to exist, but it is empty.
Although this method is far way from what I really want, it is better than nothing... Do you want to suggest anything else?


Thank you!
0
 
LVL 44

Expert Comment

by:scrathcyboy
ID: 21842667
I was going to suggest non-adobe PDF programs -- there are zillions of them,  Pardon me if I give you a general link, there are way too many to list individually -- check them out, you will find something of interest

http://www.google.com/search?num=30&q=free+PDF+text+convert+software

Notice it's listing the free software first.  Good luck.
0
Microsoft Certification Exam 74-409

Veeam® is happy to provide the Microsoft community with a study guide prepared by MVP and MCT, Orin Thomas. This guide will take you through each of the exam objectives, helping you to prepare for and pass the examination.

 

Author Comment

by:asgarcymed
ID: 21847355
After making some google searches, I had a new idea (I do not know if it is a good idea, so I post it here, hoping you to make a critic about it) - image-only PDF files usually have a bigger file size than text-only PDF files for the same number of pages...
I know something about AutoIt and VBScript...
There are some ActiveX=COM DLL's to count PDF pages...
So, I thought - I can make a script which:

1) Get each PDF file's SIZE

2) Get each PDF file's NUMBER OF PAGES

3) Calculate the RATIO = SIZE / NUMBER OF PAGES

4) Condition:

If RATIO is High, Then it is, very probably, an image-only PDF file

Else If RATIO is Low, Then it is, very probably, an text-only PDF file

Else = Doubt

End If


Does this looks silly or reliable? Why?

Thank you!
Regards.
0
 
LVL 44

Accepted Solution

by:
scrathcyboy earned 500 total points
ID: 21849203
That would be workable -- ASSUMING the PDF is "normally" constructed.  Some PDF's could have lots of tiny text (footnotes and indices as well) and therefore be as large as a PDF with say just one image per page scattered throughout the document.

Almost all new PDFs have a mixture of images and text, so this still does not solve the main goal of getting out whatever searchable text there is in the PDF.  Of the links listed, there should be several programs you can use (many free), to extract out the searchable text.  If those programs have command lines, then you can really go to town -- say  -x = extract, -t =  text, -f = file to write to .. then here goes --

for each *.PDF in C:\directory ( i++, PDF-extractor.exe -x -t -f"C:\directory\file(i).txt)

There's a hypothetical batch to process them all in one command.  The problem is, programmers are not using command line options like they used to -- so that is the challenge.

The other thing perhaps you are overlooking is how google has already scanned virtually all PDFs it encounters, and made text files from them, which they store on their own servers.  In many google searches you will see the PDF file listed first, but under it, there is a text file (also linked) with all the extractable text already in a file, cached on their servers.  There may be a way to google for just the cached pages, not the original PDFs.  If you look at enough, you might find they are all on one server.

Also, before you go off writing code with assumptions that could be broken, look at these links --
http://www.google.com/search?num=30&q=batch+extract+text+from+PDF+
I think you will find that many people have already done what you want -- don't reinvent.
0
 

Author Comment

by:asgarcymed
ID: 21849388
THANK YOU VERY MUCH FOR EVERYTHING! You are really a nice guy! ;)
Best regards.
0
 
LVL 44

Expert Comment

by:scrathcyboy
ID: 21849734
my pleasure, and good luck.
0

Featured Post

On Demand Webinar: Networking for the Cloud Era

Did you know SD-WANs can improve network connectivity? Check out this webinar to learn how an SD-WAN simplified, one-click tool can help you migrate and manage data in the cloud.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Today companies are subjected to more-and-more data, and it won't stop any time soon.  But there are obvious opportunities for reducing data, particularly data duplicated among companies.
In our personal lives, we have well-designed consumer apps to delight us and make even the most complex transactions simple. Many enterprise applications, however, are a bit behind the times. For an enterprise app to be successful in today's tech wo…
This video Micro Tutorial shows how to password-protect PDF files with free software. Many software products can do this, such as Adobe Acrobat (but not Adobe Reader), Nuance PaperPort, and Nuance Power PDF, but they are not free products. This vide…
In this brief tutorial Pawel from AdRem Software explains how you can quickly find out which services are running on your network, or what are the IP addresses of servers responsible for each service. Software used is freeware NetCrunch Tools (https…

726 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question