Solved

Detect if PDF files have searchable text or not...

Posted on 2008-06-21
7
2,379 Views
Last Modified: 2012-05-05
As far as I know, a PDF (Portable Document File - created by Adobe) file can be created by 2 different ways:

*)From a Scanner, Digital Photo Camera, or ScreenShot/Screen Capture applications, thus it is a image/picture/photo-based PDF without TEXT (no searchable text)...

*)From a document file with TEXT (has searchable text), such as Word or OpenOffice Writer...

This is very important to me because I am an eBook collector, and I use Windows Desktop Search (WDS) and Google Desktop Search (GDS) a lot!!... So, I have many, many PDF files and I need to find a way to automatically scan each PDF file and identify if each PDF has searchable text or not (if not, I will OCR it!). My final goal is have all PDF with searchable text (the original image/picture/photo-based PDF will be OCRed, after being identified and isolated, in order to get searchable text); so I can run WDS/GDS with trust... I know I cannot trust in WDS/GDS if I have a big number of image/picture/photo-based PDF without TEXT; since this files are ignored...


My question is - how to automatically detect if a PDF needs to be OCRed?

Thanks in advance!
0
Comment
Question by:asgarcymed
  • 4
  • 3
7 Comments
 
LVL 44

Expert Comment

by:scrathcyboy
Comment Utility
There is no process to automate it -- adobe specifically avoids any command line arguments in any of their products.  They assume you have your whole life to spend, sitting in front of a computer, pointing to their software, and going through all the clicks they want you to do, to waste your life away (I am serious !!).

Hence, you will have to open every PDF, and look for the binoculars being highlighted.  If there is no searchable text in the document, the binoculars will be grayed out.

Here is Adobe's take on the problem (note, they admit it is a problem, but do nothing about it ) --

http://blogs.adobe.com/acrolaw/2007/02/is_that_pdf_sea.html
0
 

Author Comment

by:asgarcymed
Comment Utility
Damn!... I am sad and disappointed... :( :(

However, I must thank you for answering my question!!

I thought about it and I tried a method - using VeryPDF PDF2TXT, which has batch support, I get 1 TXT file corresponding to the text contained in each PDF file. If a PDF has no text, the TXT file continues to exist, but it is empty.
Although this method is far way from what I really want, it is better than nothing... Do you want to suggest anything else?


Thank you!
0
 
LVL 44

Expert Comment

by:scrathcyboy
Comment Utility
I was going to suggest non-adobe PDF programs -- there are zillions of them,  Pardon me if I give you a general link, there are way too many to list individually -- check them out, you will find something of interest

http://www.google.com/search?num=30&q=free+PDF+text+convert+software

Notice it's listing the free software first.  Good luck.
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 

Author Comment

by:asgarcymed
Comment Utility
After making some google searches, I had a new idea (I do not know if it is a good idea, so I post it here, hoping you to make a critic about it) - image-only PDF files usually have a bigger file size than text-only PDF files for the same number of pages...
I know something about AutoIt and VBScript...
There are some ActiveX=COM DLL's to count PDF pages...
So, I thought - I can make a script which:

1) Get each PDF file's SIZE

2) Get each PDF file's NUMBER OF PAGES

3) Calculate the RATIO = SIZE / NUMBER OF PAGES

4) Condition:

If RATIO is High, Then it is, very probably, an image-only PDF file

Else If RATIO is Low, Then it is, very probably, an text-only PDF file

Else = Doubt

End If


Does this looks silly or reliable? Why?

Thank you!
Regards.
0
 
LVL 44

Accepted Solution

by:
scrathcyboy earned 500 total points
Comment Utility
That would be workable -- ASSUMING the PDF is "normally" constructed.  Some PDF's could have lots of tiny text (footnotes and indices as well) and therefore be as large as a PDF with say just one image per page scattered throughout the document.

Almost all new PDFs have a mixture of images and text, so this still does not solve the main goal of getting out whatever searchable text there is in the PDF.  Of the links listed, there should be several programs you can use (many free), to extract out the searchable text.  If those programs have command lines, then you can really go to town -- say  -x = extract, -t =  text, -f = file to write to .. then here goes --

for each *.PDF in C:\directory ( i++, PDF-extractor.exe -x -t -f"C:\directory\file(i).txt)

There's a hypothetical batch to process them all in one command.  The problem is, programmers are not using command line options like they used to -- so that is the challenge.

The other thing perhaps you are overlooking is how google has already scanned virtually all PDFs it encounters, and made text files from them, which they store on their own servers.  In many google searches you will see the PDF file listed first, but under it, there is a text file (also linked) with all the extractable text already in a file, cached on their servers.  There may be a way to google for just the cached pages, not the original PDFs.  If you look at enough, you might find they are all on one server.

Also, before you go off writing code with assumptions that could be broken, look at these links --
http://www.google.com/search?num=30&q=batch+extract+text+from+PDF+
I think you will find that many people have already done what you want -- don't reinvent.
0
 

Author Comment

by:asgarcymed
Comment Utility
THANK YOU VERY MUCH FOR EVERYTHING! You are really a nice guy! ;)
Best regards.
0
 
LVL 44

Expert Comment

by:scrathcyboy
Comment Utility
my pleasure, and good luck.
0

Featured Post

Free Trending Threat Insights Every Day

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

Join & Write a Comment

Skype is a P2P (Peer to Peer) instant messaging and VOIP (Voice over IP) service – as well as a whole lot more.
Healthcare organizations in the United States must adhere to the guidance of both the HIPAA (Health Insurance Portability and Accountability Act) and HITECH (Health Information Technology for Economic and Clinical Health Act) for securing and protec…
XMind Plus helps organize all details/aspects of any project from large to small in an orderly and concise manner. If you are working on a complex project, use this micro tutorial to show you how to make a basic flow chart. The software is free when…
We often encounter PDF files that are pure images, that is, they do not have text characters, but instead contain only raster graphics. The most common causes of this are document scanning software and faxing software/services that create image-only…

771 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now