Want to win a PS4? Go Premium and enter to win our High-Tech Treats giveaway. Enter to Win

x
?
Solved

Detect if PDF files have searchable text or not...

Posted on 2008-06-21
7
Medium Priority
?
2,496 Views
Last Modified: 2012-05-05
As far as I know, a PDF (Portable Document File - created by Adobe) file can be created by 2 different ways:

*)From a Scanner, Digital Photo Camera, or ScreenShot/Screen Capture applications, thus it is a image/picture/photo-based PDF without TEXT (no searchable text)...

*)From a document file with TEXT (has searchable text), such as Word or OpenOffice Writer...

This is very important to me because I am an eBook collector, and I use Windows Desktop Search (WDS) and Google Desktop Search (GDS) a lot!!... So, I have many, many PDF files and I need to find a way to automatically scan each PDF file and identify if each PDF has searchable text or not (if not, I will OCR it!). My final goal is have all PDF with searchable text (the original image/picture/photo-based PDF will be OCRed, after being identified and isolated, in order to get searchable text); so I can run WDS/GDS with trust... I know I cannot trust in WDS/GDS if I have a big number of image/picture/photo-based PDF without TEXT; since this files are ignored...


My question is - how to automatically detect if a PDF needs to be OCRed?

Thanks in advance!
0
Comment
Question by:asgarcymed
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 3
7 Comments
 
LVL 44

Expert Comment

by:scrathcyboy
ID: 21839866
There is no process to automate it -- adobe specifically avoids any command line arguments in any of their products.  They assume you have your whole life to spend, sitting in front of a computer, pointing to their software, and going through all the clicks they want you to do, to waste your life away (I am serious !!).

Hence, you will have to open every PDF, and look for the binoculars being highlighted.  If there is no searchable text in the document, the binoculars will be grayed out.

Here is Adobe's take on the problem (note, they admit it is a problem, but do nothing about it ) --

http://blogs.adobe.com/acrolaw/2007/02/is_that_pdf_sea.html
0
 

Author Comment

by:asgarcymed
ID: 21841583
Damn!... I am sad and disappointed... :( :(

However, I must thank you for answering my question!!

I thought about it and I tried a method - using VeryPDF PDF2TXT, which has batch support, I get 1 TXT file corresponding to the text contained in each PDF file. If a PDF has no text, the TXT file continues to exist, but it is empty.
Although this method is far way from what I really want, it is better than nothing... Do you want to suggest anything else?


Thank you!
0
 
LVL 44

Expert Comment

by:scrathcyboy
ID: 21842667
I was going to suggest non-adobe PDF programs -- there are zillions of them,  Pardon me if I give you a general link, there are way too many to list individually -- check them out, you will find something of interest

http://www.google.com/search?num=30&q=free+PDF+text+convert+software

Notice it's listing the free software first.  Good luck.
0
Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

 

Author Comment

by:asgarcymed
ID: 21847355
After making some google searches, I had a new idea (I do not know if it is a good idea, so I post it here, hoping you to make a critic about it) - image-only PDF files usually have a bigger file size than text-only PDF files for the same number of pages...
I know something about AutoIt and VBScript...
There are some ActiveX=COM DLL's to count PDF pages...
So, I thought - I can make a script which:

1) Get each PDF file's SIZE

2) Get each PDF file's NUMBER OF PAGES

3) Calculate the RATIO = SIZE / NUMBER OF PAGES

4) Condition:

If RATIO is High, Then it is, very probably, an image-only PDF file

Else If RATIO is Low, Then it is, very probably, an text-only PDF file

Else = Doubt

End If


Does this looks silly or reliable? Why?

Thank you!
Regards.
0
 
LVL 44

Accepted Solution

by:
scrathcyboy earned 2000 total points
ID: 21849203
That would be workable -- ASSUMING the PDF is "normally" constructed.  Some PDF's could have lots of tiny text (footnotes and indices as well) and therefore be as large as a PDF with say just one image per page scattered throughout the document.

Almost all new PDFs have a mixture of images and text, so this still does not solve the main goal of getting out whatever searchable text there is in the PDF.  Of the links listed, there should be several programs you can use (many free), to extract out the searchable text.  If those programs have command lines, then you can really go to town -- say  -x = extract, -t =  text, -f = file to write to .. then here goes --

for each *.PDF in C:\directory ( i++, PDF-extractor.exe -x -t -f"C:\directory\file(i).txt)

There's a hypothetical batch to process them all in one command.  The problem is, programmers are not using command line options like they used to -- so that is the challenge.

The other thing perhaps you are overlooking is how google has already scanned virtually all PDFs it encounters, and made text files from them, which they store on their own servers.  In many google searches you will see the PDF file listed first, but under it, there is a text file (also linked) with all the extractable text already in a file, cached on their servers.  There may be a way to google for just the cached pages, not the original PDFs.  If you look at enough, you might find they are all on one server.

Also, before you go off writing code with assumptions that could be broken, look at these links --
http://www.google.com/search?num=30&q=batch+extract+text+from+PDF+
I think you will find that many people have already done what you want -- don't reinvent.
0
 

Author Comment

by:asgarcymed
ID: 21849388
THANK YOU VERY MUCH FOR EVERYTHING! You are really a nice guy! ;)
Best regards.
0
 
LVL 44

Expert Comment

by:scrathcyboy
ID: 21849734
my pleasure, and good luck.
0

Featured Post

Free Backup Tool for VMware and Hyper-V

Restore full virtual machine or individual guest files from 19 common file systems directly from the backup file. Schedule VM backups with PowerShell scripts. Set desired time, lean back and let the script to notify you via email upon completion.  

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

One-stop solution for Exchange Administrators to address all MS Exchange Server issues, which is known by the name of Stellar Exchange Toolkit.
The main intent of this article is to make you aware of ‘Exchange fail to mount’ error, its effects, causes, and solution.
In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…
In a question here at Experts Exchange (https://www.experts-exchange.com/questions/29062564/Adobe-acrobat-reader-DC.html), a member asked how to create a signature in Adobe Acrobat Reader DC (the free Reader product, not the paid, full Acrobat produ…

604 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question