Improve company productivity with a Business Account.Sign Up

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 2587
  • Last Modified:

Detect if PDF files have searchable text or not...

As far as I know, a PDF (Portable Document File - created by Adobe) file can be created by 2 different ways:

*)From a Scanner, Digital Photo Camera, or ScreenShot/Screen Capture applications, thus it is a image/picture/photo-based PDF without TEXT (no searchable text)...

*)From a document file with TEXT (has searchable text), such as Word or OpenOffice Writer...

This is very important to me because I am an eBook collector, and I use Windows Desktop Search (WDS) and Google Desktop Search (GDS) a lot!!... So, I have many, many PDF files and I need to find a way to automatically scan each PDF file and identify if each PDF has searchable text or not (if not, I will OCR it!). My final goal is have all PDF with searchable text (the original image/picture/photo-based PDF will be OCRed, after being identified and isolated, in order to get searchable text); so I can run WDS/GDS with trust... I know I cannot trust in WDS/GDS if I have a big number of image/picture/photo-based PDF without TEXT; since this files are ignored...


My question is - how to automatically detect if a PDF needs to be OCRed?

Thanks in advance!
0
asgarcymed
Asked:
asgarcymed
  • 4
  • 3
1 Solution
 
scrathcyboyCommented:
There is no process to automate it -- adobe specifically avoids any command line arguments in any of their products.  They assume you have your whole life to spend, sitting in front of a computer, pointing to their software, and going through all the clicks they want you to do, to waste your life away (I am serious !!).

Hence, you will have to open every PDF, and look for the binoculars being highlighted.  If there is no searchable text in the document, the binoculars will be grayed out.

Here is Adobe's take on the problem (note, they admit it is a problem, but do nothing about it ) --

http://blogs.adobe.com/acrolaw/2007/02/is_that_pdf_sea.html
0
 
asgarcymedAuthor Commented:
Damn!... I am sad and disappointed... :( :(

However, I must thank you for answering my question!!

I thought about it and I tried a method - using VeryPDF PDF2TXT, which has batch support, I get 1 TXT file corresponding to the text contained in each PDF file. If a PDF has no text, the TXT file continues to exist, but it is empty.
Although this method is far way from what I really want, it is better than nothing... Do you want to suggest anything else?


Thank you!
0
 
scrathcyboyCommented:
I was going to suggest non-adobe PDF programs -- there are zillions of them,  Pardon me if I give you a general link, there are way too many to list individually -- check them out, you will find something of interest

http://www.google.com/search?num=30&q=free+PDF+text+convert+software

Notice it's listing the free software first.  Good luck.
0
Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

 
asgarcymedAuthor Commented:
After making some google searches, I had a new idea (I do not know if it is a good idea, so I post it here, hoping you to make a critic about it) - image-only PDF files usually have a bigger file size than text-only PDF files for the same number of pages...
I know something about AutoIt and VBScript...
There are some ActiveX=COM DLL's to count PDF pages...
So, I thought - I can make a script which:

1) Get each PDF file's SIZE

2) Get each PDF file's NUMBER OF PAGES

3) Calculate the RATIO = SIZE / NUMBER OF PAGES

4) Condition:

If RATIO is High, Then it is, very probably, an image-only PDF file

Else If RATIO is Low, Then it is, very probably, an text-only PDF file

Else = Doubt

End If


Does this looks silly or reliable? Why?

Thank you!
Regards.
0
 
scrathcyboyCommented:
That would be workable -- ASSUMING the PDF is "normally" constructed.  Some PDF's could have lots of tiny text (footnotes and indices as well) and therefore be as large as a PDF with say just one image per page scattered throughout the document.

Almost all new PDFs have a mixture of images and text, so this still does not solve the main goal of getting out whatever searchable text there is in the PDF.  Of the links listed, there should be several programs you can use (many free), to extract out the searchable text.  If those programs have command lines, then you can really go to town -- say  -x = extract, -t =  text, -f = file to write to .. then here goes --

for each *.PDF in C:\directory ( i++, PDF-extractor.exe -x -t -f"C:\directory\file(i).txt)

There's a hypothetical batch to process them all in one command.  The problem is, programmers are not using command line options like they used to -- so that is the challenge.

The other thing perhaps you are overlooking is how google has already scanned virtually all PDFs it encounters, and made text files from them, which they store on their own servers.  In many google searches you will see the PDF file listed first, but under it, there is a text file (also linked) with all the extractable text already in a file, cached on their servers.  There may be a way to google for just the cached pages, not the original PDFs.  If you look at enough, you might find they are all on one server.

Also, before you go off writing code with assumptions that could be broken, look at these links --
http://www.google.com/search?num=30&q=batch+extract+text+from+PDF+
I think you will find that many people have already done what you want -- don't reinvent.
0
 
asgarcymedAuthor Commented:
THANK YOU VERY MUCH FOR EVERYTHING! You are really a nice guy! ;)
Best regards.
0
 
scrathcyboyCommented:
my pleasure, and good luck.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Free Tool: SSL Checker

Scans your site and returns information about your SSL implementation and certificate. Helpful for debugging and validating your SSL configuration.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

  • 4
  • 3
Tackle projects and never again get stuck behind a technical roadblock.
Join Now