Solved

How To Determine File Type of a PDF?

Posted on 2008-10-25
9
777 Views
Last Modified: 2010-04-29
I have many thousands of PDFs that were OCRed. to be saved with "text under image" format.  Several PCs were workng in tandem.  One was accidentally set for "text and image", which does not provide an exact duplicate of the original PDF's appearance.  They are readable, but just not exactly the same and we need them to be.  I could redo the OCR on the bad ones, but don't know which is which, and not possible to just open them all and look.  Is there a way to open a PDF file and determine how it has been saved?  IF so, I could scan them all, make a list of bad ones, and go get the originals to be re-OCRed.
0
Comment
Question by:Mike Caldwell
  • 4
  • 3
9 Comments
 
LVL 44

Expert Comment

by:Karl Heinz Kremer
ID: 22823968
Which version of Acrobat did you use to OCR the files? Chances are that you cannot determine which method was used to OCR the files, but I want to double check with the version of Acrobat that you used.
0
 
LVL 1

Author Comment

by:Mike Caldwell
ID: 22824645
OCR was done using ABBYY8, a stand-alone OCR package.  I know that some files were processed using 'text over image', because it was saved as a standard script that way.  About 10% of our files have been so-OCRed, and we'd like to find a way to electronically examine the file and determine which is which, then make a list and redo them.  Otherwise we have 3 million PDFs to redo just in case.  The originals were PDF images, which we OCR to make text searchable.  As they are they are useable, but the images presented are somewhat funky compared to the original.  When we save using "Text under image" the image is identical to the original.
0
 
LVL 44

Expert Comment

by:Karl Heinz Kremer
ID: 22824740
When you do a text under image, the text will not have any color. You can actually select the  text with the touchup text tool and change the color, and you will be able to see it (you may have to move the image first). If you have a tool that is able to identify text that actually has a color assigned, you can identify your documents that need to be redone.

You can do this with PitStop (http://www.enfocus.com) - they also have a server version that allows to check many files. I just run the test, and for a file that contains hidden text, there is no color information in the file, for a file that contains "normal" text, you will find a colorspace of "Gray".

You may want to download the eval version of PitStop and experiment with it to see if the Abbyy files also show this difference (I've only tried Acrobat OCR generated files).
0
What Is Threat Intelligence?

Threat intelligence is often discussed, but rarely understood. Starting with a precise definition, along with clear business goals, is essential.

 
LVL 44

Expert Comment

by:Karl Heinz Kremer
ID: 22830134
I just run a quick test with Abbyy, and it does not look good: I find a reference to a grayscale color space even with the "text under image" option.

That leaves only one option: You need to analyze the contents of the PDF document, and that is not trivial. You need some software that "knows" what the difference between the two different file types are, and can then report that it found either one or the other.

In general, you need to look at the sequence of operations in the PDF page content to find out if the text is under or over the large image on the page. To create such an application requires very detailed knowledge of the PDF structure and of one PDF framework/toolkit that allows you to look at the "guts" of a PDF file.
0
 
LVL 1

Author Comment

by:Mike Caldwell
ID: 22847974
I have a very skilled programmer that could examine each file, but I do need to tell him what to look for.  We are looking for an unknow quantity amongst at least a million PDFs, so automation is definately needed.
0
 
LVL 44

Accepted Solution

by:
Karl Heinz Kremer earned 500 total points
ID: 22867687
It's not trivial!
You need to use a PDF library (for example iText in it's Java or .NET version - http://www.lowagie.com/iText/) to parse the content stream.

Load both of your document types into Acrobat and bring up the "Content" pane. (View>Navigation Panels>Content") and look at the differences between the two page content types. For the "text behind image" you have one image object followed by the text. The other type of document will have text objects and image objects intermixed. The key is probably that you will find many image objects per page.

Let your skilled programmer take a look at the iText documentation, samples and the Content panel. If you have any questions, please let me know.
0
 
LVL 1

Author Comment

by:Mike Caldwell
ID: 22915453
We're probably going to purchase the PDFLib for PDF; perhaps it will also let us take a peek inside.
0

Featured Post

Free Trending Threat Insights Every Day

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

Join & Write a Comment

Suggested Solutions

This article was inspired by a question here at Experts Exchange (http://www.experts-exchange.com/Software/Photos_Graphics/Images_and_Photos/Q_28629170.html). The requirements stated in that question are (1) reduce the file size of a large number of…
Use email signature images to promote corporate certifications and industry awards.
In this fourth video of the Xpdf series, we discuss and demonstrate the PDFinfo utility, which retrieves the contents of a PDF's Info Dictionary, as well as some other information, including the page count. We show how to isolate the page count in a…
This video Micro Tutorial is the first in a two-part series that shows how to create and use custom scanning profiles in Nuance's PaperPort 14.5 (http://www.experts-exchange.com/articles/17490/). But the ability to create custom scanning profiles al…

708 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now