How To Determine File Type of a PDF?

I have many thousands of PDFs that were OCRed. to be saved with "text under image" format.  Several PCs were workng in tandem.  One was accidentally set for "text and image", which does not provide an exact duplicate of the original PDF's appearance.  They are readable, but just not exactly the same and we need them to be.  I could redo the OCR on the bad ones, but don't know which is which, and not possible to just open them all and look.  Is there a way to open a PDF file and determine how it has been saved?  IF so, I could scan them all, make a list of bad ones, and go get the originals to be re-OCRed.
LVL 1
Mike CaldwellConsultant to IP industryAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Karl Heinz KremerCommented:
Which version of Acrobat did you use to OCR the files? Chances are that you cannot determine which method was used to OCR the files, but I want to double check with the version of Acrobat that you used.
0
Mike CaldwellConsultant to IP industryAuthor Commented:
OCR was done using ABBYY8, a stand-alone OCR package.  I know that some files were processed using 'text over image', because it was saved as a standard script that way.  About 10% of our files have been so-OCRed, and we'd like to find a way to electronically examine the file and determine which is which, then make a list and redo them.  Otherwise we have 3 million PDFs to redo just in case.  The originals were PDF images, which we OCR to make text searchable.  As they are they are useable, but the images presented are somewhat funky compared to the original.  When we save using "Text under image" the image is identical to the original.
0
Karl Heinz KremerCommented:
When you do a text under image, the text will not have any color. You can actually select the  text with the touchup text tool and change the color, and you will be able to see it (you may have to move the image first). If you have a tool that is able to identify text that actually has a color assigned, you can identify your documents that need to be redone.

You can do this with PitStop (http://www.enfocus.com) - they also have a server version that allows to check many files. I just run the test, and for a file that contains hidden text, there is no color information in the file, for a file that contains "normal" text, you will find a colorspace of "Gray".

You may want to download the eval version of PitStop and experiment with it to see if the Abbyy files also show this difference (I've only tried Acrobat OCR generated files).
0
Cloud Class® Course: Microsoft Windows 7 Basic

This introductory course to Windows 7 environment will teach you about working with the Windows operating system. You will learn about basic functions including start menu; the desktop; managing files, folders, and libraries.

Karl Heinz KremerCommented:
I just run a quick test with Abbyy, and it does not look good: I find a reference to a grayscale color space even with the "text under image" option.

That leaves only one option: You need to analyze the contents of the PDF document, and that is not trivial. You need some software that "knows" what the difference between the two different file types are, and can then report that it found either one or the other.

In general, you need to look at the sequence of operations in the PDF page content to find out if the text is under or over the large image on the page. To create such an application requires very detailed knowledge of the PDF structure and of one PDF framework/toolkit that allows you to look at the "guts" of a PDF file.
0
Mike CaldwellConsultant to IP industryAuthor Commented:
I have a very skilled programmer that could examine each file, but I do need to tell him what to look for.  We are looking for an unknow quantity amongst at least a million PDFs, so automation is definately needed.
0
Karl Heinz KremerCommented:
It's not trivial!
You need to use a PDF library (for example iText in it's Java or .NET version - http://www.lowagie.com/iText/) to parse the content stream.

Load both of your document types into Acrobat and bring up the "Content" pane. (View>Navigation Panels>Content") and look at the differences between the two page content types. For the "text behind image" you have one image object followed by the text. The other type of document will have text objects and image objects intermixed. The key is probably that you will find many image objects per page.

Let your skilled programmer take a look at the iText documentation, samples and the Content panel. If you have any questions, please let me know.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Mike CaldwellConsultant to IP industryAuthor Commented:
We're probably going to purchase the PDFLib for PDF; perhaps it will also let us take a peek inside.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Images and Photos

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.