Link to home
Create AccountLog in
Avatar of Cmitch
CmitchFlag for Australia

asked on

Detecting if PDF file has content

I have VB.NET application that loops through various Adobe PDF documents and I need to determine if any documents contain coded content.  What would would the most efficient approach be, as there will be are large amount of documents being reviewed?


Cheers
Avatar of Karl Heinz Kremer
Karl Heinz Kremer
Flag of United States of America image

What exactly do you mean by "coded content"?
What exactly do you mean by "has content"?

How would you categorize a document that has only a 1x1 image that only contains transparency? It would never show up when you display the page, so the page would appear empty. Is that content, or not?

Just to state the obvious: Every PDF file does have content - it has to contain at least one page. There is no need for content on that page, but at least an empty page needs to always be there.
Avatar of Cmitch

ASKER

Sorry.  To clarify I need to detect if there is any OCR text content within the selected Adobe PDF files.  I have been suggested a possible solution to this would be to check if text fonts are present within the file, however I am unsure how this would be completed.  Any suggestions?
sounds like you would need to search the content of the PDF one tool that can be used is here: http://www.aquaforest.com/en/ocrsdk.asp (Searchable PDFs).
ASKER CERTIFIED SOLUTION
Avatar of Karl Heinz Kremer
Karl Heinz Kremer
Flag of United States of America image

Link to home
membership
Create an account to see this answer
Signing up is free. No credit card required.
Create Account
Avatar of Cmitch

ASKER

Thanks for the very  thorough response, it is very helpful.
Does dtsearch have the functionality to perform this task, but only detecting the content & not indexing it?
What is dtsearch?