Link to home
Start Free TrialLog in
Avatar of curiouswebster
curiouswebsterFlag for United States of America

asked on

how to programmatically isolate PDF from image scan versus an original PDF?

how to programmatically isolate PDF from image scan versus an original PDF?

I have a folder filled with PDF's, most of which are scanned copies. But I need a way to pul out the original versions.

I do not want to deal with OCR software and need originals.

Is there a tool which can do this parsing to find originals?

Thanks
Avatar of Joe Winograd
Joe Winograd
Flag of United States of America image

Based on keywords in your question — programmatically, PDF, image, scan, OCR, parsing — I think I can help you, as I'm knowledgeable in those topics. However, I don't understand the question, so let's try to clarify:

(1) "isolate PDF from image scan"

When you scan, it is always to an "image" (aka bitmap or graphic). It is often to PDF format, but can be to other formats, such as BMP, JPG, PNG, etc. What do you mean by isolating the PDF from the image? The PDF *is* the image (of course, it could be OCR'ed to create text in the PDF, too).

(2) "versus an original PDF"

What do you mean by "original PDF"? PDF Normal (text, no image)? PDF Searchable Image (text from OCR and image)? Something else?

(3) "I need a way to pull out the original versions"

Probably the same issue as (2) above...what are "original versions"? I don't know what you mean by that.

(4) "I do not want to deal with OCR software and need originals."

One more time...what do you mean by originals? And what does OCR software have to do with it?

(5) "Is there a tool which can do this parsing to find originals?"

What "parsing" do you want done? And, again, what "originals"?

Regards, Joe
Avatar of curiouswebster

ASKER

Hi Joe,

When a merchant gets his credit card statement from his provider, it's an original PDF. On the other hand, when he scans his PDF and emails the scanned copy as an attachment, it is an image.

I need to find a way to parse through hundreds of PDF files, most of which are scanned, to find which few are actually originals from the service provider.

I need to then write a Windows program that imports data from a PDF, but I want to start with a solid PDF report that has not errors, since it is based upon the original data sent from the service provider. If I can not find an original PDF, I may not write the program. (we can talk about OCR on another thread...)

Make sense?
ASKER CERTIFIED SOLUTION
Avatar of Joe Winograd
Joe Winograd
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thank you for beginning to enlighten me.

Yes, I want to search the contents of the PDF from the Windows program I hope to write. I guess the term "original" was based on the hope there would be no errors.

You see, the merchant always has possession of the "original." But for my development, I need to find a few needles in the haystack, soI can use them as samples as I write my program.

So, original was the first version of the PDF to be created.

Is there a way using some tool or script to scan for original files?
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Joe, I may have been getting ahead of myself with my quest to find an "original" PDF.

If you feel I can use PDFtoText or other the Xpdf utility to convert the PDF files to plain text files, then I am happy. This means I am free to write my app and will have no shortage or report formats to test against.

My worry came when I opened a PDF based on a scanned report inside Adobe Pro CC and found on clicking the Edit field there were a number of field not selectable.

I suspect that Adobe Acrobat failed to covert clearly visible numbers to editable text. Is your OCR reader better than Adobe?

Do I need to provide a tool to enable merchants to edit the result of the scan?

I think it wold be pretty cool if I cold import the PDF then perform some validity tests on the scanned numbers. Then, I could flag when an error has occurred.

It would be like the check-bit on a byte of information.

Thanks again for the help.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Last question...

>Here is my standard default dictionary:

What is the purpose of this dictionary?
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks!
You're welcome, but the "answer" you picked is the post with just my clarification questions in it. I suggest that you re-open the question and re-close it by selecting the post(s) that actually answer the question, which, imo, are:

#a42632942
#a42632978
#a42633416

Thanks, Joe
I already placed a request. Moderator, please re-open this question.
Thanks!
thanks
You're welcome...and thanks to you for re-closing it...much appreciated!
Happy to look at the follow-on question...and thanks for the bonus points on this one...much appreciated! Regards, Joe