Link to home
Start Free TrialLog in
Avatar of curiouswebster
curiouswebsterFlag for United States of America

asked on

Need a simple program that scans folders for "native" PDF files.

Need a simple program that scans folders for "native" PDF files.

I have thousands of PDF's in a folder, but most were created from a scanner, followed by the automated use of some OCD scanner. But I need to find those few PDF files which originated from an inbound email (or other means) where the PDF was created by the original producer, where it was a bank or other service provider.

I am very new to what makes up a PDF file and apologize if this question is missing the mark slightly.

I have been told there's a metadata field called PDF Producer. I have also been told it's easy to write a program to retrieve that PDF Producer field and look for values that indicate an original/native PDF. Here are example PDF Producer values...

OpenText Output Transformation Engine - 16.1.12
Ricoh Americas Corporation, AFP2PDF Plus Version: 1.100.01, Linux

My problem is that I do not have a list of such values. So, perhaps the PDF Producer field has other attributes which we could use?

So, I am searching for "native" PDF files which are a needle in a haystack and need some simple automation that can help me find them.

Suggestions for what exactly to scan for? I fee that once that is known, the actual work of writing the program is trivial and copying those few PDF's which match, is trivial.

Thanks
Avatar of Joe Winograd
Joe Winograd
Flag of United States of America image

A few points:

(1) Based on previous questions here at EE and on private correspondence, I know that by "native" (which you also call "original") PDF, you mean a PDF that has not been scanned. It came from a provider, such as a bank or credit card company, and was likely produced programmatically by some type of ERP system (or maybe just from a Word doc).

(2) There's nothing in the PDF metadata that identifies a doc as being scanned or one that has had OCR performed on it. Thus, in the case of what you call a "native/original" PDF and the case of a scanned/OCR'ed PDF, there's plenty of text, so that is not a distinguishing attribute. As I've mentioned in the past, I can use the Xpdf utility called PDFtoText to detect image-only PDFs, but not to distinguish a native/original PDF from a scanned/OCR'ed one, since they both have text.

(3) As I've mentioned before, I can write a program that retrieves the metadata field called PDF Producer. If you can identify the PDF Producers from the banks/providers that you deal with, I can check for those, but it sounds as if you can't identify all of them (and there are a huge number of them). Conversely, if you can identify the PDF Producers of the scanned/OCR'ed docs that you have, I can check for those (I've mentioned in the past ABBYY FineReader 14, Adobe PDF Scan Library 3.2, Nuance Power PDF, PaperPort 11, PaperPort 12, PaperPort 14, PDF-XChange Core API SDK, and there are many others).
I have thousands of PDF's in a folder, but most were created from a scanner, followed by the automated use of some OCD [JW: I assume you mean OCR] scanner.
Were they all created from the same scanner or an identifiable set of scanners (and scanning software)? If so, the "conversely" test that I mentioned above will work. The program will toss out all the files with a PDF Producer showing the scanning/imaging software, so the remaining files will be the original/native PDFs. Regards, Joe
That should be rather easy. You read the PDF via a PDF library and if you find any text, you can mark it as "native"

You will even be able to use the free version of GemBox Document for this
https://www.gemboxsoftware.com/document
Hi Shaun,
That doesn't do what he wants. As I mentioned in my previous post, it's easy to detect if a PDF has text in it, but a PDF that has been scanned and OCR'ed also has text in it. He doesn't consider a scanned/OCR'ed PDF to be native/original. Regards, Joe
Scanned PDFs do not have text based on my experience
As I mentioned above, it's easy to detect a scanned, image-only PDF. But his scanned PDFs have had OCR performed on them, so they have text from the OCR process (as well as the image/bitmap from scanning). This is known as Searchable PDF (or PDF Searchable Image) and is the default these days with lots of document scanning/document imaging software.
Avatar of curiouswebster

ASKER

Sorry I took so long to respond...

I am intrigued by this idea, Joe...

Conversely, if you can identify the PDF Producers of the scanned/OCR'ed docs that you have, I can check for those (I've mentioned in the past ABBYY FineReader 14, Adobe PDF Scan Library 3.2, Nuance Power PDF, PaperPort 11, PaperPort 12, PaperPort 14, PDF-XChange Core API SDK, and there are many others).

But I can say I have no idea what the potential OCR scanner could have been but expect it would be "all of the above" then more.

So, what percent of the total market could be covered by a complete list of the above, and others we could dig up?

Thanks
> I have no idea what the potential OCR scanner could have been

Can you ask the people who do the scanning?

> what percent of the total market could be covered by a complete list of the above, and others we could dig up?

I don't know if we can get to a high percentage. I found a 2018 industry analysis that lists 15 scanner manufacturers, but each of them could be using different scanning software, especially over time. For example, Brother, which is one of the 15 companies in that industry analysis, has used the three versions of PaperPort that I mentioned in my previous post (11, 12, 14). When I came up with the "conversely" idea, I thought that you would know the folks who are doing the scanning and that you could ask them what scanning software they use. If that's not the case, the "conversely" idea may not fly.

Let's take a step back. Considering whatever it is that your program wants to do with the native/original PDFs, why not let it do that with the scanned/OCR'ed PDFs, too? They both have text. Sure, the OCR'ed text may not always be accurate, which means that your program may not always find what it is looking for (even when it really should be there, but the OCR messed up) — so what?! You get a negative when it should be a positive, but you're willing to discard that file, anyway, and not even consider it. In other words, I don't understand why it's a problem to let your program run on the scanned/OCR'ed PDFs — so what if it fails on one when it shouldn't? Regards, Joe
So, I am thinking about a poor man's solution that gets us closer to those native / original PDF's.

How about we make up a comprehensive list of names of OCR providers, even including obsolete names, when we have them.

We could then write a simple program that checks each PDF Producer for each PDF in an entire folder, and outputs the name of each producer (to text file) along with the filename and filepath. That list is then returned to me.

I can strip out records associated with known OCR Producers and can thus shrink the list down to producers that are lesser known, obsolete and native.

From there, I may be able to find a small collection of native PDF's, so I can provide a list of easy to find files that my partner can zip up and send to me.

This lets me to provide my partner a safe program to execute and to not request a bunch of unneeded confidential information, most (or all) of which will have no value to me. Instead, the few files I need I will get, and they will be native (or original).

Will this work?
Please answer my earlier question:

Why not let your program process scanned/OCR'ed PDFs, too? So what if the OCR is not accurate and your program fails on one when it shouldn't? My understanding is that you want to identify the so-called native/original PDFs due to lack of accuracy in OCR, but my question is: so what?

OK, moving on to your other comments.

> How about we make up a comprehensive list of names of OCR providers, even including obsolete names, when we have them.

Well, I don't know how "comprehensive" the list can be — would be a challenge.

> a simple program that checks each PDF Producer for each PDF in an entire folder, and outputs the name of each producer (to text file) along with the filename and filepath

No problem.

> I can strip out records associated with known OCR Producers

No need for you to do that manually. The program can do it by checking the "comprehensive list".

> provide a list of easy to find files that my partner can zip up and send to me

I'm really confused about the workflow here. If the files are on your system and the program is looking at the PDF Producer for each one, where does your "partner" come into play?

> This lets me to provide my partner a safe program to execute

Maybe this clears up the confusion — are you saying that the program runs on your partner's system, not yours? If so, why not run a program there — just once — that creates a text file with the fully qualified file name and the PDF Producer for every PDF file. By reviewing that, we can identify the likely PDF Producers for native/original files, so that subsequent runs on your partner's files (which, I presume, occur periodically) will create a list of only the probable native/original PDFs.

> Will this work?

Maybe, but I'd really like to understand why you need to determine native/original vs. scanned/OCR'ed PDFs. Regards, Joe
I feel I need assurance of native data for financial calculations. The program I plan to write will run on the desktop of merchants, so they al have the real PDF, from the service provider, likely via inbound email..


Yes, you got it right with this paragraph "Maybe this clears up the confusion"

The files are on my partners PC, but I do not want any PDF's unless I am pretty sure it's native.

I would like to create a program that scans the folder specified and creates a delimited text file for easy import to Excel.

Once there, I could sort by PDF producer and delete huge chunks of records when I find them on the comprehensive list of producers.

BTW: We do not need to put much effort into making that list comprehensive. Once we see a name that has lots of rows, I can Google it and find if it's OCR or not.

I think we have our answer.

It took a while, but we all hung in there...
ASKER CERTIFIED SOLUTION
Avatar of Joe Winograd
Joe Winograd
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
> Why would they scan and OCR "the real PDF, from the service provider"?!

They virtually always fax the report to us for further analysis. And the sales people do not complain, since they are happy to have the report and to potentially close the deals.

The Windows program I ultimately write will be intended for merchants to use privately, where they are in possession of the real file. But once I am off the ground with my Windows program, I will be in a stronger position to add in OCR functionality.

Otherwise, I will review your program and descriptions. I am happy to see we have a solution that helps get past this step.

I will close this question.
thanks!