Need help with VB Script to detect bad PDF files

I have an OCR system that is fed files by a VB Script.  The OCR program is locking up when it gets a bad PDF.  I have tried to open the PDF, and sure enough get an error message that it is a bad file.  Adobe Reader says "File is damaged and could not be repaired."  I know how many pages are supposed to in each file, and I wrote a quick script to count pages, and unfortunately it returned the expected page count.  So I suspect there some something like a missing EOF flag or ?????  I don't need to fix the files, just identify them so I can put them out of the queue to the OCR machine.  Any method that will identify a bad file would be greatly appreciated.  VB Script is the only language I can deal with, but my son could handle a test written in PHP, Java, Java Script, or Python.
LVL 1
Mike CaldwellConsultant to IP industryAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Mike CaldwellConsultant to IP industryAuthor Commented:
I opened a bad PDF with Notepad and sure enough, no EOF.  Best way to test for that?
YZlatCommented:
you could parse the content of the file to find the last line and check if it equals %%EOF

dim str
    Do Until objFile.AtEndOfStream
      x = objFile.ReadLine
      If objFile.atEndOfStream Then
         last_line = str
      End If
    Loop

if InStr(0, last_line, "%%EOF")=0 then
 ''file corrupt
end if

Open in new window

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Joe Winograd, Fellow&MVEDeveloperCommented:
If you're sure that the missing %%EOF is the only problem, you could search the file for that string. I have also found that adding %%EOF at the bottom using Notepad++ makes some of the PDFs readable (adding it with Notepad does not work). Regards, Joe
Python 3 Fundamentals

This course will teach participants about installing and configuring Python, syntax, importing, statements, types, strings, booleans, files, lists, tuples, comprehensions, functions, and classes.

Mike CaldwellConsultant to IP industryAuthor Commented:
Looks like just what I need.  However I get an error message "Invalid procedure call or agruement: InStr".  It looks fine to me, but....  I tried adding the binary "compare" flag of zero and one; no change.
Mike CaldwellConsultant to IP industryAuthor Commented:
Joe, I don't want to take a chance of making the file "good" again; I might be cutting off some data I want.  I can go get them again.  So I just need to throw them out.  The system will notice it did not get returned to the server and will refetch it and try again.  It is possible the file copy is bad, but first I need to just keep the OCR process from shutting down, then I can work upstream.
Mike CaldwellConsultant to IP industryAuthor Commented:
YZlat, the problem is not with Instr; it is that the variable it is evaluating is zero length.  I modified with this:

      Set fso=CreateObject("Scripting.FileSystemObject")
      Set objFile=fso.OpenTextFile("c:\junk\p0_9082324_0017.pdf",1)

dim str
    Do Until objFile.AtEndOfStream
      x = objFile.ReadLine
      If objFile.atEndOfStream Then
         last_line = str
      End If
    Loop


msgbox "Last line length is " & len(last_line)
Mike CaldwellConsultant to IP industryAuthor Commented:
Just noticed the PDF is being opened as a text file; is that the issue?
YZlatCommented:
close the file before you run the code
YZlatCommented:
Change 0 to 1 in the first argument in InStr

if InStr(1, last_line, "%%EOF")=0 then
 ''file corrupt
end if

Open in new window

Mike CaldwellConsultant to IP industryAuthor Commented:
OK, running now.  But both good and bad files test as bad.  Here is my complete code:

Set fso=CreateObject("Scripting.FileSystemObject")
Set objFile=fso.OpenTextFile("c:\junk\good.pdf",1)


dim str
    Do Until objFile.AtEndOfStream
      x = objFile.ReadLine
      If objFile.atEndOfStream Then
         last_line = str
      End If
    Loop

if InStr(1, last_line, "%%EOF")=0 then

msgbox "File Corrupt"

Else

msgbox "File OK"

end if

Open in new window

Mike CaldwellConsultant to IP industryAuthor Commented:
The returned value is still zero length.

Set fso=CreateObject("Scripting.FileSystemObject")
Set objFile=fso.OpenTextFile("c:\junk\good.pdf",1)


dim str
    Do Until objFile.AtEndOfStream
      x = objFile.ReadLine
      If objFile.atEndOfStream Then
         last_line = str
      End If
    Loop


msgbox "Last line length is " & len(last_line)

if InStr(1, last_line, "%%EOF")=0 then

msgbox "File Corrupt"

Else

msgbox "File OK"

end if

Open in new window

YZlatCommented:
looks like fso.OpenTextFile cannot read pdf - returns blank line
Mike CaldwellConsultant to IP industryAuthor Commented:
Ah, found it:  line should be

    str = objFile.ReadLine

Test is now perfect; many thanks.
Joe Winograd, Fellow&MVEDeveloperCommented:
Hi Mike,

It seems that YZlat has you on the path of a VB script, so I won't jump in there (other than to say that opening it as a text file is the right thing to do — that's how you'll find the %%EOF string).

But if you want to try a completely different approach, I recommend an excellent (free!) piece of software called the PDF Toolkit (PDFtk). This article explains where to get it:
How to Combine-Merge PDF Files in Many Subfolders

You may ignore most of the article — read just the section about downloading/installing it. You may call the command line <pdftk.exe> (requires the presence of <libiconv2.dll>) from your VB script and check the return code. The simplest call is to use the dump_data option — if it can open the file, you'll get a return code of 0; if it can't open the file, you'll get a return code of 1. I just tested it with a PDF where I removed the %%EOF from the end and it worked perfectly, giving a return code of 1. Here's the call my script used:

pdftk.exe input.pdf dump_data output dumpdata.txt

Regards, Joe
Joe Winograd, Fellow&MVEDeveloperCommented:
Hi Mike,
Our messages crossed. While I was writing my post and testing the code, I see that you've achieved success — great news! Regards, Joe
YZlatCommented:
If it worked for you, why are you closing the question?
Mike CaldwellConsultant to IP industryAuthor Commented:
I'm closing it because it is done.  I awarded you all the points, but I did include my comment so that if someone wants to copy the script they will notice the typo.  I don't understand if there is a problem; please explain.
YZlatCommented:
I am asking why can't you just accept multiple solutions? Why send a close request?
Mike CaldwellConsultant to IP industryAuthor Commented:
I did accept multiples, but one was my own (obviously for no points).  The system then required an explanation, and put it into an approval cycle.  I guess I don't know how "closing" is different from accepting the solutions.  Do you mean that closing negates your points?  Certainly not my intent.  But I don't see that to be the case.  Let me know.
Mike CaldwellConsultant to IP industryAuthor Commented:
YZLat, if you look closely at the fine print on the Close Request Pending, it states that the points are awarded to you,  250 + 250, so I think all is well.  The reason it is Pending is due to my inclusion of my typo fix.  Looks like it will be approved on Saturday.  My thanks for your recommendation and patience.
Mike CaldwellConsultant to IP industryAuthor Commented:
I included my edit, which was a minor correction to the code the Expert provided.  Without it, the code always fails the test.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
VB Script

From novice to tech pro — start learning today.