How to find Corupted files... Needle in haystack.

Hey Guys,

     Long story short, How do I search for corupted image files? Is there a program witch will attempt to open them, and flag ones it can't?

     I'm Hoping there is a simple way to go about this. I'm not sure if any of you have heard about the program for image management called Alchemy, but reguardless it stores images in datafiles using it's own kind of database/filesystem. A customer had gone with Alchemy in the past and now wants to go a different route, so I extracted the images from the datafiles. Some of the datafiles had CRC errors when I copied them down from CD, and I feared that some of the images were corrupted. There are ~15000 images to look through so this would take a long time manually. I managed to find some that had a file size of 0kb, and hence sorted by size to find similar ones, but I also found others that were average sizes and blend into the mix.

    I need to determine which images are corupted so I can track down the original information and get this project sorted out.

any pointers?

Brian

   
LVL 1
miglaughAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

lherrouCommented:
Brian,

This is an interesting challenge, and I don't know how much effort you want to put into it. A programming solution might be to use Image Magick (www.imagemagick.org), there are a number of approaches which could be taken (compare all images against a test image, and flag the error images, for example).

Since I'm not a programmer, I think the most likely approach I would take would be to produce a series of contact sheets of images with their filenames. It should easy and quick to visually scan the contact sheets and spot the corrupt images.

Cheers,
LHerrou
miglaughAuthor Commented:
Hi LHerrou,
 
   Thanks for the input. I too think this an interesting challenge and I'm willing to put a lot of effort into this. Our company processes thousands upon thousands of images daily, so having a utility to do a preemptive scan for bad files would be nice. One of the things that's been happening lately is clients from the past wanting to upgrade their materials with OCR or something. Usually this means pulling images down from CD instead of rescanning from film (which we would charge extra for). We all know how bad CDs and DVDs can get over time, and when CRC errors are ignored who knows what demons lie in wait.

    Producing contact sheets isn't going to work simply because of the shear volume of images. It'd be quicker to run through the images in a program like ACDSee.

   So now that I'm actually trying to think up check algorythms, I'm noticing that a bad file might not necessarily be corrupt and hence unable to open, its bits may have just been garbled so that what is displayed is all messed up. So I'm starting to think this sort of thing will only work for truely corrupted files where they can't even open and display anything.

   I'll start with trying a simple command-line program like libtiff, and see if the corrupted images I've found can display thier headers. If junk headers is a sign of bad things I could run a batch on a directory, redirect the output to a text file, and search the text file for "error."

Let me know if anybody has some ideas,

Brian

   
lherrouCommented:
Well, I would certainly take a look at the Image Magick "Identify" command (Image Magick is also a command-line program, and has interfaces for various programming languages). It specifically "reports if an image is incomplete or corrupt."

http://www.imagemagick.org/script/identify.php

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
HTML5 and CSS3 Fundamentals

Build a website from the ground up by first learning the fundamentals of HTML5 and CSS3, the two popular programming languages used to present content online. HTML deals with fonts, colors, graphics, and hyperlinks, while CSS describes how HTML elements are to be displayed.

miglaughAuthor Commented:
Awesome find!

  I was skeptical about the program at first because I wasn't sure if some of the screwed up images I have that display partly on screen would be deemed "corrupt" by it or not. I have found out they are... So with a simple batch script and a free machine I can narrow down the search incredibly. One downside to this method is that it seems to take forever on a per image basis, more than a second or so, sometimes two. Normaly that wouldn't be an issue, but when there's 15,000 images to check that could get pretty bad. So I think I'm going to peel open the source code to see whats necesary for the program to determine if a file is corrupt and what, if anything, can go the way of the condor.

I'll leave this question open for a couple days incase anybody else has some suggestions, otherwise it's all yours.

later, and thanks,

Brian
miglaughAuthor Commented:

I ran a batch of the Identify program on the files I was wondering about... It took 12 hours to scan 15,000 images. Not exactly speedy gonzalez, but these are huge images. Turns out that there were no other "corrupted" images that it could find, which is good news unless it missed some for whatever reason.

Another wierd thing I noticed is that a bunch of the images (around 550) have lost their date information and don't display when they were created or modified. Maybe that is what got corrupted. Something has to have been, because I got a fairly large number of CRC errors, and I've never seen them just disappear like this.

Well anyways, thanks for the help,

Brian
lherrouCommented:
Brian,

12 hours for 15,000 large images doesn't sound too bad to me! Anyway, glad I could help out. Thanks for the "A"!

LHerrou
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Web Graphics Software

From novice to tech pro — start learning today.