Image duplication finder

Can anyone recommend a really good duplicate image finder that can compare some 100gb of photos by metadata e.g. exact date and time taken of the original photo rather than the name of the photo (as some of my photos might have different names after I changed them but they are the same photos).

Also is there a way of seeing a visual depiction of all the photos taken in a heat map style so I can see where there are large gaps (e.g. 1-2 months) where no photos are?

Also I am capable of doing a google search myself so what I am really looking for are recommendations based on what people have actually used and not opinions based on what they have read on a web site (as I capable of reading a blurb too).  Thanks.
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Paul SauvéRetiredCommented:
i have tried different software - none ever give the same results!

have a look here, perhaps you will find something of interest: Gizmo's Freeware - Best Free Duplicate File Remover
dbruntonQuid, Me Anxius Sum?  Illegitimi non carborundum.Commented:

Try exiftool.  A description of it can be found here  and the link is there as well.

Not pretty but it should handle metadata for you.  But some work is involved.
Like Paul Sauvé, I too have tried out a number of duplicate image finder applications in the past.  On one of those occasions I had run a data recovery on a malfunctioning hard drive and ended up recovering tens of thousands of image files with 6 or 8 character alphanumeric file names as well as many with original file names contained in folders with similar 6 or 8 character alphanumeric file names.  I needed to figure out which of the the recovered files with original file names had also been recovered as renamed duplicates.  Unfortunately I didn't find the holy grail of applications and, as Paul said, I ended up with mixed results that didn't inspire enough confidence in me to proceed and delete the duplicates.   I ended up using a multi-stage approach, some scripted and some with GUI applications.

In your case, remember that as well as being able to search for duplicates using the embedded EXIF metadata, you also have the option to use file MD5, CRC32, SHA1, SHA-256, SHA-512, and SHA-384 hashes of files to compare and weed out true duplicate files that have just been copied and renamed and not modified with an image editing application.  A file that has been copied and pasted or drag & drop copied to the same folder, a different folder, or to a different drive, and renamed will still have the same MD5/CRC32/SHA1, etc hash as the original.  It is unaffected even if copied between FAT32 and NTFS formatted drives and vice versa.  Tinkering with the binary data in it, such as simply opening and resaving without modifying anything in an image editing application, will generate a different hash because the files are no longer truly identical.

There are small applications that allow you to create hashes and compare them.  One such application is HashMyFiles by Nir Sofer -  Using it from the command line you could have it traverse a folder or drive, generate whichever of the hash types as you want for all files with a specified extension, prefix identical files with a symbol of choice, sort the output by column, and output the results to various file formats including text, XML, and CSV.  I have used it in this kind of "dupfinder" mode before and then used the results in a "DOS" batch file to get rid of duplicate files.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
OWASP: Avoiding Hacker Tricks

Learn to build secure applications from the mindset of the hacker and avoid being exploited.

u587162Author Commented:
hmm..I think I have bitten of more than I can chew.  Not sure I understand all that syntax.  I'm happy to try a multi layered approach if it is easy to follow.  What is EXIF?
dbruntonQuid, Me Anxius Sum?  Illegitimi non carborundum.Commented:
The EXIF contains data from your camera such as date, time that photo was taken and location.  See  for more gory details.

Now if you haven't altered any of the EXIF tags then you don't need the application I linked to.

If you've only altered the external file properties such as copying and then renaming the file then any decent duplicate file finder can find the copies by a simple file comparison.  See BillDL's posting above.

If you need to get the photos using the actual date and time then you'll have to try the application I linked to.
Hi u587162

Sorry, that wasn't meant to confuse you.  In the context of this question, a "hash" (otherwise known as a checksum) is just a computational method of inspecting files and creating a cryptographic value that can be used to compare against the cryptographic value created from another file to verify integrity.  You may have seen mention of an "MD5" or "SHA1" hash on sites where you downloading installer packages.  I'm sure it is mentioned in the small print on the Microsoft site when downloading patches.  The vendor has already computed the hash value for their download file and you are then supposed to download the file and create a hash yourself to check that it has not been corrupted during the download process or deliberately tampered with.  There was recently an incident where the installer package for a very popular utility named Ccleaner was maliciously tampered with by 3rd-parties and installed malware.

There are different methods of creating this hash or checksum.  In terms of accuracy and catching extremely minor differences in files, CRC32 is about the lowest on the totem pole followed by MD5.  The SHA methods increase in accuracy as their numbers get higher.  While CRC32 and MD5 may not catch a file that has been deliberately, maliciously, and craftily modified, for your purposes either would be accurate enough to determine whether two digital images were the same or different.

There are loads of programs that allow you to generate a hash of a file using some or all of the methods.  The one I linked to is actually quite easy to use in GUI mode, but its strength kicks in when run from the command line where you can tailor the results into a suitablly formatted text-based file (for example writing the full paths of files) that can then be used to perform a batch mode process, like deleting all the files listed or moving them all to another folder, and other similar things.

I will try to write something more specific later to demonstrate what I mean.  Just getting ready to have dinner, dig my car out of the snow in my driveway, and go out on nightshift.
No, I haven't been digging my car out of the snow since I last posted.  Unfortunately I had some other things needing done.

I mentioned how something like this could be scripted.  I have written an example batch file that calls on two external programs. One of them, namely FINDSTR.EXE should already present on all modern versions of Windows and should run just by calling FINDSTR.  The other is a standalone program that you would have to download.

Download the Zip file from the link near the bottom of the above page.  The 32-bit version is behind link entitled "Download HashMyFiles" and the 64-bit version is named as such in the link.

Save it to any folder you want and unzip it so you have:
HashMyFiles.exe - the program
HashMyFiles.chm - old style Windows Help that won't work unless you have an add-on installed to view them.  Not needed anyway.
readme.txt - contains the command line options.

Run the EXE and configure it under the View and Options menus.  There are some configuration settings that are used when the program is run from the command line, whilst others can be overridden by certain paramaters you specify from the command line.  The column headers that are configured to be visible in GUI mode will always be listed in the output when run from the command line, so to keep things as uncluttered as possible, only show the columns that will be the most important.

NOTES:  Under the View Menu > Choose Columns option you can untick or tick the columns you want to see when run in normal GUI mode and will therefore also appear in the output when run from the command line.
- Important: Move the "Identical" column to the top so that it displays as the first column header.
- I suggest that for this test you only tick the columns named Identical, Full Path, and File Size.

IMPORTANT: Under the Options menu TICK "Mark Identical Hashes".

Under the Options menu DO NOT tick any of the options to integrate the program with the Windows Explorer Context Menu.  If you do, it ceases to be a standalone program and writes more stuff to the registry than you might want.  I suggest that you "Add Header Line to CSV/Tab-Delimited File".

When you close the program it will write its settings to a new configuration file in te same folder named "HashMyFiles.cfg".

Now download the "Find_Dups.cmd" file that I have attached.  This is the batch file.  Save it to the same folder as HashMyFiles.exe.


Open the batch file in a plain text editor like Windows Notepad.  There is definitely one and possibly two edits right at the start of the file that you will need to make within the batch file for your own use.

set EXT=JPG <-----------  Change to GIF or BMP or PNG if you are searching for file type other than than JPG.

set BaseDir=C:\Full_Path_To\Photos <------------------ Don't double-quote path here

Save the edited batch file and double-click on it to run it.

Basically what it does is calls HashMyFiles.exe and tells it to:
- Find every JPG file in the folder specified and in all its sub-folders.
- Generate a cryptographic MD5 and SHA1 hash for each JPG found.
- Check for identical hashes even for files in different folders and by different names.
- Apply the same sequential number in the "Identical" column for identical files.
- Sort the results so that the Identical files marked with numbers are grouped and sorted.
- Outputs a temporary file in the same folder in Comma-Separated (CSV) format that contains the full paths to all the JPG files found.  Those with an "identical" number in the first comma-separated field are grouped to the end.

The process then calls FINDSTR.EXE and tells it to:
- Find all lines that DO NOT HAVE a comma right at the start (ONLY those marked as identical).
The lines for files that are not identical will all begin with a comma because no numbers have been assigned to them.
- Output the results to a final CSV formatted report that can be opened in Microsoft Excel, LibreOffice/OpenOffice Calc, or other spreadsheet application.

As I said, this is just an example.  The output does not allow you to click on anything to view images, to delete files, or anyting similar.  It is just a report.  It would be possible to take the generated output and process it further whereby it COULD be used to delete duplicate files leaving only one, but doing this kind of thing in an automated scripted mode is dangerous.

You can, of course, use the program in GUI mode, load a base folder to search, and finally sort the results and group identical files by clicking on the "Identical" header button.  In that mode you can select one of the duplicate files and (from the File menu) move to recycle bin or delete or open the containing folder.  In GUI mode you would probably choose to have more columns displayed , such as file creation and modified dates, file size, and the MD5, SHA1, and other such columns.  It is easier to look at the results if you tick View Gridlines and Mark Odd/Even lines under the View menu.

Remember that because the program stores its settings in its own config file, copies of the EXE will need to be configured separately if run from other folders.

I have found HashMyFiles to be extremely accurate in identifying duplicate files even though it isn't specifically intended or have features for visually comparing digital images.
I use the aptly known Easy Duplicate finder easy to use you can view the files it finds first which is great.
It explains itself what to set it to look for
images included here for windows
video on home page. I wonder why not more folks dont use it?
dbruntonQuid, Me Anxius Sum?  Illegitimi non carborundum.Commented:
Other people have proposed different solutions but this one should do for a first pass over the 100 Gb of photos.  Notice the statement 100 Gb.

I don't believe anything else would be as time saving as his method.
Thank you u587162
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Images and Photos

From novice to tech pro — start learning today.