Link to home
Start Free TrialLog in
Avatar of Steve Meyer
Steve MeyerFlag for United States of America

asked on

Duplicate File Finder Where File Names are Different...

Looking for Windows file compare utility, similar to Ultra Compare or Beyond Compare, or even Directory Opus, but here's the problem.  I have a folder containing several hundred or more files.  I believe these files may already exist somewhere else on my computer in another folder.  I need to determine if the files in the one folder are duplicates.  To make things more complicated, while these file may be duplicates, the file may occur in multiple locations with a different file name.  For example I have a file named IMG00535.jpg but it may exist somewhere else as "SM0110-180906-IMG00535.jpg".  In summary, I want find all files and locations where a portion of the file name matches one of the short names in a specified folder.  Essentially I have a folder containing photos downloaded from a camera.  I believe they have been downloaded before.  However, it is customary for me to rename the image files to include a Project ID and the date taken in addition to the original name given by the camera.  I'm not interested in searching within the folder containing the list of short names.  Bottom line here is essentially to look for duplicate files where the filenames do not fully match, but perhaps the date, size, or hash do.
ASKER CERTIFIED SOLUTION
Avatar of Thomas U
Thomas U
Flag of Switzerland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thomas makes a good suggestion. I have used that tool many times, and it works fairly well. The odds of a hash collision are very tiny, so it does an excellent job of finding duplicates with different attributes.  I did not know about the symbolic link option though Thomas. Thanks for pointing that out.
my users work a lot with "promotional videos" they download it, copy it to another folder, want to send it to a different user! even in the same network! instead of make a link to it...damn they even copy it from the network drive to their desktop to have them at hand faster and anytime...so I have a loooot of duplicates to deal with ;)
tressize professional saved my ass once...the first time i partition on the server where all data lies, runing out of space 2TB and could not extend because of MBR partition...so I though, well maybe I can delete old data..bought treesize...duplicates find with hash....and I could delete alomst 800GB of duplicate data / create symbolink links......i hate my users ;-)
Avatar of Steve Meyer

ASKER

Thanks, guys, I'm going to checkout TreeSize.  Anyone know of any limitations using trial version, like maximum number of files, etc.?

Any other suggestions?
The page says 30 days trial without ANY limitation in functions whatsoever...so I believe them ;)
Thomas,
wow, that cool if you can share the Script here that would be greatly appreciated.

Why not using Windows Server 2012R2 and above for the Deduplication file server ?
Hi Senior IT System Engineer (please change your name ;)

I used that one changed some things to my needs.
https://stackoverflow.com/questions/44358602/trying-to-compare-hashes-and-delete-files-with-same-hash-in-powershell
It's slow and messy, but works

Yes, If I could move my fileserver to a 2012R2 and use deduplication in no time, I would've done that already ;). But it's a task still on my list.
Here's an update.  I downloaded Treesize.  But also found a couple of reviews and an application from KeyMetric called Duplicate File Detector.  Very nice.  Does it all with lots of options for selecting file folders and drives, and what to base comparison of files on, name, date, size, and/or hash, etc..  Results can be deleted, archived, and replaced with links.   Allows me to determine what folder is to be used for master file list, then finds all matching files in all other selected folders.  I needed to get rid of a particular cloud service, so I downloaded about 12.000 files from that cloud to a folder on my PC and then deleted them from the cloud.  Using DFD, I then selected the downloaded folder as master, and scanned the other selected folders on the PC for duplicates.  I then locked those folder trees containing duplicates that I wanted to keep and marked the remaining files for deletion.  I sent them to them to a temporary archive folder.  I based my comparison on matching file type and file size using one of six available hash protocols, (I used SHA256).  The other hash types are CRC32, ADLER32, MD5, SHA1, and ShA512).   Pretty slick.  I am now evaluating TreeSize for comparison.  This app appears to be pretty slick also and they both cost about the same.  I will let you know the results.  Anyone ever use Duplicate File Detector?
i tried once such softwares, and found out ine must be very cautious when deleting  - i deleted some folders that were not meant to delete...
not so easy if you have hundreds of folders
Yes indeed, one must be careful not to accidentally delete the wrong files.  The solution here is to archive your deletions to another drive or memory stick.
Or just do a full backup before deleting anything.
I have been evaluating TreeSize.  It looks like this program does everything but shine your shoes.  I have not been able to configure it however to safely search based on a master list or folder of files, not to say it can't do this.  This is accomplished using Duplicate File Dective (DFD), by prioritizing and locking folder trees that are to be searched but left untouched (no files will be deleted in these folders).  These are what I call my Master set of files.  In this way, based on locked folders, only duplicate files in unlocked folders will be purged.  This took no time to configure and run using DFD.

Now that I have learned something, I need to correct my previous explanation.  Disregard that confusing post.  Lets try this:  To avoid permanently deleting cloud files that may not be on my computer, I downloaded all of my cloud files to my computer, then using DFD, I selected and locked those folders containing what I call my Master files.  DFD then found the duplicates in the downloaded cloud folder and archived them, leaving me to determine what to do with the remaining (orphaned) cloud files.  This was easily configured in DFD.  

I am pretty sure TreeSize can do this, but there are so many options and settings, I haven't yet been able to duplicate the same task yet.  However, because Treesize has so many other useful file management features (besides deduplication), I am going to give credit to the experts, even though I found my own solution.  Thanks folks.  

Also, I was lucky when I searched for a coupon and found DFD for $34 on BitDuJour (discounts good for a day).   Both Treesize and DFD are normally around $50, except Treesize does have a $25 version that doesn't include the full set of features.