We help IT Professionals succeed at work.

Delete duplicate files

veedar
veedar used Ask the Experts™
on
Consider a directory filled with hundreds of .xls files. Many of the files have identical content but have different file names. For example...

% md5sum f320183584.xls f327662272.xls f334737920.xls f67843472.xls f7849312.xls
de9e25d5934e2715b7117a9c7e584c84  f320183584.xls
de9e25d5934e2715b7117a9c7e584c84  f327662272.xls
de9e25d5934e2715b7117a9c7e584c84  f334737920.xls
de9e25d5934e2715b7117a9c7e584c84  f67843472.xls
de9e25d5934e2715b7117a9c7e584c84  f7849312.xls

I'm looking for a scrip that will leave just one unique version for each set of identicals and remove all the duplicates.


Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®

Commented:
This is a link to download a free software for this purpose.

http://download.cnet.com/Auslogics-Duplicate-File-Finder/3000-2248_4-10964299.html

HTH

Commented:
You could simply write your own script that reads the files into arrays and then compares them and deletes one of the files if the arrays are the same or moves it to another folder in that case.
A simple solution.

The example assumes all files are in one flat directory.
If this is not the case, then I can show you how to get all filenames below a certain directory
#!/usr/bin/env python
import os
import hashlib

def md5_for_file(fname):
    """ calculates md5 of a file """
    block_size=0x100000
    fhndl = open(fname, "rb")
    md5 = hashlib.md5()
    while True:
        data = fhndl.read(block_size)
        if not data:
            break
        md5.update(data)
    fhndl.close()
    return md5.hexdigest()

def rmv_duplicates(files):
    md5_entries = {}
    for fname in files:
        if not os.path.isfile(fname): # skip dirs / symlinks
            continue
        fname  = os.path.join(DIR_TO_CHECK, fname)
        md5 = md5_for_file(fname)
        if md5 in md5_entries:
            print ("%s  has same md5(%s) as %s" % 
                (fname, md5, md5_entries[md5]))
            print "I will delete file %s"  % fname
            # uncomment next line if you are sure the script does what 
            # you want it to do
            #os.unlink(fname)
        else:
            md5_entries[md5] = fname
    
DIR_TO_CHECK = 'C:/mydir'
files = os.listdir(DIR_TO_CHECK)
rmv_duplicates(files)

Open in new window

However you should think, whether you want to randomly delete one of the files or whether you
have a certain method of deciding which name to keep and which one to delete.
Use duplicate finder its a good  software for this purpose

Author

Commented:
Perfect!