Link to home
Create AccountLog in
Avatar of Christian Knell
Christian KnellFlag for Germany

asked on

Finding directory duplicates

Hi,

we have a file server (Windows 2008 Server) with approx. 800.000 files stored on one share. Sometimes users copy complete directories with subdirectories to other locations on this share. We want to find those duplicate directories using a software tool.

If there there are two identical top folders with many subfolders, this tool should only report the two identical top folders, not all the subfolders. We don't care about duplicate files (there are too many of them). We are just interested in duplicate directory structures.  We already tried some tools for finding duplicate files. Unfortunately they report all subfolders of identical top folders.

Example:

C:\Folder1
    C:\Folder1\Subfolder1
    C:\Folder1\Subfolder2
C:\SomeOtherFolder\Folder1
    C:\SomeOtherFolder\Folder1\Subfolder1
    C:\SomeOtherFolder\Folder1\Subfolder2

If C:\Folder1 and C:\SomeOtherFolder\Folder1 are identical, all tested duplicate finder tools report 6 lines. For us it would be better, if it would just report 2 lines (C:\Folder1 and C:\SomeOtherFolder\Folder1).

Anybody knows a software which can handle this?

Thanks!

Avatar of arnold
arnold
Flag of United States of America image

https://www.experts-exchange.com/questions/24475390/Remove-duplicate-files-on-file-server.html


In a way you should only look for duplicate files, and while addressing those, you can address the folder duplication later i.e. catalog the duplicate files and the paths and then decide which is the correct path and which is the duplicate.

using a document management system might become an alternative to the file share mechanism.

Avatar of Christian Knell

ASKER

We tried tools for finding duplicate files. Due to the vast number of files, analyzing the output of file duplicate finder software is very complex and should be done by the software we are looking for. Especially finding the root of duplicate folder structures is very demanding.

We have no problem, if a user has few copies of single, duplicate files on our central file share. With duplicate file finder software it's possible to report those duplicate files. It's not possible however to detect quickly and on a regular basis, when a user copies a complete directory tree.

Any ideas?
The only way is to process the entire tree and record
filename,path,md5sum into a database.
Then you can use the group by of the md5sum column.  This will return you with a list of files/paths that are identical.

The item is not fast, but you could use searches where you would look only at new or items that have changed. i.e. you would effectively maintain a database table for files/paths and they md5sum.

lets say you find them, what is the next step?  Will you be deleting them? Pit fall here is why the user copied it in the first place?

How familiar are you with fileobject and recursively/iteratively going through the share file structure along with connecting into a database.

http://www.tek-tips.com/viewthread.cfm?qid=1480537&page=9
Thanks, I'll try that. I'm familiar with databases, recursion and using the filesystemobject. Just two more questions:

What function returns the md5sum of a file in VBA?

Do you know way comparing near identical folder structures (i.e. comparision two folders with 999 identical and 1 different file returns resemblance of 99,9%). Computing the md5sum of an entire tree wouldn't help in this case, I think.

Thanks!
ASKER CERTIFIED SOLUTION
Avatar of arnold
arnold
Flag of United States of America image

Link to home
membership
Create an account to see this answer
Signing up is free. No credit card required.
Create Account