Finding directory duplicates

Hi,

we have a file server (Windows 2008 Server) with approx. 800.000 files stored on one share. Sometimes users copy complete directories with subdirectories to other locations on this share. We want to find those duplicate directories using a software tool.

If there there are two identical top folders with many subfolders, this tool should only report the two identical top folders, not all the subfolders. We don't care about duplicate files (there are too many of them). We are just interested in duplicate directory structures.  We already tried some tools for finding duplicate files. Unfortunately they report all subfolders of identical top folders.

Example:

C:\Folder1
    C:\Folder1\Subfolder1
    C:\Folder1\Subfolder2
C:\SomeOtherFolder\Folder1
    C:\SomeOtherFolder\Folder1\Subfolder1
    C:\SomeOtherFolder\Folder1\Subfolder2

If C:\Folder1 and C:\SomeOtherFolder\Folder1 are identical, all tested duplicate finder tools report 6 lines. For us it would be better, if it would just report 2 lines (C:\Folder1 and C:\SomeOtherFolder\Folder1).

Anybody knows a software which can handle this?

Thanks!

mbwjkAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

arnoldCommented:
http://www.experts-exchange.com/OS/Microsoft_Operating_Systems/Server/2003_Server/Q_24475390.html


In a way you should only look for duplicate files, and while addressing those, you can address the folder duplication later i.e. catalog the duplicate files and the paths and then decide which is the correct path and which is the duplicate.

using a document management system might become an alternative to the file share mechanism.

mbwjkAuthor Commented:
We tried tools for finding duplicate files. Due to the vast number of files, analyzing the output of file duplicate finder software is very complex and should be done by the software we are looking for. Especially finding the root of duplicate folder structures is very demanding.

We have no problem, if a user has few copies of single, duplicate files on our central file share. With duplicate file finder software it's possible to report those duplicate files. It's not possible however to detect quickly and on a regular basis, when a user copies a complete directory tree.

Any ideas?
arnoldCommented:
The only way is to process the entire tree and record
filename,path,md5sum into a database.
Then you can use the group by of the md5sum column.  This will return you with a list of files/paths that are identical.

The item is not fast, but you could use searches where you would look only at new or items that have changed. i.e. you would effectively maintain a database table for files/paths and they md5sum.

lets say you find them, what is the next step?  Will you be deleting them? Pit fall here is why the user copied it in the first place?

How familiar are you with fileobject and recursively/iteratively going through the share file structure along with connecting into a database.

http://www.tek-tips.com/viewthread.cfm?qid=1480537&page=9
mbwjkAuthor Commented:
Thanks, I'll try that. I'm familiar with databases, recursion and using the filesystemobject. Just two more questions:

What function returns the md5sum of a file in VBA?

Do you know way comparing near identical folder structures (i.e. comparision two folders with 999 identical and 1 different file returns resemblance of 99,9%). Computing the md5sum of an entire tree wouldn't help in this case, I think.

Thanks!
arnoldCommented:
http://bytes.com/topic/visual-basic-net/answers/388664-md5sum-vb-net
http://msdn.microsoft.com/en-us/magazine/cc164146.aspx#fig5

You need to include System.Security.Cryptography and evaluate MD5CryptoServiceProvider().ComputeHash(Fs) where Fs is the file handle.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
System Utilities

From novice to tech pro — start learning today.