Solved

VB.Net  Program to find duplicate files on hard drive

Posted on 2013-06-17
4
1,177 Views
Last Modified: 2013-06-17
Hi Experts.  Looking for ideas.

 I want to write a program to find duplicate files on my hard drive.   I've researched a lot and see there is "Finding by Filename" and "Finding by Content"  (File size and hashing).  I'm looking for ideas on the best way to achieve this.  

Is it better loading all the filenames and sizes into an array and then finding duplicates in the array, if so wouldn't there be a problem with an array with 300 000 elements?  (Which is what I have on my hard drive)   Or, is it better to go file by file, which looks like it would take forever?    

Looking for ideas on the principle of the best way to proceed
0
Comment
Question by:PNRT
  • 2
4 Comments
 
LVL 27

Expert Comment

by:MacroShadow
ID: 39252977
Check this out, although it's written in  c# it can easily be converted to vb.net.
http://www.codeproject.com/Articles/28512/Duplicate-Files-Finder
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 39253034
Define "duplicate files." It sounds like you are considering two files to be duplicates of each other if they share the same filename. What if they had different sizes, though?

Is it better loading all the filenames and sizes into an array and then finding duplicates in the array...
Only because you say "hard drive" I would say no. If you want to search the entire hard drive for duplicate files, then you will definitely not fit every file into an array--not even using filename only. Even if you did, trying to find duplicates in that array would take a really, really, really long time. You would want to use some sort of hashing data structure like a Dictionary or HashMap.
0
 
LVL 2

Author Comment

by:PNRT
ID: 39253450
Thanks Kaufmed, that's the sort of information I was looking for.   Please could you elaborate a bit more on   "some sort of hashing data structure like a Dictionary or HashMap "  Perhaps a reference to start me off
Many Thanks for the reply.
0
 
LVL 75

Accepted Solution

by:
käµfm³d   👽 earned 500 total points
ID: 39253571
If you use a hashing data structure, the insertions and queries are generally faster than linearly searching through an array. A hashing data structure uses some sort of "hashing algorithm" (i.e. a mathematical operation) that makes locating an object very fast. If you used a HashSet (sorry, I said "HashMap" earlier; that's a Java term) to hold the names, then you would get the benefit of this fast access. For example:

Imports System.Collections.Generic
Imports System.IO

...

Dim files As New HashSet(Of String)(1000000)

For Each filepath As String In Directory.GetFiles("C:\")
    If Not files.Add(Path.GetFileName(filepath)) Then  ' Add returns True if the entry does not exist in the HashSet; False if it already does
        ' duplicate file name
    End If
Next

Open in new window


Again, based on your comment of "hard drive," there's a good chance that you'll still run out of memory even using a HashSet. You may need to break your logic up to process various directories in chunks to avoid encountering memory exceptions.
0

Featured Post

Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Article by: Kraeven
Introduction Remote Share is a simple remote sharing tool, enabling you to see, add and remove remote or local shares. The application is written in VB.NET targeting the .NET framework 2.0. The source code and the compiled programs have been in…
Creating an analog clock UserControl seems fairly straight forward.  It is, after all, essentially just a circle with several lines in it!  Two common approaches for rendering an analog clock typically involve either manually calculating points with…
With Secure Portal Encryption, the recipient is sent a link to their email address directing them to the email laundry delivery page. From there, the recipient will be required to enter a user name and password to enter the page. Once the recipient …
Finds all prime numbers in a range requested and places them in a public primes() array. I've demostrated a template size of 30 (2 * 3 * 5) but larger templates can be built such 210  (2 * 3 * 5 * 7) or 2310  (2 * 3 * 5 * 7 * 11). The larger templa…

856 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question