?
Solved

VB.Net  Program to find duplicate files on hard drive

Posted on 2013-06-17
4
Medium Priority
?
1,420 Views
Last Modified: 2013-06-17
Hi Experts.  Looking for ideas.

 I want to write a program to find duplicate files on my hard drive.   I've researched a lot and see there is "Finding by Filename" and "Finding by Content"  (File size and hashing).  I'm looking for ideas on the best way to achieve this.  

Is it better loading all the filenames and sizes into an array and then finding duplicates in the array, if so wouldn't there be a problem with an array with 300 000 elements?  (Which is what I have on my hard drive)   Or, is it better to go file by file, which looks like it would take forever?    

Looking for ideas on the principle of the best way to proceed
0
Comment
Question by:PNRT
  • 2
4 Comments
 
LVL 28

Expert Comment

by:MacroShadow
ID: 39252977
Check this out, although it's written in  c# it can easily be converted to vb.net.
http://www.codeproject.com/Articles/28512/Duplicate-Files-Finder
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 39253034
Define "duplicate files." It sounds like you are considering two files to be duplicates of each other if they share the same filename. What if they had different sizes, though?

Is it better loading all the filenames and sizes into an array and then finding duplicates in the array...
Only because you say "hard drive" I would say no. If you want to search the entire hard drive for duplicate files, then you will definitely not fit every file into an array--not even using filename only. Even if you did, trying to find duplicates in that array would take a really, really, really long time. You would want to use some sort of hashing data structure like a Dictionary or HashMap.
0
 
LVL 2

Author Comment

by:PNRT
ID: 39253450
Thanks Kaufmed, that's the sort of information I was looking for.   Please could you elaborate a bit more on   "some sort of hashing data structure like a Dictionary or HashMap "  Perhaps a reference to start me off
Many Thanks for the reply.
0
 
LVL 75

Accepted Solution

by:
käµfm³d   👽 earned 2000 total points
ID: 39253571
If you use a hashing data structure, the insertions and queries are generally faster than linearly searching through an array. A hashing data structure uses some sort of "hashing algorithm" (i.e. a mathematical operation) that makes locating an object very fast. If you used a HashSet (sorry, I said "HashMap" earlier; that's a Java term) to hold the names, then you would get the benefit of this fast access. For example:

Imports System.Collections.Generic
Imports System.IO

...

Dim files As New HashSet(Of String)(1000000)

For Each filepath As String In Directory.GetFiles("C:\")
    If Not files.Add(Path.GetFileName(filepath)) Then  ' Add returns True if the entry does not exist in the HashSet; False if it already does
        ' duplicate file name
    End If
Next

Open in new window


Again, based on your comment of "hard drive," there's a good chance that you'll still run out of memory even using a HashSet. You may need to break your logic up to process various directories in chunks to avoid encountering memory exceptions.
0

Featured Post

Upgrade your Question Security!

Add Premium security features to your question to ensure its privacy or anonymity. Learn more about your ability to control Question Security today.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Article by: jpaulino
XML Literals are a great way to handle XML files and the community doesn’t use it as much as it should.  An XML Literal is like a String (http://msdn.microsoft.com/en-us/library/system.string.aspx) Literal, only instead of starting and ending with w…
I think the Typed DataTable and Typed DataSet are very good options when working with data, but I don't like auto-generated code. First, I create an Abstract Class for my DataTables Common Code.  This class Inherits from DataTable. Also, it can …
Loops Section Overview
With just a little bit of  SQL and VBA, many doors open to cool things like synchronize a list box to display data relevant to other information on a form.  If you have never written code or looked at an SQL statement before, no problem! ...  give i…
Suggested Courses

809 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question