Solved

VB.Net  Program to find duplicate files on hard drive

Posted on 2013-06-17
4
1,197 Views
Last Modified: 2013-06-17
Hi Experts.  Looking for ideas.

 I want to write a program to find duplicate files on my hard drive.   I've researched a lot and see there is "Finding by Filename" and "Finding by Content"  (File size and hashing).  I'm looking for ideas on the best way to achieve this.  

Is it better loading all the filenames and sizes into an array and then finding duplicates in the array, if so wouldn't there be a problem with an array with 300 000 elements?  (Which is what I have on my hard drive)   Or, is it better to go file by file, which looks like it would take forever?    

Looking for ideas on the principle of the best way to proceed
0
Comment
Question by:PNRT
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
4 Comments
 
LVL 27

Expert Comment

by:MacroShadow
ID: 39252977
Check this out, although it's written in  c# it can easily be converted to vb.net.
http://www.codeproject.com/Articles/28512/Duplicate-Files-Finder
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 39253034
Define "duplicate files." It sounds like you are considering two files to be duplicates of each other if they share the same filename. What if they had different sizes, though?

Is it better loading all the filenames and sizes into an array and then finding duplicates in the array...
Only because you say "hard drive" I would say no. If you want to search the entire hard drive for duplicate files, then you will definitely not fit every file into an array--not even using filename only. Even if you did, trying to find duplicates in that array would take a really, really, really long time. You would want to use some sort of hashing data structure like a Dictionary or HashMap.
0
 
LVL 2

Author Comment

by:PNRT
ID: 39253450
Thanks Kaufmed, that's the sort of information I was looking for.   Please could you elaborate a bit more on   "some sort of hashing data structure like a Dictionary or HashMap "  Perhaps a reference to start me off
Many Thanks for the reply.
0
 
LVL 75

Accepted Solution

by:
käµfm³d   👽 earned 500 total points
ID: 39253571
If you use a hashing data structure, the insertions and queries are generally faster than linearly searching through an array. A hashing data structure uses some sort of "hashing algorithm" (i.e. a mathematical operation) that makes locating an object very fast. If you used a HashSet (sorry, I said "HashMap" earlier; that's a Java term) to hold the names, then you would get the benefit of this fast access. For example:

Imports System.Collections.Generic
Imports System.IO

...

Dim files As New HashSet(Of String)(1000000)

For Each filepath As String In Directory.GetFiles("C:\")
    If Not files.Add(Path.GetFileName(filepath)) Then  ' Add returns True if the entry does not exist in the HashSet; False if it already does
        ' duplicate file name
    End If
Next

Open in new window


Again, based on your comment of "hard drive," there's a good chance that you'll still run out of memory even using a HashSet. You may need to break your logic up to process various directories in chunks to avoid encountering memory exceptions.
0

Featured Post

Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

This tutorial demonstrates one way to create an application that runs without any Forms but still has a GUI presence via an Icon in the System Tray. The magic lies in Inheriting from the ApplicationContext Class and passing that to Application.Ru…
Since .Net 2.0, Visual Basic has made it easy to create a splash screen and set it via the "Splash Screen" drop down in the Project Properties.  A splash screen set in this manner is automatically created, displayed and closed by the framework itsel…
Although Jacob Bernoulli (1654-1705) has been credited as the creator of "Binomial Distribution Table", Gottfried Leibniz (1646-1716) did his dissertation on the subject in 1666; Leibniz you may recall is the co-inventor of "Calculus" and beat Isaac…

756 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question