Solved

VB.Net  Program to find duplicate files on hard drive

Posted on 2013-06-17
4
1,257 Views
Last Modified: 2013-06-17
Hi Experts.  Looking for ideas.

 I want to write a program to find duplicate files on my hard drive.   I've researched a lot and see there is "Finding by Filename" and "Finding by Content"  (File size and hashing).  I'm looking for ideas on the best way to achieve this.  

Is it better loading all the filenames and sizes into an array and then finding duplicates in the array, if so wouldn't there be a problem with an array with 300 000 elements?  (Which is what I have on my hard drive)   Or, is it better to go file by file, which looks like it would take forever?    

Looking for ideas on the principle of the best way to proceed
0
Comment
Question by:PNRT
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
4 Comments
 
LVL 27

Expert Comment

by:MacroShadow
ID: 39252977
Check this out, although it's written in  c# it can easily be converted to vb.net.
http://www.codeproject.com/Articles/28512/Duplicate-Files-Finder
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 39253034
Define "duplicate files." It sounds like you are considering two files to be duplicates of each other if they share the same filename. What if they had different sizes, though?

Is it better loading all the filenames and sizes into an array and then finding duplicates in the array...
Only because you say "hard drive" I would say no. If you want to search the entire hard drive for duplicate files, then you will definitely not fit every file into an array--not even using filename only. Even if you did, trying to find duplicates in that array would take a really, really, really long time. You would want to use some sort of hashing data structure like a Dictionary or HashMap.
0
 
LVL 2

Author Comment

by:PNRT
ID: 39253450
Thanks Kaufmed, that's the sort of information I was looking for.   Please could you elaborate a bit more on   "some sort of hashing data structure like a Dictionary or HashMap "  Perhaps a reference to start me off
Many Thanks for the reply.
0
 
LVL 75

Accepted Solution

by:
käµfm³d   👽 earned 500 total points
ID: 39253571
If you use a hashing data structure, the insertions and queries are generally faster than linearly searching through an array. A hashing data structure uses some sort of "hashing algorithm" (i.e. a mathematical operation) that makes locating an object very fast. If you used a HashSet (sorry, I said "HashMap" earlier; that's a Java term) to hold the names, then you would get the benefit of this fast access. For example:

Imports System.Collections.Generic
Imports System.IO

...

Dim files As New HashSet(Of String)(1000000)

For Each filepath As String In Directory.GetFiles("C:\")
    If Not files.Add(Path.GetFileName(filepath)) Then  ' Add returns True if the entry does not exist in the HashSet; False if it already does
        ' duplicate file name
    End If
Next

Open in new window


Again, based on your comment of "hard drive," there's a good chance that you'll still run out of memory even using a HashSet. You may need to break your logic up to process various directories in chunks to avoid encountering memory exceptions.
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This tutorial demonstrates one way to create an application that runs without any Forms but still has a GUI presence via an Icon in the System Tray. The magic lies in Inheriting from the ApplicationContext Class and passing that to Application.Ru…
Introduction When many people think of the WebBrowser (http://msdn.microsoft.com/en-us/library/2te2y1x6%28v=VS.85%29.aspx) control, they immediately think of a control which allows the viewing and navigation of web pages. While this is true, it's a…
NetCrunch network monitor is a highly extensive platform for network monitoring and alert generation. In this video you'll see a live demo of NetCrunch with most notable features explained in a walk-through manner. You'll also get to know the philos…
Do you want to know how to make a graph with Microsoft Access? First, create a query with the data for the chart. Then make a blank form and add a chart control. This video also shows how to change what data is displayed on the graph as well as form…

690 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question