?
Solved

VB.Net  Program to find duplicate files on hard drive

Posted on 2013-06-17
4
Medium Priority
?
1,293 Views
Last Modified: 2013-06-17
Hi Experts.  Looking for ideas.

 I want to write a program to find duplicate files on my hard drive.   I've researched a lot and see there is "Finding by Filename" and "Finding by Content"  (File size and hashing).  I'm looking for ideas on the best way to achieve this.  

Is it better loading all the filenames and sizes into an array and then finding duplicates in the array, if so wouldn't there be a problem with an array with 300 000 elements?  (Which is what I have on my hard drive)   Or, is it better to go file by file, which looks like it would take forever?    

Looking for ideas on the principle of the best way to proceed
0
Comment
Question by:PNRT
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
4 Comments
 
LVL 27

Expert Comment

by:MacroShadow
ID: 39252977
Check this out, although it's written in  c# it can easily be converted to vb.net.
http://www.codeproject.com/Articles/28512/Duplicate-Files-Finder
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 39253034
Define "duplicate files." It sounds like you are considering two files to be duplicates of each other if they share the same filename. What if they had different sizes, though?

Is it better loading all the filenames and sizes into an array and then finding duplicates in the array...
Only because you say "hard drive" I would say no. If you want to search the entire hard drive for duplicate files, then you will definitely not fit every file into an array--not even using filename only. Even if you did, trying to find duplicates in that array would take a really, really, really long time. You would want to use some sort of hashing data structure like a Dictionary or HashMap.
0
 
LVL 2

Author Comment

by:PNRT
ID: 39253450
Thanks Kaufmed, that's the sort of information I was looking for.   Please could you elaborate a bit more on   "some sort of hashing data structure like a Dictionary or HashMap "  Perhaps a reference to start me off
Many Thanks for the reply.
0
 
LVL 75

Accepted Solution

by:
käµfm³d   👽 earned 2000 total points
ID: 39253571
If you use a hashing data structure, the insertions and queries are generally faster than linearly searching through an array. A hashing data structure uses some sort of "hashing algorithm" (i.e. a mathematical operation) that makes locating an object very fast. If you used a HashSet (sorry, I said "HashMap" earlier; that's a Java term) to hold the names, then you would get the benefit of this fast access. For example:

Imports System.Collections.Generic
Imports System.IO

...

Dim files As New HashSet(Of String)(1000000)

For Each filepath As String In Directory.GetFiles("C:\")
    If Not files.Add(Path.GetFileName(filepath)) Then  ' Add returns True if the entry does not exist in the HashSet; False if it already does
        ' duplicate file name
    End If
Next

Open in new window


Again, based on your comment of "hard drive," there's a good chance that you'll still run out of memory even using a HashSet. You may need to break your logic up to process various directories in chunks to avoid encountering memory exceptions.
0

Featured Post

VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Article by: Kraeven
Introduction Remote Share is a simple remote sharing tool, enabling you to see, add and remove remote or local shares. The application is written in VB.NET targeting the .NET framework 2.0. The source code and the compiled programs have been in…
It’s quite interesting for me as I worked with Excel using vb.net for some time. Here are some topics which I know want to share with others whom this might help. First of all if you are working with Excel then you need to Download the Following …
In this video we outline the Physical Segments view of NetCrunch network monitor. By following this brief how-to video, you will be able to learn how NetCrunch visualizes your network, how granular is the information collected, as well as where to f…
Michael from AdRem Software outlines event notifications and Automatic Corrective Actions in network monitoring. Automatic Corrective Actions are scripts, which can automatically run upon discovery of a certain undesirable condition in your network.…
Suggested Courses
Course of the Month10 days, 21 hours left to enroll

770 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question