Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x
?
Solved

Finding the duplicates in a big collection

Posted on 2007-04-03
3
Medium Priority
?
533 Views
Last Modified: 2013-11-07
Hello all,

I have a little problem.
I must find duplicates i a CollectionBase object.
Actually there are 3 properties that give the uniqueness of the records.

I am reading am XML file that gives me the collection of objects in the CollectionBase object.
Then i must "say/display" witch records are duplicated according to some TAG node values.

 "public class XmlEmployesCollection : CollectionBase"

The problem is that sometimes there are more than 15 000 objects in the XmlEmployesCollection.
What i need is some guidelines for completing this task; "Finding the duplicates in a big collection."

I am using .NET Framework v2.0 with C#.

Thanks in advance,
So.



0
Comment
Question by:barbulea
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
3 Comments
 
LVL 33

Expert Comment

by:raterus
ID: 18845606
Whenever I need to find duplicates, I pull out the trusty HashTable object and start adding values to it based on a "should be unique" key.  Before you add the value to the hashtable, make use of the ContainsKey method to see if you've already put it there.  If you have, you know you have a duplicate.
0
 
LVL 16

Expert Comment

by:AlexNek
ID: 18845796
For key of the 3 properties it is not so easy but you need only additional steps.
For one Key/it can be complex key too/ you have at least 2 methods
- Sort the collection by key and remove one of the same neighbour item
- When you build a collection make an additional map by key and don't add the items which are already in map
It can be binary sort with preventing item duplication too.
0
 
LVL 6

Accepted Solution

by:
thuannguy earned 1500 total points
ID: 18848656
You can use three Dictionary<> to store the objects. Let's consider a concrete example in which the three "KEYS" are Age, Salary and Name
      Dictionary<string, Employee> nameDict = new Dictionary<string, Employee>();
      Dictionary<int, Employee> ageDict = new Dictionary<int, Employee>();
      Dictionary<double, Employee> salaryDict = new Dictionary<double, Employee>();
      List<Employee> duplicateList = new List<Employee>();

      public void Add(Employee employee)
      {
         bool isDuplicate = true;
         if (!nameDict.ContainsKey(employee.Name))
         {
            isDuplicate = false;
            nameDict.Add(employee.Name, employee);
         }
         
         if (!ageDict.ContainsKey(employee.Age))
         {
            isDuplicate = false;
            ageDict.Add(employee.Age, employee);
         }
         
         if (!salaryDict.ContainsKey(employee.Salary))
         {
            isDuplicate = false;
            salaryDict.Add(employee.Salary, employee);
         }
         if (isDuplicate)
              duplicateList.Add(employee);//this object is duplicate, add it to the duplicate list
}

When you read an object from the Xml file, just use the Add method to add it to the container. In the Add method, we check if the three "KEYS" already exist. Since we only store the references to the objects in the three dictionary, the memory cost is not so much.
0

Featured Post

Learn Veeam advantages over legacy backup

Every day, more and more legacy backup customers switch to Veeam. Technologies designed for the client-server era cannot restore any IT service running in the hybrid cloud within seconds. Learn top Veeam advantages over legacy backup and get Veeam for the price of your renewal

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

More often than not, we developers are confronted with a need: a need to make some kind of magic happen via code. Whether it is for a client, for the boss, or for our own personal projects, the need must be satisfied. Most of the time, the Framework…
This article aims to explain the working of CircularLogArchiver. This tool was designed to solve the buildup of log file in cases where systems do not support circular logging or where circular logging is not enabled
Visualize your data even better in Access queries. Given a date and a value, this lesson shows how to compare that value with the previous value, calculate the difference, and display a circle if the value is the same, an up triangle if it increased…
In this video, Percona Solution Engineer Rick Golba discuss how (and why) you implement high availability in a database environment. To discuss how Percona Consulting can help with your design and architecture needs for your database and infrastr…

705 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question