Solved

Best (fastest) way to find duplicate entries in a text file.

Posted on 2008-10-21
7
636 Views
Last Modified: 2012-05-08
I'm parsing a text file that could potentially have tens of thousands of lines (perhaps more!) and I want to check for duplicate lines and if there is a duplicate line replace it with a pointer to the original line. I've tried two ways of doing this. a) using nested for loops to iterate over all the lines after they've been stored in a CStringArray, and b) storing the entire file in one huge CString and using a combination of find and replace while breaking it into a CStringArray. Method b seems to work faster but I'm wondering if there might be an even faster way of accomplishing this task? Any thoughts and suggestions?

Thanks in advance!
0
Comment
Question by:Naylin
  • 3
  • 2
7 Comments
 
LVL 7

Expert Comment

by:lucky_james
ID: 22765383
You should have put this question in algorithms zone as well.......

if you have to look for full duplicate entry i.e. a line, you can take a look at hashing as well.

simplest of the case:
create 26 buckets for alphabets, each line starts with.

now after seeking and fetching new line from the file. check in that individual bucket if that line already exists or not.

Above trick is based on initial alphabet of the line. if you have some other key, you can use that in hashing, so as to distribute data among the buckets as evenly as you can.

if you create string array, this will give you nxn complexity.
while this on-the-fly hashing will give you axb complexity where a,b < n.

Hope it helps.
0
 
LVL 39

Expert Comment

by:itsmeandnobodyelse
ID: 22766492
You better would use a map where you would store all 'first' occurrences of a line as key (and maybe the line number as data).

Then, for any new line you would make a lookup into the map before adding a new entry.

0
 
LVL 7

Expert Comment

by:lucky_james
ID: 22766643
hi itsme,
         if feel, in your approach of map, it should be a multi map. correct me if i am wrong.... or if you use a map then the line numbers should be appended. else, previous entry will delete the new entry.


James
0
Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

 
LVL 39

Accepted Solution

by:
itsmeandnobodyelse earned 50 total points
ID: 22770030
>>>> it should be a multi map. correct me if i am wrong....
Hmmm. That depends onto where you want to write the pointers. If using the map only to check duplicates you would store the line number of the first occurrence in the map and nothing else. Then you would have a text-file which has a mix of text and pointers, e. g. for a file like

Boston
Birmingham
New Delhi
Birmingham
Birmingham
New York
Boston
New York
London
New York

You would turn to

Boston
Birmingham
New Delhi
@2
@2
New York
@1
@6
London
@6

and have a map with

Birmingham    2
Boston           1
London          9
New Delhi      3
New York       6

Alternatively, you could store all in the map (making it a multi-map)

Birmingham    2,4,5
Boston           1,7
London          9
New Delhi      3
New York       6,8,10

The latter would allow you to restore the original file by iterating the map and fill an array of strings by index. But you need to hold all contents in memory cause you badly could write the line numbers sequentially as they were very unsorted in the multi-map.

Note, in STL a multi-map is a very incomfortable container. I would suggest using a map of string to vectors instead. You easily could build up the map when reading the file like that:

    map<string, vector<int> >  mm;
    string line;
    ifstream ifs("input.txt");
    int n = 0;
    while (!getline(ifs, line))
    {
          mm[line].push_back(++n);
    }
    ifs.close();

That's all.

Note, the mm[line] returns a reference to an empty vector for the first occurrence of line. For all further occurrences it returns a reference to the existing vector. In both cases you simply could add a new line number by calling push_back.

 
0
 
LVL 39

Expert Comment

by:itsmeandnobodyelse
ID: 22770052
It should be

>>>> while (getline(ifs, line))
0
 

Author Closing Comment

by:Naylin
ID: 31670671
Havn't really tried this yet but need to close the question... Not really sure about all this, but I'll check it out more.
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

When writing generic code, using template meta-programming techniques, it is sometimes useful to know if a type is convertible to another type. A good example of when this might be is if you are writing diagnostic instrumentation for code to generat…
In days of old, returning something by value from a function in C++ was necessarily avoided because it would, invariably, involve one or even two copies of the object being created and potentially costly calls to a copy-constructor and destructor. A…
The goal of the tutorial is to teach the user how to use functions in C++. The video will cover how to define functions, how to call functions and how to create functions prototypes. Microsoft Visual C++ 2010 Express will be used as a text editor an…
The viewer will learn how to pass data into a function in C++. This is one step further in using functions. Instead of only printing text onto the console, the function will be able to perform calculations with argumentents given by the user.

895 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

17 Experts available now in Live!

Get 1:1 Help Now