Solved

Best (fastest) way to find duplicate entries in a text file.

Posted on 2008-10-21
7
637 Views
Last Modified: 2012-05-08
I'm parsing a text file that could potentially have tens of thousands of lines (perhaps more!) and I want to check for duplicate lines and if there is a duplicate line replace it with a pointer to the original line. I've tried two ways of doing this. a) using nested for loops to iterate over all the lines after they've been stored in a CStringArray, and b) storing the entire file in one huge CString and using a combination of find and replace while breaking it into a CStringArray. Method b seems to work faster but I'm wondering if there might be an even faster way of accomplishing this task? Any thoughts and suggestions?

Thanks in advance!
0
Comment
Question by:Naylin
  • 3
  • 2
7 Comments
 
LVL 7

Expert Comment

by:lucky_james
ID: 22765383
You should have put this question in algorithms zone as well.......

if you have to look for full duplicate entry i.e. a line, you can take a look at hashing as well.

simplest of the case:
create 26 buckets for alphabets, each line starts with.

now after seeking and fetching new line from the file. check in that individual bucket if that line already exists or not.

Above trick is based on initial alphabet of the line. if you have some other key, you can use that in hashing, so as to distribute data among the buckets as evenly as you can.

if you create string array, this will give you nxn complexity.
while this on-the-fly hashing will give you axb complexity where a,b < n.

Hope it helps.
0
 
LVL 39

Expert Comment

by:itsmeandnobodyelse
ID: 22766492
You better would use a map where you would store all 'first' occurrences of a line as key (and maybe the line number as data).

Then, for any new line you would make a lookup into the map before adding a new entry.

0
 
LVL 7

Expert Comment

by:lucky_james
ID: 22766643
hi itsme,
         if feel, in your approach of map, it should be a multi map. correct me if i am wrong.... or if you use a map then the line numbers should be appended. else, previous entry will delete the new entry.


James
0
Courses: Start Training Online With Pros, Today

Brush up on the basics or master the advanced techniques required to earn essential industry certifications, with Courses. Enroll in a course and start learning today. Training topics range from Android App Dev to the Xen Virtualization Platform.

 
LVL 39

Accepted Solution

by:
itsmeandnobodyelse earned 50 total points
ID: 22770030
>>>> it should be a multi map. correct me if i am wrong....
Hmmm. That depends onto where you want to write the pointers. If using the map only to check duplicates you would store the line number of the first occurrence in the map and nothing else. Then you would have a text-file which has a mix of text and pointers, e. g. for a file like

Boston
Birmingham
New Delhi
Birmingham
Birmingham
New York
Boston
New York
London
New York

You would turn to

Boston
Birmingham
New Delhi
@2
@2
New York
@1
@6
London
@6

and have a map with

Birmingham    2
Boston           1
London          9
New Delhi      3
New York       6

Alternatively, you could store all in the map (making it a multi-map)

Birmingham    2,4,5
Boston           1,7
London          9
New Delhi      3
New York       6,8,10

The latter would allow you to restore the original file by iterating the map and fill an array of strings by index. But you need to hold all contents in memory cause you badly could write the line numbers sequentially as they were very unsorted in the multi-map.

Note, in STL a multi-map is a very incomfortable container. I would suggest using a map of string to vectors instead. You easily could build up the map when reading the file like that:

    map<string, vector<int> >  mm;
    string line;
    ifstream ifs("input.txt");
    int n = 0;
    while (!getline(ifs, line))
    {
          mm[line].push_back(++n);
    }
    ifs.close();

That's all.

Note, the mm[line] returns a reference to an empty vector for the first occurrence of line. For all further occurrences it returns a reference to the existing vector. In both cases you simply could add a new line number by calling push_back.

 
0
 
LVL 39

Expert Comment

by:itsmeandnobodyelse
ID: 22770052
It should be

>>>> while (getline(ifs, line))
0
 

Author Closing Comment

by:Naylin
ID: 31670671
Havn't really tried this yet but need to close the question... Not really sure about all this, but I'll check it out more.
0

Featured Post

Gigs: Get Your Project Delivered by an Expert

Select from freelancers specializing in everything from database administration to programming, who have proven themselves as experts in their field. Hire the best, collaborate easily, pay securely and get projects done right.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Errors will happen. It is a fact of life for the programmer. How and when errors are detected have a great impact on quality and cost of a product. It is better to detect errors at compile time, when possible and practical. Errors that make their wa…
Container Orchestration platforms empower organizations to scale their apps at an exceptional rate. This is the reason numerous innovation-driven companies are moving apps to an appropriated datacenter wide platform that empowers them to scale at a …
The viewer will learn how to use the return statement in functions in C++. The video will also teach the user how to pass data to a function and have the function return data back for further processing.
The viewer will be introduced to the member functions push_back and pop_back of the vector class. The video will teach the difference between the two as well as how to use each one along with its functionality.

776 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question