Solved

Best (fastest) way to find duplicate entries in a text file.

Posted on 2008-10-21
7
635 Views
Last Modified: 2012-05-08
I'm parsing a text file that could potentially have tens of thousands of lines (perhaps more!) and I want to check for duplicate lines and if there is a duplicate line replace it with a pointer to the original line. I've tried two ways of doing this. a) using nested for loops to iterate over all the lines after they've been stored in a CStringArray, and b) storing the entire file in one huge CString and using a combination of find and replace while breaking it into a CStringArray. Method b seems to work faster but I'm wondering if there might be an even faster way of accomplishing this task? Any thoughts and suggestions?

Thanks in advance!
0
Comment
Question by:Naylin
  • 3
  • 2
7 Comments
 
LVL 7

Expert Comment

by:lucky_james
ID: 22765383
You should have put this question in algorithms zone as well.......

if you have to look for full duplicate entry i.e. a line, you can take a look at hashing as well.

simplest of the case:
create 26 buckets for alphabets, each line starts with.

now after seeking and fetching new line from the file. check in that individual bucket if that line already exists or not.

Above trick is based on initial alphabet of the line. if you have some other key, you can use that in hashing, so as to distribute data among the buckets as evenly as you can.

if you create string array, this will give you nxn complexity.
while this on-the-fly hashing will give you axb complexity where a,b < n.

Hope it helps.
0
 
LVL 39

Expert Comment

by:itsmeandnobodyelse
ID: 22766492
You better would use a map where you would store all 'first' occurrences of a line as key (and maybe the line number as data).

Then, for any new line you would make a lookup into the map before adding a new entry.

0
 
LVL 7

Expert Comment

by:lucky_james
ID: 22766643
hi itsme,
         if feel, in your approach of map, it should be a multi map. correct me if i am wrong.... or if you use a map then the line numbers should be appended. else, previous entry will delete the new entry.


James
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 
LVL 39

Accepted Solution

by:
itsmeandnobodyelse earned 50 total points
ID: 22770030
>>>> it should be a multi map. correct me if i am wrong....
Hmmm. That depends onto where you want to write the pointers. If using the map only to check duplicates you would store the line number of the first occurrence in the map and nothing else. Then you would have a text-file which has a mix of text and pointers, e. g. for a file like

Boston
Birmingham
New Delhi
Birmingham
Birmingham
New York
Boston
New York
London
New York

You would turn to

Boston
Birmingham
New Delhi
@2
@2
New York
@1
@6
London
@6

and have a map with

Birmingham    2
Boston           1
London          9
New Delhi      3
New York       6

Alternatively, you could store all in the map (making it a multi-map)

Birmingham    2,4,5
Boston           1,7
London          9
New Delhi      3
New York       6,8,10

The latter would allow you to restore the original file by iterating the map and fill an array of strings by index. But you need to hold all contents in memory cause you badly could write the line numbers sequentially as they were very unsorted in the multi-map.

Note, in STL a multi-map is a very incomfortable container. I would suggest using a map of string to vectors instead. You easily could build up the map when reading the file like that:

    map<string, vector<int> >  mm;
    string line;
    ifstream ifs("input.txt");
    int n = 0;
    while (!getline(ifs, line))
    {
          mm[line].push_back(++n);
    }
    ifs.close();

That's all.

Note, the mm[line] returns a reference to an empty vector for the first occurrence of line. For all further occurrences it returns a reference to the existing vector. In both cases you simply could add a new line number by calling push_back.

 
0
 
LVL 39

Expert Comment

by:itsmeandnobodyelse
ID: 22770052
It should be

>>>> while (getline(ifs, line))
0
 

Author Closing Comment

by:Naylin
ID: 31670671
Havn't really tried this yet but need to close the question... Not really sure about all this, but I'll check it out more.
0

Featured Post

Enabling OSINT in Activity Based Intelligence

Activity based intelligence (ABI) requires access to all available sources of data. Recorded Future allows analysts to observe structured data on the open, deep, and dark web.

Join & Write a Comment

Errors will happen. It is a fact of life for the programmer. How and when errors are detected have a great impact on quality and cost of a product. It is better to detect errors at compile time, when possible and practical. Errors that make their wa…
Introduction This article is the first in a series of articles about the C/C++ Visual Studio Express debugger.  It provides a quick start guide in using the debugger. Part 2 focuses on additional topics in breakpoints.  Lastly, Part 3 focuses on th…
The viewer will learn how to clear a vector as well as how to detect empty vectors in C++.
The viewer will be introduced to the member functions push_back and pop_back of the vector class. The video will teach the difference between the two as well as how to use each one along with its functionality.

757 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

18 Experts available now in Live!

Get 1:1 Help Now