Go Premium for a chance to win a PS4. Enter to Win

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 652
  • Last Modified:

Best (fastest) way to find duplicate entries in a text file.

I'm parsing a text file that could potentially have tens of thousands of lines (perhaps more!) and I want to check for duplicate lines and if there is a duplicate line replace it with a pointer to the original line. I've tried two ways of doing this. a) using nested for loops to iterate over all the lines after they've been stored in a CStringArray, and b) storing the entire file in one huge CString and using a combination of find and replace while breaking it into a CStringArray. Method b seems to work faster but I'm wondering if there might be an even faster way of accomplishing this task? Any thoughts and suggestions?

Thanks in advance!
0
Naylin
Asked:
Naylin
  • 3
  • 2
1 Solution
 
lucky_jamesCommented:
You should have put this question in algorithms zone as well.......

if you have to look for full duplicate entry i.e. a line, you can take a look at hashing as well.

simplest of the case:
create 26 buckets for alphabets, each line starts with.

now after seeking and fetching new line from the file. check in that individual bucket if that line already exists or not.

Above trick is based on initial alphabet of the line. if you have some other key, you can use that in hashing, so as to distribute data among the buckets as evenly as you can.

if you create string array, this will give you nxn complexity.
while this on-the-fly hashing will give you axb complexity where a,b < n.

Hope it helps.
0
 
itsmeandnobodyelseCommented:
You better would use a map where you would store all 'first' occurrences of a line as key (and maybe the line number as data).

Then, for any new line you would make a lookup into the map before adding a new entry.

0
 
lucky_jamesCommented:
hi itsme,
         if feel, in your approach of map, it should be a multi map. correct me if i am wrong.... or if you use a map then the line numbers should be appended. else, previous entry will delete the new entry.


James
0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 
itsmeandnobodyelseCommented:
>>>> it should be a multi map. correct me if i am wrong....
Hmmm. That depends onto where you want to write the pointers. If using the map only to check duplicates you would store the line number of the first occurrence in the map and nothing else. Then you would have a text-file which has a mix of text and pointers, e. g. for a file like

Boston
Birmingham
New Delhi
Birmingham
Birmingham
New York
Boston
New York
London
New York

You would turn to

Boston
Birmingham
New Delhi
@2
@2
New York
@1
@6
London
@6

and have a map with

Birmingham    2
Boston           1
London          9
New Delhi      3
New York       6

Alternatively, you could store all in the map (making it a multi-map)

Birmingham    2,4,5
Boston           1,7
London          9
New Delhi      3
New York       6,8,10

The latter would allow you to restore the original file by iterating the map and fill an array of strings by index. But you need to hold all contents in memory cause you badly could write the line numbers sequentially as they were very unsorted in the multi-map.

Note, in STL a multi-map is a very incomfortable container. I would suggest using a map of string to vectors instead. You easily could build up the map when reading the file like that:

    map<string, vector<int> >  mm;
    string line;
    ifstream ifs("input.txt");
    int n = 0;
    while (!getline(ifs, line))
    {
          mm[line].push_back(++n);
    }
    ifs.close();

That's all.

Note, the mm[line] returns a reference to an empty vector for the first occurrence of line. For all further occurrences it returns a reference to the existing vector. In both cases you simply could add a new line number by calling push_back.

 
0
 
itsmeandnobodyelseCommented:
It should be

>>>> while (getline(ifs, line))
0
 
NaylinAuthor Commented:
Havn't really tried this yet but need to close the question... Not really sure about all this, but I'll check it out more.
0

Featured Post

VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

  • 3
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now