Best (fastest) way to find duplicate entries in a text file.

I'm parsing a text file that could potentially have tens of thousands of lines (perhaps more!) and I want to check for duplicate lines and if there is a duplicate line replace it with a pointer to the original line. I've tried two ways of doing this. a) using nested for loops to iterate over all the lines after they've been stored in a CStringArray, and b) storing the entire file in one huge CString and using a combination of find and replace while breaking it into a CStringArray. Method b seems to work faster but I'm wondering if there might be an even faster way of accomplishing this task? Any thoughts and suggestions?

Thanks in advance!
NaylinAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

lucky_jamesCommented:
You should have put this question in algorithms zone as well.......

if you have to look for full duplicate entry i.e. a line, you can take a look at hashing as well.

simplest of the case:
create 26 buckets for alphabets, each line starts with.

now after seeking and fetching new line from the file. check in that individual bucket if that line already exists or not.

Above trick is based on initial alphabet of the line. if you have some other key, you can use that in hashing, so as to distribute data among the buckets as evenly as you can.

if you create string array, this will give you nxn complexity.
while this on-the-fly hashing will give you axb complexity where a,b < n.

Hope it helps.
0
itsmeandnobodyelseCommented:
You better would use a map where you would store all 'first' occurrences of a line as key (and maybe the line number as data).

Then, for any new line you would make a lookup into the map before adding a new entry.

0
lucky_jamesCommented:
hi itsme,
         if feel, in your approach of map, it should be a multi map. correct me if i am wrong.... or if you use a map then the line numbers should be appended. else, previous entry will delete the new entry.


James
0
Upgrade your Question Security!

Your question, your audience. Choose who sees your identity—and your question—with question security.

itsmeandnobodyelseCommented:
>>>> it should be a multi map. correct me if i am wrong....
Hmmm. That depends onto where you want to write the pointers. If using the map only to check duplicates you would store the line number of the first occurrence in the map and nothing else. Then you would have a text-file which has a mix of text and pointers, e. g. for a file like

Boston
Birmingham
New Delhi
Birmingham
Birmingham
New York
Boston
New York
London
New York

You would turn to

Boston
Birmingham
New Delhi
@2
@2
New York
@1
@6
London
@6

and have a map with

Birmingham    2
Boston           1
London          9
New Delhi      3
New York       6

Alternatively, you could store all in the map (making it a multi-map)

Birmingham    2,4,5
Boston           1,7
London          9
New Delhi      3
New York       6,8,10

The latter would allow you to restore the original file by iterating the map and fill an array of strings by index. But you need to hold all contents in memory cause you badly could write the line numbers sequentially as they were very unsorted in the multi-map.

Note, in STL a multi-map is a very incomfortable container. I would suggest using a map of string to vectors instead. You easily could build up the map when reading the file like that:

    map<string, vector<int> >  mm;
    string line;
    ifstream ifs("input.txt");
    int n = 0;
    while (!getline(ifs, line))
    {
          mm[line].push_back(++n);
    }
    ifs.close();

That's all.

Note, the mm[line] returns a reference to an empty vector for the first occurrence of line. For all further occurrences it returns a reference to the existing vector. In both cases you simply could add a new line number by calling push_back.

 
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
itsmeandnobodyelseCommented:
It should be

>>>> while (getline(ifs, line))
0
NaylinAuthor Commented:
Havn't really tried this yet but need to close the question... Not really sure about all this, but I'll check it out more.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
C++

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.