removing duplicates from txt

Im writting a simple c++ program that will remove duplicates from a txt file. Alot like a dictionary list, but there are three fields to each entry. ie:

First Name: Harliquen
Last Name: <none>
Phone Number: 555-6666

The program will search through the list, and remove all _exact_ duplicates of the above but keep any that are similar, yet different (different phone number, etc).

How would I go about this?? using a struct?? Pseudocode and/or the data type to use that will get me going would be great..
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

let the structure be like this
typedef struct duplicate
      int id;
      char chName[100];
      char chLastName[100];
      char phone[20];
      struct duplicate* link;
U need to have functions for enetering values
finding duplicate
deleting duplicates

1. Enter value (its simple)
   Eneter a new value add to list

2. FindDuplicate

  a. take the first node store the name in a buffer
  b. compare name with each node if found, compare last name if same compare phone no if same return the id
  c. the control will come here if no matches found so return 0

3. call delete duplicate with the id returned

4. in delete duplicate search for the id and delete the corresponding node

Hope this helps
It may be easier to find duplicates in a sorted list
HarliquenAuthor Commented:
Didnt really help that much :( .. I will explain further, or you may need to go into more detail for me....

A list has already been created which is a text file (using other programs / obtained from elsewhere, etc) and looks similar to the following:

  First Name: Harliquen
  Last Name: <none>
  Phone Number: 555-6666

  First Name: Harliquen2
  Last Name: HarliquenL2
  Phone Number: 555-2344

  First Name: TestName1
  Last Name: TestnameL1
  Phone Number: 555-8888  


Between each entry there is a single blank line. I want the program to read in this text file, and output it again to another file, but minus the duplicate entries. I used a structure as suggested, and read each line into its respective entry in the structure. Then read the next block into another structure, but this seems to be not really worth it, because you cant compare structures can you?? (eg if struc1==struct2), you still have to compare each element (eg struct1.fname==struct2.fname etc) So is a structure the way to go?

The next problem I have, is the continuation of checking each block of data. Once I have read in the first block, I use that to compare with the rest of blocks of data until the end of the file, and outputting those that dont match to the output file.. eliminating duplicates of the first block, but not the rest. So I end up with a file containing no duplicates of the first block, but the rest are still there.. and I have lost the position of where I was up to in the file (ie I have to re-open the file again, and read it from the start which wouldnt help since I have eliminated the duplicates of the first block anyway :/  ).

If removing duplicates of blocks of data is too complex, help on removing single words would be usefull.. like a list of words (dictionary), and removing the duplicates of words..

Cloud Class® Course: Certified Penetration Testing

This CPTE Certified Penetration Testing Engineer course covers everything you need to know about becoming a Certified Penetration Testing Engineer. Career Path: Professional roles include Ethical Hackers, Security Consultants, System Administrators, and Chief Security Officers.

My suggestion was to create single linked list for entire file
if posssible can U send me the text file and the code U have proceeded upto so that I can help U more clearly
I don't think(At first instance)it is necessary to create a different linked list for each block mantaining one list is enough.
UR code seems to be currect since it is currectly finding duplicates in the first list
U cannot use the same logic for more than one list
If U like to send me the code with the txt file my email id is

Here is an example I built under BC 5.02.
I have hard-coded input and output filenames for simplicity.
It makes heavy use of the STL containers etc.
You may have to modify it for your environment.
I have tested it and it works under the above compiler.


Kevin Frey

#include <stdio>
#include <list>
#include <iterator>
#include <algorithm>
#include <cstring>
#include <fstream>
#include <assert>

#define FIRST_NAME_PFX          "First Name: "
#define LAST_NAME_PFX           "Last Name: "
#define PHONE_PFX               "Phone Number: "

#define FIRST_NAME_PFXLEN       strlen (FIRST_NAME_PFX)
#define LAST_NAME_PFXLEN        strlen (LAST_NAME_PFX)
#define PHONE_PFXLEN            strlen (PHONE_PFX)

    // cNamePhone - repository for a first/last name and telephone number

class cNamePhone {
    string  First;
    string  Last;
    string  Phone;

    bool    ReadFromFile (istream & argInputFile);
    bool    WriteToFile (ostream & argOutputFile);

    int     compare (const cNamePhone & argNamePhone);
    int     operator== (const cNamePhone & argNamePhone);
    int     operator!= (const cNamePhone & argNamePhone);
    int     operator< (const cNamePhone & argNamePhone);

// cNamePhone implementation ---------------------------------------------------

cNamePhone::ReadFromFile (istream & argInputFile)
    string  FirstNameLine;
    string  LastNameLine;
    string  PhoneLine;
    string  BlankLine;
    bool    Success (TRUE);

    FirstNameLine.read_line (argInputFile);
    LastNameLine.read_line (argInputFile);
    PhoneLine.read_line (argInputFile);
    // BlankLine.read_line (argInputFile);

    if (FirstNameLine.is_null () && LastNameLine.is_null () && PhoneLine.is_null ()) Success = FALSE;

    if (Success) {
        // just check we got the right data
        assert (FirstNameLine.substr (0, FIRST_NAME_PFXLEN) == FIRST_NAME_PFX);
        assert (LastNameLine.substr (0, LAST_NAME_PFXLEN) == LAST_NAME_PFX);
        assert (PhoneLine.substr (0, PHONE_PFXLEN) == PHONE_PFX);

        // grab out the bits from each line we need (first/last/phone)
        First = FirstNameLine.substr (FIRST_NAME_PFXLEN);
        First.strip ();
        Last = LastNameLine.substr (LAST_NAME_PFXLEN);
        Last.strip ();
        Phone = PhoneLine.substr (PHONE_PFXLEN);
        Phone.strip ();
    } /* if */

    return (Success);

cNamePhone::WriteToFile (ostream & argOutputFile)
    argOutputFile << FIRST_NAME_PFX << First << endl;
    argOutputFile << LAST_NAME_PFX << Last << endl;
    argOutputFile << PHONE_PFX << Phone << endl << endl;
    return (TRUE);

cNamePhone::compare (const cNamePhone & argNamePhone)
    int Result (0);

    if (First < argNamePhone.First) Result = -1;
    else if (First > argNamePhone.First) Result = 1;
    else if (Last < argNamePhone.Last) Result = -1;
    else if (Last > argNamePhone.Last) Result = 1;
    else if (Phone < argNamePhone.Phone) Result = -1;
    else if (Phone > argNamePhone.Phone) Result = 1;

    return (Result);

cNamePhone::operator== (const cNamePhone & argNamePhone)
    return (compare (argNamePhone) == 0);

cNamePhone::operator!= (const cNamePhone & argNamePhone)
    return (compare (argNamePhone) != 0);

cNamePhone::operator< (const cNamePhone & argNamePhone)
    return (compare (argNamePhone) < 0);

typedef std::list<cNamePhone>           cNamePhoneList;
typedef std::list<cNamePhone>::iterator cNamePhoneListIterator;


First Name: Harliquen
Last Name: <none>
Phone Number: 555-6666

main (void)
    cNamePhoneList          List;
    cNamePhoneListIterator  ListI;
    ifstream                InputFile;
    ofstream                OutputFile;
    cNamePhone              NamePhone;

    // open the file ("TEST.TXT");
    assert (InputFile.good ());

    while (NamePhone.ReadFromFile (InputFile)) {
        // scan through the existing list to see if we have a match
        if (std::find (List.begin (), List.end (), NamePhone) == List.end ()) {
            // no match - add it to the list
            List.insert (List.begin (), NamePhone);
        } /* if */
    } /* while */

    // close our input file
    InputFile.close ();

    // now open our output file ("TEST.OUT");
    assert (OutputFile.good ());

    // write each name/phone that we retained in our list
    for (ListI = List.begin (); ListI != List.end (); ListI++)
        (*ListI).WriteToFile (OutputFile);

    // close our output file
    OutputFile.close ();

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
I think ozo makes a good point.  It is much easier to find duplicates in a sorted list.  The following code is really C, but I think it is pretty easy to follow.
There is a stuct that holds
 First Name
 Last Name
 Phone Number

For this example, I took the liberty of assuming that no line would be over 80 characters in length.  If you think one would be, either increase MAX, of dynamically allocate each character array of each record.

A pass is made through the input file to count the total records, and memory is then allocated.  The data is read in, and qsort is used to sort the data.  The data is sorted by first name, then last name, then phone number, but it really doesn't matter for our purposes.

Once sorted, the data is written to a file called "result.txt" as long as the current record is not the same as the previous record (no duplicates).

Follwing the code is an example input file, followed by the resulting output file.

You may want to trim the strings as they are input (remove leading and trailing whitespace), but that is up to you.

Hope this helps


#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAX  80

typedef struct record
    char last[MAX];
    char first[MAX];
    char phone[MAX];
    } record;

int mycompare(const void* a, const void* b)
record* r1 = (record*)a;
record* r2 = (record*)b;
int res;

if ( (res = strcmp( r1->first, r2->first)) != 0 )
    return res;
if ( (res = strcmp( r1->last, r2->last)) != 0 )
    return res;
return strcmp( r1->phone, r2->phone );

} /* mycompare */

/* MAIN */
int main(int argc, char* argv[])
 /* Variables */
FILE *in;
unsigned int total = 0;
unsigned int ctr;
char temp[MAX + 1];
record* data;
FILE *out;
record prev;
record cur;

if ( argc != 2 )
    printf("\nUSAGE: %s <filename>\n", argv[0]);

if ( (in = fopen(argv[1], "rt")) == NULL )
    printf("\nERROR: Failure to open file %s\n", argv[1]);

while ( fgets(temp, MAX, in) != NULL )

/* Number of records should be total/4, but we have to make sure */
/*  that there is a newline after the last record */
total = (total / 4) + (( (total % 4) > 0 ) ? 1 : 0);


/* Allocate the array of pointers */
if ( (data = (record*)malloc(total * sizeof(record))) == NULL )

/* Read in data */
for ( ctr = 0; ctr < total; ctr++ )
    fgets((data+ctr)->first, MAX, in);
    fgets((data+ctr)->last, MAX, in);
    fgets((data+ctr)->phone, MAX, in);
    fgets(temp, MAX, in); /* get the blank line */

/* Sort the data */
qsort(data, total, sizeof(record), mycompare);

/* Data is sorted, lets output results */
if ( (out = fopen("result.txt", "wt")) == NULL)
    printf("\nERROR: Failure to open result file\n");

/* Output first record */
fprintf(out, "%s", data->first);
fprintf(out, "%s", data->last);
fprintf(out, "%s", data->phone);
fprintf(out, "\n");
for ( ctr = 1; ctr < total; ctr++ )
    if ( strcmp((data+ctr)->first,(data+ctr-1)->first) == 0
      && strcmp((data+ctr)->last,(data+ctr-1)->last) == 0
      && strcmp((data+ctr)->phone,(data+ctr-1)->phone) == 0
      continue; /* Do not output this record.  It matched the last one. */
    fprintf(out, "%s", (data+ctr)->first);
    fprintf(out, "%s", (data+ctr)->last);
    fprintf(out, "%s", (data+ctr)->phone);
    fprintf(out, "\n");


return 0;
} /* end main */

---- end source code ----
---- begin input file ----




---- end input file ----
---- start result file ----


---- end results ----
HarliquenAuthor Commented:
Sorry for the delay. Been quite busy lately. Thank you both for your answers.

I will test both methods (primarily yours kevinf since you were first to propose an answer) and award points ASAP, could you also explain for me basically how it works. I understand and can follow marcjb's answer, but cant really follow yours, it's a bit over my head, although I will still thoroughly go through it.

It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today

From novice to tech pro — start learning today.