Link to home
Start Free TrialLog in
Avatar of justinY
justinY

asked on

How can I compare 2 files and output the differences to a new file

Hi experts,

I have 2 files (file_old and file_new) with same formats, same columns, but different rows and contents.
So, is there any function I can use to compare the 2 files and write the differences to a new output file (file_diff) ?

Thanks in advance
Avatar of stefan73
stefan73
Flag of Germany image

Hi justinY,
Why don't you use diff?

FILE* f=popen("diff file_old file_new","r");

...then you can parse the output.


Cheers!

Stefan
Avatar of justinY
justinY

ASKER

diff is an unix function. I am running on windows O.S
so whats for windows O.S

thanks
You could write your own little version of diff... something like this:

#include <fstream>
#include <string>

int main() {

ifstream file1 ("file_old");
ifstream file2 ("file_new");
ofstream out ("file_dff");

string line1, line2;
getline (file1, line1);
getline (file2, line2);

while ( file1 )  //while not at EOF
{
   if (line1 != line2)
        out << "OLD: " << line1 << "NEW: " << line2 << endl;
   getline (file1, line1);
   getline (file2, line2);
}

return 0;
}



Please specify what did you mean by "differences"? If, for example,  first string from first file present in second file, but as fifth string - it should be marked as different? Or you want to get only lines that present only in one source file? In first case - answer from gugario is what you need. In second - you'll need more complicated code, for example use sets for strings, something like that:

#include <fstream>
#include <string>
#include <set>

using namespace std;

int main() {

  string file1_name = "file_old";
  string file2_name = "file_new";

  ifstream file1 ( file1_name.c_str() );
  ifstream file2 ( file2_name.c_str() );
  ofstream out ( "file_dff" );

  string line1, line2;
  set<string> set1;

  int cnt = 0;
  // Load contents of first file into set
  while( file1 ) {
    cnt++;
    getline( file1, line1 );
    if ( ! set1.insert(line1).second ) {
        cerr << "Insert failed of line no." << cnt << ": \"" << line1 << "\"  - duplicate" << endl ;
    }
  }

  // Read and compare lines from second file to set1, outputting lines that present in file2 only
  while( file2 ) {
    getline( file2, line2 );
    if ( set1.empty() || set1.find( line2 ) == set1.end() ) {
      out << "only in file " << file2_name << ": " << line2 << endl;
    } else {
      set1.erase( line2 );   // remove from set1 lines, found in both files
    }
  }

  // Just print remaining lines from set1 - there should stay only lines, not found in file2
  set<string>::iterator setItr;
  for( setItr = set1.begin(); setItr != set1.end(); setItr++ ) {
    out << "only in file " << file1_name << ": " << *setItr << endl;
  }
}

Just example will not work correctly in case of repeating lines - at least you should use multiset instead of set.


Why don't you use cygwin's diff?

http://www.cygwin.com

Then you have both worlds: Window's nice GUI and Unix's power. Or you can have a look at diff's source. It should be programmed in such a generic way that it probably compiles fine on Windows.
Avatar of justinY

ASKER

first string from first file present in second file, but as fifth string - it should be marked as different?          >>>>>             NO, that's same

Thanks guys,
Let me make myself clear here.
I have an old file ( oldfile ) and a new file (newfile). I want to compare oldfile and newfile, delete the same records ( not the same lines ) from newfile, and then output the newfile to a file called oldfile.

How can I start this ?
What I would do is go through the old file once, and use a <set> to store all the unique records in the first file.  Afterwards, go through the second file, and compare each read in record to the ones you already read.

Open the old file again in append  mode (so that you can add the new information), and for each record you find which is not already in the old file, add this in.

You could have multiple sets, or some other technique for parsing the line, if you don't have only one column of records in your file.  I really hope this helps, let me know if you have any questions.

Gustavo

here's the code:

#include <set>
#include <fstream>
#include <string>
using namespace std;

int main()
{
        set <string> oldRecords;  //declare the set
        oldRecords.clear();       //empty it out

        ifstream oldFile ("oldfile.txt");       //open old file for reading
        ifstream newFile ("newfile.txt");       //open new file for reading

        string nextRecord;
        while (! oldFile.eof() )        //while not at end of old file
        {
                //add record to set
                oldFile >> nextRecord;
                oldRecords.insert( nextRecord );
        }

        //close old file and open it again in append write mode
        oldFile.close();
        ofstream writeFile ("oldfile.txt", fstream::app);

        while (! newFile.eof() )
        {
                //read record
                newFile >> nextRecord;
                if (oldRecords.find(nextRecord) == oldRecords.end()) //if not found, add
                {
                        writeFile << nextRecord << endl;
                        oldRecords.insert(nextRecord);
                }
        }

        newFile.close();
        writeFile.close();
        oldRecords.clear();

        return 1;
}
Avatar of justinY

ASKER

Thanks gugario, here is my code, but compiling errors, can you check see whats wrong ? Thanks

#include <fstream>
#include <sstream>
#include <iostream>
#include <string>
#include <iomanip>

using namespace std;

///////////////////////////////////
// this function can get any fields
//////////////////////////////////
std::string GetField(std::string &aStr, int aFieldNum, char aDelim)
{
    std::istringstream ss(aStr);
    std::string field;
    while (std::getline(ss, field, ',') && aFieldNum > 0 )
    {
        --aFieldNum;
    }
    return field;
}

int main(int argc, char *argv[])
{
    std::ifstream oldfin;
    std::ifstream newfin;
    std::ofstream fout;
    oldfin.open("oldfile.csv");
    newfin.open("newfile.csv");
    fout.open("diff.csv");
    std::string line;

    while ( std::getline(oldfin, line) && std::getline(newfin, line))
    {
            // comparing oldfile and newfile in both the 10th column and the 30th column fields
                                // if they both not same, then write the newfile to output file  ( my code id for this one )
                                // or if they both same, then delete the same fileds from newfile, and write the rest of newfile to output file ( can you give me code to delete
                                // the same fileds)
 
            std::string oldfn9 = GetField(line, 9, ',');
            std::string oldfn29 = GetField(line, 29,',');
            std::string newfn9 = GetField(line, 9,',');
            std::string newfn29 = GetField(line, 29,',');
            if ( (::oldfn9.c_str() != ::newfn9.c_strt()) && (::oldfn29.c_str() != ::newfn29.c_strt()) )
        {
            fout << line.c_str() << std::endl;
        }
    }
    fout.close();
    oldfin.close();
    newfin.close();
}
Avatar of justinY

ASKER

here is the compiling errors:
 if ( (oldfn9.c_str() != newfn9.c_str()) && (oldfn29.c_str() != newfn29.c_str()) )

But,  this part ' fout << line.c_str() << std::endl;' doesnt make sense. It produces no results.
what I want to do here is write to the lines to output file on the base of newfile, but not the same records by comparing it with oldfile.
Hey, I almost understand what your program is supposed to do completely.. I'm gonna try to make some simple test files and make sure it runs ok, and then I'll repost the code... A couple of quick question and comments, though:

1.  When you use the "using namespace std;" line in the top of your program, you don't need all the "std::" in the middle of the code, since you already told the compiler you are using std for your namespace.  (that would make it a lot cleaner)

2. comparing oldfin9.c_str() != newfn9.c_str() is the same as doing oldfin9 != newfin9.... the string library has a comparison operator, so you don't have to change them to char strings before comparing...

I don't mean to be too picky, I'm just saying it because I think it would save you a lot of trouble and make the code cleaner.

Now for my question:
> // or if they both same, then delete the same fileds from newfile, and ?> write the rest of newfile to output file
   Does that mean that when you find a line where fields 10 and 30 are equal you go on to the next one and keep processing the file?  Are you supposed to remove those lines from the newfile?  Are you supposed to write the line to the output file without the 2 fields?  or are you supposed to skip that line and copy the rest of oldfile into the output file?  Please clarify cause I don't understand...

I'll post the code with some fixes (except for that part) as soon as possible

Gustavo.

Here you go... apart from some general cleaning up, I made sure that the GetField function now
returned the appropriate field (so you can call it on 10 and 30 instead of 9 and 29), and your big
problem was that you were trying to store both the line from the old file and the one from the new
file in the same string, that way you ended up comparing one line to the same line, and that's
why you never found them to be different.

Hope it helps,

Gustavo.:

#include <fstream>
#include <sstream>
#include <iostream>
#include <string>
#include <iomanip>

using namespace std;

///////////////////////////////////
// this function can get any fields
////////////////////////////////////
string GetField(string &aStr, int aFieldNum, char aDelim)
{
        istringstream ss(aStr);
        string field;
        while ( aFieldNum > 0)
        {
                getline ( ss, field, ',');
                --aFieldNum;
        }
        return field;
}

int main(int argc, char *argv[])
{
        ifstream oldfin;
        ifstream newfin;
        ofstream fout;
        oldfin.open("oldfile.csv");
        newfin.open("newfile.csv");
        fout.open("diff.csv");

        string lineOld;
        string lineNew;

        while (getline(oldfin, lineOld) && getline(newfin, lineNew))
        {
                string oldfn10 = GetField(lineOld, 10,',');
                string oldfn30 = GetField(lineOld, 30, ',');
                string newfn10 = GetField(lineNew, 10, ',');
                string newfn30 = GetField(lineNew, 30, ',');

               if ( (oldfn10 != newfn10) && (oldfn30 != newfn30) )
                {
                        fout << lineNew << endl;
                }
               else if ((oldfn10 == newfn10) && (oldfn30 == newfn30))
              {
                     //not sure what you wanted to do here.. let me know
               }
        }
        oldfin.close();
        newfin.close();
        fout.close();

        return 1;

}
Avatar of justinY

ASKER

Now for my question:
> // or if they both same, then delete the same fileds from newfile, and ?> write the rest of newfile to output file
   Does that mean that when you find a line where fields 10 and 30 are equal you go on to the next one and keep processing the file?  Are you supposed to remove those lines from the newfile?  Are you supposed to write the line to the output file without the 2 fields?  or are you supposed to skip that line and copy the rest of oldfile into the output file?  Please clarify cause I don't understand...

>>>>> Yes to all questions, but skip that line and copy the rest of newfile into the output file.

Now back to the code, after running the code, I have nothing in my diff.csv file. dont know why .
We're getting there... The reason the code gave no output must be that I'm thinking that your input file looks some other way.

hmmmm... if you can, post the newfile.csv, oldfile.csv you are using, so that I can see what your input really looks like.. also, it would be good if you could tell me what kind of diff.csv you are expecting... (i don't need a 100 line file, just something that covers the cases where a line is diff in 2 places, equal in 2 places or diff in only one place.)

Gustavo.
Avatar of justinY

ASKER

Hi,

Finally, I got the code working , but it seems like the code is comparing line to line. That means 1st line of oldfile compares 1st line of newfile, and 2nd line of oldfile compares 2nd line of newfile and so on .... . Thats not what I want. What I want is as long as newfile has same fields as oldfile, not matter of the line number, then ignore them and go on the comparing untill reach the end of newfile, and write the not same fields to diff file.
Avatar of justinY

ASKER

Hi, Gustavo

this might be a good approach.

1. get field 10 and 30 of line 1 of newfile, compare them with oldfile from line1 to end of oldfile. If both are same, then delete them from newfile ( delete the whole line), and go on, otherwise go on

2. get filed10 and 30 of line2 of newfile, compare them with oldfile from line1 to end of oldfile. If both are same, then delete them from newfile (delete the whole line) and go on, otherwise go on

3. keep doing this, until reach the end of newfile.

4. write the rest lines of newfile to output file.

I think this will give us a clear logic way to do it, do you think so ? then how can i do it ?
ASKER CERTIFIED SOLUTION
Avatar of gugario
gugario

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of justinY

ASKER

I have an empty output file, why ?
is there anyway you can give me sample input/output?  It seemed to work on my pc....
hey, nevermind.. i saw an obvious error on it.. after the return field; line in the first function, i forgot to close the bracket of the function (this is definitelly a copy/paste error).. did u catch that?  Maybe it'll make a difference.. if not, i'd still ask for sample in/out
Avatar of justinY

ASKER

OK, It works great. ---- Yes, I did catch that error since I had compiling error.

 Thanks, I am going to close this ticket and credit the points to you. I will open another ticket regarding getline(). If you have time, please take a look.