Solved

I/O question from a newbe

Posted on 2004-10-01
20
276 Views
Last Modified: 2010-04-01

Input to my program is 2 huge csv files with over 200 columns each. I need to compare specific fields for
both files and create the following report.

Description,       Account,        CIP,            WWS,      WLL,      WAM,      Column with Break,      Source 1 Cell,      Source 1 Value,      Source 2 Cell,      Source 2 Value
FNMA 4.000 15Y,         M001,      01F0404A2,      4.554,       6,      174,         Vol ,                  M2,      -4.44575,            M2,          -4.4976      
FNMA 4.000 15Y,         M001,      01F0404A2,      4.554,       6,      174,         i0v10,                  AP2,      -9.92827035,            AK2,          -9.85369319      
FNMA 4.000 15Y,         M001,      01F0404A2,      4.554,       6,      174,         i0v-10),                  AQ2,      8.820230435,            AL2,          8.93395677      

to match each row I need to use 6 key fields:
Description,Account,CIP,WWS,WLL,WAM. If there is no match in the field, then I need to display it in a field called 'Column with Break' shown above(that is the title of field). Then I need to display the values for that column(again shown above)
I also need to show the cell.(in excel). For example, here is sample input to this program


Description,       Account,        CIP,            WWS,      WLL,      WAM,       Vol,            i0v10,      i0v-10      
FNMA 4.000 15Y,         M001,      01F0404A2,      4.554,       6,      174,  -4.44575,            3434,      3423            
FNMA 4.000 15Y,         M001,      01F0404A2,      4.554,       6,      174,  -9.92827035,      454,      4555
FNMA 4.000 15Y,         M001,      01F0404A2,      4.554,       6,      174,  8.820230435,      455,      5555

Description,       Account,        CIP,            WWS,      WLL,      WAM,       Vol,            i0v10,      i0v-10      
FNMA 4.000 15Y,         M001,      01F0404A2,      4.554,       6,      174,  -6.44575,            3434,      3423            
FNMA 4.000 15Y,         M001,      01F0404A2,      4.554,       6,      174,  -9.92827035,      404,      4505
FNMA 4.000 15Y,         M001,      01F0404A2,      4.554,       6,      174,  8.320230435,      455,      5558

I somehow also need to show the cell in excel which I'm not sure how?
Here is the code that I already have. It uses column class because I needed to massage the data,switch columns before.

#pragma warning(disable:4786)

#include "stdafx.h"
#include <cstdlib>
#include <iostream>
#include <fstream>
#include <string>
#include <sstream>
#include <vector>
#include "process.h"
#include "require.h"
using namespace std;


class Column
{
public:
       Column(const string& ColumnName):m_ColumnName(ColumnName){}
     Column(const string& ColumnName, const vector<string>& ColumnData):m_ColumnName(ColumnName), m_ColumnData(ColumnData){}
     string m_ColumnName;
     vector<string> m_ColumnData;
};
typedef vector<Column> FileColumnVector;
void loadBigFile(const char* fileName, FileColumnVector& FileData);


int main(int argc, char* argv[])
{
      const char* fileName1 = "c:\\sample1.csv";
      const char* fileName2 = "c:\\sample2.csv";
      FileColumnVector EVector1;
      FileColumnVector AVector2;
          loadBigFile(fileName1, EVector1);
      loadBigFile(fileName2, AVector2);
      // I need a routine to compare fields  1) Vol 2) i0v10 3) i0v-10 for each row for vector
      // EVector1 and AVector2 and create report specified above
  return 0;

}


void loadBigFile(const char* fileName, BigFileColumnVector& FileData)
{
     fstream read_file(fileName, ios::in);
     string Line;
     string Field;
     getline(read_file, Line);
     stringstream ss(Line);

       while(getline(ss, Field, ','))
     {
          FileData.push_back(Column(Field));
     }
     
       while(getline(read_file, Line))
     {
          stringstream ss(Line);
          int i = 0;
          while(getline(ss, Field, ','))
          {
              FileData[i++].m_ColumnData.push_back(Field);
          }
     }
     read_file.close();

}

How would I update this code to create the report I need.





0
Comment
Question by:healingtao
  • 11
  • 9
20 Comments
 
LVL 1

Expert Comment

by:gseidman
ID: 12201824
What do you actually need to do with the data? Do you need to remove rows that don't match some expression? Perform some kind of aggregation?
0
 

Author Comment

by:healingtao
ID: 12202435
gseidman,

I can't touch input files. If fields don't match I need to create the report specified on top
which lists 6 key fields I use to match rows, column name that didn't match, values from both files,
where they are located in excel(this is a csv file that client will look at in excel). So for example if column is located in row 2 and column AB for file 1 and row 2 column BS, I need to show 'source 1 cell' AB2 and
'source 2 cell' is BS2.

I hope this is clear.

Thanks
0
 
LVL 1

Expert Comment

by:gseidman
ID: 12202652
Okay, let me know if I get anything wrong here:

1) There are two files with the same number of rows in each. The files are formatted identically, meaning they have the same columns in the same order.

2) The first six fields of a row can be considered its unique key, and no other row in the same file has the same key.

3) For each row in the first file, there is exactly one row in the second file with a matching key.

4) The resulting report lists each difference in each pair of corresponding rows by listing the key, the column title of the field that is different, and the value of the field in each file.

I'm not clear on what the Source 1 Cell and Source 2 Cell refer to, so please explain that further.
0
 

Author Comment

by:healingtao
ID: 12203252
--Okay, let me know if I get anything wrong here:

--1) There are two files with the same number of rows in each. The files are formatted identically, meaning they have the same columns in the same order.
The files do not have the same number of columns. The only guarantee I have is that the six key fields are the same and the columns I need to compare have the same column name.

--2) The first six fields of a row can be considered its unique key, and no other row in the same file has the same key.
These six fields are unique but they are not the first six fields. They are not necessarily ordered, they can be anywhere in the row.

--3) For each row in the first file, there is exactly one row in the second file with a matching key.
Yes, that is correct. Except, the part that I have to also do is create a second report if a row is missing in one or the other file and list the missing row.
--4) The resulting report lists each difference in each pair of corresponding rows by listing the key, the column title of the field that is different, and the value of the field in each file.
Yes, this is correct except you're missing source cell 1 and 2.

--I'm not clear on what the Source 1 Cell and Source 2 Cell refer to, so please explain that further.

The client will load the input files and output report in excel.
As you know in excel the first row is column names
A B C........Z     AA AB AC......AZ            BA...BZ etc...
I need to display where these mismatched columns are located when opened in excel. Here is an example once again:
Description,      Account,       CIP,          WWS,     WLL,     WAM,     Column with Break,     Source 1 Cell,     Source 1 Value,     Source 2 Cell,     Source 2 Value
FNMA 4.000 15Y,        M001,     01F0404A2,     4.554,      6,     174,        Vol ,               M2,     4.44575,          M2,         -4.4976    
FNMA 4.000 15Y,        M001,     01F0404A2,     4.554,      6,     174,        i0v10,               AP2,  9.92827035,          AK2,         -9.85369319    
FNMA 4.000 15Y,        M001,     01F0404A2,     4.554,      6,     174,        i0v-10),               AQ2,     8.820230435,          AL2,         8.93395677    

Let me know if it's clear or if you have more questions.
0
 
LVL 1

Expert Comment

by:gseidman
ID: 12203359
Okay, I now know nearly enough to give a solution. I have a few more questions:

1) What goes into the report if a row in file 1 does not have a match in file 2? How about if a row in file 2 does not have a match in file 1?

2) Is there any required ordering of the report output?

3) If the columns in file 1 do not match up in some order with the columns in file 2, what do you report about columns that do not appear in one file or the other?
0
 

Author Comment

by:healingtao
ID: 12203441

--1) What goes into the report if a row in file 1 does not have a match in file 2? How about if a row in file 2 does not have a match in file 1?

That would be a separate csv file just showing the key (6 fields I mentioned), 7th field will be Source showing 'source1' or 'source2', 8th field will be 'Row Number'- some number

--2) Is there any required ordering of the report output?

no

--3) If the columns in file 1 do not match up in some order with the columns in file 2, what do you report about columns that do not appear in one file or the other?

Ignore those. I don't care about that at all.

Thanks
0
 
LVL 1

Expert Comment

by:gseidman
ID: 12204338
I'm working on this, but I'm going to have to come back to it tomorrow. Sorry I can't get it done tonight.
0
 

Author Comment

by:healingtao
ID: 12205126
Not a problem. Thanks very much
0
 
LVL 1

Expert Comment

by:gseidman
ID: 12207769
This should do it. Testing on the sample data you gave seems to give the desired results. Let me know if it gives you any problems. Note that this code is based on my solution to your previous problem, not on the code you gave in this question. I hope that's okay. Also, there are various things that should be done to this so it isn't all in one monolithic file, but I'm sure you can figure that out.

#include <cassert>
#include <cctype>
#include <algorithm>
#include <iterator>
#include <fstream>
#include <sstream>
#include <iostream>
#include <string>
#include <map>
#include <vector>
using namespace std;

const string keycolnames[] = {
  "Description", "Account", "CIP", "WWS", "WLL", "WAM"
};
const int NUM_KEY_COLUMNS = sizeof(keycolnames)/sizeof(string);

template <char token> class tokenstr : public string { };
typedef tokenstr<'\n'> linestr;
typedef tokenstr<','> fieldstr;

template <char token>
inline istream & operator>>(istream &is, tokenstr<token> &str) {
  getline(is, str, token);
  return is;
}

inline string rowColToAlpha(int row, int col) {
  assert(row>=0);
  assert(col>=0);
  char low = col % 26;
  char high = col / 26;
  assert(high<26);
  ostringstream os;

  if (high) {
    os << (char)('A' + high);
  }
  os << (char)('A' + low) << row;
  return os.str();
}

struct TrimLeading {
  inline string operator()(const string &str) {
    const int size = str.size();
    int i;

    for (i=0;(i<size)&&(isspace(str[i]));++i);
    return (i>0) ? string(str, i, size-i) : str;
  }
};

class Row : public vector<string> {
  public:
    Row(const Row &other, int linenum = -1) : vector<string>(other)
    {
      rownum = ((linenum==-1)&&(other.rownum!=-1)) ? other.rownum : linenum;
    }
    Row(int linenum = -1) { rownum = linenum; }
    bool read(istream &is) {
      transform(istream_iterator<fieldstr>(is),
          istream_iterator<fieldstr>(),
          back_inserter(*this), TrimLeading());
      return true;
    }
    inline bool read(const string &row) {
      istringstream is(row);
      return read(is);
    }
    bool write(ostream &os) const {
      const int ncols = size();
      string sep;

      for (int i=0;i<ncols;++i) {
        os << sep << (*this)[i];
        if (sep.empty()) sep = ",";
      }
      os << endl;
      return (os.good());
    }
    int rownum;
};

typedef vector<Row> Table;

inline ostream &operator<<(ostream &os, const Row &row) {
  row.write(os);
  return os;
}

struct ReadRow {
  Table &table;
  ReadRow(Table &tbl) : table(tbl) { }
  inline void operator()(const string &line) {
    Row row;
    row.read(line);
    table.push_back(row);
  }
};

struct Key {
  string values[NUM_KEY_COLUMNS];
  bool operator<(const Key &rhs) const {
    for (int i=0;i<NUM_KEY_COLUMNS;++i) {
      if (values[i] < rhs.values[i]) {
        return true;
      }
    }
    return false;
  }
};

class DataSet : public map<Key, Row> {
  public:
    typedef map<string, int> HeaderMap;
    DataSet() { }
    DataSet(const Table &table) {
      buildFromTable(table);
    }
    DataSet(istream &is) {
      Table table;
      for_each(istream_iterator<linestr>(is),
          istream_iterator<linestr>(),
          ReadRow(table));
      buildFromTable(table);
    }
    inline const HeaderMap &getHeaderMap() const { return headerMap; }
  private:
    HeaderMap headerMap;
    void buildFromTable(const Table &table) {
      assert(!table.empty());
      Table::const_iterator i = table.begin();
      const Table::const_iterator tEnd = table.end();
      const Row &header = *i;
      const int ncols = header.size();
      int count = 0; //maybe this should start at 1
      int keycols[NUM_KEY_COLUMNS];

      for (int j=0;j<ncols;++j) {
        headerMap[header[j]] = j;
      }
      for (int j=0;j<NUM_KEY_COLUMNS;++j) {
        HeaderMap::iterator found = headerMap.find(keycolnames[j]);
        assert(found!=headerMap.end());
        keycols[j] = found->second;
        headerMap.erase(found);
      }
      for (++i;i!=tEnd;++i) {
        const Row &row = *i;
        Row newrow(row, count++);
        assert(row.size()==ncols);
        Key key;
        for (int j=0;j<NUM_KEY_COLUMNS;++j) {
          key.values[j] = row[keycols[j]];
        }
        (*this)[key] = newrow;
      }
    }
};

class MismatchReport : public Table {
  public:
    MismatchReport() : Table() {
      static const string extracols[] = {
        "Column with Break",
        "Source 1 Cell",
        "Source 1 Value",
        "Source 2 Cell",
        "Source 2 Value" };
      static const int NEXTRA_COLS = sizeof(extracols)/sizeof(string);
      Row newrow;
      copy(keycolnames, keycolnames+NUM_KEY_COLUMNS, back_inserter(newrow));
      copy(extracols, extracols+NEXTRA_COLS, back_inserter(newrow));
      push_back(newrow);
    }
    void append(const Key &key, const string &colname,
        const Row &row1, int col1,
        const Row &row2, int col2) {
      Row newrow;
      copy(key.values, key.values+NUM_KEY_COLUMNS, back_inserter(newrow));
      newrow.push_back(colname);
      newrow.push_back(rowColToAlpha(row1.rownum, col1));
      newrow.push_back(row1[col1]);
      newrow.push_back(rowColToAlpha(row2.rownum, col2));
      newrow.push_back(row2[col2]);
      push_back(newrow);
    }
};

class MissingReport : public Table {
  public:
    MissingReport() : Table() {
      static const string extracols[] = {
        "Source",
        "Row Number" };
      static const int NEXTRA_COLS = sizeof(extracols)/sizeof(string);
      Row newrow;
      copy(keycolnames, keycolnames+NUM_KEY_COLUMNS, back_inserter(newrow));
      copy(extracols, extracols+NEXTRA_COLS, back_inserter(newrow));
      push_back(newrow);
    }
    void append(const Key &key, const Row &row, const string &source) {
      Row newrow;
      copy(key.values, key.values+NUM_KEY_COLUMNS, back_inserter(newrow));
      newrow.push_back(source);
      newrow.push_back(intToStr(row.rownum));
      push_back(newrow);
    }
    using Table::push_back;
    inline void push_back(const DataSet::value_type &pair) {
      append(pair.first, pair.second, cursource);
    }
    inline void setSource(int source) { cursource = intToStr(source); }
  private:
    string cursource;
    static inline string intToStr(int num) {
      ostringstream os;
      os << num;
      return os.str();
    }
};

struct HandleRow {
  DataSet &data;
  MissingReport &missing;
  MismatchReport &mismatch;
  typedef pair<int, string> ColPair;
  typedef map<int, ColPair> ColMap;
  ColMap colMap;
  HandleRow(const DataSet::HeaderMap &HeaderMap1, DataSet &Data, MissingReport &Missing, MismatchReport &Mismatch) : data(Data), missing(Missing), mismatch(Mismatch) {
    const DataSet::HeaderMap &headerMap1 = HeaderMap1;
    const DataSet::HeaderMap &headerMap2 = data.getHeaderMap();
    DataSet::HeaderMap::const_iterator i, end1 = headerMap1.end();
    DataSet::HeaderMap::const_iterator found, end2 = headerMap2.end();
    for (i=headerMap1.begin();i!=end1;++i) {
      found = headerMap2.find(i->first);
      if (found!=end2) {
        colMap[i->second] = ColPair(found->second, i->first);
      }
    }
  }
  void operator()(const DataSet::value_type &pair) {
    const Key &key = pair.first;
    DataSet::iterator found = data.find(key);

    if (found==data.end()) {
      missing.push_back(pair);
    } else {
      const Row &row1 = pair.second;
      const Row &row2 = found->second;
      ColMap::const_iterator i, end = colMap.end();
      for (i=colMap.begin();i!=end;++i) {
        if (row1[i->first]!=row2[i->second.first]) {
          mismatch.append(key, i->second.second,
              row1, i->first,
              row2, i->second.first);
        }
      }
      data.erase(found);
    }
  }
};

struct AppendMissing {
  MissingReport &missing;
  AppendMissing(MissingReport &Missing) : missing(Missing) { }
  inline void operator()(const DataSet::value_type &pair) {
    missing.push_back(pair);
  }
};

int main(int argc, char **argv) {
  ifstream in1("in1");
  ifstream in2("in2");
  ofstream out1("missing");
  ofstream out2("mismatch");
  DataSet f1(in1);
  DataSet f2(in2);
  MissingReport missing;
  MismatchReport mismatch;

  missing.setSource(1);
  for_each(f1.begin(), f1.end(),
      HandleRow(f1.getHeaderMap(), f2, missing, mismatch));
  missing.setSource(2);
  for_each(f2.begin(), f2.end(), AppendMissing(missing));
  copy(missing.begin(), missing.end(), ostream_iterator<Row>(out1, ""));
  copy(mismatch.begin(), mismatch.end(), ostream_iterator<Row>(out2, ""));
  return 0;
}

0
 

Author Comment

by:healingtao
ID: 12208327
ok, thanks. I'll try it now
0
Threat Intelligence Starter Resources

Integrating threat intelligence can be challenging, and not all companies are ready. These resources can help you build awareness and prepare for defense.

 

Author Comment

by:healingtao
ID: 12209978
gseidman,

Initially I decided to put the code into one file, just to test it and try to understand this. I'm currently using VC++ version 7.0.
But I'm getting an error " unexpected end of file while looking for precompiled header directive".
Do you know what is causing this? If you could also provide just a few comments for each function(aside from the obvious getters), this would help a lot.
Thanks
0
 
LVL 1

Expert Comment

by:gseidman
ID: 12211802
I don't know a whole lot about the VC++ compiler, but I'll guess that the #pragma you have at the top of your original file will take care of that: #pragma warning(disable:4786)

If not, well, you'll probably have to delve into the details of VC++. Sorry. Anyhow, here's a commented version of the program (I also simplified the Row::write() method and moved some things around):

#include <cassert>
#include <cctype>
#include <algorithm>
#include <iterator>
#include <fstream>
#include <sstream>
#include <iostream>
#include <string>
#include <map>
#include <vector>
using namespace std;

// A fairly flexible representation of a multi-field key; it might be
// better if keycolnames was a static member of Key, but that would make it
// harder to separate out header and source later.
const string keycolnames[] = {
  "Description", "Account", "CIP", "WWS", "WLL", "WAM"
};
const int NUM_KEY_COLUMNS = sizeof(keycolnames)/sizeof(string);
struct Key {
  string values[NUM_KEY_COLUMNS];
  // comparison operator to make Key work as the key type for a map
  inline bool operator<(const Key &rhs) const {
    for (int i=0;i<NUM_KEY_COLUMNS;++i) {
      if (values[i] < rhs.values[i]) {
        return true;
      }
    }
    return false;
  }
};

// This stuff is used for tokenizing input; the operator>> is templated on
// the separator character, which makes it possible to use istream_iterator
template <char sep> class tokenstr : public string { };
typedef tokenstr<'\n'> linestr;
typedef tokenstr<','> fieldstr;

template <char sep>
inline istream & operator>>(istream &is, tokenstr<sep> &str) {
  getline(is, str, sep);
  return is;
}

// Converts a row and column to an Excel cell identifier,
// e.g. row 6 column 36 (column starts at zero) becomes AK6
inline string rowColToAlpha(int row, int col) {
  assert(row>=0);
  assert(col>=0);
  char low = col % 26;
  char high = col / 26;
  assert(high<26);
  ostringstream os;

  if (high) {
    os << (char)('A' + high);
  }
  os << (char)('A' + low) << row;
  return os.str();
}

// A functor to trim leading space from a string
struct TrimLeading {
  inline string operator()(const string &str) {
    const int size = str.size();
    int i;

    for (i=0;(i<size)&&(isspace(str[i]));++i);
    return (i>0) ? string(str, i, size-i) : str;
  }
};

// The Row class represents a row of numbered (not named) fields
class Row : public vector<string> {
  public:
    // copy constructor, with line numbering; the line/row number is sort
    // of a hack to begin with, but it makes life easier elsewhere
    Row(const Row &other, int linenum = -1) : vector<string>(other) {
      rownum = ((linenum==-1)&&(other.rownum!=-1)) ? other.rownum : linenum;
    }
    // null constructor, with the line number hack
    Row(int linenum = -1) { rownum = linenum; }
    // input routine
    bool read(istream &is) {
      transform(istream_iterator<fieldstr>(is),
          istream_iterator<fieldstr>(),
          back_inserter(*this), TrimLeading());
      return true;
    }
    // convenience version of the input routine which reads a string rather
    // than an istream
    inline bool read(const string &row) {
      istringstream is(row);
      return read(is);
    }
    // output routine
    bool write(ostream &os) const {
      const int ncols = size()-1;
      copy(begin(), end()-1, ostream_iterator<string>(os, ","));
      os << (*this)[ncols] << endl;
      return (os.good());
    }
    // line number hack
    int rownum;
};

// overload of operator<< to make ostream_iterator work
inline ostream &operator<<(ostream &os, const Row &row) {
  row.write(os);
  return os;
}

// convenient typedef; a Table is a bunch of rows, essentially a
// two-dimensional array
typedef vector<Row> Table;

// Functor to read Rows into a Table
struct ReadRow {
  Table &table;
  ReadRow(Table &tbl) : table(tbl) { }
  inline void operator()(const string &line) {
    Row row;
    row.read(line);
    table.push_back(row);
  }
};

// This represents a Table that is indexed by Key; it is created from a
// table or from an istream which is a valid table; the first line of the
// table is assumed to be column headers, and it is assumed that every row
// has the same number of columns
class DataSet : public map<Key, Row> {
  public:
    // typedef for a map of column names to column numbers
    typedef map<string, int> HeaderMap;
    // null constructor
    DataSet() { }
    // construction from a table; the real work is in buildFromTable()
    DataSet(const Table &table) {
      buildFromTable(table);
    }
    // construction from an istream; the real work is in buildFromTable()
    DataSet(istream &is) {
      Table table;
      for_each(istream_iterator<linestr>(is),
          istream_iterator<linestr>(),
          ReadRow(table));
      buildFromTable(table);
    }
    // const getter; no need for a non-const version
    inline const HeaderMap &getHeaderMap() const { return headerMap; }
  private:
    // map of column names to column numbers
    HeaderMap headerMap;
    // this is where the real construction happens
    void buildFromTable(const Table &table) {
      assert(!table.empty());
      Table::const_iterator i = table.begin();
      const Table::const_iterator tEnd = table.end();
      const Row &header = *i;
      const int ncols = header.size();
      int linecount = 0; //maybe this should start at 1
      int keycols[NUM_KEY_COLUMNS];

      // create the headerMap from the first row of the table
      for (int j=0;j<ncols;++j) {
        headerMap[header[j]] = j;
      }
      // find the columns numbers of the key columns, put them in the
      // keycols array, and remove the entries from the headerMap since
      // those columns will never be read from anything but the Key
      for (int j=0;j<NUM_KEY_COLUMNS;++j) {
        HeaderMap::iterator found = headerMap.find(keycolnames[j]);
        assert(found!=headerMap.end());
        keycols[j] = found->second;
        headerMap.erase(found);
      }
      // starting at the second row of the Table, copy the Table's rows
      // into the map, including extracting the key columns into a Key and
      // giving the row a line number
      for (++i;i!=tEnd;++i) {
        const Row &row = *i;
        Key key;
        Row newrow(row, linecount++);
        assert(row.size()==ncols);
        for (int j=0;j<NUM_KEY_COLUMNS;++j) {
          key.values[j] = row[keycols[j]];
        }
        (*this)[key] = newrow;
      }
    }
};

// This is the representation of the output for found data mismatches
class MismatchReport : public Table {
  public:
    MismatchReport() : Table() {
      // these are the columns other than the key columns
      static const string extracols[] = {
        "Column with Break",
        "Source 1 Cell",
        "Source 1 Value",
        "Source 2 Cell",
        "Source 2 Value" };
      static const int NEXTRA_COLS = sizeof(extracols)/sizeof(string);
      Row newrow;
      // copy the key column names into the row
      copy(keycolnames, keycolnames+NUM_KEY_COLUMNS, back_inserter(newrow));
      // copy the extra column names into the row
      copy(extracols, extracols+NEXTRA_COLS, back_inserter(newrow));
      // make the row the first Row of the Table
      push_back(newrow);
    }
    // record a found mismatch
    void append(const Key &key, const string &colname,
        const Row &row1, int col1,
        const Row &row2, int col2) {
      Row newrow;
      // copy the key into the new row
      copy(key.values, key.values+NUM_KEY_COLUMNS, back_inserter(newrow));
      // copy the mismatch data into the new row
      newrow.push_back(colname);
      newrow.push_back(rowColToAlpha(row1.rownum, col1));
      newrow.push_back(row1[col1]);
      newrow.push_back(rowColToAlpha(row2.rownum, col2));
      newrow.push_back(row2[col2]);
      // add the new row to the Table
      push_back(newrow);
    }
};

// This is the representation of the output for rows that don't match
// between the sources
class MissingReport : public Table {
  public:
    MissingReport() : Table() {
      // these are the columns other than the key columns
      static const string extracols[] = {
        "Source",
        "Row Number" };
      static const int NEXTRA_COLS = sizeof(extracols)/sizeof(string);
      Row newrow;
      // copy the key column names into the row
      copy(keycolnames, keycolnames+NUM_KEY_COLUMNS, back_inserter(newrow));
      // copy the extra column names into the row
      copy(extracols, extracols+NEXTRA_COLS, back_inserter(newrow));
      // make the row the first Row of the Table
      push_back(newrow);
    }
    // record a found mismatch
    void append(const Key &key, const Row &row, const string &source) {
      Row newrow;
      // copy the key into the new row
      copy(key.values, key.values+NUM_KEY_COLUMNS, back_inserter(newrow));
      // copy the mismatch data into the new row
      newrow.push_back(source);
      newrow.push_back(intToStr(row.rownum));
      // add the new row to the Table
      push_back(newrow);
    }
    // this directive is needed because we don't want to hide the
    // superclass's push_back method(s)
    using Table::push_back;
    // push_back method to take a const DataSet::value_type; this should
    // make it possible to copy() a DataSet into a MissingReport, but it
    // doesn't seem to work, thus the need for the AppendMissing functor
    // below
    inline void push_back(const DataSet::value_type &pair) {
      append(pair.first, pair.second, cursource);
    }
    // convenience when using push_back(), since we know that we will be
    // copying rows into MissingReport from one source, then another, in
    // large blocks
    inline void setSource(int source) { cursource = intToStr(source); }
  private:
    string cursource;
    // convenience method to convert an int to a string
    static inline string intToStr(int num) {
      ostringstream os;
      os << num;
      return os.str();
    }
};

// functor to compare rows
struct HandleRow {
  // the second dataset
  DataSet &data;
  // reports to be filled
  MissingReport &missing;
  MismatchReport &mismatch;
  // convenience typedefs used below
  typedef pair<int, string> ColPair;
  typedef map<int, ColPair> ColMap;
  // mapping of comparable columns from the incoming DataSet to data
  ColMap colMap;
  HandleRow(const DataSet::HeaderMap &HeaderMap1, DataSet &Data, MissingReport &Missing, MismatchReport &Mismatch) : data(Data), missing(Missing), mismatch(Mismatch) {
    // headerMap of the incoming DataSet
    const DataSet::HeaderMap &headerMap1 = HeaderMap1;
    // headerMap of the data DataSet
    const DataSet::HeaderMap &headerMap2 = data.getHeaderMap();
    DataSet::HeaderMap::const_iterator i, end1 = headerMap1.end();
    DataSet::HeaderMap::const_iterator found, end2 = headerMap2.end();
    // find all matching header names and map columns in the incoming
    // DataSet to columns in data and their names
    for (i=headerMap1.begin();i!=end1;++i) {
      found = headerMap2.find(i->first);
      if (found!=end2) {
        colMap[i->second] = ColPair(found->second, i->first);
      }
    }
  }
  // functor operator
  void operator()(const DataSet::value_type &pair) {
    const Key &key = pair.first;
    DataSet::iterator found = data.find(key);

    if (found==data.end()) {
      // if the key is not present in the second DataSet, it goes in the
      // MissingReport
      missing.push_back(pair);
    } else {
      // else we search for column mismatches using colMap
      const Row &row1 = pair.second;
      const Row &row2 = found->second;
      ColMap::const_iterator i, end = colMap.end();
      for (i=colMap.begin();i!=end;++i) {
        if (row1[i->first]!=row2[i->second.first]) {
          // add to MismatchReport
          mismatch.append(key, i->second.second,
              row1, i->first,
              row2, i->second.first);
        }
      }
      // remove the row from the second DataSet; anything left in it counts
      // as missing
      data.erase(found);
    }
  }
};

// Functor to allow a DataSet to be copied into a MissingReport
struct AppendMissing {
  MissingReport &missing;
  AppendMissing(MissingReport &Missing) : missing(Missing) { }
  inline void operator()(const DataSet::value_type &pair) {
    missing.push_back(pair);
  }
};

int main(int argc, char **argv) {
  // change filenames as appropriate
  ifstream in1("in1");
  ifstream in2("in2");
  ofstream out1("missing");
  ofstream out2("mismatch");
  // the two datasets, created from input
  DataSet f1(in1);
  DataSet f2(in2);
  // the two output reports, initially empty
  MissingReport missing;
  MismatchReport mismatch;

  missing.setSource(1);
  // search for mismatches
  for_each(f1.begin(), f1.end(),
      HandleRow(f1.getHeaderMap(), f2, missing, mismatch));
  missing.setSource(2);
  // copy unmatched lines from the second DataSet into MissingReport
  for_each(f2.begin(), f2.end(), AppendMissing(missing));
  // output reports
  copy(missing.begin(), missing.end(), ostream_iterator<Row>(out1, ""));
  copy(mismatch.begin(), mismatch.end(), ostream_iterator<Row>(out2, ""));
  return 0;
}

0
 

Author Comment

by:healingtao
ID: 12212461
Thanks for the update. I'll try it now. Actually the error
" unexpected end of file while looking for precompiled header directive" was my fault, I needed to include stdafx.h for vc++. So forget about that one.
0
 

Author Comment

by:healingtao
ID: 12212824
The mismatch report seems to work, except 'source cell 1' and 'source cell 2' show incorrect values. The missing report is incorrect. At the moment I have 2 files with no missing rows and it produces report with many missing. Do you mind if I send you 2 of my input files to your email and you can help me debug this.
Thanks
0
 
LVL 1

Expert Comment

by:gseidman
ID: 12212892
That would be fine. Please use experts.exchange.gseidman@spamgourmet.com

How incorrect are the cell numbers? Are they off-by-one, or switched, or completely wrong?
0
 

Author Comment

by:healingtao
ID: 12212979
They are off by 26. You start off from AA instead of A. Also the row is off by 2. So your 1 should be 3. I believe it's because you're using zero based index and not including titles. I sent you the files.
0
 

Author Comment

by:healingtao
ID: 12213962
gseidman,

Can you please confirm that you received my email and input files.

Thanks
0
 
LVL 1

Expert Comment

by:gseidman
ID: 12217338
Got them. I'm on it.
0
 
LVL 1

Accepted Solution

by:
gseidman earned 500 total points
ID: 12222386
Okay, I had a really stupid bug and some extraneous stuff. Here's a fixed, working, tested version. Let me know if you have any trouble with it.

#include <cassert>
#include <cctype>
#include <algorithm>
#include <iterator>
#include <fstream>
#include <sstream>
#include <iostream>
#include <string>
#include <map>
#include <vector>
using namespace std;

// A fairly flexible representation of a multi-field key; it might be
// better if keycolnames was a static member of Key, but that would make it
// harder to separate out header and source later.
const string keycolnames[] = {
  "Description", "Account Name", "Cusip",
  "WAC (Coll)", "WALA (Coll)", "WAM (Coll)"
};
const int NUM_KEY_COLUMNS = sizeof(keycolnames)/sizeof(string);
struct Key {
  string values[NUM_KEY_COLUMNS];
  // comparison operator to make Key work as the key type for a map
  inline bool operator<(const Key &rhs) const {
    for (int i=0;i<NUM_KEY_COLUMNS;++i) {
      if (values[i] < rhs.values[i]) {
        return true;
      } else if (values[i] > rhs.values[i]) {
        return false;
      }
    }
    return false;
  }
};

// This stuff is used for tokenizing input; the operator>> is templated on
// the separator character, which makes it possible to use istream_iterator
template <char sep> class tokenstr : public string { };
typedef tokenstr<'\n'> linestr;
typedef tokenstr<','> fieldstr;

template <char sep>
inline istream & operator>>(istream &is, tokenstr<sep> &str) {
  getline(is, str, sep);
  return is;
}

// Converts a row and column to an Excel cell identifier,
// e.g. row 6 column 36 (column starts at zero) becomes AK6
inline string rowColToAlpha(int row, int col) {
  assert(row>=0);
  assert(col>=0);
  char low = col % 26;
  char high = col / 26;
  assert(high<26);
  ostringstream os;

  if (high>0) {
    os << (char)('A' + high - 1);
  }
  os << (char)('A' + low) << row;
  return os.str();
}

// The Row class represents a row of numbered (not named) fields
class Row : public vector<string> {
  public:
    // copy constructor, with line numbering; the line/row number is sort
    // of a hack to begin with, but it makes life easier elsewhere
    Row(const Row &other, int linenum = -1) : vector<string>(other) {
      rownum = ((linenum==-1)&&(other.rownum!=-1)) ? other.rownum : linenum;
    }
    // null constructor, with the line number hack
    Row(int linenum = -1) { rownum = linenum; }
    // input routine
    bool read(istream &is) {
      copy(istream_iterator<fieldstr>(is),
          istream_iterator<fieldstr>(),
          back_inserter(*this));
      return true;
    }
    // convenience version of the input routine which reads a string rather
    // than an istream
    inline bool read(const string &row) {
      istringstream is(row);
      return read(is);
    }
    // output routine
    bool write(ostream &os) const {
      const int ncols = size()-1;
      copy(begin(), end()-1, ostream_iterator<string>(os, ","));
      os << (*this)[ncols] << endl;
      return (os.good());
    }
    // line number hack
    int rownum;
};

// overload of operator<< to make ostream_iterator work
inline ostream &operator<<(ostream &os, const Row &row) {
  row.write(os);
  return os;
}

// convenient typedef; a Table is a bunch of rows, essentially a
// two-dimensional array
typedef vector<Row> Table;

// Functor to read Rows into a Table
struct ReadRow {
  Table &table;
  ReadRow(Table &tbl) : table(tbl) { }
  inline void operator()(const string &line) {
    Row row;
    row.read(line);
    table.push_back(row);
  }
};

// This represents a Table that is indexed by Key; it is created from a
// table or from an istream which is a valid table; the first line of the
// table is assumed to be column headers, and it is assumed that every row
// has the same number of columns
class DataSet : public map<Key, Row> {
  public:
    // typedef for a map of column names to column numbers
    typedef map<string, int> HeaderMap;
    // null constructor
    DataSet() { }
    // construction from a table; the real work is in buildFromTable()
    DataSet(const Table &table) {
      buildFromTable(table);
    }
    // construction from an istream; the real work is in buildFromTable()
    DataSet(istream &is) {
      Table table;
      for_each(istream_iterator<linestr>(is),
          istream_iterator<linestr>(),
          ReadRow(table));
      buildFromTable(table);
    }
    // const getter; no need for a non-const version
    inline const HeaderMap &getHeaderMap() const { return headerMap; }
  private:
    // map of column names to column numbers
    HeaderMap headerMap;
    // this is where the real construction happens
    void buildFromTable(const Table &table) {
      assert(!table.empty());
      Table::const_iterator i = table.begin();
      const Table::const_iterator tEnd = table.end();
      const Row &header = *i;
      const int ncols = header.size();
      int linecount = 2;
      int keycols[NUM_KEY_COLUMNS];

      // create the headerMap from the first row of the table
      for (int j=0;j<ncols;++j) {
        headerMap[header[j]] = j;
      }
      // find the columns numbers of the key columns, put them in the
      // keycols array, and remove the entries from the headerMap since
      // those columns will never be read from anything but the Key
      for (int j=0;j<NUM_KEY_COLUMNS;++j) {
        HeaderMap::iterator found = headerMap.find(keycolnames[j]);
        assert(found!=headerMap.end());
        keycols[j] = found->second;
        headerMap.erase(found);
      }
      // starting at the second row of the Table, copy the Table's rows
      // into the map, including extracting the key columns into a Key and
      // giving the row a line number
      for (++i;i!=tEnd;++i) {
        const Row &row = *i;
        Key key;
        Row newrow(row, linecount++);
        assert(row.size()==ncols);
        for (int j=0;j<NUM_KEY_COLUMNS;++j) {
          key.values[j] = row[keycols[j]];
        }
        (*this)[key] = newrow;
      }
    }
};

// This is the representation of the output for found data mismatches
class MismatchReport : public Table {
  public:
    MismatchReport() : Table() {
      // these are the columns other than the key columns
      static const string extracols[] = {
        "Column with Break",
        "Source 1 Cell",
        "Source 1 Value",
        "Source 2 Cell",
        "Source 2 Value" };
      static const int NEXTRA_COLS = sizeof(extracols)/sizeof(string);
      Row newrow;
      // copy the key column names into the row
      copy(keycolnames, keycolnames+NUM_KEY_COLUMNS, back_inserter(newrow));
      // copy the extra column names into the row
      copy(extracols, extracols+NEXTRA_COLS, back_inserter(newrow));
      // make the row the first Row of the Table
      push_back(newrow);
    }
    // record a found mismatch
    void append(const Key &key, const string &colname,
        const Row &row1, int col1,
        const Row &row2, int col2) {
      Row newrow;
      // copy the key into the new row
      copy(key.values, key.values+NUM_KEY_COLUMNS, back_inserter(newrow));
      // copy the mismatch data into the new row
      newrow.push_back(colname);
      newrow.push_back(rowColToAlpha(row1.rownum, col1));
      newrow.push_back(row1[col1]);
      newrow.push_back(rowColToAlpha(row2.rownum, col2));
      newrow.push_back(row2[col2]);
      // add the new row to the Table
      push_back(newrow);
    }
};

// This is the representation of the output for rows that don't match
// between the sources
class MissingReport : public Table {
  public:
    MissingReport() : Table() {
      // these are the columns other than the key columns
      static const string extracols[] = {
        "Source",
        "Row Number" };
      static const int NEXTRA_COLS = sizeof(extracols)/sizeof(string);
      Row newrow;
      // copy the key column names into the row
      copy(keycolnames, keycolnames+NUM_KEY_COLUMNS, back_inserter(newrow));
      // copy the extra column names into the row
      copy(extracols, extracols+NEXTRA_COLS, back_inserter(newrow));
      // make the row the first Row of the Table
      push_back(newrow);
    }
    // record a found mismatch
    void append(const Key &key, const Row &row, const string &source) {
      Row newrow;
      cout << "Keys: ";
      copy(key.values, key.values+NUM_KEY_COLUMNS,
          ostream_iterator<string>(cout, ","));
      cout << endl;
      // copy the key into the new row
      copy(key.values, key.values+NUM_KEY_COLUMNS, back_inserter(newrow));
      // copy the mismatch data into the new row
      newrow.push_back(source);
      newrow.push_back(intToStr(row.rownum));
      // add the new row to the Table
      push_back(newrow);
    }
    // this directive is needed because we don't want to hide the
    // superclass's push_back method(s)
    using Table::push_back;
    // push_back method to take a const DataSet::value_type; this should
    // make it possible to copy() a DataSet into a MissingReport, but it
    // doesn't seem to work, thus the need for the AppendMissing functor
    // below
    inline void push_back(const DataSet::value_type &pair) {
      append(pair.first, pair.second, cursource);
    }
    // convenience when using push_back(), since we know that we will be
    // copying rows into MissingReport from one source, then another, in
    // large blocks
    inline void setSource(int source) { cursource = intToStr(source); }
  private:
    string cursource;
    // convenience method to convert an int to a string
    static inline string intToStr(int num) {
      ostringstream os;
      os << num;
      return os.str();
    }
};

// functor to compare rows
struct HandleRow {
  // the second dataset
  DataSet &data;
  // reports to be filled
  MissingReport &missing;
  MismatchReport &mismatch;
  // convenience typedefs used below
  typedef pair<int, string> ColPair;
  typedef map<int, ColPair> ColMap;
  // mapping of comparable columns from the incoming DataSet to data
  ColMap colMap;
  HandleRow(const DataSet::HeaderMap &HeaderMap1, DataSet &Data, MissingReport &Missing, MismatchReport &Mismatch) : data(Data), missing(Missing), mismatch(Mismatch) {
    // headerMap of the incoming DataSet
    const DataSet::HeaderMap &headerMap1 = HeaderMap1;
    // headerMap of the data DataSet
    const DataSet::HeaderMap &headerMap2 = data.getHeaderMap();
    DataSet::HeaderMap::const_iterator i, end1 = headerMap1.end();
    DataSet::HeaderMap::const_iterator found, end2 = headerMap2.end();
    // find all matching header names and map columns in the incoming
    // DataSet to columns in data and their names
    for (i=headerMap1.begin();i!=end1;++i) {
      found = headerMap2.find(i->first);
      if (found!=end2) {
        colMap[i->second] = ColPair(found->second, i->first);
      }
    }
  }
  // functor operator
  void operator()(const DataSet::value_type &pair) {
    const Key &key = pair.first;
    DataSet::iterator found = data.find(key);

    if (found==data.end()) {
      // if the key is not present in the second DataSet, it goes in the
      // MissingReport
      missing.push_back(pair);
    } else {
      // else we search for column mismatches using colMap
      const Row &row1 = pair.second;
      const Row &row2 = found->second;
      ColMap::const_iterator i, end = colMap.end();
      for (i=colMap.begin();i!=end;++i) {
        if (row1[i->first]!=row2[i->second.first]) {
          // add to MismatchReport
          mismatch.append(key, i->second.second,
              row1, i->first,
              row2, i->second.first);
        }
      }
      // remove the row from the second DataSet; anything left in it counts
      // as missing
      data.erase(found);
    }
  }
};

// Functor to allow a DataSet to be copied into a MissingReport
struct AppendMissing {
  MissingReport &missing;
  AppendMissing(MissingReport &Missing) : missing(Missing) { }
  inline void operator()(const DataSet::value_type &pair) {
    missing.push_back(pair);
  }
};

int main(int argc, char **argv) {
  // change filenames as appropriate
  ifstream in1("in1");
  ifstream in2("in2");
  ofstream out1("missing");
  ofstream out2("mismatch");
  // the two datasets, created from input
  DataSet f1(in1);
  DataSet f2(in2);
  // the two output reports, initially empty
  MissingReport missing;
  MismatchReport mismatch;

  missing.setSource(1);
  // search for mismatches
  for_each(f1.begin(), f1.end(),
      HandleRow(f1.getHeaderMap(), f2, missing, mismatch));
  missing.setSource(2);
  // copy unmatched lines from the second DataSet into MissingReport
  for_each(f2.begin(), f2.end(), AppendMissing(missing));
  // output reports
  copy(missing.begin(), missing.end(), ostream_iterator<Row>(out1, ""));
  copy(mismatch.begin(), mismatch.end(), ostream_iterator<Row>(out2, ""));
  return 0;
}

0
 

Author Comment

by:healingtao
ID: 12277756
gseidman,

Sorry for the delay, I was out for a week. Thanks for the solution.
0

Featured Post

Top 6 Sources for Identifying Threat Actor TTPs

Understanding your enemy is essential. These six sources will help you identify the most popular threat actor tactics, techniques, and procedures (TTPs).

Join & Write a Comment

In days of old, returning something by value from a function in C++ was necessarily avoided because it would, invariably, involve one or even two copies of the object being created and potentially costly calls to a copy-constructor and destructor. A…
This article will show you some of the more useful Standard Template Library (STL) algorithms through the use of working examples.  You will learn about how these algorithms fit into the STL architecture, how they work with STL containers, and why t…
The viewer will learn how to clear a vector as well as how to detect empty vectors in C++.
The viewer will be introduced to the member functions push_back and pop_back of the vector class. The video will teach the difference between the two as well as how to use each one along with its functionality.

758 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

21 Experts available now in Live!

Get 1:1 Help Now