Calculate distinct ID's and occurences using C / C++

I am dealing with big log files with about 50,000,000 rows per day and I have to calculate a number of distinct ID's over one single day and over the whole month (from multiple log files). Log file names are named by date e.g.:
2007-07-22.log
2007-07-23.log
2007-07-24.log
2007-07-25.log etc.

Right now I am using Perl script to do this and it uses also some system calls (sort, uniq, wc)
Because I know that C / C++ is much faster than Perl I am asking all C programmers out there if anyone has any solution for this? If there is any other even faster solution on Linux for this kind of problem I am also interested.

I need two things:

1.) a program that reads log filename(s) as command line argument(s) (there can be multiple files because I also need to calculate this count for the whole month), parses them and then prints out a total count of distinct ID's as just one integer.
Log files are plain text, tab separated, unordered and the ID is in the 6th column of every row
e.g.:
date<tab>time<tab>some<tab>other<tab>stuff<tab>ID<tab>.......more stuff\n

2.) I also need a program which would print out a total number of ID's which occure once, twice, three times...etc. This can be the same program if that would make it faster, no problem at all!
sample output:

no.of ID's  |  no. of hits
-------------------------------------
138491    | 1
3890        | 2
834          | 3
524          | 4
.....etc.

(this means that 138491 distinct ID's are logged only once and 3890 distinct ID's are logged twice etc.)

Right now the number of distinct ID's per day is about 20,000,000 and about 80,000,000 per month so I am dealing with a big data set and can not figure out how to make a program or a script which would calculate all this very fast. Perl script takes a few hours and also dramatically increases load so the computer is not usable during that time.

I would be also interested in any distributed solution (to work over 2 or more computers on LAN in order to calculate as fast as possible) if anybody has that kind of skills and I am also willing to give extra points for that kind of solution.
LVL 1
jakacAsked:
Who is Participating?
 
jkrCommented:
Yes, that might cause that issue. Then that should be

#include <sys/types.h>
#include <dirent.h>
#include <string>
#include <vector>
#include <iostream>
#include <fstream>
#include <map>
#include <list>

using namespace std;

void list_all_logfiles ( const string& sStartDir, list<string>& lstFound) {

   cout << "checking " << sStartDir<< endl;

   DIR* pDir = opendir ( sStartDir.c_str ());

   if ( !pDir) return;

   dirent* pEntry;

   while ( pEntry = readdir ( pDir)) {

       cout << "found " << pEntry->d_name << endl;

       string sFound = sStartDir + string ( "/") + string ( pEntry->d_name); // <---

       if (string::npos != sFound.find(".log")) lstFound.push_back ( sFound);
   }

   closedir(pDir);
}

int split_text(string strIn, const char cDelim, vector<string>& vResult) {

   int nPos = 0;
   int nCount = 0;
   int nFound;
   string strToken;

   while(1) {

      nFound = strIn.find(cDelim,nPos);

      if (-1 == nFound)  {

        strToken = strIn.substr(nPos,strIn.length() - nPos);
        vResult.push_back(strToken);
        break;
      }

      strToken = strIn.substr(nPos,nFound - nPos);

      nPos = nFound + 1;

      ++nCount;

      vResult.push_back(strToken);

   }

   return nCount;
}

void ProcessFile(const string& strName, map<string,int>& rmapResults) {

    ifstream is(strName.c_str());

    if (!is.is_open()) {

        cout << "Error processing " << strName << endl;

        return;
    }

    while(!is.eof()) {

        vector<string> vResult;

        string strLine;

        getline(is,strLine);

        int n = split_text(strLine,'\t',vResult);

        int nId;

        string sId;

        sId = vResult[5]; // get ID at col. 6 (indices are zero-based)
 
        map<string,int>::iterator i = rmapResults.find(sId);

        if (i == rmapResults.end()) { // not found yet

            int nCount = 1;
            rmapResults.insert(map<string,int>::value_type(sId,nCount));

        } else { // already seen that ID

            i->second++; // increment count

        }

    }
}

void DisplayResults(const map<string,int>& rmap) {

    map<string,int>::const_iterator i;

    cout << "ID:\tCount:" << endl; // header

    for (i = rmap.begin(); i != rmap.end(); ++i) {

        cout << i->first << "\t" << i->second << endl;
    }
}

int main () {


    list<string> lstFiles;
    map<string,int> mapIdsToHits;

    list_all_logfiles ( "/home/somedir", lstFiles);

    list<string>::const_iterator i;

    for (i = lstFiles.begin(); i != lstFiles.end(); ++i) {

        ProcessFile(*i, mapIdsToHits);
    }

    DisplayResults(mapIdsToHits);

    return 0;
}
0
 
jkrCommented:
I'd rather pass the base directory and have the program read the log files naes than passing them on the command line, e.g.

#include <sys/types.h>
#include <dirent.h>
#include <string>
#include <iostream>
#include <list>
using namespace std;

void list_all_logfiles ( const string& sStartDir, list<string>& lstFound) {

   cout << "checking " << sStartDir.c_str () << endl;

   DIR* pDir = opendir ( sStartDir.c_str ());

   if ( !pDir) return false;

   dirent* pEntry;

   while ( pEntry = readdir ( pDir)) {

       cout << "found " << pEntry->d_name << endl;

       string sFound = pEntry->d_name;

       if (string::npos != sFound.find(".log") lstFound.push_back ( sFound);
   }

   closedir(pDir);
}

and use it like

list<string> lstFiles;

list_all_logfiles ( "/home/somedir", lstFiles);

From there on, you can simply process the ID counting like

#include <sys/types.h>
#include <dirent.h>
#include <string>
#include <string>
#include <vector>
#include <iostream>
#include <map>
#include <sstream>

using namespace std;

void list_all_logfiles ( const string& sStartDir, list<string>& lstFound) {

   cout << "checking " << sStartDir.c_str () << endl;

   DIR* pDir = opendir ( sStartDir.c_str ());

   if ( !pDir) return false;

   dirent* pEntry;

   while ( pEntry = readdir ( pDir)) {

       cout << "found " << pEntry->d_name << endl;

       string sFound = pEntry->d_name;

       if (string::npos != sFound.find(".log") lstFound.push_back ( sFound);
   }

   closedir(pDir);
}

int split_text(string strIn, const char cDelim, vector<string>& vResult) {

   int nPos = 0;
   int nCount = 0;
   int nFound;
   string strToken;

   while(1) {

      nFound = strIn.find(cDelim,nPos);

      if (-1 == nFound)  {

        strToken = strIn.substr(nPos,strIn.length() - nPos);
        vResult.push_back(strToken);
        break;
      }

      strToken = strIn.substr(nPos,nFound - nPos);

      nPos = nFound + 1;

      ++nCount;

      vResult.push_back(strToken);

   }

   return nCount;
}

void ProcessFile(const string& strName, map<int,int>& rmapResults) {

    ifstream is(strName.c_str());

    if (!is.is_open()) {

        cout << "Error processing " << strName << endl;

        return;
    }

    while(!is.eof()) {

        vector<string> vResult;

        string strLine;

        getline(is.strLine);

        int n = split_text(strLine,'\t',vResult);

        int nId;

        stringstream ss;

        ss << vResult[5]; // get ID at col. 6 (indices are zero-based)
        ss >> nId ;

        map<int,int>::iterator i = rmapResults.find(nId);

        if (i == rmapResults.end()) { // not found yet

            int nCount = 1;

            rmapResults.insert(map<int,int>::value_type(nId,nCount));

        } else { // already seen that ID

            i->second++; // increment count

        }

    }
}

void DisplayResults(const map<int,int>& rmap) {

    map<int,int>::const_iterator i:

    cout << "ID:\tCount:" << endl; // header
   
    for (i = rmap.begin(); i != rmap.end(); ++i) {

        cout << i->first << "\t" << i->second << endl;
    }
}

int main () {


    list<string> lstFiles;
    map<int,int> mapIdsToHits;

    list_all_logfiles ( "/home/somedir", lstFiles);

    list<string>::iterator i;

    for (i = lstFile.begin(); i != lstFiles.end() ++i) {

        ProcessFile(*i, mapIdsToHits);
    }

    DisplayResults(mapIdsToHits);

    return 0;
}
0
 
jkrCommented:
Lil' correction, 'main()' should be

int main () {

    list<string> lstFiles;
    map<int,int> mapIdsToHits;

    list_all_logfiles ( "/home/somedir", lstFiles);

    list<string>::const_iterator i;

    for (i = lstFile.begin(); i != lstFiles.end() ++i) {

        ProcessFile(*i, mapIdsToHits);
    }

    DisplayResults(mapIdsToHits);

    return 0;
}
0
Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
jkrCommented:
OK, a lot of typos in the above - the following compiles

#include <sys/types.h>
#include <dirent.h>
#include <string>
#include <vector>
#include <iostream>
#include <fstream>
#include <map>
#include <list>
#include <sstream>

using namespace std;

void list_all_logfiles ( const string& sStartDir, list<string>& lstFound) {

   cout << "checking " << sStartDir<< endl;

   DIR* pDir = opendir ( sStartDir.c_str ());

   if ( !pDir) return;

   dirent* pEntry;

   while ( pEntry = readdir ( pDir)) {

       cout << "found " << pEntry->d_name << endl;

       string sFound = pEntry->d_name;

       if (string::npos != sFound.find(".log")) lstFound.push_back ( sFound);
   }

   closedir(pDir);
}

int split_text(string strIn, const char cDelim, vector<string>& vResult) {

   int nPos = 0;
   int nCount = 0;
   int nFound;
   string strToken;

   while(1) {

      nFound = strIn.find(cDelim,nPos);

      if (-1 == nFound)  {

        strToken = strIn.substr(nPos,strIn.length() - nPos);
        vResult.push_back(strToken);
        break;
      }

      strToken = strIn.substr(nPos,nFound - nPos);

      nPos = nFound + 1;

      ++nCount;

      vResult.push_back(strToken);

   }

   return nCount;
}

void ProcessFile(const string& strName, map<int,int>& rmapResults) {

    ifstream is(strName.c_str());

    if (!is.is_open()) {

        cout << "Error processing " << strName << endl;

        return;
    }

    while(!is.eof()) {

        vector<string> vResult;

        string strLine;

        getline(is,strLine);

        int n = split_text(strLine,'\t',vResult);

        int nId;

        stringstream ss;

        ss << vResult[5]; // get ID at col. 6 (indices are zero-based)
        ss >> nId ;

        map<int,int>::iterator i = rmapResults.find(nId);

        if (i == rmapResults.end()) { // not found yet

            int nCount = 1;
            rmapResults.insert(map<int,int>::value_type(nId,nCount));

        } else { // already seen that ID

            i->second++; // increment count

        }

    }
}

void DisplayResults(const map<int,int>& rmap) {

    map<int,int>::const_iterator i;

    cout << "ID:\tCount:" << endl; // header

    for (i = rmap.begin(); i != rmap.end(); ++i) {

        cout << i->first << "\t" << i->second << endl;
    }
}

int main () {


    list<string> lstFiles;
    map<int,int> mapIdsToHits;

    list_all_logfiles ( "/home/somedir", lstFiles);

    list<string>::const_iterator i;

    for (i = lstFiles.begin(); i != lstFiles.end(); ++i) {

        ProcessFile(*i, mapIdsToHits);
    }

    DisplayResults(mapIdsToHits);

    return 0;
}

0
 
jkrCommented:
Oh, and just if that would be unclear - compile the above using

g++ -o genids genids.cpp

(any filename will do, that is just the one I chose)
0
 
jakacAuthor Commented:
thanx for this solution... I will try it asap and let you know!
0
 
jakacAuthor Commented:

I tried it but it just produced the following output:

found .
found ..
found 2007-06-30.log
Error processing 2007-06-30.log
ID:     Count:

Before I compiled it I also changed the line
    list_all_logfiles ( "/home/somedir", lstFiles);
to my actual directory for testing which contains this 2007-06-30.log file but I guess the program doesn't open it or something? Can you please check it out? Thanx!
0
 
jkrCommented:
Sorry, the files aren't stored with their full path - make that

void list_all_logfiles ( const string& sStartDir, list<string>& lstFound) {

   cout << "checking " << sStartDir<< endl;

   DIR* pDir = opendir ( sStartDir.c_str ());

   if ( !pDir) return;

   dirent* pEntry;

   while ( pEntry = readdir ( pDir)) {

       cout << "found " << pEntry->d_name << endl;

       string sFound = sStartDir + string ( "/") + string ( pEntry->d_name); // <---

       if (string::npos != sFound.find(".log")) lstFound.push_back ( sFound);
   }

   closedir(pDir);
}

instead.
0
 
jakacAuthor Commented:
Hello,

Well the program works now but after running about one minute it produces a Segmentation fault...

As I said - program has to be able to handle a massive amount of data - about 20,000,000 distinct ID's... Right now I only tested in stripped-down file which has only about 200,000 distinct ID's and it didn't go through... I also tried on a smaller file (only 1,000 rows) and it also produced a segmentation fault...

BTW: did I mention that ID's are 24 character long strings, not integers (if that may be the problem maybe)
0
 
Computer101Commented:
Forced accept.

Computer101
EE Admin
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.