I am dealing with big log files with about 50,000,000 rows per day and I have to calculate a number of distinct ID's over one single day and over the whole month (from multiple log files). Log file names are named by date e.g.:
Right now I am using Perl script to do this and it uses also some system calls (sort, uniq, wc)
Because I know that C / C++ is much faster than Perl I am asking all C programmers out there if anyone has any solution for this? If there is any other even faster solution on Linux for this kind of problem I am also interested.
I need two things:
1.) a program that reads log filename(s) as command line argument(s) (there can be multiple files because I also need to calculate this count for the whole month), parses them and then prints out a total count of distinct ID's as just one integer.
Log files are plain text, tab separated, unordered and the ID is in the 6th column of every row
2.) I also need a program which would print out a total number of ID's which occure once, twice, three times...etc. This can be the same program if that would make it faster, no problem at all!
no.of ID's | no. of hits
138491 | 1
3890 | 2
834 | 3
524 | 4
(this means that 138491 distinct ID's are logged only once and 3890 distinct ID's are logged twice etc.)
Right now the number of distinct ID's per day is about 20,000,000 and about 80,000,000 per month so I am dealing with a big data set and can not figure out how to make a program or a script which would calculate all this very fast. Perl script takes a few hours and also dramatically increases load so the computer is not usable during that time.
I would be also interested in any distributed solution (to work over 2 or more computers on LAN in order to calculate as fast as possible) if anybody has that kind of skills and I am also willing to give extra points for that kind of solution.