• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 2406
  • Last Modified:

Fastest way to count distinct strings from plain text log files

We are using perl in order to process certain statistics from plain text log (ordered by timestamp, not ID) files which have more than 50 million rows per day. I know that Perl is maybe not appropriate to do this job but it works fine and we don't have any problem with it.
One thing it also does (in cooperation with linux shell programs) is that it counts distinct occurences of "ID" which is a 24 characters long string. It does that using the following algorithm:

open(F,"logfile.log");
open(TMP,">temp.log");
while (<F>) {
  @data = split(/\t/,$_);   # logfile is tab-separated
  print TMP $data[5]."\n";     # ID is in the 6th column
}
close(F);
close(TMP);

$unique = `sort temp.log | uniq | wc -l`;
print "There are ".$unique." user ID's in the log file";

... anyway - it may be a weird solution but it is the fastest I came up to... Any solution using perl's hash or similiar just produced a memory leak after too much data was inserted (again - I'm talking about 50 mils of logfile lines which mostly contain about 20 million distinct ID's).

So my first question is - does anybody have a solution which is perhaps faster and maybe more ellegant than this?

And the next question is - does anybody have any idea how to write a script which would also count a number of ID's with the same number of hits producing output like:

no.of ID's  |  no. of hits
-------------------------------------
138491    | 1
3890        | 2
834          | 3
524          | 4
etc.


(this means that 138491 distinct ID's are logged only once and 3890 distinct ID's are logged twice etc.)

I know how to write a script like this using perl's hash, so I am not looking for that kind of solution because it is too slow, produces memory leak and can not handle that amount of data. So if anybody has any better solution, please help me! :)


0
jakac
Asked:
jakac
  • 3
  • 2
  • 2
  • +4
5 Solutions
 
ozoCommented:
Do you really mean a memory leak, or do you mean a memory overflow?
If it's a memory leak, you may want to let the Perl developers know about it.


open U,"sort temp.log | uniq -c|";
$n{(split)[0]}++ while <U>;
close U;
for( sort{$a<=>$b} keys %n ){
   printf"%12d| %d\n",$n{$_},$_;
}
0
 
jakacAuthor Commented:
Sorry ... I meant memory overflow...
Thanx for the solution of the second question.. I guess it's OK (still have to test it though)... Is there maybe other way to count distinct occurences of certain string or only by doing sort | uniq?
0
 
hernst42Commented:
You colud do it in php like this:

if ( ($fp = fopen('temp.log', 'r'))) {
   while (!feof($fp)) ++$data[trim(fgets($fp, 10000))];
   fclose($fp);
}
asort($data);
Does not need a lot of memory and you don't need to sort the file.
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
jakacAuthor Commented:
Hernst42: well isn't this the same thing as counting in perl using hash? If you put about 20,000,000 different keys into $data[] array it will also eat up all of your computer's RAM and eventually crash don't you think? I tried the same thing using Perl on computer with 2 GB of RAM and the whole thing took forever and the computer almost died while executing this script...

The script you provided is the same as Perl's:

open(F,"temp.log");
while (<F>) {
   $data[$_]++;
}
close(F);
print "Distinct count: " . keys(%data);

This script would be great for small amount of data but with the huge data set there is no way...

I wonder how the software which counts for example Google's count of distinct search keywords is written? For the needs of AdWords they also have some kind of statistics that tells them how often a single keyword was used during a whole month and we are talking about billions of keywords. I understand that there is also a huge cluster of hardware behind all that but still - how is that software written? I guess not with php array or perl hash... I guess that also no other programming language is capable of storing that amount of data into the memory and still not increasing computer's load dramatically...

If it would be possible I would also write a program which would break the big file into pieces and then separate computers could easily calculate each single piece - but this wouldn't be possible without sorting it first because different computers would count the same ID multiple times ...

And the point is that right now I am dealing with about 50 millon of log file rows per day - two years ago I  was working on "only" about 30 million per day and as the usage increases in a couple of years I will be dealing with 100 million - what about then? The hardware is not a problem because I can get more computers if needed but how to write a program that will be able to actually take the full advantage of all that hardware?
0
 
Adam314Commented:
Is there a reason you don't like using sort and uniq?
If you sort the file first, you could then process it in perl without using uniq.
If you don't want to use sort or uniq, I think you would have to process the file in multiple passes (eg: first look for IDs 0-1000, save those results, then look for IDs 1001-2000, save those results.....)

0
 
jakacAuthor Commented:
Adam314: well the reason is that if you sort 1 - 2 GB big file you have to do it using temporary files, it takes a lot of time, massive load, some disk space... So first I have to get 6th column into one temporary file, then sort it using another temporary files etc. Also you have to "walk" whole BIG logfile first to extract all ID's then sort walks that file and its temporary files a lot of times and after that you have to walk the sorted file once again to calculate final number which is just one single integer but it takes an hour to get to it...

I am not saying that this solution is bad... It is slow but still faster than any other I tried before (like hashes)...

I am just asking if there is any other, better, more ellegant, faster solution? It doesn't need to be only in Perl - I guess C / C++ or maybe any other language would work it even faster... My programming knowledge is unfortunately limited to Perl & PHP so I wouldn't know but with my experiences in programming there is always some better solution even if you think you found the best one...
0
 
hernst42Commented:
For such huge amount, you could split it up to do the work:
read e.g. 1.000.000 rows, then discard all hashes which are below a certain limit e.g. 5 or take only the first 100 Top.

Then proceed the next 1.000.000 rows. At the end combine all result into one big. This might not be 100% accurate, but so you can parallelize the job and you shouldn't need much memory and your are finished with one pass reading through the lines.
0
 
ozoCommented:
unix sort is written to run efficiently with large external files, but can waste io on multiple copies of the same line.
a hash does not keep multiple copies but is not written to run efficiently with large external files.
So how about something that works like a hash that is written to use large external file, like a DBI
0
 
mjcoyneCommented:
Can you work in uniq's -c switch:

c, --count
Print each line once, prefixing number of instances.

If you output whatever uniq -c finds in the sorted temp.log to a file, you have the user ID preceeded by the number of times each appeared, with (as before) the number of lines in the file equal to the number of unique IDs found, so you could feed *this* file into wc -l.
0
 
Thomas QvidahlDeveloperCommented:
Would it be possible to import the whole dataset into a database like say, MySQL and doing the magic there? Isn't a db more efficient on these matters?
I remember we did a project at my previous job (for fun, we're nerds, right?:)) where we collected about 1 hours worth of logs from a PBX - *only* 4 million lines, but we did some pretty interesting stuff in seconds! on only an old Compaq Proliant  dual Pentium Pro server with Linux and MySQL... Personally I *HATE* large txt-files, and try to db-ify them every chance I get.. ;-)

snow
0
 
Computer101Commented:
Forced accept.

Computer101
EE Admin
0

Featured Post

Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

  • 3
  • 2
  • 2
  • +4
Tackle projects and never again get stuck behind a technical roadblock.
Join Now