Fastest way to count distinct strings from plain text log files

Posted on 2007-07-26
Last Modified: 2013-12-07
We are using perl in order to process certain statistics from plain text log (ordered by timestamp, not ID) files which have more than 50 million rows per day. I know that Perl is maybe not appropriate to do this job but it works fine and we don't have any problem with it.
One thing it also does (in cooperation with linux shell programs) is that it counts distinct occurences of "ID" which is a 24 characters long string. It does that using the following algorithm:

while (<F>) {
  @data = split(/\t/,$_);   # logfile is tab-separated
  print TMP $data[5]."\n";     # ID is in the 6th column

$unique = `sort temp.log | uniq | wc -l`;
print "There are ".$unique." user ID's in the log file";

... anyway - it may be a weird solution but it is the fastest I came up to... Any solution using perl's hash or similiar just produced a memory leak after too much data was inserted (again - I'm talking about 50 mils of logfile lines which mostly contain about 20 million distinct ID's).

So my first question is - does anybody have a solution which is perhaps faster and maybe more ellegant than this?

And the next question is - does anybody have any idea how to write a script which would also count a number of ID's with the same number of hits producing output like:

no.of ID's  |  no. of hits
138491    | 1
3890        | 2
834          | 3
524          | 4

(this means that 138491 distinct ID's are logged only once and 3890 distinct ID's are logged twice etc.)

I know how to write a script like this using perl's hash, so I am not looking for that kind of solution because it is too slow, produces memory leak and can not handle that amount of data. So if anybody has any better solution, please help me! :)

Question by:jakac
    LVL 84

    Accepted Solution

    Do you really mean a memory leak, or do you mean a memory overflow?
    If it's a memory leak, you may want to let the Perl developers know about it.

    open U,"sort temp.log | uniq -c|";
    $n{(split)[0]}++ while <U>;
    close U;
    for( sort{$a<=>$b} keys %n ){
       printf"%12d| %d\n",$n{$_},$_;
    LVL 1

    Author Comment

    Sorry ... I meant memory overflow...
    Thanx for the solution of the second question.. I guess it's OK (still have to test it though)... Is there maybe other way to count distinct occurences of certain string or only by doing sort | uniq?
    LVL 48

    Expert Comment

    You colud do it in php like this:

    if ( ($fp = fopen('temp.log', 'r'))) {
       while (!feof($fp)) ++$data[trim(fgets($fp, 10000))];
    Does not need a lot of memory and you don't need to sort the file.
    LVL 1

    Author Comment

    Hernst42: well isn't this the same thing as counting in perl using hash? If you put about 20,000,000 different keys into $data[] array it will also eat up all of your computer's RAM and eventually crash don't you think? I tried the same thing using Perl on computer with 2 GB of RAM and the whole thing took forever and the computer almost died while executing this script...

    The script you provided is the same as Perl's:

    while (<F>) {
    print "Distinct count: " . keys(%data);

    This script would be great for small amount of data but with the huge data set there is no way...

    I wonder how the software which counts for example Google's count of distinct search keywords is written? For the needs of AdWords they also have some kind of statistics that tells them how often a single keyword was used during a whole month and we are talking about billions of keywords. I understand that there is also a huge cluster of hardware behind all that but still - how is that software written? I guess not with php array or perl hash... I guess that also no other programming language is capable of storing that amount of data into the memory and still not increasing computer's load dramatically...

    If it would be possible I would also write a program which would break the big file into pieces and then separate computers could easily calculate each single piece - but this wouldn't be possible without sorting it first because different computers would count the same ID multiple times ...

    And the point is that right now I am dealing with about 50 millon of log file rows per day - two years ago I  was working on "only" about 30 million per day and as the usage increases in a couple of years I will be dealing with 100 million - what about then? The hardware is not a problem because I can get more computers if needed but how to write a program that will be able to actually take the full advantage of all that hardware?
    LVL 39

    Assisted Solution

    Is there a reason you don't like using sort and uniq?
    If you sort the file first, you could then process it in perl without using uniq.
    If you don't want to use sort or uniq, I think you would have to process the file in multiple passes (eg: first look for IDs 0-1000, save those results, then look for IDs 1001-2000, save those results.....)

    LVL 1

    Author Comment

    Adam314: well the reason is that if you sort 1 - 2 GB big file you have to do it using temporary files, it takes a lot of time, massive load, some disk space... So first I have to get 6th column into one temporary file, then sort it using another temporary files etc. Also you have to "walk" whole BIG logfile first to extract all ID's then sort walks that file and its temporary files a lot of times and after that you have to walk the sorted file once again to calculate final number which is just one single integer but it takes an hour to get to it...

    I am not saying that this solution is bad... It is slow but still faster than any other I tried before (like hashes)...

    I am just asking if there is any other, better, more ellegant, faster solution? It doesn't need to be only in Perl - I guess C / C++ or maybe any other language would work it even faster... My programming knowledge is unfortunately limited to Perl & PHP so I wouldn't know but with my experiences in programming there is always some better solution even if you think you found the best one...
    LVL 48

    Assisted Solution

    For such huge amount, you could split it up to do the work:
    read e.g. 1.000.000 rows, then discard all hashes which are below a certain limit e.g. 5 or take only the first 100 Top.

    Then proceed the next 1.000.000 rows. At the end combine all result into one big. This might not be 100% accurate, but so you can parallelize the job and you shouldn't need much memory and your are finished with one pass reading through the lines.
    LVL 84

    Expert Comment

    unix sort is written to run efficiently with large external files, but can waste io on multiple copies of the same line.
    a hash does not keep multiple copies but is not written to run efficiently with large external files.
    So how about something that works like a hash that is written to use large external file, like a DBI
    LVL 17

    Assisted Solution

    Can you work in uniq's -c switch:

    c, --count
    Print each line once, prefixing number of instances.

    If you output whatever uniq -c finds in the sorted temp.log to a file, you have the user ID preceeded by the number of times each appeared, with (as before) the number of lines in the file equal to the number of unique IDs found, so you could feed *this* file into wc -l.
    LVL 3

    Assisted Solution

    by:Thomas Qvidahl
    Would it be possible to import the whole dataset into a database like say, MySQL and doing the magic there? Isn't a db more efficient on these matters?
    I remember we did a project at my previous job (for fun, we're nerds, right?:)) where we collected about 1 hours worth of logs from a PBX - *only* 4 million lines, but we did some pretty interesting stuff in seconds! on only an old Compaq Proliant  dual Pentium Pro server with Linux and MySQL... Personally I *HATE* large txt-files, and try to db-ify them every chance I get.. ;-)

    LVL 1

    Expert Comment

    Forced accept.

    EE Admin

    Write Comment

    Please enter a first name

    Please enter a last name

    We will never share this with anyone.

    Featured Post

    How to improve team productivity

    Quip adds documents, spreadsheets, and tasklists to your Slack experience
    - Elevate ideas to Quip docs
    - Share Quip docs in Slack
    - Get notified of changes to your docs
    - Available on iOS/Android/Desktop/Web
    - Online/Offline

    Popularity Can Be Measured Sometimes we deal with questions of popularity, and we need a way to collect opinions from our clients.  This article shows a simple teaching example of how we might elect a favorite color by letting our clients vote for …
    Introduction HTML checkboxes provide the perfect way for a web developer to receive client input when the client's options might be none, one or many.  But the PHP code for processing the checkboxes can be confusing at first.  What if a checkbox is…
    This tutorial will teach you the core code needed to finalize the addition of a watermark to your image. The viewer will use a small PHP class to learn and create a watermark.
    The viewer will get a basic understanding of what section 508 compliance can entail, learn about skip navigation links, alt text, transcripts, and font size controls.

    761 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    8 Experts available now in Live!

    Get 1:1 Help Now