Link to home
Start Free TrialLog in
Avatar of radix655
radix655

asked on

Which is faster? grep pattern awk or cat awk pattern?

The file is very big 107G.

Which is faster?

grep 'INSERTED/|PENDINGACCEPT' $file | awk -f getcount.awk

(or)

cat $file | awk -f getcount.awk

//inside getcount.awk

var =$0;
ENDED=1;
while(ENDED=1) {
if ( var ~ /PENDINGACCEPT/ || var ~ /INSERTED/ ) {
//DO SOMETHING
}
ENDED=getline var;
}

Also how to store unique values in a list in awk?

1) get the line...
2) gets the value..
need to store the value in a list, if the value is not already present.

Please suggest.



Avatar of nicerocko
nicerocko
Flag of Canada image

I dont realy know witch one is faster, But with the command time you will be able to figure out.

time grep 'INSERTED/|PENDINGACCEPT' $file | awk -f getcount.awk

time cat $file | awk -f getcount.awk

COmpare the result
For algorithm point of view,

grep 'INSERTED/|PENDINGACCEPT' $file | awk -f getcount.awk
  do the check on the matched lines twice so it should take longer time.
Avatar of ozo
why not
awk -f getcount.awk $file
By the way radix655, it may not make much difference to speed, but instead of writing:
    cat $file | awk ...
I would usually write:
    awk ... $file
No need for cat or pipe.  More concise and simpler.

As nicerocko suggests, test the options with the "time" command, but you might like to test it on a smaller file, if running it on the whole file would be too time consuming (or pointless).

I don't know much about awk values, but if you want to create a unique list of values outside of awk, look at UNIX/Linux's "sort -u" or possibly "uniq" commands.  Or within awk, you could possibly use a hash.
ozo beat me to it.
SOLUTION
Avatar of Michael Eager
Michael Eager
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of radix655
radix655

ASKER

I will post the timings soon. Looks like awk is little bit faster. I am running again to verify and will post the timings. Thanks for your insight.
bash-3.2$ time grep 'INSERTED\|PENDINGACCEPT' cam_verbose.20110803.000.TRIM.log | gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { match(var, /Source<25102>=([^\]]*)\].*ParentID<25101>=([^\]]*)\].*SourceSystem<5177>=([^\]]*)\]/, arr);  ENDED=getline var; } } '

real    0m0.775s
user    0m0.780s
sys     0m0.011s
bash-3.2$ time grep 'INSERTED\|PENDINGACCEPT' cam_verbose.20110803.000.TRIM.log | gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { match(var, /Source<25102>=([^\]]*)\].*ParentID<25101>=([^\]]*)\].*SourceSystem<5177>=([^\]]*)\]/, arr);  ENDED=getline var; } } '

real    0m0.756s
user    0m0.762s
sys     0m0.017s
bash-3.2$ time grep 'INSERTED\|PENDINGACCEPT' cam_verbose.20110803.000.TRIM.log | gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { match(var, /Source<25102>=([^\]]*)\].*ParentID<25101>=([^\]]*)\].*SourceSystem<5177>=([^\]]*)\]/, arr);  ENDED=getline var; } } '

real    0m0.759s
user    0m0.767s
sys     0m0.013s
bash-3.2$ time gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) {  match(var, /Source<25102>=([^\]]*)\].*ParentID<25101>=([^\]]*)\].*SourceSystem<5177>=([^\]]*)\]/, arr);  } ENDED=getline var; } } ' cam_verbose.20110803.000.TRIM.log

real    0m0.803s
user    0m0.797s
sys     0m0.006s
bash-3.2$ time gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) {  match(var, /Source<25102>=([^\]]*)\].*ParentID<25101>=([^\]]*)\].*SourceSystem<5177>=([^\]]*)\]/, arr);  } ENDED=getline var; } } ' cam_verbose.20110803.000.TRIM.log

real    0m0.795s
user    0m0.790s
sys     0m0.004s
bash-3.2$ time gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) {  match(var, /Source<25102>=([^\]]*)\].*ParentID<25101>=([^\]]*)\].*SourceSystem<5177>=([^\]]*)\]/, arr);  } ENDED=getline var; } } ' cam_verbose.20110803.000.TRIM.log

real    0m0.804s
user    0m0.798s
sys     0m0.007s


Does these results mean that grep is faster? sys seems to be low of gawk.

I am currently testing on the big file.
Also, I found out that individually getting the variables is twice faster than getting them all at once. I have no clue why. Please tell me why.

bash-3.2$ time grep 'INSERTED\|PENDINGACCEPT' cam_verbose.20110803.000.TRIM.log | gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) {  match(var, /Source<25102>=([^\]]*)\]/, src);   match(var, /ParentID<25101>=([^\]]*)\]/, parentId); match(var, /SourceSystem<5177>=([^\]]*)\]/, source_system ); } ENDED=getline var; } } '

real    0m0.387s
user    0m0.394s
sys     0m0.007s
bash-3.2$ time grep 'INSERTED\|PENDINGACCEPT' cam_verbose.20110803.000.TRIM.log | gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { match(var, /Source<25102>=([^\]]*)\]/, src);   match(var, /ParentID<25101>=([^\]]*)\]/, parentId); match(var, /SourceSystem<5177>=([^\]]*)\]/, source_system ); ENDED=getline var; } } '

real    0m0.399s
user    0m0.413s
sys     0m0.011s
bash-3.2$ time grep 'INSERTED\|PENDINGACCEPT' cam_verbose.20110803.000.TRIM.log | gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { match(var, /Source<25102>=([^\]]*)\]/, src);   match(var, /ParentID<25101>=([^\]]*)\]/, parentId); match(var, /SourceSystem<5177>=([^\]]*)\]/, source_system ); ENDED=getline var; } } '

real    0m0.390s
user    0m0.402s
sys     0m0.013s
bash-3.2$ time gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) {  match(var, /Source<25102>=([^\]]*)\]/, src);   match(var, /ParentID<25101>=([^\]]*)\]/, parentId); match(var, /SourceSystem<5177>=([^\]]*)\]/, source_system ); } ENDED=getline var; } } ' cam_verbose.20110803.000.TRIM.log

real    0m0.416s
user    0m0.412s
sys     0m0.005s
bash-3.2$ time gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) {  match(var, /Source<25102>=([^\]]*)\]/, src);   match(var, /ParentID<25101>=([^\]]*)\]/, parentId); match(var, /SourceSystem<5177>=([^\]]*)\]/, source_system ); } ENDED=getline var; } } ' cam_verbose.20110803.000.TRIM.log

real    0m0.416s
user    0m0.409s
sys     0m0.007s
bash-3.2$ time gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) {  match(var, /Source<25102>=([^\]]*)\]/, src);   match(var, /ParentID<25101>=([^\]]*)\]/, parentId); match(var, /SourceSystem<5177>=([^\]]*)\]/, source_system ); } ENDED=getline var; } } ' cam_verbose.20110803.000.TRIM.log

real    0m0.418s
user    0m0.414s
sys     0m0.004s
bash-3.2$ time gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) {  match(var, /Source<25102>=([^\]]*)\].*ParentID<25101>=([^\]]*)\].*SourceSystem<5177>=([^\]]*)\]/, arr);  } ENDED=getline var; } } ' cam_verbose.20110803.000.TRIM.log

real    0m0.793s
user    0m0.786s
sys     0m0.007s
bash-3.2$ time gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) {  match(var, /Source<25102>=([^\]]*)\].*ParentID<25101>=([^\]]*)\].*SourceSystem<5177>=([^\]]*)\]/, arr);  } ENDED=getline var; } } ' cam_verbose.20110803.000.TRIM.log

real    0m0.798s
user    0m0.793s
sys     0m0.005s
If you are using a small data file, your test results may be skewed by the overhead involved with forking two processes and piping data from grep to awk.  My guess is that with such a short execution time, your sample data set is so much smaller than the real data set that your results are not likely to be applicable.  

The result may be dependent on your data.  If you are searching through a large data file for only a few occurrences of a pattern, grep will select only a small amount of data and awk will only have to process a few lines.  Alternately, if there are many lines which match the pattern, essentially you will be processing almost the entire data set with both grep and awk, which will be slower.  In the first case, grep piped to awk will be faster; in the second, running awk will likely be faster.  
@wesly_chen & @eager:

Thank you. Why is match faster when obtaining the variables individually? I thought that scanning the line only once should be faster?
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
@eager

I think you are right regarding the RE.


bash-3.2$ time grep 'INSERTED\|PENDINGACCEPT' cam_verbose.20110803.000.TRIM.log | gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) {  match(var, /Source<25102>=([A-Za-z0-9]*)/, src);   match(var, /ParentID<25101>=([A-Za-z0-9-]*)/, parentId); match(var, /SourceSystem<5177>=([A-Za-z0-9-]*)/, source_system ); } ENDED=getline var; } } '

real    0m0.379s
user    0m0.375s
sys     0m0.018s
bash-3.2$ time grep 'INSERTED\|PENDINGACCEPT' cam_verbose.20110803.000.TRIM.log | gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) {  match(var, /Source<25102>=([A-Za-z0-9]*)/, src);   match(var, /ParentID<25101>=([A-Za-z0-9-]*)/, parentId); match(var, /SourceSystem<5177>=([A-Za-z0-9-]*)/, source_system ); } ENDED=getline var; } } '

real    0m0.377s
user    0m0.386s
sys     0m0.014s
bash-3.2$ time grep 'INSERTED\|PENDINGACCEPT' cam_verbose.20110803.000.TRIM.log | gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) {  match(var, /Source<25102>=([A-Za-z0-9]*)/, src);   match(var, /ParentID<25101>=([A-Za-z0-9-]*)/, parentId); match(var, /SourceSystem<5177>=([A-Za-z0-9-]*)/, source_system ); } ENDED=getline var; } } '

real    0m0.375s
user    0m0.380s
sys     0m0.010s


The time is even more reduced if I less complicate the RE.

@eager
good catch
Thank you @eager and @wesly_chen.
Thanks, Glad the command Time help you