Which is faster? grep pattern awk or cat awk pattern?

radix655
radix655 used Ask the Experts™
on
The file is very big 107G.

Which is faster?

grep 'INSERTED/|PENDINGACCEPT' $file | awk -f getcount.awk

(or)

cat $file | awk -f getcount.awk

//inside getcount.awk

var =$0;
ENDED=1;
while(ENDED=1) {
if ( var ~ /PENDINGACCEPT/ || var ~ /INSERTED/ ) {
//DO SOMETHING
}
ENDED=getline var;
}

Also how to store unique values in a list in awk?

1) get the line...
2) gets the value..
need to store the value in a list, if the value is not already present.

Please suggest.



Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
I dont realy know witch one is faster, But with the command time you will be able to figure out.

time grep 'INSERTED/|PENDINGACCEPT' $file | awk -f getcount.awk

time cat $file | awk -f getcount.awk

COmpare the result
Top Expert 2011

Commented:
For algorithm point of view,

grep 'INSERTED/|PENDINGACCEPT' $file | awk -f getcount.awk
  do the check on the matched lines twice so it should take longer time.
ozo
Most Valuable Expert 2014
Top Expert 2015

Commented:
why not
awk -f getcount.awk $file
Exploring SharePoint 2016

Explore SharePoint 2016, the web-based, collaborative platform that integrates with Microsoft Office to provide intranets, secure document management, and collaboration so you can develop your online and offline capabilities.

By the way radix655, it may not make much difference to speed, but instead of writing:
    cat $file | awk ...
I would usually write:
    awk ... $file
No need for cat or pipe.  More concise and simpler.

As nicerocko suggests, test the options with the "time" command, but you might like to test it on a smaller file, if running it on the whole file would be too time consuming (or pointless).

I don't know much about awk values, but if you want to create a unique list of values outside of awk, look at UNIX/Linux's "sort -u" or possibly "uniq" commands.  Or within awk, you could possibly use a hash.
ozo beat me to it.
Michael EagerConsultant
Commented:
I would expect that grep followed by running awk on the results should be faster.   Grep is written in C and is optimized for searching for patterns, while awk is more general purpose and uses an interpreter to execute the awk script.  

Author

Commented:
I will post the timings soon. Looks like awk is little bit faster. I am running again to verify and will post the timings. Thanks for your insight.

Author

Commented:
bash-3.2$ time grep 'INSERTED\|PENDINGACCEPT' cam_verbose.20110803.000.TRIM.log | gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { match(var, /Source<25102>=([^\]]*)\].*ParentID<25101>=([^\]]*)\].*SourceSystem<5177>=([^\]]*)\]/, arr);  ENDED=getline var; } } '

real    0m0.775s
user    0m0.780s
sys     0m0.011s
bash-3.2$ time grep 'INSERTED\|PENDINGACCEPT' cam_verbose.20110803.000.TRIM.log | gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { match(var, /Source<25102>=([^\]]*)\].*ParentID<25101>=([^\]]*)\].*SourceSystem<5177>=([^\]]*)\]/, arr);  ENDED=getline var; } } '

real    0m0.756s
user    0m0.762s
sys     0m0.017s
bash-3.2$ time grep 'INSERTED\|PENDINGACCEPT' cam_verbose.20110803.000.TRIM.log | gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { match(var, /Source<25102>=([^\]]*)\].*ParentID<25101>=([^\]]*)\].*SourceSystem<5177>=([^\]]*)\]/, arr);  ENDED=getline var; } } '

real    0m0.759s
user    0m0.767s
sys     0m0.013s
bash-3.2$ time gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) {  match(var, /Source<25102>=([^\]]*)\].*ParentID<25101>=([^\]]*)\].*SourceSystem<5177>=([^\]]*)\]/, arr);  } ENDED=getline var; } } ' cam_verbose.20110803.000.TRIM.log

real    0m0.803s
user    0m0.797s
sys     0m0.006s
bash-3.2$ time gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) {  match(var, /Source<25102>=([^\]]*)\].*ParentID<25101>=([^\]]*)\].*SourceSystem<5177>=([^\]]*)\]/, arr);  } ENDED=getline var; } } ' cam_verbose.20110803.000.TRIM.log

real    0m0.795s
user    0m0.790s
sys     0m0.004s
bash-3.2$ time gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) {  match(var, /Source<25102>=([^\]]*)\].*ParentID<25101>=([^\]]*)\].*SourceSystem<5177>=([^\]]*)\]/, arr);  } ENDED=getline var; } } ' cam_verbose.20110803.000.TRIM.log

real    0m0.804s
user    0m0.798s
sys     0m0.007s


Does these results mean that grep is faster? sys seems to be low of gawk.

I am currently testing on the big file.

Author

Commented:
Also, I found out that individually getting the variables is twice faster than getting them all at once. I have no clue why. Please tell me why.

bash-3.2$ time grep 'INSERTED\|PENDINGACCEPT' cam_verbose.20110803.000.TRIM.log | gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) {  match(var, /Source<25102>=([^\]]*)\]/, src);   match(var, /ParentID<25101>=([^\]]*)\]/, parentId); match(var, /SourceSystem<5177>=([^\]]*)\]/, source_system ); } ENDED=getline var; } } '

real    0m0.387s
user    0m0.394s
sys     0m0.007s
bash-3.2$ time grep 'INSERTED\|PENDINGACCEPT' cam_verbose.20110803.000.TRIM.log | gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { match(var, /Source<25102>=([^\]]*)\]/, src);   match(var, /ParentID<25101>=([^\]]*)\]/, parentId); match(var, /SourceSystem<5177>=([^\]]*)\]/, source_system ); ENDED=getline var; } } '

real    0m0.399s
user    0m0.413s
sys     0m0.011s
bash-3.2$ time grep 'INSERTED\|PENDINGACCEPT' cam_verbose.20110803.000.TRIM.log | gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { match(var, /Source<25102>=([^\]]*)\]/, src);   match(var, /ParentID<25101>=([^\]]*)\]/, parentId); match(var, /SourceSystem<5177>=([^\]]*)\]/, source_system ); ENDED=getline var; } } '

real    0m0.390s
user    0m0.402s
sys     0m0.013s
bash-3.2$ time gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) {  match(var, /Source<25102>=([^\]]*)\]/, src);   match(var, /ParentID<25101>=([^\]]*)\]/, parentId); match(var, /SourceSystem<5177>=([^\]]*)\]/, source_system ); } ENDED=getline var; } } ' cam_verbose.20110803.000.TRIM.log

real    0m0.416s
user    0m0.412s
sys     0m0.005s
bash-3.2$ time gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) {  match(var, /Source<25102>=([^\]]*)\]/, src);   match(var, /ParentID<25101>=([^\]]*)\]/, parentId); match(var, /SourceSystem<5177>=([^\]]*)\]/, source_system ); } ENDED=getline var; } } ' cam_verbose.20110803.000.TRIM.log

real    0m0.416s
user    0m0.409s
sys     0m0.007s
bash-3.2$ time gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) {  match(var, /Source<25102>=([^\]]*)\]/, src);   match(var, /ParentID<25101>=([^\]]*)\]/, parentId); match(var, /SourceSystem<5177>=([^\]]*)\]/, source_system ); } ENDED=getline var; } } ' cam_verbose.20110803.000.TRIM.log

real    0m0.418s
user    0m0.414s
sys     0m0.004s
bash-3.2$ time gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) {  match(var, /Source<25102>=([^\]]*)\].*ParentID<25101>=([^\]]*)\].*SourceSystem<5177>=([^\]]*)\]/, arr);  } ENDED=getline var; } } ' cam_verbose.20110803.000.TRIM.log

real    0m0.793s
user    0m0.786s
sys     0m0.007s
bash-3.2$ time gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) {  match(var, /Source<25102>=([^\]]*)\].*ParentID<25101>=([^\]]*)\].*SourceSystem<5177>=([^\]]*)\]/, arr);  } ENDED=getline var; } } ' cam_verbose.20110803.000.TRIM.log

real    0m0.798s
user    0m0.793s
sys     0m0.005s
Top Expert 2011

Commented:
Michael EagerConsultant

Commented:
If you are using a small data file, your test results may be skewed by the overhead involved with forking two processes and piping data from grep to awk.  My guess is that with such a short execution time, your sample data set is so much smaller than the real data set that your results are not likely to be applicable.  

The result may be dependent on your data.  If you are searching through a large data file for only a few occurrences of a pattern, grep will select only a small amount of data and awk will only have to process a few lines.  Alternately, if there are many lines which match the pattern, essentially you will be processing almost the entire data set with both grep and awk, which will be slower.  In the first case, grep piped to awk will be faster; in the second, running awk will likely be faster.  

Author

Commented:
@wesly_chen & @eager:

Thank you. Why is match faster when obtaining the variables individually? I thought that scanning the line only once should be faster?
Consultant
Commented:
It's going to depend on the internals of the regular expression processor in awk.  It may be that two simpler pattern matches run faster than a single match with a more complex pattern.  Patterns with *'s can result in a large amount of backtracking.  I don't have any real familiarity with the RE processing in awk, so this is just a guess.  
Top Expert 2011
Commented:
It depends on the search algorithm used by awk and grep.
It seems grep use better search algorithm.

Logically, scan though once should be faster. However, the way (algorithm) to scan/search file are different and, in this case, it makes a big difference so grep is faster than awk.

Author

Commented:
@eager

I think you are right regarding the RE.


bash-3.2$ time grep 'INSERTED\|PENDINGACCEPT' cam_verbose.20110803.000.TRIM.log | gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) {  match(var, /Source<25102>=([A-Za-z0-9]*)/, src);   match(var, /ParentID<25101>=([A-Za-z0-9-]*)/, parentId); match(var, /SourceSystem<5177>=([A-Za-z0-9-]*)/, source_system ); } ENDED=getline var; } } '

real    0m0.379s
user    0m0.375s
sys     0m0.018s
bash-3.2$ time grep 'INSERTED\|PENDINGACCEPT' cam_verbose.20110803.000.TRIM.log | gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) {  match(var, /Source<25102>=([A-Za-z0-9]*)/, src);   match(var, /ParentID<25101>=([A-Za-z0-9-]*)/, parentId); match(var, /SourceSystem<5177>=([A-Za-z0-9-]*)/, source_system ); } ENDED=getline var; } } '

real    0m0.377s
user    0m0.386s
sys     0m0.014s
bash-3.2$ time grep 'INSERTED\|PENDINGACCEPT' cam_verbose.20110803.000.TRIM.log | gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) {  match(var, /Source<25102>=([A-Za-z0-9]*)/, src);   match(var, /ParentID<25101>=([A-Za-z0-9-]*)/, parentId); match(var, /SourceSystem<5177>=([A-Za-z0-9-]*)/, source_system ); } ENDED=getline var; } } '

real    0m0.375s
user    0m0.380s
sys     0m0.010s


The time is even more reduced if I less complicate the RE.

Top Expert 2011

Commented:
@eager
good catch

Author

Commented:
Thank you @eager and @wesly_chen.
Thanks, Glad the command Time help you

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial