asked on

Which is faster? grep pattern awk or cat awk pattern?

The file is very big 107G.

Which is faster?

grep 'INSERTED/|PENDINGACCEPT' $file | awk -f getcount.awk

(or)

cat $file | awk -f getcount.awk

//inside getcount.awk

var =$0;
ENDED=1;
while(ENDED=1) {
if ( var ~ /PENDINGACCEPT/ || var ~ /INSERTED/ ) {
//DO SOMETHING
}
ENDED=getline var;
}

Also how to store unique values in a list in awk?

1) get the line...
2) gets the value..
need to store the value in a list, if the value is not already present.

Please suggest.

nicerocko

I dont realy know witch one is faster, But with the command time you will be able to figure out.

time grep 'INSERTED/|PENDINGACCEPT' $file | awk -f getcount.awk

time cat $file | awk -f getcount.awk

COmpare the result

wesly_chen

For algorithm point of view,

grep 'INSERTED/|PENDINGACCEPT' $file | awk -f getcount.awk
do the check on the matched lines twice so it should take longer time.

ozo

why not
awk -f getcount.awk $file

tel2

By the way radix655, it may not make much difference to speed, but instead of writing:
cat $file | awk ...
I would usually write:
awk ... $file
No need for cat or pipe. More concise and simpler.

As nicerocko suggests, test the options with the "time" command, but you might like to test it on a smaller file, if running it on the whole file would be too time consuming (or pointless).

I don't know much about awk values, but if you want to create a unique list of values outside of awk, look at UNIX/Linux's "sort -u" or possibly "uniq" commands. Or within awk, you could possibly use a hash.

tel2

ozo beat me to it.

SOLUTION

Michael Eager

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

radix655

ASKER

I will post the timings soon. Looks like awk is little bit faster. I am running again to verify and will post the timings. Thanks for your insight.

radix655

ASKER

bash-3.2$ time grep 'INSERTED\|PENDINGACCEPT' cam_verbose.20110803.000.TRIM.log | gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { match(var, /Source<25102>=([^\]]*)\].*ParentID<25101>=([^\]]*)\].*SourceSystem<5177>=([^\]]*)\]/, arr); ENDED=getline var; } } '

real 0m0.775s
user 0m0.780s
sys 0m0.011s
bash-3.2$ time grep 'INSERTED\|PENDINGACCEPT' cam_verbose.20110803.000.TRIM.log | gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { match(var, /Source<25102>=([^\]]*)\].*ParentID<25101>=([^\]]*)\].*SourceSystem<5177>=([^\]]*)\]/, arr); ENDED=getline var; } } '

real 0m0.756s
user 0m0.762s
sys 0m0.017s
bash-3.2$ time grep 'INSERTED\|PENDINGACCEPT' cam_verbose.20110803.000.TRIM.log | gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { match(var, /Source<25102>=([^\]]*)\].*ParentID<25101>=([^\]]*)\].*SourceSystem<5177>=([^\]]*)\]/, arr); ENDED=getline var; } } '

real 0m0.759s
user 0m0.767s
sys 0m0.013s
bash-3.2$ time gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) { match(var, /Source<25102>=([^\]]*)\].*ParentID<25101>=([^\]]*)\].*SourceSystem<5177>=([^\]]*)\]/, arr); } ENDED=getline var; } } ' cam_verbose.20110803.000.TRIM.log

real 0m0.803s
user 0m0.797s
sys 0m0.006s
bash-3.2$ time gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) { match(var, /Source<25102>=([^\]]*)\].*ParentID<25101>=([^\]]*)\].*SourceSystem<5177>=([^\]]*)\]/, arr); } ENDED=getline var; } } ' cam_verbose.20110803.000.TRIM.log

real 0m0.795s
user 0m0.790s
sys 0m0.004s
bash-3.2$ time gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) { match(var, /Source<25102>=([^\]]*)\].*ParentID<25101>=([^\]]*)\].*SourceSystem<5177>=([^\]]*)\]/, arr); } ENDED=getline var; } } ' cam_verbose.20110803.000.TRIM.log

real 0m0.804s
user 0m0.798s
sys 0m0.007s

Does these results mean that grep is faster? sys seems to be low of gawk.

I am currently testing on the big file.

radix655

ASKER

Also, I found out that individually getting the variables is twice faster than getting them all at once. I have no clue why. Please tell me why.

bash-3.2$ time grep 'INSERTED\|PENDINGACCEPT' cam_verbose.20110803.000.TRIM.log | gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) { match(var, /Source<25102>=([^\]]*)\]/, src); match(var, /ParentID<25101>=([^\]]*)\]/, parentId); match(var, /SourceSystem<5177>=([^\]]*)\]/, source_system ); } ENDED=getline var; } } '

real 0m0.387s
user 0m0.394s
sys 0m0.007s
bash-3.2$ time grep 'INSERTED\|PENDINGACCEPT' cam_verbose.20110803.000.TRIM.log | gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { match(var, /Source<25102>=([^\]]*)\]/, src); match(var, /ParentID<25101>=([^\]]*)\]/, parentId); match(var, /SourceSystem<5177>=([^\]]*)\]/, source_system ); ENDED=getline var; } } '

real 0m0.399s
user 0m0.413s
sys 0m0.011s
bash-3.2$ time grep 'INSERTED\|PENDINGACCEPT' cam_verbose.20110803.000.TRIM.log | gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { match(var, /Source<25102>=([^\]]*)\]/, src); match(var, /ParentID<25101>=([^\]]*)\]/, parentId); match(var, /SourceSystem<5177>=([^\]]*)\]/, source_system ); ENDED=getline var; } } '

real 0m0.390s
user 0m0.402s
sys 0m0.013s
bash-3.2$ time gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) { match(var, /Source<25102>=([^\]]*)\]/, src); match(var, /ParentID<25101>=([^\]]*)\]/, parentId); match(var, /SourceSystem<5177>=([^\]]*)\]/, source_system ); } ENDED=getline var; } } ' cam_verbose.20110803.000.TRIM.log

real 0m0.416s
user 0m0.412s
sys 0m0.005s
bash-3.2$ time gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) { match(var, /Source<25102>=([^\]]*)\]/, src); match(var, /ParentID<25101>=([^\]]*)\]/, parentId); match(var, /SourceSystem<5177>=([^\]]*)\]/, source_system ); } ENDED=getline var; } } ' cam_verbose.20110803.000.TRIM.log

real 0m0.416s
user 0m0.409s
sys 0m0.007s
bash-3.2$ time gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) { match(var, /Source<25102>=([^\]]*)\]/, src); match(var, /ParentID<25101>=([^\]]*)\]/, parentId); match(var, /SourceSystem<5177>=([^\]]*)\]/, source_system ); } ENDED=getline var; } } ' cam_verbose.20110803.000.TRIM.log

real 0m0.418s
user 0m0.414s
sys 0m0.004s
bash-3.2$ time gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) { match(var, /Source<25102>=([^\]]*)\].*ParentID<25101>=([^\]]*)\].*SourceSystem<5177>=([^\]]*)\]/, arr); } ENDED=getline var; } } ' cam_verbose.20110803.000.TRIM.log

real 0m0.793s
user 0m0.786s
sys 0m0.007s
bash-3.2$ time gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) { match(var, /Source<25102>=([^\]]*)\].*ParentID<25101>=([^\]]*)\].*SourceSystem<5177>=([^\]]*)\]/, arr); } ENDED=getline var; } } ' cam_verbose.20110803.000.TRIM.log

real 0m0.798s
user 0m0.793s
sys 0m0.005s

wesly_chen

eager has an answer already.
https://www.experts-exchange.com/questions/27244960/Which-is-faster-grep-pattern-awk-or-cat-awk-pattern.html?cid=1572&anchorAnswerId=36337067#a36337067

Michael Eager

If you are using a small data file, your test results may be skewed by the overhead involved with forking two processes and piping data from grep to awk. My guess is that with such a short execution time, your sample data set is so much smaller than the real data set that your results are not likely to be applicable.

The result may be dependent on your data. If you are searching through a large data file for only a few occurrences of a pattern, grep will select only a small amount of data and awk will only have to process a few lines. Alternately, if there are many lines which match the pattern, essentially you will be processing almost the entire data set with both grep and awk, which will be slower. In the first case, grep piped to awk will be faster; in the second, running awk will likely be faster.

radix655

ASKER

@wesly_chen & @eager:

Thank you. Why is match faster when obtaining the variables individually? I thought that scanning the line only once should be faster?

ASKER CERTIFIED SOLUTION

Michael Eager

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

SOLUTION

wesly_chen

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

radix655

ASKER

@eager

I think you are right regarding the RE.

bash-3.2$ time grep 'INSERTED\|PENDINGACCEPT' cam_verbose.20110803.000.TRIM.log | gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) { match(var, /Source<25102>=([A-Za-z0-9]*)/, src); match(var, /ParentID<25101>=([A-Za-z0-9-]*)/, parentId); match(var, /SourceSystem<5177>=([A-Za-z0-9-]*)/, source_system ); } ENDED=getline var; } } '

real 0m0.379s
user 0m0.375s
sys 0m0.018s
bash-3.2$ time grep 'INSERTED\|PENDINGACCEPT' cam_verbose.20110803.000.TRIM.log | gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) { match(var, /Source<25102>=([A-Za-z0-9]*)/, src); match(var, /ParentID<25101>=([A-Za-z0-9-]*)/, parentId); match(var, /SourceSystem<5177>=([A-Za-z0-9-]*)/, source_system ); } ENDED=getline var; } } '

real 0m0.377s
user 0m0.386s
sys 0m0.014s
bash-3.2$ time grep 'INSERTED\|PENDINGACCEPT' cam_verbose.20110803.000.TRIM.log | gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) { match(var, /Source<25102>=([A-Za-z0-9]*)/, src); match(var, /ParentID<25101>=([A-Za-z0-9-]*)/, parentId); match(var, /SourceSystem<5177>=([A-Za-z0-9-]*)/, source_system ); } ENDED=getline var; } } '

real 0m0.375s
user 0m0.380s
sys 0m0.010s

The time is even more reduced if I less complicate the RE.

wesly_chen

@eager
good catch

radix655

ASKER

Thank you @eager and @wesly_chen.

nicerocko

Thanks, Glad the command Time help you