radix655
asked on
Which is faster? grep pattern awk or cat awk pattern?
The file is very big 107G.
Which is faster?
grep 'INSERTED/|PENDINGACCEPT' $file | awk -f getcount.awk
(or)
cat $file | awk -f getcount.awk
//inside getcount.awk
var =$0;
ENDED=1;
while(ENDED=1) {
if ( var ~ /PENDINGACCEPT/ || var ~ /INSERTED/ ) {
//DO SOMETHING
}
ENDED=getline var;
}
Also how to store unique values in a list in awk?
1) get the line...
2) gets the value..
need to store the value in a list, if the value is not already present.
Please suggest.
Which is faster?
grep 'INSERTED/|PENDINGACCEPT' $file | awk -f getcount.awk
(or)
cat $file | awk -f getcount.awk
//inside getcount.awk
var =$0;
ENDED=1;
while(ENDED=1) {
if ( var ~ /PENDINGACCEPT/ || var ~ /INSERTED/ ) {
//DO SOMETHING
}
ENDED=getline var;
}
Also how to store unique values in a list in awk?
1) get the line...
2) gets the value..
need to store the value in a list, if the value is not already present.
Please suggest.
For algorithm point of view,
grep 'INSERTED/|PENDINGACCEPT' $file | awk -f getcount.awk
do the check on the matched lines twice so it should take longer time.
grep 'INSERTED/|PENDINGACCEPT' $file | awk -f getcount.awk
do the check on the matched lines twice so it should take longer time.
why not
awk -f getcount.awk $file
awk -f getcount.awk $file
By the way radix655, it may not make much difference to speed, but instead of writing:
cat $file | awk ...
I would usually write:
awk ... $file
No need for cat or pipe. More concise and simpler.
As nicerocko suggests, test the options with the "time" command, but you might like to test it on a smaller file, if running it on the whole file would be too time consuming (or pointless).
I don't know much about awk values, but if you want to create a unique list of values outside of awk, look at UNIX/Linux's "sort -u" or possibly "uniq" commands. Or within awk, you could possibly use a hash.
cat $file | awk ...
I would usually write:
awk ... $file
No need for cat or pipe. More concise and simpler.
As nicerocko suggests, test the options with the "time" command, but you might like to test it on a smaller file, if running it on the whole file would be too time consuming (or pointless).
I don't know much about awk values, but if you want to create a unique list of values outside of awk, look at UNIX/Linux's "sort -u" or possibly "uniq" commands. Or within awk, you could possibly use a hash.
ozo beat me to it.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
I will post the timings soon. Looks like awk is little bit faster. I am running again to verify and will post the timings. Thanks for your insight.
ASKER
bash-3.2$ time grep 'INSERTED\|PENDINGACCEPT' cam_verbose.20110803.000.T RIM.log | gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { match(var, /Source<25102>=([^\]]*)\]. *ParentID< 25101>=([^ \]]*)\].*S ourceSyste m<5177>=([ ^\]]*)\]/, arr); ENDED=getline var; } } '
real 0m0.775s
user 0m0.780s
sys 0m0.011s
bash-3.2$ time grep 'INSERTED\|PENDINGACCEPT' cam_verbose.20110803.000.T RIM.log | gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { match(var, /Source<25102>=([^\]]*)\]. *ParentID< 25101>=([^ \]]*)\].*S ourceSyste m<5177>=([ ^\]]*)\]/, arr); ENDED=getline var; } } '
real 0m0.756s
user 0m0.762s
sys 0m0.017s
bash-3.2$ time grep 'INSERTED\|PENDINGACCEPT' cam_verbose.20110803.000.T RIM.log | gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { match(var, /Source<25102>=([^\]]*)\]. *ParentID< 25101>=([^ \]]*)\].*S ourceSyste m<5177>=([ ^\]]*)\]/, arr); ENDED=getline var; } } '
real 0m0.759s
user 0m0.767s
sys 0m0.013s
bash-3.2$ time gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) { match(var, /Source<25102>=([^\]]*)\]. *ParentID< 25101>=([^ \]]*)\].*S ourceSyste m<5177>=([ ^\]]*)\]/, arr); } ENDED=getline var; } } ' cam_verbose.20110803.000.T RIM.log
real 0m0.803s
user 0m0.797s
sys 0m0.006s
bash-3.2$ time gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) { match(var, /Source<25102>=([^\]]*)\]. *ParentID< 25101>=([^ \]]*)\].*S ourceSyste m<5177>=([ ^\]]*)\]/, arr); } ENDED=getline var; } } ' cam_verbose.20110803.000.T RIM.log
real 0m0.795s
user 0m0.790s
sys 0m0.004s
bash-3.2$ time gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) { match(var, /Source<25102>=([^\]]*)\]. *ParentID< 25101>=([^ \]]*)\].*S ourceSyste m<5177>=([ ^\]]*)\]/, arr); } ENDED=getline var; } } ' cam_verbose.20110803.000.T RIM.log
real 0m0.804s
user 0m0.798s
sys 0m0.007s
Does these results mean that grep is faster? sys seems to be low of gawk.
I am currently testing on the big file.
real 0m0.775s
user 0m0.780s
sys 0m0.011s
bash-3.2$ time grep 'INSERTED\|PENDINGACCEPT' cam_verbose.20110803.000.T
real 0m0.756s
user 0m0.762s
sys 0m0.017s
bash-3.2$ time grep 'INSERTED\|PENDINGACCEPT' cam_verbose.20110803.000.T
real 0m0.759s
user 0m0.767s
sys 0m0.013s
bash-3.2$ time gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) { match(var, /Source<25102>=([^\]]*)\].
real 0m0.803s
user 0m0.797s
sys 0m0.006s
bash-3.2$ time gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) { match(var, /Source<25102>=([^\]]*)\].
real 0m0.795s
user 0m0.790s
sys 0m0.004s
bash-3.2$ time gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) { match(var, /Source<25102>=([^\]]*)\].
real 0m0.804s
user 0m0.798s
sys 0m0.007s
Does these results mean that grep is faster? sys seems to be low of gawk.
I am currently testing on the big file.
ASKER
Also, I found out that individually getting the variables is twice faster than getting them all at once. I have no clue why. Please tell me why.
bash-3.2$ time grep 'INSERTED\|PENDINGACCEPT' cam_verbose.20110803.000.T RIM.log | gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) { match(var, /Source<25102>=([^\]]*)\]/ , src); match(var, /ParentID<25101>=([^\]]*)\ ]/, parentId); match(var, /SourceSystem<5177>=([^\]] *)\]/, source_system ); } ENDED=getline var; } } '
real 0m0.387s
user 0m0.394s
sys 0m0.007s
bash-3.2$ time grep 'INSERTED\|PENDINGACCEPT' cam_verbose.20110803.000.T RIM.log | gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { match(var, /Source<25102>=([^\]]*)\]/ , src); match(var, /ParentID<25101>=([^\]]*)\ ]/, parentId); match(var, /SourceSystem<5177>=([^\]] *)\]/, source_system ); ENDED=getline var; } } '
real 0m0.399s
user 0m0.413s
sys 0m0.011s
bash-3.2$ time grep 'INSERTED\|PENDINGACCEPT' cam_verbose.20110803.000.T RIM.log | gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { match(var, /Source<25102>=([^\]]*)\]/ , src); match(var, /ParentID<25101>=([^\]]*)\ ]/, parentId); match(var, /SourceSystem<5177>=([^\]] *)\]/, source_system ); ENDED=getline var; } } '
real 0m0.390s
user 0m0.402s
sys 0m0.013s
bash-3.2$ time gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) { match(var, /Source<25102>=([^\]]*)\]/ , src); match(var, /ParentID<25101>=([^\]]*)\ ]/, parentId); match(var, /SourceSystem<5177>=([^\]] *)\]/, source_system ); } ENDED=getline var; } } ' cam_verbose.20110803.000.T RIM.log
real 0m0.416s
user 0m0.412s
sys 0m0.005s
bash-3.2$ time gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) { match(var, /Source<25102>=([^\]]*)\]/ , src); match(var, /ParentID<25101>=([^\]]*)\ ]/, parentId); match(var, /SourceSystem<5177>=([^\]] *)\]/, source_system ); } ENDED=getline var; } } ' cam_verbose.20110803.000.T RIM.log
real 0m0.416s
user 0m0.409s
sys 0m0.007s
bash-3.2$ time gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) { match(var, /Source<25102>=([^\]]*)\]/ , src); match(var, /ParentID<25101>=([^\]]*)\ ]/, parentId); match(var, /SourceSystem<5177>=([^\]] *)\]/, source_system ); } ENDED=getline var; } } ' cam_verbose.20110803.000.T RIM.log
real 0m0.418s
user 0m0.414s
sys 0m0.004s
bash-3.2$ time gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) { match(var, /Source<25102>=([^\]]*)\]. *ParentID< 25101>=([^ \]]*)\].*S ourceSyste m<5177>=([ ^\]]*)\]/, arr); } ENDED=getline var; } } ' cam_verbose.20110803.000.T RIM.log
real 0m0.793s
user 0m0.786s
sys 0m0.007s
bash-3.2$ time gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) { match(var, /Source<25102>=([^\]]*)\]. *ParentID< 25101>=([^ \]]*)\].*S ourceSyste m<5177>=([ ^\]]*)\]/, arr); } ENDED=getline var; } } ' cam_verbose.20110803.000.T RIM.log
real 0m0.798s
user 0m0.793s
sys 0m0.005s
bash-3.2$ time grep 'INSERTED\|PENDINGACCEPT' cam_verbose.20110803.000.T
real 0m0.387s
user 0m0.394s
sys 0m0.007s
bash-3.2$ time grep 'INSERTED\|PENDINGACCEPT' cam_verbose.20110803.000.T
real 0m0.399s
user 0m0.413s
sys 0m0.011s
bash-3.2$ time grep 'INSERTED\|PENDINGACCEPT' cam_verbose.20110803.000.T
real 0m0.390s
user 0m0.402s
sys 0m0.013s
bash-3.2$ time gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) { match(var, /Source<25102>=([^\]]*)\]/
real 0m0.416s
user 0m0.412s
sys 0m0.005s
bash-3.2$ time gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) { match(var, /Source<25102>=([^\]]*)\]/
real 0m0.416s
user 0m0.409s
sys 0m0.007s
bash-3.2$ time gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) { match(var, /Source<25102>=([^\]]*)\]/
real 0m0.418s
user 0m0.414s
sys 0m0.004s
bash-3.2$ time gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) { match(var, /Source<25102>=([^\]]*)\].
real 0m0.793s
user 0m0.786s
sys 0m0.007s
bash-3.2$ time gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) { match(var, /Source<25102>=([^\]]*)\].
real 0m0.798s
user 0m0.793s
sys 0m0.005s
If you are using a small data file, your test results may be skewed by the overhead involved with forking two processes and piping data from grep to awk. My guess is that with such a short execution time, your sample data set is so much smaller than the real data set that your results are not likely to be applicable.
The result may be dependent on your data. If you are searching through a large data file for only a few occurrences of a pattern, grep will select only a small amount of data and awk will only have to process a few lines. Alternately, if there are many lines which match the pattern, essentially you will be processing almost the entire data set with both grep and awk, which will be slower. In the first case, grep piped to awk will be faster; in the second, running awk will likely be faster.
The result may be dependent on your data. If you are searching through a large data file for only a few occurrences of a pattern, grep will select only a small amount of data and awk will only have to process a few lines. Alternately, if there are many lines which match the pattern, essentially you will be processing almost the entire data set with both grep and awk, which will be slower. In the first case, grep piped to awk will be faster; in the second, running awk will likely be faster.
ASKER
@wesly_chen & @eager:
Thank you. Why is match faster when obtaining the variables individually? I thought that scanning the line only once should be faster?
Thank you. Why is match faster when obtaining the variables individually? I thought that scanning the line only once should be faster?
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
@eager
I think you are right regarding the RE.
bash-3.2$ time grep 'INSERTED\|PENDINGACCEPT' cam_verbose.20110803.000.T RIM.log | gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) { match(var, /Source<25102>=([A-Za-z0-9 ]*)/, src); match(var, /ParentID<25101>=([A-Za-z0 -9-]*)/, parentId); match(var, /SourceSystem<5177>=([A-Za -z0-9-]*)/ , source_system ); } ENDED=getline var; } } '
real 0m0.379s
user 0m0.375s
sys 0m0.018s
bash-3.2$ time grep 'INSERTED\|PENDINGACCEPT' cam_verbose.20110803.000.T RIM.log | gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) { match(var, /Source<25102>=([A-Za-z0-9 ]*)/, src); match(var, /ParentID<25101>=([A-Za-z0 -9-]*)/, parentId); match(var, /SourceSystem<5177>=([A-Za -z0-9-]*)/ , source_system ); } ENDED=getline var; } } '
real 0m0.377s
user 0m0.386s
sys 0m0.014s
bash-3.2$ time grep 'INSERTED\|PENDINGACCEPT' cam_verbose.20110803.000.T RIM.log | gawk 'BEGIN{var=$0; } {ENDED=1; while(ENDED==1) { if(var ~ /INSERTED/ || var ~ /PENDINGACCEPT/) { match(var, /Source<25102>=([A-Za-z0-9 ]*)/, src); match(var, /ParentID<25101>=([A-Za-z0 -9-]*)/, parentId); match(var, /SourceSystem<5177>=([A-Za -z0-9-]*)/ , source_system ); } ENDED=getline var; } } '
real 0m0.375s
user 0m0.380s
sys 0m0.010s
The time is even more reduced if I less complicate the RE.
I think you are right regarding the RE.
bash-3.2$ time grep 'INSERTED\|PENDINGACCEPT' cam_verbose.20110803.000.T
real 0m0.379s
user 0m0.375s
sys 0m0.018s
bash-3.2$ time grep 'INSERTED\|PENDINGACCEPT' cam_verbose.20110803.000.T
real 0m0.377s
user 0m0.386s
sys 0m0.014s
bash-3.2$ time grep 'INSERTED\|PENDINGACCEPT' cam_verbose.20110803.000.T
real 0m0.375s
user 0m0.380s
sys 0m0.010s
The time is even more reduced if I less complicate the RE.
@eager
good catch
good catch
ASKER
Thank you @eager and @wesly_chen.
Thanks, Glad the command Time help you
time grep 'INSERTED/|PENDINGACCEPT' $file | awk -f getcount.awk
time cat $file | awk -f getcount.awk
COmpare the result