?
Solved

most efficient way to extract info from 2GB log file

Posted on 2005-02-28
28
Medium Priority
?
580 Views
Last Modified: 2012-05-05
I have written  a perl script that reads a log file (approx 2GB) and combines this with info from a ps-ef command that prints out who
is currently logged on and when the last utility was executed.  This script takes 1.5 minutes to run.  Is it possible to make it any faster?  I am currently using pipes to dump this info into separate arrays.  ie,
open EXECUTES, "tail -r txt.log |grep ': Execute: '| awk '{print \$2, \$3, \$4}' |";
@executes = map({$_} <EXECUTES>);
0
Comment
Question by:janeguzzardo
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 9
  • 7
  • 6
  • +1
28 Comments
 
LVL 20

Accepted Solution

by:
jmcg earned 2000 total points
ID: 13420902
I imagine you could run it a little faster by doing the work in perl and avoiding unnecessary copying. Something like:

my @executes;
open EXECUTES, "txt.log" or die "open failed: txt.log -- $!";
while( <EXECUTES> ) {
   next unless index $_, ': Execute: ';
   unshift( @executes, join( ', ', (split ' ', $_, 4)[1,2,3] ) );
   }

Because we're using 'unshift', the order of the lines in @executes will be the reverse of their order in the file. I'm not sure what data structure you were trying to build with that {$_} in the 'map'.

But maybe I'm not understanding enough about your task. In general, you want to make just one pass through the file, if possible, since it's too large to fit in cache (on most systems). To the extent possible, you want to give an early place to the filters that cut down on how much of the input goes on to be processed by later steps. Reading a file forward usually works better than reading it backwards, but this may only make a small difference in the overall performance.
0
 
LVL 16

Expert Comment

by:manav_mathur
ID: 13420904
why dont you grep, awk etc inside of Perl??

Manav
0
 
LVL 16

Expert Comment

by:manav_mathur
ID: 13421044
map() looks to be doing nothing except supplying list context to <EXECUTES> which would have been supplied by a straight-forward assignment anyway.

push()/unshift() are complimentary in that one pushes to the end of the array, whereas unshifts into start of array.

Manav
0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 
LVL 8

Expert Comment

by:inq123
ID: 13421102
Hi janeguzzardo,

Why do you need the tail?  You might consider just "'grep ': Execute: ' txt.log | awk '{print \$2, \$3, \$4}' | "

Also you don't need the map {$_} to put input into arrays.  @executes = <EXECUTES>; should do.

The thing is, that probably only helps a little bit in speed, and I doubt doing it in perl makes it run faster.  My guess is that it'd be very similar or even a tad bit slower.

Cheers!
0
 
LVL 8

Expert Comment

by:inq123
ID: 13421123
janeguzzardo,

I forgot to mention that one other advantage in the pipe in your script is that if your log file goes over 2GB, perl can't even open the file.  So perl-based solution won't even work.  From what you described, it's hard to see why your log file won't go over 2GB unless your FS does not support it.
0
 

Author Comment

by:janeguzzardo
ID: 13421512
Thank you for all of your responses.
I had to tail the file because I only want to pull off the LAST Execute found for a given logged in user.  If I start from the beginning, I may have to iterate thru 50 or so Executes per user till I hit the last one.  I tried the code given and I am uncertain as to  how it is to pull off the 3 fields that I need.  Currently it is putting the entire line containing Execute into my array.  I just want the 2, 3rd and 4th fields.
I am also concerned about the statement regarding the 2GB log file size.  Are you implying that I cannot use perl with the | to extract information from a log file that is larger than 2GB?
0
 
LVL 16

Expert Comment

by:manav_mathur
ID: 13421563
try changing
unshift( @executes, join( ', ', (split ' ', $_, 4)[1,2,3] ) );

to

unshift( @executes, join( ', ', (split /\s+/, $_, 4)[1,2,3] ) );

Manav


0
 
LVL 8

Expert Comment

by:inq123
ID: 13421590
No, I meant that using pipe (or STDIN) is the way to go for files over 2 GB.

I did not give you any new code, I simply got rid of the tail from your command.  I don't know the format of your file (and I'm not quite familiar with awk), so I don't know why your awk wasn't working as you thought it should.

I still could not totally understand why you used tail -r.  I don't know what this '-r' do since it's certainly not supported by tail on my linux machine.  I also don't understand how this -r would enable you ignore the lines until the line after a particular user was found.  Can you explain?

BTW, it sounds to me like that you don't actually have a fully working script (still some problem with awk), so you would actually like both fixing your script and optimize it, right?
0
 

Author Comment

by:janeguzzardo
ID: 13421672
Actually, my original script is providing all of the info that I need.  However, it is taking 1.5 minutes to run.  I was looking for ways to make it quicker (it originally took 7.5 minutes when written by another programmer using c shell--my task was to make it faster so i rewrote it in perl).  As for the tail, I store all of the Executes into an array. I then grep on the array for each logged in user.  It will encounter the last Execute first for each user.  Perhaps I am not explaining very well...  
0
 
LVL 16

Expert Comment

by:manav_mathur
ID: 13421700
Can you post your original script so we are able to understand better. Looks like this wont be a large one...

Manav
0
 
LVL 20

Expert Comment

by:jmcg
ID: 13421720
I was using the ' ' in the split to cause split to more closely simulate what awk does by default. It seems like something wasn't transcribed right, so the split was not doing what I expected.

=================

I think my line

next unless index $_, ': Execute: ';

should have been written as

next unless index( $_, ': Execute: ') >= 0;

I keep misremembering that index all by itself isn't a good predicate, since a 0 return should be considered true and it returns -1 for "not found".

===========

Perl can be compiled to handle files larger than 2Gb, but, until fairly recently, only people with 64-bit systems did that routinely.

===========

Is the name of the logged-in user part of the Execute: line? Knowing that we only need to keep the last seen data can reduce the amount of data that must be stored, but I don't see how you can escape reading the entire file and parsing all of the Execute lines. It therefore matters little whether you read the file forward or back.


0
 
LVL 8

Expert Comment

by:inq123
ID: 13421742
I still don't get the '-r' switch for tail.  I also did not understand your comment 'Currently it is putting the entire line containing Execute into my array.  I just want the 2, 3rd and 4th fields.' - I simply copied your code, if your code was working, then the copied code should work too, unless there's some messup of white space or something.

As for the speed, I suspect your tail did not do the job of relieving processing of the file, since grepping a pattern as simple as the one your showed for a 2 GB file should take only half a mintue at most.  There is definitely room for improvement, and I suspect this tail step could be improved.
0
 

Author Comment

by:janeguzzardo
ID: 13421762
Manav,
    Your portion of the script works, but unfortunately doesn't speed things up.  Also, due to my work regulations, cannot post entire script on the site.  I am basically doing 3 separate greps from the logfile (logouts, logins, executes) and then extracting information from the ps -ef  run and combining it all together.   Was hoping there was some way to do a grep on a filehandle (such as
grep (/$loginuser/, <FILEHANDLE for EXECUTES>) so that I wouldn't have to dump the file into arrays (which seems to be taking the most time).  
Thank you again for your response.  I have never used index or join...that helped.
Jane
0
 
LVL 16

Expert Comment

by:manav_mathur
ID: 13421786
a grep on a filehandle will still force all lines to be read into the memory(I believe so).

Manav
0
 
LVL 16

Expert Comment

by:manav_mathur
ID: 13421803
Maybe you can give a step wise description of what you are trying to do. Then we can patch up a script for you....

Manav
0
 
LVL 20

Expert Comment

by:jmcg
ID: 13421809
The "-r" option on 'tail' tells it to give the lines in reverse order. I'm surprised this didn't make it into the GNU version of 'tail', but it apparently never made it into the POSIX standard, either.
0
 

Author Comment

by:janeguzzardo
ID: 13421833
THANK YOU MANAV,  Once I fixed the index to work as you said, that cut 27 seconds off of the process time.  Let me use that in place of  some of the other greps and see how much more I can reduce.  Brilliant!!  Especially since i had a difficult time explaining the problem.

Jane
0
 
LVL 16

Expert Comment

by:manav_mathur
ID: 13421846
I think the correction in index was posted by jmcg himself.

Manav
0
 

Author Comment

by:janeguzzardo
ID: 13421855
One more question, if my log file does exceed 2GB, will I have problems with the code?

Thanks again, Jane
0
 
LVL 8

Expert Comment

by:inq123
ID: 13421864
jmcg, thanks! No wonder I couldn't find it by man tail. jane, if you're using 3 separate greps, then you really should process the file using perl and not use 3 greps.  It should definitely be faster than running through the same file 3 times.  As long as you read log file from <STDIN> or pipe (using cat for example), the file size won't be a concern (or you could recompile the perl as jmcg suggested).
0
 
LVL 8

Expert Comment

by:inq123
ID: 13421872
janeguzzardo,

open() function in perl would fail if file exceeds 2 GB
0
 

Author Comment

by:janeguzzardo
ID: 13421875
OKAY--MANY MANY THANKS!!

Jane
0
 
LVL 16

Expert Comment

by:manav_mathur
ID: 13421965
janeguzzardo,

posting in a member-feedback will not make your post be seen by all. Rather, even we infrequently visit out feedback pages ;)

<quote>
Yes, I am just trying to pull out smaller portions of the log file that I need to search so as not to pull in all at once.  Let me try your code and see if that makes a difference.  I am not familiar with index., but will try it.  Thank you.
</quote>

Please post question-related comments here.

Manav
0
 

Author Comment

by:janeguzzardo
ID: 13421976
Yes, I figured that out after I wasn't getting any response. :)
0
 
LVL 16

Expert Comment

by:manav_mathur
ID: 13421997
And you also have to accept a post(s) as final answers. ;)
As you seem to be a newcomer, maybe jmcg will guide you to the right area on help page regarding this. I still havent figured out how you put those fancy http://Q3124234 links here.

Manav

0
 
LVL 20

Expert Comment

by:jmcg
ID: 13423923
Maybe you were a little hasty in clicking on the "Accept" button. Do you want the question re-opened so as to allow a split amongst the various people who helped? That would be fairer.

Also, you accepted before we were done answering!

jmcg
EE Page Editor for Perl
0
 
LVL 20

Expert Comment

by:jmcg
ID: 13424014
Manav -

The "fancy" links occur whenever you put in something the site's scripts "recognize" as a URL:

  http://anything.at.all/
  www.anything.goes.com
  ftp://some-site/path/path/file

It's a bit of a hack and it turns out that the relative links

  http:Q_21331778.html#13423923

aren't rendered as fully legal links, so some browsers (Safari on Mac is the one I know about) do the wrong thing with them.

But there's no magic.
0
 
LVL 20

Expert Comment

by:jmcg
ID: 13424162
Jane -

Reading a 2 Gb file three times is bound to be slower than reading it once.

If I understood better what you were doing...more about the lookups and what was on the Execute: log lines, I suspect there may be a way to make things even faster: if we indeed read the file backwards from the end, and if there's a way for us to determine that we've found all the lines we are interested in without necessarily reading the whole 2Gb -- then we could quit processing the rest of the file.

Could the "ps -ef" be run first? Can you learn before processing the 2Gb file which users or process IDs (or whatever it is that you extract from the Login/Logout/Execute lines) you're looking for?

Also, is this an operation that you'll be doing frequently? Would it make sense to keep an extract of the log file that's shared between runs? By keeping track of where the extract left off, you can process the remainder of the log file incrementally.
0

Featured Post

VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Many time we need to work with multiple files all together. If its windows system then we can use some GUI based editor to accomplish our task. But what if you are on putty or have only CLI(Command Line Interface) as an option to  edit your files. I…
Email validation in proper way is  very important validation required in any web pages. This code is self explainable except that Regular Expression which I used for pattern matching. I originally published as a thread on my website : http://www…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Six Sigma Control Plans

800 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question