Bash one liner, start and end time calculation

Rich Rumble
Rich Rumble used Ask the Experts™
on
I've got file full of VPN logs. There are 2 to 3 lines that I'd like to combine into one with a perl/awk/sed what have you script.
I'd like to go from this:
id=firewall time="2016-01-01 09:34:57" pri=6 fw=10.0.16.11 vpn=vpn1-ssl user=jdoe@example.com realm="MPKI" roles="Role_PC" proto= src=15.16.3.132 dst= dstname= type=vpn op= arg="" result= sent= rcvd= agent="" duration= msg="NWC23464: VPN Tunneling: Session started for user with IPv4 address 10.26.115.42, hostname pc-lt-sfo-55321"
id=firewall time="2016-01-01 10:32:07" pri=6 fw=10.0.16.11 vpn=vpn1-ssl user=jdoe@example.com realm="MPKI" roles="Role_PC" proto=auth src=15.16.3.132 dst= dstname= type=vpn op= arg="" result= sent= rcvd= agent="" duration= msg="AUT22673: Logout from 15.16.3.132 (session:ff61620b)"
id=firewall time="2016-01-01 10:32:07" pri=6 fw=10.0.16.11 vpn=vpn1-ssl user=jdoe@example.com realm="MPKI" roles="Role_PC" proto= src=15.16.3.132 dst= dstname= type=vpn op= arg="" result= sent=8360966 rcvd=20520833 agent="" duration=3431 msg="NWC23465: VPN Tunneling: Session ended for user with IPv4 address 10.26.115.42"

Open in new window

to this:
   (Hostname, Username, Start-Time, End-Time, Sent, Received)
   pc-lt-sfo-55321, jdoe@example.com, 2016-01-01 09:34:57, 2016-01-01 10:32:07,  sent=8.36Mb, rcvd=20.52Mb

The line with msg="NWC23464" is the only line with the hostname in the log. Otherwise, the line with  msg="NWC23465" is the only one I care about. I can use Cut to parse the 23465 line into most of it's important parts, but I don't know how to do math on the SENT/RECIEVED data.

Again its a log filled with other entries just like this one, I'd like to "cat" the file and pass it through sed/awk/perl (I'm open to any), extract the hostname from the 23464 msg line, write the start time based on the value in 23465's duration field (duration written in seconds) then write the end time, and do math on the sent and received fields. (subtract duration seconds from the time stamp at the beginning of the 23465 line)
This is what I have working so far, it's close, but it does not output to one line or do the math:
cat 2016-20k.txt |grep NWC2346[4-5] |cut -d" " -f2,3,6,7,11,18,19,21,32,34 |sed 's/"//g;s/ /-/;s/^time=//;s/ vpn=/, /;s/ user=/, /;s/ src=/, /;s/sent= rcvd= duration= //;'|LC_ALL='C' sort -t, -k3 -n
2=date, 3=time, 6=vpn-device, 7=user, 11=ip, 18=sent, 19=recieved (bytes), 21=duration, 32=hostname, 34=ip

Open in new window

With all the other entries, sort is not pairing the begin/end times as well as I'd hoped. I've tried every variation on sort's column order (-k3, -k1 etc). Any ideas on how to do this better, and with a one liner if at all possible?
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Most Valuable Expert 2013
Top Expert 2013

Commented:
1.) Isn't a line containing msg="NWC23464" always followed by a related line containing msg="NWC23465" (immediately or not doesn't matter), so we won't have to do any sorting/matching?

2) Aren't the start and end times already contained in the respective "time=" fields  of the input lines (start time in the 23464 line, end time in the 23465 line) so we won't have to do any math here?

3) Shouldn't we divide by (1024*1024) instead of (1000*1000) in order to get MB from bytes?

Assuming that all of the above is true here is a little awk scriptlet (it's basically a one-liner, but I added line breaks for readability):

awk '/NWC23464/ {H=$NF; sub("\"","",H);
                 ST=$2" "$3; sub("time=","",ST); gsub("\"","",ST) }
     /NWC23465/ {U=$7; sub("user=","",U);
                 ET=$2" "$3; sub("time=","",ET); gsub("\"","",ET);
                 S=$18; S=substr(S,index(S,"sent=")+5)/1024/1024;
                 R=$19; R=substr(R,index(R,"rcvd=")+5)/1024/1024;
                 printf "%s,%s,%s,%s,sent=%.2fMB,rcvd=%.2fMB\n", H,U,ST,ET,S,R}'  file_containing_vpnlogs

Open in new window

Rich RumbleSecurity Samurai
Top Expert 2006

Author

Commented:
This looks pretty good!
awk '/NWC23464/ {H=$NF; sub("\"","",H);ST=$2" "$3; sub("time=","",ST); gsub("\"","",ST) }/NWC23465/ {U=$7; sub("user=","",U);ET=$2" "$3; sub("time=","",ET); gsub("\"","",ET);S=$18; S=substr(S,index(S,"sent=")+5)/1024/1024;R=$19; R=substr(R,index(R,"rcvd=")+5)/1024/1024;printf "%s,%s,%s,%s,sent=%.2fMB,rcvd=%.2fMB\n", H,U,ST,ET,S,R}' < vpnlogs.txt

Open in new window

Most of your statements are true, there is always a 23464 for user-x before there is a 23465 for the same user, but other 23464-5's are in between as users come and go, that doesn't seem to matter :) I'm going to test this out a bit more, but I think you nailed it!
Rich RumbleSecurity Samurai
Top Expert 2006

Author

Commented:
It's not matching the hostnames correctly sometimes, it does for a small sample, I'm trying to figure out what is throwing it off in the larger samples. If I can anonymize the data more I'll give you a sample to work with that is "breaking" it for me. Thanks again for taking a look at this!
-rich
Fundamentals of JavaScript

Learn the fundamentals of the popular programming language JavaScript so that you can explore the realm of web development.

Most Valuable Expert 2013
Top Expert 2013

Commented:
I think I understand.

Your logs could look like this:

NWC23464 userA hostX
NWC23464 userB hostY
NWC23465 userA
NWC23464 userC hostZ
NWC23465 userC
NWC23465 userB

OK, so I really need meaningful data to check this out.
Sorting will not help here.

I think we'll have to create two temporary files, one containing the NWC23464 records, one containing the NWC23465 records.
We can read the NWC23464 file and try to find a matching NWC23465 in the other file,
perhaps based on the userid plus maybe  the start time derived from the end time minus duration as you suggested in the Q.
In this case we'd have to read the NWC23465 file first, obviously.

I fear the above cannot be done using a one-liner ...

Waiting for data ...
Most Valuable Expert 2013
Top Expert 2013

Commented:
This script is a very first approach. As I said, "real" data will be needed to fully check it out.

#!/bin/bash
logf="vpnlogs.txt"
tmp4="/tmp/tmp4.$$"
tmp5="/tmp/tmp5.$$"
grep "NWC23464" $logf > $tmp4
grep "NWC23465" $logf > $tmp5
while read line5
 do
  U=$(cut -d" " -f7 <<<$line5)
  ET=$(cut -d" " -f2-3 <<<$line5 | tr -d '[a-zA-Z"=]')
  DU=$(cut -d" " -f21 <<<$line5 |  tr -d '[a-zA-Z"=]')
  ST1=$(date -d "$ET $((DU-0)) seconds ago" "+%Y-%m-%d %H:%M:%S")
  ST2=$(date -d "$ET $((DU-1)) seconds ago" "+%Y-%m-%d %H:%M:%S")
  ST3=$(date -d "$ET $((DU+1)) seconds ago" "+%Y-%m-%d %H:%M:%S")
  if line4=$(grep $U $tmp4 | egrep "$ST1|$ST2|$ST3"); then
    echo $line4
    echo $line5
  fi
 done < $tmp5 |
awk '/NWC23464/ {H=$NF; sub("\"","",H);
                 ST=$2" "$3; sub("time=","",ST); gsub("\"","",ST) }
     /NWC23465/ {U=$7; sub("user=","",U);
                 ET=$2" "$3; sub("time=","",ET); gsub("\"","",ET);
                 S=$18; S=substr(S,index(S,"sent=")+5)/1024/1024;
                 R=$19; R=substr(R,index(R,"rcvd=")+5)/1024/1024;
                 printf "%s,%s,%s,%s,sent=%.2fMb,rcvd=%.2fMb\n", H,U,ST,ET,S,R}'
rm $tmp4 $tmp5

Open in new window

Rich RumbleSecurity Samurai
Top Expert 2006

Author

Commented:
That's why I wanted to do the math on the duration, there is no direct "session" ID or method to track the session begining/ending very directly. I wanted to convert to a unix timestamp for the times(it's probaly too hard to use yyyy-mm-dd hh:mm:ss to do the math for subtracting the seconds from the duration. Again if not for the hostname, almost all the data I need is in 23465. I'm no good with programming or arrays etc... can we "key" on the username (username can be "user" or "user@example.com" btw) and use the duration to find a line that is near the duration time subtracted from the 23465 timestamp? That's the direction I was going, the grouping/sort'ing actually was working for some instances. When a user had to login/logout a few times a day, it got jacked up. I wanted to sort on user, pairing them all up, then I wanted to sort on time, and I was hoping that would get it setup like this:
23464 start-time, ip, vpn-device, user, hostname
23465 end-time, ip, vpn-device, user, sent, rcvd, ip
But sort can only do so much when it's a whole big list. I was able to sort and pair them up by using an awk script to run grep over and over against the one file using the usernames to search for. If I grep'd for each user through the large list, sort then acted the way I expected. But since everything is in one large file it does sort on column 3 (user), then it sorts on column1 (time) but the times for the whole file, not just the group of those usernames. This is understandable, but also unfortunate.

I'm working on getting you a dataset nonetheless, might be tomorrow before I can post it, it's bed time for me.
-rich
Most Valuable Expert 2013
Top Expert 2013

Commented:
Did you see my script? It does the math on the timestamp using GNU date (lines 12-14).
As you can see, it computes three timestamps (1 second apart), because there can/will be rounding differences.

Of course I could rewrite the script to use just the input file instead of the two temporary files, if you prefer.
Maybe this will slow down things a bit, because the first of the greps on line 15 will then run against the whole file at each iteration,
instead of the extract file containing just the NWC23464 records.
Rich RumbleSecurity Samurai
Top Expert 2006

Author

Commented:
See attached, uncompressed it's 18Mb. I had errors with the script

$ ./script.sh
: No such file or directory
: No such file or directory
./script.sh: line 19: syntax error near unexpected token `done'
'/script.sh: line 19: ` done < $tmp5 |

$ ls /tmp/
tmp4.13476??  tmp4.13572??  tmp5.13476??  tmp5.13572??
There are literal question marks, that is new :) Each time I run the script I get a new (empty) tmp4/5.xxxxx?? file in /tmp.
logs.zip
Most Valuable Expert 2013
Top Expert 2013
Commented:
Hi,

the script works just fine for me using your data.

Did you use the correct input file name in line 2? I named it "vpnlogs.txt" but you must of course pick the "real" name and path of the log file which you want to process.
What is your shell?
It seems that it cannot handle the variable "$$" correctly. This variable should contain the PID of the current process and I use it to make the temp file names unique.
Or did you transfer the script from Windows to Unix? Could it be that there are special characters at the/some line ends?
What do you get with "cat -v script.sh"?

Here is the version which doesn't use any temporary files. It takes twice the time required by the temp file version:

#!/bin/bash
logf="./logs.txt"
grep "NWC23465" $logf | while read line5
 do
  U=$(cut -d" " -f7 <<<$line5)
  ET=$(cut -d" " -f2-3 <<<$line5 | tr -d '[a-zA-Z"=]')
  DU=$(cut -d" " -f21 <<<$line5 |  tr -d '[a-zA-Z"=]')
  ST1=$(date -d "$ET $((DU-0)) seconds ago" "+%Y-%m-%d %H:%M:%S")
  ST2=$(date -d "$ET $((DU-1)) seconds ago" "+%Y-%m-%d %H:%M:%S")
  ST3=$(date -d "$ET $((DU+1)) seconds ago" "+%Y-%m-%d %H:%M:%S")
  if line4=$(grep "NWC23464" $logf | grep -w $U | egrep "$ST1|$ST2|$ST3"); then
    echo $line4
    echo $line5
  fi
done | \
awk '/NWC23464/ {H=$NF; sub("\"","",H);
                 ST=$2" "$3; sub("time=","",ST); gsub("\"","",ST) }
     /NWC23465/ {U=$7; sub("user=","",U);
                 ET=$2" "$3; sub("time=","",ET); gsub("\"","",ET);
                 S=$18; S=substr(S,index(S,"sent=")+5)/1024/1024;
                 R=$19; R=substr(R,index(R,"rcvd=")+5)/1024/1024;
                 printf "%s,%s,%s,%s,sent=%.2fMb,rcvd=%.2fMb\n", H,U,ST,ET,S,R}'

Open in new window


If you still get errors please add "set -xv" just below "#!/bin/bash", rerun the script and post the result.

If you prefer the temp file version please let me know. If the problem is just the "$$" variable
we can easily avoid using it.
Rich RumbleSecurity Samurai
Top Expert 2006

Author

Commented:
Cool, I was using Cygwin when I posted the errors. On linux it's working. Let me evaluate it a bit more, thanks!
-rich
Most Valuable Expert 2013
Top Expert 2013

Commented:
The script discards all records which don't have a counterpart, particularly when a NWC23464 record cannot be found for a given NWC23465 record (line 15 in the first version, line 11 in the second version). Maybe this is the reason for the missing entries.
Rich RumbleSecurity Samurai
Top Expert 2006

Author

Commented:
Yeah, I noticed that the time frame was off, the 264 came after the 265. I think I'm ready to accept, it's much better than I had and I think you did a great job!
-rich
Rich RumbleSecurity Samurai
Top Expert 2006

Author

Commented:
Thanks for putting in the time for a noob!

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial