Rich Rumble
asked on
Bash one liner, start and end time calculation
I've got file full of VPN logs. There are 2 to 3 lines that I'd like to combine into one with a perl/awk/sed what have you script.
I'd like to go from this:
(Hostname, Username, Start-Time, End-Time, Sent, Received)
pc-lt-sfo-55321, jdoe@example.com, 2016-01-01 09:34:57, 2016-01-01 10:32:07, sent=8.36Mb, rcvd=20.52Mb
The line with msg="NWC23464" is the only line with the hostname in the log. Otherwise, the line with msg="NWC23465" is the only one I care about. I can use Cut to parse the 23465 line into most of it's important parts, but I don't know how to do math on the SENT/RECIEVED data.
Again its a log filled with other entries just like this one, I'd like to "cat" the file and pass it through sed/awk/perl (I'm open to any), extract the hostname from the 23464 msg line, write the start time based on the value in 23465's duration field (duration written in seconds) then write the end time, and do math on the sent and received fields. (subtract duration seconds from the time stamp at the beginning of the 23465 line)
This is what I have working so far, it's close, but it does not output to one line or do the math:
I'd like to go from this:
id=firewall time="2016-01-01 09:34:57" pri=6 fw=10.0.16.11 vpn=vpn1-ssl user=jdoe@example.com realm="MPKI" roles="Role_PC" proto= src=15.16.3.132 dst= dstname= type=vpn op= arg="" result= sent= rcvd= agent="" duration= msg="NWC23464: VPN Tunneling: Session started for user with IPv4 address 10.26.115.42, hostname pc-lt-sfo-55321"
id=firewall time="2016-01-01 10:32:07" pri=6 fw=10.0.16.11 vpn=vpn1-ssl user=jdoe@example.com realm="MPKI" roles="Role_PC" proto=auth src=15.16.3.132 dst= dstname= type=vpn op= arg="" result= sent= rcvd= agent="" duration= msg="AUT22673: Logout from 15.16.3.132 (session:ff61620b)"
id=firewall time="2016-01-01 10:32:07" pri=6 fw=10.0.16.11 vpn=vpn1-ssl user=jdoe@example.com realm="MPKI" roles="Role_PC" proto= src=15.16.3.132 dst= dstname= type=vpn op= arg="" result= sent=8360966 rcvd=20520833 agent="" duration=3431 msg="NWC23465: VPN Tunneling: Session ended for user with IPv4 address 10.26.115.42"
to this:(Hostname, Username, Start-Time, End-Time, Sent, Received)
pc-lt-sfo-55321, jdoe@example.com, 2016-01-01 09:34:57, 2016-01-01 10:32:07, sent=8.36Mb, rcvd=20.52Mb
The line with msg="NWC23464" is the only line with the hostname in the log. Otherwise, the line with msg="NWC23465" is the only one I care about. I can use Cut to parse the 23465 line into most of it's important parts, but I don't know how to do math on the SENT/RECIEVED data.
Again its a log filled with other entries just like this one, I'd like to "cat" the file and pass it through sed/awk/perl (I'm open to any), extract the hostname from the 23464 msg line, write the start time based on the value in 23465's duration field (duration written in seconds) then write the end time, and do math on the sent and received fields. (subtract duration seconds from the time stamp at the beginning of the 23465 line)
This is what I have working so far, it's close, but it does not output to one line or do the math:
cat 2016-20k.txt |grep NWC2346[4-5] |cut -d" " -f2,3,6,7,11,18,19,21,32,34 |sed 's/"//g;s/ /-/;s/^time=//;s/ vpn=/, /;s/ user=/, /;s/ src=/, /;s/sent= rcvd= duration= //;'|LC_ALL='C' sort -t, -k3 -n
2=date, 3=time, 6=vpn-device, 7=user, 11=ip, 18=sent, 19=recieved (bytes), 21=duration, 32=hostname, 34=ip
With all the other entries, sort is not pairing the begin/end times as well as I'd hoped. I've tried every variation on sort's column order (-k3, -k1 etc). Any ideas on how to do this better, and with a one liner if at all possible?
ASKER
This looks pretty good!
awk '/NWC23464/ {H=$NF; sub("\"","",H);ST=$2" "$3; sub("time=","",ST); gsub("\"","",ST) }/NWC23465/ {U=$7; sub("user=","",U);ET=$2" "$3; sub("time=","",ET); gsub("\"","",ET);S=$18; S=substr(S,index(S,"sent=")+5)/1024/1024;R=$19; R=substr(R,index(R,"rcvd=")+5)/1024/1024;printf "%s,%s,%s,%s,sent=%.2fMB,rcvd=%.2fMB\n", H,U,ST,ET,S,R}' < vpnlogs.txt
Most of your statements are true, there is always a 23464 for user-x before there is a 23465 for the same user, but other 23464-5's are in between as users come and go, that doesn't seem to matter :) I'm going to test this out a bit more, but I think you nailed it!
ASKER
It's not matching the hostnames correctly sometimes, it does for a small sample, I'm trying to figure out what is throwing it off in the larger samples. If I can anonymize the data more I'll give you a sample to work with that is "breaking" it for me. Thanks again for taking a look at this!
-rich
-rich
I think I understand.
Your logs could look like this:
NWC23464 userA hostX
NWC23464 userB hostY
NWC23465 userA
NWC23464 userC hostZ
NWC23465 userC
NWC23465 userB
OK, so I really need meaningful data to check this out.
Sorting will not help here.
I think we'll have to create two temporary files, one containing the NWC23464 records, one containing the NWC23465 records.
We can read the NWC23464 file and try to find a matching NWC23465 in the other file,
perhaps based on the userid plus maybe the start time derived from the end time minus duration as you suggested in the Q.
In this case we'd have to read the NWC23465 file first, obviously.
I fear the above cannot be done using a one-liner ...
Waiting for data ...
Your logs could look like this:
NWC23464 userA hostX
NWC23464 userB hostY
NWC23465 userA
NWC23464 userC hostZ
NWC23465 userC
NWC23465 userB
OK, so I really need meaningful data to check this out.
Sorting will not help here.
I think we'll have to create two temporary files, one containing the NWC23464 records, one containing the NWC23465 records.
We can read the NWC23464 file and try to find a matching NWC23465 in the other file,
perhaps based on the userid plus maybe the start time derived from the end time minus duration as you suggested in the Q.
In this case we'd have to read the NWC23465 file first, obviously.
I fear the above cannot be done using a one-liner ...
Waiting for data ...
This script is a very first approach. As I said, "real" data will be needed to fully check it out.
#!/bin/bash
logf="vpnlogs.txt"
tmp4="/tmp/tmp4.$$"
tmp5="/tmp/tmp5.$$"
grep "NWC23464" $logf > $tmp4
grep "NWC23465" $logf > $tmp5
while read line5
do
U=$(cut -d" " -f7 <<<$line5)
ET=$(cut -d" " -f2-3 <<<$line5 | tr -d '[a-zA-Z"=]')
DU=$(cut -d" " -f21 <<<$line5 | tr -d '[a-zA-Z"=]')
ST1=$(date -d "$ET $((DU-0)) seconds ago" "+%Y-%m-%d %H:%M:%S")
ST2=$(date -d "$ET $((DU-1)) seconds ago" "+%Y-%m-%d %H:%M:%S")
ST3=$(date -d "$ET $((DU+1)) seconds ago" "+%Y-%m-%d %H:%M:%S")
if line4=$(grep $U $tmp4 | egrep "$ST1|$ST2|$ST3"); then
echo $line4
echo $line5
fi
done < $tmp5 |
awk '/NWC23464/ {H=$NF; sub("\"","",H);
ST=$2" "$3; sub("time=","",ST); gsub("\"","",ST) }
/NWC23465/ {U=$7; sub("user=","",U);
ET=$2" "$3; sub("time=","",ET); gsub("\"","",ET);
S=$18; S=substr(S,index(S,"sent=")+5)/1024/1024;
R=$19; R=substr(R,index(R,"rcvd=")+5)/1024/1024;
printf "%s,%s,%s,%s,sent=%.2fMb,rcvd=%.2fMb\n", H,U,ST,ET,S,R}'
rm $tmp4 $tmp5
ASKER
That's why I wanted to do the math on the duration, there is no direct "session" ID or method to track the session begining/ending very directly. I wanted to convert to a unix timestamp for the times(it's probaly too hard to use yyyy-mm-dd hh:mm:ss to do the math for subtracting the seconds from the duration. Again if not for the hostname, almost all the data I need is in 23465. I'm no good with programming or arrays etc... can we "key" on the username (username can be "user" or "user@example.com" btw) and use the duration to find a line that is near the duration time subtracted from the 23465 timestamp? That's the direction I was going, the grouping/sort'ing actually was working for some instances. When a user had to login/logout a few times a day, it got jacked up. I wanted to sort on user, pairing them all up, then I wanted to sort on time, and I was hoping that would get it setup like this:
23464 start-time, ip, vpn-device, user, hostname
23465 end-time, ip, vpn-device, user, sent, rcvd, ip
But sort can only do so much when it's a whole big list. I was able to sort and pair them up by using an awk script to run grep over and over against the one file using the usernames to search for. If I grep'd for each user through the large list, sort then acted the way I expected. But since everything is in one large file it does sort on column 3 (user), then it sorts on column1 (time) but the times for the whole file, not just the group of those usernames. This is understandable, but also unfortunate.
I'm working on getting you a dataset nonetheless, might be tomorrow before I can post it, it's bed time for me.
-rich
23464 start-time, ip, vpn-device, user, hostname
23465 end-time, ip, vpn-device, user, sent, rcvd, ip
But sort can only do so much when it's a whole big list. I was able to sort and pair them up by using an awk script to run grep over and over against the one file using the usernames to search for. If I grep'd for each user through the large list, sort then acted the way I expected. But since everything is in one large file it does sort on column 3 (user), then it sorts on column1 (time) but the times for the whole file, not just the group of those usernames. This is understandable, but also unfortunate.
I'm working on getting you a dataset nonetheless, might be tomorrow before I can post it, it's bed time for me.
-rich
Did you see my script? It does the math on the timestamp using GNU date (lines 12-14).
As you can see, it computes three timestamps (1 second apart), because there can/will be rounding differences.
Of course I could rewrite the script to use just the input file instead of the two temporary files, if you prefer.
Maybe this will slow down things a bit, because the first of the greps on line 15 will then run against the whole file at each iteration,
instead of the extract file containing just the NWC23464 records.
As you can see, it computes three timestamps (1 second apart), because there can/will be rounding differences.
Of course I could rewrite the script to use just the input file instead of the two temporary files, if you prefer.
Maybe this will slow down things a bit, because the first of the greps on line 15 will then run against the whole file at each iteration,
instead of the extract file containing just the NWC23464 records.
ASKER
See attached, uncompressed it's 18Mb. I had errors with the script
$ ./script.sh
: No such file or directory
: No such file or directory
./script.sh: line 19: syntax error near unexpected token `done'
'/script.sh: line 19: ` done < $tmp5 |
$ ls /tmp/
tmp4.13476?? tmp4.13572?? tmp5.13476?? tmp5.13572??
There are literal question marks, that is new :) Each time I run the script I get a new (empty) tmp4/5.xxxxx?? file in /tmp.
logs.zip
$ ./script.sh
: No such file or directory
: No such file or directory
./script.sh: line 19: syntax error near unexpected token `done'
'/script.sh: line 19: ` done < $tmp5 |
$ ls /tmp/
tmp4.13476?? tmp4.13572?? tmp5.13476?? tmp5.13572??
There are literal question marks, that is new :) Each time I run the script I get a new (empty) tmp4/5.xxxxx?? file in /tmp.
logs.zip
ASKER CERTIFIED SOLUTION
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
ASKER
Cool, I was using Cygwin when I posted the errors. On linux it's working. Let me evaluate it a bit more, thanks!
-rich
-rich
The script discards all records which don't have a counterpart, particularly when a NWC23464 record cannot be found for a given NWC23465 record (line 15 in the first version, line 11 in the second version). Maybe this is the reason for the missing entries.
ASKER
Yeah, I noticed that the time frame was off, the 264 came after the 265. I think I'm ready to accept, it's much better than I had and I think you did a great job!
-rich
-rich
ASKER
Thanks for putting in the time for a noob!
2) Aren't the start and end times already contained in the respective "time=" fields of the input lines (start time in the 23464 line, end time in the 23465 line) so we won't have to do any math here?
3) Shouldn't we divide by (1024*1024) instead of (1000*1000) in order to get MB from bytes?
Assuming that all of the above is true here is a little awk scriptlet (it's basically a one-liner, but I added line breaks for readability):
Open in new window