asked on

AWK script parsing mail-log for data. (How to..)

Hey Experts,

Please keep in mind that I am not from a English speaking country and so my grammar / spelling might not be good.

I have to get specific data out of a mail-log making use of AWK in a script (not a one-liner), however, I’m not able to get the data I want by myself (very limited knowledge).

AWK (WHICH DATA)
The goal of the script is to get the following:
- [DATA] the mail fail-percentage per domain,
- [OUTPUT] which has to be stored in a separate file (e.g. log_faildomain.txt)
- [DATA] the mail deferral-percentage per domain,
- [OUTPUT] which has to be stored in a sep. file (e.g. log_defdomain.txt)

MAIL-LOG STRUCTURE
Now the structure of the “test” mail-log is where the problem starts. I have posted a bit of the “test” mail-log in the code section. Furthermore, I have added the test mail-log itself to the attachments.
The structure (copy pasted, (c) Uni-ulm.de):

Message lines
A message line summarizes the delivery results for a message that has left the queue:
m birth done bytes nk nz nd <sender> qp uid

Here birth and done are timestamps, bytes is the number of bytes in the message, nk is the number of successful deliveries, nz is the number of deferred delivery attempts, nd is the number of failed delivery attempts, sender is the message's return path, qp is the message's long-term queue identifier, and uid is the userid of the user that queued the message. Note that matchup converts sender to lowercase. This can lose information, since a few hosts pay attention to the case in the box part of an address.

Delivery Lines
A delivery line shows the result of a single delivery attempt:
d result birth dstart ddone bytes
<sender> chan.recip qp uid reason

Here birth, bytes, sender, qp, and uid are message information as above; chan is the channel for this delivery; recip is the recipient address for this delivery; dstart and ddone are timestamps; result is the letter k for success, z for deferral, d for failure; and reason is a more detailed explanation of the delivery result.

ADDITIONAL INFORMATION
The following might help (“what I’m going to do with the data..”):
When I have the data I’m going to put into a prototype graph (made in Excel) and then I should be able to read the data out in a statistical format. I hope this helps clearing things up.

I would be so GRATEFUL if someone would help me with this. Thanks in advance!!

d k 1004274611.163303500 1004274611.220392500 1004274611.672914500 2024 <SCHIEDAM@Gilbert-_en_Ellice-eilanden.com> remote.ALMERE@Hongarije.com 22059 250 
m 1004274611.163303500 1004274611.855431500 2024 1 0 0 <SCHIEDAM@Gilbert-_en_Ellice-eilanden.com> 22059 250 
d d 1004274655.308560500 1004274655.382174500 1004274655.417475500 1873 <SINT_PANCRAS@Groenland.com> local.VLAARDINGEN@Nieuwzeeland.com 22061 81 
m 1004274655.308560500 1004274655.548948500 1873 0 1 0 <SINT_PANCRAS@Groenland.com> 22061 81 
d z 1004274655.590566500 1004274655.665621500 1004274776.72770500 2431 <SINT_PANCRAS@Groenland.com> remote.SINT_PANCRAS@Groenland.com 22064 86 
d k 1004275066.390660500 1004275066.455900500 1004275066.631254500 1951 <MAASSLUIS@Rhodesie.com> local.SCHIPHOL@Frankrijk.com 22126 81 
m 1004275066.390660500 1004275066.639430500 1951 1 0 0 <MAASSLUIS@Rhodesie.com> 22126 81 
d k 1004275066.574064500 1004275066.631035500 1004275067.151378500 2082 <MAASSLUIS@Rhodesie.com> remote.LUCHTHAVEN_SCHIPHOL@Zwitserland.com 22130 512 
m 1004275066.574064500 1004275067.166109500 2082 1 0 0 <MAASSLUIS@Rhodesie.com> 22130 512 
d z 1004274655.590566500 1004275056.212493500 1004275116.242880500 2431 <MAASSLUIS@Rhodesie.com> remote.SINT_PANCRAS@Groenland.com 22064 86 
d k 1004275796.535919500 1004275796.610806500 1004275796.786087500 3959 <HOEK_VAN_HOLLAND@Canada.com> local.AMSTERDAM_ZUIDOOST@Frankrijk.com 22250 81 
m 1004275796.535919500 1004275796.800460500 3959 1 0 0 <HOEK_VAN_HOLLAND@Canada.com> 22250 81

Open in new window

seip2-1-.log

F. Dominicus

Well awk is quite good ad mangling text. I can not see any code so I have not idea what you expect.
However awk does a loop around a file in which it read line by line you then can match against pattern and take aktion accordingly. The example script doest the following, checking on the first letter ($0 is the whole line)

you can see that you can assign to variables. And of course you can introduce other variables also.
e.g a counter for the messages and a counter for "failed" messages etc.

you can introduce them inan BEGIN block and at the end you can do the calculation in between you are just "collecting" informatoin.

Now just introduce the variables you are interested in, update them according to the output.

As small remark Arrays in awk are "realy" Dictionary. So you an use string so index into "arrays".

BEGIN { delivery_count = 0; }
{ if ($1 == "d") {
        printf("It's a delivery line\n");
        result = $2;
        birth = $3
        delivery_count++;
        printf("birth = %s, result = %s\n", result, birth);
    } else {
        printf("It's a message line\n");
    }
}
END { printf("I found %d delivery lines\n", delivery_count); }

On  you log  I get:
It's a message line
It's a delivery line
birth = k, result = 1004420024.300957500
It's a message line
It's a delivery line
birth = k, result = 1004420024.485636500
It's a message line
It's a delivery line
birth = k, result = 1004420240.508533500
It's a message line
It's a delivery line
birth = k, result = 1004420240.708604500
It's a message line
It's a delivery line
birth = k, result = 1004420740.781613500
I found 961 delivery lines

Open in new window

gheist

If you are combating spam - it is better to install some UNIX which generates proper logs and handles RBL lists properly. MS-SMTP.
Given message in 10 seconds you can beef up mailer with in-session AV scan using command-line scanner (AVIRA does well)

tehmario

ASKER

I'm very sorry for being so limited in my explanation and for not providing any code snippets. Generally I don't know how to get the data I want.
I really am limited in "Shell scripting" knowledge-wise and experience-wise.
I know it was an long shot, however, soon I'm getting an bunch of old logs with the mail-structure as in the example mail-log: seip2-1.log.
And I will have to get useful data from those mail-logs, with use of AWK and shell scripting;
- the mail fail-percentage per domain,

Kind of an psuedo-code:
So I want to first off get the mail which failed and then I want to count the Domains which have mails that failed. Then comparing the failed mails from that domain with the total of failed mails in general over all the domains in percentage. After that I want to print all the unique domains and show the percentage of failed mails in comparison with all the mails spread over all the domains.

But I just don't know how to translate this into an proper AWK script.
Hopefully you guys can help me out to translate this into an AWK script.
Thanks in advance,
- Mario

F. Dominicus

Do you have understand what I wrote?
$1... $N are the elements in which awk saves the tokens. And from there on you can simply do you calculation or whatever. What was unclear about the posted code?

Here's what you should do:
keep an counter around fro successfull mails and for failing mails and a counter which counts them all
you than can calculate the failing percentage very easily.

If you want to log the failing lines just use fprintf or other things for output.

Regards
Friedrich

tehmario

ASKER

Hey,
Thing is, I get the code - kind of - but It's not what I am looking for.
I am interested in the following elements:
- The failed mails ($2, fair enough)
- The domain name
- How many failed mails per domain
Then afterwards I want a out put like this:

OUTPUT EXAMPLE: "Fail_.txt"
Domain1 has XXX amount of failed mails.
Domain2 has XXX amount of failed mails.
----------
Total number of failed mails = XXX.

Then with that text I can do simple calculations to get the failpercentage per domain.

I tried your code out and I tried editing it, but I can't get the above stated information out of it.
I hope this clarifies it a bit.

F. Dominicus

Well how did you try to get the domain name?
What is the problem if you cound the failed attempts with an own variable
like I showed e.g for
delivery_count++;

So this counta all the delivered mails, be it successful or not. Now you'd keep another variable which just count the failed attempts and at the end you can calculate

(failed_count / delivery_count) * 100 =
percentage of failed mails.

Now of course you have to have this counters e.g in a Dictionary which is indexed by the Domain name (what have you tried to extract the domain name) you keep one counter for the delivery to that domain and another one which counts the faiiled attempts. At the end in an
END block you write out the stuff you are interested in.

So where's your code which shows what you've done and where you have problems? This forum is to give you hints for helping yourself, we are not supposed to present ready-made solutions on "demand".

tehmario

ASKER

Hey,
First off, thanks for sticking with me (I'm a stubborn guy).
Right, based on your code I tried several things and here's the one which is closest to what I want but still not good enough:

#!/bin/awk -f
BEGIN { ov_count = 0; }
{ if ($1 == "d" && $2 == "z")
{ n=split($7,Domain_name,"@")
for (i=1;i<=n;i++)
{ print (Domain_name[1]);
}
printf("E-mail failed\n");
ov_count++
}
}
END { printf ("Total Failed mails is %d", ov_count); }

CURRENT (WRONG) OUTPUT
Let's say I have the following lines:
d z 1004205671.221525500 1004205671.296391500 1004205731.329531500 1423 <RIJNSBURG@Frankrijk.com> remote.MUIDERBERG@Zwitserland.com 5085 81
d z 1004205671.221525500 1004205671.296391500 1004205731.329531500 1423 <AMSTELHOEK@Ethiopie.com> remote.AMSTELHOEK@Chili.com 5085 81

The output would be:
<RIJNSBURG
Frankrijk.com>
E-mail failed
<AMSTELHOEK
Ethiopie.com>
Total Failed mails is 2

AND this is totally wrong! Now I've tried to edit it more and more but the same result all the time.

Let's say I have 5 Failed mails from the domain "Frankrijk.com", and I only want to show a number behind the domain which tells me how many times a mail has failed coming from that Domain.

PREFERRED OUTPUT
Let's say I have 4 failed mails for a random Domain "Frankrijk.com" and 6 mails for a random Domain called "Ethiopie.com". Then the following out put is what I'm striving for:

Domain:
Frankrijk.com> 4 mails failed
Ethiopie.com> 6 mails failed
Total number of failed mails: 10 .

I know I have been stubborn, but its just that I was very unsure about posting any code which would be a waste of your time.
I hope this helps.
Thanks again,
- Mario

tehmario

ASKER

Edit:
Also tried the following script (which didn't work either but was close):

#!/bin/awk -f
{
if ($1 == "d" && $2 == "z"); then
gsub(/.*@/,"",$7);
gsub(/>/,"",$7);
failmailArr[$7]++
}
END { for (x in failmailArr) print x,failmailArr[x] }

This prints out for example: Domainname.com ## (## meaning a number).
- However, it does not print the failed e-mails ( $1 = d & $2 = z for the mails to be read as a failed mail ).
- Also it does not print the total number of failed mails.
- Mario

ASKER CERTIFIED SOLUTION

F. Dominicus

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

tehmario

ASKER

Thank you.