Solved

Perl or Shell script to sieve thru thousands of mail files & extract out addresses & failure reason from failed mails only

Posted on 2011-03-10
10
438 Views
Last Modified: 2012-05-11
I have thousands of mail files & in these files, there are
some mails which fail to be sent & some which are Ok
(ie get to the destination).

The attached is one of the thousands of mail files & below are
lines of text which we'll need a Perl or Shell script to search for,
extract out & output into a file in the format below :

xia0h3i-@hotmail.com;Subject: Delivery Status Notification (Failure):
tec_kid@hotmail.com;Subject: Delivery Status Notification (Failure)
life.breathe@hotmail.com;Subject: Mail delivery failed: returning message to sender


Briefly, the search algorithm is as follows :

a)search each file for a line containing both the search strings
   "he following"  &  "failed" & 2 lines below it is the problem email
   address - extract out this email address & output to a file in first
   column followed by ; (semi-colon as column separator)

b)then search a few lines (it's variable number of lines) backwards
   for the string "Subject:"  & extract out this line & add it into the 2nd
   column of the file

c)then repeat step (a) above to search forward for the next line with the
   search strings "he following" & "failed" for subsequent extractions till
   the end of the file & then proceed to do the same for the next mail file
   (all mail files have either 4 or 5 digits as their filenames & all of them
    are in one directory)


======== key search strings / text: extracted from the attachment =========

Subject: Delivery Status Notification (Failure):
. . . . .

Delivery to the following recipients failed.

       xia0h3i-@hotmail.com

.......


Subject: Delivery Status Notification (Failure)
. . . . .
Delivery to the following recipients failed.

       tec_kid@hotmail.com
..........

Subject: Mail delivery failed: returning message to sender
. . . . .
recipients. This is a permanent error. The following address(es) failed:

  life.breathe@hotmail.com

..........


j.txt
0
Comment
Question by:sunhux
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 5
  • 4
10 Comments
 
LVL 26

Assisted Solution

by:wilcoxon
wilcoxon earned 500 total points
ID: 35096683
This should do what you want.  To call the script:

script.pl file1 file2 file3 ... > output_file
#!/usr/bin/perl

use strict;
use warnings;

my ($subj, $in_err);
foreach my $fil (@ARGV) {
    open IN, $fil or die "could not open $fil: $!";
    while (<IN>) {
        chomp;
        if (/^Subject:/) {
            $subj = $_;
        } elsif (/he\s+following\b.*\s+failed\b/) {
            $in_err++;
        } elsif ($in_err) {
            if (/^\s*(\S+@\S+)\s*$/) {
                print "$1; $subj\n";
            } elsif (not /^\s*$/) {
                $in_err = 0;
            }
        }
    }
    close IN;
}

Open in new window

0
 

Author Comment

by:sunhux
ID: 35096691

3 corrections / requirements to what I posted above :

1) remove "failed" from the search string as I came across one line
    below which does not have the string "failed" in it :
   "The following addresses had permanent delivery errors"


2) Came across one example below where search string
     "he following" & the email addr to be extracted are on
     the same line : if this 2nd point can't be achieved by the
     same script, feel free to write a separate script

3) After extraction, I would like to sort the output by the 2nd
    column of the output as primary key & 1st column as the
    secondary sort key (remember it's a ;/semicolon separated
    columns
Subject: Delivery Status Notification (Failure)
MIME-Version: 1.0
Content-Type: multipart/report; report-type=delivery-status; boundary="onJ2O.4dPFG2WYz.1+z937.67NgUs8"
. . . . .

The following message to <kayn73@starhub.net.sg> was undeliverable.
The reason for the problem:
......
0
 

Author Comment

by:sunhux
ID: 35096726

Wow that's fast Wilcoxon.  However, I've made some corrections to my
earlier requirements, sorry about that, hope you can amend the script.


Lastly, can I run the script by passing the * wildcard character to represent
all files in that directory,  ie :
    script.pl * > output_file
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 31

Expert Comment

by:farzanj
ID: 35096729
I looked through your example.  There is no "he following"  &  "failed".  I searched the attached file.  Could you give a file that contains the terms you need.

Second, your algorithm is a little confusing.  A little simpler would be appreciated.
0
 
LVL 26

Accepted Solution

by:
wilcoxon earned 500 total points
ID: 35096794
To take into account your changes....

This modified version should handle everything...
#!/usr/bin/perl

use strict;
use warnings;

my ($subj, $in_err, %bad);
foreach my $fil (@ARGV) {
    open IN, $fil or die "could not open $fil: $!";
    while (<IN>) {
        chomp;
        if (/^Subject:/) {
            $subj = $_;
        } elsif (/he\s+following\b/) {
            if (/\b(\w\S+@[\w\.]+)\b/) {
                $bad{$subj}{$1}++;
            } else {
                $in_err++;
            }
        } elsif ($in_err) {
            if (/^\s*(\S+@\S+)\s*$/) {
                $bad{$subj}{$1}++;
            } elsif (not /^\s*$/) {
                $in_err = 0;
            }
        }
    }
    close IN;
}

foreach my $subj (sort keys %bad) {
    foreach my $addr (sort keys %{$bad{$subj}}) {
        print "$addr; $subj\n";
    }
}

Open in new window

0
 
LVL 26

Assisted Solution

by:wilcoxon
wilcoxon earned 500 total points
ID: 35096807
Yes, you can run it as script.pl * > output_file.
0
 
LVL 26

Expert Comment

by:wilcoxon
ID: 35096820
farzanj, not sure why you're not finding it - "he following" and "failure" occur 3 times in j.txt (as expected since there are 3 failures in the log).
0
 

Author Comment

by:sunhux
ID: 35096842


The attachment is the actual file & to re-illustrate , I repost those
lines with the search strings (those underlined by ^^^) below :

Delivery to the following recipients failed
                   ^^^^^^^^^^^^                

recipients. This is a permanent error. The following address(es) failed:
                                                              ^^^^^^^^^^^^

The following message to <kayn73@starhub.net.sg> was undeliverable.
  ^^^^^^^^^^^


"he" is a substring of both "The" & "the" (but since it can be sometimes capital
T & sometimes small t,  I indicated the search string as "he following" )
0
 

Author Comment

by:sunhux
ID: 35096856

I'll test it out tomorrow : it's now 1am my time
0
 

Author Closing Comment

by:sunhux
ID: 35108046
Marvellous
0

Featured Post

Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Many time we need to work with multiple files all together. If its windows system then we can use some GUI based editor to accomplish our task. But what if you are on putty or have only CLI(Command Line Interface) as an option to  edit your files. I…
In the distant past (last year) I hacked together a little toy that would allow a couple of Manager types to query, preview, and extract data from a number of MongoDB instances, to their tool of choice: Excel (http://dilbert.com/strips/comic/2007-08…
Learn several ways to interact with files and get file information from the bash shell. ls lists the contents of a directory: Using the -a flag displays hidden files: Using the -l flag formats the output in a long list: The file command gives us mor…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

734 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question