Link to home
Start Free TrialLog in
Avatar of mytilig
mytilig

asked on

Parsing and extracting data from an email through perl

Hello-
I have a need to read all the emails that come to a particular mailbox, parse it out, and retreive some data out and save it into a database.
I am planning to use the Mail::POP3Client module to parse the email. I can insert it into the db as well.
What I need help is with the data extraction.

I want to know a neat way to extract data out. For eg in the sample here, I want to insert Joe into the first name field in the database, Smith as last name, similary Address1, 2 etc. City state zip is more tricky because they appear on the same line. there is comma between city and state and space between state and zip.

In addition, I want to save the complete conclusion part of it as well. This is the last part of the email and appears with the heading "CONCLUSION"

The data I receive will consistenly be in this format all the time. Are there any perl modules that will help this kind of extraction or should I be getting good at my regular expressions?

Sample email that I receive:
--------------------------------------------------
From
Return-Path:
Received:
Date:
Message-ID:
To:  
etc

hello here is the info blah blah...

Name:                 Joe Smith
Address:              Address line1
                      Address line2 (optional)
                      city, state zip
Country of Residence: United States
Primary Phone:        xxx-xxx-xxxx
Secondary Phone:
Email Address:        email

******CONCLUSION******
Thank you for the oppurtunity blah blah

--------------------------------------------------
ASKER CERTIFIED SOLUTION
Avatar of Kim Ryan
Kim Ryan
Flag of Australia image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of mensuck
mensuck

Hi

I have something that is sort of old, that I quickly converted to do what you are wanting.



      use DBI;
      my $files = 'email.txt';
      my $start = 0;
      my $lines = 0;
      my $found = 0;
      my $datas = '';
      my @parts;
      my @ms_db = ( 'localhost', '3306', 'database_name', 'user_name', 'user_password', 'insert_table_name' );

      open FILE, '<' . $files;

      while ( <FILE> )
      {
            tr/\r\n\t//d;

            if ( $start != 1 && m/^$/ )
            {
                  $start = 1;
            }

            elsif ( $start == 1 )
            {
                  s/\:/\: /g;

                  if ( m/^Name\:/ )
                  {
                        s/^.*?\s+|\s+$//g;

                        if ( m/\s/ )
                        {
                              my @name = split(/\s/, $_);

                              push @parts,$name[0];
                              push @parts,$name[1];
                        }
                        else
                        {
                              push @parts,$_;
                              push @parts,'';
                        }
                  }

                  elsif ( m/^Address\:/|| $lines > 0 )
                  {
                        s/^.*?\s+|\s+$//g;

                        if ( $lines == 0 )
                        {
                              push @parts,$_;

                              $lines++;
                        }

                        elsif ( $lines == 1 )
                        {
                              push @parts,$_;

                              $lines++;
                        }
                        else
                        {
                              $lines = 0;

                              foreach my $addy (split(/\s/, $_))
                              {
                                    if ( $addy =~ m/^[0-9]{5}$/ )
                                    {
                                          push @parts,$addy;
                                    }
                                    elsif ( $addy =~ m/^.*?\,$/ )
                                    {
                                          push @parts,substr ( $addy, 0, -1 );
                                    }
                                    else
                                    {
                                          push @parts,$addy;
                                    }
                              }
                        }
                  }

                  elsif ( m/Residence\:|Phone\:|Email/ )
                  {
                        s/^.*?\:\s+|\s+$//g;
                        push @parts,$_;
                  }
                  else
                  {
                        if ( m/^.*?CONCLUSION.*?$/ )
                        {
                              $found = 1;
                        }
                        elsif ( $found )
                        {
                              $datas .= $_ . "\n";
                        }
                  }
            }
      }

      close FILE;

      push @parts,$datas;

      my $obj = &make_connection;

      $obj->do("INSERT INTO " . $ms_db[5] . " (id,first_name,last_name,addy_one,addy_two,city_name,state_name,zip_code,country_name,phone_one,phone_two,user_email,the_conclusion) VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?)",undef, (NULL,$parts[0],$parts[1],$parts[2],$parts[3],$parts[4],$parts[5],$parts[6],$parts[7],$parts[8],$parts[9],$parts[10],$parts[11]));

      $obj->disconnect();

      sub make_connection
      {
            return DBI->connect("dbi:mysql:" . $ms_db[2] . ":" . $ms_db[0] . ":" . $ms_db[1], $ms_db[3], $ms_db[4]);
      }


Suzanne
Avatar of mytilig

ASKER

Hi Teraplane,

The Lingua::EN::AddressParse did not work correctly for me. Can you tell me what I am doing wrong?
Here is my code:

--------------------------------------------------------------------------------------
my %args = ( country => 'United States',
 auto_clean => 1, force_case => 1,
 abbreviate_subcountry => 0,
 abbreviated_subcountry_only => 1 );
 
my $address = new Lingua::EN::AddressParse(%args);
$error = $address->parse("3430, warburton avenue, Apt. No. xx,Maywood Apts,Vallejo, CA 95000 ");
%my_address = $address->components;
 $suburb = $my_address{suburb};
$correct_casing = $address->case_all;

foreach $comp ( sort keys %my_address)
{
print "$comp\t\t: $my_address{$comp}\n";
}
--------------------------------------------------------------------------------------
Output I receive is
country            :
post_box            :
post_code            :
property_identifier            :
property_name            :
street            :
street_type            :
subcountry            :
suburb            :
Address parsing is a very complex business, as there are a huge number of valid syntax patterns. This module can cope with many, but not all. I reaaranged your data as follows and it parsed OK. Also, it was developed with Australian addresses. I never really got the US style  of  street num, street name, apartment number, suburb working properly.

$error = $address->parse("Apt 2, 3430 warburton avenue, Vallejo, CA 95000 ");

All you can really do is filter out the mis matches for hand processing
Avatar of mytilig

ASKER

I see that you re-arranged just the apt number and street. City, State and Zip is same as the order I had, still I did not get any output in post_code, or suburb(if that is where state should go)
I asked this because Address1 and Address2 are low priority for me. I will be somewhat ok if I can succesfully parse city state and zip.

I do not have control on the input format, that website is owned by someone else and I am just tasked to parsing the fields out.
That would mean I cannot use this module, correct?

I tried Suzanne's code above, but it did not parse correctly for majority of the input either...



with the input of "Apt 2, 3430 warburton avenue, Vallejo, CA 95000 "  the output is

country:
post_box:
post_code: 95000
property_identifier: Apt 2 3430
property_name:
street: warburton
street_type: avenue
subcountry: CA
suburb: Vallejo

But if the parser gets confused with the start of the address, it cannot recover to retrieve city and post code. Is all your data in this format, or can it handle other cases? If the parser cannot handle most of your data, you may be better of writing your own rexexp, or adapting the grammer I have just to parse city state and zip.

Avatar of mytilig

ASKER

Yes all my data is in this format.
---------------------------------------------------
hello here is the info blah blah...

Name:                 Joe Smith
Address:              Address line1
                      Address line2 (optional)
                      city, state zip
Country of Residence: United States
Primary Phone:        xxx-xxx-xxxx
Secondary Phone:
Email Address:        email

******CONCLUSION******
Thank you for the oppurtunity blah blah
---------------------------------------------------
And I have only US addresses.

Thanks so much for your time!
I was referring to the Address line1 and 2 components. Is there always an apartment after the street name? My guess is that some of your address data can be parsed and some will not fit the patterns it can cope with, in which case you could apply another regexp to try and salvage city, post code and state.
 
Avatar of ozo
Avatar of mytilig

ASKER

use Geo::StreetAddress::US;
  my $hashref = Geo::StreetAddress::US->parse_location(
                "1005 Gravenstein Hwy N, Sebastopol CA 95472" );

From this reference to a hash, how do I retreive the address specifiers?
For eg how what is the variable that holds the zip code?
print("Zip is:",$hashref->{zip});
print("City is:",$hashref->{city});

Avatar of mytilig

ASKER

Did try it but does not work.

print $hashref->{'zip'} ;
just prints a blank..
Avatar of mytilig

ASKER

my input is
my $hashref = Geo::StreetAddress::US->parse_location("1005 Gravenstein Hwy N, Sebastopol CA 95472" );
I ran your exact code and it prints 95472 for zip code. What version of perl are you running?
Avatar of mytilig

ASKER

I have perl 5.6.1
Looks like my module might not have got installed correctly.

I re-downloaded the module and installed it once again, I got some errors for make test. I do remember seeing them the  last time I installed them too. Guess I ignored it since my scripts were compiling fine.

Pasted below. Any thoughts on why the module would not install correctly?


C:\Temp\Geo-StreetAddress-US-0.99>perl makefile.PL
Writing Makefile for Geo::StreetAddress::US

C:\Temp\Geo-StreetAddress-US-0.99>c:\perl\bin\nmake

Microsoft (R) Program Maintenance Utility   Version 1.50
Copyright (c) Microsoft Corp 1988-94. All rights reserved.


C:\Temp\Geo-StreetAddress-US-0.99>c:\perl\bin\nmake test

Microsoft (R) Program Maintenance Utility   Version 1.50
Copyright (c) Microsoft Corp 1988-94. All rights reserved.

        C:\Perl\bin\perl.exe -Mblib -IC:\Perl\lib -IC:\Perl\lib -e "use Test::Harness qw(&
runtests $verbose); $verbose=0; runtests @ARGV;" t\01_parser.t
Using C:/Temp/Geo-StreetAddress-US-0.99/blibt\01_parser....Using C:/Temp/Geo-StreetAddr
ess-US-0.99/blib
#     Failed test (t\01_parser.t at line 366)
t\01_parser....NOK 2#     Structures begin differing at:
#          $got->{state} = undef
#     $expected->{state} = 'CA'
#     Failed test (t\01_parser.t at line 366)
t\01_parser....NOK 3#     Structures begin differing at:
#          $got->{state} = undef
#     $expected->{state} = 'CA'
#     Failed test (t\01_parser.t at line 366)
t\01_parser....NOK 4#     Structures begin differing at:
#          $got->{zip} = undef
#     $expected->{zip} = '95472'
#     Failed test (t\01_parser.t at line 366)
t\01_parser....NOK 5#     Structures begin differing at:
#          $got->{state} = undef
#     $expected->{state} = 'CA'
#     Failed test (t\01_parser.t at line 366)
t\01_parser....NOK 6#     Structures begin differing at:
#          $got->{state} = undef
#     $expected->{state} = 'CA'
#     Failed test (t\01_parser.t at line 366)
t\01_parser....NOK 7#     Structures begin differing at:
#          $got->{state} = undef
#     $expected->{state} = 'CA'
#     Failed test (t\01_parser.t at line 366)
t\01_parser....NOK 8#     Structures begin differing at:
#          $got->{zip} = undef
#     $expected->{zip} = '95472'
#     Failed test (t\01_parser.t at line 366)
t\01_parser....NOK 9#     Structures begin differing at:
#          $got->{state} = undef
#     $expected->{state} = 'CA'
#     Failed test (t\01_parser.t at line 366)
t\01_parser....NOK 10#     Structures begin differing at:
#          $got->{state} = undef
#     $expected->{state} = 'UT'
#     Failed test (t\01_parser.t at line 366)
t\01_parser....NOK 11#     Structures begin differing at:
#          $got->{state} = undef
#     $expected->{state} = 'DC'
#     Failed test (t\01_parser.t at line 366)
t\01_parser....NOK 12#     Structures begin differing at:
#          $got->{state} = undef
#     $expected->{state} = 'PA'
#     Failed test (t\01_parser.t at line 366)
t\01_parser....NOK 13#     Structures begin differing at:
#          $got->{state} = undef
#     $expected->{state} = 'CA'
#     Failed test (t\01_parser.t at line 366)
t\01_parser....NOK 14#     Structures begin differing at:
#          $got->{state} = undef
#     $expected->{state} = 'CA'
#     Failed test (t\01_parser.t at line 366)
t\01_parser....NOK 15#     Structures begin differing at:
#          $got->{state} = undef
#     $expected->{state} = 'CA'
#     Failed test (t\01_parser.t at line 366)
t\01_parser....NOK 16#     Structures begin differing at:
#          $got->{state} = undef
#     $expected->{state} = 'CA'
#     Failed test (t\01_parser.t at line 366)
t\01_parser....NOK 17#     Structures begin differing at:
#          $got->{state} = undef
#     $expected->{state} = 'CA'
#     Failed test (t\01_parser.t at line 366)
t\01_parser....NOK 18#     Structures begin differing at:
#          $got->{zip} = undef
#     $expected->{zip} = '95472'
#     Failed test (t\01_parser.t at line 366)
t\01_parser....NOK 19#     Structures begin differing at:
#          $got->{state} = undef
#     $expected->{state} = 'CA'
#     Failed test (t\01_parser.t at line 366)
t\01_parser....NOK 20#     Structures begin differing at:
#          $got->{state} = undef
#     $expected->{state} = 'CA'
#     Failed test (t\01_parser.t at line 366)
t\01_parser....NOK 21#     Structures begin differing at:
#          $got->{state} = undef
#     $expected->{state} = 'CA'
#     Failed test (t\01_parser.t at line 366)
t\01_parser....NOK 22#     Structures begin differing at:
#          $got->{state} = undef
#     $expected->{state} = 'DC'
#     Failed test (t\01_parser.t at line 366)
t\01_parser....NOK 23#     Structures begin differing at:
#          $got->{state} = undef
#     $expected->{state} = 'CA'
#     Failed test (t\01_parser.t at line 366)
t\01_parser....NOK 24#     Structures begin differing at:
#          $got->{state} = undef
#     $expected->{state} = 'CA'
#     Failed test (t\01_parser.t at line 366)
t\01_parser....NOK 25#     Structures begin differing at:
#          $got->{state} = undef
#     $expected->{state} = 'CA'
#     Failed test (t\01_parser.t at line 366)
t\01_parser....NOK 26#     Structures begin differing at:
#          $got->{state} = undef
#     $expected->{state} = 'CA'
#     Failed test (t\01_parser.t at line 366)
t\01_parser....NOK 27#     Structures begin differing at:
#          $got->{state} = undef
#     $expected->{state} = 'MN'
#     Failed test (t\01_parser.t at line 366)
t\01_parser....NOK 28#     Structures begin differing at:
#          $got->{state} = undef
#     $expected->{state} = 'CA'
#     Failed test (t\01_parser.t at line 366)
t\01_parser....NOK 29#     Structures begin differing at:
#          $got->{state} = undef
#     $expected->{state} = 'CA'
#     Failed test (t\01_parser.t at line 366)
t\01_parser....NOK 30#     Structures begin differing at:
#          $got->{state} = undef
#     $expected->{state} = 'CA'
#     Failed test (t\01_parser.t at line 366)
t\01_parser....NOK 31#     Structures begin differing at:
#          $got->{state} = undef
#     $expected->{state} = 'CA'
#     Failed test (t\01_parser.t at line 366)
t\01_parser....NOK 32#     Structures begin differing at:
#          $got->{zip} = undef
#     $expected->{zip} = '95472'
#     Failed test (t\01_parser.t at line 366)
t\01_parser....NOK 33#     Structures begin differing at:
#          $got->{state} = undef
#     $expected->{state} = 'CA'
# Looks like you failed 32 tests of 38.
t\01_parser....dubious
        Test returned status 32 (wstat 8192, 0x2000)
DIED. FAILED tests 2-33
        Failed 32/38 tests, 15.79% okay
Failed Test   Stat Wstat Total Fail  Failed  List of Failed
-------------------------------------------------------------------------------
t\01_parser.t   32  8192    38   32  84.21%  2-33
Failed 1/1 test scripts, 0.00% okay. 32/38 subtests failed, 15.79% okay.
NMAKE : fatal error U1077: 'C:\WINDOWS\system32\cmd.exe' : return code '0x2'
Stop.

C:\Temp\Geo-StreetAddress-US-0.99>c:\perl\bin\nmake install

Microsoft (R) Program Maintenance Utility   Version 1.50
Copyright (c) Microsoft Corp 1988-94. All rights reserved.

Skipping C:\Perl\site\lib\Geo\StreetAddress\US.pm (unchanged)
Writing C:\Perl\site\lib\auto\Geo\StreetAddress\US\.packlist
Appending installation info to C:\Perl\lib/perllocal.pod
Avatar of mytilig

ASKER

use Geo::StreetAddress::US;
my $hashref = Geo::StreetAddress::US->parse_location("1005 Gravenstein Hwy N, Sebastopol CA 95472" );

while ( my ($key, $value) = each(%$hashref) ) {
print "$key => $value\n";
}

prints

state =>
zip =>
suffix =>
city =>
prefix =>
type =>
number =>
street =>

So looks like none of the keys have any values. I did try other address format from the doc for the module , but same result.