Solved

Can you write me a script that extracts email addresses from local files.

Posted on 2003-12-02
14
334 Views
Last Modified: 2010-03-04
Hello,
     Can you write me a script that extracts email addresses from local files.

Thanks,
PD GRAVES
0
Comment
Question by:QUESTOMNI
14 Comments
 
LVL 20

Expert Comment

by:jmcg
ID: 9857407
On the face of it, this sounds like an application whose primary use would be harvesting email addresses for spam. Please reassure us as to your intentions.
0
 

Author Comment

by:QUESTOMNI
ID: 9857451
No. I have opt-in email I want to process faster so I can make the most efficient use of my time.
0
 

Author Comment

by:QUESTOMNI
ID: 9857480
I also get emails from those contacting me. Many are businesses.
0
Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

 

Author Comment

by:QUESTOMNI
ID: 9857515
when can I expect a response? Question:
Can you write me a script that extracts email addresses from local files.

Thanks,
PD GRAVES
0
 

Author Comment

by:QUESTOMNI
ID: 9857657
Can you write the script so it reads the file from my unix? I can upload the file and the script can read it and extract the email addresses. Is that do-able?
0
 
LVL 11

Expert Comment

by:turn123
ID: 9857683
What format is your local file in?
0
 

Author Comment

by:QUESTOMNI
ID: 9857954
It's windows 98
0
 

Author Comment

by:QUESTOMNI
ID: 9857962
Can you write the script so it reads the file from my unix? I can upload the file and the script can read it and extract the email addresses. Is that do-able? It is likely easier.
0
 
LVL 18

Expert Comment

by:kandura
ID: 9858188
You could make it as simple as this:

#!/usr/bin/perl

while(<>)
{
      chomp;
      my @emails = /([\w&_-]+(?:\.?[\w&_-]+)*\@[\w_-]+(?:\.?[\w_-]+)*)/g;
      print join($/, @emails), $/ if @emails;
}

#end script

Use it like this:

emailextractor.pl < your_input_file > your_list_of_emailaddresses
0
 

Author Comment

by:QUESTOMNI
ID: 9859505
I am very much a novice. I looks like the while(<>) is reading each line. Then you chomped the \n. Then you give me a variable set to a valid email address.

After that I'm kind of lost. You seem to be joining the Special variable \n at the end of @emails. Then you put $/ if @emails;
at the end of that. That's where you lose me. Can you tell me what's happening with that?

Can you write it so it only extracts the From: email@address.com
email address one uder the other?

I don't understand the following at all. Please explain:

Use it like this:

emailextractor.pl < your_input_file > your_list_of_emailaddresses

Thanks,
PD GRAVES
0
 
LVL 18

Accepted Solution

by:
kandura earned 400 total points
ID: 9861176
Ok, here's an attempt at an explanation:

### loop over the input coming from STDIN or a file piped into the script
while(<>)
{
###  with the current line,
###    remove newline
     chomp;
###    find all email addresses on this line
###    the regexp captures them in the @emails array
     my @emails = /([\w&_-]+(?:\.?[\w&_-]+)*\@[\w_-]+(?:\.?[\w_-]+)*)/g;

###    print out the email addresses with a newline character ($/) between them, and one at the end
###    _if_ @emails contains something
     print join($/, @emails), $/ if @emails;
}

The usage example is a command line statement: you type it in in a shell, or as a crontab entry, or whatever you like.
It says that the input for the script (ie. what <> will iterate over) should come from the file your_input_file (its contents is fed into the script
by the < operator of your shell), and the output of the script should be redirected ( by the > operator) from STDOUT to the file your_list_of_emailadresses.


If you want to match only email addresses on lines that start with From: , then change the script to:
#!/usr/bin/perl

while(<>)
{
     chomp;
     next unless /^From:/i;                #### <------ new line to check only lines that start with From:
     my @emails = /([\w&_-]+(?:\.?[\w&_-]+)*\@[\w_-]+(?:\.?[\w_-]+)*)/g;
     print join($/, @emails), $/ if @emails;
}

#end script
0
 
LVL 5

Assisted Solution

by:ZiaTioN
ZiaTioN earned 100 total points
ID: 9870436
Just on a side note. Can you step me through the regexp string "my @emails = /([\w&_-]+(?:\.?[\w&_-]+)*\@[\w_-]+(?:\.?[\w_-]+)*)/g;" ?

I have wondered about some of these and cannot seem to find the interconnection between them.

Here is what I know about the string:

my = decleration while "use strict" is in use;
@emails = this is of course the array;
// = pattern match indicators or regexp indicators;
[] = character class designators;
\w = matches a word with alphanumeric characters;
&_ = ********************************** # dont know;
() = grouping of items or memory;
?: = *********************************** # don't know;
\. = this appears to be an escaped concantenation character but not sure what the purpose here is;
*\@ = ********************************* # don't know;
/g = looks for multiple iterations of the matched pattern on each line.

I know some regexp but just do not see how you came up with this string to match an email address format.
I would really appreciate it if you could break it down for me. Thanks.
0
 
LVL 18

Assisted Solution

by:kandura
kandura earned 400 total points
ID: 9870676
You were on the right track with most of the items. Most everything within a character class is treated as either a shortcut such as \w or \d, or the range indicator (- in things like a-z) or in all other cases as a literal character. so [a&-] means a character class which matches either a or & or -.
?: is one of the special extended pattern indicators; this one means clustering, but not capturing. It allows you to group a piece of a regexp together, so that you can repeat it. But the part of the string that matches is not captured in one of the special $1,$2 etc. variables, and in our case it's not captured into the @emails array.
* means "match the previous expression zero or more times".
\@ is a literal @ character. All characters that may have a special meaning in a regexp can be escaped with a backslash to get a literal character.

Hope this helps:

/
  (                                    # start of capture: everything matching the following regexp will be stored
    [\w&_-]                        # character class consisting of a \w (alphanumeric) character, a & or a _ or a -
                                        # this matches stuff like kandura, jack&jill, or i-am-an-_-addict
       +                               # we need one or more of those at the front
    (?:                                # this starts a group, but doesn't capture it in one of the $1,$2 etc. variables
        \.?                             # a literal . (possibly; the ? means we don't need it)
                                         #    note that . in a regexp means "match any character", so we escape it to mean a real .
        [\w&_-]+                   # same character class as above
    )                                   # end of this group
    *                                  # we allow zero or more of this group. so this groups matches .blah or .yes.and.no
    \@                                 # a literal @ sign, the center of an email address ;^)
                                         # the next part matches domain names
    [\w_-]+                          # character class containing alphanumeric, _ and -
    (?:                                 # start of a group; again, no capturing
        \.?                              # possibly a dot
        [\w_-]+                      # one or more of the allowed characters
     )*                                 # this group can occur zero or more times
                                          #   I chose to use * to allow for user@localhost type address, where the domain part consists of only one part
                                          # the optional group catches higher level domain parts (such as .com or .localdomain)
  )                                      # end of capturing group
/gx
0
 
LVL 84

Expert Comment

by:ozo
ID: 9933197
There can be many characters other than [\w&_-] in email addresses
0

Featured Post

Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Extract data from span tag 1 95
PERL - Find newest folder 12 149
create a gui in perl 3 97
Perl Snippet to Parse String 1 22
On Microsoft Windows, if  when you click or type the name of a .pl file, you get an error "is not recognized as an internal or external command, operable program or batch file", then this means you do not have the .pl file extension associated with …
I have been pestered over the years to produce and distribute regular data extracts, and often the request have explicitly requested the data be emailed as an Excel attachement; specifically Excel, as it appears: CSV files confuse (no Red or Green h…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

790 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question