• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 354
  • Last Modified:

Can you write me a script that extracts email addresses from local files.

Hello,
     Can you write me a script that extracts email addresses from local files.

Thanks,
PD GRAVES
0
QUESTOMNI
Asked:
QUESTOMNI
3 Solutions
 
jmcgOwnerCommented:
On the face of it, this sounds like an application whose primary use would be harvesting email addresses for spam. Please reassure us as to your intentions.
0
 
QUESTOMNIAuthor Commented:
No. I have opt-in email I want to process faster so I can make the most efficient use of my time.
0
 
QUESTOMNIAuthor Commented:
I also get emails from those contacting me. Many are businesses.
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
QUESTOMNIAuthor Commented:
when can I expect a response? Question:
Can you write me a script that extracts email addresses from local files.

Thanks,
PD GRAVES
0
 
QUESTOMNIAuthor Commented:
Can you write the script so it reads the file from my unix? I can upload the file and the script can read it and extract the email addresses. Is that do-able?
0
 
turn123Commented:
What format is your local file in?
0
 
QUESTOMNIAuthor Commented:
It's windows 98
0
 
QUESTOMNIAuthor Commented:
Can you write the script so it reads the file from my unix? I can upload the file and the script can read it and extract the email addresses. Is that do-able? It is likely easier.
0
 
kanduraCommented:
You could make it as simple as this:

#!/usr/bin/perl

while(<>)
{
      chomp;
      my @emails = /([\w&_-]+(?:\.?[\w&_-]+)*\@[\w_-]+(?:\.?[\w_-]+)*)/g;
      print join($/, @emails), $/ if @emails;
}

#end script

Use it like this:

emailextractor.pl < your_input_file > your_list_of_emailaddresses
0
 
QUESTOMNIAuthor Commented:
I am very much a novice. I looks like the while(<>) is reading each line. Then you chomped the \n. Then you give me a variable set to a valid email address.

After that I'm kind of lost. You seem to be joining the Special variable \n at the end of @emails. Then you put $/ if @emails;
at the end of that. That's where you lose me. Can you tell me what's happening with that?

Can you write it so it only extracts the From: email@address.com
email address one uder the other?

I don't understand the following at all. Please explain:

Use it like this:

emailextractor.pl < your_input_file > your_list_of_emailaddresses

Thanks,
PD GRAVES
0
 
kanduraCommented:
Ok, here's an attempt at an explanation:

### loop over the input coming from STDIN or a file piped into the script
while(<>)
{
###  with the current line,
###    remove newline
     chomp;
###    find all email addresses on this line
###    the regexp captures them in the @emails array
     my @emails = /([\w&_-]+(?:\.?[\w&_-]+)*\@[\w_-]+(?:\.?[\w_-]+)*)/g;

###    print out the email addresses with a newline character ($/) between them, and one at the end
###    _if_ @emails contains something
     print join($/, @emails), $/ if @emails;
}

The usage example is a command line statement: you type it in in a shell, or as a crontab entry, or whatever you like.
It says that the input for the script (ie. what <> will iterate over) should come from the file your_input_file (its contents is fed into the script
by the < operator of your shell), and the output of the script should be redirected ( by the > operator) from STDOUT to the file your_list_of_emailadresses.


If you want to match only email addresses on lines that start with From: , then change the script to:
#!/usr/bin/perl

while(<>)
{
     chomp;
     next unless /^From:/i;                #### <------ new line to check only lines that start with From:
     my @emails = /([\w&_-]+(?:\.?[\w&_-]+)*\@[\w_-]+(?:\.?[\w_-]+)*)/g;
     print join($/, @emails), $/ if @emails;
}

#end script
0
 
ZiaTioNCommented:
Just on a side note. Can you step me through the regexp string "my @emails = /([\w&_-]+(?:\.?[\w&_-]+)*\@[\w_-]+(?:\.?[\w_-]+)*)/g;" ?

I have wondered about some of these and cannot seem to find the interconnection between them.

Here is what I know about the string:

my = decleration while "use strict" is in use;
@emails = this is of course the array;
// = pattern match indicators or regexp indicators;
[] = character class designators;
\w = matches a word with alphanumeric characters;
&_ = ********************************** # dont know;
() = grouping of items or memory;
?: = *********************************** # don't know;
\. = this appears to be an escaped concantenation character but not sure what the purpose here is;
*\@ = ********************************* # don't know;
/g = looks for multiple iterations of the matched pattern on each line.

I know some regexp but just do not see how you came up with this string to match an email address format.
I would really appreciate it if you could break it down for me. Thanks.
0
 
kanduraCommented:
You were on the right track with most of the items. Most everything within a character class is treated as either a shortcut such as \w or \d, or the range indicator (- in things like a-z) or in all other cases as a literal character. so [a&-] means a character class which matches either a or & or -.
?: is one of the special extended pattern indicators; this one means clustering, but not capturing. It allows you to group a piece of a regexp together, so that you can repeat it. But the part of the string that matches is not captured in one of the special $1,$2 etc. variables, and in our case it's not captured into the @emails array.
* means "match the previous expression zero or more times".
\@ is a literal @ character. All characters that may have a special meaning in a regexp can be escaped with a backslash to get a literal character.

Hope this helps:

/
  (                                    # start of capture: everything matching the following regexp will be stored
    [\w&_-]                        # character class consisting of a \w (alphanumeric) character, a & or a _ or a -
                                        # this matches stuff like kandura, jack&jill, or i-am-an-_-addict
       +                               # we need one or more of those at the front
    (?:                                # this starts a group, but doesn't capture it in one of the $1,$2 etc. variables
        \.?                             # a literal . (possibly; the ? means we don't need it)
                                         #    note that . in a regexp means "match any character", so we escape it to mean a real .
        [\w&_-]+                   # same character class as above
    )                                   # end of this group
    *                                  # we allow zero or more of this group. so this groups matches .blah or .yes.and.no
    \@                                 # a literal @ sign, the center of an email address ;^)
                                         # the next part matches domain names
    [\w_-]+                          # character class containing alphanumeric, _ and -
    (?:                                 # start of a group; again, no capturing
        \.?                              # possibly a dot
        [\w_-]+                      # one or more of the allowed characters
     )*                                 # this group can occur zero or more times
                                          #   I chose to use * to allow for user@localhost type address, where the domain part consists of only one part
                                          # the optional group catches higher level domain parts (such as .com or .localdomain)
  )                                      # end of capturing group
/gx
0
 
ozoCommented:
There can be many characters other than [\w&_-] in email addresses
0

Featured Post

Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now