Solved

Can you write me a script that extracts email addresses from local files.

Posted on 2003-12-02
14
325 Views
Last Modified: 2010-03-04
Hello,
     Can you write me a script that extracts email addresses from local files.

Thanks,
PD GRAVES
0
Comment
Question by:QUESTOMNI
14 Comments
 
LVL 20

Expert Comment

by:jmcg
ID: 9857407
On the face of it, this sounds like an application whose primary use would be harvesting email addresses for spam. Please reassure us as to your intentions.
0
 

Author Comment

by:QUESTOMNI
ID: 9857451
No. I have opt-in email I want to process faster so I can make the most efficient use of my time.
0
 

Author Comment

by:QUESTOMNI
ID: 9857480
I also get emails from those contacting me. Many are businesses.
0
 

Author Comment

by:QUESTOMNI
ID: 9857515
when can I expect a response? Question:
Can you write me a script that extracts email addresses from local files.

Thanks,
PD GRAVES
0
 

Author Comment

by:QUESTOMNI
ID: 9857657
Can you write the script so it reads the file from my unix? I can upload the file and the script can read it and extract the email addresses. Is that do-able?
0
 
LVL 11

Expert Comment

by:turn123
ID: 9857683
What format is your local file in?
0
 

Author Comment

by:QUESTOMNI
ID: 9857954
It's windows 98
0
Highfive Gives IT Their Time Back

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

 

Author Comment

by:QUESTOMNI
ID: 9857962
Can you write the script so it reads the file from my unix? I can upload the file and the script can read it and extract the email addresses. Is that do-able? It is likely easier.
0
 
LVL 18

Expert Comment

by:kandura
ID: 9858188
You could make it as simple as this:

#!/usr/bin/perl

while(<>)
{
      chomp;
      my @emails = /([\w&_-]+(?:\.?[\w&_-]+)*\@[\w_-]+(?:\.?[\w_-]+)*)/g;
      print join($/, @emails), $/ if @emails;
}

#end script

Use it like this:

emailextractor.pl < your_input_file > your_list_of_emailaddresses
0
 

Author Comment

by:QUESTOMNI
ID: 9859505
I am very much a novice. I looks like the while(<>) is reading each line. Then you chomped the \n. Then you give me a variable set to a valid email address.

After that I'm kind of lost. You seem to be joining the Special variable \n at the end of @emails. Then you put $/ if @emails;
at the end of that. That's where you lose me. Can you tell me what's happening with that?

Can you write it so it only extracts the From: email@address.com
email address one uder the other?

I don't understand the following at all. Please explain:

Use it like this:

emailextractor.pl < your_input_file > your_list_of_emailaddresses

Thanks,
PD GRAVES
0
 
LVL 18

Accepted Solution

by:
kandura earned 400 total points
ID: 9861176
Ok, here's an attempt at an explanation:

### loop over the input coming from STDIN or a file piped into the script
while(<>)
{
###  with the current line,
###    remove newline
     chomp;
###    find all email addresses on this line
###    the regexp captures them in the @emails array
     my @emails = /([\w&_-]+(?:\.?[\w&_-]+)*\@[\w_-]+(?:\.?[\w_-]+)*)/g;

###    print out the email addresses with a newline character ($/) between them, and one at the end
###    _if_ @emails contains something
     print join($/, @emails), $/ if @emails;
}

The usage example is a command line statement: you type it in in a shell, or as a crontab entry, or whatever you like.
It says that the input for the script (ie. what <> will iterate over) should come from the file your_input_file (its contents is fed into the script
by the < operator of your shell), and the output of the script should be redirected ( by the > operator) from STDOUT to the file your_list_of_emailadresses.


If you want to match only email addresses on lines that start with From: , then change the script to:
#!/usr/bin/perl

while(<>)
{
     chomp;
     next unless /^From:/i;                #### <------ new line to check only lines that start with From:
     my @emails = /([\w&_-]+(?:\.?[\w&_-]+)*\@[\w_-]+(?:\.?[\w_-]+)*)/g;
     print join($/, @emails), $/ if @emails;
}

#end script
0
 
LVL 5

Assisted Solution

by:ZiaTioN
ZiaTioN earned 100 total points
ID: 9870436
Just on a side note. Can you step me through the regexp string "my @emails = /([\w&_-]+(?:\.?[\w&_-]+)*\@[\w_-]+(?:\.?[\w_-]+)*)/g;" ?

I have wondered about some of these and cannot seem to find the interconnection between them.

Here is what I know about the string:

my = decleration while "use strict" is in use;
@emails = this is of course the array;
// = pattern match indicators or regexp indicators;
[] = character class designators;
\w = matches a word with alphanumeric characters;
&_ = ********************************** # dont know;
() = grouping of items or memory;
?: = *********************************** # don't know;
\. = this appears to be an escaped concantenation character but not sure what the purpose here is;
*\@ = ********************************* # don't know;
/g = looks for multiple iterations of the matched pattern on each line.

I know some regexp but just do not see how you came up with this string to match an email address format.
I would really appreciate it if you could break it down for me. Thanks.
0
 
LVL 18

Assisted Solution

by:kandura
kandura earned 400 total points
ID: 9870676
You were on the right track with most of the items. Most everything within a character class is treated as either a shortcut such as \w or \d, or the range indicator (- in things like a-z) or in all other cases as a literal character. so [a&-] means a character class which matches either a or & or -.
?: is one of the special extended pattern indicators; this one means clustering, but not capturing. It allows you to group a piece of a regexp together, so that you can repeat it. But the part of the string that matches is not captured in one of the special $1,$2 etc. variables, and in our case it's not captured into the @emails array.
* means "match the previous expression zero or more times".
\@ is a literal @ character. All characters that may have a special meaning in a regexp can be escaped with a backslash to get a literal character.

Hope this helps:

/
  (                                    # start of capture: everything matching the following regexp will be stored
    [\w&_-]                        # character class consisting of a \w (alphanumeric) character, a & or a _ or a -
                                        # this matches stuff like kandura, jack&jill, or i-am-an-_-addict
       +                               # we need one or more of those at the front
    (?:                                # this starts a group, but doesn't capture it in one of the $1,$2 etc. variables
        \.?                             # a literal . (possibly; the ? means we don't need it)
                                         #    note that . in a regexp means "match any character", so we escape it to mean a real .
        [\w&_-]+                   # same character class as above
    )                                   # end of this group
    *                                  # we allow zero or more of this group. so this groups matches .blah or .yes.and.no
    \@                                 # a literal @ sign, the center of an email address ;^)
                                         # the next part matches domain names
    [\w_-]+                          # character class containing alphanumeric, _ and -
    (?:                                 # start of a group; again, no capturing
        \.?                              # possibly a dot
        [\w_-]+                      # one or more of the allowed characters
     )*                                 # this group can occur zero or more times
                                          #   I chose to use * to allow for user@localhost type address, where the domain part consists of only one part
                                          # the optional group catches higher level domain parts (such as .com or .localdomain)
  )                                      # end of capturing group
/gx
0
 
LVL 84

Expert Comment

by:ozo
ID: 9933197
There can be many characters other than [\w&_-] in email addresses
0

Featured Post

Do You Know the 4 Main Threat Actor Types?

Do you know the main threat actor types? Most attackers fall into one of four categories, each with their own favored tactics, techniques, and procedures.

Join & Write a Comment

I've just discovered very important differences between Windows an Unix formats in Perl,at least 5.xx.. MOST IMPORTANT: Use Unix file format while saving Your script. otherwise it will have ^M s or smth likely weird in the EOL, Then DO NOT use m…
Checking the Alert Log in AWS RDS Oracle can be a pain through their user interface.  I made a script to download the Alert Log, look for errors, and email me the trace files.  In this article I'll describe what I did and share my script.
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Get a first impression of how PRTG looks and learn how it works.   This video is a short introduction to PRTG, as an initial overview or as a quick start for new PRTG users.

746 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

10 Experts available now in Live!

Get 1:1 Help Now