Can you write me a script that extracts email addresses from local files.

Hello,
     Can you write me a script that extracts email addresses from local files.

Thanks,
PD GRAVES
QUESTOMNIAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

jmcgOwnerCommented:
On the face of it, this sounds like an application whose primary use would be harvesting email addresses for spam. Please reassure us as to your intentions.
0
QUESTOMNIAuthor Commented:
No. I have opt-in email I want to process faster so I can make the most efficient use of my time.
0
QUESTOMNIAuthor Commented:
I also get emails from those contacting me. Many are businesses.
0
Cloud Class® Course: Microsoft Azure 2017

Azure has a changed a lot since it was originally introduce by adding new services and features. Do you know everything you need to about Azure? This course will teach you about the Azure App Service, monitoring and application insights, DevOps, and Team Services.

QUESTOMNIAuthor Commented:
when can I expect a response? Question:
Can you write me a script that extracts email addresses from local files.

Thanks,
PD GRAVES
0
QUESTOMNIAuthor Commented:
Can you write the script so it reads the file from my unix? I can upload the file and the script can read it and extract the email addresses. Is that do-able?
0
turn123Commented:
What format is your local file in?
0
QUESTOMNIAuthor Commented:
It's windows 98
0
QUESTOMNIAuthor Commented:
Can you write the script so it reads the file from my unix? I can upload the file and the script can read it and extract the email addresses. Is that do-able? It is likely easier.
0
kanduraCommented:
You could make it as simple as this:

#!/usr/bin/perl

while(<>)
{
      chomp;
      my @emails = /([\w&_-]+(?:\.?[\w&_-]+)*\@[\w_-]+(?:\.?[\w_-]+)*)/g;
      print join($/, @emails), $/ if @emails;
}

#end script

Use it like this:

emailextractor.pl < your_input_file > your_list_of_emailaddresses
0
QUESTOMNIAuthor Commented:
I am very much a novice. I looks like the while(<>) is reading each line. Then you chomped the \n. Then you give me a variable set to a valid email address.

After that I'm kind of lost. You seem to be joining the Special variable \n at the end of @emails. Then you put $/ if @emails;
at the end of that. That's where you lose me. Can you tell me what's happening with that?

Can you write it so it only extracts the From: email@address.com
email address one uder the other?

I don't understand the following at all. Please explain:

Use it like this:

emailextractor.pl < your_input_file > your_list_of_emailaddresses

Thanks,
PD GRAVES
0
kanduraCommented:
Ok, here's an attempt at an explanation:

### loop over the input coming from STDIN or a file piped into the script
while(<>)
{
###  with the current line,
###    remove newline
     chomp;
###    find all email addresses on this line
###    the regexp captures them in the @emails array
     my @emails = /([\w&_-]+(?:\.?[\w&_-]+)*\@[\w_-]+(?:\.?[\w_-]+)*)/g;

###    print out the email addresses with a newline character ($/) between them, and one at the end
###    _if_ @emails contains something
     print join($/, @emails), $/ if @emails;
}

The usage example is a command line statement: you type it in in a shell, or as a crontab entry, or whatever you like.
It says that the input for the script (ie. what <> will iterate over) should come from the file your_input_file (its contents is fed into the script
by the < operator of your shell), and the output of the script should be redirected ( by the > operator) from STDOUT to the file your_list_of_emailadresses.


If you want to match only email addresses on lines that start with From: , then change the script to:
#!/usr/bin/perl

while(<>)
{
     chomp;
     next unless /^From:/i;                #### <------ new line to check only lines that start with From:
     my @emails = /([\w&_-]+(?:\.?[\w&_-]+)*\@[\w_-]+(?:\.?[\w_-]+)*)/g;
     print join($/, @emails), $/ if @emails;
}

#end script
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
ZiaTioNCommented:
Just on a side note. Can you step me through the regexp string "my @emails = /([\w&_-]+(?:\.?[\w&_-]+)*\@[\w_-]+(?:\.?[\w_-]+)*)/g;" ?

I have wondered about some of these and cannot seem to find the interconnection between them.

Here is what I know about the string:

my = decleration while "use strict" is in use;
@emails = this is of course the array;
// = pattern match indicators or regexp indicators;
[] = character class designators;
\w = matches a word with alphanumeric characters;
&_ = ********************************** # dont know;
() = grouping of items or memory;
?: = *********************************** # don't know;
\. = this appears to be an escaped concantenation character but not sure what the purpose here is;
*\@ = ********************************* # don't know;
/g = looks for multiple iterations of the matched pattern on each line.

I know some regexp but just do not see how you came up with this string to match an email address format.
I would really appreciate it if you could break it down for me. Thanks.
0
kanduraCommented:
You were on the right track with most of the items. Most everything within a character class is treated as either a shortcut such as \w or \d, or the range indicator (- in things like a-z) or in all other cases as a literal character. so [a&-] means a character class which matches either a or & or -.
?: is one of the special extended pattern indicators; this one means clustering, but not capturing. It allows you to group a piece of a regexp together, so that you can repeat it. But the part of the string that matches is not captured in one of the special $1,$2 etc. variables, and in our case it's not captured into the @emails array.
* means "match the previous expression zero or more times".
\@ is a literal @ character. All characters that may have a special meaning in a regexp can be escaped with a backslash to get a literal character.

Hope this helps:

/
  (                                    # start of capture: everything matching the following regexp will be stored
    [\w&_-]                        # character class consisting of a \w (alphanumeric) character, a & or a _ or a -
                                        # this matches stuff like kandura, jack&jill, or i-am-an-_-addict
       +                               # we need one or more of those at the front
    (?:                                # this starts a group, but doesn't capture it in one of the $1,$2 etc. variables
        \.?                             # a literal . (possibly; the ? means we don't need it)
                                         #    note that . in a regexp means "match any character", so we escape it to mean a real .
        [\w&_-]+                   # same character class as above
    )                                   # end of this group
    *                                  # we allow zero or more of this group. so this groups matches .blah or .yes.and.no
    \@                                 # a literal @ sign, the center of an email address ;^)
                                         # the next part matches domain names
    [\w_-]+                          # character class containing alphanumeric, _ and -
    (?:                                 # start of a group; again, no capturing
        \.?                              # possibly a dot
        [\w_-]+                      # one or more of the allowed characters
     )*                                 # this group can occur zero or more times
                                          #   I chose to use * to allow for user@localhost type address, where the domain part consists of only one part
                                          # the optional group catches higher level domain parts (such as .com or .localdomain)
  )                                      # end of capturing group
/gx
0
ozoCommented:
There can be many characters other than [\w&_-] in email addresses
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Perl

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.