regex tutorial help

Hi,

I'm just trying to practice regexes. I made a dummy string with an age and address in it. I want to pull out the age and the building number from the string. So I'm really looking for a sequence of 3 numbers, then another sequence of 3 numbers.  Here's the script:


use strict;

my $str = "hello I am 500 years old and my address is 123 Main Street.";


# Try to find the age and the address.
if ($str =~ m/ (\d{3})[\w\s]+(\d{3})/) {
    print("Yeah it matched and the extracted stuff is: $1, $2", "\n");
    print($1, "\n"); // the age
    print($2, "\n"); // the building number
}


I get the first extraction:

    (\d{3})   - look for a sequence of 3 digits.

I don't get this part:

    [\w\s]+

how do you express the [] brackets? I jsut gave it a shot cause I saw it somewhere else, but what I was trying to express is:

    look for a sequence of 3 digits, then some spaces and characters, then another sequence of 3 digits.

so the [\w\s]+ is the (some spaces and characters) part, I just don't understand technically what it is saying.

Thanks
DJ_AM_JuiceboxAsked:
Who is Participating?
 
hieloCommented:
>> look for a sequence of 3 digits, then some spaces and characters, then another sequence of 3 digits.
The problem is that \w is shortcut for [a-zA-Z0-9_]. So when you get to your second set of digits, it also forms part of the [].

If you take a step back and look at your input string again, another way to look at it is some digits followed by non-digits followed by digits:
if ($str =~ m/ (\d{3})\D+(\d{3})/) {
0
 
FishMongerCommented:
I'd take an additional step and use named vars instead of $1 and $2.

my $str = "hello I am 500 years old and my address 123 is Main Street.";
 
if ( (my $age, $number) = $str =~ /(\d{3})\D+(\d{3})/ ) {
    print "Yeah it matched and the extracted stuff is:\n";
    print "$age\n";
    print "$number\n";
}

Open in new window

0
 
DJ_AM_JuiceboxAuthor Commented:
Ah ok yeah this makes more sense to me:

    m/ (\d{3})\D+(\d{3})/)


so that says look for 3 digits, followed by one or more non-digit characters (so this includes alphabet chars and whitespaces), then 3 digits, right?
0
Cloud Class® Course: Amazon Web Services - Basic

Are you thinking about creating an Amazon Web Services account for your business? Not sure where to start? In this course you’ll get an overview of the history of AWS and take a tour of their user interface.

 
ozoCommented:
yes, although it would  also match the 123 in
hello I am 500 years old and my address is 1234 Main Street
0
 
Adam314Commented:
Your original regex worked, and the [\w\s] didn't match the numbers because if it did, the overall regex wouldn't match - there were no numbers left for the \d+ to match.  The regex will have each +, *, or {min,max} using as many characters as possible, while still allowing the overall regex to match.  If you use +?, *?, or {min,max}?, then it will match as few as possible, while still allowing the overall regex to match.
0
 
hieloCommented:
>>so that says look for 3 digits, followed by one or more non-digit characters (so this includes alphabet chars and whitespaces), then 3 digits, right?
Exactly! Wasn't that easy? :)
0
 
ozoCommented:
perl -MYAPE::Regex::Explain -e "print YAPE::Regex::Explain->new(qr/ (\d{3})[\w\s]+(\d{3})/(->explain"
syntax error at -e line 1, near "qr/ (\d{3})[\w\s]+(\d{3})/("
Execution of -e aborted due to compilation errors.
PowerMac-G5:~/ee dmi$ perl -MYAPE::Regex::Explain -e "print YAPE::Regex::Explain->new(qr/ (\d{3})[\w\s]+(\d{3})/)->explain"
The regular expression:

(?-imsx: (\d{3})[\w\s]+(\d{3}))

matches as follows:
 
NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
                           ' '
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    \d{3}                    digits (0-9) (3 times)
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
  [\w\s]+                  any character of: word characters (a-z, A-
                           Z, 0-9, _), whitespace (\n, \r, \t, \f,
                           and " ") (1 or more times (matching the
                           most amount possible))
----------------------------------------------------------------------
  (                        group and capture to \2:
----------------------------------------------------------------------
    \d{3}                    digits (0-9) (3 times)
----------------------------------------------------------------------
  )                        end of \2
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------
0
 
ghostdog74Commented:
don't need that much regexp.
you can use split to split by non digits.
my $str = "hello I am 500 years old and my address is 123 Main Street.";
@array = split( /\D+/ ,$str);
print @array;

Open in new window

0
 
ozoCommented:
if there is no requirement that the sequences have 3 digits
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.