Solved

regex tutorial help

Posted on 2008-06-25
9
259 Views
Last Modified: 2010-03-05
Hi,

I'm just trying to practice regexes. I made a dummy string with an age and address in it. I want to pull out the age and the building number from the string. So I'm really looking for a sequence of 3 numbers, then another sequence of 3 numbers.  Here's the script:


use strict;

my $str = "hello I am 500 years old and my address is 123 Main Street.";


# Try to find the age and the address.
if ($str =~ m/ (\d{3})[\w\s]+(\d{3})/) {
    print("Yeah it matched and the extracted stuff is: $1, $2", "\n");
    print($1, "\n"); // the age
    print($2, "\n"); // the building number
}


I get the first extraction:

    (\d{3})   - look for a sequence of 3 digits.

I don't get this part:

    [\w\s]+

how do you express the [] brackets? I jsut gave it a shot cause I saw it somewhere else, but what I was trying to express is:

    look for a sequence of 3 digits, then some spaces and characters, then another sequence of 3 digits.

so the [\w\s]+ is the (some spaces and characters) part, I just don't understand technically what it is saying.

Thanks
0
Comment
Question by:DJ_AM_Juicebox
9 Comments
 
LVL 82

Accepted Solution

by:
hielo earned 100 total points
ID: 21867482
>> look for a sequence of 3 digits, then some spaces and characters, then another sequence of 3 digits.
The problem is that \w is shortcut for [a-zA-Z0-9_]. So when you get to your second set of digits, it also forms part of the [].

If you take a step back and look at your input string again, another way to look at it is some digits followed by non-digits followed by digits:
if ($str =~ m/ (\d{3})\D+(\d{3})/) {
0
 
LVL 28

Assisted Solution

by:FishMonger
FishMonger earned 100 total points
ID: 21867737
I'd take an additional step and use named vars instead of $1 and $2.

my $str = "hello I am 500 years old and my address 123 is Main Street.";
 

if ( (my $age, $number) = $str =~ /(\d{3})\D+(\d{3})/ ) {

    print "Yeah it matched and the extracted stuff is:\n";

    print "$age\n";

    print "$number\n";

}

Open in new window

0
 

Author Comment

by:DJ_AM_Juicebox
ID: 21867989
Ah ok yeah this makes more sense to me:

    m/ (\d{3})\D+(\d{3})/)


so that says look for 3 digits, followed by one or more non-digit characters (so this includes alphabet chars and whitespaces), then 3 digits, right?
0
 
LVL 84

Assisted Solution

by:ozo
ozo earned 100 total points
ID: 21868071
yes, although it would  also match the 123 in
hello I am 500 years old and my address is 1234 Main Street
0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 
LVL 39

Assisted Solution

by:Adam314
Adam314 earned 100 total points
ID: 21868728
Your original regex worked, and the [\w\s] didn't match the numbers because if it did, the overall regex wouldn't match - there were no numbers left for the \d+ to match.  The regex will have each +, *, or {min,max} using as many characters as possible, while still allowing the overall regex to match.  If you use +?, *?, or {min,max}?, then it will match as few as possible, while still allowing the overall regex to match.
0
 
LVL 82

Expert Comment

by:hielo
ID: 21869001
>>so that says look for 3 digits, followed by one or more non-digit characters (so this includes alphabet chars and whitespaces), then 3 digits, right?
Exactly! Wasn't that easy? :)
0
 
LVL 84

Expert Comment

by:ozo
ID: 21870806
perl -MYAPE::Regex::Explain -e "print YAPE::Regex::Explain->new(qr/ (\d{3})[\w\s]+(\d{3})/(->explain"
syntax error at -e line 1, near "qr/ (\d{3})[\w\s]+(\d{3})/("
Execution of -e aborted due to compilation errors.
PowerMac-G5:~/ee dmi$ perl -MYAPE::Regex::Explain -e "print YAPE::Regex::Explain->new(qr/ (\d{3})[\w\s]+(\d{3})/)->explain"
The regular expression:

(?-imsx: (\d{3})[\w\s]+(\d{3}))

matches as follows:
 
NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
                           ' '
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    \d{3}                    digits (0-9) (3 times)
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
  [\w\s]+                  any character of: word characters (a-z, A-
                           Z, 0-9, _), whitespace (\n, \r, \t, \f,
                           and " ") (1 or more times (matching the
                           most amount possible))
----------------------------------------------------------------------
  (                        group and capture to \2:
----------------------------------------------------------------------
    \d{3}                    digits (0-9) (3 times)
----------------------------------------------------------------------
  )                        end of \2
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------
0
 
LVL 9

Assisted Solution

by:ghostdog74
ghostdog74 earned 100 total points
ID: 21871218
don't need that much regexp.
you can use split to split by non digits.
my $str = "hello I am 500 years old and my address is 123 Main Street.";

@array = split( /\D+/ ,$str);

print @array;

Open in new window

0
 
LVL 84

Expert Comment

by:ozo
ID: 21872012
if there is no requirement that the sequences have 3 digits
0

Featured Post

IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

Many time we need to work with multiple files all together. If its windows system then we can use some GUI based editor to accomplish our task. But what if you are on putty or have only CLI(Command Line Interface) as an option to  edit your files. I…
I have been pestered over the years to produce and distribute regular data extracts, and often the request have explicitly requested the data be emailed as an Excel attachement; specifically Excel, as it appears: CSV files confuse (no Red or Green h…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Illustrator's Shape Builder tool will let you combine shapes visually and interactively. This video shows the Mac version, but the tool works the same way in Windows. To follow along with this video, you can draw your own shapes or download the file…

760 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

18 Experts available now in Live!

Get 1:1 Help Now