Solved

regex tutorial help

Posted on 2008-06-25
9
286 Views
Last Modified: 2010-03-05
Hi,

I'm just trying to practice regexes. I made a dummy string with an age and address in it. I want to pull out the age and the building number from the string. So I'm really looking for a sequence of 3 numbers, then another sequence of 3 numbers.  Here's the script:


use strict;

my $str = "hello I am 500 years old and my address is 123 Main Street.";


# Try to find the age and the address.
if ($str =~ m/ (\d{3})[\w\s]+(\d{3})/) {
    print("Yeah it matched and the extracted stuff is: $1, $2", "\n");
    print($1, "\n"); // the age
    print($2, "\n"); // the building number
}


I get the first extraction:

    (\d{3})   - look for a sequence of 3 digits.

I don't get this part:

    [\w\s]+

how do you express the [] brackets? I jsut gave it a shot cause I saw it somewhere else, but what I was trying to express is:

    look for a sequence of 3 digits, then some spaces and characters, then another sequence of 3 digits.

so the [\w\s]+ is the (some spaces and characters) part, I just don't understand technically what it is saying.

Thanks
0
Comment
Question by:DJ_AM_Juicebox
9 Comments
 
LVL 82

Accepted Solution

by:
hielo earned 100 total points
ID: 21867482
>> look for a sequence of 3 digits, then some spaces and characters, then another sequence of 3 digits.
The problem is that \w is shortcut for [a-zA-Z0-9_]. So when you get to your second set of digits, it also forms part of the [].

If you take a step back and look at your input string again, another way to look at it is some digits followed by non-digits followed by digits:
if ($str =~ m/ (\d{3})\D+(\d{3})/) {
0
 
LVL 28

Assisted Solution

by:FishMonger
FishMonger earned 100 total points
ID: 21867737
I'd take an additional step and use named vars instead of $1 and $2.

my $str = "hello I am 500 years old and my address 123 is Main Street.";
 
if ( (my $age, $number) = $str =~ /(\d{3})\D+(\d{3})/ ) {
    print "Yeah it matched and the extracted stuff is:\n";
    print "$age\n";
    print "$number\n";
}

Open in new window

0
 

Author Comment

by:DJ_AM_Juicebox
ID: 21867989
Ah ok yeah this makes more sense to me:

    m/ (\d{3})\D+(\d{3})/)


so that says look for 3 digits, followed by one or more non-digit characters (so this includes alphabet chars and whitespaces), then 3 digits, right?
0
Networking for the Cloud Era

Join Microsoft and Riverbed for a discussion and demonstration of enhancements to SteelConnect:
-One-click orchestration and cloud connectivity in Azure environments
-Tight integration of SD-WAN and WAN optimization capabilities
-Scalability and resiliency equal to a data center

 
LVL 84

Assisted Solution

by:ozo
ozo earned 100 total points
ID: 21868071
yes, although it would  also match the 123 in
hello I am 500 years old and my address is 1234 Main Street
0
 
LVL 39

Assisted Solution

by:Adam314
Adam314 earned 100 total points
ID: 21868728
Your original regex worked, and the [\w\s] didn't match the numbers because if it did, the overall regex wouldn't match - there were no numbers left for the \d+ to match.  The regex will have each +, *, or {min,max} using as many characters as possible, while still allowing the overall regex to match.  If you use +?, *?, or {min,max}?, then it will match as few as possible, while still allowing the overall regex to match.
0
 
LVL 82

Expert Comment

by:hielo
ID: 21869001
>>so that says look for 3 digits, followed by one or more non-digit characters (so this includes alphabet chars and whitespaces), then 3 digits, right?
Exactly! Wasn't that easy? :)
0
 
LVL 84

Expert Comment

by:ozo
ID: 21870806
perl -MYAPE::Regex::Explain -e "print YAPE::Regex::Explain->new(qr/ (\d{3})[\w\s]+(\d{3})/(->explain"
syntax error at -e line 1, near "qr/ (\d{3})[\w\s]+(\d{3})/("
Execution of -e aborted due to compilation errors.
PowerMac-G5:~/ee dmi$ perl -MYAPE::Regex::Explain -e "print YAPE::Regex::Explain->new(qr/ (\d{3})[\w\s]+(\d{3})/)->explain"
The regular expression:

(?-imsx: (\d{3})[\w\s]+(\d{3}))

matches as follows:
 
NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
                           ' '
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    \d{3}                    digits (0-9) (3 times)
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
  [\w\s]+                  any character of: word characters (a-z, A-
                           Z, 0-9, _), whitespace (\n, \r, \t, \f,
                           and " ") (1 or more times (matching the
                           most amount possible))
----------------------------------------------------------------------
  (                        group and capture to \2:
----------------------------------------------------------------------
    \d{3}                    digits (0-9) (3 times)
----------------------------------------------------------------------
  )                        end of \2
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------
0
 
LVL 9

Assisted Solution

by:ghostdog74
ghostdog74 earned 100 total points
ID: 21871218
don't need that much regexp.
you can use split to split by non digits.
my $str = "hello I am 500 years old and my address is 123 Main Street.";
@array = split( /\D+/ ,$str);
print @array;

Open in new window

0
 
LVL 84

Expert Comment

by:ozo
ID: 21872012
if there is no requirement that the sequences have 3 digits
0

Featured Post

Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Remove Malware code from PHP file 6 90
perl search and replace 6 171
Perl script to delete older files 6 88
Perl Untar File 1 55
Email validation in proper way is  very important validation required in any web pages. This code is self explainable except that Regular Expression which I used for pattern matching. I originally published as a thread on my website : http://www…
A year or so back I was asked to have a play with MongoDB; within half an hour I had downloaded (http://www.mongodb.org/downloads),  installed and started the daemon, and had a console window open. After an hour or two of playing at the command …
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

828 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question