Solved

Regex help - problem with word boundary

Posted on 2004-04-14
16
566 Views
Last Modified: 2008-02-01
Here's the regex: \b\(?0\d{4}\)?\s*\d{6}\b
I want it to pick out phone numbers from block of text e.g. "this is a test (01603) 123456 more stuff"
The match I get misses off the starting bracket "01603) 123456"
I want "(01603) 123456"

I don't want matches if the phone number doesn't have a word boundary e.g. "this is a test(01603) 123456 more stuff" - (no space between the word test and the phone number) should not match

Thanks
0
Comment
Question by:joegass
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
16 Comments
 
LVL 1

Expert Comment

by:kmcghee
ID: 10821768
hey mate,

basically \b is an anchor like the caret and the dollar sign. It matches at a position that is called a "word boundary".

There are four different positions that qualify as word boundaries:

1 - Before the first character in the string, if the first character is a word character.
2 - After the last character in the string, if the last character is a word character.
3 - Between a word character and a non-word character following right after the word character.
4 - Between a non-word character and a word character following right after the non-word character.

* so i think the reason your loosing the ( is because its a non-word char.

\B is the negated version of \b. \B matches at every position where \b does not. Effectively, \B matches at any position between two word characters as well as at any position between two non-word characters.

I wrecken if you use \B it'll match the bracket!!!!

Regards,

Kevin

0
 
LVL 2

Author Comment

by:joegass
ID: 10821825
That helps, but not quite there

I tried \B\(?0\d{4}\)?\s*\d{6}\b and that matched the number if it started with a bracket, but if it didn't start with a bracket (brackets are optional) it didn't find it.

Cheers for your help so far
0
 
LVL 28

Expert Comment

by:FishMonger
ID: 10823796
Note: you have 5 digits within the parenthisis, but your regex is requiring exactly 4.

Here's a regex that works against your example data.  (The phone number is held in $1)

/\s((\(\d{5}\))?\s+\d{6})/
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 2

Author Comment

by:joegass
ID: 10823897
What about instances when my string will start with a phone number
maybe I should have mentioned that before :)

e.g. "01603 123456 dfvgsd gsdfgsdfgdgdg fs"

or "(01603) 123456 dfvgsd gsdfgsdfgdgdg fs"
0
 
LVL 28

Expert Comment

by:FishMonger
ID: 10823918
As an additional test and example of the differences between \b \B \s see the results of my test.

#!/usr/bin/perl

while (<DATA>) {
   print "B: $1\n" if (/\B((\(\d{5}\))?\s+\d{6})/);
   print "b: $1\n" if (/\b((\(\d{5}\))?\s+\d{6})/);
   print "s: $1\n" if (/\s((\(\d{5}\))?\s+\d{6})/);
}

__DATA__
this is a test(01604) 012345 more stuff
this is a test (01603) 123456 more stuff


== outputs ==
B:  012345
b: (01604) 012345
B: (01603) 123456
s: (01603) 123456

---

As you can see:
   \b picks up the wrong line
   \B picks up both
   \s picks up the correct line
0
 
LVL 28

Expert Comment

by:FishMonger
ID: 10823979
In that case, you need to make the \(  &  \) optional.  

while (<DATA>) {
#   print "B: $1\n" if (/\B((\(?\d{5}\)?)?\s+\d{6})/);
#   print "b: $1\n" if (/\b((\(?\d{5}\)?)?\s+\d{6})/);
   print "s: $1\n" if (/\s((\(?\d{5}\)?)?\s+\d{6})/);
}

__DATA__
this is a test(01604) 012345 more stuff
this is a test (01603) 123456 more stuff
this is a test 05304 654321 more stuff


== outputs ==
s: (01603) 123456
s: 05304 654321
0
 
LVL 2

Author Comment

by:joegass
ID: 10824027
Thanks for you for you quick reponses!

If I try "01603 123456 sdf dsfgsdfg fg fg01603 123456 sdfg fg dfG (01603) 123456"

It won't pick up the 1st phone number
0
 
LVL 2

Author Comment

by:joegass
ID: 10824044
and it also picks up the 2nd number which is concatenated with another string, which I don't want it too
"fg01603 123456" gets "01603 123456"
0
 
LVL 28

Expert Comment

by:FishMonger
ID: 10824062
Oops, I missed something, let me do another test (after I have breakfast) and I'll get back to you with the answer.
0
 
LVL 28

Accepted Solution

by:
FishMonger earned 125 total points
ID: 10825095
Because of the variations and their exceptions to how the data may by formated, it might make it easier to use 2 regexs.


#!/usr/bin/perl -w

while (<DATA>) {
   my @phone;
   push @phone, $1 if (/^((\(?\d{5}\)?)\s\d{6})/);
   push @phone, $1 if (/\s+((\(?\d{5}\)?)\s\d{6})/);
   print "$_\n" foreach @phone;
}

__DATA__
this is a test(01604) 012345 more stuff
this is a test (01603) 123456 more stuff
05304 654321 more stuff
06103 345621 sdf dsfgsdfg fg fg01603 123456 sdfg fg dfG (06203) 234561


== outputs ==
(01603) 123456
05304 654321
06103 345621
(06203) 234561
0
 
LVL 6

Expert Comment

by:DominicCronin
ID: 10826432
I just wanted to jump in here and have a little bitch about life in general:

Why is it that when people ask questions about regexps, they almost never say which regexp language they are on about. Are we to assume that someone means Perl regexps, or the notionally identical but separately defined XML Schema regexps, or Microsoft Scripting regexps, etc, etc. ?

No offence - just felt like a bit of a rant. :-)
0
 
LVL 28

Expert Comment

by:FishMonger
ID: 10826858
DominicCronin,

Technically, you probably should say which type of 'regexp engine' instead of 'regexp language'.  There are 4 regex engines (DFA, Traditional NFA, POSIX NFA, and Hybrid NFA/DFA), so multiple languages will use the same engine, however, each language may or may not implement all of the features of the engine that they use.  With that being said, I do agree that with questions in the generic programming topic area, it would help to know exactly which language they are using.  But as far as this question is concerned, since joegass didn't have anything to say about me using the Perl examples, I can safely assume that he is either using Perl or one of the other languages using the Trad NFA engine and knows how to drop the regex into his code.
0
 
LVL 2

Author Comment

by:joegass
ID: 10831090
Thanks for all your help FishMonger, perhaps I should have said that I'm using vb.net - I've had no problem with interpreting your perl stuff tho'.

0
 
LVL 2

Author Comment

by:joegass
ID: 10831120
It still isn't quite working as I'd like
Here's my code - basically all I want to do is return a collection of matches from a block of text (includes my first regex)

        Private Function findPhoneNumbers(ByVal textIn As String) As Text.RegularExpressions.MatchCollection
            Dim re As New Text.RegularExpressions.Regex("\b\(?0\d{4}\)?\s*\d{6}\b|\b\(?0\d{3}\)?\s*\d{7}\b", Text.RegularExpressions.RegexOptions.Multiline)
            Return re.Matches(textIn)
        End Function

I'm trying your suggestion

^((\(?0\d{4}\)?)\s*\d{6})|\s+((\(?0\d{4}\)?)\s+\d{6})

and also

^((\(?0\d{4}\)?)\s*\d{6})|\b+((\(?0\d{4}\)?)\s+\d{6})

Both achieve moderate success, but they don't seem to find all numbers in all formats 100% of the time, e.g.
"this is a test 01603 123456xcvc" still matches the phone number

with the first regex and a multiline
"is a test
06103 345621 sdf dsfgsdfg"
t at the end of test and the phone number is matched

with the 2nd regex and a multiline
"more stuff
(01604) 012345 more stuff this is a test."
 01604) 012345 is matched missing the first bracket

Thanks for all you help so far!
0
 
LVL 2

Author Comment

by:joegass
ID: 11040208
I didn't ever quite get to the bottom of this, but fishmonger was very helpful and knowledgable
Thanks!
0

Featured Post

Secure Your Active Directory - April 20, 2017

Active Directory plays a critical role in your company’s IT infrastructure and keeping it secure in today’s hacker-infested world is a must.
Microsoft published 300+ pages of guidance, but who has the time, money, and resources to implement? Register now to find an easier way.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

This is about my first experience with programming Arduino.
In this post we will learn how to make Android Gesture Tutorial and give different functionality whenever a user Touch or Scroll android screen.

726 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question