Solved

Regex help - problem with word boundary

Posted on 2004-04-14
16
535 Views
Last Modified: 2008-02-01
Here's the regex: \b\(?0\d{4}\)?\s*\d{6}\b
I want it to pick out phone numbers from block of text e.g. "this is a test (01603) 123456 more stuff"
The match I get misses off the starting bracket "01603) 123456"
I want "(01603) 123456"

I don't want matches if the phone number doesn't have a word boundary e.g. "this is a test(01603) 123456 more stuff" - (no space between the word test and the phone number) should not match

Thanks
0
Comment
Question by:joegass
16 Comments
 
LVL 1

Expert Comment

by:kmcghee
ID: 10821768
hey mate,

basically \b is an anchor like the caret and the dollar sign. It matches at a position that is called a "word boundary".

There are four different positions that qualify as word boundaries:

1 - Before the first character in the string, if the first character is a word character.
2 - After the last character in the string, if the last character is a word character.
3 - Between a word character and a non-word character following right after the word character.
4 - Between a non-word character and a word character following right after the non-word character.

* so i think the reason your loosing the ( is because its a non-word char.

\B is the negated version of \b. \B matches at every position where \b does not. Effectively, \B matches at any position between two word characters as well as at any position between two non-word characters.

I wrecken if you use \B it'll match the bracket!!!!

Regards,

Kevin

0
 
LVL 2

Author Comment

by:joegass
ID: 10821825
That helps, but not quite there

I tried \B\(?0\d{4}\)?\s*\d{6}\b and that matched the number if it started with a bracket, but if it didn't start with a bracket (brackets are optional) it didn't find it.

Cheers for your help so far
0
 
LVL 28

Expert Comment

by:FishMonger
ID: 10823796
Note: you have 5 digits within the parenthisis, but your regex is requiring exactly 4.

Here's a regex that works against your example data.  (The phone number is held in $1)

/\s((\(\d{5}\))?\s+\d{6})/
0
 
LVL 2

Author Comment

by:joegass
ID: 10823897
What about instances when my string will start with a phone number
maybe I should have mentioned that before :)

e.g. "01603 123456 dfvgsd gsdfgsdfgdgdg fs"

or "(01603) 123456 dfvgsd gsdfgsdfgdgdg fs"
0
 
LVL 28

Expert Comment

by:FishMonger
ID: 10823918
As an additional test and example of the differences between \b \B \s see the results of my test.

#!/usr/bin/perl

while (<DATA>) {
   print "B: $1\n" if (/\B((\(\d{5}\))?\s+\d{6})/);
   print "b: $1\n" if (/\b((\(\d{5}\))?\s+\d{6})/);
   print "s: $1\n" if (/\s((\(\d{5}\))?\s+\d{6})/);
}

__DATA__
this is a test(01604) 012345 more stuff
this is a test (01603) 123456 more stuff


== outputs ==
B:  012345
b: (01604) 012345
B: (01603) 123456
s: (01603) 123456

---

As you can see:
   \b picks up the wrong line
   \B picks up both
   \s picks up the correct line
0
 
LVL 28

Expert Comment

by:FishMonger
ID: 10823979
In that case, you need to make the \(  &  \) optional.  

while (<DATA>) {
#   print "B: $1\n" if (/\B((\(?\d{5}\)?)?\s+\d{6})/);
#   print "b: $1\n" if (/\b((\(?\d{5}\)?)?\s+\d{6})/);
   print "s: $1\n" if (/\s((\(?\d{5}\)?)?\s+\d{6})/);
}

__DATA__
this is a test(01604) 012345 more stuff
this is a test (01603) 123456 more stuff
this is a test 05304 654321 more stuff


== outputs ==
s: (01603) 123456
s: 05304 654321
0
 
LVL 2

Author Comment

by:joegass
ID: 10824027
Thanks for you for you quick reponses!

If I try "01603 123456 sdf dsfgsdfg fg fg01603 123456 sdfg fg dfG (01603) 123456"

It won't pick up the 1st phone number
0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 
LVL 2

Author Comment

by:joegass
ID: 10824044
and it also picks up the 2nd number which is concatenated with another string, which I don't want it too
"fg01603 123456" gets "01603 123456"
0
 
LVL 28

Expert Comment

by:FishMonger
ID: 10824062
Oops, I missed something, let me do another test (after I have breakfast) and I'll get back to you with the answer.
0
 
LVL 28

Accepted Solution

by:
FishMonger earned 125 total points
ID: 10825095
Because of the variations and their exceptions to how the data may by formated, it might make it easier to use 2 regexs.


#!/usr/bin/perl -w

while (<DATA>) {
   my @phone;
   push @phone, $1 if (/^((\(?\d{5}\)?)\s\d{6})/);
   push @phone, $1 if (/\s+((\(?\d{5}\)?)\s\d{6})/);
   print "$_\n" foreach @phone;
}

__DATA__
this is a test(01604) 012345 more stuff
this is a test (01603) 123456 more stuff
05304 654321 more stuff
06103 345621 sdf dsfgsdfg fg fg01603 123456 sdfg fg dfG (06203) 234561


== outputs ==
(01603) 123456
05304 654321
06103 345621
(06203) 234561
0
 
LVL 6

Expert Comment

by:DominicCronin
ID: 10826432
I just wanted to jump in here and have a little bitch about life in general:

Why is it that when people ask questions about regexps, they almost never say which regexp language they are on about. Are we to assume that someone means Perl regexps, or the notionally identical but separately defined XML Schema regexps, or Microsoft Scripting regexps, etc, etc. ?

No offence - just felt like a bit of a rant. :-)
0
 
LVL 28

Expert Comment

by:FishMonger
ID: 10826858
DominicCronin,

Technically, you probably should say which type of 'regexp engine' instead of 'regexp language'.  There are 4 regex engines (DFA, Traditional NFA, POSIX NFA, and Hybrid NFA/DFA), so multiple languages will use the same engine, however, each language may or may not implement all of the features of the engine that they use.  With that being said, I do agree that with questions in the generic programming topic area, it would help to know exactly which language they are using.  But as far as this question is concerned, since joegass didn't have anything to say about me using the Perl examples, I can safely assume that he is either using Perl or one of the other languages using the Trad NFA engine and knows how to drop the regex into his code.
0
 
LVL 2

Author Comment

by:joegass
ID: 10831090
Thanks for all your help FishMonger, perhaps I should have said that I'm using vb.net - I've had no problem with interpreting your perl stuff tho'.

0
 
LVL 2

Author Comment

by:joegass
ID: 10831120
It still isn't quite working as I'd like
Here's my code - basically all I want to do is return a collection of matches from a block of text (includes my first regex)

        Private Function findPhoneNumbers(ByVal textIn As String) As Text.RegularExpressions.MatchCollection
            Dim re As New Text.RegularExpressions.Regex("\b\(?0\d{4}\)?\s*\d{6}\b|\b\(?0\d{3}\)?\s*\d{7}\b", Text.RegularExpressions.RegexOptions.Multiline)
            Return re.Matches(textIn)
        End Function

I'm trying your suggestion

^((\(?0\d{4}\)?)\s*\d{6})|\s+((\(?0\d{4}\)?)\s+\d{6})

and also

^((\(?0\d{4}\)?)\s*\d{6})|\b+((\(?0\d{4}\)?)\s+\d{6})

Both achieve moderate success, but they don't seem to find all numbers in all formats 100% of the time, e.g.
"this is a test 01603 123456xcvc" still matches the phone number

with the first regex and a multiline
"is a test
06103 345621 sdf dsfgsdfg"
t at the end of test and the phone number is matched

with the 2nd regex and a multiline
"more stuff
(01604) 012345 more stuff this is a test."
 01604) 012345 is matched missing the first bracket

Thanks for all you help so far!
0
 
LVL 2

Author Comment

by:joegass
ID: 11040208
I didn't ever quite get to the bottom of this, but fishmonger was very helpful and knowledgable
Thanks!
0

Featured Post

How to improve team productivity

Quip adds documents, spreadsheets, and tasklists to your Slack experience
- Elevate ideas to Quip docs
- Share Quip docs in Slack
- Get notified of changes to your docs
- Available on iOS/Android/Desktop/Web
- Online/Offline

Join & Write a Comment

Here we come across an interesting topic of coding guidelines while designing automation test scripts. The scope of this article will not be limited to QTP but to an overall extent of using VB Scripting for automation projects. Introduction Now…
Although it can be difficult to imagine, someday your child will have a career of his or her own. He or she will likely start a family, buy a home and start having their own children. So, while being a kid is still extremely important, it’s also …
An introduction to basic programming syntax in Java by creating a simple program. Viewers can follow the tutorial as they create their first class in Java. Definitions and explanations about each element are given to help prepare viewers for future …
Viewers will learn how to properly install Eclipse with the necessary JDK, and will take a look at an introductory Java program. Download Eclipse installation zip file: Extract files from zip file: Download and install JDK 8: Open Eclipse and …

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now