[Webinar] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 583
  • Last Modified:

Regex help - problem with word boundary

Here's the regex: \b\(?0\d{4}\)?\s*\d{6}\b
I want it to pick out phone numbers from block of text e.g. "this is a test (01603) 123456 more stuff"
The match I get misses off the starting bracket "01603) 123456"
I want "(01603) 123456"

I don't want matches if the phone number doesn't have a word boundary e.g. "this is a test(01603) 123456 more stuff" - (no space between the word test and the phone number) should not match

Thanks
0
joegass
Asked:
joegass
1 Solution
 
kmcgheeCommented:
hey mate,

basically \b is an anchor like the caret and the dollar sign. It matches at a position that is called a "word boundary".

There are four different positions that qualify as word boundaries:

1 - Before the first character in the string, if the first character is a word character.
2 - After the last character in the string, if the last character is a word character.
3 - Between a word character and a non-word character following right after the word character.
4 - Between a non-word character and a word character following right after the non-word character.

* so i think the reason your loosing the ( is because its a non-word char.

\B is the negated version of \b. \B matches at every position where \b does not. Effectively, \B matches at any position between two word characters as well as at any position between two non-word characters.

I wrecken if you use \B it'll match the bracket!!!!

Regards,

Kevin

0
 
joegassAuthor Commented:
That helps, but not quite there

I tried \B\(?0\d{4}\)?\s*\d{6}\b and that matched the number if it started with a bracket, but if it didn't start with a bracket (brackets are optional) it didn't find it.

Cheers for your help so far
0
 
FishMongerCommented:
Note: you have 5 digits within the parenthisis, but your regex is requiring exactly 4.

Here's a regex that works against your example data.  (The phone number is held in $1)

/\s((\(\d{5}\))?\s+\d{6})/
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
joegassAuthor Commented:
What about instances when my string will start with a phone number
maybe I should have mentioned that before :)

e.g. "01603 123456 dfvgsd gsdfgsdfgdgdg fs"

or "(01603) 123456 dfvgsd gsdfgsdfgdgdg fs"
0
 
FishMongerCommented:
As an additional test and example of the differences between \b \B \s see the results of my test.

#!/usr/bin/perl

while (<DATA>) {
   print "B: $1\n" if (/\B((\(\d{5}\))?\s+\d{6})/);
   print "b: $1\n" if (/\b((\(\d{5}\))?\s+\d{6})/);
   print "s: $1\n" if (/\s((\(\d{5}\))?\s+\d{6})/);
}

__DATA__
this is a test(01604) 012345 more stuff
this is a test (01603) 123456 more stuff


== outputs ==
B:  012345
b: (01604) 012345
B: (01603) 123456
s: (01603) 123456

---

As you can see:
   \b picks up the wrong line
   \B picks up both
   \s picks up the correct line
0
 
FishMongerCommented:
In that case, you need to make the \(  &  \) optional.  

while (<DATA>) {
#   print "B: $1\n" if (/\B((\(?\d{5}\)?)?\s+\d{6})/);
#   print "b: $1\n" if (/\b((\(?\d{5}\)?)?\s+\d{6})/);
   print "s: $1\n" if (/\s((\(?\d{5}\)?)?\s+\d{6})/);
}

__DATA__
this is a test(01604) 012345 more stuff
this is a test (01603) 123456 more stuff
this is a test 05304 654321 more stuff


== outputs ==
s: (01603) 123456
s: 05304 654321
0
 
joegassAuthor Commented:
Thanks for you for you quick reponses!

If I try "01603 123456 sdf dsfgsdfg fg fg01603 123456 sdfg fg dfG (01603) 123456"

It won't pick up the 1st phone number
0
 
joegassAuthor Commented:
and it also picks up the 2nd number which is concatenated with another string, which I don't want it too
"fg01603 123456" gets "01603 123456"
0
 
FishMongerCommented:
Oops, I missed something, let me do another test (after I have breakfast) and I'll get back to you with the answer.
0
 
FishMongerCommented:
Because of the variations and their exceptions to how the data may by formated, it might make it easier to use 2 regexs.


#!/usr/bin/perl -w

while (<DATA>) {
   my @phone;
   push @phone, $1 if (/^((\(?\d{5}\)?)\s\d{6})/);
   push @phone, $1 if (/\s+((\(?\d{5}\)?)\s\d{6})/);
   print "$_\n" foreach @phone;
}

__DATA__
this is a test(01604) 012345 more stuff
this is a test (01603) 123456 more stuff
05304 654321 more stuff
06103 345621 sdf dsfgsdfg fg fg01603 123456 sdfg fg dfG (06203) 234561


== outputs ==
(01603) 123456
05304 654321
06103 345621
(06203) 234561
0
 
DominicCroninCommented:
I just wanted to jump in here and have a little bitch about life in general:

Why is it that when people ask questions about regexps, they almost never say which regexp language they are on about. Are we to assume that someone means Perl regexps, or the notionally identical but separately defined XML Schema regexps, or Microsoft Scripting regexps, etc, etc. ?

No offence - just felt like a bit of a rant. :-)
0
 
FishMongerCommented:
DominicCronin,

Technically, you probably should say which type of 'regexp engine' instead of 'regexp language'.  There are 4 regex engines (DFA, Traditional NFA, POSIX NFA, and Hybrid NFA/DFA), so multiple languages will use the same engine, however, each language may or may not implement all of the features of the engine that they use.  With that being said, I do agree that with questions in the generic programming topic area, it would help to know exactly which language they are using.  But as far as this question is concerned, since joegass didn't have anything to say about me using the Perl examples, I can safely assume that he is either using Perl or one of the other languages using the Trad NFA engine and knows how to drop the regex into his code.
0
 
joegassAuthor Commented:
Thanks for all your help FishMonger, perhaps I should have said that I'm using vb.net - I've had no problem with interpreting your perl stuff tho'.

0
 
joegassAuthor Commented:
It still isn't quite working as I'd like
Here's my code - basically all I want to do is return a collection of matches from a block of text (includes my first regex)

        Private Function findPhoneNumbers(ByVal textIn As String) As Text.RegularExpressions.MatchCollection
            Dim re As New Text.RegularExpressions.Regex("\b\(?0\d{4}\)?\s*\d{6}\b|\b\(?0\d{3}\)?\s*\d{7}\b", Text.RegularExpressions.RegexOptions.Multiline)
            Return re.Matches(textIn)
        End Function

I'm trying your suggestion

^((\(?0\d{4}\)?)\s*\d{6})|\s+((\(?0\d{4}\)?)\s+\d{6})

and also

^((\(?0\d{4}\)?)\s*\d{6})|\b+((\(?0\d{4}\)?)\s+\d{6})

Both achieve moderate success, but they don't seem to find all numbers in all formats 100% of the time, e.g.
"this is a test 01603 123456xcvc" still matches the phone number

with the first regex and a multiline
"is a test
06103 345621 sdf dsfgsdfg"
t at the end of test and the phone number is matched

with the 2nd regex and a multiline
"more stuff
(01604) 012345 more stuff this is a test."
 01604) 012345 is matched missing the first bracket

Thanks for all you help so far!
0
 
joegassAuthor Commented:
I didn't ever quite get to the bottom of this, but fishmonger was very helpful and knowledgable
Thanks!
0

Featured Post

What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now