Solved

VBScript: Regular Expressions

Posted on 2011-03-18
2
727 Views
Last Modified: 2012-05-11
Hi there,

I need to understand the meaning (in words), every aspects and by disecting in detail, the following regular expression:
"((coconut|two)?(?:[^A-Za-z0-9\n\r\t]*))?\(?(\d{3})\)?[- \.]?(\d{3})[- \.]?(\d{4})"

For example:
- ((coconut|two) = look for "coconut" and "two" as literal text due to the parentheses
- ?=Used here for ...
Etc...

If you have a better and clearer way than my example to explain it, please go ahaid.

Thanks for your help,
Rene
0
Comment
Question by:ReneGe
2 Comments
 
LVL 12

Accepted Solution

by:
Daz_1234 earned 500 total points
ID: 35169219
The job of a regular expression is to match or 'capture' 1 or more patterns of characters in a string.  The parts of the regex that are in parentheses are 'groups'.  Each group that can capture would be a submatch.  Submatches are labelled $1, $2 etc.  In VBScript regex, submatches are zero referenced as oMatch.SubMatches(0), oMatch.SubMatches(1), etc.

Capturing groups in your sample regex:

Group1: ((coconut|two)?(?:[^A-Za-z0-9\n\r\t]*))?
Group2: (coconut|two)?
Group3: (\d{3})
Group4: (\d{3})
Group5: (\d{4})

You may have spotted that I did not specify this one (?:[^A-Za-z0-9\n\r\t]*) as a group.  That is because the ?: at the start of the group means match but don't capture (although the way it is here it is actually captured as part of group 1, but it does not have it's own group).

So the whole regex goes like this:

Unless the whole regex fits a pattern in the string then nothing is captured.  But some matches are also set as optional.

Find a pattern within the test string that:
1. Starts with coconut OR two.  Place the result in submatch $2.  The pipe means OR.  This match is OPTIONAL because there is a question mark directly after the group, so a matching pattern does not have to start with either.  But if it does start with one, we want to capture it.

2. The next part is to match but not capture zero or more characters that are NOT A-Z or a-z or numbers or carriage return or tab.  This is done by:  
   ?:  means match if there are characters that qualify, but don't capture them.  Why would we want that?  Well if the whole string we want to capture could possibly contain characters like that we still want to have the whole string, but if there is a letter or number in here then we don't.  Either way we don't want this part in the result.
   [^A-Za-z0-9\n\r\t]  the ^ character means anything in these square brackets must NOT be in this part of the string.  the \n means new line, \r means return and \t means tab.
   *  the asterisk means match any amount of characters from zero to many.

SO the result of 1 plus 2 above combined is placed in Group1 or $1.

3. The question mark at the end of the surrounding parenthesis means the match is optional.  The resulting capture does not necessarily have coconut or two or any of the characters not in the square brackets.

4. the \(? means capture a left bracket but the bracket is optional.  There is a leading backslash to escape the bracket so that the regex engine does not think we are starting a new group.  The capture is not assigned to a submatch group.

5. (\d{3})  this means capture 3 numbers together.  This is not optional, if there are not three numbers together at this point the section is discarded.  The numbers are placed in submatch $3

6. the  \)?  means capture a right bracket but the bracket is optional. The capture is not assigned to a submatch group.

7. the  [- \.]?  means after the 3 numbers capture any *one* of the characters in the square brackets, either hyphen, space or dot.  To specify a dot literally, you must escape it with a backslash because dots mean any character when they are on their own.  Because this section is not in parentheses, it is not assigned to a submatch group.  The question mark means this character is optional.

8. (\d{3}) as before this matches any 3 numbers.  The numbers are placed in submatch $4

9. Another optional character [- \.]?  hyphen or space or dot, optional, not placed in a submatch group.

10. The last part of the pattern that must match  (\d{4})  means any 4 numbers together.  not optional and this is placed in submatch group $5.


SO if there was a string like this:
lkfdgi klh lakjfh slakjghdlaskjgh slzkjghlaskjghs lkjhg lazkjdh two £$%7371-11-2222 lasjud
hg lajhdalzudshv lajdhf lakjdshf two£$%^ (767).543-6262 djasdhg 767656.652652 kjhsdg kjsdfyh gaksjf
hy gazkjk gajfsg coconut£$%f(643)-767-1233 kajs gfkauysgf akiuyf gdkahdsf ga

Open in new window


... it would look through the string looking for the first compulsory match, which was 3 numbers together.  If there is a two or coconut followed by some non letters or numbers and possibly a left bracket then take those in two.  It checks if there is an optional right bracket and if there is an optional character of hyphen, space or dot.  It then needs a compulsory 3 numbers together if there is to be a pattern capture.  Then it knows there may be an optional space, dot or hyphen and finally a compulsory 4 numbers together.  If all these factors are true, then the pattern is captured.  In the sample it would capture "two£$%^ (767).543-6262" AND "767656.6526"but NOT "two £$%7371-11-2222" because there are not 3 numbers together, then 3 numbers together, then 4.  And NOT "coconut£$%f(643)-767-1233" because there is a letter between the coconut and the first set of numbers, but even though the letter would exclude the coconut part, it is optional, remember, so the rest of the number DOES match! Confused?  This is really quite hard to get your head round.

Here is a list of settings that can be used in VBScript regular expressions.
http://msdn.microsoft.com/en-us/library/f97kw5ka(v=VS.85).aspx

More info:
http://msdn.microsoft.com/en-us/library/yab2dx62(v=VS.85).aspx

Regular expressions tutorial:
http://www.regular-expressions.info/tutorial.html

A brilliant web page to instantly test regex:
http://gskinner.com/RegExr/
.. you paste the string to test in the main window, and paste in the regex pattern in the bit at the top.  If you paste in the regex string from your question, then paste in the test string I created in the code box above , you should see the three matches I described above.  See screen shot.

Hope this helps a bit - it took me a very long time to get my head around regular expressions, and there is often more than one correct answer to set one up.
Screenshot---180311---21-57-40.png
0
 
LVL 10

Author Closing Comment

by:ReneGe
ID: 35169280
Hey Daz,

I am more than impressed by all the dedication and efforts you put in helping me.

I'll pass through it this week end.

Thanks a lot!

Cheers,
Rene
0

Featured Post

Do You Know the 4 Main Threat Actor Types?

Do you know the main threat actor types? Most attackers fall into one of four categories, each with their own favored tactics, techniques, and procedures.

Join & Write a Comment

This article is meant to give a basic understanding of how to use R Sweave as a way to merge LaTeX and R code seamlessly into one presentable document.
Although it can be difficult to imagine, someday your child will have a career of his or her own. He or she will likely start a family, buy a home and start having their own children. So, while being a kid is still extremely important, it’s also …
Learn the basics of if, else, and elif statements in Python 2.7. Use "if" statements to test a specified condition.: The structure of an if statement is as follows: (CODE) Use "else" statements to allow the execution of an alternative, if the …
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…

707 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now