Solved

Regex help!

Posted on 2011-02-11
41
407 Views
Last Modified: 2012-06-27
I'm not terribly good with Regex, but I'm trying to use them to strip out some text from a long string.  Here is an example string.

"         0108   COL-GDI-STAP          1206   INDIO                 2212   THEDALS-BEND          5030   HARLINGEN                    "

What I really want is 4 matches that would give me "0108   COL-GDI-STAP" in one match, "1206   INDIO" in the next, and so on.

I've got the string expression to get the first three, but I can't figure out the code to get the last set!  Here is the code for the first 3--
"\d{4}.*?(?=\d{4})"

What expression would I use to get the last set?  Is there a better way to grab all 4 using one statement?  Thanks!
0
Comment
Question by:jacksonm1234
  • 14
  • 12
  • 8
  • +2
41 Comments
 
LVL 9

Expert Comment

by:user_n
ID: 34874475
[0-9]{4}\s+[-A-Z]+
0
 
LVL 2

Author Comment

by:jacksonm1234
ID: 34874557
Close, but i need it to support spaces, commas, etc. in amongst the letters.  For example, these two strings don't work correctly with your expression:

"         0017   CANAL E/W RP          0300   DRGW                  2203   WALLULA               4230   TULSA, OK                    "

"         0101   COB-FREMONT           1103   CLEARFIELD            2205   WALLA WALLA           4245   OOLOGAH                      "


I should have provided a better example initially. Sorry.
0
 
LVL 41

Expert Comment

by:HonorGod
ID: 34874588
Something like this perhaps?
<html>
<body>

<script type="text/javascript">
  var data = "         0108   COL-GDI-STAP          1206   INDIO                 2212   THEDALS-BEND          5030   HARLINGEN                    ";
  document.write( '"' + data + '"<br/>' );
  data = data.replace( /\s*([0-9]{4})\s+([-A-Z]+)\s*/g, '$1 $2' );
  document.write( '"' + data + '"<br/>' );

</script>

</body>
</html>

Open in new window

0
 
LVL 9

Expert Comment

by:user_n
ID: 34874596
and who are the etc. symbols
0
 
LVL 2

Author Comment

by:jacksonm1234
ID: 34874657
HonorGod:  Same problem...i need to take into account other characters in the words.  Sorry for giving a poor example.

user_n:  So we can't just use a universal 'match any character'?  We have to use specifics?
The special symbols I need to deal with are  spaces, commas, periods, hyphens, slashes, and I think thats it.
0
 
LVL 9

Accepted Solution

by:
user_n earned 500 total points
ID: 34874658
[0-9]{4}\s+([-A-Z]|\s|,)+
0
 
LVL 9

Expert Comment

by:user_n
ID: 34874699
Try this
[0-9]{4}\s+([-A-Z]|\s|,|.|\|/|_)+

[0-9]{4}\s+([-A-Z]|\s|,|.|\\|/|_)+
0
 
LVL 31

Expert Comment

by:farzanj
ID: 34874713
Did you want in Perl.   Here is something you may be interested in


if ($line =~ /\s+(\d+\s+[-A-Z]+)\s+(\d+\s+[A-Z]+)\s+(\d+\s+[A-Z-]+)\s+(\d+\s+[A-Z-]+).+/)
{
    print $1;
    print $2;
    print $3;
    print $4;

}
else
{
    print 'no match found';
0
 
LVL 31

Expert Comment

by:farzanj
ID: 34874717
Sorry, left closing brace.
0
 
LVL 9

Expert Comment

by:user_n
ID: 34874720
[0-9]{4}\s+([-A-Z]|\s|,|.|\|/|_)*([-A-Z]|,|.|\|/|_)
0
 
LVL 2

Author Comment

by:jacksonm1234
ID: 34874793
None work.  The last two return only one match consisting of the entire string.  This one looks close, but it doesnt have the slashes, periods, etc included.
[0-9]{4}\s+([-A-Z]|\s|,)+
0
 
LVL 2

Author Comment

by:jacksonm1234
ID: 34874799
[0-9]{4}\s+([-A-Z]|\s|,|.|\|/|_)*([-A-Z]|,|.|\|/|_)

Same thing user_n.  It returns a match, but the match is the whole string (not broken up into pieces like I need).
0
 
LVL 9

Expert Comment

by:user_n
ID: 34874830
[0-9] - defines one digit symbol 0 or 1 or 2 or 3 or 4 or 5 or 6 or 7 or 8 or 9
{4} - number of repetiotions
[0-9]{4} -  for  digit symbols
\s - includes \t tab \r carriage return \n new line \v vertical tab \f]      Whitespace characters
+ at least one
\s+ - one or more whitespace characters
[-A-Z] - defines chars "-" or "A" or "B" ... "Z"
| - means or
([-A-Z]|\s|,|.|\|/|_) - means character "-" or "A" or "B" ... "Z" or "whitespace character" or "," or "." or "\" or "/" or "_"
* - 0 or more appearance of characters
([-A-Z]|\s|,|.|\|/|_)* - means 0 or more appearances of ([-A-Z]|\s|,|.|\|/|_) characters
0
 
LVL 31

Expert Comment

by:farzanj
ID: 34874843
Test the attached file.
test2.txt
0
 
LVL 2

Author Comment

by:jacksonm1234
ID: 34874866
Sorry farzan,  I dont know anything about Perl.
0
 
LVL 41

Expert Comment

by:HonorGod
ID: 34874873
simpler...
<html>
<body>

<script type="text/javascript">
  var data = "         0108   COL-GDI-STAP          1206   INDIO                 2212   THEDALS-BEND          5030   HARLINGEN                    ";
  document.write( '"' + data + '"<br/>' );
  data = data.replace( /\s*([0-9]{4})\s+(\S+)\s*/g, '$1 $2' );
  document.write( '"' + data + '"<br/>' );

</script>

</body>
</html>

Open in new window

0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 34874889
Based on question history, I'll assume this is VB.NET--the overall logic should work in different languages though:
Imports System.Text.RegularExpressions

...

Dim src As String = "         0108   COL-GDI-STAP          1206   INDIO                 2212   THEDALS-BEND          5030   HARLINGEN                    "
Dim matches As MatchCollection = Regex.Matches(src.Trim(), "\d+\s+\S+")

matches(0).Value ' 0108   COL-GDI-STAP
matches(1).Value ' 1206   INDIO
matches(2).Value ' 2212   THEDALS-BEND
matches(3).Value ' 5030   HARLINGEN

Open in new window

0
 
LVL 31

Expert Comment

by:farzanj
ID: 34874903
Sorry, which language should I be using??
0
 
LVL 9

Expert Comment

by:user_n
ID: 34874905
What program language are you using.

[0-9]{4}\s+([-A-Z]|\s|,|.|\|/|_)*([-A-Z]|,|.|\|/|_)
should match for
  0101   COB-FREMONT           1103   CLEARFIELD            2205   WALLA WALLA           4245   OOLOGAH
on first pass, with program you can get the rest
0101   COB-FREMONT

([0-9]{4}\s+([-A-Z]|\s|,|.|\|/|_)*([-A-Z]|,|.|\|/|_)\s+){3}([0-9]{4}\s+([-A-Z]|\s|,|.|\|/|_)*([-A-Z]|,|.|\|/|_))
should match
0101   COB-FREMONT           1103   CLEARFIELD            2205   WALLA WALLA           4245   OOLOGAH
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 34874907
Lines 8 - 10 aren't really code lines, but are meant to demonstrate where to locate the value and what the value would be. I did an inadequate job of commenting  : \
0
Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

 
LVL 2

Author Comment

by:jacksonm1234
ID: 34874915
HonorGod:

Doesn't seem to be working with slashes, commas, etc.
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 34874949
Can you guarantee that each alpha string can have internal spaces of exactly one occurrence?

E.g.

This is my string  <--  single spaces

vs.

This   is  my string <-- multi-space

Open in new window

0
 
LVL 2

Author Comment

by:jacksonm1234
ID: 34875013
Everyone:
I'm using Excel VBA.   I apologize for not mentioning it before, but I assumed regex would be a standard implementation, no matter the language.

user_n:
Your first regex string returns a match (the whole string).  I want 4 matches that I can iterate through.  What do you mean "with program you can get the rest"?  I don't want to use excel formulas or something to strip them out, that's why I'm using regex.

Your second regex freezes my Excel for some reason.

kaufmed:
I need it to account for spaces inside of words (see my additional examples I provided in the second post above.

0
 
LVL 2

Author Comment

by:jacksonm1234
ID: 34875022
kaufmed:
These could have multiple spaces.  This is the closest answer I've gotten so far, from user_n:
[0-9]{4}\s+([-A-Z]|\s|,)+  
0
 
LVL 41

Expert Comment

by:HonorGod
ID: 34875024
Please explain:

Doesn't seem to be working ...

/\s*([0-9]{4})\s+(\S+)\s*/g

Means:


\s*         == zero or more whitespace characters
([0-9]{4})  == Group #1 (i.e., $1) composed of exactly 4 digits
\s+         == one  or more whitespace characters
(\S+)       == Group #2 (i.e., $2) composed of 1 or more non-Whitespace characters
\s*         == zero or more whitespace characters

Open in new window

0
 
LVL 2

Author Comment

by:jacksonm1234
ID: 34875072
HonorGod:

try it with this string.  "0017   CANAL E/W RP "  returns as "0017  CANAL" by your expression.  
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 34875093
Given the above samples, perhaps this will fit the bill:

(I guess I'll switch to javascript since that appears to be the desired result)
<script type="text/javascript">
	function run() {
		var data = "         0108   COL-GDI-STAP          1206   INDIO                 2212   THEDALS-BEND          5030   HARLINGEN                    ";

		data = data.replace(/^\s+|\s+$/g, '');
		data = data.replace(/([a-z])\s+(\d)/gi, '$1######$2');
		matches = data.split(/######/);

		for (var i in matches)
		{
			alert(matches[i]);
		}
	}
</script>

Open in new window

0
 
LVL 9

Expert Comment

by:user_n
ID: 34875116
Some symbols in regular expression need to be escaped with \ in some languages. I do not use VB .
This may help
http://www.aspfree.com/c/a/Windows-Scripting/Regular-Expressions-in-VBScript/1/
for matching not only the first string but the rest too.
so I escaped the \ in the next expression (I used "\\" for escaping)
[0-9]{4}\s+([-A-Z]|\s|,|.|\\|/|_)*([-A-Z]|,|.|\\|/|_)
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 34875136
>>  but I assumed regex would be a standard implementation, no matter the language.

Ah my young padawan, you have much to learn   ; )

Here's the VBA breakdown of the above:
Sub func(data As String)
    Set RegularExpressionObject = New RegExp

    With RegularExpressionObject
        .IgnoreCase = True
        .Global = True
        
        .Pattern = "^\s+|\s+$"
        data = .Replace(data, "")
        
        .Pattern = "([a-z])\s+(\d)"
        data = .Replace(data, "$1#######$2")
    End With

    matches = Split(data, "#######")
    
    For i = 0 To UBound(matches)
        MsgBox matches(i)
    Next
End Sub

Open in new window

0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 34875149
You can change line 11 to

    .Pattern = "([a-z])\s+(\d{4})"

to make it a tad more reliable, unless you never receive digits inside of the "strings"
0
 
LVL 9

Expert Comment

by:user_n
ID: 34875389
[0-9]{4}\s+([A-Z]|\-|\s|\,|\.|\\|\/|\_)*([A-Z]|\,|\.|\\|\/|\_|\-)
0
 
LVL 2

Author Comment

by:jacksonm1234
ID: 34875477
This is the one that works.  Perfectly.

[0-9]{4}\s+([-A-Z./\\]|\s|,)+

user_n gave it in post 6ish, but I had to add the slashes and period, I think.  Thanks to all who helped.
0
 
LVL 2

Author Closing Comment

by:jacksonm1234
ID: 34875593
solution was close but not quite there.
0
 
LVL 9

Expert Comment

by:user_n
ID: 34875664
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html>
<body>

<script type="text/javascript">
  var data = "         0017   CANAL E/W RP          0300   DRGW                  2203   WALLULA               4230   TULSA, OK                   ";
  var length;
  document.write( '"' + data + '"<br/>' );
  data = data.match(/[0-9]{4}\s+([A-Z]|\-|\s|\,|\.|\\|\/|\_)*([A-Z]|\,|\.|\\|\/|\_|\-)/g,'');
  length = data.length;
  for(i = 0; i < length; i++)
  {
	document.write(data[i] + '<br/>');
  }

</script>

</body>
</html>

Open in new window

0
 
LVL 9

Expert Comment

by:user_n
ID: 34875940
[0-9]{4}\\s+([A-Z]|-|\\s|,|.|\\|/|_)*([A-Z]|,|.|\\|/|_|-) .Net C#
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 34876505
In the future, you can save yourself quite a few posts by giving an accurate representation of your data or the rules surrounding what valid data is comprised of. In the course of splitting a string into fields, it is important to know what comprises a field and what can function as a field separator.
0
 
LVL 2

Author Comment

by:jacksonm1234
ID: 34876561
I know, my initial post was incomplete, and I already admitted my mistake twice above.

However, I noted my  more specific requirements in the second post, so giving an accurate representation of my data really didn't save many posts in this case.  Maybe I need to assume that no one reads the posts besides the main one?
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 34876956
Just some friendly guidance regarding future pattern questions. You can take it or leave it--it makes no difference to me.

Glad you have a working solution which you understand.  = )
0
 
LVL 9

Expert Comment

by:user_n
ID: 34877460
Sub Macro2()
    Dim RegEx As Object
    Dim strTest As String
    Dim valid As Boolean
    Dim Matches As Object
    Dim i As Integer
   
Worksheets("Sheet1").Activate

strTest = "  0017   CANAL E/W RP          0300   DRGW                  2203   WALLULA               4230   TULSA, OK "
Set RegEx = CreateObject("VBScript.RegExp")
RegEx.Pattern = "[0-9]{4}\s+([A-Z]|-|\s|,|\.|\|/|_)*([A-Z]|,|\.|\|/|_|-)"
i = 1
        valid = RegEx.test(strTest)
        While valid = True
            Set Matches = RegEx.Execute(strTest)
            Worksheets("Sheet1").Cells(i, 1).Value = CStr(Matches(0))
            strTest = RegEx.Replace(strTest, "")
            valid = RegEx.test(strTest)
            i = i + 1
        Wend

    Set RegEx = Nothing
   
End Sub
0
 
LVL 9

Expert Comment

by:user_n
ID: 34877469
. needed to be escaped ("\.") to match only the symbol ".", otherwise it matches any single character
0
 
LVL 9

Expert Comment

by:user_n
ID: 34877630

[0-9]{4}\s+([A-Z]|-|\s|,|\.|\\|/|_)*([A-Z]|,|\.|\\|/|_|-)
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Learning regular expressions 13 51
Question for a Awk & Sed Blackbelt 5 50
Need some help with grep 7 85
Regular Expression to find a line feed 1 38
I have been reconstructing a PHP-based application that has grown into a full blown interface system over the last ten years by a developer that has now gone into business for himself building websites. I am not incredibly fond of writing PHP code o…
Whatever be the reason, if you are working on web development side,  you will need day-today validation codes like email validation, date validation , IP address validation, phone validation on any of the edit page or say at the time of registration…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

895 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now