Solved

Regex help!

Posted on 2011-02-11
41
398 Views
Last Modified: 2012-06-27
I'm not terribly good with Regex, but I'm trying to use them to strip out some text from a long string.  Here is an example string.

"         0108   COL-GDI-STAP          1206   INDIO                 2212   THEDALS-BEND          5030   HARLINGEN                    "

What I really want is 4 matches that would give me "0108   COL-GDI-STAP" in one match, "1206   INDIO" in the next, and so on.

I've got the string expression to get the first three, but I can't figure out the code to get the last set!  Here is the code for the first 3--
"\d{4}.*?(?=\d{4})"

What expression would I use to get the last set?  Is there a better way to grab all 4 using one statement?  Thanks!
0
Comment
Question by:jacksonm1234
  • 14
  • 12
  • 8
  • +2
41 Comments
 
LVL 9

Expert Comment

by:user_n
Comment Utility
[0-9]{4}\s+[-A-Z]+
0
 
LVL 2

Author Comment

by:jacksonm1234
Comment Utility
Close, but i need it to support spaces, commas, etc. in amongst the letters.  For example, these two strings don't work correctly with your expression:

"         0017   CANAL E/W RP          0300   DRGW                  2203   WALLULA               4230   TULSA, OK                    "

"         0101   COB-FREMONT           1103   CLEARFIELD            2205   WALLA WALLA           4245   OOLOGAH                      "


I should have provided a better example initially. Sorry.
0
 
LVL 41

Expert Comment

by:HonorGod
Comment Utility
Something like this perhaps?
<html>
<body>

<script type="text/javascript">
  var data = "         0108   COL-GDI-STAP          1206   INDIO                 2212   THEDALS-BEND          5030   HARLINGEN                    ";
  document.write( '"' + data + '"<br/>' );
  data = data.replace( /\s*([0-9]{4})\s+([-A-Z]+)\s*/g, '$1 $2' );
  document.write( '"' + data + '"<br/>' );

</script>

</body>
</html>

Open in new window

0
 
LVL 9

Expert Comment

by:user_n
Comment Utility
and who are the etc. symbols
0
 
LVL 2

Author Comment

by:jacksonm1234
Comment Utility
HonorGod:  Same problem...i need to take into account other characters in the words.  Sorry for giving a poor example.

user_n:  So we can't just use a universal 'match any character'?  We have to use specifics?
The special symbols I need to deal with are  spaces, commas, periods, hyphens, slashes, and I think thats it.
0
 
LVL 9

Accepted Solution

by:
user_n earned 500 total points
Comment Utility
[0-9]{4}\s+([-A-Z]|\s|,)+
0
 
LVL 9

Expert Comment

by:user_n
Comment Utility
Try this
[0-9]{4}\s+([-A-Z]|\s|,|.|\|/|_)+

[0-9]{4}\s+([-A-Z]|\s|,|.|\\|/|_)+
0
 
LVL 31

Expert Comment

by:farzanj
Comment Utility
Did you want in Perl.   Here is something you may be interested in


if ($line =~ /\s+(\d+\s+[-A-Z]+)\s+(\d+\s+[A-Z]+)\s+(\d+\s+[A-Z-]+)\s+(\d+\s+[A-Z-]+).+/)
{
    print $1;
    print $2;
    print $3;
    print $4;

}
else
{
    print 'no match found';
0
 
LVL 31

Expert Comment

by:farzanj
Comment Utility
Sorry, left closing brace.
0
 
LVL 9

Expert Comment

by:user_n
Comment Utility
[0-9]{4}\s+([-A-Z]|\s|,|.|\|/|_)*([-A-Z]|,|.|\|/|_)
0
 
LVL 2

Author Comment

by:jacksonm1234
Comment Utility
None work.  The last two return only one match consisting of the entire string.  This one looks close, but it doesnt have the slashes, periods, etc included.
[0-9]{4}\s+([-A-Z]|\s|,)+
0
 
LVL 2

Author Comment

by:jacksonm1234
Comment Utility
[0-9]{4}\s+([-A-Z]|\s|,|.|\|/|_)*([-A-Z]|,|.|\|/|_)

Same thing user_n.  It returns a match, but the match is the whole string (not broken up into pieces like I need).
0
 
LVL 9

Expert Comment

by:user_n
Comment Utility
[0-9] - defines one digit symbol 0 or 1 or 2 or 3 or 4 or 5 or 6 or 7 or 8 or 9
{4} - number of repetiotions
[0-9]{4} -  for  digit symbols
\s - includes \t tab \r carriage return \n new line \v vertical tab \f]      Whitespace characters
+ at least one
\s+ - one or more whitespace characters
[-A-Z] - defines chars "-" or "A" or "B" ... "Z"
| - means or
([-A-Z]|\s|,|.|\|/|_) - means character "-" or "A" or "B" ... "Z" or "whitespace character" or "," or "." or "\" or "/" or "_"
* - 0 or more appearance of characters
([-A-Z]|\s|,|.|\|/|_)* - means 0 or more appearances of ([-A-Z]|\s|,|.|\|/|_) characters
0
 
LVL 31

Expert Comment

by:farzanj
Comment Utility
Test the attached file.
test2.txt
0
 
LVL 2

Author Comment

by:jacksonm1234
Comment Utility
Sorry farzan,  I dont know anything about Perl.
0
 
LVL 41

Expert Comment

by:HonorGod
Comment Utility
simpler...
<html>
<body>

<script type="text/javascript">
  var data = "         0108   COL-GDI-STAP          1206   INDIO                 2212   THEDALS-BEND          5030   HARLINGEN                    ";
  document.write( '"' + data + '"<br/>' );
  data = data.replace( /\s*([0-9]{4})\s+(\S+)\s*/g, '$1 $2' );
  document.write( '"' + data + '"<br/>' );

</script>

</body>
</html>

Open in new window

0
 
LVL 74

Expert Comment

by:käµfm³d 👽
Comment Utility
Based on question history, I'll assume this is VB.NET--the overall logic should work in different languages though:
Imports System.Text.RegularExpressions

...

Dim src As String = "         0108   COL-GDI-STAP          1206   INDIO                 2212   THEDALS-BEND          5030   HARLINGEN                    "
Dim matches As MatchCollection = Regex.Matches(src.Trim(), "\d+\s+\S+")

matches(0).Value ' 0108   COL-GDI-STAP
matches(1).Value ' 1206   INDIO
matches(2).Value ' 2212   THEDALS-BEND
matches(3).Value ' 5030   HARLINGEN

Open in new window

0
 
LVL 31

Expert Comment

by:farzanj
Comment Utility
Sorry, which language should I be using??
0
 
LVL 9

Expert Comment

by:user_n
Comment Utility
What program language are you using.

[0-9]{4}\s+([-A-Z]|\s|,|.|\|/|_)*([-A-Z]|,|.|\|/|_)
should match for
  0101   COB-FREMONT           1103   CLEARFIELD            2205   WALLA WALLA           4245   OOLOGAH
on first pass, with program you can get the rest
0101   COB-FREMONT

([0-9]{4}\s+([-A-Z]|\s|,|.|\|/|_)*([-A-Z]|,|.|\|/|_)\s+){3}([0-9]{4}\s+([-A-Z]|\s|,|.|\|/|_)*([-A-Z]|,|.|\|/|_))
should match
0101   COB-FREMONT           1103   CLEARFIELD            2205   WALLA WALLA           4245   OOLOGAH
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
Comment Utility
Lines 8 - 10 aren't really code lines, but are meant to demonstrate where to locate the value and what the value would be. I did an inadequate job of commenting  : \
0
What Is Threat Intelligence?

Threat intelligence is often discussed, but rarely understood. Starting with a precise definition, along with clear business goals, is essential.

 
LVL 2

Author Comment

by:jacksonm1234
Comment Utility
HonorGod:

Doesn't seem to be working with slashes, commas, etc.
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
Comment Utility
Can you guarantee that each alpha string can have internal spaces of exactly one occurrence?

E.g.

This is my string  <--  single spaces

vs.

This   is  my string <-- multi-space

Open in new window

0
 
LVL 2

Author Comment

by:jacksonm1234
Comment Utility
Everyone:
I'm using Excel VBA.   I apologize for not mentioning it before, but I assumed regex would be a standard implementation, no matter the language.

user_n:
Your first regex string returns a match (the whole string).  I want 4 matches that I can iterate through.  What do you mean "with program you can get the rest"?  I don't want to use excel formulas or something to strip them out, that's why I'm using regex.

Your second regex freezes my Excel for some reason.

kaufmed:
I need it to account for spaces inside of words (see my additional examples I provided in the second post above.

0
 
LVL 2

Author Comment

by:jacksonm1234
Comment Utility
kaufmed:
These could have multiple spaces.  This is the closest answer I've gotten so far, from user_n:
[0-9]{4}\s+([-A-Z]|\s|,)+  
0
 
LVL 41

Expert Comment

by:HonorGod
Comment Utility
Please explain:

Doesn't seem to be working ...

/\s*([0-9]{4})\s+(\S+)\s*/g

Means:


\s*         == zero or more whitespace characters
([0-9]{4})  == Group #1 (i.e., $1) composed of exactly 4 digits
\s+         == one  or more whitespace characters
(\S+)       == Group #2 (i.e., $2) composed of 1 or more non-Whitespace characters
\s*         == zero or more whitespace characters

Open in new window

0
 
LVL 2

Author Comment

by:jacksonm1234
Comment Utility
HonorGod:

try it with this string.  "0017   CANAL E/W RP "  returns as "0017  CANAL" by your expression.  
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
Comment Utility
Given the above samples, perhaps this will fit the bill:

(I guess I'll switch to javascript since that appears to be the desired result)
<script type="text/javascript">
	function run() {
		var data = "         0108   COL-GDI-STAP          1206   INDIO                 2212   THEDALS-BEND          5030   HARLINGEN                    ";

		data = data.replace(/^\s+|\s+$/g, '');
		data = data.replace(/([a-z])\s+(\d)/gi, '$1######$2');
		matches = data.split(/######/);

		for (var i in matches)
		{
			alert(matches[i]);
		}
	}
</script>

Open in new window

0
 
LVL 9

Expert Comment

by:user_n
Comment Utility
Some symbols in regular expression need to be escaped with \ in some languages. I do not use VB .
This may help
http://www.aspfree.com/c/a/Windows-Scripting/Regular-Expressions-in-VBScript/1/
for matching not only the first string but the rest too.
so I escaped the \ in the next expression (I used "\\" for escaping)
[0-9]{4}\s+([-A-Z]|\s|,|.|\\|/|_)*([-A-Z]|,|.|\\|/|_)
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
Comment Utility
>>  but I assumed regex would be a standard implementation, no matter the language.

Ah my young padawan, you have much to learn   ; )

Here's the VBA breakdown of the above:
Sub func(data As String)
    Set RegularExpressionObject = New RegExp

    With RegularExpressionObject
        .IgnoreCase = True
        .Global = True
        
        .Pattern = "^\s+|\s+$"
        data = .Replace(data, "")
        
        .Pattern = "([a-z])\s+(\d)"
        data = .Replace(data, "$1#######$2")
    End With

    matches = Split(data, "#######")
    
    For i = 0 To UBound(matches)
        MsgBox matches(i)
    Next
End Sub

Open in new window

0
 
LVL 74

Expert Comment

by:käµfm³d 👽
Comment Utility
You can change line 11 to

    .Pattern = "([a-z])\s+(\d{4})"

to make it a tad more reliable, unless you never receive digits inside of the "strings"
0
 
LVL 9

Expert Comment

by:user_n
Comment Utility
[0-9]{4}\s+([A-Z]|\-|\s|\,|\.|\\|\/|\_)*([A-Z]|\,|\.|\\|\/|\_|\-)
0
 
LVL 2

Author Comment

by:jacksonm1234
Comment Utility
This is the one that works.  Perfectly.

[0-9]{4}\s+([-A-Z./\\]|\s|,)+

user_n gave it in post 6ish, but I had to add the slashes and period, I think.  Thanks to all who helped.
0
 
LVL 2

Author Closing Comment

by:jacksonm1234
Comment Utility
solution was close but not quite there.
0
 
LVL 9

Expert Comment

by:user_n
Comment Utility
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html>
<body>

<script type="text/javascript">
  var data = "         0017   CANAL E/W RP          0300   DRGW                  2203   WALLULA               4230   TULSA, OK                   ";
  var length;
  document.write( '"' + data + '"<br/>' );
  data = data.match(/[0-9]{4}\s+([A-Z]|\-|\s|\,|\.|\\|\/|\_)*([A-Z]|\,|\.|\\|\/|\_|\-)/g,'');
  length = data.length;
  for(i = 0; i < length; i++)
  {
	document.write(data[i] + '<br/>');
  }

</script>

</body>
</html>

Open in new window

0
 
LVL 9

Expert Comment

by:user_n
Comment Utility
[0-9]{4}\\s+([A-Z]|-|\\s|,|.|\\|/|_)*([A-Z]|,|.|\\|/|_|-) .Net C#
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
Comment Utility
In the future, you can save yourself quite a few posts by giving an accurate representation of your data or the rules surrounding what valid data is comprised of. In the course of splitting a string into fields, it is important to know what comprises a field and what can function as a field separator.
0
 
LVL 2

Author Comment

by:jacksonm1234
Comment Utility
I know, my initial post was incomplete, and I already admitted my mistake twice above.

However, I noted my  more specific requirements in the second post, so giving an accurate representation of my data really didn't save many posts in this case.  Maybe I need to assume that no one reads the posts besides the main one?
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
Comment Utility
Just some friendly guidance regarding future pattern questions. You can take it or leave it--it makes no difference to me.

Glad you have a working solution which you understand.  = )
0
 
LVL 9

Expert Comment

by:user_n
Comment Utility
Sub Macro2()
    Dim RegEx As Object
    Dim strTest As String
    Dim valid As Boolean
    Dim Matches As Object
    Dim i As Integer
   
Worksheets("Sheet1").Activate

strTest = "  0017   CANAL E/W RP          0300   DRGW                  2203   WALLULA               4230   TULSA, OK "
Set RegEx = CreateObject("VBScript.RegExp")
RegEx.Pattern = "[0-9]{4}\s+([A-Z]|-|\s|,|\.|\|/|_)*([A-Z]|,|\.|\|/|_|-)"
i = 1
        valid = RegEx.test(strTest)
        While valid = True
            Set Matches = RegEx.Execute(strTest)
            Worksheets("Sheet1").Cells(i, 1).Value = CStr(Matches(0))
            strTest = RegEx.Replace(strTest, "")
            valid = RegEx.test(strTest)
            i = i + 1
        Wend

    Set RegEx = Nothing
   
End Sub
0
 
LVL 9

Expert Comment

by:user_n
Comment Utility
. needed to be escaped ("\.") to match only the symbol ".", otherwise it matches any single character
0
 
LVL 9

Expert Comment

by:user_n
Comment Utility

[0-9]{4}\s+([A-Z]|-|\s|,|\.|\\|/|_)*([A-Z]|,|\.|\\|/|_|-)
0

Featured Post

Enabling OSINT in Activity Based Intelligence

Activity based intelligence (ABI) requires access to all available sources of data. Recorded Future allows analysts to observe structured data on the open, deep, and dark web.

Join & Write a Comment

by Batuhan Cetin Regular expression is a language that we use to edit a string or retrieve sub-strings that meets specific rules from a text. A regular expression can be applied to a set of string variables. There are many RegEx engines for u…
As most anyone who uses or has come across them can attest to, regular expressions (regex) are a complicated bit of magic. Packed so succinctly within their cryptic syntax lies a great deal of power. It's not the "take over the world" kind of power,…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

7 Experts available now in Live!

Get 1:1 Help Now