Go Premium for a chance to win a PS4. Enter to Win

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 444
  • Last Modified:

Regex help!

I'm not terribly good with Regex, but I'm trying to use them to strip out some text from a long string.  Here is an example string.

"         0108   COL-GDI-STAP          1206   INDIO                 2212   THEDALS-BEND          5030   HARLINGEN                    "

What I really want is 4 matches that would give me "0108   COL-GDI-STAP" in one match, "1206   INDIO" in the next, and so on.

I've got the string expression to get the first three, but I can't figure out the code to get the last set!  Here is the code for the first 3--
"\d{4}.*?(?=\d{4})"

What expression would I use to get the last set?  Is there a better way to grab all 4 using one statement?  Thanks!
0
jacksonm1234
Asked:
jacksonm1234
  • 14
  • 12
  • 8
  • +2
1 Solution
 
user_nCommented:
[0-9]{4}\s+[-A-Z]+
0
 
jacksonm1234Author Commented:
Close, but i need it to support spaces, commas, etc. in amongst the letters.  For example, these two strings don't work correctly with your expression:

"         0017   CANAL E/W RP          0300   DRGW                  2203   WALLULA               4230   TULSA, OK                    "

"         0101   COB-FREMONT           1103   CLEARFIELD            2205   WALLA WALLA           4245   OOLOGAH                      "


I should have provided a better example initially. Sorry.
0
 
HonorGodCommented:
Something like this perhaps?
<html>
<body>

<script type="text/javascript">
  var data = "         0108   COL-GDI-STAP          1206   INDIO                 2212   THEDALS-BEND          5030   HARLINGEN                    ";
  document.write( '"' + data + '"<br/>' );
  data = data.replace( /\s*([0-9]{4})\s+([-A-Z]+)\s*/g, '$1 $2' );
  document.write( '"' + data + '"<br/>' );

</script>

</body>
</html>

Open in new window

0
Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

 
user_nCommented:
and who are the etc. symbols
0
 
jacksonm1234Author Commented:
HonorGod:  Same problem...i need to take into account other characters in the words.  Sorry for giving a poor example.

user_n:  So we can't just use a universal 'match any character'?  We have to use specifics?
The special symbols I need to deal with are  spaces, commas, periods, hyphens, slashes, and I think thats it.
0
 
user_nCommented:
[0-9]{4}\s+([-A-Z]|\s|,)+
0
 
user_nCommented:
Try this
[0-9]{4}\s+([-A-Z]|\s|,|.|\|/|_)+

[0-9]{4}\s+([-A-Z]|\s|,|.|\\|/|_)+
0
 
farzanjCommented:
Did you want in Perl.   Here is something you may be interested in


if ($line =~ /\s+(\d+\s+[-A-Z]+)\s+(\d+\s+[A-Z]+)\s+(\d+\s+[A-Z-]+)\s+(\d+\s+[A-Z-]+).+/)
{
    print $1;
    print $2;
    print $3;
    print $4;

}
else
{
    print 'no match found';
0
 
farzanjCommented:
Sorry, left closing brace.
0
 
user_nCommented:
[0-9]{4}\s+([-A-Z]|\s|,|.|\|/|_)*([-A-Z]|,|.|\|/|_)
0
 
jacksonm1234Author Commented:
None work.  The last two return only one match consisting of the entire string.  This one looks close, but it doesnt have the slashes, periods, etc included.
[0-9]{4}\s+([-A-Z]|\s|,)+
0
 
jacksonm1234Author Commented:
[0-9]{4}\s+([-A-Z]|\s|,|.|\|/|_)*([-A-Z]|,|.|\|/|_)

Same thing user_n.  It returns a match, but the match is the whole string (not broken up into pieces like I need).
0
 
user_nCommented:
[0-9] - defines one digit symbol 0 or 1 or 2 or 3 or 4 or 5 or 6 or 7 or 8 or 9
{4} - number of repetiotions
[0-9]{4} -  for  digit symbols
\s - includes \t tab \r carriage return \n new line \v vertical tab \f]      Whitespace characters
+ at least one
\s+ - one or more whitespace characters
[-A-Z] - defines chars "-" or "A" or "B" ... "Z"
| - means or
([-A-Z]|\s|,|.|\|/|_) - means character "-" or "A" or "B" ... "Z" or "whitespace character" or "," or "." or "\" or "/" or "_"
* - 0 or more appearance of characters
([-A-Z]|\s|,|.|\|/|_)* - means 0 or more appearances of ([-A-Z]|\s|,|.|\|/|_) characters
0
 
farzanjCommented:
Test the attached file.
test2.txt
0
 
jacksonm1234Author Commented:
Sorry farzan,  I dont know anything about Perl.
0
 
HonorGodCommented:
simpler...
<html>
<body>

<script type="text/javascript">
  var data = "         0108   COL-GDI-STAP          1206   INDIO                 2212   THEDALS-BEND          5030   HARLINGEN                    ";
  document.write( '"' + data + '"<br/>' );
  data = data.replace( /\s*([0-9]{4})\s+(\S+)\s*/g, '$1 $2' );
  document.write( '"' + data + '"<br/>' );

</script>

</body>
</html>

Open in new window

0
 
käµfm³d 👽Commented:
Based on question history, I'll assume this is VB.NET--the overall logic should work in different languages though:
Imports System.Text.RegularExpressions

...

Dim src As String = "         0108   COL-GDI-STAP          1206   INDIO                 2212   THEDALS-BEND          5030   HARLINGEN                    "
Dim matches As MatchCollection = Regex.Matches(src.Trim(), "\d+\s+\S+")

matches(0).Value ' 0108   COL-GDI-STAP
matches(1).Value ' 1206   INDIO
matches(2).Value ' 2212   THEDALS-BEND
matches(3).Value ' 5030   HARLINGEN

Open in new window

0
 
farzanjCommented:
Sorry, which language should I be using??
0
 
user_nCommented:
What program language are you using.

[0-9]{4}\s+([-A-Z]|\s|,|.|\|/|_)*([-A-Z]|,|.|\|/|_)
should match for
  0101   COB-FREMONT           1103   CLEARFIELD            2205   WALLA WALLA           4245   OOLOGAH
on first pass, with program you can get the rest
0101   COB-FREMONT

([0-9]{4}\s+([-A-Z]|\s|,|.|\|/|_)*([-A-Z]|,|.|\|/|_)\s+){3}([0-9]{4}\s+([-A-Z]|\s|,|.|\|/|_)*([-A-Z]|,|.|\|/|_))
should match
0101   COB-FREMONT           1103   CLEARFIELD            2205   WALLA WALLA           4245   OOLOGAH
0
 
käµfm³d 👽Commented:
Lines 8 - 10 aren't really code lines, but are meant to demonstrate where to locate the value and what the value would be. I did an inadequate job of commenting  : \
0
 
jacksonm1234Author Commented:
HonorGod:

Doesn't seem to be working with slashes, commas, etc.
0
 
käµfm³d 👽Commented:
Can you guarantee that each alpha string can have internal spaces of exactly one occurrence?

E.g.

This is my string  <--  single spaces

vs.

This   is  my string <-- multi-space

Open in new window

0
 
jacksonm1234Author Commented:
Everyone:
I'm using Excel VBA.   I apologize for not mentioning it before, but I assumed regex would be a standard implementation, no matter the language.

user_n:
Your first regex string returns a match (the whole string).  I want 4 matches that I can iterate through.  What do you mean "with program you can get the rest"?  I don't want to use excel formulas or something to strip them out, that's why I'm using regex.

Your second regex freezes my Excel for some reason.

kaufmed:
I need it to account for spaces inside of words (see my additional examples I provided in the second post above.

0
 
jacksonm1234Author Commented:
kaufmed:
These could have multiple spaces.  This is the closest answer I've gotten so far, from user_n:
[0-9]{4}\s+([-A-Z]|\s|,)+  
0
 
HonorGodCommented:
Please explain:

Doesn't seem to be working ...

/\s*([0-9]{4})\s+(\S+)\s*/g

Means:


\s*         == zero or more whitespace characters
([0-9]{4})  == Group #1 (i.e., $1) composed of exactly 4 digits
\s+         == one  or more whitespace characters
(\S+)       == Group #2 (i.e., $2) composed of 1 or more non-Whitespace characters
\s*         == zero or more whitespace characters

Open in new window

0
 
jacksonm1234Author Commented:
HonorGod:

try it with this string.  "0017   CANAL E/W RP "  returns as "0017  CANAL" by your expression.  
0
 
käµfm³d 👽Commented:
Given the above samples, perhaps this will fit the bill:

(I guess I'll switch to javascript since that appears to be the desired result)
<script type="text/javascript">
	function run() {
		var data = "         0108   COL-GDI-STAP          1206   INDIO                 2212   THEDALS-BEND          5030   HARLINGEN                    ";

		data = data.replace(/^\s+|\s+$/g, '');
		data = data.replace(/([a-z])\s+(\d)/gi, '$1######$2');
		matches = data.split(/######/);

		for (var i in matches)
		{
			alert(matches[i]);
		}
	}
</script>

Open in new window

0
 
user_nCommented:
Some symbols in regular expression need to be escaped with \ in some languages. I do not use VB .
This may help
http://www.aspfree.com/c/a/Windows-Scripting/Regular-Expressions-in-VBScript/1/
for matching not only the first string but the rest too.
so I escaped the \ in the next expression (I used "\\" for escaping)
[0-9]{4}\s+([-A-Z]|\s|,|.|\\|/|_)*([-A-Z]|,|.|\\|/|_)
0
 
käµfm³d 👽Commented:
>>  but I assumed regex would be a standard implementation, no matter the language.

Ah my young padawan, you have much to learn   ; )

Here's the VBA breakdown of the above:
Sub func(data As String)
    Set RegularExpressionObject = New RegExp

    With RegularExpressionObject
        .IgnoreCase = True
        .Global = True
        
        .Pattern = "^\s+|\s+$"
        data = .Replace(data, "")
        
        .Pattern = "([a-z])\s+(\d)"
        data = .Replace(data, "$1#######$2")
    End With

    matches = Split(data, "#######")
    
    For i = 0 To UBound(matches)
        MsgBox matches(i)
    Next
End Sub

Open in new window

0
 
käµfm³d 👽Commented:
You can change line 11 to

    .Pattern = "([a-z])\s+(\d{4})"

to make it a tad more reliable, unless you never receive digits inside of the "strings"
0
 
user_nCommented:
[0-9]{4}\s+([A-Z]|\-|\s|\,|\.|\\|\/|\_)*([A-Z]|\,|\.|\\|\/|\_|\-)
0
 
jacksonm1234Author Commented:
This is the one that works.  Perfectly.

[0-9]{4}\s+([-A-Z./\\]|\s|,)+

user_n gave it in post 6ish, but I had to add the slashes and period, I think.  Thanks to all who helped.
0
 
jacksonm1234Author Commented:
solution was close but not quite there.
0
 
user_nCommented:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html>
<body>

<script type="text/javascript">
  var data = "         0017   CANAL E/W RP          0300   DRGW                  2203   WALLULA               4230   TULSA, OK                   ";
  var length;
  document.write( '"' + data + '"<br/>' );
  data = data.match(/[0-9]{4}\s+([A-Z]|\-|\s|\,|\.|\\|\/|\_)*([A-Z]|\,|\.|\\|\/|\_|\-)/g,'');
  length = data.length;
  for(i = 0; i < length; i++)
  {
	document.write(data[i] + '<br/>');
  }

</script>

</body>
</html>

Open in new window

0
 
user_nCommented:
[0-9]{4}\\s+([A-Z]|-|\\s|,|.|\\|/|_)*([A-Z]|,|.|\\|/|_|-) .Net C#
0
 
käµfm³d 👽Commented:
In the future, you can save yourself quite a few posts by giving an accurate representation of your data or the rules surrounding what valid data is comprised of. In the course of splitting a string into fields, it is important to know what comprises a field and what can function as a field separator.
0
 
jacksonm1234Author Commented:
I know, my initial post was incomplete, and I already admitted my mistake twice above.

However, I noted my  more specific requirements in the second post, so giving an accurate representation of my data really didn't save many posts in this case.  Maybe I need to assume that no one reads the posts besides the main one?
0
 
käµfm³d 👽Commented:
Just some friendly guidance regarding future pattern questions. You can take it or leave it--it makes no difference to me.

Glad you have a working solution which you understand.  = )
0
 
user_nCommented:
Sub Macro2()
    Dim RegEx As Object
    Dim strTest As String
    Dim valid As Boolean
    Dim Matches As Object
    Dim i As Integer
   
Worksheets("Sheet1").Activate

strTest = "  0017   CANAL E/W RP          0300   DRGW                  2203   WALLULA               4230   TULSA, OK "
Set RegEx = CreateObject("VBScript.RegExp")
RegEx.Pattern = "[0-9]{4}\s+([A-Z]|-|\s|,|\.|\|/|_)*([A-Z]|,|\.|\|/|_|-)"
i = 1
        valid = RegEx.test(strTest)
        While valid = True
            Set Matches = RegEx.Execute(strTest)
            Worksheets("Sheet1").Cells(i, 1).Value = CStr(Matches(0))
            strTest = RegEx.Replace(strTest, "")
            valid = RegEx.test(strTest)
            i = i + 1
        Wend

    Set RegEx = Nothing
   
End Sub
0
 
user_nCommented:
. needed to be escaped ("\.") to match only the symbol ".", otherwise it matches any single character
0
 
user_nCommented:

[0-9]{4}\s+([A-Z]|-|\s|,|\.|\\|/|_)*([A-Z]|,|\.|\\|/|_|-)
0

Featured Post

Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

  • 14
  • 12
  • 8
  • +2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now