Improve company productivity with a Business Account.Sign Up

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 446
  • Last Modified:

Regex help!

I'm not terribly good with Regex, but I'm trying to use them to strip out some text from a long string.  Here is an example string.

"         0108   COL-GDI-STAP          1206   INDIO                 2212   THEDALS-BEND          5030   HARLINGEN                    "

What I really want is 4 matches that would give me "0108   COL-GDI-STAP" in one match, "1206   INDIO" in the next, and so on.

I've got the string expression to get the first three, but I can't figure out the code to get the last set!  Here is the code for the first 3--
"\d{4}.*?(?=\d{4})"

What expression would I use to get the last set?  Is there a better way to grab all 4 using one statement?  Thanks!
0
jacksonm1234
Asked:
jacksonm1234
  • 14
  • 12
  • 8
  • +2
1 Solution
 
user_nCommented:
[0-9]{4}\s+[-A-Z]+
0
 
jacksonm1234Author Commented:
Close, but i need it to support spaces, commas, etc. in amongst the letters.  For example, these two strings don't work correctly with your expression:

"         0017   CANAL E/W RP          0300   DRGW                  2203   WALLULA               4230   TULSA, OK                    "

"         0101   COB-FREMONT           1103   CLEARFIELD            2205   WALLA WALLA           4245   OOLOGAH                      "


I should have provided a better example initially. Sorry.
0
 
HonorGodCommented:
Something like this perhaps?
<html>
<body>

<script type="text/javascript">
  var data = "         0108   COL-GDI-STAP          1206   INDIO                 2212   THEDALS-BEND          5030   HARLINGEN                    ";
  document.write( '"' + data + '"<br/>' );
  data = data.replace( /\s*([0-9]{4})\s+([-A-Z]+)\s*/g, '$1 $2' );
  document.write( '"' + data + '"<br/>' );

</script>

</body>
</html>

Open in new window

0
Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
user_nCommented:
and who are the etc. symbols
0
 
jacksonm1234Author Commented:
HonorGod:  Same problem...i need to take into account other characters in the words.  Sorry for giving a poor example.

user_n:  So we can't just use a universal 'match any character'?  We have to use specifics?
The special symbols I need to deal with are  spaces, commas, periods, hyphens, slashes, and I think thats it.
0
 
user_nCommented:
[0-9]{4}\s+([-A-Z]|\s|,)+
0
 
user_nCommented:
Try this
[0-9]{4}\s+([-A-Z]|\s|,|.|\|/|_)+

[0-9]{4}\s+([-A-Z]|\s|,|.|\\|/|_)+
0
 
farzanjCommented:
Did you want in Perl.   Here is something you may be interested in


if ($line =~ /\s+(\d+\s+[-A-Z]+)\s+(\d+\s+[A-Z]+)\s+(\d+\s+[A-Z-]+)\s+(\d+\s+[A-Z-]+).+/)
{
    print $1;
    print $2;
    print $3;
    print $4;

}
else
{
    print 'no match found';
0
 
farzanjCommented:
Sorry, left closing brace.
0
 
user_nCommented:
[0-9]{4}\s+([-A-Z]|\s|,|.|\|/|_)*([-A-Z]|,|.|\|/|_)
0
 
jacksonm1234Author Commented:
None work.  The last two return only one match consisting of the entire string.  This one looks close, but it doesnt have the slashes, periods, etc included.
[0-9]{4}\s+([-A-Z]|\s|,)+
0
 
jacksonm1234Author Commented:
[0-9]{4}\s+([-A-Z]|\s|,|.|\|/|_)*([-A-Z]|,|.|\|/|_)

Same thing user_n.  It returns a match, but the match is the whole string (not broken up into pieces like I need).
0
 
user_nCommented:
[0-9] - defines one digit symbol 0 or 1 or 2 or 3 or 4 or 5 or 6 or 7 or 8 or 9
{4} - number of repetiotions
[0-9]{4} -  for  digit symbols
\s - includes \t tab \r carriage return \n new line \v vertical tab \f]      Whitespace characters
+ at least one
\s+ - one or more whitespace characters
[-A-Z] - defines chars "-" or "A" or "B" ... "Z"
| - means or
([-A-Z]|\s|,|.|\|/|_) - means character "-" or "A" or "B" ... "Z" or "whitespace character" or "," or "." or "\" or "/" or "_"
* - 0 or more appearance of characters
([-A-Z]|\s|,|.|\|/|_)* - means 0 or more appearances of ([-A-Z]|\s|,|.|\|/|_) characters
0
 
farzanjCommented:
Test the attached file.
test2.txt
0
 
jacksonm1234Author Commented:
Sorry farzan,  I dont know anything about Perl.
0
 
HonorGodCommented:
simpler...
<html>
<body>

<script type="text/javascript">
  var data = "         0108   COL-GDI-STAP          1206   INDIO                 2212   THEDALS-BEND          5030   HARLINGEN                    ";
  document.write( '"' + data + '"<br/>' );
  data = data.replace( /\s*([0-9]{4})\s+(\S+)\s*/g, '$1 $2' );
  document.write( '"' + data + '"<br/>' );

</script>

</body>
</html>

Open in new window

0
 
käµfm³d 👽Commented:
Based on question history, I'll assume this is VB.NET--the overall logic should work in different languages though:
Imports System.Text.RegularExpressions

...

Dim src As String = "         0108   COL-GDI-STAP          1206   INDIO                 2212   THEDALS-BEND          5030   HARLINGEN                    "
Dim matches As MatchCollection = Regex.Matches(src.Trim(), "\d+\s+\S+")

matches(0).Value ' 0108   COL-GDI-STAP
matches(1).Value ' 1206   INDIO
matches(2).Value ' 2212   THEDALS-BEND
matches(3).Value ' 5030   HARLINGEN

Open in new window

0
 
farzanjCommented:
Sorry, which language should I be using??
0
 
user_nCommented:
What program language are you using.

[0-9]{4}\s+([-A-Z]|\s|,|.|\|/|_)*([-A-Z]|,|.|\|/|_)
should match for
  0101   COB-FREMONT           1103   CLEARFIELD            2205   WALLA WALLA           4245   OOLOGAH
on first pass, with program you can get the rest
0101   COB-FREMONT

([0-9]{4}\s+([-A-Z]|\s|,|.|\|/|_)*([-A-Z]|,|.|\|/|_)\s+){3}([0-9]{4}\s+([-A-Z]|\s|,|.|\|/|_)*([-A-Z]|,|.|\|/|_))
should match
0101   COB-FREMONT           1103   CLEARFIELD            2205   WALLA WALLA           4245   OOLOGAH
0
 
käµfm³d 👽Commented:
Lines 8 - 10 aren't really code lines, but are meant to demonstrate where to locate the value and what the value would be. I did an inadequate job of commenting  : \
0
 
jacksonm1234Author Commented:
HonorGod:

Doesn't seem to be working with slashes, commas, etc.
0
 
käµfm³d 👽Commented:
Can you guarantee that each alpha string can have internal spaces of exactly one occurrence?

E.g.

This is my string  <--  single spaces

vs.

This   is  my string <-- multi-space

Open in new window

0
 
jacksonm1234Author Commented:
Everyone:
I'm using Excel VBA.   I apologize for not mentioning it before, but I assumed regex would be a standard implementation, no matter the language.

user_n:
Your first regex string returns a match (the whole string).  I want 4 matches that I can iterate through.  What do you mean "with program you can get the rest"?  I don't want to use excel formulas or something to strip them out, that's why I'm using regex.

Your second regex freezes my Excel for some reason.

kaufmed:
I need it to account for spaces inside of words (see my additional examples I provided in the second post above.

0
 
jacksonm1234Author Commented:
kaufmed:
These could have multiple spaces.  This is the closest answer I've gotten so far, from user_n:
[0-9]{4}\s+([-A-Z]|\s|,)+  
0
 
HonorGodCommented:
Please explain:

Doesn't seem to be working ...

/\s*([0-9]{4})\s+(\S+)\s*/g

Means:


\s*         == zero or more whitespace characters
([0-9]{4})  == Group #1 (i.e., $1) composed of exactly 4 digits
\s+         == one  or more whitespace characters
(\S+)       == Group #2 (i.e., $2) composed of 1 or more non-Whitespace characters
\s*         == zero or more whitespace characters

Open in new window

0
 
jacksonm1234Author Commented:
HonorGod:

try it with this string.  "0017   CANAL E/W RP "  returns as "0017  CANAL" by your expression.  
0
 
käµfm³d 👽Commented:
Given the above samples, perhaps this will fit the bill:

(I guess I'll switch to javascript since that appears to be the desired result)
<script type="text/javascript">
	function run() {
		var data = "         0108   COL-GDI-STAP          1206   INDIO                 2212   THEDALS-BEND          5030   HARLINGEN                    ";

		data = data.replace(/^\s+|\s+$/g, '');
		data = data.replace(/([a-z])\s+(\d)/gi, '$1######$2');
		matches = data.split(/######/);

		for (var i in matches)
		{
			alert(matches[i]);
		}
	}
</script>

Open in new window

0
 
user_nCommented:
Some symbols in regular expression need to be escaped with \ in some languages. I do not use VB .
This may help
http://www.aspfree.com/c/a/Windows-Scripting/Regular-Expressions-in-VBScript/1/
for matching not only the first string but the rest too.
so I escaped the \ in the next expression (I used "\\" for escaping)
[0-9]{4}\s+([-A-Z]|\s|,|.|\\|/|_)*([-A-Z]|,|.|\\|/|_)
0
 
käµfm³d 👽Commented:
>>  but I assumed regex would be a standard implementation, no matter the language.

Ah my young padawan, you have much to learn   ; )

Here's the VBA breakdown of the above:
Sub func(data As String)
    Set RegularExpressionObject = New RegExp

    With RegularExpressionObject
        .IgnoreCase = True
        .Global = True
        
        .Pattern = "^\s+|\s+$"
        data = .Replace(data, "")
        
        .Pattern = "([a-z])\s+(\d)"
        data = .Replace(data, "$1#######$2")
    End With

    matches = Split(data, "#######")
    
    For i = 0 To UBound(matches)
        MsgBox matches(i)
    Next
End Sub

Open in new window

0
 
käµfm³d 👽Commented:
You can change line 11 to

    .Pattern = "([a-z])\s+(\d{4})"

to make it a tad more reliable, unless you never receive digits inside of the "strings"
0
 
user_nCommented:
[0-9]{4}\s+([A-Z]|\-|\s|\,|\.|\\|\/|\_)*([A-Z]|\,|\.|\\|\/|\_|\-)
0
 
jacksonm1234Author Commented:
This is the one that works.  Perfectly.

[0-9]{4}\s+([-A-Z./\\]|\s|,)+

user_n gave it in post 6ish, but I had to add the slashes and period, I think.  Thanks to all who helped.
0
 
jacksonm1234Author Commented:
solution was close but not quite there.
0
 
user_nCommented:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html>
<body>

<script type="text/javascript">
  var data = "         0017   CANAL E/W RP          0300   DRGW                  2203   WALLULA               4230   TULSA, OK                   ";
  var length;
  document.write( '"' + data + '"<br/>' );
  data = data.match(/[0-9]{4}\s+([A-Z]|\-|\s|\,|\.|\\|\/|\_)*([A-Z]|\,|\.|\\|\/|\_|\-)/g,'');
  length = data.length;
  for(i = 0; i < length; i++)
  {
	document.write(data[i] + '<br/>');
  }

</script>

</body>
</html>

Open in new window

0
 
user_nCommented:
[0-9]{4}\\s+([A-Z]|-|\\s|,|.|\\|/|_)*([A-Z]|,|.|\\|/|_|-) .Net C#
0
 
käµfm³d 👽Commented:
In the future, you can save yourself quite a few posts by giving an accurate representation of your data or the rules surrounding what valid data is comprised of. In the course of splitting a string into fields, it is important to know what comprises a field and what can function as a field separator.
0
 
jacksonm1234Author Commented:
I know, my initial post was incomplete, and I already admitted my mistake twice above.

However, I noted my  more specific requirements in the second post, so giving an accurate representation of my data really didn't save many posts in this case.  Maybe I need to assume that no one reads the posts besides the main one?
0
 
käµfm³d 👽Commented:
Just some friendly guidance regarding future pattern questions. You can take it or leave it--it makes no difference to me.

Glad you have a working solution which you understand.  = )
0
 
user_nCommented:
Sub Macro2()
    Dim RegEx As Object
    Dim strTest As String
    Dim valid As Boolean
    Dim Matches As Object
    Dim i As Integer
   
Worksheets("Sheet1").Activate

strTest = "  0017   CANAL E/W RP          0300   DRGW                  2203   WALLULA               4230   TULSA, OK "
Set RegEx = CreateObject("VBScript.RegExp")
RegEx.Pattern = "[0-9]{4}\s+([A-Z]|-|\s|,|\.|\|/|_)*([A-Z]|,|\.|\|/|_|-)"
i = 1
        valid = RegEx.test(strTest)
        While valid = True
            Set Matches = RegEx.Execute(strTest)
            Worksheets("Sheet1").Cells(i, 1).Value = CStr(Matches(0))
            strTest = RegEx.Replace(strTest, "")
            valid = RegEx.test(strTest)
            i = i + 1
        Wend

    Set RegEx = Nothing
   
End Sub
0
 
user_nCommented:
. needed to be escaped ("\.") to match only the symbol ".", otherwise it matches any single character
0
 
user_nCommented:

[0-9]{4}\s+([A-Z]|-|\s|,|\.|\\|/|_)*([A-Z]|,|\.|\\|/|_|-)
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Upgrade your Question Security!

Your question, your audience. Choose who sees your identity—and your question—with question security.

  • 14
  • 12
  • 8
  • +2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now