aikimark
asked on
Regexp Replace() of repeated field delimiters
I was trying to clean up a string of delimited text with the (vbscript.regexp) regular expression Replace() method. I had to repeat the method in order to correctly process all the fields. I wondered if there was a pattern that can correctly replace sequential delimiters.
Example:
If I have a string like this:
However, the pattern I'm using only produces the following:
Example:
If I have a string like this:
ABC^DEF^^GHI^^^JKL
and I want to supply missing field values, the resulting string should look like this:ABC^DEF^#MI^GHI^#MI^#MI^JKL
However, the pattern I'm using only produces the following:
ABC^DEF^#MI^GHI^#MI^^JKL
because the delimiters between GHI and JKL get 'paired', preventing the second missing field from getting its value in a single method invocation.
ASKER
@Rey
I know I can do it with multiple replace operations. The question is whether there is a single pattern that would facilitate a correct operation with a single method invocation.
I know I can do it with multiple replace operations. The question is whether there is a single pattern that would facilitate a correct operation with a single method invocation.
Not sure if this is the right method, but see if this works for you.....
Assuming your string is in A2.
Assuming your string is in A2.
Sub ReplaceMissingString()
Dim missStr As String
Dim RE As VBScript_RegExp_10.RegExp
Dim match As VBScript_RegExp_10.match
Dim matches As VBScript_RegExp_10.MatchCollection
Set RE = New VBScript_RegExp_10.RegExp
missStr = "^#MI^"
RE.Global = True
RE.Pattern = "\^\^"
If RE.Test(Range("A2").Value) Then
Set matches = RE.Execute(Range("A2").Value)
For Each match In matches
Range("A2").Value = RE.Replace(Range("A2").Value, missStr)
Next match
End If
End Sub
You can use a look-ahead: the second circumflex in the pattern below needs to match but is not included in the matched value (hence also don't include a second one in the replace value):
Sub TestReplace()
Dim msg As String
msg = "ABC^DEF^^GHI^^^JKL"
Dim re As New RegExp
re.Global = True
re.Pattern = "\^(?=\^)"
msg = re.Replace(msg, "^#MI")
MsgBox msg
End Sub
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
@sktneer
That will have to be repeated in order to produce the correct string when there are three or more consecutive delimiters.
That will have to be repeated in order to produce the correct string when there are three or more consecutive delimiters.
ASKER
@hielo
That's very promising. Let me throw that into my test script.
That's very promising. Let me throw that into my test script.
ASKER
@Robert
I didn't think look-a-heads were supported. Looks like I'm going to learn some good stuff from this question.
I didn't think look-a-heads were supported. Looks like I'm going to learn some good stuff from this question.
I'm not sure about your environment. I used Excel VBA and added a reference to RegEx 5.5
LOOKAHEAD is supported in VBScript.
~bp
~bp
Correct. I ended up with what hielo suggested. That will do the trick.
Did a little testing here for curiosity. The first non-lookahead approach seems to handle a trailing delimiter on the input string, so that seems better to me:
Input = "ABC^"
sktneer = "ABC^#MI"
schutt = "ABC^"
However neither handle a leading delimiter in the input string, for example:
Input = "^ABC"
sktneer = "^ABC"
schutt = "^ABC"
Not sure if that is a requirement.
~bp
Input = "ABC^"
sktneer = "ABC^#MI"
schutt = "ABC^"
However neither handle a leading delimiter in the input string, for example:
Input = "^ABC"
sktneer = "^ABC"
schutt = "^ABC"
Not sure if that is a requirement.
~bp
I do also note a difference with "special characters" in existing values, not sure what your requirement is there.
Input = "ABC^^!^DEF"
sktneer = "ABC^#MI^#MI!^DEF"
schutt = "ABC^#MI^!^DEF"
~bp
Input = "ABC^^!^DEF"
sktneer = "ABC^#MI^#MI!^DEF"
schutt = "ABC^#MI^!^DEF"
~bp
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Test Results
Nothing like exposing solutions to actual data. Even though the actual data is all numeric data (or empty), I included both alphabetic and numeric fields in my tests of these two patterns. In addition to the all-internal delimiters, I included trailing and leading delimiter configurations.Test Data
ABC^DEF^^GHI^^^ALLINTERNAL
1^2^^4^^^7
ABC^DEF^^GHI^^^TRAILING^
1^2^^4^^^7^
^ABC^DEF^^GHI^^^LEADING
^2^3^^5^^^8
^ABC^DEF^^GHI^^^BOTHLEADINGANDTRAILING^
^2^3^^5^^^8^
1. hielo pattern results
ABC^DEF^#MI^GHI^#MI^#MI^ALLINTERNAL
1^2^#MI^4^#MI^#MI^7
ABC^DEF^#MI^GHI^#MI^#MI^TRAILING^#MI
1^2^#MI^4^#MI^#MI^7^#MI
^ABC^DEF^#MI^GHI^#MI^#MI^LEADING
^2^3^#MI^5^#MI^#MI^8
^ABC^DEF^#MI^GHI^#MI^#MI^BOTHLEADINGANDTRAILING^#MI
^2^3^#MI^5^#MI^#MI^8^#MI
2. Robert Schutt pattern results
ABC^DEF^#MI^GHI^#MI^#MI^ALLINTERNAL
1^2^#MI^4^#MI^#MI^7
ABC^DEF^#MI^GHI^#MI^#MI^TRAILING^
1^2^#MI^4^#MI^#MI^7^
^ABC^DEF^#MI^GHI^#MI^#MI^LEADING
^2^3^#MI^5^#MI^#MI^8
^ABC^DEF^#MI^GHI^#MI^#MI^BOTHLEADINGANDTRAILING^
^2^3^#MI^5^#MI^#MI^8^
==========================Based on these results the hielo results also correctly transform a trailing delimiter, which would improve its performance.
I'm going to give you and any additional EE experts a few more hours to see if they can better hielo's pattern.
ASKER
This revised version of the Robert Schutt pattern (http:#a40911944)
\^(?=\^|$|\r\n)
produced the following results:ABC^DEF^#MI^GHI^#MI^#MI^ALLINTERNAL
1^2^#MI^4^#MI^#MI^7
ABC^DEF^#MI^GHI^#MI^#MI^TRAILING^#MI
1^2^#MI^4^#MI^#MI^7^#MI
^ABC^DEF^#MI^GHI^#MI^#MI^LEADING
^2^3^#MI^5^#MI^#MI^8
^ABC^DEF^#MI^GHI^#MI^#MI^BOTHLEADINGANDTRAILING^#MI
^2^3^#MI^5^#MI^#MI^8^#MI
Which takes care of trailing delimiters, like the hielo pattern.
Do you mean hielo instead of sktneer by the way?Yes, apologies hielo!
~bp
Ok, so I guess you don't expect special characters in the fields which would make my solution necessary. As it is, my solution may be significantly slower on larger documents (but haven't tested).
Some more additions to both search and replace patterns for the beginning of line:
I had to remove $ because it added an extra line with a 'missing' field after every line, seems like something to do with the CR/LF combination. So now the only thing that doesn't work is a last line in a file without CR/LF.
output:
Some more additions to both search and replace patterns for the beginning of line:
Sub TestReplace()
Dim msg As String
msg = _
"ABC^DEF^^GHI^^^ALLINTERNAL" & vbCrLf & _
"1^2^^4^^^7" & vbCrLf & _
"ABC^DEF^^GHI^^^TRAILING^" & vbCrLf & _
"1^2^^4^^^7^" & vbCrLf & _
"^ABC^DEF^^GHI^^^LEADING" & vbCrLf & _
"^2^3^^5^^^8" & vbCrLf & _
"^ABC^DEF^^GHI^^^BOTHLEADINGANDTRAILING^" & vbCrLf & _
"^2^3^^5^^^8^" & vbCrLf
Dim re As New RegExp
re.Global = True
re.MultiLine = True
re.Pattern = "(^|\^)(?=\^|\r\n)"
msg = re.Replace(msg, "$1#MI") & vbCrLf
MsgBox msg
End Sub
I had to remove $ because it added an extra line with a 'missing' field after every line, seems like something to do with the CR/LF combination. So now the only thing that doesn't work is a last line in a file without CR/LF.
output:
ABC^DEF^#MI^GHI^#MI^#MI^ALLINTERNAL
1^2^#MI^4^#MI^#MI^7
ABC^DEF^#MI^GHI^#MI^#MI^TRAILING^#MI
1^2^#MI^4^#MI^#MI^7^#MI
#MI^ABC^DEF^#MI^GHI^#MI^#MI^LEADING
#MI^2^3^#MI^5^#MI^#MI^8
#MI^ABC^DEF^#MI^GHI^#MI^#MI^BOTHLEADINGANDTRAILING^#MI
#MI^2^3^#MI^5^#MI^#MI^8^#MI
ASKER
@Robert
I still have to do a performance test of these patterns. I expect that the simpler of the two patterns would have the performance advantage. The performance tests are only operating on a line at a time, so the \r\n version of your pattern won't be necessary.
I still have to do a performance test of these patterns. I expect that the simpler of the two patterns would have the performance advantage. The performance tests are only operating on a line at a time, so the \r\n version of your pattern won't be necessary.
ASKER
Performance test results
Process zero valued key's string: ^^103^^^106^^108^109^^^the number of times there are items in the set.
1.546875 Regexp 1001 #MI^#MI^103^#MI^#MI^106^#MI^108^109^#MI^#MI^#MI
0.5 hielo 1001 #MI^#MI^103^#MI^#MI^106^#MI^108^109^#MI^#MI^#MI
0.5703125 RS 1001 #MI^#MI^103^#MI^#MI^106^#MI^108^109^#MI^#MI^#MI
0.453125 VB 1001 #MI^#MI^103^#MI^#MI^106^#MI^108^109^#MI^#MI^#MI
Process all items in the set. 2.410156 Regexp #MI^102^#MI^#MI^#MI^#MI^#MI^#MI^#MI^#MI^#MI^#MI
1.734375 Regexp hielo #MI^102^#MI^#MI^#MI^#MI^#MI^#MI^#MI^#MI^#MI^#MI
1.75 Regexp RS #MI^102^#MI^#MI^#MI^#MI^#MI^#MI^#MI^#MI^#MI^#MI
1.910156 VB #MI^102^#MI^#MI^#MI^#MI^#MI^#MI^#MI^#MI^#MI^#MI
Notes:* The string in the all-items run is the last item in the set.
* The Regexp block iterates a "\^\^" pattern until all repeated delimiters are transformed. Also, leading and trailing delimiters are transformed, if necessary.
* The VB block iterates a VB Replace() function until all repeated delimiters, "^^" are transformed. Also, leading and trailing delimiters are transformed, if necessary.
ASKER
Congratulations to hielo for the faster-performing pattern.
Also, kudos to Robert Schutt for his runner-up pattern.
Thanks to all experts participating in the thread. In addition to getting my question answered, I learned some things about the vbscript.regexp engine and patterns (non-word boundaries and look-ahead patterns)
If you are interested, I did a performance comparison article, comparing Regexp and VB Replace() in this article:
http://rdsrc.us/ipF4nT
Also, kudos to Robert Schutt for his runner-up pattern.
Thanks to all experts participating in the thread. In addition to getting my question answered, I learned some things about the vbscript.regexp engine and patterns (non-word boundaries and look-ahead patterns)
If you are interested, I did a performance comparison article, comparing Regexp and VB Replace() in this article:
http://rdsrc.us/ipF4nT
?Replace(replace("ABC^DEF^
ABC^DEF^#MI^GHI^#MI^#MI^JK