Link to home
Start Free TrialLog in
Avatar of aikimark
aikimarkFlag for United States of America

asked on

Regexp Replace() of repeated field delimiters

I was trying to clean up a string of delimited text with the (vbscript.regexp) regular expression Replace() method.  I had to repeat the method in order to correctly process all the fields.  I wondered if there was a pattern that can correctly replace sequential delimiters.

Example:
If I have a string like this:
ABC^DEF^^GHI^^^JKL

Open in new window

and I want to supply missing field values, the resulting string should look like this:
ABC^DEF^#MI^GHI^#MI^#MI^JKL

Open in new window


However, the pattern I'm using only produces the following:
ABC^DEF^#MI^GHI^#MI^^JKL

Open in new window

because the delimiters between GHI and JKL get 'paired', preventing the second missing field from getting its value in a single method invocation.
Avatar of Rey Obrero (Capricorn1)
Rey Obrero (Capricorn1)
Flag of United States of America image

try using two replace() function

?Replace(replace("ABC^DEF^^GHI^^^JKL^^^^^^MNO","^^","^#MI^"),"^^","^#MI^")
ABC^DEF^#MI^GHI^#MI^#MI^JKL^#MI^#MI^#MI^#MI^#MI^MNO
Avatar of aikimark

ASKER

@Rey

I know I can do it with multiple replace operations.  The question is whether there is a single pattern that would facilitate a correct operation with a single method invocation.
Not sure if this is the right method, but see if this works for you.....
Assuming your string is in A2.

Sub ReplaceMissingString()
Dim missStr As String
Dim RE As VBScript_RegExp_10.RegExp
Dim match As VBScript_RegExp_10.match
Dim matches As VBScript_RegExp_10.MatchCollection

Set RE = New VBScript_RegExp_10.RegExp
missStr = "^#MI^"
RE.Global = True
RE.Pattern = "\^\^"

    If RE.Test(Range("A2").Value) Then
        Set matches = RE.Execute(Range("A2").Value)
        For Each match In matches
            Range("A2").Value = RE.Replace(Range("A2").Value, missStr)
        Next match
    End If
End Sub

Open in new window

You can use a look-ahead: the second circumflex in the pattern below needs to match but is not included in the matched value (hence also don't include a second one in the replace value):
Sub TestReplace()

    Dim msg As String
    msg = "ABC^DEF^^GHI^^^JKL"

    Dim re As New RegExp
    re.Global = True
    re.Pattern = "\^(?=\^)"

    msg = re.Replace(msg, "^#MI")

    MsgBox msg

End Sub

Open in new window

ASKER CERTIFIED SOLUTION
Avatar of hielo
hielo
Flag of Wallis and Futuna image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
@sktneer

That will have to be repeated in order to produce the correct string when there are three or more consecutive delimiters.
@hielo

That's very promising.  Let me throw that into my test script.
@Robert

I didn't think look-a-heads were supported.  Looks like I'm going to learn some good stuff from this question.
I'm not sure about your environment. I used Excel VBA and added a reference to RegEx 5.5
Avatar of Bill Prew
Bill Prew

LOOKAHEAD is supported in VBScript.

~bp
Correct. I ended up with what hielo suggested. That will do the trick.
Did a little testing here for curiosity.  The first non-lookahead approach seems to handle a trailing delimiter on the input string, so that seems better to me:

Input = "ABC^"
sktneer = "ABC^#MI"
schutt = "ABC^"


However neither handle a leading delimiter in the input string, for example:

Input = "^ABC"
sktneer = "^ABC"
schutt = "^ABC"


Not sure if that is a requirement.

~bp
I do also note a difference with "special characters" in existing values, not sure what your requirement is there.

Input = "ABC^^!^DEF"
sktneer = "ABC^#MI^#MI!^DEF"
schutt = "ABC^#MI^!^DEF"


~bp
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial

Test Results

Nothing like exposing solutions to actual data.  Even though the actual data is all numeric data (or empty), I included both alphabetic and numeric fields in my tests of these two patterns.  In addition to the all-internal delimiters, I included trailing and leading delimiter configurations.
Test Data
ABC^DEF^^GHI^^^ALLINTERNAL
1^2^^4^^^7
ABC^DEF^^GHI^^^TRAILING^
1^2^^4^^^7^
^ABC^DEF^^GHI^^^LEADING
^2^3^^5^^^8
^ABC^DEF^^GHI^^^BOTHLEADINGANDTRAILING^
^2^3^^5^^^8^

Open in new window


1. hielo pattern results

ABC^DEF^#MI^GHI^#MI^#MI^ALLINTERNAL
1^2^#MI^4^#MI^#MI^7
ABC^DEF^#MI^GHI^#MI^#MI^TRAILING^#MI
1^2^#MI^4^#MI^#MI^7^#MI
^ABC^DEF^#MI^GHI^#MI^#MI^LEADING
^2^3^#MI^5^#MI^#MI^8
^ABC^DEF^#MI^GHI^#MI^#MI^BOTHLEADINGANDTRAILING^#MI
^2^3^#MI^5^#MI^#MI^8^#MI

Open in new window

2. Robert Schutt pattern results

ABC^DEF^#MI^GHI^#MI^#MI^ALLINTERNAL
1^2^#MI^4^#MI^#MI^7
ABC^DEF^#MI^GHI^#MI^#MI^TRAILING^
1^2^#MI^4^#MI^#MI^7^
^ABC^DEF^#MI^GHI^#MI^#MI^LEADING
^2^3^#MI^5^#MI^#MI^8
^ABC^DEF^#MI^GHI^#MI^#MI^BOTHLEADINGANDTRAILING^
^2^3^#MI^5^#MI^#MI^8^

Open in new window

=============================
Based on these results the hielo results also correctly transform a trailing delimiter, which would improve its performance.

I'm going to give you and any additional EE experts a few more hours to see if they can better hielo's pattern.
This revised version of the Robert Schutt pattern (http:#a40911944)
\^(?=\^|$|\r\n)

Open in new window

produced the following results:
ABC^DEF^#MI^GHI^#MI^#MI^ALLINTERNAL
1^2^#MI^4^#MI^#MI^7
ABC^DEF^#MI^GHI^#MI^#MI^TRAILING^#MI
1^2^#MI^4^#MI^#MI^7^#MI
^ABC^DEF^#MI^GHI^#MI^#MI^LEADING
^2^3^#MI^5^#MI^#MI^8
^ABC^DEF^#MI^GHI^#MI^#MI^BOTHLEADINGANDTRAILING^#MI
^2^3^#MI^5^#MI^#MI^8^#MI

Open in new window

Which takes care of trailing delimiters, like the hielo pattern.
Do you mean hielo instead of sktneer by the way?
Yes, apologies hielo!

~bp
Ok, so I guess you don't expect special characters in the fields which would make my solution necessary. As it is, my solution may be significantly slower on larger documents (but haven't tested).

Some more additions to both search and replace patterns for the beginning of line:
Sub TestReplace()

    Dim msg As String
    msg = _
        "ABC^DEF^^GHI^^^ALLINTERNAL" & vbCrLf & _
        "1^2^^4^^^7" & vbCrLf & _
        "ABC^DEF^^GHI^^^TRAILING^" & vbCrLf & _
        "1^2^^4^^^7^" & vbCrLf & _
        "^ABC^DEF^^GHI^^^LEADING" & vbCrLf & _
        "^2^3^^5^^^8" & vbCrLf & _
        "^ABC^DEF^^GHI^^^BOTHLEADINGANDTRAILING^" & vbCrLf & _
        "^2^3^^5^^^8^" & vbCrLf

    Dim re As New RegExp
    re.Global = True
    re.MultiLine = True
    re.Pattern = "(^|\^)(?=\^|\r\n)"

    msg = re.Replace(msg, "$1#MI") & vbCrLf

    MsgBox msg

End Sub

Open in new window


I had to remove $ because it added an extra line with a 'missing' field after every line, seems like something to do with the CR/LF combination. So now the only thing that doesn't work is a last line in a file without CR/LF.

output:
ABC^DEF^#MI^GHI^#MI^#MI^ALLINTERNAL
1^2^#MI^4^#MI^#MI^7
ABC^DEF^#MI^GHI^#MI^#MI^TRAILING^#MI
1^2^#MI^4^#MI^#MI^7^#MI
#MI^ABC^DEF^#MI^GHI^#MI^#MI^LEADING
#MI^2^3^#MI^5^#MI^#MI^8
#MI^ABC^DEF^#MI^GHI^#MI^#MI^BOTHLEADINGANDTRAILING^#MI
#MI^2^3^#MI^5^#MI^#MI^8^#MI

Open in new window

@Robert

I still have to do a performance test of these patterns.  I expect that the simpler of the two patterns would have the performance advantage.  The performance tests are only operating on a line at a time, so the \r\n version of your pattern won't be necessary.

Performance test results

Process zero valued key's string: ^^103^^^106^^108^109^^^
the number of times there are items in the set.
 1.546875     Regexp 1001   #MI^#MI^103^#MI^#MI^106^#MI^108^109^#MI^#MI^#MI
 0.5          hielo 1001    #MI^#MI^103^#MI^#MI^106^#MI^108^109^#MI^#MI^#MI
 0.5703125    RS 1001       #MI^#MI^103^#MI^#MI^106^#MI^108^109^#MI^#MI^#MI
 0.453125     VB 1001       #MI^#MI^103^#MI^#MI^106^#MI^108^109^#MI^#MI^#MI

Open in new window

Process all items in the set.
 2.410156     Regexp        #MI^102^#MI^#MI^#MI^#MI^#MI^#MI^#MI^#MI^#MI^#MI
 1.734375     Regexp hielo  #MI^102^#MI^#MI^#MI^#MI^#MI^#MI^#MI^#MI^#MI^#MI
 1.75         Regexp RS     #MI^102^#MI^#MI^#MI^#MI^#MI^#MI^#MI^#MI^#MI^#MI
 1.910156     VB            #MI^102^#MI^#MI^#MI^#MI^#MI^#MI^#MI^#MI^#MI^#MI

Open in new window

Notes:
* The string in the all-items run is the last item in the set.
* The Regexp block iterates a "\^\^" pattern until all repeated delimiters are transformed.  Also, leading and trailing delimiters are transformed, if necessary.
* The VB block iterates a VB Replace() function until all repeated delimiters, "^^" are transformed.  Also, leading and trailing delimiters are transformed, if necessary.
Congratulations to hielo for the faster-performing pattern.

Also, kudos to Robert Schutt for his runner-up pattern.

Thanks to all experts participating in the thread.  In addition to getting my question answered, I learned some things about the vbscript.regexp engine and patterns (non-word boundaries and look-ahead patterns)

If you are interested, I did a performance comparison article, comparing Regexp and VB Replace() in this article:
http://rdsrc.us/ipF4nT