asked on

Macro that will remove duplicate strings in a cell using RegEx

This macro removes duplicate strings that occur consecutively. But how do I write it so that it removes all duplicate alphanumerics? In other words in this text it will remove the non-consecutive second occurrence of Text2 as well as both gratuitous occurrences of 33.

"Text1 33 33 Text2 44 Text3 55 Text2 33 "

Thanks,
John

(Dave, if you're reading this, I'm sure you will recognize your macro from quite a while ago!)

Sub LoseDupes()
Dim RegEx      As VBScript_RegExp_55.RegExp
Dim Myrange As Range, C As Range
    Set RegEx = New VBScript_RegExp_55.RegExp
    Set Myrange = ActiveSheet.Range("B3:B6")
    With RegEx
        .MultiLine = False
        .Global = True
        .IgnoreCase = True
        .Pattern = "\b(\w+)\b(\s+\1\b)+"
    End With
    For Each C In Myrange
        C.Offset(0, 1) = RegEx.Replace(C, "$1")
    Next
    Set Myrange = Nothing
    Set RegEx = Nothing
End Sub

Open in new window

Terry Woods

Try this:

Sub LoseDupes()
Dim RegEx      As VBScript_RegExp_55.RegExp
Dim Myrange As Range, C As Range
    Set RegEx = New VBScript_RegExp_55.RegExp
    Set Myrange = ActiveSheet.Range("B3:B6")
    With RegEx
        .MultiLine = False
        .Global = True
        .IgnoreCase = True
        .Pattern = "\b(\w+)\b(.*?)\b(\s+\1\b)+"
    End With
    For Each C In Myrange
        C.Offset(0, 1) = RegEx.Replace(C, "$1$2")
    Next
    Set Myrange = Nothing
    Set RegEx = Nothing
End Sub

Open in new window

Terry Woods

Note that I changed both the pattern, and the replacement.

John Carney

ASKER

Thanks Terry, I had a feeling it would take both changes. The only problem now is that there will often be a comma between the two instances. Sometimes with words and other punctuation as well. I tried a bunch of variations but couldn't get it. How do I wtrite that?

Also, and this is probably much more complicated and might require a separate question, is it possible to to remove duplicates when:

1) one is part of a larger word: as in 7ABCDEF, 7A
2) the number and the letter are in the same "word" but not adjacent: as in 7ABCDEF, 7C

Oh, and it will almost always be alphanumeric duplicates that I want to remove: 7A, 33D, etc.

Please don't spend much time on this. Just let me know if it's possible and then I'll post another question.

BTW, I'm still using a pattern you gave me last year and it still works great!

Thanks,
John

John Carney

ASKER

Just to clarify there may or may not be a comma.

ASKER CERTIFIED SOLUTION

Terry Woods

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

John Carney

ASKER

Thanks, Terry this is great. Here's a link to my followup question: https://www.experts-exchange.com/questions/27247528/Removing-duplicate-strings-when-one-of-the-strings-is-a-whole-word-and-the-other-is-embedded-in-a-longer-string.html

Incidentally, I notice that all three of these work:
.Pattern = "\b(\w+)\b(.*?)\b([\s,]+\1\b)+"
.Pattern = "\b(\w+)\b(.*?)\b([\s,]?\1\b)+"
.Pattern = "\b(\w+)\b(.*?)\b([\s,*]\1\b)+""

Are they really interchangeable? Or are there hidden deficiencies in the second and third?

And if I'm not pushing my luck (and please tell me if I am, I will gladly put this stuff in separate questions)), what is the function of the \1\? I tried other numbers and they either don't work at all or do things like remove all spaces.

And finally, what is the mechanism of "$1$2" in RegEx.Replace(C, "$1$2"). In trying to understand its function, I tried other numbers and that didn't tell me anything helpful.

Thanks,
John

Terry Woods

.Pattern = "\b(\w+)\b(.*?)\b([\s,]?\1\b)+"

Doesn't require a leading space or comma before the 2nd occurrence, so it will match:
A7B #A7B 123
And give result:
A7B # 123

.Pattern = "\b(\w+)\b(.*?)\b([\s,*]\1\b)+""

Will allow a leading * instead of a space or comma, but won't allow more than one of those characters (the + is removed), and still needs a word boundary before it, so it will match:
A7B ABC*A7B 123 or A7B ABC,A7B 123
And give result:
A7B ABC 123

But
A7B ABC, A7B 123
won't match unless you add the + back in. You might have meant to put the * outside the square brackets?

\1 is a backreference to the group captured by the round brackets, (\w+) so it detects a repeat of the word

$1 and $2 insert the first and second captured groups as a replacement, (\w+) and (.*?)