Link to home
Start Free TrialLog in
Avatar of John Carney
John CarneyFlag for United States of America

asked on

Macro that will remove duplicate strings in a cell using RegEx

This macro removes duplicate strings that occur consecutively. But how do I write it so that it removes all duplicate alphanumerics? In other words in this text it will remove the non-consecutive second occurrence of Text2 as well as both gratuitous occurrences of 33.

"Text1 33  33 Text2 44 Text3 55  Text2  33 "

Thanks,
John

(Dave, if you're reading this, I'm sure you will recognize your macro from quite a while ago!)


Sub LoseDupes()
Dim RegEx      As VBScript_RegExp_55.RegExp
Dim Myrange As Range, C As Range
    Set RegEx = New VBScript_RegExp_55.RegExp
    Set Myrange = ActiveSheet.Range("B3:B6")
    With RegEx
        .MultiLine = False
        .Global = True
        .IgnoreCase = True
        .Pattern = "\b(\w+)\b(\s+\1\b)+"
    End With
    For Each C In Myrange
        C.Offset(0, 1) = RegEx.Replace(C, "$1")
    Next
    Set Myrange = Nothing
    Set RegEx = Nothing
End Sub

Open in new window

Avatar of Terry Woods
Terry Woods
Flag of New Zealand image

Try this:
Sub LoseDupes()
Dim RegEx      As VBScript_RegExp_55.RegExp
Dim Myrange As Range, C As Range
    Set RegEx = New VBScript_RegExp_55.RegExp
    Set Myrange = ActiveSheet.Range("B3:B6")
    With RegEx
        .MultiLine = False
        .Global = True
        .IgnoreCase = True
        .Pattern = "\b(\w+)\b(.*?)\b(\s+\1\b)+"
    End With
    For Each C In Myrange
        C.Offset(0, 1) = RegEx.Replace(C, "$1$2")
    Next
    Set Myrange = Nothing
    Set RegEx = Nothing
End Sub

Open in new window

Note that I changed both the pattern, and the replacement.
Avatar of John Carney

ASKER

Thanks Terry, I had a feeling it would take both changes. The only problem now is that there will often be a comma between the two instances. Sometimes with words and other punctuation as well. I tried a bunch of variations but couldn't get it. How do I wtrite that?

Also, and this is probably much more complicated and might require a separate question, is it possible to to remove duplicates when:

1) one is part of a larger word: as in 7ABCDEF, 7A
2) the number and the letter are in the same "word" but not adjacent: as in 7ABCDEF, 7C

Oh, and it will almost always be alphanumeric duplicates that I want to remove: 7A, 33D, etc.

Please don't spend much time on this. Just let me know if it's possible and then I'll post another question.

BTW, I'm still using a pattern you gave me last year and it still works great!

Thanks,
John
Just to clarify there may or may not be a comma.
ASKER CERTIFIED SOLUTION
Avatar of Terry Woods
Terry Woods
Flag of New Zealand image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks, Terry this is great. Here's a link to my followup question: https://www.experts-exchange.com/questions/27247528/Removing-duplicate-strings-when-one-of-the-strings-is-a-whole-word-and-the-other-is-embedded-in-a-longer-string.html

Incidentally, I notice that all three of these work:
   .Pattern = "\b(\w+)\b(.*?)\b([\s,]+\1\b)+"
   .Pattern = "\b(\w+)\b(.*?)\b([\s,]?\1\b)+"
   .Pattern = "\b(\w+)\b(.*?)\b([\s,*]\1\b)+""

Are they really interchangeable? Or are there hidden deficiencies in the second and third?

And if I'm not pushing my luck (and please tell me if I am, I will gladly put this stuff in separate questions)), what is the function of the \1\? I tried other numbers and they either don't work at all or do things like remove all spaces.

And finally, what is the mechanism of "$1$2" in RegEx.Replace(C, "$1$2"). In trying to understand its function, I tried other numbers and that didn't tell me anything helpful.

Thanks,
John
 
  .Pattern = "\b(\w+)\b(.*?)\b([\s,]?\1\b)+"

Doesn't require a leading space or comma before the 2nd occurrence, so it will match:
A7B #A7B 123
And give result:
A7B # 123

   .Pattern = "\b(\w+)\b(.*?)\b([\s,*]\1\b)+""

Will allow a leading * instead of a space or comma, but won't allow more than one of those characters (the + is removed), and still needs a word boundary before it, so it will match:
A7B ABC*A7B 123 or A7B ABC,A7B 123
And give result:
A7B ABC 123

But
A7B ABC, A7B 123
won't match unless you add the + back in. You might have meant to put the * outside the square brackets?

\1 is a backreference to the group captured by the round brackets, (\w+) so it detects a repeat of the word

$1 and $2 insert the first and second captured groups as a replacement, (\w+) and (.*?)