Avatar of John Carney
John Carney
Flag for United States of America asked on

Macro that will remove duplicate strings in a cell using RegEx

This macro removes duplicate strings that occur consecutively. But how do I write it so that it removes all duplicate alphanumerics? In other words in this text it will remove the non-consecutive second occurrence of Text2 as well as both gratuitous occurrences of 33.

"Text1 33  33 Text2 44 Text3 55  Text2  33 "

Thanks,
John

(Dave, if you're reading this, I'm sure you will recognize your macro from quite a while ago!)


Sub LoseDupes()
Dim RegEx      As VBScript_RegExp_55.RegExp
Dim Myrange As Range, C As Range
    Set RegEx = New VBScript_RegExp_55.RegExp
    Set Myrange = ActiveSheet.Range("B3:B6")
    With RegEx
        .MultiLine = False
        .Global = True
        .IgnoreCase = True
        .Pattern = "\b(\w+)\b(\s+\1\b)+"
    End With
    For Each C In Myrange
        C.Offset(0, 1) = RegEx.Replace(C, "$1")
    Next
    Set Myrange = Nothing
    Set RegEx = Nothing
End Sub

Open in new window

Microsoft Excel

Avatar of undefined
Last Comment
Terry Woods

8/22/2022 - Mon
Terry Woods

Try this:
Sub LoseDupes()
Dim RegEx      As VBScript_RegExp_55.RegExp
Dim Myrange As Range, C As Range
    Set RegEx = New VBScript_RegExp_55.RegExp
    Set Myrange = ActiveSheet.Range("B3:B6")
    With RegEx
        .MultiLine = False
        .Global = True
        .IgnoreCase = True
        .Pattern = "\b(\w+)\b(.*?)\b(\s+\1\b)+"
    End With
    For Each C In Myrange
        C.Offset(0, 1) = RegEx.Replace(C, "$1$2")
    Next
    Set Myrange = Nothing
    Set RegEx = Nothing
End Sub

Open in new window

Terry Woods

Note that I changed both the pattern, and the replacement.
John Carney

ASKER
Thanks Terry, I had a feeling it would take both changes. The only problem now is that there will often be a comma between the two instances. Sometimes with words and other punctuation as well. I tried a bunch of variations but couldn't get it. How do I wtrite that?

Also, and this is probably much more complicated and might require a separate question, is it possible to to remove duplicates when:

1) one is part of a larger word: as in 7ABCDEF, 7A
2) the number and the letter are in the same "word" but not adjacent: as in 7ABCDEF, 7C

Oh, and it will almost always be alphanumeric duplicates that I want to remove: 7A, 33D, etc.

Please don't spend much time on this. Just let me know if it's possible and then I'll post another question.

BTW, I'm still using a pattern you gave me last year and it still works great!

Thanks,
John
Experts Exchange is like having an extremely knowledgeable team sitting and waiting for your call. Couldn't do my job half as well as I do without it!
James Murphy
John Carney

ASKER
Just to clarify there may or may not be a comma.
ASKER CERTIFIED SOLUTION
Terry Woods

Log in or sign up to see answer
Become an EE member today7-DAY FREE TRIAL
Members can start a 7-Day Free trial then enjoy unlimited access to the platform
Sign up - Free for 7 days
or
Learn why we charge membership fees
We get it - no one likes a content blocker. Take one extra minute and find out why we block content.
Not exactly the question you had in mind?
Sign up for an EE membership and get your own personalized solution. With an EE membership, you can ask unlimited troubleshooting, research, or opinion questions.
ask a question
John Carney

ASKER
Thanks, Terry this is great. Here's a link to my followup question: https://www.experts-exchange.com/Software/Office_Productivity/Office_Suites/MS_Office/Excel/Q_27247528.html

Incidentally, I notice that all three of these work:
   .Pattern = "\b(\w+)\b(.*?)\b([\s,]+\1\b)+"
   .Pattern = "\b(\w+)\b(.*?)\b([\s,]?\1\b)+"
   .Pattern = "\b(\w+)\b(.*?)\b([\s,*]\1\b)+""

Are they really interchangeable? Or are there hidden deficiencies in the second and third?

And if I'm not pushing my luck (and please tell me if I am, I will gladly put this stuff in separate questions)), what is the function of the \1\? I tried other numbers and they either don't work at all or do things like remove all spaces.

And finally, what is the mechanism of "$1$2" in RegEx.Replace(C, "$1$2"). In trying to understand its function, I tried other numbers and that didn't tell me anything helpful.

Thanks,
John
 
Terry Woods

  .Pattern = "\b(\w+)\b(.*?)\b([\s,]?\1\b)+"

Doesn't require a leading space or comma before the 2nd occurrence, so it will match:
A7B #A7B 123
And give result:
A7B # 123

   .Pattern = "\b(\w+)\b(.*?)\b([\s,*]\1\b)+""

Will allow a leading * instead of a space or comma, but won't allow more than one of those characters (the + is removed), and still needs a word boundary before it, so it will match:
A7B ABC*A7B 123 or A7B ABC,A7B 123
And give result:
A7B ABC 123

But
A7B ABC, A7B 123
won't match unless you add the + back in. You might have meant to put the * outside the square brackets?

\1 is a backreference to the group captured by the round brackets, (\w+) so it detects a repeat of the word

$1 and $2 insert the first and second captured groups as a replacement, (\w+) and (.*?)


Get an unlimited membership to EE for less than $4 a week.
Unlimited question asking, solutions, articles and more.