We help IT Professionals succeed at work.

Macro that will remove duplicate strings in a cell using RegEx

John Carney
John Carney used Ask the Experts™
on
This macro removes duplicate strings that occur consecutively. But how do I write it so that it removes all duplicate alphanumerics? In other words in this text it will remove the non-consecutive second occurrence of Text2 as well as both gratuitous occurrences of 33.

"Text1 33  33 Text2 44 Text3 55  Text2  33 "

Thanks,
John

(Dave, if you're reading this, I'm sure you will recognize your macro from quite a while ago!)


Sub LoseDupes()
Dim RegEx      As VBScript_RegExp_55.RegExp
Dim Myrange As Range, C As Range
    Set RegEx = New VBScript_RegExp_55.RegExp
    Set Myrange = ActiveSheet.Range("B3:B6")
    With RegEx
        .MultiLine = False
        .Global = True
        .IgnoreCase = True
        .Pattern = "\b(\w+)\b(\s+\1\b)+"
    End With
    For Each C In Myrange
        C.Offset(0, 1) = RegEx.Replace(C, "$1")
    Next
    Set Myrange = Nothing
    Set RegEx = Nothing
End Sub

Open in new window

Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Terry WoodsIT Guru
Most Valuable Expert 2011

Commented:
Try this:
Sub LoseDupes()
Dim RegEx      As VBScript_RegExp_55.RegExp
Dim Myrange As Range, C As Range
    Set RegEx = New VBScript_RegExp_55.RegExp
    Set Myrange = ActiveSheet.Range("B3:B6")
    With RegEx
        .MultiLine = False
        .Global = True
        .IgnoreCase = True
        .Pattern = "\b(\w+)\b(.*?)\b(\s+\1\b)+"
    End With
    For Each C In Myrange
        C.Offset(0, 1) = RegEx.Replace(C, "$1$2")
    Next
    Set Myrange = Nothing
    Set RegEx = Nothing
End Sub

Open in new window

Terry WoodsIT Guru
Most Valuable Expert 2011

Commented:
Note that I changed both the pattern, and the replacement.
John CarneyReliability Business Tools Analyst II

Author

Commented:
Thanks Terry, I had a feeling it would take both changes. The only problem now is that there will often be a comma between the two instances. Sometimes with words and other punctuation as well. I tried a bunch of variations but couldn't get it. How do I wtrite that?

Also, and this is probably much more complicated and might require a separate question, is it possible to to remove duplicates when:

1) one is part of a larger word: as in 7ABCDEF, 7A
2) the number and the letter are in the same "word" but not adjacent: as in 7ABCDEF, 7C

Oh, and it will almost always be alphanumeric duplicates that I want to remove: 7A, 33D, etc.

Please don't spend much time on this. Just let me know if it's possible and then I'll post another question.

BTW, I'm still using a pattern you gave me last year and it still works great!

Thanks,
John
John CarneyReliability Business Tools Analyst II

Author

Commented:
Just to clarify there may or may not be a comma.
IT Guru
Most Valuable Expert 2011
Commented:
This pattern will allow a comma instead of a space (or a combination of both):

\b(\w+)\b(.*?)\b([\s,]+\1\b)+

And to answer the extra questions:
1) one is part of a larger word: as in 7ABCDEF, 7A
2) the number and the letter are in the same "word" but not adjacent: as in 7ABCDEF, 7C
The first case above should be achievable for the example given.
However, does the order matter, or is
7AB, A7BDF
considered a duplicate too?
And if order matters, is this counted as a duplicate (slightly different, as the 2nd "word" starts with a B in the wrong order, but also has one in the right order):
7AB, B7CABF
If order does matter, then I think a regular expression can manage it. If not, then you'd have to parse out the individual words then programmatically compare them. A separate question would probably be a good idea, as it might get a bit tricky.
John CarneyReliability Business Tools Analyst II

Author

Commented:
Thanks, Terry this is great. Here's a link to my followup question: http://www.experts-exchange.com/Software/Office_Productivity/Office_Suites/MS_Office/Excel/Q_27247528.html

Incidentally, I notice that all three of these work:
   .Pattern = "\b(\w+)\b(.*?)\b([\s,]+\1\b)+"
   .Pattern = "\b(\w+)\b(.*?)\b([\s,]?\1\b)+"
   .Pattern = "\b(\w+)\b(.*?)\b([\s,*]\1\b)+""

Are they really interchangeable? Or are there hidden deficiencies in the second and third?

And if I'm not pushing my luck (and please tell me if I am, I will gladly put this stuff in separate questions)), what is the function of the \1\? I tried other numbers and they either don't work at all or do things like remove all spaces.

And finally, what is the mechanism of "$1$2" in RegEx.Replace(C, "$1$2"). In trying to understand its function, I tried other numbers and that didn't tell me anything helpful.

Thanks,
John
 
Terry WoodsIT Guru
Most Valuable Expert 2011

Commented:
  .Pattern = "\b(\w+)\b(.*?)\b([\s,]?\1\b)+"

Doesn't require a leading space or comma before the 2nd occurrence, so it will match:
A7B #A7B 123
And give result:
A7B # 123

   .Pattern = "\b(\w+)\b(.*?)\b([\s,*]\1\b)+""

Will allow a leading * instead of a space or comma, but won't allow more than one of those characters (the + is removed), and still needs a word boundary before it, so it will match:
A7B ABC*A7B 123 or A7B ABC,A7B 123
And give result:
A7B ABC 123

But
A7B ABC, A7B 123
won't match unless you add the + back in. You might have meant to put the * outside the square brackets?

\1 is a backreference to the group captured by the round brackets, (\w+) so it detects a repeat of the word

$1 and $2 insert the first and second captured groups as a replacement, (\w+) and (.*?)