Eliminating duplicate data with Duplicate Master V2

AID: 2123
  • Status: Published

28683 points

  • Bybrettdj
  • TypeBest Practices
  • Posted on2009-12-17 at 16:59:24
Awards
  • Community Pick
  • Experts Exchange Approved
  • Editor's Choice

Introduction - The Problem

Many people use or inherit spreadsheets built with database functionality.  A frequent problem with such spreadsheets is duplicate data, either of entire records, or in specific fields.  Duplicate data in your spreadsheet can cost your organization in many ways:

  • incorrect financial calculations,

  • incorrect statistics,

  • increased mailing costs,

  • wasted manpower due to CRM inaccuracies


Introduction - The Solution

Not surprisingly, one of the most common requests in the online Excel forums is for solutions to handling the
  • identification,

  • extraction,

  • highlighting,

  • or removal

of duplicate or unique lists, in either single or multi-column fields

Several years ago I created the Duplicate Master addin which packages VBA code to provide this additional functionality to Excel. This is an extension beyond Excel's in-built duplicate handling  which is provided via Advanced Filtering, and now the "Remove Duplicates" in Excel 2007.  The Duplicate Master Version 1.4 had been downloaded over 12,000 times at the time I decommissioned my web-site in May 2008.
 
Duplicate Master Excel add-in V2
In this Article I am now pleased to present Duplicate Master V2.0 - in beta form - which delivers additional functionality in 3 areas:

1) The string comparison options have been significantly beefed up
  • case insensitive searches,

  • ignore all white-space searches (ASCII 9-13,32,160),

  • apply the worksheet CLEAN and/or TRIM functions,

  • and for those who want some serious matching capability - Regular Expression functionality

2) The string handling code has been further optimised to improve code speed, especially on string concatenation
3) I have added an option to exclude any items from a unique list if they occur more than once

The latest xla file is  
** Updated to V2.18 on 13 July 2011**



Assumed Knowledge
This tutorial assumed that users are familiar with general downloading and installation of Excel add-ins.

Tutorial contents
This Article provides a brief overview of applying the string matching options to a basic two field data list where the aim is to highlight the entire row where both fields result in a duplicate match.

Four increasingly more specific matching options are covered:

1 Basic Match
2 Match applying using Excel's CLEAN & TRIM worksheet functions
3 Using Regular Expression matching on the first field
4 As per (3), but case insensitive over both fields


The simple Name - Surname list below has been used for this example.

list.png
  • 5 KB
  • list
list



1 Basic Match
This example runs a basic match over the cells in B2:C11 where the strings must be identical (by character case, and white space) to flag a row match.  For these settings data sets 1 and 9 are identical and are highlighted as duplicate rows.


2 Match applying the CLEAN & TRIM functions
In this example the Excel CLEAN & TRIM functions are used to remove non-printing characters and excess spaces. Data sets 1, 6, 9 and 10 are now flagged as duplicates. Please note that the "Ignore ALL white-space chars (ASCII 9-13, 32 & 160" String Option offers this combined option plus handling for the presence of character 160.


3 Using Regular Expression matching on the first field
I won't go into detail on Regular Expressions here,  for those who want more background I suggest visiting the definitive VBA-RegExp resource by Patrick Matthews here.

In this case we would like to match the Name column if

  • the name is Fred, Freddy or Frederick

  • for Fred spelt as Frad, Fred, Frid, Frod, Frud

The regular expression used will convert all of these versions into "Fred" - the Duplicate Master Replacement text option for data matching


Any other variants will not convert purely into Fred (NB: i.e., Fredericka becomes would become Freda in this example, so this would match other Freda fields. If you want to avoid these matches then use a string such as "||xyz||" for the replacement text)

Running this RegExp option - in combination with "Ignore ALL white-space chars (ASCII 9-13, 32 & 160" results in rows 1, 6, 8, 9 and 10 matching


4 As per (3), but case insensitive over both fields
The regular expression options above were designed to cater for variations of "Fred" only (in either the Name or Surname) fields).  The RegExp IgnoreCase setting applies only to strings where there is a valid RegExp match. If we use the same settings as per step 3, but now use the String Option Case Insensitive then row 2 is added as an additional match.

 

Hopefully this tutorial has been useful in identifying the string matching capabilities of the upgraded addin.

Some readers may have already appreciated the versatility that applying RegExp matches provides. For example, pattern matching for numbers would allow a user to remove all rows in an entire workbook where numbers are contained in any or specific fields.

As always I would appreciate feedback and comments, especially as I would like to move this version from beta to final and make the addin generally available over the web for download.

Regards

Dave (brettdj)

    Asked On
    2009-12-17 at 16:59:24ID2123
    Tags

    excel

    ,

    duplicates

    ,

    vba

    ,

    regular expressions

    ,

    fuzzy match

    ,

    addin

    ,

    unique

    ,

    brettdj

    Topic

    Microsoft Excel Spreadsheet Software

    Views
    6432

    Comments

    Expert Comment

    by: mwvisa1 on 2010-01-14 at 19:41:55ID: 8049

    Thank you very much for continuing to develop this and, further, sharing it here!  Worked beautifully in Excel 2007.
    Voted a big YES above.

    Author Comment

    by: brettdj on 2010-01-22 at 03:17:04ID: 8292

    V2.12

    a) cosmetic changes mainly around de-cluttering the Output Choices textbox by making the descriptions text tooltips
    b) ensured that the RegExp section remains open once the code finishes
    c) fixed some brettdj big fingered typo's.


    I also want to acknowledge the generous assistance provided by Jerry Paladino (ProdOps) in reviewing the upgraded addin design and format. Your valuable input was and is much appreciated Jerry :)

    Regards

    Dave

    Expert Comment

    by: broomee9 on 2010-01-25 at 12:16:21ID: 8430

    I ran this on a workbook and I got the below error message.  After clicking OK about 6 times, it kept running.  After about 40 minutes it still wasn't finished, so I cancelled the process.

    Author Comment

    by: brettdj on 2010-01-26 at 20:12:49ID: 8489

    Thanks broomee9 for flagging this issue,  the new code design for V2 did not properly handle duplicates which occur in merged cells

    V2.13 (attached) addresses this change

    Regards

    Dave

    Expert Comment

    by: mwvisa1 on 2010-01-26 at 20:28:22ID: 8493

    Yes, good catch as I would not have thought to test with merged cells, but could see how that may happen in one of files and would have been annoying to get an error.  

    Thanks for all your continued work on this, Dave!

    Best regards,

    Kevin

    Expert Comment

    by: hc2342uhxx3vw36x96hq on 2010-02-03 at 05:15:56ID: 9049

    I hightlight a block of cells in Excel 2007, and then click on the button "The Duplicate Master V2.0" in the folder "Add-Ins".

    I receive the following "Microsoft Visual Basic" PopUp:

    Run-time error '-2147467259 (80004005)':

    Can't move focus to the control because it is invisible, not enabled, or of a type that does not accept the focus.



    ==> Please assume that I have enabled all macros.

    Author Comment

    by: brettdj on 2010-02-03 at 17:13:08ID: 9089

    Hi hc2342uhxx3vw36x96hq,

    Thanks for your feedback.

    I have not been able to duplicate this issue on my pc with xl2007.  If you send me an email to
    ozbrettdjnospam at yahoo dot com (remove the nospam)
    then I will send you the password for the addin so that you can identify where the code is failing on your machine

    Have you tried the addin on a different file?

    Also, do you have any other versions of the Duplicate Master in memory if you press
    Alt & F11
    to got to the Visual Basic Editor (VBE)?

    An earlier version I published in this Article turned out to be a corrupted file

    Regards
    Dave

    Expert Comment

    by: mwvisa1 on 2010-02-05 at 17:28:51ID: 9180

    Dave:

    I was able to reproduce the issue previously, but just re-tested by totally removing the existing version of Duplicate-Master-V2 that I had installed and then downloaded from here again.  Once I enabled it, the Add-In worked as previously did.

    Best regards,
    Kevin

    Author Comment

    by: brettdj on 2010-02-07 at 14:10:51ID: 9259

    Attached is V2.15

    a) It will search for and remove prior versions of the Duplicate Master from memory, uncheck them as Addins, then delete then from the installed path.

    b) - fixed headers in the reporting of Unique Row lists
        - reduced the amount of time that the progress bar is showing (it currently increases run time by approx 30%)
        - improved the logic that calculates multi column duplicates where users define the key columns

    Regards

    Dave

    *** File Removed - see bottom of thread ***

    Expert Comment

    by: hc2342uhxx3vw36x96hq on 2010-02-07 at 14:15:23ID: 9260

    The bug it not present, neither in the "Duplicate Master V2.14.xls" received from David Brett via email, nor in "Duplicate-Master-V2.15.xla" above.

    Thank you!! ;-)

    It's useful, in Excel!!!!! :-)

    Expert Comment

    by: hc2342uhxx3vw36x96hq on 2010-02-08 at 00:27:45ID: 9282

    Excuse me, another question.  When I open the file "Duplicate-Master-V2.15.xla", authomatically a new folder "Add In" is created, with the macro "Duplicate Mater V2.15".  But when I EXIT from Excel 2007, and when I open Excel 2007 again, the folder "Add In" is disappeared.  Why?  Is there any way to leave the macro in the "Add In" folder permanently?

    Author Comment

    by: brettdj on 2010-02-08 at 03:13:47ID: 9286

    try
    - Pressing the Office Button
    - click 'Excel Options'
    - select "addins" from the menu to the left, then a box at the bottom will come up containing "Excel addins", hit 'Go' to the right of this box
    - then click 'Browse' and select the location of the addin from your machine

    Cheers

    Dave


    *** Update: see http://peltiertech.com/WordPress/installing-an-add-in-in-excel-2007/  for the steps above including diagrams ****

    Expert Comment

    by: hc2342uhxx3vw36x96hq on 2010-02-17 at 06:34:15ID: 9742

    Please help!
    A colleague of mine uses EXCEL 2003, and when running the Macro receives ALWAYS the following error in a PopUp:

    The translation of the error is: RUN TIME ERROR '94' - invalid use of NULL

    How can I fix this??

    Thank you in advance for your kind help.

    Author Comment

    by: brettdj on 2010-02-25 at 20:55:35ID: 10122

    I presume that the error does not occur on the same file if it runs on your machine with an English version of Excel?

    Cheers

    Dave

    Author Comment

    by: brettdj on 2010-04-08 at 19:03:34ID: 12808

    V2.16 added

    The UserForm fonts in V2.15 had become screwy in places

    Cheers

    Dave

    Author Comment

    by: brettdj on 2010-07-10 at 06:45:01ID: 16841

    Added V2.17

    • includes additional RegExp option to replaces all valid string matches with a single replacement - to assist with duplicate matching

    • RegExp tester now flags any parsing errors  in the Output box


    Mr Excel forum question.

    Acceptable numbers are 209, 210, 213. Not acceptable are 000 thru 999.

    For example:
    209:fejoaiajeajefl,210:falkejea;weiajsdkjfeia
    209:alskdjfa;owieja,443:dfj;aowiejaslej,213:ejf;aowiej
    220:sdkajf;elsijalars,213:dkfja;lseiajsl,509:al;skdjalseise
    209:df;alseijalseij,210:d;alseijalseij,213:ldkjfjlsiejlfsej

    Should count 2 as only two strings are valid


    A regular expression pattern of
    (^|[^\d])2(09|10|12|13)([^\d]|$)|[^\d]
    will match any character that is non-numeric, but it will allow 209, 210, 212 and 213  (NB but not these strings contained in longer numeric strings ie 9213)

    ** Note this pattern assumed that any string without an invalid numeric string should be matched **

    Then 1) using "||Bob-Neil||" as a replacement string (on the basis that this will not match an existing string)
            2)  replacing all '||Bob-Neil||' strings with a single '||Bob-Neil||'

    Means that the Duplicate Master can be used to match the strings meeting these conditions as duplicates

    Regards

    Dave

    Expert Comment

    by: justmiracle78 on 2010-08-23 at 10:32:45ID: 18677

    I really like this duplicate master but I wanted to know if the trim can be modified. I have a cell formul "=IF(LEN(C2)=12,C2,IF(LEN(C2)=16,MID(C2,3,6)&RIGHT(C2,6),RIGHT(C2,12)))" and I wanted to know if I can incoporate this with the duplicate master?

    Thanks

    Expert Comment

    by: SOCCSUPPORT on 2011-03-08 at 12:14:19ID: 24424

    Once again this add-on has saved me a ton of AD administrator work! Bonus it works with Office 2010 x64bit! Thank you so much!

    Author Comment

    by: brettdj on 2011-03-08 at 17:30:01ID: 24438

    Thank you for the kind comments and vote :)

    Dave

    Expert Comment

    by: Bright01 on 2012-02-03 at 06:07:54ID: 41732

    Brett,

    I cannot access the Add-In.  I'm running Excel 2010 but the file attached to this post is not accessible.  Can you assist me with this?

    Thank you,

    B1

    Author Comment

    by: brettdj on 2012-02-03 at 19:31:02ID: 41870

    Hi B1

    You should be able to download the zip file, extract the xla file, and then install it with a similar process to this (but click the `File` option rather than button)

    Cheers

    Dave

    Add your Comment

    Please Sign up or Log in to comment on this article.

    Loading Advertisement...

    Top MS Excel Experts

    1. dlmille

      243,245

      Guru

      15,200 points yesterday

      Profile
      Rank: Genius
    2. barryhoudini

      117,920

      Master

      5,600 points yesterday

      Profile
      Rank: Genius
    3. matthewspatrick

      96,772

      Master

      20 points yesterday

      Profile
      Rank: Savant
    4. redmondb

      88,300

      Master

      5,200 points yesterday

      Profile
      Rank: Wizard
    5. teylyn

      83,662

      Master

      1,410 points yesterday

      Profile
      Rank: Genius
    6. rorya

      77,182

      Master

      0 points yesterday

      Profile
      Rank: Savant
    7. nutsch

      74,776

      Master

      1,400 points yesterday

      Profile
      Rank: Genius
    8. gowflow

      71,936

      Master

      1,800 points yesterday

      Profile
      Rank: Sage
    9. imnorie

      63,260

      Master

      3,000 points yesterday

      Profile
      Rank: Sage
    10. zorvek

      53,368

      Master

      0 points yesterday

      Profile
      Rank: Savant
    11. ssaqibh

      52,552

      Master

      2,400 points yesterday

      Profile
      Rank: Genius
    12. robhenson

      49,314

      0 points yesterday

      Profile
      Rank: Sage
    13. StephenJR

      46,768

      0 points yesterday

      Profile
      Rank: Genius
    14. GlennLRay

      37,932

      0 points yesterday

      Profile
      Rank: Master
    15. andrewssd3

      37,622

      0 points yesterday

      Profile
      Rank: Wizard
    16. regmigrant

      27,075

      2,000 points yesterday

      Profile
      Rank: Master
    17. ScriptAddict

      25,900

      0 points yesterday

      Profile
      Rank: Master
    18. Michael74

      24,840

      0 points yesterday

      Profile
      Rank: Wizard
    19. kgerb

      24,064

      0 points yesterday

      Profile
      Rank: Guru
    20. jppinto

      19,588

      1,480 points yesterday

      Profile
      Rank: Genius
    21. TommySzalapski

      19,400

      0 points yesterday

      Profile
      Rank: Genius
    22. akoster

      16,300

      0 points yesterday

      Profile
      Rank: Sage
    23. krishnakrkc

      16,032

      0 points yesterday

      Profile
      Rank: Wizard
    24. chris_bottomley

      12,720

      0 points yesterday

      Profile
      Rank: Genius
    25. Shanan212

      12,634

      2,000 points yesterday

      Profile
      Rank: Master

    Hall Of Fame