<

Go Premium for a chance to win a PS4. Enter to Win

x

Eliminating duplicate data with Duplicate Master V2

Published on
50,775 Points
18,175 Views
26 Endorsements
Last Modified:
Awarded
Introduction - The Problem

Many people use or inherit spreadsheets built with database functionality.  A frequent problem with such spreadsheets is duplicate data, either of entire records, or in specific fields.  Duplicate data in your spreadsheet can cost your organization in many ways:
incorrect financial calculations,
incorrect statistics,
increased mailing costs,
wasted manpower due to CRM inaccuracies
Introduction - The Solution

Not surprisingly, one of the most common requests in the online Excel forums is for solutions to handling the
identification,
extraction,
highlighting,
or removal
of duplicate or unique lists, in either single or multi-column fields

Several years ago I created the Duplicate Master addin which packages VBA code to provide this additional functionality to Excel. This is an extension beyond Excel's in-built duplicate handling  which is provided via Advanced Filtering, and now the "Remove Duplicates" in Excel 2007.  The Duplicate Master Version 1.4 had been downloaded over 12,000 times at the time I decommissioned my web-site in May 2008.
 
Duplicate Master Excel add-in V2
In this Article I am now pleased to present Duplicate Master V2.0 - in beta form - which delivers additional functionality in 3 areas:

1) The string comparison options have been significantly beefed up
case insensitive searches,
ignore all white-space searches (ASCII 9-13,32,160),
apply the worksheet CLEAN and/or TRIM functions,
and for those who want some serious matching capability - Regular Expression functionality
2) The string handling code has been further optimised to improve code speed, especially on string concatenation
3) I have added an option to exclude any items from a unique list if they occur more than once

The latest xla file is  
** Updated to V2.18 on 13 July 2011**
Duplicate-Master-V2.18.zip

Assumed Knowledge
This tutorial assumed that users are familiar with general downloading and installation of Excel add-ins.

Tutorial contents
This Article provides a brief overview of applying the string matching options to a basic two field data list where the aim is to highlight the entire row where both fields result in a duplicate match.

Four increasingly more specific matching options are covered:

1 Basic Match
2 Match applying using Excel's CLEAN & TRIM worksheet functions
3 Using Regular Expression matching on the first field
4 As per (3), but case insensitive over both fields


The simple Name - Surname list below has been used for this example.list1 Basic Match
This example runs a basic match over the cells in B2:C11 where the strings must be identical (by character case, and white space) to flag a row match.  For these settings data sets 1 and 9 are identical and are highlighted as duplicate rows.
12 Match applying the CLEAN & TRIM functions
In this example the Excel CLEAN & TRIM functions are used to remove non-printing characters and excess spaces. Data sets 1, 6, 9 and 10 are now flagged as duplicates. Please note that the "Ignore ALL white-space chars (ASCII 9-13, 32 & 160" String Option offers this combined option plus handling for the presence of character 160.
23 Using Regular Expression matching on the first field
I won't go into detail on Regular Expressions here,  for those who want more background I suggest visiting the definitive VBA-RegExp resource by Patrick Matthews here.

In this case we would like to match the Name column if
the name is Fred, Freddy or Frederick
for Fred spelt as Frad, Fred, Frid, Frod, Frud
The regular expression used will convert all of these versions into "Fred" - the Duplicate Master Replacement text option for data matching


Any other variants will not convert purely into Fred (NB: i.e., Fredericka becomes would become Freda in this example, so this would match other Freda fields. If you want to avoid these matches then use a string such as "||xyz||" for the replacement text)

Running this RegExp option - in combination with "Ignore ALL white-space chars (ASCII 9-13, 32 & 160" results in rows 1, 6, 8, 9 and 10 matching
34 As per (3), but case insensitive over both fields
The regular expression options above were designed to cater for variations of "Fred" only (in either the Name or Surname) fields).  The RegExp IgnoreCase setting applies only to strings where there is a valid RegExp match. If we use the same settings as per step 3, but now use the String Option Case Insensitive then row 2 is added as an additional match.
4  

Hopefully this tutorial has been useful in identifying the string matching capabilities of the upgraded addin.

Some readers may have already appreciated the versatility that applying RegExp matches provides. For example, pattern matching for numbers would allow a user to remove all rows in an entire workbook where numbers are contained in any or specific fields.

As always I would appreciate feedback and comments, especially as I would like to move this version from beta to final and make the addin generally available over the web for download.

Regards

Dave (brettdj)
26
Comment
Author:Dave Brett
  • 12
  • 4
  • 4
  • +6
28 Comments
 
LVL 60

Expert Comment

by:Kevin Cross
Thank you very much for continuing to develop this and, further, sharing it here!  Worked beautifully in Excel 2007.
Voted a big YES above.
0
 
LVL 50

Author Comment

by:Dave Brett
V2.12

a) cosmetic changes mainly around de-cluttering the Output Choices textbox by making the descriptions text tooltips
b) ensured that the RegExp section remains open once the code finishes
c) fixed some brettdj big fingered typo's.


I also want to acknowledge the generous assistance provided by Jerry Paladino (ProdOps) in reviewing the upgraded addin design and format. Your valuable input was and is much appreciated Jerry :)

Regards

Dave
0
 
LVL 24

Expert Comment

by:Tracy
I ran this on a workbook and I got the below error message.  After clicking OK about 6 times, it kept running.  After about 40 minutes it still wasn't finished, so I cancelled the process.
Error.png
Settings.png
0
Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

 
LVL 50

Author Comment

by:Dave Brett
Thanks broomee9 for flagging this issue,  the new code design for V2 did not properly handle duplicates which occur in merged cells

V2.13 (attached) addresses this change

Regards

Dave
0
 
LVL 60

Expert Comment

by:Kevin Cross
Yes, good catch as I would not have thought to test with merged cells, but could see how that may happen in one of files and would have been annoying to get an error.  

Thanks for all your continued work on this, Dave!

Best regards,

Kevin
0
 
LVL 1

Expert Comment

by:hc2342uhxx3vw36x96hq
I hightlight a block of cells in Excel 2007, and then click on the button "The Duplicate Master V2.0" in the folder "Add-Ins".

I receive the following "Microsoft Visual Basic" PopUp:

Run-time error '-2147467259 (80004005)':

Can't move focus to the control because it is invisible, not enabled, or of a type that does not accept the focus.



==> Please assume that I have enabled all macros.
0
 
LVL 50

Author Comment

by:Dave Brett
Hi hc2342uhxx3vw36x96hq,

Thanks for your feedback.

I have not been able to duplicate this issue on my pc with xl2007.  If you send me an email to
ozbrettdjnospam at yahoo dot com (remove the nospam)
then I will send you the password for the addin so that you can identify where the code is failing on your machine

Have you tried the addin on a different file?

Also, do you have any other versions of the Duplicate Master in memory if you press
Alt & F11
to got to the Visual Basic Editor (VBE)?

An earlier version I published in this Article turned out to be a corrupted file

Regards
Dave
0
 
LVL 60

Expert Comment

by:Kevin Cross
Dave:

I was able to reproduce the issue previously, but just re-tested by totally removing the existing version of Duplicate-Master-V2 that I had installed and then downloaded from here again.  Once I enabled it, the Add-In worked as previously did.

Best regards,
Kevin
0
 
LVL 50

Author Comment

by:Dave Brett
Attached is V2.15

a) It will search for and remove prior versions of the Duplicate Master from memory, uncheck them as Addins, then delete then from the installed path.

b) - fixed headers in the reporting of Unique Row lists
    - reduced the amount of time that the progress bar is showing (it currently increases run time by approx 30%)
    - improved the logic that calculates multi column duplicates where users define the key columns

Regards

Dave

*** File Removed - see bottom of thread ***
0
 
LVL 1

Expert Comment

by:hc2342uhxx3vw36x96hq
The bug it not present, neither in the "Duplicate Master V2.14.xls" received from David Brett via email, nor in "Duplicate-Master-V2.15.xla" above.

Thank you!! ;-)

It's useful, in Excel!!!!! :-)
0
 
LVL 1

Expert Comment

by:hc2342uhxx3vw36x96hq
Excuse me, another question.  When I open the file "Duplicate-Master-V2.15.xla", authomatically a new folder "Add In" is created, with the macro "Duplicate Mater V2.15".  But when I EXIT from Excel 2007, and when I open Excel 2007 again, the folder "Add In" is disappeared.  Why?  Is there any way to leave the macro in the "Add In" folder permanently?
0
 
LVL 50

Author Comment

by:Dave Brett
try
- Pressing the Office Button
- click 'Excel Options'
- select "addins" from the menu to the left, then a box at the bottom will come up containing "Excel addins", hit 'Go' to the right of this box
- then click 'Browse' and select the location of the addin from your machine

Cheers

Dave


*** Update: see http://peltiertech.com/WordPress/installing-an-add-in-in-excel-2007/  for the steps above including diagrams ****
0
 
LVL 1

Expert Comment

by:hc2342uhxx3vw36x96hq
Please help!
A colleague of mine uses EXCEL 2003, and when running the Macro receives ALWAYS the following error in a PopUp:

The translation of the error is: RUN TIME ERROR '94' - invalid use of NULL

How can I fix this??

Thank you in advance for your kind help.

errore.JPG
0
 
LVL 50

Author Comment

by:Dave Brett
I presume that the error does not occur on the same file if it runs on your machine with an English version of Excel?

Cheers

Dave
0
 
LVL 50

Author Comment

by:Dave Brett
V2.16 added

The UserForm fonts in V2.15 had become screwy in places

Cheers

Dave
0
 
LVL 50

Author Comment

by:Dave Brett
Added V2.17

includes additional RegExp option to replaces all valid string matches with a single replacement - to assist with duplicate matching
RegExp tester now flags any parsing errors  in the Output box

Mr Excel forum question.

Acceptable numbers are 209, 210, 213. Not acceptable are 000 thru 999.

For example:
209:fejoaiajeajefl,210:falkejea;weiajsdkjfeia
209:alskdjfa;owieja,443:dfj;aowiejaslej,213:ejf;aowiej
220:sdkajf;elsijalars,213:dkfja;lseiajsl,509:al;skdjalseise
209:df;alseijalseij,210:d;alseijalseij,213:ldkjfjlsiejlfsej

Should count 2 as only two strings are valid


A regular expression pattern of
(^|[^\d])2(09|10|12|13)([^\d]|$)|[^\d]
will match any character that is non-numeric, but it will allow 209, 210, 212 and 213  (NB but not these strings contained in longer numeric strings ie 9213)

** Note this pattern assumed that any string without an invalid numeric string should be matched **

Then 1) using "||Bob-Neil||" as a replacement string (on the basis that this will not match an existing string)
        2)  replacing all '||Bob-Neil||' strings with a single '||Bob-Neil||'

Means that the Duplicate Master can be used to match the strings meeting these conditions as duplicates

Regards

Dave
Regex.JPG
0
 

Expert Comment

by:justmiracle78
I really like this duplicate master but I wanted to know if the trim can be modified. I have a cell formul "=IF(LEN(C2)=12,C2,IF(LEN(C2)=16,MID(C2,3,6)&RIGHT(C2,6),RIGHT(C2,12)))" and I wanted to know if I can incoporate this with the duplicate master?

Thanks
0
 
 

Administrative Comment

by:Patrick Matthews
brettdj,

Your article has been selected for the Editors Choice designation.  Great work!

matthewspatrick
Page Editor
0
 
LVL 1

Expert Comment

by:SOCCSUPPORT
Once again this add-on has saved me a ton of AD administrator work! Bonus it works with Office 2010 x64bit! Thank you so much!
0
 
LVL 50

Author Comment

by:Dave Brett
Thank you for the kind comments and vote :)

Dave
0
 

Expert Comment

by:Bright01
Brett,

I cannot access the Add-In.  I'm running Excel 2010 but the file attached to this post is not accessible.  Can you assist me with this?

Thank you,

B1
0
 
LVL 50

Author Comment

by:Dave Brett
Hi B1

You should be able to download the zip file, extract the xla file, and then install it with a similar process to this (but click the `File` option rather than button)

Cheers

Dave
0
 

Expert Comment

by:markferg
Brett

I am trying to dedupe by way of "delete entire row" on a 450K row 5 column sheet. I have been giving up after waiting up to 7 hours for completion.

Have I bitten off more than TDM can chew?

TDM ver 2.16
Windows 7
Office 2010 (32 Bit)
8GB mem.
0
 
LVL 50

Author Comment

by:Dave Brett
I just ran a test of 500,000 rows with 5 columns where 100 rows were continually repeated and it completed in under two minutes

Are you able to send me a copy of the file?

Regards

Dave
0
 

Expert Comment

by:markferg
Dave

Have tried many times to upload file with no joy. Its only 24Mb so not sure why it wont upload. Perhaps as I am just community member?

Do you have FTP perhaps?
0
 

Expert Comment

by:markferg
Dave

File attached.
Book3.xlsx
0
 
LVL 50

Author Comment

by:Dave Brett
The issue is the code is hanging on setting the working range - it is "freezing" when it tries to use
.Cells.SpecialCells(xlConstants)`

Open in new window

on the 500k rows

Which is interesting as stepping through manually works - after a time lag of 5-10 seconds

I will keep looking at it

Cheers

Dave
0
 

Expert Comment

by:markferg
Dave

Very greatful for you looking into this.

I have been using TDM for years now and think your a genius!

My dedupes on similiar file type whilst always a tad slow under Excel 97/2003 format (ie 65K rows) always worked well. I always put that down to slow CPU speed.

I have only just started using TDM with Excel 2010 (1m+ rows) and wonder if that has anything to do with it. Anyway you would know that better than I.

Again greatful for any suggestions.
0

Featured Post

How to Use the Help Bell

Need to boost the visibility of your question for solutions? Use the Experts Exchange Help Bell to confirm priority levels and contact subject-matter experts for question attention.  Check out this how-to article for more information.

Join & Write a Comment

The viewer will learn how to use a discrete random variable to simulate the return on an investment over a period of years, create a Monte Carlo simulation using the discrete random variable, and create a graph to represent the possible returns over…
This Micro Tutorial demonstrate the bugs in Microsoft Excel for Mac with Pivot Charts.

Keep in touch with Experts Exchange

Tech news and trends delivered to your inbox every month