How do I do de-duplication in Excel 2013, but have the ability to filter on the first occurrence?

rye004
rye004 used Ask the Experts™
on
I have an excel sheet with approximately 200K of rows in it.  I have a single column that I want to de-duplicate against.  When I use the default “duplicate” conditional filter in Excel 2013, it highlights all rows which are duplicates.

What I am trying to do is filter out all duplicates, however I want to identify the “first occurrence”.  Hopefully this makes sense.

Does anyone know how I can go about this?

Many Thanks
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Glenn RayExcel VBA Developer
Top Expert 2014

Commented:
Do you want to:
* filter out duplicate values after the first occurrence, or
* remove duplicate values after the first occurrence

If the 2nd option, you can use the Remove Duplicates function (Menu:  Data tab, Data Tools section, Remove Duplicates).

If the 1st option, you'll need to add a helper column to help identify the subsequent duplicate records and then filter out those.  For example, if your data is in column A and starts on row 2, add this formula in an adjacent column on row 2 and copy down:
=IF(COUNTIF($A$2:A2,A2)>1,"Duplicate","")

Then filter on blank cells in this new column.  Only the unique or first-occurrence values will display.

There's another option - Advanced Filter - that will let you copy a list of the unique values in that column to another location.


Let us know which method you need and we'll help out.

-Glenn
There is no option under Conditional Formatting to achieve what you require.

An alternative is to add a column that will show the duplicate status of each cell.  In that column, apply the following formula and replicate down:
=IF(COUNTIF(A:A,A1)=1,"Not duplicated","Duplicate number " & COUNTIF(A$1:A1,A1))

Open in new window


This assumes that:
Column A is the one you are searching for duplicates, if not, change all occurrences of A in the formula to the required column letter.
The data starts at row 1, if not, replace all occurrences of 1 in the formula to the row number where the data starts.
Rob HensonFinance Analyst

Commented:
Actually, there is a way that you could use Conditional Formatting.

Apply CF such that duplicates are highlighted in a particular colour. Would probably use the formula option and a count like the suggestions above within the CF.

You could then use AutoFilter and Filter on colour, excluding the duplicate colour.

Thanks
Rob H
Expert Spotlight: Joe Anderson (DatabaseMX)

We’ve posted a new Expert Spotlight!  Joe Anderson (DatabaseMX) has been on Experts Exchange since 2006. Learn more about this database architect, guitar aficionado, and Microsoft MVP.

Author

Commented:
Glenn,

To answer your question, I am looking for your 1st option.

After I put up the posting, I did do something similar to what you suggested.  However, instead of using countif, I used a vlookup.  Below is an example of what I did:

=IF(LEN(IFERROR(VLOOKUP([@[MD5 - Static]], $N$3:N3, 1, FALSE), "")) > 1, "Yes", "No")

It took over an hour to run this in Excel 2013 64 bit.  I am assuming your method would be faster since it uses 2 excel functions instead of the 4 that I used.

Can you please let me know your thoughts?

Many Thanks.
Excel VBA Developer
Top Expert 2014
Commented:
I tried replicating a table with a field titled "MD5 - Static" in column N and the results all returned "Yes", so maybe I'm missing something in your requirement.

If you're looking for the duplicate values in the MD5 - Static column, and the table data begins in row 3 (as it appears from your example) then add this formula in an adjacent column on row 3 also and copy down (if you add this to a new Excel Table column, it should automatically copy it down for you):
=IF(COUNTIF($N$3:N3,N3)>1,"Duplicate","")

I'm only testing on a small table - less than 50 rows - but it's very fast.

EDIT:  I just tried this on a table with randomly-generated values on 20,000 rows and it takes about 4 seconds to process. See attached example.

-Glenn
EE-28552267.xlsx

Author

Commented:
Thank you for your help with this.  Tweaking this to use less excel formulas did improve on performance time.

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial