Solved

How do I do de-duplication in Excel 2013, but have the ability to filter on the first occurrence?

Posted on 2014-11-06
6
105 Views
Last Modified: 2014-11-10
I have an excel sheet with approximately 200K of rows in it.  I have a single column that I want to de-duplicate against.  When I use the default “duplicate” conditional filter in Excel 2013, it highlights all rows which are duplicates.

What I am trying to do is filter out all duplicates, however I want to identify the “first occurrence”.  Hopefully this makes sense.

Does anyone know how I can go about this?

Many Thanks
0
Comment
Question by:rye004
6 Comments
 
LVL 27

Expert Comment

by:Glenn Ray
ID: 40426884
Do you want to:
* filter out duplicate values after the first occurrence, or
* remove duplicate values after the first occurrence

If the 2nd option, you can use the Remove Duplicates function (Menu:  Data tab, Data Tools section, Remove Duplicates).

If the 1st option, you'll need to add a helper column to help identify the subsequent duplicate records and then filter out those.  For example, if your data is in column A and starts on row 2, add this formula in an adjacent column on row 2 and copy down:
=IF(COUNTIF($A$2:A2,A2)>1,"Duplicate","")

Then filter on blank cells in this new column.  Only the unique or first-occurrence values will display.

There's another option - Advanced Filter - that will let you copy a list of the unique values in that column to another location.


Let us know which method you need and we'll help out.

-Glenn
0
 
LVL 4

Expert Comment

by:AnthonyHamon
ID: 40426926
There is no option under Conditional Formatting to achieve what you require.

An alternative is to add a column that will show the duplicate status of each cell.  In that column, apply the following formula and replicate down:
=IF(COUNTIF(A:A,A1)=1,"Not duplicated","Duplicate number " & COUNTIF(A$1:A1,A1))

Open in new window


This assumes that:
Column A is the one you are searching for duplicates, if not, change all occurrences of A in the formula to the required column letter.
The data starts at row 1, if not, replace all occurrences of 1 in the formula to the row number where the data starts.
0
 
LVL 31

Expert Comment

by:Rob Henson
ID: 40428206
Actually, there is a way that you could use Conditional Formatting.

Apply CF such that duplicates are highlighted in a particular colour. Would probably use the formula option and a count like the suggestions above within the CF.

You could then use AutoFilter and Filter on colour, excluding the duplicate colour.

Thanks
Rob H
0
Better Security Awareness With Threat Intelligence

See how one of the leading financial services organizations uses Recorded Future as part of a holistic threat intelligence program to promote security awareness and proactively and efficiently identify threats.

 

Author Comment

by:rye004
ID: 40429234
Glenn,

To answer your question, I am looking for your 1st option.

After I put up the posting, I did do something similar to what you suggested.  However, instead of using countif, I used a vlookup.  Below is an example of what I did:

=IF(LEN(IFERROR(VLOOKUP([@[MD5 - Static]], $N$3:N3, 1, FALSE), "")) > 1, "Yes", "No")

It took over an hour to run this in Excel 2013 64 bit.  I am assuming your method would be faster since it uses 2 excel functions instead of the 4 that I used.

Can you please let me know your thoughts?

Many Thanks.
0
 
LVL 27

Accepted Solution

by:
Glenn Ray earned 500 total points
ID: 40429342
I tried replicating a table with a field titled "MD5 - Static" in column N and the results all returned "Yes", so maybe I'm missing something in your requirement.

If you're looking for the duplicate values in the MD5 - Static column, and the table data begins in row 3 (as it appears from your example) then add this formula in an adjacent column on row 3 also and copy down (if you add this to a new Excel Table column, it should automatically copy it down for you):
=IF(COUNTIF($N$3:N3,N3)>1,"Duplicate","")

I'm only testing on a small table - less than 50 rows - but it's very fast.

EDIT:  I just tried this on a table with randomly-generated values on 20,000 rows and it takes about 4 seconds to process. See attached example.

-Glenn
EE-28552267.xlsx
0
 

Author Closing Comment

by:rye004
ID: 40433130
Thank you for your help with this.  Tweaking this to use less excel formulas did improve on performance time.
0

Featured Post

Top 6 Sources for Identifying Threat Actor TTPs

Understanding your enemy is essential. These six sources will help you identify the most popular threat actor tactics, techniques, and procedures (TTPs).

Join & Write a Comment

Convert between Excel file formats (.XLS, .XLSX, .XLSM) with/without macro option David Miller (dlmille) Intro Over this past Fall, I've had the opportunity to see several similar requests and have developed a couple related solutions associate…
How to quickly and accurately populate Word documents with Excel data, charts and images (including Automated Bookmark generation) David Miller (dlmille) Synopsis In this article you’ll learn how to use ExcelToWord! to copy data,charts, shapes …
The viewer will learn how to simulate a series of sales calls dependent on a single skill level and learn how to simulate a series of sales calls dependent on two skill levels. Simulating Independent Sales Calls: Enter .75 into cell C2 – “skill leve…
Excel styles will make formatting consistent and let you apply and change formatting faster. In this tutorial, you'll learn how to use Excel's built-in styles, how to modify styles, and how to create your own. You'll also learn how to use your custo…

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now