Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x
?
Solved

How do I do de-duplication in Excel 2013, but have the ability to filter on the first occurrence?

Posted on 2014-11-06
6
Medium Priority
?
116 Views
Last Modified: 2014-11-10
I have an excel sheet with approximately 200K of rows in it.  I have a single column that I want to de-duplicate against.  When I use the default “duplicate” conditional filter in Excel 2013, it highlights all rows which are duplicates.

What I am trying to do is filter out all duplicates, however I want to identify the “first occurrence”.  Hopefully this makes sense.

Does anyone know how I can go about this?

Many Thanks
0
Comment
Question by:rye004
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
6 Comments
 
LVL 27

Expert Comment

by:Glenn Ray
ID: 40426884
Do you want to:
* filter out duplicate values after the first occurrence, or
* remove duplicate values after the first occurrence

If the 2nd option, you can use the Remove Duplicates function (Menu:  Data tab, Data Tools section, Remove Duplicates).

If the 1st option, you'll need to add a helper column to help identify the subsequent duplicate records and then filter out those.  For example, if your data is in column A and starts on row 2, add this formula in an adjacent column on row 2 and copy down:
=IF(COUNTIF($A$2:A2,A2)>1,"Duplicate","")

Then filter on blank cells in this new column.  Only the unique or first-occurrence values will display.

There's another option - Advanced Filter - that will let you copy a list of the unique values in that column to another location.


Let us know which method you need and we'll help out.

-Glenn
0
 
LVL 4

Expert Comment

by:AnthonyHamon
ID: 40426926
There is no option under Conditional Formatting to achieve what you require.

An alternative is to add a column that will show the duplicate status of each cell.  In that column, apply the following formula and replicate down:
=IF(COUNTIF(A:A,A1)=1,"Not duplicated","Duplicate number " & COUNTIF(A$1:A1,A1))

Open in new window


This assumes that:
Column A is the one you are searching for duplicates, if not, change all occurrences of A in the formula to the required column letter.
The data starts at row 1, if not, replace all occurrences of 1 in the formula to the row number where the data starts.
0
 
LVL 33

Expert Comment

by:Rob Henson
ID: 40428206
Actually, there is a way that you could use Conditional Formatting.

Apply CF such that duplicates are highlighted in a particular colour. Would probably use the formula option and a count like the suggestions above within the CF.

You could then use AutoFilter and Filter on colour, excluding the duplicate colour.

Thanks
Rob H
0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 

Author Comment

by:rye004
ID: 40429234
Glenn,

To answer your question, I am looking for your 1st option.

After I put up the posting, I did do something similar to what you suggested.  However, instead of using countif, I used a vlookup.  Below is an example of what I did:

=IF(LEN(IFERROR(VLOOKUP([@[MD5 - Static]], $N$3:N3, 1, FALSE), "")) > 1, "Yes", "No")

It took over an hour to run this in Excel 2013 64 bit.  I am assuming your method would be faster since it uses 2 excel functions instead of the 4 that I used.

Can you please let me know your thoughts?

Many Thanks.
0
 
LVL 27

Accepted Solution

by:
Glenn Ray earned 2000 total points
ID: 40429342
I tried replicating a table with a field titled "MD5 - Static" in column N and the results all returned "Yes", so maybe I'm missing something in your requirement.

If you're looking for the duplicate values in the MD5 - Static column, and the table data begins in row 3 (as it appears from your example) then add this formula in an adjacent column on row 3 also and copy down (if you add this to a new Excel Table column, it should automatically copy it down for you):
=IF(COUNTIF($N$3:N3,N3)>1,"Duplicate","")

I'm only testing on a small table - less than 50 rows - but it's very fast.

EDIT:  I just tried this on a table with randomly-generated values on 20,000 rows and it takes about 4 seconds to process. See attached example.

-Glenn
EE-28552267.xlsx
0
 

Author Closing Comment

by:rye004
ID: 40433130
Thank you for your help with this.  Tweaking this to use less excel formulas did improve on performance time.
0

Featured Post

Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Some code to ensure data integrity when using macros within Excel. Also included code that helps secure your data within an Excel workbook.
Excel can be a tricky bit of software to get your head around. Whilst you’ll be able to eventually get to grips with the basic understanding of how to get by, there are a few Excel tips that not everybody will even know about let alone know how to d…
This Micro Tutorial will demonstrate how to use longer labels with horizontal bar charts instead of the vertical column chart.
This Micro Tutorial will demonstrate in Microsoft Excel how to add style and sexy appeal to horizontal bar charts.

671 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question