Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 119
  • Last Modified:

How do I do de-duplication in Excel 2013, but have the ability to filter on the first occurrence?

I have an excel sheet with approximately 200K of rows in it.  I have a single column that I want to de-duplicate against.  When I use the default “duplicate” conditional filter in Excel 2013, it highlights all rows which are duplicates.

What I am trying to do is filter out all duplicates, however I want to identify the “first occurrence”.  Hopefully this makes sense.

Does anyone know how I can go about this?

Many Thanks
0
rye004
Asked:
rye004
1 Solution
 
Glenn RayExcel VBA DeveloperCommented:
Do you want to:
* filter out duplicate values after the first occurrence, or
* remove duplicate values after the first occurrence

If the 2nd option, you can use the Remove Duplicates function (Menu:  Data tab, Data Tools section, Remove Duplicates).

If the 1st option, you'll need to add a helper column to help identify the subsequent duplicate records and then filter out those.  For example, if your data is in column A and starts on row 2, add this formula in an adjacent column on row 2 and copy down:
=IF(COUNTIF($A$2:A2,A2)>1,"Duplicate","")

Then filter on blank cells in this new column.  Only the unique or first-occurrence values will display.

There's another option - Advanced Filter - that will let you copy a list of the unique values in that column to another location.


Let us know which method you need and we'll help out.

-Glenn
0
 
AnthonyHamonCommented:
There is no option under Conditional Formatting to achieve what you require.

An alternative is to add a column that will show the duplicate status of each cell.  In that column, apply the following formula and replicate down:
=IF(COUNTIF(A:A,A1)=1,"Not duplicated","Duplicate number " & COUNTIF(A$1:A1,A1))

Open in new window


This assumes that:
Column A is the one you are searching for duplicates, if not, change all occurrences of A in the formula to the required column letter.
The data starts at row 1, if not, replace all occurrences of 1 in the formula to the row number where the data starts.
0
 
Rob HensonIT & Database AssistantCommented:
Actually, there is a way that you could use Conditional Formatting.

Apply CF such that duplicates are highlighted in a particular colour. Would probably use the formula option and a count like the suggestions above within the CF.

You could then use AutoFilter and Filter on colour, excluding the duplicate colour.

Thanks
Rob H
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
rye004Author Commented:
Glenn,

To answer your question, I am looking for your 1st option.

After I put up the posting, I did do something similar to what you suggested.  However, instead of using countif, I used a vlookup.  Below is an example of what I did:

=IF(LEN(IFERROR(VLOOKUP([@[MD5 - Static]], $N$3:N3, 1, FALSE), "")) > 1, "Yes", "No")

It took over an hour to run this in Excel 2013 64 bit.  I am assuming your method would be faster since it uses 2 excel functions instead of the 4 that I used.

Can you please let me know your thoughts?

Many Thanks.
0
 
Glenn RayExcel VBA DeveloperCommented:
I tried replicating a table with a field titled "MD5 - Static" in column N and the results all returned "Yes", so maybe I'm missing something in your requirement.

If you're looking for the duplicate values in the MD5 - Static column, and the table data begins in row 3 (as it appears from your example) then add this formula in an adjacent column on row 3 also and copy down (if you add this to a new Excel Table column, it should automatically copy it down for you):
=IF(COUNTIF($N$3:N3,N3)>1,"Duplicate","")

I'm only testing on a small table - less than 50 rows - but it's very fast.

EDIT:  I just tried this on a table with randomly-generated values on 20,000 rows and it takes about 4 seconds to process. See attached example.

-Glenn
EE-28552267.xlsx
0
 
rye004Author Commented:
Thank you for your help with this.  Tweaking this to use less excel formulas did improve on performance time.
0

Featured Post

Get expert help—faster!

Need expert help—fast? Use the Help Bell for personalized assistance getting answers to your important questions.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now