We help IT Professionals succeed at work.
Troubleshooting Question

Display Duplicate Data with RGB Colors

48 Views
Last Modified: 2020-07-31
I have a large amount of data, >800,000 rows, with duplicates in the combined data from both column C and Column D. If the same data is in both Col C and Col D, the row is considered a duplicate.
Need to visually display the duplicates for research purposes. Only the data in C & D need to be colored.
Given the large number of duplicates, need to use RGB colors to expand the color options...below seems to provide good color scheme results.  Trying to balance speed and accuracy...How do I best do this with a large data set?  

            iR = WorksheetFunction.RandBetween(51, 255)
            iG = WorksheetFunction.RandBetween(102, 255)
            iB = WorksheetFunction.RandBetween(51, 255)
               DuplicateCells.Interior.Color = RGB(iR, iG, iB)  'This is just an example
Comment
Watch Question

Martin LissSocial distance - Don't touch your face - Wash your hands for 20 seconds
CERTIFIED EXPERT
Most Valuable Expert 2017
Distinguished Expert 2018

Commented:
How many of the same duplicate might you have?

Are you looking to have unique colors?

Author

Commented:
I have had over 50,000 duplicates...Given this large number, trying to have all unique colors seems impractical, but yes, that was the initial intent.
Martin LissSocial distance - Don't touch your face - Wash your hands for 20 seconds
CERTIFIED EXPERT
Most Valuable Expert 2017
Distinguished Expert 2018

Commented:
I agree, and even if you could you wouldn't be visually able to tell them apart. What I was asking however was if you had any idea of the maximum number of the same duplicate.

Author

Commented:
I have had up to 50 duplicates of the same data.
Martin LissSocial distance - Don't touch your face - Wash your hands for 20 seconds
CERTIFIED EXPERT
Most Valuable Expert 2017
Distinguished Expert 2018

Commented:
That's a problem because if, say, the average number of the same duplicate were 25, you'd need 32,000 colors.

How about sorting the data on columns C and D and placing some unique number in an unused column, so if column F were unused it might look like this.
2020-07-30_16-26-52.png

Author

Commented:
Oh, I'm sorry...all of the same duplicate will be displayed in one color. I know there will be the overlapping colors, due to the large number of duplicates, but put in random order, they will be separated.
Martin LissSocial distance - Don't touch your face - Wash your hands for 20 seconds
CERTIFIED EXPERT
Most Valuable Expert 2017
Distinguished Expert 2018

Commented:
I'm not sure what you are saying. If, say, rows 5 and 10 are duplicates could there be non-duplicates in-between?

Author

Commented:
Yes, there could be.
Martin LissSocial distance - Don't touch your face - Wash your hands for 20 seconds
CERTIFIED EXPERT
Most Valuable Expert 2017
Distinguished Expert 2018

Commented:
OK then if rows 5 and 50,000 were duplicates and were given the same one of my estimate 32,000 needed colors you'll never recognize them as duplicates.

Author

Commented:
I just tried it, and unless I perform a A-Z or Z-A filter, they will be separated. Which leaves doing the sort either at the beginning or at the end. At the beginning, leaves all the duplicates together....
Ryan ChongSoftware Tead Lead / Business Analyst / System Analyst / Data Engineer / Applications Consultant
CERTIFIED EXPERT

Commented:
try read my article: Handling Duplicate Rows In Excel

and look into sections:

2. Using Conditional Formatting
     II ) With Conditional Formatting - Method 2

6. Move the Duplicate or Unique Rows to Another Sheet

see if that's useful for you?

Author

Commented:
Yes, I will look at it.  Conditional formatting though is way to slow for really large datasets...could take hours.  Looking forward to reading it.
Ryan ChongSoftware Tead Lead / Business Analyst / System Analyst / Data Engineer / Applications Consultant
CERTIFIED EXPERT

Commented:
Another possible way is to load your data into a database, and do your manipulation there

Author

Commented:
The data is too dirty for the database...have already tried this.  Would like a visual - in place - solution.  But, will look at the article and see if something can be adapted from these insights.
Martin LissSocial distance - Don't touch your face - Wash your hands for 20 seconds
CERTIFIED EXPERT
Most Valuable Expert 2017
Distinguished Expert 2018

Commented:
In the attached workbook press Ctrl+Shift+D or run the ColorDupes macro you'll find in Module1. You'll need a Reference to Microsoft Scripting Runtime.
29190135.xlsm

Author

Commented:
Martin...I like what I see so far!  Let me go and use it on the 800K.

Author

Commented:
Martin...I did run into one issue.  After it ran for 20 minutes, on 600K, it was stopped on a:
Run-time error, Autofilter method of Range class failed.
This is where the code reads,
For lngEntry = 1 To colDupes.Count
  With rng
   strParts = Split(colDupes(lngEntry),"|")
  .Autofilter    -> This is where the Error occurred. 

The colDupes.count = 14666, the strKey = 9999972292 | 0000000018 , which is at the end of the list at 604645.

Martin LissSocial distance - Don't touch your face - Wash your hands for 20 seconds
CERTIFIED EXPERT
Most Valuable Expert 2017
Distinguished Expert 2018

Commented:
Try this for me. Copy the last hundred rows or so into a new sheet and run the code. If you still get the error can you show me columns C and D of those hundred rows?

Author

Commented:
I did the last 1,000 rows and came out with 75 duplicate C & D numbers, or 150 cells. Everything ran smoothly.

Author

Commented:
When I looked at the data, I saw row 62745, then 543249.  Everything in between was 'hidden.'
Martin LissSocial distance - Don't touch your face - Wash your hands for 20 seconds
CERTIFIED EXPERT
Most Valuable Expert 2017
Distinguished Expert 2018

Commented:
Add this line just prior to the .AutoFilter line
ActiveSheet.AutoFilterMode = False

Open in new window


Or better - attach a workbook that has the data with everything but columns C and D deleted or cleared. Hopefully that will eliminate any problems with you posting sensitive data.

Author

Commented:
I will add the autofilter.  Cannot do the C & D data...got in trouble with risk compliance about an hour ago for attempting to do so.  But, is there something out of the ordinary I can look for?

Author

Commented:
I just did the last 60K and it ran okay. 864 duplicate C & D cells, or 432 duplicates.
Martin LissSocial distance - Don't touch your face - Wash your hands for 20 seconds
CERTIFIED EXPERT
Most Valuable Expert 2017
Distinguished Expert 2018

Commented:
Did you do the whole thing with the new line of code?

Author

Commented:
Not yet...was just in process..

Author

Commented:
It's still running.  If I'm seeing the bottom and side scroll bars flashing, then we're still good...correct?

Author

Commented:
After about 20 minutes, the scroll bars stopped, and I quit the code. The error actually occurs on the line below the autofilter, or strParts (0). In moving up one, the Autofilter also fails.  The error for this reads:
   Runtime Error: -2147417848 (80010108)
   Method Autofilter of Object Range failed.
strParts (0) = "0263163882"  I'm going to search for this.
strKey is the same as before

Martin LissSocial distance - Don't touch your face - Wash your hands for 20 seconds
CERTIFIED EXPERT
Most Valuable Expert 2017
Distinguished Expert 2018

Commented:
What scrollbars are you referring to? Is your data in a table?

Do a Find & Select->Find->Find All for 0263163882 in column C. How many hits do you get?

Author

Commented:
Yes, there are six occurrences. I grab those six and about thirty more and ran the code.  All six were paired in duplicates.  3 duplicate C & D numbers for 6 total cells.  It ran fine though.  I really like the code.

Author

Commented:
Could there be just too many?
Martin LissSocial distance - Don't touch your face - Wash your hands for 20 seconds
CERTIFIED EXPERT
Most Valuable Expert 2017
Distinguished Expert 2018

Commented:
Try it one more time with this code. I added lines 37 and 53.
Sub ColorDupes()
Dim R As Integer
Dim G As Integer
Dim B As Integer
Dim colDupes As New Collection
Dim lngEntry As Long
Dim lngLastRow As Long
Dim lngRow As Long
Dim rng As Range
Dim strParts() As String
' Requires a reference to Microsoft Scripting Runtime
Dim dicDupes As Dictionary
Dim strKey As String

Set rng = Range("C1:D" & Cells(Rows.Count, "C").End(xlUp).Row)
Set dicDupes = CreateObject("Scripting.Dictionary")

With Application
    .EnableEvents = False
    .ScreenUpdating = False
    .Calculation = xlCalculationManual
End With

For lngRow = 1 To Cells(Rows.Count, "C").End(xlUp).Row
    On Error Resume Next
'    colDupes.Add Cells(lngRow, "C") & "|" & Cells(lngRow, "D"), CStr(Cells(lngRow, "C") & "|" & Cells(lngRow, "D"))
    strKey = Cells(lngRow, "C") & "|" & Cells(lngRow, "D")
    With dicDupes
        .Add strKey, strKey
        If Err.Number <> 0 Then
            colDupes.Add strKey, strKey
        End If
    End With
    On Error GoTo 0
Next

On Error Resume Next
For lngEntry = 1 To colDupes.Count
    With rng
        strParts = Split(colDupes(lngEntry), "|")
        ActiveSheet.AutoFilterMode = False
        .AutoFilter
        .AutoFilter field:=1, Criteria1:=strParts(0)
        .AutoFilter field:=2, Criteria1:=strParts(1)
        R = WorksheetFunction.RandBetween(51, 255)
        G = WorksheetFunction.RandBetween(102, 255)
        B = WorksheetFunction.RandBetween(51, 255)
        ' This assumes the data has a one-row heading. If there is no heading then
        ' remove .Offset(1, 0).Resize(.Rows.Count - 1, 1)
        .Offset(1, 0).Resize(.Rows.Count - 1, 2).SpecialCells(xlCellTypeVisible).Interior.Color = RGB(R, G, B)
    End With
Next
On Error GoTo 0

rng.AutoFilter
With Application
    .Calculation = xlCalculationAutomatic
    .EnableEvents = True
    .ScreenUpdating = True
End With
End Sub

Open in new window

Martin LissSocial distance - Don't touch your face - Wash your hands for 20 seconds
CERTIFIED EXPERT
Most Valuable Expert 2017
Distinguished Expert 2018

Commented:
Could there be just too many?
The variables involved with the dictionary and collection are Long which can contain more than you have. I'm not aware of size limits on dictionaries or collections but I guess my answer has to be I don't know.

Author

Commented:
Got it...

Author

Commented:
It took about 12-14 minutes to run 100K, and it seemed to be okay. Maybe I stopped it to soon. If there is an error which generates a pop-up, will the vba editor still show "Running?" 

Author

Commented:
For 800K, it's about 105 minutes.  If we combined columns C & D in a helper column, and had the code only look at the contents in that one cell to determine duplicates, would this reduce the time at all?  I really like what we have now, but may need to reduce the time.
Martin LissSocial distance - Don't touch your face - Wash your hands for 20 seconds
CERTIFIED EXPERT
Most Valuable Expert 2017
Distinguished Expert 2018

Commented:
I assume so.

Are there any empty cells in columns C or D?

Author

Commented:
For these first 100K, no, but there will be some blank cells in column D going forward.

Author

Commented:
Martin...I  can accept what you have provided, as it is what I asked you to do...and really like it. It's the best I've seen doing it the way you have.  I can then submit another request for assistance in making it a little faster (combine C & D)

Author

Commented:
If that is what is needed...I'm not sure.
Martin LissSocial distance - Don't touch your face - Wash your hands for 20 seconds
CERTIFIED EXPERT
Most Valuable Expert 2017
Distinguished Expert 2018

Commented:
Give me a few minutes.
Social distance - Don't touch your face - Wash your hands for 20 seconds
CERTIFIED EXPERT
Most Valuable Expert 2017
Distinguished Expert 2018
Commented:
Unlock this solution and get a sample of our free trial.
(No credit card required)
UNLOCK SOLUTION

Author

Commented:
Appreciate it Martin...I want to read your article as well.  Let me put this in and see how it goes.

Author

Commented:
Martin...I may not have fully understood all of the code. In returning to the data, I had Col G with prefilled data, so I moved the Helper Column to Col K. Made all the changes from G to K, and changed the -4 to -8. But, I'm missing something else, because the cell coloring is all off.  What other change should I have made?
Martin LissSocial distance - Don't touch your face - Wash your hands for 20 seconds
CERTIFIED EXPERT
Most Valuable Expert 2017
Distinguished Expert 2018

Commented:
I believe that was my fault. When I posted my most recent code the name of the sub was ColorDupe rather than ColorDupes. If you make that change in your code it will probably work. I'ver attached my workbook that works with column K as the helper column.
29190135a.xlsm

Author

Commented:
Got it...let me go test..

Author

Commented:
Martin, I'm having a tough time running it with the workbook you provided.  It runs all the way through, errrors out on rng.Autofilter...but nothing occurred in cols C & D and nothing is added to K...no Fred | Murtz or Ricky | Ricardo...  I looked at my references and I have MS Scripting Runtime activated.
Martin LissSocial distance - Don't touch your face - Wash your hands for 20 seconds
CERTIFIED EXPERT
Most Valuable Expert 2017
Distinguished Expert 2018

Commented:
The workbook I posted already has data (via a formula) in K. I've attached a video of me running it. The red circle flickers at the moment when I press Ctrl+Shift+D.https://www.experts-exchange.com/questions/29190135/Display-Duplicate-Data-with-RGB-Colors.html#
2020-07-31_15-27-20--1-.mp4

Author

Commented:
Do I need to put a formula in K
=C2 & "|" & D2
Martin LissSocial distance - Don't touch your face - Wash your hands for 20 seconds
CERTIFIED EXPERT
Most Valuable Expert 2017
Distinguished Expert 2018

Commented:
Yes, sorry I wasn't clearer.

Author

Commented:
That's fine...I'm glad it was something simple as this...Yes, it works on a 10K workbook I have...I believe we are good.  This is exactly what I was looking for.  I believe you have an excellent 2 variable / criteria duplicate finder in a multi-color format...even if one of the options is a helper..  Thanks Martin.  I'm looking forward to trying it at work.
Martin LissSocial distance - Don't touch your face - Wash your hands for 20 seconds
CERTIFIED EXPERT
Most Valuable Expert 2017
Distinguished Expert 2018

Commented:
Can I ask you how it is that you have 800,000 rows? I ask because if it's done incrementally, it might be faster to check for and possibly delete the duplicates before they are loaded into the workbook.

In any case you’re welcome and I’m glad I was able to help.

If you expand the “Full Biography" section of my profile you’ll find links to some articles I’ve written that may interest you.

Marty - Microsoft MVP 2009 to 2017
              Experts Exchange Most Valuable Expert (MVE) 2015, 2017
              Experts Exchange Top Expert Visual Basic Classic 2012 to 2019
              Experts Exchange Top Expert VBA 2018, 2019
              Experts Exchange Distinguished Expert in Excel 2018

Author

Commented:
Martin, the 800,000+ are from a Mainframe download to a text file which I then import into Excel. At the moment, it is not 100% certain that the duplicates from C & D do signify duplicates. This is the reason for coloring those appearing to be duplicates.  If indeed they are duplicates, I can use the Data->remove duplicates function on the menu. What you wrote today helps to bridge the gap between identification and deletion.
Unlock the solution to this question.
Thanks for using Experts Exchange.

Please provide your email to receive a sample view!

*This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

OR

Please enter a first name

Please enter a last name

8+ characters (letters, numbers, and a symbol)

By clicking, you agree to the Terms of Use and Privacy Policy.