Auto Data Comparison

The goal here is to compare the data in Numbers_1 through Numbers_10 and find matching rows, these matching rows would then be placed in the Matching sheet.

Note: To save time there is no need to check rows in the same sheet and there it would be redundant to check Numbers_1 with Numbers_2 and then when it gets to Numbers_2 to check Numbers_1 since the data will yield the same results.

Also, once a combination is found for instance, 2, 6, 9, 11, 13 match on more than one sheet and thus that combination does not need to be duplicated on the Matching sheet since it is already there.
MatchingCombinationsFrom1-10.xlsb
Pedrov664Asked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

FaustulusCommented:
Hi Pedrov,
Please paste the following code to a standard code module in your project. Then run the procedure 'WriteMatching'.
The enums and constants at the top of the code may be modified. Let me know if you require assistance with that. Note that I tested the code only on a fraction of the data you have. I can't tell you how long it will take to sort through all your sequences but presume that there will be time for a cup of coffee.
If you are interested, I shall be glad to tell you how the code works. One item of information may be useful to you, however: all your sequences are converted into strings. Meaning, the numbers 2, 6, 9, 11, 13 end up as "2 6 9 11 13" in one cell. These creations are used and deleted. I am wondering if they are useful to you in some other way. They could be left behind in Matching!A:A.
The other thing you need to know is that the code doesn't clear the 'Matching' worksheet. If you run the program another time the second result will be appended to the first.
Option Explicit
    
    Private Enum Nwm                    ' WsM: Worksheets navigation
        NwmFirstDataRow = 3
        NwmTest = 1                     ' Column A = 1
        NwmFirstData
    End Enum
    
    Private Enum Nws                    ' WsS: Worksheets navigation
        NwsFirstData = 2                ' Column B = 2
        NwsColCount = 5                 ' number of data columns
    End Enum
    
Sub WriteMatching()

    ' modify as required:-
    Const Temp As String = "Temp"
    Const Matching As String = "Matching"
    Const Numbers As String = "Numbers"
    Const MaxNum As Long = 10       ' count of Numbers sheets

    Dim WsT As Worksheet            ' Temporary sheet
    Dim WsM As Worksheet            ' Matching: Result sheet
    Dim WsS As Worksheet            ' Source sheets: Numbers_?
    Dim Cell As Range
    Dim Idx As Long
    
    Set WsM = Worksheets(Matching)
    WsM.Activate
    SetApplication False
    Set WsT = SetTempSheet(Temp)
    
    For Idx = 1 To MaxNum
        If SetNumbersSheet(Numbers, Idx, WsS) Then
            CreateTempList Idx, WsT, WsS
        End If
    Next Idx
    
    For Each Cell In WsT.UsedRange
        With Cell
            If Len(.Value) Then
                If FindMatch(.Value, WsT) Then AddToMatching .Value, WsM
                .ClearContents
            End If
        End With
    Next Cell
    
    DecypherMatches WsM
    DelSheet WsT.Name
    SetApplication True
End Sub

Private Sub DecypherMatches(WsM As Worksheet)

    Dim PageArr() As Variant
    Dim MatchArr() As String
    Dim Dest As Range
    Dim R As Long
    Dim C As Long
    
    With WsM
        PageArr = Range(.Cells(NwmFirstDataRow, NwmTest), _
                        .Cells(LastRow(NwmTest, WsM), NwmFirstData + NwsColCount - 1))
    End With

    For R = 1 To UBound(PageArr)
        If Len(PageArr(R, 1)) Then
            MatchArr = Split(PageArr(R, 1))
            For C = 0 To UBound(MatchArr)
                PageArr(R, NwmFirstData + C) = MatchArr(C)
                PageArr(R, 1) = vbNullString
            Next C
        End If
    Next R
    
    Set Dest = WsM.Cells(NwmFirstDataRow, NwmTest)
    Set Dest = Dest.Resize(UBound(PageArr), UBound(PageArr, 2))
    Dest.Value = PageArr
End Sub

Private Sub AddToMatching(ByVal NewMatch As Variant, _
                          WsM As Worksheet)
    Dim R As Long
    
    If Not FindMatch(NewMatch, WsM) Then
        R = LastRow(NwmTest, WsM) + 1
        If R < NwmFirstDataRow Then R = NwmFirstDataRow
        WsM.Cells(R, NwmTest).Value = NewMatch
    End If
End Sub

Private Sub CreateTempList(ByVal Idx As Long, _
                           WsT As Worksheet, _
                           WsS As Worksheet)
                           
    Dim PageArr() As Variant
    Dim MatchArr() As Variant
    Dim TempArr() As Variant
    Dim Dest As Range                   ' destination
    Dim R As Long
    Dim Rl As Long
    Dim C As Long
    Dim i As Long
    
    Rl = LastRow(NwsFirstData, WsS)
    R = FirstRow(NwsFirstData, Rl, WsS)
    With WsS
        PageArr = Range(.Cells(R, NwsFirstData), _
                        .Cells(Rl, NwsFirstData + NwsColCount - 1))
    End With
    
    For R = 1 To UBound(PageArr)
        ReDim TempArr(1 To UBound(PageArr, 2))
        For C = 1 To UBound(PageArr, 2)
            TempArr(C) = PageArr(R, C)
        Next C
        i = i + 1
        ReDim Preserve MatchArr(1 To i)
        MatchArr(i) = Join(TempArr)
    Next R
    Set Dest = WsT.Cells(1, Idx)
    Set Dest = Dest.Resize(UBound(MatchArr), 1)
    Dest.Value = Application.Transpose(MatchArr)
End Sub

Private Function SetNumbersSheet(ByVal Numbers As String, _
                                 ByVal Idx As Long, _
                                 WsS As Worksheet) As Boolean

    Dim Sn As String
    
    Sn = Numbers & "_" & CStr(Idx)
    On Error Resume Next
    Set WsS = ThisWorkbook.Worksheets(Sn)
    If Err Then
        MsgBox "Worksheet " & Chr(34) & Sn & Chr(34) & _
               " doesn't exist."
    Else
        SetNumbersSheet = True
    End If
End Function

Private Function SetTempSheet(ByVal Temp As String) As Worksheet

    Dim Ws As Worksheet
    
    Set Ws = ActiveSheet
    With ThisWorkbook
        DelSheet Temp
        Set SetTempSheet = .Worksheets.Add
        ActiveSheet.Name = Temp
    End With
    Ws.Activate
End Function

Private Function FirstRow(ByVal Col As Long, _
                          ByVal Rl As Long, _
                          Ws As Worksheet) As Long
    Dim R As Long
    
    R = 1
    Do While Len(Ws.Cells(R, Col).Value) = 0 _
             And Rl > R
        R = R + 1
    Loop
    FirstRow = R
End Function

Private Function LastRow(Optional ByVal Col As Variant, _
                         Optional Ws As Worksheet) As Long
    ' 0059 V 3.2 Apr 2, 2012

    ' Return the number of the last non-blank row in column Col.
    ' Specify the column as string or number
    ' If no column is specified,
      ' return the last row from column A.
    ' If no worksheet is specified
      ' return the result from the currently active sheet.
    
    Dim R As Long
    
    If Ws Is Nothing Then Set Ws = ActiveSheet
    If VarType(Col) = vbError Then Col = 1
    With Ws
        R = .Cells(.Rows.Count, Col).End(xlUp).Row
        With .Cells(R, Col)
            ' in a blank column the last used row is 0 (= none)
            If R = 1 And .Value = vbNullString Then R = 0
            ' include all rows of a merged range
            LastRow = R + .MergeArea.Rows.Count - 1
        End With
    End With
End Function

Private Sub SetApplication(ByVal Target As Boolean)
    With Application
        .ScreenUpdating = Target
        .Cursor = IIf(Target, xlDefault, xlWait)
    End With
End Sub

Private Function FindMatch(ByVal SearchFor As Variant, _
                           Ws As Worksheet) As Boolean
                           
    FindMatch = Not Ws.UsedRange.Find( _
                       What:=SearchFor, _
                       LookIn:=xlValues, _
                       Lookat:=xlWhole, _
                       Searchorder:=xlByColumns) _
                       Is Nothing
End Function

Private Sub DelSheet(ByVal Sn As String)

        On Error Resume Next
        Application.DisplayAlerts = False
        ThisWorkbook.Worksheets(Sn).Delete
        Application.DisplayAlerts = True
End Sub

Open in new window

0
Pedrov664Author Commented:
Faustulus,

Thank you for the time and effort put into this scripting.

Since I am not doing scripting myself it may not help me much as to how it works. However, I noticed that the code module has divided the module into many subscripts.

How do I handle this when I would like to run other scripts with this one? I think putting the other scripts below this one would be best. What do you think?

P.S. I will run the script when I can spare some computer time and let you know the results.
0
Pedrov664Author Commented:
Ok, I got a run-time error '13:' Type mismatch and it highlights the following:

Dest.Value = Application.Transpose(MatchArr)

It also gives the dreaded hourglass (waiting) and does not go away. Notice it made the "temp" worksheet and then the error. Hope that helps you debug.
0
The Ultimate Tool Kit for Technolgy Solution Provi

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy for valuable how-to assets including sample agreements, checklists, flowcharts, and more!

FaustulusCommented:
The error you get is caused by the size of the array. It seems that Transpose can't take more than 65535 rows. So, I have written code to write the arrays to the sheets in batches.
The dreaded hour glass is turned on by my code and should be turned off by it, too. But if the code crashes before it reaches the end you get stuck with the hour glass. Get rid of it by typing the following line in VBE's Immediate window followed by [Enter].
Application.Cursor = xlDefault.
The best way to manage your code would be to have a separate module for each operation. Drag the entire module 'MatchMan' from the project I will upload here to your project. Create new modules for other tasks.
Note that the new version doesn't require the sheet 'Matching' to be present and deletes and re-creates it if it exists. If you wish to retain the data rename it before you start another run of the code.
I will upload the new version of my workbook later. Right now it is undergoing a full test. I will share the result with you when I have it.
0
Pedrov664Author Commented:
Ok, thanks. Any idea how long it ma be?
0
FaustulusCommented:
I gave it until morning. When I interrupted the run after 18 hours some 34000 items had been checked and 6060 duplicates identified. I then managed to let the search continue and about 3 minutes later another 460 items had been checked and another 53 duplicates found. Unfortunatley, I didn't time the continuation exactly. It might have been considerably less than 3 minutes. Anyway, I couldn't continue the program another time. But based on the estimate of 3 minutes the program should have found the 6060 items in 343 minutes = 5:43 hours - nowhere near the 18 hours it took. The computer must have been sleeping as soundly as I did, working a few minutes whenever I woke it up from time to time.
I am loathe to add a progress indicator because that will consume extra time. However, I imagine that it will have to be done just to keep the CPU awake. I will re-design the search method entirely.
Unfortunately, I am about to embark on some serious travelling and don't know how much time I will be able to devote to this next week, probably none today and tomorrow. Anyway, the progam seems to be working as it is, and if the 3 minutes were an accurate guess it should be able to complete the search within 1000 hours, if you can keep the computer awake. Actually, the search should speed up significantly toward the end because the search area is reduced with each item checked. Perhaps you can employ it on smaller lots, like comparing just 2 sheets.
Please let me know if you have any idea of how to speed up the search based on your knowledge of the numbers. I may want to consider pre-sorting them, but that would lead me to a completely different way of searching. Please tell me anything you know about the sequence of the numbers. You seem to know that there are no duplicates within any one sheet. Can you confirm that the numbers are, in fact, sorted within each sheet?

I am also interested in the number of duplicates you might potentially find. Could it conceivably be more than a million? Also, if the search of real data would return much fewer than the 20%-odd I found so far that would be good to know, too.
EXX-130912-Match-Combinations.xlsb
0
Pedrov664Author Commented:
Due to the vast amount of data I am unable to know how many if any are duplicates. That is one reason why I believe a script would be best. I thought of using excel functions but these prove too cumbersome to use on this type of data set.

Thank you for the considerable time and effort you've put into this. I will reduce the amount of data to process and then perhaps someone else knows how to speed things up.
0
aikimarkCommented:
@Pedrov664

1a. It looks like these cells have values < 100.  Is that correct?
1b. If the value limits differ by column, what are each column's value limits?

2. It looks like these lists/sequences are already sorted.  Is this correct?
0
Pedrov664Author Commented:
1. Yes I believe the cell value limit is 60 max no zeros
2. The data is pre-sorted
0
aikimarkCommented:
Is that 60 for all columns, or just the right-most columns?
0
Pedrov664Author Commented:
After sorting the right most columns contain the highest numbers and the left the lowest. As such numbers would appear in ascending order top to bottom and left to right.
0
aikimarkCommented:
I'm using Excel 2003 and can not see all the values.  I can't see the highest values in the left most columns.
0
Pedrov664Author Commented:
What is the most columns or rows that you can handle?

Is there a way I can format it that would be best for you?
0
aikimarkCommented:
We are only working with 5 columns.  I needed to know the maximum numeric values in these columns.

I'm limited to 64k rows per worksheet.
0
Pedrov664Author Commented:
Yes, only five columns and the range is from 1:60. Thus minimum 1 maximum 60 for this data set.

64,000 rows per worksheet should be fine. If that is all you can see then you can work with that. It should be sufficient data for the script.
0
aikimarkCommented:
What is supposed to happen with the XX DELETE XX rows?
0
aikimarkCommented:
1. Place the code below into your workbook's module.
2. On all of your Numbers_# worksheets, make sure the first row of data is 3.
3a. In cell G3 of all the Numbers_# worksheets, add the following formula
=B3*100^4+C3*100^3+D3*100^2+E3*100+F3

Open in new window

3b. fill the formula down to the last row of your data.
4. Run the CompareAllSheets() routine

Note: On my 800MHz laptop, running Excel2003, this ran about 15 seconds and produced 135000+ unique values for the 64k row limits in my copy of the xlsb workbook you posted (running in Excel2003).  It might be possible to do some performance tweaks on this, but I think 15 seconds is a tolerable run time, considering what you are trying to do.

Option Explicit

Public Sub CompareAllSheets()
    Dim lngLoop1 As Long, lngLoop2 As Long  'for worksheet iteration
    Dim lngFrom As Long, lngTo As Long      'for the copy operation
    Dim vAllMatches() As Variant
    Dim vTheseMatches() As Variant
    Dim vDifferences() As Variant
    Dim vMerge() As Variant
    Dim wksTgt As Worksheet, rngTgt As Range
    Dim strNum As String
    ReDim vAllMatches(0 To 0)
    
    For lngLoop1 = 1 To 9
        For lngLoop2 = lngLoop1 + 1 To 10
            vTheseMatches = CompareSheet(Worksheets("Numbers_" & lngLoop1), Worksheets("Numbers_" & lngLoop2))
            If UBound(vAllMatches) = 0 Then
                vAllMatches = vTheseMatches
            Else
                If UBound(vTheseMatches) = 0 Then
                Else
                    vDifferences = GetDifferences(vAllMatches, vTheseMatches)
                    If UBound(vDifferences) = 0 Then
                    Else
                        ReDim vMerge(1 To UBound(vAllMatches) + UBound(vDifferences))
                        MergeValues vAllMatches, vDifferences, vMerge
                        vAllMatches = vMerge
                    End If
                End If
            End If
        Next
    Next
    Set wksTgt = Worksheets("Matching")
    Set rngTgt = wksTgt.Range("A1")
    Application.ScreenUpdating = False
    For lngLoop1 = 1 To UBound(vAllMatches)
        strNum = Format(vAllMatches(lngLoop1), "0000000000")
        wksTgt.Range(rngTgt, rngTgt.Offset(0, 4)).Value = _
            Array(Mid$(strNum, 1, 2), Mid$(strNum, 3, 2), Mid$(strNum, 5, 2), Mid$(strNum, 7, 2), Mid$(strNum, 9, 2))
        Set rngTgt = rngTgt.Offset(1)
    Next
    Application.ScreenUpdating = True
End Sub

Public Function CompareSheet(parmWks1 As Worksheet, parmWks2 As Worksheet)
    Dim rng1 As Range
    Dim rng2 As Range
    Dim vValues1() As Variant
    Dim vValues2() As Variant
    Dim vMatches() As Variant
    Dim boolListExhausted As Boolean
    Dim lngPosn1 As Long, lngPosn2 As Long, lngMatchCount As Long
    Set rng1 = parmWks1.Range(parmWks1.Range("G3"), parmWks1.Range("G3").End(xlDown))
    Set rng2 = parmWks2.Range(parmWks2.Range("G3"), parmWks2.Range("G3").End(xlDown))
    vValues1() = rng1.Cells.Value
    vValues2() = rng2.Cells.Value
    lngPosn1 = LBound(vValues1)
    lngPosn2 = LBound(vValues2)
    If rng1.Rows.Count > rng2.Rows.Count Then
        ReDim vMatches(1 To UBound(vValues1))
    Else
        ReDim vMatches(1 To UBound(vValues2))
    End If
    Do
        Select Case True
            Case IsError(vValues1(lngPosn1, 1))
                lngPosn1 = lngPosn1 + 1
                If lngPosn1 > UBound(vValues1) Then boolListExhausted = True
            Case IsError(vValues2(lngPosn2, 1))
                lngPosn2 = lngPosn2 + 1
                If lngPosn2 > UBound(vValues2) Then boolListExhausted = True
            Case vValues1(lngPosn1, 1) < vValues2(lngPosn2, 1)
                lngPosn1 = lngPosn1 + 1
                If lngPosn1 > UBound(vValues1) Then boolListExhausted = True
            Case vValues1(lngPosn1, 1) > vValues2(lngPosn2, 1)
                lngPosn2 = lngPosn2 + 1
                If lngPosn2 > UBound(vValues2) Then boolListExhausted = True
            Case Else
                lngMatchCount = lngMatchCount + 1
                vMatches(lngMatchCount) = vValues1(lngPosn1, 1)
                lngPosn1 = lngPosn1 + 1
                If lngPosn1 > UBound(vValues1) Then boolListExhausted = True
                'DoEvents
        End Select
    Loop Until boolListExhausted
    'Debug.Print lngMatchCount
    If lngMatchCount = 0 Then
        ReDim vMatches(0 To 0)
    Else
        ReDim Preserve vMatches(1 To lngMatchCount)
    End If
    CompareSheet = vMatches()
End Function

Public Function GetDifferences(parmAll, parmThis)
    'output parmThis() values that are not in parmAll()
    Dim boolListExhausted As Boolean
    Dim vDifferences() As Variant
    Dim lngPosn1 As Long, lngPosn2 As Long, lngDiffCount As Long
    ReDim vDifferences(1 To UBound(parmThis))
    lngPosn1 = 1
    lngPosn2 = 1
    Do
        Select Case True
            Case parmAll(lngPosn1) = parmThis(lngPosn2)
                lngPosn1 = lngPosn1 + 1
                lngPosn2 = lngPosn2 + 1
                If (lngPosn1 > UBound(parmAll)) Or (lngPosn2 > UBound(parmThis)) Then
                    boolListExhausted = True
                End If
            Case parmAll(lngPosn1) < parmThis(lngPosn2)
                lngPosn1 = lngPosn1 + 1
                If lngPosn1 > UBound(parmAll) Then boolListExhausted = True
            Case parmAll(lngPosn1) > parmThis(lngPosn2)
                lngDiffCount = lngDiffCount + 1
                vDifferences(lngDiffCount) = parmThis(lngPosn2)
                lngPosn2 = lngPosn2 + 1
                If lngPosn2 > UBound(parmThis) Then boolListExhausted = True
            Case Else
                lngPosn1 = lngPosn1 + 1
                'DoEvents
        End Select
    Loop Until boolListExhausted
    'Debug.Print lngMatchCount
    If lngDiffCount = 0 Then
        ReDim vDifferences(0 To 0)
    Else
        ReDim Preserve vDifferences(1 To lngDiffCount)
    End If
    GetDifferences = vDifferences()
    
End Function


Public Sub MergeValues(parmAll, parmDiffs, parmMerge)
    Dim lngPosn1 As Long, lngPosn2 As Long, lngOut As Long
    Dim boolListExhausted As Boolean
    lngPosn1 = 1
    lngPosn2 = 1
    lngOut = 1
    Do
        Select Case True
            Case parmAll(lngPosn1) < parmDiffs(lngPosn2)
                parmMerge(lngOut) = parmAll(lngPosn1)
                lngOut = lngOut + 1
                lngPosn1 = lngPosn1 + 1
                If lngPosn1 > UBound(parmAll) Then
                    For lngPosn2 = lngPosn2 To UBound(parmDiffs)
                        parmMerge(lngOut) = parmDiffs(lngPosn1)
                        lngOut = lngOut + 1
                    Next
                    boolListExhausted = True
                End If
            Case parmAll(lngPosn1) > parmDiffs(lngPosn2)
                parmMerge(lngOut) = parmDiffs(lngPosn2)
                lngOut = lngOut + 1
                lngPosn2 = lngPosn2 + 1
                If lngPosn2 > UBound(parmDiffs) Then
                    For lngPosn1 = lngPosn1 To UBound(parmAll)
                        parmMerge(lngOut) = parmAll(lngPosn1)
                        lngOut = lngOut + 1
                    Next
                    boolListExhausted = True
                End If
            Case Else
                Stop        'There should never be identical items
        End Select
    Loop Until boolListExhausted
    
End Sub

Open in new window

0
FaustulusCommented:
@Pedrov664,
This doesn't look like it will end up being fair to me.
A) I'm working on a solution for producing a list of duplicates found in 1,650,000 data.
B) You are encouraging a solution to provide unique data extracted from 640,000 data.
Input and output are both different. Should I continue my effort to create an acceptable solution to the original problem? Is your target a unique list or a list of duplicates?
0
aikimarkCommented:
@Faustulus

I have no idea if my solution will run fast in Pedrov's production environment as in my limited Excel 2003 environment.  I wasn't even going to post a solution in this thread until I read that some of the run times were measured in hours.
0
FaustulusCommented:
@aikimark

As you have discovered yourself, the problem here is in the number of data exceeding 65535. Obviously, my approach was wrong but yours wasn't even an approach. Between the bad news both of us are plastering him with pedrov664 tries to reduce his requirements. That is good, if it is done in a fair and reasonable way. Otherwise it will serve no one.
Fairness requires that concessions are made in such a way that they can benefit all possible approaches to the problem. If the source data have 150,000 rows per sheet it wouldn't be fair to say that they could also be limited to 64,000 rows. They either have one size or the other, unless they have a third.
Unreasonable is the plan that the data which originally have 150K rows could be manipulated somehow to be transformed into sheets of 64K. Such manipulation must be part of the solution, not part of a concession.
In short, if pedrov664 has to manually manipulate his data for 2 hours so that they might then be processed in 13 seconds the overall result isn't reasonable.
0
aikimarkCommented:
The manual steps I asked him to do can be done programmatically.  I thought it more expedient to ask him to add and fill down these formulas just to test the speed and correctness of my code.

It takes about one minute to create the first formula and then to copy/paste it to the other worksheets and to do the fill down on the worksheets manually, not two hours.
0
aikimarkCommented:
Even multiplying the 15 seconds by 2.5 should yield an expected 38 second run time, which isn't terribly bad.
0
FaustulusCommented:
@pedrov664

You haven't replied my question whether or not the number of duplicates might conceivably exceed a million rows. Perhaps I should have explained this better.
1. Your sample data have less than 1,800,000 rows. Therefore a million duplicates are only possible if the data you wish to examine might be larger - at least 2 million.
2. The code I have written is capable of handling more than 10 sheets. Perhaps you might conceivably use it to process 20 sheets with more than 3 million data resulting in correspondingly more duplicates.
It would be nice if you could guarantee that there wouldn't be more than a million duplicates because at that number you would need to write in more than one column.

Your comment to the effect that you couldn't guess at the percentage of duplicates was very enlightening and helpful. I deduced that the data I have are, in fact, real data as opposed to test data which have been enriched artificially to contain more duplicates. The point of the question is that writing duplicates and searching them for repetitive duplicates takes a lot of time. As you can probably guess, at this time I am very much concerning myself with anything that takes a lot of time in this project. Your answer provides no relief.
0
Pedrov664Author Commented:
Faustulus,

First of all, I would like to thank you for you dedication and unfailing devotion to finding a solution. At first I thought that if someone could find a solution that would work for less data then it can be extrapolated to work for more and thus it explains my remark to aikimark.

It appears from your comments that i was incorrect. Therefore, I sincerely apoligize. I did not mean to create more work than necessary.

The data to be examined at any one time would be explicitly limited to ten sheets which I believe is what the current excel file contains, again I apologize if I did not make that explicitly clear.

I do not know if this helps, however, if a duplicate is found then there is not need to search for that duplicate again. For instance, if you find that 1 2 3 4 5 are in two sheets then there is no need to find it in other sheets, the script can then move to the next until it finds it in another sheet. I should point out that duplicates are explicitly not contained within sheets, in othe words, sheet 1 will not duplicate 1 2 3 4 5 within itself.

In the end the sheet containing the duplicate sheets should only hold those that were found in more that one sheet. As such it is also not necessary to check sheet 1 with sheet 2 then sheet 2 with sheet 1 since it would yield the same results and would simply be a waste of time.

I hope I have helped to clarify all of the necessary points. If there is anything else that you feel is necessary please feel free to let me know.
0
FaustulusCommented:
Hi pedrov664,
Thanks for clarifying. The loose end you left behind is whether you really need a list of duplicates or the invers, a list of unique vales. I am working on the former, but the other is never far away in such circumstance, in fact, easier.
After a very late night (aided by jet lag) I am now in the final stages of trouble-shooting the new approach. The way it looks the program will take less than 10 minutes to complete the 1.6 million entries in your current list. I will explain the way it functions when I can post it. Meanwhile please note that the program can handle more than 10 files. 15 should be no problem. There are some items (almost 4000 of them) marked XX DELETE XX in the first column. I am programming to ignore these, but anything that's not a number needs hard-programmed attention. There is also a surprising variation of quantities. Number sequences starting like 2 6 have nearly 40,000 entries, many of them duplicates. Sequences like 50 28 may have a single occurrence only. There are quite a few of these. Of the 3600 possible variations about 1350 actually exist. My suggestion that you might be able to handle 15 number files rests on the target of not exceeding 65535 sequences starting with the same two numbers. So, if 10 lists contain 40,000 sequences starting with 2 6 then 15 lists might come near the maximum. On the other hand, if your data eventually turn out to be more evenly spread over all available variations you could handle much more.
Note that my new approach doesn't search. Not having to search for repetitive duplicates therefore doesn't offer an advantage.
I hope to finish this ater in the day - if all goes well.
0
FaustulusCommented:
@pedrov664,
OK, here we go. I timed the process at 52 seconds the only time I ran it completely. So, as you see, I tested extensively (with smaller samples) :-).
So far, so well. The best way to install the program in your project is to open the attached file as well as your own project and drag the two modules 'MatchMan' and 'TestMan' into your own in the VBE' s Project Explorer window. Because the program creates and deletes thousands of sheets I recommend that you "clean" the project from time to time and whenever you experience crashes that shouldn't be.
1. Delete the two modules from your project, but be sure to export them before you do.
2. Save the workbook without any code.
3. Import the same two modules back into the the newly saved workbook.
This procedure gets rid of a lot of junk that Excel collects and uses to inflate the size of your file which is also slowing it down in every way.

To run the program start the procedure "ListMatches". It is the only one available from the Macros button on the ribbon's Developer tab.
Run the two procedures in the 'TestMan' module to familiarize yourself with them. They are available from the code sheet only (F5) but you might delete the word "Private" from their declaration line if you prefer to run them from the Macro button. One of them analyses the Input data, the other the Output data. I used them to make sure that no data got "lost" on the way. That, in fact, was the last problem I solved on trouble-shooting and, in view of my thorough testing, you may like to put your shoulder to this task yet.

And here is the promised description of method:-
- The program starts by creating two strings joining the first two and the last three of each of your sequences of 5 numbers, like "2 6" and "10 11 12". For each of the first strings a worksheet is created into the first column of which all the second strings are written.
- As the second and further numbers sheets are split the same way the existing sheets will be used for the same first strings. Note that the maximum number of rows (each sequence has its own row) is about 1,048,000.
- All qualifying sheets will be sorted in the above way. It could be 10 or fewer or more. The identifying criterion is the name, like "Numbers_" followed by a number. "Numbers_1" will be processed, as will "Numbers_001" but not "Numbers_1A"
- Once all Numbers sheets have been sorted into the temp sheets each of the Temp sheets is sorted. This is where there might be a problem if you have more than 65535 rows. I don't know and I didn't test, but I suspect.
- Now duplicates are in adjacent rows. The next step of the program simply extracts those which occur more than once and creates a list of unique items from them.
- In the final step the first two numbers from the temp tab names are added back to the last three numbers in the duplicate list and sorted into five columns of the 'Matching' sheet.
- You will have the option of retaining the temp sheets for analysis when the program draws to a close.

It is a good program which will respond friendly to your future wishes. You can easily change the names of the sheets, columns, the number of numbers in each sequence, and the number of Numbers sheets as already mentioned. It is easy to create auxiliary lists like unique values. Since the system of splitting the sequences is already present it could be applied to further split the results of each of the temp sheets.
The program isn't well commented. In case you ever have to let some one else lose on it you may point out that it doesn't require comments because it is assembled from very short and self-explanatory procedures which are easy to follow because they, as well as most of the variables used in them have descriptive names which are, in addition, explained where I felt it might be required.
Meanwhile, if there are things you would like to have modified at this time I shall be glad to help. I regret the false start I had on this project and thank you for your patience. It was quite a learning experience for me.
Regards,
Faustulus
EXX-130917-Match-Combinations.xlsb
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
aikimarkCommented:
@Faustulus

Did you find any duplicate values within a single worksheet?
0
Pedrov664Author Commented:
@aikimark,

There should not be any within the same worksheet.

@ Faustulus,

Once again you have saved the day. The script seems to work without a glitch and I like the fact that you keept track of elapsed time, it is a nice touch.

P.S. How do I get it to sort in ascending instead of descending order? Not sure where in the script I can put that and I do not want to ruin your masterpiece. If your work requires this type of sort then no worries, it is a simple task to sort using excels sort feature.
0
FaustulusCommented:
@aikimark
Due to the method I employ I can't tell where the duplicates are found.

@pedrov664
Thank you for the flowers. It isn't a masterpiece, as you have found so quickly :-)
The current sorting is (a) an unintended byproduct of the method employed for saving time, (b) not uniformly applied and therefore (c) not easy to fix.

I have an idea of how it might be done. Give me a little time on this and I'll get back to you.
0
FaustulusCommented:
Hi,
That wasn't such a big deal after all. Once you figured out how to do it most things aren't. But it does eat a full minute of extra time. Nothing much to worry about, considering where we came from.
Regards,
Faustulus
EXX-130918-Match-Combinations.xlsb
0
Pedrov664Author Commented:
Excellent work! Considering the amount of time and effort dedicated to this project I do believe you deserve more points, but as they say, "That is all she wrote."
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
VB Script

From novice to tech pro — start learning today.