Pedro
asked on
Auto Data Comparison
The goal here is to compare the data in Numbers_1 through Numbers_10 and find matching rows, these matching rows would then be placed in the Matching sheet.
Note: To save time there is no need to check rows in the same sheet and there it would be redundant to check Numbers_1 with Numbers_2 and then when it gets to Numbers_2 to check Numbers_1 since the data will yield the same results.
Also, once a combination is found for instance, 2, 6, 9, 11, 13 match on more than one sheet and thus that combination does not need to be duplicated on the Matching sheet since it is already there.
MatchingCombinationsFrom1-10.xlsb
Note: To save time there is no need to check rows in the same sheet and there it would be redundant to check Numbers_1 with Numbers_2 and then when it gets to Numbers_2 to check Numbers_1 since the data will yield the same results.
Also, once a combination is found for instance, 2, 6, 9, 11, 13 match on more than one sheet and thus that combination does not need to be duplicated on the Matching sheet since it is already there.
MatchingCombinationsFrom1-10.xlsb
ASKER
Faustulus,
Thank you for the time and effort put into this scripting.
Since I am not doing scripting myself it may not help me much as to how it works. However, I noticed that the code module has divided the module into many subscripts.
How do I handle this when I would like to run other scripts with this one? I think putting the other scripts below this one would be best. What do you think?
P.S. I will run the script when I can spare some computer time and let you know the results.
Thank you for the time and effort put into this scripting.
Since I am not doing scripting myself it may not help me much as to how it works. However, I noticed that the code module has divided the module into many subscripts.
How do I handle this when I would like to run other scripts with this one? I think putting the other scripts below this one would be best. What do you think?
P.S. I will run the script when I can spare some computer time and let you know the results.
ASKER
Ok, I got a run-time error '13:' Type mismatch and it highlights the following:
Dest.Value = Application.Transpose(Matc hArr)
It also gives the dreaded hourglass (waiting) and does not go away. Notice it made the "temp" worksheet and then the error. Hope that helps you debug.
Dest.Value = Application.Transpose(Matc
It also gives the dreaded hourglass (waiting) and does not go away. Notice it made the "temp" worksheet and then the error. Hope that helps you debug.
The error you get is caused by the size of the array. It seems that Transpose can't take more than 65535 rows. So, I have written code to write the arrays to the sheets in batches.
The dreaded hour glass is turned on by my code and should be turned off by it, too. But if the code crashes before it reaches the end you get stuck with the hour glass. Get rid of it by typing the following line in VBE's Immediate window followed by [Enter].
Application.Cursor = xlDefault.
The best way to manage your code would be to have a separate module for each operation. Drag the entire module 'MatchMan' from the project I will upload here to your project. Create new modules for other tasks.
Note that the new version doesn't require the sheet 'Matching' to be present and deletes and re-creates it if it exists. If you wish to retain the data rename it before you start another run of the code.
I will upload the new version of my workbook later. Right now it is undergoing a full test. I will share the result with you when I have it.
The dreaded hour glass is turned on by my code and should be turned off by it, too. But if the code crashes before it reaches the end you get stuck with the hour glass. Get rid of it by typing the following line in VBE's Immediate window followed by [Enter].
Application.Cursor = xlDefault.
The best way to manage your code would be to have a separate module for each operation. Drag the entire module 'MatchMan' from the project I will upload here to your project. Create new modules for other tasks.
Note that the new version doesn't require the sheet 'Matching' to be present and deletes and re-creates it if it exists. If you wish to retain the data rename it before you start another run of the code.
I will upload the new version of my workbook later. Right now it is undergoing a full test. I will share the result with you when I have it.
ASKER
Ok, thanks. Any idea how long it ma be?
I gave it until morning. When I interrupted the run after 18 hours some 34000 items had been checked and 6060 duplicates identified. I then managed to let the search continue and about 3 minutes later another 460 items had been checked and another 53 duplicates found. Unfortunatley, I didn't time the continuation exactly. It might have been considerably less than 3 minutes. Anyway, I couldn't continue the program another time. But based on the estimate of 3 minutes the program should have found the 6060 items in 343 minutes = 5:43 hours - nowhere near the 18 hours it took. The computer must have been sleeping as soundly as I did, working a few minutes whenever I woke it up from time to time.
I am loathe to add a progress indicator because that will consume extra time. However, I imagine that it will have to be done just to keep the CPU awake. I will re-design the search method entirely.
Unfortunately, I am about to embark on some serious travelling and don't know how much time I will be able to devote to this next week, probably none today and tomorrow. Anyway, the progam seems to be working as it is, and if the 3 minutes were an accurate guess it should be able to complete the search within 1000 hours, if you can keep the computer awake. Actually, the search should speed up significantly toward the end because the search area is reduced with each item checked. Perhaps you can employ it on smaller lots, like comparing just 2 sheets.
Please let me know if you have any idea of how to speed up the search based on your knowledge of the numbers. I may want to consider pre-sorting them, but that would lead me to a completely different way of searching. Please tell me anything you know about the sequence of the numbers. You seem to know that there are no duplicates within any one sheet. Can you confirm that the numbers are, in fact, sorted within each sheet?
I am also interested in the number of duplicates you might potentially find. Could it conceivably be more than a million? Also, if the search of real data would return much fewer than the 20%-odd I found so far that would be good to know, too.
EXX-130912-Match-Combinations.xlsb
I am loathe to add a progress indicator because that will consume extra time. However, I imagine that it will have to be done just to keep the CPU awake. I will re-design the search method entirely.
Unfortunately, I am about to embark on some serious travelling and don't know how much time I will be able to devote to this next week, probably none today and tomorrow. Anyway, the progam seems to be working as it is, and if the 3 minutes were an accurate guess it should be able to complete the search within 1000 hours, if you can keep the computer awake. Actually, the search should speed up significantly toward the end because the search area is reduced with each item checked. Perhaps you can employ it on smaller lots, like comparing just 2 sheets.
Please let me know if you have any idea of how to speed up the search based on your knowledge of the numbers. I may want to consider pre-sorting them, but that would lead me to a completely different way of searching. Please tell me anything you know about the sequence of the numbers. You seem to know that there are no duplicates within any one sheet. Can you confirm that the numbers are, in fact, sorted within each sheet?
I am also interested in the number of duplicates you might potentially find. Could it conceivably be more than a million? Also, if the search of real data would return much fewer than the 20%-odd I found so far that would be good to know, too.
EXX-130912-Match-Combinations.xlsb
ASKER
Due to the vast amount of data I am unable to know how many if any are duplicates. That is one reason why I believe a script would be best. I thought of using excel functions but these prove too cumbersome to use on this type of data set.
Thank you for the considerable time and effort you've put into this. I will reduce the amount of data to process and then perhaps someone else knows how to speed things up.
Thank you for the considerable time and effort you've put into this. I will reduce the amount of data to process and then perhaps someone else knows how to speed things up.
@Pedrov664
1a. It looks like these cells have values < 100. Is that correct?
1b. If the value limits differ by column, what are each column's value limits?
2. It looks like these lists/sequences are already sorted. Is this correct?
1a. It looks like these cells have values < 100. Is that correct?
1b. If the value limits differ by column, what are each column's value limits?
2. It looks like these lists/sequences are already sorted. Is this correct?
ASKER
1. Yes I believe the cell value limit is 60 max no zeros
2. The data is pre-sorted
2. The data is pre-sorted
Is that 60 for all columns, or just the right-most columns?
ASKER
After sorting the right most columns contain the highest numbers and the left the lowest. As such numbers would appear in ascending order top to bottom and left to right.
I'm using Excel 2003 and can not see all the values. I can't see the highest values in the left most columns.
ASKER
What is the most columns or rows that you can handle?
Is there a way I can format it that would be best for you?
Is there a way I can format it that would be best for you?
We are only working with 5 columns. I needed to know the maximum numeric values in these columns.
I'm limited to 64k rows per worksheet.
I'm limited to 64k rows per worksheet.
ASKER
Yes, only five columns and the range is from 1:60. Thus minimum 1 maximum 60 for this data set.
64,000 rows per worksheet should be fine. If that is all you can see then you can work with that. It should be sufficient data for the script.
64,000 rows per worksheet should be fine. If that is all you can see then you can work with that. It should be sufficient data for the script.
What is supposed to happen with the XX DELETE XX rows?
1. Place the code below into your workbook's module.
2. On all of your Numbers_# worksheets, make sure the first row of data is 3.
3a. In cell G3 of all the Numbers_# worksheets, add the following formula
4. Run the CompareAllSheets() routine
Note: On my 800MHz laptop, running Excel2003, this ran about 15 seconds and produced 135000+ unique values for the 64k row limits in my copy of the xlsb workbook you posted (running in Excel2003). It might be possible to do some performance tweaks on this, but I think 15 seconds is a tolerable run time, considering what you are trying to do.
2. On all of your Numbers_# worksheets, make sure the first row of data is 3.
3a. In cell G3 of all the Numbers_# worksheets, add the following formula
=B3*100^4+C3*100^3+D3*100^2+E3*100+F3
3b. fill the formula down to the last row of your data.4. Run the CompareAllSheets() routine
Note: On my 800MHz laptop, running Excel2003, this ran about 15 seconds and produced 135000+ unique values for the 64k row limits in my copy of the xlsb workbook you posted (running in Excel2003). It might be possible to do some performance tweaks on this, but I think 15 seconds is a tolerable run time, considering what you are trying to do.
Option Explicit
Public Sub CompareAllSheets()
Dim lngLoop1 As Long, lngLoop2 As Long 'for worksheet iteration
Dim lngFrom As Long, lngTo As Long 'for the copy operation
Dim vAllMatches() As Variant
Dim vTheseMatches() As Variant
Dim vDifferences() As Variant
Dim vMerge() As Variant
Dim wksTgt As Worksheet, rngTgt As Range
Dim strNum As String
ReDim vAllMatches(0 To 0)
For lngLoop1 = 1 To 9
For lngLoop2 = lngLoop1 + 1 To 10
vTheseMatches = CompareSheet(Worksheets("Numbers_" & lngLoop1), Worksheets("Numbers_" & lngLoop2))
If UBound(vAllMatches) = 0 Then
vAllMatches = vTheseMatches
Else
If UBound(vTheseMatches) = 0 Then
Else
vDifferences = GetDifferences(vAllMatches, vTheseMatches)
If UBound(vDifferences) = 0 Then
Else
ReDim vMerge(1 To UBound(vAllMatches) + UBound(vDifferences))
MergeValues vAllMatches, vDifferences, vMerge
vAllMatches = vMerge
End If
End If
End If
Next
Next
Set wksTgt = Worksheets("Matching")
Set rngTgt = wksTgt.Range("A1")
Application.ScreenUpdating = False
For lngLoop1 = 1 To UBound(vAllMatches)
strNum = Format(vAllMatches(lngLoop1), "0000000000")
wksTgt.Range(rngTgt, rngTgt.Offset(0, 4)).Value = _
Array(Mid$(strNum, 1, 2), Mid$(strNum, 3, 2), Mid$(strNum, 5, 2), Mid$(strNum, 7, 2), Mid$(strNum, 9, 2))
Set rngTgt = rngTgt.Offset(1)
Next
Application.ScreenUpdating = True
End Sub
Public Function CompareSheet(parmWks1 As Worksheet, parmWks2 As Worksheet)
Dim rng1 As Range
Dim rng2 As Range
Dim vValues1() As Variant
Dim vValues2() As Variant
Dim vMatches() As Variant
Dim boolListExhausted As Boolean
Dim lngPosn1 As Long, lngPosn2 As Long, lngMatchCount As Long
Set rng1 = parmWks1.Range(parmWks1.Range("G3"), parmWks1.Range("G3").End(xlDown))
Set rng2 = parmWks2.Range(parmWks2.Range("G3"), parmWks2.Range("G3").End(xlDown))
vValues1() = rng1.Cells.Value
vValues2() = rng2.Cells.Value
lngPosn1 = LBound(vValues1)
lngPosn2 = LBound(vValues2)
If rng1.Rows.Count > rng2.Rows.Count Then
ReDim vMatches(1 To UBound(vValues1))
Else
ReDim vMatches(1 To UBound(vValues2))
End If
Do
Select Case True
Case IsError(vValues1(lngPosn1, 1))
lngPosn1 = lngPosn1 + 1
If lngPosn1 > UBound(vValues1) Then boolListExhausted = True
Case IsError(vValues2(lngPosn2, 1))
lngPosn2 = lngPosn2 + 1
If lngPosn2 > UBound(vValues2) Then boolListExhausted = True
Case vValues1(lngPosn1, 1) < vValues2(lngPosn2, 1)
lngPosn1 = lngPosn1 + 1
If lngPosn1 > UBound(vValues1) Then boolListExhausted = True
Case vValues1(lngPosn1, 1) > vValues2(lngPosn2, 1)
lngPosn2 = lngPosn2 + 1
If lngPosn2 > UBound(vValues2) Then boolListExhausted = True
Case Else
lngMatchCount = lngMatchCount + 1
vMatches(lngMatchCount) = vValues1(lngPosn1, 1)
lngPosn1 = lngPosn1 + 1
If lngPosn1 > UBound(vValues1) Then boolListExhausted = True
'DoEvents
End Select
Loop Until boolListExhausted
'Debug.Print lngMatchCount
If lngMatchCount = 0 Then
ReDim vMatches(0 To 0)
Else
ReDim Preserve vMatches(1 To lngMatchCount)
End If
CompareSheet = vMatches()
End Function
Public Function GetDifferences(parmAll, parmThis)
'output parmThis() values that are not in parmAll()
Dim boolListExhausted As Boolean
Dim vDifferences() As Variant
Dim lngPosn1 As Long, lngPosn2 As Long, lngDiffCount As Long
ReDim vDifferences(1 To UBound(parmThis))
lngPosn1 = 1
lngPosn2 = 1
Do
Select Case True
Case parmAll(lngPosn1) = parmThis(lngPosn2)
lngPosn1 = lngPosn1 + 1
lngPosn2 = lngPosn2 + 1
If (lngPosn1 > UBound(parmAll)) Or (lngPosn2 > UBound(parmThis)) Then
boolListExhausted = True
End If
Case parmAll(lngPosn1) < parmThis(lngPosn2)
lngPosn1 = lngPosn1 + 1
If lngPosn1 > UBound(parmAll) Then boolListExhausted = True
Case parmAll(lngPosn1) > parmThis(lngPosn2)
lngDiffCount = lngDiffCount + 1
vDifferences(lngDiffCount) = parmThis(lngPosn2)
lngPosn2 = lngPosn2 + 1
If lngPosn2 > UBound(parmThis) Then boolListExhausted = True
Case Else
lngPosn1 = lngPosn1 + 1
'DoEvents
End Select
Loop Until boolListExhausted
'Debug.Print lngMatchCount
If lngDiffCount = 0 Then
ReDim vDifferences(0 To 0)
Else
ReDim Preserve vDifferences(1 To lngDiffCount)
End If
GetDifferences = vDifferences()
End Function
Public Sub MergeValues(parmAll, parmDiffs, parmMerge)
Dim lngPosn1 As Long, lngPosn2 As Long, lngOut As Long
Dim boolListExhausted As Boolean
lngPosn1 = 1
lngPosn2 = 1
lngOut = 1
Do
Select Case True
Case parmAll(lngPosn1) < parmDiffs(lngPosn2)
parmMerge(lngOut) = parmAll(lngPosn1)
lngOut = lngOut + 1
lngPosn1 = lngPosn1 + 1
If lngPosn1 > UBound(parmAll) Then
For lngPosn2 = lngPosn2 To UBound(parmDiffs)
parmMerge(lngOut) = parmDiffs(lngPosn1)
lngOut = lngOut + 1
Next
boolListExhausted = True
End If
Case parmAll(lngPosn1) > parmDiffs(lngPosn2)
parmMerge(lngOut) = parmDiffs(lngPosn2)
lngOut = lngOut + 1
lngPosn2 = lngPosn2 + 1
If lngPosn2 > UBound(parmDiffs) Then
For lngPosn1 = lngPosn1 To UBound(parmAll)
parmMerge(lngOut) = parmAll(lngPosn1)
lngOut = lngOut + 1
Next
boolListExhausted = True
End If
Case Else
Stop 'There should never be identical items
End Select
Loop Until boolListExhausted
End Sub
@Pedrov664,
This doesn't look like it will end up being fair to me.
A) I'm working on a solution for producing a list of duplicates found in 1,650,000 data.
B) You are encouraging a solution to provide unique data extracted from 640,000 data.
Input and output are both different. Should I continue my effort to create an acceptable solution to the original problem? Is your target a unique list or a list of duplicates?
This doesn't look like it will end up being fair to me.
A) I'm working on a solution for producing a list of duplicates found in 1,650,000 data.
B) You are encouraging a solution to provide unique data extracted from 640,000 data.
Input and output are both different. Should I continue my effort to create an acceptable solution to the original problem? Is your target a unique list or a list of duplicates?
@Faustulus
I have no idea if my solution will run fast in Pedrov's production environment as in my limited Excel 2003 environment. I wasn't even going to post a solution in this thread until I read that some of the run times were measured in hours.
I have no idea if my solution will run fast in Pedrov's production environment as in my limited Excel 2003 environment. I wasn't even going to post a solution in this thread until I read that some of the run times were measured in hours.
@aikimark
As you have discovered yourself, the problem here is in the number of data exceeding 65535. Obviously, my approach was wrong but yours wasn't even an approach. Between the bad news both of us are plastering him with pedrov664 tries to reduce his requirements. That is good, if it is done in a fair and reasonable way. Otherwise it will serve no one.
Fairness requires that concessions are made in such a way that they can benefit all possible approaches to the problem. If the source data have 150,000 rows per sheet it wouldn't be fair to say that they could also be limited to 64,000 rows. They either have one size or the other, unless they have a third.
Unreasonable is the plan that the data which originally have 150K rows could be manipulated somehow to be transformed into sheets of 64K. Such manipulation must be part of the solution, not part of a concession.
In short, if pedrov664 has to manually manipulate his data for 2 hours so that they might then be processed in 13 seconds the overall result isn't reasonable.
As you have discovered yourself, the problem here is in the number of data exceeding 65535. Obviously, my approach was wrong but yours wasn't even an approach. Between the bad news both of us are plastering him with pedrov664 tries to reduce his requirements. That is good, if it is done in a fair and reasonable way. Otherwise it will serve no one.
Fairness requires that concessions are made in such a way that they can benefit all possible approaches to the problem. If the source data have 150,000 rows per sheet it wouldn't be fair to say that they could also be limited to 64,000 rows. They either have one size or the other, unless they have a third.
Unreasonable is the plan that the data which originally have 150K rows could be manipulated somehow to be transformed into sheets of 64K. Such manipulation must be part of the solution, not part of a concession.
In short, if pedrov664 has to manually manipulate his data for 2 hours so that they might then be processed in 13 seconds the overall result isn't reasonable.
The manual steps I asked him to do can be done programmatically. I thought it more expedient to ask him to add and fill down these formulas just to test the speed and correctness of my code.
It takes about one minute to create the first formula and then to copy/paste it to the other worksheets and to do the fill down on the worksheets manually, not two hours.
It takes about one minute to create the first formula and then to copy/paste it to the other worksheets and to do the fill down on the worksheets manually, not two hours.
Even multiplying the 15 seconds by 2.5 should yield an expected 38 second run time, which isn't terribly bad.
@pedrov664
You haven't replied my question whether or not the number of duplicates might conceivably exceed a million rows. Perhaps I should have explained this better.
1. Your sample data have less than 1,800,000 rows. Therefore a million duplicates are only possible if the data you wish to examine might be larger - at least 2 million.
2. The code I have written is capable of handling more than 10 sheets. Perhaps you might conceivably use it to process 20 sheets with more than 3 million data resulting in correspondingly more duplicates.
It would be nice if you could guarantee that there wouldn't be more than a million duplicates because at that number you would need to write in more than one column.
Your comment to the effect that you couldn't guess at the percentage of duplicates was very enlightening and helpful. I deduced that the data I have are, in fact, real data as opposed to test data which have been enriched artificially to contain more duplicates. The point of the question is that writing duplicates and searching them for repetitive duplicates takes a lot of time. As you can probably guess, at this time I am very much concerning myself with anything that takes a lot of time in this project. Your answer provides no relief.
You haven't replied my question whether or not the number of duplicates might conceivably exceed a million rows. Perhaps I should have explained this better.
1. Your sample data have less than 1,800,000 rows. Therefore a million duplicates are only possible if the data you wish to examine might be larger - at least 2 million.
2. The code I have written is capable of handling more than 10 sheets. Perhaps you might conceivably use it to process 20 sheets with more than 3 million data resulting in correspondingly more duplicates.
It would be nice if you could guarantee that there wouldn't be more than a million duplicates because at that number you would need to write in more than one column.
Your comment to the effect that you couldn't guess at the percentage of duplicates was very enlightening and helpful. I deduced that the data I have are, in fact, real data as opposed to test data which have been enriched artificially to contain more duplicates. The point of the question is that writing duplicates and searching them for repetitive duplicates takes a lot of time. As you can probably guess, at this time I am very much concerning myself with anything that takes a lot of time in this project. Your answer provides no relief.
ASKER
Faustulus,
First of all, I would like to thank you for you dedication and unfailing devotion to finding a solution. At first I thought that if someone could find a solution that would work for less data then it can be extrapolated to work for more and thus it explains my remark to aikimark.
It appears from your comments that i was incorrect. Therefore, I sincerely apoligize. I did not mean to create more work than necessary.
The data to be examined at any one time would be explicitly limited to ten sheets which I believe is what the current excel file contains, again I apologize if I did not make that explicitly clear.
I do not know if this helps, however, if a duplicate is found then there is not need to search for that duplicate again. For instance, if you find that 1 2 3 4 5 are in two sheets then there is no need to find it in other sheets, the script can then move to the next until it finds it in another sheet. I should point out that duplicates are explicitly not contained within sheets, in othe words, sheet 1 will not duplicate 1 2 3 4 5 within itself.
In the end the sheet containing the duplicate sheets should only hold those that were found in more that one sheet. As such it is also not necessary to check sheet 1 with sheet 2 then sheet 2 with sheet 1 since it would yield the same results and would simply be a waste of time.
I hope I have helped to clarify all of the necessary points. If there is anything else that you feel is necessary please feel free to let me know.
First of all, I would like to thank you for you dedication and unfailing devotion to finding a solution. At first I thought that if someone could find a solution that would work for less data then it can be extrapolated to work for more and thus it explains my remark to aikimark.
It appears from your comments that i was incorrect. Therefore, I sincerely apoligize. I did not mean to create more work than necessary.
The data to be examined at any one time would be explicitly limited to ten sheets which I believe is what the current excel file contains, again I apologize if I did not make that explicitly clear.
I do not know if this helps, however, if a duplicate is found then there is not need to search for that duplicate again. For instance, if you find that 1 2 3 4 5 are in two sheets then there is no need to find it in other sheets, the script can then move to the next until it finds it in another sheet. I should point out that duplicates are explicitly not contained within sheets, in othe words, sheet 1 will not duplicate 1 2 3 4 5 within itself.
In the end the sheet containing the duplicate sheets should only hold those that were found in more that one sheet. As such it is also not necessary to check sheet 1 with sheet 2 then sheet 2 with sheet 1 since it would yield the same results and would simply be a waste of time.
I hope I have helped to clarify all of the necessary points. If there is anything else that you feel is necessary please feel free to let me know.
Hi pedrov664,
Thanks for clarifying. The loose end you left behind is whether you really need a list of duplicates or the invers, a list of unique vales. I am working on the former, but the other is never far away in such circumstance, in fact, easier.
After a very late night (aided by jet lag) I am now in the final stages of trouble-shooting the new approach. The way it looks the program will take less than 10 minutes to complete the 1.6 million entries in your current list. I will explain the way it functions when I can post it. Meanwhile please note that the program can handle more than 10 files. 15 should be no problem. There are some items (almost 4000 of them) marked XX DELETE XX in the first column. I am programming to ignore these, but anything that's not a number needs hard-programmed attention. There is also a surprising variation of quantities. Number sequences starting like 2 6 have nearly 40,000 entries, many of them duplicates. Sequences like 50 28 may have a single occurrence only. There are quite a few of these. Of the 3600 possible variations about 1350 actually exist. My suggestion that you might be able to handle 15 number files rests on the target of not exceeding 65535 sequences starting with the same two numbers. So, if 10 lists contain 40,000 sequences starting with 2 6 then 15 lists might come near the maximum. On the other hand, if your data eventually turn out to be more evenly spread over all available variations you could handle much more.
Note that my new approach doesn't search. Not having to search for repetitive duplicates therefore doesn't offer an advantage.
I hope to finish this ater in the day - if all goes well.
Thanks for clarifying. The loose end you left behind is whether you really need a list of duplicates or the invers, a list of unique vales. I am working on the former, but the other is never far away in such circumstance, in fact, easier.
After a very late night (aided by jet lag) I am now in the final stages of trouble-shooting the new approach. The way it looks the program will take less than 10 minutes to complete the 1.6 million entries in your current list. I will explain the way it functions when I can post it. Meanwhile please note that the program can handle more than 10 files. 15 should be no problem. There are some items (almost 4000 of them) marked XX DELETE XX in the first column. I am programming to ignore these, but anything that's not a number needs hard-programmed attention. There is also a surprising variation of quantities. Number sequences starting like 2 6 have nearly 40,000 entries, many of them duplicates. Sequences like 50 28 may have a single occurrence only. There are quite a few of these. Of the 3600 possible variations about 1350 actually exist. My suggestion that you might be able to handle 15 number files rests on the target of not exceeding 65535 sequences starting with the same two numbers. So, if 10 lists contain 40,000 sequences starting with 2 6 then 15 lists might come near the maximum. On the other hand, if your data eventually turn out to be more evenly spread over all available variations you could handle much more.
Note that my new approach doesn't search. Not having to search for repetitive duplicates therefore doesn't offer an advantage.
I hope to finish this ater in the day - if all goes well.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
@Faustulus
Did you find any duplicate values within a single worksheet?
Did you find any duplicate values within a single worksheet?
ASKER
@aikimark,
There should not be any within the same worksheet.
@ Faustulus,
Once again you have saved the day. The script seems to work without a glitch and I like the fact that you keept track of elapsed time, it is a nice touch.
P.S. How do I get it to sort in ascending instead of descending order? Not sure where in the script I can put that and I do not want to ruin your masterpiece. If your work requires this type of sort then no worries, it is a simple task to sort using excels sort feature.
There should not be any within the same worksheet.
@ Faustulus,
Once again you have saved the day. The script seems to work without a glitch and I like the fact that you keept track of elapsed time, it is a nice touch.
P.S. How do I get it to sort in ascending instead of descending order? Not sure where in the script I can put that and I do not want to ruin your masterpiece. If your work requires this type of sort then no worries, it is a simple task to sort using excels sort feature.
@aikimark
Due to the method I employ I can't tell where the duplicates are found.
@pedrov664
Thank you for the flowers. It isn't a masterpiece, as you have found so quickly :-)
The current sorting is (a) an unintended byproduct of the method employed for saving time, (b) not uniformly applied and therefore (c) not easy to fix.
I have an idea of how it might be done. Give me a little time on this and I'll get back to you.
Due to the method I employ I can't tell where the duplicates are found.
@pedrov664
Thank you for the flowers. It isn't a masterpiece, as you have found so quickly :-)
The current sorting is (a) an unintended byproduct of the method employed for saving time, (b) not uniformly applied and therefore (c) not easy to fix.
I have an idea of how it might be done. Give me a little time on this and I'll get back to you.
Hi,
That wasn't such a big deal after all. Once you figured out how to do it most things aren't. But it does eat a full minute of extra time. Nothing much to worry about, considering where we came from.
Regards,
Faustulus
EXX-130918-Match-Combinations.xlsb
That wasn't such a big deal after all. Once you figured out how to do it most things aren't. But it does eat a full minute of extra time. Nothing much to worry about, considering where we came from.
Regards,
Faustulus
EXX-130918-Match-Combinations.xlsb
ASKER
Excellent work! Considering the amount of time and effort dedicated to this project I do believe you deserve more points, but as they say, "That is all she wrote."
Please paste the following code to a standard code module in your project. Then run the procedure 'WriteMatching'.
The enums and constants at the top of the code may be modified. Let me know if you require assistance with that. Note that I tested the code only on a fraction of the data you have. I can't tell you how long it will take to sort through all your sequences but presume that there will be time for a cup of coffee.
If you are interested, I shall be glad to tell you how the code works. One item of information may be useful to you, however: all your sequences are converted into strings. Meaning, the numbers 2, 6, 9, 11, 13 end up as "2 6 9 11 13" in one cell. These creations are used and deleted. I am wondering if they are useful to you in some other way. They could be left behind in Matching!A:A.
The other thing you need to know is that the code doesn't clear the 'Matching' worksheet. If you run the program another time the second result will be appended to the first.
Open in new window