Solved

Excel to find identical words

Posted on 2013-10-22
31
251 Views
Last Modified: 2013-10-30
I have a excel sheet with a column with such data

www.site.com\somename\this-is-the-page
www.site.com\somename\this-is-the-page-thats-is-identical
www.site.com\somename\this-is-the-page-could-be-partial
www.site.com\somename\some-page-like-this
www.site.com\somename\what-is-this-page-about
www.site.com\somename\this-is-about-what-is-this-page-about

First 3 URL's have 4 identical words next to each other
4th = 0
5th and 6th has 5 words identical

Can i have help with a excel macro to find identical url's
0
Comment
Question by:mtthompsons
  • 17
  • 14
31 Comments
 
LVL 45

Expert Comment

by:Martin Liss
ID: 39594024
1) Please further define "identical". In other words if there is even just one word the same, is it then "identical"?

2) Do the words that may make it identical only come from the portion of the url after the last backslash?

3) In your example, all the identical urls follow each other. Is that a requirement? In other words assuming that the data was as follows, would 1 and 3 still be identical?

www.site.com\somename\this-is-the-page
http://www.experts-exchange.com/Software/Office_Productivity
www.site.com\somename\this-is-the-page-thats-is-identical

4) Do you want to know how many identical words there are?

5) How do want to identify the identical urls?
0
 

Author Comment

by:mtthompsons
ID: 39594524
1. Sorry it should be 3+ words that are adjacent to each other

2. Yes, we need to check only after last slash between the hyphens -

3. Yes 1 and 3 would be identical in your example

4. As of now we will target 3 and if i have a easy way to change 3 to 4 or 2 would be great so i can switch accordingly to get my results

5. If i can get the results in column "B" as Identical so i can sort column "A" and "B" and have the results next to each other
0
 
LVL 45

Expert Comment

by:Martin Liss
ID: 39594568
Thanks. Another question. In your example all the "words" are separated by dashes. Is that how it is in real life, or are they separated by spaces, and/or separated by other characters, and/or might there be punctuation marks?
0
 

Author Comment

by:mtthompsons
ID: 39594698
99% would be - dashes
Some cases would have some other special characters in such cases if code cannot handle rather can skip those url's
0
 
LVL 45

Expert Comment

by:Martin Liss
ID: 39594996
<deleted>
0
 
LVL 45

Expert Comment

by:Martin Liss
ID: 39595209
Here's a workbook with a partial solution. The code is shown below. Run the FindIdentical macro. When you do you'll see what I find are the numbers of matching words in each URL. I know that that's not your desired result but I'm not sure what you want me to put in column B instead. Should all the value of all the cells from B1 to B6 with the exception of B4 be "Identical"? Do you want to have the code do the sorting? If so should it be sorted on just column B or on columns A and B? Also please look at line 79 where you would add whatever characters should be treated as separators in addition to the dashes.

Option Explicit

Sub FindIdentical()
Dim lngLastRow As Long
Dim lngRow1 As Long
Dim lngRow2 As Long
Dim strWords1() As String
Dim strWords2() As String
Dim intPos As Integer
Dim intMinMatch As Integer
Dim intMatching As Integer
Dim lngIndex1 As Long
Dim lngIndex2 As Long
Dim colWords1 As Collection
Dim colWords2 As Collection

Range("B1").EntireColumn.ClearContents

intMinMatch = InputBox("Please enter the number of words that have to match in order to make an URL identical", "How many words?", 3)
lngLastRow = Range("A65536").End(xlUp).Row

For lngRow1 = 1 To lngLastRow
    ' Break up the words in the last part of the URL into a collection
    ' of unique words
    intPos = InStrRev(Cells(lngRow1, 1).Value, "\")
    strWords1 = Split(ConvertSeparators(Mid(Cells(lngRow1, 1).Value, intPos + 1)), "-")
    Set colWords1 = New Collection
    On Error Resume Next
    For lngIndex = 0 To UBound(strWords1)
        colWords1.Add strWords1(lngIndex), strWords1(lngIndex)
        If Err.Number = 457 Then
            ' The word is already in the collection
            Err.Clear
        End If
    Next
    On Error GoTo 0
    ' Compare the value in lngRow1 with all the rows below it
    For lngRow2 = lngRow1 + 1 To lngLastRow
        intMatching = 0
        intPos = InStrRev(Cells(lngRow2, 1).Value, "\")
        strWords2 = Split(ConvertSeparators(Mid(Cells(lngRow2, 1).Value, intPos + 1)), "-")
        Set colWords2 = New Collection
        On Error Resume Next
        For lngIndex = 0 To UBound(strWords2)
            colWords2.Add strWords2(lngIndex), strWords2(lngIndex)
            If Err.Number = 457 Then
                Err.Clear
            End If
        Next
        ' Compare the words. (Colletions start at index 1)
        For lngIndex1 = 1 To colWords1.Count
            For lngIndex2 = 1 To colWords2.Count
                If lngIndex1 <= colWords2.Count Then
                    If colWords1(lngIndex1) = colWords2(lngIndex2) Then
                        intMatching = intMatching + 1
                        Exit For
                    End If
                End If
            Next
        Next
        If intMatching >= intMinMatch Then
            If Cells(lngRow1, 2).Value = "" Then
                Cells(lngRow1, 2).Value = "This URL has " & intMatching & " words that match words in " & Cells(lngRow2, 1).Value
            Else
                Cells(lngRow1, 2).Value = Cells(lngRow1, 2).Value & vbCrLf & " AND " & intMatching & " words that match words in " & Cells(lngRow2, 1).Value
            End If
        End If
    Next
Next
End Sub

Private Function ConvertSeparators(strWords As String) As String
' This function changes all listed separators to dashes

Dim varSeps As Variant
Dim lngIndex As Long
' Add any characters in additon to the dash that can be separators. I added
' the # and [ as examples of how you'd add them
varSeps = Array("#", "[")

For lngIndex = 0 To UBound(varSeps)
    strWords = Replace(strWords, varSeps(lngIndex), "-")
Next

ConvertSeparators = strWords
End Function

Open in new window

Q-28274528.xlsm
0
 

Author Comment

by:mtthompsons
ID: 39596162
Thanks
As in the attachment i get an error
The column "B" looks fine but will test it with actual data once we get it running and confirm back
Error.JPG
0
 

Author Comment

by:mtthompsons
ID: 39596165
I get this error with the sample data as well that you attached

on line
    For lngIndex = 0 To UBound(strWords1)

I will have 50K rows of data just if that needs some change in the code
0
 

Author Comment

by:mtthompsons
ID: 39596173
One more part is the actual data slashes would be as

 http://www.Sample.com/path/This-is-a-sample-page

In the example i gave the slashes as \

if that's something embedded in the code
Sorry for that
0
 
LVL 45

Expert Comment

by:Martin Liss
ID: 39596260
I've attached an updated workbook.
Q-28274528.xlsm
0
 

Author Comment

by:mtthompsons
ID: 39599970
Sorry for the delay i placed 40K URL's which took close to a day to complete and did not want to kill the results by breaking the macro from running
Works great but have an issue with the report

Column "B" can we have numbers if 2 are duplicate to pair those 2 lets place number (1)

Then if next one has a duplicate with 4 URL's then lets place (2) 4 times

So when i sort all 1,2,3,4's are next to each other and i can easy check the column "A" URL's
0
 
LVL 45

Expert Comment

by:Martin Liss
ID: 39600566
Replace your FindIdentical sub with the code below and tell me if it is faster. It still shows the same things in column B because I'm not sure what you want there.

You say
Column "B" can we have numbers if 2 are duplicate to pair those 2 lets place number (1)

Then if next one has a duplicate with 4 URL's then lets place (2) 4 times
Should I use 3 for the next duplicate, and then 4 and then 5, etc?

Also let's say rows 1, 100, 105 and 106 are duplicates. I assume that those rows all get a 1. But what if I find that row 2 is a duplicate of rows, 6, 105 and 120. I assume then that rows 2, 6
and 120 get a 2, but what about row 105? Does it stay a 1 or do I make it a 2?

Sub FindIdentical()
Dim lngLastRow As Long
Dim lngRow1 As Long
Dim lngRow2 As Long
Dim strWords1() As String
Dim strWords2() As String
Dim intPos As Integer
Dim intMinMatch As Integer
Dim intMatching As Integer
Dim lngIndex1 As Long
Dim lngIndex2 As Long
Dim colWords1 As Collection
Dim colWords2 As Collection
Dim varData As Variant
Dim lngMarker As Long

Range("B1").EntireColumn.ClearContents

intMinMatch = InputBox("Please enter the number of words that have to match in order to make an URL identical", "How many words?", 3)
lngLastRow = Range("A65536").End(xlUp).Row
varData = Range("A1:B" & lngLastRow).Value


For lngRow1 = 1 To lngLastRow
    ' Break up the words in the last part of the URL into a collection
    ' of unique words
    intPos = InStrRev(varData(lngRow1, 1), "/")
    strWords1 = Split(ConvertSeparators(Mid(varData(lngRow1, 1), intPos + 1)), "-")
    Set colWords1 = New Collection
    On Error Resume Next
    For lngIndex1 = 0 To UBound(strWords1)
    Debug.Print UBound(strWords1)
        colWords1.Add strWords1(lngIndex1), strWords1(lngIndex1)
        If Err.Number = 457 Then
            ' The word is already in the collection
            Err.Clear
        End If
    Next
    On Error GoTo 0
    ' Compare the value in lngRow1 with all the rows below it
    For lngRow2 = lngRow1 + 1 To lngLastRow
        intMatching = 0
        intPos = InStrRev(varData(lngRow2, 1), "/")
        strWords2 = Split(ConvertSeparators(Mid(varData(lngRow2, 1), intPos + 1)), "-")
        Set colWords2 = New Collection
        On Error Resume Next
        For lngIndex2 = 0 To UBound(strWords2)
            colWords2.Add strWords2(lngIndex2), strWords2(lngIndex2)
            If Err.Number = 457 Then
                Err.Clear
            End If
        Next
        ' Compare the words. (Colletions start at index 1)
        For lngIndex1 = 1 To colWords1.Count
            For lngIndex2 = 1 To colWords2.Count
                If lngIndex1 <= colWords2.Count Then
                    If colWords1(lngIndex1) = colWords2(lngIndex2) Then
                        intMatching = intMatching + 1
                        Exit For
                    End If
                End If
            Next
        Next
        If intMatching >= intMinMatch Then
            If varData(lngRow1, 2) = "" Then
                varData(lngRow1, 2) = "This URL has " & intMatching & " words that match words in " & varData(lngRow2, 1)
            Else
                varData(lngRow1, 2) = varData(lngRow1, 2) & vbCrLf & " AND " & intMatching & " words that match words in " & varData(lngRow2, 1)
            End If
        End If
    Next
Next

Range("A1:B" & lngLastRow).Value = varData
End Sub

Open in new window

0
 

Author Comment

by:mtthompsons
ID: 39600581
Yes you are right i just want it as you mentioned and if its multiple times then we can have 1,6,8 and so on

So first i can work on those duplicate URL's and then have the sort work perfectly
0
 
LVL 45

Expert Comment

by:Martin Liss
ID: 39600728
I'm sorry but I don't know what that means and you didn't answer both my questions so let me ask in a different way

1) Does the first set of duplicates get a 1 and the second set a 2 and the third set a 3 and the fourth set a 4, etc?

2) If row 1 matches row 100, both get a 1. What happens if later row 30 also matches row 100? Does row 100 stay a 1 or does it get anther number?
0
 
LVL 45

Expert Comment

by:Martin Liss
ID: 39600741
I also have another idea to make it faster but before I try it can you tell me what the largest number of words there might be in one URL?
0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 

Author Comment

by:mtthompsons
ID: 39600766
Max 15 words and not more
0
 
LVL 45

Expert Comment

by:Martin Liss
ID: 39600784
Please also answer the questions in post ID ID: 39600728.
0
 

Author Comment

by:mtthompsons
ID: 39600819
Sorry missed that part

1) Does the first set of duplicates get a 1 and the second set a 2 and the third set a 3 and the fourth set a 4, etc?

Yes say row 1,5,10 are same then they get 1
Say 2,3,4 are same they get 2
Say 3,7,8,9 are same then it gets 3

As 3 is mentioned in 2 as well we get 2,3 in column "B"

2) If row 1 matches row 100, both get a 1. What happens if later row 30 also matches row 100? Does row 100 stay a 1 or does it get anther number?

Yes so we have 2 numbers as mentioned in point 1

Please let me know if you have any questions or any other easier ways to get this done

Goal : Identify URL's that are very closely similar to others
0
 
LVL 45

Expert Comment

by:Martin Liss
ID: 39600827
Thanks.
0
 
LVL 45

Expert Comment

by:Martin Liss
ID: 39601077
What I have now is still too slow. I will continue to work on it and will keep you posted on any progress I make.
0
 

Author Comment

by:mtthompsons
ID: 39601186
Thanks
0
 
LVL 45

Expert Comment

by:Martin Liss
ID: 39602086
I'm sorry but this is the best I can do (at least for now but I'll continue to think about it) but it's still very slow.

What this code does is
1) Transfers all the data from the sheet into a variant array
2) Finds all the unique words and puts then in columns 3 to 17 of the array
3) Compares the words in each row with the words in all the rows below it and puts the match count in column 2 of the array
4) Transfers columns 1 and 2 of the array back to the sheet.

Steps 1 and 4 are very fast and step 2 is reasonably fast, but step 3 is very slow since with 40,000 rows an average of 20,000 comparisons are done for each row and that's 800,000,000 comparisons!

Sub FindIdentical()
Dim lngLastRow As Long
Dim lngRow1 As Long
Dim lngRow2 As Long
Dim strWords1() As String
Dim intPos As Integer
Dim intMinMatch As Integer
Dim intMatching As Integer
Dim lngIndex1 As Long
Dim lngIndex2 As Long
Dim colWords1 As Collection
Dim colWords2 As Collection
Dim varData As Variant
Dim lngMarker As Long

intMinMatch = InputBox("Please enter the number of words that have to match in order to make an URL identical", "How many words?", 3)
lngLastRow = Range("A65536").End(xlUp).Row
' Transfer all the data to an array. The data is only
' in columns A and B, but Q is used to leave room
' for 15 additional columns for the words
varData = Range("A1:Q" & lngLastRow).Value
Debug.Print Now

For lngRow1 = 1 To lngLastRow
    ' Split up all the words
    intPos = InStrRev(varData(lngRow1, 1), "/")
    strWords1 = Split(ConvertSeparators(Mid(varData(lngRow1, 1), intPos + 1)), "-")
    Set colWords1 = New Collection
    On Error Resume Next
    For lngIndex1 = 0 To UBound(strWords1)
        colWords1.Add strWords1(lngIndex1), strWords1(lngIndex1)
        If Err.Number = 457 Then
            ' The word is already in the collection
            Err.Clear
        End If
    Next
    
    intPos = 3
    For lngIndex1 = 1 To colWords1.Count
        varData(lngRow1, intPos) = colWords1(lngIndex1)
        intPos = intPos + 1
    Next
    On Error GoTo 0
Next

For lngRow1 = 1 To lngLastRow
    Debug.Print lngRow1 & " - " & Now
    ' Compare the words in lngRow1 with the words in all the rows below it
    For lngRow2 = lngRow1 + 1 To lngLastRow
        intMatching = 0
        For lngIndex1 = 3 To 17
            If varData(lngRow1, lngIndex1) = "" Then
                Exit For
            Else
                If varData(lngRow1, lngIndex1) = varData(lngRow2, 3) Or _
                   varData(lngRow1, lngIndex1) = varData(lngRow2, 4) Or _
                   varData(lngRow1, lngIndex1) = varData(lngRow2, 5) Or _
                   varData(lngRow1, lngIndex1) = varData(lngRow2, 6) Or _
                   varData(lngRow1, lngIndex1) = varData(lngRow2, 7) Or _
                   varData(lngRow1, lngIndex1) = varData(lngRow2, 8) Or _
                   varData(lngRow1, lngIndex1) = varData(lngRow2, 9) Or _
                   varData(lngRow1, lngIndex1) = varData(lngRow2, 10) Or _
                   varData(lngRow1, lngIndex1) = varData(lngRow2, 11) Or _
                   varData(lngRow1, lngIndex1) = varData(lngRow2, 12) Or _
                   varData(lngRow1, lngIndex1) = varData(lngRow2, 13) Or _
                   varData(lngRow1, lngIndex1) = varData(lngRow2, 14) Or _
                   varData(lngRow1, lngIndex1) = varData(lngRow2, 15) Or _
                   varData(lngRow1, lngIndex1) = varData(lngRow2, 16) Or _
                   varData(lngRow1, lngIndex1) = varData(lngRow2, 17) Then
                    intMatching = intMatching + 1
                End If
            End If
        Next
        
        If intMatching >= intMinMatch Then
            If varData(lngRow1, 2) = "" Then
                varData(lngRow1, 2) = lngRow1 & ", " & lngRow2
            Else
                varData(lngRow1, 2) = varData(lngRow1, 2) & ", " & lngRow2
            End If
        End If
    Next
Next

Range("A1:B" & lngLastRow).Value = varData

Debug.Print Now
End Sub

Open in new window

0
 

Author Comment

by:mtthompsons
ID: 39602779
Thanks but i have one logical issue

I got this as results for one
33, 20133, 25805
20133, 25805

25805 had no message in column "B" now the problem is i need to browse manually to these rows to find its duplicate

Can we color all cells that has a duplicate or mention the numbers on the other one as well

Else too much of manual work would be needed

Sorry to keep asking to improve
I can live with the time taken to execute
0
 
LVL 45

Expert Comment

by:Martin Liss
ID: 39602823
EDIT:  See the next post first.

Here's a modification that I believe fixes the problem. Try it on a thousand or so rows subset of your data and tell me if it works.

Sub FindIdentical()
Dim lngLastRow As Long
Dim lngRow1 As Long
Dim lngRow2 As Long
Dim strWords1() As String
Dim intPos As Integer
Dim intMinMatch As Integer
Dim intMatching As Integer
Dim lngIndex1 As Long
Dim lngIndex2 As Long
Dim colWords1 As Collection
Dim colWords2 As Collection
Dim varData As Variant
Dim lngLastMatch As Long

intMinMatch = InputBox("Please enter the number of words that have to match in order to make an URL identical", "How many words?", 3)
lngLastRow = Range("A65536").End(xlUp).Row
Columns("B:B").ClearContents

' Transfer all the data to an array. The data is only
' in columns A and B, but Q is used to leave room
' for 15 additional columns for the words
varData = Range("A1:Q" & lngLastRow).Value
'Debug.Print Now
For lngRow1 = 1 To lngLastRow
    ' Split up all the words
    intPos = InStrRev(varData(lngRow1, 1), "/")
    strWords1 = Split(ConvertSeparators(Mid(varData(lngRow1, 1), intPos + 1)), "-")
    Set colWords1 = New Collection
    On Error Resume Next
    For lngIndex1 = 0 To UBound(strWords1)
        colWords1.Add strWords1(lngIndex1), strWords1(lngIndex1)
        If Err.Number = 457 Then
            ' The word is already in the collection
            Err.Clear
        End If
    Next
    
    intPos = 3
    For lngIndex1 = 1 To colWords1.Count
        varData(lngRow1, intPos) = colWords1(lngIndex1)
        intPos = intPos + 1
    Next
    On Error GoTo 0
Next

For lngRow1 = 1 To lngLastRow
    'Debug.Print lngRow1 & " - " & Now
    ' Compare the words in lngRow1 with the words in all the rows below it
    For lngRow2 = lngRow1 + 1 To lngLastRow
        intMatching = 0
        For lngIndex1 = 3 To 17
            If varData(lngRow1, lngIndex1) = "" Then
                Exit For
            Else
                If varData(lngRow1, lngIndex1) = varData(lngRow2, 3) Or _
                   varData(lngRow1, lngIndex1) = varData(lngRow2, 4) Or _
                   varData(lngRow1, lngIndex1) = varData(lngRow2, 5) Or _
                   varData(lngRow1, lngIndex1) = varData(lngRow2, 6) Or _
                   varData(lngRow1, lngIndex1) = varData(lngRow2, 7) Or _
                   varData(lngRow1, lngIndex1) = varData(lngRow2, 8) Or _
                   varData(lngRow1, lngIndex1) = varData(lngRow2, 9) Or _
                   varData(lngRow1, lngIndex1) = varData(lngRow2, 10) Or _
                   varData(lngRow1, lngIndex1) = varData(lngRow2, 11) Or _
                   varData(lngRow1, lngIndex1) = varData(lngRow2, 12) Or _
                   varData(lngRow1, lngIndex1) = varData(lngRow2, 13) Or _
                   varData(lngRow1, lngIndex1) = varData(lngRow2, 14) Or _
                   varData(lngRow1, lngIndex1) = varData(lngRow2, 15) Or _
                   varData(lngRow1, lngIndex1) = varData(lngRow2, 16) Or _
                   varData(lngRow1, lngIndex1) = varData(lngRow2, 17) Then
                    intMatching = intMatching + 1
                End If
            End If
        Next
        
        If intMatching >= intMinMatch Then
            lngLastMatch = lngRow2
            If varData(lngRow1, 2) = "" Then
                varData(lngRow1, 2) = lngRow1 & ", " & lngRow2
            Else
                varData(lngRow1, 2) = varData(lngRow1, 2) & ", " & lngRow2
            End If
        End If
    Next
    varData(lngLastMatch, 2) = "'" & lngLastMatch
Next

Range("A1:B" & lngLastRow).Value = varData

'Debug.Print Now
End Sub

Open in new window

0
 
LVL 45

Expert Comment

by:Martin Liss
ID: 39603007
I made a change to the above that in a short test (letting it run through the first 50 rows of a 40,000 row sheet) seems to be about 20 percent faster. Unless there's still a problem I think this is the best I can do.

Sub FindIdentical()
Dim lngLastRow As Long
Dim lngRow1 As Long
Dim lngRow2 As Long
Dim strWords1() As String
Dim intPos As Integer
Dim intMinMatch As Integer
Dim intMatching As Integer
Dim lngIndex1 As Long
Dim lngIndex2 As Long
Dim colWords1 As Collection
Dim colWords2 As Collection
Dim varData As Variant
Dim lngMarker As Long

intMinMatch = InputBox("Please enter the number of words that have to match in order to make an URL identical", "How many words?", 3)
lngLastRow = Range("A65536").End(xlUp).Row
Columns("B:B").ClearContents

' Transfer all the data to an array. The data is only
' in columns A and B, but Q is used to leave room
' for 15 additional columns for the words
varData = Range("A1:Q" & lngLastRow).Value
'Debug.Print Now

For lngRow1 = 1 To lngLastRow
    ' Split up all the words
    intPos = InStrRev(varData(lngRow1, 1), "/")
    strWords1 = Split(ConvertSeparators(Mid(varData(lngRow1, 1), intPos + 1)), "-")
    Set colWords1 = New Collection
    On Error Resume Next
    For lngIndex1 = 0 To UBound(strWords1)
        colWords1.Add strWords1(lngIndex1), strWords1(lngIndex1)
        If Err.Number = 457 Then
            ' The word is already in the collection
            Err.Clear
        End If
    Next

    intPos = 3
    For lngIndex1 = 1 To colWords1.Count
        varData(lngRow1, intPos) = colWords1(lngIndex1)
        intPos = intPos + 1
    Next
    On Error GoTo 0
Next

For lngRow1 = 1 To lngLastRow
'If lngRow1 Mod 10 = 0 Then Debug.Print lngRow1 & " - " & Now
    ' Compare the words in lngRow1 with the words in all the rows below it
    For lngRow2 = lngRow1 + 1 To lngLastRow
        intMatching = 0
        For lngIndex1 = 3 To 17
            If varData(lngRow1, lngIndex1) = "" Then
                Exit For
            Else
                Select Case True
                    Case varData(lngRow1, lngIndex1) = varData(lngRow2, 3)
                        intMatching = intMatching + 1
                    Case varData(lngRow1, lngIndex1) = varData(lngRow2, 4)
                        intMatching = intMatching + 1
                    Case varData(lngRow1, lngIndex1) = varData(lngRow2, 5)
                        intMatching = intMatching + 1
                    Case varData(lngRow1, lngIndex1) = varData(lngRow2, 6)
                        intMatching = intMatching + 1
                    Case varData(lngRow1, lngIndex1) = varData(lngRow2, 7)
                        intMatching = intMatching + 1
                    Case varData(lngRow1, lngIndex1) = varData(lngRow2, 8)
                        intMatching = intMatching + 1
                    Case varData(lngRow1, lngIndex1) = varData(lngRow2, 9)
                        intMatching = intMatching + 1
                    Case varData(lngRow1, lngIndex1) = varData(lngRow2, 10)
                        intMatching = intMatching + 1
                    Case varData(lngRow1, lngIndex1) = varData(lngRow2, 11)
                        intMatching = intMatching + 1
                    Case varData(lngRow1, lngIndex1) = varData(lngRow2, 12)
                        intMatching = intMatching + 1
                    Case varData(lngRow1, lngIndex1) = varData(lngRow2, 13)
                        intMatching = intMatching + 1
                    Case varData(lngRow1, lngIndex1) = varData(lngRow2, 14)
                        intMatching = intMatching + 1
                    Case varData(lngRow1, lngIndex1) = varData(lngRow2, 15)
                        intMatching = intMatching + 1
                    Case varData(lngRow1, lngIndex1) = varData(lngRow2, 16)
                        intMatching = intMatching + 1
                    Case varData(lngRow1, lngIndex1) = varData(lngRow2, 17)
                        intMatching = intMatching + 1
                End Select
            End If
        Next
        If intMatching >= intMinMatch Then
            If varData(lngRow1, 2) = "" Then
                varData(lngRow1, 2) = lngRow1 & ", " & lngRow2
            Else
                varData(lngRow1, 2) = varData(lngRow1, 2) & ", " & lngRow2
            End If
        End If
    Next
Next

Range("A1:B" & lngLastRow).Value = varData

'Debug.Print Now
End Sub

Open in new window

0
 

Author Comment

by:mtthompsons
ID: 39603529
But still dont get the numbers on the other duplicate

8665, 8668

8665 row has the above 2 numbers but 8668 is blank in column "B"
0
 
LVL 45

Accepted Solution

by:
Martin Liss earned 500 total points
ID: 39603746
I'm sorry, I had several versions floating around and I lost the change that corrected the problem in the one I posted. Again try this on a small sample.

Sub FindIdentical()
Dim lngLastRow As Long
Dim lngRow1 As Long
Dim lngRow2 As Long
Dim strWords1() As String
Dim intPos As Integer
Dim intMinMatch As Integer
Dim intMatching As Integer
Dim lngIndex1 As Long
Dim lngIndex2 As Long
Dim colWords1 As Collection
Dim colWords2 As Collection
Dim varData As Variant
Dim lngMarker As Long
Dim lngLastMatch As Long

intMinMatch = InputBox("Please enter the number of words that have to match in order to make an URL identical", "How many words?", 3)
lngLastRow = Range("A65536").End(xlUp).Row
Columns("B:B").ClearContents

' Transfer all the data to an array. The data is only
' in columns A and B, but Q is used to leave room
' for 15 additional columns for the words
varData = Range("A1:Q" & lngLastRow).Value
'Debug.Print Now

For lngRow1 = 1 To lngLastRow
    ' Split up all the words
    intPos = InStrRev(varData(lngRow1, 1), "/")
    strWords1 = Split(ConvertSeparators(Mid(varData(lngRow1, 1), intPos + 1)), "-")
    Set colWords1 = New Collection
    On Error Resume Next
    For lngIndex1 = 0 To UBound(strWords1)
        colWords1.Add strWords1(lngIndex1), strWords1(lngIndex1)
        If Err.Number = 457 Then
            ' The word is already in the collection
            Err.Clear
        End If
    Next

    intPos = 3
    For lngIndex1 = 1 To colWords1.Count
        varData(lngRow1, intPos) = colWords1(lngIndex1)
        intPos = intPos + 1
    Next
    On Error GoTo 0
Next

For lngRow1 = 1 To lngLastRow
'If lngRow1 Mod 10 = 0 Then Debug.Print lngRow1 & " - " & Now
    ' Compare the words in lngRow1 with the words in all the rows below it
    For lngRow2 = lngRow1 + 1 To lngLastRow
        intMatching = 0
        For lngIndex1 = 3 To 17
            If varData(lngRow1, lngIndex1) = "" Then
                Exit For
            Else
                Select Case True
                    Case varData(lngRow1, lngIndex1) = varData(lngRow2, 3)
                        intMatching = intMatching + 1
                    Case varData(lngRow1, lngIndex1) = varData(lngRow2, 4)
                        intMatching = intMatching + 1
                    Case varData(lngRow1, lngIndex1) = varData(lngRow2, 5)
                        intMatching = intMatching + 1
                    Case varData(lngRow1, lngIndex1) = varData(lngRow2, 6)
                        intMatching = intMatching + 1
                    Case varData(lngRow1, lngIndex1) = varData(lngRow2, 7)
                        intMatching = intMatching + 1
                    Case varData(lngRow1, lngIndex1) = varData(lngRow2, 8)
                        intMatching = intMatching + 1
                    Case varData(lngRow1, lngIndex1) = varData(lngRow2, 9)
                        intMatching = intMatching + 1
                    Case varData(lngRow1, lngIndex1) = varData(lngRow2, 10)
                        intMatching = intMatching + 1
                    Case varData(lngRow1, lngIndex1) = varData(lngRow2, 11)
                        intMatching = intMatching + 1
                    Case varData(lngRow1, lngIndex1) = varData(lngRow2, 12)
                        intMatching = intMatching + 1
                    Case varData(lngRow1, lngIndex1) = varData(lngRow2, 13)
                        intMatching = intMatching + 1
                    Case varData(lngRow1, lngIndex1) = varData(lngRow2, 14)
                        intMatching = intMatching + 1
                    Case varData(lngRow1, lngIndex1) = varData(lngRow2, 15)
                        intMatching = intMatching + 1
                    Case varData(lngRow1, lngIndex1) = varData(lngRow2, 16)
                        intMatching = intMatching + 1
                    Case varData(lngRow1, lngIndex1) = varData(lngRow2, 17)
                        intMatching = intMatching + 1
                End Select
            End If
        Next
        If intMatching >= intMinMatch Then
            lngLastMatch = lngRow2
            If varData(lngRow1, 2) = "" Then
                varData(lngRow1, 2) = lngRow1 & ", " & lngRow2
            Else
                varData(lngRow1, 2) = varData(lngRow1, 2) & ", " & lngRow2
            End If
        End If
    Next
    varData(lngLastMatch, 2) = "'" & lngLastMatch
Next

Range("A1:B" & lngLastRow).Value = varData

'Debug.Print Now
End Sub

Open in new window

0
 

Author Comment

by:mtthompsons
ID: 39605013
I get subscript out of range on line
   varData(lngLastMatch, 2) = "'" & lngLastMatch

I have 40K rows of data

I tested with 500 and seems fine and on the bigger loaded file i got this error
0
 
LVL 45

Expert Comment

by:Martin Liss
ID: 39605903
I can't imagine why that is happening. Did you modify the code at all? If not then try this for me. On a temporary basis change line 49 to

For lngRow1 = 39500 To lngLastRow

where 39500 is a number about 500 rows less then your total number of rows. If you still get the error
1) Click the 'Debug' button
2) Assuming that the varData(lngLastMatch, 2) = "'" & lngLastMatch line is now highlighted in yellow, place your cursor over either lngLastMatch in that line and tell me what the value is.

Is that value greater than the number of rows and if so by how many?
0
 

Author Comment

by:mtthompsons
ID: 39607783
lngLastMatch=0 is what i get when highlighted

I never changed the code and now when you asked i changed the row 49 and tried
0
 
LVL 45

Expert Comment

by:Martin Liss
ID: 39609129
That doesn't make sense. Can you attach your project? If it's too big you can go to MediaFire, upload your workbook and then post the resulting URL here.
0

Featured Post

What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

Join & Write a Comment

This article is the result of a quest to better understand Task Scheduler 2.0 and all the newer objects available in vbscript in this version over  the limited options we had scripting in Task Scheduler 1.0.  As I started my journey of knowledge I f…
This article descibes how to create a connection between Excel and SAP and how to move data from Excel to SAP or the other way around.
The viewer will learn how to use a discrete random variable to simulate the return on an investment over a period of years, create a Monte Carlo simulation using the discrete random variable, and create a graph to represent the possible returns over…
Access reports are powerful and flexible. Learn how to create a query and then a grouped report using the wizard. Modify the report design after the wizard is done to make it look better. There will be another video to explain how to put the final p…

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now