Highlight sentences with similar words!!

Shyamulee Das
Shyamulee Das used Ask the Experts™
on
Hi,

I want to create a Word macro that will check for similar words in two documents and highlight those sentences with more than 6 similar words in 2nd document.
Example: Doc1 contains some sentences and 2nd document contains some sentences, so highlight sentences which contains more than 6 similar words in the 2nd document.

Thank you.
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Dr. KlahnPrincipal Software Engineer

Commented:
Please define "similar words" so that we can determine whether this is possible without using external software.
Shyamulee DasAutomation Analyst

Author

Commented:
For example: (Doc1) " I am going to the market in sometime "
(Doc2) " I will be going to the market in the evening "
So both the sentences have 6 similar words, then highlight the 2nd sentence.
GrahamSkanRetired
Top Expert 2012

Commented:
See if this fits your need:
Sub HighLightMatchingWords()
    Dim docFirst As Document
    Dim docSecond As Document
    
    Dim iTarget As Integer
    Dim iCount As Integer
    Dim rngSent1 As Range
    Dim rngSent2 As Range
    Dim rngWord1 As Range
    Dim rngWord2 As Range
    
    iTarget = 6
    
    Set docFirst = Documents.Open("C:\MyFolder\" & "First.docx")
    Set docSecond = Documents.Open("C:\MyFolder\" & "Second.docx")
    For Each rngSent1 In docFirst.Range.Sentences
        For Each rngSent2 In docSecond.Range.Sentences
            iCount = 0
            For Each rngWord1 In rngSent1.Words
                For Each rngWord2 In rngSent2.Words
                    If StrComp(rngWord1.Text, rngWord2.Text, vbTextCompare) = 0 Then
                        rngWord2.HighlightColorIndex = wdYellow
                        iCount = iCount + 1
                    End If
                Next rngWord2
                
                If iCount < iTarget Then
                    rngSent2.HighlightColorIndex = wdNoHighlight
                End If
            Next rngWord1
        Next rngSent2
    Next rngSent1
End Sub

Open in new window

Introduction to R

R is considered the predominant language for data scientist and statisticians. Learn how to use R for your own data science projects.

GrahamSkanRetired
Top Expert 2012

Commented:
Warnings:

1. The macro compares every sentence in the first document with every sentence in the second document, so that will make it slow with large documents.
2. It uses MS Word's VBA definition of a word. This might not be what you need (e.g. commas are treated as words)
Shyamulee DasAutomation Analyst

Author

Commented:
I want to highlight Sentences in the "DocSecond"  Sentences having more than 6 similar words with compare to the sentences of "DocFirst".


Thank you.
Shyamulee DasAutomation Analyst

Author

Commented:
I have created a macro for this but it is not helping me. It is highlighting everything in the second doc. If you can help me out with the code it will be really helpful.

Sub comparison1()

Dim doc As Document
Dim doc1 As Document
Set doc = ActiveDocument
Dim sen As Range
Dim wrd
Dim wrd1
Dim sen1 As Range
Dim SelectedFileItem As String
Dim afile
Dim item_count As Variant
Dim total_item

Dim FDialog As FileDialog
Set FDialog = Application.FileDialog(msoFileDialogOpen)

Dim c_dict As Scripting.Dictionary
Set c_dict = New Scripting.Dictionary

Dim c_dict1 As Scripting.Dictionary
Set c_dict1 = New Scripting.Dictionary

With FDialog

   .Title = "Select a file"
   .AllowMultiSelect = False
   .InitialFileName = "D:\OneDrive - CACTUS\Documents"
   .Filters.Clear
   .Filters.Add "Word files", "*.docx"
   
    If .Show = -1 Then
        SelectedFileItem = .SelectedItems(1)
        afile = SelectedFileItem
    Else
        MsgBox "You cancelled the operation"
     Exit Sub
    End If
    Set doc1 = Documents.Open(afile)
End With

total_item = 0

    For Each sen In doc1.Range.sentences
        For Each wrd In sen.words
        If c_dict.Exists(LCase(wrd.Text)) = False Then
            With sen
               If sen.words.count > 6 Then
                    c_dict.Add LCase(wrd.Text), LCase(wrd.Text)
               End If
            End With
        End If
        Next
        
    For Each sen1 In doc1.Range.sentences
        For Each wrd1 In sen1.words
        If c_dict1.Exists(LCase(wrd1.Text)) = False Then
            If sen1.words.count >= sen.words.count Then
                c_dict1.Add LCase(wrd1.Text), LCase(wrd1.Text)
            End If
        End If
        Next
        
    For Each item_count In c_dict.Items
        If c_dict1.Exists(LCase(item_count)) = True Then
        total_item = total_item + 1
            If total_item > 6 Then
                sen1.HighlightColorIndex = wdGray25
            End If
        End If
    Next
       
c_dict.RemoveAll
c_dict1.RemoveAll

    Next sen1
    Next sen

For Each coll In doc1.Range.sentences
    If coll.HighlightColorIndex = wdGray25 Then
        highlightCount = highlightCount + 1
    End If
Next

   MsgBox "There are (" & highlightCount & ") similar sentences" & vbCr & "Matched items: " & total_item
End Sub

Open in new window

GrahamSkanRetired
Top Expert 2012

Commented:
I hope this is better for you
Sub HighLightMatchingSentences()
    Dim docFirst As Document
    Dim docSecond As Document
    
    Dim iTarget As Integer
    Dim iCount As Integer
    Dim rngSent1 As Range
    Dim rngSent2 As Range
    Dim rngWord1 As Range
    Dim rngWord2 As Range
    
    iTarget = 6
    
    Set docFirst = Documents.Open("C:\MyFolder\" & "First.docx")
    Set docSecond = Documents.Open("C:\MyFolder\" & "Second.docx")
    For Each rngSent1 In docFirst.Range.Sentences
        For Each rngSent2 In docSecond.Range.Sentences
            iCount = 0
            For Each rngWord1 In rngSent1.Words
                For Each rngWord2 In rngSent2.Words
                    If StrComp(rngWord1.Text, rngWord2.Text, vbTextCompare) = 0 Then
                        iCount = iCount + 1
                    End If
                Next rngWord2
                
                If iCount < iTarget Then
                    rngSent2.HighlightColorIndex = wdYellow
                End If
            Next rngWord1
        Next rngSent2
    Next rngSent1
End Sub

Open in new window

GrahamSkanRetired
Top Expert 2012

Commented:
I haven't analysed your macro in detail, but lines 44 and 55 are both stepping through doc1. I guess that you wanted to use doc in line 44
Shyamulee DasAutomation Analyst

Author

Commented:
rngSent2.HighlightColorIndex = wdYellow

After updating the recent code it is highlighting everything.
GrahamSkanRetired
Top Expert 2012

Commented:
My code or yours?
Shyamulee DasAutomation Analyst

Author

Commented:
Yours.
GrahamSkanRetired
Top Expert 2012

Commented:
Mine
Shyamulee DasAutomation Analyst

Author

Commented:
Yes!!
Sub HighLightMatchingSentences()
    Dim docFirst As Document
    Dim docSecond As Document
    
    Dim iTarget As Integer
    Dim iCount As Integer
    Dim rngSent1 As Range
    Dim rngSent2 As Range
    Dim rngWord1 As Range
    Dim rngWord2 As Range
    
    iTarget = 6
    
    Set docFirst = Documents.Open("C:\MyFolder\" & "First.docx")
    Set docSecond = Documents.Open("C:\MyFolder\" & "Second.docx")
    For Each rngSent1 In docFirst.Range.Sentences
        For Each rngSent2 In docSecond.Range.Sentences
            iCount = 0
            For Each rngWord1 In rngSent1.Words
                For Each rngWord2 In rngSent2.Words
                    If StrComp(rngWord1.Text, rngWord2.Text, vbTextCompare) = 0 Then
                        iCount = iCount + 1
                    End If
                Next rngWord2
                
                If iCount < iTarget Then
                    rngSent2.HighlightColorIndex = wdYellow
                End If
            Next rngWord1
        Next rngSent2
    Next rngSent1
End Sub

Open in new window

GrahamSkanRetired
Top Expert 2012

Commented:
Sorry. Careless mistake. It was line 26
Sub HighLightMatchingSentences()
    Dim docFirst As Document
    Dim docSecond As Document
    
    Dim iTarget As Integer
    Dim iCount As Integer
    Dim rngSent1 As Range
    Dim rngSent2 As Range
    Dim rngWord1 As Range
    Dim rngWord2 As Range
    
    iTarget = 6
    
    Set docFirst = Documents.Open("C:\MyFolder\" & "First.docx")
    Set docSecond = Documents.Open("C:\MyFolder\" & "Second.docx")
    For Each rngSent1 In docFirst.Range.Sentences
        For Each rngSent2 In docSecond.Range.Sentences
            iCount = 0
            For Each rngWord1 In rngSent1.Words
                For Each rngWord2 In rngSent2.Words
                    If StrComp(rngWord1.Text, rngWord2.Text, vbTextCompare) = 0 Then
                        iCount = iCount + 1
                    End If
                Next rngWord2
                
                If iCount >= iTarget Then
                    rngSent2.HighlightColorIndex = wdYellow
                End If
            Next rngWord1
        Next rngSent2
    Next rngSent1
End Sub

Open in new window

Shyamulee DasAutomation Analyst

Author

Commented:
Still it is highlighting the whole thing !! :(
GrahamSkanRetired
Top Expert 2012

Commented:
Perhaps my choice of documents is wrong. Can you post a couple of sample documents, please?
Shyamulee DasAutomation Analyst

Author

Commented:
I have uploaded the two demo files which m working on!! In doc2 New stimulation paradigms with time-varying intensity and frequency were developed to suppress the “onset responses”. you will find this sentence same as the one in doc1. so have to highlight this sentence.

Thank you.
Doc1.docx
Doc2.docx
GrahamSkanRetired
Top Expert 2012

Commented:
It didn't highlight every sentence, but more, I suppose, than you were expecting.

That is partly because of punctuation. Also I suspect that you didn't want to include short common words like "to", "the" and "and".

In this  version, the word length must be greater than 4 characters (including the trailing blank) to be counted. It only finds one sentence.

Sub HighLightMatchingSentences1()
    Dim docFirst As Document
    Dim docSecond As Document
    
    Dim iTarget As Integer
    Dim iCount As Integer
    Dim rngSent1 As Range
    Dim rngSent2 As Range
    Dim rngWord1 As Range
    Dim rngWord2 As Range
    
    iTarget = 6
    
    Set docFirst = Documents.Open("C:\MyFolder\" & "Doc1.docx")
    Set docSecond = Documents.Open("C:\MyFolder\" & "Doc2.docx")
    For Each rngSent1 In docFirst.Range.Sentences
        For Each rngSent2 In docSecond.Range.Sentences
            iCount = 0
            For Each rngWord1 In rngSent1.words
                If Len(rngWord1) > 4 Then
                    For Each rngWord2 In rngSent2.words
                        If StrComp(rngWord1.Text, rngWord2.Text, vbTextCompare) = 0 Then
                            iCount = iCount + 1
                        End If
                    Next rngWord2
                    
                    If iCount >= iTarget Then
                        rngSent2.HighlightColorIndex = wdYellow
                    End If
                End If
            Next rngWord1
        Next rngSent2
    Next rngSent1
End Sub

Open in new window

Shyamulee DasAutomation Analyst

Author

Commented:
This is the output which I am getting.
But I don't want this as my output. It should highlight only those sentences which has more than 6 similar words as compared to the sentences of doc1!!

Thank you.
Output.docx
GrahamSkanRetired
Top Expert 2012

Commented:
That is strange. This is what I get:
Doc2Out.docx
GrahamSkanRetired
Top Expert 2012

Commented:
I'll add some code to capture some debugging information.
Martin LissOlder than dirt
Most Valuable Expert 2017
Distinguished Expert 2018

Commented:
You might be interested in plagiarism detectors such as quetext or Search Engine Reports.
Retired
Top Expert 2012
Commented:
Instead of  the debugging code, I have made the macro more rigorous.

It now steps through the first document and creates an array of unique words for each eligible sentence. It then steps through the sentences  of the second document, using a Find with each of the array words to see if the word appears in the sentence

Sub HighLightMatchingSentences2()
    Dim docFirst As Document
    Dim docSecond As Document
    
    Dim iTarget As Integer
    Dim iMatchCount As Integer
    Dim rngSent1 As Range
    Dim rngSent2 As Range
    Dim rngWord1 As Range
    Dim rngWord2 As Range
    Dim strWords1() As String
    Dim w As Integer
    Dim strWord As String
    Dim bFound As Boolean
    Dim iWords1 As Integer
    Dim i As Integer
    Dim rngSent2Find As Range
    iTarget = 6
    
    Set docFirst = Documents.Open("C:\MyFolder\" & "doc1.docx")
    Set docSecond = Documents.Open("C:\MyFolder\" & "doc2.docx")
    For Each rngSent1 In docFirst.Range.Sentences
        iWords1 = 0
        If rngSent1.words.Count >= iTarget Then 'only long enough sentences. Ignore if too short
            iWords1 = -1
            For Each rngWord1 In rngSent1.words
                strWord = Trim(rngWord1)
                If Len(strWord) > 3 Then 'ignore small words
                    bFound = False
                    For w = 0 To iWords1
                        If StrComp(strWord, strWords1(w), vbTextCompare) = 0 Then
                            bFound = True
                            Exit For
                        End If
                    Next w
                    'collect unique words from sentence in first document
                    If Not bFound Then
                        iWords1 = iWords1 + 1
                        ReDim Preserve strWords1(iWords1)
                        strWords1(iWords1) = strWord
                    End If
                End If
            Next rngWord1
            If iWords1 > 6 Then 'more than six unique words
                For Each rngSent2 In docSecond.Range.Sentences
                   If rngSent2.words.Count >= iTarget Then  'ignore short sentences
                        iMatchCount = 0
                        For i = 0 To UBound(strWords1)
                            strWord = strWords1(i)
                            'look for an occurrence of word in each sentence of the second document
                            Set rngSent2Find = rngSent2.Duplicate
                            With rngSent2Find.Find
                                .Text = strWord
                                .MatchWholeWord = True
                                If .Execute Then
                                    iMatchCount = iMatchCount + 1
                                End If
                            End With
                        Next i
                    End If
                    If iMatchCount >= iTarget Then
                        rngSent2.HighlightColorIndex = wdYellow
                    End If
                Next rngSent2
            End If 'iWords1 > 6
        End If 'rngSent1.words.Count >= iTarget Then 'ignore short sentences
    Next rngSent1
End Su

Open in new window

b
Shyamulee DasAutomation Analyst

Author

Commented:
Thank you so much !! :) This has solved my problem..

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial