Highlight duplicate sentences in word document

ShaileshShinde
ShaileshShinde used Ask the Experts™
on
Hello Experts,

There's an requirement to highlight the duplicate sentences in word documents using Macro or Perl Script.
The script will segment the paragraph using delimiter as fullstops and if single word appears the script will check for duplicates of this single word.

Can you please suggest any references or sample codes for this to achieved.

Thanks In Advance,
Shail
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Top Expert 2014

Commented:
1. do the duplicate sentences have to be adjacent to each other?
2. "single word appears" -- not sure how this pertains to duplicate sentence detection.  Please elaborate.
Try this macro

Sub finddups()
    Dim sen As Range, ssen As Range
    For Each sen In ActiveDocument.Sentences
        For Each ssen In ActiveDocument.Sentences
            If sen.Start < ssen.Start Then
                If ssen.Text = sen.Text Then
                    ssen.HighlightColorIndex = wdYellow
                    sen.HighlightColorIndex = wdYellow
                End If
            End If
        Next ssen
    Next sen
End Sub

Open in new window

Top Expert 2014

Commented:
@ShaileshShinde

3. How big will these documents be?
4. Is it possible that you need to find multiple sentence copies (3 or more)?
5. Once the duplicates have been identified, what do you do next?

=============
@ssaqibh

Nicely done.  

Depending on the OP responses, you might consider:
* Limiting the range for the ssen iteration to just those sentences after the current sen range, rather than iterate all sentences.  Probably not necessary for relatively short documents.

* Assigning the duplicates similar bookmark names. (DS001a, DS001b, DS001c, DS002a, DS002b, DS003.001, DS003.002)  This would facilitate the use of GoTo Bookmark navigation or other processing.
Amazon Web Services

Are you thinking about creating an Amazon Web Services account for your business? Not sure where to start? In this course you’ll get an overview of the history of AWS and take a tour of their user interface.

Author

Commented:
Hello Experts,
1. do the duplicate sentences have to be adjacent to each other?

No, the duplicate sentences can be appear any where in the document.

2. "single word appears" -- not sure how this pertains to duplicate sentence detection.  Please elaborate.

For Example: there are multiple tables which has headings "Name", "Address",....
So for all table this "Name" is duplicate single word because it has appears in almost all tables and this needs to be highlighted.


3. How big will these documents be?

This document can be of 300 to 600 pages.

4. Is it possible that you need to find multiple sentence copies (3 or more)?

Yes, duplicate sentences can be two or more than two.

5. Once the duplicates have been identified, what do you do next?

Once the duplicates have been identified, highlight all the duplicates sentences.

Thanks a lot,
Shailesh
Did you try the macro? What was the outcome?

Author

Commented:
Hello Expert,

Just tested this macro and found that this highlights the duplicate sentences. However, this does not highlight the sentences which will appears like below...

This is test message. I have to delete this test message.
This is not test message. Information has to be updated.
This is first test message.
This is test message. I will not delete this test message.
Information has to be updated.

Here, the underline ones has been highlighted. However, the Italics ones has not been highlighted though these are appearing twice.

Thanks a Lot,
Shail
Ok now try

Sub finddups()
    Dim sen As Range, ssen As Range
    For Each sen In ActiveDocument.Sentences
        For Each ssen In ActiveDocument.Sentences
            If sen.Start < ssen.Start Then
                If clean(sen) = clean(ssen) Then
                    ssen.HighlightColorIndex = wdRed
                    sen.HighlightColorIndex = wdRed
                End If
            End If
        Next ssen
    Next sen
End Sub
                                            
Function clean(x)
clean = x
clean = Replace(clean, vbCr, "")
clean = Replace(clean, vbLf, "")
clean = Replace(clean, vbTab, "")
clean = Replace(clean, vbVerticalTab, "")
clean = Replace(clean, vbBack, "")
clean = Replace(clean, vbFormFeed, "")
End Function

Open in new window

Author

Commented:
Hello Experts,

I just tested with the latest code and found it's working correctly as per the requirement. However, for the second part of this for below ....

2. "single word appears" -- not sure how this pertains to duplicate sentence detection.  Please elaborate.

For Example: there are multiple tables which has headings "Name", "Address",....
So for all table this "Name" is duplicate single word because it has appears in almost all tables and this needs to be highlighted.

Can this be handle in this code?

Thanks a Lot!
Shail
Please upload a sample word file so that the program can be tested on real data.
Top Expert 2014

Commented:
better yet, upload a zip file of the Word document.

Author

Commented:
Hello Experts,

Please find sample .docx file for this.

Thanks,
Shail
Doc6.docx
Sub finddups()
    Dim sen As Range, ssen As Range
    Dim cel As Cell, ccel As Cell, tbl As Table
    Dim hlc As Integer
    Application.ScreenUpdating = False
    hlc = wdYellow
    For Each sen In ActiveDocument.Sentences
        For Each ssen In ActiveDocument.Sentences
            If sen.Start < ssen.Start Then
                If clean(sen) = clean(ssen) Then
                    ssen.HighlightColorIndex = hlc
                    sen.HighlightColorIndex = hlc
                End If
            End If
        Next ssen
    Next sen
For Each tbl In ActiveDocument.Tables
    For Each cel In tbl.Range.Cells
        For Each ccel In tbl.Range.Cells
                    If cel.Row.Index * tbl.Columns.Count + cel.Column.Index < _
                       ccel.Row.Index * tbl.Columns.Count + ccel.Column.Index Then
                        If clean(cel) = clean(ccel) Then
                            ccel.Range.HighlightColorIndex = hlc
                            cel.Range.HighlightColorIndex = hlc
                        End If
                    End If
        Next ccel
    Next cel
Next tbl
    Application.ScreenUpdating = False
End Sub
                                            
Function clean(x)
clean = x
clean = Replace(clean, vbCr, "")
clean = Replace(clean, vbLf, "")
clean = Replace(clean, vbTab, "")
clean = Replace(clean, vbVerticalTab, "")
clean = Replace(clean, vbBack, "")
clean = Replace(clean, vbFormFeed, "")
End Function

Open in new window

Top Expert 2014

Commented:
@ssaqibh

Add chr(7) to your clean routine.

Author

Commented:
Hello Experts,

We got some additional requirement for this. I have tested the latest code and found it's working correctly as per the requirement. However, the new requirement for this might change the logic. The existing requirement is to highlight all the repeated sentences and standalone words. The new requirement is highlight all repeated sentences and standalone words except for the first occurrence. For example....

Let's say the word document contains three paras..

This is first test message. This is second test message.
This is second test message. This is third test message.
This is third test message. This is second test message.

Here, "This is second test message" appears thrice so, except first one; highlight other two occurrences. Same for the standalone words.

Can this be possible in the existing latest code provided?

Thanks, Shail
This is first test message. This is second test message.
This is second test message. This is third test message.
This is third test message. This is second test message.
This is first test message. This is third test message.
This is first test message.

There are three of all. Which ones do you want highlighted?

Author

Commented:
Yes, will post the new question for this requirement.

This is first test message. This is second test message.
This is second test message. This is third test message.
This is third test message. This is second test message.
This is first test message. This is third test message.
This is first test message.

The bold ones need to be highlighted leaving first occurrence of repeated sentences as it is.

Thanks,
Shail

Author

Commented:
Thanks Expert.
Remember to post as a "Related question" so that I get an email which attracts my attention.

Thanks for the grade.
Oops...I thought it was an A grade.

I have tested the latest code and found it's working correctly as per the requirement.
Then why a B grade. Please tell me what was lacking to deserve a B grade?
Top Expert 2014

Commented:
The Ask A Related Question link has been missing since the release of version 10.
Oh I see....Another improvement.

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial