Link to home
Start Free TrialLog in
Avatar of Mike
MikeFlag for United Kingdom of Great Britain and Northern Ireland

asked on

MS Word VBA to remove Duplicate Paragraphs Using Scripting.Dictionary

Below is a section of MS Word VBA code which removes duplicate paragraphs from a document. It retains the first instance of each paragraph it finds, and deletes any subsequent instances ("duplicates").  Does anyone know how I might amend the code so that instead:

1) it deletes the first and all subsequent instances of each paragraph except the last instance (the one closest to the bottom of the document) which it retains? PS: since posting I have figured this one out.

2) It deletes all instances of a paragraph which has duplicates, retaining none?

3) In the process of trying to solve (2) I realise that I don't really understand how the 'collect duplicates' section works. If anyone could explain that to me I would be very grateful!

Sub DeleteDuplicateParagraphs()
  Dim p As Paragraph
  Dim d As New Scripting.Dictionary
  Dim t As Variant
  Dim i As Integer
  Dim StartTime As Single

  StartTime = Timer

  ' collect duplicates
  For Each p In ActiveDocument.Paragraphs
    t = p.Range.Text
    If t <> vbCr Then
      If Not d.Exists(t) Then d.Add t, New Scripting.Dictionary
      d(t).Add d(t).Count + 1, p
    End If
  Next

  ' eliminate duplicates
  Application.ScreenUpdating = False
  For Each t In d
    For i = 2 To d(t).Count
      d(t)(i).Range.Delete
    Next
  Next
  Application.ScreenUpdating = True

  MsgBox "This code ran successfully in " & Round(Timer - StartTime, 2) & " seconds", vbInformation
End Sub

Open in new window

Avatar of aikimark
aikimark
Flag of United States of America image

This is a wrong statements:
If Not d.Exists(t) Then d.Add t, New Scripting.Dictionary
d(t).Add d(t).Count + 1, p

Open in new window

What you should have is a collection object as the value/item.  That way, you add paragraphs to the collection.  Your code is adding ALL the paragraphs, not just the duplicates.
Avatar of Mike

ASKER

Hi and thanks.  Please could you suggest how I would amend the code to do so?
please post a representative document containing duplicate paragraphs.
Avatar of Mike

ASKER

Try this......
Text-Sample.docx
Sub DeleteDuplicateParagraphs()
    Dim p As Paragraph
    Dim d As Object     'New Scripting.Dictionary
    Dim paratext As String
    Dim i As Long       'Integer
    Dim StartTime As Single
    Dim lngPara As Long
    Dim lngDups() As Long
    Dim vitem As Variant
    
    Set d = CreateObject("scripting.dictionary")
    
    StartTime = Timer
    
    ' iterate the paragraphs
    For lngPara = 1 To ActiveDocument.Paragraphs.Count
        'Set p = ActiveDocument.Paragraphs(lngPara)
        paratext = ActiveDocument.Paragraphs(lngPara).Range.Text
        If paratext <> vbCr Then
            If d.Exists(paratext) Then
                d(paratext).Add lngPara
            Else
                d.Add paratext, New Collection
            End If
        End If
    Next
    
    'initialize and populate dup array => bucket sort
    ReDim lngDups(1 To ActiveDocument.Paragraphs.Count)
    For Each vitem In d
        If d(vitem).Count <> 0 Then
            For i = 1 To d(vitem).Count
                lngDups(d(vitem)(i)) = 1
            Next
        End If
    Next
    
    ' eliminate duplicates from end to start
    Application.ScreenUpdating = False
    For lngPara = UBound(lngDups) To LBound(lngDups) Step -1
        If lngDups(lngPara) = 1 Then
            ActiveDocument.Paragraphs(lngPara).Range.Delete
        End If
    Next
    Application.ScreenUpdating = True
    
    MsgBox "This code ran successfully in " & Round(Timer - StartTime, 2) & " seconds", vbInformation
End Sub

Open in new window

Avatar of Mike

ASKER

Thanks. You have obviously spent a lot of time on this and I am very grateful.

Both the dictionary-based code, and your collection-based code run fine, except neither removes the 1st instance of a duplicate paragraph. Your code takes about twice as long to run, I'm afraid.

My question was whether you could amend either code so that It deletes all instances of a paragraph which originally had duplicates (including the 1st instance), retaining none?
I thought you stated that the posted code was doing this and that you didn't want that behavior.
Your code takes about twice as long to run
Twice as long compared to what?

When running my posted code against your sample document, it took .05 seconds on my laptop.
Avatar of Mike

ASKER

With regard to your penultimate post, I think I was clear that the dictionary-based code was running OK, and that my only complaint with that code was that it left one remaining instance of each duplicate paragraphs, and that instead I wanted all instances removed.  

Regarding your last post, I am running this code against longer documents than the sample I provided and a quick test indicates the dictionary-based method is faster. If we can solve the "last remaining instance" problem with either the dictionary- or collection- based code then I will test the speed in more detail.
Please test this
Sub DeleteDuplicateParagraphs()
    Dim p As Paragraph
    Dim d As Object     'New Scripting.Dictionary
    Dim paratext As String
    Dim i As Long       'Integer
    Dim StartTime As Single
    Dim lngPara As Long
    Dim lngDups() As Long
    Dim vitem As Variant
    
    Set d = CreateObject("scripting.dictionary")
    
    StartTime = Timer
    
    ' iterate the paragraphs
    For lngPara = ActiveDocument.Paragraphs.Count To 1 Step -1
        'Set p = ActiveDocument.Paragraphs(lngPara)
        paratext = ActiveDocument.Paragraphs(lngPara).Range.Text
        If paratext <> vbCr Then
            If d.Exists(paratext) Then
                d(paratext).Add lngPara
            Else
                d.Add paratext, New Collection
            End If
        End If
    Next
    
    'initialize and populate dup array => bucket sort
    ReDim lngDups(1 To ActiveDocument.Paragraphs.Count)
    For Each vitem In d
        If d(vitem).Count <> 0 Then
            For i = 1 To d(vitem).Count
                lngDups(d(vitem)(i)) = 1
            Next
        End If
    Next
    
    ' eliminate duplicates from end to start
    Application.ScreenUpdating = False
    For lngPara = UBound(lngDups) To LBound(lngDups) Step -1
        If lngDups(lngPara) = 1 Then
            ActiveDocument.Paragraphs(lngPara).Range.Delete
        End If
    Next
    Application.ScreenUpdating = True
    
    MsgBox "This code ran successfully in " & Round(Timer - StartTime, 2) & " seconds", vbInformation
End Sub

Open in new window

Avatar of Mike

ASKER

Thanks. Unfortunately this also leaves one instance each of the paragraphs I have denoted as DUPE 1, DUPE 2 and DUPE 3.
Please test this one.  I think I now understand what you need.
Sub DeleteDuplicateParagraphs()
    Dim p As Paragraph
    Dim d As Object     'New Scripting.Dictionary
    Dim paratext As String
    Dim i As Long       'Integer
    Dim StartTime As Single
    Dim lngPara As Long
    Dim lngDups() As Long
    Dim vItem As Variant
    Dim vDup As Variant
    
    Set d = CreateObject("scripting.dictionary")
    
    StartTime = Timer
    
    ' iterate the paragraphs
    lngPara = 0
    For Each p In ActiveDocument.Paragraphs
        lngPara = lngPara + 1
        paratext = p.Range.Text
        If paratext <> vbCr Then
            If d.Exists(paratext) Then
            Else
                d.Add paratext, New Collection
            End If
            d(paratext).Add lngPara
        End If
    Next
    
    'initialize and populate dup array => bucket sort
    ReDim lngDups(1 To ActiveDocument.Paragraphs.Count)
    For Each vItem In d
        If d(vItem).Count > 1 Then
            For Each vDup In d(vItem)
                lngDups(vDup) = 1
            Next
        End If
    Next
    
    ' eliminate duplicates from end to start
    Application.ScreenUpdating = False
    For lngPara = UBound(lngDups) To LBound(lngDups) Step -1
        If lngDups(lngPara) = 1 Then
            ActiveDocument.Paragraphs(lngPara).Range.Delete
        End If
    Next
    Application.ScreenUpdating = True
    
    MsgBox "This code ran successfully in " & Round(Timer - StartTime, 2) & " seconds", vbInformation
End Sub

Open in new window

Avatar of Mike

ASKER

OK, this is good. All duplicates are removed.

I tested speed again. On my test document (which is unfortunately not something I can post) the original code takes 3.36secs and your code takes 5.45s. However, in both cases almost all that time is consumed by the actual deletion process, which in your code is:

For lngPara = UBound(lngDups) To LBound(lngDups) Step -1
        If lngDups(lngPara) = 1 Then
            ActiveDocument.Paragraphs(lngPara).Range.Delete
        End If
    Next

Open in new window


Can you think of any way to accelerate that? The delay might be caused by the Word undo stack but I know of no way to disable that. Can you think of any way to delete what needs to be deleted in a faster way?
While performance might be improved, I don't think it worth the effort for a two second difference.
Avatar of Mike

ASKER

For one small document I agree. But I will be using this to process much larger documents, and to batch process multiple documents at a time. So it is worth trying to improve the performance and if you have any suggestions I would be happy to hear them. However, you may prefer not to spend any more time on this as you have already done a lot of work. If that's the case let me know and I will close the question and recognise your solution.
What is the percentage of duplicate paragraphs (avg, min, max) that you expect to encounter?

Is this going to be a one time process against these documents?
Avatar of Mike

ASKER

I estimate av=30%, min=10%, max=50%.

The code would be run once only on each document.

I don't know where you are based but it's getting very late here now so if it's OK I will continue this tomorrow.  I really appreciate your spending time on this.
ASKER CERTIFIED SOLUTION
Avatar of aikimark
aikimark
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of Mike

ASKER

This is indeed about 30% faster.  Thanks very much for your help. I am going to post a follow-up question, so it may be that we correspond again. But, if not, have a great weekend and thanks again.