Word count excluding repeated text

Is there any software, shareware, freeware to do a word count on textfiles (doc, txt, rtf) that excludes text repetitions, such as titles or legal notices?
The ideal would be that repetitions are marked and that certain styles (such as heading 1 for example) can be excluded from the count result.
Tnx for your hints.
Who is Participating?

Improve company productivity with a Business Account.Sign Up

bruintjeConnect With a Mentor Commented:
OK, this is not easy because it's pretty specific found something that did this exactly except for filtering out but the guys mailbox was over limit so it bounced.....

thus we can come up with another solution and because it gives you something to work with i guess it's a good alternative

the count of unique words and the list with the count of each word

-open a new word document
-open the VB Editor with ALT+F11
-then choose in the upper left pane the normal icon
-then insert a new module from the menu
-now paste the code

Option Explicit

'Here 's a macro that works pretty well.  It displays a message box asking
'whether you want the word count sorted by frequency or in alphabetic
'order, then creates a new document with a list of all the words.

Sub WordFrequency()
         Dim SingleWord As String           'Raw word pulled from doc
         Const maxwords = 9000              'Maximum unique words allowed
         Dim Words(maxwords) As String      'Array to hold unique words
         Dim Freq(maxwords) As Integer      'Frequency counter for unique Words
         Dim WordNum As Integer             'Number of unique words
         Dim ByFreq As Boolean              'Flag for sorting order
         Dim ttlwds As Long                 'Total words in the document
         Dim Excludes As String             'Words to be excluded
         Dim Found As Boolean               'Temporary flag
         Dim j, k, l, Temp As Integer       'Temporary variables
         Dim tword As String                '
         Dim Ans As String
         Dim aword
         Dim tmpName As String

         ' Set up excluded words
         Excludes = "[the][a][of][is][to][for][this][that][by][be][and][are]"

         ' Find out how to sort
        ByFreq = True
        Ans = InputBox$("Sort by WORD or by FREQ?", "Sort order", "FREQ")
         If Ans = "" Then End
         If UCase(Ans) = "WORD" Then
             ByFreq = False
         End If

         Selection.HomeKey Unit:=wdStory
         System.Cursor = wdCursorWait
         WordNum = 0
         ttlwds = ActiveDocument.Words.Count

         ' Control the repeat
         For Each aword In ActiveDocument.Words
             SingleWord = Trim(LCase(aword))
             If SingleWord < "a" Or SingleWord > "z" Then SingleWord = ""   'Out of range?
             If InStr(Excludes, "[" & SingleWord & "]") Then SingleWord = "" 'On exclude list?
             If Len(SingleWord) > 0 Then
                 Found = False
                 For j = 1 To WordNum
                     If Words(j) = SingleWord Then
                         Freq(j) = Freq(j) + 1
                         Found = True
                         Exit For
                     End If
                 Next j
                 If Not Found Then
                     WordNum = WordNum + 1
                     Words(WordNum) = SingleWord
                     Freq(WordNum) = 1
                 End If
                 If WordNum > maxwords - 1 Then
                     j = MsgBox("The maximum array size has been exceeded. Increase maxwords.", vbOKOnly)
                     Exit For
                 End If
             End If
             ttlwds = ttlwds - 1
             StatusBar = "Remaining: " & ttlwds & "     Unique: " & WordNum
         Next aword

         ' Now sort it into word order
         For j = 1 To WordNum - 1
             k = j
             For l = j + 1 To WordNum
                 If (Not ByFreq And Words(l) < Words(k)) Or (ByFreq And Freq(l) > Freq(k)) Then k = l
             Next l
             If k <> j Then
                 tword = Words(j)
                 Words(j) = Words(k)
                 Words(k) = tword
                 Temp = Freq(j)
                 Freq(j) = Freq(k)
                 Freq(k) = Temp
             End If
             StatusBar = "Sorting: " & WordNum - j
         Next j

         ' Now write out the results
         tmpName = ActiveDocument.AttachedTemplate.FullName
         Documents.Add Template:=tmpName, NewTemplate:=False
         With Selection
             For j = 1 To WordNum
                 .TypeText Text:=Trim(Str(Freq(j))) & vbTab & Words(j) & vbCrLf
             Next j
         End With
         System.Cursor = wdCursorNormal
         j = MsgBox("There were " & Trim(Str(WordNum)) & " different words ", vbOKOnly, "Finished")
     Selection.HomeKey wdStory

End Sub

copyright notice
this code is originally written by Larry > Larry328@att.net
i changed a few declarations but it was his code ;)

-choose save
-close the editor
-back in word again open a document you want to handle
-now choose ALT+F8 and choose the "WordFrequency" macro
-ahhhhhhhh have fun

for your styles filtering that's difficult, and for the exclusion of words look at this line of code

Excludes = "[the][a][of][is][to][for][this][that][by][be][and][are]"

and change it

if you want a listbox or something that will be some real work as it looks like it now already :)

Hi Zorro,

since you're asking this in the Office section
-Word has properties to do word counts
-excluding repititions i don't know have to check
-you could build some code to do your second part
-but it would be huge since it's not one word but all words in a document so you got to work with lists in memory etc....not a pleasant thought

what's the need for this? is this to be processed manually or something?

Zorro032798Author Commented:
Hi Bruintje,

This function is offered in Trados, a translators help software. It counts only the words that really need to be translated. If a section or a sentence is repeated it does not have to be translated again, since that is already done.
In some document repetitions can be significant, sometimes up to 20%. In documents of a few hundred pages this might have a strong impact on things like deadlines and workload.
That's why I want to do this count.

If you mean that some VBA code can be written so solve the problem I'm ready to grant this question with more points.

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Glad i know some people here work with Trados, so i know the software a bit

but now i don't understand it
-you said the function is offered in Trados?
-do you need it outside Trados?
-purely for cost accounting / workload purposes so you can calculate beforehand what's going to be translated?

Zorro032798Author Commented:
yep, that's the idea.
I don't have Trados myself and buying the software just for this function seems a little bit stupid :-)
btw this kind of code finding is very rare the guy didn't put anything under it only an email addy........that's open source
Zorro032798Author Commented:
This is working very nice indeed, but not exactly what I'm looking for. I want to find the really repeating sections in a document. So the application has to look for sentences and text parts that come back. As you know Trados does this kind of job.
Nevertheless this is a nice piece of code indeed :-)
David ToddSenior DBACommented:
Hi Zorro,

I imagine that finding repeating sections in a document is fairly heavy work.

I'd suggest puting  minimum number of words variable that woudl constitute the smallest section, say 5 or 10, then select the next 10 words and do a find ...

I didn't see anything in the above code to limit the words to say body-text only which would eliminate the index and captions and ...

Zorro032798Author Commented:
the above code is indeed helpful, but that doesn't really solve our problem.

It would be better if there was an application available like "dtodd" is telling, that is looking for a repeating "phrase" of e.g. a few words.

Thanks for all the comment !

Z. Orro
i know how Trados works, but it isn't limiting to certain amount of words it's looking for sentences, and you can even look for complete paragraphs..........

got my hands on a few of those templates but it's a lot of work for someone not into linguistics and text processing theory so it will take some time to get something done

maybe someone else will come up with an easier solution in the meantime

Zorro032798Author Commented:
Bruintje, that's exactly what I'm looking for :-)
I wondered if there wasn't any shareware that would do the job. I'm not willing to keep you at work all night long, but if you would come up with a solution, I can grant it with a lot more points. I really don't know how much work this will represent for you and I appreciate what you have done so far. Definitely you are an expert :-)
Hi Zorro,

Was looking for a description on text processing and particular how Trados does it, and came accross this one


of course it's priced the same as Trados (keeping the market profitable) does the same but you can download a 30-day evaluation version, and then i don't know maybe reinstalling?

i just keep on looking for a simple way to this in Word which would make some nice piece of code

OK i've seen some texts on this and what they all say and that's how Trados et al work, they build a sentence /paragraph database. this database is fed by the user!

what this means is that you've got to build your own database of phrases if you really want the same functionality.

but now i need your help a bit here on a workable solution
-build an automatic word counter excluding repeating words (got that one already above)
-build an automatic sentence counter excluding repeating sentences
-build an automatic paragraph counter excluding repeating paragraphs (this one is the trickiest)

that's all we can do automatic
-what can also be build is the functionality of your own database of phrases that's build up by you as user where you add words, sentences etc to a db which word uses as reference for counting

is this what you want?

simply counting things that aren't bounded to anything is a bit difficult and senseless

what do we count in this little text

<this is a just a simple sample sentence>

do we count
-this is
-this is a
-is a

that would be fun for a programmer or a theorist but it doesn't solve your problem

David ToddSenior DBACommented:
Hi Zorro,

From what I read of the exchange beteen you and Bruintje and what you are asking for, you need to get a programmer (or Word Macro specialist) to write it for you.

With the example above of how many words to match on, you need some real programming in Word.

If you can start to record the sort of macro I outlined above - select the first sentece, do a find ... then you are on the way to programming the solution yourself. The advantage of starting from a recorded macro is that it will find many of the objects itself.

But the kinds of results that you want are fairly complex and will require tuning the code as its developed ...

Kind Regards
Hi David,

why can't we do that here, it will be a long thread but it would be fun wouldn't it?

David ToddSenior DBACommented:
Hi Brian,

Well ... maybe a few more points would help ...

Let us know how you get on recording - select the sentence, find the selected text ...

And post the code ...

Like Brian says, it could be fun if we've all got the time ...

Zorro032798Author Commented:
Brain & David, I can see there is a lot of goodwill here, but I don't want to keep you working for me. I can imagine it will take some effort to develop such a function. After all I was just asking if there wasn't some utility somewhere on the net to do the job.
I can grant you with a few 100 points, that's not the problem. It's just that I think it will take us very far to develop it.
If you feel this a challenge for you, then we can arrange something, but - again - I don't want to take all of your time.
Tnx, boys
Hi Zorro/David, i've to be honest it's going to take some time to develop somthing like this, and that time isn't available for such a grand task

at least make this a PAQ there's some good info in it though it only answers the title no the question

David ToddSenior DBACommented:
Hi Zorro/Brian,

I agree. I don't have the time for this project either.

Zorro032798Author Commented:
Hi boys, tnx for the good help.
David, I hope you can accept that I give the points for this q to Brian. But I want to thank you for your input too.
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.