Solved

Word count excluding repeated text

Posted on 2002-04-24
20
554 Views
Last Modified: 2008-02-01
Is there any software, shareware, freeware to do a word count on textfiles (doc, txt, rtf) that excludes text repetitions, such as titles or legal notices?
The ideal would be that repetitions are marked and that certain styles (such as heading 1 for example) can be excluded from the count result.
Tnx for your hints.
0
Comment
Question by:Zorro032798
  • 9
  • 7
  • 4
20 Comments
 
LVL 44

Expert Comment

by:bruintje
Comment Utility
Hi Zorro,

since you're asking this in the Office section
-Word has properties to do word counts
-excluding repititions i don't know have to check
-you could build some code to do your second part
-but it would be huge since it's not one word but all words in a document so you got to work with lists in memory etc....not a pleasant thought

what's the need for this? is this to be processed manually or something?

:O)Bruintje
0
 

Author Comment

by:Zorro032798
Comment Utility
Hi Bruintje,

This function is offered in Trados, a translators help software. It counts only the words that really need to be translated. If a section or a sentence is repeated it does not have to be translated again, since that is already done.
In some document repetitions can be significant, sometimes up to 20%. In documents of a few hundred pages this might have a strong impact on things like deadlines and workload.
That's why I want to do this count.

If you mean that some VBA code can be written so solve the problem I'm ready to grant this question with more points.

CU
0
 
LVL 44

Expert Comment

by:bruintje
Comment Utility
Glad i know some people here work with Trados, so i know the software a bit

but now i don't understand it
-you said the function is offered in Trados?
-do you need it outside Trados?
-purely for cost accounting / workload purposes so you can calculate beforehand what's going to be translated?

:O)Bruintje
0
 

Author Comment

by:Zorro032798
Comment Utility
yep, that's the idea.
I don't have Trados myself and buying the software just for this function seems a little bit stupid :-)
0
 
LVL 44

Accepted Solution

by:
bruintje earned 80 total points
Comment Utility
OK, this is not easy because it's pretty specific found something that did this exactly except for filtering out but the guys mailbox was over limit so it bounced.....

thus we can come up with another solution and because it gives you something to work with i guess it's a good alternative

the count of unique words and the list with the count of each word

-open a new word document
-open the VB Editor with ALT+F11
-then choose in the upper left pane the normal icon
-then insert a new module from the menu
-now paste the code

Option Explicit

'Here 's a macro that works pretty well.  It displays a message box asking
'whether you want the word count sorted by frequency or in alphabetic
'order, then creates a new document with a list of all the words.

Sub WordFrequency()
         Dim SingleWord As String           'Raw word pulled from doc
         Const maxwords = 9000              'Maximum unique words allowed
         Dim Words(maxwords) As String      'Array to hold unique words
         Dim Freq(maxwords) As Integer      'Frequency counter for unique Words
         Dim WordNum As Integer             'Number of unique words
         Dim ByFreq As Boolean              'Flag for sorting order
         Dim ttlwds As Long                 'Total words in the document
         Dim Excludes As String             'Words to be excluded
         Dim Found As Boolean               'Temporary flag
         Dim j, k, l, Temp As Integer       'Temporary variables
         Dim tword As String                '
         Dim Ans As String
         Dim aword
         Dim tmpName As String

         ' Set up excluded words
         Excludes = "[the][a][of][is][to][for][this][that][by][be][and][are]"

         ' Find out how to sort
        ByFreq = True
        Ans = InputBox$("Sort by WORD or by FREQ?", "Sort order", "FREQ")
         If Ans = "" Then End
         If UCase(Ans) = "WORD" Then
             ByFreq = False
         End If

         Selection.HomeKey Unit:=wdStory
         System.Cursor = wdCursorWait
         WordNum = 0
         ttlwds = ActiveDocument.Words.Count

         ' Control the repeat
         For Each aword In ActiveDocument.Words
             SingleWord = Trim(LCase(aword))
             If SingleWord < "a" Or SingleWord > "z" Then SingleWord = ""   'Out of range?
             If InStr(Excludes, "[" & SingleWord & "]") Then SingleWord = "" 'On exclude list?
             If Len(SingleWord) > 0 Then
                 Found = False
                 For j = 1 To WordNum
                     If Words(j) = SingleWord Then
                         Freq(j) = Freq(j) + 1
                         Found = True
                         Exit For
                     End If
                 Next j
                 If Not Found Then
                     WordNum = WordNum + 1
                     Words(WordNum) = SingleWord
                     Freq(WordNum) = 1
                 End If
                 If WordNum > maxwords - 1 Then
                     j = MsgBox("The maximum array size has been exceeded. Increase maxwords.", vbOKOnly)
                     Exit For
                 End If
             End If
             ttlwds = ttlwds - 1
             StatusBar = "Remaining: " & ttlwds & "     Unique: " & WordNum
         Next aword

         ' Now sort it into word order
         For j = 1 To WordNum - 1
             k = j
             For l = j + 1 To WordNum
                 If (Not ByFreq And Words(l) < Words(k)) Or (ByFreq And Freq(l) > Freq(k)) Then k = l
             Next l
             If k <> j Then
                 tword = Words(j)
                 Words(j) = Words(k)
                 Words(k) = tword
                 Temp = Freq(j)
                 Freq(j) = Freq(k)
                 Freq(k) = Temp
             End If
             StatusBar = "Sorting: " & WordNum - j
         Next j

         ' Now write out the results
         tmpName = ActiveDocument.AttachedTemplate.FullName
         Documents.Add Template:=tmpName, NewTemplate:=False
         Selection.ParagraphFormat.TabStops.ClearAll
         With Selection
             For j = 1 To WordNum
                 .TypeText Text:=Trim(Str(Freq(j))) & vbTab & Words(j) & vbCrLf
             Next j
         End With
         System.Cursor = wdCursorNormal
         j = MsgBox("There were " & Trim(Str(WordNum)) & " different words ", vbOKOnly, "Finished")
     Selection.HomeKey wdStory

End Sub

--------------------------------------------
copyright notice
this code is originally written by Larry > Larry328@att.net
i changed a few declarations but it was his code ;)
--------------------------------------------

-choose save
-close the editor
-back in word again open a document you want to handle
-now choose ALT+F8 and choose the "WordFrequency" macro
-ahhhhhhhh have fun

for your styles filtering that's difficult, and for the exclusion of words look at this line of code

Excludes = "[the][a][of][is][to][for][this][that][by][be][and][are]"

and change it

if you want a listbox or something that will be some real work as it looks like it now already :)

HTH:O)Bruintje
0
 
LVL 44

Expert Comment

by:bruintje
Comment Utility
btw this kind of code finding is very rare the guy didn't put anything under it only an email addy........that's open source
0
 

Author Comment

by:Zorro032798
Comment Utility
This is working very nice indeed, but not exactly what I'm looking for. I want to find the really repeating sections in a document. So the application has to look for sentences and text parts that come back. As you know Trados does this kind of job.
Nevertheless this is a nice piece of code indeed :-)
0
 
LVL 35

Expert Comment

by:David Todd
Comment Utility
Hi Zorro,

I imagine that finding repeating sections in a document is fairly heavy work.

I'd suggest puting  minimum number of words variable that woudl constitute the smallest section, say 5 or 10, then select the next 10 words and do a find ...

I didn't see anything in the above code to limit the words to say body-text only which would eliminate the index and captions and ...

Regards
  David
0
 

Author Comment

by:Zorro032798
Comment Utility
the above code is indeed helpful, but that doesn't really solve our problem.

It would be better if there was an application available like "dtodd" is telling, that is looking for a repeating "phrase" of e.g. a few words.

Thanks for all the comment !

Z. Orro
0
 
LVL 44

Expert Comment

by:bruintje
Comment Utility
i know how Trados works, but it isn't limiting to certain amount of words it's looking for sentences, and you can even look for complete paragraphs..........

got my hands on a few of those templates but it's a lot of work for someone not into linguistics and text processing theory so it will take some time to get something done

maybe someone else will come up with an easier solution in the meantime

:O)Bruintje
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 

Author Comment

by:Zorro032798
Comment Utility
Bruintje, that's exactly what I'm looking for :-)
I wondered if there wasn't any shareware that would do the job. I'm not willing to keep you at work all night long, but if you would come up with a solution, I can grant it with a lot more points. I really don't know how much work this will represent for you and I appreciate what you have done so far. Definitely you are an expert :-)
Tnx,
zorrO)
0
 
LVL 44

Expert Comment

by:bruintje
Comment Utility
Hi Zorro,

Was looking for a description on text processing and particular how Trados does it, and came accross this one

http://www.atril.com/

of course it's priced the same as Trados (keeping the market profitable) does the same but you can download a 30-day evaluation version, and then i don't know maybe reinstalling?

i just keep on looking for a simple way to this in Word which would make some nice piece of code

:O)Bruintje
0
 
LVL 44

Expert Comment

by:bruintje
Comment Utility
OK i've seen some texts on this and what they all say and that's how Trados et al work, they build a sentence /paragraph database. this database is fed by the user!

what this means is that you've got to build your own database of phrases if you really want the same functionality.

but now i need your help a bit here on a workable solution
-build an automatic word counter excluding repeating words (got that one already above)
-build an automatic sentence counter excluding repeating sentences
-build an automatic paragraph counter excluding repeating paragraphs (this one is the trickiest)

that's all we can do automatic
-what can also be build is the functionality of your own database of phrases that's build up by you as user where you add words, sentences etc to a db which word uses as reference for counting

is this what you want?

simply counting things that aren't bounded to anything is a bit difficult and senseless

what do we count in this little text

<this is a just a simple sample sentence>

do we count
-this
-this is
-this is a
-is a
etc.....

that would be fun for a programmer or a theorist but it doesn't solve your problem

:O)Bruintje
0
 
LVL 35

Expert Comment

by:David Todd
Comment Utility
Hi Zorro,

From what I read of the exchange beteen you and Bruintje and what you are asking for, you need to get a programmer (or Word Macro specialist) to write it for you.

With the example above of how many words to match on, you need some real programming in Word.

If you can start to record the sort of macro I outlined above - select the first sentece, do a find ... then you are on the way to programming the solution yourself. The advantage of starting from a recorded macro is that it will find many of the objects itself.

But the kinds of results that you want are fairly complex and will require tuning the code as its developed ...

Kind Regards
  David
0
 
LVL 44

Expert Comment

by:bruintje
Comment Utility
Hi David,

why can't we do that here, it will be a long thread but it would be fun wouldn't it?

Brian
0
 
LVL 35

Expert Comment

by:David Todd
Comment Utility
Hi Brian,

Well ... maybe a few more points would help ...

Zorro:
Let us know how you get on recording - select the sentence, find the selected text ...

And post the code ...

Like Brian says, it could be fun if we've all got the time ...

Regards
  David
0
 

Author Comment

by:Zorro032798
Comment Utility
Brain & David, I can see there is a lot of goodwill here, but I don't want to keep you working for me. I can imagine it will take some effort to develop such a function. After all I was just asking if there wasn't some utility somewhere on the net to do the job.
I can grant you with a few 100 points, that's not the problem. It's just that I think it will take us very far to develop it.
If you feel this a challenge for you, then we can arrange something, but - again - I don't want to take all of your time.
Tnx, boys
0
 
LVL 44

Expert Comment

by:bruintje
Comment Utility
Hi Zorro/David, i've to be honest it's going to take some time to develop somthing like this, and that time isn't available for such a grand task

at least make this a PAQ there's some good info in it though it only answers the title no the question

Brian
0
 
LVL 35

Expert Comment

by:David Todd
Comment Utility
Hi Zorro/Brian,

I agree. I don't have the time for this project either.

Regards
  David
0
 

Author Comment

by:Zorro032798
Comment Utility
Hi boys, tnx for the good help.
David, I hope you can accept that I give the points for this q to Brian. But I want to thank you for your input too.
0

Featured Post

What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

Join & Write a Comment

Have you ever had the experience that you had to follow 10 steps over and over again every time when you need to nicely forward an important email to your manager? Fear no more! With the help of the Quick Steps feature in Outlook 2010, your old chor…
Using Word 2013, I was experiencing some incredible lag when typing.  Here's what worked for me....
The viewer will learn how to create two correlated normally distributed random variables in Excel, use a normal distribution to simulate the return on different levels of investment in each of the two funds over a period of ten years, and, create a …
Learn how to create and modify your own paragraph styles in Microsoft Word. This can be helpful when wanting to make consistently referenced styles throughout a document or template.

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now