Solved

Real Word Count Utility

Posted on 1998-10-13
14
218 Views
Last Modified: 2006-11-17
Does anyone have (know of) a tool or program that will allow me to count the number of words in an HTML file?  I don't want all the tags counted, just the words that would appear in the browser.  

Our situation here is that we have documents coded in HTML that have to be translated into other languages.  The rate for translations comes down to a cost per word.  It would be beneficial if we had a utility that would examine all of the HTML files in a given directory tree and give us a total word count that applies to the words that need to be translated.

Any help is greatly appreciated.

Lankford
0
Comment
Question by:lankford
  • 5
  • 4
  • 2
  • +3
14 Comments
 
LVL 4

Expert Comment

by:raoool
ID: 1838703
What platform?

TextPad (Wintel -- www.textpad.com) has a word count which could probably be scripted (has its own macro) to run through a batch of files. The removing tags thing could be done with a SaveAs text from a browser (or also, likely, your HTML editor). BBEdit on a Mac has a remove tags function.

Hope this helps.


On a Mac, BBedit provides 'remove tags' and I THINK it's got a word count... TextPa

Hope this helps!
0
 

Expert Comment

by:talley
ID: 1838704
We use Translation Manager V2.0 from IBM.
It has markup tables that handle the tags and do not include them in word counts.

also...
sed 's/<[^>]*>//g' index.html  will strip out tags.
0
 
LVL 84

Expert Comment

by:ozo
ID: 1838705
that sed script could be fooled by comments, scripts, and tags spanning more than one line
0
Announcing the Most Valuable Experts of 2016

MVEs are more concerned with the satisfaction of those they help than with the considerable points they can earn. They are the types of people you feel privileged to call colleagues. Join us in honoring this amazing group of Experts.

 

Author Comment

by:lankford
ID: 1838706
Let me open this up again.  I have heard that the word count feature in Microsoft Word will count the words correctly in an HTML document.  Can this funtion be used inside of a word macro and be applied to each of the files in a given directory?

Lankford
0
 
LVL 28

Expert Comment

by:sybe
ID: 1838707
It could be written in perl quite easily. I bet that ozo (he's very active in the perl area) would not need more then half an hour to make it. Scan a whole directory at once, and for each file give the number of words, excluding the HTML-tags.
It would take me a lot more time, but still I would choose to write such a thing in perl.
0
 

Expert Comment

by:talley
ID: 1838708
In my test it was never fooled by tags over more than one line...
beginnign with < and ending with > ...
I do agree that it is not the best answer..the BEST answer is to use TM2, but most people want things for free and dislike paying for software.
0
 
LVL 4

Expert Comment

by:mcix
ID: 1838709
You could use  

wordCount = ActiveDocument.ComputeStatistics(Statistic:=wdStatisticWords)

To obtain an accurate Word Count for an HTML document using word.

However, this method is terribly inefficient.

There is a better way using C++, Look at this article:

http://support.microsoft.com/support/kb/articles/q186/8/98.asp


0
 
LVL 4

Expert Comment

by:mcix
ID: 1838710
It looks something like this in Word:

Public Function ComputeWordCounts(vstrPath As String,vstrSpec As String)

Dim mwrdDocument As Word.Document
Dim mstrCurrentFile As String

mstrCurrentFile = Dir(vstrPath & vstrSpec)
Do While mstrCurrentFile <> ""
    Documents.Open vstrPath & mstrCurrentFile
    wordCount = ActiveDocument.ComputeStatistics(Statistic:=wdStatisticWords)
    MsgBox ActiveDocument.Name & " has " & wordCount
    ActiveDocument.Close wdDoNotSaveChanges
    mstrCurrentFile = Dir
Loop

End Function

ComputeWordCounts "C:\SomeDirectory\" ,"*.HTM"

This will be slow, because Word will Open Each file that meets the criteria.

 
0
 

Author Comment

by:lankford
ID: 1838711
mcix, that works for me.  Change your comment to an answer and you get the points.

Lankford
0
 
LVL 4

Accepted Solution

by:
mcix earned 50 total points
ID: 1838712
Glad it works...

Mark
0
 

Author Comment

by:lankford
ID: 1838713
I liked your response.  

The only significant change I made was to use the Application.FileSearch object instead of the Visual Basic Dir command.  FileSearch is a little more flexible.  I was able to specify files matching *.htm and *.html and have it search sub directories too.

It is slow, but everyone around here who needs the function already has word and can let this run at night or whenever.

Thanks again for the response.

Lankford
0
 
LVL 4

Expert Comment

by:mcix
ID: 1838714
Do you mind sharing your code?
0
 

Author Comment

by:lankford
ID: 1838715
What is your e-mail address?  I'll mail the entire DOC file to you.

Lankford
0
 
LVL 4

Expert Comment

by:mcix
ID: 1838716
E-Mail is:

marko_justus@hotmail.com

Thanks,

Mark
0

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Downside of adding characters set in ASP pages 6 30
innerHTML 7 35
Html Table Looping (part 2) 5 28
Asp in server side with Mssql Server 7 4 29
This is a PowerShell web interface I use to manage some task as a network administrator. Clicking an action button on the left frame will display a form in the middle frame to input some data in textboxes, process this data in PowerShell and display…
Is your Office 365 signature not working the way you want it to? Are signature updates taking up too much of your time? Let's run through the most common problems that an IT administrator can encounter when dealing with Office 365 email signatures.
In this Micro Tutorial viewers will learn how to create navigation buttons that change on rollover, using CSS (Continuation of the CSS Image Sprite tutorial) Create a parent ID for all the list items       - Specify position: absolute and display: block…
In this tutorial viewers will learn how to embed an audio file in a webpage using HTML5. Ensure your DOCTYPE declaration is set to HTML5: : The declaration should display (CODE) HTML5 is supported by the most recent versions of all major browsers…

820 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question