Does anyone have (know of) a tool or program that will allow me to count the number of words in an HTML file? I don't want all the tags counted, just the words that would appear in the browser.
Our situation here is that we have documents coded in HTML that have to be translated into other languages. The rate for translations comes down to a cost per word. It would be beneficial if we had a utility that would examine all of the HTML files in a given directory tree and give us a total word count that applies to the words that need to be translated.
TextPad (Wintel -- www.textpad.com) has a word count which could probably be scripted (has its own macro) to run through a batch of files. The removing tags thing could be done with a SaveAs text from a browser (or also, likely, your HTML editor). BBEdit on a Mac has a remove tags function.
Hope this helps.
On a Mac, BBedit provides 'remove tags' and I THINK it's got a word count... TextPa
Hope this helps!
0
talleyCommented:
We use Translation Manager V2.0 from IBM.
It has markup tables that handle the tags and do not include them in word counts.
also...
sed 's/<[^>]*>//g' index.html will strip out tags.
that sed script could be fooled by comments, scripts, and tags spanning more than one line
0
There are many ways to learn to code these days. From coding bootcamps like Flatiron School to online courses to totally free beginner resources. The best way to learn to code depends on many factors, but the most important one is you. See what course is best for you.
Let me open this up again. I have heard that the word count feature in Microsoft Word will count the words correctly in an HTML document. Can this funtion be used inside of a word macro and be applied to each of the files in a given directory?
It could be written in perl quite easily. I bet that ozo (he's very active in the perl area) would not need more then half an hour to make it. Scan a whole directory at once, and for each file give the number of words, excluding the HTML-tags.
It would take me a lot more time, but still I would choose to write such a thing in perl.
0
talleyCommented:
In my test it was never fooled by tags over more than one line...
beginnign with < and ending with > ...
I do agree that it is not the best answer..the BEST answer is to use TM2, but most people want things for free and dislike paying for software.
The only significant change I made was to use the Application.FileSearch object instead of the Visual Basic Dir command. FileSearch is a little more flexible. I was able to specify files matching *.htm and *.html and have it search sub directories too.
It is slow, but everyone around here who needs the function already has word and can let this run at night or whenever.
TextPad (Wintel -- www.textpad.com) has a word count which could probably be scripted (has its own macro) to run through a batch of files. The removing tags thing could be done with a SaveAs text from a browser (or also, likely, your HTML editor). BBEdit on a Mac has a remove tags function.
Hope this helps.
On a Mac, BBedit provides 'remove tags' and I THINK it's got a word count... TextPa
Hope this helps!