Link to home
Start Free TrialLog in
Avatar of riverguy
riverguy

asked on

Selection.Words.Count in Microsoft Word

I need a function that will give the count of words in a string that is equivalent to Word 2013's Selection.Words.Count property.  The behavior of the Word 2013 property appears to treat non-alphanumeric characters differently when more than 1 are in sequence.  The following shows selection in the left column, words.count in the second column:
Selection                   Words.Count      
Word                                      1      
.Word                                      2      
.Word.                                    3      Preceding and following periods count as words
Word/                                     2      Following / (and other non alphanumerics/non whitespace) count
Word/Second                        3      / counts as 1 word
Word//Second                       3      but 2 //s only count as 1 word
Word  Second                        2      2 spaces count as 1 delimiter (0 words)
Word<tab>Second               3      tab counts as a word
Word<tab><tab>Second     4      2 tabs count as 2 words
Word./Second                       3      period and slash count as 1 word
Word.Second                              3      1 period counts as 1 word
Word..                                    2      2 periods count as 1 word
Word…                                  2      3 periods count as 1 word
..Word                                    2      2 preceding periods count as 1 word
Word…Second                     3      3 periods between count as 1 word

Is there a simple way to duplicate this behavior with a string without having to use a selection in VBA or VB.Net?
Avatar of aikimark
aikimark
Flag of United States of America image

Since there is a Words collection on a range, why not use that instead of duplicating it in a user-defined function?  You could have a non-visible document used by your code to instantiate yous range object.  The Selection would not change.
Avatar of riverguy
riverguy

ASKER

Yes, I understand that I can do it that way.  But I need to call it from a procedure that uses a table look up on "tag" phases of up to 5 words each in fairly large documents, which already takes some time to run.  I was hoping to avoid the extra overhead of doing what you suggest.  I was thinking there is probably a function that Word uses itself and hoping to be able to call that function, or failing that reproduce it.  

Probably unnecessary but details of what I'm doing is locating each ":" in a document, then backing up 5 words, looking up in a table to determine presence in a table, and if found formatting the phrase (as a tag to highlight/categorize the following content), then if not found checking the previous 4 words, then 3, 2, 1.  Therefore I would need to call such a function many hundreds of time in a document.  Your suggestion is a fall back in case I can't identify or recreate a faster method.
is this an approach to plagiarism detection?

I think you will get much better results if you state your actual problem, not just the obstacles you face with your chosen algorithm or approach.
No, not looking for plagiarism.  I'm categorizing content in medical records for identification and categorization of information such as reflexes, tenderness, medications, etc. in a way similar to what the Senate Watergate Committee staff used in analyzing voluminous records.  (I know, ancient history, but software was developed for that and I have used it.)  Tags, as defined in this app consist of 1 to 5 words, could be more, in a table.  I identify tags in the documents with a colon suffix, since it is frequently used for headings such as History,: Diagnosis:, etc, and many of the common tags used regularly in medical records overlap the set of tags that I am interested in for a particular case.  Rather than going through each of the several hundred Tags in the table and determining if the text preceding each colon contains that text, I have separated the tags using SQL into recordsets by word count, and starting with 5 words, I compare the preceding 5 words before each colon to lookup in the recordset for a match , then 4, 3, etc.  Since Words WordCount property uses word delimiters differently depending on if the delimiter is a space, hyphen, period, slash, etc. I wanted to store a word count that was identical to what Word does in the table.  Now I have already used a blank document for that whenever I add a new Tag to the table and that works.  And I know that I can compare the Tag with the rightmost portion of text preceding a colon, which I think would be less efficient, although frankly I haven't speed tested it.  But also partly from intellectual curiosity I wanted to find if anyone knew the algorithm that Word uses or if there is a an windows function that can call it.
Is your text still in the database or have you extracted it to some flat files or Word documents?

It would be helpful to have some sample sentences/paragraphs/documents and the various phrases you are trying to detect.
ASKER CERTIFIED SOLUTION
Avatar of aikimark
aikimark
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Although the problem is solved I would still like to know if there is a way to call Microsoft's function used in returning the Words.Count property.
...if there is a way to call Microsoft's function used in returning the Words.Count property
I'm not sure I understand.

The title of this question implies that you already know how to get to the Selection.Words.Count property.

Is the question one of run-time environment VBA inside Word vs Word automation (VBScript, Excel, Access, Powershell, etc) or something else?
Something else.  I want to call it independently of Microsoft Word, as from vb.net but without having to open a word document, insert the text into a new selection and then retrieve the word count.  I haven't yet tested the regex you supplied, but I need it to exactly replicate the Words.count property in Word.  I suspect that somewhere there is available an API call that can be used.  I guess It's a matter of curiosity, since I can get the same results with a word document, but if anyone is familiar with the internal function that Microsoft Uses and it is accessible, I'm interested in that.  But thanks for all  your help.
If you have a reliable source of word parsing, relative to a trailing colon character, why do you need to exactly replicate what Word does?

Have you looked at the OpenXML SDK?
When there are periods, hyphens and other punctuation characters, Word handles them differently as I showed in the sample.  If the word count I apply to a phrase (that I have previously determined with my own function) does not match exactly the way Word does in say .movestart wdWords,-5 and my Non word function treats the word count differently a phrase that should match doesn't.  Now my question is solved by inserting the phrase in a selection in a blank document in the background and getting Words.WordCount.  So the answer to your question of why I need to exactly replicate what Word does is to that all wordphrases I have assigned a 5 count to will match backing up 5 words in the target document selection.  
Beyond that I am simply curious to know if I can access Word's function directly.  Bottom line is curiosity!!  Yes, I know that's dumb!
I think this is the Word VBA sequence that selects five "words", including the trailing colon character:
selection.Find.Execute ":"
selection.Moveright wdcharacter, 1, false
selection.MoveLeft wdword, 5, true

Open in new window

To test this, I typed the following into the Word document:
Now is the time for all good men: to come to the aid of their country.
After executing the above statements, I checked the results in the immediate window
?"""" & selection.Text & """"
"for all good men:"
?selection.Words.Count
 5 
for w=1 to selection.Words.Count: ?w,"""" & selection.Words(w) & """":next
 1            "for "
 2            "all "
 3            "good "
 4            "men"
 5            ": "

Open in new window

Note: Even though the selected text ended with a colon, the words collection in the selection included the trailing space.

As far as I can tell, you can't directly execute the Word functions without instantiating Word and opening the document.

While I appreciate your curiosity, I think the best solution would be to look for the word sequences ignoring the punctuation.  Forget what Word does.  Parse the text.