Link to home
Start Free TrialLog in
Avatar of Antonio Hernandez
Antonio HernandezFlag for United States of America

asked on

Word VBA - Reliably detect English sentences

I need help in Word VBA in creating a function that reliably detects each sentence in a paragraph and store each one in a separate entry in an array. I have used regular expressions with some success but have still had trouble. This particular application would be used mainly with legal briefs. Thus, there are often abbreviations that have a period or other quirks in punctuation that make it difficult to reliably detect sentences. It would be okay if the code relied in part on a minimum string length to raise accuracy by reducing false hits of two words, etc. 

Avatar of Dr. Klahn
Dr. Klahn

I would take a different approach ...

Pre-parse the text to remove all extraneous text (), section marks, line numbers, abbreviations such as q.v., op. cit., vs. ... there will be a long list.  Feed the original text in, remove all extraneity, produce that text as output.   This is much easier to parse into sentences.  Then go backwards to the original, using those sentences to guide the parse.

e.g.:  "In re Beavis vs. Butthead, legis. cit. supra, the appellant claims (1) having been repeatedly called a fart-knocker (q.v.), and (2) having been pummeled about the body on divers occasions.

Converted:  In re Beavis Butthead, supra, the appellant claims having been repeatedly called a fart-knocker, and having been pummeled about the body on diver occasions.

Parsed:  Word 1 "in", last word "occasions", at least 22 words between beginning and end, start looking for terminating word 22 words from beginning.
Although I do well in vbscript, I do not do much in vba. However, I agree with Dr Klahn about a different approach. I think you are going to find a lot of gotcha's trying to do this with detecting abbreviations and other oddities.

I would instead look to an API. Perhaps https://developer.grammarly.com/ or maybe there is something in Google's Natural Language API https://cloud.google.com/natural-language (My warning with Google is they tend to change and not support legacy).

Today, there are a number of AI apps for writing that have API's. These are not going to be free, but given you are doing this for the law sector, that should not be an issue.

Using an AI type of API to detect sentences and put those in your array will be more accurate in the long run.

Otherwise, your code would need to break down in parts.

  • The start of the document as one starting point of a sentence.
  • The next character ending in a Period as the end.  "s." 
  • Check if you are at the end. If not, start at where you left off  looking for the next non space character to start.
  • Then again look for a character followed by a period and repeat.

If everything is clean, that is easy. But we know that is not going to be the case. You could start building a database of the gotcha's like:

  • St. Petersburg
  • 111 Maple Ave. anytown 99999
  • A.L.R. 5th

Put all of those in a database, when you do your look up for the end of a sentence, make sure it is not in one of the exception rules.   This is why I think you need an AI API.
Avatar of Antonio Hernandez

ASKER

Thank you both so far. I am inclined to stay away from cloud solutions to keep things local. While Dr. Klahn's proposed strategy makes a lot of sense theoretically, it seems somewhat difficult in application to remove all extraneous characters. Scott, regarding your second suggestion, I had been thinking similarly that it would be the best approach. Since Word reliably detects paragraphs, we know the beginning of the first sentence of the paragraph. I think this might be somewhat reliable: 1) capture the location of every period in the paragraph; 2) starting from the beginning of the first sentence, extract the string up to the first period plus two more characters; 3) check the string against a regular expression to see if it follows a pattern such as [Capital letter][rest of word][any number of characters][period][space][Capital letter]; 4)(a) if it fits the pattern of a sentence, start over from that period to the next but (b) if  it does not fit the pattern then extract the string from the same start position to the next period and repeat; 5) repeat until the end of the paragraph and then the end of the document.

Please let me know your thoughts and if you have any code for such an approach. Thank you.
ASKER CERTIFIED SOLUTION
Avatar of Scott Fell
Scott Fell
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Antonio

It would be helpful if you posted some representative sample text.  Perhaps, a paragraph or two, should be sufficient.

I don't know if you're talking about citations or depositions.
Thanks Scott for your help. I've made progress with my strategy. Your responses helped confirm to me that it is the best approach here. Regarding the Word Sentence object, it is pretty basic in that it overcounts based on period locations. I had previously checked it out (and wished it was better), but it is what it is.

Aikimark, it would be text from a legal brief, not a deposition. So, yes, citations get in the way.