Extracting words pairs and trips (n-gram extraction) from string
Posted on 2013-01-30
Hi all, Im trying to figure out a way to take a string of variable length, usually something from natural language with some padding. What I want is 2 and 3 word pairs with a minimum word length of 3 chars or more, so with this sentence for example:
This here is a short sentence. And this is another sentence. >>>>> A final sentence here.
For bi-grams (2 words) would produce:
It stops at the ., > or any non word character (hypens, apostrophes are allowed) and starts processing a new sentence
Ive tried this with Regex, but not had much luck and wondered if anyone had a more elegant solution?