Solved

Extracting words pairs and trips (n-gram extraction) from string

Posted on 2013-01-30
2
720 Views
Last Modified: 2013-01-30
Hi all, Im trying to figure out a way to take a string of variable length, usually something from natural language with some padding. What I want is 2 and 3 word pairs with a minimum word length of 3 chars or more, so with this sentence for example:

This here is a short sentence. And this is another sentence. >>>>> A final sentence here.

For bi-grams (2 words) would produce:
This here
short sentence
It stops at the ., > or any non word character (hypens, apostrophes are allowed) and starts processing a new sentence

And this
another sentence
final sentence
sentence here

Ive tried this with Regex, but not had much luck and wondered if anyone had a more elegant solution?
0
Comment
Question by:Slimshaneey
2 Comments
 
LVL 14

Accepted Solution

by:
Scott Madeira earned 500 total points
ID: 38835898
I would approach it like this...

1. Create an array of sentences by using preg_split() to split text into an array  when it hits your list of unallowed characters. (Not a regex expert so I can't offer much here.)

2. For each sentence in the array explode it based on a space  character to get the words.

3. Step through the array of words and do a check on length or element x and element x+1 to see if it meets your needs.

There may be a better way to do it but I think this will work.
0
 
LVL 11

Author Comment

by:Slimshaneey
ID: 38835956
Thanks, I was thinking the same thing. As is often the case,  the very act of writing the problem out here sets a seed in my head, ended up with this function:

public function getNgrams($string, $n = 2) {
        
        $arrParts = preg_split('/[?,.:;\/%$\!]/', $string);
       
        foreach($arrParts as $parts){
            $wordParts = explode(" ", trim($parts));
            if(count($wordParts) < $n){
                continue;
            }
            foreach($wordParts as $key => $word){
                if(strlen($word) < 3 || $key > count($wordParts) - $n || preg_match('/[0-9]/', $word || $this->isStopWord($word))){
                    continue;
                }
                $ngramTmp[] = $word;
                //Get next $n-1 words in sequence
                for($i = 0; $i < ($n - 1); $i++){
                    $nextWord = $wordParts[$key + $i + 1];
                    //Check its a valid word and not a part number
                    if(strlen($nextWord) >= 3 && !preg_match('/[0-9]/', $nextWord) && !$this->isStopWord($nextWord)){
                        $ngramTmp[] = $nextWord;
                    }else{
                        $ngramTmp = array();
                        continue 2;
                    }
                }
                if(count($ngramTmp) == $n){
                    $ngrams[] = implode(' ', $ngramTmp);
                }
                $ngramTmp = array();
            }
        }
        return $ngrams;
    }

Open in new window


Its essentially what you mentioned, with a few extras I threw in for my needs.

THanks again!
0

Featured Post

Efficient way to get backups off site to Azure

This user guide provides instructions on how to deploy and configure both a StoneFly Scale Out NAS Enterprise Cloud Drive virtual machine and Veeam Cloud Connect in the Microsoft Azure Cloud.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Deprecated and Headed for the Dustbin By now, you have probably heard that some PHP features, while convenient, can also cause PHP security problems.  This article discusses one of those, called register_globals.  It is a thing you do not want.  …
Nothing in an HTTP request can be trusted, including HTTP headers and form data.  A form token is a tool that can be used to guard against request forgeries (CSRF).  This article shows an improved approach to form tokens, making it more difficult to…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
The viewer will learn how to look for a specific file type in a local or remote server directory using PHP.

770 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question