Solved

Extracting words pairs and trips (n-gram extraction) from string

Posted on 2013-01-30
2
681 Views
Last Modified: 2013-01-30
Hi all, Im trying to figure out a way to take a string of variable length, usually something from natural language with some padding. What I want is 2 and 3 word pairs with a minimum word length of 3 chars or more, so with this sentence for example:

This here is a short sentence. And this is another sentence. >>>>> A final sentence here.

For bi-grams (2 words) would produce:
This here
short sentence
It stops at the ., > or any non word character (hypens, apostrophes are allowed) and starts processing a new sentence

And this
another sentence
final sentence
sentence here

Ive tried this with Regex, but not had much luck and wondered if anyone had a more elegant solution?
0
Comment
Question by:Slimshaneey
2 Comments
 
LVL 14

Accepted Solution

by:
Scott Madeira earned 500 total points
ID: 38835898
I would approach it like this...

1. Create an array of sentences by using preg_split() to split text into an array  when it hits your list of unallowed characters. (Not a regex expert so I can't offer much here.)

2. For each sentence in the array explode it based on a space  character to get the words.

3. Step through the array of words and do a check on length or element x and element x+1 to see if it meets your needs.

There may be a better way to do it but I think this will work.
0
 
LVL 11

Author Comment

by:Slimshaneey
ID: 38835956
Thanks, I was thinking the same thing. As is often the case,  the very act of writing the problem out here sets a seed in my head, ended up with this function:

public function getNgrams($string, $n = 2) {
        
        $arrParts = preg_split('/[?,.:;\/%$\!]/', $string);
       
        foreach($arrParts as $parts){
            $wordParts = explode(" ", trim($parts));
            if(count($wordParts) < $n){
                continue;
            }
            foreach($wordParts as $key => $word){
                if(strlen($word) < 3 || $key > count($wordParts) - $n || preg_match('/[0-9]/', $word || $this->isStopWord($word))){
                    continue;
                }
                $ngramTmp[] = $word;
                //Get next $n-1 words in sequence
                for($i = 0; $i < ($n - 1); $i++){
                    $nextWord = $wordParts[$key + $i + 1];
                    //Check its a valid word and not a part number
                    if(strlen($nextWord) >= 3 && !preg_match('/[0-9]/', $nextWord) && !$this->isStopWord($nextWord)){
                        $ngramTmp[] = $nextWord;
                    }else{
                        $ngramTmp = array();
                        continue 2;
                    }
                }
                if(count($ngramTmp) == $n){
                    $ngrams[] = implode(' ', $ngramTmp);
                }
                $ngramTmp = array();
            }
        }
        return $ngrams;
    }

Open in new window


Its essentially what you mentioned, with a few extras I threw in for my needs.

THanks again!
0

Featured Post

Free Trending Threat Insights Every Day

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

Join & Write a Comment

I imagine that there are some, like me, who require a way of getting currency exchange rates for implementation in web project from time to time, so I thought I would share a solution that I have developed for this purpose. It turns out that Yaho…
Part of the Global Positioning System A geocode (https://developers.google.com/maps/documentation/geocoding/) is the major subset of a GPS coordinate (http://en.wikipedia.org/wiki/Global_Positioning_System), the other parts being the altitude and t…
The viewer will learn how to dynamically set the form action using jQuery.
The viewer will learn how to count occurrences of each item in an array.

757 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

22 Experts available now in Live!

Get 1:1 Help Now