?
Solved

Extracting words pairs and trips (n-gram extraction) from string

Posted on 2013-01-30
2
Medium Priority
?
848 Views
Last Modified: 2013-01-30
Hi all, Im trying to figure out a way to take a string of variable length, usually something from natural language with some padding. What I want is 2 and 3 word pairs with a minimum word length of 3 chars or more, so with this sentence for example:

This here is a short sentence. And this is another sentence. >>>>> A final sentence here.

For bi-grams (2 words) would produce:
This here
short sentence
It stops at the ., > or any non word character (hypens, apostrophes are allowed) and starts processing a new sentence

And this
another sentence
final sentence
sentence here

Ive tried this with Regex, but not had much luck and wondered if anyone had a more elegant solution?
0
Comment
Question by:Slimshaneey
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
2 Comments
 
LVL 14

Accepted Solution

by:
Scott Madeira earned 2000 total points
ID: 38835898
I would approach it like this...

1. Create an array of sentences by using preg_split() to split text into an array  when it hits your list of unallowed characters. (Not a regex expert so I can't offer much here.)

2. For each sentence in the array explode it based on a space  character to get the words.

3. Step through the array of words and do a check on length or element x and element x+1 to see if it meets your needs.

There may be a better way to do it but I think this will work.
0
 
LVL 11

Author Comment

by:Slimshaneey
ID: 38835956
Thanks, I was thinking the same thing. As is often the case,  the very act of writing the problem out here sets a seed in my head, ended up with this function:

public function getNgrams($string, $n = 2) {
        
        $arrParts = preg_split('/[?,.:;\/%$\!]/', $string);
       
        foreach($arrParts as $parts){
            $wordParts = explode(" ", trim($parts));
            if(count($wordParts) < $n){
                continue;
            }
            foreach($wordParts as $key => $word){
                if(strlen($word) < 3 || $key > count($wordParts) - $n || preg_match('/[0-9]/', $word || $this->isStopWord($word))){
                    continue;
                }
                $ngramTmp[] = $word;
                //Get next $n-1 words in sequence
                for($i = 0; $i < ($n - 1); $i++){
                    $nextWord = $wordParts[$key + $i + 1];
                    //Check its a valid word and not a part number
                    if(strlen($nextWord) >= 3 && !preg_match('/[0-9]/', $nextWord) && !$this->isStopWord($nextWord)){
                        $ngramTmp[] = $nextWord;
                    }else{
                        $ngramTmp = array();
                        continue 2;
                    }
                }
                if(count($ngramTmp) == $n){
                    $ngrams[] = implode(' ', $ngramTmp);
                }
                $ngramTmp = array();
            }
        }
        return $ngrams;
    }

Open in new window


Its essentially what you mentioned, with a few extras I threw in for my needs.

THanks again!
0

Featured Post

Want to be a Web Developer? Get Certified Today!

Enroll in the Certified Web Development Professional course package to learn HTML, Javascript, and PHP. Build a solid foundation to work toward your dream job!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Password hashing is better than message digests or encryption, and you should be using it instead of message digests or encryption.  Find out why and how in this article, which supplements the original article on PHP Client Registration, Login, Logo…
Nothing in an HTTP request can be trusted, including HTTP headers and form data.  A form token is a tool that can be used to guard against request forgeries (CSRF).  This article shows an improved approach to form tokens, making it more difficult to…
The viewer will learn how to count occurrences of each item in an array.
The viewer will learn how to create a basic form using some HTML5 and PHP for later processing. Set up your basic HTML file. Open your form tag and set the method and action attributes.: (CODE) Set up your first few inputs one for the name and …
Suggested Courses

741 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question