Solved

Extracting words pairs and trips (n-gram extraction) from string

Posted on 2013-01-30
2
768 Views
Last Modified: 2013-01-30
Hi all, Im trying to figure out a way to take a string of variable length, usually something from natural language with some padding. What I want is 2 and 3 word pairs with a minimum word length of 3 chars or more, so with this sentence for example:

This here is a short sentence. And this is another sentence. >>>>> A final sentence here.

For bi-grams (2 words) would produce:
This here
short sentence
It stops at the ., > or any non word character (hypens, apostrophes are allowed) and starts processing a new sentence

And this
another sentence
final sentence
sentence here

Ive tried this with Regex, but not had much luck and wondered if anyone had a more elegant solution?
0
Comment
Question by:Slimshaneey
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
2 Comments
 
LVL 14

Accepted Solution

by:
Scott Madeira earned 500 total points
ID: 38835898
I would approach it like this...

1. Create an array of sentences by using preg_split() to split text into an array  when it hits your list of unallowed characters. (Not a regex expert so I can't offer much here.)

2. For each sentence in the array explode it based on a space  character to get the words.

3. Step through the array of words and do a check on length or element x and element x+1 to see if it meets your needs.

There may be a better way to do it but I think this will work.
0
 
LVL 11

Author Comment

by:Slimshaneey
ID: 38835956
Thanks, I was thinking the same thing. As is often the case,  the very act of writing the problem out here sets a seed in my head, ended up with this function:

public function getNgrams($string, $n = 2) {
        
        $arrParts = preg_split('/[?,.:;\/%$\!]/', $string);
       
        foreach($arrParts as $parts){
            $wordParts = explode(" ", trim($parts));
            if(count($wordParts) < $n){
                continue;
            }
            foreach($wordParts as $key => $word){
                if(strlen($word) < 3 || $key > count($wordParts) - $n || preg_match('/[0-9]/', $word || $this->isStopWord($word))){
                    continue;
                }
                $ngramTmp[] = $word;
                //Get next $n-1 words in sequence
                for($i = 0; $i < ($n - 1); $i++){
                    $nextWord = $wordParts[$key + $i + 1];
                    //Check its a valid word and not a part number
                    if(strlen($nextWord) >= 3 && !preg_match('/[0-9]/', $nextWord) && !$this->isStopWord($nextWord)){
                        $ngramTmp[] = $nextWord;
                    }else{
                        $ngramTmp = array();
                        continue 2;
                    }
                }
                if(count($ngramTmp) == $n){
                    $ngrams[] = implode(' ', $ngramTmp);
                }
                $ngramTmp = array();
            }
        }
        return $ngrams;
    }

Open in new window


Its essentially what you mentioned, with a few extras I threw in for my needs.

THanks again!
0

Featured Post

Announcing the Most Valuable Experts of 2016

MVEs are more concerned with the satisfaction of those they help than with the considerable points they can earn. They are the types of people you feel privileged to call colleagues. Join us in honoring this amazing group of Experts.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
PHP function parameters defined 14 43
What is define("__APPROOT__", __DIR__); 6 33
assigning javascript variable to php variable 8 44
PHP substring 3 18
Generating table dynamically is the most common issue faced by php developers.... So it seems there is a need of an article that explains the basic concept of generating tables dynamically. It just requires a basic knowledge of html and little maths…
Nothing in an HTTP request can be trusted, including HTTP headers and form data.  A form token is a tool that can be used to guard against request forgeries (CSRF).  This article shows an improved approach to form tokens, making it more difficult to…
The viewer will learn how to dynamically set the form action using jQuery.
The viewer will learn how to create a basic form using some HTML5 and PHP for later processing. Set up your basic HTML file. Open your form tag and set the method and action attributes.: (CODE) Set up your first few inputs one for the name and …

696 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question