Solved

Extracting words pairs and trips (n-gram extraction) from string

Posted on 2013-01-30
2
806 Views
Last Modified: 2013-01-30
Hi all, Im trying to figure out a way to take a string of variable length, usually something from natural language with some padding. What I want is 2 and 3 word pairs with a minimum word length of 3 chars or more, so with this sentence for example:

This here is a short sentence. And this is another sentence. >>>>> A final sentence here.

For bi-grams (2 words) would produce:
This here
short sentence
It stops at the ., > or any non word character (hypens, apostrophes are allowed) and starts processing a new sentence

And this
another sentence
final sentence
sentence here

Ive tried this with Regex, but not had much luck and wondered if anyone had a more elegant solution?
0
Comment
Question by:Slimshaneey
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
2 Comments
 
LVL 14

Accepted Solution

by:
Scott Madeira earned 500 total points
ID: 38835898
I would approach it like this...

1. Create an array of sentences by using preg_split() to split text into an array  when it hits your list of unallowed characters. (Not a regex expert so I can't offer much here.)

2. For each sentence in the array explode it based on a space  character to get the words.

3. Step through the array of words and do a check on length or element x and element x+1 to see if it meets your needs.

There may be a better way to do it but I think this will work.
0
 
LVL 11

Author Comment

by:Slimshaneey
ID: 38835956
Thanks, I was thinking the same thing. As is often the case,  the very act of writing the problem out here sets a seed in my head, ended up with this function:

public function getNgrams($string, $n = 2) {
        
        $arrParts = preg_split('/[?,.:;\/%$\!]/', $string);
       
        foreach($arrParts as $parts){
            $wordParts = explode(" ", trim($parts));
            if(count($wordParts) < $n){
                continue;
            }
            foreach($wordParts as $key => $word){
                if(strlen($word) < 3 || $key > count($wordParts) - $n || preg_match('/[0-9]/', $word || $this->isStopWord($word))){
                    continue;
                }
                $ngramTmp[] = $word;
                //Get next $n-1 words in sequence
                for($i = 0; $i < ($n - 1); $i++){
                    $nextWord = $wordParts[$key + $i + 1];
                    //Check its a valid word and not a part number
                    if(strlen($nextWord) >= 3 && !preg_match('/[0-9]/', $nextWord) && !$this->isStopWord($nextWord)){
                        $ngramTmp[] = $nextWord;
                    }else{
                        $ngramTmp = array();
                        continue 2;
                    }
                }
                if(count($ngramTmp) == $n){
                    $ngrams[] = implode(' ', $ngramTmp);
                }
                $ngramTmp = array();
            }
        }
        return $ngrams;
    }

Open in new window


Its essentially what you mentioned, with a few extras I threw in for my needs.

THanks again!
0

Featured Post

Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Password hashing is better than message digests or encryption, and you should be using it instead of message digests or encryption.  Find out why and how in this article, which supplements the original article on PHP Client Registration, Login, Logo…
Since pre-biblical times, humans have sought ways to keep secrets, and share the secrets selectively.  This article explores the ways PHP can be used to hide and encrypt information.
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…
This tutorial will teach you the core code needed to finalize the addition of a watermark to your image. The viewer will use a small PHP class to learn and create a watermark.

705 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question