Solved

Splitting text by period using regular expressions in php

Posted on 2013-12-23
13
534 Views
Last Modified: 2014-01-07
Hi,

I have a requirement,where i need to break a long sentence into single sentences using full stop as delimeter.

for example The text

$text = "Music, movies, games and voice calls in great stereo quality. An improved noise canceling microphone ensures you can be easily heard, even when people are talking around you. Position the fully flexible boom arm exactly where it needs to be for calls, or move it out of the way to game, watch movies or listen to your music."

Will split like this.

Music, movies, games and voice calls in great stereo quality.
An improved noise canceling microphone ensures you can be easily heard, even when people are talking around you.
Position the fully flexible boom arm exactly where it needs to be for calls, or move it out of the way to game, watch movies or listen to your music.

I have tried like this

$seperator = ". ";
$description = array();
$description = explode($seperator,$text);

But when the text contains any numeric data like (1.2 values) or any abbreviated words ft.4(feet).
The sentence will be splitting like this 1.
2 values

I want to excludes those numbers with decimals and abbreviated words when splitting.I know this can be done using regular expressions.Please suggest any idea on how this can be done.
0
Comment
Question by:srikanth_saladi
  • 6
  • 4
  • 2
  • +1
13 Comments
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 39736229
If you explode a string like 1.2 using a glue like '. ' it will bypass the 1.2 because the whitespace is missing.  Can you please give us a small test data set that has some more examples?  I'll be glad to work with that to give you something appropriate.  Thanks!
0
 
LVL 9

Expert Comment

by:Derek Jensen
ID: 39736275
Try this regex:

/(?<!\d)\.\s*(?!\d)/

Open in new window

0
 
LVL 108

Accepted Solution

by:
Ray Paseur earned 500 total points
ID: 39736470
Please see: http://www.laprbass.com/RAY_temp_srikanth_saladi.php

Decimal numbers are no problem at all, whether with explode() or with a whitespace-aware regular expression.  But I think the problem may be a bit bigger than it looks.  Here are some of the things you may need to account for.  You will probably want to include samples like this in your test data to be sure that the algorithm you're developing will work correctly on the range of input you expect.

Dots inside proper names like Robert J. King, Jr., MD
Ellipses... embedded in sentences.
Tabular data.

Executive summary: The period is an "overloaded" punctuation with multiple meanings.  When you're parsing natural written language, it's never quite as easy on a computer as it is for a sentient being that reads the language.  One possible rule might include reliance on the traditional typists rule of two spaces after a period that ends a sentence.  Another might be that any end-of-line character after a dot means that the dot is the end of a sentence.  You might look for very short sentence fragments, such as those that could arise from abbreviations.  It's an interesting problem, to be sure!

<?php // RAY_temp_srikanth_saladi.php
error_reporting(E_ALL);
echo '<pre>';


// SEE http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/Q_28324495.html


$texts = array
( "Music, movies, games and voice calls in great stereo quality. An improved noise canceling microphone ensures you can be easily heard, even when people are talking around you. Position the fully flexible boom arm exactly where it needs to be for calls, or move it out of the way to game, watch movies or listen to your music."
, 'The price of the item is $1.95 plus tax.  We have 3 of them in stock.'
)
;


function get_sentences($str)
{
    $rgx
    = '#'        // REGEX DELIMITER
    . '\.'       // ESCAPED DOT
    . '\s+'      // WHITESPACE, ONE OR MORE
    . '#'        // REGEX DELIMITER
    ;

    $arr = preg_split($rgx, $str, -1, PREG_SPLIT_NO_EMPTY);

    // RESTORE DOTS REMOVED BY THE SPLIT
    foreach ($arr as $key => $val)
    {
        $arr[$key] = trim($val);
        if (substr($val, -1) != '.') $arr[$key] .= '.';
    }
    return $arr;
}


foreach ($texts as $text)
{
    var_dump(get_sentences($text));
    echo PHP_EOL;
}

Open in new window

0
 
LVL 35

Expert Comment

by:Terry Woods
ID: 39736631
Ray, your solution looks good. The \s+ part of your pattern will also handle newlines, keeping the result tidy.
0
 
LVL 9

Expert Comment

by:Derek Jensen
ID: 39736934
@Ray, you bring up several good points, and although I did consider some of them when constructing my regex, I decided not to implement them, and with good reason:

Your ellipses were handled quite easily, with this regex:

/(?<!\d|\.)\.\s*(?!\d|\.)/

Open in new window

as was the comma immediately following the period:

/(?<!\d|\.)\.\s*(?!\d|\.|,)/

Open in new window

However, try as I might, the final offending period could *not* be accounted for without specifying a massive number of specific examples to look for. Were I to attempt a broad-matching search for the . following the J (with something like "\b\w\.") and I wouldn't match a very specific yet frustratingly elusive sentence end: "I." as in:
Robert J. King Jr., MD am I.
I think this sentence is an extremely good example of a 'worst-case' scenario you might possibly encounter, and so would be terrific for matching against, but the likelihood you were to actually encounter a sentence like this in real-life text is astronomical, and thus not prudent to attempt to match against using regex.

Academically, I'm sure I could've matched against it using some complex set of dependent matches using "\b\w\." and "\s{2,}" or some such, but you must also consider that not everyone puts two spaces after their sentence ends(e.g. moi).
Add to that the fact that you're just as likely (if not moreso) to encounter other languages in your text, such as Spanish, which has "y" as its own word as well.

Barring all that, you still have the MD to contend with, assuming they chose to put periods in the abbreviation, which could end up looking like M.D., and perhaps they chose not to follow the Jr. with a comma (either of which are valid alternative punctuations), in which case the above regex fails.

All things considered, the only thing prudent to check against would be ellipses, since they are much more common in everyday language than they used to be...
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 39737026
There's a joke that goes around among experienced programmers something like this, "I had a problem so I tried to use a regular expression.  Now I have two problems!"

:-)  and best of luck with the project, ~Ray
0
IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

 
LVL 9

Expert Comment

by:Derek Jensen
ID: 39737161
You mean this joke? ;-)
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 39737176
Love it!  There is also this: http://en.wikipedia.org/wiki/99_Problems
0
 
LVL 21

Expert Comment

by:Mazdajai
ID: 39737228
Will that parse the following? :)

I have a B.S. degree in regular expression.

Open in new window

0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 39737938
@Mazdajai: That is exactly the issue I raised here, and it underscores the need for the author to give us the test data set!  It's unlikely that a single regular expression will handle all of the cases, or even most of the cases.  more likely a little extra programming will be needed.  A tangentially related problem is the question of how to capitalize common English language names.  An easy guess is "make the first letter a capital," but when you deconstruct it, it's not as simple as it seems.  You need a lot of rules, and even with those rules, you'll almost certainly not cover 100% of the issues.  With punctuation we have a better chance of success because there are fewer rules.

<?php // RAY_capitalize_names.php
error_reporting(E_ALL);

// SOME TEST NAMES
$names = array
( "o'brien"
, 'MCAFEE'
, "barrett-o'reilly"
, "smith jones"
, "burns"
, "CROWTHER"
, "George w. bush, iiI"
, "RONALD    MCDONALD"
, "RONALD    MCDONALD-o'brien"
, "van De Graaff GeneratoR"
)
;

// TEST EACH CASE
foreach ($names as $name)
{
    echo "<br/>$name ";
    echo fixname($name);
}


// FUNCTION TO HANDLE NAMES
function fixname($name)
{
    // SPECIAL CASES FOR UPPER OR LOWER CASE DISPOSITION
    $uc = array  // UPPERCASE AFTER ANY OF THESE
    ( 'Mc'
    , "'"
    , '-'
    )
    ;

    $lc = array  // ALWAYS LOWER CASE
    ( 'Van De '
    )
    ;

    $mc = array  // ALWAYS UPPER CASE
    ( 'Iii'
    )
    ;

    // REMOVE UNNECESSARY BLANKS
    $name = preg_replace('/\s\s+/', ' ', $name);

    // START WITH LOWER CASE AND UPPER FIRST
    $name = strtolower($name);
    $name = ucwords($name);

    // CHECK FOR KNOWN SPECIAL UPPER-CASES
    foreach ($uc as $dlm)
    {
        // FIX THE Mcdonald EXAMPLE, ETC
        $namex = explode($dlm, $name);
        foreach ($namex as $k => $v)
        {
            $namex[$k] = ucwords($v);
        }
        $name = implode($dlm, $namex);
    }

    // CHECK FOR KNOWN CONSTANT LOWER-CASES
    foreach ($lc as $dlm)
    {
        // FIX THE van de Graaff EXAMPLE
        $name = str_replace($dlm, strtolower($dlm), $name);
    }

    // CHECK FOR KNOW CONSTANT UPPERCASE
    foreach ($mc as $dlm)
    {
        // FIX THE Bush, III EXAMPLE
        $name = str_replace($dlm, strtoupper($dlm), $name);
    }

    // RETURN THE REPAIRED STRING
    return $name;
}

Open in new window

0
 
LVL 21

Expert Comment

by:Mazdajai
ID: 39738016
I want to excludes those numbers with decimals and abbreviated words when splitting.I know this can be done using regular expressions.

The purpose of my post is not to complicate the issue, but rather suggesting the user that above statement is false.

Depends on the number of records or data set, it may be require a manual inspection because this is not log files with consistent format. You can only exclude known abbreviation.
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 39738039
... can only exclude known abbreviation.
Exactly, and that's why trying to rely on a single regular expression is not going to be very fruitful.  It is possible to do what the author wants, it's just not possible with a single regular expression.  More likely it will be a combination of several PHP statements that take into account known abbreviations, special considerations, etc.  In the case of capitalizing names, we had to have several rules (and that's still not perfect).  I am sure that once the author revisits the responses here, and provides the test data, we can get a solution that will work with the test data set.  The quality of the solution will be determined by the degree to which the test data looks like the actual data.

Best regards, ~Ray
0
 
LVL 9

Expert Comment

by:Derek Jensen
ID: 39740032
Indeed, @Ray; whenever a solution to a problem I encounter starts to take more than one regular expression (if it's not specifically a regex problem) I like to rework it and see if there's a more programmatic solution (almost always there is), because once you start using more than one regex to solve one problem, you start to lose efficiency, readability, maintainability...hence the 99 problems "joke." ;-)
0

Featured Post

Threat Intelligence Starter Resources

Integrating threat intelligence can be challenging, and not all companies are ready. These resources can help you build awareness and prepare for defense.

Join & Write a Comment

Things That Drive Us Nuts Have you noticed the use of the reCaptcha feature at EE and other web sites?  It wants you to read and retype something that looks like this.Insanity!  It's not EE's fault - that's just the way reCaptcha works.  But it is …
This article discusses how to create an extensible mechanism for linked drop downs.
The viewer will learn how to dynamically set the form action using jQuery.
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…

708 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

16 Experts available now in Live!

Get 1:1 Help Now