Avatar of srikanth saladi
srikanth saladi
 asked on

Splitting text by period using regular expressions in php

Hi,

I have a requirement,where i need to break a long sentence into single sentences using full stop as delimeter.

for example The text

$text = "Music, movies, games and voice calls in great stereo quality. An improved noise canceling microphone ensures you can be easily heard, even when people are talking around you. Position the fully flexible boom arm exactly where it needs to be for calls, or move it out of the way to game, watch movies or listen to your music."

Will split like this.

Music, movies, games and voice calls in great stereo quality.
An improved noise canceling microphone ensures you can be easily heard, even when people are talking around you.
Position the fully flexible boom arm exactly where it needs to be for calls, or move it out of the way to game, watch movies or listen to your music.

I have tried like this

$seperator = ". ";
$description = array();
$description = explode($seperator,$text);

But when the text contains any numeric data like (1.2 values) or any abbreviated words ft.4(feet).
The sentence will be splitting like this 1.
2 values

I want to excludes those numbers with decimals and abbreviated words when splitting.I know this can be done using regular expressions.Please suggest any idea on how this can be done.
PHPRegular Expressions

Avatar of undefined
Last Comment
Derek Jensen

8/22/2022 - Mon
Ray Paseur

If you explode a string like 1.2 using a glue like '. ' it will bypass the 1.2 because the whitespace is missing.  Can you please give us a small test data set that has some more examples?  I'll be glad to work with that to give you something appropriate.  Thanks!
Derek Jensen

Try this regex:

/(?<!\d)\.\s*(?!\d)/

Open in new window

ASKER CERTIFIED SOLUTION
Ray Paseur

THIS SOLUTION ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
GET A PERSONALIZED SOLUTION
Ask your own question & get feedback from real experts
Find out why thousands trust the EE community with their toughest problems.
Terry Woods

Ray, your solution looks good. The \s+ part of your pattern will also handle newlines, keeping the result tidy.
All of life is about relationships, and EE has made a viirtual community a real community. It lifts everyone's boat
William Peck
Derek Jensen

@Ray, you bring up several good points, and although I did consider some of them when constructing my regex, I decided not to implement them, and with good reason:

Your ellipses were handled quite easily, with this regex:

/(?<!\d|\.)\.\s*(?!\d|\.)/

Open in new window

as was the comma immediately following the period:

/(?<!\d|\.)\.\s*(?!\d|\.|,)/

Open in new window

However, try as I might, the final offending period could *not* be accounted for without specifying a massive number of specific examples to look for. Were I to attempt a broad-matching search for the . following the J (with something like "\b\w\.") and I wouldn't match a very specific yet frustratingly elusive sentence end: "I." as in:
Robert J. King Jr., MD am I.
I think this sentence is an extremely good example of a 'worst-case' scenario you might possibly encounter, and so would be terrific for matching against, but the likelihood you were to actually encounter a sentence like this in real-life text is astronomical, and thus not prudent to attempt to match against using regex.

Academically, I'm sure I could've matched against it using some complex set of dependent matches using "\b\w\." and "\s{2,}" or some such, but you must also consider that not everyone puts two spaces after their sentence ends(e.g. moi).
Add to that the fact that you're just as likely (if not moreso) to encounter other languages in your text, such as Spanish, which has "y" as its own word as well.

Barring all that, you still have the MD to contend with, assuming they chose to put periods in the abbreviation, which could end up looking like M.D., and perhaps they chose not to follow the Jr. with a comma (either of which are valid alternative punctuations), in which case the above regex fails.

All things considered, the only thing prudent to check against would be ellipses, since they are much more common in everyday language than they used to be...
Ray Paseur

There's a joke that goes around among experienced programmers something like this, "I had a problem so I tried to use a regular expression.  Now I have two problems!"

:-)  and best of luck with the project, ~Ray
Derek Jensen

You mean this joke? ;-)
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.
Ray Paseur

Love it!  There is also this: http://en.wikipedia.org/wiki/99_Problems
Mazdajai

Will that parse the following? :)

I have a B.S. degree in regular expression.

Open in new window

Ray Paseur

@Mazdajai: That is exactly the issue I raised here, and it underscores the need for the author to give us the test data set!  It's unlikely that a single regular expression will handle all of the cases, or even most of the cases.  more likely a little extra programming will be needed.  A tangentially related problem is the question of how to capitalize common English language names.  An easy guess is "make the first letter a capital," but when you deconstruct it, it's not as simple as it seems.  You need a lot of rules, and even with those rules, you'll almost certainly not cover 100% of the issues.  With punctuation we have a better chance of success because there are fewer rules.

<?php // RAY_capitalize_names.php
error_reporting(E_ALL);

// SOME TEST NAMES
$names = array
( "o'brien"
, 'MCAFEE'
, "barrett-o'reilly"
, "smith jones"
, "burns"
, "CROWTHER"
, "George w. bush, iiI"
, "RONALD    MCDONALD"
, "RONALD    MCDONALD-o'brien"
, "van De Graaff GeneratoR"
)
;

// TEST EACH CASE
foreach ($names as $name)
{
    echo "<br/>$name ";
    echo fixname($name);
}


// FUNCTION TO HANDLE NAMES
function fixname($name)
{
    // SPECIAL CASES FOR UPPER OR LOWER CASE DISPOSITION
    $uc = array  // UPPERCASE AFTER ANY OF THESE
    ( 'Mc'
    , "'"
    , '-'
    )
    ;

    $lc = array  // ALWAYS LOWER CASE
    ( 'Van De '
    )
    ;

    $mc = array  // ALWAYS UPPER CASE
    ( 'Iii'
    )
    ;

    // REMOVE UNNECESSARY BLANKS
    $name = preg_replace('/\s\s+/', ' ', $name);

    // START WITH LOWER CASE AND UPPER FIRST
    $name = strtolower($name);
    $name = ucwords($name);

    // CHECK FOR KNOWN SPECIAL UPPER-CASES
    foreach ($uc as $dlm)
    {
        // FIX THE Mcdonald EXAMPLE, ETC
        $namex = explode($dlm, $name);
        foreach ($namex as $k => $v)
        {
            $namex[$k] = ucwords($v);
        }
        $name = implode($dlm, $namex);
    }

    // CHECK FOR KNOWN CONSTANT LOWER-CASES
    foreach ($lc as $dlm)
    {
        // FIX THE van de Graaff EXAMPLE
        $name = str_replace($dlm, strtolower($dlm), $name);
    }

    // CHECK FOR KNOW CONSTANT UPPERCASE
    foreach ($mc as $dlm)
    {
        // FIX THE Bush, III EXAMPLE
        $name = str_replace($dlm, strtoupper($dlm), $name);
    }

    // RETURN THE REPAIRED STRING
    return $name;
}

Open in new window

I started with Experts Exchange in 2004 and it's been a mainstay of my professional computing life since. It helped me launch a career as a programmer / Oracle data analyst
William Peck
Mazdajai

I want to excludes those numbers with decimals and abbreviated words when splitting.I know this can be done using regular expressions.

The purpose of my post is not to complicate the issue, but rather suggesting the user that above statement is false.

Depends on the number of records or data set, it may be require a manual inspection because this is not log files with consistent format. You can only exclude known abbreviation.
Ray Paseur

... can only exclude known abbreviation.
Exactly, and that's why trying to rely on a single regular expression is not going to be very fruitful.  It is possible to do what the author wants, it's just not possible with a single regular expression.  More likely it will be a combination of several PHP statements that take into account known abbreviations, special considerations, etc.  In the case of capitalizing names, we had to have several rules (and that's still not perfect).  I am sure that once the author revisits the responses here, and provides the test data, we can get a solution that will work with the test data set.  The quality of the solution will be determined by the degree to which the test data looks like the actual data.

Best regards, ~Ray
Derek Jensen

Indeed, @Ray; whenever a solution to a problem I encounter starts to take more than one regular expression (if it's not specifically a regex problem) I like to rework it and see if there's a more programmatic solution (almost always there is), because once you start using more than one regex to solve one problem, you start to lose efficiency, readability, maintainability...hence the 99 problems "joke." ;-)
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.