Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 591
  • Last Modified:

Splitting text by period using regular expressions in php

Hi,

I have a requirement,where i need to break a long sentence into single sentences using full stop as delimeter.

for example The text

$text = "Music, movies, games and voice calls in great stereo quality. An improved noise canceling microphone ensures you can be easily heard, even when people are talking around you. Position the fully flexible boom arm exactly where it needs to be for calls, or move it out of the way to game, watch movies or listen to your music."

Will split like this.

Music, movies, games and voice calls in great stereo quality.
An improved noise canceling microphone ensures you can be easily heard, even when people are talking around you.
Position the fully flexible boom arm exactly where it needs to be for calls, or move it out of the way to game, watch movies or listen to your music.

I have tried like this

$seperator = ". ";
$description = array();
$description = explode($seperator,$text);

But when the text contains any numeric data like (1.2 values) or any abbreviated words ft.4(feet).
The sentence will be splitting like this 1.
2 values

I want to excludes those numbers with decimals and abbreviated words when splitting.I know this can be done using regular expressions.Please suggest any idea on how this can be done.
0
srikanth saladi
Asked:
srikanth saladi
  • 6
  • 4
  • 2
  • +1
1 Solution
 
Ray PaseurCommented:
If you explode a string like 1.2 using a glue like '. ' it will bypass the 1.2 because the whitespace is missing.  Can you please give us a small test data set that has some more examples?  I'll be glad to work with that to give you something appropriate.  Thanks!
0
 
Derek JensenCommented:
Try this regex:

/(?<!\d)\.\s*(?!\d)/

Open in new window

0
 
Ray PaseurCommented:
Please see: http://www.laprbass.com/RAY_temp_srikanth_saladi.php

Decimal numbers are no problem at all, whether with explode() or with a whitespace-aware regular expression.  But I think the problem may be a bit bigger than it looks.  Here are some of the things you may need to account for.  You will probably want to include samples like this in your test data to be sure that the algorithm you're developing will work correctly on the range of input you expect.

Dots inside proper names like Robert J. King, Jr., MD
Ellipses... embedded in sentences.
Tabular data.

Executive summary: The period is an "overloaded" punctuation with multiple meanings.  When you're parsing natural written language, it's never quite as easy on a computer as it is for a sentient being that reads the language.  One possible rule might include reliance on the traditional typists rule of two spaces after a period that ends a sentence.  Another might be that any end-of-line character after a dot means that the dot is the end of a sentence.  You might look for very short sentence fragments, such as those that could arise from abbreviations.  It's an interesting problem, to be sure!

<?php // RAY_temp_srikanth_saladi.php
error_reporting(E_ALL);
echo '<pre>';


// SEE http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/Q_28324495.html


$texts = array
( "Music, movies, games and voice calls in great stereo quality. An improved noise canceling microphone ensures you can be easily heard, even when people are talking around you. Position the fully flexible boom arm exactly where it needs to be for calls, or move it out of the way to game, watch movies or listen to your music."
, 'The price of the item is $1.95 plus tax.  We have 3 of them in stock.'
)
;


function get_sentences($str)
{
    $rgx
    = '#'        // REGEX DELIMITER
    . '\.'       // ESCAPED DOT
    . '\s+'      // WHITESPACE, ONE OR MORE
    . '#'        // REGEX DELIMITER
    ;

    $arr = preg_split($rgx, $str, -1, PREG_SPLIT_NO_EMPTY);

    // RESTORE DOTS REMOVED BY THE SPLIT
    foreach ($arr as $key => $val)
    {
        $arr[$key] = trim($val);
        if (substr($val, -1) != '.') $arr[$key] .= '.';
    }
    return $arr;
}


foreach ($texts as $text)
{
    var_dump(get_sentences($text));
    echo PHP_EOL;
}

Open in new window

0
Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

 
Terry WoodsIT GuruCommented:
Ray, your solution looks good. The \s+ part of your pattern will also handle newlines, keeping the result tidy.
0
 
Derek JensenCommented:
@Ray, you bring up several good points, and although I did consider some of them when constructing my regex, I decided not to implement them, and with good reason:

Your ellipses were handled quite easily, with this regex:

/(?<!\d|\.)\.\s*(?!\d|\.)/

Open in new window

as was the comma immediately following the period:

/(?<!\d|\.)\.\s*(?!\d|\.|,)/

Open in new window

However, try as I might, the final offending period could *not* be accounted for without specifying a massive number of specific examples to look for. Were I to attempt a broad-matching search for the . following the J (with something like "\b\w\.") and I wouldn't match a very specific yet frustratingly elusive sentence end: "I." as in:
Robert J. King Jr., MD am I.
I think this sentence is an extremely good example of a 'worst-case' scenario you might possibly encounter, and so would be terrific for matching against, but the likelihood you were to actually encounter a sentence like this in real-life text is astronomical, and thus not prudent to attempt to match against using regex.

Academically, I'm sure I could've matched against it using some complex set of dependent matches using "\b\w\." and "\s{2,}" or some such, but you must also consider that not everyone puts two spaces after their sentence ends(e.g. moi).
Add to that the fact that you're just as likely (if not moreso) to encounter other languages in your text, such as Spanish, which has "y" as its own word as well.

Barring all that, you still have the MD to contend with, assuming they chose to put periods in the abbreviation, which could end up looking like M.D., and perhaps they chose not to follow the Jr. with a comma (either of which are valid alternative punctuations), in which case the above regex fails.

All things considered, the only thing prudent to check against would be ellipses, since they are much more common in everyday language than they used to be...
0
 
Ray PaseurCommented:
There's a joke that goes around among experienced programmers something like this, "I had a problem so I tried to use a regular expression.  Now I have two problems!"

:-)  and best of luck with the project, ~Ray
0
 
Derek JensenCommented:
You mean this joke? ;-)
0
 
Ray PaseurCommented:
Love it!  There is also this: http://en.wikipedia.org/wiki/99_Problems
0
 
MazdajaiCommented:
Will that parse the following? :)

I have a B.S. degree in regular expression.

Open in new window

0
 
Ray PaseurCommented:
@Mazdajai: That is exactly the issue I raised here, and it underscores the need for the author to give us the test data set!  It's unlikely that a single regular expression will handle all of the cases, or even most of the cases.  more likely a little extra programming will be needed.  A tangentially related problem is the question of how to capitalize common English language names.  An easy guess is "make the first letter a capital," but when you deconstruct it, it's not as simple as it seems.  You need a lot of rules, and even with those rules, you'll almost certainly not cover 100% of the issues.  With punctuation we have a better chance of success because there are fewer rules.

<?php // RAY_capitalize_names.php
error_reporting(E_ALL);

// SOME TEST NAMES
$names = array
( "o'brien"
, 'MCAFEE'
, "barrett-o'reilly"
, "smith jones"
, "burns"
, "CROWTHER"
, "George w. bush, iiI"
, "RONALD    MCDONALD"
, "RONALD    MCDONALD-o'brien"
, "van De Graaff GeneratoR"
)
;

// TEST EACH CASE
foreach ($names as $name)
{
    echo "<br/>$name ";
    echo fixname($name);
}


// FUNCTION TO HANDLE NAMES
function fixname($name)
{
    // SPECIAL CASES FOR UPPER OR LOWER CASE DISPOSITION
    $uc = array  // UPPERCASE AFTER ANY OF THESE
    ( 'Mc'
    , "'"
    , '-'
    )
    ;

    $lc = array  // ALWAYS LOWER CASE
    ( 'Van De '
    )
    ;

    $mc = array  // ALWAYS UPPER CASE
    ( 'Iii'
    )
    ;

    // REMOVE UNNECESSARY BLANKS
    $name = preg_replace('/\s\s+/', ' ', $name);

    // START WITH LOWER CASE AND UPPER FIRST
    $name = strtolower($name);
    $name = ucwords($name);

    // CHECK FOR KNOWN SPECIAL UPPER-CASES
    foreach ($uc as $dlm)
    {
        // FIX THE Mcdonald EXAMPLE, ETC
        $namex = explode($dlm, $name);
        foreach ($namex as $k => $v)
        {
            $namex[$k] = ucwords($v);
        }
        $name = implode($dlm, $namex);
    }

    // CHECK FOR KNOWN CONSTANT LOWER-CASES
    foreach ($lc as $dlm)
    {
        // FIX THE van de Graaff EXAMPLE
        $name = str_replace($dlm, strtolower($dlm), $name);
    }

    // CHECK FOR KNOW CONSTANT UPPERCASE
    foreach ($mc as $dlm)
    {
        // FIX THE Bush, III EXAMPLE
        $name = str_replace($dlm, strtoupper($dlm), $name);
    }

    // RETURN THE REPAIRED STRING
    return $name;
}

Open in new window

0
 
MazdajaiCommented:
I want to excludes those numbers with decimals and abbreviated words when splitting.I know this can be done using regular expressions.

The purpose of my post is not to complicate the issue, but rather suggesting the user that above statement is false.

Depends on the number of records or data set, it may be require a manual inspection because this is not log files with consistent format. You can only exclude known abbreviation.
0
 
Ray PaseurCommented:
... can only exclude known abbreviation.
Exactly, and that's why trying to rely on a single regular expression is not going to be very fruitful.  It is possible to do what the author wants, it's just not possible with a single regular expression.  More likely it will be a combination of several PHP statements that take into account known abbreviations, special considerations, etc.  In the case of capitalizing names, we had to have several rules (and that's still not perfect).  I am sure that once the author revisits the responses here, and provides the test data, we can get a solution that will work with the test data set.  The quality of the solution will be determined by the degree to which the test data looks like the actual data.

Best regards, ~Ray
0
 
Derek JensenCommented:
Indeed, @Ray; whenever a solution to a problem I encounter starts to take more than one regular expression (if it's not specifically a regex problem) I like to rework it and see if there's a more programmatic solution (almost always there is), because once you start using more than one regex to solve one problem, you start to lose efficiency, readability, maintainability...hence the 99 problems "joke." ;-)
0

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

  • 6
  • 4
  • 2
  • +1
Tackle projects and never again get stuck behind a technical roadblock.
Join Now