Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1610
  • Last Modified:

Splitting a paragraph into sentences

I have a bit of code that takes a chunk of text and splits it into individual sentences.   It works pretty good but there are a few cases that I would like to see if I could cover in a regular expression without having to do some post processing cleanup.  Those involve titles like Mr. Mrs. or Dr.   Right now the code splits sentences after the title which is not desirable.   Given the enclosed code, can you see how to alter the pattern to prevent this from happenning?
import java.util.regex.*;

public class Test {
    public static void main(String[] args) throws Exception {
        // Create a pattern to match breaks
          String teststring = "This is a simple sentence. This is a sentence about Mr. Smith and Dr. Jones.  This is a rather more complicated (e.g. one that contains a clause) and holds a sentence (2.25). " +
          "And this is another sentence but finishes with a number 12. And this is another (small-sized) sentence. " +
          "Finally, this is the last sentence in this (rather short) paragraph." +
          " And what about this sentence? And of course don't forget this one!  Amen brother." +
          " Here is a bullet list test a.  one bullet; b. two bullets c. three bullets.";
            Pattern p = Pattern.compile("(?<=\\w[\\w\\)\\]][\\.\\?\\!]\\s)");  
        String[] result =
                 p.split(teststring);
        for (int i=0; i<result.length; i++)
            System.out.println("i->"+result[i]);
    }
}
0
efamilant
Asked:
efamilant
  • 2
  • 2
1 Solution
 
Gurvinder Pal SinghCommented:
Did you considered this?
http://stanfordparser.rubyforge.org/

0
 
CEHJCommented:
Try something like

Pattern p = Pattern.compile("(?<=\\w[\\w\\)\\]](?<!Mrs?|Dr)[\\.\\?\\!]\\s)");
0
 
Gurvinder Pal SinghCommented:
This is a hard problem to solve, if you really want to have a complete/heuristic solution.
you can put more such words like Mr. or abbreviations like M.B.B.S in CEHJ's solution.
0
 
efamilantAuthor Commented:
Great.   Just what I needed.
0
 
CEHJCommented:
:-)
0

Featured Post

Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

  • 2
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now