asked on

Stanford NLP enhance splitting a line to sentences

Hello,
http://nlp.stanford.edu/software/corenlp.shtml
the example in the documents works fine
\stanford-corenlp-full-2015-04-20> java -cp stanford-corenlp-3.5.2.jar -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit -file input.txt

input.txt:

I saw Dr. Spock yesterday, he was speaking with Mr. McCoy. They were walking down Mullholand Dr. talking about www.google.com. Dr. Spock returns!

output:
Sentence #1 (13 tokens):

I saw Dr. Spock yesterday, he was speaking with Mr. McCoy.

...
Sentence #2 (10 tokens):

They were walking down Mullholand Dr. talking about www.google.com.

...
Sentence #3 (4 tokens):

Dr. Spock returns!

...

But if the input is:

The paper is 7 cm. length. What is you name? the size of the picture is 5 cm. x 8 cm.

the output is:
Sentence #1 (6 tokens):

The paper is 7 cm.

...
Sentence #2 (2 tokens):

length.

...
Sentence #3 (5 tokens):

What is you name?

...
Sentence #4 (9 tokens):

the size of the picture is 5 cm.

...
Sentence #5 (4 tokens):

x 8 cm.

...

Can I put such a Regex constraint, somehow somewhere for a separator:

(?<!cm)\.(?=\s)

Open in new window

Thanks, Aryeh.

Bob Learned

I am trying to figure out the benefit of annotators, like RegexNER:

http://nlp.stanford.edu/software/regexner/

Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, regexner");
props.put("regexner.mapping", "org/foo/resources/jg-regexner.txt");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

Bob Learned

Another interesting discovery:

Stanford TokensRegex
http://nlp.stanford.edu/software/tokensregex.shtml

 List<CoreMap> sentences = ...;
 CoreMapExpressionExtractor extractor = CoreMapExpressionExtractor.createExtractorFromFiles(TokenSequencePattern.getNewEnv(), file1, file2,...);
 for (CoreMap sentence:sentences) {
   List<MatchedExpression> matched = extractor.extractExpressions(sentence);
   ...
 }

Open in new window

Bob Learned

Stanford CoreNLP FAQ
http://nlp.stanford.edu/software/corenlp-faq.shtml

How do I add constraints to the parser in CoreNLP?
The parser can be instructed to keep certain sets of tokens together as a single constituent. If you do this, it will try to make a parse which contains a subtree where the exact set of tokens in that subtree are the ones specified in the constraint.

For any sentence where you want to add constraints, attach the ParserAnnotations.ConstraintAnnotation to that sentence. This annotation is a List<ParserConstraint>, where ParserConstraint specifies the start (inclusive) and end (exclusive) of the range and a pattern which the enclosing constituent must match. However, there is a bug in the way patterns are handled in the parser, so it is strongly recommended to use .* for the matching pattern.

ASKER CERTIFIED SOLUTION

tuchfeld

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Bob Learned

There really should be an easier way than having to rebuild the Java code. What changes did you have to make to the PTBLexer.flex file?

SOLUTION

tuchfeld

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

SOLUTION

tuchfeld

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Bob Learned

Thanks for sharing!!

Bob Learned

I wonder if this setting for the PTBTokenizer would have some affect:

Class PTBTokenizer<T extends HasWord>
http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/process/PTBTokenizer.html

strictTreebank3: PTBTokenizer deliberately deviates from strict PTB3 WSJ tokenization in two cases. Setting this improves compatibility for those cases. They are: (i) When an acronym is followed by a sentence end, such as "U.K." at the end of a sentence, the PTB3 has tokens of "Corp" and ".", while by default PTBTokenizer duplicates the period returning tokens of "Corp." and ".", and (ii) PTBTokenizer will return numbers with a whole number and a fractional part like "5 7/8" as a single token, with a non-breaking space in the middle, while the PTB3 separates them into two tokens "5" and "7/8". (Exception: for only "U.S." the treebank does have the two tokens "U.S." and "." like our default; strictTreebank3 now does that too.) The default is false.

tuchfeld

ASKER

Eventually I found the solution myself.
I think it might be useful for others..