Stanford NLP enhance splitting a line to sentences

Hello,
http://nlp.stanford.edu/software/corenlp.shtml
the example in the documents works fine
\stanford-corenlp-full-2015-04-20> java -cp stanford-corenlp-3.5.2.jar -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit -file input.txt

input.txt:
I saw Dr. Spock yesterday, he was speaking with Mr. McCoy.  They were walking down Mullholand Dr. talking about www.google.com.  Dr. Spock returns!

output:
Sentence #1 (13 tokens):
I saw Dr. Spock yesterday, he was speaking with Mr. McCoy.
...
Sentence #2 (10 tokens):
They were walking down Mullholand Dr. talking about www.google.com.
...
Sentence #3 (4 tokens):
Dr. Spock returns!
...

But if the input is:
The paper is 7 cm. length. What is you name? the size of the picture is 5 cm. x 8 cm.

the output is:
Sentence #1 (6 tokens):
The paper is 7 cm.
...
Sentence #2 (2 tokens):
length.
...
Sentence #3 (5 tokens):
What is you name?
...
Sentence #4 (9 tokens):
the size of the picture is 5 cm.
...
Sentence #5 (4 tokens):
x 8 cm.
...

Can I put such a Regex constraint, somehow somewhere for a separator:
(?<!cm)\.(?=\s)

Open in new window


Thanks, Aryeh.
tuchfeldAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Bob LearnedCommented:
I am trying to figure out the benefit of annotators, like RegexNER:

http://nlp.stanford.edu/software/regexner/

Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, regexner");
props.put("regexner.mapping", "org/foo/resources/jg-regexner.txt");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Bob LearnedCommented:
Another interesting discovery:

Stanford TokensRegex
http://nlp.stanford.edu/software/tokensregex.shtml

 List<CoreMap> sentences = ...;
 CoreMapExpressionExtractor extractor = CoreMapExpressionExtractor.createExtractorFromFiles(TokenSequencePattern.getNewEnv(), file1, file2,...);
 for (CoreMap sentence:sentences) {
   List<MatchedExpression> matched = extractor.extractExpressions(sentence);
   ...
 }

Open in new window

Bob LearnedCommented:
Stanford CoreNLP FAQ
http://nlp.stanford.edu/software/corenlp-faq.shtml

How do I add constraints to the parser in CoreNLP?
The parser can be instructed to keep certain sets of tokens together as a single constituent. If you do this, it will try to make a parse which contains a subtree where the exact set of tokens in that subtree are the ones specified in the constraint.

For any sentence where you want to add constraints, attach the ParserAnnotations.ConstraintAnnotation to that sentence. This annotation is a List<ParserConstraint>, where ParserConstraint specifies the start (inclusive) and end (exclusive) of the range and a pattern which the enclosing constituent must match. However, there is a bug in the way patterns are handled in the parser, so it is strongly recommended to use .* for the matching pattern.
Become a Microsoft Certified Solutions Expert

This course teaches how to install and configure Windows Server 2012 R2.  It is the first step on your path to becoming a Microsoft Certified Solutions Expert (MCSE).

tuchfeldAuthor Commented:
I got a tip:
stackoverflow.com/questions/30550739/how-can-i-teach-the-nlp-splitter
Need to:
1) edit PTBLexer.flex
2) recompile stanford-corenlp-full-2015-04-20 with ant and create the jar file.
3) run ikvmc.
4) replace the dll in the project
See snapshot.
p.s was not too trivial because need to take care about versions.
Paragraph splitted to sentences considering Measurements like Inch and Centimeter abbreviations

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Bob LearnedCommented:
There really should be an easier way than having to rebuild the Java code.  What changes did you have to make to the PTBLexer.flex file?
tuchfeldAuthor Commented:
ABMEASURE = In|Cm|mm /* <== */

/* ABRREV2 abbreviations are normally followed by an upper case word.
 *  We assume they aren't used sentence finally.
 */
ABBREV4 = [A-Za-z]|{ABTITLE}|vs|Alex|Wm|Jos|Cie|a\.k\.a|cf|TREAS|{ACRO}|{ABCOMP2}|{ABMEASURE}

Open in new window

tuchfeldAuthor Commented:
I think the Flex is very powerful
and might be useful in the future for additional splitting rules...
Also, I have created a batch file to do the whole process:
c:
cd C:\JFLEX\bin
CMD /C jflex.bat PTBLexer.flex
copy /Y PTBLexer.java D:\Dev\NLP\stanford-corenlp\stanford-corenlp-full-2015-04-20\src\edu\stanford\nlp\process\
d:
cd D:\Dev\NLP\stanford-corenlp\stanford-corenlp-full-2015-04-20
CMD /C ant
cd classes
CMD /C jar -cfm ../stanford-corenlp-aryeh_1.1.jar ..\src\META-INF\MANIFEST.MF edu
c:
cd C:\ikvm-8.0.5449.1\bin
CMD /C ikvmc.exe -target:library D:\Dev\NLP\stanford-corenlp\stanford-corenlp-full-2015-04-20\stanford-corenlp-aryeh_1.1.jar
copy /Y stanford-corenlp-aryeh_1.1.dll D:\Dev\NLP\stanford-corenlp\Test_Stanford_NLP\bin\Debug\
echo the dll is ready
pause

Open in new window

Bob LearnedCommented:
Thanks for sharing!!
Bob LearnedCommented:
I wonder if this setting for the PTBTokenizer would have some affect:

Class PTBTokenizer<T extends HasWord>
http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/process/PTBTokenizer.html

strictTreebank3: PTBTokenizer deliberately deviates from strict PTB3 WSJ tokenization in two cases. Setting this improves compatibility for those cases. They are: (i) When an acronym is followed by a sentence end, such as "U.K." at the end of a sentence, the PTB3 has tokens of "Corp" and ".", while by default PTBTokenizer duplicates the period returning tokens of "Corp." and ".", and (ii) PTBTokenizer will return numbers with a whole number and a fractional part like "5 7/8" as a single token, with a non-breaking space in the middle, while the PTB3 separates them into two tokens "5" and "7/8". (Exception: for only "U.S." the treebank does have the two tokens "U.S." and "." like our default; strictTreebank3 now does that too.) The default is false.
tuchfeldAuthor Commented:
Eventually I found the solution myself.
I think it might be useful for others..
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Regular Expressions

From novice to tech pro — start learning today.