NLP in .Net for Categorization

Hello,
I'm looking for an environment in .Net to be able to analyze text as in the following example:
Text = "0257 painted in 2013 horse mounted ink and color on silk"
should be recognized as follows:
0257 = id
painted in 2013 = creation date
horse = name of picture
mounted ink and color on silk = medium of the picture
I thought that Natural Language Processing is appropriate for doing that.
I found Stanford NLP for .Net
http://sergey-tihon.github.io/Stanford.NLP.NET/StanfordCoreNLP.html
But I tried the above sentence and it didn't split it at all. as can be seen in the following screenshot (also the code is attached).
I'd like to "teach" the algorithm MY rules for analyzing it.
Please give me suggestions.
Thanks,
  Aryeh.
 Testing Stanford NLP in Visual Studio
using System;
using System.Windows.Forms;
using System.IO;
using System.Collections.Generic;
using java.io;
using java.util;
using edu.stanford.nlp.pipeline;
using Console = System.Console;
using edu.stanford.nlp.util;

namespace Test_Stanford_NLP
{
    public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();
        }

        private void button1_Click(object sender, EventArgs e)
        {
            string curDir = Environment.CurrentDirectory;
            string jarRoot = Path.GetFullPath(curDir + @"\..\..\..\stanford-corenlp-full-2015-04-20\stanford-corenlp-3.5.2-models");

            // Text for processing
            // var text = "Kosgi Santosh sent an email to Stanford University. He didn't get a reply.";
            var text = "0257 painted in 2013 horse mounted ink and color on silk";

            // Annotation pipeline configuration
            java.util.Properties props = new java.util.Properties();
            props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
            props.setProperty("sutime.binders", "0");

            // We should change current directory, so StanfordCoreNLP could find all the model files automatically 
            Directory.SetCurrentDirectory(jarRoot);
            StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
            Directory.SetCurrentDirectory(curDir);

            // Annotation
            Annotation annotation = new Annotation(text);
            pipeline.annotate(annotation);

            // Result - Pretty Print
            ByteArrayOutputStream stream = new ByteArrayOutputStream();
            pipeline.prettyPrint(annotation, new PrintWriter(stream));
            Console.WriteLine(stream.toString());
            stream.close();

            java.util.ArrayList sentences = (java.util.ArrayList)annotation.get(typeof(edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation));
            List<string> sentences_list = new List<string>();
            for (int i = 0; i < sentences.size(); i++)
            {
                sentences_list.Add(((edu.stanford.nlp.pipeline.Annotation)sentences.get(i)).toString());
            }
            string[] sentences_ar = sentences_list.ToArray();
        }
    }
}

Open in new window

tuchfeldAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Bob LearnedCommented:
I tried to catch up to where you are, and I came across this error:

Unrecoverable error while loading a tagger model

I believe that means that I need to find some kind of tagger jar file...
Bob LearnedCommented:
I found this reference:

Stanford NLP Software for .NET
http://sergey-tihon.github.io/Stanford.NLP.NET/index.html

Important notes to self:

1) Do not try to reference several NuGet packages from your solution. They are incompatible with each other. If you need more than one - you should reference Stanford CoreNLP package. All features are packed inside.

2) Unzip *.jar file with models if such one exists.

3) Stanford Core NLP requires Java 1.8 or higher.
Bob LearnedCommented:
New error:

MetaClass couldn't create public edu.stanford.nlp.time.TimeExpressionExtractorImpl(java.lang.String,java.util.Properties) with args [sutime, {}]"}
Python 3 Fundamentals

This course will teach participants about installing and configuring Python, syntax, importing, statements, types, strings, booleans, files, lists, tuples, comprehensions, functions, and classes.

tuchfeldAuthor Commented:
1) Did you run:
PM> Install-Package Stanford.NLP.CoreNLP
in Visual Studio?
2) You should open:
stanford-postagger-2015-04-20.zip
and reference:
\stanford-postagger-2015-04-20\models\
in your code
as seen in line 23 above.
Bob LearnedCommented:
1) Yes

2) I don't see stanford-postagger-2015-04-20.zip.  Do I have to download something additional?  What folder does it get unzipped into?

    I am using this for the jar root folder:

                    C:\StanfordNLP\stanford-corenlp-3.5.2-models
tuchfeldAuthor Commented:
1) to get the Tagger:
at the bottom of this page:
http://nlp.stanford.edu/software/tagger.shtml
Download basic English Stanford Tagger version 3.5.2 [24 MB]

2) Anyway I think that I need to use the Classifer:
at the bottom of this page:
http://nlp.stanford.edu/software/classifier.shtml
Download Stanford Classifier version 3.5.2
note again you need to open the zip file and reference the folder..

3) I already compiled an example
that is using the data files in the example folder inside this zip.
see my following code.

4) still I do not know how to "train" / "teach" it...

Thanks,
  Aryeh.

public static void ClassifierDemo()
{
    // http://nlp.stanford.edu/software/classifier.shtml

    string curDir = Path.GetDirectoryName(Application.ExecutablePath);
    string classifierDir = Path.GetFullPath(curDir + @"\..\..\..\stanford-classifier-2015-04-20");
    Directory.SetCurrentDirectory(classifierDir);

    ColumnDataClassifier cdc = new ColumnDataClassifier("examples/cheese2007.prop");
    Classifier cl = cdc.makeClassifier(cdc.readTrainingExamples("examples/cheeseDisease.train"));
    ObjectBank iteraror = ObjectBank.getLineIterator("examples/cheeseDisease.test", "utf-8");

    foreach (String line in iteraror)
    {
        // instead of the method in the line below, if you have the individual elements
        // already you can use cdc.makeDatumFromStrings(String[])
        Datum d = cdc.makeDatumFromLine(line);
        Console.WriteLine(line + "  ==>  " + cl.classOf(d));
    }

    Console.WriteLine("Demonstrating working with a serialized classifier");
    cdc = new ColumnDataClassifier("examples/cheese2007.prop");
    cl = cdc.makeClassifier(cdc.readTrainingExamples("examples/cheeseDisease.train"));

    // Exhibit serialization and deserialization working. Serialized to bytes in memory for simplicity
    Console.WriteLine(); Console.WriteLine();
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    ObjectOutputStream oos = new ObjectOutputStream(baos);
    oos.writeObject(cl);
    oos.close();
    byte[] ba = baos.toByteArray();
    ByteArrayInputStream bais = new ByteArrayInputStream(ba);
    ObjectInputStream ois = new ObjectInputStream(bais);
    LinearClassifier lc = (LinearClassifier)ErasureUtils.uncheckedCast(ois.readObject());
    ois.close();
    ColumnDataClassifier cdc2 = new ColumnDataClassifier("examples/cheese2007.prop");

    // We compare the output of the deserialized classifier lc versus the original one cl
    // For both we use a ColumnDataClassifier to convert text lines to examples
    ObjectBank iterator = ObjectBank.getLineIterator("examples/cheeseDisease.test", "utf-8");
    foreach (String line in iterator)
    {
        Datum d = cdc.makeDatumFromLine(line);
        Datum d2 = cdc2.makeDatumFromLine(line);
        Console.WriteLine(line + "  =origi=>  " + cl.classOf(d));
        Console.WriteLine(line + "  =deser=>  " + lc.classOf(d2));
    }
}

Open in new window

Bob LearnedCommented:
I downloaded the tagger zip file, but I still don't see where to unzip it.  I was think that I need to unzip it into the C:\StanfordNLP, under its own folder, but then would I need to change some configuration.
tuchfeldAuthor Commented:
1) As I told you I think that apparently the Classifier is needed (and not the Tagger). see section 2 that I wrote above.
2) Anyway, you should reference the location of the folder as seen in line 6 in the code I added above (it doesn't matter where you put it just reference it.. even give a direct path e.g. C:\StanfordNLP\stanford-classifier-2015-04-20).
3) the question is still. how can I train the code for my language classification case?
Bob LearnedCommented:
1) There is some confusion here for me still, so I am trying to catch up as fast as I can.

2) This is very new for me, but I am willing to learn something new for both your sake and mine.

3) I will reference the Classifier as noted.

4) This link was not helpful at the moment:

http://nlp.stanford.edu/wiki/Software/Classifier#Cheese-Disease:_A_small_textual_example
tuchfeldAuthor Commented:
Thanks.
4) The link from Stanford Wiki can give you background.
The source file is taken from
stanford-classifier-2015-04-20.zip
open it and you will see (at the root) the file:
ClassifierDemo.java
I converted it to C#.
Bob LearnedCommented:
I got the classifier demo working, but I am trying to understand the output:

There are 196 entries like this:

BasicDatum[features=[CLASS, 1-#-is, 1-#-it, 1-#-o, 1-#-co, 1-#-ta, 1-#-s, 1-#-t, 1-#-cosi, 1-#E-sis, 1-#-si, 1-#-acos, 1-#B-P, 1-#-osi, 1-#E-is, 1-Len-11-20, 1-#-ac, 1-#-Ps, 1-#-cos, 1-#-os, 1-#E-osis, 1-#-osis, 1-#-aco, 1-#-Psit, 1-#-P, 1-#B-Psit, 1-#-ttac, 1-#-itta, 1-#B-Psi, 1-#-sitt, 1-#-taco, 1-#-tac, 1-#-sis, 1-#-tta, 1-#-sit, 1-#-a, 1-#E-s, 1-#-c, 1-#B-Ps, 1-#-tt, 1-#-itt, 1-#-Psi, 1-#-i],labels=[2]]
Bob LearnedCommented:
I think it is starting to make a little sense:

Is it a Cheese or a Disease?

1 is cheese and 2 is disease

1    Caerphilly
2    Ivemark Syndrome

In the train file there are categorized entries like this:

2      Back Pain
2      Dissociative Disorders
2      Lipoma
1      Blue Rathgore

...

The test file has a completely different set like this:

2      Psittacosis
2      Cushing Syndrome
2      Esotropia
2      Jaundice, Neonatal
2      Thymoma
1      Caerphilly
2      Teratoma
Bob LearnedCommented:
The features in the datum, show n-grams

http://en.wikipedia.org/wiki/N-gram

An n-gram of size 1 is referred to as a "unigram"; size 2 is a "bigram" (or, less commonly, a "digram"); size 3 is a "trigram". Larger sizes are sometimes referred to by the value of n, e.g., "four-gram", "five-gram", and so on.

I found this interesting Google project--Google Ngram Viewer

https://books.google.com/ngrams

You can see how a word is used over time.

Example:

http://en.wikipedia.org/wiki/Halitosis   

it only became commonly used in the 1920s when a marketing campaign promoted Listerine as a solution for "chronic halitosis".

N-Gram Graph for Halitosis
Bob LearnedCommented:
Enough of the diversion!!

I believe that you are trying to figure out how to work out the training part for the classification...
Bob LearnedCommented:
I ran this command line:

java -jar stanford-classifier.jar -prop examples/cheese2007.prop

and, got this output:

C:\StanfordNLP\stanford-classifier-2015-04-20>java -jar stanford-classifier.jar -prop examples/cheese2007.prop

ColumnDataClassifier invoked on Tue May 26 11:46:34 EDT 2015 with arguments:
   -prop examples/cheese2007.prop
1.usePrefixSuffixNGrams = true
QNsize = 15
useQN = true
goldAnswerColumn = 0
1.minNGramLeng = 1
trainFile = ./examples/cheeseDisease.train
tolerance = 1e-4
1.maxNGramLeng = 4
testFile = ./examples/cheeseDisease.test
sigma = 3
printClassifierParam = 200
displayedColumn = 1
intern = true
useClassFeature = true
1.binnedLengths = 10,20,30
1.useNGrams = true
Reading dataset from ./examples/cheeseDisease.train ... done [0.3s, 1765 items].
numDatums: 1765
numDatumsPerLabel: {1=531.0, 2=1234.0}
numLabels: 2 [2, 1]
numFeatures (Phi(X) types): 19667 [CLASS, 1-Len-0-10, 1-#-ack , 1-#-k, 1-#- Pai, ...]
..................................
QNMinimizer terminated due to average improvement: | newest_val - previous_val | / |newestVal| < TOL
Built this classifier: LinearClassifier with 19667 features, 2 classes, and 39334 parameters.

Reading dataset from ./examples/cheeseDisease.test ... done [0.1s, 196 items].
Output format: dataColumn1      goldAnswer      classifierAnswer        P(clAnswer)     P(goldAnswer)
Psittacosis     2       2       0.999   0.999
Cushing Syndrome        2       2       1.000   1.000
Esotropia       2       2       0.999   0.999
Jaundice, Neonatal      2       2       0.999   0.999
Thymoma 2       2       0.996   0.996
Caerphilly      1       1       0.544   0.544
Teratoma        2       2       0.945   0.945
Phantom Limb    2       1       0.857   0.143
Iron Overload   2       1       0.862   0.138
Spermatic Cord Torsion  2       2       1.000   1.000
Epistaxis (Nosebleed)   2       2       0.920   0.920
Folded cheese with mint 1       1       0.999   0.999
Maytag Blue     1       1       0.998   0.998
Castelmagno     1       1       0.978   0.978
Monoclonal Gammopathy   2       2       0.996   0.996
NiikawaKuroki Syndrome  2       2       1.000   1.000
Cendre d'Olivet 1       1       0.999   0.999
Pericarditis    2       2       0.999   0.999
Speech Disorders        2       2       0.999   0.999
Maredsous       1       1       0.969   0.969
Thyroid Diseases        2       2       1.000   1.000
Briquette de Brebis     1       1       1.000   1.000
Banon   1       1       0.989   0.989
Taeniasis       2       2       0.999   0.999
Testicular Diseases     2       2       1.000   1.000
Carcinoma       2       2       0.997   0.997
Rubella 2       1       0.999   0.001
Hordeolum       2       2       0.539   0.539
Basing  1       2       0.841   0.159
Apnea   2       2       0.970   0.970
Urea Cycle Disorders    2       2       1.000   1.000
Infertility (Female Genital Diseases ..)        2       2       1.000   1.000
Sardo   1       1       0.999   0.999
Ivemark Syndrome        2       2       1.000   1.000
Ambert  1       1       0.749   0.749
Color Vision Defects    2       2       1.000   1.000
Arrhythmia      2       2       0.997   0.997
Tangier Disease 2       2       1.000   1.000
Herpes Labialis (Fever Blisters)        2       2       1.000   1.000
Cathelain       1       1       0.770   0.770
Genital Warts   2       2       0.999   0.999
Pleurisy        2       2       0.921   0.921
Narcissism      2       2       0.999   0.999
ArnoldChiari Malformation       2       2       1.000   1.000
Crottin du Chavignol    1       1       0.993   0.993
White Stilton   1       1       0.836   0.836
Retinopathy of Prematurity      2       2       1.000   1.000
Monterey Jack Dry       1       1       0.975   0.975
Prostatic Diseases      2       2       1.000   1.000
Psoriasis       2       2       0.999   0.999
Lincolnshire Poacher    1       1       0.997   0.997
Lactose Intolerance     2       2       0.886   0.886
Pregnancy, Ectopic      2       2       0.995   0.995
Sinusitis       2       2       1.000   1.000
Vignotte        1       1       0.925   0.925
Typhoid Fever   2       2       1.000   1.000
Diseases of Marine Mammals      2       2       1.000   1.000
Rhabdoid Tumor  2       2       0.951   0.951
Angioneurotic Edema     2       2       1.000   1.000
Amnesia 2       2       0.997   0.997
Trois Cornes De Vendee  1       1       0.939   0.939
Roncal  1       1       0.997   0.997
Avitaminosis    2       2       1.000   1.000
Sourire Lozerien        1       1       1.000   1.000
CREST   2       1       0.900   0.100
Skin Ulcer      2       2       0.986   0.986
Pave du Berry   1       1       0.999   0.999
Pain    2       1       0.808   0.192
Lymphogranuloma Venereum        2       2       1.000   1.000
Aneurysm        2       2       0.994   0.994
Taste Disorders 2       2       0.990   0.990
Schabzieger     1       1       0.976   0.976
KleineLevin Syndrome    2       2       1.000   1.000
Facial Hemiatrophy (ParryRomberg Disease)       2       2       1.000   1.000
Ardrahan        1       1       0.996   0.996
Meyer Vintage Gouda     1       1       0.998   0.998
Autun   1       1       0.788   0.788
Myositis        2       2       1.000   1.000
Vaginal Diseases        2       2       1.000   1.000
Edam    1       1       0.961   0.961
Wilms' Tumor    2       2       0.825   0.825
Maribo  1       1       0.997   0.997
Blue Rubber Bleb Nevus Syndrome 2       2       0.998   0.998
Red Leicester   1       1       0.964   0.964
Trophoblastic Neoplasms 2       2       1.000   1.000
Eczema  2       2       0.981   0.981
Whipple's Disease       2       2       1.000   1.000
Alexanders Disease      2       2       0.999   0.999
Tomme de Chevre 1       1       1.000   1.000
Gas Gangrene    2       2       0.983   0.983
Purpura, SchoenleinHenoch       2       2       1.000   1.000
Scoliosis       2       2       1.000   1.000
Vomiting        2       2       0.991   0.991
Gouda   1       1       0.971   0.971
Arthropod Diseases      2       2       1.000   1.000
Orbital Cellulitis      2       2       1.000   1.000
Acne Rosacea    2       2       0.890   0.890
Menstruation Disturbances       2       2       1.000   1.000
Turner Syndrome 2       2       1.000   1.000
Stinking Bishop 1       2       0.994   0.006
Adenomatous Polyposis Coli      2       2       1.000   1.000
Encephalomyelitis       2       2       1.000   1.000
Factor XI Deficiency    2       2       1.000   1.000
Shprintzen Syndrome     2       2       1.000   1.000
Palet de Babligny       1       1       1.000   1.000
Lentigo, Malignant      2       2       0.993   0.993
Bronchiolitis   2       2       0.999   0.999
Coulommiers     1       2       0.684   0.316
Queso Majorero  1       1       1.000   1.000
Lipodystrophy   2       2       0.998   0.998
Grana   1       1       0.978   0.978
Frog Diseases   2       2       1.000   1.000
Anneau du Vic-Bilh      1       1       0.948   0.948
Thalamic Diseases       2       2       1.000   1.000
Chest Pain      2       2       0.555   0.555
Ichthyosis      2       2       1.000   1.000
Klosterkaese    1       1       0.997   0.997
Oschtjepka      1       1       0.591   0.591
Bronchopulmonary Dysplasia      2       2       1.000   1.000
Fanconi Anemia  2       2       1.000   1.000
Gruyere 1       1       0.994   0.994
Scleroderma, Systemic   2       2       1.000   1.000
Hypothermia     2       2       1.000   1.000
Ulloa   1       1       0.815   0.815
Gornyaltajski   1       1       0.999   0.999
Beal's Syndrome 2       2       1.000   1.000
Turunmaa        1       2       0.661   0.339
Hereford Hop    1       2       0.826   0.174
Adrenogenital Syndrome  2       2       1.000   1.000
Friesla 1       1       0.973   0.973
Quatre-Vents    1       2       0.636   0.364
Coeur de Chevre 1       1       1.000   1.000
Sialorrhea      2       2       0.960   0.960
Q Fever 2       2       0.995   0.995
Arachnoiditis   2       2       1.000   1.000
Monterey Jack   1       1       0.999   0.999
Encephalocele   2       2       0.985   0.985
Prastost        1       1       0.964   0.964
Desmoid Tumor   2       2       0.996   0.996
Torticollis     2       2       0.975   0.975
Anemia, Hemolytic       2       2       1.000   1.000
Sever's Disease 2       2       1.000   1.000
Arsenic Poisoning       2       2       0.994   0.994
Nantais 1       1       0.959   0.959
Spina Bifida    2       2       0.899   0.899
Zenker Diverticulum     2       2       1.000   1.000
Conjunctivitis  2       2       1.000   1.000
Muscle Spasticity       2       2       1.000   1.000
Kernhem 1       2       0.812   0.188
Caciotta        1       1       0.955   0.955
Abertam 1       1       0.835   0.835
Histoplasmosis  2       2       1.000   1.000
Idaho Goatster  1       1       0.978   0.978
Pecorino in Walnut Leaves       1       2       0.983   0.017
Tommes  1       1       0.925   0.925
Hallux Valgus   2       2       0.989   0.989
Musculoskeletal Abnormalities (Pediatr.)        2       2       1.000   1.000
ChurgStrauss Syndrome   2       2       1.000   1.000
Vasterbottenost 1       1       0.983   0.983
Peripheral Vascular Diseases    2       2       1.000   1.000
Constipation    2       2       0.993   0.993
Pin Worms       2       2       0.509   0.509
Dermoid Cyst    2       2       0.979   0.979
Catscratch Disease      2       2       1.000   1.000
Kabuki MakeUp Syndrome  2       2       1.000   1.000
Inflammation    2       2       1.000   1.000
Gastanberra     1       1       0.932   0.932
Fynbo   1       1       0.986   0.986
Crottin de Chavignol    1       1       0.998   0.998
Lymphadenitis   2       2       1.000   1.000
Pont l'Eveque   1       1       0.961   0.961
Molbo   1       1       0.998   0.998
Caprice des Dieux       1       1       0.974   0.974
Arial Fibrillation      2       2       0.998   0.998
Macconais       1       1       0.950   0.950
Sottocenare al Tartufo  1       1       0.963   0.963
Quartirolo Lombardo     1       1       0.999   0.999
Scurvy  2       2       0.638   0.638
Fetal Alcohol Syndrome  2       2       1.000   1.000
Leukodystrophy, Metachromatic   2       2       1.000   1.000
Optic Neuritis  2       2       1.000   1.000
Herpes Zoster (Shingles)        2       2       0.996   0.996
Rigotte 1       1       0.983   0.983
Trachoma        2       2       0.981   0.981
Dwarfism        2       2       0.959   0.959
Chorioretinitis 2       2       1.000   1.000
Pas de l'Escalette      1       1       1.000   1.000
Hypophosphatasia        2       2       1.000   1.000
Alpha 1Antitrypsin Deficiency   2       2       1.000   1.000
Staphylococcal Infections       2       2       1.000   1.000
Rhinoscleroma   2       2       0.998   0.998
Paronychia      2       2       0.920   0.920
Autistic Disorder       2       2       1.000   1.000
Leprosy 2       2       0.828   0.828
Angina Pectoris 2       2       0.997   0.997
Boulette d'Avesnes      1       1       0.999   0.999

196 examples in test set
Cls 2: TP=123 FN=5 FP=8 TN=60; Acc 0.934 P 0.939 R 0.961 F1 0.950
Cls 1: TP=60 FN=8 FP=5 TN=123; Acc 0.934 P 0.923 R 0.882 F1 0.902
Accuracy/micro-averaged F1: 0.93367
Macro-averaged F1: 0.92603
Bob LearnedCommented:
Cls 2: TP=123 FN=5 FP=8 TN=60; Acc 0.934 P 0.939 R 0.961 F1 0.950

Classification 2:  
============
True Positives (TP) = 123
False Negatives (FN) = 5
False Positives (FP) = 8
True Negatives (TN) = 60

Accuracy (ACC) = 93.4%
Precision (P) = .934
Recall (R) = .961
F1 Measure (F1) = .95
tuchfeldAuthor Commented:
Great, thanks for the info!
We made a progress!
Let's go further.

1) I think (for simplicity) I'd like to have a (float) number that is the probability that a line is in the suggested Class.
Care these the numbers above?
e.g.
Quatre-Vents    1       2       0.636   0.364

2) What is significant for me to know from the "Classification info". e.g.
Cls 2: TP=123 FN=5 FP=8 TN=60; Acc 0.934 P 0.939 R 0.961 F1 0.950
?
Bob LearnedCommented:
1) The example uses a column data classifier, and the output from the command line is in this format:

Output format: dataColumn1      goldAnswer      classifierAnswer        P(clAnswer)     P(goldAnswer)

dataColumn1 = Quatre-Vents    
goldAnswer = 1      
classifierAnswer = 2      
P(clAnswer) = 0.636  
P(goldAnswer) = 0.364

2) With the output, you can see how accurate the parsing was, in determining that a line in the test file is that classification.
Bob LearnedCommented:
Stanford NER CRF FAQ
http://nlp.stanford.edu/software/crf-faq.shtml

How can I train my own NER model?

The documentation for training your own classifier is somewhere between bad and non-existent. But nevertheless, everything you need is in the box, and you should look through the Javadoc for at least the classes CRFClassifier and NERFeatureFactory.
Bob LearnedCommented:
Going back to your original question, I noticed that you are using "ssplit" tokenizer, which is a sentence annotator.

This text has only one sentence:

Text = "0257 painted in 2013 horse mounted ink and color on silk"

If you are looking for parts of speech (POS), then you would need to use a different annotator.
tuchfeldAuthor Commented:
Thanks, I'll probably get to POS later..
Now, to the Classifier results.
How can I achieve it in the code?
see my code:
private static void run_classifier()
{
    ColumnDataClassifier cdc = new ColumnDataClassifier("examples/cheese2007.prop");
    Classifier cl = cdc.makeClassifier(cdc.readTrainingExamples("examples/cheeseDisease.train"));

    ObjectBank iteraror = ObjectBank.getLineIterator("examples/cheeseDisease.test", "utf-8");
    foreach (String line in iteraror)
    {
        // instead of the method in the line below, if you have the individual elements
        // already you can use cdc.makeDatumFromStrings(String[])
        Datum d = cdc.makeDatumFromLine(line);
        Console.WriteLine(line + "  ==>  " + cl.classOf(d) + " " + cl.scoresOf(d));
    }
}

Open in new window

and the result for our line is (not similar to any of the P above):
1      Quatre-Vents  ==>  2 {1=-0.28003821865313994, 2=0.28003821865113615}
Bob LearnedCommented:
I am not sure what code change you are asking for.

If you are looking for a way to pull out Quatre-Vents from the list of 196 entries, then I have created a data structure (ClassificationEntry), and a model (Classification) with a list of entries.  Then, I could use a LINQ statement to get that entry.

Debug session
Bob LearnedCommented:
ClassificationModel:

 public class ClassificationModel
    {

        public List<ClassificationEntry> EntryList { get; private set; }

        public ClassificationModel()
        {
            EntryList = new List<ClassificationEntry>();
        }

    }

ClassificationEntry:

  public ClassificationEntry(ColumnDataClassifier columnDataClassifier, Classifier classifier, string line)
        {
            Line = line;
            Datum = columnDataClassifier.makeDatumFromLine(line);
            Classifier = classifier;
        }

        public string Line { get; set; }
        public Datum Datum { get; set; }
        public Classifier Classifier { get; set; }

        public override string ToString()
        {
            return string.Format("{0} [line={1}, datum={2}]",
                GetType().Name, Line, Datum);
        }

Open in new window


Sample code change:

        public static ClassificationModel ClassifierDemo()
        {
            // http://nlp.stanford.edu/software/classifier.shtml

            var testModel = new ClassificationModel();

            const string CLASSIFIER_ROOT = JAR_ROOT + "stanford-classifier-2015-04-20";

            var currentDirectory = Path.GetDirectoryName(Application.ExecutablePath);
            Directory.SetCurrentDirectory(CLASSIFIER_ROOT);

            var columnDataClassifier = new ColumnDataClassifier("examples/cheese2007.prop");
            var classifier = columnDataClassifier.makeClassifier(columnDataClassifier.readTrainingExamples("examples/cheeseDisease.train"));
            var iterator = ObjectBank.getLineIterator("examples/cheeseDisease.test", "utf-8");

            foreach (string line in iterator)
            {
                // instead of the method in the line below, if you have the individual elements
                // already you can use cdc.makeDatumFromStrings(String[])
                testModel.EntryList.Add(new ClassificationEntry(columnDataClassifier, classifier, line));
            }

            Directory.SetCurrentDirectory(currentDirectory);

            return testModel;
        }

Open in new window

Bob LearnedCommented:
Test harness calls:

  var result = TestHarness.ClassifierDemo();

            var element = result.EntryList
                .Where(x => x.Line.Contains("Quatre"))
                .ToList();

Open in new window

tuchfeldAuthor Commented:
Did you get the same results?
i.e.
Quatre-Vents    1       2       0.636   0.364
I couldn't get the P(clAnswer) and P(goldAnswer) values.
tuchfeldAuthor Commented:
Hey.. I did it from code,
simply:
ColumnDataClassifier.main(new string[] { "-prop", "examples/cheese2007.prop" });

Open in new window

Now I need to figure out the main Sub code.
and the documentation is:
http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/classify/ColumnDataClassifier.html
Bob LearnedCommented:
If you need that output value from the ColumnDataClassifier, here is the call stack:

main
   testClassifier
     testExamples
       testExample
         writeAnswer
         

In the testExample method, there is this code:

  if ( ! (globalFlags.crossValidationFolds > 0 && ! globalFlags.printCrossValidationDecisions)) {
      if (globalFlags.csvOutput != null) {
        System.out.print(formatCsv(globalFlags.csvOutput, example, answer));
      } else {
        writeAnswer(example, answer, dist, contingency, cl, sim);
      }
    }

Open in new window



In writeAnswer, there is this code:

results = clAnswer + '\t' + nf.format(cntr.probabilityOf(clAnswer)) + '\t' + nf.format(cntr.probabilityOf(goldAnswer));

Open in new window

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Bob LearnedCommented:
As I see it, you would need to get the "cntr", which is Distribution<String>, "clAnswer", and "goldAnswer".
tuchfeldAuthor Commented:
Bob, Thank you VERY much for joining me starting figuring this issue.
Well, I converted the class ColumnDataClassifier to C# (it took some time . was not too trivial for me). and I changed the code for my testing (see the Form example snapshot).
I didn't get to the POS yet but You already earned your points :-)
Thanks again,
  Aryeh.
Test Stanford Linear Classifier Form with cheeseDisease.train
Bob LearnedCommented:
I am very interested in the topic of natural language processing and other such decision trees, so this was a great question for me.

Do you have the converted source code for the ColumnDataClassifier?
tuchfeldAuthor Commented:
Sure. Here is the Zip file.
ColumnDataClassifier.zip
tuchfeldAuthor Commented:
I uploaded the full Visual Studio Solution to this link.
http://www.websitemanager.co.il/for_Bob/Test_Stanford_NLP.zip
Note that you need to open:
stanford-corenlp-full-2015-04-20.zip
just beside the:
Test_Stanford_NLP
Folder.
Here is how my GUI looks.
A run example with my GUILet me know if you successfully got it..
Bob LearnedCommented:
Thank you for the code.  It was a great idea to convert the Java code to C#, since I couldn't see a way to get that information otherwise.  It's a good thing that Java is a lot like C# *BIG GRIN*.
tuchfeldAuthor Commented:
Great assistant from Bob!
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
.NET Programming

From novice to tech pro — start learning today.