?
Solved

Search:- 2 For Mr ozymandias

Posted on 2003-03-01
153
Medium Priority
?
485 Views
Last Modified: 2010-03-31
question:- Nothing yet. will start asking tomorrow.

thanks ozymandias
0
Comment
Question by:cancer_66
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 100
  • 53
153 Comments
 

Author Comment

by:cancer_66
ID: 8048728
1)whenever i want to add a new pattern which consists of two word i should add it in Token4,Token5
otherswise Token1,2,3

correct?
-----------------------------------------------------
correct way of adding pattern?

private static WordList list4 = new WordList("foo,bar,buzz,is",",");

or

private static WordList list4 = new WordList("foo,bar,buzz","is",",");

why is the "," at the end ?
0
 

Author Comment

by:cancer_66
ID: 8048741
ill talk to you tomorrow.
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8048888
OK.

If you want to add a new Pattern you have to create a new list for each word in the Pattern. If its a three word pattern you need three lists, if its a two word pattern you need two lists etc.

You can get rid of list4 and list5 becuase they were just for demonstration purposes.

If you want to search for "is the" then you need :


private static WordList list4 = new WordList("is",",");
private static WordList list5 = new WordList("the",",");

If you then wanted to search for "are arranged in" and "is surrounded by", you could either do this :

private static WordList list6 = new WordList("are,is",",");
private static WordList list7 = new WordList("surrounded,arranged",",");
private static WordList list8 = new WordList("in,by",",");

or you could add the required missing words to the existing word lists of list1, list2, list3.

You can assemble the lists in any combination or order into new WordPatterns.



0
Optimize your web performance

What's in the eBook?
- Full list of reasons for poor performance
- Ultimate measures to speed things up
- Primary web monitoring types
- KPIs you should be monitoring in order to increase your ROI

 
LVL 15

Expert Comment

by:ozymandias
ID: 8048918
The extra "," at the end is because the constructor for WordList takes both the string e.g. "described,defined,delimited" and the delimiter that should be used to tokenize it into an array e.g. ",".
0
 

Author Comment

by:cancer_66
ID: 8051058
hello,

ok i just read your comment. regarding the Addition of new patterns. iam going to try that right aways and let you know if i have faced any problems.

thanks again.
0
 

Author Comment

by:cancer_66
ID: 8051134
i have created a directory called "Expert Exchange"

a)and extracted the DefinitionChecker.zip there.

b)there fore in the directory "Expert Exchange" i had the following files:-

i)DefintionChecker.java  

ii)a subdirectory called "definitions" (created automatically as i extracted the zip file) which contained the file "PatternMatcher.java"

when i compile "DefinitionChecker.java"

the following classes are created
"DefintionChecker.class" + in subdirectory "definitions" classes "PatternMatcher.class","WordPattern.class" and "WordList.class"

i just did the following to the code:-

private static WordList list6 = new WordList("are,is",",");
private static WordList list7 = new WordList("surrounded,arranged",",");
private static WordList list8 = new WordList("in,by",",");

                      &

        WordPattern pattern3 = new WordPattern();
        pattern1.addList(list6);
        pattern1.addList(list7);
        pattern1.addList(list8);
        patterns.add(pattern3);
         

and it worked . Just a check am i doing it the right way?

0
 

Author Comment

by:cancer_66
ID: 8051212
1)lets say i want to add another option to the user to choose between

a)Sequential search (completed)
b)Strict Sequential search (completed)
c)Random search

now in randon search all the Patterns should be found regardless of there position in the sentence. i.e is defined as,defined

is as,as defined is,is XX defined YY as

for example

computer Graphics is defined as a field in cs
(printed in seq + st.seq +random search)

computer graphics is sometimes defined as a field in cs(printed in seq + Random)

computer graphics defined tt as tt is a field in cs
(printed in Random search ONLY)

intelligent agents are sometimes defined as mobile agents (printed in seq +random)

answer me whenever you can. ill be waiting.thanks
0
 

Author Comment

by:cancer_66
ID: 8051475
guess you are busy. no problem. please answer me as you have the time. iam waiting,
0
 

Author Comment

by:cancer_66
ID: 8051883
still waiting:)
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8052090
OK. Sorry, it's been a busy day.

I have changed the code around a bit and split up some of the files.

There is now a new command-line option -r for random.

I will mail you the code shortly.
0
 

Author Comment

by:cancer_66
ID: 8052702
its fine no problem. take you time !
ok ill just check my email.
thanks
0
 

Author Comment

by:cancer_66
ID: 8052711
whenever you can. just let me know a bit about the changes,
0
 

Author Comment

by:cancer_66
ID: 8052793
hmmmm, i a bit confused. which one is the latest "WordPattern" the one you sent with the "DefinitionChecker.zip" or the one in the separate email ?
0
 

Author Comment

by:cancer_66
ID: 8052926
1)Comments on "DefinitionChecker.zip"

a)Strict mode doesnt work?
b)In normal Mode the following was printed

computer graphics defined t t as t is ?

c)in random mode following was printed

Graphic is as a field in computer science (should not be printed no MainMarker)?






0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8052937
The latest WordPattern.java was the one that I sent on its own. It replaces the one in the zip file.
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8052948
All the modes work fine for me.

When run normal mode I get the following output :

Matches found in abc.txt
                MATCH : computer graphics foo token blah blah
Matches found in xyz.txt
                MATCH : computer graphics buzz ping token blah blah
                MATCH : computer graphics is defined by nonsense
                MATCH : computer Graphics is often delimited by science
                MATCH : Computer Graphics is defined as science
Matches found in other.txt
                MATCH : Computer Graphics are described as pictures

When I run stric mode, I get :

Matches found in abc.txt
                MATCH : computer graphics foo token blah blah
Matches found in xyz.txt
                MATCH : computer graphics is defined by nonsense
                MATCH : Computer Graphics is defined as science
Matches found in other.txt
                MATCH : Computer Graphics are described as pictures

and random mode gives me :

Matches found in abc.txt
                MATCH : Computer Graphics defined t t t is t as
                MATCH : computer graphics foo token blah blah
Matches found in xyz.txt
                MATCH : computer graphics buzz ping token blah blah
                MATCH : computer graphics is defined by nonsense
                MATCH : computer Graphics is often delimited by science
                MATCH : Computer Graphics is defined as science
Matches found in other.txt
                MATCH : Computer Graphics are described as pictures
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8052954
Are you sure you are using the right arguments :

normal = java DefinitionChecker computer graphics

strict = java DefinitionChecker computer graphics -s

random = java DefinitionChecker computer graphics -r
0
 

Author Comment

by:cancer_66
ID: 8052959
check ur email ive sent you some test files.

0
 

Author Comment

by:cancer_66
ID: 8052991
1)note ive tried Both "DefinitionChecker.zip"
and got the above errors

+ ive replaced the "WordPattern" which is in the "DefinitionChercker.zip" with the new "WordPattern.java" you have sent in a seprate email.

still recived the above error?

2)yes iam using the right arguments

java DefinitionChecker computer graphics (seq mode)
java DefinitionChecker computer graphics -s (strict)
hjava DefinitionChecker computer graphics -r (random)
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8053014
OK. I am now using the same files you sent. I am going to post my output and I will number the lines. Please tell me which lines of output you think are wrong and why.



 1  >java DefinitionChecker computer graphics
 2  Matches found in a.txt
 3                MATCH :  art or designs which are created is defined as computer graphics (printed in both)
 4                MATCH :  computer Graphics is purely delimited by science (printed in seq)
 5                MATCH :  science is described as computer graphics (printed both)
 6  Matches found in b.txt
 7                MATCH : computer graphics is purely defined as a field in cs (printed in seq)
 8                MATCH :  cs is purely described as a field in computer graphics (printed in seq)
 9                MATCH :  computer graphics is delimited by cs (printed in both)
10                MATCH :  cs is delimited by computer graphics (printed in both)
11                MATCH :  computer graphics is X defined RR as cs (printed in seq)
12                MATCH :  computer Graphics XXX was GGGG defined nnnnnn as cs (printed in seq)
13  Matches found in c.txt
14                MATCH :  computer Graphics is purely delimited by science (printed in seq)
15                MATCH :  computer is defined as science (not printed "computer graphics")
16                MATCH :  science is purely described as computer graphics (printed in seq)
17
18  >java DefinitionChecker computer graphics -s
19  Matches found in a.txt
20                MATCH :  science is described as computer graphics (printed both)
21  Matches found in b.txt
22                MATCH :  computer graphics is delimited by cs (printed in both)
23                MATCH :  cs is delimited by computer graphics (printed in both)
24  Matches found in c.txt
25                MATCH :  computer is defined as science (not printed "computer graphics")
26
27  java DefinitionChecker computer graphics -r
28  Matches found in a.txt
29                MATCH :  art or designs which are created is defined as computer graphics (printed in both)
30                MATCH :  computer Graphics is purely delimited by science (printed in seq)
31                MATCH :  science is described as computer graphics (printed both)
32  Matches found in b.txt
33                MATCH : computer graphics is purely defined as a field in cs (printed in seq)
34                MATCH :  cs is purely described as a field in computer graphics (printed in seq)
35                MATCH :  computer graphics defined t t as t is cs (not printed in seq and s.seq)
36                MATCH :  computer graphics is delimited by cs (printed in both)
37                MATCH :  cs is delimited by computer graphics (printed in both)
38                MATCH :  computer graphics is X defined RR as cs (printed in seq)
39                MATCH :  computer Graphics XXX was GGGG defined nnnnnn as cs (printed in seq)
40  Matches found in c.txt
41                MATCH :  computer Graphics is purely delimited by science (printed in seq)
42                MATCH :  computer is defined as science (not printed "computer graphics")
43                MATCH :  science is purely described as computer graphics (printed in seq)
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8053025
Line 25 should not normally be printed but it is because it contains the words "computer graphics" inside the brackets e.g. :

  (not printed "computer graphics")
0
 

Author Comment

by:cancer_66
ID: 8053030
1)hmmm this is weird i deleted the all the files and extracted "DefinitionChecker.zip" again from scratch in a folder called "Expert Exchange" now it worked properly???

2)i described the way i have added new patterns above. and asked if it is the correct way ?

3)please test the program with the test files ive sent on the email. just so i could feel comfortable plz

thanks

0
 

Author Comment

by:cancer_66
ID: 8053046
ok its looks fine to me..i dont know what went wrong. really iam suprized my self for a second i got very worried;z

0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8053181
1) Bizarre. I can't explain that, but I'm glad it's working now.


2) I think so. Dd you read my answer in my first comment above ?

3) I have tested teh program with the files. My output is above.
0
 

Author Comment

by:cancer_66
ID: 8053257
ill try testing it more . and let you know,
if there is any problems. but that problem i faced worries me,

thanks
0
 

Author Comment

by:cancer_66
ID: 8053258
ill try testing it more . and let you know,
if there is any problems. but that problem i faced worries me,

thanks
0
 

Author Comment

by:cancer_66
ID: 8053320
1) i did the following in order to add the pattern "is the"

     private static WordList list6 = new WordList("is",",");
     private static WordList list7 = new WordList("the",",");


          // create another WordPattern
          WordPattern pattern3 = new WordPattern();
          // add the appropriate WordLists
          pattern2.addList(list6);
          pattern2.addList(list7);
          // add the WordPattern to the vector
          patterns.add(pattern3);

i also added two sentences
a)mobile agent is the future of XYZ

it was not matched?am i doing something wrong?
0
 

Author Comment

by:cancer_66
ID: 8053367
i even corrected the mistake which is above "pattern2.addlist(list6)" to pattern3.addlist(list6)

still didnt match ?
0
 

Author Comment

by:cancer_66
ID: 8053460
2)following sentence is not printed in strict mode

art or designs which are created is defined as computer graphics (printed in both) # ?

sorry for the trouble
0
 

Author Comment

by:cancer_66
ID: 8053529
2)notice for the question above when i modfied the sentence to be

art or designs which created is defined as computer graphics (printed in both) #

(removed the "are") it was matched !

0
 

Author Comment

by:cancer_66
ID: 8053568
2)in the test which you have done Line 3 should have been also matched with strict sequential.

notice "is defined as"

please look into this problem:(
0
 

Author Comment

by:cancer_66
ID: 8053601
3)ive added the following sentence

Expert Exchange is surrounded by XYZ #

and added the pattern "is surrounded by"


    private static WordList list6 = new WordList("are,is",",");
    private static WordList list7 = new WordList("surrounded,arranged",",");
    private static WordList list8 = new WordList("in,by",",");
     
sentence didnt match?

but however this sentence matched
intelligent agents, are sometimes, defined as mobile agents #





0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8053654
2) there is a problem with this sentence :

art or designs which are created is defined as computer graphics

it has both "are" and "is" which are both in List1. This creates a problem because the first token found starts the strict sequential search and then "created" breaks it.

I will have to have a think about this.

3) I have added the following :

     private static WordList list6 = new WordList("were,is,are",",");
     private static WordList list7 = new WordList("surrounded,encompassed,arranged",",");
     private static WordList list8 = new WordList("in,by",",");

and

     // create another WordPattern
     WordPattern pattern3 = new WordPattern();
     // add the appropriate WordLists
     pattern3.addList(list6);
     pattern3.addList(list7);
     pattern3.addList(list8);
     // add the WordPattern to the vector
     patterns.add(pattern3);


when I run :

        >java DefinitionChecker expert exchange

I get :

Matches found in a.txt
                MATCH : [never] Expert Exchange is surrounded by XYZ
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8053658
The code works for me, except the problem detailed in point 2 above, which I will try to look into.
0
 

Author Comment

by:cancer_66
ID: 8054894
hi. hmm ok. i will test the program more and see..

what do u mean

[never] Expert Exchange is surrounded by XYZ

why the "never"?
0
 

Author Comment

by:cancer_66
ID: 8054946
1)i still havent been sucessfull so far in adding new patterns ? it doesnt work!

i addded "is surrounded by"

and had a sentence

mobile agents is surrounded by xxx (no match)



0
 

Author Comment

by:cancer_66
ID: 8054954
forget the last message. it worked. my mistake.

ill try adding the pattern "is the"

and test it. please look into the problem.

thanks
0
 

Author Comment

by:cancer_66
ID: 8054977
pattern "is the" worked fine.

i think the only problem is sentence

"art or designs which are created is defined as computer graphics"

however we should try and make it work with all sentences since ill be randomly taking definitions from the internet and test it.

thanks alot
0
 

Author Comment

by:cancer_66
ID: 8055732
please answer me whenever u r free
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8055734
The last version of the program I emailed you works with :
"art or designs which are created is defined as computer graphics", just fine. I fixed that probelm.

The [never] I put in front of any sentence that did not contain the words "computer graphics" or "mobile agents" (since those were the terms we were testing) or did not have a pattern like "is defined by" at all.


0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8055748
This is my current sample output :

 1  >java DefinitionChecker computer graphics
 2  Matches found in a.txt
 3                MATCH : [-r -n -s] art or designs which are created is defined as computer graphics
 4                MATCH : [-r -n]computer Graphics is purely delimited by science
 5                MATCH : [-r -n -s] science is described as computer graphics
 6  Matches found in b.txt
 7                MATCH : [-r -n] computer graphics is purely defined as a field in cs
 8                MATCH : [-r -n] cs is purely described as a field in computer graphics
 9                MATCH : [-r -n -s] computer graphics is delimited by cs
10                MATCH : [-r -n -s] cs is delimited by computer graphics
11                MATCH : [-r -n] computer graphics is X defined RR as cs
12                MATCH : [-r -n] computer Graphics XXX was GGGG defined nnnnnn as cs
13  Matches found in c.txt
14                MATCH : [-r -n] computer Graphics is purely delimited by science (printed in seq)
15                MATCH : [-r -n] science is purely described as computer graphics (printed in seq)
16  
17  >java DefinitionChecker computer graphics -s
18  Matches found in a.txt
19                MATCH : [-r -n -s] art or designs which are created is defined as computer graphics
20                MATCH : [-r -n -s] science is described as computer graphics
21  Matches found in b.txt
22                MATCH : [-r -n -s] computer graphics is delimited by cs
23                MATCH : [-r -n -s] cs is delimited by computer graphics

Note that the problem sentence on lines 3 and 19 appears correctly.
0
 

Author Comment

by:cancer_66
ID: 8056293
thanks alot ozymandias . ill just test it right aways, been busy writing the Interim report for my project.

anyways ill just do that in a short while.

thanks 4 your help.
0
 

Author Comment

by:cancer_66
ID: 8056535
can you please explain, what was the problem ?
and very breifly how you have fixed it ?please
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8056597
OK. It's a bit hard to explain in writing though.

Imagine that we have a list (array) of words that are based on the sentence "art or designs which are created is defined as computer graphics", so it looks like this :

art
or
designs
which
are
created
is
defined
as
computer
graphics

We are going through this list checking each word against the WordLists in our WordPattern. We are in strict mode, so the matches must take place in consecutive words. When a word matches a list we record that fact and move on to the next list and keep matching the words.

The problem is that the 5th word "are" in the list above matches the first list so by the time we get to "is" which is part of the real apttern we have already skipped past the first list. This means that only "defined" and "as" are found in sequence.

I fixed this by adding a rule into the loop that checks to see if it is working in strict mode when ever a sequence is broken. If it is, it skipps back to the first list and starts checking from there.
0
 

Author Comment

by:cancer_66
ID: 8056699
cool. i guess i did understand something not 100% though, anyways ill try testing with new texts.

1)would it be easy to map the program to a User Interface? ive got the user interface code ready.

2)in terms of Algorithm "random,sequential,strict seq" i need some sort of puesdo code. if possible.






0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8056824
1) Yes. The interface code would probably just replace the DefinitionChecker code.

2)

loop through the words and the lists
    look for each word in each list

    if the word is found
        record the match and whether or not it was found in strict sequence
        if we are in random mode
            start from the fist word again
        otherwise
            move on to the next word and the next list

    if the word was not found
        move on to the next word
        if we are in strict mode
            go back to the first list
       
Once we have checked all the words look at the information we have recorded from the matching process

if the number of matches = the number of lists
    then at least one word from each list was matched so random match = true or normal match = true

if the number of sequential matches = the number of lists
    then at least one word from each list was found in strict order so strict match = true
0
 

Author Comment

by:cancer_66
ID: 8056875
1)ok
2)when you said "loop through the words and the lists"

i know what the list contains "is defined..etc"

words?? you mean the texts ? user input?
0
 

Author Comment

by:cancer_66
ID: 8056922
2)whenever you are free can you just give me a bit more details with the puesdo code. take your time.

please
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8057093
2) No. The words are the sentences found in the files.

I have modified your UI code and I am emailing you a new version of the UI and a new version of DefinitionChecker that works with the UI.
0
 

Author Comment

by:cancer_66
ID: 8057174
ok thanks alots . ill just test it right away,

0
 

Author Comment

by:cancer_66
ID: 8057201
2) so u mean whenever we meet a "#" while reading the file we take the whole sentence and put it in a array. and then start comparing it with the Lists(patterns)
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8057279
2) Yes, sort of. Actually we find a sentence and add it to an array. Then we break each sentence up (tokenize it) into another array so we can compare it word by word.
0
 

Author Comment

by:cancer_66
ID: 8057334
2)ok thanks now i get the picture. i wasnt at my seat. just came back. ill check my mail.

thanks
0
 

Author Comment

by:cancer_66
ID: 8057391
1)ok i check the email. and iam testing the UI now. god this makes life so easier for testing as well:)
0
 

Author Comment

by:cancer_66
ID: 8057700
1)iam a bit confused WordList,WordPattern
wordList contains the sentences from the file?
wordpattern holds the combinations of patterns?

sorry for this really.
0
 

Author Comment

by:cancer_66
ID: 8057789
2) a)with the user interface when i enter "Computer"    it prints? (recall we are searching for Terms)

0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8057792
1) No.

A WordList contains the tokens like :

    is
    are
    be

or

    defined
    described
    delimited

both of the above would be a WordList.

A WordPattern contains a set of WordLists, a bit like :

    is            defined            by
    are           described          as
    be            delimited      

A PatternMatcher then contains a set of WordPatterns and can check sets of sentences for those patterns.
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8057846
>>2) a)with the user interface when i enter "Computer"    it prints? (recall we are searching for Terms)


Yes, If you look at the code, I have not implemented any code to check the number or length of arguments passed by the UI, whereas when you use the command line the main() method does this checking.
0
 

Author Comment

by:cancer_66
ID: 8057885
2)i.c but it would be the same as the one in the main() i mean it terms of code?its better to have the UI do the checking as well. please
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8057906
Sort of, in the main() method you are checking an array of arguments including arguments for the match mode like -s or -r. In the UI you are checking a string. I can produce an equivalent though.
0
 

Author Comment

by:cancer_66
ID: 8057925
2)yes please. that would be it for today, iam not feeling that well myself. thanks for all the help ozymandias i really appricate it.
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8057985
OK. I have just mailed you a new version of UI1.java which checks for valid search terms.
0
 

Author Comment

by:cancer_66
ID: 8057987
3)ozymandias since ive got a presentation on saturday let assume i was asked about

a)how effcient the search algorithm is ?
b)complexcity? (O notation)
c)how would i validate the system?

i would like to know how would you answer those question and what are the appropriate answers
0
 

Author Comment

by:cancer_66
ID: 8058042
4)i replaced the UI you mailed me with the previous one i got the following error?

symbol  : constructor DefinitionChecker  (java.lang.String,int,boolean)
location: class DefinitionChecker
               DefinitionChecker dc = new DefinitionChecker(search.getText(),mode,false);
0
 

Author Comment

by:cancer_66
ID: 8058177
answer me whenever you can. take your time. ill wait
0
 

Author Comment

by:cancer_66
ID: 8058582
please answer me when u can. ill be waiting
0
 

Author Comment

by:cancer_66
ID: 8058954
5)found an error same problem in strict mode.

computer graphics is are defined as create by art or design #( not matched)?

Mobile agents was can be defined such as intelligent agents # (not matched)

6)add the pattern "can be defined such as"

test if it would match : Mobile agent can be defined such as XYZ

didnt match with me?

0
 

Author Comment

by:cancer_66
ID: 8058967
6)remove the word "be" from the list1 and add "can be" + remove "as" from list3 and add "such as"

doesnt match with all 3 modes ! please look into this



0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8059201
3)

a) how efficient is the search algorithm ?

Compared to what ?

b)  How complex ?

It's relatively simple. It doesn't do fuzzy matching, or word stemming or any of the other clever stuff that most "search engines" do, and it can't really handle punctuation. It just matches words and groups of words.

c)How to validate ?

Tricky, since I'm not sure of what exactly it is supposed to achieve. The test files you have set up validate that it finds what it is supposed to find and doesn't find things that don;t match. What else could you do ?

4) Yes, that's because I changed the constructor of DefintionChecker to take an extra argument so that it would know whether to print out the results to the console when being used on the command line or return a result set when being used by the GUI.

5) problems in strict mode :

computer graphics is are defined as create by art or design #( not matched)?

OK. I will have a look and see why this sentence is not matched in strict mode.

Mobile agents was can be defined such as intelligent agents # (not matched)

The above sentence will not match in strict mode because it has the word "such" between "defined" and "as", so it is not strict.

6) You cannot add two words to a WordList as one word. You are not allowed "can be" or "such as". Actually, you can add them if you like but they will never match. This is because (as you asked) the sentence is tokenised into individual words and compared word-by-word, so nothing will ever match with "can be" because "can be" in a sentence will always be broken up into "can" and "be". If you want to look for "can be defined as" you must create a 4 word WordPattern.
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8059251
5) OK I have fixed that problem too. I will mail you a new copy of WordPattern.java, you will need to recompile.
0
 

Author Comment

by:cancer_66
ID: 8059278
4)hmm how to make it work? please look into it.
5)ok thanks
6)yeah i guessed that would be the problem. ok lets say i created 4 lists and add list1= "can" ,list2=be list3=defined,list4=such,list5=as

would it match :Mobile agents was can be defined such as intelligent agents
0
 

Author Comment

by:cancer_66
ID: 8059355
6)i just did the follows removed the word "be" from the list1 and add "can be" + removed "as" from list3 and add "such as"

supprisingly :Mobile agents was can be defined such as intelligent agents

matched in random and sequential but not strict?
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8059360
4) Unless there is a good reason always use the latest version of any file I have sent you.

6) Yes. If you did that it would work. You dont need to creat 4 lists though. You already have three of the lists you need.

You want to look for "can be defined as", but there is no point. With the current lists any sentence that has "can be defined as" will match because it has "be defined as" anyway.

I think that this is probably not a good idea in general though. As I have said before the chances of words like "defined" and "described" being used outside the context of sentences like "can be defined as" or "is described by" are very remote, and even if they were the number of occurences would be no more that the number of lost occurences due to grammatical or spelling errors in the documents being searched. My point is that adding pattern words like can, be ,such, is, are, by and so on is pretty pointless. It the words "computer graphics" and "described" appear in the same sentence at all then that probably warrants a match 99.9999% of the time.
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8059382
>>6)i just did the follows removed the word "be" from the list1 and add "can be" + removed "as" from list3 and add "such as"
>>
>>supprisingly :Mobile agents was can be defined such as intelligent agents
>>
>>matched in random and sequential but not strict?

You CANNOT do that !

Mobile agents was can be defined such as intelligent agents

The above sentence will NEVER match in strict because mode because it has "such" in between "defined" and "as". You cannot add "such as" as a pattern. You would have to add "such" and "as" to spearate lists or add them to the same list individually in which case the sentence would match in strict mode because it contained "be defined such".
0
 

Author Comment

by:cancer_66
ID: 8059393
4)thats what i did i used your latest UI file which check for valid user input by overwriting the old one but did not compile for the reason ive give you above?
0
 

Author Comment

by:cancer_66
ID: 8059412
6)ok sorry my mistake. stupid question.
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8059455
OH, OK. I need to send you both UI1.java and DefinitionChecker.java. Sorry, I thought I had.

I will mail them to you now.
0
 

Author Comment

by:cancer_66
ID: 8059485
7)ok lets assume i had a pattern which consisted of 4 words. that means i should create 4 lists correct? the reason iam asking this is because the same code wont be just used for definition. i might use different patterns to find synonyms..etc
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8059540
7) Yes. 4 WordLists added to 1 WordPattern in the correct sequence. BTW, you can reuse the WordLists, i.e. you can add them to more than one WordPattern or to the same WordPattern more than once.
0
 

Author Comment

by:cancer_66
ID: 8059608
7)thanks. i got your point
8)for the sake of testing i just added list1="be" list2=defined list3=such

now: Mobile agents was can be defined such as intelligent agents

should match in strict seq as well since "be defined such" are not separated with intermidate tokens.

buts its not ?
0
 

Author Comment

by:cancer_66
ID: 8059627
8)forget the last one. the mistake i use did is not recomipling the UI.it worked.

0
 

Author Comment

by:cancer_66
ID: 8059669
9)BTW how did you fix the problem in question5. hope its fixed for good. i thought you have fixed this problem?

ok i think ill call it a day. iam dead tired. i need a break. thanks for your help.



ill talk to you tomorrow. thanks alot
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8059937
9) OK. The first fix was for when the strict match is triggered too soon, i.e. by a word from list1 appearing in the sentence before the real pattern. When I foxed that I dod not allow for the fact that it might appear exactly 1 word before the real pattern, like this :

computer graphics is are defined as create by art or design #( not matched)?


in this instance it would never happen...you cannot write "is are" because it is grammatical nonsense, but it could happen in other circumstances so I allow for that eventaulity now aswell.
0
 

Author Comment

by:cancer_66
ID: 8062816
hello there. i wont be at my seat for few hours. but ill be back with more questions:) thanks for everything
0
 

Author Comment

by:cancer_66
ID: 8063428
hi, ok iam back for some time:)

1)how could be make this search algorithm more effcient?
2)what is a morphological analayser? can i use it ?
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8063481
1) I don't know, there are probably lots of ways, but it depends on your definition of efficiency. Do you mean faster or more accurate or the best possible trade of between speed and accuracy. For instance, I don't think that it is efficient looking for words like "is", "are", "can", "be", "by" etc.

2) I think it is an analyser (or in this case a sarch tool) that can find inexcat but highly likely matches. Common examples would be a "fuzzy logic" kind of word matching that would find obvious misspellings of words like "cimputer" or word stemming, where if you ask for computer graphics is will find variations of those words like compute, computational, computing, computed, computes and graphic, graphical, graphically etc. So, "computer graphics" would match "graphical computing" but it would probably be "ranked" low down the list of matches.
0
 

Author Comment

by:cancer_66
ID: 8063505
1)so what would you suggest ? ill have to talk to my supervisor regarding this. well what iam implying to as effient is "tade of between speed and accuracy"

2)yeah the supervisor did talk breifly about this.

>>computer graphics" would match "graphical computing" but it would probably be "ranked" low down the list of matches.

a) is this difficult to achieve?
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8063533
You could use morphological analysis, but you would either have to write your own routines to do fuzzy matching or word stemming (big job!) or get someone elses library code that you could use (probably expensive to buy a good one).

This is a very big field in computer science and there are a lot of ideas about how to do it. Search engines and data mining tools are BIG business.

There are a number of strands in word search and word matching.

Shannon's Law, for instance, states that the less often a word is used (i.e. the more rare it is) the more significant it is, because its rarity generally attests to the uniqeness of its meaning or interpretation. Words that get used a lot like is, this, it, are, be etc are used so frequently and in so many contexts that searching for them is a) inefficient and b) meaningless because how can we ever be sure what meaning to attach to them.

Beysian Inference repersents another common set of ideas. Beys states that the outcome of any particular seach (for instance) could be better predicted by a prior knowledge of the results of searches that have gone before.

For instance, if someone were to type the word "stocks" into a search egine they would get a lot of very mixed results. However, if the search engine knew that their previous searches had been "market values", "bonds" and "share trading" the results could be narrowed down considerably. Similarly if their previous searches had been "recipes", "soups" and "bouillon" then you would get a completely different set of results, and a different one again for "medieval" and "punishment". Basically we have added extra meaning or context to the word "stocks" from an awareness of previous searches or fields of interest.
0
 

Author Comment

by:cancer_66
ID: 8063569
1)ok ill speak to my supervisor regarding the morphological analayzer lets assume. he will provide me with the libraries etc. is using it difficult?

a)ill send you one file which he has given me its a tokenizer + it has some rules. please check it and see if it could be of some help

2)i.c again ill have to disscuss with the superivisor. i think the two most inportant keywords are "User Input" + MainMarker" in the search algorithm iam using. isnt it?
0
 

Author Comment

by:cancer_66
ID: 8063686
3)arent there are morpho libariaries which i can use?
0
 

Author Comment

by:cancer_66
ID: 8063695
3)arent there are morpho libariaries which i can use?
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8063742
1) I'm not sure what to suggest in terms of efficiency. This is not really my area of expertise. I suppose the first questions to ask would be :

    i) How is the current program inefficient ?
    ii) What can be done to improve it ?

2)
a) Ranking is not simple, but it's not that complex either. Basically you would assign a score to the words as they were matched. An exact match in a correct sequence would score 3 points, an fuzzy match in the correct sequence or an exact match in an inciorrect sequence would score 2 points and a fizzy match in an incorrect sequence would score 1 point.

Let's say you are looking for "computer graphics".

"computer graphics" would score 6 points
"computed graphics" would score 5 points
"computed graphically" would score 4 points
"graphically computed" would score 2 points

You could apply the same rules to the patterns too.

If the sentence contained "is defined by" (strict match)it would score 4 points.
If it contained "is often defined by" (seqnetial match) it would score 3 points.
If it contained "defined is by" (random match) it would score 2 points.
If it just contained "defined" (word match) it would score 1 point.

You could then add up the total score for each macthed sentence and display the matched sentences in descending order of score, i.e. a ranking system.
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8063751
3) Yes, almost certainly. Unfortuantely I do not know of any, and you would have to find one that could be used from within your java program for little or no cost (I assume)
0
 

Author Comment

by:cancer_66
ID: 8063786
2)i think if i could do that the supervisor would be quite impressed. since ranking the sentence and printing them according to the score can be considered as a way to validate the results. isnt it?

3)i will definatly talk to the supervisor tomorrow regarding any morpho-libararies he could provide. however, if you could also try and find one which is appropriate to the program. please.

0
 

Author Comment

by:cancer_66
ID: 8063800
4)for the time being without the Morph-libaries can the ranking system be done?
0
 

Author Comment

by:cancer_66
ID: 8063806
4)by applying the rules to the patterns ?
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8063905
2) Ranking would not constitute validation. The validation would have to be by some mechanism external to the program. e.g. some "known good" set of results to which the program's results could be compared.

4) Yes. It could be done, but it is a pretty big job. We would be changing the way the whole program works. Currently we have no "fuzzy match" capability and we do not compare search terms word by word we use the whole term. Currently the user specifies the matchMode, but to rank we would have to take each sentence and do a strict match, if that failed try a sequential match and if that failed try a random match in order to calculate the sentence's score.
0
 

Author Comment

by:cancer_66
ID: 8063963
2)ok
4)if it could be done. it would be a plus point for my project. since in my specification ive stated that score system woud be done if time premits. please
0
 

Author Comment

by:cancer_66
ID: 8063981
4)however the user should still be able to choose between the 3 modes.
0
 

Author Comment

by:cancer_66
ID: 8064105
5)private static WordList list9 = new WordList("can",",");
    private static WordList list10 = new WordList("be",",");
    private static WordList list11 = new WordList("defined",",");
   private static WordList list12 = new WordList("such",",");
    private static WordList list13 = new WordList("as",",");
     
  // create another WordPattern
        WordPattern pattern4 = new WordPattern();
        // add the appropriate WordLists
        pattern3.addList(list9);
        pattern3.addList(list10);
        pattern3.addList(list11);
        pattern3.addList(list12);
        pattern3.addList(list13);
        // add the WordPattern to the vector
       patterns.add(pattern4);

got the following results:

Matches found in a.txt

2.[-r -n -s] art or designs was are created is defined as computer graphics
2.1[-r -n -s] computer graphics is was can be are defined as create by art or design
3.[never] computer Graphics is as a pictorial computer output produced on a display screen, plotter, or printer
4.[-r -n] computer Graphics is purely delimited by science
6.[-r] computer graphics t t defined t is t as mohammed
8.[-r -n -s] science is described as computer graphics
12.[never] computer graphics is the xydsdgjhgs

Note:- 3. should not be matched since no mainmarker i.e defined, described..etc
0
 

Author Comment

by:cancer_66
ID: 8064129
5)try searching for "Mobile Agents"

13.[never] Mobile agents consists of exectution environment ,,etc (printed) even though i have dont have the pattern "consist of"

when i remove which i added up there it works properly.
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8064194
// create another WordPattern
       WordPattern pattern4 = new WordPattern();
       // add the appropriate WordLists
       pattern3.addList(list9);
       pattern3.addList(list10);
       pattern3.addList(list11);
       pattern3.addList(list12);
       pattern3.addList(list13);
       // add the WordPattern to the vector
      patterns.add(pattern4);



The above code is wrong.
You create pattern4, then add the lists to pattern3 and then add pattern4 to the patterns vector.

0
 

Author Comment

by:cancer_66
ID: 8064254
5)it should be this way?
// create another WordPattern
      WordPattern pattern4 = new WordPattern();
      // add the appropriate WordLists
      pattern4.addList(list9);
      pattern4.addList(list10);
      pattern4.addList(list11);
      pattern4.addList(list12);
      pattern4.addList(list13);
      // add the WordPattern to the vector
     patterns.add(pattern4);
0
 

Author Comment

by:cancer_66
ID: 8064267
5)i did the above and it worked!thanks
0
 

Author Comment

by:cancer_66
ID: 8064307
6)Would you help me with ques(4) . i really appricate all ur help.
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8064348
I am working on question 4. I have a java implementation of the Porter Stemming Algorithm which I will try to integrate into the code.

Once I have done that I will work on adding a ranking machnaism.

It may take a bit of time though.
What is your deadline for this ?
0
 

Author Comment

by:cancer_66
ID: 8064372
4)take ur time. its not nessary to submit it now i have enough time.even if it would be ready by thursday or friday its fine.
0
 

Author Comment

by:cancer_66
ID: 8064417
7)iam going to the hospital mom not well. will be back soon.

thanks once again.
0
 

Author Comment

by:cancer_66
ID: 8065960
8)hello iam back:) sorry 4 leaving like that
0
 

Author Comment

by:cancer_66
ID: 8066056
iam reading about the Porter Stemming Algorithm. trying to understand what its all about
0
 

Author Comment

by:cancer_66
ID: 8066216
9)tested the current program again. works fine. thank god. no problems. i read bit about the algorithm. got the overall picture.

0
 

Author Comment

by:cancer_66
ID: 8066362
10)keep in mind iam using jdk1.3
0
 

Author Comment

by:cancer_66
ID: 8066506
11)ill talk to you tomorrow. iam currently working on my interim report and presentation.

12)ill be waiting.
0
 

Author Comment

by:cancer_66
ID: 8066509
thanks alot for your help
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8067040
I have incorporated the stemming algorith amd now the results appear in a ranked list. I will mail you a complete copy of the new code and all the new files I am using.
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8068112
I have mailed you an update of all the files which includes some bug fixes and a tidied-up UI.
0
 

Author Comment

by:cancer_66
ID: 8069106
13)thanks alot ozymandias you been of great help. i really appricate that. ill check the program in a while.
0
 

Author Comment

by:cancer_66
ID: 8070446
hello. there. ill just check the mail. and let you give you the remarks. thanks
0
 

Author Comment

by:cancer_66
ID: 8070465
1)can i still add new patterns ? the way i used to do in the previous code?
0
 

Author Comment

by:cancer_66
ID: 8070502
2)can you please explain how the ranking is done? + stemmer.

BTW as i was searching i paased by a Porter Stemmer class which can be used? well i think its too late for that. sorry.
0
 

Author Comment

by:cancer_66
ID: 8070613
3)ill be waiting 4 ue answer. in order to test it, i need to understand how it works.

thanks.
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8070643
1) Yes. Patterns can be added in exactly the same way.
2) The ranking is done as follows :

First we search for the search term e.g. "computer graphics".
We now search for the search term using the same technique as the patterns, i.e. we put them in word lists with stemming turned on.
Seach terms are always stemmed.
We always search for seach terms in "strict sequence".

"computer graphics" is made up of two words.
For each word an exact match will score 2 points and a stemmed match will score 1 point.

So "computer graphics" will score 4, plus 3 for being a strict match. "computer graphic" will score 3, plus three for being a stric match.

Then we score the pattern. The patterns can contained stemmed words too and work in much the same way. They get a score for each word plus 3 for a strict match, 2 for a sequential match and 1 for a random match.

0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8070646
Finally, we add the key score and the pattern score together to give a full ranking value.

Yes, the Porter Stemming class you found is probably the same one I did. I used it (with a couple of small modifications).
0
 

Author Comment

by:cancer_66
ID: 8070739
3)hmm i pretty much got the idea of how it works.

so if the exact term is found i.e computer graphics

computer = 2points
graphics =2 points  total score =4points

plus for being a "strict match" they get an additional 3points! correct?

so the total is 7points for the exact search term.

4)colomns s1=search term points s2=pattern points?

when i search for computer graphics rank starts from "0"..etc its ranks in descending order?

5)lets say iam searching in random. now the patterns "is defined as" if found. they are ranked 7points in total correct? how is the distrubution of points done.

6)if possible can you give me an example of two sentences
and how their coressponding ranking is done.

take your time. ill wait.thanks

0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8070917
3) Yes. Correct.

4)Yes.

5) "is defined as" in a random match  =

is = 2 points
defined = 2 points
as  = 2 points
random match = 1 point
======================
total = 7
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8071004
6)

Example 1
==========

"computer graphics are defined as pictures" matched in strict mode against "computer graphics".

computer : computer = 2 points
graphics : graphics = 2 points
strict match : computer graphics = 3 points

Key Score = 7

is : is = 2 points
defined : defined  = 2 points
by : by = 2 points
strict match : is defined by = 3 points

Pattern Score = 9 points

Total Score = 16 points



Example 2
=========

"as science is often found to delimit computers graphic output" matched in random mode against "computer graphics".

computers : computer = 1 point
graphic : graphics = 1 point
strict match : computers graphic = 3 points

Key Score = 5 points

is : is = 2 points
delimit : delimited = 1 point
as : as = 2 points
random match : is delimit as = 1 points

Pattern Score = 6

Total Score = 11 points
0
 

Author Comment

by:cancer_66
ID: 8071194
6)take ur time.
7)ok lets say couple of sentences have the same ranking.

lets say
computer graphics is defined as
computer graphics was defined as
computer graphics should be defined as

ranking would start

0 sentence1
1 sentence2
2 sentence3

shouldnt it be

1 sentence1
1 sentence2
1 sentence3

----------------------------------------------------------
sorry for asking so many questions.





0
 

Author Comment

by:cancer_66
ID: 8071218
ok now i understood how it works. ill start testing it.
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8071497
No. The ranking will always be 1,2,3 etc.

If two sentences have the same score then they will be presented in the order in which they were found e.g. the one in a1.txt will come before the one in c1.txt.
0
 

Author Comment

by:cancer_66
ID: 8071517
7)even if they were in the same file i.e a1.txt ? they will be presented in the order they were found first?
0
 

Author Comment

by:cancer_66
ID: 8071652
8)hmm you said the ranking will always be 1,2,3. i just tested it very quickly and it started 0,1,2..?

shouldnt it start from 1,2,..

note:-downloaded the code where you have clearly stated that (bugs fixed..etc)
0
 

Author Comment

by:cancer_66
ID: 8071943
answer me whenever you can ill be waiting .
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8071957
Yes, the numbering starts from 0. It can just as easily start from 1 if you want. When i said "1,2,3" I meant sequential as opposed to "1,1,1,2,2,2,3,3,3" or whatever.

If you want to change it so it strat from 1 the change line 194 of UI1.java so it reads :

     v.add(Integer.toString(i+1));
0
 

Author Comment

by:cancer_66
ID: 8072058
ok thanks. ill do that soon. iam currently working on my Interim presentation which is on sunday.

0
 

Author Comment

by:cancer_66
ID: 8072163
thanks 4 everything
0
 

Author Comment

by:cancer_66
ID: 8072550
ill award you points and open another thread. is it ok ?
ill still didnt test it though?
anything you would like?
0
 

Author Comment

by:cancer_66
ID: 8072582
8)ill start testing it now :)
0
 

Author Comment

by:cancer_66
ID: 8072845
9)i havent really tested it 100%. but it seems working fine.

10)now whats left is just to integrate it in the Aglets. Do you have any idea how thats done? thats my final stage
0
 

Author Comment

by:cancer_66
ID: 8072910
11)i added two sentences in a1.txt

2computer graphics interface are defined as pictures#
2computer graphics interface are defined as pictures#

now rank should be 1,1,..
but it was 1,2..?
0
 

Author Comment

by:cancer_66
ID: 8073001
12) i tried adding the pattern "is the" by doing the following

private static WordList list4 = new WordList("is",",",false);
    private static WordList list5 = new WordList("the",",",false);

 // create a WordPattern
        WordPattern pattern2 = new WordPattern();
        // add the appropriate WordLists
        pattern2.addList(list4);
        pattern2.addList(list5);
        // add the WordPattern to the vector
        pm.addPattern(pattern2);

i only had one sentence which contained "is the"
"computer graphics is the art of blah blah"

a)this was the only result printed ? it didnt match any other sentence ?


0
 

Author Comment

by:cancer_66
ID: 8073201
please answer whenever you are free. ill be gone to the hospital in a while.
take ur time
0
 

Author Comment

by:cancer_66
ID: 8073286
13)ill award points here.
ive opened a new thread called "Search 3 for ozymandias "

please answer the questions there !thanks alot

ill be going to the hospital now!
0
 
LVL 15

Accepted Solution

by:
ozymandias earned 2000 total points
ID: 8073645
11) No. The chnage I gave you was so that the number would start from 1 rather than 0. It will still be 1,2,3 etc. As I said the ranking is based on the score. Where the scores are equal then the order of precedence is based on the order in which the sentences were found. I would advise against having joint rankings.

12) I will look into this and try see what the problem is.

13) Yes. That's fine. I will post all comments to the new thread.
0
 

Author Comment

by:cancer_66
ID: 8075072
thanks 4 ur help
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8075739
For PAQ value here is the complete code at this stage.

There are 7 files :

/UI1.java
/DefinitionChecker.java
/definitions/PatternMatcher.java
/definitions/WordPattern.java
/definitions/WordList.java
/definitions/Stemmer.java
/definitions/Sentence.java
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8075741
import java.io.*;
import java.io.IOException;
import java.util.*;
import java.awt.*;
import java.awt.event.*;
import javax.swing.*;
import javax.swing.table.*;
import java.util.Vector;

import definitions.Sentence;

public class UI1 extends JFrame implements ActionListener{

      /*
      * UI Components
      */
      TextField search = new TextField(18);
      Label searchlab = new Label("Search for");
      Scrollbar bar = new Scrollbar();
      TextArea results = new TextArea("",15,40,10);
      JTable table;
      JScrollPane scroller;
      Vector columns = new Vector();
      Vector rows = new Vector();
      Button go = new Button("Go...");
      Button send = new Button("Send Clone");
      Button close = new Button("Close");
      Panel Resultpanel = new Panel();
      Panel Buttonpanel = new Panel();
      Panel Inputpanel = new Panel();
      Panel checkbox = new Panel();

      CheckboxGroup cbg1 = new CheckboxGroup();
      Checkbox ran = new Checkbox("Random",cbg1,false);
      Checkbox seq = new Checkbox("Normal",cbg1,true);
      Checkbox sseq = new Checkbox("Strict",cbg1,false);

      CheckboxGroup cbg2 = new CheckboxGroup();
      Checkbox full = new Checkbox("Full",cbg2,false);
      Checkbox fast = new Checkbox("Fast",cbg2,true);

      public UI1(){

            addWindowListener(new WindowAdapter(){
                  public void windowClosing(WindowEvent e){
                        dispose();
                        System.exit(0);
                  }
            });

            GridBagLayout gridbag = new GridBagLayout();
            GridBagConstraints c = new GridBagConstraints();
            Container content = this.getContentPane();
            content.setLayout(gridbag);

            c.insets = new Insets(3,3,3,3);

            c.fill = GridBagConstraints.NONE;
            c.gridx = 0;
            c.gridy = 0;
            c.gridwidth = 1;
            c.weightx = 0.0;
            c.anchor = GridBagConstraints.NORTHWEST;
            gridbag.setConstraints (searchlab, c);
            content.add(searchlab);

            c.fill = GridBagConstraints.HORIZONTAL;
            c.gridx = 1;
            c.gridy = 0;
            c.gridwidth = 1;
            c.weightx = 1.0;
            c.anchor = GridBagConstraints.NORTHEAST;
            gridbag.setConstraints(search, c);
            content.add(search);

            c.fill = GridBagConstraints.BOTH;
            c.gridx = 0;
            c.gridy = 1;
            c.gridwidth = 2;
            c.weightx = 1.0;
            c.weighty = 1.0;
            c.anchor = GridBagConstraints.CENTER;
            columns.add("Rank");
            columns.add("Text");
            columns.add("S1");
            columns.add("S2");
            columns.add("File");
            table = new JTable(rows,columns);
            scroller = new JScrollPane(table);
            gridbag.setConstraints(scroller, c);
            content.add(scroller);

            c.weighty = 0.0;
            c.fill = GridBagConstraints.NONE;
            c.gridx = 0;
            c.gridy = 2;
            c.gridwidth = 2;
            c.weightx = 1.0;
            c.anchor = GridBagConstraints.CENTER;
            checkbox.setLayout(new GridLayout(1,8));
            checkbox.add(new Label("Search : "));
            checkbox.add(ran);
            checkbox.add(seq);
            checkbox.add(sseq);
            checkbox.add(new Label("         "));
            checkbox.add(new Label("Scoring : "));
            checkbox.add(full);
            checkbox.add(fast);
            gridbag.setConstraints(checkbox, c);
            content.add(checkbox);

            c.fill = GridBagConstraints.NONE;
            c.gridx = 0;
            c.gridy = 3;
            c.gridwidth = 2;
            c.weightx = 1.0;
            c.anchor = GridBagConstraints.CENTER;

            Buttonpanel.setLayout(new GridLayout(1,5));
            Buttonpanel.add(go);
            Buttonpanel.add(new Label("    "));
            Buttonpanel.add(close);
            Buttonpanel.add(new Label("    "));
            Buttonpanel.add(send);
            gridbag.setConstraints(Buttonpanel, c);
            content.add(Buttonpanel);

            go.addActionListener(this);
            send.addActionListener(this);
            close.addActionListener(this);

            KeyListener kl = new KeyListener() {
                  public void keyPressed(KeyEvent e) {}

                  public void keyReleased(KeyEvent e) {
                        if (e.getKeyCode() == KeyEvent.VK_ENTER) {
                              System.out.println(search.getText());
                        }
                  }

                  public void keyTyped(KeyEvent e) {}
            };

            search.addKeyListener(kl);

            this.pack();
            this.resize(this.preferredSize());
            this.reshape(20,20,600,400);
            setColumnWidths();
      }

      public static void main(String args[]){

            UI1 agletFrame = new UI1();

            agletFrame.setTitle("Aglet Interface Example");
            agletFrame.show();

      }

      private boolean validSearch(String s){

            // if the search term list less that 7 characters it can't be valid
            if (s.length() < 7){
                  return false;
            }
            // if the search term does not have a space it can't be valid
            if (s.indexOf(" ") == -1){
                  return false;
            }
            // if any of the search terms words are less than 3 characters
            // it can't be valid.
            StringTokenizer st = new StringTokenizer(s);
            while (st.hasMoreTokens()){
                  if (st.nextToken().length() < 3){
                        return false;
                  }
            }
            return true;
      }

      private void showMsg(String msg){
            JOptionPane.showMessageDialog(this,msg);
      }

      public void actionPerformed(ActionEvent event){

            if (event.getSource() == go){
                  results.setText("");
                  if (validSearch(search.getText())){
                        int mode = 2;
                        if (cbg1.getSelectedCheckbox() == ran){
                              mode = 1;
                        }else if(cbg1.getSelectedCheckbox() == seq){
                              mode = 2;
                        }else{
                              mode = 3;
                        }
                        boolean quick = true;
                        if (cbg2.getSelectedCheckbox() == full){
                              quick = false;
                        }
                        DefinitionChecker dc = new DefinitionChecker(search.getText(),mode,quick,false);
                        Sentence[] sentences = dc.getMatchedSentences();
                        rows = new Vector();
                        for (int i = 0; i < sentences.length; i++){
                              Vector v = new Vector();
                              v.add(Integer.toString(i+1));
                              v.add(sentences[i].getSentence());
                              v.add(Integer.toString(sentences[i].getKeyScore()));
                              v.add(Integer.toString(sentences[i].getPatternScore()));
                              v.add(sentences[i].getLocation().toString());
                              rows.add(v);
                        }
                        table.setModel(new DefaultTableModel(rows,columns));
                        setColumnWidths();
                  }else{
                        showMsg("You must provide a valid search term.\n\nA valid search term must have a minimum of two words\neach of which must have at least three chracaters.");
                  }
            }else if (event.getSource() == close){
                  System.exit(0);
            }else if(event.getSource()==send){

            }
      }

      private void setColumnWidths(){
            table.getColumnModel().getColumn(0).setPreferredWidth(15);
            table.getColumnModel().getColumn(1).setPreferredWidth(300);
            table.getColumnModel().getColumn(2).setPreferredWidth(15);
            table.getColumnModel().getColumn(3).setPreferredWidth(15);
            table.getColumnModel().getColumn(4).setPreferredWidth(100);
      }
}
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8075746
/*
* DefinitionChecker.java
*
*/

import java.util.Vector;
import java.util.StringTokenizer;
import java.io.*;
import definitions.*;

public class DefinitionChecker{


      /*
      *
      * These are some static WordLists which can be used to create
      * the WordPatterns that this PatterMatcher will use
      *
      */
      private static WordList list1 = new WordList("is,was,are,be",",",false);
      private static WordList list2 = new WordList("described,defined,delimited",",",true);
      private static WordList list3 = new WordList("as,by",",",false);
      private static WordList list4 = new WordList("is",",",false);
      private static WordList list5 = new WordList("the",",",false);

      String keyword;
      String[] files = new String[]{"a1.txt","b1.txt","c1.txt","d1.txt"};
      Sentence[] sentences;
      Vector matches = new Vector();

      /**
      *
      * Constructor for the DefinitionChecker
      *
      */
      public DefinitionChecker(String s, int matchMode, boolean quick, boolean debug){

            // let's build our PatternMatcher
            PatternMatcher pm = new PatternMatcher();

            // create a WordPattern
            WordPattern pattern1 = new WordPattern();
            // add the appropriate WordLists
            pattern1.addList(list1);
            pattern1.addList(list2);
            pattern1.addList(list3);
            // add the WordPattern to the vector
            pm.addPattern(pattern1);

            // create a WordPattern
            WordPattern pattern2 = new WordPattern();
            // add the appropriate WordLists
            pattern2.addList(list4);
            pattern2.addList(list5);
            // add the WordPattern to the vector
            pm.addPattern(pattern2);

            // now let's build a PatternMatcher to hold our keyword pattern
            // and use a StemmedWordList to do so.
            PatternMatcher km = new PatternMatcher();
            WordPattern keyPattern = new WordPattern();
            StringTokenizer st = new StringTokenizer(s);
            while (st.hasMoreTokens()){
                  keyPattern.addList(new WordList(st.nextToken()," ",true));
            }
            km.addPattern(keyPattern);

            // loop through each file in the list of files
            for (int f = 0; f < files.length; f++){
                  File file = null;
                  try{
                        // get all the sentences
                        file = new File(files[f]);
                        sentences = getSentencesFromFile(file);
                  }catch(IOException ioe){
                        System.out.println(ioe);
                  }
                  keyword = s.toLowerCase();
                  // loop through all the sentences
                  for (int i = 0; i < sentences.length; i++){
                        // if any sentence contains the keyword and matches any of the patterns specified in the PatternMatcher
                        int keyScore = km.scoreSentence(sentences[i],WordPattern.STRICT_MATCH,false,false);
                        int patternScore = pm.scoreSentence(sentences[i],matchMode,quick,false);
                        if (keyScore > 0 && patternScore > 0){
                              sentences[i].setKeyScore(keyScore);
                              sentences[i].setPatternScore(patternScore);
                              // if this is the first match found in this file
                              matches.add(sentences[i]);
                        }
                        //System.out.println();
                  }
            }
            sortMatches();
            if (debug){
                  for (int m = 0; m < matches.size(); m++){
                        System.out.println("\t\tMATCH : " + matches.elementAt(m).toString());
                  }
            }

      }

      private void sortMatches(){
            Object[] o = matches.toArray();
            java.util.Arrays.sort(o);
            matches = new Vector();
            for (int i = 0; i < o.length; i++){
                  matches.add(o[i]);
            }
      }

      /**
      * getMatches()
      *
      * Returns an array of strings which are all the matched sentences found by the DefinitionChecker.
      *
      */
      public String[] getMatches(){
            String[] m = new String[matches.size()];
            m = (String[])matches.toArray(m);
            return m;
      }

      /**
      * getMatchedSentences()
      *
      * Returns an array of sentences which are all the matched sentences found by the DefinitionChecker.
      *
      */
      public Sentence[] getMatchedSentences(){
            Sentence[] s = new Sentence[matches.size()];
            s = (Sentence[])matches.toArray(s);
            return s;
      }

      /**
      *
      * GetArrayFromFile
      *
      * This function reads a specified file and breaks the contents into
      * and array of strings (sentences) using the # character as a delimiter
      *
      */
      private Sentence[] getSentencesFromFile(File f) throws IOException{
            FileReader reader = new FileReader(f);
            Vector sentences = new Vector();
            char[] cbuf = new char[1];
            String delimiter = "#";
            String sentence = "";
            String c = "";
            // read the file character by character
            while (reader.read(cbuf) != -1){
                  c = new String(cbuf);
                  // if the chracter is a delimiter (#)
                  if (c.equals(delimiter)){
                        // add the sentence to the Vector and start a new blank sentence
                        Sentence s = new Sentence(sentence);
                        s.setLocation(f);
                        sentences.add(s);
                        sentence = "";
                  }else{
                        // otherwise just add the character to the current sentence string
                        sentence += c;
                  }
            }
            reader.close();
            Sentence[] sentenceArray = new Sentence[sentences.size()];
            // convert the Vector to an array and return it
            sentenceArray = (Sentence[])sentences.toArray(sentenceArray);
            return sentenceArray;
      }

      public static void main(String[] args){

            int matchMode = WordPattern.NORMAL_MATCH;

            String s = "";
            int numKeywords = 0;
            // first lets check what the arguments are
            for (int i = 0;i < args.length;i++){
                  //if any of them are -? then we print the usage message
                  if (args[i].equalsIgnoreCase("-?")){
                        printUsage("");
                        System.exit(1);
                  }
                  //if any of them are -s then we are in strict mode
                  if (args[i].equalsIgnoreCase("-s")){
                        matchMode = WordPattern.STRICT_MATCH;
                        continue;
                  }
                  //if any of them are -r then we are in random mode
                  if (args[i].equalsIgnoreCase("-r")){
                        matchMode = WordPattern.RANDOM_MATCH;
                        continue;
                  }
                  //if any of them are -r then we are in normal mode
                  if (args[i].equalsIgnoreCase("-n")){
                        matchMode = WordPattern.NORMAL_MATCH;
                        continue;
                  }
                  // make sure they are all 3 chracaters or longer
                  if (args[i].length() < 3){
                        printUsage("Input Error : " + args[i] + "\nAll component words of the SearchTerm must be three characters or more.");
                        System.exit(1);
                  }
                  // concatenate the arguments into one search string
                  s = s + args[i] + " ";
                  numKeywords++;
            }
            // now make sure that we have at least two valid keywords
            if (numKeywords < 2){
                  printUsage("");
                  System.exit(1);
            }
            s = s.trim();
            // finally instantiate a DefinitionChecker and pass it the string and tell it which match mode to use
            DefinitionChecker dc = new DefinitionChecker(s,matchMode,false,true);
      }

      private static void printUsage(String msg){
            if (msg.length() > 0){
                  System.out.println("\n" + msg);
            }
            System.out.println("\nUSAGE : DefintionChecker Mode SearchTerm\n\n\tMode Options :\n\t-r\trandom pattern matching\n\t-n\tnormal ppttern matching (default)\n\t-s\tstrict pattern matching\n\n\tSearchTerm : \n\tA minimum of 2 words each consisting of 3 chracters\n\tor more must be provided to make a valid SearchTerm.");
      }

}
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8075752
/*
* PatternMatcher.java
*
*/

package definitions;

import java.util.StringTokenizer;
import java.util.Vector;

public class PatternMatcher{

     private Vector patterns;

     /**
     *
     * Constructor for the PatternMatcher. This adds the
     * WordPatterns to the PatternMatchers list of patterns
     * ready for matching.
     *
     */
     public PatternMatcher(){
          // create the vector to store our WordPatterns
          patterns = new Vector();
     }

     /**
     *
     * This is just a function for adding WordPatterns
     * to the PatternMatcher. It's not used currently
     * but it will probably come in handy.
     */
     public void addPattern(WordPattern pattern){
          patterns.add(pattern);
     }

     /**
     *
     * This is the key function on the PatternMatcher. It is
     * passed a String (sentence) and information on "strictnesss".
     * It thens cycles through all its patterns seeing if any of them
     * are found in the sentence.
     *
     */
     public int scoreSentence(Sentence s, int matchMode, boolean quick, boolean all){

          // loop through all the WordPatterns checking to see if
          // any of them match the sentence.
          int hiScore = 0;
          for (int i = 0; i < patterns.size();i++){
               int score = 0;
               WordPattern wp = (WordPattern)patterns.elementAt(i);
               if ((score = wp.containsPattern(s,matchMode,all)) > 0){
                    if (quick){
                         return score;
                    }else{
                         hiScore += score;
                    }
               }
          }
          return hiScore;
     }

}
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8075782
/**
*
* WordPattern.java
*
* This class contains the core of the "comparison logic". Each WordPattern
* contains one or more word lists which it uses in sequence to do a word by
* word comparison with the sentence provided.
*
*/

package definitions;

import java.util.Vector;
import java.util.StringTokenizer;


public class WordPattern{

      /*
      *
      * Some static integers to denote the various modes
      * available for pattern matching
      */
      public final static int STRICT_MATCH = 3;
      public final static int NORMAL_MATCH = 2;
      public final static int RANDOM_MATCH = 1;

      private Vector lists;

      /**
      *
      * This constructor takes an array of WordLists
      * and uses them to populate its own Vector
      * of WordLists
      */
      public WordPattern(WordList[] wl){
            lists = new Vector();
            for (int i = 0; i < wl.length; i++){
                  lists.add(wl[i]);
            }
      }

      /**
      *
      * This constructor simply initialises a blank Vector
      * to be used to store the WordLists which can be added
      * using the addList() method
      */
      public WordPattern(){
            lists = new Vector();
      }

      /**
      *
      * This function adds a WordList to the Word Pattern
      *
      */
      public void addList(WordList list){
            lists.add(list);
      }

      /**
      *
      * This function does all the real work. It breaks the supplied
      * String into iuts component words and then compares them either
      * strictly or not, to the words in the WordLists.
      *
      */
      public int containsPattern(Sentence s, int matchMode, boolean all){
            //System.out.println(s);
            String[] words = s.getWordArray();
            int totalScore = 0;
            int score = 0;
            int stop = 0;
            if (!all){
                  stop = (matchMode - 1);
            }
            // if there are less words that lists then the sentence cannot
            // possibly contain a full pattern, so return false
            if (words.length < lists.size()){
                  return 0;
            }
            for (int m = matchMode; m > stop; m--){
                  totalScore = 0;
                  // this counter will hold the number of words matched
                  int count = 0;
                  // this counter will hold the number of words matched contiguously (i.e. in strict sequence)
                  int sequence = 0;
                  // this value will tell us whether the previous word was a match
                  boolean inSequence = false;
                  // simultaneously loop through the array of words and the Vector
                  // of WordLists, starting by comparing the first word with the first WordList
                  for (int l = 0, w = 0; ((l < lists.size()) && (w < words.length));){
                        WordList wordlist = (WordList)lists.elementAt(l);
                        String word = words[w];
                        // if the wordlist contains the word then we can move to the next wordlist
                        // and to the next word in the word array, unless we are in random mode.
                        // If we are in random mode, we move back to the beginning of the word array
                        // and start checking from the beginning becuase the words can appear in any order.
                        if ((score = wordlist.containsWord(word)) > 0){
                              totalScore += score;
                              //System.out.println(word + " : scores : " + score + " : total = " + totalScore);
                              l++;
                              if (m == RANDOM_MATCH){
                                    w = 0;
                              }else{
                                    w++;
                              }
                              count++;
                              // if we are are in sequence (i.e. the previous word was a match
                              // then we increment the number of seqential words found
                              if (inSequence || sequence == 0){
                                    sequence++;
                              }
                              // set the value to indicate that this word was matched
                              inSequence = true;
                        }else{
                              // if the wordlist does not contain the word then we can move to the next word
                              // but we do not move to the next wordlist
                              w++;
                              // if we are in strict mode and had started a sequence but not finished it then
                              // we may as well abandon it and start with the  first list again just in case
                              // there is a full sequence later in the sentence.
                              if (m == STRICT_MATCH && inSequence && sequence < lists.size()){
                                    l = 0;
                                    w--;
                                    sequence = 0;
                                    count = 0;
                                    totalScore = 0;
                              }
                              // set the value to indicate that we are no longer in strict sequence
                              inSequence = false;
                        }
                  }

                  // if the number of words matched is the same as the number of lists
                  // then we have a match
                  if (count == lists.size()){
                        switch (m){
                              case STRICT_MATCH:
                                    if(sequence == lists.size()){
                                          //System.out.println("strict : scored " + totalScore);
                                          return totalScore + 3;
                                    }else{
                                          totalScore = 0;
                                    }
                                    break;
                              case NORMAL_MATCH:
                                    //System.out.println("normal : scored " + totalScore);
                                    return (totalScore + 2);
                              case RANDOM_MATCH:
                                    //System.out.println("random : scored " + totalScore);
                                    return (totalScore + 1);
                        }
                  }
            }
            //System.out.println("fail : scored " + totalScore);
            return 0;
      }

      /**
      *
      * This function returns the length of the longest word list.
      * It's not used at the moment but may be useful
      *
      */
      public int maxListLength(){
            int length = 0;
            for (int l = 0; l < lists.size(); l ++){
                  if (((WordList)lists.elementAt(l)).numWords() > length){
                        length = ((WordList)lists.elementAt(l)).numWords();
                  }
            }
            return length;
      }

}
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8075786
/**
*
* WordList.java
*
* This class holds an array of strings (words) which
* can be combined in a WordPattern with other WordLists
*
*/

package definitions;

import java.util.StringTokenizer;
import java.util.Vector;

public class WordList{

     private Vector words;
     private Stemmer stemmer;
     private boolean stemming = false;

     /**
     *
     * This constructor takes a string and a delimiter string
     * and then uses a StringTokenizer to break the string into
     * an array of words
     */
     public WordList(String s, String delimiter, boolean stem){
          if (stem){
               stemmer = new Stemmer();
               stemming = true;
          }
          StringTokenizer st = new StringTokenizer(s,delimiter);
          words = new Vector();
          while (st.hasMoreTokens()){
               words.add(st.nextToken());
          }
     }

     /**
     *
     * This is just an accessor function that lets you get the words
     * held in the list. Not used at the moment, but probably useful
     * for debugging.
     */
     public String[] getWords(){
          String[] wordArray = new String[words.size()];
          wordArray = (String[])words.toArray(wordArray);
          return wordArray;
     }

     /**
     *
     * This is just an accessor function that lets you get the number of
     * words held in the list. Not used at the moment, but probably useful
     * for debugging.
     */
     public int numWords(){
          return words.size();
     }

     /**
     *
     * This function takes a string (word) and checks to
     * see if it matches any of the words in its list.
     */
     public int containsWord(String s){
          //System.out.println("Looking for " + s + " in :");
          //this.print();
          String word1 = s.trim();
          for (int i = 0; i < words.size(); i++){
               String word2 = (String)words.elementAt(i);
               if (word1.equalsIgnoreCase(word2)){
                    //System.out.println("match : "+ word1 + " : " + word2);
                    return 2;
               }
          }
          if (stemming){
               word1 = stemmer.getStem(word1);
               for (int i = 0; i < words.size(); i++){
                    String word2 = stemmer.getStem((String)words.elementAt(i));
                    if (word1.equalsIgnoreCase(word2)){
                         //System.out.println("stem match : "+ word1 + " : " + word2);
                         return 1;
                    }
               }
          }
          return 0;
     }

     /**
     *
     * This is just an accessor function that prints out the words
     * held in the list. Not used at the moment, but probably useful
     * for debugging.
     */
     public void print(){
          for (int i = 0; i < words.size(); i++){
               System.out.println((String)words.elementAt(i));
          }
     }

     public boolean isStemming(){
          return stemming;
     }

}
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8075791
/*

   Porter stemmer in Java. The original paper is in

       Porter, 1980, An algorithm for suffix stripping, Program, Vol. 14,
       no. 3, pp 130-137,

   See also http://www.tartarus.org/~martin/PorterStemmer

   History:

   Release 1

   Bug 1 (reported by Gonzalo Parra 16/10/99) fixed as marked below.
   The words 'aed', 'eed', 'oed' leave k at 'a' for step 3, and b[k-1]
   is then out outside the bounds of b.

   Release 2

   Similarly,

   Bug 2 (reported by Steve Dyrdahl 22/2/00) fixed as marked below.
   'ion' by itself leaves j = -1 in the test for 'ion' in step 5, and
   b[j] is then outside the bounds of b.

   Release 3

   Considerably revised 4/9/00 in the light of many helpful suggestions
   from Brian Goetz of Quiotix Corporation (brian@quiotix.com).

   Release 4

*/

package definitions;

import java.io.*;

/**
  * Stemmer, implementing the Porter Stemming Algorithm
  *
  * The Stemmer class transforms a word into its root form.  The input
  * word can be provided a character at time (by calling add()), or at once
  * by calling one of the various stem(something) methods.
  */

public class Stemmer
{  private char[] b;
   private int i,     /* offset into b */
               i_end, /* offset to end of stemmed word */
               j, k;
   private static final int INC = 50;
                     /* unit of size whereby b is increased */
   public Stemmer()
   {  b = new char[INC];
      i = 0;
      i_end = 0;
   }

   /**
    * Function to allow the Stemmer to be reused
      *
      * Ozymandias 04/03/03
    */

   public void reset()
   {
         b = new char[INC];
         i = 0;
         i_end = 0;
   }

   /**
    * Add a character to the word being stemmed.  When you are finished
    * adding characters, you can call stem(void) to stem the word.
    */

   public void add(char ch)
   {  if (i == b.length)
      {  char[] new_b = new char[i+INC];
         for (int c = 0; c < i; c++) new_b[c] = b[c];
         b = new_b;
      }
      b[i++] = ch;
   }


   /** Adds wLen characters to the word being stemmed contained in a portion
    * of a char[] array. This is like repeated calls of add(char ch), but
    * faster.
    */

   public void add(char[] w, int wLen)
   {  if (i+wLen >= b.length)
      {  char[] new_b = new char[i+wLen+INC];
         for (int c = 0; c < i; c++) new_b[c] = b[c];
         b = new_b;
      }
      for (int c = 0; c < wLen; c++) b[i++] = w[c];
   }

   /**
    * Quick and dirty method for adding a word as a string
    *
    * Ozymandias 04/03/03
    */

      public void add(String word){
            int length = word.length();
            char[] buf = new char[length];
            for (int c = 0; c < length; c++){
                  buf[c] = word.charAt(c);
            }
            add(buf,length);
      }

   /**
    * Quick and dirty method for getting the stem of a word
    *
    * Ozymandias 04/03/03
    */

      public String getStem(String s){
            this.reset();
            this.add(s);
            this.stem();
            return this.toString();
      }

   /**
    * After a word has been stemmed, it can be retrieved by toString(),
    * or a reference to the internal buffer can be retrieved by getResultBuffer
    * and getResultLength (which is generally more efficient.)
    */
   public String toString() { return new String(b,0,i_end); }

   /**
    * Returns the length of the word resulting from the stemming process.
    */
   public int getResultLength() { return i_end; }

   /**
    * Returns a reference to a character buffer containing the results of
    * the stemming process.  You also need to consult getResultLength()
    * to determine the length of the result.
    */
   public char[] getResultBuffer() { return b; }

   /* cons(i) is true <=> b[i] is a consonant. */

   private final boolean cons(int i)
   {  switch (b[i])
      {  case 'a': case 'e': case 'i': case 'o': case 'u': return false;
         case 'y': return (i==0) ? true : !cons(i-1);
         default: return true;
      }
   }

   /* m() measures the number of consonant sequences between 0 and j. if c is
      a consonant sequence and v a vowel sequence, and <..> indicates arbitrary
      presence,

         <c><v>       gives 0
         <c>vc<v>     gives 1
         <c>vcvc<v>   gives 2
         <c>vcvcvc<v> gives 3
         ....
   */

   private final int m()
   {  int n = 0;
      int i = 0;
      while(true)
      {  if (i > j) return n;
         if (! cons(i)) break; i++;
      }
      i++;
      while(true)
      {  while(true)
         {  if (i > j) return n;
               if (cons(i)) break;
               i++;
         }
         i++;
         n++;
         while(true)
         {  if (i > j) return n;
            if (! cons(i)) break;
            i++;
         }
         i++;
       }
   }

   /* vowelinstem() is true <=> 0,...j contains a vowel */

   private final boolean vowelinstem()
   {  int i; for (i = 0; i <= j; i++) if (! cons(i)) return true;
      return false;
   }

   /* doublec(j) is true <=> j,(j-1) contain a double consonant. */

   private final boolean doublec(int j)
   {  if (j < 1) return false;
      if (b[j] != b[j-1]) return false;
      return cons(j);
   }

   /* cvc(i) is true <=> i-2,i-1,i has the form consonant - vowel - consonant
      and also if the second c is not w,x or y. this is used when trying to
      restore an e at the end of a short word. e.g.

         cav(e), lov(e), hop(e), crim(e), but
         snow, box, tray.

   */

   private final boolean cvc(int i)
   {  if (i < 2 || !cons(i) || cons(i-1) || !cons(i-2)) return false;
      {  int ch = b[i];
         if (ch == 'w' || ch == 'x' || ch == 'y') return false;
      }
      return true;
   }

   private final boolean ends(String s)
   {  int l = s.length();
      int o = k-l+1;
      if (o < 0) return false;
      for (int i = 0; i < l; i++) if (b[o+i] != s.charAt(i)) return false;
      j = k-l;
      return true;
   }

   /* setto(s) sets (j+1),...k to the characters in the string s, readjusting
      k. */

   private final void setto(String s)
   {  int l = s.length();
      int o = j+1;
      for (int i = 0; i < l; i++) b[o+i] = s.charAt(i);
      k = j+l;
   }

   /* r(s) is used further down. */

   private final void r(String s) { if (m() > 0) setto(s); }

   /* step1() gets rid of plurals and -ed or -ing. e.g.

          caresses  ->  caress
          ponies    ->  poni
          ties      ->  ti
          caress    ->  caress
          cats      ->  cat

          feed      ->  feed
          agreed    ->  agree
          disabled  ->  disable

          matting   ->  mat
          mating    ->  mate
          meeting   ->  meet
          milling   ->  mill
          messing   ->  mess

          meetings  ->  meet

   */

   private final void step1()
   {  if (b[k] == 's')
      {  if (ends("sses")) k -= 2; else
         if (ends("ies")) setto("i"); else
         if (b[k-1] != 's') k--;
      }
      if (ends("eed")) { if (m() > 0) k--; } else
      if ((ends("ed") || ends("ing")) && vowelinstem())
      {  k = j;
         if (ends("at")) setto("ate"); else
         if (ends("bl")) setto("ble"); else
         if (ends("iz")) setto("ize"); else
         if (doublec(k))
         {  k--;
            {  int ch = b[k];
               if (ch == 'l' || ch == 's' || ch == 'z') k++;
            }
         }
         else if (m() == 1 && cvc(k)) setto("e");
     }
   }

   /* step2() turns terminal y to i when there is another vowel in the stem. */

   private final void step2() { if (ends("y") && vowelinstem()) b[k] = 'i'; }

   /* step3() maps double suffices to single ones. so -ization ( = -ize plus
      -ation) maps to -ize etc. note that the string before the suffix must give
      m() > 0. */

   private final void step3() { if (k == 0) return; /* For Bug 1 */ switch (b[k-1])
   {
       case 'a': if (ends("ational")) { r("ate"); break; }
                 if (ends("tional")) { r("tion"); break; }
                 break;
       case 'c': if (ends("enci")) { r("ence"); break; }
                 if (ends("anci")) { r("ance"); break; }
                 break;
       case 'e': if (ends("izer")) { r("ize"); break; }
                 break;
       case 'l': if (ends("bli")) { r("ble"); break; }
                 if (ends("alli")) { r("al"); break; }
                 if (ends("entli")) { r("ent"); break; }
                 if (ends("eli")) { r("e"); break; }
                 if (ends("ousli")) { r("ous"); break; }
                 break;
       case 'o': if (ends("ization")) { r("ize"); break; }
                 if (ends("ation")) { r("ate"); break; }
                 if (ends("ator")) { r("ate"); break; }
                 break;
       case 's': if (ends("alism")) { r("al"); break; }
                 if (ends("iveness")) { r("ive"); break; }
                 if (ends("fulness")) { r("ful"); break; }
                 if (ends("ousness")) { r("ous"); break; }
                 break;
       case 't': if (ends("aliti")) { r("al"); break; }
                 if (ends("iviti")) { r("ive"); break; }
                 if (ends("biliti")) { r("ble"); break; }
                 break;
       case 'g': if (ends("logi")) { r("log"); break; }
   } }

   /* step4() deals with -ic-, -full, -ness etc. similar strategy to step3. */

   private final void step4() { switch (b[k])
   {
       case 'e': if (ends("icate")) { r("ic"); break; }
                 if (ends("ative")) { r(""); break; }
                 if (ends("alize")) { r("al"); break; }
                 break;
       case 'i': if (ends("iciti")) { r("ic"); break; }
                 break;
       case 'l': if (ends("ical")) { r("ic"); break; }
                 if (ends("ful")) { r(""); break; }
                 break;
       case 's': if (ends("ness")) { r(""); break; }
                 break;
   } }

   /* step5() takes off -ant, -ence etc., in context <c>vcvc<v>. */

   private final void step5()
   {   if (k == 0) return; /* for Bug 1 */ switch (b[k-1])
       {  case 'a': if (ends("al")) break; return;
          case 'c': if (ends("ance")) break;
                    if (ends("ence")) break; return;
          case 'e': if (ends("er")) break; return;
          case 'i': if (ends("ic")) break; return;
          case 'l': if (ends("able")) break;
                    if (ends("ible")) break; return;
          case 'n': if (ends("ant")) break;
                    if (ends("ement")) break;
                    if (ends("ment")) break;
                    /* element etc. not stripped before the m */
                    if (ends("ent")) break; return;
          case 'o': if (ends("ion") && j >= 0 && (b[j] == 's' || b[j] == 't')) break;
                                    /* j >= 0 fixes Bug 2 */
                    if (ends("ou")) break; return;
                    /* takes care of -ous */
          case 's': if (ends("ism")) break; return;
          case 't': if (ends("ate")) break;
                    if (ends("iti")) break; return;
          case 'u': if (ends("ous")) break; return;
          case 'v': if (ends("ive")) break; return;
          case 'z': if (ends("ize")) break; return;
          default: return;
       }
       if (m() > 1) k = j;
   }

   /* step6() removes a final -e if m() > 1. */

   private final void step6()
   {  j = k;
      if (b[k] == 'e')
      {  int a = m();
         if (a > 1 || a == 1 && !cvc(k-1)) k--;
      }
      if (b[k] == 'l' && doublec(k) && m() > 1) k--;
   }

   /** Stem the word placed into the Stemmer buffer through calls to add().
    * Returns true if the stemming process resulted in a word different
    * from the input.  You can retrieve the result with
    * getResultLength()/getResultBuffer() or toString().
    */
   public void stem()
   {  k = i - 1;
      if (k > 1) { step1(); step2(); step3(); step4(); step5(); step6(); }
      i_end = k+1; i = 0;
   }

   /** Test program for demonstrating the Stemmer.  It reads text from a
    * a list of files, stems each word, and writes the result to standard
    * output. Note that the word stemmed is expected to be in lower case:
    * forcing lower case must be done outside the Stemmer class.
    * Usage: Stemmer file-name file-name ...
    */
   /**
   *
   * Commenting out this main method to add one that is more useful
   * for my immediate needs.
   *
   * Ozymandias 04/03/03
   *
   public static void main(String[] args)
   {
      char[] w = new char[501];
      Stemmer s = new Stemmer();
      for (int i = 0; i < args.length; i++)
      try
      {
         FileInputStream in = new FileInputStream(args[i]);

         try
         { while(true)

           {  int ch = in.read();
              if (Character.isLetter((char) ch))
              {
                 int j = 0;
                 while(true)
                 {  ch = Character.toLowerCase((char) ch);
                    w[j] = (char) ch;
                    if (j < 500) j++;
                    ch = in.read();
                    if (!Character.isLetter((char) ch))
                    {
                       // to test add(char ch)
                       for (int c = 0; c < j; c++) s.add(w[c]);

                       // or, to test add(char[] w, int j)
                       // s.add(w, j);

                       s.stem();
                       {  String u;

                          // and now, to test toString() :
                          u = s.toString();

                          // to test getResultBuffer(), getResultLength() :
                          // u = new String(s.getResultBuffer(), 0, s.getResultLength());

                          System.out.print(u);
                       }
                       break;
                    }
                 }
              }
              if (ch < 0) break;
              System.out.print((char)ch);
           }
         }
         catch (IOException e)
         {  System.out.println("error reading " + args[i]);
            break;
         }
      }
      catch (FileNotFoundException e)
      {  System.out.println("file " + args[i] + " not found");
         break;
      }
   }
   */

      public static void main(String[] args){
            Stemmer stemmer = new Stemmer();

            for (int i = 0; i < args.length; i++){
                  String word = args[i];
                  stemmer.reset();
                  stemmer.add(word);
                  stemmer.stem();
                  System.out.println(word + " was stemmed to " + stemmer.toString());
            }
      }
}

0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8075798
package definitions;

import java.util.StringTokenizer;
import java.io.File;

public class Sentence implements Comparable{

     private String sentence;
     private int keyScore;
     private int patternScore;
     private File location;

     public Sentence(String s){
          sentence = s;
          keyScore = 0;
          patternScore = 0;
     }

     public String[] getWordArray(){
          int token = 0;
          StringTokenizer st = new StringTokenizer(sentence);
          String[] words = new String[st.countTokens()];
          while (st.hasMoreTokens()){
               words[token++] = st.nextToken();
          }
          return words;
     }

     public int getScore(){
          return (keyScore + patternScore);
     }

     public int getKeyScore(){
          return keyScore;
     }

     public void setKeyScore(int s){
          keyScore = s;
     }

     public void addKeyScore(int s){
          keyScore += s;
     }

     public int getPatternScore(){
          return patternScore;
     }

     public void setPatternScore(int s){
          patternScore = s;
     }

     public void addSPatterncore(int s){
          patternScore += s;
     }

     public void setLocation(File f){
          location = f;
     }

     public File getLocation(){
          return location;
     }

     public int compareTo(Object o){
          int c = ((Sentence)o).getScore();
          if (c == this.getScore()){
               return 0;
          }else{
               return (c - this.getScore());
          }
     }

     public String getSentence(){
          return sentence;
     }

     public String toString(){
          return sentence + "[" + keyScore + "][" + patternScore + "](" + location.toString() + ")";
     }
}
0
 
LVL 15

Expert Comment

by:ozymandias
ID: 8075823
For completeness, here are contents of the 4 test files containing sample sentences.

a1.txt
======
1 graphics computer is defined as nonsense#2 computer graphics interface are defined as pictures#3 computer graphic is delimited by pictures#4 computer graphics are often delimited by science#5 computer graphics are often delimited by science#6 computer graphic are often delimited by science#7 computer graphics described are as random#8 computer graphic described are as random#

b1.txt
======
10 mobile agents are defined as pictures#11 mobile agent is delimited by pictures#12 mobile agents are often delimited by science#13 mobile agents are often delimited by science#14 mobile agent are often delimited by science#15 mobile agents described are as random#16 mobile agent described are as random#

c1.txt
======
17 computers graphics are defined as pictures#18 computers graphic is delimited by pictures#19 computers graphics are often delimited by science#20 computers graphics are often delimited by science#21 computers graphic are often delimited by science#22 computers graphics descibed are is random#23 computers graphic descibed are is random#

d1.txt
======
24 as science is often found to delimit computers graphic output#25 computer graphics is the display of digital images which is defined by science#
0

Featured Post

On Demand Webinar: Networking for the Cloud Era

Ready to improve network connectivity? Watch this webinar to learn how SD-WANs and a one-click instant connect tool can boost provisions, deployment, and management of your cloud connection.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Java functions are among the best things for programmers to work with as Java sites can be very easy to read and prepare. Java especially simplifies many processes in the coding industry as it helps integrate many forms of technology and different d…
In this post we will learn how to make Android Gesture Tutorial and give different functionality whenever a user Touch or Scroll android screen.
Viewers will learn about basic arrays, how to declare them, and how to use them. Introduction and definition: Declare an array and cover the syntax of declaring them: Initialize every index in the created array: Example/Features of a basic arr…
Viewers will learn how to properly install Eclipse with the necessary JDK, and will take a look at an introductory Java program. Download Eclipse installation zip file: Extract files from zip file: Download and install JDK 8: Open Eclipse and …
Suggested Courses