Solved: Search:- 2 For Mr ozymandias

ASKER

1)whenever i want to add a new pattern which consists of two word i should add it in Token4,Token5
otherswise Token1,2,3

correct?
-----------------------------------------------------
correct way of adding pattern?

private static WordList list4 = new WordList("foo,bar,buzz,is",",");

or

private static WordList list4 = new WordList("foo,bar,buzz","is",",");

why is the "," at the end ?

cancer_66

ASKER

ill talk to you tomorrow.

ozymandias

OK.

If you want to add a new Pattern you have to create a new list for each word in the Pattern. If its a three word pattern you need three lists, if its a two word pattern you need two lists etc.

You can get rid of list4 and list5 becuase they were just for demonstration purposes.

If you want to search for "is the" then you need :

private static WordList list4 = new WordList("is",",");
private static WordList list5 = new WordList("the",",");

If you then wanted to search for "are arranged in" and "is surrounded by", you could either do this :

private static WordList list6 = new WordList("are,is",",");
private static WordList list7 = new WordList("surrounded,arranged",",");
private static WordList list8 = new WordList("in,by",",");

or you could add the required missing words to the existing word lists of list1, list2, list3.

You can assemble the lists in any combination or order into new WordPatterns.

ozymandias

The extra "," at the end is because the constructor for WordList takes both the string e.g. "described,defined,delimited" and the delimiter that should be used to tokenize it into an array e.g. ",".

cancer_66

ASKER

hello,

ok i just read your comment. regarding the Addition of new patterns. iam going to try that right aways and let you know if i have faced any problems.

thanks again.

cancer_66

ASKER

i have created a directory called "Expert Exchange"

a)and extracted the DefinitionChecker.zip there.

b)there fore in the directory "Expert Exchange" i had the following files:-

i)DefintionChecker.java

ii)a subdirectory called "definitions" (created automatically as i extracted the zip file) which contained the file "PatternMatcher.java"

when i compile "DefinitionChecker.java"

the following classes are created
"DefintionChecker.class" + in subdirectory "definitions" classes "PatternMatcher.class","WordPattern.class" and "WordList.class"

i just did the following to the code:-

private static WordList list6 = new WordList("are,is",",");
private static WordList list7 = new WordList("surrounded,arranged",",");
private static WordList list8 = new WordList("in,by",",");

&

WordPattern pattern3 = new WordPattern();
pattern1.addList(list6);
pattern1.addList(list7);
pattern1.addList(list8);
patterns.add(pattern3);

and it worked . Just a check am i doing it the right way?

cancer_66

ASKER

1)lets say i want to add another option to the user to choose between

a)Sequential search (completed)
b)Strict Sequential search (completed)
c)Random search

now in randon search all the Patterns should be found regardless of there position in the sentence. i.e is defined as,defined

is as,as defined is,is XX defined YY as

for example

computer Graphics is defined as a field in cs
(printed in seq + st.seq +random search)

computer graphics is sometimes defined as a field in cs(printed in seq + Random)

computer graphics defined tt as tt is a field in cs
(printed in Random search ONLY)

intelligent agents are sometimes defined as mobile agents (printed in seq +random)

answer me whenever you can. ill be waiting.thanks

cancer_66

ASKER

guess you are busy. no problem. please answer me as you have the time. iam waiting,

cancer_66

ASKER

still waiting:)

ozymandias

OK. Sorry, it's been a busy day.

I have changed the code around a bit and split up some of the files.

There is now a new command-line option -r for random.

I will mail you the code shortly.

cancer_66

ASKER

its fine no problem. take you time !
ok ill just check my email.
thanks

cancer_66

ASKER

whenever you can. just let me know a bit about the changes,

cancer_66

ASKER

hmmmm, i a bit confused. which one is the latest "WordPattern" the one you sent with the "DefinitionChecker.zip" or the one in the separate email ?

cancer_66

ASKER

1)Comments on "DefinitionChecker.zip"

a)Strict mode doesnt work?
b)In normal Mode the following was printed

computer graphics defined t t as t is ?

c)in random mode following was printed

Graphic is as a field in computer science (should not be printed no MainMarker)?

ozymandias

The latest WordPattern.java was the one that I sent on its own. It replaces the one in the zip file.

ozymandias

All the modes work fine for me.

When run normal mode I get the following output :

Matches found in abc.txt
MATCH : computer graphics foo token blah blah
Matches found in xyz.txt
MATCH : computer graphics buzz ping token blah blah
MATCH : computer graphics is defined by nonsense
MATCH : computer Graphics is often delimited by science
MATCH : Computer Graphics is defined as science
Matches found in other.txt
MATCH : Computer Graphics are described as pictures

When I run stric mode, I get :

Matches found in abc.txt
MATCH : computer graphics foo token blah blah
Matches found in xyz.txt
MATCH : computer graphics is defined by nonsense
MATCH : Computer Graphics is defined as science
Matches found in other.txt
MATCH : Computer Graphics are described as pictures

and random mode gives me :

Matches found in abc.txt
MATCH : Computer Graphics defined t t t is t as
MATCH : computer graphics foo token blah blah
Matches found in xyz.txt
MATCH : computer graphics buzz ping token blah blah
MATCH : computer graphics is defined by nonsense
MATCH : computer Graphics is often delimited by science
MATCH : Computer Graphics is defined as science
Matches found in other.txt
MATCH : Computer Graphics are described as pictures

ozymandias

Are you sure you are using the right arguments :

normal = java DefinitionChecker computer graphics

strict = java DefinitionChecker computer graphics -s

random = java DefinitionChecker computer graphics -r

cancer_66

ASKER

check ur email ive sent you some test files.

cancer_66

ASKER

1)note ive tried Both "DefinitionChecker.zip"
and got the above errors

+ ive replaced the "WordPattern" which is in the "DefinitionChercker.zip" with the new "WordPattern.java" you have sent in a seprate email.

still recived the above error?

2)yes iam using the right arguments

java DefinitionChecker computer graphics (seq mode)
java DefinitionChecker computer graphics -s (strict)
hjava DefinitionChecker computer graphics -r (random)

ozymandias

OK. I am now using the same files you sent. I am going to post my output and I will number the lines. Please tell me which lines of output you think are wrong and why.

1 >java DefinitionChecker computer graphics
2 Matches found in a.txt
3 MATCH : art or designs which are created is defined as computer graphics (printed in both)
4 MATCH : computer Graphics is purely delimited by science (printed in seq)
5 MATCH : science is described as computer graphics (printed both)
6 Matches found in b.txt
7 MATCH : computer graphics is purely defined as a field in cs (printed in seq)
8 MATCH : cs is purely described as a field in computer graphics (printed in seq)
9 MATCH : computer graphics is delimited by cs (printed in both)
10 MATCH : cs is delimited by computer graphics (printed in both)
11 MATCH : computer graphics is X defined RR as cs (printed in seq)
12 MATCH : computer Graphics XXX was GGGG defined nnnnnn as cs (printed in seq)
13 Matches found in c.txt
14 MATCH : computer Graphics is purely delimited by science (printed in seq)
15 MATCH : computer is defined as science (not printed "computer graphics")
16 MATCH : science is purely described as computer graphics (printed in seq)
17
18 >java DefinitionChecker computer graphics -s
19 Matches found in a.txt
20 MATCH : science is described as computer graphics (printed both)
21 Matches found in b.txt
22 MATCH : computer graphics is delimited by cs (printed in both)
23 MATCH : cs is delimited by computer graphics (printed in both)
24 Matches found in c.txt
25 MATCH : computer is defined as science (not printed "computer graphics")
26
27 java DefinitionChecker computer graphics -r
28 Matches found in a.txt
29 MATCH : art or designs which are created is defined as computer graphics (printed in both)
30 MATCH : computer Graphics is purely delimited by science (printed in seq)
31 MATCH : science is described as computer graphics (printed both)
32 Matches found in b.txt
33 MATCH : computer graphics is purely defined as a field in cs (printed in seq)
34 MATCH : cs is purely described as a field in computer graphics (printed in seq)
35 MATCH : computer graphics defined t t as t is cs (not printed in seq and s.seq)
36 MATCH : computer graphics is delimited by cs (printed in both)
37 MATCH : cs is delimited by computer graphics (printed in both)
38 MATCH : computer graphics is X defined RR as cs (printed in seq)
39 MATCH : computer Graphics XXX was GGGG defined nnnnnn as cs (printed in seq)
40 Matches found in c.txt
41 MATCH : computer Graphics is purely delimited by science (printed in seq)
42 MATCH : computer is defined as science (not printed "computer graphics")
43 MATCH : science is purely described as computer graphics (printed in seq)

ozymandias

Line 25 should not normally be printed but it is because it contains the words "computer graphics" inside the brackets e.g. :

(not printed "computer graphics")

cancer_66

ASKER

1)hmmm this is weird i deleted the all the files and extracted "DefinitionChecker.zip" again from scratch in a folder called "Expert Exchange" now it worked properly???

2)i described the way i have added new patterns above. and asked if it is the correct way ?

3)please test the program with the test files ive sent on the email. just so i could feel comfortable plz

thanks

cancer_66

ASKER

ok its looks fine to me..i dont know what went wrong. really iam suprized my self for a second i got very worried;z

ozymandias

1) Bizarre. I can't explain that, but I'm glad it's working now.

2) I think so. Dd you read my answer in my first comment above ?

3) I have tested teh program with the files. My output is above.

cancer_66

ASKER

ill try testing it more . and let you know,
if there is any problems. but that problem i faced worries me,

thanks

cancer_66

ASKER

ill try testing it more . and let you know,
if there is any problems. but that problem i faced worries me,

thanks

cancer_66

ASKER

1) i did the following in order to add the pattern "is the"

private static WordList list6 = new WordList("is",",");
private static WordList list7 = new WordList("the",",");

// create another WordPattern
WordPattern pattern3 = new WordPattern();
// add the appropriate WordLists
pattern2.addList(list6);
pattern2.addList(list7);
// add the WordPattern to the vector
patterns.add(pattern3);

i also added two sentences
a)mobile agent is the future of XYZ

it was not matched?am i doing something wrong?

cancer_66

ASKER

i even corrected the mistake which is above "pattern2.addlist(list6)" to pattern3.addlist(list6)

still didnt match ?

cancer_66

ASKER

2)following sentence is not printed in strict mode

art or designs which are created is defined as computer graphics (printed in both) # ?

sorry for the trouble

cancer_66

ASKER

2)notice for the question above when i modfied the sentence to be

art or designs which created is defined as computer graphics (printed in both) #

(removed the "are") it was matched !

cancer_66

ASKER

2)in the test which you have done Line 3 should have been also matched with strict sequential.

notice "is defined as"

please look into this problem:(

cancer_66

ASKER

3)ive added the following sentence

Expert Exchange is surrounded by XYZ #

and added the pattern "is surrounded by"

private static WordList list6 = new WordList("are,is",",");
private static WordList list7 = new WordList("surrounded,arranged",",");
private static WordList list8 = new WordList("in,by",",");

sentence didnt match?

but however this sentence matched
intelligent agents, are sometimes, defined as mobile agents #

ozymandias

2) there is a problem with this sentence :

art or designs which are created is defined as computer graphics

it has both "are" and "is" which are both in List1. This creates a problem because the first token found starts the strict sequential search and then "created" breaks it.

I will have to have a think about this.

3) I have added the following :

private static WordList list6 = new WordList("were,is,are",",");
private static WordList list7 = new WordList("surrounded,encompassed,arranged",",");
private static WordList list8 = new WordList("in,by",",");

and

// create another WordPattern
WordPattern pattern3 = new WordPattern();
// add the appropriate WordLists
pattern3.addList(list6);
pattern3.addList(list7);
pattern3.addList(list8);
// add the WordPattern to the vector
patterns.add(pattern3);

when I run :

>java DefinitionChecker expert exchange

I get :

Matches found in a.txt
MATCH : [never] Expert Exchange is surrounded by XYZ

ozymandias

The code works for me, except the problem detailed in point 2 above, which I will try to look into.

cancer_66

ASKER

hi. hmm ok. i will test the program more and see..

what do u mean

[never] Expert Exchange is surrounded by XYZ

why the "never"?

cancer_66

ASKER

1)i still havent been sucessfull so far in adding new patterns ? it doesnt work!

i addded "is surrounded by"

and had a sentence

mobile agents is surrounded by xxx (no match)

cancer_66

ASKER

forget the last message. it worked. my mistake.

ill try adding the pattern "is the"

and test it. please look into the problem.

thanks

cancer_66

ASKER

pattern "is the" worked fine.

i think the only problem is sentence

"art or designs which are created is defined as computer graphics"

however we should try and make it work with all sentences since ill be randomly taking definitions from the internet and test it.

thanks alot

cancer_66

ASKER

please answer me whenever u r free

ozymandias

The last version of the program I emailed you works with :
"art or designs which are created is defined as computer graphics", just fine. I fixed that probelm.

The [never] I put in front of any sentence that did not contain the words "computer graphics" or "mobile agents" (since those were the terms we were testing) or did not have a pattern like "is defined by" at all.

ozymandias

This is my current sample output :

1 >java DefinitionChecker computer graphics
2 Matches found in a.txt
3 MATCH : [-r -n -s] art or designs which are created is defined as computer graphics
4 MATCH : [-r -n]computer Graphics is purely delimited by science
5 MATCH : [-r -n -s] science is described as computer graphics
6 Matches found in b.txt
7 MATCH : [-r -n] computer graphics is purely defined as a field in cs
8 MATCH : [-r -n] cs is purely described as a field in computer graphics
9 MATCH : [-r -n -s] computer graphics is delimited by cs
10 MATCH : [-r -n -s] cs is delimited by computer graphics
11 MATCH : [-r -n] computer graphics is X defined RR as cs
12 MATCH : [-r -n] computer Graphics XXX was GGGG defined nnnnnn as cs
13 Matches found in c.txt
14 MATCH : [-r -n] computer Graphics is purely delimited by science (printed in seq)
15 MATCH : [-r -n] science is purely described as computer graphics (printed in seq)
16
17 >java DefinitionChecker computer graphics -s
18 Matches found in a.txt
19 MATCH : [-r -n -s] art or designs which are created is defined as computer graphics
20 MATCH : [-r -n -s] science is described as computer graphics
21 Matches found in b.txt
22 MATCH : [-r -n -s] computer graphics is delimited by cs
23 MATCH : [-r -n -s] cs is delimited by computer graphics

Note that the problem sentence on lines 3 and 19 appears correctly.

cancer_66

ASKER

thanks alot ozymandias . ill just test it right aways, been busy writing the Interim report for my project.

anyways ill just do that in a short while.

thanks 4 your help.

cancer_66

ASKER

can you please explain, what was the problem ?
and very breifly how you have fixed it ?please

ozymandias

OK. It's a bit hard to explain in writing though.

Imagine that we have a list (array) of words that are based on the sentence "art or designs which are created is defined as computer graphics", so it looks like this :

art
or
designs
which
are
created
is
defined
as
computer
graphics

We are going through this list checking each word against the WordLists in our WordPattern. We are in strict mode, so the matches must take place in consecutive words. When a word matches a list we record that fact and move on to the next list and keep matching the words.

The problem is that the 5th word "are" in the list above matches the first list so by the time we get to "is" which is part of the real apttern we have already skipped past the first list. This means that only "defined" and "as" are found in sequence.

I fixed this by adding a rule into the loop that checks to see if it is working in strict mode when ever a sequence is broken. If it is, it skipps back to the first list and starts checking from there.

cancer_66

ASKER

cool. i guess i did understand something not 100% though, anyways ill try testing with new texts.

1)would it be easy to map the program to a User Interface? ive got the user interface code ready.

2)in terms of Algorithm "random,sequential,strict seq" i need some sort of puesdo code. if possible.

ozymandias

1) Yes. The interface code would probably just replace the DefinitionChecker code.

2)

loop through the words and the lists
look for each word in each list

if the word is found
record the match and whether or not it was found in strict sequence
if we are in random mode
start from the fist word again
otherwise
move on to the next word and the next list

if the word was not found
move on to the next word
if we are in strict mode
go back to the first list

Once we have checked all the words look at the information we have recorded from the matching process

if the number of matches = the number of lists
then at least one word from each list was matched so random match = true or normal match = true

if the number of sequential matches = the number of lists
then at least one word from each list was found in strict order so strict match = true

cancer_66

ASKER

1)ok
2)when you said "loop through the words and the lists"

i know what the list contains "is defined..etc"

words?? you mean the texts ? user input?

cancer_66

ASKER

2)whenever you are free can you just give me a bit more details with the puesdo code. take your time.

please

ozymandias

2) No. The words are the sentences found in the files.

I have modified your UI code and I am emailing you a new version of the UI and a new version of DefinitionChecker that works with the UI.

cancer_66

ASKER

ok thanks alots . ill just test it right away,

cancer_66

ASKER

2) so u mean whenever we meet a "#" while reading the file we take the whole sentence and put it in a array. and then start comparing it with the Lists(patterns)

ozymandias

2) Yes, sort of. Actually we find a sentence and add it to an array. Then we break each sentence up (tokenize it) into another array so we can compare it word by word.

cancer_66

ASKER

2)ok thanks now i get the picture. i wasnt at my seat. just came back. ill check my mail.

thanks

cancer_66

ASKER

1)ok i check the email. and iam testing the UI now. god this makes life so easier for testing as well:)

cancer_66

ASKER

1)iam a bit confused WordList,WordPattern
wordList contains the sentences from the file?
wordpattern holds the combinations of patterns?

sorry for this really.

cancer_66

ASKER

2) a)with the user interface when i enter "Computer" it prints? (recall we are searching for Terms)

ozymandias

1) No.

A WordList contains the tokens like :

is
are
be

or

defined
described
delimited

both of the above would be a WordList.

A WordPattern contains a set of WordLists, a bit like :

is defined by
are described as
be delimited

A PatternMatcher then contains a set of WordPatterns and can check sets of sentences for those patterns.

ozymandias

>>2) a)with the user interface when i enter "Computer" it prints? (recall we are searching for Terms)

Yes, If you look at the code, I have not implemented any code to check the number or length of arguments passed by the UI, whereas when you use the command line the main() method does this checking.

cancer_66

ASKER

2)i.c but it would be the same as the one in the main() i mean it terms of code?its better to have the UI do the checking as well. please

ozymandias

Sort of, in the main() method you are checking an array of arguments including arguments for the match mode like -s or -r. In the UI you are checking a string. I can produce an equivalent though.

cancer_66

ASKER

2)yes please. that would be it for today, iam not feeling that well myself. thanks for all the help ozymandias i really appricate it.

ozymandias

OK. I have just mailed you a new version of UI1.java which checks for valid search terms.

cancer_66

ASKER

3)ozymandias since ive got a presentation on saturday let assume i was asked about

a)how effcient the search algorithm is ?
b)complexcity? (O notation)
c)how would i validate the system?

i would like to know how would you answer those question and what are the appropriate answers

cancer_66

ASKER

4)i replaced the UI you mailed me with the previous one i got the following error?

symbol : constructor DefinitionChecker (java.lang.String,int,boolean)
location: class DefinitionChecker
DefinitionChecker dc = new DefinitionChecker(search.getText(),mode,false);

cancer_66

ASKER

answer me whenever you can. take your time. ill wait

cancer_66

ASKER

please answer me when u can. ill be waiting

cancer_66

ASKER

5)found an error same problem in strict mode.

computer graphics is are defined as create by art or design #( not matched)?

Mobile agents was can be defined such as intelligent agents # (not matched)

6)add the pattern "can be defined such as"

test if it would match : Mobile agent can be defined such as XYZ

didnt match with me?

cancer_66

ASKER

6)remove the word "be" from the list1 and add "can be" + remove "as" from list3 and add "such as"

doesnt match with all 3 modes ! please look into this

ozymandias

3)

a) how efficient is the search algorithm ?

Compared to what ?

b) How complex ?

It's relatively simple. It doesn't do fuzzy matching, or word stemming or any of the other clever stuff that most "search engines" do, and it can't really handle punctuation. It just matches words and groups of words.

c)How to validate ?

Tricky, since I'm not sure of what exactly it is supposed to achieve. The test files you have set up validate that it finds what it is supposed to find and doesn't find things that don;t match. What else could you do ?

4) Yes, that's because I changed the constructor of DefintionChecker to take an extra argument so that it would know whether to print out the results to the console when being used on the command line or return a result set when being used by the GUI.

5) problems in strict mode :

computer graphics is are defined as create by art or design #( not matched)?

OK. I will have a look and see why this sentence is not matched in strict mode.

Mobile agents was can be defined such as intelligent agents # (not matched)

The above sentence will not match in strict mode because it has the word "such" between "defined" and "as", so it is not strict.

6) You cannot add two words to a WordList as one word. You are not allowed "can be" or "such as". Actually, you can add them if you like but they will never match. This is because (as you asked) the sentence is tokenised into individual words and compared word-by-word, so nothing will ever match with "can be" because "can be" in a sentence will always be broken up into "can" and "be". If you want to look for "can be defined as" you must create a 4 word WordPattern.

ozymandias

5) OK I have fixed that problem too. I will mail you a new copy of WordPattern.java, you will need to recompile.

cancer_66

ASKER

4)hmm how to make it work? please look into it.
5)ok thanks
6)yeah i guessed that would be the problem. ok lets say i created 4 lists and add list1= "can" ,list2=be list3=defined,list4=such,list5=as

would it match :Mobile agents was can be defined such as intelligent agents

cancer_66

ASKER

6)i just did the follows removed the word "be" from the list1 and add "can be" + removed "as" from list3 and add "such as"

supprisingly :Mobile agents was can be defined such as intelligent agents

matched in random and sequential but not strict?

ozymandias

4) Unless there is a good reason always use the latest version of any file I have sent you.

6) Yes. If you did that it would work. You dont need to creat 4 lists though. You already have three of the lists you need.

You want to look for "can be defined as", but there is no point. With the current lists any sentence that has "can be defined as" will match because it has "be defined as" anyway.

I think that this is probably not a good idea in general though. As I have said before the chances of words like "defined" and "described" being used outside the context of sentences like "can be defined as" or "is described by" are very remote, and even if they were the number of occurences would be no more that the number of lost occurences due to grammatical or spelling errors in the documents being searched. My point is that adding pattern words like can, be ,such, is, are, by and so on is pretty pointless. It the words "computer graphics" and "described" appear in the same sentence at all then that probably warrants a match 99.9999% of the time.

ozymandias

>>6)i just did the follows removed the word "be" from the list1 and add "can be" + removed "as" from list3 and add "such as"
>>
>>supprisingly :Mobile agents was can be defined such as intelligent agents
>>
>>matched in random and sequential but not strict?

You CANNOT do that !

Mobile agents was can be defined such as intelligent agents

The above sentence will NEVER match in strict because mode because it has "such" in between "defined" and "as". You cannot add "such as" as a pattern. You would have to add "such" and "as" to spearate lists or add them to the same list individually in which case the sentence would match in strict mode because it contained "be defined such".

cancer_66

ASKER

4)thats what i did i used your latest UI file which check for valid user input by overwriting the old one but did not compile for the reason ive give you above?

cancer_66

ASKER

6)ok sorry my mistake. stupid question.

ozymandias

OH, OK. I need to send you both UI1.java and DefinitionChecker.java. Sorry, I thought I had.

I will mail them to you now.

cancer_66

ASKER

7)ok lets assume i had a pattern which consisted of 4 words. that means i should create 4 lists correct? the reason iam asking this is because the same code wont be just used for definition. i might use different patterns to find synonyms..etc

ozymandias

7) Yes. 4 WordLists added to 1 WordPattern in the correct sequence. BTW, you can reuse the WordLists, i.e. you can add them to more than one WordPattern or to the same WordPattern more than once.

cancer_66

ASKER

7)thanks. i got your point
8)for the sake of testing i just added list1="be" list2=defined list3=such

now: Mobile agents was can be defined such as intelligent agents

should match in strict seq as well since "be defined such" are not separated with intermidate tokens.

buts its not ?

cancer_66

ASKER

8)forget the last one. the mistake i use did is not recomipling the UI.it worked.

cancer_66

ASKER

9)BTW how did you fix the problem in question5. hope its fixed for good. i thought you have fixed this problem?

ok i think ill call it a day. iam dead tired. i need a break. thanks for your help.

ill talk to you tomorrow. thanks alot

ozymandias

9) OK. The first fix was for when the strict match is triggered too soon, i.e. by a word from list1 appearing in the sentence before the real pattern. When I foxed that I dod not allow for the fact that it might appear exactly 1 word before the real pattern, like this :

computer graphics is are defined as create by art or design #( not matched)?

in this instance it would never happen...you cannot write "is are" because it is grammatical nonsense, but it could happen in other circumstances so I allow for that eventaulity now aswell.

cancer_66

ASKER

hello there. i wont be at my seat for few hours. but ill be back with more questions:) thanks for everything

cancer_66

ASKER

hi, ok iam back for some time:)

1)how could be make this search algorithm more effcient?
2)what is a morphological analayser? can i use it ?

ozymandias

1) I don't know, there are probably lots of ways, but it depends on your definition of efficiency. Do you mean faster or more accurate or the best possible trade of between speed and accuracy. For instance, I don't think that it is efficient looking for words like "is", "are", "can", "be", "by" etc.

2) I think it is an analyser (or in this case a sarch tool) that can find inexcat but highly likely matches. Common examples would be a "fuzzy logic" kind of word matching that would find obvious misspellings of words like "cimputer" or word stemming, where if you ask for computer graphics is will find variations of those words like compute, computational, computing, computed, computes and graphic, graphical, graphically etc. So, "computer graphics" would match "graphical computing" but it would probably be "ranked" low down the list of matches.

cancer_66

ASKER

1)so what would you suggest ? ill have to talk to my supervisor regarding this. well what iam implying to as effient is "tade of between speed and accuracy"

2)yeah the supervisor did talk breifly about this.

>>computer graphics" would match "graphical computing" but it would probably be "ranked" low down the list of matches.

a) is this difficult to achieve?

ozymandias

You could use morphological analysis, but you would either have to write your own routines to do fuzzy matching or word stemming (big job!) or get someone elses library code that you could use (probably expensive to buy a good one).

This is a very big field in computer science and there are a lot of ideas about how to do it. Search engines and data mining tools are BIG business.

There are a number of strands in word search and word matching.

Shannon's Law, for instance, states that the less often a word is used (i.e. the more rare it is) the more significant it is, because its rarity generally attests to the uniqeness of its meaning or interpretation. Words that get used a lot like is, this, it, are, be etc are used so frequently and in so many contexts that searching for them is a) inefficient and b) meaningless because how can we ever be sure what meaning to attach to them.

Beysian Inference repersents another common set of ideas. Beys states that the outcome of any particular seach (for instance) could be better predicted by a prior knowledge of the results of searches that have gone before.

For instance, if someone were to type the word "stocks" into a search egine they would get a lot of very mixed results. However, if the search engine knew that their previous searches had been "market values", "bonds" and "share trading" the results could be narrowed down considerably. Similarly if their previous searches had been "recipes", "soups" and "bouillon" then you would get a completely different set of results, and a different one again for "medieval" and "punishment". Basically we have added extra meaning or context to the word "stocks" from an awareness of previous searches or fields of interest.

cancer_66

ASKER

1)ok ill speak to my supervisor regarding the morphological analayzer lets assume. he will provide me with the libraries etc. is using it difficult?

a)ill send you one file which he has given me its a tokenizer + it has some rules. please check it and see if it could be of some help

2)i.c again ill have to disscuss with the superivisor. i think the two most inportant keywords are "User Input" + MainMarker" in the search algorithm iam using. isnt it?

cancer_66

ASKER

3)arent there are morpho libariaries which i can use?

cancer_66

ASKER

3)arent there are morpho libariaries which i can use?

ozymandias

1) I'm not sure what to suggest in terms of efficiency. This is not really my area of expertise. I suppose the first questions to ask would be :

i) How is the current program inefficient ?
ii) What can be done to improve it ?

2)
a) Ranking is not simple, but it's not that complex either. Basically you would assign a score to the words as they were matched. An exact match in a correct sequence would score 3 points, an fuzzy match in the correct sequence or an exact match in an inciorrect sequence would score 2 points and a fizzy match in an incorrect sequence would score 1 point.

Let's say you are looking for "computer graphics".

"computer graphics" would score 6 points
"computed graphics" would score 5 points
"computed graphically" would score 4 points
"graphically computed" would score 2 points

You could apply the same rules to the patterns too.

If the sentence contained "is defined by" (strict match)it would score 4 points.
If it contained "is often defined by" (seqnetial match) it would score 3 points.
If it contained "defined is by" (random match) it would score 2 points.
If it just contained "defined" (word match) it would score 1 point.

You could then add up the total score for each macthed sentence and display the matched sentences in descending order of score, i.e. a ranking system.

ozymandias

3) Yes, almost certainly. Unfortuantely I do not know of any, and you would have to find one that could be used from within your java program for little or no cost (I assume)

cancer_66

ASKER

2)i think if i could do that the supervisor would be quite impressed. since ranking the sentence and printing them according to the score can be considered as a way to validate the results. isnt it?

3)i will definatly talk to the supervisor tomorrow regarding any morpho-libararies he could provide. however, if you could also try and find one which is appropriate to the program. please.

cancer_66

ASKER

4)for the time being without the Morph-libaries can the ranking system be done?

cancer_66

ASKER

4)by applying the rules to the patterns ?

ozymandias

2) Ranking would not constitute validation. The validation would have to be by some mechanism external to the program. e.g. some "known good" set of results to which the program's results could be compared.

4) Yes. It could be done, but it is a pretty big job. We would be changing the way the whole program works. Currently we have no "fuzzy match" capability and we do not compare search terms word by word we use the whole term. Currently the user specifies the matchMode, but to rank we would have to take each sentence and do a strict match, if that failed try a sequential match and if that failed try a random match in order to calculate the sentence's score.

cancer_66

ASKER

2)ok
4)if it could be done. it would be a plus point for my project. since in my specification ive stated that score system woud be done if time premits. please

cancer_66

ASKER

4)however the user should still be able to choose between the 3 modes.

cancer_66

ASKER

5)private static WordList list9 = new WordList("can",",");
private static WordList list10 = new WordList("be",",");
private static WordList list11 = new WordList("defined",",");
private static WordList list12 = new WordList("such",",");
private static WordList list13 = new WordList("as",",");

// create another WordPattern
WordPattern pattern4 = new WordPattern();
// add the appropriate WordLists
pattern3.addList(list9);
pattern3.addList(list10);
pattern3.addList(list11);
pattern3.addList(list12);
pattern3.addList(list13);
// add the WordPattern to the vector
patterns.add(pattern4);

got the following results:

Matches found in a.txt

2.[-r -n -s] art or designs was are created is defined as computer graphics
2.1[-r -n -s] computer graphics is was can be are defined as create by art or design
3.[never] computer Graphics is as a pictorial computer output produced on a display screen, plotter, or printer
4.[-r -n] computer Graphics is purely delimited by science
6.[-r] computer graphics t t defined t is t as mohammed
8.[-r -n -s] science is described as computer graphics
12.[never] computer graphics is the xydsdgjhgs

Note:- 3. should not be matched since no mainmarker i.e defined, described..etc

cancer_66

ASKER

5)try searching for "Mobile Agents"

13.[never] Mobile agents consists of exectution environment ,,etc (printed) even though i have dont have the pattern "consist of"

when i remove which i added up there it works properly.

ozymandias

// create another WordPattern
WordPattern pattern4 = new WordPattern();
// add the appropriate WordLists
pattern3.addList(list9);
pattern3.addList(list10);
pattern3.addList(list11);
pattern3.addList(list12);
pattern3.addList(list13);
// add the WordPattern to the vector
patterns.add(pattern4);

The above code is wrong.
You create pattern4, then add the lists to pattern3 and then add pattern4 to the patterns vector.

cancer_66

ASKER

5)it should be this way?
// create another WordPattern
WordPattern pattern4 = new WordPattern();
// add the appropriate WordLists
pattern4.addList(list9);
pattern4.addList(list10);
pattern4.addList(list11);
pattern4.addList(list12);
pattern4.addList(list13);
// add the WordPattern to the vector
patterns.add(pattern4);

cancer_66

ASKER

5)i did the above and it worked!thanks

cancer_66

ASKER

6)Would you help me with ques(4) . i really appricate all ur help.

ozymandias

I am working on question 4. I have a java implementation of the Porter Stemming Algorithm which I will try to integrate into the code.

Once I have done that I will work on adding a ranking machnaism.

It may take a bit of time though.
What is your deadline for this ?

cancer_66

ASKER

4)take ur time. its not nessary to submit it now i have enough time.even if it would be ready by thursday or friday its fine.

cancer_66

ASKER

7)iam going to the hospital mom not well. will be back soon.

thanks once again.

cancer_66

ASKER

8)hello iam back:) sorry 4 leaving like that

cancer_66

ASKER

iam reading about the Porter Stemming Algorithm. trying to understand what its all about

cancer_66

ASKER

9)tested the current program again. works fine. thank god. no problems. i read bit about the algorithm. got the overall picture.

cancer_66

ASKER

10)keep in mind iam using jdk1.3

cancer_66

ASKER

11)ill talk to you tomorrow. iam currently working on my interim report and presentation.

12)ill be waiting.

cancer_66

ASKER

thanks alot for your help

ozymandias

I have incorporated the stemming algorith amd now the results appear in a ranked list. I will mail you a complete copy of the new code and all the new files I am using.

ozymandias

I have mailed you an update of all the files which includes some bug fixes and a tidied-up UI.

cancer_66

ASKER

13)thanks alot ozymandias you been of great help. i really appricate that. ill check the program in a while.

cancer_66

ASKER

hello. there. ill just check the mail. and let you give you the remarks. thanks

cancer_66

ASKER

1)can i still add new patterns ? the way i used to do in the previous code?

cancer_66

ASKER

2)can you please explain how the ranking is done? + stemmer.

BTW as i was searching i paased by a Porter Stemmer class which can be used? well i think its too late for that. sorry.

cancer_66

ASKER

3)ill be waiting 4 ue answer. in order to test it, i need to understand how it works.

thanks.

ozymandias

1) Yes. Patterns can be added in exactly the same way.
2) The ranking is done as follows :

First we search for the search term e.g. "computer graphics".
We now search for the search term using the same technique as the patterns, i.e. we put them in word lists with stemming turned on.
Seach terms are always stemmed.
We always search for seach terms in "strict sequence".

"computer graphics" is made up of two words.
For each word an exact match will score 2 points and a stemmed match will score 1 point.

So "computer graphics" will score 4, plus 3 for being a strict match. "computer graphic" will score 3, plus three for being a stric match.

Then we score the pattern. The patterns can contained stemmed words too and work in much the same way. They get a score for each word plus 3 for a strict match, 2 for a sequential match and 1 for a random match.

ozymandias

Finally, we add the key score and the pattern score together to give a full ranking value.

Yes, the Porter Stemming class you found is probably the same one I did. I used it (with a couple of small modifications).

cancer_66

ASKER

3)hmm i pretty much got the idea of how it works.

so if the exact term is found i.e computer graphics

computer = 2points
graphics =2 points total score =4points

plus for being a "strict match" they get an additional 3points! correct?

so the total is 7points for the exact search term.

4)colomns s1=search term points s2=pattern points?

when i search for computer graphics rank starts from "0"..etc its ranks in descending order?

5)lets say iam searching in random. now the patterns "is defined as" if found. they are ranked 7points in total correct? how is the distrubution of points done.

6)if possible can you give me an example of two sentences
and how their coressponding ranking is done.

take your time. ill wait.thanks

ozymandias

3) Yes. Correct.

4)Yes.

5) "is defined as" in a random match =

is = 2 points
defined = 2 points
as = 2 points
random match = 1 point
======================
total = 7

ozymandias

6)

Example 1
==========

"computer graphics are defined as pictures" matched in strict mode against "computer graphics".

computer : computer = 2 points
graphics : graphics = 2 points
strict match : computer graphics = 3 points

Key Score = 7

is : is = 2 points
defined : defined = 2 points
by : by = 2 points
strict match : is defined by = 3 points

Pattern Score = 9 points

Total Score = 16 points

Example 2
=========

"as science is often found to delimit computers graphic output" matched in random mode against "computer graphics".

computers : computer = 1 point
graphic : graphics = 1 point
strict match : computers graphic = 3 points

Key Score = 5 points

is : is = 2 points
delimit : delimited = 1 point
as : as = 2 points
random match : is delimit as = 1 points

Pattern Score = 6

Total Score = 11 points

cancer_66

ASKER

6)take ur time.
7)ok lets say couple of sentences have the same ranking.

lets say
computer graphics is defined as
computer graphics was defined as
computer graphics should be defined as

ranking would start

0 sentence1
1 sentence2
2 sentence3

shouldnt it be

1 sentence1
1 sentence2
1 sentence3

----------------------------------------------------------
sorry for asking so many questions.

cancer_66

ASKER

ok now i understood how it works. ill start testing it.

ozymandias

No. The ranking will always be 1,2,3 etc.

If two sentences have the same score then they will be presented in the order in which they were found e.g. the one in a1.txt will come before the one in c1.txt.

cancer_66

ASKER

7)even if they were in the same file i.e a1.txt ? they will be presented in the order they were found first?

cancer_66

ASKER

8)hmm you said the ranking will always be 1,2,3. i just tested it very quickly and it started 0,1,2..?

shouldnt it start from 1,2,..

note:-downloaded the code where you have clearly stated that (bugs fixed..etc)

cancer_66

ASKER

answer me whenever you can ill be waiting .

ozymandias

Yes, the numbering starts from 0. It can just as easily start from 1 if you want. When i said "1,2,3" I meant sequential as opposed to "1,1,1,2,2,2,3,3,3" or whatever.

If you want to change it so it strat from 1 the change line 194 of UI1.java so it reads :

v.add(Integer.toString(i+1));

cancer_66

ASKER

ok thanks. ill do that soon. iam currently working on my Interim presentation which is on sunday.

cancer_66

ASKER

thanks 4 everything

cancer_66

ASKER

ill award you points and open another thread. is it ok ?
ill still didnt test it though?
anything you would like?

cancer_66

ASKER

8)ill start testing it now :)

cancer_66

ASKER

9)i havent really tested it 100%. but it seems working fine.

10)now whats left is just to integrate it in the Aglets. Do you have any idea how thats done? thats my final stage

cancer_66

ASKER

11)i added two sentences in a1.txt

2computer graphics interface are defined as pictures#
2computer graphics interface are defined as pictures#

now rank should be 1,1,..
but it was 1,2..?

cancer_66

ASKER

12) i tried adding the pattern "is the" by doing the following

private static WordList list4 = new WordList("is",",",false);
private static WordList list5 = new WordList("the",",",false);

// create a WordPattern
WordPattern pattern2 = new WordPattern();
// add the appropriate WordLists
pattern2.addList(list4);
pattern2.addList(list5);
// add the WordPattern to the vector
pm.addPattern(pattern2);

i only had one sentence which contained "is the"
"computer graphics is the art of blah blah"

a)this was the only result printed ? it didnt match any other sentence ?

cancer_66

ASKER

please answer whenever you are free. ill be gone to the hospital in a while.
take ur time

cancer_66

ASKER

13)ill award points here.
ive opened a new thread called "Search 3 for ozymandias "

please answer the questions there !thanks alot

ill be going to the hospital now!

cancer_66

ASKER

thanks 4 ur help

ozymandias

For PAQ value here is the complete code at this stage.

There are 7 files :

/UI1.java
/DefinitionChecker.java
/definitions/PatternMatcher.java
/definitions/WordPattern.java
/definitions/WordList.java
/definitions/Stemmer.java
/definitions/Sentence.java

ozymandias

import java.io.*;
import java.io.IOException;
import java.util.*;
import java.awt.*;
import java.awt.event.*;
import javax.swing.*;
import javax.swing.table.*;
import java.util.Vector;

import definitions.Sentence;

public class UI1 extends JFrame implements ActionListener{

      /*
      * UI Components
      */
      TextField search = new TextField(18);
      Label searchlab = new Label("Search for");
      Scrollbar bar = new Scrollbar();
      TextArea results = new TextArea("",15,40,10);
      JTable table;
      JScrollPane scroller;
      Vector columns = new Vector();
      Vector rows = new Vector();
      Button go = new Button("Go...");
      Button send = new Button("Send Clone");
      Button close = new Button("Close");
      Panel Resultpanel = new Panel();
      Panel Buttonpanel = new Panel();
      Panel Inputpanel = new Panel();
      Panel checkbox = new Panel();

      CheckboxGroup cbg1 = new CheckboxGroup();
      Checkbox ran = new Checkbox("Random",cbg1,false);
      Checkbox seq = new Checkbox("Normal",cbg1,true);
      Checkbox sseq = new Checkbox("Strict",cbg1,false);

      CheckboxGroup cbg2 = new CheckboxGroup();
      Checkbox full = new Checkbox("Full",cbg2,false);
      Checkbox fast = new Checkbox("Fast",cbg2,true);

      public UI1(){

            addWindowListener(new WindowAdapter(){
                  public void windowClosing(WindowEvent e){
                        dispose();
                        System.exit(0);
                  }
            });

            GridBagLayout gridbag = new GridBagLayout();
            GridBagConstraints c = new GridBagConstraints();
            Container content = this.getContentPane();
            content.setLayout(gridbag);

            c.insets = new Insets(3,3,3,3);

            c.fill = GridBagConstraints.NONE;
            c.gridx = 0;
            c.gridy = 0;
            c.gridwidth = 1;
            c.weightx = 0.0;
            c.anchor = GridBagConstraints.NORTHWEST;
            gridbag.setConstraints (searchlab, c);
            content.add(searchlab);

            c.fill = GridBagConstraints.HORIZONTAL;
            c.gridx = 1;
            c.gridy = 0;
            c.gridwidth = 1;
            c.weightx = 1.0;
            c.anchor = GridBagConstraints.NORTHEAST;
            gridbag.setConstraints(search, c);
            content.add(search);

            c.fill = GridBagConstraints.BOTH;
            c.gridx = 0;
            c.gridy = 1;
            c.gridwidth = 2;
            c.weightx = 1.0;
            c.weighty = 1.0;
            c.anchor = GridBagConstraints.CENTER;
            columns.add("Rank");
            columns.add("Text");
            columns.add("S1");
            columns.add("S2");
            columns.add("File");
            table = new JTable(rows,columns);
            scroller = new JScrollPane(table);
            gridbag.setConstraints(scroller, c);
            content.add(scroller);

            c.weighty = 0.0;
            c.fill = GridBagConstraints.NONE;
            c.gridx = 0;
            c.gridy = 2;
            c.gridwidth = 2;
            c.weightx = 1.0;
            c.anchor = GridBagConstraints.CENTER;
            checkbox.setLayout(new GridLayout(1,8));
            checkbox.add(new Label("Search : "));
            checkbox.add(ran);
            checkbox.add(seq);
            checkbox.add(sseq);
            checkbox.add(new Label(" "));
            checkbox.add(new Label("Scoring : "));
            checkbox.add(full);
            checkbox.add(fast);
            gridbag.setConstraints(checkbox, c);
            content.add(checkbox);

            c.fill = GridBagConstraints.NONE;
            c.gridx = 0;
            c.gridy = 3;
            c.gridwidth = 2;
            c.weightx = 1.0;
            c.anchor = GridBagConstraints.CENTER;

            Buttonpanel.setLayout(new GridLayout(1,5));
            Buttonpanel.add(go);
            Buttonpanel.add(new Label(" "));
            Buttonpanel.add(close);
            Buttonpanel.add(new Label(" "));
            Buttonpanel.add(send);
            gridbag.setConstraints(Buttonpanel, c);
            content.add(Buttonpanel);

            go.addActionListener(this);
            send.addActionListener(this);
            close.addActionListener(this);

            KeyListener kl = new KeyListener() {
                  public void keyPressed(KeyEvent e) {}

                  public void keyReleased(KeyEvent e) {
                        if (e.getKeyCode() == KeyEvent.VK_ENTER) {
                              System.out.println(search.getText());
                        }
                  }

                  public void keyTyped(KeyEvent e) {}
            };

            search.addKeyListener(kl);

            this.pack();
            this.resize(this.preferredSize());
            this.reshape(20,20,600,400);
            setColumnWidths();
      }

      public static void main(String args[]){

            UI1 agletFrame = new UI1();

            agletFrame.setTitle("Aglet Interface Example");
            agletFrame.show();

      }

      private boolean validSearch(String s){

            // if the search term list less that 7 characters it can't be valid
            if (s.length() < 7){
                  return false;
            }
            // if the search term does not have a space it can't be valid
            if (s.indexOf(" ") == -1){
                  return false;
            }
            // if any of the search terms words are less than 3 characters
            // it can't be valid.
            StringTokenizer st = new StringTokenizer(s);
            while (st.hasMoreTokens()){
                  if (st.nextToken().length() < 3){
                        return false;
                  }
            }
            return true;
      }

      private void showMsg(String msg){
            JOptionPane.showMessageDialog(this,msg);
      }

      public void actionPerformed(ActionEvent event){

            if (event.getSource() == go){
                  results.setText("");
                  if (validSearch(search.getText())){
                        int mode = 2;
                        if (cbg1.getSelectedCheckbox() == ran){
                              mode = 1;
                        }else if(cbg1.getSelectedCheckbox() == seq){
                              mode = 2;
                        }else{
                              mode = 3;
                        }
                        boolean quick = true;
                        if (cbg2.getSelectedCheckbox() == full){
                              quick = false;
                        }
                        DefinitionChecker dc = new DefinitionChecker(search.getText(),mode,quick,false);
                        Sentence[] sentences = dc.getMatchedSentences();
                        rows = new Vector();
                        for (int i = 0; i < sentences.length; i++){
                              Vector v = new Vector();
                              v.add(Integer.toString(i+1));
                              v.add(sentences[i].getSentence());
                              v.add(Integer.toString(sentences[i].getKeyScore()));
                              v.add(Integer.toString(sentences[i].getPatternScore()));
                              v.add(sentences[i].getLocation().toString());
                              rows.add(v);
                        }
                        table.setModel(new DefaultTableModel(rows,columns));
                        setColumnWidths();
                  }else{
                        showMsg("You must provide a valid search term.\n\nA valid search term must have a minimum of two words\neach of which must have at least three chracaters.");
                  }
            }else if (event.getSource() == close){
                  System.exit(0);
            }else if(event.getSource()==send){

            }
      }

      private void setColumnWidths(){
            table.getColumnModel().getColumn(0).setPreferredWidth(15);
            table.getColumnModel().getColumn(1).setPreferredWidth(300);
            table.getColumnModel().getColumn(2).setPreferredWidth(15);
            table.getColumnModel().getColumn(3).setPreferredWidth(15);
            table.getColumnModel().getColumn(4).setPreferredWidth(100);
      }
}

ozymandias

/*
* DefinitionChecker.java
*
*/

import java.util.Vector;
import java.util.StringTokenizer;
import java.io.*;
import definitions.*;

public class DefinitionChecker{

      /*
      *
      * These are some static WordLists which can be used to create
      * the WordPatterns that this PatterMatcher will use
      *
      */
      private static WordList list1 = new WordList("is,was,are,be",",",false);
      private static WordList list2 = new WordList("described,defined,delimited",",",true);
      private static WordList list3 = new WordList("as,by",",",false);
      private static WordList list4 = new WordList("is",",",false);
      private static WordList list5 = new WordList("the",",",false);

      String keyword;
      String[] files = new String[]{"a1.txt","b1.txt","c1.txt","d1.txt"};
      Sentence[] sentences;
      Vector matches = new Vector();

      /**
      *
      * Constructor for the DefinitionChecker
      *
      */
      public DefinitionChecker(String s, int matchMode, boolean quick, boolean debug){

            // let's build our PatternMatcher
            PatternMatcher pm = new PatternMatcher();

            // create a WordPattern
            WordPattern pattern1 = new WordPattern();
            // add the appropriate WordLists
            pattern1.addList(list1);
            pattern1.addList(list2);
            pattern1.addList(list3);
            // add the WordPattern to the vector
            pm.addPattern(pattern1);

            // create a WordPattern
            WordPattern pattern2 = new WordPattern();
            // add the appropriate WordLists
            pattern2.addList(list4);
            pattern2.addList(list5);
            // add the WordPattern to the vector
            pm.addPattern(pattern2);

            // now let's build a PatternMatcher to hold our keyword pattern
            // and use a StemmedWordList to do so.
            PatternMatcher km = new PatternMatcher();
            WordPattern keyPattern = new WordPattern();
            StringTokenizer st = new StringTokenizer(s);
            while (st.hasMoreTokens()){
                  keyPattern.addList(new WordList(st.nextToken()," ",true));
            }
            km.addPattern(keyPattern);

            // loop through each file in the list of files
            for (int f = 0; f < files.length; f++){
                  File file = null;
                  try{
                        // get all the sentences
                        file = new File(files[f]);
                        sentences = getSentencesFromFile(file);
                  }catch(IOException ioe){
                        System.out.println(ioe);
                  }
                  keyword = s.toLowerCase();
                  // loop through all the sentences
                  for (int i = 0; i < sentences.length; i++){
                        // if any sentence contains the keyword and matches any of the patterns specified in the PatternMatcher
                        int keyScore = km.scoreSentence(sentences[i],WordPattern.STRICT_MATCH,false,false);
                        int patternScore = pm.scoreSentence(sentences[i],matchMode,quick,false);
                        if (keyScore > 0 && patternScore > 0){
                              sentences[i].setKeyScore(keyScore);
                              sentences[i].setPatternScore(patternScore);
                              // if this is the first match found in this file
                              matches.add(sentences[i]);
                        }
                        //System.out.println();
                  }
            }
            sortMatches();
            if (debug){
                  for (int m = 0; m < matches.size(); m++){
                        System.out.println("\t\tMATCH : " + matches.elementAt(m).toString());
                  }
            }

      }

      private void sortMatches(){
            Object[] o = matches.toArray();
            java.util.Arrays.sort(o);
            matches = new Vector();
            for (int i = 0; i < o.length; i++){
                  matches.add(o[i]);
            }
      }

      /**
      * getMatches()
      *
      * Returns an array of strings which are all the matched sentences found by the DefinitionChecker.
      *
      */
      public String[] getMatches(){
            String[] m = new String[matches.size()];
            m = (String[])matches.toArray(m);
            return m;
      }

      /**
      * getMatchedSentences()
      *
      * Returns an array of sentences which are all the matched sentences found by the DefinitionChecker.
      *
      */
      public Sentence[] getMatchedSentences(){
            Sentence[] s = new Sentence[matches.size()];
            s = (Sentence[])matches.toArray(s);
            return s;
      }

      /**
      *
      * GetArrayFromFile
      *
      * This function reads a specified file and breaks the contents into
      * and array of strings (sentences) using the # character as a delimiter
      *
      */
      private Sentence[] getSentencesFromFile(File f) throws IOException{
            FileReader reader = new FileReader(f);
            Vector sentences = new Vector();
            char[] cbuf = new char[1];
            String delimiter = "#";
            String sentence = "";
            String c = "";
            // read the file character by character
            while (reader.read(cbuf) != -1){
                  c = new String(cbuf);
                  // if the chracter is a delimiter (#)
                  if (c.equals(delimiter)){
                        // add the sentence to the Vector and start a new blank sentence
                        Sentence s = new Sentence(sentence);
                        s.setLocation(f);
                        sentences.add(s);
                        sentence = "";
                  }else{
                        // otherwise just add the character to the current sentence string
                        sentence += c;
                  }
            }
            reader.close();
            Sentence[] sentenceArray = new Sentence[sentences.size()];
            // convert the Vector to an array and return it
            sentenceArray = (Sentence[])sentences.toArray(sentenceArray);
            return sentenceArray;
      }

      public static void main(String[] args){

            int matchMode = WordPattern.NORMAL_MATCH;

            String s = "";
            int numKeywords = 0;
            // first lets check what the arguments are
            for (int i = 0;i < args.length;i++){
                  //if any of them are -? then we print the usage message
                  if (args[i].equalsIgnoreCase("-?")){
                        printUsage("");
                        System.exit(1);
                  }
                  //if any of them are -s then we are in strict mode
                  if (args[i].equalsIgnoreCase("-s")){
                        matchMode = WordPattern.STRICT_MATCH;
                        continue;
                  }
                  //if any of them are -r then we are in random mode
                  if (args[i].equalsIgnoreCase("-r")){
                        matchMode = WordPattern.RANDOM_MATCH;
                        continue;
                  }
                  //if any of them are -r then we are in normal mode
                  if (args[i].equalsIgnoreCase("-n")){
                        matchMode = WordPattern.NORMAL_MATCH;
                        continue;
                  }
                  // make sure they are all 3 chracaters or longer
                  if (args[i].length() < 3){
                        printUsage("Input Error : " + args[i] + "\nAll component words of the SearchTerm must be three characters or more.");
                        System.exit(1);
                  }
                  // concatenate the arguments into one search string
                  s = s + args[i] + " ";
                  numKeywords++;
            }
            // now make sure that we have at least two valid keywords
            if (numKeywords < 2){
                  printUsage("");
                  System.exit(1);
            }
            s = s.trim();
            // finally instantiate a DefinitionChecker and pass it the string and tell it which match mode to use
            DefinitionChecker dc = new DefinitionChecker(s,matchMode,false,true);
      }

      private static void printUsage(String msg){
            if (msg.length() > 0){
                  System.out.println("\n" + msg);
            }
            System.out.println("\nUSAGE : DefintionChecker Mode SearchTerm\n\n\tMode Options :\n\t-r\trandom pattern matching\n\t-n\tnormal ppttern matching (default)\n\t-s\tstrict pattern matching\n\n\tSearchTerm : \n\tA minimum of 2 words each consisting of 3 chracters\n\tor more must be provided to make a valid SearchTerm.");
      }

}

ozymandias

/*
* PatternMatcher.java
*
*/

package definitions;

import java.util.StringTokenizer;
import java.util.Vector;

public class PatternMatcher{

private Vector patterns;

/**
*
* Constructor for the PatternMatcher. This adds the
* WordPatterns to the PatternMatchers list of patterns
* ready for matching.
*
*/
public PatternMatcher(){
// create the vector to store our WordPatterns
patterns = new Vector();
}

/**
*
* This is just a function for adding WordPatterns
* to the PatternMatcher. It's not used currently
* but it will probably come in handy.
*/
public void addPattern(WordPattern pattern){
patterns.add(pattern);
}

/**
*
* This is the key function on the PatternMatcher. It is
* passed a String (sentence) and information on "strictnesss".
* It thens cycles through all its patterns seeing if any of them
* are found in the sentence.
*
*/
public int scoreSentence(Sentence s, int matchMode, boolean quick, boolean all){

// loop through all the WordPatterns checking to see if
// any of them match the sentence.
int hiScore = 0;
for (int i = 0; i < patterns.size();i++){
int score = 0;
WordPattern wp = (WordPattern)patterns.elementAt(i);
if ((score = wp.containsPattern(s,matchMode,all)) > 0){
if (quick){
return score;
}else{
hiScore += score;
}
}
}
return hiScore;
}

}

ozymandias

/**
*
* WordPattern.java
*
* This class contains the core of the "comparison logic". Each WordPattern
* contains one or more word lists which it uses in sequence to do a word by
* word comparison with the sentence provided.
*
*/

package definitions;

import java.util.Vector;
import java.util.StringTokenizer;

public class WordPattern{

      /*
      *
      * Some static integers to denote the various modes
      * available for pattern matching
      */
      public final static int STRICT_MATCH = 3;
      public final static int NORMAL_MATCH = 2;
      public final static int RANDOM_MATCH = 1;

      private Vector lists;

      /**
      *
      * This constructor takes an array of WordLists
      * and uses them to populate its own Vector
      * of WordLists
      */
      public WordPattern(WordList[] wl){
            lists = new Vector();
            for (int i = 0; i < wl.length; i++){
                  lists.add(wl[i]);
            }
      }

      /**
      *
      * This constructor simply initialises a blank Vector
      * to be used to store the WordLists which can be added
      * using the addList() method
      */
      public WordPattern(){
            lists = new Vector();
      }

      /**
      *
      * This function adds a WordList to the Word Pattern
      *
      */
      public void addList(WordList list){
            lists.add(list);
      }

      /**
      *
      * This function does all the real work. It breaks the supplied
      * String into iuts component words and then compares them either
      * strictly or not, to the words in the WordLists.
      *
      */
      public int containsPattern(Sentence s, int matchMode, boolean all){
            //System.out.println(s);
            String[] words = s.getWordArray();
            int totalScore = 0;
            int score = 0;
            int stop = 0;
            if (!all){
                  stop = (matchMode - 1);
            }
            // if there are less words that lists then the sentence cannot
            // possibly contain a full pattern, so return false
            if (words.length < lists.size()){
                  return 0;
            }
            for (int m = matchMode; m > stop; m--){
                  totalScore = 0;
                  // this counter will hold the number of words matched
                  int count = 0;
                  // this counter will hold the number of words matched contiguously (i.e. in strict sequence)
                  int sequence = 0;
                  // this value will tell us whether the previous word was a match
                  boolean inSequence = false;
                  // simultaneously loop through the array of words and the Vector
                  // of WordLists, starting by comparing the first word with the first WordList
                  for (int l = 0, w = 0; ((l < lists.size()) && (w < words.length));){
                        WordList wordlist = (WordList)lists.elementAt(l);
                        String word = words[w];
                        // if the wordlist contains the word then we can move to the next wordlist
                        // and to the next word in the word array, unless we are in random mode.
                        // If we are in random mode, we move back to the beginning of the word array
                        // and start checking from the beginning becuase the words can appear in any order.
                        if ((score = wordlist.containsWord(word)) > 0){
                              totalScore += score;
                              //System.out.println(word + " : scores : " + score + " : total = " + totalScore);
                              l++;
                              if (m == RANDOM_MATCH){
                                    w = 0;
                              }else{
                                    w++;
                              }
                              count++;
                              // if we are are in sequence (i.e. the previous word was a match
                              // then we increment the number of seqential words found
                              if (inSequence || sequence == 0){
                                    sequence++;
                              }
                              // set the value to indicate that this word was matched
                              inSequence = true;
                        }else{
                              // if the wordlist does not contain the word then we can move to the next word
                              // but we do not move to the next wordlist
                              w++;
                              // if we are in strict mode and had started a sequence but not finished it then
                              // we may as well abandon it and start with the first list again just in case
                              // there is a full sequence later in the sentence.
                              if (m == STRICT_MATCH && inSequence && sequence < lists.size()){
                                    l = 0;
                                    w--;
                                    sequence = 0;
                                    count = 0;
                                    totalScore = 0;
                              }
                              // set the value to indicate that we are no longer in strict sequence
                              inSequence = false;
                        }
                  }

                  // if the number of words matched is the same as the number of lists
                  // then we have a match
                  if (count == lists.size()){
                        switch (m){
                              case STRICT_MATCH:
                                    if(sequence == lists.size()){
                                          //System.out.println("strict : scored " + totalScore);
                                          return totalScore + 3;
                                    }else{
                                          totalScore = 0;
                                    }
                                    break;
                              case NORMAL_MATCH:
                                    //System.out.println("normal : scored " + totalScore);
                                    return (totalScore + 2);
                              case RANDOM_MATCH:
                                    //System.out.println("random : scored " + totalScore);
                                    return (totalScore + 1);
                        }
                  }
            }
            //System.out.println("fail : scored " + totalScore);
            return 0;
      }

      /**
      *
      * This function returns the length of the longest word list.
      * It's not used at the moment but may be useful
      *
      */
      public int maxListLength(){
            int length = 0;
            for (int l = 0; l < lists.size(); l ++){
                  if (((WordList)lists.elementAt(l)).numWords() > length){
                        length = ((WordList)lists.elementAt(l)).numWords();
                  }
            }
            return length;
      }

}

ozymandias

/**
*
* WordList.java
*
* This class holds an array of strings (words) which
* can be combined in a WordPattern with other WordLists
*
*/

package definitions;

import java.util.StringTokenizer;
import java.util.Vector;

public class WordList{

private Vector words;
private Stemmer stemmer;
private boolean stemming = false;

/**
*
* This constructor takes a string and a delimiter string
* and then uses a StringTokenizer to break the string into
* an array of words
*/
public WordList(String s, String delimiter, boolean stem){
if (stem){
stemmer = new Stemmer();
stemming = true;
}
StringTokenizer st = new StringTokenizer(s,delimiter);
words = new Vector();
while (st.hasMoreTokens()){
words.add(st.nextToken());
}
}

/**
*
* This is just an accessor function that lets you get the words
* held in the list. Not used at the moment, but probably useful
* for debugging.
*/
public String[] getWords(){
String[] wordArray = new String[words.size()];
wordArray = (String[])words.toArray(wordArray);
return wordArray;
}

/**
*
* This is just an accessor function that lets you get the number of
* words held in the list. Not used at the moment, but probably useful
* for debugging.
*/
public int numWords(){
return words.size();
}

/**
*
* This function takes a string (word) and checks to
* see if it matches any of the words in its list.
*/
public int containsWord(String s){
//System.out.println("Looking for " + s + " in :");
//this.print();
String word1 = s.trim();
for (int i = 0; i < words.size(); i++){
String word2 = (String)words.elementAt(i);
if (word1.equalsIgnoreCase(word2)){
//System.out.println("match : "+ word1 + " : " + word2);
return 2;
}
}
if (stemming){
word1 = stemmer.getStem(word1);
for (int i = 0; i < words.size(); i++){
String word2 = stemmer.getStem((String)words.elementAt(i));
if (word1.equalsIgnoreCase(word2)){
//System.out.println("stem match : "+ word1 + " : " + word2);
return 1;
}
}
}
return 0;
}

/**
*
* This is just an accessor function that prints out the words
* held in the list. Not used at the moment, but probably useful
* for debugging.
*/
public void print(){
for (int i = 0; i < words.size(); i++){
System.out.println((String)words.elementAt(i));
}
}

public boolean isStemming(){
return stemming;
}

}

ozymandias

/*

Porter stemmer in Java. The original paper is in

Porter, 1980, An algorithm for suffix stripping, Program, Vol. 14,
no. 3, pp 130-137,

See also http://www.tartarus.org/~martin/PorterStemmer

History:

Release 1

Bug 1 (reported by Gonzalo Parra 16/10/99) fixed as marked below.
The words 'aed', 'eed', 'oed' leave k at 'a' for step 3, and b[k-1]
is then out outside the bounds of b.

Release 2

Similarly,

Bug 2 (reported by Steve Dyrdahl 22/2/00) fixed as marked below.
'ion' by itself leaves j = -1 in the test for 'ion' in step 5, and
b[j] is then outside the bounds of b.

Release 3

Considerably revised 4/9/00 in the light of many helpful suggestions
from Brian Goetz of Quiotix Corporation (brian@quiotix.com).

Release 4

*/

package definitions;

import java.io.*;

/**
* Stemmer, implementing the Porter Stemming Algorithm
*
* The Stemmer class transforms a word into its root form. The input
* word can be provided a character at time (by calling add()), or at once
* by calling one of the various stem(something) methods.
*/

public class Stemmer
{ private char[] b;
private int i, /* offset into b */
i_end, /* offset to end of stemmed word */
j, k;
private static final int INC = 50;
/* unit of size whereby b is increased */
public Stemmer()
{ b = new char[INC];
i = 0;
i_end = 0;
}

/**
* Function to allow the Stemmer to be reused
      *
      * Ozymandias 04/03/03
*/

public void reset()
{
       b = new char[INC];
       i = 0;
       i_end = 0;
}

/**
* Add a character to the word being stemmed. When you are finished
* adding characters, you can call stem(void) to stem the word.
*/

public void add(char ch)
{ if (i == b.length)
{ char[] new_b = new char[i+INC];
for (int c = 0; c < i; c++) new_b[c] = b[c];
b = new_b;
}
b[i++] = ch;
}

/** Adds wLen characters to the word being stemmed contained in a portion
* of a char[] array. This is like repeated calls of add(char ch), but
* faster.
*/

public void add(char[] w, int wLen)
{ if (i+wLen >= b.length)
{ char[] new_b = new char[i+wLen+INC];
for (int c = 0; c < i; c++) new_b[c] = b[c];
b = new_b;
}
for (int c = 0; c < wLen; c++) b[i++] = w[c];
}

/**
* Quick and dirty method for adding a word as a string
*
* Ozymandias 04/03/03
*/

      public void add(String word){
            int length = word.length();
            char[] buf = new char[length];
            for (int c = 0; c < length; c++){
                  buf[c] = word.charAt(c);
            }
            add(buf,length);
      }

/**
* Quick and dirty method for getting the stem of a word
*
* Ozymandias 04/03/03
*/

      public String getStem(String s){
            this.reset();
            this.add(s);
            this.stem();
            return this.toString();
      }

/**
* After a word has been stemmed, it can be retrieved by toString(),
* or a reference to the internal buffer can be retrieved by getResultBuffer
* and getResultLength (which is generally more efficient.)
*/
public String toString() { return new String(b,0,i_end); }

/**
* Returns the length of the word resulting from the stemming process.
*/
public int getResultLength() { return i_end; }

/**
* Returns a reference to a character buffer containing the results of
* the stemming process. You also need to consult getResultLength()
* to determine the length of the result.
*/
public char[] getResultBuffer() { return b; }

/* cons(i) is true <=> b[i] is a consonant. */

private final boolean cons(int i)
{ switch (b[i])
{ case 'a': case 'e': case 'i': case 'o': case 'u': return false;
case 'y': return (i==0) ? true : !cons(i-1);
default: return true;
}
}

/* m() measures the number of consonant sequences between 0 and j. if c is
a consonant sequence and v a vowel sequence, and <..> indicates arbitrary
presence,

<c><v> gives 0
<c>vc<v> gives 1
<c>vcvc<v> gives 2
<c>vcvcvc<v> gives 3
....
*/

private final int m()
{ int n = 0;
int i = 0;
while(true)
{ if (i > j) return n;
if (! cons(i)) break; i++;
}
i++;
while(true)
{ while(true)
{ if (i > j) return n;
if (cons(i)) break;
i++;
}
i++;
n++;
while(true)
{ if (i > j) return n;
if (! cons(i)) break;
i++;
}
i++;
}
}

/* vowelinstem() is true <=> 0,...j contains a vowel */

private final boolean vowelinstem()
{ int i; for (i = 0; i <= j; i++) if (! cons(i)) return true;
return false;
}

/* doublec(j) is true <=> j,(j-1) contain a double consonant. */

private final boolean doublec(int j)
{ if (j < 1) return false;
if (b[j] != b[j-1]) return false;
return cons(j);
}

/* cvc(i) is true <=> i-2,i-1,i has the form consonant - vowel - consonant
and also if the second c is not w,x or y. this is used when trying to
restore an e at the end of a short word. e.g.

cav(e), lov(e), hop(e), crim(e), but
snow, box, tray.

*/

private final boolean cvc(int i)
{ if (i < 2 || !cons(i) || cons(i-1) || !cons(i-2)) return false;
{ int ch = b[i];
if (ch == 'w' || ch == 'x' || ch == 'y') return false;
}
return true;
}

private final boolean ends(String s)
{ int l = s.length();
int o = k-l+1;
if (o < 0) return false;
for (int i = 0; i < l; i++) if (b[o+i] != s.charAt(i)) return false;
j = k-l;
return true;
}

/* setto(s) sets (j+1),...k to the characters in the string s, readjusting
k. */

private final void setto(String s)
{ int l = s.length();
int o = j+1;
for (int i = 0; i < l; i++) b[o+i] = s.charAt(i);
k = j+l;
}

/* r(s) is used further down. */

private final void r(String s) { if (m() > 0) setto(s); }

/* step1() gets rid of plurals and -ed or -ing. e.g.

caresses -> caress
ponies -> poni
ties -> ti
caress -> caress
cats -> cat

feed -> feed
agreed -> agree
disabled -> disable

matting -> mat
mating -> mate
meeting -> meet
milling -> mill
messing -> mess

meetings -> meet

*/

private final void step1()
{ if (b[k] == 's')
{ if (ends("sses")) k -= 2; else
if (ends("ies")) setto("i"); else
if (b[k-1] != 's') k--;
}
if (ends("eed")) { if (m() > 0) k--; } else
if ((ends("ed") || ends("ing")) && vowelinstem())
{ k = j;
if (ends("at")) setto("ate"); else
if (ends("bl")) setto("ble"); else
if (ends("iz")) setto("ize"); else
if (doublec(k))
{ k--;
{ int ch = b[k];
if (ch == 'l' || ch == 's' || ch == 'z') k++;
}
}
else if (m() == 1 && cvc(k)) setto("e");
}
}

/* step2() turns terminal y to i when there is another vowel in the stem. */

private final void step2() { if (ends("y") && vowelinstem()) b[k] = 'i'; }

/* step3() maps double suffices to single ones. so -ization ( = -ize plus
-ation) maps to -ize etc. note that the string before the suffix must give
m() > 0. */

private final void step3() { if (k == 0) return; /* For Bug 1 */ switch (b[k-1])
{
case 'a': if (ends("ational")) { r("ate"); break; }
if (ends("tional")) { r("tion"); break; }
break;
case 'c': if (ends("enci")) { r("ence"); break; }
if (ends("anci")) { r("ance"); break; }
break;
case 'e': if (ends("izer")) { r("ize"); break; }
break;
case 'l': if (ends("bli")) { r("ble"); break; }
if (ends("alli")) { r("al"); break; }
if (ends("entli")) { r("ent"); break; }
if (ends("eli")) { r("e"); break; }
if (ends("ousli")) { r("ous"); break; }
break;
case 'o': if (ends("ization")) { r("ize"); break; }
if (ends("ation")) { r("ate"); break; }
if (ends("ator")) { r("ate"); break; }
break;
case 's': if (ends("alism")) { r("al"); break; }
if (ends("iveness")) { r("ive"); break; }
if (ends("fulness")) { r("ful"); break; }
if (ends("ousness")) { r("ous"); break; }
break;
case 't': if (ends("aliti")) { r("al"); break; }
if (ends("iviti")) { r("ive"); break; }
if (ends("biliti")) { r("ble"); break; }
break;
case 'g': if (ends("logi")) { r("log"); break; }
} }

/* step4() deals with -ic-, -full, -ness etc. similar strategy to step3. */

private final void step4() { switch (b[k])
{
case 'e': if (ends("icate")) { r("ic"); break; }
if (ends("ative")) { r(""); break; }
if (ends("alize")) { r("al"); break; }
break;
case 'i': if (ends("iciti")) { r("ic"); break; }
break;
case 'l': if (ends("ical")) { r("ic"); break; }
if (ends("ful")) { r(""); break; }
break;
case 's': if (ends("ness")) { r(""); break; }
break;
} }

/* step5() takes off -ant, -ence etc., in context <c>vcvc<v>. */

private final void step5()
{ if (k == 0) return; /* for Bug 1 */ switch (b[k-1])
{ case 'a': if (ends("al")) break; return;
case 'c': if (ends("ance")) break;
if (ends("ence")) break; return;
case 'e': if (ends("er")) break; return;
case 'i': if (ends("ic")) break; return;
case 'l': if (ends("able")) break;
if (ends("ible")) break; return;
case 'n': if (ends("ant")) break;
if (ends("ement")) break;
if (ends("ment")) break;
/* element etc. not stripped before the m */
if (ends("ent")) break; return;
case 'o': if (ends("ion") && j >= 0 && (b[j] == 's' || b[j] == 't')) break;
/* j >= 0 fixes Bug 2 */
if (ends("ou")) break; return;
/* takes care of -ous */
case 's': if (ends("ism")) break; return;
case 't': if (ends("ate")) break;
if (ends("iti")) break; return;
case 'u': if (ends("ous")) break; return;
case 'v': if (ends("ive")) break; return;
case 'z': if (ends("ize")) break; return;
default: return;
}
if (m() > 1) k = j;
}

/* step6() removes a final -e if m() > 1. */

private final void step6()
{ j = k;
if (b[k] == 'e')
{ int a = m();
if (a > 1 || a == 1 && !cvc(k-1)) k--;
}
if (b[k] == 'l' && doublec(k) && m() > 1) k--;
}

/** Stem the word placed into the Stemmer buffer through calls to add().
* Returns true if the stemming process resulted in a word different
* from the input. You can retrieve the result with
* getResultLength()/getResultBuffer() or toString().
*/
public void stem()
{ k = i - 1;
if (k > 1) { step1(); step2(); step3(); step4(); step5(); step6(); }
i_end = k+1; i = 0;
}

/** Test program for demonstrating the Stemmer. It reads text from a
* a list of files, stems each word, and writes the result to standard
* output. Note that the word stemmed is expected to be in lower case:
* forcing lower case must be done outside the Stemmer class.
* Usage: Stemmer file-name file-name ...
*/
/**
*
* Commenting out this main method to add one that is more useful
* for my immediate needs.
*
* Ozymandias 04/03/03
*
public static void main(String[] args)
{
char[] w = new char[501];
Stemmer s = new Stemmer();
for (int i = 0; i < args.length; i++)
try
{
FileInputStream in = new FileInputStream(args[i]);

try
{ while(true)

{ int ch = in.read();
if (Character.isLetter((char) ch))
{
int j = 0;
while(true)
{ ch = Character.toLowerCase((char) ch);
w[j] = (char) ch;
if (j < 500) j++;
ch = in.read();
if (!Character.isLetter((char) ch))
{
// to test add(char ch)
for (int c = 0; c < j; c++) s.add(w[c]);

// or, to test add(char[] w, int j)
// s.add(w, j);

s.stem();
{ String u;

// and now, to test toString() :
u = s.toString();

// to test getResultBuffer(), getResultLength() :
// u = new String(s.getResultBuffer(), 0, s.getResultLength());

System.out.print(u);
}
break;
}
}
}
if (ch < 0) break;
System.out.print((char)ch);
}
}
catch (IOException e)
{ System.out.println("error reading " + args[i]);
break;
}
}
catch (FileNotFoundException e)
{ System.out.println("file " + args[i] + " not found");
break;
}
}
*/

      public static void main(String[] args){
            Stemmer stemmer = new Stemmer();

            for (int i = 0; i < args.length; i++){
                  String word = args[i];
                  stemmer.reset();
                  stemmer.add(word);
                  stemmer.stem();
                  System.out.println(word + " was stemmed to " + stemmer.toString());
            }
      }
}

ozymandias

package definitions;

import java.util.StringTokenizer;
import java.io.File;

public class Sentence implements Comparable{

private String sentence;
private int keyScore;
private int patternScore;
private File location;

public Sentence(String s){
sentence = s;
keyScore = 0;
patternScore = 0;
}

public String[] getWordArray(){
int token = 0;
StringTokenizer st = new StringTokenizer(sentence);
String[] words = new String[st.countTokens()];
while (st.hasMoreTokens()){
words[token++] = st.nextToken();
}
return words;
}

public int getScore(){
return (keyScore + patternScore);
}

public int getKeyScore(){
return keyScore;
}

public void setKeyScore(int s){
keyScore = s;
}

public void addKeyScore(int s){
keyScore += s;
}

public int getPatternScore(){
return patternScore;
}

public void setPatternScore(int s){
patternScore = s;
}

public void addSPatterncore(int s){
patternScore += s;
}

public void setLocation(File f){
location = f;
}

public File getLocation(){
return location;
}

public int compareTo(Object o){
int c = ((Sentence)o).getScore();
if (c == this.getScore()){
return 0;
}else{
return (c - this.getScore());
}
}

public String getSentence(){
return sentence;
}

public String toString(){
return sentence + "[" + keyScore + "][" + patternScore + "](" + location.toString() + ")";
}
}

ozymandias

For completeness, here are contents of the 4 test files containing sample sentences.

a1.txt
======
1 graphics computer is defined as nonsense#2 computer graphics interface are defined as pictures#3 computer graphic is delimited by pictures#4 computer graphics are often delimited by science#5 computer graphics are often delimited by science#6 computer graphic are often delimited by science#7 computer graphics described are as random#8 computer graphic described are as random#

b1.txt
======
10 mobile agents are defined as pictures#11 mobile agent is delimited by pictures#12 mobile agents are often delimited by science#13 mobile agents are often delimited by science#14 mobile agent are often delimited by science#15 mobile agents described are as random#16 mobile agent described are as random#

c1.txt
======
17 computers graphics are defined as pictures#18 computers graphic is delimited by pictures#19 computers graphics are often delimited by science#20 computers graphics are often delimited by science#21 computers graphic are often delimited by science#22 computers graphics descibed are is random#23 computers graphic descibed are is random#

d1.txt
======
24 as science is often found to delimit computers graphic output#25 computer graphics is the display of digital images which is defined by science#