Start Free Trial

asked on

improved C codes- Imladris version (Phonex)

Dear Imladris,

from
https://www.experts-exchange.com/jsp/qManageQuestion.jsp?ta=cprog&qid=20116615
I would like to improve the output to be like a metrix which contains 3 colums and the row depends on the the number of names we enter or other word is input names

below is example of requirement
-----

how many basename do u want to compare?----this is the row of metrix
the user enter number of names which he/she wants to compare for example 10
user enter 10 name
save in matched names y or no.?-- and the results which saved in matched.dat should be exactly the same as output

please enter name u want to compare?
a
b
c
...

here is the demonbstrator of the requirement:
save output in matched file?
y
how many basenames u want to compare?
3 ---user enter
please enter names u want to compare
smith
smyth
smit
after enter the program would provide the output like this:
--------------------------------------------------
basenames | n | matched names
smith 3 smith smithe smyth
smyth 3 smith smithe smyth
smit 2 smit smith

...
..
------------------------------------------------
do you want to compare again?
y or n

y continue, n exit

---------------------------------
p.s.
n = number of matched names
matched names output should be sorted in alphabet order....
-----------------------
hope it's not that hard....

many thanks,

pps. if you think it's harder i will add more point later on....any suggestion?
-----------------------

What would happen if you get more matches than can fit on one line?

imladris, a complaint has been sent my way, and after looking at your PAQ's and open Q's it seems there was an alleged comment that you are in collusion with korsila. I want to be sure these connections between the two of you are on the level and not doing this for point accrual.

If it is found there is collusion, all points regarding C or C++ will be removed.

I am not accusing you of wrong doing, but it does have the appearance of such.

For the sake of EE form now on, please do not have korsila ask a question Intended for you only. EE is for open for all to debate, although you may be the only one korsila wishes to deal with, it appears offensive to others in the group.

What compounds this is the fact that you are both Experts.

I will update you if anything else arises.

Thank you,
ComTech
Community Support

ComTech,

there is no collusion here. I have never met korsila nor have I ever asked or encouraged him to direct questions only at me. I have never asked or discouraged any other expert from answering these questions.

It is not clear to me what he does with the results of these questions. I can see that they could appear to be a series of "toy" or "artificial" questions. I can but assume that they have value to the questioner.

As to why he addresses them to me, I would guess it is because I had the patience to stick with a couple of his initial questions, nearly a year ago, over the course of many weeks, and provide him with a working program to his specifications. He tends to add requirements as he thinks of them, and clarify as he goes along. I assume he finds it simpler to deal with someone who is familiar with the whole history, so that he doesn't have to spell everything out from the beginning again.

I am anxious to lay this suspicion to rest and clear my name. Please let me know what else I could do to that end.

imladris.

P.S. Please also note that korsila is an expert only the very broadest sense of the term. He has answered 2 questions.

And also note, that I have twice as many expert points in the Java forum, where korsila has never asked a question.

imladris, you are an excellent Expert, and I do not think personally there has been mischief. But was ask to look into it by a third party with-in EE. Although it is an ongoing programming issue, I understand completely the motivations to help.

This is merely to put some at rest, as others may have seen it a different way.

There was not anything I could see wrong, except maybe korsila may ask the question without directing it at a single person. Even tough you two will still get together, due to the fact it seems to be an ongoing project,

I menat no harm, and felt I needed to inform you.

Thank you,
ComTech
Community Support

ASKER

I have no idea what is going on here between Imladris and Comtech ?

but as far as i can see it's absolutely nonsense for ComTech ..sorry to say that...whenever i need some help with a bit pressure of implementing codes but no-one did it better than Imladris ..he has concerned about my question and had tried to help me sorting the all problems out with his kindness..and I rerely see this kind of expert on the internet... long time ago, I posted some questions here and sometime I got no answer and no help at all until I had an exelent response from Imladris who had helped me and understood well on my requirement...since then he has produced the answers which I need and they are useful to my project...

as you can see I used to post some question in CGI and no-one could help me at all, therefore I would have to deleted the question which are still solved out...actually i prefer doing things in CGI or PHP but when Imladris gave me the excelent answers twice so i had changed my mind to study and implement some codes in C whichg some help from Imladris...it maybe my fault that direct the questions to imladris only ..I have no idea that this sort of things would cuase a lot of trobles here...but as you can see no-one gave the best answer like Imladris did..and i really feel comfortable to ask any question whenever i got into the problems....I would love to make clear that if there are plenty of good expert trying to help me please i would be delighted and very pleased to get any suggestion or even help like Imladris does to me...

another thing I would mention is that ..am learning how name variations can be a big problem of history and that was why I have come up with many question about name matching..and once when someone has heped me with an ecelent answers so I need to carry on with the one who understood what i am asking to know ....is there any wrong for the expert comunity..and i ahve no idea why it cause a lot of troble here..or you guys got paid with my points , please be nmore specific and explain to me..since I have no idea what is going on between the expert...jeallousy being an expert of the week or sort of ..please anyone expalin..?

lastly, i would like to thank whoever have help me posting codes or any suggestion into my questions...

korsila

p.s. next time I won't post any question direct to imladris..but i think he is the one who is motivative about my problem....whoelse could help me with a good respect and answers please ..am stillawaiting..!!

cheers,

ASKER

Imladris and comtech..
can we still carry on about my questions...
is there any experts are trying to help please i need your help and quickly...

cheers,

Korsila

ASKER

Actually I have more requirement than this but I justcut it down a bit and wanna make sure Imladris can sort this out first..and if any expert would love to give me any comment or help me out with a good codes please am still awating....

many thanks,

Korsila

The main thing Comtech has asked is that you not direct the question specifically at me. It does cause an appearance of exclusion. Other than that we will carry on as before.

So, I still have the question outstanding:

What would happen if you get more matches than can fit on one line?

ASKER

**** if you think you could not fit names in 1 line then change the coulums into rows
but when you read the below requirement, maybe it's easy to stick with my above outputs..
anyway it's up to you to come up with good results and ideas...

here to change colums into rows....
basenames: smith smyth ...
n 2 3
matched names: smith smith ...
smyth smyth
smythey

--------------------------------------------------------

here is the whole requirement I need to do...

when you get the resulkts of matched names..I need to cluster the group
of name matching...
like if there is a duplication like this
basenames: smith smyth ...
n 3 3
matched names: smith smith ...
smyth smyth
smythey smythey

just to eliminate and keep just one like this:

basenames: smith
n 3
matched names: smith
smyth
smythey

---------
that is the first elimination

secondly,
to find the overlaping names and then eliminate...

here is a good examples to explain:
assume that there are 5 matched of basenames: a, b, c , d, e

basename: a b c d e
n 4 3 3 3 2
matched a a a a d
b b b d e
c c c e
d

as you can see we can eliminate base name c records and then keep base
name b becus of duplication...
( code might be writen from this algorithm (just an example)
if duplicated eliminated duplication first....then start eliminating from k to kn therefore eliminate
k+1 where k=k+1....... (k= basename a)-k=coulum1=a , k+1 = coulum 2 =b and so on
so now we have the first basename b cluster which contains a,b,c from duplication
se more example from below

next is to find another gruop or cluster by trying to eliminate name from any
basename group and have the maximum matched names in each group
for example now we are trying to eliminated starting from "a" basename first (left to right) so "a" in matched
will be deleted...
basename: a b d e
n 4 3 3 2
matched a a a d
b b d e
c c e
d

as we can see we got another duplication

basename: a b d e
n 4 3 3 2
matched d
b b d e
c c e
d
d and e basename are duplictaed ..therefore keep d and deleted e basename and we got another cluster of d basename containing d, e matched names....
now the process repeats again and start with the same process start eliminating "b"
from the original outputs we got:

basename: a b d e
n 4 3 3 2
matched a a a d
b b d e
c c e
d
so wer trying to eliminate "b" next but we cann't find any cluster from deleting "b" therefore move to next step
to eliminate "c" , "d', "e" respectively

finally we got another cluster from m eliminating "d" which contain a, b, c from basenbame a, b however we have aready got this cluster at the beginning of the process....
----------

this is just an idea how to cluster matched names together in the same group of similarity..

ASKER

finally it returns the cluster which we have from iliminating basenames..
so we have basically

abc
de
----------

ASKER

I think it shopuld return more clusters than that

perhap
abc
de
bc
e

korsila

ASKER

I think 50 points more is a fair deal for this question...I know it's quiet hard to get it done as I would like to..

however, i have done it by hand it's much easier...
forexample you have 5 basenames: a, b, c, d, e, f and each matched names and no. of names which are matched as ordering format: basename, n, matched name1...matched name n (look like a metrix)

a 4 abcd
b 3 abc
c 3 abc
d 3 ade
e 2 de

i want identily clusters where a cluster of names has no overlap

firstly,find all name-overlaping clusters
for every cluster
every i matches every j
(see above metrix which I previously propose the requirement)

secondly, find overlaping cluster (n)
with 1 outline
2 outline
..
(n-1)

----------------------
hereis what i have come up by hand

find duplicate first
a 4 abcd
b 3 abc
c 3 abc
d 3 ade
e 2 de

and eliminate c (row) since it's duplicated with b so we got cluster "abc"

1)start elimiate matched names start from left or right "a or d"
forexample staring from eliminating "a" f from
a 4 abcd
b 3 abc
d 3 ade
e 2 de

2)we got another duplicated row which is "de"
so de can be another cluster

a 4 bcd
b 3 bc
d 3 de
e 2 de
----------program return to start from 1) again and starting eliminating b,c and d
but as you can see eliminating b,c had not affected any ckluster at all..
so now starting eliminating "d" from
a 4 abcd
b 3 abc
d 3 ade
e 2 de
we got "abc" overlaping and can be a cluster but as you can see above we hjave already got cluster "abc"

a 4 abc
b 3 abc
d 3 ae
e 2 e
--------
now star eliminating 2 string matched names from 1) again
if we start eliminating a and then d from:
a 4 abcd
b 3 abc
d 3 ade
e 2 de

we got another 2 clusters which are "bc" and "e"
a 4 bc
b 3 bc
d 3 e
e 2 e

--------
so finally the program return cluster : abc, de, bc, and e

I think it 's not very clear but hope u could understand a bit...

many thanks,
korsila

ASKER

Dear Imladris,
I think this is the main key of name matching to get and classify the cluster of matched names...
I myself have found out it's extremely difficult..and harder than I expect even to make the requirement clearer..

hope another 100 points would encourage you and maybe other experts to pay more attention in this question...

am still awaiting the comment and some good implementation...

regards,
korsila

I think I have some grasp of the requirement now. The last explanation helped a lot. I have been busy with work, but should be able to have a crack at this soon.

ASKER

Imladris,
I will see if you could come up with expectable results...hope to get this done as soon u can... as am trying to finish this part of my interests-name matching..and I just found out that the key to make name matching easy for searching is to classify "Cluster" -this is the idea which I have come across...but to be honest don't know how to implement it in C....

looking forwards to having a test at your codes soon..

Thanx

Korsila

ASKER CERTIFIED SOLUTION

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

ASKER

Imladris,
1 error
"Inconsistent type declaration "strupr"

char *strupr(char *n)-------1 error ***

{ for( ; *n!='\0'; ++n)*n=toupper(*n);
return(n);
}

ASKER

I have fixed the error, but still confused a bit of the outputs...

let me make sure that I do understand what you have come up with...

will get back to you soon..

ASKER

it's really confusing...but it 's a good try..
let me explain more what my requirement was..

for example..when you enter 20 basenames and you would get matched names in match.dat like this

Match name smeath
Smeath
Smeeth
Smeethe
Smeith
Smeth
Smethe
Smett
Smieth
Smit
Smite
Smith
Smithe
Smithee
Smithie
Smithy
Smitt
Smitte
Smitth
Smity
Smiyth
----------------------------------------
Match name smeath
Smeath
Smeeth
Smeethe
Smeith
Smeth
Smethe
Smett
Smieth
Smit
Smite
Smith
Smithe
Smithee
Smithie
Smithy
Smitt
Smitte
Smitth
Smity
Smiyth
----------------------------------------
Match name smeath
Smeath
Smeeth
Smeethe
Smeith
Smeth
Smethe
Smett
Smieth
Smit
Smite
Smith
Smithe
Smithee
Smithie
Smithy
Smitt
Smitte
Smitth
Smity
Smiyth
----------------------------------------
Match name smeath
Smeath
Smeeth
Smeethe
Smeith
Smeth
Smethe
Smett
Smieth
Smit
Smite
Smith
Smithe
Smithee
Smithie
Smithy
Smitt
Smitte
Smitth
Smity
Smiyth
----------------------------------------
Match name smeath
Smeath
Smeeth
Smeethe
Smeith
Smeth
Smethe
Smett
Smieth
Smit
Smite
Smith
Smithe
Smithee
Smithie
Smithy
Smitt
Smitte
Smitth
Smity
Smiyth
----------------------------------------
Match name smeath
Smeath
Smeeth
Smeethe
Smeith
Smeth
Smethe
Smett
Smieth
Smit
Smite
Smith
Smithe
Smithee
Smithie
Smithy
Smitt
Smitte
Smitth
Smity
Smiyth
----------------------------------------
Match name smeath
Smeath
Smeeth
Smeethe
Smeith
Smeth
Smethe
Smett
Smieth
Smit
Smite
Smith
Smithe
Smithee
Smithie
Smithy
Smitt
Smitte
Smitth
Smity
Smiyth
----------------------------------------
Match name smeath
Smeath
Smeeth
Smeethe
Smeith
Smeth
Smethe
Smett
Smieth
Smit
Smite
Smith
Smithe
Smithee
Smithie
Smithy
Smitt
Smitte
Smitth
Smity
Smiyth
----------------------------------------
Match name smeath
Smeath
Smeeth
Smeethe
Smeith
Smeth
Smethe
Smett
Smieth
Smit
Smite
Smith
Smithe
Smithee
Smithie
Smithy
Smitt
Smitte
Smitth
Smity
Smiyth
----------------------------------------
10 more outputs which is exactkly the same ...

all is the same , so the porogram shou;ld return cluster
which contains 20 nams since each base name got 20 name matched...
well, if you could make it like a metrix like this it's easy and simple to see..

Smeath Smeath
Smeeth smeeth
Smeethe smeethe
Smeith smeith
Smeth smeth
Smethe smethe
Smett smett
Smieth smieth
Smit smit
Smite smite
Smith smith
Smithe smithe
Smithee smithee
Smithie smithie
Smithy smithy
Smitt smitt
Smitte smitte ..
Smitth smitth ..
Smity smity ..(18 exactl;y the same matched names)
Smiyth smiyth ...

so finally the program return: there are only 1 cluster which is
Smeath
Smeeth
Smeethe
Smeith
Smeth
Smethe
Smett
Smieth
Smit
Smite
Smith
Smithe
Smithee
Smithie
Smithy
Smitt
Smitte
Smitth
Smity
Smiyth

-------
here is what i would like to see...

ASKER

maybe this should be clearer..

for example this is the output of matched.dat (datfile)

if you enter 5 basenames :a, b, c, d, e and the following outputs are those basename repectively with alphbet order..
a b c d e ---> basename
-------
a a
b b b
c c c
d d d d
e e e e

or like this left to right
a b c d e
a b c d e
b c d
d e
e
----------
if not use the sort() u cann't see the different how to clasify the cluster...easy to see..and code...

---------
now the program must cluster those outputs and return how many of cluster and each cluster matched names
------
so here is the outputs which was made by hand

program retruns 4 clusters:
a,b,c,d,e
bcd
de,
e
-------
basic idea is to find overlaping of matched name in the output file (matched.dat) and then use the algorithm to clasify clustering like a group...

-------
hope this should be a bit clearer...

as I can suggest you can use the output file from macthed.dat in phonex.c which you previously implement and then implement another algorithm to clasify the cluster from that matched.dat file which might be easy to do...

matched.dat file which was the output when i run phonex.c
is below..
here is only tested with smith 20 basenanmes..
and as u can see it has only 1 cluster which is the name itself containing 0 matched names (see below)

ASKER

if u enter name "smeath, smeeth and so on..each base nbame give 20 matched names..so that it has only 1 cluster which is duplictaed names...
Smeath
Smeeth
Smeethe
Smeith
Smeth
Smethe
Smett
Smieth
Smit
Smite
Smith
Smithe
Smithee
Smithie
Smithy
Smitt
Smitte
Smitth
Smity
Smiyth
----------

ASKER

i think the idea is to find the name which is overlaping and then classify the cluster...

I would prefer to add more mark if u find out it's very tough one..!!

many thaks, for your first try...it's not bad but I need to make sure you got the main idea working first..

the best solution should be implenting a new code and test it with "match.dat" which was the output file from running the phonex.c program -you have implemented it be4...

since in the matched.dat u got the basename and the matched names like this

match name smeath
Smeath
Smeeth
Smeethe
..
...
------------

match name smeeth
Smeath
Smeeth
Smeethe
..
...
------------
match name smeethe
Smeath
Smeeth
Smeethe
..
...
------------

if u could make this matched.dat output file become a metrix with name ordeing by alphbet..it's easy to find overlaping names and then u can easily cluster the group of name togther...

it's not a bad idea but if you find anbother feasible way to do it, I would be delighted to see how it works...

many thanks,

Korsila

ASKER

I have read your comment carefully this time and i must confess that it was not a bad idea at all..
here what i need to mention: **** is my comment

you said:
hese has a nice mathematical flavour to it. However, there is more than 1 cluster set possible. Here for instance you could also choose: a, bc, d, de. For a result set of abc and bcd you could get cluster sets of a, bc, d or abc, bcd.
**** i think the cluster should be only 1 way which
contains only 3 clusters from clustering abc and bcd:
bc
a
d

firdtlky should find dupilcated retrun cluster inn this case no duplicated, therefore, find the maximum overlaping and then return cluster which is "bc"..the program again try to fiond another overlaping but there is no more in this case... so lastly return non -overlaping...which is "a", "d"

-----------

So now the question of which to choose appears. You could answer the smallest (although in the last case it's not obvious that the smallest one is "best") or the one with the longest strings etc. Since this question did not appear to have a good answer, I abandoned this tactic as well.

***** the answer should return al the cluster which i have mentioned above...1. duplicated, 2.maximum overlaping
3, non-overlaping...

What I have settled on is the following: a sequence of names for a base name is a cluster if it contains
at least 2 names and is the largest sequence containing some names that appears for any other base name.
So for the original example the clusters would be abc and de.

****** not the one I expect from the program..as i mentioned it should returns 4 clusters:abc (duplictaed and maximum overlaping), de (another maximum overlaping), d and e (another overlaping) and if there is "f" in basename a it should be returning "f" as a cluster as well since non-overlaping...

For abc, bcd there would be one cluster: bc.

***** not the one I expect
it should be 3 clusters bc (maximum overlaping) a, b non-overlaping...

---------

hope this should be very clear explaination...

also the program should ignore "the too many basename" : when user enter more than 20 basenames:

---------

HOPE It helps you to figure out a bit..!!

I am certainly happy to work at modifying the algorithm. I realized that the notion of "clustering" was not yet well defined, and that this would just provide a starting point.

But before we get to that, we need to sort out why it is not returning what I expected it to return. The algorithm I chose, and the tests I ran were such that a group of basenames that got the same results should return a single cluster that is identical to each of the results.

It would appear that you have made modifications to the program (which I applaud) both because of the problem with strupr, and because it was coded to support 10 base names and you're mentioning entering 20. At present, since a bug may have been introduced, it may be helpful to go back to the original code, and see what it does with simple input (like 2 or 3 base names). The step after that will depend on whether that test works or not.

I want to set thing right please. Someone in your group thought there was collusion between the two of you.

It was my duty to look at this, and I found nothing of the sort.

I am actually impressed by the work I see here, keep it up, I reported back that there was nothing wrong here.

Thanks and good luck,
ComTech

Ps, I do not have anything against either of you, I think you two are very bright and are doing one hell of a job.

I understand completely, and appreciate the influence moderators have exerted. It has facilitated resolving disagreements, and raises the level of confidence we all have in this wonderful community.

Thanx Comtech.

ASKER

first of all, I would like to thank Comtech who has sorted out the misunderstanding things here...It is wonderful to be here and get helped by some nice and cool experts..without the community and experts here I would feel more depressed...and again would like to thank Imladris who has helped and concentrated on my problems...and also other experts who made my work easier..

Korsila

ASKER

ok Imladris i will test your algorithm again with my 100 names of datafile...and will get back to you with the results and comments..are you still on line ? i need to make sure I would get this done as soon as posible ..but it depends on your movtivation..:)

Korsila

Thanks guys. <blush>

ComTech

And hope the test works.

ASKER

Imladris,
here is what I got from your running your algoirthm with 3 basenames: smeath, smeeth and smeethe
-------------------
Please enter number of base names to match:
3
Please enter base name to match:
smeath
Please enter base name to match:
smeeth
Please enter base name to match:
smeethe
SmittheSmeath
SmittheSmite
Smiyth Smity
Number of Matches: 20
SmittheSmeath
SmittheSmite
Smiyth Smity
Number of Matches: 20
Smitthe Smeath
SmittheSmite
Smiyth Smity
Number of Matches: 20

Smiythe Smeath
Smiythe Smeath
Compare again (y or n)?
------------------

could you explain the cluster which the program has resulted in your algorithm..and what are those results repreesented?
please explain as well abbout the results I got above...

it's simple to say that every basenames matched 20 names indatafile and 20 names are exatly the same...so it would returns only 1 cluster which cointains all the 20 names in my 100 names datafile...(you could test it by yourself)

and this should work with your algorithm as wekll..so I would be impress if you could modify your algorithm to meet my requirement...

u can test your algorithm with my 100 names datfile with basename "brown" and u would see what the cluster should be and returns...

-------

still waiting for your quick response and comments...

many thanks,
Korsila

ASKER

Imladris,
I HAVE TESED your algorithm with my 100 names datafile and it seems confusing...

for example I tested with another 3 basenames: williammas, willeans and willems

however, the program returns "no matches found"
see the results below...

Please enter number of base names to match:
3
Please enter base name to match:
williammas
Please enter base name to match:
willeans
Please enter base name to match:
willems
williammas:
No Matches Found.
willeans:
No Matches Found.
willems:
No Matches Found.
-----------------------
as u can see when I tested this 100 names with the previous algorithms-phonex.c
(not a cluster one) the program returns 1 match found: see below..
Save matched names in matched.dat (y or n)?
n
Save unmatched names in unmatched.dat (y or n)?
n
Please enter name to match:
williammas
Wyllym
Number of Matches: 1
Compare again (y or n)?
y
Please enter name to match:
willeans
Wyllym
Number of Matches: 1
Compare again (y or n)?
y
Please enter name to match:
willems
Wyllym
Number of Matches: 1
-----------

obviously your new cluster algorithm here supposed to return 1 cluster which is "wyllym" but it returns "no matches found which is absolutely wrong...

Sorry the response speed is not up there. I am still very busy. We are working around some very deep bugs in a production system.

I am still puzzled by the difference in the results you get from executing the program. When I use the base.dat that you e-mailed a while ago, I get 13 matches for each of "Williammas", "willeans" and "willems". It still seems to me that such discrepancies are most likely caused by oddities in the file format. This has proven out in at least one other case. If you are unable to find a problem with that, please send me the exact file you are using for the "Williammas" test, and I will try it with that.

ASKER

are you on line?
well, I will try to sort out that problem of williams..hope it could be the format file of datafile..
anyway could u explain about the "smith" basename

how could your program classify the cluster as my previous question...

how come I got these follwoing results:
what cluster are they represented and so on?Please enter number of base names to match:
3
Please enter base name to match:
smeath
Please enter base name to match:
smeeth
Please enter base name to match:
smeethe
SmittheSmeath
SmittheSmite
Smiyth Smity
Number of Matches: 20
SmittheSmeath
SmittheSmite
Smiyth Smity
Number of Matches: 20
Smitthe Smeath
SmittheSmite
Smiyth Smity
Number of Matches: 20

Smiythe Smeath
Smiythe Smeath
Compare again (y or n)?

ASKER

the requirement of this question is to find the cluster and results how many cluster and outputname in each cluster
would be? so the program must retrun:
1. the number of cluster which the program has classified.
2. the outputname from each cluster

for example:

enter "smith" basename the program will return:

1 cluster has been classified
smith
smeth
smeeth
...
...
..
---------

or another example:

3 clusters has been classified:
--------
cluster1
a
b
c
--------
cluster2
b
c
--------
cluster3
a
d
---------

u can output as a coulum or row it's up to you
or maybe use a 10 names for each row.

------------

could u update your program..so I can see easiyl if something went wrong...

-------

many thanks,
kORSILA

ASKER

ah ah I just sorted out "No macths found" to return 1 match from deleting the last space of the name in datfile..
however, i have discovered that our results differs becus of the different compiler...
be4 I used cc -Aa filename.c
the program returns "a.out"

and I just changed the compiler to be compile with CC (capital letter)

so now I have used "CC -o clus(outputname) filename.c

now the errors have occurred:

1 error and 3 warning as following line:

unlink("matrix.tmp");---1) error:"undefined function unlink called" (1286)

void main(int argc,char *argv[])-- 2 warning:"argc not used" (117)

char last,code,*p;---1 warning:last may have been used before set (137)

could u fix the error?

ASKER

again this is very crucial... found out that the space after each name in datafile have caused the efficiency of algorithm to mismatched outputnames..
so i added more 20 points for you to implement the algorithm which can ignore space after names indatafile..so I DON'T need to delete space one by one til 1000 names..

this algorithm should ignore the space after name in datafile...

---------------

hope it won't take too long...with this one and the cluster things which I have mentioned above

many thanks,
Korsila

ASKER

I don't know I cann't add more points ..is 300 the maximum point?

ASKER

wow wow wow...
after i got rid of the space in datafile..
I was impressed with your algorithm..it works this time.
wooooopy, unbelivable...:)

let me test with other basenames..

also could you change a bit of saving file code to"matched.dat"
I would like the program to save "output from screen"
to matched.dat

-------
and one last thing a simple algorithm to get rid of space after name in datfile...
---------------------

I need to make more test to make sure you got, non-overlaping names to be classified as a cluster as well..in the mean time could you fix the output saving code out and the ignoring space thing...

many thanks,
Korsila

Now that we are getting similar results, we'll move on to the next step. I will add identifying non-overlapping names as clusters. As of now, I am interpreting this requirement as purely in addition to the algorithm as it currently exists. Is that correct?

Also, I will modify the program to ignore any trailing blanks on names.

And lastly I will add the cluster output to matched.dat.

P.S. as you may have found, you can ignore the warnings. I assume you had to substitute something else for unlink. Just to avoid extra hassle for both of us, what did you use? Then perhaps I can use that on my end as well, and avoid the need for a fixup on your side.

ASKER

well, I have tested it with 3 basename: mooren, moorham and moorhen

but your algorithm returns only 1 cluster which was overlaping:
Moorham
Moorhen
Morham
Morhan
Morhen
Morman
-------
the program suppose to return non-overlaping as one clusyter as well..
so another cluster is non-overlaping which contains 14 basenames:
Moeran
Mooran
Mooren
Morahan
Moran
Morean
Moreen
Moreham
Morehan
Moren
Morine
Morrain
Morran
Morren

-------
to make it clearer

**** is my comment and requirement

1)you said:hese has a nice mathematical flavour to it. However, there is more than 1 cluster set possible. Here
for instance you could also choose: a, bc, d, de. For a result set of abc and bcd you could get cluster sets of a, bc, d or abc, bcd.

**** i think the cluster should be only 1 way which
contains only 3 clusters from clustering abc and bcd:
bc
a
d

firstlky should find dupilcated retrun cluster inn this case no duplicated, therefore, find the maximum
overlaping and then return cluster which is "bc"..the program again try to fiond another overlaping but there is no more in this case... so lastly return non -overlaping...which is "a", "d"

-----------

2)you siad: So now the question of which to choose appears. You could answer the smallest (although in the last
case it's not obvious that the smallest one is "best") or this question did not appear to have a good answer, I abandoned this tactic as well.
***** the answer should return al the cluster which i have mentioned above...1. duplicated, 2.maximum overlaping
3, non-overlaping...

3)you said :What I have settled on is the following: a sequence of names for a base name is a cluster if it contains at least 2 names and is the largest sequence containing some names that appears for any other base name.
So for the original example the clusters would be abc and de.

****** not the one I expect from the program..as i mentioned it should returns 4 clusters:abc (duplictaed
and maximum overlaping), de (another maximum overlaping), d and e (another overlaping) and if there is "f" in basename a it should be returning "f" as a cluster as well since non-overlaping...

----------------

so now the output program should look like this:

please enter number of basename? no limited maximum no. of basename.

return number of clusters: forexample
3 clusters has been clasified from these base names

and then show each basename in each clusters:

cluster1:
a, b, c, etc..

cluster2:
b,d, c...

cluter3:
e,f ...
------------------
then save all output on the screen into matched.dat --you can do it first at the beginning..if it's more easier..

--------

algorithm for ignoring or getting rid of space afer nbame in datafile

--------
and that's it...last try...hope this time would work perfectly ...

p.s. I would like to increase some more points but couldn't do that..is there any way..?

many thanks,
Korsila

ASKER

that sounds nice !!!!

***** is my comment

Now that we are getting similar results, we'll move on to the next step. I will add identifying non-overlapping
names as clusters. As of now, I am interpreting this requirement as purely in addition to the algorithm
as it currently exists. Is that correct?

**** yes, you need to cluster non-overlaping and extra case of requirement which I HAVE MENTIONED ABOVE: maimum overlaping , and another overlaping etc...

Also, I will modify the program to ignore any trailing blanks on names.

***** yes, please

And lastly I will add the cluster output to matched.dat.

**** cluster output and matching output on screen as well..if you could...

seems like we are getting on quiet well...hope to finish this part as soon as possible...

I would like to add more point for the task of explaoining the algorithms..as far as i can see I have no idea of any piece of your codes..how it works and what's it for?

ASKER

I have recived this follwoing confirms on the screen, but it hasn't changed of what it should be...

Question points have been changed from 300 to 320.
Comment added to history.

-----------------

ASKER

I assume you had to substitute something else
for unlink. Just to avoid extra hassle for both of us, what did you use? Then perhaps I can use that
on my end as well, and avoid the need for a fixup on your side.

******** you can suggest me, whatever you use, let me know and i will change to it as the same....

Well, I use unlink, but that doesn't seem to work for you....

P.S. I suggest, as I have in the past, that adding on points for new requirements is not a good idea anyway. If you are will to add points for detailed comments, that would best be dealt with in a separate question.

ASKER

Imladris,
ok I will put A new requirement of explaining the codes separately after you could get this done first...

hope it won't take too long to get the codes done...

best regards,

Korsila

ASKER

unlink, it's ok for me when I COMPILE WITH cc (small letters)...but what it's for? since i have tried to change to "linkage" but error has occurred with the same comment..
anyway, it doesn't matter now becus i can run it through...

OK. Here is a new version. It removes trailing blanks and newlines. It adds the cluster output to matched.dat and it will display non-overlapped clusters and singletons as Unclusters. It also uses unlink (to remove temporary files) which will work, if I have understood correctly, as long as you compile with lowercase cc.

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <ctype.h>

#define NUMBASE 10

int phonexMatch(char* name1, char* name2);
int phonex(char* name);
int isvowelY(char c);
void getinput(char *prompt,char n[]);
int parseLine(FILE *fp,char **pa,char *line);
int clustermatch(char **src,int ix,int size,char **tgt,int ts);

char name[NUMBASE][150];

void main(int argc,char *argv[])
{ int matched,msave,usave,mc,bnn,i,pos,maxm,j,rs,ts,k,clm,size,x,nos,cs,l;
char line[400],line2[400],**src,**tgt;
long fpos;
FILE *namefile,*mfile,*ufile,*matrix,*mtmp;

printf("Save matched names in matched.dat (y or n)?\n");
gets(line);
msave=(toupper(line[0])=='Y');
printf("Save unmatched names in unmatched.dat (y or n)?\n");
gets(line);
usave=(toupper(line[0])=='Y');
namefile=fopen("surnames.dat","r+");
if(msave)mfile=fopen("matched.dat","w");
if(usave)ufile=fopen("unmatched.dat","w");
matrix=fopen("matrix.tmp","w+");
do
{ printf("Please enter number of base names to match:\n");
       gets(line);
       bnn=atoi(line);
if(bnn>NUMBASE)
{ printf("Too many base names!\n");
exit(1);
}
for(i=0; i<bnn; ++i)
{ printf("Please enter base name to match:\n");
gets(name[i]);
}
maxm=0;
for(i=0; i<bnn; ++i)
{ fseek(namefile,0L,0);
matched=0;
mc=0;
              printf("%s: ",name[i]);
              pos=strlen(name[i])+2;
while(fgets(line,400,namefile)!=NULL)
              { for(x=strlen(line)-1; line[x]==' ' || line[x]=='\n'; --x);
line[x+1]='\0';
if(phonexMatch(name[i],line))
{ if(pos+strlen(line)>80)
{ printf("\n ");
pos=8;
}
printf("%s ",line);
fprintf(matrix," %s",line);
                          pos+=strlen(line)+1;
++mc;
if(msave)
{ if(matched==0)fprintf(mfile,"Match name %s\n",name);
fprintf(mfile,"%s\n",line);
}
matched=1;
}
}
fprintf(matrix,"\n");
if(matched==1)
{ if(msave)fprintf(mfile,"----------------------------------------\n");
printf("\nNumber of Matches: %d\n",mc);
                   if(mc>maxm)maxm=mc;
}
else
{ printf("\nNo Matches Found.\n");
if(usave)
{ fprintf(ufile,"Match name %s\n",name);
fprintf(ufile,"No Matches found\n");
fprintf(ufile,"--------------------\n");
}
}
}
       printf("\n");
fseek(matrix,0L,0);
src=(char **)calloc(sizeof(char *),maxm);
tgt=(char **)calloc(sizeof(char *),maxm);
for(i=0; i<bnn; ++i)
{ rs=parseLine(matrix,src,line);
fpos=ftell(matrix);
nos=0;
for(k=0; k<rs-1; ++k)
{ for(size=2; k+size<=rs; ++size)
{ fseek(matrix,fpos,0);
clm=0;
for(j=i+1; j<bnn && !clm; ++j)
{ ts=parseLine(matrix,tgt,line2);
clm=(clustermatch(src,k,size,tgt,ts)!=-1);
}
if(!clm)break;
}
--size;
if(size>=2)
{ if(k>nos)
{ printf("Unclust: ");
for(j=nos; j<k; ++j)printf("%s ",src[j]);
printf("\n");
if(msave)
{ fprintf(mfile,"\nUnclust: ");
for(j=nos; j<k; ++j)fprintf(mfile,"%s ",src[j]);
       fprintf(mfile,"\n");
}
}
nos=k+size;
printf("Cluster: ");
for(j=0; j<size; ++j)printf("%s ",src[k+j]);
printf("\n");
if(msave)
{ fprintf(mfile,"\nCluster: ");
pos=9;
for(j=0; j<size; ++j)
{ if(pos+strlen(src[k+j])>80)
{ fprintf(mfile,"\n ");
pos=9;
}
fprintf(mfile,"%s ",src[k+j]);
pos+=strlen(src[k+j])+1;
}
fprintf(mfile,"\n");
}
fseek(matrix,0L,0);
mtmp=fopen("mtmp.tmp","w+");
fgets(line2,400,matrix);
fputs(line2,mtmp);
for(j=i+1; j<bnn; ++j)
{ ts=parseLine(matrix,tgt,line2);
cs=clustermatch(src,k,size,tgt,ts);
for(l=0; l<ts; ++l)
{ if(cs==-1 || l<cs || l>=cs+size)
fprintf(mtmp,"%s ",tgt[l]);
}
fprintf(mtmp,"\n");
}
fclose(matrix);
fclose(mtmp);
unlink("matrix.tmp");
rename("mtmp.tmp","matrix.tmp");
matrix=fopen("matrix.tmp","r+");
fgets(line2,400,matrix);
k+=size-1;
}
}
if(nos==0 && rs>0)
              { printf("Unclust: ");
for(j=nos; j<=k; ++j)printf("%s ",src[j]);
printf("\n");
if(msave)
{ fprintf(mfile,"\nUnclust: ");
for(j=nos; j<k; ++j)fprintf(mfile,"%s ",src[j]);
fprintf(mfile,"\n");
}
}
mtmp=fopen("mtmp.tmp","w+");
fseek(matrix,0L,0);
fgets(line,400,matrix);
while(fgets(line,400,matrix)!=NULL)fputs(line,mtmp);
fclose(matrix);
fclose(mtmp);
unlink("matrix.tmp");
rename("mtmp.tmp","matrix.tmp");
matrix=fopen("matrix.tmp","r+");
}
printf("Compare again (y or n)?\n");
gets(line);
} while(toupper(line[0])=='Y');
free(src);
free(tgt);
fclose(namefile);
fclose(matrix);
unlink("matrix.tmp");
unlink("mtmp.tmp");
if(msave)fclose(mfile);
if(usave)fclose(ufile);
}

int parseLine(FILE *fp,char **pa,char *line)
{ int ix;
char *p;

fgets(line,400,fp);
pa[ix=0]=strtok(line," \n");
      if(pa[0]==NULL)return(0);
      while((p=strtok(NULL," \n"))!=NULL)pa[++ix]=p;
      return(ix+1);
}

int clustermatch(char **src,int ix,int size,char **tgt,int ts)
{ int i,j;

for(i=0; i<=ts-size; ++i)
      { for(j=0; j<size && strcmp(src[ix+j],tgt[i+j])==0; ++j);
       if(j==size)return(i);
      }
      return(-1);
}

/* prompt user for input */
/* get input */
/* copy it safely into provided variable */

void getinput(char *prompt,char n[])
{ char ipc[150];

printf("%s:\n",prompt);
gets(ipc);
strncpy(n,ipc,19);
n[19]='\0';
return;
}

#define MAXLINE 150

unsigned long comps;

int phonexMatch(char* nameA, char* nameB)
/* calculates phonex codes for two names */
/* and returns 1 if codes match */
{
int match;
char name1[MAXLINE];
char name2[MAXLINE];

strcpy(name1, nameA);
strcpy(name2, nameB);
strupr(name1);
strupr(name2);
match=0;
/* turn names into phonex code equivalents */
if (!phonex(name1))
{
       printf("Error coding name1.\n");
}
else if (!phonex(name2))
{
       printf("Error coding name2.\n");
}
else if (strcmp(name1, name2) == 0)
{
/* phonex codes are the same */
match = 1;
}

return match;
}

/* The following C function to create Phonex codes is converted from
SDX v1.1 coded by Michael Cooley. Released to the public domain, 1994. */
int phonex(char* name)
{
char last,code,*p;
int y=1;

/*
/ PHONEX PRE-PROCESSING ...
/

/ Deletions effected by replacing with next letter which
/ will be ignored due to duplicate handling of Soundex code.
/ This is faster than 'moving' all subsequent letters.

/ Remove any trailing Ss */
while (name[strlen(name)-1] == 'S')
{
name[strlen(name)-1] = '\0';

comps += 1;
}

/* Phonetic equivalents of first 2 characters */
/* Works since duplicate letters are ignored */
if (strncmp(name, "KN", 2) == 0)
{
*name = 'N'; /* KN.. == N..*/

/* AMENDMENT */
comps += 2;
}
else if (strncmp(name, "PH", 2) == 0)
{
*name = 'F'; /* PH.. == F.. (H ignored anyway) */

/* AMENDMENT */
comps += 2;
}
else if (strncmp(name, "WR", 2) == 0)
{
*name = 'R'; /* WR.. == R.. */

/* AMENDMENT */
comps += 2;
}

/* Special case, ignore H first letter (subsequent Hs ignored anyway) */
/* Works since duplicate letters are ignored */
if (*name == 'H')
{
*name = *(name+1);

/* AMENDMENT */
comps += 1;
}

/* Phonetic equivalents of first character */
switch(*name)
{
case 'E':
case 'I':
case 'O':
case 'U':
case 'Y':
/* vowel equivalence A = E, I, O, U, Y */
*name = 'A';
/* AMENDMENT */
comps += 5;
break;
case 'P':
/* consonant equivalence B = P */
*name = 'B';
/* AMENDMENT */
comps += 6;
break;
case 'V':
/* consonant equivalence F = V */
*name = 'F';
/* AMENDMENT */
comps += 7;
break;
case 'K':
case 'Q':
/* consonant equivalence C = K */
*name = 'C';
/* AMENDMENT */
comps += 9;
break;
case 'J':
/* consonant equivalence G = J */
*name = 'G';
/* AMENDMENT */
comps += 10;
break;
case 'Z':
/* consonant equivalence S = Z */
*name = 'S';
/* AMENDMENT */
comps += 11;
break;
default:
/* no equivalence for letter, no change */
break;
}

/*
/ MODIFIED SOUNDEX CODE
*/

p=name;
for ( ;
*p!='\0' /* not at end of name */
&& *p!=' ' /* terminate if encountered */
&& *p!=',' /* terminate if encountered */
&& y < 4; /* code no more than 4 characters */
p++)
{
switch(*p)
{
case 'B':
case 'P':
case 'F':
case 'V':
code='1';
/* AMENDMENT */
comps += 4;
break;
case 'C':
case 'S':
case 'K':
case 'G':
case 'J':
case 'Q':
case 'X':
case 'Z':
code='2';
/* AMENDMENT */
comps += 12;
break;
case 'D':
case 'T':
if (*(p+1) != 'C')
{
code='3';
}
/* else do not code: TC.. = DC.. = C.. */
/* AMENDMENT */
comps += 14;
break;
case 'L':
if (isvowelY(*(p+1)) || (*(p+1) == '\0'))
{
/* only code if followed by vowel of end of name */
code='4';
}
/* AMENDMENT */
comps += 15;
break;
case 'M':
case 'N':
if ((*(p+1) == 'D') || (*(p+1) == 'G'))
{
/* NG.. = ND.. = N.. */
*(p+1) = *p;
}
code='5';
/* AMENDMENT */
comps += 17;
break;
case 'R':
if (isvowelY(*(p+1)) || (*(p+1) == '\0'))
{
/* only code if followed by vowel of end of name */
code='6';
}
/* AMENDMENT */
comps += 18;
break;
default:
code=0;
break;
}

/* AMENDMENT */
comps += 3;

if(
last!=code /* not the same as previous
position */
&& code!=0 /* is not a vowel, etc. */
&& p!=name) /* is not the first letter
*/
name[y++]=code; /* then code it */

last=name[y-1];
if (y==1) last=code; /* special case for 1st letter */
}

while(y<4) name[y++]='0'; /* fill in with trailing zeros */
name[4]='\0'; /* NULL terminate after 4
characters */
return 1;
}

int isvowelY(char c)
/* returns 1 if c is a vowel or Y, 0 otherwise */
{
/* convert to upper case */
c = toupper(c);

/* AMENDMENT: average no. of comparisons */
comps += 3;

return ( (c == 'A') ||
(c == 'E') ||
(c == 'I') ||
(c == 'O') ||
(c == 'U') ||
(c == 'Y') );
}

char *strupr(char *n)
{ for( ; *n!='\0'; ++n)*n=toupper(*n);
return(n);
}

ASKER

woooooopy..I was starting to be worried that you might have been busy and forgot my question :) ...well, am going to test your algorithm against my datafile now...

many many thanks for your quick response again..i am feling released more or less...will get back to you with the results...

ASKER

I have tested your algorithms..it seems working pretty well..apart from 2 problems which I need you to solve out for me.

1) in the saving file-matched.dat

Match name moeran----*** same output name as below
Moeran
Mooran
Mooren
Morahan
Moran
Morean
Moreen
Moreham
Morehan
Moren
Morine
Morrain
Morran
Morren
----------------------------------------

Match name moeran---->same (it should be moorham not moeran)
Moorham
Moorhen
Morham
Morhan
Morhen
Morman
----------------------------------------
Match name moeran---->same (it should be moorhen not moeran)
Moorham
Moorhen
Morham
Morhan
Morhen
Morman
----------------------------------------

2) this is the big problem and it needed to be solved it out ...

the maximum of the basename input which i can enter to macth is only 10 basenames...more than that the program returns "too many basenames" is it possible if you could change the number of input to be unlimited no. such as 100 names for input basename or more than that...

many thanks,

korsila

p.s. I have not yet tested the condition of another maximum overlaping cluster...

for example: 5 output basenames

a b c d
a b c
a b
d
e

program should returns: 3 clusters

1)abc --maximum overlaping (3 baesnames overlaped)
2)ab ----another maximum overlaping (2 basenames overlaped)
3)d --- another overlaped)1 basename overlaped
4)e --non-overlapping

I hope your algorithm meets my conditions and requirements I have...
---------------
as far as i can see it works pretty well with these following conditions:

for example: 5 output basenames

a b c
a b c
d e
d e
f

your algorithm returns 3 clusters':
1)abc -----overlaping
2)de ------ another overlaping and not overlaping with the previous overlapped names "abc"

3)f -----non-overlaping..

-------------------
anyway so far so good...

I will think about how to open up the number of base names. While I am doing that, and you're putting it through more paces, we could discuss the clustering you're proposing:

a b c d
a b c
a b
d
e

program should returns: 3 clusters

1)abc --maximum overlaping (3 baesnames overlaped)
2)ab ----another maximum overlaping (2 basenames overlaped)
3)d --- another overlaped)1 basename overlaped
4)e --non-overlapping

I don't think the algorithm as it stands is going to do this. I specifically included the phrase "maximum" to indicate that it would pick up abc, but not ab (since abc is bigger, and so it is the "maximum" cluster). The difficulty with this clustering is that it is not clear to me how it is being restricted. If "ab" is a cluster, why isn't "bc" a cluster? And from there, and since "d" is a cluster, why isn't "a" a cluster.
This is not to say that the algorithm can't be modified. But we will have to come to some real specific rules, and sometimes forcing yourself to explain why something doesn't qualify is helpful.

ASKER

Imladris,
many thanks for a quick response
***** my comment

I don't think the algorithm as it stands is going to do this.

***** hmmmm as far as i can see, your algorithm returns "maximum of overlaping" and "non-overlaping" which are the main part of my requirement..and it works pretty well-no doubt.
----------------------
I specifically included the phrase "maximum"
to indicate that it would pick up abc, but not ab (since abc is bigger, and so it is the "maximum" cluster).

**** this is absoluately right , I do agree..since this is the main cluster which I should get
-------------------------------

The difficulty with this clustering is that it is not clear to me how it is being restricted. If "ab" is a cluster, why isn't "bc" a cluster?

****
a b c d
a b c
a b
d
e

well, as i mentioned be4 ..when we get dupilated cluster or maximum overlaping cluster and non-overlaping cluster..now we need to find sub-maximum cluster by eliminating basename in each row.

if we try to eliminate each basenames in each row using "left to right" or right to left" to find another maximum and sub-maximum of cluster respectively..

firstly, eliminate "d" from each row (righ to left)
as u can see you got "abc" cluster as duplicted and maximum overlaping cluster...
a b c
a b c
a b

e
however, from your algorithm we have already got "abc" cluster..

then the program repeat and start again from the firsdt output above..

now start eliminating "c" from each row and u got :
a b d
a b
a b
d
e

as u can see you can get another cluster from elilminating "c" that's "ab" cluster which is a duplicated cluster..we can say that "ab" is another maximum cluster from eliminating basename...

then again repaet the process and start to eiminate another basename which is "b"
a c d
a c
a
d
e

as u can se we can not have any cluster from elimintaing "c" since there is no duplicated basename between the row..

next elimintaing "a" :

b c d
b c
b
d
e

no cluster from eliminating "a" as well...

so next start eliminating by 2 basenames (right to left again) now elimintaing "cd"':
a b
a b
a b
d
e

this process would repaet again and again (eliminating:db, da, cb, ca, ba, bc, bd, dcb, dca, dba, cba, and dcba--last elimination) )

til we deleted all the maimum basenames in row (abcd)

u might wionder hat how come we get "d" cluster..
it's becus we eliminate "cba"(right to left) or abc (left to right) we get "d" cluster by duplicating basename...

and "e" cluster is based on your algorithm...

basic idea is when we eliminate the basename in each row and if we could find a duplicated bastnames we can say that is "cluster" or another maximum of cluster by eliminating the basenames...

so fainally, the algorithm returns 4 clusters:
1)abc ----from your algorithm (maximum overlaping cluster)
2)ab ----duplictaing cluster from eliminating basename in each row
3)d-----duplictaing cluster from eliminating basename in each row
4)e ------ from your algorithm (non -overlaping cluster)

and that's it..
--------------

This is not to say that the algorithm can't be modified. But we will have to come to some real specific rules, and sometimes forcing yourself to explain why something doesn't qualify is helpful.

***** I do understand pefectly that this requirement is really hard and unclear at some point..and as far as i can see you have come up with a great idea at the beginning ..that made my work even more easier...am hoping you have more or less idea in your mind when you read my comment in here...discussing with you here it's worth understanding the whole of my work...:) please comment and tell me if it's not easy to modify the codes since I think you have done the job perfectly so far...however, if you could come up with any efficient algorithm again, it's worth waiting..
and it would be no doubt to accept this one as a briliant answer...
-----------------
korsila..

p.s. it's a bit of my fault that didn't explain clearly at the beginning, but glad that you got the whole idea working til now...

But according to that algorithm:

>this process would repaet again and again (eliminating:db, da, cb, ca, ba, bc, bd, dcb, dca, dba, cba,
> and dcba--last elimination) )

you will (at step 7) be eliminating da, and in that case you would get:

(a) b c (d)
(a) b c
(a) b
(d)
e

and so "bc" would indeed be a cluster. The algorithm you propose appears to find "every" cluster, i.e. a,b,c,d,e,ab,bc,ac,abc.

ASKER

Thank for your discussing and comment...

you will (at step 7) be eliminating da, and in that case you would get:

(a) b c (d)
(a) b c
(a) b
(d)
e

and so "bc" would indeed be a cluster. The algorithm you propose appears to find "every" cluster, i.e.
a,b,c,d,e,ab,bc,ac,abc.

no, it is not that..when you eliminate "da" it means "d and a" not d or a as youm thought..this means that
you got no cluster from eliminating "da" as follwoing metrix can be best to explain:

(a) b c (d)
a b c
a b
d
e

------------
ad will only be eliminated from first low since other rows have got no "d and a" apart from the first row (see above).

hope I make it clear a bit more...

thanks..
Korsila

Ah, so you don't eliminate names from a set of matches unless you they're ALL in there. In that case, I think we could simplify the formulation of the requirement to the following:

A group of names is a cluster if and only if it is the set of matches for a basename, and is also, as a group, represented in the matches for at least one other name.

In believe this produces the same result, but along much simpler lines. For instance, for our canonical example:

a b c d
a b c
a b
d
e

We would check the first row against all others (no result) then the second row (it gets a match against the first row: abc), then the third row (it gets a match against row 1 or 2: ab) then the fourth row (it gets a match against row 1: d).
So the result is: abc, ab, d.

For the matches:

a b c
b c d

there would be no clusters from either algorithm (row 1 is not to be found in row 2 and vica versa, and eliminating a, b, c, ab, bc, ac or abc also doesn't result in matches either).

So, in this proposal, the clusters are a subset of the basename matches. Or, to put it another way, any cluster will be equal to one of the basename matches.

What do you think? If you feel that is too restrictive, please provide a counterexample for illumination.

P.S. Note that the current algorithm (in the program you have) will take:

a b c
b c d

and show "bc" as a cluster.

ASKER

Imladris,
you have really come up with a great idea of setting the rule of clasifying the cluster again...am so impressed..

well, I need to find out what exactly rule of clustering algorithm will meet my requirements and can solve my problems...otherwise we would wait our precious time to figure out the clustering codes and so on...

will get back to you with my specification of clustering rules.

Korsila

p.s. you have given me a lots of ideas how to claasify the cluster..seems like there are plenty of methods and ways to fulfil my requirement especially your idea..I need to think about it first which one suits my work and meets my requirement...

ASKER

Imladris, I have re-read any of your comments here...i just relised that you are really smart..:)I found out every of your coments were very useful to me...and perhap I need to get back and follow your ideas adding a bit of codeand work on it.
you also made me got load of ideas against my own requirement now :) so I need to think carefully what exactly rule of cluster i would like to have...maybe follow yours thoughts and rules...anyway, will get back to you when I come up with the best rules...

korsila

ASKER

well, I have been thinking for the whole day..

I like any of your ideas which posted here and if you could figure out for the codes if would be great...
------------
**Method 1**

OK. I have an algorithm written up and working. First a word on the requirements. I thought about exactly
what should be reported as a cluster for a fair while. Given the result set:

a 4 abcd
b 3 abc
c 3 abc
d 3 ade
e 2 de

I can imagine a number of different answers. The first answer would be: a cluster is any name or group of names that appears for more than one base name. This is nice and general and should be easy to do. The clusters would be:

a, b, c, ab, bc, ac, abc, d, e, de

****this is should be the first method in modifying code

-----------------------------
** Method 2 **

A cluster set is a set of groups of names such that
you can construct the result set using the clusters. For this one answer could be:

abc, de, a, d

It fulfills the requirement in that, if we number the clusters 1 through 4, you could construct the
result set as:

a 1 4
b 1
c 1
d 3 2
e 2

These has a nice mathematical flavour to it. However, there is more than 1 cluster set possible. Here
for instance you could also choose: a, bc, d, de.

***** could you do manage to do this for implementation as a second method.
---------------------------------
**Method 3 **

For a result set of abc and bcd you could get cluster
sets of a, bc, d or abc, bcd. So now the question of which to choose appears.

*** I would choose a, bc and d to be appeared. if you could do the codes for this,

group of names is a cluster if and only if it is the set of matches for a basename, and is also, as a group, represented in the matches for at least one other name.

In believe this produces the same result, but along much simpler lines. For instance, for our canonical example:

a b c d
a b c
a b
d
e

We would check the first row against all others (no result) then the second row (it gets a match against
the first row: abc), then the third row (it gets a match against row 1 or 2: ab) then the fourth row
(it gets a match against row 1: d).
So the result is: abc, ab, d.

******* could you implement this as a thrid method.

----------------------

** to be inclused in Method 3 condition **

For the matches:

a b c
b c d

there would be no clusters from either algorithm (row 1 is not to be found in row 2 and vica versa,
and eliminating a, b, c, ab, bc, ac or abc also doesn't result in matches either).

So, in this proposal, the clusters are a subset of the basename matches. Or, to put it another way,
any cluster will be equal to one of the basename matches.

P.S. Note that the current algorithm (in the program you have) will take:

a b c
b c d

and show "bc" as a cluster.

**** well if you could solve this , please include this into the thrid method.

----------------------------

so until now we have got 3 methods which will be given when
user enter input basenames (unlimited basename) as results.

--------------------
**** Method 4 ***

am going to suggest as a forth metod:

4.1) eliminate duplicated row and return cluster as duplicated cluster:
abc
abc
return abc as a cluster.

4.2) find overlaping cluster and return maximum overlaping:
abcd
abc
return cluster abc

4.3)find all non-overlaping and return as a non-overlaping cluster:

a b
b d
retrun cluster a and d in this case:

--------

4.4) (your answer) a cluster is any name or group of names that appears for more than one basename..

a b
a b c
c d

in this case return cluster a, b, c, ab

------------------

From above, here the interface requirement:

how many basename? unlimited basename to be enter here:
1000

please enter basename:
smith
please enter basename:
smithe

...
...
til 1000 basenames

p.s.*** if you cann't do this u can input basename as a datafile forexample:

please enter input datfile:
basename.dat

** maybe it's hard...

after finish entering basename (1000 in this example)

the program return 4 methods of clustering

method 1: return any cluster which fits this requirement (see above method 1)

method 2: return any cluster which fits this requirement (see above method 2)

method 3: return any cluster which fits this requirement (see above method 3)

method 4: return any cluster which fits this requirement (see above method 4)

p.s. every method (1 -4) should return at least 2 basic results
if it 's likely (your codes have provided these):
1)duplicated cluster
2) non-overlaping cluster

---------------------

ASKER

all of your comments here is the new fantastic idea I jut thought..what a combination :)

hope it's not that tough...please get back to me if you got any better idea than those...

many thanks,
korsila

Well, that makes the project almost 4 times as large. Since I foresee sizeable discussions on the new ideas I propose the following, both to keep this "page" manageable (it's going to start takeing forever to load this discussion soon), and to make some concrete progress.

As an answer to this question, I will provide the unlimited base name entry, and cluster returns according to method 1 and method 3 (both of which are now reasonably well defined, and with which we have had some experience). I then suggest relegating each additional clustering method you wish to add to the program to subsequent questions. What do you think?

ASKER

you are abosolutely right..yes, that would be fiarly excellent agreement...

I will be waiting then..can't wait to see this to be done...!!

many thanks,
Korsila

ASKER

p.s. don't forget the duplicated cluster and non-overlaping cluster should be included as well...:)

thanxxxxx...

OK, here is a new version of the program that supports an unlimited number of basenames and two cluster identification methods.

As I set out to implement method number 1, I realized that it is really the current (maximum overlapping) clustering method, with the addition of all the subclusters (i.e. when abc is found to be a cluster, method 1 will also report on a,b,c,ab,bc and ac). This can be implemented, however the total number of subclusters will be 2**n-1 (2 to the power of n minus 1). This means that for a cluster of 20 (which seems entirely likely with your data) you will get 2**20-1 subclusters for a total of over 1 million clusters. This seemed neither theoritically usefule nor practically feasible (the disk space requirements would be monstrous). So I have left it as the previous algorithm.

Clustering method number 2 is from method number 3 which I originally proposed, in which a group of names that are an answer to a basename are a cluster if and only if they appear for a different base name.

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <ctype.h>

int phonexMatch(char* name1, char* name2);
int phonex(char* name);
int isvowelY(char c);
void getinput(char *prompt,char n[]);
void maxcluster(int,int);
void subcluster(int,int);
int parseLine(FILE *fp,char **pa,char *line);
int clustermatch(char **src,int ix,int size,char **tgt,int ts);

int msave;
FILE *mfile;
char line[400],line2[400];

void main(int argc,char *argv[])
{ int matched,usave,mc,bnn,i,pos,maxm,j,rs,ts,k,clm,size,x,nos,cs,l,cag;
char name[150];
long fpos;
FILE *namefile,*ufile,*matrix,*mtmp,*bnames;

printf("Save matched names in matched.dat (y or n)?\n");
gets(line);
msave=(toupper(line[0])=='Y');
printf("Save unmatched names in unmatched.dat (y or n)?\n");
gets(line);
usave=(toupper(line[0])=='Y');
namefile=fopen("surnames.dat","r+");
if(msave)mfile=fopen("matched.dat","w");
if(usave)ufile=fopen("unmatched.dat","w");
do
{ printf("Please select clustering algorithm (1-2):\n");
       gets(line);
       cag=atoi(line);
printf("Please enter number of base names to match:\n");
       gets(line);
       bnn=atoi(line);
matrix=fopen("matrix.tmp","w+");
       bnames=fopen("bnames.tmp","w+");
for(i=0; i<bnn; ++i)
{ printf("Please enter base name to match:\n");
gets(name);
              fputs(name,bnames);
              fputs("\n",bnames);
}
       fseek(bnames,0L,0);
maxm=0;
for(i=0; i<bnn; ++i)
{ fseek(namefile,0L,0);
matched=0;
mc=0;
              fgets(name,150,bnames);
              name[strlen(name)-1]='\0';
              printf("%s: ",name);
              pos=strlen(name)+2;
while(fgets(line,400,namefile)!=NULL)
              { for(x=strlen(line)-1; line[x]==' ' || line[x]=='\n'; --x);
line[x+1]='\0';
if(phonexMatch(name,line))
{ if(pos+strlen(line)>80)
{ printf("\n ");
pos=8;
}
printf("%s ",line);
fprintf(matrix," %s",line);
                          pos+=strlen(line)+1;
++mc;
if(msave)
{ if(matched==0)fprintf(mfile,"Match name %s\n",name);
fprintf(mfile,"%s\n",line);
}
matched=1;
}
}
fprintf(matrix,"\n");
if(matched==1)
{ if(msave)fprintf(mfile,"----------------------------------------\n");
printf("\nNumber of Matches: %d\n",mc);
                   if(mc>maxm)maxm=mc;
}
else
{ printf("\nNo Matches Found.\n");
if(usave)
{ fprintf(ufile,"Match name %s\n",name);
fprintf(ufile,"No Matches found\n");
fprintf(ufile,"--------------------\n");
}
}
}
fclose(matrix);
       fclose(bnames);
       printf("\n");
       if(cag==1)maxcluster(bnn,maxm);
       else if(cag==2)subcluster(bnn,maxm);
printf("Compare again (y or n)?\n");
gets(line);
} while(toupper(line[0])=='Y');
fclose(namefile);
unlink("matrix.tmp");
unlink("mtmp.tmp");
if(msave)fclose(mfile);
if(usave)fclose(ufile);
}

void maxcluster(bnn,maxm)
int bnn,maxm;
{ int i,j,k,l,rs,ts,cs,nos,size,pos,clm;
      long fpos;
char **src,**tgt;
      FILE *mtmp,*matrix;

      matrix=fopen("matrix.tmp","r+");
src=(char **)calloc(sizeof(char *),maxm);
tgt=(char **)calloc(sizeof(char *),maxm);
for(i=0; i<bnn; ++i)
{ rs=parseLine(matrix,src,line);
fpos=ftell(matrix);
nos=0;
for(k=0; k<rs-1; ++k)
{ for(size=2; k+size<=rs; ++size)
{ fseek(matrix,fpos,0);
clm=0;
for(j=i+1; j<bnn && !clm; ++j)
{ ts=parseLine(matrix,tgt,line2);
clm=(clustermatch(src,k,size,tgt,ts)!=-1);
}
if(!clm)break;
}
--size;
if(size>=2)
{ if(k>nos)
{ printf("Unclust: ");
for(j=nos; j<k; ++j)printf("%s ",src[j]);
printf("\n");
if(msave)
{ fprintf(mfile,"\nUnclust: ");
for(j=nos; j<k; ++j)fprintf(mfile,"%s ",src[j]);
fprintf(mfile,"\n");
}
}
nos=k+size;
printf("Cluster: ");
for(j=0; j<size; ++j)printf("%s ",src[k+j]);
printf("\n");
if(msave)
{ fprintf(mfile,"\nCluster: ");
pos=9;
for(j=0; j<size; ++j)
{ if(pos+strlen(src[k+j])>80)
{ fprintf(mfile,"\n ");
pos=9;
}
fprintf(mfile,"%s ",src[k+j]);
pos+=strlen(src[k+j])+1;
}
fprintf(mfile,"\n");
}
fseek(matrix,0L,0);
mtmp=fopen("mtmp.tmp","w+");
fgets(line2,400,matrix);
fputs(line2,mtmp);
for(j=i+1; j<bnn; ++j)
{ ts=parseLine(matrix,tgt,line2);
cs=clustermatch(src,k,size,tgt,ts);
for(l=0; l<ts; ++l)
{ if(cs==-1 || l<cs || l>=cs+size)
fprintf(mtmp,"%s ",tgt[l]);
}
fprintf(mtmp,"\n");
}
fclose(matrix);
fclose(mtmp);
unlink("matrix.tmp");
rename("mtmp.tmp","matrix.tmp");
matrix=fopen("matrix.tmp","r+");
fgets(line2,400,matrix);
k+=size-1;
}
}
if(nos==0 && rs>0)
       { printf("Unclust: ");
for(j=nos; j<=k; ++j)printf("%s ",src[j]);
printf("\n");
if(msave)
{ fprintf(mfile,"\nUnclust: ");
for(j=nos; j<k; ++j)fprintf(mfile,"%s ",src[j]);
fprintf(mfile,"\n");
}
}
mtmp=fopen("mtmp.tmp","w+");
fseek(matrix,0L,0);
fgets(line,400,matrix);
while(fgets(line,400,matrix)!=NULL)fputs(line,mtmp);
fclose(matrix);
fclose(mtmp);
unlink("matrix.tmp");
rename("mtmp.tmp","matrix.tmp");
matrix=fopen("matrix.tmp","r+");
}
return;
}

void subcluster(bnn,maxm)
int bnn,maxm;
{ int i,j,rs,ts,pos;
long fpos;
char **src,**tgt;
FILE *matrix,*mtmp;

matrix=fopen("matrix.tmp","r+");
src=(char **)calloc(sizeof(char *),maxm);
tgt=(char **)calloc(sizeof(char *),maxm);
while((rs=parseLine(matrix,src,line))!=-1)
{ fpos=ftell(matrix);
            fseek(matrix,0L,0);
while((ts=parseLine(matrix,tgt,line2))!=-1)
{ if(ftell(matrix)!=fpos && clustermatch(src,0,rs,tgt,ts)!=-1)break;
}
if(ts!=-1)
{ printf("Cluster: ");
for(i=0; i<rs; ++i)printf("%s ",src[i]);
printf("\n");
if(msave)
{ fprintf(mfile,"\nCluster: ");
pos=9;
for(j=0; j<rs; ++j)
{ if(pos+strlen(src[j])>80)
{ fprintf(mfile,"\n ");
pos=9;
}
fprintf(mfile,"%s :",src[j]);
pos+=strlen(src[j])+1;
}
fprintf(mfile,"\n");
}
}
fseek(matrix,0L,0);
            mtmp=fopen("mtmp.tmp","w+");
do
{ fgets(line2,400,matrix);
fputs(line2,mtmp);
} while(ftell(matrix)!=fpos);
while((ts=parseLine(matrix,tgt,line2))!=-1)
{ if(rs!=ts || clustermatch(src,0,rs,tgt,ts)==-1)
{ for(j=0; j<ts; ++j)fprintf(mtmp,"%s ",tgt[j]);
fprintf(mtmp,"\n");
}
}
fclose(matrix);
fclose(mtmp);
unlink("matrix.tmp");
rename("mtmp.tmp","matrix.tmp");
matrix=fopen("matrix.tmp","r+");
            fseek(matrix,fpos,0);
}
return;
}

int parseLine(FILE *fp,char **pa,char *line)
{ int ix;
char *p;

if(fgets(line,400,fp)==NULL)return(-1);
pa[ix=0]=strtok(line," \n");
      if(pa[0]==NULL)return(0);
      while((p=strtok(NULL," \n"))!=NULL)pa[++ix]=p;
      return(ix+1);
}

int clustermatch(char **src,int ix,int size,char **tgt,int ts)
{ int i,j;

for(i=0; i<=ts-size; ++i)
      { for(j=0; j<size && strcmp(src[ix+j],tgt[i+j])==0; ++j);
       if(j==size)return(i);
      }
      return(-1);
}

/* prompt user for input */
/* get input */
/* copy it safely into provided variable */

void getinput(char *prompt,char n[])
{ char ipc[150];

printf("%s:\n",prompt);
gets(ipc);
strncpy(n,ipc,19);
n[19]='\0';
return;
}

#define MAXLINE 150

unsigned long comps;

int phonexMatch(char* nameA, char* nameB)
/* calculates phonex codes for two names */
/* and returns 1 if codes match */
{
int match;
char name1[MAXLINE];
char name2[MAXLINE];

strcpy(name1, nameA);
strcpy(name2, nameB);
strupr(name1);
strupr(name2);
match=0;
/* turn names into phonex code equivalents */
if (!phonex(name1))
{
       printf("Error coding name1.\n");
}
else if (!phonex(name2))
{
       printf("Error coding name2.\n");
}
else if (strcmp(name1, name2) == 0)
{
/* phonex codes are the same */
match = 1;
}

return match;
}

/* The following C function to create Phonex codes is converted from
SDX v1.1 coded by Michael Cooley. Released to the public domain, 1994. */
int phonex(char* name)
{
char last,code,*p;
int y=1;

/*
/ PHONEX PRE-PROCESSING ...
/

/ Deletions effected by replacing with next letter which
/ will be ignored due to duplicate handling of Soundex code.
/ This is faster than 'moving' all subsequent letters.

/ Remove any trailing Ss */
while (name[strlen(name)-1] == 'S')
{
name[strlen(name)-1] = '\0';

comps += 1;
}

/* Phonetic equivalents of first 2 characters */
/* Works since duplicate letters are ignored */
if (strncmp(name, "KN", 2) == 0)
{
*name = 'N'; /* KN.. == N..*/

/* AMENDMENT */
comps += 2;
}
else if (strncmp(name, "PH", 2) == 0)
{
*name = 'F'; /* PH.. == F.. (H ignored anyway) */

/* AMENDMENT */
comps += 2;
}
else if (strncmp(name, "WR", 2) == 0)
{
*name = 'R'; /* WR.. == R.. */

/* AMENDMENT */
comps += 2;
}

/* Special case, ignore H first letter (subsequent Hs ignored anyway) */
/* Works since duplicate letters are ignored */
if (*name == 'H')
{
*name = *(name+1);

/* AMENDMENT */
comps += 1;
}

/* Phonetic equivalents of first character */
switch(*name)
{
case 'E':
case 'I':
case 'O':
case 'U':
case 'Y':
/* vowel equivalence A = E, I, O, U, Y */
*name = 'A';
/* AMENDMENT */
comps += 5;
break;
case 'P':
/* consonant equivalence B = P */
*name = 'B';
/* AMENDMENT */
comps += 6;
break;
case 'V':
/* consonant equivalence F = V */
*name = 'F';
/* AMENDMENT */
comps += 7;
break;
case 'K':
case 'Q':
/* consonant equivalence C = K */
*name = 'C';
/* AMENDMENT */
comps += 9;
break;
case 'J':
/* consonant equivalence G = J */
*name = 'G';
/* AMENDMENT */
comps += 10;
break;
case 'Z':
/* consonant equivalence S = Z */
*name = 'S';
/* AMENDMENT */
comps += 11;
break;
default:
/* no equivalence for letter, no change */
break;
}

/*
/ MODIFIED SOUNDEX CODE
*/

p=name;
for ( ;
*p!='\0' /* not at end of name */
&& *p!=' ' /* terminate if encountered */
&& *p!=',' /* terminate if encountered */
&& y < 4; /* code no more than 4 characters */
p++)
{
switch(*p)
{
case 'B':
case 'P':
case 'F':
case 'V':
code='1';
/* AMENDMENT */
comps += 4;
break;
case 'C':
case 'S':
case 'K':
case 'G':
case 'J':
case 'Q':
case 'X':
case 'Z':
code='2';
/* AMENDMENT */
comps += 12;
break;
case 'D':
case 'T':
if (*(p+1) != 'C')
{
code='3';
}
/* else do not code: TC.. = DC.. = C.. */
/* AMENDMENT */
comps += 14;
break;
case 'L':
if (isvowelY(*(p+1)) || (*(p+1) == '\0'))
{
/* only code if followed by vowel of end of name */
code='4';
}
/* AMENDMENT */
comps += 15;
break;
case 'M':
case 'N':
if ((*(p+1) == 'D') || (*(p+1) == 'G'))
{
/* NG.. = ND.. = N.. */
*(p+1) = *p;
}
code='5';
/* AMENDMENT */
comps += 17;
break;
case 'R':
if (isvowelY(*(p+1)) || (*(p+1) == '\0'))
{
/* only code if followed by vowel of end of name */
code='6';
}
/* AMENDMENT */
comps += 18;
break;
default:
code=0;
break;
}

/* AMENDMENT */
comps += 3;

if(
last!=code /* not the same as previous
position */
&& code!=0 /* is not a vowel, etc. */
&& p!=name) /* is not the first letter
*/
name[y++]=code; /* then code it */

last=name[y-1];
if (y==1) last=code; /* special case for 1st letter */
}

while(y<4) name[y++]='0'; /* fill in with trailing zeros */
name[4]='\0'; /* NULL terminate after 4
characters */
return 1;
}

int isvowelY(char c)
/* returns 1 if c is a vowel or Y, 0 otherwise */
{
/* convert to upper case */
c = toupper(c);

/* AMENDMENT: average no. of comparisons */
comps += 3;

return ( (c == 'A') ||
(c == 'E') ||
(c == 'I') ||
(c == 'O') ||
(c == 'U') ||
(c == 'Y') );
}

char *strupr(char *n)
{ for( ; *n!='\0'; ++n)*n=toupper(*n);
return(n);
}

Let me know how it goes.

ASKER

apparently, you have done enough of work more than I expected...
well, cann't wait to test a new algorithm..
get back to you soon..

ASKER

the program works fine...but before am going to accept your answer...
just a little thing that I would like you to update..
that 's an output in datafile (matched.dat)...

it's not exactly the same as the output on screen..
forexample:
here the output on screen:
cluster gorithm:
2
number of basename match
8
...
...
moeran: Moeran Mooran Mooren Morahan Moran Morean Moreen Moreham Morehan Moren
Morine Morrain Morran Morren
Number of Matches: 14
mooran: Moeran Mooran Mooren Morahan Moran Morean Moreen Moreham Morehan Moren
Morine Morrain Morran Morren
Number of Matches: 14
smeath: Smeath Smeeth Smeethe Smeith Smeth Smethe Smett Smieth Smit Smite Smith
Smithe Smithee Smithie Smithy Smitt Smitte Smitth Smity Smiyth
Number of Matches: 20
smeethe: Smeath Smeeth Smeethe Smeith Smeth Smethe Smett Smieth Smit Smite Smith

Smithe Smithee Smithie Smithy Smitt Smitte Smitth Smity Smiyth
Number of Matches: 20

Cluster: Braun Braune Brawn Brawne Broun Broune Browen Browm Brown Browne Browne
e Brownen Browney Brownie Brownin Brownne Brownwon Bruan Bruane Bruen
Cluster: Jhones Jhonnes Joanas Joanes Joans Joenes Johnes Johns Johnss Jomes Jon
as Jonees Jones Jonnes Jonns Jons
Cluster: Moeran Mooran Mooren Morahan Moran Morean Moreen Moreham Morehan Moren
Morine Morrain Morran Morren
Cluster: Smeath Smeeth Smeethe Smeith Smeth Smethe Smett Smieth Smit Smite Smith
Smithe Smithee Smithie Smithy Smitt Smitte Smitth Smity Smiyth
------------
however, in the matched.dat the format of name list is in a colums like this:

Match name braun
Braun
Braune
Brawn
Brawne
Broun
Broune
Browen
Browm
Brown
Browne
Brownee
Brownen
..
..
----------------------------------------
Match name braune
Braun
Braune
...
etc..

Cluster: Braun :Braune :Brawn :Brawne :Broun :Broune :Browen :Browm :Brown :Browne :
Brownee :Brownen :Browney :Brownie :Brownin :Brownne :Brownwon :Bruan :Bruane :
Bruen :

Cluster: Moeran :Mooran :Mooren :Morahan :Moran :Morean :Moreen :Moreham :Morehan :Moren :
Morine :Morrain :Morran :Morren :

Cluster: Smeath :Smeeth :Smeethe :Smeith :Smeth :Smethe :Smett :Smieth :Smit :Smite :Smith :
Smithe :Smithee :Smithie :Smithy :Smitt :Smitte :Smitth :Smity :Smiyth :

Cluster: Willeams :Willeans :Willems :Willens :Williamas :Williames :Williams :
Williamss :Willians :Willimes :Willims :Wyllyames :Wyllyams :Wyllyms :

Cluster: Jhones :Jhonnes :Joanas :Joanes :Joans :Joenes :Johnes :Johns :Johnss :Jomes :
Jonas :Jonees :Jones :Jonnes :Jonns :Jons :

-------------
the output list in matched.dat (colums format) should be updated like the output on scren (row format)

hope it 's not going to be too much hassle for you...

many thanks and hope it should be my last favour...:)

Korsila

ASKER

to make it clearer,
the output in datafile must eaxctly the same as the one on screen like this for example:(everything on screen should be saved in matched.dat)
cluster gorithm:
2
number of basename match
8
...
...
moeran: Moeran Mooran Mooren Morahan Moran Morean Moreen Moreham Morehan Moren
Morine Morrain Morran Morren
Number of Matches: 14
mooran: Moeran Mooran Mooren Morahan Moran Morean Moreen Moreham Morehan Moren
Morine Morrain Morran Morren
Number of Matches: 14
smeath: Smeath Smeeth Smeethe Smeith Smeth Smethe Smett Smieth Smit Smite Smith
Smithe Smithee Smithie Smithy Smitt Smitte Smitth Smity Smiyth
Number of Matches: 20
smeethe: Smeath Smeeth Smeethe Smeith Smeth Smethe Smett Smieth Smit Smite Smith

Smithe Smithee Smithie Smithy Smitt Smitte Smitth Smity Smiyth
Number of Matches: 20

Cluster: Braun Braune Brawn Brawne Broun Broune Browen Browm Brown Browne Browne
e Brownen Browney Brownie Brownin Brownne Brownwon Bruan Bruane Bruen
Cluster: Jhones Jhonnes Joanas Joanes Joans Joenes Johnes Johns Johnss Jomes Jon
as Jonees Jones Jonnes Jonns Jons
Cluster: Moeran Mooran Mooren Morahan Moran Morean Moreen Moreham Morehan Moren
Morine Morrain Morran Morren
Cluster: Smeath Smeeth Smeethe Smeith Smeth Smethe Smett Smieth Smit Smite Smith
Smithe Smithee Smithie Smithy Smitt Smitte Smitth Smity Smiyth

ASKER

to make it clearer,
the output in datafile must eaxctly the same as the one on screen like this for example:
cluster gorithm:
2
number of basename match
8
...
...
moeran: Moeran Mooran Mooren Morahan Moran Morean Moreen Moreham Morehan Moren
Morine Morrain Morran Morren
Number of Matches: 14
mooran: Moeran Mooran Mooren Morahan Moran Morean Moreen Moreham Morehan Moren
Morine Morrain Morran Morren
Number of Matches: 14
smeath: Smeath Smeeth Smeethe Smeith Smeth Smethe Smett Smieth Smit Smite Smith
Smithe Smithee Smithie Smithy Smitt Smitte Smitth Smity Smiyth
Number of Matches: 20
smeethe: Smeath Smeeth Smeethe Smeith Smeth Smethe Smett Smieth Smit Smite Smith

Smithe Smithee Smithie Smithy Smitt Smitte Smitth Smity Smiyth
Number of Matches: 20

Cluster: Braun Braune Brawn Brawne Broun Broune Browen Browm Brown Browne Browne
e Brownen Browney Brownie Brownin Brownne Brownwon Bruan Bruane Bruen
Cluster: Jhones Jhonnes Joanas Joanes Joans Joenes Johnes Johns Johnss Jomes Jon
as Jonees Jones Jonnes Jonns Jons
Cluster: Moeran Mooran Mooren Morahan Moran Morean Moreen Moreham Morehan Moren
Morine Morrain Morran Morren
Cluster: Smeath Smeeth Smeethe Smeith Smeth Smethe Smett Smieth Smit Smite Smith
Smithe Smithee Smithie Smithy Smitt Smitte Smitth Smity Smiyth

--------------

p.s. don't forget no. of matches as well as the input basename(see above)...save the same format output on screen into "matched.dat"

This should do it:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <ctype.h>

int phonexMatch(char* name1, char* name2);
int phonex(char* name);
int isvowelY(char c);
void getinput(char *prompt,char n[]);
void maxcluster(int,int);
void subcluster(int,int);
int parseLine(FILE *fp,char **pa,char *line);
int clustermatch(char **src,int ix,int size,char **tgt,int ts);

int msave;
FILE *mfile;
char line[400],line2[400];

void main(int argc,char *argv[])
{ int matched,usave,mc,bnn,i,pos,maxm,x,cag,mpos;
char name[150];
FILE *namefile,*ufile,*matrix,*bnames;

printf("Save matched names in matched.dat (y or n)?\n");
gets(line);
msave=(toupper(line[0])=='Y');
printf("Save unmatched names in unmatched.dat (y or n)?\n");
gets(line);
usave=(toupper(line[0])=='Y');
namefile=fopen("surnames.dat","r+");
if(msave)mfile=fopen("matched.dat","w");
if(usave)ufile=fopen("unmatched.dat","w");
do
{ printf("Please select clustering algorithm (1-2):\n");
       gets(line);
       cag=atoi(line);
printf("Please enter number of base names to match:\n");
       gets(line);
       bnn=atoi(line);
matrix=fopen("matrix.tmp","w+");
       bnames=fopen("bnames.tmp","w+");
for(i=0; i<bnn; ++i)
{ printf("Please enter base name to match:\n");
gets(name);
              fputs(name,bnames);
              fputs("\n",bnames);
}
       fseek(bnames,0L,0);
maxm=0;
for(i=0; i<bnn; ++i)
{ fseek(namefile,0L,0);
matched=0;
mc=0;
              fgets(name,150,bnames);
              name[strlen(name)-1]='\0';
              printf("%s: ",name);
              pos=strlen(name)+2;
mpos=0;
while(fgets(line,400,namefile)!=NULL)
              { for(x=strlen(line)-1; line[x]==' ' || line[x]=='\n'; --x);
line[x+1]='\0';
if(phonexMatch(name,line))
{ if(pos+strlen(line)>80)
{ printf("\n ");
pos=8;
}
printf("%s ",line);
fprintf(matrix," %s",line);
                          pos+=strlen(line)+1;
++mc;
if(msave)
{ if(matched==0)fprintf(mfile,"Match name %s\n",name);
if(mpos+strlen(line)>80)
{ fprintf(mfile,"\n");
mpos=0;
                               }
fprintf(mfile,"%s ",line);
mpos+=strlen(line)+1;
}
matched=1;
}
}
fprintf(matrix,"\n");
if(matched==1)
{ if(msave)
fprintf(mfile,"\nNumber of Matches: %d\n----------------------------------------\n",mc);
printf("\nNumber of Matches: %d\n",mc);
                   if(mc>maxm)maxm=mc;
}
else
{ printf("\nNo Matches Found.\n");
if(usave)
{ fprintf(ufile,"Match name %s\n",name);
fprintf(ufile,"No Matches found\n");
fprintf(ufile,"--------------------\n");
}
}
}
fclose(matrix);
       fclose(bnames);
       printf("\n");
       if(cag==1)maxcluster(bnn,maxm);
       else if(cag==2)subcluster(bnn,maxm);
printf("Compare again (y or n)?\n");
gets(line);
} while(toupper(line[0])=='Y');
fclose(namefile);
unlink("matrix.tmp");
unlink("mtmp.tmp");
if(msave)fclose(mfile);
if(usave)fclose(ufile);
}

void maxcluster(int bnn,int maxm)
{ int i,j,k,l,rs,ts,cs,nos,size,pos,clm;
      long fpos;
char **src,**tgt;
      FILE *mtmp,*matrix;

      matrix=fopen("matrix.tmp","r+");
src=(char **)calloc(sizeof(char *),maxm);
tgt=(char **)calloc(sizeof(char *),maxm);
for(i=0; i<bnn; ++i)
{ rs=parseLine(matrix,src,line);
fpos=ftell(matrix);
nos=0;
for(k=0; k<rs-1; ++k)
{ for(size=2; k+size<=rs; ++size)
{ fseek(matrix,fpos,0);
clm=0;
for(j=i+1; j<bnn && !clm; ++j)
{ ts=parseLine(matrix,tgt,line2);
clm=(clustermatch(src,k,size,tgt,ts)!=-1);
}
if(!clm)break;
}
--size;
if(size>=2)
{ if(k>nos)
{ printf("Unclust: ");
for(j=nos; j<k; ++j)printf("%s ",src[j]);
printf("\n");
if(msave)
{ fprintf(mfile,"\nUnclust: ");
for(j=nos; j<k; ++j)fprintf(mfile,"%s ",src[j]);
fprintf(mfile,"\n");
}
}
nos=k+size;
printf("Cluster: ");
for(j=0; j<size; ++j)printf("%s ",src[k+j]);
printf("\n");
if(msave)
{ fprintf(mfile,"\nCluster: ");
pos=9;
for(j=0; j<size; ++j)
{ if(pos+strlen(src[k+j])>80)
{ fprintf(mfile,"\n ");
pos=9;
}
fprintf(mfile,"%s ",src[k+j]);
pos+=strlen(src[k+j])+1;
}
fprintf(mfile,"\n");
}
fseek(matrix,0L,0);
mtmp=fopen("mtmp.tmp","w+");
fgets(line2,400,matrix);
fputs(line2,mtmp);
for(j=i+1; j<bnn; ++j)
{ ts=parseLine(matrix,tgt,line2);
cs=clustermatch(src,k,size,tgt,ts);
for(l=0; l<ts; ++l)
{ if(cs==-1 || l<cs || l>=cs+size)
fprintf(mtmp,"%s ",tgt[l]);
}
fprintf(mtmp,"\n");
}
fclose(matrix);
fclose(mtmp);
unlink("matrix.tmp");
rename("mtmp.tmp","matrix.tmp");
matrix=fopen("matrix.tmp","r+");
fgets(line2,400,matrix);
k+=size-1;
}
}
if(nos==0 && rs>0)
       { printf("Unclust: ");
for(j=nos; j<=k; ++j)printf("%s ",src[j]);
printf("\n");
if(msave)
{ fprintf(mfile,"\nUnclust: ");
for(j=nos; j<k; ++j)fprintf(mfile,"%s ",src[j]);
fprintf(mfile,"\n");
}
}
mtmp=fopen("mtmp.tmp","w+");
fseek(matrix,0L,0);
fgets(line,400,matrix);
while(fgets(line,400,matrix)!=NULL)fputs(line,mtmp);
fclose(matrix);
fclose(mtmp);
unlink("matrix.tmp");
rename("mtmp.tmp","matrix.tmp");
matrix=fopen("matrix.tmp","r+");
}
return;
}

void subcluster(int bnn,int maxm)
{ int i,j,rs,ts,pos;
long fpos;
char **src,**tgt;
FILE *matrix,*mtmp;

matrix=fopen("matrix.tmp","r+");
src=(char **)calloc(sizeof(char *),maxm);
tgt=(char **)calloc(sizeof(char *),maxm);
while((rs=parseLine(matrix,src,line))!=-1)
{ fpos=ftell(matrix);
            fseek(matrix,0L,0);
while((ts=parseLine(matrix,tgt,line2))!=-1)
{ if(ftell(matrix)!=fpos && clustermatch(src,0,rs,tgt,ts)!=-1)break;
}
if(ts!=-1)
{ printf("Cluster: ");
for(i=0; i<rs; ++i)printf("%s ",src[i]);
printf("\n");
if(msave)
{ fprintf(mfile,"\nCluster: ");
pos=9;
for(j=0; j<rs; ++j)
{ if(pos+strlen(src[j])>80)
{ fprintf(mfile,"\n ");
pos=9;
}
fprintf(mfile,"%s :",src[j]);
pos+=strlen(src[j])+1;
}
fprintf(mfile,"\n");
}
}
fseek(matrix,0L,0);
            mtmp=fopen("mtmp.tmp","w+");
do
{ fgets(line2,400,matrix);
fputs(line2,mtmp);
} while(ftell(matrix)!=fpos);
while((ts=parseLine(matrix,tgt,line2))!=-1)
{ if(rs!=ts || clustermatch(src,0,rs,tgt,ts)==-1)
{ for(j=0; j<ts; ++j)fprintf(mtmp,"%s ",tgt[j]);
fprintf(mtmp,"\n");
}
}
fclose(matrix);
fclose(mtmp);
unlink("matrix.tmp");
rename("mtmp.tmp","matrix.tmp");
matrix=fopen("matrix.tmp","r+");
            fseek(matrix,fpos,0);
}
return;
}

int parseLine(FILE *fp,char **pa,char *line)
{ int ix;
char *p;

if(fgets(line,400,fp)==NULL)return(-1);
pa[ix=0]=strtok(line," \n");
      if(pa[0]==NULL)return(0);
      while((p=strtok(NULL," \n"))!=NULL)pa[++ix]=p;
      return(ix+1);
}

int clustermatch(char **src,int ix,int size,char **tgt,int ts)
{ int i,j;

for(i=0; i<=ts-size; ++i)
      { for(j=0; j<size && strcmp(src[ix+j],tgt[i+j])==0; ++j);
       if(j==size)return(i);
      }
      return(-1);
}

/* prompt user for input */
/* get input */
/* copy it safely into provided variable */

void getinput(char *prompt,char n[])
{ char ipc[150];

printf("%s:\n",prompt);
gets(ipc);
strncpy(n,ipc,19);
n[19]='\0';
return;
}

#define MAXLINE 150

unsigned long comps;

int phonexMatch(char* nameA, char* nameB)
/* calculates phonex codes for two names */
/* and returns 1 if codes match */
{
int match;
char name1[MAXLINE];
char name2[MAXLINE];

strcpy(name1, nameA);
strcpy(name2, nameB);
strupr(name1);
strupr(name2);
match=0;
/* turn names into phonex code equivalents */
if (!phonex(name1))
{
       printf("Error coding name1.\n");
}
else if (!phonex(name2))
{
       printf("Error coding name2.\n");
}
else if (strcmp(name1, name2) == 0)
{
/* phonex codes are the same */
match = 1;
}

return match;
}

/* The following C function to create Phonex codes is converted from
SDX v1.1 coded by Michael Cooley. Released to the public domain, 1994. */
int phonex(char* name)
{
char last,code,*p;
int y=1;

/*
/ PHONEX PRE-PROCESSING ...
/

/ Deletions effected by replacing with next letter which
/ will be ignored due to duplicate handling of Soundex code.
/ This is faster than 'moving' all subsequent letters.

/ Remove any trailing Ss */
while (name[strlen(name)-1] == 'S')
{
name[strlen(name)-1] = '\0';

comps += 1;
}

/* Phonetic equivalents of first 2 characters */
/* Works since duplicate letters are ignored */
if (strncmp(name, "KN", 2) == 0)
{
*name = 'N'; /* KN.. == N..*/

/* AMENDMENT */
comps += 2;
}
else if (strncmp(name, "PH", 2) == 0)
{
*name = 'F'; /* PH.. == F.. (H ignored anyway) */

/* AMENDMENT */
comps += 2;
}
else if (strncmp(name, "WR", 2) == 0)
{
*name = 'R'; /* WR.. == R.. */

/* AMENDMENT */
comps += 2;
}

/* Special case, ignore H first letter (subsequent Hs ignored anyway) */
/* Works since duplicate letters are ignored */
if (*name == 'H')
{
*name = *(name+1);

/* AMENDMENT */
comps += 1;
}

/* Phonetic equivalents of first character */
switch(*name)
{
case 'E':
case 'I':
case 'O':
case 'U':
case 'Y':
/* vowel equivalence A = E, I, O, U, Y */
*name = 'A';
/* AMENDMENT */
comps += 5;
break;
case 'P':
/* consonant equivalence B = P */
*name = 'B';
/* AMENDMENT */
comps += 6;
break;
case 'V':
/* consonant equivalence F = V */
*name = 'F';
/* AMENDMENT */
comps += 7;
break;
case 'K':
case 'Q':
/* consonant equivalence C = K */
*name = 'C';
/* AMENDMENT */
comps += 9;
break;
case 'J':
/* consonant equivalence G = J */
*name = 'G';
/* AMENDMENT */
comps += 10;
break;
case 'Z':
/* consonant equivalence S = Z */
*name = 'S';
/* AMENDMENT */
comps += 11;
break;
default:
/* no equivalence for letter, no change */
break;
}

/*
/ MODIFIED SOUNDEX CODE
*/

p=name;
for ( ;
*p!='\0' /* not at end of name */
&& *p!=' ' /* terminate if encountered */
&& *p!=',' /* terminate if encountered */
&& y < 4; /* code no more than 4 characters */
p++)
{
switch(*p)
{
case 'B':
case 'P':
case 'F':
case 'V':
code='1';
/* AMENDMENT */
comps += 4;
break;
case 'C':
case 'S':
case 'K':
case 'G':
case 'J':
case 'Q':
case 'X':
case 'Z':
code='2';
/* AMENDMENT */
comps += 12;
break;
case 'D':
case 'T':
if (*(p+1) != 'C')
{
code='3';
}
/* else do not code: TC.. = DC.. = C.. */
/* AMENDMENT */
comps += 14;
break;
case 'L':
if (isvowelY(*(p+1)) || (*(p+1) == '\0'))
{
/* only code if followed by vowel of end of name */
code='4';
}
/* AMENDMENT */
comps += 15;
break;
case 'M':
case 'N':
if ((*(p+1) == 'D') || (*(p+1) == 'G'))
{
/* NG.. = ND.. = N.. */
*(p+1) = *p;
}
code='5';
/* AMENDMENT */
comps += 17;
break;
case 'R':
if (isvowelY(*(p+1)) || (*(p+1) == '\0'))
{
/* only code if followed by vowel of end of name */
code='6';
}
/* AMENDMENT */
comps += 18;
break;
default:
code=0;
break;
}

/* AMENDMENT */
comps += 3;

if(
last!=code /* not the same as previous
position */
&& code!=0 /* is not a vowel, etc. */
&& p!=name) /* is not the first letter
*/
name[y++]=code; /* then code it */

last=name[y-1];
if (y==1) last=code; /* special case for 1st letter */
}

while(y<4) name[y++]='0'; /* fill in with trailing zeros */
name[4]='\0'; /* NULL terminate after 4
characters */
return 1;
}

int isvowelY(char c)
/* returns 1 if c is a vowel or Y, 0 otherwise */
{
/* convert to upper case */
c = toupper(c);

/* AMENDMENT: average no. of comparisons */
comps += 3;

return ( (c == 'A') ||
(c == 'E') ||
(c == 'I') ||
(c == 'O') ||
(c == 'U') ||
(c == 'Y') );
}

char *strupr(char *n)
{ for( ; *n!='\0'; ++n)*n=toupper(*n);
return(n);
}

ASKER

Dear Imladris,
absolutely spotted on..!!

a million thanks...

Korsila...

p.s. please check a new question out about the explaining of your algorithms and codes...espectially the cluster things...please write more about how to approach , what strategy has been used, definition and rule of clasifying the cluster etc..you need to make it clearly since I have no idea much about your algorithms apart from our discussion...

ASKER

Dear Imladris,
absolutely spotted on..!!

a million thanks...

Korsila...

p.s. please check a new question out about the explaining of your algorithms and codes...espectially the cluster things...please write more about how to approach , what strategy has been used, definition and rule of clasifying the cluster etc..you need to make it clearly since I have no idea much about your algorithms apart from our discussion...

ASKER

Dear Imladris,

I have just realised that this I did make a mistake in this requirement...I don't need phonex to produce any match..just wanted the program to generate the cluster based on our requirement (method 1 and method2)..so do you have time to solve this out....I will put anothe question and hope you would have time to take a look...

p.s. why we don;t need to use phonex match becus for example,
smith
smythe
smithe

will return the same matches, and it 's no point to do cluster since they are always overlaping (duplicated)..tus the cluster will always return smith, smythe, smithe as cluster...

hope you could help me solve this out as soon as u can..
many thanks,

Korsila