asked on

compare how similar of name matching in 2 datafiles?

Hi all expert who is willing to help and give me some responses quickly...

here is my requirement:

I got 2 datafiles which contain names and ordering number of each name (names which have the same sound would have the same number) see example:
please note that , names from 2 datafiles are exactly the same but they have different number (this has been done by name matching, for example names from datfile 1 might be run by soundex1 but names from datafile 2 might be run by soundex 2)

datafile1 datafile 2

arran 1 arran 1
aron 1 aron 2
bary 2 bary 3
berry 3 berry 3
birry 3 birry 3
smath 4 smath 4
smith 5 smith 4
smithe 5 smithe 4
smythe 5 smythe 4
smithey 6 smithey 4
willams 7 willams 5
william 7 william 6
williems 7 willieams 6

as you can see from datafile 1 arran and aron are in the same group (which is 1) contrastly, arran and aron from datafile 2 are in different group....(arran is 1 but aron is 2)
so here is my question:
I would like to have C code to compare how similar of 2 datfiles is?

by considering in each name and group and then the output returns the percentage of each similar group like this:

datafile1 datafile 2

arran 1 arran 1 75%
aron 1 aron 2
----------------------------------------
bary 2 bary 3
berry 3 berry 3 83.33%
birry 3 birry 3
----------------------------------------
smath 4 smath 4
smith 5 smith 4
smithe 5 smithe 4 60%
smythe 5 smythe 4
smithey 6 smithey 4
------------------------------------------
willams 7 willams 5
william 7 william 6 50%
williems 7 willieams 6
------------------------------------------
total % = 268.3%
total no. of percentage that calculated = 4
actual % = 268.3/4 = 67.07%

so we can say that this 2 datafiles have a similarity 67.07 %

the problem is how to calculate % of similarity

first clasify the cluster in each grooup base on the large group whgich contained the same number..
(shgould be more than 2 the same names in each files that contained the same no.
as you can see here:

arran 1 arran 1 75 %
aron 1 aron 2

large group is 1 and the last 1 is finished at "aron"
to calculate percentage of similarity, we count the how many 1 appears in the cluster between 2 files
and how many does not....and the formular is : how many names contain 1/ total name in cluster between 2 files
3/4 * 100 = 75 %

similarly to :
bary 2 bary 3
berry 3 berry 3 83.33%
birry 3 birry 3

large group is 3 and the last 3 ended up at birry
there are 5 names contained no.3
and total names are 6

5/6* 100 = 83.33%

again:

smath 4 smath 4
smith 5 smith 4
smithe 5 smithe 4 60%
smythe 5 smythe 4
smithey 6 smithey 4

large group is 4 and the last 4 ended up at smithey (datafile 2)

so 6/10*100 = 60%

another example:
------------------------------------------
willams 7 willams 5
william 7 william 6 50%
williems 7 willieams 6

large group which can be classified is 7
and the last 7 ended up at williems (datafile 1)

so 3/6*100 = 50%

-----------------
if you can use -------------------------------- (line) to separate each cluster it would be very helpful indeed to see u get the cluster right...

please note that each clsuter based on the large group of number which appears in 2 datafiles ..see above...

I think it's hard..so high points would be given as a motivation...

many thanks, and hope to get the great answer from all expert..
korsila
p.s. please , no suggestion as I need to consider only codes (in C or any language which you think it suits and make my life easier:)

jonnin

So you want a char by char comparison, a missed char is worth nothing (add total of matched chars and divide by length of the longest item)? Just a count and compute?

Would a near key (close on keyboard, i.e type checking) be better? Or common spelling error / reversals (were weer or were ware) or the like?

Would one file be "correct" and another "unknown" or are both equal?

The first is easy, assume files are treated equally: get all guys of one score together and pick one at random or whichever spelling happens the most as correct, then compute %'s (I can do this, but want a clearer def before coding up).

Also is clarity of code > speed of code?
Is this really what the data looks like, or is the real thing binary or something?

Finally, this looks moderately like homework. You will have to convince me that its not, or I will cryptically code it so that any teacher would fail you for giving it to them. (It will work, just be ugly mess to read/understand)...