Solved

compare how similar of name matching  in 2 datafiles?

Posted on 2002-04-19
25
279 Views
Last Modified: 2010-04-15
Hi all expert who is willing to help and give me some responses quickly...

here is my requirement:

I got 2 datafiles which contain names and ordering number of each name (names which have the same sound would have  the same number)  see example:
please note that , names from 2 datafiles are exactly the same but they have different number (this has been done by name matching, for example names from datfile 1 might be run by soundex1 but names from datafile 2 might be run by soundex 2)

datafile1                 datafile 2

arran 1                   arran 1
aron 1                    aron  2
bary 2                     bary  3
berry 3                    berry 3
birry 3                     birry   3
smath 4                  smath 4
smith 5                   smith  4
smithe 5                 smithe 4
smythe 5                smythe 4
smithey  6              smithey 4
willams 7                willams 5
william 7                 william 6
williems 7               willieams 6
                           
as you can see from datafile 1 arran and aron are in the same group (which is 1) contrastly, arran and aron from datafile 2 are in different group....(arran is 1 but aron is 2)
so here is my question:
I would like to have  C code to compare how similar of 2 datfiles is?

by considering in each name and group and then the output  returns the percentage of each similar group like this:

datafile1                 datafile 2

arran 1                   arran 1                          75%          
aron 1                    aron  2
----------------------------------------
bary 2                     bary  3
berry 3                    berry 3                          83.33%
birry 3                     birry   3
----------------------------------------
smath 4                  smath 4
smith 5                   smith  4
smithe 5                 smithe 4                         60%
smythe 5                smythe 4
smithey  6              smithey 4
------------------------------------------
willams 7                willams 5
william 7                 william 6                         50%
williems 7               willieams 6
------------------------------------------
                                                                      total % = 268.3%
                                                                      total no. of percentage that calculated = 4
                                                                     actual % = 268.3/4 =  67.07%

so we can say that this 2 datafiles have a similarity 67.07 %

the problem is how to calculate % of similarity

first clasify the cluster in each grooup base on the large group whgich contained the same number..
(shgould be more than 2 the same names in each files that contained the same no.
as you can see here:

arran 1                   arran 1                         75 %      
aron 1                    aron  2

large group is 1 and the last 1 is finished at "aron"
to calculate percentage of similarity, we count the how many 1 appears in  the cluster between 2 files
and how many does not....and the formular is : how many names contain 1/ total name in cluster between 2 files
3/4 * 100 = 75 %


similarly to :
bary 2                     bary  3
berry 3                    berry 3                          83.33%
birry 3                     birry   3

large group is 3 and the last 3 ended up at birry
there are 5 names contained  no.3
and total names are 6

5/6* 100 = 83.33%

again:

smath 4                  smath 4
smith 5                   smith  4
smithe 5                 smithe 4                         60%
smythe 5                smythe 4
smithey  6              smithey 4

large group is 4 and the last 4 ended up at smithey (datafile 2)

so 6/10*100 = 60%

another example:
------------------------------------------
willams 7                willams 5
william 7                 william 6                         50%
williems 7               willieams 6

large group which can be classified is 7
and the last 7 ended up at williems (datafile 1)

so 3/6*100 = 50%

-----------------
if you can use -------------------------------- (line) to separate each cluster it would be very helpful indeed to see u get the cluster right...

please note that each clsuter based on the large group of number which appears in 2 datafiles ..see above...


I think it's hard..so  high points would be given as a motivation...

many thanks, and hope to get the great answer from all expert..
korsila
p.s. please , no suggestion as I need to consider only codes (in C or any language which you think it suits and make my life easier:)
0
Comment
Question by:korsila
25 Comments
 
LVL 2

Expert Comment

by:jonnin
ID: 6954447
So you want a char by char comparison, a missed char is worth nothing (add total of matched chars and divide by length of the longest item)? Just a count and compute?

Would a near key (close on keyboard, i.e type checking) be better?  Or common spelling error / reversals (were weer or were ware) or the like?  

Would one file be "correct" and another "unknown" or are both equal?

The first is easy, assume files are treated equally: get all guys of one score together and pick one at random or whichever spelling happens the most as correct, then compute %'s (I can do this, but want a clearer def before coding up).  

Also is clarity of code > speed of code?
Is this really what the data looks like, or is the real thing binary or something?

Finally, this looks moderately like homework. You will have to convince me that its not, or I will cryptically code it so that any teacher would fail you for giving it to them. (It will work, just be ugly mess to read/understand)...









0
 

Author Comment

by:korsila
ID: 6955305
Dear Jonnin,
thanxxx for a quick response...


So you want a char by char comparison, a missed char is worth nothing (add total of matched chars and
divide by length of the longest item)? Just a count and compute?

*** first, you will have to classify the cluster (see above , how?)and then count the large group of names and the totall of name in cluster for example ...(above)

smath 4                  smath 4
smith 5                   smith  4
smithe 5                 smithe 4           60%
smythe 5                smythe 4
smithey  6              smithey 4

large group is 4 and the last 4 ended up at smithey (datafile 2)

so count names which contained 4 = 6
then count the total of names in this cluster (cluster 4) = 10
so
6/10*100 = 60%

this is only 1 cluster, but we will have to do this for the whole datafiles
and once you have done this you will need to calculate the whole % of how similar of 2 files


for example above:

     total % = 268.3%
                                                            total no. of percentage that calculated = 4
or how many clusters
                                                                    actual % = total percentage/no. of cluster
                  = 268.3/4 =  67.07%

----------------

Would a near key (close on keyboard, i.e type checking) be better?  Or common spelling error / reversals
(were weer or were ware) or the like?  

**** No, ..name matching are used for names in datafiled and the number behind each name is the indentity of matching...


Would one file be "correct" and another "unknown" or are both equal?

***** they are equal, but like I said each file is used different method for name matching (number behind each name in each datafile) so they will return different number for the same names in 2 files


The first is easy, assume files are treated equally: get all guys of one score together and pick one
at random or whichever spelling happens the most as correct, then compute %'s (I can do this, but want
a clearer def before coding up).  

*** well, please try it then...the datafile is not that small as my example (says about 2000 names), but it's pretty similar...so if it works with those datfiles then it would be no doubt to work with a real one...


Also is clarity of code > speed of code?
***speed of code would be considered as well...


Is this really what the data looks like, or is the real thing binary or something?

*** it's something like that..or maybe it might have a coma (,) separate between name and number...I will have to run to get those data as well...


Finally, this looks moderately like homework. You will have to convince me that its not, or I will cryptically
code it so that any teacher would fail you for giving it to them. (It will work, just be ugly mess to
read/understand)...

***it's not a homework, but please make the code more understanding or readable, it would be very helpful indeed as am not expert in C (just beginner)

hope it 's clear to get you starting ^_^

many thanks,
Korsila









0
 
LVL 22

Expert Comment

by:cookre
ID: 6955439
We first simplify the problem by recognizing that the names have no bearing on the problem, since the calculation of similarity is stated to be based solely on the codes.

Second, we see that the program is just multiple iterations of a piece of code that handles a single group.

Third, the definer of a given group is that file segment whose code does not change first.

Fourth, the end of the group is signaled by a code change of the definer.

Now the calculations can also be simplified by recognizing that the only figures that matter are:
a) total # of codes = twice the size of the definer
b) Number of non-definer codes equal to the definer code.

Does that not make the assignment easier to approach?
0
 

Author Comment

by:korsila
ID: 6956209
Ok here is a new deal...the right way to calculate how similar of 2 name matching methods which processed these 2 datafiles...

first classify the cluster of 2 datafiles based on the maximum of large group of number in one file which contains the same number (see behind names)--see below

secondly, calculate the similarity of the cluster by counting only the maximum number which appear in each file then divide by the the total of number of names in cluster...and calculate the % of similarity of cluster which is  (no. of maximum number of file1 + no. of maximum number of file2)*100/ total name in the cluster

and then print out the output of cluster between 2 files and the % of similarity of eac cluster

finally , calculate the actual % of similarity between 2 files which is (total % of similarity of each cluster)/ total number of cluster between 2 files ..and then print out the results
   
here the out put should look like this:

            datafile1                 datafile 2

                arran 1                   arran 1                        75%                  
                aron 1                    aron  2
                ----------------------------------------
                bary 2                     bary  3
                berry 3                    berry 3                         83.33%
                birry 3                     birry   3
                ----------------------------------------
                smath 4                  smath 4
                smith 5                   smith  4
                smithe 5                 smithe 4                      80%                        
                smythe 5                smythe 4
                smithey  6              smithey 4
                ------------------------------------------
                willams 7                willams 5
                william 7                 william 6                     83.33%                      
                williems 7               willieams 6
                ------------------------------------------
                                                                                     total % = 75+83.33+80+83.33=321.66%
                                                                                     total no. of cluster between 2 files =4
                                                                                    actual % = 321.66/4 =  80.27%

                so we can say that this 2 name macthing methods have a similarity 80.27%


here is the example step by step:

first classify the cluster of 2 datafiles based on the maximum of large group of number in 1 file which contains the same number (see behind names)

                 smath 4                  smath 4
                smith 5                   smith  4
                smithe 5                 smithe 4                        
                smythe 5                smythe 4
                smithey  6              smithey 4

                large group is 4 (in file 2)and the last 4 ended up at smithey (datafile 2)

% of similarity of each cluster =(no. of maximum number of file1 + no. of maximum number of file2)/ total name in the cluster
As you can see,
maximum number in file 1 of this cluster= 5 , so count how many 5 is in this cluster( in this case= 3)
maximum number in file 2 of this cluster = 4,  so count how many 4 is in this cluster (in this case= 5)
total name in cluster = 10
%= 3+5*100/10 = 80%



another example:

                 willams 7                willams 5
                william 7                 william 6                                      
                williems 7               willieams 6

    maximu no. in file 1 = 7        and        no. of 7 = 3
     maximum no. in file 2 = 6    and        no. of 6 = 2
  number of names = 6

so %=3+2*100/6 = 83.33%

the important thing is how to classify the cluster...see the previous requirement above....

-------------
hope this time is right......!!!! I should change the title of question to be "how similar of 2 matching methods ? not how similar of names in datafile...

----------------

many thans, hope it's clearer to get you starting..
hope to see your helpful answer as soon as possible...
Korsila
0
 

Author Comment

by:korsila
ID: 6961555
Hi there,
is there any expert who is not busy and could help me to sort this question out? !!am just wondering...I used to get the answer quicker than this..!!huhs!!

anyway,. hope I could gets ome help of processing this as soon as possible...sorry to put a lot of pressure on anybody here..!!!

many thanks in advance
Korsila
0
 
LVL 16

Accepted Solution

by:
imladris earned 300 total points
ID: 6963265
Korsila,

I have been completely snowed under here for weeks now. However, I had a quick pass at this this morning. It produces the figures you are after. It assumes the datafiles are named "file1" and "file2".

#include <stdio.h>
#include <stdlib.h>

void main(int argc,char *argv[])
{     FILE *file1,*file2;
     char name1[150],name2[150];
     int nr,id1,id2,idn1,idn2,oc1,c1,oc2,c2,rc,perc,tot,clc,tp,lid1,lid2;

     file1=fopen("file1","r");
     if(file1==NULL)
     {     printf("file1 open error\n");
          exit(1);
     }
     file2=fopen("file2","r");
     if(file2==NULL)
     {     printf("file2 open error\n");
          exit(1);
     }
     nr=fscanf(file1,"%s %d",name1,&id1);
     fscanf(file2,"%s %d",name2,&id2);
     printf("%s %d    %s %d\n",name1,id1,name2,id2);
     clc=tot=0;
     while(nr>0)
     {     nr=fscanf(file1,"%s %d",name1,&idn1);
          fscanf(file2,"%s %d",name2,&idn2);
          oc1=oc2=-1;
          c1=c2=1;
          rc=1;
          lid1=id1;
          lid2=id2;
          while(nr>0 && (idn1==id1 || idn2==id2))
          {     printf("%s %d   %s %d\n",name1,idn1,name2,idn2);
               ++rc;
               if(idn1==lid1)++c1;
               else
               {     if(c1>oc1)oc1=c1;
                    c1=1;
                    lid1=idn1;
               }
               if(idn2==lid2)++c2;
               else
               {     if(c2>oc2)oc2=c2;
                    c2=1;
                    lid2=idn2;
               }
               nr=fscanf(file1,"%s %d",name1,&idn1);
               fscanf(file2,"%s %d",name2,&idn2);
          }
          if(c1>oc1)oc1=c1;
          if(c2>oc2)oc2=c2;
          perc=((oc1+oc2)*10000+rc)/(rc*2);
          printf("%d.%02d percent\n",perc/100,perc%100);
          if(nr>0)printf("%s %d   %s %d\n",name1,idn1,name2,idn2);
          tot+=perc;
          ++clc;
          id1=idn1;
          id2=idn2;
     }
     tp=(tot*10+clc/2)/(clc*10);
     printf("\ntotal similarity %d.%02d\n",tp/100,tp%100);
}
0
 

Author Comment

by:korsila
ID: 6963343
Imladris,
you are my man....you are still my favourite expert on the net of this pressure world ^_^ a million thank even you are under snowed with work but you have got time for my question...
I ahve tested your code and it works fine with a small data above...I will get a big data tomorrow and will test against them..will let you know  as soon as possible...

many many thanks for your greatful help again..!!!
you are worth it..!!
Korsila
p.s. please wait for my results..
0
 

Author Comment

by:korsila
ID: 6971487
Hi Imladris,

sorry for getting back to you quiet late...
I have just got the real data to test your algorithm...
the cluster works ok, but the calculation at the end was not right..it returns something like
total similarity -4.-13..I have no idea what it means...

so my suggestion is could you make your codes returns:
total % of entire clusters
total no. of cluster between 2 files
and total similarity (%) at the end...

for example:
total % = 75+83.33+80+83.33=321.66%
 total no. of cluster between 2 files =4
                                                             actual % = 321.66/4 =  80.27%

so i can figure out what went wrong for the last calculation...


many thanks, hope you have time for me (a bit)

0
 

Author Comment

by:korsila
ID: 6973496
Dear Imladris,
I have got a new data and tested with it..
now it seems to work ok..no problem with calculation of total percentage anymore (total similarity -4.-13), how come?

anyway, one thing is doubt me is the cluster which has single name in each file always returns 100% similarity which is not right all the time..!!well, could you make the program to ignore to this (the calculation)and put them into only 1 cluster which contains name which has no the same number (no duplicated numbers)--you can save this into another file if you like...

sorry got back to you quiet late since just got a correct data to work with....

hope all's well with you and you are not busy at all..

many thanks,
Korsila
p.s. could you also explain your code a bit as well..(as I have no idea how the calculation and cluster works)
and that's it , should be no more requirement in here , if so I will put another question regarding to this answer...

0
 

Author Comment

by:korsila
ID: 6973531
here is some example:


file1           file2

smith 1          smith 1
smyth 1          smyth 1
willy 2          willy 2
william 3       william 3
williams 3       williams 3
susie 4         susie 4
zuzy 5          zuzy 5

so now the calculation will ignore the cluster which has single name in each file...from above

file1           file2

smith 1          smith 1    100%
smyth 1          smyth 1
-------------------------
willy 2          willy 2 -------> this would be ignored
--------------------------
william 3       william 3     100%
williams 3       williams 3
---------------------------
susie 4         susie 4 ----> this would be ignored
------------------------
zuzy 5          zuzy 5 ------> this would be ignored
-----------------------

and we may put thos inorable names into another file, don't need to print out on the screen...

--------------
hope it's ok..
many thanks sir..
Korsila
0
 

Author Comment

by:korsila
ID: 6977194
Imladris,
I still got a problem with your code which returns
total similarity -19.-29 (this time)
so I tried to test with eaxctly the same daata (file1 =file2) then the programs return
total similarity -8.-29 instead of 100%

so my suggestion is could you make your codes returns:
                total % of entire clusters
                total no. of cluster between 2 files
                and total similarity (%) at the end...

                for example:
                total % = 75+83.33+80+83.33=321.66%
                total no. of cluster between 2 files =4
                                                                            actual % = 321.66/4 =  80.27%

                so i can figure out what went wrong for the last calculation...

I think you will have to update this...so i can see what wrong with the codes and output...

is it possible for the mistake of memory , since I have run your program with load of files...

anyway, hope you would get back to me as soon as possible..!!

Many thanks,
Korsila
0
 

Author Comment

by:korsila
ID: 6977234
another test against 2 same datfiles..
and the output loos like this (which was wrong):

ZEBEDEE 141311   ZEBEDEE 141311
100.00 percent
ZEBEDY 141312   ZEBEDY 141312
100.00 percent
ZOOK 141313   ZOOK 141313
100.00 percent
ZOUCH 141314   ZOUCH 141314
100.00 percent
ZOUCHE 141315   ZOUCHE 141315
100.00 percent
ZACHARIAS 141316   ZACHARIAS 141316
100.00 percent
ZACHARIAH 141317   ZACHARIAH 141317
100.00 percent
ZACHARY 141318   ZACHARY 141318
100.00 percent
ZACHARYE 141319   ZACHARYE 141319
100.00 percent
ZACKARY 141320   ZACKARY 141320
100.00 percent
ZECHARIAH 141321   ZECHARIAH 141321
100.00 percent
ZUGG 141322   ZUGG 141322
100.00 percent
ZEAGAR 141323   ZEAGAR 141323
100.00 percent
ZEAGER 141324   ZEAGER 141324
100.00 percent
ZEGAR 141325   ZEGAR 141325
100.00 percent
ZEGER 141326   ZEGER 141326
100.00 percent
ZALLY 141327   ZALLY 141327
100.00 percent
ZEALL 141328   ZEALL 141328
100.00 percent
ZEALLE 141329   ZEALLE 141329
100.00 percent
ZEALLY 141330   ZEALLY 141330
100.00 percent
ZELL 141331   ZELL 141331
100.00 percent
ZILLY 141332   ZILLY 141332
100.00 percent
ZEAL 141333   ZEAL 141333
100.00 percent
ZEALE 141334   ZEALE 141334
100.00 percent
ZEALY 141335   ZEALY 141335
100.00 percent
ZELY 141336   ZELY 141336
100.00 percent
ZEALWOOD 141337   ZEALWOOD 141337
100.00 percent
ZELLWOOD 141338   ZELLWOOD 141338
100.00 percent
ZELWOOD 141339   ZELWOOD 141339
100.00 percent
ZELWOODE 141340   ZELWOODE 141340
100.00 percent
ZILLWOOD 141341   ZILLWOOD 141341
100.00 percent
ZILWOOD 141342   ZILWOOD 141342
100.00 percent
ZELLAR 141343   ZELLAR 141343
100.00 percent
ZELLER 141344   ZELLER 141344
100.00 percent
ZEALEY 141345   ZEALEY 141345
100.00 percent
ZEALLEY 141346   ZEALLEY 141346
100.00 percent
ZEELEY 141347   ZEELEY 141347
100.00 percent
ZELLEY 141348   ZELLEY 141348
100.00 percent
ZILLEY 141349   ZILLEY 141349
100.00 percent
ZWILCHENBART 141350   ZWILCHENBART 141350
100.00 percent
ZELMAN 141351   ZELMAN 141351
100.00 percent
ZIEMS 141352   ZIEMS 141352
100.00 percent
ZENN 141353   ZENN 141353
100.00 percent
ZANE 141354   ZANE 141354
100.00 percent
ZEANE 141355   ZEANE 141355
100.00 percent
ZINE 141356   ZINE 141356
100.00 percent
ZANN 141357   ZANN 141357
100.00 percent
ZIMMERMAN 141358   ZIMMERMAN 141358
100.00 percent
ZIMMERMANN 141359   ZIMMERMANN 141359
100.00 percent
ZONCH 141360   ZONCH 141360
100.00 percent
ZEAR 141361   ZEAR 141361
100.00 percent
ZINZAN 141362   ZINZAN 141362
100.00 percent
ZYNZON 141363   ZYNZON 141363
100.00 percent
ZETT 141364   ZETT 141364
100.00 percent

total similarity 8.85
---------------------
it supposes to return
total similarity 100.00

so that is why i need you to sort out the number of clusters (should be reurned) and the total % of clusters (should be returned too) and then we can calculate the total similarity (no. of clusters / total %)

hope you would help me to figure out..

many thanks,
Korsila
0
Maximize Your Threat Intelligence Reporting

Reporting is one of the most important and least talked about aspects of a world-class threat intelligence program. Here’s how to do it right.

 
LVL 16

Expert Comment

by:imladris
ID: 6977979
Here is a new version. It will show the total number of clusters and the total percentage, as well as the total similarity. Depending on the size of your data file, it may be that this calculation is overflowing. If so, it is probably easy to remedy it by using long's instead of int's.

This version also checks for 1 line clusters. It still shows them, but prints up "ignored" with them, and excludes them from the total similarity calculation.

#include <stdio.h>
#include <stdlib.h>

void main(int argc,char *argv[])
{     FILE *file1,*file2;
     char name1[150],name2[150];
     int nr,id1,id2,idn1,idn2,oc1,c1,oc2,c2,rc,perc,tot,clc,tp,lid1,lid2;

     file1=fopen("file1","r");
     if(file1==NULL)
     {     printf("file1 open error\n");
          exit(1);
     }
     file2=fopen("file2","r");
     if(file2==NULL)
     {     printf("file2 open error\n");
          exit(1);
     }
     nr=fscanf(file1,"%s %d",name1,&id1);
     fscanf(file2,"%s %d",name2,&id2);
     printf("%s %d    %s %d\n",name1,id1,name2,id2);
     clc=tot=0;
     while(nr>0)
     {     nr=fscanf(file1,"%s %d",name1,&idn1);
          fscanf(file2,"%s %d",name2,&idn2);
          oc1=oc2=-1;
          c1=c2=1;
          rc=1;
          lid1=id1;
          lid2=id2;
          while(nr>0 && (idn1==id1 || idn2==id2))
          {     printf("%s %d   %s %d\n",name1,idn1,name2,idn2);
               ++rc;
               if(idn1==lid1)++c1;
               else
               {     if(c1>oc1)oc1=c1;
                    c1=1;
                    lid1=idn1;
               }
               if(idn2==lid2)++c2;
               else
               {     if(c2>oc2)oc2=c2;
                    c2=1;
                    lid2=idn2;
               }
               nr=fscanf(file1,"%s %d",name1,&idn1);
               fscanf(file2,"%s %d",name2,&idn2);
          }
          if(c1>oc1)oc1=c1;
          if(c2>oc2)oc2=c2;
          perc=((oc1+oc2)*10000+rc)/(rc*2);
          printf("%d.%02d percent\n",perc/100,perc%100);
          if(perc==10000 && rc==1)printf("ignored\n");
          else
          {     tot+=perc;
               ++clc;
          }
          if(nr>0)printf("%s %d   %s %d\n",name1,idn1,name2,idn2);
          id1=idn1;
          id2=idn2;
     }
     tp=(tot*10+clc/2)/(clc*10);
     printf("\ntotal clusters %d total percentage %d.%02d\n",clc,tot/100,tot%100);
     printf("total similarity %d.%02d\n",tp/100,tp%100);
}
0
 

Author Comment

by:korsila
ID: 6978870
hmm IMLADRIS,
THAT'S QUICK, MANY THANKS, I will try to run it again with datafiles...hope will get back to you AS SOON AS POSSIBLE...

WISH ME LUCK ^_^

Korsila
0
 

Author Comment

by:korsila
ID: 6980624
I got load of questions and requirements to ask but I don't think it's a good idea to keep you busy within this question, I will post more questions regarding your answer and my new requirements..However, could you do me a favor for the last curious requirement...

here it 's
actually i got 4 files from 4 name matching methods which you used to implement them...now
0
 

Author Comment

by:korsila
ID: 6980945
I got load of questions and requirements to ask but I don't think it's a good idea to keep you busy within this question, I will post more questions regarding your answer and my new requirements..However, could you do me a favor for the last curious requirement...

here it is..
actually i got 4 files from 4 name matching methods which you used to implement them...now I would like to compare all 4 methods by using your answer...(but don't ignore the single name cluster this time, but should solve the problem of overflowing instead...

here what i want for the final output:

datafile1      datafile 2     file3    file4
arran 1         arran 1       arran1    arran 1    75%
aron 1          aron  2       aron 2    aron 1     75%
                75%
------------------------------------------------
bary 2          bary  3       bary 3    bary 2     66.66%
berry 3         berry 3       berry 4   berry 2    83.33%
birry 3         birry   3     birry 4   barry 2    83.33%
                 83%
-----------------------------------------------
smath 4         smath 4       smath 5   smath 3    75%
smith 5         smith  4      smith 5   smith 3    85%
smithe 5        smithe 4      smithe 5  smithe 3   85%
smythe 5        smythe 4      smythe 5  smythe 3   85%
smithey  6      smithey 4     smithey 5 smithey 4  60%
               85%
------------------------------------------
willams 7       willams 5     willams 6  willams 5  66.66%
william 7       william 6     william 7  william 5  75%
williems 7      willieams 6   williems 8  williems 5 75%
               67%
--------------------------------------------------
total similarity of each cluster = (75+83+85+67)/4

total similarity of names = sum (%of each line)/ no. of line

1. cluster bases on the large group of same number of each file

2. calculation of similarity in each cluster

3. calculation % of similarity of each name
4. total % of similarity of names

5. calculation total % of similarity of clsuters

here to exaplin each process:

datafile1      datafile 2     file3    file4
arran 1         arran 1       arran1    arran 1    75%--3.
aron 1          aron  2       aron 2    aron 1     75% --3.
                75% ---> 2. (similarity of each name)
---------------------------------1. see above (cluster bases...)

1, 2, 4, are the same as be4....

butr no. 3- the calculation % of similarity of each name must be calculated as following formular:

after clustering, considering name in the first line  which is "arran"
arran 1         arran 1       arran1    arran 1    

then try to find out from each file how similar of this name ( arran )in each file by counting the numbers which are the same in each file and divide by the total no. in each file .
therefore, from above:
datfile1 has 2 the same out of 2 ,so 2/2 = 1
file2 has no the same or has 1 the same out of 2, 1/2 =0.5
file3 has no the same  or has 1 out of 2, 1/2 =0.5
file4 has 2 the same out of 2, 2/2 =1
total = (1+0.5+0.5+1)
similarity of each name =total/no. of datfile which is always 4
3/4 *100 = 75%
then the program return 75% of similarity of that line..

next considering the seond row of cluster which is "aron"
aron 1          aron  2       aron 2    aron 1    

it's the same as above
-----------------------------

cluster2:

------------------------------------------------
bary 2          bary  3       bary 3    bary 2    
berry 3         berry 3       berry 4   berry 2    
birry 3         birry   3     birry 4   barry 2  

as above idea:

bary 2 (1/3)  bary  3(3/3) bary 3 (1/3) bary 2 (3/3)
berry 3(2/3)  berry 3(3/3) berry 4(2/3) berry 2 (3/3)
birry 3 (2/3) birry 3(3/3) birry 4 (2/3)barry 2  (3/3)  

considering first line (which is "bary")
so "bary" has (1/3)+(3/3)+(1/3)+(3/3)=0.33+1+0.33+1 = 2.66

similarity of each name = {2.66/4(no. of datafile)}*100

= 66.66%

so
bary 2          bary  3       bary 3    bary 2     66.66%

now calculate "berry" which is the same formular....and similarly with other clusters (smith and williams)

------------------------------------

many thanks,
p.s. don't ignore the single cluster which contains only 1 name in each file...

This could be the last question i would ask for....if you find a difficulty to understand my requirement , don't be hestitate to get back to me....and that's it..after this one I got some questions to post for new requirement as well...but this should be done first bsically to see the output..!!hope you won't feel unhappy with me keeping you busy with a new requirement agan..as you know when i did experiment I always come up with the new problems.. and hope this should be last that i would like to see and get a quick response from you...

pss. if you feel like too much work from here , I don't mind putting this question in another topic...but I know you are always try your best with others...
so bsiaclly, the new addition is , comparing with 4 files instead of 2 files,(don't ignore the single cluster)

and then the % of similarity of names between 4 methods(each line)
and total % of similarity of names between 4 methods(total)

I am deeply sorry that  i always come up with a new idea to change the requirement...if you think it's too uch I can put it in another question..!!



0
 
LVL 16

Expert Comment

by:imladris
ID: 6988776
Allright. In general new requirements are fine, as long as they go into new questions. In this particular case, though I had to think about it off and on for a couple of days, I decided that this one would fit within the point allotment for this question. This covers the new requirements: 4 files, don't ignore single name clusters, and provide similarity calculation for each line.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

void main(int argc,char *argv[])
{     FILE *nfile[4];
     char fname[150],mname[4][150];
     int i,j,nr,id[4],idn[4],ctr[4],oc[4],rc,perc,tot,clc,tp,lid[4];
     int nc,nsc[4],nid[4],tid,nsm;
     long nfp[4];

     strcpy(fname,"file1");
     for(i=0; i<4; ++i)
     {     fname[4]='1'+i;
          nfile[i]=fopen(fname,"r");
          if(nfile[i]==NULL)
          {     printf("%s open error\n",fname);
               exit(1);
          }
     }
     for(i=0; i<4; ++i)
     {     nr=fscanf(nfile[i],"%s %d",mname[i],id+i);
          nfp[i]=0;
     }
     clc=tot=0;
     while(nr>0)
     {     for(i=0; i<4; ++i)
          {     nr=fscanf(nfile[i],"%s %d",mname[i],idn+i);
               oc[i]=-1;
               ctr[i]=1;
               lid[i]=id[i];
          }
          rc=1;
          while(nr>0 && (idn[0]==id[0] || idn[1]==id[1] || idn[2]==id[2] || idn[3]==id[3]))
          {     ++rc;
               for(i=0; i<4; ++i)
               {     if(idn[i]==lid[i])++ctr[i];
                    else
                    {     if(ctr[i]>oc[i])oc[i]=ctr[i];
                         ctr[i]=1;
                         lid[i]=idn[i];
                    }
               }
               for(i=0; i<4; ++i)
                    nr=fscanf(nfile[i],"%s %d",mname[i],idn+i);
          }
          for(i=0; i<4; ++i)
          {     if(ctr[i]>oc[i])oc[i]=ctr[i];
          }
          for(nc=0; nc<rc; ++nc)
          {     for(i=0; i<4; ++i)
               {     fseek(nfile[i],nfp[i],SEEK_SET);
                    for(j=0; j<=nc; ++j)fscanf(nfile[i],"%s %d",mname[i],nid+i);
                    nsc[i]=0;
                    fseek(nfile[i],nfp[i],SEEK_SET);
                    for(j=0; j<rc; ++j)
                    {     fscanf(nfile[i],"%s %d",fname,&tid);
                         if(tid==nid[i])++nsc[i];
                    }
               }
               for(nsm=i=0; i<4; ++i)
               {     printf("%s %d   ",mname[i],nid[i]);
                    nsm+=nsc[i];
               }
               nsm=(((nsm*100+rc/2)/rc)*100+2)/4;
               printf("%d.%02d\n",nsm/100,nsm%100);
          }
          for(i=0; i<4; ++i)
          {     nfp[i]=ftell(nfile[i]);
               nr=fscanf(nfile[i],"%s %d",mname[i],idn+i);
          }
          for(perc=0,i=0; i<4; ++i)
               perc+=oc[i];
          perc=(perc*10000+rc)/(rc*4);
          printf("%d.%02d percent\n",perc/100,perc%100);
          tot+=perc;
          ++clc;
          for(i=0; i<4; ++i)id[i]=idn[i];
     }
     tp=(tot*10+clc/2)/(clc*10);
     printf("\ntotal clusters %d total percentage %d.%02d\n",clc,tot/100,tot%100);
     printf("total similarity %d.%02d\n",tp/100,tp%100);
}
0
 

Author Comment

by:korsila
ID: 6994344
Dear Imladris,

sorry it supposed to be only 2 files,
I have made a mistake again...!!! with 4 files will not be this requirement, i will need to think and post the new question..really sorry indeed...

however, i was trying to update your codes for 2 files comparing(clustering) and it didn't work...here the codes I have updated from you (basically  to change "4" into "2"

it 's compilable but it said : "fi1e1 open error"

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

void main(int argc,char *argv[])
{     FILE *nfile[2];
    char fname[150],mname[2][150];
    int i,j,nr,id[2],idn[2],ctr[2],oc[2],rc,perc,tot,clc,tp,lid[2];
    int nc,nsc[2],nid[2],tid,nsm;
    long nfp[2];

    strcpy(fname,"file1");
    for(i=0; i<2; ++i)
    {     fname[2]='1'+i;
         nfile[i]=fopen(fname,"r");
         if(nfile[i]==NULL)
         {     printf("%s open error\n",fname);
              exit(1);
         }
    }
    for(i=0; i<2; ++i)
    {     nr=fscanf(nfile[i],"%s %d",mname[i],id+i);
         nfp[i]=0;
    }
    clc=tot=0;
    while(nr>0)
    {     for(i=0; i<2; ++i)
         {     nr=fscanf(nfile[i],"%s %d",mname[i],idn+i);
              oc[i]=-1;
              ctr[i]=1;
              lid[i]=id[i];
         }
         rc=1;
         while(nr>0 && (idn[0]==id[0] || idn[1]==id[1]))
         {     ++rc;
              for(i=0; i<2; ++i)
              {     if(idn[i]==lid[i])++ctr[i];
                   else
                   {     if(ctr[i]>oc[i])oc[i]=ctr[i];
                        ctr[i]=1;
                        lid[i]=idn[i];
                   }
              }
              for(i=0; i<2; ++i)
                   nr=fscanf(nfile[i],"%s %d",mname[i],idn+i);
         }
         for(i=0; i<2; ++i)
         {     if(ctr[i]>oc[i])oc[i]=ctr[i];
         }
         for(nc=0; nc<rc; ++nc)
         {     for(i=0; i<2; ++i)
              {     fseek(nfile[i],nfp[i],SEEK_SET);
                   for(j=0; j<=nc; ++j)fscanf(nfile[i],"%s %d",mname[i],nid+i);
                   nsc[i]=0;
                   fseek(nfile[i],nfp[i],SEEK_SET);
                   for(j=0; j<rc; ++j)
                   {     fscanf(nfile[i],"%s %d",fname,&tid);
                        if(tid==nid[i])++nsc[i];
                   }
              }
              for(nsm=i=0; i<2; ++i)
              {     printf("%s %d   ",mname[i],nid[i]);
                   nsm+=nsc[i];
              }
              nsm=(((nsm*100+rc/2)/rc)*100+2)/2;
              printf("%d.%02d\n",nsm/100,nsm%100);
         }
         for(i=0; i<2; ++i)
         {     nfp[i]=ftell(nfile[i]);
              nr=fscanf(nfile[i],"%s %d",mname[i],idn+i);
         }
         for(perc=0,i=0; i<2; ++i)
              perc+=oc[i];
         perc=(perc*10000+rc)/(rc*2);
         printf("%d.%02d percent\n",perc/100,perc%100);
         tot+=perc;
         ++clc;
         for(i=0; i<2; ++i)id[i]=idn[i];
    }
    tp=(tot*10+clc/2)/(clc*10);
    printf("\ntotal clusters %d total percentage %d.%02d\n",clc,tot/100,tot%100);
    printf("total similarity %d.%02d\n",tp/100,tp%100);
}


0
 

Author Comment

by:korsila
ID: 6994357
Did I do the right thing ?? for a new change of your codes ...? I don't think so..anyway could you change the code again for me and please note that it wil be comparing only 2 files..(nopt 4 files) I was confusing with my new requirement...

hope it's not difficult for a change..!!

thank you,
Korsila
0
 

Author Comment

by:korsila
ID: 6994425
Dear Steven,
I have tested your code with 4 datfile (which i have made them up) and it works ok as following:


Gutman 144   Guthrum 144   Joass 108   Joubert 87   58.25
Guthrum 145   Gutman 145   Joss 108   Jubert 87   58.25
Gutsell 146   Gutsell 146   Josse 108   Joce 88   50.00
58.33 percent
Leegood 148   LeGood 148   LeGood 110   LeGood 90   66.75
LeGood 149   Leegood 149   Leegood 110   Leegood 90   75.00
Legood 149   Legood 150   Legood 110   Legood 90   75.00
75.00 percent

total clusters 81 total percentage 6399.34
total similarity 79.00
actual similarity ---**** to be added

however, as for the 2 datafile  I would like to have an actual similarity as well by calculating the total % of each line (not in each cluster) / no. of lines (6 in this case)

so in this case:
(58.25+58.25+50+66.75+75+75)/6 = actual similarity =?


so could you do this a bit more...!!!

a million thank..and that' it really.(I DON'T THINK I would make any mistake from this requirement again) so should be last one really...!!!

Korsila
 
0
 
LVL 16

Expert Comment

by:imladris
ID: 6999129
OK, if I've understood this right, you want, for this question, to go back to 2 files, but add the line similarity calculation and also a total (average) line similarity at the end. This code would do it:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int oneequal(int id1[],int id2[]);

#define FALSE    0
#define TRUE     1

#define FILENUM  2


void main(int argc,char *argv[])
{     FILE *nfile[FILENUM];
     char fname[150],mname[FILENUM][150];
     int i,j,nr,id[FILENUM],idn[FILENUM],ctr[FILENUM],oc[FILENUM],rc,perc,tot,clc,tp,lid[FILENUM];
     int nc,nsc[FILENUM],nid[FILENUM],tid,nsm;
     int ltot,lnum;
     long nfp[FILENUM];

     strcpy(fname,"file1");
     for(i=0; i<FILENUM; ++i)
     {     fname[4]='1'+i;
          nfile[i]=fopen(fname,"r");
          if(nfile[i]==NULL)
          {     printf("%s open error\n",fname);
               exit(1);
          }
     }
     for(i=0; i<FILENUM; ++i)
     {     nr=fscanf(nfile[i],"%s %d",mname[i],id+i);
          nfp[i]=0;
     }
     clc=tot=0;
     ltot=lnum=0;
     while(nr>0)
     {     for(i=0; i<FILENUM; ++i)
          {     nr=fscanf(nfile[i],"%s %d",mname[i],idn+i);
               oc[i]=-1;
               ctr[i]=1;
               lid[i]=id[i];
          }
          rc=1;
          while(nr>0 && oneequal(idn,id))
          {     ++rc;
               for(i=0; i<FILENUM; ++i)
               {     if(idn[i]==lid[i])++ctr[i];
                    else
                    {     if(ctr[i]>oc[i])oc[i]=ctr[i];
                         ctr[i]=1;
                         lid[i]=idn[i];
                    }
               }
               for(i=0; i<FILENUM; ++i)
                    nr=fscanf(nfile[i],"%s %d",mname[i],idn+i);
          }
          for(i=0; i<FILENUM; ++i)
          {     if(ctr[i]>oc[i])oc[i]=ctr[i];
          }
          for(nc=0; nc<rc; ++nc)
          {     for(i=0; i<FILENUM; ++i)
               {     fseek(nfile[i],nfp[i],SEEK_SET);
                    for(j=0; j<=nc; ++j)fscanf(nfile[i],"%s %d",mname[i],nid+i);
                    nsc[i]=0;
                    fseek(nfile[i],nfp[i],SEEK_SET);
                    for(j=0; j<rc; ++j)
                    {     fscanf(nfile[i],"%s %d",fname,&tid);
                         if(tid==nid[i])++nsc[i];
                    }
               }
               for(nsm=i=0; i<FILENUM; ++i)
               {     printf("%s %d   ",mname[i],nid[i]);
                    nsm+=nsc[i];
               }
               nsm=(((nsm*100+rc/2)/rc)*100+FILENUM/2)/FILENUM;
               printf("%d.%02d\n",nsm/100,nsm%100);
               ltot+=nsm;
               ++lnum;
          }
          for(i=0; i<FILENUM; ++i)
          {     nfp[i]=ftell(nfile[i]);
               nr=fscanf(nfile[i],"%s %d",mname[i],idn+i);
          }
          for(perc=0,i=0; i<FILENUM; ++i)
               perc+=oc[i];
          perc=(perc*10000+rc)/(rc*FILENUM);
          printf("%d.%02d percent\n",perc/100,perc%100);
          tot+=perc;
          ++clc;
          for(i=0; i<FILENUM; ++i)id[i]=idn[i];
     }
     tp=(tot+clc/2)/(clc);
     printf("\ntotal clusters %d total percentage %d.%02d\n",clc,tot/100,tot%100);
     printf("total cluster similarity %d.%02d\n",tp/100,tp%100);
     tp=(ltot+lnum/2)/(lnum);
     printf("\ntotal lines %d total percentage %d.%02d\n",lnum,ltot/100,ltot%100);
     printf("total line similarity %d.%02d\n",tp/100,tp%100);
}

int oneequal(int id1[],int id2[])
{     int i;

     for(i=0; i<FILENUM; ++i)
     {     if(id1[i]==id2[i])return(TRUE);
     }
     return(FALSE);
}
0
 

Author Comment

by:korsila
ID: 7001103
well done indeed,
and many thanks for being patient with me..!!!

I got another question baesd on this answer but will post in here soon...


once again, a millon thank for your great help
Korsila
0
 

Author Comment

by:korsila
ID: 7134896
dear Imladris,
I have submitted another question, and if you have time to take a look..please I need your help again...!!!

many thanks,

Korsila
0
 
LVL 16

Expert Comment

by:imladris
ID: 7154817
Korsila,

Thanks for this ingenious note (you correctly figured I am still subscribed to old questions).

I found and looked at both of your new questions. As you probably surmised, I have been away. I have been on holidays the last couple of weeks.

So I'll be digging out from my backlog here at work for at least a week, maybe longer. I'll see what I can do, but it may take a couple of weeks.
0
 

Author Comment

by:korsila
ID: 7155743
Dear Imladris,

seriously, you are the best expert on earth (on the net) I have ever had for such a great help..to be honest, i truth no-one here who could do even better...your beauty of codes for example...they are in great formats, easy to undestand, easy to learn, meet my requirements...etc...am not that smarmy aren't I? CANN'T SAY ANYTHING for more..but am just franky...

I can wait that long cus I KNOW YOU WOULD COME UP WITH A GREAT idea and help with codes...

many thanks, for your reponse..at least i could figure out how busy you will be..!!

Korsila
0

Featured Post

Top 6 Sources for Identifying Threat Actor TTPs

Understanding your enemy is essential. These six sources will help you identify the most popular threat actor tactics, techniques, and procedures (TTPs).

Join & Write a Comment

This tutorial is posted by Aaron Wojnowski, administrator at SDKExpert.net.  To view more iPhone tutorials, visit www.sdkexpert.net. This is a very simple tutorial on finding the user's current location easily. In this tutorial, you will learn ho…
This is a short and sweet, but (hopefully) to the point article. There seems to be some fundamental misunderstanding about the function prototype for the "main" function in C and C++, more specifically what type this function should return. I see so…
Video by: Grant
The goal of this video is to provide viewers with basic examples to understand and use for-loops in the C programming language.
The goal of this video is to provide viewers with basic examples to understand and use conditional statements in the C programming language.

708 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

13 Experts available now in Live!

Get 1:1 Help Now