• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 427
  • Last Modified:

Calculating the cosine coefficient

I'm trying to calculate the cosine coefficient between two document, and I'm not getting th correct answer, I belive I'm getting it wrong when calculating the cross product and the sum of the squares between the documents.

here is my code, it runs but I keep geting a wrong answer.
#!/usr/bin/perl
#Ennio Bozzetti
#S0547650
 
use DBI;
 
###########################
# Setup the DB connection #
###########################
my $dbh = DBI->connect("dbi:mysqlPP:database=crawler;host=localhost",
		       "root", "ghh8773v", {'RaiseError' => 1});
 
############################################
# Get 1 documents and store it to a hash   #
############################################
my $doc = {};
my $sSQL = "SELECT tbl_terms.term FROM tbl_doc, tbl_terms WHERE tbl_doc.docID =  tbl_terms.docID AND tbl_doc.docID = 58";
$sth  = $dbh->prepare($sSQL);
$sth->execute;
$doc = $dbh->selectall_hashref($sSQL, 'term');
 
################################################
# Get the second documment and store in a hash #
################################################
my $doc2 = {};
$sSQL = "SELECT tbl_terms.term FROM tbl_doc, tbl_terms WHERE tbl_doc.docID =  tbl_terms.docID AND tbl_doc.docID = 58";
$sth  = $dbh->prepare($sSQL);
$sth->execute;
$doc2 = $dbh->selectall_hashref($sSQL, 'term');
 
my $cosine_result = 0;
$cosine_result = cosine_sim_1($doc, $doc2);
 
print "The cosine coefficient between doc1 and doc2 is $cosine_result";
 
sub cosine_sim_1 {
   my $vec1 = shift;
   my $vec2 = shift;
 
   my $num = 0;
   my $sum_sq1 = 0;
   my $sum_sq2 = 0;
 
   my @val1 = values %{$vec1};
   my @val2 = values %{$vec2};
 
   #######################################################
   # Get the smallest hash, $vec1 holds the smallest hash #
   #######################################################
   if ((scalar @val1) > (scalar @val2)){
      my $temp = $vec1;
      $vec1 = $vec2;
      $vec2 = $temp;
   }
 
   ########################
   # Get the intersection #
   ########################
   while (my ($key, $val) = each(%$vec1)){
      $vec1->{$key} = '1', if exists $vec2->{$key};
   }
   while (my ($key, $val) = each(%$vec2)){
      $vec2->{$key} = '1', if exists $vec1->{$key};
   }
   ###############################
   # Calculate the cross product #
   ###############################
   while (my ($key, $val) = each(%$vec1)){
      $num += $val * ($vec2->{$key} || 0);
   }
 
   # Calculate the sum of squares #
   foreach my $term (@val1){$sum_sq1 += $term * $term}
 
   foreach my $term (@val2){$sum_sq2 += $term * $term}
 
   return ($num/sqrt($sum_sq1 * $sum_sq2));
}
 
 
############
# Cleam up #
############
$sth->finish();
$dbh->disconnect();

Open in new window

0
Trexgreen
Asked:
Trexgreen
  • 9
  • 5
  • 2
1 Solution
 
ozoCommented:
Why are you setting all the values to '1',
0
 
TrexgreenAuthor Commented:
IF the term is in the other Vector don'tI need tosetp to 1?
0
 
TrexgreenAuthor Commented:
When I remove the code the adds 1 to the value of the vector, the results keeps on changing

And even when I compare two doc. that are the same it gives me 0.999566067943262 when I run for the first time, and 0.999622983688223 for the second time.
this is thesection that I removed.
 
   ########################
   # Get the intersection #
   ########################
   while (my ($key, $val) = each(%$vec1)){
      $vec1->{$key} = '1', if exists $vec2->{$key};
   }
 
   while (my ($key, $val) = each(%$vec2)){
      $vec2->{$key} = '1', if exists $vec1->{$key};
   }

Open in new window

0
Cloud Class® Course: CompTIA Healthcare IT Tech

This course will help prep you to earn the CompTIA Healthcare IT Technician certification showing that you have the knowledge and skills needed to succeed in installing, managing, and troubleshooting IT systems in medical and clinical settings.

 
kawasCommented:
as an aside, if you are calculating an intersection, dont you only need to iterate through the keys of one of the hashes?

   ########################
   # Get the intersection #
   ########################
   while (my ($key, $val) = each(%$vec1)){
      $vec1->{$key} = '1', if exists $vec2->{$key};
   }
   while (my ($key, $val) = each(%$vec2)){
      $vec2->{$key} = '1', if exists $vec1->{$key};
   }
 
# should only be
   ########################
   # Get the intersection #
   ########################
   while (my ($key, $val) = each(%$vec1)){
      $vec1->{$key} = '1', if exists $vec2->{$key};
   }
   while (my ($key, $val) = each(%$vec2)){
      $vec2->{$key} = '1', if exists $vec1->{$key};
   }

Open in new window

0
 
kawasCommented:
doh, at first, i deleted too much and pressed 'undo' and all the text came back ...
   ########################
   # Get the intersection #
   ########################
   while (my ($key, $val) = each(%$vec1)){
      $vec1->{$key} = '1', if exists $vec2->{$key};
   }
   

Open in new window

0
 
TrexgreenAuthor Commented:
kawas... I don't think I need that section of the code. I removed that and the calculation seens to get better... when i compare two documents that are equal I'm getting it very close to 1.

here is a sample result.
0.99981428947807

But the problem is that the result is changing, if I run it again Iwill get a different number.
0
 
TrexgreenAuthor Commented:
I just had some toughts...

The documents that I get from the DB are stored in a hash reference. When I pass it to the sub the @val1, and @val2 will contain the results of the 2 hashes that I passed but when I print the values of @val1 I get something like this.


HASH(0x2d0c464)
HASH(0x2d0c404)
......

0
 
ozoCommented:
Are there many vals, with a range of small and large terms, and when the results change, are they very close?
0
 
TrexgreenAuthor Commented:
is this correct?

   my @val1 = values %{$vec1};
   my @val2 = values %{$vec2};

when I print scalar @val1, and @val2 I get the size of the doc that I'm comparing,but when I print the values of @val1 and @val2 I get some random values that I posted on th eprevious post
0
 
TrexgreenAuthor Commented:
ozo... yes... they are term from a document small and large term (but they text was processed before)

the results are very close... but it should be 1 and I'm getting 0.99981428947807
0
 
ozoCommented:
what is in those hashes?
what is in keys %{$val1[0]}
0
 
ozoCommented:
if you just want to check for the existence of terms, and not the values
then maybe you want
 while (my ($key, $val) = each(%$vec1)){
      $num += $vec2->{$key} && 1;
   }
 
   # Calculate the sum of squares #
   foreach my $term (@val1){$sum_sq1 += 1}  #which would be the same as $sum_sq1 = @val1;
 
   foreach my $term (@val2){$sum_sq2 += 1}
 
   return ($num/sqrt($sum_sq1 * $sum_sq2));
}
0
 
TrexgreenAuthor Commented:
I get the following term.

termaid

0
 
ozoCommented:
or, of what you want is the number of elements of $term,
then maybe you want
while (my ($key, $val) = each(%$vec1)){
      $num += keys %{$val} + keys %{$vec2->{$val}} if $vec2->{$val};
}
   foreach my $term (@val1){$sum_sq1 += (keys %{$term})**2}
 
   foreach my $term (@val2){$sum_sq2 +=  (keys %{$term})**2}
0
 
TrexgreenAuthor Commented:
I'm going to do some test here and I let you know.... but your first solution looks like it will work... I'm running some tests here. and I iwll let you know.
0
 
TrexgreenAuthor Commented:
ozo... your first solution worked... :)

I just have to fix this line

$num += $vec2->{$key} && 1;

so it won't give this error
Use of uninitialized value in addition(+) at cosine.pl line 64


Thank you for the help...
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Cloud Class® Course: Microsoft Office 2010

This course will introduce you to the interfaces and features of Microsoft Office 2010 Word, Excel, PowerPoint, Outlook, and Access. You will learn about the features that are shared between all products in the Office suite, as well as the new features that are product specific.

  • 9
  • 5
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now