[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 424
  • Last Modified:

Calculating the cosine coefficient

I'm trying to calculate the cosine coefficient between two document, and I'm not getting th correct answer, I belive I'm getting it wrong when calculating the cross product and the sum of the squares between the documents.

here is my code, it runs but I keep geting a wrong answer.
#!/usr/bin/perl
#Ennio Bozzetti
#S0547650
 
use DBI;
 
###########################
# Setup the DB connection #
###########################
my $dbh = DBI->connect("dbi:mysqlPP:database=crawler;host=localhost",
		       "root", "ghh8773v", {'RaiseError' => 1});
 
############################################
# Get 1 documents and store it to a hash   #
############################################
my $doc = {};
my $sSQL = "SELECT tbl_terms.term FROM tbl_doc, tbl_terms WHERE tbl_doc.docID =  tbl_terms.docID AND tbl_doc.docID = 58";
$sth  = $dbh->prepare($sSQL);
$sth->execute;
$doc = $dbh->selectall_hashref($sSQL, 'term');
 
################################################
# Get the second documment and store in a hash #
################################################
my $doc2 = {};
$sSQL = "SELECT tbl_terms.term FROM tbl_doc, tbl_terms WHERE tbl_doc.docID =  tbl_terms.docID AND tbl_doc.docID = 58";
$sth  = $dbh->prepare($sSQL);
$sth->execute;
$doc2 = $dbh->selectall_hashref($sSQL, 'term');
 
my $cosine_result = 0;
$cosine_result = cosine_sim_1($doc, $doc2);
 
print "The cosine coefficient between doc1 and doc2 is $cosine_result";
 
sub cosine_sim_1 {
   my $vec1 = shift;
   my $vec2 = shift;
 
   my $num = 0;
   my $sum_sq1 = 0;
   my $sum_sq2 = 0;
 
   my @val1 = values %{$vec1};
   my @val2 = values %{$vec2};
 
   #######################################################
   # Get the smallest hash, $vec1 holds the smallest hash #
   #######################################################
   if ((scalar @val1) > (scalar @val2)){
      my $temp = $vec1;
      $vec1 = $vec2;
      $vec2 = $temp;
   }
 
   ########################
   # Get the intersection #
   ########################
   while (my ($key, $val) = each(%$vec1)){
      $vec1->{$key} = '1', if exists $vec2->{$key};
   }
   while (my ($key, $val) = each(%$vec2)){
      $vec2->{$key} = '1', if exists $vec1->{$key};
   }
   ###############################
   # Calculate the cross product #
   ###############################
   while (my ($key, $val) = each(%$vec1)){
      $num += $val * ($vec2->{$key} || 0);
   }
 
   # Calculate the sum of squares #
   foreach my $term (@val1){$sum_sq1 += $term * $term}
 
   foreach my $term (@val2){$sum_sq2 += $term * $term}
 
   return ($num/sqrt($sum_sq1 * $sum_sq2));
}
 
 
############
# Cleam up #
############
$sth->finish();
$dbh->disconnect();

Open in new window

0
Trexgreen
Asked:
Trexgreen
  • 9
  • 5
  • 2
1 Solution
 
ozoCommented:
Why are you setting all the values to '1',
0
 
TrexgreenAuthor Commented:
IF the term is in the other Vector don'tI need tosetp to 1?
0
 
TrexgreenAuthor Commented:
When I remove the code the adds 1 to the value of the vector, the results keeps on changing

And even when I compare two doc. that are the same it gives me 0.999566067943262 when I run for the first time, and 0.999622983688223 for the second time.
this is thesection that I removed.
 
   ########################
   # Get the intersection #
   ########################
   while (my ($key, $val) = each(%$vec1)){
      $vec1->{$key} = '1', if exists $vec2->{$key};
   }
 
   while (my ($key, $val) = each(%$vec2)){
      $vec2->{$key} = '1', if exists $vec1->{$key};
   }

Open in new window

0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
kawasCommented:
as an aside, if you are calculating an intersection, dont you only need to iterate through the keys of one of the hashes?

   ########################
   # Get the intersection #
   ########################
   while (my ($key, $val) = each(%$vec1)){
      $vec1->{$key} = '1', if exists $vec2->{$key};
   }
   while (my ($key, $val) = each(%$vec2)){
      $vec2->{$key} = '1', if exists $vec1->{$key};
   }
 
# should only be
   ########################
   # Get the intersection #
   ########################
   while (my ($key, $val) = each(%$vec1)){
      $vec1->{$key} = '1', if exists $vec2->{$key};
   }
   while (my ($key, $val) = each(%$vec2)){
      $vec2->{$key} = '1', if exists $vec1->{$key};
   }

Open in new window

0
 
kawasCommented:
doh, at first, i deleted too much and pressed 'undo' and all the text came back ...
   ########################
   # Get the intersection #
   ########################
   while (my ($key, $val) = each(%$vec1)){
      $vec1->{$key} = '1', if exists $vec2->{$key};
   }
   

Open in new window

0
 
TrexgreenAuthor Commented:
kawas... I don't think I need that section of the code. I removed that and the calculation seens to get better... when i compare two documents that are equal I'm getting it very close to 1.

here is a sample result.
0.99981428947807

But the problem is that the result is changing, if I run it again Iwill get a different number.
0
 
TrexgreenAuthor Commented:
I just had some toughts...

The documents that I get from the DB are stored in a hash reference. When I pass it to the sub the @val1, and @val2 will contain the results of the 2 hashes that I passed but when I print the values of @val1 I get something like this.


HASH(0x2d0c464)
HASH(0x2d0c404)
......

0
 
ozoCommented:
Are there many vals, with a range of small and large terms, and when the results change, are they very close?
0
 
TrexgreenAuthor Commented:
is this correct?

   my @val1 = values %{$vec1};
   my @val2 = values %{$vec2};

when I print scalar @val1, and @val2 I get the size of the doc that I'm comparing,but when I print the values of @val1 and @val2 I get some random values that I posted on th eprevious post
0
 
TrexgreenAuthor Commented:
ozo... yes... they are term from a document small and large term (but they text was processed before)

the results are very close... but it should be 1 and I'm getting 0.99981428947807
0
 
ozoCommented:
what is in those hashes?
what is in keys %{$val1[0]}
0
 
ozoCommented:
if you just want to check for the existence of terms, and not the values
then maybe you want
 while (my ($key, $val) = each(%$vec1)){
      $num += $vec2->{$key} && 1;
   }
 
   # Calculate the sum of squares #
   foreach my $term (@val1){$sum_sq1 += 1}  #which would be the same as $sum_sq1 = @val1;
 
   foreach my $term (@val2){$sum_sq2 += 1}
 
   return ($num/sqrt($sum_sq1 * $sum_sq2));
}
0
 
TrexgreenAuthor Commented:
I get the following term.

termaid

0
 
ozoCommented:
or, of what you want is the number of elements of $term,
then maybe you want
while (my ($key, $val) = each(%$vec1)){
      $num += keys %{$val} + keys %{$vec2->{$val}} if $vec2->{$val};
}
   foreach my $term (@val1){$sum_sq1 += (keys %{$term})**2}
 
   foreach my $term (@val2){$sum_sq2 +=  (keys %{$term})**2}
0
 
TrexgreenAuthor Commented:
I'm going to do some test here and I let you know.... but your first solution looks like it will work... I'm running some tests here. and I iwll let you know.
0
 
TrexgreenAuthor Commented:
ozo... your first solution worked... :)

I just have to fix this line

$num += $vec2->{$key} && 1;

so it won't give this error
Use of uninitialized value in addition(+) at cosine.pl line 64


Thank you for the help...
0

Featured Post

Free Tool: SSL Checker

Scans your site and returns information about your SSL implementation and certificate. Helpful for debugging and validating your SSL configuration.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

  • 9
  • 5
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now