• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1466
  • Last Modified:

Perl Cosine Similarity

I'm creating a small software that will calculate the cosine coefficient between two documents and I would like to have some help to make sure I'm getting the correct data.

The first document has 384 terms (all text processing was done before)
The second document has 52 terms (all text processing was done before)

I'm having problem calculating the cross product and the sum of squares, this is where I think I doing it wrong, so any help it would be great.

here is the code I have so far.
#!/usr/bin/perl
 
use DBI;
 
#######################
# Setup the DB connection #
#######################
my $dbh = DBI->connect("dbi:mysqlPP:database=crawler;host=localhost",
		       "root", "12345679", {'RaiseError' => 1});
 
###################################
# Get 1 documents and store it to a hash   #
###################################
my $doc = {};
my $sSQL = "SELECT tbl_terms.term FROM tbl_doc, tbl_terms WHERE tbl_doc.docID =  tbl_terms.docID AND tbl_doc.docID = 58";
$sth  = $dbh->prepare($sSQL);
$sth->execute;
$doc = $dbh->selectall_hashref($sSQL, 'term');
 
#########################################
# Get the second documment and store in a hash #
#########################################
my $doc2 = {};
$sSQL = "SELECT tbl_terms.term FROM tbl_doc, tbl_terms WHERE tbl_doc.docID =  tbl_terms.docID AND tbl_doc.docID = 59";
$sth  = $dbh->prepare($sSQL);
$sth->execute;
$doc2 = $dbh->selectall_hashref($sSQL, 'term');
 
print cosine_sim_1($doc, $doc2);
 
sub cosine_sim_1 {
   my $vec1 = shift;
   my $vec2 = shift;
 
   my $num = 0;
   my $sum_sq1 = 0;
   my $sum_sq2 = 0;
 
   my $val1 = values %{$vec1};
   my $val2 = values %{$vec2};
 
   # This section is ok, the $val1 and $val2 is correct
   print "Debug\nthe value of val1 is $val1\nthe value of val2 is $val2\n";
 
   #############################################
   # Get the smallest hash, $doc holds the smallest hash #
   #############################################
  #  This section is correct too, $vec1 is the small vector
   if ((scalar keys %$vec1) > (scalar keys %$vec2)){
      my $temp = $vec1;
      $vec1 = $vec2;
      $vec2 = $temp;
   }
 
   #########################
   # Calculate the cross product #
   #########################
   # This section I need some help
   # Because $num keeps on changing everytime I run the program
   while (my ($key, $val) = each(%$vec1)){
      $num += $val * ($$vec2{$key} || 0);
   }
 
   print "\nDebug\nthe value of num is $num\n";
 
   # Calculate the sum of squares #
   # I need some help here too, right now it returns blank.
 
   foreach my $term (@val1){$sum_sq1 += $term * $term}
   print "\nDebug\nthe value of term is $term\n";
 
   foreach my $term (@val2){$sum_sq2 += $term * $term}
   print "\nDebug\nthe value of term is $term\n";
 
   return ($num/sqrt($sum_sq1 * $sum_sq2));
}
 
 
############
# Cleam up #
############
$sth->finish();
$dbh->disconnect();

Open in new window

0
Trexgreen
Asked:
Trexgreen
1 Solution
 
ozoCommented:
did you mean
my @val1 = values %{$vec1};
my @val2 = values %{$vec2};
0
 
TrexgreenAuthor Commented:
ozo... thank you, good call.

I will try later today... I guess I must gone crazy after a long night of programming. :)
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Improve Your Query Performance Tuning

In this FREE six-day email course, you'll learn from Janis Griffin, Database Performance Evangelist. She'll teach 12 steps that you can use to optimize your queries as much as possible and see measurable results in your work. Get started today!

Tackle projects and never again get stuck behind a technical roadblock.
Join Now