[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1435
  • Last Modified:

Perl Cosine Similarity

I'm creating a small software that will calculate the cosine coefficient between two documents and I would like to have some help to make sure I'm getting the correct data.

The first document has 384 terms (all text processing was done before)
The second document has 52 terms (all text processing was done before)

I'm having problem calculating the cross product and the sum of squares, this is where I think I doing it wrong, so any help it would be great.

here is the code I have so far.
#!/usr/bin/perl
 
use DBI;
 
#######################
# Setup the DB connection #
#######################
my $dbh = DBI->connect("dbi:mysqlPP:database=crawler;host=localhost",
		       "root", "12345679", {'RaiseError' => 1});
 
###################################
# Get 1 documents and store it to a hash   #
###################################
my $doc = {};
my $sSQL = "SELECT tbl_terms.term FROM tbl_doc, tbl_terms WHERE tbl_doc.docID =  tbl_terms.docID AND tbl_doc.docID = 58";
$sth  = $dbh->prepare($sSQL);
$sth->execute;
$doc = $dbh->selectall_hashref($sSQL, 'term');
 
#########################################
# Get the second documment and store in a hash #
#########################################
my $doc2 = {};
$sSQL = "SELECT tbl_terms.term FROM tbl_doc, tbl_terms WHERE tbl_doc.docID =  tbl_terms.docID AND tbl_doc.docID = 59";
$sth  = $dbh->prepare($sSQL);
$sth->execute;
$doc2 = $dbh->selectall_hashref($sSQL, 'term');
 
print cosine_sim_1($doc, $doc2);
 
sub cosine_sim_1 {
   my $vec1 = shift;
   my $vec2 = shift;
 
   my $num = 0;
   my $sum_sq1 = 0;
   my $sum_sq2 = 0;
 
   my $val1 = values %{$vec1};
   my $val2 = values %{$vec2};
 
   # This section is ok, the $val1 and $val2 is correct
   print "Debug\nthe value of val1 is $val1\nthe value of val2 is $val2\n";
 
   #############################################
   # Get the smallest hash, $doc holds the smallest hash #
   #############################################
  #  This section is correct too, $vec1 is the small vector
   if ((scalar keys %$vec1) > (scalar keys %$vec2)){
      my $temp = $vec1;
      $vec1 = $vec2;
      $vec2 = $temp;
   }
 
   #########################
   # Calculate the cross product #
   #########################
   # This section I need some help
   # Because $num keeps on changing everytime I run the program
   while (my ($key, $val) = each(%$vec1)){
      $num += $val * ($$vec2{$key} || 0);
   }
 
   print "\nDebug\nthe value of num is $num\n";
 
   # Calculate the sum of squares #
   # I need some help here too, right now it returns blank.
 
   foreach my $term (@val1){$sum_sq1 += $term * $term}
   print "\nDebug\nthe value of term is $term\n";
 
   foreach my $term (@val2){$sum_sq2 += $term * $term}
   print "\nDebug\nthe value of term is $term\n";
 
   return ($num/sqrt($sum_sq1 * $sum_sq2));
}
 
 
############
# Cleam up #
############
$sth->finish();
$dbh->disconnect();

Open in new window

0
Trexgreen
Asked:
Trexgreen
1 Solution
 
ozoCommented:
did you mean
my @val1 = values %{$vec1};
my @val2 = values %{$vec2};
0
 
TrexgreenAuthor Commented:
ozo... thank you, good call.

I will try later today... I guess I must gone crazy after a long night of programming. :)
0

Featured Post

Free Backup Tool for VMware and Hyper-V

Restore full virtual machine or individual guest files from 19 common file systems directly from the backup file. Schedule VM backups with PowerShell scripts. Set desired time, lean back and let the script to notify you via email upon completion.  

Tackle projects and never again get stuck behind a technical roadblock.
Join Now