We help IT Professionals succeed at work.

Perl Cosine Similarity

Trexgreen
Trexgreen asked
on
Medium Priority
1,558 Views
Last Modified: 2013-11-15
I'm creating a small software that will calculate the cosine coefficient between two documents and I would like to have some help to make sure I'm getting the correct data.

The first document has 384 terms (all text processing was done before)
The second document has 52 terms (all text processing was done before)

I'm having problem calculating the cross product and the sum of squares, this is where I think I doing it wrong, so any help it would be great.

here is the code I have so far.
#!/usr/bin/perl
 
use DBI;
 
#######################
# Setup the DB connection #
#######################
my $dbh = DBI->connect("dbi:mysqlPP:database=crawler;host=localhost",
		       "root", "12345679", {'RaiseError' => 1});
 
###################################
# Get 1 documents and store it to a hash   #
###################################
my $doc = {};
my $sSQL = "SELECT tbl_terms.term FROM tbl_doc, tbl_terms WHERE tbl_doc.docID =  tbl_terms.docID AND tbl_doc.docID = 58";
$sth  = $dbh->prepare($sSQL);
$sth->execute;
$doc = $dbh->selectall_hashref($sSQL, 'term');
 
#########################################
# Get the second documment and store in a hash #
#########################################
my $doc2 = {};
$sSQL = "SELECT tbl_terms.term FROM tbl_doc, tbl_terms WHERE tbl_doc.docID =  tbl_terms.docID AND tbl_doc.docID = 59";
$sth  = $dbh->prepare($sSQL);
$sth->execute;
$doc2 = $dbh->selectall_hashref($sSQL, 'term');
 
print cosine_sim_1($doc, $doc2);
 
sub cosine_sim_1 {
   my $vec1 = shift;
   my $vec2 = shift;
 
   my $num = 0;
   my $sum_sq1 = 0;
   my $sum_sq2 = 0;
 
   my $val1 = values %{$vec1};
   my $val2 = values %{$vec2};
 
   # This section is ok, the $val1 and $val2 is correct
   print "Debug\nthe value of val1 is $val1\nthe value of val2 is $val2\n";
 
   #############################################
   # Get the smallest hash, $doc holds the smallest hash #
   #############################################
  #  This section is correct too, $vec1 is the small vector
   if ((scalar keys %$vec1) > (scalar keys %$vec2)){
      my $temp = $vec1;
      $vec1 = $vec2;
      $vec2 = $temp;
   }
 
   #########################
   # Calculate the cross product #
   #########################
   # This section I need some help
   # Because $num keeps on changing everytime I run the program
   while (my ($key, $val) = each(%$vec1)){
      $num += $val * ($$vec2{$key} || 0);
   }
 
   print "\nDebug\nthe value of num is $num\n";
 
   # Calculate the sum of squares #
   # I need some help here too, right now it returns blank.
 
   foreach my $term (@val1){$sum_sq1 += $term * $term}
   print "\nDebug\nthe value of term is $term\n";
 
   foreach my $term (@val2){$sum_sq2 += $term * $term}
   print "\nDebug\nthe value of term is $term\n";
 
   return ($num/sqrt($sum_sq1 * $sum_sq2));
}
 
 
############
# Cleam up #
############
$sth->finish();
$dbh->disconnect();

Open in new window

Comment
Watch Question

CERTIFIED EXPERT
Most Valuable Expert 2014
Top Expert 2015
Commented:
did you mean
my @val1 = values %{$vec1};
my @val2 = values %{$vec2};

Not the solution you were looking for? Getting a personalized solution is easy.

Ask the Experts

Author

Commented:
ozo... thank you, good call.

I will try later today... I guess I must gone crazy after a long night of programming. :)
Access more of Experts Exchange with a free account
Thanks for using Experts Exchange.

Create a free account to continue.

Limited access with a free account allows you to:

  • View three pieces of content (articles, solutions, posts, and videos)
  • Ask the experts questions (counted toward content limit)
  • Customize your dashboard and profile

*This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

OR

Please enter a first name

Please enter a last name

8+ characters (letters, numbers, and a symbol)

By clicking, you agree to the Terms of Use and Privacy Policy.