I'm trying to calculate the cosine coefficient between two document, and I'm not getting th correct answer, I belive I'm getting it wrong when calculating the cross product and the sum of the squares between the documents.

here is my code, it runs but I keep geting a wrong answer.

#!/usr/bin/perl#Ennio Bozzetti#S0547650use DBI;############################ Setup the DB connection ############################my $dbh = DBI->connect("dbi:mysqlPP:database=crawler;host=localhost", "root", "ghh8773v", {'RaiseError' => 1});############################################# Get 1 documents and store it to a hash #############################################my $doc = {};my $sSQL = "SELECT tbl_terms.term FROM tbl_doc, tbl_terms WHERE tbl_doc.docID = tbl_terms.docID AND tbl_doc.docID = 58";$sth = $dbh->prepare($sSQL);$sth->execute;$doc = $dbh->selectall_hashref($sSQL, 'term');################################################# Get the second documment and store in a hash #################################################my $doc2 = {};$sSQL = "SELECT tbl_terms.term FROM tbl_doc, tbl_terms WHERE tbl_doc.docID = tbl_terms.docID AND tbl_doc.docID = 58";$sth = $dbh->prepare($sSQL);$sth->execute;$doc2 = $dbh->selectall_hashref($sSQL, 'term');my $cosine_result = 0;$cosine_result = cosine_sim_1($doc, $doc2);print "The cosine coefficient between doc1 and doc2 is $cosine_result";sub cosine_sim_1 { my $vec1 = shift; my $vec2 = shift; my $num = 0; my $sum_sq1 = 0; my $sum_sq2 = 0; my @val1 = values %{$vec1}; my @val2 = values %{$vec2}; ####################################################### # Get the smallest hash, $vec1 holds the smallest hash # ####################################################### if ((scalar @val1) > (scalar @val2)){ my $temp = $vec1; $vec1 = $vec2; $vec2 = $temp; } ######################## # Get the intersection # ######################## while (my ($key, $val) = each(%$vec1)){ $vec1->{$key} = '1', if exists $vec2->{$key}; } while (my ($key, $val) = each(%$vec2)){ $vec2->{$key} = '1', if exists $vec1->{$key}; } ############################### # Calculate the cross product # ############################### while (my ($key, $val) = each(%$vec1)){ $num += $val * ($vec2->{$key} || 0); } # Calculate the sum of squares # foreach my $term (@val1){$sum_sq1 += $term * $term} foreach my $term (@val2){$sum_sq2 += $term * $term} return ($num/sqrt($sum_sq1 * $sum_sq2));}############# Cleam up #############$sth->finish();$dbh->disconnect();

IF the term is in the other Vector don'tI need tosetp to 1?

0

TrexgreenAuthor Commented:

When I remove the code the adds 1 to the value of the vector, the results keeps on changing

And even when I compare two doc. that are the same it gives me 0.999566067943262 when I run for the first time, and 0.999622983688223 for the second time.

this is thesection that I removed. ######################## # Get the intersection # ######################## while (my ($key, $val) = each(%$vec1)){ $vec1->{$key} = '1', if exists $vec2->{$key}; } while (my ($key, $val) = each(%$vec2)){ $vec2->{$key} = '1', if exists $vec1->{$key}; }

as an aside, if you are calculating an intersection, dont you only need to iterate through the keys of one of the hashes?

######################## # Get the intersection # ######################## while (my ($key, $val) = each(%$vec1)){ $vec1->{$key} = '1', if exists $vec2->{$key}; } while (my ($key, $val) = each(%$vec2)){ $vec2->{$key} = '1', if exists $vec1->{$key}; }# should only be ######################## # Get the intersection # ######################## while (my ($key, $val) = each(%$vec1)){ $vec1->{$key} = '1', if exists $vec2->{$key}; } while (my ($key, $val) = each(%$vec2)){ $vec2->{$key} = '1', if exists $vec1->{$key}; }

doh, at first, i deleted too much and pressed 'undo' and all the text came back ...

######################## # Get the intersection # ######################## while (my ($key, $val) = each(%$vec1)){ $vec1->{$key} = '1', if exists $vec2->{$key}; }

kawas... I don't think I need that section of the code. I removed that and the calculation seens to get better... when i compare two documents that are equal I'm getting it very close to 1.

here is a sample result.
0.99981428947807

But the problem is that the result is changing, if I run it again Iwill get a different number.

0

TrexgreenAuthor Commented:

I just had some toughts...

The documents that I get from the DB are stored in a hash reference. When I pass it to the sub the @val1, and @val2 will contain the results of the 2 hashes that I passed but when I print the values of @val1 I get something like this.

Are there many vals, with a range of small and large terms, and when the results change, are they very close?

0

TrexgreenAuthor Commented:

is this correct?

my @val1 = values %{$vec1};
my @val2 = values %{$vec2};

when I print scalar @val1, and @val2 I get the size of the doc that I'm comparing,but when I print the values of @val1 and @val2 I get some random values that I posted on th eprevious post

0

TrexgreenAuthor Commented:

ozo... yes... they are term from a document small and large term (but they text was processed before)

the results are very close... but it should be 1 and I'm getting 0.99981428947807

if you just want to check for the existence of terms, and not the values
then maybe you want
while (my ($key, $val) = each(%$vec1)){
$num += $vec2->{$key} && 1;
}

# Calculate the sum of squares #
foreach my $term (@val1){$sum_sq1 += 1} #which would be the same as $sum_sq1 = @val1;

or, of what you want is the number of elements of $term,
then maybe you want
while (my ($key, $val) = each(%$vec1)){
$num += keys %{$val} + keys %{$vec2->{$val}} if $vec2->{$val};
}
foreach my $term (@val1){$sum_sq1 += (keys %{$term})**2}

foreach my $term (@val2){$sum_sq2 += (keys %{$term})**2}

0

TrexgreenAuthor Commented:

I'm going to do some test here and I let you know.... but your first solution looks like it will work... I'm running some tests here. and I iwll let you know.

0

TrexgreenAuthor Commented:

ozo... your first solution worked... :)

I just have to fix this line

$num += $vec2->{$key} && 1;

so it won't give this error
Use of uninitialized value in addition(+) at cosine.pl line 64

Thank you for the help...

0

Featured Post

Scans your site and returns information about your SSL implementation and certificate. Helpful for debugging and validating your SSL configuration.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.