Solved

Posted on 2009-02-20

I'm trying to calculate the cosine coefficient between two document, and I'm not getting th correct answer, I belive I'm getting it wrong when calculating the cross product and the sum of the squares between the documents.

here is my code, it runs but I keep geting a wrong answer.

here is my code, it runs but I keep geting a wrong answer.

```
#!/usr/bin/perl
#Ennio Bozzetti
#S0547650
use DBI;
###########################
# Setup the DB connection #
###########################
my $dbh = DBI->connect("dbi:mysqlPP:database=crawler;host=localhost",
"root", "ghh8773v", {'RaiseError' => 1});
############################################
# Get 1 documents and store it to a hash #
############################################
my $doc = {};
my $sSQL = "SELECT tbl_terms.term FROM tbl_doc, tbl_terms WHERE tbl_doc.docID = tbl_terms.docID AND tbl_doc.docID = 58";
$sth = $dbh->prepare($sSQL);
$sth->execute;
$doc = $dbh->selectall_hashref($sSQL, 'term');
################################################
# Get the second documment and store in a hash #
################################################
my $doc2 = {};
$sSQL = "SELECT tbl_terms.term FROM tbl_doc, tbl_terms WHERE tbl_doc.docID = tbl_terms.docID AND tbl_doc.docID = 58";
$sth = $dbh->prepare($sSQL);
$sth->execute;
$doc2 = $dbh->selectall_hashref($sSQL, 'term');
my $cosine_result = 0;
$cosine_result = cosine_sim_1($doc, $doc2);
print "The cosine coefficient between doc1 and doc2 is $cosine_result";
sub cosine_sim_1 {
my $vec1 = shift;
my $vec2 = shift;
my $num = 0;
my $sum_sq1 = 0;
my $sum_sq2 = 0;
my @val1 = values %{$vec1};
my @val2 = values %{$vec2};
#######################################################
# Get the smallest hash, $vec1 holds the smallest hash #
#######################################################
if ((scalar @val1) > (scalar @val2)){
my $temp = $vec1;
$vec1 = $vec2;
$vec2 = $temp;
}
########################
# Get the intersection #
########################
while (my ($key, $val) = each(%$vec1)){
$vec1->{$key} = '1', if exists $vec2->{$key};
}
while (my ($key, $val) = each(%$vec2)){
$vec2->{$key} = '1', if exists $vec1->{$key};
}
###############################
# Calculate the cross product #
###############################
while (my ($key, $val) = each(%$vec1)){
$num += $val * ($vec2->{$key} || 0);
}
# Calculate the sum of squares #
foreach my $term (@val1){$sum_sq1 += $term * $term}
foreach my $term (@val2){$sum_sq2 += $term * $term}
return ($num/sqrt($sum_sq1 * $sum_sq2));
}
############
# Cleam up #
############
$sth->finish();
$dbh->disconnect();
```

16 Comments

And even when I compare two doc. that are the same it gives me 0.999566067943262 when I run for the first time, and 0.999622983688223 for the second time.

```
this is thesection that I removed.
########################
# Get the intersection #
########################
while (my ($key, $val) = each(%$vec1)){
$vec1->{$key} = '1', if exists $vec2->{$key};
}
while (my ($key, $val) = each(%$vec2)){
$vec2->{$key} = '1', if exists $vec1->{$key};
}
```

```
########################
# Get the intersection #
########################
while (my ($key, $val) = each(%$vec1)){
$vec1->{$key} = '1', if exists $vec2->{$key};
}
while (my ($key, $val) = each(%$vec2)){
$vec2->{$key} = '1', if exists $vec1->{$key};
}
# should only be
########################
# Get the intersection #
########################
while (my ($key, $val) = each(%$vec1)){
$vec1->{$key} = '1', if exists $vec2->{$key};
}
while (my ($key, $val) = each(%$vec2)){
$vec2->{$key} = '1', if exists $vec1->{$key};
}
```

```
########################
# Get the intersection #
########################
while (my ($key, $val) = each(%$vec1)){
$vec1->{$key} = '1', if exists $vec2->{$key};
}
```

here is a sample result.

0.99981428947807

But the problem is that the result is changing, if I run it again Iwill get a different number.

The documents that I get from the DB are stored in a hash reference. When I pass it to the sub the @val1, and @val2 will contain the results of the 2 hashes that I passed but when I print the values of @val1 I get something like this.

HASH(0x2d0c464)

HASH(0x2d0c404)

......

my @val1 = values %{$vec1};

my @val2 = values %{$vec2};

when I print scalar @val1, and @val2 I get the size of the doc that I'm comparing,but when I print the values of @val1 and @val2 I get some random values that I posted on th eprevious post

the results are very close... but it should be 1 and I'm getting 0.99981428947807

then maybe you want

while (my ($key, $val) = each(%$vec1)){

$num += $vec2->{$key} && 1;

}

# Calculate the sum of squares #

foreach my $term (@val1){$sum_sq1 += 1} #which would be the same as $sum_sq1 = @val1;

foreach my $term (@val2){$sum_sq2 += 1}

return ($num/sqrt($sum_sq1 * $sum_sq2));

}

then maybe you want

while (my ($key, $val) = each(%$vec1)){

$num += keys %{$val} + keys %{$vec2->{$val}} if $vec2->{$val};

}

foreach my $term (@val1){$sum_sq1 += (keys %{$term})**2}

foreach my $term (@val2){$sum_sq2 += (keys %{$term})**2}

By clicking you are agreeing to Experts Exchange's Terms of Use.

Title | # Comments | Views | Activity |
---|---|---|---|

Perl Regular expression | 9 | 178 | |

Perl Script - Remove row of data based on column value | 3 | 68 | |

rename outfile before writing | 2 | 64 | |

Using Perl to parse rows | 7 | 82 |

Join the community of 500,000 technology professionals and ask your questions.

Connect with top rated Experts

**10** Experts available now in Live!