• Status: Solved
• Priority: Medium
• Security: Public
• Views: 349

# Calculating IDF

I need some help calculating the IDF.

I created this code that will loop thru an array of term, and it will check if the terms are in a hash that contains some text information.

my hash has only 5 texts, and the terms in the hash deos not repeat. But when I print the numbers of time the term apear in the documents hash I get some crazy number (like 4000,or 380) and not 5 or 1 or 2.

here is the loop where I look at the terms, and search thru the hash.

The array of doc contains text information, so each place in the array has a text. the docTemp is used to get the terms from the doc array and store each term as a value inside the array.

for example
\$doc[1] = "The mouse is black"
``````for (\$counter = 0; \$counter <= \$#terms; \$counter++){
\$nDocs = 0;
for (\$count = 0; \$count <= \$#doc; \$count++){
@docTemp = split(/\s+/, \$doc[\$count]);
###################################
# STORE THE DOCUMENTS INTO A HASH #
###################################
for my \$word (@docTemp){
\$docHash{\$word}++;
}
for my \$key ( keys %docHash ) {
################################
# CHECK IF TERM IS IN THE HASH #
################################
if (\$terms[\$counter] == \$key){
\$nDocs++;
}
}
}
print \$terms[\$counter], " ", \$nDocs, "\n";
}
``````
0
Ennio
• 3
• 2
1 Solution

Author Commented:
I did some changes in the code, and now I get all 6 times.

here is the changes in the code.
``````for (\$counter = 0; \$counter <= \$#terms; \$counter++){
\$nDocs = 0;
for (\$count = 0; \$count <= \$#doc; \$count++){
@docTemp = split(/\s+/, \$doc[\$count]);
###################################
# STORE THE DOCUMENTS INTO A HASH #
###################################
for my \$word (@docTemp){
\$docHash{\$word}++;
}

if (exists \$docHash{\$terms[\$counter]}){
\$nDocs++;
}
}
print \$terms[\$counter], " ", \$nDocs, "\n";
}
``````
0

Commented:
where did \$terms[\$counter] come from?

if exists \$docHash{\$terms[\$counter]} is true the first time through the for (\$count = 0; \$count <= \$#doc; \$count++) loop,
it will also be true the next time through the loop.
Is that what you want?
0

Author Commented:
\$terms is an array that contain the terms that I'm searching in the hash.

I only want the if exists \$docHash{\$terms[\$counter]} if the terms[\$counter] is in the hash too, if not I don't want.
0

Commented:
exists \$docHash{\$terms[\$counter]}  is true when the \$terms[\$counter] is in the %docHash hash
since you only accumulate entries in the hash, if it is ever in the hash, it will always be in the hash
0

Author Commented:
what it whould be the best way to do this. should I check and delete the key from the hash?
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.