[Webinar] Learn how to a build a cloud-first strategyRegister Now

x
Solved

# calculate IDF

Posted on 2009-12-30
Medium Priority
381 Views
Hello everybody,
it's not a long time that I program, and I still have many many doubts.
I have to calculate the TFIDF of some documents (3), but I don't know exactly how to calculate the IDF.
I know that IDF = total number of documents in a corpus / total number of documents containing a given word.
This is my program for the part related to the IDF:

use strict;

my \$countDoc = 0; # total number of processed documents
my \$countWordDoc = 0; # total number of documents containing a given word

for (\$countWordDoc = 0; \$countWordDoc <= \$countDoc; ++\$countWordDoc)
{
if (exists \$wordHash{\$key}) # if a word is contained in a document # increase the counter
{
++\$countWordDoc;
next;
}
}

But I guess there is something wrong, because when I print "countWordDoc \$key = \$countWordDoc" it gives me always the same number.

Thank you for the attention and happy new year!
``````use strict;

my \$countDoc = 0; # total number of processed documents
my \$countWordDoc = 0; # total number of documents containing a given word

for (\$countWordDoc = 0; \$countWordDoc <= \$countDoc; ++\$countWordDoc)
{
if (exists \$wordHash{\$key}) # if a word is contained in a document # increase the counter
{
++\$countWordDoc;
next;
}
}
``````
0
Question by:ironpony86
• 2
• 2

LVL 85

Expert Comment

ID: 26144677
# if a word is contained in a document # increase the counter
How are you specifying which document?
How are you specifying which word?
0

Author Comment

ID: 26144852
Well, the document is an input file (I have already checked that part of the program, it works)
and I have specified the word in this way:

``````while (<INFILE>) # take every file of the directory
{
\$line = \$_; # read lines in a loop
while (\$line =~ /\b(\w+)\b/g) # get each word
{
my \$word = "\L\$1"; # translate each word in lower case
++\$totalwords; # increase the amount of total words
if (exists \$wordHash{\$word}) # check whether we already have the word
{
++\$wordHash{\$word}; # if so, increase the counter
}
else
{
\$wordHash{\$word} = 1; # otherwise assign the value 1 to that word
}
}
}
``````
0

LVL 85

Accepted Solution

ozo earned 500 total points
ID: 26144965
where in your
for (\$countWordDoc = 0; \$countWordDoc <= \$countDoc; ++\$countWordDoc)
loop do you say which document you are checking when you do
if (exists \$wordHash{\$key})
is \$key the word you are checking?
Do you reset %wordHash and recreate is as in http:#26144852 for every iteration of that loop? (which would be a rather inefficient way to do it)
If the Code Snippet you posted was the complete loop, then nothing will affect the if condition, and it will have the same
value on each iteration of the loop, so you might as well take the exists \$wordHash{\$key} condition out of the loop

Is %wordHash supposed to contain one document, or the entire corpus?
if you want it to contain the entire corpus, and keep track of which documents the words were in, you might do something like
++\$wordHash{\$word}{\$document};

By the way, there is no need to check exists \$wordHash{\$word} before incrementing,
if you increment an element that doesn't exist, the result will be 1
also

then the count if documents that contain it would be just
scalar keys %{\$wordHash{\$word}}

0

Author Comment

ID: 26145561
%wordHash is supposed to contain every document at a time, not all together.

The first Code Snippet I posted doesn't belong to the loop of the second Code Snipped, it's outside.

What I would like to do is increment the number of documents according to how many documents contain each word. The word i work with is \$key. How can I do it?

I am really sorry that I take so long to understand... it's just one of my first experiences...

Thank you so much for your patience!

0

## Featured Post

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In the distant past (last year) I hacked together a little toy that would allow a couple of Manager types to query, preview, and extract data from a number of MongoDB instances, to their tool of choice: Excel (http://dilbert.com/strips/comic/2007-08…
The greatest common divisor (gcd) of two positive integers is their largest common divisor. Let's consider two numbers 12 and 20. The divisors of 12 are 1, 2, 3, 4, 6, 12 The divisors of 20 are 1, 2, 4, 5, 10 20 The highest number among the c…
Six Sigma Control Plans
I've attached the XLSM Excel spreadsheet I used in the video and also text files containing the macros used below. https://filedb.experts-exchange.com/incoming/2017/03_w12/1151775/Permutations.txt https://filedb.experts-exchange.com/incoming/201…
###### Suggested Courses
Course of the Month20 days, 10 hours left to enroll