Solved

# calculate IDF

Posted on 2009-12-30
372 Views
Hello everybody,
it's not a long time that I program, and I still have many many doubts.
I have to calculate the TFIDF of some documents (3), but I don't know exactly how to calculate the IDF.
I know that IDF = total number of documents in a corpus / total number of documents containing a given word.
This is my program for the part related to the IDF:

use strict;

my \$countDoc = 0; # total number of processed documents
my \$countWordDoc = 0; # total number of documents containing a given word

for (\$countWordDoc = 0; \$countWordDoc <= \$countDoc; ++\$countWordDoc)
{
if (exists \$wordHash{\$key}) # if a word is contained in a document # increase the counter
{
++\$countWordDoc;
next;
}
}

But I guess there is something wrong, because when I print "countWordDoc \$key = \$countWordDoc" it gives me always the same number.

Thank you for the attention and happy new year!
``````use strict;

my \$countDoc = 0; # total number of processed documents
my \$countWordDoc = 0; # total number of documents containing a given word

for (\$countWordDoc = 0; \$countWordDoc <= \$countDoc; ++\$countWordDoc)
{
if (exists \$wordHash{\$key}) # if a word is contained in a document # increase the counter
{
++\$countWordDoc;
next;
}
}
``````
0
Question by:ironpony86

LVL 84

Expert Comment

# if a word is contained in a document # increase the counter
How are you specifying which document?
How are you specifying which word?
0

Author Comment

Well, the document is an input file (I have already checked that part of the program, it works)
and I have specified the word in this way:

``````while (<INFILE>) # take every file of the directory
{
\$line = \$_; # read lines in a loop
while (\$line =~ /\b(\w+)\b/g) # get each word
{
my \$word = "\L\$1"; # translate each word in lower case
++\$totalwords; # increase the amount of total words
if (exists \$wordHash{\$word}) # check whether we already have the word
{
++\$wordHash{\$word}; # if so, increase the counter
}
else
{
\$wordHash{\$word} = 1; # otherwise assign the value 1 to that word
}
}
}
``````
0

LVL 84

Accepted Solution

where in your
for (\$countWordDoc = 0; \$countWordDoc <= \$countDoc; ++\$countWordDoc)
loop do you say which document you are checking when you do
if (exists \$wordHash{\$key})
is \$key the word you are checking?
Do you reset %wordHash and recreate is as in http:#26144852 for every iteration of that loop? (which would be a rather inefficient way to do it)
If the Code Snippet you posted was the complete loop, then nothing will affect the if condition, and it will have the same
value on each iteration of the loop, so you might as well take the exists \$wordHash{\$key} condition out of the loop

Is %wordHash supposed to contain one document, or the entire corpus?
if you want it to contain the entire corpus, and keep track of which documents the words were in, you might do something like
++\$wordHash{\$word}{\$document};

By the way, there is no need to check exists \$wordHash{\$word} before incrementing,
if you increment an element that doesn't exist, the result will be 1
also

then the count if documents that contain it would be just
scalar keys %{\$wordHash{\$word}}

0

Author Comment

%wordHash is supposed to contain every document at a time, not all together.

The first Code Snippet I posted doesn't belong to the loop of the second Code Snipped, it's outside.

What I would like to do is increment the number of documents according to how many documents contain each word. The word i work with is \$key. How can I do it?

I am really sorry that I take so long to understand... it's just one of my first experiences...

Thank you so much for your patience!

0

## Join & Write a Comment Already a member? Login.

In the distant past (last year) I hacked together a little toy that would allow a couple of Manager types to query, preview, and extract data from a number of MongoDB instances, to their tool of choice: Excel (http://dilbert.com/strips/comic/2007-08…
Iteration: Iteration is repetition of a process. A student who goes to school repeats the process of going to school everyday until graduation. We go to grocery store at least once or twice a month to buy products. We repeat this process every mont…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Sending a Secure fax is easy with eFax Corporate (http://www.enterprise.efax.com). First, Just open a new email message.  In the To field, type your recipient's fax number @efaxsend.com. You can even send a secure international fax — just include t…

#### 729 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

#### Need Help in Real-Time?

Connect with top rated Experts

20 Experts available now in Live!