[Webinar] Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

calculate IDF

Posted on 2009-12-30
5
Medium Priority
?
381 Views
Last Modified: 2012-05-08
Hello everybody,
it's not a long time that I program, and I still have many many doubts.
I have to calculate the TFIDF of some documents (3), but I don't know exactly how to calculate the IDF.
I know that IDF = total number of documents in a corpus / total number of documents containing a given word.
This is my program for the part related to the IDF:

use strict;

my $countDoc = 0; # total number of processed documents
my $countWordDoc = 0; # total number of documents containing a given word

for ($countWordDoc = 0; $countWordDoc <= $countDoc; ++$countWordDoc)
{
     if (exists $wordHash{$key}) # if a word is contained in a document # increase the counter
     {
        ++$countWordDoc;
        next;      
     }
}


But I guess there is something wrong, because when I print "countWordDoc $key = $countWordDoc" it gives me always the same number.

Thank you for the attention and happy new year!
use strict; 

my $countDoc = 0; # total number of processed documents
my $countWordDoc = 0; # total number of documents containing a given word

for ($countWordDoc = 0; $countWordDoc <= $countDoc; ++$countWordDoc)
{
     if (exists $wordHash{$key}) # if a word is contained in a document # increase the counter
     {
	  ++$countWordDoc;
	  next;	
     }
}

Open in new window

0
Comment
Question by:ironpony86
  • 2
  • 2
4 Comments
 
LVL 85

Expert Comment

by:ozo
ID: 26144677
# if a word is contained in a document # increase the counter
How are you specifying which document?
How are you specifying which word?
0
 

Author Comment

by:ironpony86
ID: 26144852
Well, the document is an input file (I have already checked that part of the program, it works)
and I have specified the word in this way:

while (<INFILE>) # take every file of the directory
{ 
     $line = $_; # read lines in a loop
     while ($line =~ /\b(\w+)\b/g) # get each word
     { 	
	my $word = "\L$1"; # translate each word in lower case 			
        ++$totalwords; # increase the amount of total words			
        if (exists $wordHash{$word}) # check whether we already have the word
        { 		
           ++$wordHash{$word}; # if so, increase the counter
        }
        else
        {
	    $wordHash{$word} = 1; # otherwise assign the value 1 to that word
        }
     }
}

Open in new window

0
 
LVL 85

Accepted Solution

by:
ozo earned 500 total points
ID: 26144965
where in your
  for ($countWordDoc = 0; $countWordDoc <= $countDoc; ++$countWordDoc)
loop do you say which document you are checking when you do
  if (exists $wordHash{$key})
is $key the word you are checking?
Do you reset %wordHash and recreate is as in http:#26144852 for every iteration of that loop? (which would be a rather inefficient way to do it)
If the Code Snippet you posted was the complete loop, then nothing will affect the if condition, and it will have the same
value on each iteration of the loop, so you might as well take the exists $wordHash{$key} condition out of the loop


Is %wordHash supposed to contain one document, or the entire corpus?
if you want it to contain the entire corpus, and keep track of which documents the words were in, you might do something like
  ++$wordHash{$word}{$document};

By the way, there is no need to check exists $wordHash{$word} before incrementing,
if you increment an element that doesn't exist, the result will be 1
also

then the count if documents that contain it would be just
scalar keys %{$wordHash{$word}}


0
 

Author Comment

by:ironpony86
ID: 26145561
%wordHash is supposed to contain every document at a time, not all together.

The first Code Snippet I posted doesn't belong to the loop of the second Code Snipped, it's outside.

What I would like to do is increment the number of documents according to how many documents contain each word. The word i work with is $key. How can I do it?

I am really sorry that I take so long to understand... it's just one of my first experiences...

Thank you so much for your patience!



0

Featured Post

Free Tool: SSL Checker

Scans your site and returns information about your SSL implementation and certificate. Helpful for debugging and validating your SSL configuration.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In the distant past (last year) I hacked together a little toy that would allow a couple of Manager types to query, preview, and extract data from a number of MongoDB instances, to their tool of choice: Excel (http://dilbert.com/strips/comic/2007-08…
The greatest common divisor (gcd) of two positive integers is their largest common divisor. Let's consider two numbers 12 and 20. The divisors of 12 are 1, 2, 3, 4, 6, 12 The divisors of 20 are 1, 2, 4, 5, 10 20 The highest number among the c…
Six Sigma Control Plans
I've attached the XLSM Excel spreadsheet I used in the video and also text files containing the macros used below. https://filedb.experts-exchange.com/incoming/2017/03_w12/1151775/Permutations.txt https://filedb.experts-exchange.com/incoming/201…
Suggested Courses

868 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question