Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

Calculating IDF

Posted on 2008-11-17
5
Medium Priority
?
347 Views
Last Modified: 2012-08-13
I need some help calculating the IDF.

I created this code that will loop thru an array of term, and it will check if the terms are in a hash that contains some text information.

my hash has only 5 texts, and the terms in the hash deos not repeat. But when I print the numbers of time the term apear in the documents hash I get some crazy number (like 4000,or 380) and not 5 or 1 or 2.

here is the loop where I look at the terms, and search thru the hash.

The array of doc contains text information, so each place in the array has a text. the docTemp is used to get the terms from the doc array and store each term as a value inside the array.

for example
$doc[1] = "The mouse is black"
for ($counter = 0; $counter <= $#terms; $counter++){
   $nDocs = 0;
   for ($count = 0; $count <= $#doc; $count++){
      @docTemp = split(/\s+/, $doc[$count]);
      ###################################
      # STORE THE DOCUMENTS INTO A HASH #
      ###################################
      for my $word (@docTemp){
         $docHash{$word}++;
      }
      for my $key ( keys %docHash ) {
         ################################
         # CHECK IF TERM IS IN THE HASH #
         ################################
         if ($terms[$counter] == $key){
            $nDocs++;
         }
      }  
   }
   print $terms[$counter], " ", $nDocs, "\n";
}

Open in new window

0
Comment
Question by:Ennio
  • 3
  • 2
5 Comments
 
LVL 1

Author Comment

by:Ennio
ID: 22981355
I did some changes in the code, and now I get all 6 times.

here is the changes in the code.
for ($counter = 0; $counter <= $#terms; $counter++){
   $nDocs = 0;
   for ($count = 0; $count <= $#doc; $count++){
      @docTemp = split(/\s+/, $doc[$count]);
      ###################################
      # STORE THE DOCUMENTS INTO A HASH #
      ###################################
      for my $word (@docTemp){
         $docHash{$word}++;
      }
      
      if (exists $docHash{$terms[$counter]}){
         $nDocs++;
      }
   }
   print $terms[$counter], " ", $nDocs, "\n";
}

Open in new window

0
 
LVL 85

Expert Comment

by:ozo
ID: 22981692
where did $terms[$counter] come from?

if exists $docHash{$terms[$counter]} is true the first time through the for ($count = 0; $count <= $#doc; $count++) loop,
it will also be true the next time through the loop.
Is that what you want?
0
 
LVL 1

Author Comment

by:Ennio
ID: 22981717
$terms is an array that contain the terms that I'm searching in the hash.

I only want the if exists $docHash{$terms[$counter]} if the terms[$counter] is in the hash too, if not I don't want.
0
 
LVL 85

Accepted Solution

by:
ozo earned 2000 total points
ID: 22981788
exists $docHash{$terms[$counter]}  is true when the $terms[$counter] is in the %docHash hash
since you only accumulate entries in the hash, if it is ever in the hash, it will always be in the hash
0
 
LVL 1

Author Comment

by:Ennio
ID: 22981824
what it whould be the best way to do this. should I check and delete the key from the hash?
0

Featured Post

Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Article by: Nadia
Linear search (searching each index in an array one by one) works almost everywhere but it is not optimal in many cases. Let's assume, we have a book which has 42949672960 pages. We also have a table of contents. Now we want to read the content on p…
The greatest common divisor (gcd) of two positive integers is their largest common divisor. Let's consider two numbers 12 and 20. The divisors of 12 are 1, 2, 3, 4, 6, 12 The divisors of 20 are 1, 2, 4, 5, 10 20 The highest number among the c…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
I've attached the XLSM Excel spreadsheet I used in the video and also text files containing the macros used below. https://filedb.experts-exchange.com/incoming/2017/03_w12/1151775/Permutations.txt https://filedb.experts-exchange.com/incoming/201…
Suggested Courses

810 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question