Calculating TF

Posted on 2008-11-11
Last Modified: 2012-05-05
I have a question, I'm crating a script that will calculate the TF from a document. I have the document stored in a vector.

My question is when I print the vector I get the frequency of times the words appear in the text, for example the word college appears 10 times in the text. Should I keep the numbers like this or should I make the number between 0 and 1?

Because every where that I look they say that the TF is 0.1 or 0.9.

Here is the code I have to calculate the frequency.
#Get the word frequency from the text.

for my $word (@$words){



Open in new window

Question by:Ennio
    LVL 39

    Expert Comment

    Keeping the TF between 0 and 1 means it is normalized.  To do this, divide every number by the highest frequency.

    Whether or not you do this will depend on how you are using it.
    LVL 1

    Author Comment

    So I should dived it to the highest frequency or the number of words in the text?

    LVL 39

    Accepted Solution

    By the highest frequency.

    For examle, if you had this:
        college: 10
        apple: 6
        letter: 16
    You would divide each by 16, because it is the highest frequency, getting:
        college: .625
        apple: .375
        letter: 1.0
    LVL 1

    Author Comment

    ok... thanks... because I was looking some where and they said to divide by the total numbers of terms in the text.  That make sense now.

    LVL 39

    Expert Comment

    Well, again it depends on what you are looking for, but you could divide each by the total number of terms in the text.

    Write Comment

    Please enter a first name

    Please enter a last name

    We will never share this with anyone.

    Featured Post

    What Is Threat Intelligence?

    Threat intelligence is often discussed, but rarely understood. Starting with a precise definition, along with clear business goals, is essential.

    Suggested Solutions

    On Microsoft Windows, if  when you click or type the name of a .pl file, you get an error "is not recognized as an internal or external command, operable program or batch file", then this means you do not have the .pl file extension associated with …
    Iteration: Iteration is repetition of a process. A student who goes to school repeats the process of going to school everyday until graduation. We go to grocery store at least once or twice a month to buy products. We repeat this process every mont…
    Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
    To add imagery to an HTML email signature, you have two options available to you. You can either add a logo/image by embedding it directly into the signature or hosting it externally and linking to it. The vast majority of email clients display l…

    760 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    11 Experts available now in Live!

    Get 1:1 Help Now