How to count words ... ozo please help...?

Posted on 2002-03-30
Medium Priority
Last Modified: 2010-03-05
Question by:sdesar
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
LVL 19

Expert Comment

by:Kim Ryan
ID: 6907937
I wrote a CPAN module that will analyze text and report many statitiscs, including the number of words. You can download it from http://www.cpan.org/modules/by-module/Lingua/KIMRYAN/Lingua-EN-Fathom-1.06.tar.gz

use Lingua::EN::Fathom;

my $text = new Lingua::EN::Fathom;
$num_words = $text->num_words;

Author Comment

ID: 6908211
oops sorry .. the entire question did not get posted...
I am using use Lingua::EN::Fathom... in my code. Thanks for theis module... it works great to fid the best words.

here's my question-
I have 3 words ie "navigate, among, most" that are in $key1_splited  
I need to add the $key1_splited to %uniq_words and also display its count.

How can I do that?

Here's the script...

sub dumpKeywords {
    my $self = shift;
    my $dir = shift;   # sort by either alpha, or num
    $dir = "alpha" unless $dir;
    my $len = shift;
    $len = 0 unless $len;

    my %uniq_words = %{$self->{STEMCOUNT}};
    my $word;
    my $ret;

       my $key1_splited  = $self->{EXTRAKEY};

    my @list = sort keys %uniq_words;

    if($dir eq 'num') {
         @list = sort { $uniq_words{$b} <=> $uniq_words{$a} }  keys %uniq_words;
            if($len) {
         splice @list, $len;
my @key1_splited;
my $tmp;
my $var = ref($key1_splited);

 my $size = scalar(@{$key1_splited});
 print "THE the type is $var and size is $size <br>";

foreach $tmp (@{$key1_splited}){
 print "Tmp: $tmp <br>";
  push (@list, $tmp);
print "List pushed: @list <br>";

    $ret = "<TABLE>\n";
    foreach $word ( @list )
         $ret .= "<TR><TD ALIGN=right>" . $uniq_words{$word}. "</TD><TD>$word</TD></TR>\n"; # outputs
the word and frequency.
    ##                  print OUT ("$word\n"); # prints just the words

    $ret .= "</TABLE>\n";
    return $ret;

Currently, the output of this script looks like this-
40 user
30 inform
21 access
17 expert
14 individu
14 coher
12 cost
12 weight
11 item
10 docum
9 present
9 brows
8 structur
--Here's 3 additional words entered by the user and their counts
1 navigate
2 among
3 most

I need to add a count to the additional 3 words.
Therfore, how can I add these 3 words  and count them in the text document...to uniq_words.

here's the site-

Here's the
--- the script has a lot of print statemnets for
debugging purposes.

The Url that you can enter there for analysis purposes
is -

At present, the code automatically finds the top 10
keywords... it uses FATHOM module.  I need to modify the code so it also finds
the 3 additional keywords that the user enters in the
input box.

Eagerly awaiting a reponse,
Thanks in advance for your time and efforts.

LVL 84

Accepted Solution

ozo earned 400 total points
ID: 6909102
$count = () = /\b\Q$word\E\b/gi;

Author Comment

ID: 6909462
The 3 EXTRA  words are in -
my $key1_splited  = $self->{EXTRAKEY};

I need to know if I can replace - $word in -
$count = () = /\b\Q$word\E\b/gi;

Previously, I added $key1_splited  to @list.

This gave me the words by NOT the count.
This outputs

How can I add 3 extra words to uniq_words and display the count ?

Awaiting a response,

Author Comment

ID: 6909489
I think to make it work.. I need to add the EXTRAKEY to this routine-
# Get the top n Stem Keywords.  Also generate the equivalent array
# of real keywords (which will have more than n keys, and display unstemmed)
sub getStemKeywords {
     my $self = shift;
     my $len = shift;
     my $stems = $self->{STEMS};
     my $stemcount = $self->{STEMCOUNT};
     ##   my $key1_splited  = $self->{EXTRAKEY};
     my @list = sort { $stemcount->{$b} <=> $stemcount->{$a} }  keys %$stemcount;
     splice @list, $len;
      ##  my @key1_splited;
     ##   my $tmp;
  ##my $var = ref($key1_splited);

  ##my $size = scalar(@{$key1_splited});
 ## print "THE the type is $var and size is $size <br>";

     # now find all the words in the other list
     my @klist = ();
       ## foreach $tmp (@{$key1_splited}){
       ##    print "Tmp: $tmp <br>";
       ##    push (@klist, $tmp);
       ## }
     for (keys %$stems) {
          my $w = $_;
          for (@list) {
               if($stems->{$w} eq $_) {
                                push @klist, $w;

     return( \@list, \@klist );

I tried to do a push (@klist, $tmp);.. but that just adds the keyword to @list, but it does NOT count.

How can I modify the above funtion so it returns
return( \@list, \@klist, \@extralist );

And then I think I will be able to use it in-
sub txtAnalyze {
# Now, do a re-count based on stemmed words
     my $fathom = $self->{FATHOM};
     my %uniq_words = $fathom->unique_words;
     my %keycount;

     for (keys %uniq_words) {
              my $tmp1 = $uniq_words{$_};
              my $tmp2 = $stemhash{$_};
        ###      print "COUNT: $tmp1  STEMHASH : $tmp2 <br>";
          $keycount{$stemhash{$_}} += $uniq_words{$_};
     $self->{STEMCOUNT} = \%keycount;

     # Now, get the top 10 keywords

     ($self->{STEMKEYWORDS}, $self->{KEYWORDS}) = $self->getStemKeywords(10);


Awaiting suggestions...

Featured Post

What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In the distant past (last year) I hacked together a little toy that would allow a couple of Manager types to query, preview, and extract data from a number of MongoDB instances, to their tool of choice: Excel (http://dilbert.com/strips/comic/2007-08…
Checking the Alert Log in AWS RDS Oracle can be a pain through their user interface.  I made a script to download the Alert Log, look for errors, and email me the trace files.  In this article I'll describe what I did and share my script.
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Six Sigma Control Plans

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question