Link to home
Start Free TrialLog in
Avatar of sdesar
sdesar

asked on

Given a list of words how do i count them in a text file?

How can I count the list of words?
File1 - contains a list of words
testing
automate
discrete
measure
seem
..
File2 - text file that contins these words in different paragraphs.

File3- Should output the following
           P1  P2  P3....Pn ..paragraphs

testing    1    0   1
automate   5    0   4
discrete   3    3   3
measure    3    5   5
seem       2    1   0


How can I achieve this words count?
Avatar of geotiger
geotiger

I created a text file and a script that does what you want. For assigning your own file to @a, you can

$fn1 = "/your/file/name";
open FILE, "<$fn1" or die "$!\n";
@a = <FILE>;
close FILE;
 

$ more cnt_words.txt
=head2 How can I count the number of occurrences of a substring within a string?

There are a number of ways, with varying efficiency: If you want a
count of a certain single character (X) within a string, you can use the
C<tr///> function like so:

    $string = "ThisXlineXhasXsomeXx'sXinXit";
    $count = ($string =~ tr/X//);
    print "There are $count X charcters in the string";

This is fine if you are just looking for a single character.  However,
if you are trying to count multiple character substrings within a
larger string, C<tr///> won't work.  What you can do is wrap a while()
loop around a global pattern match.  For example, let's count negative
integers:

    $string = "-9 55 48 -2 23 -76 4 14 -44";
    while ($string =~ /-\d+/g) { $count++ }
    print "There are $count negative numbers in the string";

=head1 Found in /usr/local/lib/perl5/5.00503/pod/perlfaq5.pod

=head2 How do I count the number of lines in a file?

One fairly efficient way is to count newlines in the file. The
following program uses a feature of tr///, as documented in L<perlop>.
If your text file doesn't end with a newline, then it's not really a
proper text file, so this may report one fewer line than you expect.

    $lines = 0;
    open(FILE, $filename) or die "Can't open `$filename': $!";
    while (sysread FILE, $buffer, 4096) {
        $lines += ($buffer =~ tr/\n//);
    }
    close FILE;

This assumes no funny games with newline translations.

$ more cnt_words.pl
#!/usr/local/bin/perl
# file name cnt_words.pl


@a = split /,/, "are,end,text,work,for,with,game,count,pattern,number";
# you can open your first file to get the content into @a

$fn2 = "cnt_words.txt";

open WD, "<$fn2" or die "$!\n";
@b = <WD>;
close WD;

$p = 1;    # paragram counter
%R =();
foreach $i (@b) {    # loop through each line
    if ($i =~ /^\n$/) {  ++$p; }
    foreach $j (@a) {  
        if ($i =~  /$j/) { ++$R{$j}[$p]; } else { $R{$j}[$p] += 0; }
    }
}

for $i (sort keys %R) {
    $t = "";
    for $j (0..$#{$R{$i}}) { $t .= " $R{$i}[$j]"; }
    printf "%10s %-30s\n", $i, $t;
}

$ ./cnt_words.pl
       are   0 1 1 2 1 0 0 0 0 0 0      
     count   1 1 2 2 2 0 1 1 0 0 0      
       end   0 0 0 0 0 0 0 1 0 0 0      
       for   0 0 0 1 0 0 0 0 0 0 0      
      game   0 0 0 0 0 0 0 0 0 1 0      
    number   1 1 0 0 1 0 1 0 0 0 0      
   pattern   0 0 0 1 0 0 0 0 0 0 0      
      text   0 0 0 0 0 0 0 2 0 0 0      
      with   1 2 0 1 0 0 0 1 0 1 0      
      work   0 0 0 1 0 0 0 0 0 0 0      
Avatar of sdesar

ASKER

Instead of  this line how can I get the llist of words in the text file --

@a = split /,/, "are,end,text,work,for,with,game,count,pattern,number";

And the file cnt_words.txt is that the word document (WD)  that cointains a bunch of text with diff. pargraphs that have the above words -
are, end, text etc.....


thanks

hope to hear from you soon....
"Instead of  this line how can I get the llist of words in the text file --

@a = split /,/, "are,end,text,work,for,with,game,count,pattern,number";
"

Assuming you have the words one in a line, then here is how:

$fn="/dir/to/my/file/name";

open FILE, "<$fn" or die "Could not open the file - $fn:$!|n";
@a=<FILE>;
close FILE;

"And the file cnt_words.txt is that the word document (WD)  that cointains a bunch of text with diff. pargraphs that have the above words -
are, end, text etc..... "

That is right. You put your source text in the $fn2 (cnt_words.txt).

Avatar of sdesar

ASKER

This is what I did ...
But when I run this on the command Prompt -

$perl cnt_words.pl
No such file or directory

that's the message I am getting

no such file or directory... & I do see this file in my directory.
Also .. I changed dthe permissions of this file to be
chmod 755 cnt_words.pl


here's the file-   cnt_words.pl

#!/usr/bin/perl
rds.pl.swp
 # file name cnt_words.pl
$fn1="P1fileParse1.txt"; // input file that has the text data
open FILE, "<$fn1" or die "\$!\n";
@a=<FILE>;
close FILE;

 #   @a = split /,/, "are,end,text,work,for,with,game,count,pattern,number";
                    # you can open your first file to get the content into @a

                    $fn2 = "cnt_words.txt";

                    open WD, "<$fn2" or die "$!\n";
                    @b = <WD>;
                    close WD;

                    $p = 1;    # paragram counter
                    %R =();
                    foreach $i (@b) {    # loop through each line
 if ($i =~ /^\n$/) {  ++$p; }
                        foreach $j (@a) {
                            if ($i =~  /$j/) { ++$R{$j}[$p]; } else { $R{$j}[$p]
 += 0; }
                        }
                    }

                    for $i (sort keys %R) {
                        $t = "";
                        for $j (0..$#{$R{$i}}) { $t .= " $R{$i}[$j]"; }
                        printf "%10s %-30s\n", $i, $t;
                    }



Could you please help me debug this really easy but yet a mystery.... code...

Awating are response


Avatar of sdesar

ASKER

This is what I did ...
But when I run this on the command Prompt -

$perl cnt_words.pl
No such file or directory

that's the message I am getting

no such file or directory... & I do see this file in my directory.
Also .. I changed dthe permissions of this file to be
chmod 755 cnt_words.pl


here's the file-   cnt_words.pl

#!/usr/bin/perl
rds.pl.swp
 # file name cnt_words.pl
$fn1="P1fileParse1.txt"; // input file that has the text data
open FILE, "<$fn1" or die "\$!\n";
@a=<FILE>;
close FILE;

 #   @a = split /,/, "are,end,text,work,for,with,game,count,pattern,number";
                    # you can open your first file to get the content into @a

                    $fn2 = "cnt_words.txt";

                    open WD, "<$fn2" or die "$!\n";
                    @b = <WD>;
                    close WD;

                    $p = 1;    # paragram counter
                    %R =();
                    foreach $i (@b) {    # loop through each line
 if ($i =~ /^\n$/) {  ++$p; }
                        foreach $j (@a) {
                            if ($i =~  /$j/) { ++$R{$j}[$p]; } else { $R{$j}[$p]
 += 0; }
                        }
                    }

                    for $i (sort keys %R) {
                        $t = "";
                        for $j (0..$#{$R{$i}}) { $t .= " $R{$i}[$j]"; }
                        printf "%10s %-30s\n", $i, $t;
                    }



Could you please help me debug this really easy but yet a mystery.... code...

Awating are response


Avatar of sdesar

ASKER

I changed the above script o have the following ....

$fn1="cnt_words.out";     # output file

$fn2 = "cnt_words.txt";   #input WD word document file


typing perl cnt_words.pl

but there's no output in cnt_words.out


I don't understand ... ?

Could you please give sugestions...
You need to use a "./" in front of the command after you cd to the directory, i.e.,

cd /my/dir/has/cnt_words.pl

../cnt_words.pl

What is "rds.pl.swp" in your code?
 

The $fn1 should be your input file for a list of key words to be searched in $fn2. If you want to have output to a file, you need to add the following codes to the end:

$fn3 = "myoutputfile.out";
open OUT, ">$fn3" or die "Could not write to file - $fn3:$!\n";

for $i (sort keys %R) {
  $t = "";
  for $j (0..$#{$R{$i}}) { $t .= " $R{$i}[$j]"; }
  printf OUT "%10s %-30s\n", $i, $t;
}
close OUT;


Avatar of sdesar

ASKER

Here are the files and the results... I can't figure out why I am getting 0s....

#!/usr/bin/perl
# file name cnt_words.pl
$fn1="cnt_keywords.out";                # keywords  file
open FILE, "<$fn1" or die "could not open the file -$!|\n";
@a=<FILE>;
close FILE;

 #   @a = split /,/, "are,end,text,work,for,with,game,count,pattern,number";
                    # you can open your first file to get the content into @a

                    $fn2 = "cnt_words1.txt";   #input word document WD file

                    open WD, "<$fn2" or die "$!\n";
                    @b = <WD>;
                    close WD;


                    $p = 1;    # paragram counter
                    %R =();
 foreach $i (@b) {    # loop through each line
                        if ($i =~ /^\n$/) {  ++$p; }
                        foreach $j (@a) {
                            if ($i =~  /$j/) { ++$R{$j}[$p]; } else { $R{$j}[$p]
 += 0; }
                        }
                    }

$fn3 = "cnt_words.out";
open OUT, ">$fn3" or die "Could not write to file - $fn3:$!\n";

                    for $i (sort keys %R) {
                        $t = "";
                        for $j (0..$#{$R{$i}}) { $t .= " $R{$i}[$j]"; }
                        printf OUT "%10s %-30s\n", $i, $t;
                    }
close OUT;


This is the words document -
cnt_words1.txt
This a a test file check it out and I hope that this works finally and the work,
 for, with, game,count,pattern,number are in it.
and ths has some words, test , sentences..

This is the Keywords file -
cnt_keywords.out
test
game
count
the
is
a
seems
there
check
works
hope
finally


This is the output/result file -
cnt_words.out
a
   0 0
    check
   0 0
    count
   0 0
  finally
   0 0
     game
   0 0
     hope
   0 0
       is
   0 0
    seems
   0 0
     test
   0 0
      the
   0 0
    there
   0 0
    works


This output file has all 0s...
Do you have any suggestions to fix this?

Thanks



The reason was because the "\n" character in the end of each key words. I re-wrote the code to read the key words into @a. It works as expected. Here are the files and results:

$ more cnt_keys.txt
are
end
text
work
for
with
game
count
pattern
number

$ more cnt_words.pl
#!/usr/local/bin/perl
# file name cnt_words.pl

# @a=split /,/, "are,end,text,work,for,with,game,count,pattern,number";
# you can open your first file to get the content into @a

$fn1 = "cnt_keys.txt";
$fn2 = "cnt_words.txt";
$fn3 = "cnt_out.txt";
open FILE, "<$fn1" or die "$!\n";
while (<FILE>) {
  chomp;
  next if (!$_);
  push @a, $_;
}
close FILE;

open WD, "<$fn2" or die "$!\n";
@b = <WD>;
close WD;

$p = 1;    # paragram counter
%R =();
foreach $i (@b) {    # loop through each line
    if ($i =~ /^\n$/) {  ++$p; }
    foreach $j (@a) {  
        if ($i =~  /$j/) { ++$R{$j}[$p]; } else { $R{$j}[$p] += 0; }
    }
}

for $i (sort keys %R) {
    $t = "";
    for $j (0..$#{$R{$i}}) { $t .= " $R{$i}[$j]"; }
    printf "%10s %-30s\n", $i, $t;
}

open OUT, ">$fn3" or die "could not write to $fn3:$!\n";
for $i (sort keys %R) {
    $t = "";
    for $j (0..$#{$R{$i}}) { $t .= " $R{$i}[$j]"; }
    printf OUT "%10s %-30s\n", $i, $t;
}
close OUT;


$ ./cnt_words.pl
       are   0 1 1 2 1 0 0 0 0 0 0      
     count   1 1 2 2 2 0 1 1 0 0 0      
       end   0 0 0 0 0 0 0 1 0 0 0      
       for   0 0 0 1 0 0 0 0 0 0 0      
      game   0 0 0 0 0 0 0 0 0 1 0      
    number   1 1 0 0 1 0 1 0 0 0 0      
   pattern   0 0 0 1 0 0 0 0 0 0 0      
      text   0 0 0 0 0 0 0 2 0 0 0      
      with   1 2 0 1 0 0 0 1 0 1 0      
      work   0 0 0 1 0 0 0 0 0 0 0      
$ more cnt_out.txt
       are   0 1 1 2 1 0 0 0 0 0 0      
     count   1 1 2 2 2 0 1 1 0 0 0      
       end   0 0 0 0 0 0 0 1 0 0 0      
       for   0 0 0 1 0 0 0 0 0 0 0      
      game   0 0 0 0 0 0 0 0 0 1 0      
    number   1 1 0 0 1 0 1 0 0 0 0      
   pattern   0 0 0 1 0 0 0 0 0 0 0      
      text   0 0 0 0 0 0 0 2 0 0 0      
      with   1 2 0 1 0 0 0 1 0 1 0      
      work   0 0 0 1 0 0 0 0 0 0 0      
Avatar of sdesar

ASKER

Thanks.. its works... But how can I place the paragraph numbers on the top..

      P1 P2 P3.... Pn
are   0  1  1  2  1  0
count 1  1  2  2  2  0
end   0  0  0  0  0  0

Thanks a million....




Just use the following codes for output:

$t = "           ";
for $j (0..$#{$R{$i}}) { $t .= sprintf " P%02d", $j; }
print "$t\n";

for $i (sort keys %R) {
    $t = "";
    for $j (0..$#{$R{$i}}) { $t .= sprintf " %3d", $R{$i}[$j]; }
    printf "%10s %-30s\n", $i, $t;
}

open OUT, ">$fn3" or die "could not write to $fn3:$!\n";
$t = "           ";
for $j (0..$#{$R{$i}}) { $t .= sprintf " P%02d", $j; }
print OUT "$t\n";

for $i (sort keys %R) {
    $t = "";
    for $j (0..$#{$R{$i}}) { $t .= sprintf " %3d", $R{$i}[$j]; }
    printf OUT "%10s %-30s\n", $i, $t;
}
close OUT;


Avatar of sdesar

ASKER

Thanks Geotiger....
Also.. since I have paragraphs in my code there are 2 extra lines in the cnt_words.txt.  How can have just one line instead of 2 lines because the extra line is treated as a parah and therefore it has all 0s. And also I need to display the para. numbers
P1   P2   P3 ..... Pn


         P1  P2    P3   P4   ....
list      0   0    2     3
of        0   4    4     0
keywords  0   6    4     2
in
the


Thanks

Avatar of sdesar

ASKER

Here's what the text file - cnt_words.txt looks like ( notice its got 2 lines after each parah.) - How can I have the code ignore one of the line so that the 0s don't appear as seen aboveor is there another way to avoid it?)

 artificial intelligence   direct application  problems
 immediate  outside   ai community.   example,
 project (skipper)   research group   development
 intelligent agents  web elements, informational needs  tastes   user.


 skipper project  distinct     ways
ongoing research efforts   area  intelligent web-oriented
agents.   user profiles   used  customize  form
 content  -line information   manner  meets
specific informational needs   web-browsing individual.  sets
skipper apart   similarly minded tools   fact  skipper
 sit   background   web-browser  extract user profiles
 manner    unobtrusive, .e., requires minimal explicit
statements    feedback   user. unobtrusive tools


Avatar of ozo
What line do you want to ignore?
ASKER CERTIFIED SOLUTION
Avatar of geotiger
geotiger

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of sdesar

ASKER

It works!!
Thanks geotiger!!
Avatar of sdesar

ASKER

Comment accepted as answer
Avatar of sdesar

ASKER

Thanks Again!!