• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 227
  • Last Modified:

Given a list of words how do i count them in a text file?

How can I count the list of words?
File1 - contains a list of words
testing
automate
discrete
measure
seem
..
File2 - text file that contins these words in different paragraphs.

File3- Should output the following
           P1  P2  P3....Pn ..paragraphs

testing    1    0   1
automate   5    0   4
discrete   3    3   3
measure    3    5   5
seem       2    1   0


How can I achieve this words count?
0
sdesar
Asked:
sdesar
  • 11
  • 6
1 Solution
 
geotigerCommented:
I created a text file and a script that does what you want. For assigning your own file to @a, you can

$fn1 = "/your/file/name";
open FILE, "<$fn1" or die "$!\n";
@a = <FILE>;
close FILE;
 

$ more cnt_words.txt
=head2 How can I count the number of occurrences of a substring within a string?

There are a number of ways, with varying efficiency: If you want a
count of a certain single character (X) within a string, you can use the
C<tr///> function like so:

    $string = "ThisXlineXhasXsomeXx'sXinXit";
    $count = ($string =~ tr/X//);
    print "There are $count X charcters in the string";

This is fine if you are just looking for a single character.  However,
if you are trying to count multiple character substrings within a
larger string, C<tr///> won't work.  What you can do is wrap a while()
loop around a global pattern match.  For example, let's count negative
integers:

    $string = "-9 55 48 -2 23 -76 4 14 -44";
    while ($string =~ /-\d+/g) { $count++ }
    print "There are $count negative numbers in the string";

=head1 Found in /usr/local/lib/perl5/5.00503/pod/perlfaq5.pod

=head2 How do I count the number of lines in a file?

One fairly efficient way is to count newlines in the file. The
following program uses a feature of tr///, as documented in L<perlop>.
If your text file doesn't end with a newline, then it's not really a
proper text file, so this may report one fewer line than you expect.

    $lines = 0;
    open(FILE, $filename) or die "Can't open `$filename': $!";
    while (sysread FILE, $buffer, 4096) {
        $lines += ($buffer =~ tr/\n//);
    }
    close FILE;

This assumes no funny games with newline translations.

$ more cnt_words.pl
#!/usr/local/bin/perl
# file name cnt_words.pl


@a = split /,/, "are,end,text,work,for,with,game,count,pattern,number";
# you can open your first file to get the content into @a

$fn2 = "cnt_words.txt";

open WD, "<$fn2" or die "$!\n";
@b = <WD>;
close WD;

$p = 1;    # paragram counter
%R =();
foreach $i (@b) {    # loop through each line
    if ($i =~ /^\n$/) {  ++$p; }
    foreach $j (@a) {  
        if ($i =~  /$j/) { ++$R{$j}[$p]; } else { $R{$j}[$p] += 0; }
    }
}

for $i (sort keys %R) {
    $t = "";
    for $j (0..$#{$R{$i}}) { $t .= " $R{$i}[$j]"; }
    printf "%10s %-30s\n", $i, $t;
}

$ ./cnt_words.pl
       are   0 1 1 2 1 0 0 0 0 0 0      
     count   1 1 2 2 2 0 1 1 0 0 0      
       end   0 0 0 0 0 0 0 1 0 0 0      
       for   0 0 0 1 0 0 0 0 0 0 0      
      game   0 0 0 0 0 0 0 0 0 1 0      
    number   1 1 0 0 1 0 1 0 0 0 0      
   pattern   0 0 0 1 0 0 0 0 0 0 0      
      text   0 0 0 0 0 0 0 2 0 0 0      
      with   1 2 0 1 0 0 0 1 0 1 0      
      work   0 0 0 1 0 0 0 0 0 0 0      
0
 
sdesarAuthor Commented:
Instead of  this line how can I get the llist of words in the text file --

@a = split /,/, "are,end,text,work,for,with,game,count,pattern,number";

And the file cnt_words.txt is that the word document (WD)  that cointains a bunch of text with diff. pargraphs that have the above words -
are, end, text etc.....


thanks

hope to hear from you soon....
0
 
geotigerCommented:
"Instead of  this line how can I get the llist of words in the text file --

@a = split /,/, "are,end,text,work,for,with,game,count,pattern,number";
"

Assuming you have the words one in a line, then here is how:

$fn="/dir/to/my/file/name";

open FILE, "<$fn" or die "Could not open the file - $fn:$!|n";
@a=<FILE>;
close FILE;

"And the file cnt_words.txt is that the word document (WD)  that cointains a bunch of text with diff. pargraphs that have the above words -
are, end, text etc..... "

That is right. You put your source text in the $fn2 (cnt_words.txt).

0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
sdesarAuthor Commented:
This is what I did ...
But when I run this on the command Prompt -

$perl cnt_words.pl
No such file or directory

that's the message I am getting

no such file or directory... & I do see this file in my directory.
Also .. I changed dthe permissions of this file to be
chmod 755 cnt_words.pl


here's the file-   cnt_words.pl

#!/usr/bin/perl
rds.pl.swp
 # file name cnt_words.pl
$fn1="P1fileParse1.txt"; // input file that has the text data
open FILE, "<$fn1" or die "\$!\n";
@a=<FILE>;
close FILE;

 #   @a = split /,/, "are,end,text,work,for,with,game,count,pattern,number";
                    # you can open your first file to get the content into @a

                    $fn2 = "cnt_words.txt";

                    open WD, "<$fn2" or die "$!\n";
                    @b = <WD>;
                    close WD;

                    $p = 1;    # paragram counter
                    %R =();
                    foreach $i (@b) {    # loop through each line
 if ($i =~ /^\n$/) {  ++$p; }
                        foreach $j (@a) {
                            if ($i =~  /$j/) { ++$R{$j}[$p]; } else { $R{$j}[$p]
 += 0; }
                        }
                    }

                    for $i (sort keys %R) {
                        $t = "";
                        for $j (0..$#{$R{$i}}) { $t .= " $R{$i}[$j]"; }
                        printf "%10s %-30s\n", $i, $t;
                    }



Could you please help me debug this really easy but yet a mystery.... code...

Awating are response


0
 
sdesarAuthor Commented:
This is what I did ...
But when I run this on the command Prompt -

$perl cnt_words.pl
No such file or directory

that's the message I am getting

no such file or directory... & I do see this file in my directory.
Also .. I changed dthe permissions of this file to be
chmod 755 cnt_words.pl


here's the file-   cnt_words.pl

#!/usr/bin/perl
rds.pl.swp
 # file name cnt_words.pl
$fn1="P1fileParse1.txt"; // input file that has the text data
open FILE, "<$fn1" or die "\$!\n";
@a=<FILE>;
close FILE;

 #   @a = split /,/, "are,end,text,work,for,with,game,count,pattern,number";
                    # you can open your first file to get the content into @a

                    $fn2 = "cnt_words.txt";

                    open WD, "<$fn2" or die "$!\n";
                    @b = <WD>;
                    close WD;

                    $p = 1;    # paragram counter
                    %R =();
                    foreach $i (@b) {    # loop through each line
 if ($i =~ /^\n$/) {  ++$p; }
                        foreach $j (@a) {
                            if ($i =~  /$j/) { ++$R{$j}[$p]; } else { $R{$j}[$p]
 += 0; }
                        }
                    }

                    for $i (sort keys %R) {
                        $t = "";
                        for $j (0..$#{$R{$i}}) { $t .= " $R{$i}[$j]"; }
                        printf "%10s %-30s\n", $i, $t;
                    }



Could you please help me debug this really easy but yet a mystery.... code...

Awating are response


0
 
sdesarAuthor Commented:
I changed the above script o have the following ....

$fn1="cnt_words.out";     # output file

$fn2 = "cnt_words.txt";   #input WD word document file


typing perl cnt_words.pl

but there's no output in cnt_words.out


I don't understand ... ?

Could you please give sugestions...
0
 
geotigerCommented:
You need to use a "./" in front of the command after you cd to the directory, i.e.,

cd /my/dir/has/cnt_words.pl

../cnt_words.pl

What is "rds.pl.swp" in your code?
 

The $fn1 should be your input file for a list of key words to be searched in $fn2. If you want to have output to a file, you need to add the following codes to the end:

$fn3 = "myoutputfile.out";
open OUT, ">$fn3" or die "Could not write to file - $fn3:$!\n";

for $i (sort keys %R) {
  $t = "";
  for $j (0..$#{$R{$i}}) { $t .= " $R{$i}[$j]"; }
  printf OUT "%10s %-30s\n", $i, $t;
}
close OUT;


0
 
sdesarAuthor Commented:
Here are the files and the results... I can't figure out why I am getting 0s....

#!/usr/bin/perl
# file name cnt_words.pl
$fn1="cnt_keywords.out";                # keywords  file
open FILE, "<$fn1" or die "could not open the file -$!|\n";
@a=<FILE>;
close FILE;

 #   @a = split /,/, "are,end,text,work,for,with,game,count,pattern,number";
                    # you can open your first file to get the content into @a

                    $fn2 = "cnt_words1.txt";   #input word document WD file

                    open WD, "<$fn2" or die "$!\n";
                    @b = <WD>;
                    close WD;


                    $p = 1;    # paragram counter
                    %R =();
 foreach $i (@b) {    # loop through each line
                        if ($i =~ /^\n$/) {  ++$p; }
                        foreach $j (@a) {
                            if ($i =~  /$j/) { ++$R{$j}[$p]; } else { $R{$j}[$p]
 += 0; }
                        }
                    }

$fn3 = "cnt_words.out";
open OUT, ">$fn3" or die "Could not write to file - $fn3:$!\n";

                    for $i (sort keys %R) {
                        $t = "";
                        for $j (0..$#{$R{$i}}) { $t .= " $R{$i}[$j]"; }
                        printf OUT "%10s %-30s\n", $i, $t;
                    }
close OUT;


This is the words document -
cnt_words1.txt
This a a test file check it out and I hope that this works finally and the work,
 for, with, game,count,pattern,number are in it.
and ths has some words, test , sentences..

This is the Keywords file -
cnt_keywords.out
test
game
count
the
is
a
seems
there
check
works
hope
finally


This is the output/result file -
cnt_words.out
a
   0 0
    check
   0 0
    count
   0 0
  finally
   0 0
     game
   0 0
     hope
   0 0
       is
   0 0
    seems
   0 0
     test
   0 0
      the
   0 0
    there
   0 0
    works


This output file has all 0s...
Do you have any suggestions to fix this?

Thanks



0
 
geotigerCommented:
The reason was because the "\n" character in the end of each key words. I re-wrote the code to read the key words into @a. It works as expected. Here are the files and results:

$ more cnt_keys.txt
are
end
text
work
for
with
game
count
pattern
number

$ more cnt_words.pl
#!/usr/local/bin/perl
# file name cnt_words.pl

# @a=split /,/, "are,end,text,work,for,with,game,count,pattern,number";
# you can open your first file to get the content into @a

$fn1 = "cnt_keys.txt";
$fn2 = "cnt_words.txt";
$fn3 = "cnt_out.txt";
open FILE, "<$fn1" or die "$!\n";
while (<FILE>) {
  chomp;
  next if (!$_);
  push @a, $_;
}
close FILE;

open WD, "<$fn2" or die "$!\n";
@b = <WD>;
close WD;

$p = 1;    # paragram counter
%R =();
foreach $i (@b) {    # loop through each line
    if ($i =~ /^\n$/) {  ++$p; }
    foreach $j (@a) {  
        if ($i =~  /$j/) { ++$R{$j}[$p]; } else { $R{$j}[$p] += 0; }
    }
}

for $i (sort keys %R) {
    $t = "";
    for $j (0..$#{$R{$i}}) { $t .= " $R{$i}[$j]"; }
    printf "%10s %-30s\n", $i, $t;
}

open OUT, ">$fn3" or die "could not write to $fn3:$!\n";
for $i (sort keys %R) {
    $t = "";
    for $j (0..$#{$R{$i}}) { $t .= " $R{$i}[$j]"; }
    printf OUT "%10s %-30s\n", $i, $t;
}
close OUT;


$ ./cnt_words.pl
       are   0 1 1 2 1 0 0 0 0 0 0      
     count   1 1 2 2 2 0 1 1 0 0 0      
       end   0 0 0 0 0 0 0 1 0 0 0      
       for   0 0 0 1 0 0 0 0 0 0 0      
      game   0 0 0 0 0 0 0 0 0 1 0      
    number   1 1 0 0 1 0 1 0 0 0 0      
   pattern   0 0 0 1 0 0 0 0 0 0 0      
      text   0 0 0 0 0 0 0 2 0 0 0      
      with   1 2 0 1 0 0 0 1 0 1 0      
      work   0 0 0 1 0 0 0 0 0 0 0      
$ more cnt_out.txt
       are   0 1 1 2 1 0 0 0 0 0 0      
     count   1 1 2 2 2 0 1 1 0 0 0      
       end   0 0 0 0 0 0 0 1 0 0 0      
       for   0 0 0 1 0 0 0 0 0 0 0      
      game   0 0 0 0 0 0 0 0 0 1 0      
    number   1 1 0 0 1 0 1 0 0 0 0      
   pattern   0 0 0 1 0 0 0 0 0 0 0      
      text   0 0 0 0 0 0 0 2 0 0 0      
      with   1 2 0 1 0 0 0 1 0 1 0      
      work   0 0 0 1 0 0 0 0 0 0 0      
0
 
sdesarAuthor Commented:
Thanks.. its works... But how can I place the paragraph numbers on the top..

      P1 P2 P3.... Pn
are   0  1  1  2  1  0
count 1  1  2  2  2  0
end   0  0  0  0  0  0

Thanks a million....




0
 
geotigerCommented:
Just use the following codes for output:

$t = "           ";
for $j (0..$#{$R{$i}}) { $t .= sprintf " P%02d", $j; }
print "$t\n";

for $i (sort keys %R) {
    $t = "";
    for $j (0..$#{$R{$i}}) { $t .= sprintf " %3d", $R{$i}[$j]; }
    printf "%10s %-30s\n", $i, $t;
}

open OUT, ">$fn3" or die "could not write to $fn3:$!\n";
$t = "           ";
for $j (0..$#{$R{$i}}) { $t .= sprintf " P%02d", $j; }
print OUT "$t\n";

for $i (sort keys %R) {
    $t = "";
    for $j (0..$#{$R{$i}}) { $t .= sprintf " %3d", $R{$i}[$j]; }
    printf OUT "%10s %-30s\n", $i, $t;
}
close OUT;


0
 
sdesarAuthor Commented:
Thanks Geotiger....
Also.. since I have paragraphs in my code there are 2 extra lines in the cnt_words.txt.  How can have just one line instead of 2 lines because the extra line is treated as a parah and therefore it has all 0s. And also I need to display the para. numbers
P1   P2   P3 ..... Pn


         P1  P2    P3   P4   ....
list      0   0    2     3
of        0   4    4     0
keywords  0   6    4     2
in
the


Thanks

0
 
sdesarAuthor Commented:
Here's what the text file - cnt_words.txt looks like ( notice its got 2 lines after each parah.) - How can I have the code ignore one of the line so that the 0s don't appear as seen aboveor is there another way to avoid it?)

 artificial intelligence   direct application  problems
 immediate  outside   ai community.   example,
 project (skipper)   research group   development
 intelligent agents  web elements, informational needs  tastes   user.


 skipper project  distinct     ways
ongoing research efforts   area  intelligent web-oriented
agents.   user profiles   used  customize  form
 content  -line information   manner  meets
specific informational needs   web-browsing individual.  sets
skipper apart   similarly minded tools   fact  skipper
 sit   background   web-browser  extract user profiles
 manner    unobtrusive, .e., requires minimal explicit
statements    feedback   user. unobtrusive tools


0
 
ozoCommented:
What line do you want to ignore?
0
 
geotigerCommented:
Use the following codes to get rid of empty lines between paragraphs.


$ cat cnt_words.pl
#!/usr/local/bin/perl
# file name cnt_words.pl

# @a=split /,/, "are,end,text,work,for,with,game,count,pattern,number";
# you can open your first file to get the content into @a

$fn1 = "cnt_keys.txt";
$fn2 = "cnt_words.txt";
$fn3 = "cnt_out.txt";
$fn4 = "cnt_out2.txt";
open FILE, "<$fn1" or die "$!\n";
while (<FILE>) {
  chomp;
  next if (!$_);
  push @a, $_;
}
close FILE;

open WD, "<$fn2" or die "$!\n";
@b = <WD>;
close WD;

$p = 1;    # paragram counter
%R =();
my $lastline="";
foreach $i (@b) {    # loop through each line
    foreach $j (@a) {  
        if ($i =~  /$j/) { ++$R{$j}[$p]; } else { $R{$j}[$p] += 0; }
    }
    if ($i =~ /^\n$/ && $lastline !~ /^\n$/ ) {  ++$p; }
    $lastline=$i;
}

for $i (sort keys %R) {
    $t = "";
    for $j (1..$#{$R{$i}}) { $t .= " $R{$i}[$j]"; }
    printf "%10s %-30s\n", $i, $t;
}

open OUT, ">$fn3" or die "could not write to $fn3:$!\n";
for $i (sort keys %R) {
    $t = "";
    for $j (1..$#{$R{$i}}) { $t .= " $R{$i}[$j]"; }
    printf OUT "%10s %-30s\n", $i, $t;
}
close OUT;


$t = "           ";
for $j (1..$#{$R{$a[1]}}) { $t .= sprintf " P%02d", $j; }
print "$t\n";

for $i (sort keys %R) {
    $t = "";
    for $j (1..$#{$R{$i}}) { $t .= sprintf " %3d", $R{$i}[$j]; }
    printf "%10s %-30s\n", $i, $t;
}

open OUT, ">$fn4" or die "could not write to $fn3:$!\n";
$t = "           ";
for $j (1..$#{$R{$i}}) { $t .= sprintf " P%02d", $j; }
print OUT "$t\n";

for $i (sort keys %R) {
    $t = "";
    for $j (0..$#{$R{$i}}) { $t .= sprintf " %3d", $R{$i}[$j]; }
    printf OUT "%10s %-30s\n", $i, $t;
}
close OUT;
0
 
sdesarAuthor Commented:
It works!!
Thanks geotiger!!
0
 
sdesarAuthor Commented:
Comment accepted as answer
0
 
sdesarAuthor Commented:
Thanks Again!!
0

Featured Post

Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

  • 11
  • 6
Tackle projects and never again get stuck behind a technical roadblock.
Join Now