frequency of words

would anyone know code to do this

i have a file story.txt
on a script i want to enter any number 1 to x
x is the amount of words in the file

i will then be told the most frequently occuring x word phrase in the file and how many times that phrase occured

for example
story.txt is

"this is a this is a test."

if i enter 1
i get
this (2)
is (2)
a (2)
test (1)
if i enter 2
this is (2)
is a (2)
a this (1)
a test (1)

note the numbers in brackets always equals
number_of_words_in_file - desired_phrase_size + 1

is this too obscure?
thanks
paul
paulwhelanAsked:
Who is Participating?
 
yoricConnect With a Mentor Commented:
Here is the solution to your problem:

#----------------------
%PhraseCount = ();
$words_in_phrase = 2;  # Your input number. The length of the phrase to search for.

$story_fname = "story.txt";
open(STORY_FILE, $story_fname) || return 0;
my @story_lines = <STORY_FILE>;
close(STORY_FILE);
chop @story_lines;
$story_text = join " ", @story_lines;

# Create the regular expression you need to look for.
$re = "\\w+";
for ($i=2; $i<=($words_in_phrase); $i++) {
  $re .= "\\W+\\w+";
}
print "For $words_in_phrase word phrases, the re is $re\n";

# Make a list of all the start positions of words in the file.
@positions = (0);
while ($story_text =~ m/\W+/g) {
  push @positions, pos $story_text;
}

# Do the search. Increment the %PhraseCount hash on every find.
foreach (@positions) {
  pos $story_text = $_;
  if ($story_text =~ m/($re)/g) {  
    $PhraseCount{$1}++;
  }
}

# Print out the results.  
foreach (sort { $PhraseCount{$b} <=> $PhraseCount{$a} } keys %PhraseCount) {
  print "$_ ($PhraseCount{$_})\n";
}
#----------------------

For various length phrases, this script produces...

For 1 word phrases, the regex is \w+
this (2)
a (2)
is (2)
test (1)

For 2 word phrases, the regex is \w+\W+\w+
this is (2)
is a (2)
a test (1)
a this (1)

For 3 word phrases, the regex is \w+\W+\w+\W+\w+
this is a (2)
is a test (1)
a this is (1)
is a this (1)

For 4 word phrases, the regex is \w+\W+\w+\W+\w+\W+\w+
is a this is (1)
this is a test (1)
this is a this (1)
a this is a (1)

------------------------------

To account for apostrophes in words (i.e. so that "I'll" is seen a
0
 
yoricCommented:
To account for apostrophes in words (i.e. so that "I'll" is seen as one word), you need to make a change in the regular expression generation like:

$re = "[\\w']+";
for ($i=2; $i<=($words_in_phrase); $i++) {
  $re .= "\\W+[\\w']+";
}

Similar changes can be made to account for other punctuation quirks.

You may also want to change the line above
    $PhraseCount{$1}++;
to
    $PhraseCount{lc($1)}++;
so that "Hello World" and "hello world" are seen as the same phrase.
0
 
ozoCommented:
%PhraseCount = ();
$words_in_phrase = 2;  # Your input number. The length of the phrase to search for.
$story_fname = "story.txt";
open(STORY_FILE, $story_fname) || die "$!";
($_ = lc join "", <STORY_FILE>) =~ s/[^\w']+/ /g;
# Do the search. Increment the %PhraseCount hash on every find.
$PhraseCount{$1.$2}++ while( /(\S+(?=(( \S+){${\($words_in_phrase-1)}})))/g );
foreach (sort { $PhraseCount{$b} <=> $PhraseCount{$a} } keys %PhraseCount) {
       print "$_ ($PhraseCount{$_})\n";
}
0
 
paulwhelanAuthor Commented:
ozo can u supply the html for your answer?
i will supply both of you with points
i have plenty
thanks
paul
0
All Courses

From novice to tech pro — start learning today.