Link to home
Start Free TrialLog in
Avatar of paulwhelan
paulwhelan

asked on

frequency of words

would anyone know code to do this

i have a file story.txt
on a script i want to enter any number 1 to x
x is the amount of words in the file

i will then be told the most frequently occuring x word phrase in the file and how many times that phrase occured

for example
story.txt is

"this is a this is a test."

if i enter 1
i get
this (2)
is (2)
a (2)
test (1)
if i enter 2
this is (2)
is a (2)
a this (1)
a test (1)

note the numbers in brackets always equals
number_of_words_in_file - desired_phrase_size + 1

is this too obscure?
thanks
paul
ASKER CERTIFIED SOLUTION
Avatar of yoric
yoric

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of yoric
yoric

To account for apostrophes in words (i.e. so that "I'll" is seen as one word), you need to make a change in the regular expression generation like:

$re = "[\\w']+";
for ($i=2; $i<=($words_in_phrase); $i++) {
  $re .= "\\W+[\\w']+";
}

Similar changes can be made to account for other punctuation quirks.

You may also want to change the line above
    $PhraseCount{$1}++;
to
    $PhraseCount{lc($1)}++;
so that "Hello World" and "hello world" are seen as the same phrase.
Avatar of ozo
%PhraseCount = ();
$words_in_phrase = 2;  # Your input number. The length of the phrase to search for.
$story_fname = "story.txt";
open(STORY_FILE, $story_fname) || die "$!";
($_ = lc join "", <STORY_FILE>) =~ s/[^\w']+/ /g;
# Do the search. Increment the %PhraseCount hash on every find.
$PhraseCount{$1.$2}++ while( /(\S+(?=(( \S+){${\($words_in_phrase-1)}})))/g );
foreach (sort { $PhraseCount{$b} <=> $PhraseCount{$a} } keys %PhraseCount) {
       print "$_ ($PhraseCount{$_})\n";
}
Avatar of paulwhelan

ASKER

ozo can u supply the html for your answer?
i will supply both of you with points
i have plenty
thanks
paul