paulwhelan
asked on
frequency of words
would anyone know code to do this
i have a file story.txt
on a script i want to enter any number 1 to x
x is the amount of words in the file
i will then be told the most frequently occuring x word phrase in the file and how many times that phrase occured
for example
story.txt is
"this is a this is a test."
if i enter 1
i get
this (2)
is (2)
a (2)
test (1)
if i enter 2
this is (2)
is a (2)
a this (1)
a test (1)
note the numbers in brackets always equals
number_of_words_in_file - desired_phrase_size + 1
is this too obscure?
thanks
paul
i have a file story.txt
on a script i want to enter any number 1 to x
x is the amount of words in the file
i will then be told the most frequently occuring x word phrase in the file and how many times that phrase occured
for example
story.txt is
"this is a this is a test."
if i enter 1
i get
this (2)
is (2)
a (2)
test (1)
if i enter 2
this is (2)
is a (2)
a this (1)
a test (1)
note the numbers in brackets always equals
number_of_words_in_file - desired_phrase_size + 1
is this too obscure?
thanks
paul
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
%PhraseCount = ();
$words_in_phrase = 2; # Your input number. The length of the phrase to search for.
$story_fname = "story.txt";
open(STORY_FILE, $story_fname) || die "$!";
($_ = lc join "", <STORY_FILE>) =~ s/[^\w']+/ /g;
# Do the search. Increment the %PhraseCount hash on every find.
$PhraseCount{$1.$2}++ while( /(\S+(?=(( \S+){${\($words_in_phrase- 1)}})))/g );
foreach (sort { $PhraseCount{$b} <=> $PhraseCount{$a} } keys %PhraseCount) {
print "$_ ($PhraseCount{$_})\n";
}
$words_in_phrase = 2; # Your input number. The length of the phrase to search for.
$story_fname = "story.txt";
open(STORY_FILE, $story_fname) || die "$!";
($_ = lc join "", <STORY_FILE>) =~ s/[^\w']+/ /g;
# Do the search. Increment the %PhraseCount hash on every find.
$PhraseCount{$1.$2}++ while( /(\S+(?=(( \S+){${\($words_in_phrase-
foreach (sort { $PhraseCount{$b} <=> $PhraseCount{$a} } keys %PhraseCount) {
print "$_ ($PhraseCount{$_})\n";
}
ASKER
ozo can u supply the html for your answer?
i will supply both of you with points
i have plenty
thanks
paul
i will supply both of you with points
i have plenty
thanks
paul
$re = "[\\w']+";
for ($i=2; $i<=($words_in_phrase); $i++) {
$re .= "\\W+[\\w']+";
}
Similar changes can be made to account for other punctuation quirks.
You may also want to change the line above
$PhraseCount{$1}++;
to
$PhraseCount{lc($1)}++;
so that "Hello World" and "hello world" are seen as the same phrase.