• Status: Solved
• Priority: Medium
• Security: Public
• Views: 509

# Counting diagraphs and trigraphs in Perl

Hi,
I have a string in perl say    "hh bbb. rt irr"

I'm looking for some way of going through that string and counting how many three letter words there are and how many two letter words. punctuation should be ignored....

Any takers? :)

0
petepalmer
1 Solution

Commented:
\$text="hh bbb. rt irr";
@words = split(/[\s.,;:\'\"\\\/\`\~\!\@\#\\$\%\^\& \*\(\)\_\- \+\+\[\]\{\}\?]+/,\$text);
foreach(@words) {
\$twocharwords++ if length(\$_)==2;
\$threecharwords++ if length(\$_)==3;
}
print "two char word count: \$twocharwords, three char word count: \$threecharwords\n";
0

OwnerCommented:
(spelling error in title corrected)

Actually, the words digraph and trigraph are usually used with a different meaning: they refer to two-character and three-character sequences regardless of whether they appear as independent words or as part of larger words. The statistics of digraphs and trigraphs can be used in compression algorithms and in code breaking.

If we take from your question that you want to count the number of two- and three-character sequences (sequences, that is, containing alphabetic characters only), the approach that rj2 has shown is pretty good. If I were doing it, I'd probably do it more like:

my @words = \$text =~ m{\b([a-zA-Z]+)\b}g;
my @histogram;
\$histogram[length]++ foreach @words;
print "length\tcount\n";
printf "%5d\t%5d\n" \$_, \$histogram[\$_] foreach (2..3);

or, change that last line to get the entire histogram,

printf "%5d\t%5d\n" \$_, \$histogram[\$_] foreach (1 .. \$#histogram);

0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.