Don't understand output

this code is supposed to take in two files one which has a language in the form of:
in 0.021268266230653
er 0.0199459816856101
an 0.0180031452539636
he 0.0167558696165399
on 0.0161781277728501
th 0.0155111263808014
re 0.0127007970702476

and another which takes in song titles like:
this is a song title1 (4.30)
this is a song title2 (4.30)

I want to split the information from the second file (per word) into bigrams and then get the corresponding frequencies from my language model then calculte the probability of that word being in the model. so (4.30) should return a probability of 0 but it doesn't. Anyone know why?

How is the code? Is there any way I could tidy it up a bit?

thanks.

Code:
#!usr/bin/perl
use strict;
use warnings;
use diagnostics;
use POSIX qw(log10);
use FileHandle;

#open file handles for languagemodel and test file
my $fh = new FileHandle;
my $fh2 = new FileHandle;
$fh->open("<$ARGV[0]") or die "could not open file\n";
$fh2->open("<$ARGV[1]") or die "could not open file\n";

#open a file to output the results to
open(OUTTRACKS,">trackbigrams") or die "could not open trackbigrams\n";

my $line;
my %bigramfrequency;
my $totalbigram;
my %bititles;
my @titlesbigram;
my @bifrequency;
my @frequency;
my @letterline;
my $result;
my $word;

#subroutine to get bigrams
sub bigram()
{
       
         
         for (my $i=0; $i <= $#letterline-1; $i++)
         {
          my $bigram = $letterline[$i] . $letterline[$i+1];
          $bigramfrequency{$bigram}++;
          $totalbigram++;
         }
    $_ /= $totalbigram foreach values %bigramfrequency;
    return %bigramfrequency;
}



#read in the language model values into a hash table
sub buildlanguage($)
{
   my $filehandle = shift;
   my %languagemodel;

   while (<$filehandle>)
   {
      chomp;
      my ($key, $value) = split /\s/, $_, 2;
      $languagemodel{$key} = $value;
   }
   return %languagemodel;
}

#build the language model from the first file passed in
my %model = buildlanguage($fh);

#subroutine to get the corresponding frequency for the bigrams
sub lookupfreq
{
for (my $i=0; $i<@titlesbigram; $i++) {
   if (exists $model{$titlesbigram[$i]})
      {
       $frequency[$i] = $model{$titlesbigram[$i]};
       }
     else
     {
     $frequency[$i] = 0;
     }
      }
return @frequency;
}

#calculate the probability
sub getProbability
{
my $run_total;
 foreach (@bifrequency)
 {
  $run_total += $_;
  }

my $nth_root = scalar(@bifrequency);
my $log_e = log10($run_total);
my $prob = exp($log_e/$nth_root);
return $prob;
}

#open a file handle for the test file
my $fileHandle2 = $fh2;


while (<$fileHandle2>)
     {
     $line = $_;
     chomp $line;
     $line = lc($line);

     #wordline contains each word of the string
     my @wordline =  split /[^\p{IsL}\d()]+/, $line;
     
     foreach $word (@wordline)
               {
     #letter line contains just a letter
        @letterline = split //, $word;
     %bititles = bigram();
     @titlesbigram = keys (%bititles);
     @bifrequency = lookupfreq();
     $result = getProbability();
     print OUTTRACKS "$word\t $result\n"
          }
     }
$fh->close();
$fh2->close();
close(OUTTRACKS);
kilkennpAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

ozoCommented:
exp(log10(x)) = x**(1/log(10))
exp(log10($run_total)/$nth_root) = $run_total**(1/(log(10)*number of elements in @bifrequency))
I don't know what you would expect such a quantity to represent.

#subroutine to get bigrams
sub bigram()
{


    for (my $i=0; $i <= $#letterline-1; $i++)
    {
        my $bigram = $letterline[$i] . $letterline[$i+1];
        $bigramfrequency{$bigram}++;
        $totalbigram++;
    }
    $_ /= $totalbigram foreach values %bigramfrequency;
    return %bigramfrequency;
}
once a bigram goes into bigramfrequency, it stay there forever, so keys (%bititles) is every bigram ever seen in $fh2
and every entry is divided by $totalbigram every time you call sub bigram, so some entrys can get very small.
Which seems to be pointless since I don't see the values in %bititles being used anywhere)

What is the formula you are trying to implement in computing $result?
0
kilkennpAuthor Commented:
I've looked at the problem some more and it seems that the problem is with the %bititles as you pointed out. Every time the sub bigram is called there are still bigrams remaining in the hash table from the previous word which messes up the results. Is there any way one can make sure this hash table is empty for each word it gets the bigrams for i.e. each iteration of the 'foreach $word (@wordline)' loop.

Alternatively, I could get rid of this hash table and put the bigrams into an array (@titlesbigram) because as you rightly pointed out I never use the values the in %bititles.....this is because I took this sub from another program I had wrote.

$result is a decemial value got from calling getProbability (). It takes an array of decmial values and calculates the probability.

thanks for any further help you can give me
0
ozoCommented:
foreach ( split /[^\p{IsL}\d()]+/, $line ){
     @bifrequency = map{$model{$_}||0}/(..)/g,/(?<=.)(..)/g;
     $result = getProbability();
     print OUTTRACKS "$word\t $result\n"
}
0
Cloud Class® Course: Amazon Web Services - Basic

Are you thinking about creating an Amazon Web Services account for your business? Not sure where to start? In this course you’ll get an overview of the history of AWS and take a tour of their user interface.

kilkennpAuthor Commented:
I don't understand this code. What does it do? where in the original code should i put it?
0
ozoCommented:
it would replace
     #wordline contains each word of the string
     my @wordline =  split /[^\p{IsL}\d()]+/, $line;
     
     foreach $word (@wordline)
               {
     #letter line contains just a letter
        @letterline = split //, $word;
     %bititles = bigram();
     @titlesbigram = keys (%bititles);
     @bifrequency = lookupfreq();
     $result = getProbability();
     print OUTTRACKS "$word\t $result\n"
          }
     }
to generate @bifrequency


(although I'm still not sure why getProbability would be doing exp(log10())
0
kilkennpAuthor Commented:
sorry but I still don't understand your piece of code. Where are the bigrams made? When is the corresponding frequency got? can I not still use the subroutines that I wrote?

The getProbability should take an array of frequencies of the bigrams for a word. it is to calculate the nth root of the sum of the frequencies passed in in the array.

so if we take the string "hello world", I want to take the first word and get the bigrams i.e. "he, el, ll, lo" for each of these I want to get their corresponding value from the %model which will be less than 1. I then want to pass these frequencies into getprobability to sum them up and get the nth root of them. that is what result should be then. I want to do this for every word in the string.
0
ozoCommented:
The bigrams are made here
  /(..)/g,/(?<=.)(..)/g
the corresponding frequencys are gotten here
   map{$model{$_}||0}

You can use the subroutines that you wrote, but they seem to do a lot of unnecessary work, and keep accumulating bigrams forever.
You can clear out the %bigramfrequency hash with
%bigramfrequency = ();
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Perl

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.