Link to home
Start Free TrialLog in
Avatar of StephenMcGowan
StephenMcGowan

asked on

DNA Reading Frames

Hi,

i'm just having trouble with a bit of code. I am trying to sequence three reading frames of a DNA sequence, so i.e. take the hypothetical sequence:

AAGAAAATGAAAAAAAAATAACCGCATG

Reading Frame1 would be: AAG | AAA | ATG | AAA | AAA | AAA | TAA | CCG | CAT | G
Reading Frame2 would be: AGA | AAA | TGA | AAA | AAA | AAT | AAC | CGC | ATG |
Reading Frame3 would be GAA | AAT | GAA | AAA | AAA | ATA |ACC | GCA | TG

You will see for reading frames 2 and 3 the first base has been removed, causing a frame shift in the triplicate code. Triple codes of interest are ATG (Start codon) and TAA/TGA/TAG (stop codons).

So i've been trying to write a code where you can input a random dna sequence, the script will then create the three reading frames by either keeping it as it is (ReadingFrame1), remove the first base (Frame2) or remove a second base (Frame 3).

Once creating the three frames, i have then created a sub-routine which uses a while loop to go along each sequence in triplicate fashion looking for a start codon (ATG) and terminating at any stop codon in it's triplicate frame. Staying in the same triplicate frame is essential, as shown above with the test sequences.

So far my script is:


# feed the dna data into open_reading_frame to return the longest ORF

print "\n -------Reading Frame 1-------\n\n";
$longorf1 = open_reading_frame($dna);


print "\n -------Reading Frame 2-------\n\n";
# remove first base from sequence
$dna2 = substr $dna, 1;
$longorf2 = open_reading_frame($dna2);
print $longorf2;

print "\n -------Reading Frame 3-------\n\n";
# remove first base from $dna2
$dna3 = substr $dna2, 1;
$longorf3 = open_reading_frame($dna3);
print $longorf3;

you will see each reading frame calls the sub-routine "open_reading_frame" which is the loop which should go along each of these reading frame sequences going along in triplicates looking for a start codon ATG and then terminating at a stop codon before printing out the longest of these reading frames.

The code for what i have done so far for this sub-routine is shown below:
# A subroutine to find the longest open reading frame (ORF) for a sequence

sub open_reading_frame {

    my($dna) = @_;

    use strict;
    use warnings;

    #Declare and initialise variables
    my $longest_str ='';
    my $longest_len = 0;

    local $_ = $dna;
    s/\s+//g;
#   print $_,"\n";

    # longest of the shortest sequences ending with TAA|TAG|TGA
    while( /ATG(?=((?:...)*?(?:TAA|TAG|TGA)))/ig ){
        if( length $1 >$longest_len ){
             $longest_str=$1;
             $longest_len=length $1;
             print $1, "\n";
          }
      }
    return $longest_str;
}

---My Problem---

Ok, now upon running the script with a test DNA sequence, reading frames 1, 2 and 3 are all returning the same sequences when they shouldn't as the are in different triplicate frames and should detect different ATG start codons depending on the triplicate frames they are in. Is there a problem with my loop? or a problem with the call to the sub-routine?

Many Thanks

Stephen

 
ASKER CERTIFIED SOLUTION
Avatar of ozo
ozo
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of StephenMcGowan
StephenMcGowan

ASKER

Hi ozo,

I've entered this instead of the previous loop and it seems to be working fine, but it seems to be printing out all of the frames instead of only printing out the longest one?

I'd have thought:

             $longest_str=$1;
             $longest_len=length $1;
             print $1, "\n";

would have seen to this after the loop?
the function only returns the longest one after finishing the loop,
but the  print $1, "\n"; is inside the loop
Thanks again ozo, sorry i'm kinda new to this.

so you're saying i should have it outside the loop? as in:

    # longest of the shortest sequences ending with TAA|TAG|TGA
#    while( /ATG(?=((?:...)*?(?:TAA|TAG|TGA)))/ig ){
     while( /\G(?:...)*?ATG(?=((?:...)*?(?:TAA|TAG|TGA)))/ig ){
        if( length $1 >$longest_len ){
             $longest_str=$1;
             $longest_len=length $1;
          }
      }
 print $1, "\n";

    return $longest_str;
}

if i try this i receive: Use of uninitialized value in print at ReadingFrameModules.pm
not really too sure on where the print needs to go in order to print the longest only :o/
$1 is only defined when the match succeeds, and the loop ends when the match fails
Did you mean to print $longest_str?
Which seems unnecessary, if the one who calls the function is responsible for printing the result of the function.
Nevermind! sorted it! (i think!) Thanks ozo. :)