Why Experts Exchange?

Experts Exchange always has the answer, or at the least points me in the correct direction! It is like having another employee that is extremely experienced.

Jim Murphy
Programmer at Smart IT Solutions

When asked, what has been your best career decision?

Deciding to stick with EE.

Mohamed Asif
Technical Department Head

Being involved with EE helped me to grow personally and professionally.

Carl Webster
CTP, Sr Infrastructure Consultant
Ask ANY Question

Connect with Certified Experts to gain insight and support on specific technology challenges including:

Troubleshooting
Research
Professional Opinions
Ask a Question
Did You Know?

We've partnered with two important charities to provide clean water and computer science education to those who need it most. READ MORE

troubleshooting Question

DNA Reading Frames

Avatar of StephenMcGowan
StephenMcGowan asked on
Perl
6 Comments1 Solution1832 ViewsLast Modified:
Hi,

i'm just having trouble with a bit of code. I am trying to sequence three reading frames of a DNA sequence, so i.e. take the hypothetical sequence:

AAGAAAATGAAAAAAAAATAACCGCATG

Reading Frame1 would be: AAG | AAA | ATG | AAA | AAA | AAA | TAA | CCG | CAT | G
Reading Frame2 would be: AGA | AAA | TGA | AAA | AAA | AAT | AAC | CGC | ATG |
Reading Frame3 would be GAA | AAT | GAA | AAA | AAA | ATA |ACC | GCA | TG

You will see for reading frames 2 and 3 the first base has been removed, causing a frame shift in the triplicate code. Triple codes of interest are ATG (Start codon) and TAA/TGA/TAG (stop codons).

So i've been trying to write a code where you can input a random dna sequence, the script will then create the three reading frames by either keeping it as it is (ReadingFrame1), remove the first base (Frame2) or remove a second base (Frame 3).

Once creating the three frames, i have then created a sub-routine which uses a while loop to go along each sequence in triplicate fashion looking for a start codon (ATG) and terminating at any stop codon in it's triplicate frame. Staying in the same triplicate frame is essential, as shown above with the test sequences.

So far my script is:


# feed the dna data into open_reading_frame to return the longest ORF

print "\n -------Reading Frame 1-------\n\n";
$longorf1 = open_reading_frame($dna);


print "\n -------Reading Frame 2-------\n\n";
# remove first base from sequence
$dna2 = substr $dna, 1;
$longorf2 = open_reading_frame($dna2);
print $longorf2;

print "\n -------Reading Frame 3-------\n\n";
# remove first base from $dna2
$dna3 = substr $dna2, 1;
$longorf3 = open_reading_frame($dna3);
print $longorf3;

you will see each reading frame calls the sub-routine "open_reading_frame" which is the loop which should go along each of these reading frame sequences going along in triplicates looking for a start codon ATG and then terminating at a stop codon before printing out the longest of these reading frames.

The code for what i have done so far for this sub-routine is shown below:
# A subroutine to find the longest open reading frame (ORF) for a sequence

sub open_reading_frame {

    my($dna) = @_;

    use strict;
    use warnings;

    #Declare and initialise variables
    my $longest_str ='';
    my $longest_len = 0;

    local $_ = $dna;
    s/\s+//g;
#   print $_,"\n";

    # longest of the shortest sequences ending with TAA|TAG|TGA
    while( /ATG(?=((?:...)*?(?:TAA|TAG|TGA)))/ig ){
        if( length $1 >$longest_len ){
             $longest_str=$1;
             $longest_len=length $1;
             print $1, "\n";
          }
      }
    return $longest_str;
}

---My Problem---

Ok, now upon running the script with a test DNA sequence, reading frames 1, 2 and 3 are all returning the same sequences when they shouldn't as the are in different triplicate frames and should detect different ATG start codons depending on the triplicate frames they are in. Is there a problem with my loop? or a problem with the call to the sub-routine?

Many Thanks

Stephen