Solved

DNA Reading Frames

Posted on 2009-05-10
6
1,727 Views
Last Modified: 2012-05-06
Hi,

i'm just having trouble with a bit of code. I am trying to sequence three reading frames of a DNA sequence, so i.e. take the hypothetical sequence:

AAGAAAATGAAAAAAAAATAACCGCATG

Reading Frame1 would be: AAG | AAA | ATG | AAA | AAA | AAA | TAA | CCG | CAT | G
Reading Frame2 would be: AGA | AAA | TGA | AAA | AAA | AAT | AAC | CGC | ATG |
Reading Frame3 would be GAA | AAT | GAA | AAA | AAA | ATA |ACC | GCA | TG

You will see for reading frames 2 and 3 the first base has been removed, causing a frame shift in the triplicate code. Triple codes of interest are ATG (Start codon) and TAA/TGA/TAG (stop codons).

So i've been trying to write a code where you can input a random dna sequence, the script will then create the three reading frames by either keeping it as it is (ReadingFrame1), remove the first base (Frame2) or remove a second base (Frame 3).

Once creating the three frames, i have then created a sub-routine which uses a while loop to go along each sequence in triplicate fashion looking for a start codon (ATG) and terminating at any stop codon in it's triplicate frame. Staying in the same triplicate frame is essential, as shown above with the test sequences.

So far my script is:


# feed the dna data into open_reading_frame to return the longest ORF

print "\n -------Reading Frame 1-------\n\n";
$longorf1 = open_reading_frame($dna);


print "\n -------Reading Frame 2-------\n\n";
# remove first base from sequence
$dna2 = substr $dna, 1;
$longorf2 = open_reading_frame($dna2);
print $longorf2;

print "\n -------Reading Frame 3-------\n\n";
# remove first base from $dna2
$dna3 = substr $dna2, 1;
$longorf3 = open_reading_frame($dna3);
print $longorf3;

you will see each reading frame calls the sub-routine "open_reading_frame" which is the loop which should go along each of these reading frame sequences going along in triplicates looking for a start codon ATG and then terminating at a stop codon before printing out the longest of these reading frames.

The code for what i have done so far for this sub-routine is shown below:
# A subroutine to find the longest open reading frame (ORF) for a sequence

sub open_reading_frame {

    my($dna) = @_;

    use strict;
    use warnings;

    #Declare and initialise variables
    my $longest_str ='';
    my $longest_len = 0;

    local $_ = $dna;
    s/\s+//g;
#   print $_,"\n";

    # longest of the shortest sequences ending with TAA|TAG|TGA
    while( /ATG(?=((?:...)*?(?:TAA|TAG|TGA)))/ig ){
        if( length $1 >$longest_len ){
             $longest_str=$1;
             $longest_len=length $1;
             print $1, "\n";
          }
      }
    return $longest_str;
}

---My Problem---

Ok, now upon running the script with a test DNA sequence, reading frames 1, 2 and 3 are all returning the same sequences when they shouldn't as the are in different triplicate frames and should detect different ATG start codons depending on the triplicate frames they are in. Is there a problem with my loop? or a problem with the call to the sub-routine?

Many Thanks

Stephen

 
0
Comment
Question by:StephenMcGowan
  • 3
  • 3
6 Comments
 
LVL 84

Accepted Solution

by:
ozo earned 500 total points
ID: 24347627
while( /\G(?:...)*?ATG(?=((?:...)*?(?:TAA|TAG|TGA)))/ig ){
0
 

Author Comment

by:StephenMcGowan
ID: 24347654
Hi ozo,

I've entered this instead of the previous loop and it seems to be working fine, but it seems to be printing out all of the frames instead of only printing out the longest one?

I'd have thought:

             $longest_str=$1;
             $longest_len=length $1;
             print $1, "\n";

would have seen to this after the loop?
0
 
LVL 84

Expert Comment

by:ozo
ID: 24347672
the function only returns the longest one after finishing the loop,
but the  print $1, "\n"; is inside the loop
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 

Author Comment

by:StephenMcGowan
ID: 24347718
Thanks again ozo, sorry i'm kinda new to this.

so you're saying i should have it outside the loop? as in:

    # longest of the shortest sequences ending with TAA|TAG|TGA
#    while( /ATG(?=((?:...)*?(?:TAA|TAG|TGA)))/ig ){
     while( /\G(?:...)*?ATG(?=((?:...)*?(?:TAA|TAG|TGA)))/ig ){
        if( length $1 >$longest_len ){
             $longest_str=$1;
             $longest_len=length $1;
          }
      }
 print $1, "\n";

    return $longest_str;
}

if i try this i receive: Use of uninitialized value in print at ReadingFrameModules.pm
not really too sure on where the print needs to go in order to print the longest only :o/
0
 
LVL 84

Expert Comment

by:ozo
ID: 24347733
$1 is only defined when the match succeeds, and the loop ends when the match fails
Did you mean to print $longest_str?
Which seems unnecessary, if the one who calls the function is responsible for printing the result of the function.
0
 

Author Comment

by:StephenMcGowan
ID: 24347791
Nevermind! sorted it! (i think!) Thanks ozo. :)
0

Featured Post

Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

On Microsoft Windows, if  when you click or type the name of a .pl file, you get an error "is not recognized as an internal or external command, operable program or batch file", then this means you do not have the .pl file extension associated with …
There are many situations when we need to display the data in sorted order. For example: Student details by name or by rank or by total marks etc. If you are working on data driven based projects then you will use sorting techniques very frequently.…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

726 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question