Solved

DNA Reading Frames

Posted on 2009-05-10
6
1,720 Views
Last Modified: 2012-05-06
Hi,

i'm just having trouble with a bit of code. I am trying to sequence three reading frames of a DNA sequence, so i.e. take the hypothetical sequence:

AAGAAAATGAAAAAAAAATAACCGCATG

Reading Frame1 would be: AAG | AAA | ATG | AAA | AAA | AAA | TAA | CCG | CAT | G
Reading Frame2 would be: AGA | AAA | TGA | AAA | AAA | AAT | AAC | CGC | ATG |
Reading Frame3 would be GAA | AAT | GAA | AAA | AAA | ATA |ACC | GCA | TG

You will see for reading frames 2 and 3 the first base has been removed, causing a frame shift in the triplicate code. Triple codes of interest are ATG (Start codon) and TAA/TGA/TAG (stop codons).

So i've been trying to write a code where you can input a random dna sequence, the script will then create the three reading frames by either keeping it as it is (ReadingFrame1), remove the first base (Frame2) or remove a second base (Frame 3).

Once creating the three frames, i have then created a sub-routine which uses a while loop to go along each sequence in triplicate fashion looking for a start codon (ATG) and terminating at any stop codon in it's triplicate frame. Staying in the same triplicate frame is essential, as shown above with the test sequences.

So far my script is:


# feed the dna data into open_reading_frame to return the longest ORF

print "\n -------Reading Frame 1-------\n\n";
$longorf1 = open_reading_frame($dna);


print "\n -------Reading Frame 2-------\n\n";
# remove first base from sequence
$dna2 = substr $dna, 1;
$longorf2 = open_reading_frame($dna2);
print $longorf2;

print "\n -------Reading Frame 3-------\n\n";
# remove first base from $dna2
$dna3 = substr $dna2, 1;
$longorf3 = open_reading_frame($dna3);
print $longorf3;

you will see each reading frame calls the sub-routine "open_reading_frame" which is the loop which should go along each of these reading frame sequences going along in triplicates looking for a start codon ATG and then terminating at a stop codon before printing out the longest of these reading frames.

The code for what i have done so far for this sub-routine is shown below:
# A subroutine to find the longest open reading frame (ORF) for a sequence

sub open_reading_frame {

    my($dna) = @_;

    use strict;
    use warnings;

    #Declare and initialise variables
    my $longest_str ='';
    my $longest_len = 0;

    local $_ = $dna;
    s/\s+//g;
#   print $_,"\n";

    # longest of the shortest sequences ending with TAA|TAG|TGA
    while( /ATG(?=((?:...)*?(?:TAA|TAG|TGA)))/ig ){
        if( length $1 >$longest_len ){
             $longest_str=$1;
             $longest_len=length $1;
             print $1, "\n";
          }
      }
    return $longest_str;
}

---My Problem---

Ok, now upon running the script with a test DNA sequence, reading frames 1, 2 and 3 are all returning the same sequences when they shouldn't as the are in different triplicate frames and should detect different ATG start codons depending on the triplicate frames they are in. Is there a problem with my loop? or a problem with the call to the sub-routine?

Many Thanks

Stephen

 
0
Comment
Question by:StephenMcGowan
  • 3
  • 3
6 Comments
 
LVL 84

Accepted Solution

by:
ozo earned 500 total points
ID: 24347627
while( /\G(?:...)*?ATG(?=((?:...)*?(?:TAA|TAG|TGA)))/ig ){
0
 

Author Comment

by:StephenMcGowan
ID: 24347654
Hi ozo,

I've entered this instead of the previous loop and it seems to be working fine, but it seems to be printing out all of the frames instead of only printing out the longest one?

I'd have thought:

             $longest_str=$1;
             $longest_len=length $1;
             print $1, "\n";

would have seen to this after the loop?
0
 
LVL 84

Expert Comment

by:ozo
ID: 24347672
the function only returns the longest one after finishing the loop,
but the  print $1, "\n"; is inside the loop
0
Gigs: Get Your Project Delivered by an Expert

Select from freelancers specializing in everything from database administration to programming, who have proven themselves as experts in their field. Hire the best, collaborate easily, pay securely and get projects done right.

 

Author Comment

by:StephenMcGowan
ID: 24347718
Thanks again ozo, sorry i'm kinda new to this.

so you're saying i should have it outside the loop? as in:

    # longest of the shortest sequences ending with TAA|TAG|TGA
#    while( /ATG(?=((?:...)*?(?:TAA|TAG|TGA)))/ig ){
     while( /\G(?:...)*?ATG(?=((?:...)*?(?:TAA|TAG|TGA)))/ig ){
        if( length $1 >$longest_len ){
             $longest_str=$1;
             $longest_len=length $1;
          }
      }
 print $1, "\n";

    return $longest_str;
}

if i try this i receive: Use of uninitialized value in print at ReadingFrameModules.pm
not really too sure on where the print needs to go in order to print the longest only :o/
0
 
LVL 84

Expert Comment

by:ozo
ID: 24347733
$1 is only defined when the match succeeds, and the loop ends when the match fails
Did you mean to print $longest_str?
Which seems unnecessary, if the one who calls the function is responsible for printing the result of the function.
0
 

Author Comment

by:StephenMcGowan
ID: 24347791
Nevermind! sorted it! (i think!) Thanks ozo. :)
0

Featured Post

Courses: Start Training Online With Pros, Today

Brush up on the basics or master the advanced techniques required to earn essential industry certifications, with Courses. Enroll in a course and start learning today. Training topics range from Android App Dev to the Xen Virtualization Platform.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

I have been pestered over the years to produce and distribute regular data extracts, and often the request have explicitly requested the data be emailed as an Excel attachement; specifically Excel, as it appears: CSV files confuse (no Red or Green h…
Checking the Alert Log in AWS RDS Oracle can be a pain through their user interface.  I made a script to download the Alert Log, look for errors, and email me the trace files.  In this article I'll describe what I did and share my script.
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Microsoft Active Directory, the widely used IT infrastructure, is known for its high risk of credential theft. The best way to test your Active Directory’s vulnerabilities to pass-the-ticket, pass-the-hash, privilege escalation, and malware attacks …

776 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question