Solved

Miss-cleaves

Posted on 2009-05-18
3
220 Views
Last Modified: 2012-05-07
I have a chunk of code which selects an enzyme, and depending on the enzyme cuts a sequence at a specific section:

my $enzyme = $query->param('enzyme');

# Select an enzyme from the radio buttons on form
my $re;
if   ($enzyme eq   'TRYPSIN') { $re=qr/(?<=[KR])(?!P)/; }
elsif($enzyme eq 'ENDOPROTL') { $re=qr/(?<=K)(?!P)/; }
elsif($enzyme eq 'ENDOPROTA') { $re=qr/(?<=R)(?!P)/; }
elsif($enzyme eq    'V8PROT') { $re=qr/(?<=E)(?!P)/; }
else {die "Unknown enzyme selection '$enzyme'\n";}

so in the case above.. Trypsin cuts at K and R but not after P
                                     EndoprotL cuts at K but not after P
etc etc.

Anyway, i'm trying to manipulate this code to try to count how many times a miscleavage happens... i.e. if Trypsin is selected,
how many times does "K" followed by "P" occur?  
how many times does "R" followed by "P" occur?
(these two will be added up)
if EndoprotL selected, how many time doe "K" followed by "P" occur?
etc....

these will be known as miss cleaves and become the variable $miss_cleave

I've copy/pasted my script below if any further information is required.

Thanks.
#!/usr/bin/perl -w
use CGI::Carp 'fatalsToBrowser';
# ORFfinder.pl
# Perl programme to read in FastA format to find all possible open
# reading frames (ORFS) beginning with ATG and ending with a stop codon,
# TGA, TAA, TAG)
 
# Analyse all six open reading frames and predict ORFS in all six. Only
# longest ORF will be used.
 
require 'module.pm';
use CGI;
use strict;
use warnings;
use DNALib;
use ReadingFrameModules;
my $query = new CGI;
 
# Initialise variables
my ($dna, $dna1, $dna2, $dna3, $dna5, $dna6, $revcom, $revcom1, $revcom2, $longorf1, $longorf2, $longorf3, $longorf4, $longorf5, $longorf6, 
$dna_filename);
$dna=$dna1=$dna2=$dna3=$dna5=$dna6=$revcom=$revcom1=$revcom2=$longorf1=$longorf2=$longorf3=$longorf4=$longorf5=$longorf6=$dna_filename='';
my $dna_file;
my @file_data;
my $dna_header;
 
   # If a text box provided, take from that
if ($query->param('dna-textbox')) {
   $dna1 = $query->param('dna-textbox');
   # take header and save it as a string $dna_header
   ($dna_header, $dna1) = split(/\n/, $dna1, 2);
 
   $dna = extract_string_sequence_from_fasta_data($dna1);
 }
   # Else see if file upload
elsif($query->param('fileupload'))  {
 
   #  Retrieve the file from the web post instead of the filesystem
  @file_data = get_file_data();
   #Extract the sequence from the contents of the file
   $dna = extract_sequence_from_fasta_data(@file_data);
}
 
 
# Add ACGT Validation, changing all non ACGT code to A
$dna =~ s/[^acgt]/a/g;
 
 
# feed the dna data into open_reading_frame to return the longest ORF
 
$longorf1 = open_reading_frame($dna);
 
# remove first base from sequence
$dna2 = substr $dna, 1;
$longorf2 = open_reading_frame($dna2);
 
# remove first base from $dna2
$dna3 = substr $dna2, 1;
$longorf3 = open_reading_frame($dna3);
 
#Reverse compliment the DNA sequence
$revcom = revcom($dna);
$longorf4 = open_reading_frame($revcom);
 
 
#remove first base from sequence
$dna5 = substr $revcom, 1;
$longorf5 = open_reading_frame($dna5);
 
#remove a further base from the sequence
$dna6 = substr $dna5, 1;
$longorf6 = open_reading_frame($dna6);
 
# SECOND HALF OF THE PROGRAM - THIS WAS ORIGINALLY TO BE SENT TO A SECOND SCRIPT
# FOR TASK 2 BUT HAD PROBLEMS WITH THE CGI IMPLEMENTING TWO SCRIPTS ON ONE HTML FORM
 
# my($longorf1,$longorf2,$longorf3,$longorf4,$longorf5,$longorf6)=@ARGV;
 
#Transfer Open Reading Frames over to ProteinDigest
# system './proteindigest.pl', $longorf1,$longorf2,$longorf3,$longorf4,$longorf5,$longorf6;
 
# Initialise second program variables
my $orfprotein1 = '';
my $orfprotein2 = '';
my $orfprotein3 = '';
my $orfprotein4 = '';
my $orfprotein5 = '';
my $orfprotein6 = '';
my $codon;
 
# Convert DNA sequence to Protein sequence - Translate each three base
# codon into an amino acid, and append to the protein
 
for(my $i=0; $i < (length($longorf1) -2) ; $i += 3) {
$codon = substr($longorf1,$i,3);
$orfprotein1 .= codon2aa($codon);
}
 
for(my $i=0; $i < (length($longorf2) -2) ; $i += 3) {
$codon = substr($longorf2,$i,3);
$orfprotein2 .= codon2aa($codon);
}
 
for(my $i=0; $i < (length($longorf3) -2) ; $i += 3) {
$codon = substr($longorf3,$i,3);
$orfprotein3 .= codon2aa($codon);
}
 
for(my $i=0; $i < (length($longorf4) -2) ; $i += 3) {
$codon = substr($longorf4,$i,3);
$orfprotein4 .= codon2aa($codon);
}
 
for(my $i=0; $i < (length($longorf5) -2) ; $i += 3) {
$codon = substr($longorf5,$i,3);
$orfprotein5 .= codon2aa($codon);
}
 
for(my $i=0; $i < (length($longorf6) -2) ; $i += 3) {
$codon = substr($longorf6,$i,3);
$orfprotein6 .= codon2aa($codon);
}
 
# Add N-terminal to each reading frame
 
$orfprotein1 = $orfprotein1 = "_$orfprotein1";
$orfprotein2 = $orfprotein2 = "_$orfprotein2";
$orfprotein3 = $orfprotein3 = "_$orfprotein3";
$orfprotein4 = $orfprotein4 = "_$orfprotein4";
$orfprotein5 = $orfprotein5 = "_$orfprotein5";
$orfprotein6 = $orfprotein6 = "_$orfprotein6";
 
 
 
 
 
 
 
 
 
my $enzyme = $query->param('enzyme');
 
# Select an enzyme from the radio buttons on form
my $re;
if   ($enzyme eq   'TRYPSIN') { $re=qr/(?<=[KR])(?!P)/; }
elsif($enzyme eq 'ENDOPROTL') { $re=qr/(?<=K)(?!P)/; }
elsif($enzyme eq 'ENDOPROTA') { $re=qr/(?<=R)(?!P)/; }
elsif($enzyme eq    'V8PROT') { $re=qr/(?<=E)(?!P)/; }
else {die "Unknown enzyme selection '$enzyme'\n";}
 
 
# To cleave all proteins, and put then in the same array
my @parts;
foreach my $seq ($orfprotein1,$orfprotein2,$orfprotein3,$orfprotein4,$orfprotein5,$orfprotein6) {
    push @parts, split($re, $seq);
}
 
# Now, @parts contains everything
# Generate an array of all digested protein fragments
my @fragments = join("<br>\n", @parts); 
 
print "Content-type:  text/html
 
<html>
<head>
<link href='thrColElsHdr.css' rel='stylesheet' type='text/css' />
</head>
<div class='thrColElsHdr'>
 
<div id='container'>
  <div id='header'>
     
     <img src='dna.png' alt='DNA double helix' />
 
         <h2>Peptide mass/charge analyser</h2>
 
    
  <!-- end #header --></div>
  <div id='sidebar1'>
  
  <!-- end #sidebar1 --></div>
  <div id='sidebar2'>
  
  <!-- end #sidebar2 --></div>
  <div id='mainContent'>
  
<label>
<h2>Protein Digestion Results for $dna_header</h2>
 
 
</label>
<form id='form3' name='form3' method='post' action='mass.pl'>
<label>Please select a Mass to be analysed before continuing to the mass 
analyser:<br />    <br />
    <label>
      <input type='radio' name='mass' value='average' 
id='average' />
      Average</label>
    <label>
      <input type='radio' name='mass' value='mono-isotopic' 
id='mono-isotopic'
/>
      Mono-Isotopic</label>
    <br />
<br />
 
Please click here:
<form method= 'link' action='mass.pl'> <input class='form-button' type='submit' value='M/Z Analyser'>
 
</form>
 
<hr />
 
<p>List of protein cleavage fragments, cleaved with enzyme $enzyme;</p>
<p>@fragments</p>  
 
  
  
  
  
 
 
 
 
 
	<!-- end #mainContent --></div>
	<!-- This clearing element should immediately follow the #mainContent div in order to force the #container div to contain all child floats --><br class='clearfloat' />
   <div id='footer'>
<p><a href='Help.pl#references'>REFERENCES</a> | <a href='Help.pl#about'>ABOUT</a></p>
  <!-- end #footer --></div>
<!-- end #container --></div>
</div>
</html>
 
";

Open in new window

0
Comment
Question by:StephenMcGowan
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
3 Comments
 
LVL 39

Expert Comment

by:Adam314
ID: 24413258

my $enzyme = $query->param('enzyme');
 
# Select an enzyme from the radio buttons on form
my ($reS, $reC);
if   ($enzyme eq   'TRYPSIN') { $reS=qr/(?<=[KR])(?!P)/; $reC = qr/[KR]P/;}
elsif($enzyme eq 'ENDOPROTL') { $reS=qr/(?<=K)(?!P)/;    $reC = qr/KP/;}
elsif($enzyme eq 'ENDOPROTA') { $reS=qr/(?<=R)(?!P)/;    $reC = qr/RP/;}
elsif($enzyme eq    'V8PROT') { $reS=qr/(?<=E)(?!P)/;    $reC = qr/EP/;}
else {die "Unknown enzyme selection '$enzyme'\n";}
 
 
# To cleave all proteins, and put then in the same array
my @parts;
my $miss_cleave;
foreach my $seq ($orfprotein1,$orfprotein2,$orfprotein3,$orfprotein4,$orfprotein5,$orfprotein6) {
    push @parts, split($reS, $seq);
    $miss_cleave = $seq =~ s/$reC//g;
}

Open in new window

0
 

Author Comment

by:StephenMcGowan
ID: 24414760
Hi Adam,

Really sorry about this, but i think i've described what i want to achieve wrong.

My script currently creates an array called @fragments which is a list of small peptides which varies depending on which enzyme is cutting it. Each enzyme is different:

Trypsin cuts at K and R but not when followed by a P ("KP"  "RP")
EndoprotL cuts at K but not when followed by a P ("KP")
etc etc you get the jist...

Anyway!, this is all dependent on the enzyme selected, so for each enzyme, there will be a different type of miss cleave, whether it be (KP + RP) (KP) (RP) or (EP)

what i'm trying to do is generate a way, dependent on enzyme, to scan through all lines of @fragments for each line count the number of the certain type of miscleave, and return a number in an array... so:

Enzyme: Trypsin

Peptide                                                   Miscleaves

SAEVIHQ "RP" VEEALDTDEK                        1
EMLR                                                              0
DVAI "KP" DVVPPNVR                                  1
DLALVELDILR                                                0
ER "KP" R                                                       1
GK                                                                  0
LSVGDLAELLYR                                           0

Thanks
0
 
LVL 39

Accepted Solution

by:
Adam314 earned 500 total points
ID: 24416376

my $enzyme = $query->param('enzyme');
 
# Select an enzyme from the radio buttons on form
my ($reS, $reC);
if   ($enzyme eq   'TRYPSIN') { $reS=qr/(?<=[KR])(?!P)/; $reC = qr/[KR]P/;}
elsif($enzyme eq 'ENDOPROTL') { $reS=qr/(?<=K)(?!P)/;    $reC = qr/KP/;}
elsif($enzyme eq 'ENDOPROTA') { $reS=qr/(?<=R)(?!P)/;    $reC = qr/RP/;}
elsif($enzyme eq    'V8PROT') { $reS=qr/(?<=E)(?!P)/;    $reC = qr/EP/;}
else {die "Unknown enzyme selection '$enzyme'\n";}
 
 
# To cleave all proteins, and put then in the same array
my @parts;
foreach my $seq ($orfprotein1,$orfprotein2,$orfprotein3,$orfprotein4,$orfprotein5,$orfprotein6) {
    push @parts, split($reS, $seq);
}
 
my @miss_cleave;
foreach my $part (@parts) {
    my $seq=$part;
    push @miss_cleave, $seq =~ s/$reC//g;
}

Open in new window

0

Featured Post

Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

I have been pestered over the years to produce and distribute regular data extracts, and often the request have explicitly requested the data be emailed as an Excel attachement; specifically Excel, as it appears: CSV files confuse (no Red or Green h…
A year or so back I was asked to have a play with MongoDB; within half an hour I had downloaded (http://www.mongodb.org/downloads),  installed and started the daemon, and had a console window open. After an hour or two of playing at the command …
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Six Sigma Control Plans

628 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question