Solved

Miss-cleaves

Posted on 2009-05-18
3
214 Views
Last Modified: 2012-05-07
I have a chunk of code which selects an enzyme, and depending on the enzyme cuts a sequence at a specific section:

my $enzyme = $query->param('enzyme');

# Select an enzyme from the radio buttons on form
my $re;
if   ($enzyme eq   'TRYPSIN') { $re=qr/(?<=[KR])(?!P)/; }
elsif($enzyme eq 'ENDOPROTL') { $re=qr/(?<=K)(?!P)/; }
elsif($enzyme eq 'ENDOPROTA') { $re=qr/(?<=R)(?!P)/; }
elsif($enzyme eq    'V8PROT') { $re=qr/(?<=E)(?!P)/; }
else {die "Unknown enzyme selection '$enzyme'\n";}

so in the case above.. Trypsin cuts at K and R but not after P
                                     EndoprotL cuts at K but not after P
etc etc.

Anyway, i'm trying to manipulate this code to try to count how many times a miscleavage happens... i.e. if Trypsin is selected,
how many times does "K" followed by "P" occur?  
how many times does "R" followed by "P" occur?
(these two will be added up)
if EndoprotL selected, how many time doe "K" followed by "P" occur?
etc....

these will be known as miss cleaves and become the variable $miss_cleave

I've copy/pasted my script below if any further information is required.

Thanks.
#!/usr/bin/perl -w
use CGI::Carp 'fatalsToBrowser';
# ORFfinder.pl
# Perl programme to read in FastA format to find all possible open
# reading frames (ORFS) beginning with ATG and ending with a stop codon,
# TGA, TAA, TAG)
 
# Analyse all six open reading frames and predict ORFS in all six. Only
# longest ORF will be used.
 
require 'module.pm';
use CGI;
use strict;
use warnings;
use DNALib;
use ReadingFrameModules;
my $query = new CGI;
 
# Initialise variables
my ($dna, $dna1, $dna2, $dna3, $dna5, $dna6, $revcom, $revcom1, $revcom2, $longorf1, $longorf2, $longorf3, $longorf4, $longorf5, $longorf6, 
$dna_filename);
$dna=$dna1=$dna2=$dna3=$dna5=$dna6=$revcom=$revcom1=$revcom2=$longorf1=$longorf2=$longorf3=$longorf4=$longorf5=$longorf6=$dna_filename='';
my $dna_file;
my @file_data;
my $dna_header;
 
   # If a text box provided, take from that
if ($query->param('dna-textbox')) {
   $dna1 = $query->param('dna-textbox');
   # take header and save it as a string $dna_header
   ($dna_header, $dna1) = split(/\n/, $dna1, 2);
 
   $dna = extract_string_sequence_from_fasta_data($dna1);
 }
   # Else see if file upload
elsif($query->param('fileupload'))  {
 
   #  Retrieve the file from the web post instead of the filesystem
  @file_data = get_file_data();
   #Extract the sequence from the contents of the file
   $dna = extract_sequence_from_fasta_data(@file_data);
}
 
 
# Add ACGT Validation, changing all non ACGT code to A
$dna =~ s/[^acgt]/a/g;
 
 
# feed the dna data into open_reading_frame to return the longest ORF
 
$longorf1 = open_reading_frame($dna);
 
# remove first base from sequence
$dna2 = substr $dna, 1;
$longorf2 = open_reading_frame($dna2);
 
# remove first base from $dna2
$dna3 = substr $dna2, 1;
$longorf3 = open_reading_frame($dna3);
 
#Reverse compliment the DNA sequence
$revcom = revcom($dna);
$longorf4 = open_reading_frame($revcom);
 
 
#remove first base from sequence
$dna5 = substr $revcom, 1;
$longorf5 = open_reading_frame($dna5);
 
#remove a further base from the sequence
$dna6 = substr $dna5, 1;
$longorf6 = open_reading_frame($dna6);
 
# SECOND HALF OF THE PROGRAM - THIS WAS ORIGINALLY TO BE SENT TO A SECOND SCRIPT
# FOR TASK 2 BUT HAD PROBLEMS WITH THE CGI IMPLEMENTING TWO SCRIPTS ON ONE HTML FORM
 
# my($longorf1,$longorf2,$longorf3,$longorf4,$longorf5,$longorf6)=@ARGV;
 
#Transfer Open Reading Frames over to ProteinDigest
# system './proteindigest.pl', $longorf1,$longorf2,$longorf3,$longorf4,$longorf5,$longorf6;
 
# Initialise second program variables
my $orfprotein1 = '';
my $orfprotein2 = '';
my $orfprotein3 = '';
my $orfprotein4 = '';
my $orfprotein5 = '';
my $orfprotein6 = '';
my $codon;
 
# Convert DNA sequence to Protein sequence - Translate each three base
# codon into an amino acid, and append to the protein
 
for(my $i=0; $i < (length($longorf1) -2) ; $i += 3) {
$codon = substr($longorf1,$i,3);
$orfprotein1 .= codon2aa($codon);
}
 
for(my $i=0; $i < (length($longorf2) -2) ; $i += 3) {
$codon = substr($longorf2,$i,3);
$orfprotein2 .= codon2aa($codon);
}
 
for(my $i=0; $i < (length($longorf3) -2) ; $i += 3) {
$codon = substr($longorf3,$i,3);
$orfprotein3 .= codon2aa($codon);
}
 
for(my $i=0; $i < (length($longorf4) -2) ; $i += 3) {
$codon = substr($longorf4,$i,3);
$orfprotein4 .= codon2aa($codon);
}
 
for(my $i=0; $i < (length($longorf5) -2) ; $i += 3) {
$codon = substr($longorf5,$i,3);
$orfprotein5 .= codon2aa($codon);
}
 
for(my $i=0; $i < (length($longorf6) -2) ; $i += 3) {
$codon = substr($longorf6,$i,3);
$orfprotein6 .= codon2aa($codon);
}
 
# Add N-terminal to each reading frame
 
$orfprotein1 = $orfprotein1 = "_$orfprotein1";
$orfprotein2 = $orfprotein2 = "_$orfprotein2";
$orfprotein3 = $orfprotein3 = "_$orfprotein3";
$orfprotein4 = $orfprotein4 = "_$orfprotein4";
$orfprotein5 = $orfprotein5 = "_$orfprotein5";
$orfprotein6 = $orfprotein6 = "_$orfprotein6";
 
 
 
 
 
 
 
 
 
my $enzyme = $query->param('enzyme');
 
# Select an enzyme from the radio buttons on form
my $re;
if   ($enzyme eq   'TRYPSIN') { $re=qr/(?<=[KR])(?!P)/; }
elsif($enzyme eq 'ENDOPROTL') { $re=qr/(?<=K)(?!P)/; }
elsif($enzyme eq 'ENDOPROTA') { $re=qr/(?<=R)(?!P)/; }
elsif($enzyme eq    'V8PROT') { $re=qr/(?<=E)(?!P)/; }
else {die "Unknown enzyme selection '$enzyme'\n";}
 
 
# To cleave all proteins, and put then in the same array
my @parts;
foreach my $seq ($orfprotein1,$orfprotein2,$orfprotein3,$orfprotein4,$orfprotein5,$orfprotein6) {
    push @parts, split($re, $seq);
}
 
# Now, @parts contains everything
# Generate an array of all digested protein fragments
my @fragments = join("<br>\n", @parts); 
 
print "Content-type:  text/html
 
<html>
<head>
<link href='thrColElsHdr.css' rel='stylesheet' type='text/css' />
</head>
<div class='thrColElsHdr'>
 
<div id='container'>
  <div id='header'>
     
     <img src='dna.png' alt='DNA double helix' />
 
         <h2>Peptide mass/charge analyser</h2>
 
    
  <!-- end #header --></div>
  <div id='sidebar1'>
  
  <!-- end #sidebar1 --></div>
  <div id='sidebar2'>
  
  <!-- end #sidebar2 --></div>
  <div id='mainContent'>
  
<label>
<h2>Protein Digestion Results for $dna_header</h2>
 
 
</label>
<form id='form3' name='form3' method='post' action='mass.pl'>
<label>Please select a Mass to be analysed before continuing to the mass 
analyser:<br />    <br />
    <label>
      <input type='radio' name='mass' value='average' 
id='average' />
      Average</label>
    <label>
      <input type='radio' name='mass' value='mono-isotopic' 
id='mono-isotopic'
/>
      Mono-Isotopic</label>
    <br />
<br />
 
Please click here:
<form method= 'link' action='mass.pl'> <input class='form-button' type='submit' value='M/Z Analyser'>
 
</form>
 
<hr />
 
<p>List of protein cleavage fragments, cleaved with enzyme $enzyme;</p>
<p>@fragments</p>  
 
  
  
  
  
 
 
 
 
 
	<!-- end #mainContent --></div>
	<!-- This clearing element should immediately follow the #mainContent div in order to force the #container div to contain all child floats --><br class='clearfloat' />
   <div id='footer'>
<p><a href='Help.pl#references'>REFERENCES</a> | <a href='Help.pl#about'>ABOUT</a></p>
  <!-- end #footer --></div>
<!-- end #container --></div>
</div>
</html>
 
";

Open in new window

0
Comment
Question by:StephenMcGowan
  • 2
3 Comments
 
LVL 39

Expert Comment

by:Adam314
ID: 24413258

my $enzyme = $query->param('enzyme');
 
# Select an enzyme from the radio buttons on form
my ($reS, $reC);
if   ($enzyme eq   'TRYPSIN') { $reS=qr/(?<=[KR])(?!P)/; $reC = qr/[KR]P/;}
elsif($enzyme eq 'ENDOPROTL') { $reS=qr/(?<=K)(?!P)/;    $reC = qr/KP/;}
elsif($enzyme eq 'ENDOPROTA') { $reS=qr/(?<=R)(?!P)/;    $reC = qr/RP/;}
elsif($enzyme eq    'V8PROT') { $reS=qr/(?<=E)(?!P)/;    $reC = qr/EP/;}
else {die "Unknown enzyme selection '$enzyme'\n";}
 
 
# To cleave all proteins, and put then in the same array
my @parts;
my $miss_cleave;
foreach my $seq ($orfprotein1,$orfprotein2,$orfprotein3,$orfprotein4,$orfprotein5,$orfprotein6) {
    push @parts, split($reS, $seq);
    $miss_cleave = $seq =~ s/$reC//g;
}

Open in new window

0
 

Author Comment

by:StephenMcGowan
ID: 24414760
Hi Adam,

Really sorry about this, but i think i've described what i want to achieve wrong.

My script currently creates an array called @fragments which is a list of small peptides which varies depending on which enzyme is cutting it. Each enzyme is different:

Trypsin cuts at K and R but not when followed by a P ("KP"  "RP")
EndoprotL cuts at K but not when followed by a P ("KP")
etc etc you get the jist...

Anyway!, this is all dependent on the enzyme selected, so for each enzyme, there will be a different type of miss cleave, whether it be (KP + RP) (KP) (RP) or (EP)

what i'm trying to do is generate a way, dependent on enzyme, to scan through all lines of @fragments for each line count the number of the certain type of miscleave, and return a number in an array... so:

Enzyme: Trypsin

Peptide                                                   Miscleaves

SAEVIHQ "RP" VEEALDTDEK                        1
EMLR                                                              0
DVAI "KP" DVVPPNVR                                  1
DLALVELDILR                                                0
ER "KP" R                                                       1
GK                                                                  0
LSVGDLAELLYR                                           0

Thanks
0
 
LVL 39

Accepted Solution

by:
Adam314 earned 500 total points
ID: 24416376

my $enzyme = $query->param('enzyme');
 
# Select an enzyme from the radio buttons on form
my ($reS, $reC);
if   ($enzyme eq   'TRYPSIN') { $reS=qr/(?<=[KR])(?!P)/; $reC = qr/[KR]P/;}
elsif($enzyme eq 'ENDOPROTL') { $reS=qr/(?<=K)(?!P)/;    $reC = qr/KP/;}
elsif($enzyme eq 'ENDOPROTA') { $reS=qr/(?<=R)(?!P)/;    $reC = qr/RP/;}
elsif($enzyme eq    'V8PROT') { $reS=qr/(?<=E)(?!P)/;    $reC = qr/EP/;}
else {die "Unknown enzyme selection '$enzyme'\n";}
 
 
# To cleave all proteins, and put then in the same array
my @parts;
foreach my $seq ($orfprotein1,$orfprotein2,$orfprotein3,$orfprotein4,$orfprotein5,$orfprotein6) {
    push @parts, split($reS, $seq);
}
 
my @miss_cleave;
foreach my $part (@parts) {
    my $seq=$part;
    push @miss_cleave, $seq =~ s/$reC//g;
}

Open in new window

0

Featured Post

Free Tool: SSL Checker

Scans your site and returns information about your SSL implementation and certificate. Helpful for debugging and validating your SSL configuration.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Many time we need to work with multiple files all together. If its windows system then we can use some GUI based editor to accomplish our task. But what if you are on putty or have only CLI(Command Line Interface) as an option to  edit your files. I…
There are many situations when we need to display the data in sorted order. For example: Student details by name or by rank or by total marks etc. If you are working on data driven based projects then you will use sorting techniques very frequently.…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

840 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question