Link to home
Start Free TrialLog in
Avatar of StephenMcGowan
StephenMcGowan

asked on

A very long, and code-heavy question!...

Ok, before i ask my question, i think it's best to take a look at the picture i've uploaded below...

ok, the picture shows 6 reading frames of a dna sequence at the top and the list below is an array (@fragments) of those reading frames but have been cleaved by an enzyme.

First of all, i am trying to relate each of the cleaved fragments back to its original ReadingFrame... so in the diagram below:

SAEVIHQVEEALDTDEK came from ReadingFrame1 and should be tagged peptide one

so i'm trying to create a table:

ReadingFrame       Peptide            Sequence                             Mass/Charge
ReadingFrame1           1          SAEVIHQVEEALDTDEK
ReadingFrame1           2           EMLR
ReadingFrame1           3           DLALVELDILR
.......                             ....          .......
ReadingFrame2            1           R
ReadingFrame2            2           QSFFETPSLPWAMK

etc etc

On top of this.. each letter has two mass weights assigned to it, an isotopic mass and an average mass. The end user will be able to select in an online form which mass spectrum they wish to look at.
The amino acid weights can be found here: http://www.matrixscience.com/help/aa_help.html

so i'm trying to create a counts loop which totals up the number of each amino acid in each fragment above and then depending on user choice....
IF monotopic weights to be used
       assign each amino acid a monotopic weight add up all weights and divide by charge
ELSIF average masses to be used
       assign each amino acid an average weight add up all weights and divide by charge

this will then fill the Mass/Charge column of the table above.

Any help with the coding on this or any ideas would be hugely appreciated.
output3.jpg
ASKER CERTIFIED SOLUTION
Avatar of Todd Mummert
Todd Mummert

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of StephenMcGowan
StephenMcGowan

ASKER

Hi climbgunks,

Thanks for the feedback and the breakdown of code! i'm just going through this now (see code below).

The six reading frames at the top of the picture, i have in six strings, $orfprotein1-6 etc.

Would i need to put this in an array going off your code?

As a further problem, in the weight calculations sections, i need to calculate to hypothetical average weight if the amino acid is B, Z or X... so     if B, Z or X then total up weights of all the other peptides and assign B, Z and X an average mass.
Any idea how i'd implement this?

Thanks

#!/usr/bin/perl -w
use CGI::Carp 'fatalsToBrowser';
# mass.pl
# Perl programme to find the both average and mono-isotopic M/Z ratios of peptides and return a table of this data
 
require 'module.pm';
use CGI;
use strict;
use warnings;
use DNALib;
use ReadingFrameModules;
my $query = new CGI;
 
 
system './ORFfinder.pl', $orfprotein1,$orfprotein2,$orfprotein3,$orfprotein4,$orfprotein5,$orfprotein6;
 
my %iso_weight = (
    A => 71.037114
    R => 156.101111
    N => 114.042927
    D => 115.026943
    C => 103.009185
    E => 103.009185
    Q => 128.058578
    G => 57.021464
    H => 137.058912
    I => 113.084064
    L => 113.084064
    K => 128.094963
    M => 131.040485
    F => 147.068414
    P => 97.052764
    S => 87.032028
    T => 101.047679
    U => 150.95363
    W => 186.079313
    Y => 163.06332
    V => 99.068414
    #If B, Z or X -> total up weights of all the other peptides and give
# B, Z and X an average mass
    );
 
my %avg_weight = (
    A => 71.0779
    R => 156.1857
    N => 114.1026
    D => 115.0874
    C => 103.1429
    E => 129.114
    Q => 128.1292
    G => 57.0513
    H => 137.1393
    I => 113.1576
    L => 113.1576
    K => 128.1723
    M => 131.1961
    F => 147.1739
    P => 97.1152
    S => 87.0773
    T => 101.1039
    U => 150.0379
    W => 186.2099
    Y => 163.1733
    V => 99.1311
#If B, Z or X -> total up weights of all the other peptides and give
# B, Z and X an average mass
    );
 
# pick a type
my $mass = $query->param('mass');
 
# save frames
my @frames;
while (<>) {
    chomp;
    last unless /^ReadingFrame(\d+):(\S+)$/;
    # going to assume frames are in order, else handle $1
    push @frames, $2;
}

Open in new window


I was assuming that your input file was what was being shown in your output3.jpg image

so I was parsing all that information.   If you had a file that included all the information in your output jpg, you can then just call   "script.pl  peptide_data_file"

I also didn't notice that each line of the reading frames ended in an _ , but that's easy to fix:
change the following two lines above:
1)  my $frame = "_";
2)     if ($frame eq "_") {


As for your additional problem, you'll have to give me more information.   Can you provide the full set of ReadingFrames and peptide groups that are output.  Not as a jpg, but as a text file?

This is sounding very much like a HW problem now.  So I'll stop providing code, but can provide guidance.  



yeah sure.. sorry, thinking about it it would be best if i provide the output file for the picture above. Didn't think about it at the time, only attached the picture to try to visually show what i was trying to achieve.

the HTML output for the picture above is as follows:

print "Content-type:  text/html

<html>
<form id='form2' name='form2' method='post' action='save.pl'>
<title>ProteinDigester.pl</title>
<body>
<p>ReadingFrame1:$orfprotein1<br>
Reading Frame2:$orfprotein2<br>
Reading Frame3:$orfprotein3<br>
Reading Frame4:$orfprotein4<br>
Reading Frame5:$orfprotein5<br>
Reading Frame6:$orfprotein6<br>
<p>List of protein cleavage fragments, cleaved with enzyme $enzyme;</p>
<p>@fragments</p>
</body>
</form>
</html>

So Reading frames are the strings, and the chopped up list is @fragments

ALSO

At the very beginning of each reading frame is a N-terminus
And at the very end of the reading frame ( the _ you are refering to) is the C-terminus

These both have a mass of 18.0106 for monoisotopic mass and 18.0153 for average mass and should be included.

I will attach my script so you have an idea of what i'm doing... mass.pl is what i'm working on at the moment.


Script1.txt
ORFfinder.txt
mass.txt
DNALib.txt
ReadingFrameModules.txt
So N-terminus and C-terminus have the same monoisotopic and average masses.

so all it would really mean is adding a "_" to the start of each Reading frame and then assigning these its masses in the list:

monoisotopic
_ => 18.0106

average
_=>  18.0153

more of this is clear now...

you can do the following in the script I sent you

in place of the loop that reads the frames, you may be able to do:

my @frames = ($orfprotein1, $orfprotein2,  $orfprotein3,  $orfprotein4,  $orfprotein5,  $orfprotein6);

then the 2nd while loop will look something like

foreach (@fragments) {


To handle BZX, you'll have to modify the weight calculation group.   Split my foreach statemnt into an actual loop... if the input is BZX, make a note of it... wait until the rest of the weights have been calculated (and how many contributed to it), then divide the weight found by the number of contributors and assign it to the unknown.    

Btw, can you attach a text listing of all the reading frames, and all the fragments?

I can't, the reading frames and fragments are all generated by a front end web-form HTML (Script1) i sent you.

Probably the best i can do is give you a test sequence and you visit http://biolinux.smith.man.ac.uk/~campus12sm/Script1.pl and give it a whirl:
Select Trypsin for the output i've generated.

>TESTSEQUENCE
atgtctgctgaagtcatccatcaggttgaagaagcacttgatacagatgagaaggagatg
ctgcgggatgttgctatagatgtggttccacctaatgtcagggaccttgcgctcgtcgag
ctggatattttacgggaaagaggtaagctgtctgtcggggacttggctgaactgctctac
agagtgaggcgatttgacctgctcaaacgtatcttgaagatggacagaaaagctgtggag
acccacctgctcaggaaccctcaccttgtttcggactatagagtgctgatggcagagatt
ggtgaggatttggataaatctgatgtgtcctcattaattttcctcatgaaggattacatg
ggccgaggcaagataagcaaggagaagagtttcttggaccttgtggttgagttggagaaa
ctaaatctggttgccccagatcaactggatttattagaaaaatgcctaaagaacatccac
agaatagacctgaagacaaaaatccagaagtacaagcagtctgttcaaggagcagggaca
agttacaggaatgttctccaagcagcaatccaaaagagtctcaaggatccttcaaataac
ttcaggctccataatgggagaagtaaagaacaaagacttaaggaacagcttggcgctcaa
caagaaccagtgaagaaatccattcaggaatcagaagcttttttgcctcagagcatacct
gaagagagatacaagatgaagagcaagcccctaggaatctgcctgataatcgattgcatt
ggcaatgagacagagcttcttcgagacaccttcacttccctgggctatgaagtccagaaa
ttcttgcatctcagtatgcatggtatatcccagattcttggccaatttgcctgtatgccc
gagcaccgagactacgacagctttgtgtgtgtcctggtgagccgaggaggctcccagagt
gtgtatggtgtggatcagactcactcagggctccccctgcatcacatcaggaggatgttc
atgggagattcatgcccttatctagcagggaagccaaagatgttttttattcagaactat
gtggtgtcagagggccagctggaggacagcagcctcttggaggtggatgggccagcgatg
aagaatgtggaattcaaggctcagaagcgagggctgtgcacagttcaccgagaagctgac
ttcttctggagcctgtgtactgcggacatgtccctgctggagcagtctcacagctcaccg
tccctgtacctgcagtgcctctcccagaaactgagacaagaaagaaaacgcccactcctg
gatcttcacattgaactcaatggctacatgtatgattggaacagcagagtttctgccaag
gagaaatattatgtctggctgcagcacactctgagaaagaaacttatcctctcctacaca
taa

I'll have a go at doing what you've said, but no doubt i'll be asking more coding questions, i'm not too strong at coding which is why i'm here, only really started out at it, so would appreciate any help to try to set up this programme.

Anyway, i'll go and have a look at what you've suggested.

Thanks,

Stephen
quick question, do you know how i'd add "_" to the beginning of each ReadingFrame string?


in general...

$frame = "_$frame";

or $frame = "_" . $frame;

of if you have an array of frames:

@frames = map { "_$frame" } @frames;

so if $proteinorf1 = ACGTATCT_
this will become... $proteinorf1 = _ACGTATCT_?
yes..

$proteinorf1 = "_$proteinorf1"

I think your original question has been answered sufficiently.
Hi.. thanks for the assistance so far climbgunks, not sure if you're still willing to assist me with this or not, as i'm learning this as i go along.

I've saved the six string frames into an array called @frames. This is an array of the six reading frames together. The array @fragments is the list of the shorter peptide fragments

You say:

"then the 2nd while loop will look something like
foreach (@fragments) {"

Would i need to modify the code below to accomodate for @frames and @fragments?

Thanks, Much Appreciated.
# save frames
my @frames = ($orfprotein1, $orfprotein2,  $orfprotein3,  $orfprotein4,  $orfprotein5,  $orfprotein6);
#while (<>) {
#    chomp;
#    last unless /^ReadingFrame(\d+):(\S+)$/;
#    # going to assume frames are in order, else handle $1
#    push @frames, $2;
#}
 
# skip stuff
while (<>) {
    next if /^\s*$/;
    next if /^List/;
    last;
}
 
# now come fragments
my $frame = "";
my $framenum = 0;
my $peptide = 0;
do {
    chomp;
    if ($frame eq "") {
        $frame = shift @frames or die "ran out of frames\n";
        $framenum++;
        $peptide = 0;
    }
 
    die "fragment <$_> not found at in order at frame$framenum\nRest of
frame: $frame\n"
        unless $frame =~ s/^$_//;

Open in new window


get rid of the #skip stuff block now

and the loop that looks like  

do {
....
} while (<>);

you can change to

foreach (@fragments) {
...
}

I'm still  not sure what to do about the _     Is it in your weights hash?  Is it in the fragments list?
If no to both, you'll have to special case the handling of it.  
For "_" i've gone with:
    K => 128.094963
    M => 131.040485
    F => 147.068414
    P => 97.052764
    S => 87.032028
    T => 101.047679
    U => 150.95363
    W => 186.079313
    Y => 163.06332
    V => 99.068414
    _ => 18.0106
 
# If B, Z or X -> total up weights of all the other peptides and give
# B, Z and X an average mass
    );
 
my %avg_weight = (
    A => 71.0779
    R => 156.1857
    N => 114.1026
    D => 115.0874
    C => 103.1429
    E => 129.114
    Q => 128.1292
    G => 57.0513
    H => 137.1393
    I => 113.1576
    L => 113.1576
    K => 128.1723
    M => 131.1961
    F => 147.1739
    P => 97.1152
    S => 87.0773
    T => 101.1039
    U => 150.0379
    W => 186.2099
    Y => 163.1733
    V => 99.1311
    _ => 18.0153

Open in new window



If i understand the _ thing now...   there is one C group at the beginning of a frame and one N group at the end of a frame.  Both have the same weight, so you can represent them as the same symbol.  

However, they don't show up in the fragment list?   If that's the case...I would not prepend it to the frames,
and use code like below to handle it.

Basically, it adds a C group to the first peptide in a chain, and the last.   If there in only fragment in a frame it gets both a C and an N added.   See changes involving the variable $add_group
my $frame = "_";
my $framenum = 0;
my $peptide = 0;
my $add_group;
foreach (@fragments) {
    chomp;
    if ($frame eq "_") {
        $frame = shift @frames or die "ran out of frames\n";
        $framenum++;
        $peptide = 0;
        $addgroup = 1;
    }
 
    die "fragment <$_> not found at in order at frame$framenum\nRest of fr\
ame: $frame\n"
        unless $frame =~ s/^$_//;
 
    $add_group++ if $frame eq "_";
    my $weight = calculate_weight($_, $type, $add_group);
    $add_group = 0;
 
    $peptide++;
    print "ReadingFrame$framenum $peptide $_ $weight\n";
}
 
exit;
 
sub calculate_weight {
    my ($seq, $type, $add_group) = @_;
    my @seq = split(//, $seq);
    my $weight;
    if ($type eq "mono-isotopic") {
        $weight = $add_group*$iso_weight{"_");
        $weight += $iso_weight{$_} foreach (@seq);
    }
    else {
        $weight = $add_group*$avg_weight{"_");
        $weight += $avg_weight{$_} foreach (@seq);
    }
 
    return $weight;
}

Open in new window

ok i've just been going through the code and trying to understand what it's doing. I think i've got it.
The first section of code below takes each fragment from @fragments and tries to match it up against the reading frames in @frames

# now come fragments
my $frame = "";
my $framenum = 0;
my $peptide = 0;
foreach(@fragments){
    chomp;
    if ($frame eq "") {
        $frame = shift @frames or die "ran out of frames\n";
        $framenum++;
        $peptide = 0;
    }

    die "fragment <$_> not found at in order at frame$framenum\nRest of
frame: $frame\n"
        unless $frame =~ s/^$_//;

now once each fragment is assigned a number ($framenum): of the location it is contained in a reading frame, a weight is calculated in the sub-routine:

    my $weight = calculate_weight($_, $mass);

sub calculate_weight {
    my ($seq, $mass) = @_;
    my @seq = split(//, $seq);
    my $weight;
    if ($mass eq "mono-isotopic") {
        $weight += $iso_weight{$_} foreach (@seq);
    }
    else {
        $weight += $avg_weight{$_} foreach (@seq);
    }

    return $weight;
}

I'm trying to modify the weight calculation group.   I understand what you are saying when you say "Split my foreach statement into an actual loop... if the input is BZX, make a note of it... wait until the rest of the weights have been calculated (and how many contributed to it), then divide the weight found by the number of contributors and assign it to the unknown."
But not too sure how to go about implementing it with code.

Thanks again.  

However, they don't show up in the fragment list?   If that's the case...I would not prepend it to the frames,
and use code like below to handle it.

It's ok.. i have modified my script for "_" to appear in the fragments list for the beginning of reading frames and at the ends, giving _ => (aweight) shouldn't be a problem now.