A very long, and code-heavy question!...

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

ASKER

Hi climbgunks,

Thanks for the feedback and the breakdown of code! i'm just going through this now (see code below).

The six reading frames at the top of the picture, i have in six strings, $orfprotein1-6 etc.

Would i need to put this in an array going off your code?

As a further problem, in the weight calculations sections, i need to calculate to hypothetical average weight if the amino acid is B, Z or X... so if B, Z or X then total up weights of all the other peptides and assign B, Z and X an average mass.
Any idea how i'd implement this?

Thanks

#!/usr/bin/perl -w
use CGI::Carp 'fatalsToBrowser';
# mass.pl
# Perl programme to find the both average and mono-isotopic M/Z ratios of peptides and return a table of this data
 
require 'module.pm';
use CGI;
use strict;
use warnings;
use DNALib;
use ReadingFrameModules;
my $query = new CGI;
 
 
system './ORFfinder.pl', $orfprotein1,$orfprotein2,$orfprotein3,$orfprotein4,$orfprotein5,$orfprotein6;
 
my %iso_weight = (
    A => 71.037114
    R => 156.101111
    N => 114.042927
    D => 115.026943
    C => 103.009185
    E => 103.009185
    Q => 128.058578
    G => 57.021464
    H => 137.058912
    I => 113.084064
    L => 113.084064
    K => 128.094963
    M => 131.040485
    F => 147.068414
    P => 97.052764
    S => 87.032028
    T => 101.047679
    U => 150.95363
    W => 186.079313
    Y => 163.06332
    V => 99.068414
    #If B, Z or X -> total up weights of all the other peptides and give
# B, Z and X an average mass
    );
 
my %avg_weight = (
    A => 71.0779
    R => 156.1857
    N => 114.1026
    D => 115.0874
    C => 103.1429
    E => 129.114
    Q => 128.1292
    G => 57.0513
    H => 137.1393
    I => 113.1576
    L => 113.1576
    K => 128.1723
    M => 131.1961
    F => 147.1739
    P => 97.1152
    S => 87.0773
    T => 101.1039
    U => 150.0379
    W => 186.2099
    Y => 163.1733
    V => 99.1311
#If B, Z or X -> total up weights of all the other peptides and give
# B, Z and X an average mass
    );
 
# pick a type
my $mass = $query->param('mass');
 
# save frames
my @frames;
while (<>) {
    chomp;
    last unless /^ReadingFrame(\d+):(\S+)$/;
    # going to assume frames are in order, else handle $1
    push @frames, $2;
}

I was assuming that your input file was what was being shown in your output3.jpg image

so I was parsing all that information. If you had a file that included all the information in your output jpg, you can then just call "script.pl peptide_data_file"

I also didn't notice that each line of the reading frames ended in an _ , but that's easy to fix:
change the following two lines above:
1) my $frame = "_";
2) if ($frame eq "_") {

As for your additional problem, you'll have to give me more information. Can you provide the full set of ReadingFrames and peptide groups that are output. Not as a jpg, but as a text file?

This is sounding very much like a HW problem now. So I'll stop providing code, but can provide guidance.

ASKER

yeah sure.. sorry, thinking about it it would be best if i provide the output file for the picture above. Didn't think about it at the time, only attached the picture to try to visually show what i was trying to achieve.

the HTML output for the picture above is as follows:

print "Content-type: text/html

<html>
<form id='form2' name='form2' method='post' action='save.pl'>
<title>ProteinDigester.pl</title>
<body>
ReadingFrame1:$orfprotein1 
Reading Frame2:$orfprotein2 
Reading Frame3:$orfprotein3 
Reading Frame4:$orfprotein4 
Reading Frame5:$orfprotein5 
Reading Frame6:$orfprotein6 
List of protein cleavage fragments, cleaved with enzyme $enzyme;
@fragments
</body>
</form>
</html>

So Reading frames are the strings, and the chopped up list is @fragments

ALSO

At the very beginning of each reading frame is a N-terminus
And at the very end of the reading frame ( the _ you are refering to) is the C-terminus

These both have a mass of 18.0106 for monoisotopic mass and 18.0153 for average mass and should be included.

I will attach my script so you have an idea of what i'm doing... mass.pl is what i'm working on at the moment.

Script1.txt
ORFfinder.txt
mass.txt
DNALib.txt
ReadingFrameModules.txt

ASKER

So N-terminus and C-terminus have the same monoisotopic and average masses.

so all it would really mean is adding a "_" to the start of each Reading frame and then assigning these its masses in the list:

monoisotopic
_ => 18.0106

average
_=> 18.0153

more of this is clear now...

you can do the following in the script I sent you

in place of the loop that reads the frames, you may be able to do:

my @frames = ($orfprotein1, $orfprotein2, $orfprotein3, $orfprotein4, $orfprotein5, $orfprotein6);

then the 2nd while loop will look something like

foreach (@fragments) {

To handle BZX, you'll have to modify the weight calculation group. Split my foreach statemnt into an actual loop... if the input is BZX, make a note of it... wait until the rest of the weights have been calculated (and how many contributed to it), then divide the weight found by the number of contributors and assign it to the unknown.

Btw, can you attach a text listing of all the reading frames, and all the fragments?

ASKER

I can't, the reading frames and fragments are all generated by a front end web-form HTML (Script1) i sent you.

Probably the best i can do is give you a test sequence and you visit http://biolinux.smith.man.ac.uk/~campus12sm/Script1.pl and give it a whirl:
Select Trypsin for the output i've generated.

>TESTSEQUENCE
atgtctgctgaagtcatccatcaggttgaagaagcacttgatacagatgagaaggagatg
ctgcgggatgttgctatagatgtggttccacctaatgtcagggaccttgcgctcgtcgag
ctggatattttacgggaaagaggtaagctgtctgtcggggacttggctgaactgctctac
agagtgaggcgatttgacctgctcaaacgtatcttgaagatggacagaaaagctgtggag
acccacctgctcaggaaccctcaccttgtttcggactatagagtgctgatggcagagatt
ggtgaggatttggataaatctgatgtgtcctcattaattttcctcatgaaggattacatg
ggccgaggcaagataagcaaggagaagagtttcttggaccttgtggttgagttggagaaa
ctaaatctggttgccccagatcaactggatttattagaaaaatgcctaaagaacatccac
agaatagacctgaagacaaaaatccagaagtacaagcagtctgttcaaggagcagggaca
agttacaggaatgttctccaagcagcaatccaaaagagtctcaaggatccttcaaataac
ttcaggctccataatgggagaagtaaagaacaaagacttaaggaacagcttggcgctcaa
caagaaccagtgaagaaatccattcaggaatcagaagcttttttgcctcagagcatacct
gaagagagatacaagatgaagagcaagcccctaggaatctgcctgataatcgattgcatt
ggcaatgagacagagcttcttcgagacaccttcacttccctgggctatgaagtccagaaa
ttcttgcatctcagtatgcatggtatatcccagattcttggccaatttgcctgtatgccc
gagcaccgagactacgacagctttgtgtgtgtcctggtgagccgaggaggctcccagagt
gtgtatggtgtggatcagactcactcagggctccccctgcatcacatcaggaggatgttc
atgggagattcatgcccttatctagcagggaagccaaagatgttttttattcagaactat
gtggtgtcagagggccagctggaggacagcagcctcttggaggtggatgggccagcgatg
aagaatgtggaattcaaggctcagaagcgagggctgtgcacagttcaccgagaagctgac
ttcttctggagcctgtgtactgcggacatgtccctgctggagcagtctcacagctcaccg
tccctgtacctgcagtgcctctcccagaaactgagacaagaaagaaaacgcccactcctg
gatcttcacattgaactcaatggctacatgtatgattggaacagcagagtttctgccaag
gagaaatattatgtctggctgcagcacactctgagaaagaaacttatcctctcctacaca
taa

I'll have a go at doing what you've said, but no doubt i'll be asking more coding questions, i'm not too strong at coding which is why i'm here, only really started out at it, so would appreciate any help to try to set up this programme.

Anyway, i'll go and have a look at what you've suggested.

Thanks,

Stephen

ASKER

quick question, do you know how i'd add "_" to the beginning of each ReadingFrame string?

in general...

$frame = "_$frame";

or $frame = "_" . $frame;

of if you have an array of frames:

@frames = map { "_$frame" } @frames;

ASKER

so if $proteinorf1 = ACGTATCT_
this will become... $proteinorf1 = _ACGTATCT_?

yes..

$proteinorf1 = "_$proteinorf1"

I think your original question has been answered sufficiently.

ASKER

Hi.. thanks for the assistance so far climbgunks, not sure if you're still willing to assist me with this or not, as i'm learning this as i go along.

I've saved the six string frames into an array called @frames. This is an array of the six reading frames together. The array @fragments is the list of the shorter peptide fragments

You say:

"then the 2nd while loop will look something like
foreach (@fragments) {"

Would i need to modify the code below to accomodate for @frames and @fragments?

Thanks, Much Appreciated.

# save frames
my @frames = ($orfprotein1, $orfprotein2,  $orfprotein3,  $orfprotein4,  $orfprotein5,  $orfprotein6);
#while (<>) {
#    chomp;
#    last unless /^ReadingFrame(\d+):(\S+)$/;
#    # going to assume frames are in order, else handle $1
#    push @frames, $2;
#}
 
# skip stuff
while (<>) {
    next if /^\s*$/;
    next if /^List/;
    last;
}
 
# now come fragments
my $frame = "";
my $framenum = 0;
my $peptide = 0;
do {
    chomp;
    if ($frame eq "") {
        $frame = shift @frames or die "ran out of frames\n";
        $framenum++;
        $peptide = 0;
    }
 
    die "fragment <$_> not found at in order at frame$framenum\nRest of
frame: $frame\n"
        unless $frame =~ s/^$_//;

get rid of the #skip stuff block now

and the loop that looks like

do {
....
} while (<>);

you can change to

foreach (@fragments) {
...
}

I'm still not sure what to do about the _ Is it in your weights hash? Is it in the fragments list?
If no to both, you'll have to special case the handling of it.

ASKER

For "_" i've gone with:

    K => 128.094963
    M => 131.040485
    F => 147.068414
    P => 97.052764
    S => 87.032028
    T => 101.047679
    U => 150.95363
    W => 186.079313
    Y => 163.06332
    V => 99.068414
    _ => 18.0106
 
# If B, Z or X -> total up weights of all the other peptides and give
# B, Z and X an average mass
    );
 
my %avg_weight = (
    A => 71.0779
    R => 156.1857
    N => 114.1026
    D => 115.0874
    C => 103.1429
    E => 129.114
    Q => 128.1292
    G => 57.0513
    H => 137.1393
    I => 113.1576
    L => 113.1576
    K => 128.1723
    M => 131.1961
    F => 147.1739
    P => 97.1152
    S => 87.0773
    T => 101.1039
    U => 150.0379
    W => 186.2099
    Y => 163.1733
    V => 99.1311
    _ => 18.0153

If i understand the _ thing now... there is one C group at the beginning of a frame and one N group at the end of a frame. Both have the same weight, so you can represent them as the same symbol.

However, they don't show up in the fragment list? If that's the case...I would not prepend it to the frames,
and use code like below to handle it.

Basically, it adds a C group to the first peptide in a chain, and the last. If there in only fragment in a frame it gets both a C and an N added. See changes involving the variable $add_group

my $frame = "_";
my $framenum = 0;
my $peptide = 0;
my $add_group;
foreach (@fragments) {
    chomp;
    if ($frame eq "_") {
        $frame = shift @frames or die "ran out of frames\n";
        $framenum++;
        $peptide = 0;
        $addgroup = 1;
    }
 
    die "fragment <$_> not found at in order at frame$framenum\nRest of fr\
ame: $frame\n"
        unless $frame =~ s/^$_//;
 
    $add_group++ if $frame eq "_";
    my $weight = calculate_weight($_, $type, $add_group);
    $add_group = 0;
 
    $peptide++;
    print "ReadingFrame$framenum $peptide $_ $weight\n";
}
 
exit;
 
sub calculate_weight {
    my ($seq, $type, $add_group) = @_;
    my @seq = split(//, $seq);
    my $weight;
    if ($type eq "mono-isotopic") {
        $weight = $add_group*$iso_weight{"_");
        $weight += $iso_weight{$_} foreach (@seq);
    }
    else {
        $weight = $add_group*$avg_weight{"_");
        $weight += $avg_weight{$_} foreach (@seq);
    }
 
    return $weight;
}

ASKER

ok i've just been going through the code and trying to understand what it's doing. I think i've got it.
The first section of code below takes each fragment from @fragments and tries to match it up against the reading frames in @frames

# now come fragments
my $frame = "";
my $framenum = 0;
my $peptide = 0;
foreach(@fragments){
chomp;
if ($frame eq "") {
$frame = shift @frames or die "ran out of frames\n";
$framenum++;
$peptide = 0;
}

die "fragment <$_> not found at in order at frame$framenum\nRest of
frame: $frame\n"
unless $frame =~ s/^$_//;

now once each fragment is assigned a number ($framenum): of the location it is contained in a reading frame, a weight is calculated in the sub-routine:

my $weight = calculate_weight($_, $mass);

sub calculate_weight {
my ($seq, $mass) = @_;
my @seq = split(//, $seq);
my $weight;
if ($mass eq "mono-isotopic") {
$weight += $iso_weight{$_} foreach (@seq);
}
else {
$weight += $avg_weight{$_} foreach (@seq);
}

return $weight;
}

I'm trying to modify the weight calculation group. I understand what you are saying when you say "Split my foreach statement into an actual loop... if the input is BZX, make a note of it... wait until the rest of the weights have been calculated (and how many contributed to it), then divide the weight found by the number of contributors and assign it to the unknown."
But not too sure how to go about implementing it with code.

Thanks again.