Solved

A very long, and code-heavy question!...

Posted on 2009-05-15
17
219 Views
Last Modified: 2012-05-07
Ok, before i ask my question, i think it's best to take a look at the picture i've uploaded below...

ok, the picture shows 6 reading frames of a dna sequence at the top and the list below is an array (@fragments) of those reading frames but have been cleaved by an enzyme.

First of all, i am trying to relate each of the cleaved fragments back to its original ReadingFrame... so in the diagram below:

SAEVIHQVEEALDTDEK came from ReadingFrame1 and should be tagged peptide one

so i'm trying to create a table:

ReadingFrame       Peptide            Sequence                             Mass/Charge
ReadingFrame1           1          SAEVIHQVEEALDTDEK
ReadingFrame1           2           EMLR
ReadingFrame1           3           DLALVELDILR
.......                             ....          .......
ReadingFrame2            1           R
ReadingFrame2            2           QSFFETPSLPWAMK

etc etc

On top of this.. each letter has two mass weights assigned to it, an isotopic mass and an average mass. The end user will be able to select in an online form which mass spectrum they wish to look at.
The amino acid weights can be found here: http://www.matrixscience.com/help/aa_help.html

so i'm trying to create a counts loop which totals up the number of each amino acid in each fragment above and then depending on user choice....
IF monotopic weights to be used
       assign each amino acid a monotopic weight add up all weights and divide by charge
ELSIF average masses to be used
       assign each amino acid an average weight add up all weights and divide by charge

this will then fill the Mass/Charge column of the table above.

Any help with the coding on this or any ideas would be hugely appreciated.
output3.jpg
0
Comment
Question by:StephenMcGowan
  • 10
  • 7
17 Comments
 
LVL 11

Accepted Solution

by:
climbgunks earned 500 total points
ID: 24401473

Thanks..this was an interesting problem... This does most of what you want...you'll have to fill in the real data values, and figure out how to interact w/ the user.

I assumed all frames are used entirely...there are no gaps, or skipped sequences.






#!/usr/bin/perl -w

use strict;
 

my %iso_weight = (

    S => 71.233,

    A => 34.22342,

    # put real values here

    );
 

my %avg_weight = (

    # fill in with average weights as above

    );
 

# pick a type

my $type = "mono-isotopic";
 

# save frames

my @frames;

while (<>) {

    chomp;

    last unless /^ReadingFrame(\d+):(\S+)$/;

    # going to assume frames are in order, else handle $1

    push @frames, $2;

}
 

# skip stuff

while (<>) {

    next if /^\s*$/;

    next if /^List/;

    last;

}
 

# now come fragments

my $frame = "";

my $framenum = 0;

my $peptide = 0;

do {

    chomp;

    if ($frame eq "") {

        $frame = shift @frames or die "ran out of frames\n";

        $framenum++;

        $peptide = 0;

    }
 

    die "fragment <$_> not found at in order at frame$framenum\nRest of frame: $frame\n"

        unless $frame =~ s/^$_//;
 

    my $weight = calculate_weight($_, $type);
 

    $peptide++;

    print "ReadingFrame$framenum $peptide $_ $weight\n";

} while (<>);
 

exit;
 

sub calculate_weight {

    my ($seq, $type) = @_;

    my @seq = split(//, $seq);

    my $weight;

    if ($type eq "mono-isotopic") {

        $weight += $iso_weight{$_} foreach (@seq);

    }

    else {

        $weight += $avg_weight{$_} foreach (@seq);

    }
 

    return $weight;

}

Open in new window

0
 

Author Comment

by:StephenMcGowan
ID: 24401961
Hi climbgunks,

Thanks for the feedback and the breakdown of code! i'm just going through this now (see code below).

The six reading frames at the top of the picture, i have in six strings, $orfprotein1-6 etc.

Would i need to put this in an array going off your code?

As a further problem, in the weight calculations sections, i need to calculate to hypothetical average weight if the amino acid is B, Z or X... so     if B, Z or X then total up weights of all the other peptides and assign B, Z and X an average mass.
Any idea how i'd implement this?

Thanks

#!/usr/bin/perl -w

use CGI::Carp 'fatalsToBrowser';

# mass.pl

# Perl programme to find the both average and mono-isotopic M/Z ratios of peptides and return a table of this data
 

require 'module.pm';

use CGI;

use strict;

use warnings;

use DNALib;

use ReadingFrameModules;

my $query = new CGI;
 
 

system './ORFfinder.pl', $orfprotein1,$orfprotein2,$orfprotein3,$orfprotein4,$orfprotein5,$orfprotein6;
 

my %iso_weight = (

    A => 71.037114

    R => 156.101111

    N => 114.042927

    D => 115.026943

    C => 103.009185

    E => 103.009185

    Q => 128.058578

    G => 57.021464

    H => 137.058912

    I => 113.084064

    L => 113.084064

    K => 128.094963

    M => 131.040485

    F => 147.068414

    P => 97.052764

    S => 87.032028

    T => 101.047679

    U => 150.95363

    W => 186.079313

    Y => 163.06332

    V => 99.068414

    #If B, Z or X -> total up weights of all the other peptides and give

# B, Z and X an average mass

    );
 

my %avg_weight = (

    A => 71.0779

    R => 156.1857

    N => 114.1026

    D => 115.0874

    C => 103.1429

    E => 129.114

    Q => 128.1292

    G => 57.0513

    H => 137.1393

    I => 113.1576

    L => 113.1576

    K => 128.1723

    M => 131.1961

    F => 147.1739

    P => 97.1152

    S => 87.0773

    T => 101.1039

    U => 150.0379

    W => 186.2099

    Y => 163.1733

    V => 99.1311

#If B, Z or X -> total up weights of all the other peptides and give

# B, Z and X an average mass

    );
 

# pick a type

my $mass = $query->param('mass');
 

# save frames

my @frames;

while (<>) {

    chomp;

    last unless /^ReadingFrame(\d+):(\S+)$/;

    # going to assume frames are in order, else handle $1

    push @frames, $2;

}

Open in new window

0
 
LVL 11

Expert Comment

by:climbgunks
ID: 24402757

I was assuming that your input file was what was being shown in your output3.jpg image

so I was parsing all that information.   If you had a file that included all the information in your output jpg, you can then just call   "script.pl  peptide_data_file"

I also didn't notice that each line of the reading frames ended in an _ , but that's easy to fix:
change the following two lines above:
1)  my $frame = "_";
2)     if ($frame eq "_") {


As for your additional problem, you'll have to give me more information.   Can you provide the full set of ReadingFrames and peptide groups that are output.  Not as a jpg, but as a text file?

This is sounding very much like a HW problem now.  So I'll stop providing code, but can provide guidance.  



0
 

Author Comment

by:StephenMcGowan
ID: 24402899
yeah sure.. sorry, thinking about it it would be best if i provide the output file for the picture above. Didn't think about it at the time, only attached the picture to try to visually show what i was trying to achieve.

the HTML output for the picture above is as follows:

print "Content-type:  text/html

<html>
<form id='form2' name='form2' method='post' action='save.pl'>
<title>ProteinDigester.pl</title>
<body>
<p>ReadingFrame1:$orfprotein1<br>
Reading Frame2:$orfprotein2<br>
Reading Frame3:$orfprotein3<br>
Reading Frame4:$orfprotein4<br>
Reading Frame5:$orfprotein5<br>
Reading Frame6:$orfprotein6<br>
<p>List of protein cleavage fragments, cleaved with enzyme $enzyme;</p>
<p>@fragments</p>
</body>
</form>
</html>

So Reading frames are the strings, and the chopped up list is @fragments

ALSO

At the very beginning of each reading frame is a N-terminus
And at the very end of the reading frame ( the _ you are refering to) is the C-terminus

These both have a mass of 18.0106 for monoisotopic mass and 18.0153 for average mass and should be included.

I will attach my script so you have an idea of what i'm doing... mass.pl is what i'm working on at the moment.


Script1.txt
ORFfinder.txt
mass.txt
DNALib.txt
ReadingFrameModules.txt
0
 

Author Comment

by:StephenMcGowan
ID: 24402965
So N-terminus and C-terminus have the same monoisotopic and average masses.

so all it would really mean is adding a "_" to the start of each Reading frame and then assigning these its masses in the list:

monoisotopic
_ => 18.0106

average
_=>  18.0153
0
 
LVL 11

Expert Comment

by:climbgunks
ID: 24402983

more of this is clear now...

you can do the following in the script I sent you

in place of the loop that reads the frames, you may be able to do:

my @frames = ($orfprotein1, $orfprotein2,  $orfprotein3,  $orfprotein4,  $orfprotein5,  $orfprotein6);

then the 2nd while loop will look something like

foreach (@fragments) {


To handle BZX, you'll have to modify the weight calculation group.   Split my foreach statemnt into an actual loop... if the input is BZX, make a note of it... wait until the rest of the weights have been calculated (and how many contributed to it), then divide the weight found by the number of contributors and assign it to the unknown.    

Btw, can you attach a text listing of all the reading frames, and all the fragments?

0
 

Author Comment

by:StephenMcGowan
ID: 24403016
I can't, the reading frames and fragments are all generated by a front end web-form HTML (Script1) i sent you.

Probably the best i can do is give you a test sequence and you visit http://biolinux.smith.man.ac.uk/~campus12sm/Script1.pl and give it a whirl:
Select Trypsin for the output i've generated.

>TESTSEQUENCE
atgtctgctgaagtcatccatcaggttgaagaagcacttgatacagatgagaaggagatg
ctgcgggatgttgctatagatgtggttccacctaatgtcagggaccttgcgctcgtcgag
ctggatattttacgggaaagaggtaagctgtctgtcggggacttggctgaactgctctac
agagtgaggcgatttgacctgctcaaacgtatcttgaagatggacagaaaagctgtggag
acccacctgctcaggaaccctcaccttgtttcggactatagagtgctgatggcagagatt
ggtgaggatttggataaatctgatgtgtcctcattaattttcctcatgaaggattacatg
ggccgaggcaagataagcaaggagaagagtttcttggaccttgtggttgagttggagaaa
ctaaatctggttgccccagatcaactggatttattagaaaaatgcctaaagaacatccac
agaatagacctgaagacaaaaatccagaagtacaagcagtctgttcaaggagcagggaca
agttacaggaatgttctccaagcagcaatccaaaagagtctcaaggatccttcaaataac
ttcaggctccataatgggagaagtaaagaacaaagacttaaggaacagcttggcgctcaa
caagaaccagtgaagaaatccattcaggaatcagaagcttttttgcctcagagcatacct
gaagagagatacaagatgaagagcaagcccctaggaatctgcctgataatcgattgcatt
ggcaatgagacagagcttcttcgagacaccttcacttccctgggctatgaagtccagaaa
ttcttgcatctcagtatgcatggtatatcccagattcttggccaatttgcctgtatgccc
gagcaccgagactacgacagctttgtgtgtgtcctggtgagccgaggaggctcccagagt
gtgtatggtgtggatcagactcactcagggctccccctgcatcacatcaggaggatgttc
atgggagattcatgcccttatctagcagggaagccaaagatgttttttattcagaactat
gtggtgtcagagggccagctggaggacagcagcctcttggaggtggatgggccagcgatg
aagaatgtggaattcaaggctcagaagcgagggctgtgcacagttcaccgagaagctgac
ttcttctggagcctgtgtactgcggacatgtccctgctggagcagtctcacagctcaccg
tccctgtacctgcagtgcctctcccagaaactgagacaagaaagaaaacgcccactcctg
gatcttcacattgaactcaatggctacatgtatgattggaacagcagagtttctgccaag
gagaaatattatgtctggctgcagcacactctgagaaagaaacttatcctctcctacaca
taa

I'll have a go at doing what you've said, but no doubt i'll be asking more coding questions, i'm not too strong at coding which is why i'm here, only really started out at it, so would appreciate any help to try to set up this programme.

Anyway, i'll go and have a look at what you've suggested.

Thanks,

Stephen
0
 

Author Comment

by:StephenMcGowan
ID: 24403096
quick question, do you know how i'd add "_" to the beginning of each ReadingFrame string?
0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 
LVL 11

Expert Comment

by:climbgunks
ID: 24403103


in general...

$frame = "_$frame";

or $frame = "_" . $frame;

of if you have an array of frames:

@frames = map { "_$frame" } @frames;

0
 

Author Comment

by:StephenMcGowan
ID: 24403110
so if $proteinorf1 = ACGTATCT_
this will become... $proteinorf1 = _ACGTATCT_?
0
 
LVL 11

Expert Comment

by:climbgunks
ID: 24403120
yes..

$proteinorf1 = "_$proteinorf1"

I think your original question has been answered sufficiently.
0
 

Author Comment

by:StephenMcGowan
ID: 24403188
Hi.. thanks for the assistance so far climbgunks, not sure if you're still willing to assist me with this or not, as i'm learning this as i go along.

I've saved the six string frames into an array called @frames. This is an array of the six reading frames together. The array @fragments is the list of the shorter peptide fragments

You say:

"then the 2nd while loop will look something like
foreach (@fragments) {"

Would i need to modify the code below to accomodate for @frames and @fragments?

Thanks, Much Appreciated.
# save frames

my @frames = ($orfprotein1, $orfprotein2,  $orfprotein3,  $orfprotein4,  $orfprotein5,  $orfprotein6);

#while (<>) {

#    chomp;

#    last unless /^ReadingFrame(\d+):(\S+)$/;

#    # going to assume frames are in order, else handle $1

#    push @frames, $2;

#}
 

# skip stuff

while (<>) {

    next if /^\s*$/;

    next if /^List/;

    last;

}
 

# now come fragments

my $frame = "";

my $framenum = 0;

my $peptide = 0;

do {

    chomp;

    if ($frame eq "") {

        $frame = shift @frames or die "ran out of frames\n";

        $framenum++;

        $peptide = 0;

    }
 

    die "fragment <$_> not found at in order at frame$framenum\nRest of

frame: $frame\n"

        unless $frame =~ s/^$_//;

Open in new window

0
 
LVL 11

Expert Comment

by:climbgunks
ID: 24403202

get rid of the #skip stuff block now

and the loop that looks like  

do {
....
} while (<>);

you can change to

foreach (@fragments) {
...
}

I'm still  not sure what to do about the _     Is it in your weights hash?  Is it in the fragments list?
If no to both, you'll have to special case the handling of it.  
0
 

Author Comment

by:StephenMcGowan
ID: 24403210
For "_" i've gone with:
    K => 128.094963

    M => 131.040485

    F => 147.068414

    P => 97.052764

    S => 87.032028

    T => 101.047679

    U => 150.95363

    W => 186.079313

    Y => 163.06332

    V => 99.068414

    _ => 18.0106
 

# If B, Z or X -> total up weights of all the other peptides and give

# B, Z and X an average mass

    );
 

my %avg_weight = (

    A => 71.0779

    R => 156.1857

    N => 114.1026

    D => 115.0874

    C => 103.1429

    E => 129.114

    Q => 128.1292

    G => 57.0513

    H => 137.1393

    I => 113.1576

    L => 113.1576

    K => 128.1723

    M => 131.1961

    F => 147.1739

    P => 97.1152

    S => 87.0773

    T => 101.1039

    U => 150.0379

    W => 186.2099

    Y => 163.1733

    V => 99.1311

    _ => 18.0153

Open in new window

0
 
LVL 11

Expert Comment

by:climbgunks
ID: 24403339


If i understand the _ thing now...   there is one C group at the beginning of a frame and one N group at the end of a frame.  Both have the same weight, so you can represent them as the same symbol.  

However, they don't show up in the fragment list?   If that's the case...I would not prepend it to the frames,
and use code like below to handle it.

Basically, it adds a C group to the first peptide in a chain, and the last.   If there in only fragment in a frame it gets both a C and an N added.   See changes involving the variable $add_group
my $frame = "_";

my $framenum = 0;

my $peptide = 0;

my $add_group;

foreach (@fragments) {

    chomp;

    if ($frame eq "_") {

        $frame = shift @frames or die "ran out of frames\n";

        $framenum++;

        $peptide = 0;

        $addgroup = 1;

    }
 

    die "fragment <$_> not found at in order at frame$framenum\nRest of fr\

ame: $frame\n"

        unless $frame =~ s/^$_//;
 

    $add_group++ if $frame eq "_";

    my $weight = calculate_weight($_, $type, $add_group);

    $add_group = 0;
 

    $peptide++;

    print "ReadingFrame$framenum $peptide $_ $weight\n";

}
 

exit;
 

sub calculate_weight {

    my ($seq, $type, $add_group) = @_;

    my @seq = split(//, $seq);

    my $weight;

    if ($type eq "mono-isotopic") {

        $weight = $add_group*$iso_weight{"_");

        $weight += $iso_weight{$_} foreach (@seq);

    }

    else {

        $weight = $add_group*$avg_weight{"_");

        $weight += $avg_weight{$_} foreach (@seq);

    }
 

    return $weight;

}

Open in new window

0
 

Author Comment

by:StephenMcGowan
ID: 24403396
ok i've just been going through the code and trying to understand what it's doing. I think i've got it.
The first section of code below takes each fragment from @fragments and tries to match it up against the reading frames in @frames

# now come fragments
my $frame = "";
my $framenum = 0;
my $peptide = 0;
foreach(@fragments){
    chomp;
    if ($frame eq "") {
        $frame = shift @frames or die "ran out of frames\n";
        $framenum++;
        $peptide = 0;
    }

    die "fragment <$_> not found at in order at frame$framenum\nRest of
frame: $frame\n"
        unless $frame =~ s/^$_//;

now once each fragment is assigned a number ($framenum): of the location it is contained in a reading frame, a weight is calculated in the sub-routine:

    my $weight = calculate_weight($_, $mass);

sub calculate_weight {
    my ($seq, $mass) = @_;
    my @seq = split(//, $seq);
    my $weight;
    if ($mass eq "mono-isotopic") {
        $weight += $iso_weight{$_} foreach (@seq);
    }
    else {
        $weight += $avg_weight{$_} foreach (@seq);
    }

    return $weight;
}

I'm trying to modify the weight calculation group.   I understand what you are saying when you say "Split my foreach statement into an actual loop... if the input is BZX, make a note of it... wait until the rest of the weights have been calculated (and how many contributed to it), then divide the weight found by the number of contributors and assign it to the unknown."
But not too sure how to go about implementing it with code.

Thanks again.  

0
 

Author Comment

by:StephenMcGowan
ID: 24403607
However, they don't show up in the fragment list?   If that's the case...I would not prepend it to the frames,
and use code like below to handle it.

It's ok.. i have modified my script for "_" to appear in the fragments list for the beginning of reading frames and at the ends, giving _ => (aweight) shouldn't be a problem now.
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

I've just discovered very important differences between Windows an Unix formats in Perl,at least 5.xx.. MOST IMPORTANT: Use Unix file format while saving Your script. otherwise it will have ^M s or smth likely weird in the EOL, Then DO NOT use m…
There are many situations when we need to display the data in sorted order. For example: Student details by name or by rank or by total marks etc. If you are working on data driven based projects then you will use sorting techniques very frequently.…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Polish reports in Access so they look terrific. Take yourself to another level. Equations, Back Color, Alternate Back Color. Write easy VBA Code. Tighten space to use less pages. Launch report from a menu, considering criteria only when it is filled…

746 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

8 Experts available now in Live!

Get 1:1 Help Now