Go Premium for a chance to win a PS4. Enter to Win

x
Solved

# A very long, and code-heavy question!...

Posted on 2009-05-15
Medium Priority
245 Views
Ok, before i ask my question, i think it's best to take a look at the picture i've uploaded below...

ok, the picture shows 6 reading frames of a dna sequence at the top and the list below is an array (@fragments) of those reading frames but have been cleaved by an enzyme.

First of all, i am trying to relate each of the cleaved fragments back to its original ReadingFrame... so in the diagram below:

SAEVIHQVEEALDTDEK came from ReadingFrame1 and should be tagged peptide one

so i'm trying to create a table:

.......                             ....          .......

etc etc

On top of this.. each letter has two mass weights assigned to it, an isotopic mass and an average mass. The end user will be able to select in an online form which mass spectrum they wish to look at.
The amino acid weights can be found here: http://www.matrixscience.com/help/aa_help.html

so i'm trying to create a counts loop which totals up the number of each amino acid in each fragment above and then depending on user choice....
IF monotopic weights to be used
assign each amino acid a monotopic weight add up all weights and divide by charge
ELSIF average masses to be used
assign each amino acid an average weight add up all weights and divide by charge

this will then fill the Mass/Charge column of the table above.

Any help with the coding on this or any ideas would be hugely appreciated.
output3.jpg
0
Question by:StephenMcGowan
• 10
• 7

LVL 11

Accepted Solution

Todd Mummert earned 2000 total points
ID: 24401473

Thanks..this was an interesting problem... This does most of what you want...you'll have to fill in the real data values, and figure out how to interact w/ the user.

I assumed all frames are used entirely...there are no gaps, or skipped sequences.

``````
#!/usr/bin/perl -w
use strict;

my %iso_weight = (
S => 71.233,
A => 34.22342,
# put real values here
);

my %avg_weight = (
# fill in with average weights as above
);

# pick a type
my \$type = "mono-isotopic";

# save frames
my @frames;
while (<>) {
chomp;
# going to assume frames are in order, else handle \$1
push @frames, \$2;
}

# skip stuff
while (<>) {
next if /^\s*\$/;
next if /^List/;
last;
}

# now come fragments
my \$frame = "";
my \$peptide = 0;
do {
chomp;
if (\$frame eq "") {
\$frame = shift @frames or die "ran out of frames\n";
\$peptide = 0;
}

unless \$frame =~ s/^\$_//;

my \$weight = calculate_weight(\$_, \$type);

\$peptide++;
} while (<>);

exit;

sub calculate_weight {
my (\$seq, \$type) = @_;
my @seq = split(//, \$seq);
my \$weight;
if (\$type eq "mono-isotopic") {
\$weight += \$iso_weight{\$_} foreach (@seq);
}
else {
\$weight += \$avg_weight{\$_} foreach (@seq);
}

return \$weight;
}
``````
0

Author Comment

ID: 24401961
Hi climbgunks,

Thanks for the feedback and the breakdown of code! i'm just going through this now (see code below).

The six reading frames at the top of the picture, i have in six strings, \$orfprotein1-6 etc.

Would i need to put this in an array going off your code?

As a further problem, in the weight calculations sections, i need to calculate to hypothetical average weight if the amino acid is B, Z or X... so     if B, Z or X then total up weights of all the other peptides and assign B, Z and X an average mass.
Any idea how i'd implement this?

Thanks

``````#!/usr/bin/perl -w
use CGI::Carp 'fatalsToBrowser';
# mass.pl
# Perl programme to find the both average and mono-isotopic M/Z ratios of peptides and return a table of this data

require 'module.pm';
use CGI;
use strict;
use warnings;
use DNALib;
my \$query = new CGI;

system './ORFfinder.pl', \$orfprotein1,\$orfprotein2,\$orfprotein3,\$orfprotein4,\$orfprotein5,\$orfprotein6;

my %iso_weight = (
A => 71.037114
R => 156.101111
N => 114.042927
D => 115.026943
C => 103.009185
E => 103.009185
Q => 128.058578
G => 57.021464
H => 137.058912
I => 113.084064
L => 113.084064
K => 128.094963
M => 131.040485
F => 147.068414
P => 97.052764
S => 87.032028
T => 101.047679
U => 150.95363
W => 186.079313
Y => 163.06332
V => 99.068414
#If B, Z or X -> total up weights of all the other peptides and give
# B, Z and X an average mass
);

my %avg_weight = (
A => 71.0779
R => 156.1857
N => 114.1026
D => 115.0874
C => 103.1429
E => 129.114
Q => 128.1292
G => 57.0513
H => 137.1393
I => 113.1576
L => 113.1576
K => 128.1723
M => 131.1961
F => 147.1739
P => 97.1152
S => 87.0773
T => 101.1039
U => 150.0379
W => 186.2099
Y => 163.1733
V => 99.1311
#If B, Z or X -> total up weights of all the other peptides and give
# B, Z and X an average mass
);

# pick a type
my \$mass = \$query->param('mass');

# save frames
my @frames;
while (<>) {
chomp;
# going to assume frames are in order, else handle \$1
push @frames, \$2;
}
``````
0

LVL 11

Expert Comment

ID: 24402757

I was assuming that your input file was what was being shown in your output3.jpg image

so I was parsing all that information.   If you had a file that included all the information in your output jpg, you can then just call   "script.pl  peptide_data_file"

I also didn't notice that each line of the reading frames ended in an _ , but that's easy to fix:
change the following two lines above:
1)  my \$frame = "_";
2)     if (\$frame eq "_") {

As for your additional problem, you'll have to give me more information.   Can you provide the full set of ReadingFrames and peptide groups that are output.  Not as a jpg, but as a text file?

This is sounding very much like a HW problem now.  So I'll stop providing code, but can provide guidance.

0

Author Comment

ID: 24402899
yeah sure.. sorry, thinking about it it would be best if i provide the output file for the picture above. Didn't think about it at the time, only attached the picture to try to visually show what i was trying to achieve.

the HTML output for the picture above is as follows:

print "Content-type:  text/html

<html>
<form id='form2' name='form2' method='post' action='save.pl'>
<title>ProteinDigester.pl</title>
<body>
<p>List of protein cleavage fragments, cleaved with enzyme \$enzyme;</p>
<p>@fragments</p>
</body>
</form>
</html>

So Reading frames are the strings, and the chopped up list is @fragments

ALSO

At the very beginning of each reading frame is a N-terminus
And at the very end of the reading frame ( the _ you are refering to) is the C-terminus

These both have a mass of 18.0106 for monoisotopic mass and 18.0153 for average mass and should be included.

I will attach my script so you have an idea of what i'm doing... mass.pl is what i'm working on at the moment.

Script1.txt
ORFfinder.txt
mass.txt
DNALib.txt
0

Author Comment

ID: 24402965
So N-terminus and C-terminus have the same monoisotopic and average masses.

so all it would really mean is adding a "_" to the start of each Reading frame and then assigning these its masses in the list:

monoisotopic
_ => 18.0106

average
_=>  18.0153
0

LVL 11

Expert Comment

ID: 24402983

more of this is clear now...

you can do the following in the script I sent you

in place of the loop that reads the frames, you may be able to do:

my @frames = (\$orfprotein1, \$orfprotein2,  \$orfprotein3,  \$orfprotein4,  \$orfprotein5,  \$orfprotein6);

then the 2nd while loop will look something like

foreach (@fragments) {

To handle BZX, you'll have to modify the weight calculation group.   Split my foreach statemnt into an actual loop... if the input is BZX, make a note of it... wait until the rest of the weights have been calculated (and how many contributed to it), then divide the weight found by the number of contributors and assign it to the unknown.

Btw, can you attach a text listing of all the reading frames, and all the fragments?

0

Author Comment

ID: 24403016
I can't, the reading frames and fragments are all generated by a front end web-form HTML (Script1) i sent you.

Probably the best i can do is give you a test sequence and you visit http://biolinux.smith.man.ac.uk/~campus12sm/Script1.pl and give it a whirl:
Select Trypsin for the output i've generated.

>TESTSEQUENCE
atgtctgctgaagtcatccatcaggttgaagaagcacttgatacagatgagaaggagatg
ctgcgggatgttgctatagatgtggttccacctaatgtcagggaccttgcgctcgtcgag
ctggatattttacgggaaagaggtaagctgtctgtcggggacttggctgaactgctctac
agagtgaggcgatttgacctgctcaaacgtatcttgaagatggacagaaaagctgtggag
acccacctgctcaggaaccctcaccttgtttcggactatagagtgctgatggcagagatt
ggtgaggatttggataaatctgatgtgtcctcattaattttcctcatgaaggattacatg
ggccgaggcaagataagcaaggagaagagtttcttggaccttgtggttgagttggagaaa
ctaaatctggttgccccagatcaactggatttattagaaaaatgcctaaagaacatccac
agaatagacctgaagacaaaaatccagaagtacaagcagtctgttcaaggagcagggaca
agttacaggaatgttctccaagcagcaatccaaaagagtctcaaggatccttcaaataac
ttcaggctccataatgggagaagtaaagaacaaagacttaaggaacagcttggcgctcaa
caagaaccagtgaagaaatccattcaggaatcagaagcttttttgcctcagagcatacct
gaagagagatacaagatgaagagcaagcccctaggaatctgcctgataatcgattgcatt
ggcaatgagacagagcttcttcgagacaccttcacttccctgggctatgaagtccagaaa
ttcttgcatctcagtatgcatggtatatcccagattcttggccaatttgcctgtatgccc
gagcaccgagactacgacagctttgtgtgtgtcctggtgagccgaggaggctcccagagt
gtgtatggtgtggatcagactcactcagggctccccctgcatcacatcaggaggatgttc
atgggagattcatgcccttatctagcagggaagccaaagatgttttttattcagaactat
gtggtgtcagagggccagctggaggacagcagcctcttggaggtggatgggccagcgatg
aagaatgtggaattcaaggctcagaagcgagggctgtgcacagttcaccgagaagctgac
ttcttctggagcctgtgtactgcggacatgtccctgctggagcagtctcacagctcaccg
tccctgtacctgcagtgcctctcccagaaactgagacaagaaagaaaacgcccactcctg
gatcttcacattgaactcaatggctacatgtatgattggaacagcagagtttctgccaag
gagaaatattatgtctggctgcagcacactctgagaaagaaacttatcctctcctacaca
taa

I'll have a go at doing what you've said, but no doubt i'll be asking more coding questions, i'm not too strong at coding which is why i'm here, only really started out at it, so would appreciate any help to try to set up this programme.

Anyway, i'll go and have a look at what you've suggested.

Thanks,

Stephen
0

Author Comment

ID: 24403096
quick question, do you know how i'd add "_" to the beginning of each ReadingFrame string?
0

LVL 11

Expert Comment

ID: 24403103

in general...

\$frame = "_\$frame";

or \$frame = "_" . \$frame;

of if you have an array of frames:

@frames = map { "_\$frame" } @frames;

0

Author Comment

ID: 24403110
so if \$proteinorf1 = ACGTATCT_
this will become... \$proteinorf1 = _ACGTATCT_?
0

LVL 11

Expert Comment

ID: 24403120
yes..

\$proteinorf1 = "_\$proteinorf1"

0

Author Comment

ID: 24403188
Hi.. thanks for the assistance so far climbgunks, not sure if you're still willing to assist me with this or not, as i'm learning this as i go along.

I've saved the six string frames into an array called @frames. This is an array of the six reading frames together. The array @fragments is the list of the shorter peptide fragments

You say:

"then the 2nd while loop will look something like
foreach (@fragments) {"

Would i need to modify the code below to accomodate for @frames and @fragments?

Thanks, Much Appreciated.
``````# save frames
my @frames = (\$orfprotein1, \$orfprotein2,  \$orfprotein3,  \$orfprotein4,  \$orfprotein5,  \$orfprotein6);
#while (<>) {
#    chomp;
#    # going to assume frames are in order, else handle \$1
#    push @frames, \$2;
#}

# skip stuff
while (<>) {
next if /^\s*\$/;
next if /^List/;
last;
}

# now come fragments
my \$frame = "";
my \$peptide = 0;
do {
chomp;
if (\$frame eq "") {
\$frame = shift @frames or die "ran out of frames\n";
\$peptide = 0;
}

frame: \$frame\n"
unless \$frame =~ s/^\$_//;
``````
0

LVL 11

Expert Comment

ID: 24403202

get rid of the #skip stuff block now

and the loop that looks like

do {
....
} while (<>);

you can change to

foreach (@fragments) {
...
}

I'm still  not sure what to do about the _     Is it in your weights hash?  Is it in the fragments list?
If no to both, you'll have to special case the handling of it.
0

Author Comment

ID: 24403210
For "_" i've gone with:
``````    K => 128.094963
M => 131.040485
F => 147.068414
P => 97.052764
S => 87.032028
T => 101.047679
U => 150.95363
W => 186.079313
Y => 163.06332
V => 99.068414
_ => 18.0106

# If B, Z or X -> total up weights of all the other peptides and give
# B, Z and X an average mass
);

my %avg_weight = (
A => 71.0779
R => 156.1857
N => 114.1026
D => 115.0874
C => 103.1429
E => 129.114
Q => 128.1292
G => 57.0513
H => 137.1393
I => 113.1576
L => 113.1576
K => 128.1723
M => 131.1961
F => 147.1739
P => 97.1152
S => 87.0773
T => 101.1039
U => 150.0379
W => 186.2099
Y => 163.1733
V => 99.1311
_ => 18.0153
``````
0

LVL 11

Expert Comment

ID: 24403339

If i understand the _ thing now...   there is one C group at the beginning of a frame and one N group at the end of a frame.  Both have the same weight, so you can represent them as the same symbol.

However, they don't show up in the fragment list?   If that's the case...I would not prepend it to the frames,
and use code like below to handle it.

Basically, it adds a C group to the first peptide in a chain, and the last.   If there in only fragment in a frame it gets both a C and an N added.   See changes involving the variable \$add_group
``````my \$frame = "_";
my \$peptide = 0;
foreach (@fragments) {
chomp;
if (\$frame eq "_") {
\$frame = shift @frames or die "ran out of frames\n";
\$peptide = 0;
}

ame: \$frame\n"
unless \$frame =~ s/^\$_//;

my \$weight = calculate_weight(\$_, \$type, \$add_group);

\$peptide++;
}

exit;

sub calculate_weight {
my (\$seq, \$type, \$add_group) = @_;
my @seq = split(//, \$seq);
my \$weight;
if (\$type eq "mono-isotopic") {
\$weight += \$iso_weight{\$_} foreach (@seq);
}
else {
\$weight += \$avg_weight{\$_} foreach (@seq);
}

return \$weight;
}
``````
0

Author Comment

ID: 24403396
ok i've just been going through the code and trying to understand what it's doing. I think i've got it.
The first section of code below takes each fragment from @fragments and tries to match it up against the reading frames in @frames

# now come fragments
my \$frame = "";
my \$peptide = 0;
foreach(@fragments){
chomp;
if (\$frame eq "") {
\$frame = shift @frames or die "ran out of frames\n";
\$peptide = 0;
}

frame: \$frame\n"
unless \$frame =~ s/^\$_//;

now once each fragment is assigned a number (\$framenum): of the location it is contained in a reading frame, a weight is calculated in the sub-routine:

my \$weight = calculate_weight(\$_, \$mass);

sub calculate_weight {
my (\$seq, \$mass) = @_;
my @seq = split(//, \$seq);
my \$weight;
if (\$mass eq "mono-isotopic") {
\$weight += \$iso_weight{\$_} foreach (@seq);
}
else {
\$weight += \$avg_weight{\$_} foreach (@seq);
}

return \$weight;
}

I'm trying to modify the weight calculation group.   I understand what you are saying when you say "Split my foreach statement into an actual loop... if the input is BZX, make a note of it... wait until the rest of the weights have been calculated (and how many contributed to it), then divide the weight found by the number of contributors and assign it to the unknown."
But not too sure how to go about implementing it with code.

Thanks again.

0

Author Comment

ID: 24403607
However, they don't show up in the fragment list?   If that's the case...I would not prepend it to the frames,
and use code like below to handle it.

It's ok.. i have modified my script for "_" to appear in the fragments list for the beginning of reading frames and at the ends, giving _ => (aweight) shouldn't be a problem now.
0

## Featured Post

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Email validation in proper way is  very important validation required in any web pages. This code is self explainable except that Regular Expression which I used for pattern matching. I originally published as a thread on my website : http://wwwâ€¦