[Last Call] Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

change the format

Posted on 2009-04-15
14
Medium Priority
?
159 Views
Last Modified: 2012-05-06
I had a sequences that are of below format...

>seq1
ATGAAACCGTCTCCGTTCATTGTTTT---GATATTTGCTGTTATTATAGGCCTGTGTGGTTGTGCACCACCCAAGGCCGAAGAAACTCAATCTGCTACGAGTACGAAAGCCGAGTCTTCTAATGCGGGTCAGAGCGGAAATCGATAT---------CCACCGGTGAAGATGAATTTTGAAAAAGTGTTTACTCCTAGTTTTTGTAAAGGTTTGCAAGATCAGCAATCAAAAATTGAAGAACTTTCGGCAGACTTGGAGAGGTTTGAGGGTCAGGAATTGAAGTCAAATTATGGAACATATTCCGACAAAAAGGACCATAAATAA
>seq2
ATGAAACTGTCTCTGTTCATTATTTTTTTGATATTTGCTGTTATTATAGGCCTGTGTGGTTGTGCACCACCCAAGGCCGAAGGAACTAAATCTGGTATGGGAACGCAAGCCGAGTCTTCTAATGCGGGTCAGAGAGGAAGTCGAAACAATGGCATCTCATCGGCGGAGTTGAACTTTGACAGAATTT---CTCCTGGTTTTATTAAAGGTTTGCGTGAAGATCAATCAGGATATGAAAAAGTTGGAGAGATCTTGAAGAGGGCTCAGGATCAGCAATTGAAGTCAAATTATGGAAAATATTCCGACAAAAAGGCCCATAATTAA


the sequences are not in one line....
I want a few changes for this sequences....

I want the sequence in the below format...
 
  2  303
seq1  AAACCGTCTCCGTTCATTGTTTTGATATTTGCTGTTATTATAGGCCTGTGTGGTTGTGCACCACCCAAGGCCGAAGAAACTCAATCTGCTACGAGTACGAAAGCCGAGTCTTCTAATGCGGGTCAGAGCGGAAATCGATATCCACCGGTGAAGATGAATTTTGAAAAAGTGTCTCCTAGTTTTTGTAAAGGTTTGCAAGATCAGCAATCAAAAATTGAAGAACTTTCGGCAGACTTGGAGAGGTTTGAGGGTCAGGAATTGAAGTCAAATTATGGAACATATTCCGACAAAAAGGACCATAAA
seq2      AAACTGTCTCTGTTCATTATTTTGATATTTGCTGTTATTATAGGCCTGTGTGGTTGTGCACCACCCAAGGCCGAAGGAACTAAATCTGGTATGGGAACGCAAGCCGAGTCTTCTAATGCGGGTCAGAGAGGAAGTCGAAACTCATCGGCGGAGTTGAACTTTGACAGAATTTCTCCTGGTTTTATTAAAGGTTTGCGTGAAGATCAATCAGGATATGAAAAAGTTGGAGAGATCTTGAAGAGGGCTCAGGATCAGCAATTGAAGTCAAATTATGGAAAATATTCCGACAAAAAGGCCCATAAT



the modification that taken place are

1) remove the symbol > 
2) bring all the sequence into one line...
  I mean name then sequence....
  similar for second
 name and sequence

 everything inone line

3) remove first three alphabets ATG and last three alphabets TAA in this sequence..  it not be TAA all the time..I just want to remove last 3 alphabets from one sequence..

4) after removing remove symbol  "-" from the sequence before removing it remove the corresponding positions in the other seqquence...
here's an example..

seq1    AAATATTGCATG----AATCTAGCTAGCTAGC
seq2    AAATATTCCATGTTTAATCTAGCTAGCTAGC

here there are 4 of type - before removing them see the corresponding leeters in other sequence...I mean at same position.... those are TTTA ..
so first remove those 4 from seq2 and remove 4 -'s from the first seq1

In the first line we write the number of sequences and length of sequence..after all these modifications...

here it is 2 as two sequences are there and 303 is the length after all modifications..

These days I made all the changes manaully now I need aprogram as ..it is taking more time...
0
Comment
Question by:shragi
  • 7
  • 4
12 Comments
 
LVL 39

Expert Comment

by:Adam314
ID: 24152176
Is there always 2 sequences per file?

If not, and you find "-" in a sequence, to you remove those characters from all sequences, or only the next sequence?  Do you search all sequences for "-", and remove from following, or only search first sequence?
0
 

Author Comment

by:shragi
ID: 24162795
yes there are always two sequences in the file...

let me clear how to remove "-" symbols..

seq1:   ATGCTGATCGTAGTCGATG--CTACGT
seq2:   ATGCT---CGTAGTCGATGCGCTACGT


first the seq2 has 3 "-"'s so remove corresponding alphbets from seq1 here it is (GAT)
similarly there are 2"-"'s in seq1 so remove corresponding alphabets from seq2 (here they are CG)

0
 
LVL 85

Expert Comment

by:ozo
ID: 24166053

{local $/=">"; <DATA>;chomp(@s=<DATA>)}
s/\n...(.*).../\t$1/ for @s;
$\="\n";tr/-/x/, tr/x/ /c, $m|=$_ for @m=@s;
$m =~ tr/ /a/;

print unpack $m,$_ for @s;
__DATA__
>seq1
ATGAAACCGTCTCCGTTCATTGTTTT---GATATTTGCTGTTATTATAGGCCTGTGTGGTTGTGCACCACCCAAGGCCGAAGAAACTCAATCTGCTACGAGTACGAAAGCCGAGTCTTCTAATGCGGGTCAGAGCGGAAATCGATAT---------CCACCGGTGAAGATGAATTTTGAAAAAGTGTTTACTCCTAGTTTTTGTAAAGGTTTGCAAGATCAGCAATCAAAAATTGAAGAACTTTCGGCAGACTTGGAGAGGTTTGAGGGTCAGGAATTGAAGTCAAATTATGGAACATATTCCGACAAAAAGGACCATAAATAA
>seq2
ATGAAACTGTCTCTGTTCATTATTTTTTTGATATTTGCTGTTATTATAGGCCTGTGTGGTTGTGCACCACCCAAGGCCGAAGGAACTAAATCTGGTATGGGAACGCAAGCCGAGTCTTCTAATGCGGGTCAGAGAGGAAGTCGAAACAATGGCATCTCATCGGCGGAGTTGAACTTTGACAGAATTT---CTCCTGGTTTTATTAAAGGTTTGCGTGAAGATCAATCAGGATATGAAAAAGTTGGAGAGATCTTGAAGAGGGCTCAGGATCAGCAATTGAAGTCAAATTATGGAAAATATTCCGACAAAAAGGCCCATAATTAA
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 

Author Comment

by:shragi
ID: 24167572
how about if we take the sequence from a file and output the sequence to another file...
0
 

Author Comment

by:shragi
ID: 24168064
It's not working for me....
0
 

Author Comment

by:shragi
ID: 24168148
I mean the code correctly removes the "-" symbols but I could not see the sequence name and length of sequence at the begining...

 2  303

this is not printed...  

2 is the number of sequences ..and 303 is the length of sequence...

this should be printed at the top of both sequences...

and if the sequence is taken from an input file like .... input.txt

and if it is outputed to another file output.txt it would be much helpful...
0
 
LVL 85

Expert Comment

by:ozo
ID: 24170655
open INPUT,"<input.txt" or die "input.txt $!";
{local $/=">"; <INPUT>;chomp(@s=<INPUT>)}
s/\n...(.*).../\t$1/ for @s;
tr/-/x/, tr/x/ /c, $m|=$_ for @m=@s;
open STDOUT,">output.txt" or die "output.txt $!";
print @s."\t",($m =~ tr/ /a/)-length(($s[0]=~/(.*\s+\S)/)[0]),"\n";
print unpack $m,$_ for @s;
#this assumes that the "seq1  " and "seq2  " parts are the same length
0
 

Author Comment

by:shragi
ID: 24170855
there was a mistake in the word count.... I mean the length of file is mistake ...


and more over can all the sequence be in one line....instead of multiple lines...

I used the above given example as input and it's length is 303 but ur code is giving the length as 253...

I removed tabs...instead I used two spaces...between name of sequence and original sequence..

<space><name><space><space><sequence>   total in one line...

not as
<name><space><space><sequnce>
<sequence>
<sequence>


thnk you for ur code...dude...

0
 
LVL 85

Expert Comment

by:ozo
ID: 24170914
I used the input you gave, and got the output you said you wanted.
Can you show me input for which it doesn't work, and the output you would want for it?
0
 

Author Comment

by:shragi
ID: 24171455
yup....

input I gave....>seq1
ATGAAACCGTCTCCGTTCATTGTTTT---GATATTTGCTGTTATTATAGGCCTGTGTGGTTGTGCACCACCCAAGGCCGAAGAAACTCAATCTGCTACGAGTACGAAAGCCGAGTCTTCTAATGCGGGTCAGAGCGGAAATCGATAT---------CCACCGGTGAAGATGAATTTTGAAAAAGTGTTTACTCCTAGTTTTTGTAAAGGTTTGCAAGATCAGCAATCAAAAATTGAAGAACTTTCGGCAGACTTGGAGAGGTTTGAGGGTCAGGAATTGAAGTCAAATTATGGAACATATTCCGACAAAAAGGACCATAAATAA
>seq2
ATGAAACTGTCTCTGTTCATTATTTTTTTGATATTTGCTGTTATTATAGGCCTGTGTGGTTGTGCACCACCCAAGGCCGAAGGAACTAAATCTGGTATGGGAACGCAAGCCGAGTCTTCTAATGCGGGTCAGAGAGGAAGTCGAAACAATGGCATCTCATCGGCGGAGTTGAACTTTGACAGAATTT---CTCCTGGTTTTATTAAAGGTTTGCGTGAAGATCAATCAGGATATGAAAAAGTTGGAGAGATCTTGAAGAGGGCTCAGGATCAGCAATTGAAGTCAAATTATGGAAAATATTCCGACAAAAAGGCCCATAATTAA

this is the input and the output should be...

  2  303
seq1  AAACCGTCTCCGTTCATTGTTTTGATATTTGCTGTTATTATAGGCCTGTGTGGTTGTGCACCACCCAAGGCCGAAGAAACTCAATCTGCTACGAGTACGAAAGCCGAGTCTTCTAATGCGGGTCAGAGCGGAAATCGATATCCACCGGTGAAGATGAATTTTGAAAAAGTGTCTCCTAGTTTTTGTAAAGGTTTGCAAGATCAGCAATCAAAAATTGAAGAACTTTCGGCAGACTTGGAGAGGTTTGAGGGTCAGGAATTGAAGTCAAATTATGGAACATATTCCGACAAAAAGGACCATAAA
seq2      AAACTGTCTCTGTTCATTATTTTGATATTTGCTGTTATTATAGGCCTGTGTGGTTGTGCACCACCCAAGGCCGAAGGAACTAAATCTGGTATGGGAACGCAAGCCGAGTCTTCTAATGCGGGTCAGAGAGGAAGTCGAAACTCATCGGCGGAGTTGAACTTTGACAGAATTTCTCCTGGTTTTATTAAAGGTTTGCGTGAAGATCAATCAGGATATGAAAAAGTTGGAGAGATCTTGAAGAGGGCTCAGGATCAGCAATTGAAGTCAAATTATGGAAAATATTCCGACAAAAAGGCCCATAAT

the length of the sequence is 303.....

and the modifications that I regularly domanually are....

1) remove first 3 and last three alphabets....
 here they are ATG and TAA ur code removes only first 3 alphabets .... it does not remove last 3 alphabets from a sequence...
2) remove ">" symbol....it's working
3) remove "-"'s from a sequence and remove corresponding characters from other sequences...from the same position....

example:

seq1    AAATATTGCATG----AATCTAGCTAGCTAGC
seq2    AAATATTCCATGTTTAATCTAGCTAGCTAGC

here there are 4 of type - before removing them see the corresponding leeters in other sequence...I mean at same position.... those are TTTA ..
so first remove those 4 from seq2 and remove 4 -'s from the first seq1

4) number of sequences and length of the sequences....separated by a space and at the top of all
this is partially working with ur code...the length returned is false....

0
 
LVL 85

Accepted Solution

by:
ozo earned 2000 total points
ID: 24191428
#when I run this code


#I get this outpiut

  2  303
seq1    AAACCGTCTCCGTTCATTGTTTTGATATTTGCTGTTATTATAGGCCTGTGTGGTTGTGCACCACCCAAGGCCGAAGAAACTCAATCTGCTACGAGTACGAAAGCCGAGTCTTCTAATGCGGGTCAGAGCGGAAATCGATATCCACCGGTGAAGATGAATTTTGAAAAAGTGTCTCCTAGTTTTTGTAAAGGTTTGCAAGATCAGCAATCAAAAATTGAAGAACTTTCGGCAGACTTGGAGAGGTTTGAGGGTCAGGAATTGAAGTCAAATTATGGAACATATTCCGACAAAAAGGACCATAAA
seq2    AAACTGTCTCTGTTCATTATTTTGATATTTGCTGTTATTATAGGCCTGTGTGGTTGTGCACCACCCAAGGCCGAAGGAACTAAATCTGGTATGGGAACGCAAGCCGAGTCTTCTAATGCGGGTCAGAGAGGAAGTCGAAACTCATCGGCGGAGTTGAACTTTGACAGAATTTCTCCTGGTTTTATTAAAGGTTTGCGTGAAGATCAATCAGGATATGAAAAAGTTGGAGAGATCTTGAAGAGGGCTCAGGATCAGCAATTGAAGTCAAATTATGGAAAATATTCCGACAAAAAGGCCCATAAT

which to me looks a lot like your
his is the input and the output should be...

  2  303
seq1  AAACCGTCTCCGTTCATTGTTTTGATATTTGCTGTTATTATAGGCCTGTGTGGTTGTGCACCACCCAAGGCCGAAGAAACTCAATCTGCTACGAGTACGAAAGCCGAGTCTTCTAATGCGGGTCAGAGCGGAAATCGATATCCACCGGTGAAGATGAATTTTGAAAAAGTGTCTCCTAGTTTTTGTAAAGGTTTGCAAGATCAGCAATCAAAAATTGAAGAACTTTCGGCAGACTTGGAGAGGTTTGAGGGTCAGGAATTGAAGTCAAATTATGGAACATATTCCGACAAAAAGGACCATAAA
seq2      AAACTGTCTCTGTTCATTATTTTGATATTTGCTGTTATTATAGGCCTGTGTGGTTGTGCACCACCCAAGGCCGAAGGAACTAAATCTGGTATGGGAACGCAAGCCGAGTCTTCTAATGCGGGTCAGAGAGGAAGTCGAAACTCATCGGCGGAGTTGAACTTTGACAGAATTTCTCCTGGTTTTATTAAAGGTTTGCGTGAAGATCAATCAGGATATGAAAAAGTTGGAGAGATCTTGAAGAGGGCTCAGGATCAGCAATTGAAGTCAAATTATGGAAAATATTCCGACAAAAAGGCCCATAAT

{local $/=">"; <DATA>;chomp(@s=<DATA>)}
s/\n...(.*).../\t$1/ for @s;
tr/-/x/, tr/x/ /c, $m|=$_ for @m=@s;
print "  ".@s."  ",($m =~ tr/ /a/)-length(($s[0]=~/(.*\s+\S)/)[0]),"\n";
print unpack $m,$_ for @s;
__DATA__
>seq1
ATGAAACCGTCTCCGTTCATTGTTTT---GATATTTGCTGTTATTATAGGCCTGTGTGGTTGTGCACCACCCAAGGCCGAAGAAACTCAATCTGCTACGAGTACGAAAGCCGAGTCTTCTAATGCGGGTCAGAGCGGAAATCGATAT---------CCACCGGTGAAGATGAATTTTGAAAAAGTGTTTACTCCTAGTTTTTGTAAAGGTTTGCAAGATCAGCAATCAAAAATTGAAGAACTTTCGGCAGACTTGGAGAGGTTTGAGGGTCAGGAATTGAAGTCAAATTATGGAACATATTCCGACAAAAAGGACCATAAATAA
>seq2
ATGAAACTGTCTCTGTTCATTATTTTTTTGATATTTGCTGTTATTATAGGCCTGTGTGGTTGTGCACCACCCAAGGCCGAAGGAACTAAATCTGGTATGGGAACGCAAGCCGAGTCTTCTAATGCGGGTCAGAGAGGAAGTCGAAACAATGGCATCTCATCGGCGGAGTTGAACTTTGACAGAATTT---CTCCTGGTTTTATTAAAGGTTTGCGTGAAGATCAATCAGGATATGAAAAAGTTGGAGAGATCTTGAAGAGGGCTCAGGATCAGCAATTGAAGTCAAATTATGGAAAATATTCCGACAAAAAGGCCCATAATTAA

Open in new window

0
 

Author Comment

by:shragi
ID: 24196029
Ok it works fine ...can u do  a little modification....to the above code.....

The sequence should be in multiple of 3 for me......

Here luckly we have multiples of 3 "-"'s whenever we want to remove  those...but some sequences may be like these..

>seq1 ATCT--AAGCGTAGC
>seq2 ATCTTTAAAGTAGCT

here we need to remove  "-"'s and one extra alphabet.....becoz...finally our sequence must be a multiple of 3 so when ever we remove we remove in multiples of 3 ...
Here I removed "-"'s and T which is on side of "-"'s


seq1 ATCAAGCGTAGC  

is correct here I removed T so remove TTT from seq2

seq2 ATCAAAGTAGCT

here when u divide seq1 and seq2 into 3 as on set we get below...
 seq1   ATC  AAG  CGT  AGC
 seq2   ATC  AAA  GTA  GCT

here there is no stop coden(TAA or TGA or TAG)   in any of  the sequences....so no problem ...

but if u had  done in other way...I mean if I remove A instead of T which is also beside "-"'s then ?

>seq1 ATCT--AAGCGTAGC
>seq2 ATCTTTAAAGTAGCT

to

seq1 ATC  TAG  CGT  AGC

then seq2 is as below..

seq2    ATC  TAA  GTA  GCT

Here the seq2 contains stop coden TAA
which is  a stop coden ....if we find one such thing we  remove 6 from seq1
why we want to remove the stop coden .?
becoz when the sequence sees the stop codens it neglect the remaining part of sequence...
U can clearly see in the input sequence that the sequence ends with stop coden TAA i.e., the sequence stops if it sees stop coden....so we do not want stop coden in middle...

and one more thin...

seq3  ATG TAC GAT AAA TGC ATC GAT CGA TCG   /// this is valid it has TAA but not as a set
seq4  ATG TAC GAT TAA TGC ATC GAT CGA TCG   .// invaid

In the above seq3  IF U observe the  9,10,11 positions they are TAA but if u take as a sequence they will not  be counted..as stop coden...
but in seq4 the TAA are as set i mean if u divide ... seq into 3 as one set u must not get stop codens(TAA or TGA or  TAG)

so now remove 6 from seq1 ...

>seq1 ATCT--AAGCGTAGC

to
 seq1 ATC TGT AGC    
 // here I removed more 3 from from the seq1 this can be from left side or from right side,....of previously removed 3

now seq2 is as below...
> seq2  ATCTTTAAAGTAGCT
 
to  
seq2 ATC TTA GCT

Here in this sequence there is no stop coden... so its correct....


can u do this little modification in my code..
0

Featured Post

Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Whether you’re a college noob or a soon-to-be pro, these tips are sure to help you in your journey to becoming a programming ninja and stand out from the crowd.
Article by: evilrix
Looking for a way to avoid searching through large data sets for data that doesn't exist? A Bloom Filter might be what you need. This data structure is a probabilistic filter that allows you to avoid unnecessary searches when you know the data defin…
This tutorial explains how to use the VisualVM tool for the Java platform application. This video goes into detail on the Threads, Sampler, and Profiler tabs.
The viewer will be introduced to the member functions push_back and pop_back of the vector class. The video will teach the difference between the two as well as how to use each one along with its functionality.
Suggested Courses
Course of the Month18 days, 8 hours left to enroll

826 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question