Link to home
Start Free TrialLog in
Avatar of shragi
shragiFlag for India

asked on

change the format

I had a sequences that are of below format...

>seq1
ATGAAACCGTCTCCGTTCATTGTTTT---GATATTTGCTGTTATTATAGGCCTGTGTGGTTGTGCACCACCCAAGGCCGAAGAAACTCAATCTGCTACGAGTACGAAAGCCGAGTCTTCTAATGCGGGTCAGAGCGGAAATCGATAT---------CCACCGGTGAAGATGAATTTTGAAAAAGTGTTTACTCCTAGTTTTTGTAAAGGTTTGCAAGATCAGCAATCAAAAATTGAAGAACTTTCGGCAGACTTGGAGAGGTTTGAGGGTCAGGAATTGAAGTCAAATTATGGAACATATTCCGACAAAAAGGACCATAAATAA
>seq2
ATGAAACTGTCTCTGTTCATTATTTTTTTGATATTTGCTGTTATTATAGGCCTGTGTGGTTGTGCACCACCCAAGGCCGAAGGAACTAAATCTGGTATGGGAACGCAAGCCGAGTCTTCTAATGCGGGTCAGAGAGGAAGTCGAAACAATGGCATCTCATCGGCGGAGTTGAACTTTGACAGAATTT---CTCCTGGTTTTATTAAAGGTTTGCGTGAAGATCAATCAGGATATGAAAAAGTTGGAGAGATCTTGAAGAGGGCTCAGGATCAGCAATTGAAGTCAAATTATGGAAAATATTCCGACAAAAAGGCCCATAATTAA


the sequences are not in one line....
I want a few changes for this sequences....

I want the sequence in the below format...
 
  2  303
seq1  AAACCGTCTCCGTTCATTGTTTTGATATTTGCTGTTATTATAGGCCTGTGTGGTTGTGCACCACCCAAGGCCGAAGAAACTCAATCTGCTACGAGTACGAAAGCCGAGTCTTCTAATGCGGGTCAGAGCGGAAATCGATATCCACCGGTGAAGATGAATTTTGAAAAAGTGTCTCCTAGTTTTTGTAAAGGTTTGCAAGATCAGCAATCAAAAATTGAAGAACTTTCGGCAGACTTGGAGAGGTTTGAGGGTCAGGAATTGAAGTCAAATTATGGAACATATTCCGACAAAAAGGACCATAAA
seq2      AAACTGTCTCTGTTCATTATTTTGATATTTGCTGTTATTATAGGCCTGTGTGGTTGTGCACCACCCAAGGCCGAAGGAACTAAATCTGGTATGGGAACGCAAGCCGAGTCTTCTAATGCGGGTCAGAGAGGAAGTCGAAACTCATCGGCGGAGTTGAACTTTGACAGAATTTCTCCTGGTTTTATTAAAGGTTTGCGTGAAGATCAATCAGGATATGAAAAAGTTGGAGAGATCTTGAAGAGGGCTCAGGATCAGCAATTGAAGTCAAATTATGGAAAATATTCCGACAAAAAGGCCCATAAT



the modification that taken place are

1) remove the symbol > 
2) bring all the sequence into one line...
  I mean name then sequence....
  similar for second
 name and sequence

 everything inone line

3) remove first three alphabets ATG and last three alphabets TAA in this sequence..  it not be TAA all the time..I just want to remove last 3 alphabets from one sequence..

4) after removing remove symbol  "-" from the sequence before removing it remove the corresponding positions in the other seqquence...
here's an example..

seq1    AAATATTGCATG----AATCTAGCTAGCTAGC
seq2    AAATATTCCATGTTTAATCTAGCTAGCTAGC

here there are 4 of type - before removing them see the corresponding leeters in other sequence...I mean at same position.... those are TTTA ..
so first remove those 4 from seq2 and remove 4 -'s from the first seq1

In the first line we write the number of sequences and length of sequence..after all these modifications...

here it is 2 as two sequences are there and 303 is the length after all modifications..

These days I made all the changes manaully now I need aprogram as ..it is taking more time...
Avatar of Adam314
Adam314

Is there always 2 sequences per file?

If not, and you find "-" in a sequence, to you remove those characters from all sequences, or only the next sequence?  Do you search all sequences for "-", and remove from following, or only search first sequence?
Avatar of shragi

ASKER

yes there are always two sequences in the file...

let me clear how to remove "-" symbols..

seq1:   ATGCTGATCGTAGTCGATG--CTACGT
seq2:   ATGCT---CGTAGTCGATGCGCTACGT


first the seq2 has 3 "-"'s so remove corresponding alphbets from seq1 here it is (GAT)
similarly there are 2"-"'s in seq1 so remove corresponding alphabets from seq2 (here they are CG)

Avatar of ozo

{local $/=">"; <DATA>;chomp(@s=<DATA>)}
s/\n...(.*).../\t$1/ for @s;
$\="\n";tr/-/x/, tr/x/ /c, $m|=$_ for @m=@s;
$m =~ tr/ /a/;

print unpack $m,$_ for @s;
__DATA__
>seq1
ATGAAACCGTCTCCGTTCATTGTTTT---GATATTTGCTGTTATTATAGGCCTGTGTGGTTGTGCACCACCCAAGGCCGAAGAAACTCAATCTGCTACGAGTACGAAAGCCGAGTCTTCTAATGCGGGTCAGAGCGGAAATCGATAT---------CCACCGGTGAAGATGAATTTTGAAAAAGTGTTTACTCCTAGTTTTTGTAAAGGTTTGCAAGATCAGCAATCAAAAATTGAAGAACTTTCGGCAGACTTGGAGAGGTTTGAGGGTCAGGAATTGAAGTCAAATTATGGAACATATTCCGACAAAAAGGACCATAAATAA
>seq2
ATGAAACTGTCTCTGTTCATTATTTTTTTGATATTTGCTGTTATTATAGGCCTGTGTGGTTGTGCACCACCCAAGGCCGAAGGAACTAAATCTGGTATGGGAACGCAAGCCGAGTCTTCTAATGCGGGTCAGAGAGGAAGTCGAAACAATGGCATCTCATCGGCGGAGTTGAACTTTGACAGAATTT---CTCCTGGTTTTATTAAAGGTTTGCGTGAAGATCAATCAGGATATGAAAAAGTTGGAGAGATCTTGAAGAGGGCTCAGGATCAGCAATTGAAGTCAAATTATGGAAAATATTCCGACAAAAAGGCCCATAATTAA
Avatar of shragi

ASKER

how about if we take the sequence from a file and output the sequence to another file...
Avatar of shragi

ASKER

It's not working for me....
Avatar of shragi

ASKER

I mean the code correctly removes the "-" symbols but I could not see the sequence name and length of sequence at the begining...

 2  303

this is not printed...  

2 is the number of sequences ..and 303 is the length of sequence...

this should be printed at the top of both sequences...

and if the sequence is taken from an input file like .... input.txt

and if it is outputed to another file output.txt it would be much helpful...
open INPUT,"<input.txt" or die "input.txt $!";
{local $/=">"; <INPUT>;chomp(@s=<INPUT>)}
s/\n...(.*).../\t$1/ for @s;
tr/-/x/, tr/x/ /c, $m|=$_ for @m=@s;
open STDOUT,">output.txt" or die "output.txt $!";
print @s."\t",($m =~ tr/ /a/)-length(($s[0]=~/(.*\s+\S)/)[0]),"\n";
print unpack $m,$_ for @s;
#this assumes that the "seq1  " and "seq2  " parts are the same length
Avatar of shragi

ASKER

there was a mistake in the word count.... I mean the length of file is mistake ...


and more over can all the sequence be in one line....instead of multiple lines...

I used the above given example as input and it's length is 303 but ur code is giving the length as 253...

I removed tabs...instead I used two spaces...between name of sequence and original sequence..

<space><name><space><space><sequence>   total in one line...

not as
<name><space><space><sequnce>
<sequence>
<sequence>


thnk you for ur code...dude...

I used the input you gave, and got the output you said you wanted.
Can you show me input for which it doesn't work, and the output you would want for it?
Avatar of shragi

ASKER

yup....

input I gave....>seq1
ATGAAACCGTCTCCGTTCATTGTTTT---GATATTTGCTGTTATTATAGGCCTGTGTGGTTGTGCACCACCCAAGGCCGAAGAAACTCAATCTGCTACGAGTACGAAAGCCGAGTCTTCTAATGCGGGTCAGAGCGGAAATCGATAT---------CCACCGGTGAAGATGAATTTTGAAAAAGTGTTTACTCCTAGTTTTTGTAAAGGTTTGCAAGATCAGCAATCAAAAATTGAAGAACTTTCGGCAGACTTGGAGAGGTTTGAGGGTCAGGAATTGAAGTCAAATTATGGAACATATTCCGACAAAAAGGACCATAAATAA
>seq2
ATGAAACTGTCTCTGTTCATTATTTTTTTGATATTTGCTGTTATTATAGGCCTGTGTGGTTGTGCACCACCCAAGGCCGAAGGAACTAAATCTGGTATGGGAACGCAAGCCGAGTCTTCTAATGCGGGTCAGAGAGGAAGTCGAAACAATGGCATCTCATCGGCGGAGTTGAACTTTGACAGAATTT---CTCCTGGTTTTATTAAAGGTTTGCGTGAAGATCAATCAGGATATGAAAAAGTTGGAGAGATCTTGAAGAGGGCTCAGGATCAGCAATTGAAGTCAAATTATGGAAAATATTCCGACAAAAAGGCCCATAATTAA

this is the input and the output should be...

  2  303
seq1  AAACCGTCTCCGTTCATTGTTTTGATATTTGCTGTTATTATAGGCCTGTGTGGTTGTGCACCACCCAAGGCCGAAGAAACTCAATCTGCTACGAGTACGAAAGCCGAGTCTTCTAATGCGGGTCAGAGCGGAAATCGATATCCACCGGTGAAGATGAATTTTGAAAAAGTGTCTCCTAGTTTTTGTAAAGGTTTGCAAGATCAGCAATCAAAAATTGAAGAACTTTCGGCAGACTTGGAGAGGTTTGAGGGTCAGGAATTGAAGTCAAATTATGGAACATATTCCGACAAAAAGGACCATAAA
seq2      AAACTGTCTCTGTTCATTATTTTGATATTTGCTGTTATTATAGGCCTGTGTGGTTGTGCACCACCCAAGGCCGAAGGAACTAAATCTGGTATGGGAACGCAAGCCGAGTCTTCTAATGCGGGTCAGAGAGGAAGTCGAAACTCATCGGCGGAGTTGAACTTTGACAGAATTTCTCCTGGTTTTATTAAAGGTTTGCGTGAAGATCAATCAGGATATGAAAAAGTTGGAGAGATCTTGAAGAGGGCTCAGGATCAGCAATTGAAGTCAAATTATGGAAAATATTCCGACAAAAAGGCCCATAAT

the length of the sequence is 303.....

and the modifications that I regularly domanually are....

1) remove first 3 and last three alphabets....
 here they are ATG and TAA ur code removes only first 3 alphabets .... it does not remove last 3 alphabets from a sequence...
2) remove ">" symbol....it's working
3) remove "-"'s from a sequence and remove corresponding characters from other sequences...from the same position....

example:

seq1    AAATATTGCATG----AATCTAGCTAGCTAGC
seq2    AAATATTCCATGTTTAATCTAGCTAGCTAGC

here there are 4 of type - before removing them see the corresponding leeters in other sequence...I mean at same position.... those are TTTA ..
so first remove those 4 from seq2 and remove 4 -'s from the first seq1

4) number of sequences and length of the sequences....separated by a space and at the top of all
this is partially working with ur code...the length returned is false....

ASKER CERTIFIED SOLUTION
Avatar of ozo
ozo
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of shragi

ASKER

Ok it works fine ...can u do  a little modification....to the above code.....

The sequence should be in multiple of 3 for me......

Here luckly we have multiples of 3 "-"'s whenever we want to remove  those...but some sequences may be like these..

>seq1 ATCT--AAGCGTAGC
>seq2 ATCTTTAAAGTAGCT

here we need to remove  "-"'s and one extra alphabet.....becoz...finally our sequence must be a multiple of 3 so when ever we remove we remove in multiples of 3 ...
Here I removed "-"'s and T which is on side of "-"'s


seq1 ATCAAGCGTAGC  

is correct here I removed T so remove TTT from seq2

seq2 ATCAAAGTAGCT

here when u divide seq1 and seq2 into 3 as on set we get below...
 seq1   ATC  AAG  CGT  AGC
 seq2   ATC  AAA  GTA  GCT

here there is no stop coden(TAA or TGA or TAG)   in any of  the sequences....so no problem ...

but if u had  done in other way...I mean if I remove A instead of T which is also beside "-"'s then ?

>seq1 ATCT--AAGCGTAGC
>seq2 ATCTTTAAAGTAGCT

to

seq1 ATC  TAG  CGT  AGC

then seq2 is as below..

seq2    ATC  TAA  GTA  GCT

Here the seq2 contains stop coden TAA
which is  a stop coden ....if we find one such thing we  remove 6 from seq1
why we want to remove the stop coden .?
becoz when the sequence sees the stop codens it neglect the remaining part of sequence...
U can clearly see in the input sequence that the sequence ends with stop coden TAA i.e., the sequence stops if it sees stop coden....so we do not want stop coden in middle...

and one more thin...

seq3  ATG TAC GAT AAA TGC ATC GAT CGA TCG   /// this is valid it has TAA but not as a set
seq4  ATG TAC GAT TAA TGC ATC GAT CGA TCG   .// invaid

In the above seq3  IF U observe the  9,10,11 positions they are TAA but if u take as a sequence they will not  be counted..as stop coden...
but in seq4 the TAA are as set i mean if u divide ... seq into 3 as one set u must not get stop codens(TAA or TGA or  TAG)

so now remove 6 from seq1 ...

>seq1 ATCT--AAGCGTAGC

to
 seq1 ATC TGT AGC    
 // here I removed more 3 from from the seq1 this can be from left side or from right side,....of previously removed 3

now seq2 is as below...
> seq2  ATCTTTAAAGTAGCT
 
to  
seq2 ATC TTA GCT

Here in this sequence there is no stop coden... so its correct....


can u do this little modification in my code..