asked on

change the format

I had a sequences that are of below format...

>seq1
ATGAAACCGTCTCCGTTCATTGTTTT---GATATTTGCTGTTATTATAGGCCTGTGTGGTTGTGCACCACCCAAGGCCGAAGAAACTCAATCTGCTACGAGTACGAAAGCCGAGTCTTCTAATGCGGGTCAGAGCGGAAATCGATAT---------CCACCGGTGAAGATGAATTTTGAAAAAGTGTTTACTCCTAGTTTTTGTAAAGGTTTGCAAGATCAGCAATCAAAAATTGAAGAACTTTCGGCAGACTTGGAGAGGTTTGAGGGTCAGGAATTGAAGTCAAATTATGGAACATATTCCGACAAAAAGGACCATAAATAA
>seq2
ATGAAACTGTCTCTGTTCATTATTTTTTTGATATTTGCTGTTATTATAGGCCTGTGTGGTTGTGCACCACCCAAGGCCGAAGGAACTAAATCTGGTATGGGAACGCAAGCCGAGTCTTCTAATGCGGGTCAGAGAGGAAGTCGAAACAATGGCATCTCATCGGCGGAGTTGAACTTTGACAGAATTT---CTCCTGGTTTTATTAAAGGTTTGCGTGAAGATCAATCAGGATATGAAAAAGTTGGAGAGATCTTGAAGAGGGCTCAGGATCAGCAATTGAAGTCAAATTATGGAAAATATTCCGACAAAAAGGCCCATAATTAA

the sequences are not in one line....
I want a few changes for this sequences....

I want the sequence in the below format...

2 303
seq1 AAACCGTCTCCGTTCATTGTTTTGATATTTGCTGTTATTATAGGCCTGTGTGGTTGTGCACCACCCAAGGCCGAAGAAACTCAATCTGCTACGAGTACGAAAGCCGAGTCTTCTAATGCGGGTCAGAGCGGAAATCGATATCCACCGGTGAAGATGAATTTTGAAAAAGTGTCTCCTAGTTTTTGTAAAGGTTTGCAAGATCAGCAATCAAAAATTGAAGAACTTTCGGCAGACTTGGAGAGGTTTGAGGGTCAGGAATTGAAGTCAAATTATGGAACATATTCCGACAAAAAGGACCATAAA
seq2 AAACTGTCTCTGTTCATTATTTTGATATTTGCTGTTATTATAGGCCTGTGTGGTTGTGCACCACCCAAGGCCGAAGGAACTAAATCTGGTATGGGAACGCAAGCCGAGTCTTCTAATGCGGGTCAGAGAGGAAGTCGAAACTCATCGGCGGAGTTGAACTTTGACAGAATTTCTCCTGGTTTTATTAAAGGTTTGCGTGAAGATCAATCAGGATATGAAAAAGTTGGAGAGATCTTGAAGAGGGCTCAGGATCAGCAATTGAAGTCAAATTATGGAAAATATTCCGACAAAAAGGCCCATAAT

the modification that taken place are

1) remove the symbol >
2) bring all the sequence into one line...
I mean name then sequence....
similar for second
name and sequence

everything inone line

3) remove first three alphabets ATG and last three alphabets TAA in this sequence.. it not be TAA all the time..I just want to remove last 3 alphabets from one sequence..

4) after removing remove symbol "-" from the sequence before removing it remove the corresponding positions in the other seqquence...
here's an example..

seq1 AAATATTGCATG----AATCTAGCTAGCTAGC
seq2 AAATATTCCATGTTTAATCTAGCTAGCTAGC

here there are 4 of type - before removing them see the corresponding leeters in other sequence...I mean at same position.... those are TTTA ..
so first remove those 4 from seq2 and remove 4 -'s from the first seq1

In the first line we write the number of sequences and length of sequence..after all these modifications...

here it is 2 as two sequences are there and 303 is the length after all modifications..

These days I made all the changes manaully now I need aprogram as ..it is taking more time...

Adam314

Is there always 2 sequences per file?

If not, and you find "-" in a sequence, to you remove those characters from all sequences, or only the next sequence? Do you search all sequences for "-", and remove from following, or only search first sequence?

shragi

ASKER

yes there are always two sequences in the file...

let me clear how to remove "-" symbols..

seq1: ATGCTGATCGTAGTCGATG--CTACGT
seq2: ATGCT---CGTAGTCGATGCGCTACGT

first the seq2 has 3 "-"'s so remove corresponding alphbets from seq1 here it is (GAT)
similarly there are 2"-"'s in seq1 so remove corresponding alphabets from seq2 (here they are CG)

ozo

{local $/=">"; <DATA>;chomp(@s=<DATA>)}
s/\n...(.*).../\t$1/ for @s;
$\="\n";tr/-/x/, tr/x/ /c, $m|=$_ for @m=@s;
$m =~ tr/ /a/;

print unpack $m,$_ for @s;
__DATA__
>seq1
ATGAAACCGTCTCCGTTCATTGTTTT---GATATTTGCTGTTATTATAGGCCTGTGTGGTTGTGCACCACCCAAGGCCGAAGAAACTCAATCTGCTACGAGTACGAAAGCCGAGTCTTCTAATGCGGGTCAGAGCGGAAATCGATAT---------CCACCGGTGAAGATGAATTTTGAAAAAGTGTTTACTCCTAGTTTTTGTAAAGGTTTGCAAGATCAGCAATCAAAAATTGAAGAACTTTCGGCAGACTTGGAGAGGTTTGAGGGTCAGGAATTGAAGTCAAATTATGGAACATATTCCGACAAAAAGGACCATAAATAA
>seq2
ATGAAACTGTCTCTGTTCATTATTTTTTTGATATTTGCTGTTATTATAGGCCTGTGTGGTTGTGCACCACCCAAGGCCGAAGGAACTAAATCTGGTATGGGAACGCAAGCCGAGTCTTCTAATGCGGGTCAGAGAGGAAGTCGAAACAATGGCATCTCATCGGCGGAGTTGAACTTTGACAGAATTT---CTCCTGGTTTTATTAAAGGTTTGCGTGAAGATCAATCAGGATATGAAAAAGTTGGAGAGATCTTGAAGAGGGCTCAGGATCAGCAATTGAAGTCAAATTATGGAAAATATTCCGACAAAAAGGCCCATAATTAA

shragi

ASKER

how about if we take the sequence from a file and output the sequence to another file...

shragi

ASKER

It's not working for me....

shragi

ASKER

I mean the code correctly removes the "-" symbols but I could not see the sequence name and length of sequence at the begining...

2 303

this is not printed...

2 is the number of sequences ..and 303 is the length of sequence...

this should be printed at the top of both sequences...

and if the sequence is taken from an input file like .... input.txt

and if it is outputed to another file output.txt it would be much helpful...

ozo

open INPUT,"<input.txt" or die "input.txt $!";
{local $/=">"; <INPUT>;chomp(@s=<INPUT>)}
s/\n...(.*).../\t$1/ for @s;
tr/-/x/, tr/x/ /c, $m|=$_ for @m=@s;
open STDOUT,">output.txt" or die "output.txt $!";
print @s."\t",($m =~ tr/ /a/)-length(($s[0]=~/(.*\s+\S)/)[0]),"\n";
print unpack $m,$_ for @s;
#this assumes that the "seq1 " and "seq2 " parts are the same length

shragi

ASKER

there was a mistake in the word count.... I mean the length of file is mistake ...

and more over can all the sequence be in one line....instead of multiple lines...

I used the above given example as input and it's length is 303 but ur code is giving the length as 253...

I removed tabs...instead I used two spaces...between name of sequence and original sequence..

<space><name><space><space><sequence> total in one line...

not as
<name><space><space><sequnce>
<sequence>
<sequence>

thnk you for ur code...dude...

ozo

I used the input you gave, and got the output you said you wanted.
Can you show me input for which it doesn't work, and the output you would want for it?

shragi

ASKER

yup....

input I gave....>seq1
ATGAAACCGTCTCCGTTCATTGTTTT---GATATTTGCTGTTATTATAGGCCTGTGTGGTTGTGCACCACCCAAGGCCGAAGAAACTCAATCTGCTACGAGTACGAAAGCCGAGTCTTCTAATGCGGGTCAGAGCGGAAATCGATAT---------CCACCGGTGAAGATGAATTTTGAAAAAGTGTTTACTCCTAGTTTTTGTAAAGGTTTGCAAGATCAGCAATCAAAAATTGAAGAACTTTCGGCAGACTTGGAGAGGTTTGAGGGTCAGGAATTGAAGTCAAATTATGGAACATATTCCGACAAAAAGGACCATAAATAA
>seq2
ATGAAACTGTCTCTGTTCATTATTTTTTTGATATTTGCTGTTATTATAGGCCTGTGTGGTTGTGCACCACCCAAGGCCGAAGGAACTAAATCTGGTATGGGAACGCAAGCCGAGTCTTCTAATGCGGGTCAGAGAGGAAGTCGAAACAATGGCATCTCATCGGCGGAGTTGAACTTTGACAGAATTT---CTCCTGGTTTTATTAAAGGTTTGCGTGAAGATCAATCAGGATATGAAAAAGTTGGAGAGATCTTGAAGAGGGCTCAGGATCAGCAATTGAAGTCAAATTATGGAAAATATTCCGACAAAAAGGCCCATAATTAA

this is the input and the output should be...

2 303
seq1 AAACCGTCTCCGTTCATTGTTTTGATATTTGCTGTTATTATAGGCCTGTGTGGTTGTGCACCACCCAAGGCCGAAGAAACTCAATCTGCTACGAGTACGAAAGCCGAGTCTTCTAATGCGGGTCAGAGCGGAAATCGATATCCACCGGTGAAGATGAATTTTGAAAAAGTGTCTCCTAGTTTTTGTAAAGGTTTGCAAGATCAGCAATCAAAAATTGAAGAACTTTCGGCAGACTTGGAGAGGTTTGAGGGTCAGGAATTGAAGTCAAATTATGGAACATATTCCGACAAAAAGGACCATAAA
seq2 AAACTGTCTCTGTTCATTATTTTGATATTTGCTGTTATTATAGGCCTGTGTGGTTGTGCACCACCCAAGGCCGAAGGAACTAAATCTGGTATGGGAACGCAAGCCGAGTCTTCTAATGCGGGTCAGAGAGGAAGTCGAAACTCATCGGCGGAGTTGAACTTTGACAGAATTTCTCCTGGTTTTATTAAAGGTTTGCGTGAAGATCAATCAGGATATGAAAAAGTTGGAGAGATCTTGAAGAGGGCTCAGGATCAGCAATTGAAGTCAAATTATGGAAAATATTCCGACAAAAAGGCCCATAAT

the length of the sequence is 303.....

and the modifications that I regularly domanually are....

1) remove first 3 and last three alphabets....
here they are ATG and TAA ur code removes only first 3 alphabets .... it does not remove last 3 alphabets from a sequence...
2) remove ">" symbol....it's working
3) remove "-"'s from a sequence and remove corresponding characters from other sequences...from the same position....

example:

seq1 AAATATTGCATG----AATCTAGCTAGCTAGC
seq2 AAATATTCCATGTTTAATCTAGCTAGCTAGC

here there are 4 of type - before removing them see the corresponding leeters in other sequence...I mean at same position.... those are TTTA ..
so first remove those 4 from seq2 and remove 4 -'s from the first seq1

4) number of sequences and length of the sequences....separated by a space and at the top of all
this is partially working with ur code...the length returned is false....

ASKER CERTIFIED SOLUTION

ozo

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

shragi

ASKER

Ok it works fine ...can u do a little modification....to the above code.....

The sequence should be in multiple of 3 for me......

Here luckly we have multiples of 3 "-"'s whenever we want to remove those...but some sequences may be like these..

>seq1 ATCT--AAGCGTAGC
>seq2 ATCTTTAAAGTAGCT

here we need to remove "-"'s and one extra alphabet.....becoz...finally our sequence must be a multiple of 3 so when ever we remove we remove in multiples of 3 ...
Here I removed "-"'s and T which is on side of "-"'s

seq1 ATCAAGCGTAGC

is correct here I removed T so remove TTT from seq2

seq2 ATCAAAGTAGCT

here when u divide seq1 and seq2 into 3 as on set we get below...
seq1 ATC AAG CGT AGC
seq2 ATC AAA GTA GCT

here there is no stop coden(TAA or TGA or TAG) in any of the sequences....so no problem ...

but if u had done in other way...I mean if I remove A instead of T which is also beside "-"'s then ?

>seq1 ATCT--AAGCGTAGC
>seq2 ATCTTTAAAGTAGCT

to

seq1 ATC TAG CGT AGC

then seq2 is as below..

seq2 ATC TAA GTA GCT

Here the seq2 contains stop coden TAA
which is a stop coden ....if we find one such thing we remove 6 from seq1
why we want to remove the stop coden .?
becoz when the sequence sees the stop codens it neglect the remaining part of sequence...
U can clearly see in the input sequence that the sequence ends with stop coden TAA i.e., the sequence stops if it sees stop coden....so we do not want stop coden in middle...

and one more thin...

seq3 ATG TAC GAT AAA TGC ATC GAT CGA TCG /// this is valid it has TAA but not as a set
seq4 ATG TAC GAT TAA TGC ATC GAT CGA TCG .// invaid

In the above seq3 IF U observe the 9,10,11 positions they are TAA but if u take as a sequence they will not be counted..as stop coden...
but in seq4 the TAA are as set i mean if u divide ... seq into 3 as one set u must not get stop codens(TAA or TGA or TAG)

so now remove 6 from seq1 ...

>seq1 ATCT--AAGCGTAGC

to
seq1 ATC TGT AGC
// here I removed more 3 from from the seq1 this can be from left side or from right side,....of previously removed 3

now seq2 is as below...
> seq2 ATCTTTAAAGTAGCT

to
seq2 ATC TTA GCT

Here in this sequence there is no stop coden... so its correct....

can u do this little modification in my code..