shragi
asked on
change the format
I had a sequences that are of below format...
>seq1
ATGAAACCGTCTCCGTTCATTGTTTT ---GATATTT GCTGTTATTA TAGGCCTGTG TGGTTGTGCA CCACCCAAGG CCGAAGAAAC TCAATCTGCT ACGAGTACGA AAGCCGAGTC TTCTAATGCG GGTCAGAGCG GAAATCGATA T--------- CCACCGGTGA AGATGAATTT TGAAAAAGTG TTTACTCCTA GTTTTTGTAA AGGTTTGCAA GATCAGCAAT CAAAAATTGA AGAACTTTCG GCAGACTTGG AGAGGTTTGA GGGTCAGGAA TTGAAGTCAA ATTATGGAAC ATATTCCGAC AAAAAGGACC ATAAATAA
>seq2
ATGAAACTGTCTCTGTTCATTATTTT TTTGATATTT GCTGTTATTA TAGGCCTGTG TGGTTGTGCA CCACCCAAGG CCGAAGGAAC TAAATCTGGT ATGGGAACGC AAGCCGAGTC TTCTAATGCG GGTCAGAGAG GAAGTCGAAA CAATGGCATC TCATCGGCGG AGTTGAACTT TGACAGAATT T---CTCCTG GTTTTATTAA AGGTTTGCGT GAAGATCAAT CAGGATATGA AAAAGTTGGA GAGATCTTGA AGAGGGCTCA GGATCAGCAA TTGAAGTCAA ATTATGGAAA ATATTCCGAC AAAAAGGCCC ATAATTAA
the sequences are not in one line....
I want a few changes for this sequences....
I want the sequence in the below format...
2 303
seq1 AAACCGTCTCCGTTCATTGTTTTGAT ATTTGCTGTT ATTATAGGCC TGTGTGGTTG TGCACCACCC AAGGCCGAAG AAACTCAATC TGCTACGAGT ACGAAAGCCG AGTCTTCTAA TGCGGGTCAG AGCGGAAATC GATATCCACC GGTGAAGATG AATTTTGAAA AAGTGTCTCC TAGTTTTTGT AAAGGTTTGC AAGATCAGCA ATCAAAAATT GAAGAACTTT CGGCAGACTT GGAGAGGTTT GAGGGTCAGG AATTGAAGTC AAATTATGGA ACATATTCCG ACAAAAAGGA CCATAAA
seq2 AAACTGTCTCTGTTCATTATTTTGAT ATTTGCTGTT ATTATAGGCC TGTGTGGTTG TGCACCACCC AAGGCCGAAG GAACTAAATC TGGTATGGGA ACGCAAGCCG AGTCTTCTAA TGCGGGTCAG AGAGGAAGTC GAAACTCATC GGCGGAGTTG AACTTTGACA GAATTTCTCC TGGTTTTATT AAAGGTTTGC GTGAAGATCA ATCAGGATAT GAAAAAGTTG GAGAGATCTT GAAGAGGGCT CAGGATCAGC AATTGAAGTC AAATTATGGA AAATATTCCG ACAAAAAGGC CCATAAT
the modification that taken place are
1) remove the symbol >
2) bring all the sequence into one line...
I mean name then sequence....
similar for second
name and sequence
everything inone line
3) remove first three alphabets ATG and last three alphabets TAA in this sequence.. it not be TAA all the time..I just want to remove last 3 alphabets from one sequence..
4) after removing remove symbol "-" from the sequence before removing it remove the corresponding positions in the other seqquence...
here's an example..
seq1 AAATATTGCATG----AATCTAGCTA GCTAGC
seq2 AAATATTCCATGTTTAATCTAGCTAG CTAGC
here there are 4 of type - before removing them see the corresponding leeters in other sequence...I mean at same position.... those are TTTA ..
so first remove those 4 from seq2 and remove 4 -'s from the first seq1
In the first line we write the number of sequences and length of sequence..after all these modifications...
here it is 2 as two sequences are there and 303 is the length after all modifications..
These days I made all the changes manaully now I need aprogram as ..it is taking more time...
>seq1
ATGAAACCGTCTCCGTTCATTGTTTT
>seq2
ATGAAACTGTCTCTGTTCATTATTTT
the sequences are not in one line....
I want a few changes for this sequences....
I want the sequence in the below format...
2 303
seq1 AAACCGTCTCCGTTCATTGTTTTGAT
seq2 AAACTGTCTCTGTTCATTATTTTGAT
the modification that taken place are
1) remove the symbol >
2) bring all the sequence into one line...
I mean name then sequence....
similar for second
name and sequence
everything inone line
3) remove first three alphabets ATG and last three alphabets TAA in this sequence.. it not be TAA all the time..I just want to remove last 3 alphabets from one sequence..
4) after removing remove symbol "-" from the sequence before removing it remove the corresponding positions in the other seqquence...
here's an example..
seq1 AAATATTGCATG----AATCTAGCTA
seq2 AAATATTCCATGTTTAATCTAGCTAG
here there are 4 of type - before removing them see the corresponding leeters in other sequence...I mean at same position.... those are TTTA ..
so first remove those 4 from seq2 and remove 4 -'s from the first seq1
In the first line we write the number of sequences and length of sequence..after all these modifications...
here it is 2 as two sequences are there and 303 is the length after all modifications..
These days I made all the changes manaully now I need aprogram as ..it is taking more time...
ASKER
yes there are always two sequences in the file...
let me clear how to remove "-" symbols..
seq1: ATGCTGATCGTAGTCGATG--CTACG T
seq2: ATGCT---CGTAGTCGATGCGCTACG T
first the seq2 has 3 "-"'s so remove corresponding alphbets from seq1 here it is (GAT)
similarly there are 2"-"'s in seq1 so remove corresponding alphabets from seq2 (here they are CG)
let me clear how to remove "-" symbols..
seq1: ATGCTGATCGTAGTCGATG--CTACG
seq2: ATGCT---CGTAGTCGATGCGCTACG
first the seq2 has 3 "-"'s so remove corresponding alphbets from seq1 here it is (GAT)
similarly there are 2"-"'s in seq1 so remove corresponding alphabets from seq2 (here they are CG)
{local $/=">"; <DATA>;chomp(@s=<DATA>)}
s/\n...(.*).../\t$1/ for @s;
$\="\n";tr/-/x/, tr/x/ /c, $m|=$_ for @m=@s;
$m =~ tr/ /a/;
print unpack $m,$_ for @s;
__DATA__
>seq1
ATGAAACCGTCTCCGTTCATTGTTTT
>seq2
ATGAAACTGTCTCTGTTCATTATTTT
ASKER
how about if we take the sequence from a file and output the sequence to another file...
ASKER
It's not working for me....
ASKER
I mean the code correctly removes the "-" symbols but I could not see the sequence name and length of sequence at the begining...
2 303
this is not printed...
2 is the number of sequences ..and 303 is the length of sequence...
this should be printed at the top of both sequences...
and if the sequence is taken from an input file like .... input.txt
and if it is outputed to another file output.txt it would be much helpful...
2 303
this is not printed...
2 is the number of sequences ..and 303 is the length of sequence...
this should be printed at the top of both sequences...
and if the sequence is taken from an input file like .... input.txt
and if it is outputed to another file output.txt it would be much helpful...
open INPUT,"<input.txt" or die "input.txt $!";
{local $/=">"; <INPUT>;chomp(@s=<INPUT>)}
s/\n...(.*).../\t$1/ for @s;
tr/-/x/, tr/x/ /c, $m|=$_ for @m=@s;
open STDOUT,">output.txt" or die "output.txt $!";
print @s."\t",($m =~ tr/ /a/)-length(($s[0]=~/(.*\s +\S)/)[0]) ,"\n";
print unpack $m,$_ for @s;
#this assumes that the "seq1 " and "seq2 " parts are the same length
{local $/=">"; <INPUT>;chomp(@s=<INPUT>)}
s/\n...(.*).../\t$1/ for @s;
tr/-/x/, tr/x/ /c, $m|=$_ for @m=@s;
open STDOUT,">output.txt" or die "output.txt $!";
print @s."\t",($m =~ tr/ /a/)-length(($s[0]=~/(.*\s
print unpack $m,$_ for @s;
#this assumes that the "seq1 " and "seq2 " parts are the same length
ASKER
there was a mistake in the word count.... I mean the length of file is mistake ...
and more over can all the sequence be in one line....instead of multiple lines...
I used the above given example as input and it's length is 303 but ur code is giving the length as 253...
I removed tabs...instead I used two spaces...between name of sequence and original sequence..
<space><name><space><space ><sequence > total in one line...
not as
<name><space><space><sequn ce>
<sequence>
<sequence>
thnk you for ur code...dude...
and more over can all the sequence be in one line....instead of multiple lines...
I used the above given example as input and it's length is 303 but ur code is giving the length as 253...
I removed tabs...instead I used two spaces...between name of sequence and original sequence..
<space><name><space><space
not as
<name><space><space><sequn
<sequence>
<sequence>
thnk you for ur code...dude...
I used the input you gave, and got the output you said you wanted.
Can you show me input for which it doesn't work, and the output you would want for it?
Can you show me input for which it doesn't work, and the output you would want for it?
ASKER
yup....
input I gave....>seq1
ATGAAACCGTCTCCGTTCATTGTTTT ---GATATTT GCTGTTATTA TAGGCCTGTG TGGTTGTGCA CCACCCAAGG CCGAAGAAAC TCAATCTGCT ACGAGTACGA AAGCCGAGTC TTCTAATGCG GGTCAGAGCG GAAATCGATA T--------- CCACCGGTGA AGATGAATTT TGAAAAAGTG TTTACTCCTA GTTTTTGTAA AGGTTTGCAA GATCAGCAAT CAAAAATTGA AGAACTTTCG GCAGACTTGG AGAGGTTTGA GGGTCAGGAA TTGAAGTCAA ATTATGGAAC ATATTCCGAC AAAAAGGACC ATAAATAA
>seq2
ATGAAACTGTCTCTGTTCATTATTTT TTTGATATTT GCTGTTATTA TAGGCCTGTG TGGTTGTGCA CCACCCAAGG CCGAAGGAAC TAAATCTGGT ATGGGAACGC AAGCCGAGTC TTCTAATGCG GGTCAGAGAG GAAGTCGAAA CAATGGCATC TCATCGGCGG AGTTGAACTT TGACAGAATT T---CTCCTG GTTTTATTAA AGGTTTGCGT GAAGATCAAT CAGGATATGA AAAAGTTGGA GAGATCTTGA AGAGGGCTCA GGATCAGCAA TTGAAGTCAA ATTATGGAAA ATATTCCGAC AAAAAGGCCC ATAATTAA
this is the input and the output should be...
2 303
seq1 AAACCGTCTCCGTTCATTGTTTTGAT ATTTGCTGTT ATTATAGGCC TGTGTGGTTG TGCACCACCC AAGGCCGAAG AAACTCAATC TGCTACGAGT ACGAAAGCCG AGTCTTCTAA TGCGGGTCAG AGCGGAAATC GATATCCACC GGTGAAGATG AATTTTGAAA AAGTGTCTCC TAGTTTTTGT AAAGGTTTGC AAGATCAGCA ATCAAAAATT GAAGAACTTT CGGCAGACTT GGAGAGGTTT GAGGGTCAGG AATTGAAGTC AAATTATGGA ACATATTCCG ACAAAAAGGA CCATAAA
seq2 AAACTGTCTCTGTTCATTATTTTGAT ATTTGCTGTT ATTATAGGCC TGTGTGGTTG TGCACCACCC AAGGCCGAAG GAACTAAATC TGGTATGGGA ACGCAAGCCG AGTCTTCTAA TGCGGGTCAG AGAGGAAGTC GAAACTCATC GGCGGAGTTG AACTTTGACA GAATTTCTCC TGGTTTTATT AAAGGTTTGC GTGAAGATCA ATCAGGATAT GAAAAAGTTG GAGAGATCTT GAAGAGGGCT CAGGATCAGC AATTGAAGTC AAATTATGGA AAATATTCCG ACAAAAAGGC CCATAAT
the length of the sequence is 303.....
and the modifications that I regularly domanually are....
1) remove first 3 and last three alphabets....
here they are ATG and TAA ur code removes only first 3 alphabets .... it does not remove last 3 alphabets from a sequence...
2) remove ">" symbol....it's working
3) remove "-"'s from a sequence and remove corresponding characters from other sequences...from the same position....
example:
seq1 AAATATTGCATG----AATCTAGCTA GCTAGC
seq2 AAATATTCCATGTTTAATCTAGCTAG CTAGC
here there are 4 of type - before removing them see the corresponding leeters in other sequence...I mean at same position.... those are TTTA ..
so first remove those 4 from seq2 and remove 4 -'s from the first seq1
4) number of sequences and length of the sequences....separated by a space and at the top of all
this is partially working with ur code...the length returned is false....
input I gave....>seq1
ATGAAACCGTCTCCGTTCATTGTTTT
>seq2
ATGAAACTGTCTCTGTTCATTATTTT
this is the input and the output should be...
2 303
seq1 AAACCGTCTCCGTTCATTGTTTTGAT
seq2 AAACTGTCTCTGTTCATTATTTTGAT
the length of the sequence is 303.....
and the modifications that I regularly domanually are....
1) remove first 3 and last three alphabets....
here they are ATG and TAA ur code removes only first 3 alphabets .... it does not remove last 3 alphabets from a sequence...
2) remove ">" symbol....it's working
3) remove "-"'s from a sequence and remove corresponding characters from other sequences...from the same position....
example:
seq1 AAATATTGCATG----AATCTAGCTA
seq2 AAATATTCCATGTTTAATCTAGCTAG
here there are 4 of type - before removing them see the corresponding leeters in other sequence...I mean at same position.... those are TTTA ..
so first remove those 4 from seq2 and remove 4 -'s from the first seq1
4) number of sequences and length of the sequences....separated by a space and at the top of all
this is partially working with ur code...the length returned is false....
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Ok it works fine ...can u do a little modification....to the above code.....
The sequence should be in multiple of 3 for me......
Here luckly we have multiples of 3 "-"'s whenever we want to remove those...but some sequences may be like these..
>seq1 ATCT--AAGCGTAGC
>seq2 ATCTTTAAAGTAGCT
here we need to remove "-"'s and one extra alphabet.....becoz...final ly our sequence must be a multiple of 3 so when ever we remove we remove in multiples of 3 ...
Here I removed "-"'s and T which is on side of "-"'s
seq1 ATCAAGCGTAGC
is correct here I removed T so remove TTT from seq2
seq2 ATCAAAGTAGCT
here when u divide seq1 and seq2 into 3 as on set we get below...
seq1 ATC AAG CGT AGC
seq2 ATC AAA GTA GCT
here there is no stop coden(TAA or TGA or TAG) in any of the sequences....so no problem ...
but if u had done in other way...I mean if I remove A instead of T which is also beside "-"'s then ?
>seq1 ATCT--AAGCGTAGC
>seq2 ATCTTTAAAGTAGCT
to
seq1 ATC TAG CGT AGC
then seq2 is as below..
seq2 ATC TAA GTA GCT
Here the seq2 contains stop coden TAA
which is a stop coden ....if we find one such thing we remove 6 from seq1
why we want to remove the stop coden .?
becoz when the sequence sees the stop codens it neglect the remaining part of sequence...
U can clearly see in the input sequence that the sequence ends with stop coden TAA i.e., the sequence stops if it sees stop coden....so we do not want stop coden in middle...
and one more thin...
seq3 ATG TAC GAT AAA TGC ATC GAT CGA TCG /// this is valid it has TAA but not as a set
seq4 ATG TAC GAT TAA TGC ATC GAT CGA TCG .// invaid
In the above seq3 IF U observe the 9,10,11 positions they are TAA but if u take as a sequence they will not be counted..as stop coden...
but in seq4 the TAA are as set i mean if u divide ... seq into 3 as one set u must not get stop codens(TAA or TGA or TAG)
so now remove 6 from seq1 ...
>seq1 ATCT--AAGCGTAGC
to
seq1 ATC TGT AGC
// here I removed more 3 from from the seq1 this can be from left side or from right side,....of previously removed 3
now seq2 is as below...
> seq2 ATCTTTAAAGTAGCT
to
seq2 ATC TTA GCT
Here in this sequence there is no stop coden... so its correct....
can u do this little modification in my code..
The sequence should be in multiple of 3 for me......
Here luckly we have multiples of 3 "-"'s whenever we want to remove those...but some sequences may be like these..
>seq1 ATCT--AAGCGTAGC
>seq2 ATCTTTAAAGTAGCT
here we need to remove "-"'s and one extra alphabet.....becoz...final
Here I removed "-"'s and T which is on side of "-"'s
seq1 ATCAAGCGTAGC
is correct here I removed T so remove TTT from seq2
seq2 ATCAAAGTAGCT
here when u divide seq1 and seq2 into 3 as on set we get below...
seq1 ATC AAG CGT AGC
seq2 ATC AAA GTA GCT
here there is no stop coden(TAA or TGA or TAG) in any of the sequences....so no problem ...
but if u had done in other way...I mean if I remove A instead of T which is also beside "-"'s then ?
>seq1 ATCT--AAGCGTAGC
>seq2 ATCTTTAAAGTAGCT
to
seq1 ATC TAG CGT AGC
then seq2 is as below..
seq2 ATC TAA GTA GCT
Here the seq2 contains stop coden TAA
which is a stop coden ....if we find one such thing we remove 6 from seq1
why we want to remove the stop coden .?
becoz when the sequence sees the stop codens it neglect the remaining part of sequence...
U can clearly see in the input sequence that the sequence ends with stop coden TAA i.e., the sequence stops if it sees stop coden....so we do not want stop coden in middle...
and one more thin...
seq3 ATG TAC GAT AAA TGC ATC GAT CGA TCG /// this is valid it has TAA but not as a set
seq4 ATG TAC GAT TAA TGC ATC GAT CGA TCG .// invaid
In the above seq3 IF U observe the 9,10,11 positions they are TAA but if u take as a sequence they will not be counted..as stop coden...
but in seq4 the TAA are as set i mean if u divide ... seq into 3 as one set u must not get stop codens(TAA or TGA or TAG)
so now remove 6 from seq1 ...
>seq1 ATCT--AAGCGTAGC
to
seq1 ATC TGT AGC
// here I removed more 3 from from the seq1 this can be from left side or from right side,....of previously removed 3
now seq2 is as below...
> seq2 ATCTTTAAAGTAGCT
to
seq2 ATC TTA GCT
Here in this sequence there is no stop coden... so its correct....
can u do this little modification in my code..
If not, and you find "-" in a sequence, to you remove those characters from all sequences, or only the next sequence? Do you search all sequences for "-", and remove from following, or only search first sequence?