?
Solved

remove un-nessary characters...

Posted on 2009-04-20
12
Medium Priority
?
273 Views
Last Modified: 2013-11-23
I had two sequences......

as below....

I had a sequences that are of below format...

>seq1
ATGAAACCGTCTCCGTTCATTGTTTT---GATATTTGCTGTTATTATAGGCCTGTGTGGTTGTGCACCACCCAAGGCCGAAGAAACTCAATCTGCTACGAGTACGAAAGCCGAGTCTTCTAATGCGGGTCAGAGCGGAAATCGATAT---------CCACCGGTGAAGATGAATTTTGAAAAAGTGTTTACTCCTAGTTTTTGTAAAGGTTTGCAAGATCAGCAATCAAAAATTGAAGAACTTTCGGCAGACTTGGAGAGGTTTGAGGGTCAGGAATTGAAGTCAAATTATGGAACATATTCCGACAAAAAGGACCATAAATAA
>seq2
ATGAAACTGTCTCTGTTCATTATTTTTTTGATATTTGCTGTTATTATAGGCCTGTGTGGTTGTGCACCACCCAAGGCCGAAGGAACTAAATCTGGTATGGGAACGCAAGCCGAGTCTTCTAATGCGGGTCAGAGAGGAAGTCGAAACAATGGCATCTCATCGGCGGAGTTGAACTTTGACAGAATTT---CTCCTGGTTTTATTAAAGGTTTGCGTGAAGATCAATCAGGATATGAAAAAGTTGGAGAGATCTTGAAGAGGGCTCAGGATCAGCAATTGAAGTCAAATTATGGAAAATATTCCGACAAAAAGGCCCATAATTAA


the sequences are not in one line....
I want a few changes for this sequences....

I want the sequence in the below format...

2 303
seq1 AAACCGTCTCCGTTCATTGTTTTGATATTTGCTGTTATTATAGGCCTGTGTGGTTGTGCACCACCCAAGGCCGAAGAAACTCAATCTGCTACGAGTACGAAAGCCGAGTCTTCTAATGCGGGTCAGAGCGGAAATCGATATCCACCGGTGAAGATGAATTTTGAAAAAGTGTCTCCTAGTTTTTGTAAAGGTTTGCAAGATCAGCAATCAAAAATTGAAGAACTTTCGGCAGACTTGGAGAGGTTTGAGGGTCAGGAATTGAAGTCAAATTATGGAACATATTCCGACAAAAAGGACCATAAA
seq2 AAACTGTCTCTGTTCATTATTTTGATATTTGCTGTTATTATAGGCCTGTGTGGTTGTGCACCACCCAAGGCCGAAGGAACTAAATCTGGTATGGGAACGCAAGCCGAGTCTTCTAATGCGGGTCAGAGAGGAAGTCGAAACTCATCGGCGGAGTTGAACTTTGACAGAATTTCTCCTGGTTTTATTAAAGGTTTGCGTGAAGATCAATCAGGATATGAAAAAGTTGGAGAGATCTTGAAGAGGGCTCAGGATCAGCAATTGAAGTCAAATTATGGAAAATATTCCGACAAAAAGGCCCATAAT



the modification that taken place are

1) remove the symbol >
2) bring all the sequence into one line...
I mean name then sequence....
similar for second
name and sequence

everything inone line

3) remove first three alphabets ATG and last three alphabets TAA in this sequence.. it need not be TAA all the time..I just want to remove last 3 alphabets from one sequence.. it can be any of the three... TAA or TAG or TGA.

4) after removing remove symbol "-" from the sequence before removing it remove the corresponding positions in the other seqquence...
here's an example..


ex:
>seq1 ATCT--AAGCGTAGC
>seq2 ATCTTTAAAGTAGCT

we remove "-"'s in multiple of three... if we have only 2 "-"'s also we remove three...
here there are only 2 "-"'s but we remove both 2"-"'s and remove either left side alphabet(T) or right side one(A) ...and also remove their corresponding positions from second sequences....

so output sequence will be

seq1 ATCTAGCGTAGC
SEQ2 ATCTAAGTAGCT

so I found TAA again in second sequence...

so I must delete 6 not 3 around "-"'s
>seq1 ATCT--AAGCGTAGC
>seq2 ATCTTTAAAGTAGCT

if we delete 6
seq1 ATCCGTAGC
seq2 ATCGTAGCT
At the end both should of equal length but in multiples of 3.

In the first line we write the number of sequences and length of sequence..after all these modifications...

here it is 2 as two sequences are there and 303 is the length after all modifications..
There are always 2 sequences....

 header format...

<space><numberof sequences><space><length of sequence>

length of the sequence means it does not include the name....

These days I made all the changes manaully now I need aprogram as ..it is taking more time...
0
Comment
Question by:shragi
  • 6
  • 6
12 Comments
 
LVL 16

Expert Comment

by:imladris
ID: 24189273
This:

header format...

<space><numberof sequences><space><length of sequence>


Appears to correspond to this:

2 303
seq1 AAACCGTCTCCGTTCATTGTTTTGATATTTGCTGTTATTATAGGCCTGTGTGGTTGTGCACCACCCAAGGCCGAAGAAACTCAATCTGCTACGAGTACGAAAGCCGAGTCTTCTAATGCGGGTCAGAGCGGAAATCGATATCCACCGGTGAAGATGAATTTTGAAAAAGTGTCTCCTAGTTTTTGTAAAGGTTTGCAAGATCAGCAATCAAAAATTGAAGAACTTTCGGCAGACTTGGAGAGGTTTGAGGGTCAGGAATTGAAGTCAAATTATGGAACATATTCCGACAAAAAGGACCATAAA
seq2 AAACTGTCTCTGTTCATTATTTTGATATTTGCTGTTATTATAGGCCTGTGTGGTTGTGCACCACCCAAGGCCGAAGGAACTAAATCTGGTATGGGAACGCAAGCCGAGTCTTCTAATGCGGGTCAGAGAGGAAGTCGAAACTCATCGGCGGAGTTGAACTTTGACAGAATTTCTCCTGGTTTTATTAAAGGTTTGCGTGAAGATCAATCAGGATATGAAAAAGTTGGAGAGATCTTGAAGAGGGCTCAGGATCAGCAATTGAAGTCAAATTATGGAAAATATTCCGACAAAAAGGCCCATAAT


But the spec starts with <space>, and the example doesn't. Is that those two things don't correspond, or am I misunderstanding something?

Also, these look like DNA sequences. Shouldn't they go strictly in three's then? In which case, shouldn't this example simply go from:

>seq1 ATCT--AAGCGTAGC
>seq2 ATCTTTAAAGTAGCT

to

seq1 ATCAAGCGTAGC

just because the T before the -'s starts a grouping of 3?

If not, I need a better explanation of what gets deleted and why. Trying deleting --A, and then finding that that results in TAA in the second sequence and then, for that reason, deleting 6 in sequence 1 starting at the T, seems arbitrary and incomplete.

0
 

Author Comment

by:shragi
ID: 24191032
header format...
<space><numberof sequences><space><length of sequence>
ex:
 2 303

I think i forgot to give space in the example I provided...


Yes these are DNA sequences... and they should be in multiple of 3 for my other part of program......



>seq1 ATCT--AAGCGTAGC
>seq2 ATCTTTAAAGTAGCT

to

seq1 ATCAAGCGTAGC  

is correct here u removed T so remove TTT from seq2

seq2 ATCAAAGTAGCT

here when u divide seq1 and seq2 into 3 as on set we get below...
 seq1   ATC  AAG  CGT  AGC
 seq2   ATC  AAA  GTA  GCT

here there is no stop coden(TAA or TGA or TAG)   in any of  the sequences....so no problem ...

but if u had  done in other way...

>seq1 ATCT--AAGCGTAGC
>seq2 ATCTTTAAAGTAGCT

to

seq1 ATC  TAG  CGT  AGC

then seq2 is as below..

seq2    ATC  TAA  GTA  GCT

Here the seq2 contains stop coden TAA So remove 6 from seq1
why we want to remove the stop coden .?
becoz when the sequence sees the stop codens it neglect the remaining part of sequence...
U can clearly see in the input sequence that the sequence ends with stop coden TAA i.e., the sequence stops if it sees stop coden....so we do not want stop coden in middle...

and one more thin...

seq3  ATG TAC GAT AAA TGC ATC GAT CGA TCG   /// this is valid it has TAA but not as a set
seq4  ATG TAC GAT TAA TGC ATC GAT CGA TCG   .// invaid
In the above sequence3  IF U observe 9,10,11 they are TAA but if u take as a sequence they won't be counted..as stop coden...
but in seq4 the TAA are as set i mean if u divide ... seq into 3 as one set u must not get stop codens(TAA or TGA or  TAG)


>seq1 ATCT--AAGCGTAGC

to
 seq1 ATC TGT AGC    
 // here I removed more 3 from from the seq1 this can be from left side or from right side,....of previously removed 3

now seq2 is as below...
> seq2  ATCTTTAAAGTAGCT
 
to  
seq2 ATC TTA GCT

Here in this sequence there is no stop coden... so its correct....

If u have any queries let me know...

0
 
LVL 16

Expert Comment

by:imladris
ID: 24206905
Attached is the source to a class that does what you describe.
I have listed the assumptions I have made in the comments. They're mainly about the format of the input file, and how to deal with stop codens.
Let me know what you think.

DNA.txt
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 

Author Comment

by:shragi
ID: 24217513
Your code works fine...but I have few questions...or compaliants...regarding it...

I run the code using the aove example....i.e.,


>seq1
ATGAAACCGTCTCCGTTCATTGTTTT---GATATTTGCTGTTATTATAGGCCTGTGTGGTTGTGCACCACCCAAGGCCGAAGAAACTCAATCTGCTACGAGTACGAAAGCCGAGTCTTCTAATGCGGGTCAGAGCGGAAATCGATAT---------CCACCGGTGAAGATGAATTTTGAAAAAGTGTTTACTCCTAGTTTTTGTAAAGGTTTGCAAGATCAGCAATCAAAAATTGAAGAACTTTCGGCAGACTTGGAGAGGTTTGAGGGTCAGGAATTGAAGTCAAATTATGGAACATATTCCGACAAAAAGGACCATAAATAA
>seq2
ATGAAACTGTCTCTGTTCATTATTTTTTTGATATTTGCTGTTATTATAGGCCTGTGTGGTTGTGCACCACCCAAGGCCGAAGGAACTAAATCTGGTATGGGAACGCAAGCCGAGTCTTCTAATGCGGGTCAGAGAGGAAGTCGAAACAATGGCATCTCATCGGCGGAGTTGAACTTTGACAGAATTT---CTCCTGGTTTTATTAAAGGTTTGCGTGAAGATCAATCAGGATATGAAAAAGTTGGAGAGATCTTGAAGAGGGCTCAGGATCAGCAATTGAAGTCAAATTATGGAAAATATTCCGACAAAAAGGCCCATAATTAA


when I run the code on this example.... it removes 6 "-"'s first from seq1...
but it is wrong...it should remove only 3 below is the part of above sequences

>seq1   ATG AAA CCG TCT CCG TTC ATT GTT  TT- --G  ATA
>seq2   ATG AAA CTG TCT  CTG TTC ATT ATT TTT TTG ATA  

so when I remove 3"-"'s the join must not contain ....the 3 codens ie., (TAG, TGA, TAA)...

>seq1   ATG AAA CCG TCT CCG TTC ATT GTT  TTG  ATA
>seq2   ATG AAA CTG TCT  CTG TTC ATT ATT TTG ATA  

there is nopresence of 3 codens at the join..if it is not there at the join ...it can't be there at after part... becoz we are removing in terms of three ...

but here ur code removes 6 alphabets.....

but ur code worked well when I have only 2 "-"'s ...

so can u find where did the code went wrong...
0
 
LVL 16

Expert Comment

by:imladris
ID: 24218155
HMMmmm. We're going to have to do this in iterations, I guess. The rules in your first posting:

>remove either left side alphabet(T) or right side one(A)

seemed to imply that there was no fixed correct way of removing -'s. So I simply removed any codens with dashes in it. This could involve one or two codens, even in a 2 dash case. For instance:

AAT TAC T-- GAG CAC

would become

AAT TAC GAG CAC

whereas

AAT TAC TA- -GG CAC

would become

AAT TAC CAC

If that is not acceptable let me know.


Your latest example appears to indicate that that is true, *unless* there is a multiple of 3 dashes involved. In that case they must be removed exactly. The attached code adds that exception.

DNA.txt
0
 

Author Comment

by:shragi
ID: 24219129
this time the output became more bad....the output contains...."-"'s ....
the output that I got...is below....

 2 306
seq1 AAACCGTCTCCGTTCATTGTTTTGATATTTGCTGTTATTATAGGCCTGTGTGGTTGTGCACCACCCAAGGCCGAAGAAACTCAATCTGCTACGAGTACGAAAGCCGAGTCTTCTAATGCGGGTCAGAGCGGAAATCGATATCCACCGGTGAAGATGAATTTTGAAAAAGTGTTTACTCCTAGTTTTTGTAAAGGTTTGCAAGATCAGCAATCAAAAATTGAAGAACTTTCGGCAGACTTGGAGAGGTTTGAGGGTCAGGAATTGAAGTCAAATTATGGAACATATTCCGACAAAAAGGACCATAAA
seq2 AAACTGTCTCTGTTCATTATTTTGATATTTGCTGTTATTATAGGCCTGTGTGGTTGTGCACCACCCAAGGCCGAAGGAACTAAATCTGGTATGGGAACGCAAGCCGAGTCTTCTAATGCGGGTCAGAGAGGAAGTCGAAACTCATCGGCGGAGTTGAACTTTGACAGAATTT---CTCCTGGTTTTATTAAAGGTTTGCGTGAAGATCAATCAGGATATGAAAAAGTTGGAGAGATCTTGAAGAGGGCTCAGGATCAGCAATTGAAGTCAAATTATGGAAAATATTCCGACAAAAAGGCCCATAAT


Ur example....

AAT TAC T-- GAG CAC

would become

AAT TAC GAG CAC        // it is correct if u remove T with "-"'s  becoz..we need to remove in multiples of 3 but we had only 2"-"'s so we remove one more alphabet from anyside...if we remove "--G" then at the join point there will be stop coden... TAG so we remove even that one...so output is  the one that u got or the one as below.. AAT TAC CAC

whereas

AAT TAC TA- -GG CAC

would become

AAT TAC CAC  // this is also correct here if the removal is "--G" then the join point contains coden... TAG so remove it ... so ur output is correct but there can be second one also ...  if we remove "A--" then the join contains ... TGG which is not a stop coden so  AAT TAC TGG CAC is also an output...

AAT TAC TA- --G CAC

would become

AAT TAC CAC    // becoz...here we already have 3 "-"'s (which is a multiple of 3)so no need to remove any alphabet so first remove all 3"-"'s ...then after removal if we join we get AAT TAC TAG CAC..
see at the join point we got stop coden so delete more 3 ...so our output is AAT TAC CAC


Finally if we have multiples of 3 "-" we remove them and check for stop codens at the join ... if found we remove 3 more alphabets ....

if we do not have multiples of 3"-"'s in our sequence..I mean if we have 2"-"'s or 4"-"'s ..as per rule we can remove only in multiles of 3 so we remove an alphabet along with these "-"'s ... and in the same manner if we found stop codens at the join we remove more 3....

0
 
LVL 16

Accepted Solution

by:
imladris earned 2000 total points
ID: 24220535
Programs, in general, embody a single way of performing a task. So, having multiple correct ways of doing something described may not be a useful way of figuring out how to get a program to do what you want.

What if the sequence is:

AAT TAC TA- -CA CAC

Your algorithm would appear to result in 4 codens of AAT TAC TCA CAC or AAT TAC TAA CAC
The algorithm I propose simply deletes all codens with -'s which will lead to 3 codens of:

AAT TAC CAC

Is that acceptable?

I have fixed the problem leaving the -'s in the second sequence in the accompanying file.

DNA.txt
0
 

Author Comment

by:shragi
ID: 24227926
hey dude the program u gave worked for shorter sequences....i am it worked well for above sequence...
but I got the real problem.... some times I have bigger sequences.....

the above sequence is just of length around 330...but I had sequences of length around .... 3000 at that time...whole sequence...can'e fit in one line...and u wrote code assuming only four lines.... so I get below error...


Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String ind
ex out of range: 1455
        at java.lang.String.substring(Unknown Source)
        at DNA2.main(DNA2.java:49)


I mean u code assumed that line[0], line[2] are name of sequences..
and Line[1] and line[3] contains original sequences....

but my sequences is such a big that it is more than one line.... even I made it to one line... it shows the above error....

0
 

Author Comment

by:shragi
ID: 24227950
I forgot to tell one more point.... does ur code works ...even if there are "-"'s at the end of sequences....

I had 20 "-"'s at the end for one of the sequence....

0
 
LVL 16

Expert Comment

by:imladris
ID: 24228781
You specified:

>I had a sequences that are of below format...

I copied and pasted those into notebook, wrote them to a file and worked from them. If that is not the format, you will need to specify the format of the input the program needs to process.

The code is intended to work regardless of where '-'s happen. However, given the evolving nature of the specification of what the program is intended to do, it would probably be best to rely on thorough testing rather than my assertions.
0
 

Author Comment

by:shragi
ID: 24243207
Your code worked for shorter sequences...when I tried for longer sequence...
i.e., even thought they are long I made them to one line and overall made to four lines....
but I am not getting output.. I can't find any limit on length of the sequence but Y  I am not able to get..

and the format is same...
>seq1
TTGC..............................................................TGCGC
>seq2
TGCTGATCG..................................................TCGTGCT
0
 
LVL 16

Expert Comment

by:imladris
ID: 24251521
Could you post the file here for me to try?
0

Featured Post

Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Introduction Java can be integrated with native programs using an interface called JNI(Java Native Interface). Native programs are programs which can directly run on the processor. JNI is simply a naming and calling convention so that the JVM (Java…
Introduction This article is the first of three articles that explain why and how the Experts Exchange QA Team does test automation for our web site. This article explains our test automation goals. Then rationale is given for the tools we use to a…
This tutorial covers a step-by-step guide to install VisualVM launcher in eclipse.
How to fix incompatible JVM issue while installing Eclipse While installing Eclipse in windows, got one error like above and unable to proceed with the installation. This video describes how to successfully install Eclipse. How to solve incompa…
Suggested Courses
Course of the Month9 days, 18 hours left to enroll

571 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question