Solved

# Edit Distance

Posted on 2011-04-30
824 Views
I was asked to write a program from scratch. This program has the following description:

Write a program that finds the “distance” between two different nucleotide (genetic) sequences. The dis- tance is defined as the edit distance, as described in class and the DPV textbook, with the gap penalty and mismatch penalty both equal to 1. You may want to test your code on some small sequences, but it should work on all pairs of the sequences (formatted as plain text) provided to you. Compute the edit distance between each pair of these sequences, leaving out e.coli. Also compute the distance between e. coli and salmonella. Report the normalized edit distance per nucleotide (the edit distance divided by the sum of the lengths of the two sequences) and time taken to run your code on each pair. Which pair is most similar by this measure? You may be interested in looking up the relationships among the pairs afterwards (try Wikipedia).
Note: Be careful about your code’s space and time usage. Inefficient solutions may work on smaller prob- lems, but not scale to cases like e. coli and salmonella. Your code has to eventually work on these cases!

Can anybody please give me some reference codes about edit distance so that I can work on it?

Thanks.
0
Question by:Mr_Lee

Author Comment

Nope. I did not ask for a complete solution. I just ask for a link to a resource on the web where I can see some sample codes, so I can start doing my assignment.

Thanks.
0

LVL 26

Accepted Solution

Unfortunately, the "edit distance" defines a family of algorithms rather than one specific one.  The "edit distance" between two strings (nucleotide sequences in your case) is the number of steps or "edits" needs to transform the first string into the second string.

One common method for computing edit distance is the Levenshtein Distance (http://www.merriampark.com/ld.htm) with a sample implementation here (http://www.merriampark.com/ldjava.htm).

If you understand how this algorithm works it should give you an insight into any of the other edit distance algorithms and you hopefully can adapt it to your specific problem set.

Doug
0

LVL 18

Expert Comment

You may want take a look at the book
Bioinformatics - Sequence and Genoma Analisys, by David W. Mount, Cold Spring Harbor Laboratory Press, New York.

Besides the chapters on the main subject, there is a one devoted to programming with important hints on specific tools like BioPerl, BLAST, EMBOSS and PISE; also instructions on how to use them. The file formats FASTA and GenBank are covered.

As said by dpearson, this is a wide matter which requires a number of algorithms and procedures, so you'll need to look at several sources. The suggested tools short the programming efforts but the algorithmic complexity remains the same, so requiring long processing times.

Jose
0

LVL 37

Expert Comment

I think the other experts may have missed something
From the original question:
The distance is defined as the edit distance, as described in class and the DPV textbook, with the gap penalty and mismatch penalty both equal to 1.

So you have the edit distance definition already. What you need to think about is how to implement it. The assignment mentioned space and time issues. Obviously this means you shouldn't load the whole thing into one array or take n iterations finding gaps.

How easy is it to find gaps using your definitions?
0

Author Comment

I used a 2-d array to store the edit distance. It worked, but a little bit slow for the longest sequence samples.
0

LVL 37

Assisted Solution

If it works on the e coli and salmonella as mentioned, then I assume it would be fine. I don't know what you mean by using a 2D array to store the edit distance, do you mean calculate not store? The edit distance should be a single number but finding the gaps could use a 2D array.
Mismatch could easily be found using O(1) (constant) storage and O(n) (linear) time. I don't know how you're defining gaps, but depending on that, there are probably many good efficient approaches.
0

Author Comment

Actually, the 2-D array is to store the subproblems as you said in finding the gaps. The last cell of this array is the edit distance between the two sequences.
0

LVL 37

Expert Comment

That's what I was expecting. n^2 isn't that bad. If it runs pretty quickly, then I'd stick with it.
0

## Join & Write a Comment Already a member? Login.

This article will show you some of the more useful Standard Template Library (STL) algorithms through the use of working examples.  You will learn about how these algorithms fit into the STL architecture, how they work with STL containers, and why t…
One of Google's most recent algorithm changes affecting local searches is entitled "The Pigeon Update." This update has dramatically enhanced search inquires for the keyword "Yelp." Google searches with the word "Yelp" included will now yield Yelp a…
The viewer will learn how to user default arguments when defining functions. This method of defining functions will be contrasted with the non-default-argument of defining functions.
The viewer will learn additional member functions of the vector class. Specifically, the capacity and swap member functions will be introduced.

#### 732 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

#### Need Help in Real-Time?

Connect with top rated Experts

22 Experts available now in Live!