Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium


Edit Distance

Posted on 2011-04-30
Medium Priority
Last Modified: 2012-05-11
I was asked to write a program from scratch. This program has the following description:

Write a program that finds the “distance” between two different nucleotide (genetic) sequences. The dis- tance is defined as the edit distance, as described in class and the DPV textbook, with the gap penalty and mismatch penalty both equal to 1. You may want to test your code on some small sequences, but it should work on all pairs of the sequences (formatted as plain text) provided to you. Compute the edit distance between each pair of these sequences, leaving out e.coli. Also compute the distance between e. coli and salmonella. Report the normalized edit distance per nucleotide (the edit distance divided by the sum of the lengths of the two sequences) and time taken to run your code on each pair. Which pair is most similar by this measure? You may be interested in looking up the relationships among the pairs afterwards (try Wikipedia).
Note: Be careful about your code’s space and time usage. Inefficient solutions may work on smaller prob- lems, but not scale to cases like e. coli and salmonella. Your code has to eventually work on these cases!

Can anybody please give me some reference codes about edit distance so that I can work on it?

Question by:Mr_Lee

Author Comment

ID: 35499240
Nope. I did not ask for a complete solution. I just ask for a link to a resource on the web where I can see some sample codes, so I can start doing my assignment.

LVL 28

Accepted Solution

dpearson earned 1000 total points
ID: 35499260
Unfortunately, the "edit distance" defines a family of algorithms rather than one specific one.  The "edit distance" between two strings (nucleotide sequences in your case) is the number of steps or "edits" needs to transform the first string into the second string.

One common method for computing edit distance is the Levenshtein Distance (http://www.merriampark.com/ld.htm) with a sample implementation here (http://www.merriampark.com/ldjava.htm).

If you understand how this algorithm works it should give you an insight into any of the other edit distance algorithms and you hopefully can adapt it to your specific problem set.

LVL 18

Expert Comment

by:Jose Parrot
ID: 35500636
You may want take a look at the book
Bioinformatics - Sequence and Genoma Analisys, by David W. Mount, Cold Spring Harbor Laboratory Press, New York.

Besides the chapters on the main subject, there is a one devoted to programming with important hints on specific tools like BioPerl, BLAST, EMBOSS and PISE; also instructions on how to use them. The file formats FASTA and GenBank are covered.

As said by dpearson, this is a wide matter which requires a number of algorithms and procedures, so you'll need to look at several sources. The suggested tools short the programming efforts but the algorithmic complexity remains the same, so requiring long processing times.

Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

LVL 37

Expert Comment

ID: 35507011
I think the other experts may have missed something
From the original question:
The distance is defined as the edit distance, as described in class and the DPV textbook, with the gap penalty and mismatch penalty both equal to 1.

So you have the edit distance definition already. What you need to think about is how to implement it. The assignment mentioned space and time issues. Obviously this means you shouldn't load the whole thing into one array or take n iterations finding gaps.

How easy is it to find gaps using your definitions?

Author Comment

ID: 35507802
I used a 2-d array to store the edit distance. It worked, but a little bit slow for the longest sequence samples.
LVL 37

Assisted Solution

TommySzalapski earned 1000 total points
ID: 35507879
If it works on the e coli and salmonella as mentioned, then I assume it would be fine. I don't know what you mean by using a 2D array to store the edit distance, do you mean calculate not store? The edit distance should be a single number but finding the gaps could use a 2D array.
Mismatch could easily be found using O(1) (constant) storage and O(n) (linear) time. I don't know how you're defining gaps, but depending on that, there are probably many good efficient approaches.

Author Comment

ID: 35508049
Actually, the 2-D array is to store the subproblems as you said in finding the gaps. The last cell of this array is the edit distance between the two sequences.
LVL 37

Expert Comment

ID: 35509770
That's what I was expecting. n^2 isn't that bad. If it runs pretty quickly, then I'd stick with it.

Featured Post

Free Tool: SSL Checker

Scans your site and returns information about your SSL implementation and certificate. Helpful for debugging and validating your SSL configuration.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

IntroductionThis article is the second in a three part article series on the Visual Studio 2008 Debugger.  It provides tips in setting and using breakpoints. If not familiar with this debugger, you can find a basic introduction in the EE article loc…
Article by: Nadia
Suppose you use Uber application as a rider and you request a ride to go from one place to another. Your driver just arrived at the parking lot of your place. The only thing you know about the ride is the license plate number. How do you find your U…
The goal of the video will be to teach the user the concept of local variables and scope. An example of a locally defined variable will be given as well as an explanation of what scope is in C++. The local variable and concept of scope will be relat…
The viewer will be introduced to the technique of using vectors in C++. The video will cover how to define a vector, store values in the vector and retrieve data from the values stored in the vector.

575 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question