• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 841
  • Last Modified:

Edit Distance

I was asked to write a program from scratch. This program has the following description:

Write a program that finds the “distance” between two different nucleotide (genetic) sequences. The dis- tance is defined as the edit distance, as described in class and the DPV textbook, with the gap penalty and mismatch penalty both equal to 1. You may want to test your code on some small sequences, but it should work on all pairs of the sequences (formatted as plain text) provided to you. Compute the edit distance between each pair of these sequences, leaving out e.coli. Also compute the distance between e. coli and salmonella. Report the normalized edit distance per nucleotide (the edit distance divided by the sum of the lengths of the two sequences) and time taken to run your code on each pair. Which pair is most similar by this measure? You may be interested in looking up the relationships among the pairs afterwards (try Wikipedia).
Note: Be careful about your code’s space and time usage. Inefficient solutions may work on smaller prob- lems, but not scale to cases like e. coli and salmonella. Your code has to eventually work on these cases!

Can anybody please give me some reference codes about edit distance so that I can work on it?

2 Solutions
Mr_LeeAuthor Commented:
Nope. I did not ask for a complete solution. I just ask for a link to a resource on the web where I can see some sample codes, so I can start doing my assignment.

Unfortunately, the "edit distance" defines a family of algorithms rather than one specific one.  The "edit distance" between two strings (nucleotide sequences in your case) is the number of steps or "edits" needs to transform the first string into the second string.

One common method for computing edit distance is the Levenshtein Distance (http://www.merriampark.com/ld.htm) with a sample implementation here (http://www.merriampark.com/ldjava.htm).

If you understand how this algorithm works it should give you an insight into any of the other edit distance algorithms and you hopefully can adapt it to your specific problem set.

Jose ParrotGraphics ExpertCommented:
You may want take a look at the book
Bioinformatics - Sequence and Genoma Analisys, by David W. Mount, Cold Spring Harbor Laboratory Press, New York.

Besides the chapters on the main subject, there is a one devoted to programming with important hints on specific tools like BioPerl, BLAST, EMBOSS and PISE; also instructions on how to use them. The file formats FASTA and GenBank are covered.

As said by dpearson, this is a wide matter which requires a number of algorithms and procedures, so you'll need to look at several sources. The suggested tools short the programming efforts but the algorithmic complexity remains the same, so requiring long processing times.

Cloud Class® Course: C++ 11 Fundamentals

This course will introduce you to C++ 11 and teach you about syntax fundamentals.

I think the other experts may have missed something
From the original question:
The distance is defined as the edit distance, as described in class and the DPV textbook, with the gap penalty and mismatch penalty both equal to 1.

So you have the edit distance definition already. What you need to think about is how to implement it. The assignment mentioned space and time issues. Obviously this means you shouldn't load the whole thing into one array or take n iterations finding gaps.

How easy is it to find gaps using your definitions?
Mr_LeeAuthor Commented:
I used a 2-d array to store the edit distance. It worked, but a little bit slow for the longest sequence samples.
If it works on the e coli and salmonella as mentioned, then I assume it would be fine. I don't know what you mean by using a 2D array to store the edit distance, do you mean calculate not store? The edit distance should be a single number but finding the gaps could use a 2D array.
Mismatch could easily be found using O(1) (constant) storage and O(n) (linear) time. I don't know how you're defining gaps, but depending on that, there are probably many good efficient approaches.
Mr_LeeAuthor Commented:
Actually, the 2-D array is to store the subproblems as you said in finding the gaps. The last cell of this array is the edit distance between the two sequences.
That's what I was expecting. n^2 isn't that bad. If it runs pretty quickly, then I'd stick with it.
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Cloud Class® Course: Microsoft Exchange Server

The MCTS: Microsoft Exchange Server 2010 certification validates your skills in supporting the maintenance and administration of the Exchange servers in an enterprise environment. Learn everything you need to know with this course.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now