asked on

Optimizing stored procedures for solving word chain puzzles

I was recently pondering algorithms for solving word chain puzzles. Just for fun, I decided to implement a couple of these algorithms as SQL Server stored procedures. I know that this is a rather silly idea. I could have easily written something in C that would run much faster. I chose SQL Server for two reasons. First, I wanted a challenge. Second, I wanted to use this as an experiment to learn how to make stored procedures that run efficiently.

I did my best work with these procedures. However, it is quite likely that there is something that I could have done better. Here's what I am asking for. I want you to show me what I could have done better. If you can make these procedures run faster, please teach me how.

For those who are not familiar with word chains, here's an explaination. The puzzle consists of a starting word and an ending word. The solution (which the stored procedures attempt to discover) is a list of words, each differing by a single letter from the previous word, which creates an unbroken chain from the starting word to the ending word. For example, if the chain were to start with "bread" and end with "toast," the chain might be "bread break bleak bleat blest blast boast toast." The procedures rely on the existence of a dictionary. Each word in the chain must belong to the dictionary (i.e. nonsense words are not allowed).

I created procedures for two variations of this game. The first one, named FindChain1, requires that each word in the chain be of uniform length. The second one, FindChain2, is slightly more complicated. It allows three types of chain links. When a new word is added to the chain, it can be formed by substituting one letter for another, deleting a letter, or adding a letter. For example, if the chain were to start with "bread" and end with "toast," the chain might be "bread bead beat boat boast toast."

Oh yeah, I should also mention that the procedures need to find the shortest possible chain that solves the puzzle. If there are multiple chains which are all the shortest length, only one needs to be discovered.

This project has been a fun diversion for me and I hope that it proves equally rewarding for you as well.

I create two scripts. Well, okay, Enterprise Manager create them. The first of the two scripts, Create Tables.sql, should create the tables that the procedures use. The second script, Stored Procedures.sql, should create the stored procedures. The scripts can be found at http://myweb.cableone.net/mrog/ . You will need to supply a dictionary. I downloaded one from wordlist.sourceforge.net. Insert the dictionary words into the dictionary table. Then run the stored procedures PopulateLookupTable1 and PopulateLookupTable2. These will convert the dictionary into the form that FindChain1 and FindChain2 require. The FindChain procedures take three parameters: a starting word, and ending word, and a variable to accept the return value.

Feel free to work on this over the next few days. I am not in a big hurry. Thank you. Have fun!

mvrogers

ASKER

Here's an update: I uploaded my dictionary to http://myweb.cableone.net/mrog/. It's in a text file with one word per line. It should be easy to import it into SQL Server. I also updated FindChain2. I changed a variable name and one of the comments to improve readability. None of the functionality changed.

ASKER CERTIFIED SOLUTION

ispaleny

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

mvrogers

ASKER

Thank you, ispaleny. That's an interesting idea. It makes me think, "I should have thought of that!" I will experiment with it. This is definitely a candidate for the accepted answer.

Did you happen to see anything that could be made more efficient without altering the algorithm?

mvrogers

ASKER

That helped a lot. I created two new tables, two new procedures to populate them, and two new procedures to use them. FindChain3 is an updated version of FindChain1. FindChain4 is an updated version of FindChain2. The new algorithm made an easily measured improvement, especially in FindChain4. FindChain2 took 3 hours and 45 minutes to create a chain from "simple" to "cheese." FindChain4 did the same job in 38 seconds. The new tables didn't take very long to populate, either.

I'm confident that the algorithm has been pretty well optimized now. How about the queries and tables? Does anybody know how to shave a few more seconds off?

Thanks.

The new scripts are at http://myweb.cableone.net/mrog/.

mvrogers

ASKER

I discovered another improvement to the algorithm. Instead of using a single breadth-first search, the latest version uses two simultaneous breadth-first searches. One starts with the starting word and the other starts with the goal. They meet in the middle. The search for a chain from "simple" to "cheese" is down from 38 seconds to about a third of a second. I'll upload the new scripts later today.

In the mean time, I am still hoping that someone can either improve the efficiency of the queries or else verify that they are already as efficient as they are going to get.

Thank you.

ispaleny

Congratulations, you discovered Graph theory :)

mvrogers

ASKER

Why, thank you! You make me appear much smarter than I actually am. :)

I think that "discovered" was a poor choice of words on my part. I intended it as "found" rather than "invented."

Anyway, the new scripts are on my web site. The tables haven't changed since LookupTable3 and LookupTable4 were created.

The thing that surprises me about the new algorithm is the degree of the improvement. I initially guessed that it would cut the running time by about half because it would cut the search tree size by about half. I think that what I am seeing here is SQL Server queries are very expensive in terms of execution time, especially when the tables involved contain a large amount of data. I think that as the number of rows increase, insertion times (and possibly selection times) grow super-linearly. I would appreciate any feedback regarding this.

Thanks.

ispaleny

You have table with primary index, so you must sort records.
Difficulty increases with N*LOG(N), where N is number of rows.
http://www.everything2.com/index.pl?node_id=751105
Of course, there is some disk overhead also.

mvrogers

ASKER

That's assuming that the table has to be completely resorted after every insertion. While I wouldn't put it past Microsoft to do this, it doesn't have to work that way. If the insertion used a binary search to find the correct insertion point and then shifted some of the records to make room for the new one, each new record could be inserted in O(n) time.

I think I should create some performance tests to see what's going on under the hood.

mvrogers

ASKER

My tests really surprised me. When I insert new rows into the queue table variables, the insertion time stays pretty constant regardless of how many rows are already present. The big time drain comes from the select statements.

SET @word1 = (SELECT Word FROM @Queue1 WHERE Position=@queuePos)
SET @word2 = (SELECT Word FROM @Queue2 WHERE Position=@queuePos)

The queues are declared as follows.

DECLARE @Queue1 table (Position int IDENTITY(1,1), Word varchar(30) PRIMARY KEY, Parent int)
DECLARE @Queue2 table (Position int IDENTITY(1,1), Word varchar(30) PRIMARY KEY, Parent int)

If I make Position the primary key, or if Position and Word are both the primary key, then the select statements execute really quickly, but the insertions take a lot longer. The time lost far outweighs the time gained by making this switch.

I tried creating a separate index table variable to speed up the select statements, but it didn't seem to have any significant effect on the overall running time.

mvrogers

ASKER

Last call for comments, everyone. If you have something to say, please speak up today. Thank you.