[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

Optimizing stored procedures for solving word chain puzzles

Posted on 2005-05-02
11
Medium Priority
?
519 Views
Last Modified: 2012-08-13
I was recently pondering algorithms for solving word chain puzzles.  Just for fun, I decided to implement a couple of these algorithms as SQL Server stored procedures.  I know that this is a rather silly idea.  I could have easily written something in C that would run much faster.  I chose SQL Server for two reasons.  First, I wanted a challenge.  Second, I wanted to use this as an experiment to learn how to make stored procedures that run efficiently.

I did my best work with these procedures.  However, it is quite likely that there is something that I could have done better.  Here's what I am asking for.  I want you to show me what I could have done better.  If you can make these procedures run faster, please teach me how.

For those who are not familiar with word chains, here's an explaination.  The puzzle consists of a starting word and an ending word.  The solution (which the stored procedures attempt to discover) is a list of words, each differing by a single letter from the previous word, which creates an unbroken chain from the starting word to the ending word.  For example, if the chain were to start with "bread" and end with "toast," the chain might be "bread break bleak bleat blest blast boast toast."  The procedures rely on the existence of a dictionary.  Each word in the chain must belong to the dictionary (i.e. nonsense words are not allowed).

I created procedures for two variations of this game.  The first one, named FindChain1, requires that each word in the chain be of uniform length.  The second one, FindChain2, is slightly more complicated.  It allows three types of chain links.  When a new word is added to the chain, it can be formed by substituting one letter for another, deleting a letter, or adding a letter.  For example, if the chain were to start with "bread" and end with "toast," the chain might be "bread bead beat boat boast toast."

Oh yeah, I should also mention that the procedures need to find the shortest possible chain that solves the puzzle.  If there are multiple chains which are all the shortest length, only one needs to be discovered.

This project has been a fun diversion for me and I hope that it proves equally rewarding for you as well.

I create two scripts.  Well, okay, Enterprise Manager create them.  The first of the two scripts, Create Tables.sql, should create the tables that the procedures use.  The second script, Stored Procedures.sql, should create the stored procedures.  The scripts can be found at http://myweb.cableone.net/mrog/ .  You will need to supply a dictionary.  I downloaded one from wordlist.sourceforge.net.  Insert the dictionary words into the dictionary table.  Then run the stored procedures PopulateLookupTable1 and PopulateLookupTable2.  These will convert the dictionary into the form that FindChain1 and FindChain2 require.  The FindChain procedures take three parameters: a starting word, and ending word, and a variable to accept the return value.

Feel free to work on this over the next few days.  I am not in a big hurry.  Thank you.  Have fun!
0
Comment
Question by:mvrogers
  • 8
  • 3
11 Comments
 

Author Comment

by:mvrogers
ID: 13910209
Here's an update:  I uploaded my dictionary to http://myweb.cableone.net/mrog/.  It's in a text file with one word per line.  It should be easy to import it into SQL Server.  I also updated FindChain2.  I changed a variable name and one of the comments to improve readability.  None of the functionality changed.
0
 
LVL 13

Accepted Solution

by:
ispaleny earned 1500 total points
ID: 13912612
You can create a lookup table with pairs Word-RelatedWord. This table can be indexed and you get results faster. I did some tests, and I don't expect it would be too big. But it takes about one week to create it on my test computer with unoptimized substring queries.
0
 

Author Comment

by:mvrogers
ID: 13912713
Thank you, ispaleny.  That's an interesting idea.  It makes me think, "I should have thought of that!"  I will experiment with it.  This is definitely a candidate for the accepted answer.

Did you happen to see anything that could be made more efficient without altering the algorithm?
0
Visualize your virtual and backup environments

Create well-organized and polished visualizations of your virtual and backup environments when planning VMware vSphere, Microsoft Hyper-V or Veeam deployments. It helps you to gain better visibility and valuable business insights.

 

Author Comment

by:mvrogers
ID: 13915620
That helped a lot.  I created two new tables, two new procedures to populate them, and two new procedures to use them.  FindChain3 is an updated version of FindChain1.  FindChain4 is an updated version of FindChain2.  The new algorithm made an easily measured improvement, especially in FindChain4.  FindChain2 took 3 hours and 45 minutes to create a chain from "simple" to "cheese."  FindChain4 did the same job in 38 seconds.  The new tables didn't take very long to populate, either.

I'm confident that the algorithm has been pretty well optimized now.  How about the queries and tables?  Does anybody know how to shave a few more seconds off?

Thanks.

The new scripts are at http://myweb.cableone.net/mrog/.
0
 

Author Comment

by:mvrogers
ID: 13919249
I discovered another improvement to the algorithm.  Instead of using a single breadth-first search, the latest version uses two simultaneous breadth-first searches.  One starts with the starting word and the other starts with the goal.  They meet in the middle.  The search for a chain from "simple" to "cheese" is down from 38 seconds to about a third of a second.  I'll upload the new scripts later today.

In the mean time, I am still hoping that someone can either improve the efficiency of the queries or else verify that they are already as efficient as they are going to get.

Thank you.
0
 
LVL 13

Expert Comment

by:ispaleny
ID: 13919371
Congratulations, you discovered Graph theory :)
0
 

Author Comment

by:mvrogers
ID: 13920431
Why, thank you!  You make me appear much smarter than I actually am.  :)

I think that "discovered" was a poor choice of words on my part.  I intended it as "found" rather than "invented."

Anyway, the new scripts are on my web site.  The tables haven't changed since LookupTable3 and LookupTable4 were created.

The thing that surprises me about the new algorithm is the degree of the improvement.  I initially guessed that it would cut the running time by about half because it would cut the search tree size by about half.  I think that what I am seeing here is SQL Server queries are very expensive in terms of execution time, especially when the tables involved contain a large amount of data.  I think that as the number of rows increase, insertion times (and possibly selection times) grow super-linearly.  I would appreciate any feedback regarding this.

Thanks.
0
 
LVL 13

Expert Comment

by:ispaleny
ID: 13920774
You have table with primary index, so you must sort records.
Difficulty increases with N*LOG(N), where N is number of rows.
http://www.everything2.com/index.pl?node_id=751105
Of course, there is some disk overhead also.
0
 

Author Comment

by:mvrogers
ID: 13921696
That's assuming that the table has to be completely resorted after every insertion.  While I wouldn't put it past Microsoft to do this, it doesn't have to work that way.  If the insertion used a binary search to find the correct insertion point and then shifted some of the records to make room for the new one, each new record could be inserted in O(n) time.

I think I should create some performance tests to see what's going on under the hood.
0
 

Author Comment

by:mvrogers
ID: 13922120
My tests really surprised me.  When I insert new rows into the queue table variables, the insertion time stays pretty constant regardless of how many rows are already present.  The big time drain comes from the select statements.

SET @word1 = (SELECT Word FROM @Queue1 WHERE Position=@queuePos)
SET @word2 = (SELECT Word FROM @Queue2 WHERE Position=@queuePos)

The queues are declared as follows.

DECLARE @Queue1 table (Position int IDENTITY(1,1), Word varchar(30) PRIMARY KEY, Parent int)
DECLARE @Queue2 table (Position int IDENTITY(1,1), Word varchar(30) PRIMARY KEY, Parent int)

If I make Position the primary key, or if Position and Word are both the primary key, then the select statements execute really quickly, but the insertions take a lot longer.  The time lost far outweighs the time gained by making this switch.

I tried creating a separate index table variable to speed up the select statements, but it didn't seem to have any significant effect on the overall running time.
0
 

Author Comment

by:mvrogers
ID: 13946158
Last call for comments, everyone.  If you have something to say, please speak up today.  Thank you.
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

I have a large data set and a SSIS package. How can I load this file in multi threading?
In this article we will learn how to fix  “Cannot install SQL Server 2014 Service Pack 2: Unable to install windows installer msi file” error ?
Via a live example, show how to extract insert data into a SQL Server database table using the Import/Export option and Bulk Insert.
Viewers will learn how to use the SELECT statement in SQL to return specific rows and columns, with various degrees of sorting and limits in place.
Suggested Courses

873 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question