?
Solved

need help indexing my database

Posted on 2012-08-20
36
Medium Priority
?
680 Views
Last Modified: 2012-08-23
I run this query repeatedly, with different zip codes.

SELECT Address, City, State, Zip, Longitude, Latitude, Host, URL, Price, 
ImgSrc,PropertyType, Bedrooms, Bathrooms, Squarefeet, Acres, Yearbuilt FROM 
tblProperties WHERE ZIP IN (SELECT h.zip FROM zip g JOIN zip h on g.zip <> h.zip AND 
g.zip = '32244' AND h.zip <> '32244' WHERE g.GeogCol1.STDistance(h.GeogCol1)<=(10 * 
1609.344) ) UNION SELECT Address, City, State, Zip, Longitude, Latitude, Host, URL, Price, 
ImgSrc,PropertyType, Bedrooms, Bathrooms, Squarefeet, Acres, Yearbuilt FROM 
tblProperties WHERE ZIP = '32244';

Open in new window


How can I index my table to increase loading time. This query takes 55 seconds to return results, it's insane wait time.

The geogCol1 is geography with spatial index. On tblProperties, vvery column is a varchar and none are indexed yet. I'm not very good with indexing in sql. Can someone help me out?

Thanks.
0
Comment
Question by:poweraddict
  • 12
  • 10
  • 7
  • +2
36 Comments
 
LVL 6

Expert Comment

by:ianmills2002
ID: 38314844
Firstly, put an index on ZIP

Secondly, I don't think you need to use the sub query
...
zip in (SELECT h.zip FROM zip g JOIN zip h on g.zip <> h.zip AND 
g.zip = '32244' AND h.zip <> '32244' WHERE g.GeogCol1.STDistance(h.GeogCol1)<=(10 * 
1609.344)
)

Open in new window



try the following to remove the what looks like an unrequired table join

... zip in (SELECT h.zip FROM zip h where g.GeogCol1.STDistance(h.GeogCol1)<=(10 * 1609.344) and h.zip <> '32244')

Open in new window


Hopefully this helps.

Ian
0
 
LVL 9

Expert Comment

by:Evan Cutler
ID: 38314851
Here is a training on indexes I have found useful:
http://sqlserverplanet.com/indexes/sql-server-indexes-tutorial

I would look into indexes on the ZIP column on those tables.
Try testing your execution plan over there....
0
 
LVL 6

Author Comment

by:poweraddict
ID: 38314860
SHOULD IT BE CLUSTERED OR NON CLUSTERED?

Thanks
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 9

Expert Comment

by:Evan Cutler
ID: 38314873
Here's another article on it.
http://itknowledgeexchange.techtarget.com/sql-server/back-to-basics-clustered-vs-nonclustered-indexes-whats-the-difference/

http://shannonlowder.com/2009/01/sql-102-clustered-vs-nonclustered-indexes/

It depends on the rest of your application and how it interfaces with your SQL Server.  I hope that you find your answer here.
0
 
LVL 6

Expert Comment

by:ianmills2002
ID: 38314896
Non Clustered Index should be fine.
0
 
LVL 75

Expert Comment

by:Anthony Perkins
ID: 38316113
Also, you should change UNION to UNION ALL.  There is no need for the SQL Server Engine to sort all the resulting data to remove duplicates, as there should not be any.
0
 
LVL 75

Expert Comment

by:Anthony Perkins
ID: 38316137
But actually your biggest bottleneck is the UDF GeogCol1.STDistance().  As it stands now it will execute once for practically all rows in your zip file (50K ?), and you need to find a way around that.
0
 
LVL 70

Accepted Solution

by:
Scott Pletcher earned 668 total points
ID: 38316837
You don't need UNION [ALL] at all, just use an OR in the SELECT, like so:

SELECT Address, City, State, Zip, Longitude, Latitude, Host, URL, Price,
ImgSrc,PropertyType, Bedrooms, Bathrooms, Squarefeet, Acres, Yearbuilt
FROM tblProperties
WHERE ZIP = '32244' OR ZIP IN (SELECT h.zip FROM zip g JOIN zip h on g.zip <> h.zip AND
g.zip = '32244' AND h.zip <> '32244' WHERE g.GeogCol1.STDistance(h.GeogCol1)<=(10 *
1609.344) )

The zip table should be clustered on zip.

The tblProperties table should have at least a nonclus index on zip; don't have enough info to know if it should be clus on zip or not.

Since
    g.GeogCol1.STDistance(h.GeogCol1)
wouldn't seem to change, pre-compute it and store it in the zip table.  You can use a computed, persisted column or a native SQL column that you manually update/recompute periodically, if needed (? doesn't seem likely but could be needed).

For example, for a computed, persisted column:

ALTER TABLE zip
ADD std_distance AS GeogCol1.STDistance(GeogCol1) PERSISTED;

Then:

SELECT Address, City, State, Zip, Longitude, Latitude, Host, URL, Price,
ImgSrc,PropertyType, Bedrooms, Bathrooms, Squarefeet, Acres, Yearbuilt
FROM tblProperties
WHERE ZIP = '32244' OR ZIP IN (SELECT h.zip FROM zip g JOIN zip h on g.zip <> h.zip AND
g.zip = '32244' AND h.zip <> '32244' WHERE g.std_distance <=(10 * 1609.344) )
0
 
LVL 6

Author Comment

by:poweraddict
ID: 38318586
Thanks for the replies everyone.

Scott:


ALTER TABLE zip
ADD std_distance AS GeogCol1.STDistance(GeogCol1) PERSISTED;

The zip code is never the same, so I don't think this solution will work. It pulls all the zip codes close to the zip code specified. It's not always 32244...
0
 
LVL 6

Author Comment

by:poweraddict
ID: 38318621
I cluster indexed the zip codes in the zip table, it made a performance difference but still 3 seconds on my local server and 13 seconds on my web server to search 100k rows.

Still slow.
0
 
LVL 6

Assisted Solution

by:ianmills2002
ianmills2002 earned 1332 total points
ID: 38318654
What about an index on the ZIP column in the tblProperties table?
0
 
LVL 75

Expert Comment

by:Anthony Perkins
ID: 38318656
but still 3 seconds on my local server and 13 seconds on my web server to search 100k rows.
Right.  As I stated your problem is the UDF and you are right there is no way you can store that.  You could try pre-calculating it in an appropriately indexed temp table and then join against the temp table to see if that makes any difference.  Also UNION ALL should outperform a WHERE clause with an OR, but test it for yourself to see if there is any improvement.
0
 
LVL 6

Author Comment

by:poweraddict
ID: 38318678
ian, I did that.
0
 
LVL 6

Expert Comment

by:ianmills2002
ID: 38318700
Sorry, I read your comment previously that you only added an index to the ZIP table, not to the tblProperties table.
0
 
LVL 6

Author Comment

by:poweraddict
ID: 38318707
I did it to both
0
 
LVL 6

Author Comment

by:poweraddict
ID: 38318974
Using just this takes 24 seconds

SELECT Address, City, State, Zip, Longitude, Latitude, Host, URL, Price,
ImgSrc,PropertyType, Bedrooms, Bathrooms, Squarefeet, Acres, Yearbuilt
FROM tblProperties WHERE zip in (SELECT zip FROM ZIP WHERE CITY='JACKSONVILLE' AND ST='FL') order by zip

This is insane...
0
 
LVL 6

Assisted Solution

by:ianmills2002
ianmills2002 earned 1332 total points
ID: 38318991
Is there an index that includes CITY and ST columns?

Check where the bottleneck is. Is it in the subquery

SELECT zip FROM ZIP WHERE CITY='JACKSONVILLE' AND ST='FL'

OR in the outer query

SELECT Address, City, State, Zip, Longitude, Latitude, Host, URL, Price,
ImgSrc,PropertyType, Bedrooms, Bathrooms, Squarefeet, Acres, Yearbuilt
FROM tblProperties WHERE zip in ('32244') order by zip
0
 
LVL 6

Author Comment

by:poweraddict
ID: 38319010
Its the outer query

I've attached the execution plan
execution plan
0
 
LVL 6

Expert Comment

by:ianmills2002
ID: 38319026
Can you try running it without the ORDER BY ZIP

My understanding of a Clustered index is that the data is sorted in the order of the clustered index. If the index is on ZIP, then the order will already be correct.
0
 
LVL 6

Author Comment

by:poweraddict
ID: 38319035
ian, I just did it, same execution time pretty much
0
 
LVL 75

Expert Comment

by:Anthony Perkins
ID: 38320172
Using just this takes 24 seconds
How many rows are returned?  If it is thousands of rows, there may not be much you can do about that, at least not with SQL Server.

Assuming that you have an index on City+ST try it this way:
SELECT  Address,
        City,
        State,
        Zip,
        Longitude,
        Latitude,
        Host,
        URL,
        Price,
        ImgSrc,
        PropertyType,
        Bedrooms,
        Bathrooms,
        Squarefeet,
        Acres,
        Yearbuilt
FROM    tblProperties
WHERE   EXISTS ( SELECT 1
                 FROM   ZIP
                 WHERE  CITY = 'JACKSONVILLE'
                        AND ST = 'FL' )
ORDER BY zip

Open in new window


Or this:
SELECT  p.Address,
        p.City,
        p.State,
        p.Zip,
        p.Longitude,
        p.Latitude,
        p.Host,
        p.URL,
        p.Price,
        p.ImgSrc,
        p.PropertyType,
        p.Bedrooms,
        p.Bathrooms,
        p.Squarefeet,
        p.Acres,
        p.Yearbuilt
FROM    tblProperties p
        INNER JOIN (
                    SELECT  zip
                    FROM    ZIP
                    WHERE   CITY = 'JACKSONVILLE'
                            AND ST = 'FL'
                    GROUP BY zip
                   ) z ON p.zip = z.zip
ORDER BY p.zip

Open in new window

0
 
LVL 75

Expert Comment

by:Anthony Perkins
ID: 38320179
My understanding of a Clustered index is that the data is sorted in the order of the clustered index
You cannot and should not rely on that.
0
 
LVL 70

Expert Comment

by:Scott Pletcher
ID: 38320774
>>
ALTER TABLE zip
ADD std_distance AS GeogCol1.STDistance(GeogCol1) PERSISTED;

The zip code is never the same, so I don't think this solution will work. It pulls all the zip codes close to the zip code specified. It's not always 32244...
<<

Sorry, I mis-coded the use of it!

But I'm still confused.

It looks to me like only 32244 is passed to the STDistance function ... is the other zip involved in the function somehow?

Even then, couldn't you create a table with from zip and to zip and the distances precalculated, then lookup from that table?
0
 
LVL 6

Author Comment

by:poweraddict
ID: 38320995
32244 was just used to show the code, in reality its a user entered parameter
0
 
LVL 70

Expert Comment

by:Scott Pletcher
ID: 38321030
Right, but couldn't you create a table with from zip and to zip and the STDistance already computed?

Yes, the table would be large, but when clustered on from zip and to zip, the lookups should still be very fast.
0
 
LVL 6

Author Comment

by:poweraddict
ID: 38321060
well 48000 zip codes compared to 48000 zip codes every every combination would be enormous. Not even sure how to go about that.
0
 
LVL 75

Expert Comment

by:Anthony Perkins
ID: 38321186
well 48000 zip codes compared to 48000 zip codes every every combination would be enormous.
DECLARE @ZIPs integer = 48000,
      @CurrZIP integer = 1,
      @ZIPXRef integer = 0

WHILE @CurrZIP < @ZIPs
      SELECT @ZIPXRef += @ZIPs - @CurrZIP,
            @CurrZIP += 1

SELECT @ZIPXRef
0
 
LVL 70

Expert Comment

by:Scott Pletcher
ID: 38321276
Hmm, technically can we only store from the lower zip to the higher zip?

That is, is 32244-->90210 the same value as 90210-->32244?

Ooh, do you use the 9 digit zip or just the 5 digit zips for this??
0
 
LVL 6

Author Comment

by:poweraddict
ID: 38321372
well it's find every zip within so many miles of a specified zip code.

5 digit zips are used.

This finds all zip codes within 10 miles of 32244

SELECT Address, City, State, Zip, Longitude, Latitude, Host, URL, Price,
ImgSrc,PropertyType, Bedrooms, Bathrooms, Squarefeet, Acres, Yearbuilt
FROM tblProperties
WHERE ZIP = '32244' OR ZIP IN (SELECT h.zip FROM zip g JOIN zip h on g.zip <> h.zip AND
g.zip = '32244' AND h.zip <> '32244' WHERE g.std_distance <=(10 * 1609.344) )
0
 
LVL 75

Expert Comment

by:Anthony Perkins
ID: 38321476
That is, is 32244-->90210 the same value as 90210-->32244?
Right it is a Combination.  I only included one pair in my calculation and of course did not include 32244 to 32244.
0
 
LVL 75

Expert Comment

by:Anthony Perkins
ID: 38321483
This finds all zip codes within 10 miles of 32244
That is good.  That means you can create a table like Scott is suggesting, but only include those rows that have a distance of 10 miles or less.  I suspect you can reduce the table down quite significantly from the 1 billion+ rows when combining them all.
0
 
LVL 75

Expert Comment

by:Anthony Perkins
ID: 38321548
I just read the given definition of a Combination in Wiki and in your case it would be:
48000! / 2! (48000 - 2)!

However, I would not suggest you try and calculate 48000! using a calculator or even T-SQL as you will overflow very quickly!  So the iterative method I suggested earlier would be the best option.

Incidentally, adding those ZIP codes to the table that only are within 10 miles could be an interesting script on its own, for the reason that it could take quite a long time to execute.  But once done should prove very performant if indexed appropriately.  

The other problem you will have is if you only add 32244 - 90210 (and not the reverse) that means that you will have to make sure to search on the first and not the second.  Perhaps it would be easier and not too expensive to include both.

Perhaps it is time to take a second look at those spatial data types...
0
 
LVL 75

Expert Comment

by:Anthony Perkins
ID: 38321596
I guess I should have read the entire article. :)

The formula can be simplified in your case to:
48000 * 47999 / 2

Which is a lot simpler than the T-SQL code I posted earlier!
0
 
LVL 70

Expert Comment

by:Scott Pletcher
ID: 38321740
Exactly -- it's over a billion rows, but when properly indexed, that won't matter.  And each row would not be that large.

I would precompute all distances, so that you don't have to worry about what a missing value means -- does it mean it's less than nn miles or that it wasn't computed to begin with?

Since it's only two values, it's easy to check for the lesser one to make the first key.

If, however, you want to store both, you would still only have to compute one of them -- the rest could be inserted based on existing values.

That is, compute 32244-->90210, then insert the value from that computation into 90210-->32244.

Given that the values won't change, or will change only rarely (you could recompute them every week if you wanted to), I would pre-calc and store them.

Then you can do ALL distance lookups from zip to zip very quickly.
0
 
LVL 6

Expert Comment

by:ianmills2002
ID: 38323170
I have missed so much while I have been asleep.

I go back to what acperkins wrote...
"How many rows are returned?  If it is thousands of rows, there may not be much you can do about that..."

...and back to your last post to me about the performance is with the Select on the tblProperties table.

You got the SQL to run in 3 seconds and on the Web server as 13 seconds.
When other columns might you need to search on? Especially from the Web side of things. Your original query contains many columns that could possible be filtered on by any web search. e.g. Price, PropertyType, Bedrooms, Bathrooms, Squarefeet, Acres.

If any of these others will be used to filter the search, then these columns could be added to either, the index containing ZIP that you have already created, or have other indexes built on the table containing any of the filtered columns. Exactly how many you might need depends on what filters may be required on the data.

Some examples, at my work we have a transaction table containing more than 14M rows. All reports and queries on this table contain at least APPROVED_DATE. This is one index. Each query may also filter on at least one other column. So I other indexes to improve response times for these queries. Other indexes are STATE and APPROVED_DATE, CUSTOMER and APPROVED_DATE and COST_CENTRE, CONTRACT and APPROVED_DATE. All indexes added to the table weren't setup initially, but found over time as the reporting requirements changed.
0
 
LVL 6

Author Closing Comment

by:poweraddict
ID: 38326965
thanks, with all your help I was able to speed it up dramatically
0

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

I have a large data set and a SSIS package. How can I load this file in multi threading?
A Stored Procedure in Microsoft SQL Server is a powerful feature that it can be used to execute the Data Manipulation Language (DML) or Data Definition Language (DDL). Depending on business requirements, a single Stored Procedure can return differe…
This video shows how to set up a shell script to accept a positional parameter when called, pass that to a SQL script, accept the output from the statement back and then manipulate it in the Shell.
This videos aims to give the viewer a basic demonstration of how a user can query current session information by using the SYS_CONTEXT function
Suggested Courses

840 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question