Link to home
Start Free TrialLog in
Avatar of poweraddict
poweraddictFlag for United States of America

asked on

need help indexing my database

I run this query repeatedly, with different zip codes.

SELECT Address, City, State, Zip, Longitude, Latitude, Host, URL, Price, 
ImgSrc,PropertyType, Bedrooms, Bathrooms, Squarefeet, Acres, Yearbuilt FROM 
tblProperties WHERE ZIP IN (SELECT h.zip FROM zip g JOIN zip h on g.zip <> h.zip AND 
g.zip = '32244' AND h.zip <> '32244' WHERE g.GeogCol1.STDistance(h.GeogCol1)<=(10 * 
1609.344) ) UNION SELECT Address, City, State, Zip, Longitude, Latitude, Host, URL, Price, 
ImgSrc,PropertyType, Bedrooms, Bathrooms, Squarefeet, Acres, Yearbuilt FROM 
tblProperties WHERE ZIP = '32244';

Open in new window


How can I index my table to increase loading time. This query takes 55 seconds to return results, it's insane wait time.

The geogCol1 is geography with spatial index. On tblProperties, vvery column is a varchar and none are indexed yet. I'm not very good with indexing in sql. Can someone help me out?

Thanks.
Avatar of ianmills2002
ianmills2002
Flag of Australia image

Firstly, put an index on ZIP

Secondly, I don't think you need to use the sub query
...
zip in (SELECT h.zip FROM zip g JOIN zip h on g.zip <> h.zip AND 
g.zip = '32244' AND h.zip <> '32244' WHERE g.GeogCol1.STDistance(h.GeogCol1)<=(10 * 
1609.344)
)

Open in new window



try the following to remove the what looks like an unrequired table join

... zip in (SELECT h.zip FROM zip h where g.GeogCol1.STDistance(h.GeogCol1)<=(10 * 1609.344) and h.zip <> '32244')

Open in new window


Hopefully this helps.

Ian
Here is a training on indexes I have found useful:
http://sqlserverplanet.com/indexes/sql-server-indexes-tutorial

I would look into indexes on the ZIP column on those tables.
Try testing your execution plan over there....
Avatar of poweraddict

ASKER

SHOULD IT BE CLUSTERED OR NON CLUSTERED?

Thanks
Here's another article on it.
http://itknowledgeexchange.techtarget.com/sql-server/back-to-basics-clustered-vs-nonclustered-indexes-whats-the-difference/

http://shannonlowder.com/2009/01/sql-102-clustered-vs-nonclustered-indexes/

It depends on the rest of your application and how it interfaces with your SQL Server.  I hope that you find your answer here.
Non Clustered Index should be fine.
Avatar of Anthony Perkins
Also, you should change UNION to UNION ALL.  There is no need for the SQL Server Engine to sort all the resulting data to remove duplicates, as there should not be any.
But actually your biggest bottleneck is the UDF GeogCol1.STDistance().  As it stands now it will execute once for practically all rows in your zip file (50K ?), and you need to find a way around that.
ASKER CERTIFIED SOLUTION
Avatar of Scott Pletcher
Scott Pletcher
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks for the replies everyone.

Scott:


ALTER TABLE zip
ADD std_distance AS GeogCol1.STDistance(GeogCol1) PERSISTED;

The zip code is never the same, so I don't think this solution will work. It pulls all the zip codes close to the zip code specified. It's not always 32244...
I cluster indexed the zip codes in the zip table, it made a performance difference but still 3 seconds on my local server and 13 seconds on my web server to search 100k rows.

Still slow.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
but still 3 seconds on my local server and 13 seconds on my web server to search 100k rows.
Right.  As I stated your problem is the UDF and you are right there is no way you can store that.  You could try pre-calculating it in an appropriately indexed temp table and then join against the temp table to see if that makes any difference.  Also UNION ALL should outperform a WHERE clause with an OR, but test it for yourself to see if there is any improvement.
ian, I did that.
Sorry, I read your comment previously that you only added an index to the ZIP table, not to the tblProperties table.
I did it to both
Using just this takes 24 seconds

SELECT Address, City, State, Zip, Longitude, Latitude, Host, URL, Price,
ImgSrc,PropertyType, Bedrooms, Bathrooms, Squarefeet, Acres, Yearbuilt
FROM tblProperties WHERE zip in (SELECT zip FROM ZIP WHERE CITY='JACKSONVILLE' AND ST='FL') order by zip

This is insane...
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Its the outer query

I've attached the execution plan
execution plan
Can you try running it without the ORDER BY ZIP

My understanding of a Clustered index is that the data is sorted in the order of the clustered index. If the index is on ZIP, then the order will already be correct.
ian, I just did it, same execution time pretty much
Using just this takes 24 seconds
How many rows are returned?  If it is thousands of rows, there may not be much you can do about that, at least not with SQL Server.

Assuming that you have an index on City+ST try it this way:
SELECT  Address,
        City,
        State,
        Zip,
        Longitude,
        Latitude,
        Host,
        URL,
        Price,
        ImgSrc,
        PropertyType,
        Bedrooms,
        Bathrooms,
        Squarefeet,
        Acres,
        Yearbuilt
FROM    tblProperties
WHERE   EXISTS ( SELECT 1
                 FROM   ZIP
                 WHERE  CITY = 'JACKSONVILLE'
                        AND ST = 'FL' )
ORDER BY zip

Open in new window


Or this:
SELECT  p.Address,
        p.City,
        p.State,
        p.Zip,
        p.Longitude,
        p.Latitude,
        p.Host,
        p.URL,
        p.Price,
        p.ImgSrc,
        p.PropertyType,
        p.Bedrooms,
        p.Bathrooms,
        p.Squarefeet,
        p.Acres,
        p.Yearbuilt
FROM    tblProperties p
        INNER JOIN (
                    SELECT  zip
                    FROM    ZIP
                    WHERE   CITY = 'JACKSONVILLE'
                            AND ST = 'FL'
                    GROUP BY zip
                   ) z ON p.zip = z.zip
ORDER BY p.zip

Open in new window

My understanding of a Clustered index is that the data is sorted in the order of the clustered index
You cannot and should not rely on that.
>>
ALTER TABLE zip
ADD std_distance AS GeogCol1.STDistance(GeogCol1) PERSISTED;

The zip code is never the same, so I don't think this solution will work. It pulls all the zip codes close to the zip code specified. It's not always 32244...
<<

Sorry, I mis-coded the use of it!

But I'm still confused.

It looks to me like only 32244 is passed to the STDistance function ... is the other zip involved in the function somehow?

Even then, couldn't you create a table with from zip and to zip and the distances precalculated, then lookup from that table?
32244 was just used to show the code, in reality its a user entered parameter
Right, but couldn't you create a table with from zip and to zip and the STDistance already computed?

Yes, the table would be large, but when clustered on from zip and to zip, the lookups should still be very fast.
well 48000 zip codes compared to 48000 zip codes every every combination would be enormous. Not even sure how to go about that.
well 48000 zip codes compared to 48000 zip codes every every combination would be enormous.
DECLARE @ZIPs integer = 48000,
      @CurrZIP integer = 1,
      @ZIPXRef integer = 0

WHILE @CurrZIP < @ZIPs
      SELECT @ZIPXRef += @ZIPs - @CurrZIP,
            @CurrZIP += 1

SELECT @ZIPXRef
Hmm, technically can we only store from the lower zip to the higher zip?

That is, is 32244-->90210 the same value as 90210-->32244?

Ooh, do you use the 9 digit zip or just the 5 digit zips for this??
well it's find every zip within so many miles of a specified zip code.

5 digit zips are used.

This finds all zip codes within 10 miles of 32244

SELECT Address, City, State, Zip, Longitude, Latitude, Host, URL, Price,
ImgSrc,PropertyType, Bedrooms, Bathrooms, Squarefeet, Acres, Yearbuilt
FROM tblProperties
WHERE ZIP = '32244' OR ZIP IN (SELECT h.zip FROM zip g JOIN zip h on g.zip <> h.zip AND
g.zip = '32244' AND h.zip <> '32244' WHERE g.std_distance <=(10 * 1609.344) )
That is, is 32244-->90210 the same value as 90210-->32244?
Right it is a Combination.  I only included one pair in my calculation and of course did not include 32244 to 32244.
This finds all zip codes within 10 miles of 32244
That is good.  That means you can create a table like Scott is suggesting, but only include those rows that have a distance of 10 miles or less.  I suspect you can reduce the table down quite significantly from the 1 billion+ rows when combining them all.
I just read the given definition of a Combination in Wiki and in your case it would be:
48000! / 2! (48000 - 2)!

However, I would not suggest you try and calculate 48000! using a calculator or even T-SQL as you will overflow very quickly!  So the iterative method I suggested earlier would be the best option.

Incidentally, adding those ZIP codes to the table that only are within 10 miles could be an interesting script on its own, for the reason that it could take quite a long time to execute.  But once done should prove very performant if indexed appropriately.  

The other problem you will have is if you only add 32244 - 90210 (and not the reverse) that means that you will have to make sure to search on the first and not the second.  Perhaps it would be easier and not too expensive to include both.

Perhaps it is time to take a second look at those spatial data types...
I guess I should have read the entire article. :)

The formula can be simplified in your case to:
48000 * 47999 / 2

Which is a lot simpler than the T-SQL code I posted earlier!
Exactly -- it's over a billion rows, but when properly indexed, that won't matter.  And each row would not be that large.

I would precompute all distances, so that you don't have to worry about what a missing value means -- does it mean it's less than nn miles or that it wasn't computed to begin with?

Since it's only two values, it's easy to check for the lesser one to make the first key.

If, however, you want to store both, you would still only have to compute one of them -- the rest could be inserted based on existing values.

That is, compute 32244-->90210, then insert the value from that computation into 90210-->32244.

Given that the values won't change, or will change only rarely (you could recompute them every week if you wanted to), I would pre-calc and store them.

Then you can do ALL distance lookups from zip to zip very quickly.
I have missed so much while I have been asleep.

I go back to what acperkins wrote...
"How many rows are returned?  If it is thousands of rows, there may not be much you can do about that..."

...and back to your last post to me about the performance is with the Select on the tblProperties table.

You got the SQL to run in 3 seconds and on the Web server as 13 seconds.
When other columns might you need to search on? Especially from the Web side of things. Your original query contains many columns that could possible be filtered on by any web search. e.g. Price, PropertyType, Bedrooms, Bathrooms, Squarefeet, Acres.

If any of these others will be used to filter the search, then these columns could be added to either, the index containing ZIP that you have already created, or have other indexes built on the table containing any of the filtered columns. Exactly how many you might need depends on what filters may be required on the data.

Some examples, at my work we have a transaction table containing more than 14M rows. All reports and queries on this table contain at least APPROVED_DATE. This is one index. Each query may also filter on at least one other column. So I other indexes to improve response times for these queries. Other indexes are STATE and APPROVED_DATE, CUSTOMER and APPROVED_DATE and COST_CENTRE, CONTRACT and APPROVED_DATE. All indexes added to the table weren't setup initially, but found over time as the reporting requirements changed.
thanks, with all your help I was able to speed it up dramatically