Solved

Slow query due to missing statistic - help on whether to rebuild stats

Posted on 2004-04-06
5
728 Views
Last Modified: 2008-01-16
Situation: 8 million row table1 with 3 column concatentated clustered index (non-unique, not primary key):

table1 (c1, c2, c3, ...)

select count(*) from table1 where c1=1234
500,000 rows in 1 second

select count(*) from table1 where c1=2345
600,000 rows in 60 seconds

WHY? Plans were identical, but query analyzer indicated that the top query would take 98% of the time even though it was 60 times faster. Really weird, I know (though not for SQL Server).

Anyway, there was no statistic on C1, so I created one, and the problem went away - performance to sub-second for both queries and query optimizer showing 50% of time for both queries.

This 8 million row table is a small version of a table with 150 million rows. We are thinking of building the stat on the big table.

Can anyone suggest a reason this statistic helped? Do you need a statistic on the lead column of a clustered index? Any reason the performance was so slow on one but not the other?

Not a critical problem but a big puzzle, though the decision to add the statistic has to be made. One more thing - is it possible that the statistics were out of date but were not showing "out of date" in the query analyzer (all indexes are set to auto update statistics by the way).

Thanks

NOTE: I WILL AWARD THE POINTS ON THIS QUESTION BEFORE 1 PM TODAY AS A THANKS FOR ANY DECENT ADVICE I CAN GET!

0
Comment
Question by:AaronAbend
  • 2
  • 2
5 Comments
 
LVL 12

Assisted Solution

by:monosodiumg
monosodiumg earned 25 total points
Comment Utility
The stats can only affect the query engine's plan. That means rebuilding your stats must have made a difference to the query plan. Maybe the pre-rebuild plan happened to be OK for c1=1234 but was poor for c1=2345, which could happen if there had beena lot of change for the c1=2345 data.

0
 
LVL 7

Accepted Solution

by:
ChrisFretwell earned 100 total points
Comment Utility
Rebuilding your stats is a good thing if your data has changed or there is no stat available.

When the actual running percentage of time of the queries is that different from the plan as yours was, is a key indicator that statistics would help.
Pre stats: In the plan, sql prossibly thought that the data for query part 2 may still be cached, so faster. But after it ran the first one, it found this not the case and the second one ran differently.
Post stats: sql knew what the distribution of the data was so it knew more accurately the best way to approach searching for the data and was able to do so faster.

You dont need stats on fields you dont search by. But if you want the 'intellegence' that was built into determining best route, then SQL will perform best with complete information.

Think of it this way. You're planning on driving somewhere new, and you look at a map and pick the shortest map route - thinking it will be the fastest, then find out it takes you an extra 3 hours because the shortest route has major construction. But say you're planning the same trip and along with your map, you have a traffic report that says 'major delays on route b due to construction' then you have enough information to truely pick the fastest route. Simple analogy, but it gives you an idea.
0
 
LVL 10

Author Comment

by:AaronAbend
Comment Utility
msg - I agree with your theory, and I just double checked - The plans for this query are exactly the same. It is the plans for more sophisticated queries that led to this analysis that are actually slower. These queries do perform the same, but a query with a join to another large table does not.  

Following up on some ideas in Chris's note, I checked the plan more carefully and noticed the estimated row count for the good query was 1 row while the bad query showed estimated row count of 500,000!

Order of queries had no impact. one particular constraint was always slower.

More Info: I build stats using default sample and NOTHING improved. I "sampled" the entire table and it fixed the problem. Ironically the query that returned the MOST rows (600K) ran FASTER.

Also, these queries are not inside procedures, so plans are regenerated on a regular basis (data changes constantly).

Bottom line: Whatever process determines estimated row count is seriously wrong. Any additional thoughts would be appreciated! Thanks!

0
 
LVL 10

Author Comment

by:AaronAbend
Comment Utility
I just realized - the query plan that showed the higher row count generated the best plan because the data for the plan was essentially correct. The fact that there were more records being returned by that query was irrelevant.

I have found postings elsewhere that indicate that the processes that maintain stats are not always effective (http://www.sql-server-performance.com/statistics.asp)

I have awarded the points as I promised but more comments would be welcome.

Thanks,

Aaron
0
 
LVL 12

Expert Comment

by:monosodiumg
Comment Utility
>Whatever process determines estimated row count is seriously wrong
Not necessarily. As I said, it could be that you have a very uneven distribution of index values in the data, leading, through bad luck, to a very poor sample for stats calculation.
The fact that rebuilding the stats over the full set eliminates the problem supports that idea.
For more details:
http://www.sql-server-performance.com/statistics.asp
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnsql2k/html/statquery.asp
http://www.winnetmag.com/SQLServer/Article/ArticleID/5660/5660.html
0

Featured Post

How to improve team productivity

Quip adds documents, spreadsheets, and tasklists to your Slack experience
- Elevate ideas to Quip docs
- Share Quip docs in Slack
- Get notified of changes to your docs
- Available on iOS/Android/Desktop/Web
- Online/Offline

Join & Write a Comment

Performance is the key factor for any successful data integration project, knowing the type of transformation that you’re using is the first step on optimizing the SSIS flow performance, by utilizing the correct transformation or the design alternat…
Load balancing is the method of dividing the total amount of work performed by one computer between two or more computers. Its aim is to get more work done in the same amount of time, ensuring that all the users get served faster.
Via a live example, show how to backup a database, simulate a failure backup the tail of the database transaction log and perform the restore.
Viewers will learn how to use the UPDATE and DELETE statements to change or remove existing data from their tables. Make a table: Update a specific column given a specific row using the UPDATE statement: Remove a set of values using the DELETE s…

744 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now