Link to home
Start Free TrialLog in
Avatar of AaronAbend
AaronAbendFlag for United States of America

asked on

Auto-Update Statistics versus Update Statistics - are they the same?

If I run update statistics (with a sample) it samples a "representative sample" of the data.

If I turn autoupdate statistics on, does that simply run "update statistics" when the threshholds (roughly 20% rows changed) are reached, OR, does it sample the changed data to determine how the stats should be modified.

The question is important because my client has been running with autostats OFF for some time, and the question is whether we should recalc statistics at 100% before turning autoupdate on.

I will award extra points if someone can tell me whether the 20% factor can be changed
Avatar of Qlemo
Qlemo
Flag of Germany image

AFAIK autoupdate triggers an update statistics which does "only" rescan by sampling. If you perform a complete scan you should not use autoupdate statistics.
Avatar of AaronAbend

ASKER

You mean - if I do a complete scan on a regular basis.  If I do a full scan today but do not plan on doing another one for a month, auto update should be turned on as far as I know.

Avatar of arbert
arbert

That internal factor that forces the update of the statistics can't be changed....

How much data is in the database?  
Thanks Brett...

The database has a total of 1 billion rows - 3 tables are 200million rows each (an issue I am trying to address separately). The bulk of the rest is distributed among 6 80 million row tables and a reasonable bunch of small tables. DB is a bit overnormalized, but that is a long term issue. The data has a useful life of 180 days max, so we have to purge on a regular basis.

They were defragging indexes every day. I have suggested they simply update stats on a regular basis rather than defragging (I'd be interested in your take on that).

dang!  I thought I had some big stuff....So how many updates take place on this data?  Is it pretty static like a data warehouse?
Actually it is a lot of inserts and updates. After 180 days, virtually all of the data in the largest 6 tables turns over.

Most of the records have a startdate and an enddate (necessary given the nature of the business). The enddate goes in null and gets updated when the transaction is completed. Most of the updates are based on the primary key (prior DBA put clusters on all PKs, all of which are IDENTITY columns - I am recommending clusters that support important queries and nonclustered PK indexes). Deletes/purges are based on dates.

Can you validate (T or F is fine) the following statements (these guys are really nervous - I have only been consulting here about 2 weeks):

1. Turning autoupdate statistics ON at the database level has no immediate impact. The impact will be felt when 20% threshholds are reached for individual tables, at which time the stats will be updated

2. We can turn off autoupdate and stop that from happening at any time (I know the answer but they keep asking me!)

3. Is there a way to modify an index to turn off autoupdate (for selected tables)? Does MSSQL have an Alter Index... command?  I have not seen any documentation but it would be nice.

Thanks!

Aaron
SOLUTION
Avatar of arbert
arbert

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
So, the original question - is there any difference between running autoupdate and doing it manually when it is due (at 20%). There is a feeling here that auto update is smart because it samples the changed data, whereas a manual stats update is not smart since it looks at all of the data....

(my last "one more thing..")
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Clemens, I think you are correct. Do you have a reference on that statement handy? If so post it. Thanks to both of you.

"because it samples the changed data, whereas a manual stats update is not smart since it looks at all of the data...."

That's totally false--it's a random sampling.  SQL Server doesn't know what data change (just at the table level), it just knows the threshhold has been hit...
"That's totally false--it's a random sampling"  (look at profiler when both methods run and see how the stats are updated--that will show you the definitive answer ) :)
Excellent. Thanks again.
SQLServer DOES know which records were changed, it maintains a bitmap therefor, which is used for replication.
[Ken Henderson: The Guru's Guide to SQL Server Architecture and Internals]

Theoretically this bitmap could be used to update statistics, but this would lead to overemphasis of new data.
For example, if you have one million rows scanned, and change 200 000 of them, those changed rows would have a five times higher influence then the other 800 000...

By the way: Often there is no statistical relevant change while updating non-key attributes; in this case an update of statistics will not provide you with better query plans.
I agree that SQL knows what records are changed, but why can't this information be used to modify distributions in statistics? There is absolutely no mathematical reason that the new data could not be used to update statistics assuming the orginal scan was 100%.  

As for "often there is not statistical relevant change while updating non-key attributes" I disagree.  How many databases use current dates? Almost every database I have worked with has them, and they represent the new data, and without updates of the statistics new dates are not included in the statistics, right? (Correct me if I am wrong about that).  And those dates are frequently used as constraints. In my specific situation, the most recent data is 10 times more important than old data - and the stats do not support its retrieval it seems.

On one query I am working on the optimizer consistently chooses a plan that it predicted would be 5 times faster than my forced plan. In fact, my query was 8 times faster - the optimizer was off by a factor of 40. I am not at all sure this is a statistics problem since I generated full statistics for all tables and got the same results. I believe the real problem is in poor indexing (which is a much more common problem than bad stats if you ask me).
" which is used for replication."


Exactly--for replication.....
Statistics on compound indexes (containing more than one column) are stored as histogramm for the first column only. For the remaining columns SQLServer does only store a density value ("how much rows I will get with the same column value?").
From the viewpoint of maintaining statistics it's better to have many single-column-indexes than having a single, "big" one. It's said that SQLServer can effectly combine index results even if one index would not cover the query (it will calculate higher costs, though).
And: if an index will give you more than 10 % on a single key value, this index will most likely not be used.

Ad update statistics: I suggest switching of autoupdate, and maintain manually full-scan statistics every night or alike.

"3. Is there a way to modify an index to turn off autoupdate (for selected tables)? Does MSSQL have an Alter Index... command?  I have not seen any documentation but it would be nice. "

Correct answer is: yes - sp_autostats can do this... (syntax is not alter index...but the effect is the same)

AA
Sorry, I'm totally confused that you said index--the original question was talking about update column statistics--not index statistics.  Why would you want to disable your index statistics?  That's a little bit different that disabling "regular" column statistics
Index statistics versus column statistics? As far as I know, the only difference is that there are some stats that are not built to support indexes - they are usually built ad-hoc when someone hits a column that does not have any indexing and SQL decides it needs to know more before it continues. But I regard them as the same for all practical purposes. Is that incorrect? Are there more subtleties than that? I am going to post this as a  new question by the way.
Index statistics are always maintained by SQL Server--100% sampling (it is the index)...Column statistics are based upong the sampling rate (unless you specify full scan) and aren't mantained (until, of course, the threshhold is met and an auto update would kick in)