Link to home
Start Free TrialLog in
Avatar of kyleitvss
kyleitvssFlag for Australia

asked on

Sybase Cache Management and Query Performance

I have a Sybase ASE database - 15GB RAM 6 COREs of IA64, running Red Hat Linux 2.4 (64Bit) Kernel, RAID 1 disk 2GB Fibre Channel Disks (bit old now), Single database approx 30GB, with 10GB of Shared Memory most of that data is not usually access - and should be in a DataMart. I am activey archiving data - reducing to 2 years of history - this will reduce the table sizes by 2/3rds. System is used to create Orders and do basic account management, Names / Address and Order Inquiries.

I have an order table with 3mil rows and order lines with 4mil rows and a few other related tables similar row counts used for inquiry purposes. I have allocated these tables (1GB and 500MB) their own significant caches and separate ones for their indexes (500MB).

Now the issue: we see a small number of  queries that are _very_slow - 20 - 30 seconds accessing data that should be in cache.  If I re-run the query (we have a system to trap slow SQL statements) directly the query is _always_ very fast - have checked for Deadlocks - none.

I assumed it was a cache issue, hence allocated a significant cache for the order related tables, The queries all use indexes - no table scans.

Just wondered if any one has had a similar experience. Trying to get some consistent performance. Could be a DISK IO issue, could be data is rolling out of cache - going to use the Relaxed LRU mode for the caches.
Avatar of Joe Woodhouse
Joe Woodhouse

sp_sysmon has a pretty clear cache management section. It has enough information to guide most cache tuning exercises. See if you can get an sp_sysmon output running during one of these slow queries and post the whole output here.

Many sites specifically "pre-warm" the cache to avoid this sort of thing - perhaps you could pre-run these slow queries yourself?

Still, it'd be better to figure out what's actually happening and if there's a cache issue. Can we also have the output of sp_monitorconfig "all" so we can check some of the "low-hanging fruit"?
Avatar of kyleitvss

ASKER

Hi Joe -  "pre-warm" the cache - that is what I  have been testing - and it works -. Since the post I have done a "reorg rebuild" on all the indexes for the affected tables.

Also have run sysmon for an hour (yesterday) - output attached

What I think I am seeing is the data 'rolls out' of the cache after a short while. I am assuming there is a high turn over - hence I increase the cache sizes for the table and the indexes and gave them exclusive names caches.
sp-sysmon.txt
sp-monitorconfig-all-sales.xls
Here is the monitorconfig-all file as a tab delimited file sp-monitorconfig-all-sales.csv
That's a lot of data caches. Any particular reason you split them up like that? Can we get the output of sp_helpcache and sp_cacheconfig too please?

These days I tend to think multiple named caches is not a great idea. The original motivation for this was to spread spinlock contention since each cache was protected by a single spinlock. However since it's been possible to create cache partitions (each with its own spinlock) the spinlock argument no longer holds as a reason for multiple caches.

Remember any one object can only use one cache, so when you create multiple caches you are making the total cache available to any object smaller...

You also seem to have a data cache called procedure cache which is puzzling me. It also doesn't seem to be used.

It would also be useful to see the sp_configure values for "global async prefetch limit", "open objects", "open indexes" and "procedure cache" please.

Sybase Engineer set up the originals and I added a few more. I'll do some research on your comments though - re the number

Files attached
sp-helpcache.csv
sp-cacheconfig.csv
sp-configure-values.csv
ASKER CERTIFIED SOLUTION
Avatar of Joe Woodhouse
Joe Woodhouse

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Bit confused with some of the numbers

Open Objects is already set to a Run Value of 2000
Open Indexed is set to a  Run Value of  750
   
So they are. Ignore those recommendations. I misread the columns I think. :)
Ok I'll consolidate the number of caches down - should I have

1. Data
2. Index
3. Log

And i have one for tempdb already
Found these:

execute sp_countmetadata "open objects"
execute sp_countmetadata "open indexes"
      
1      There are 2300 user objects in all database(s), requiring 4633 Kbytes of memory. The 'open objects' configuration parameter is currently set to 3000.

2      There are 1512 user indexes in all database(s), requiring 2718 Kbytes of memory. The 'open indexes' configuration parameter is currently set to 2000.
Unless you have specific benchmark results proving performance improves with an index cache vs without, I don't see the point. If it's a page you need often, it will tend to remain in cache anyway.

Log cache is a good idea. Make sure you create it as a log_only cache. It should be small - 100Mb is plenty for most sites unless you have extremely heavy trigger use and replication - and don't forget to bind syslogs in your user databases to it. (Requires single user mode per database binding.) 90% of its memory should be in a "two-page" pool - you seem to have a 2Kb page server, so that means 90Mb of 4Kb pool and 10Mb of 2K.

Tempdb cache is also something I'll only do after benchmarking proves it helps.

You should have a repeatable benchmark you can run before making any changes at all so you can prove this helped. Pretty much the only way to make cache change tests repeatable is to always start with the same cache state - ie a cold cache, ie an ASE reboot prior to each benchmark run.

I am not guaranteeing you these changes will help!! :) We can't know without testing them.
Also metadata caches seem ok, no need for any changes.
Ok I will have a go at this evening
Also Thank you for your help in reviewing my question
No problem, that's why we sign up to EE. :)

Remember you can't drop a cache while objects are bound to it, and you can't bind system tables (like syslogs) unless their database is in single-user mode. Good luck!
Have updated the 1st server and things are a lot better - will do the second server Sunday Night.

I as suspecting now that there is an physical I/O issue to consider too. Over all performance is good - but queries are still taking way to long - and they are fine once the data is actually in the cache.

I'll Run a sysmon for a few hours
FYI a few hours may be too long for any one sp_sysmon as it uses integer counters which overflow with large samples. If it does, try it with 30 or even 20 minute samples.

If you can identify a good sample period when the server was as its busiest and performance was at its worst, you're welcome to post the entire sysmon output here as an attachment. I'll see if anything (not just cache management) leaps out at me as needing attention.

Glad to hear you're seeing good results so far.

Re. I/Os you'll need to ask some questions at the O/S and storage layers to get a good feel for what's going on there. In my experience you have to start looking into obscure things like firmware updates for HBA cards connected to the SAN etc etc etc. Usually way beyond my competence! :)
Attached is full 20 min sysmon output.

I did some reading and saw a comment re  Address Lock Contention and Merge Lock Requests

  Address Lock Contention        4602.7           6.7     8284838     81.9 %

  Merge Lock Requests             per sec      per xact       count  % of total
  -------------------------  ------------  ------------  ----------  ----------
    Network Buffer Merge Locks                                                  
      Granted with no wait         3297.7           4.8     5935828      41.7 %
      Granted after wait           4602.7           6.7     8284801      58.3
        
        

sysmon-28022011---20mins.txt
This information would suggest there are no I/O issues?

  I/Os Delayed by
    Disk I/O Structures               n/a           n/a           0       n/a  
    Server Config Limit               n/a           n/a           0       n/a  
    Engine Config Limit               n/a           n/a           0       n/a  
    Operating System Limit            n/a           n/a           0       n/a  


  Total Requested Disk I/Os         452.1           0.7      813711            

  Completed Disk I/O's
    Engine 0                         45.9           0.1       82564      10.1 %
    Engine 1                         33.7           0.0       60696       7.5 %
    Engine 2                         32.4           0.0       58366       7.2 %
    Engine 3                         50.6           0.1       91152      11.2 %
    Engine 4                         58.7           0.1      105643      13.0 %
    Engine 5                         42.2           0.1       75962       9.3 %
    Engine 6                         61.1           0.1      110040      13.5 %
    Engine 7                         39.6           0.1       71287       8.8 %
    Engine 8                         43.6           0.1       78560       9.7 %
    Engine 9                         44.0           0.1       79207       9.7 %
  -------------------------  ------------  ------------  ----------            
  Total Completed I/Os              451.9           0.7      813477
The main things that stick out for me:

As you say, Address Lock Contention, but also Exclusive Address locks. These are both about locks on index pages blocking things needing to be done to data pages. The standard cure is datapages locking. Run sp_object_stats for 20-30 minutes during a busy period and it will help identify tables. Good candidates are small tables with high contention.

With your data caches, you still have one called "procedure cache" which appears to be doing nothing at all. You can get that memory back by deleting it.

Data_cache1 is looking after 25% of the total cache requests and hitting almost all of them in cache. You could possibly even make this smaller but you don't seem memory bound in your main default data cache so you can probably leave it alone. Maybe try switching it to "relaxed" mode since you're hitting close to 100% hit rate in it.

dwh_data_cache1 is doing almost nothing. I hope it is small. Perhaps it isn't needed at all.

Your index_cache is look after only 3% of the work. I hope it doesn't have more than 5% of the total cache memory. It could also be switched to "relaxed", but I would probably just delete it.

Your default data cache is showing 14% spinlock contention even with these named caches. I would set up cache partitions - try starting with 4. More importantly you have no large I/O capability here and this is likely hurting you. Try converting 25% of the 2Kb pool to 16Kb. Perhaps add 10Mb of 4Kb pool too.

Your log cache doesn't have a 4Kb pool. This is hurting you. Make sure your databases have had their syslogs tables bound to this cache. Switch the cache to both "relaxed" and "log only".

Your tempdb cache has no large I/O. This will also hurt. Switch one third of its 2Kb pool to 16Kb.


Nothing else screams for attention; fix the above and see what that does for you. Good luck!

And this that we are not waiting for the network

Network I/Os Delayed              0.0           0.0           0       0.0 %
I reduced the number of Worker Threads - they where 15 - I reduced them to 5 and the

Max parallel degree from 4 to 2
Max scan parallel degree from 4 to 2

re -ran sysmon and Address Lock Contention reduced to 17.2%

    Address Lock Contention         168.8          11.0      303827      17.2 %

That is a good thing?
I/Os Delayed simply means did you hit any configurable limits on how many outstanding I/Os can be open at once. You need O/S level stats to say for sure there were no queue waits on disk. sp_sysmon can't tell us that. It can imply it - look at Disk I/O Checks, Checks Returning I/O: 26.8% of the time ASE asked if a disk I/O was done yet, the answer was no. This suggests (but does not prove) that ASE thinks the disks are slow.

Network I/Os Delayed unfortunately doesn't mean no network waits, just no events within ASE itself that delayed network activity. Again, you need O/S level stats for this.

Reducing Address Lock Contention is indeed a very good thing, but you may have hurt more than you helped by constraining parallel query. You'd want to benchmark that before and after to see. I still think sp_object_stats is a good thing to try under load. Do not rush immediately to datarows locking! Try datapages cautiously first on the smallest half of the top 20 most-contended tables.
OK - have updated the caches as suggested - other that adding partitions to Default Cache - will need to re-start the server later tonight for that one. I'll re-run sysmon for 30 mins now and then again tomorrow after updating the defaullt cache.

I did bind the syslog to the syslog cache - I deleted the procedure cache - and added the memory to the syslog_cache,  I'll delete the Index Cache tonight and add the memory to data cache 1 ? or Default ?
Most of our locking is already DataPages - few Datarows - Datarows does not seem to add much value.

Checks Returning I/O: 26.8% of the time ASE asked if a disk I/O was done yet, the answer was no. This suggests (but does not prove) that ASE thinks the disks are slow.

I too  think to disks are slow - if I run a dd if=/dev/raw/raw33 of=/dev/null bs=1024kb - the server gets very slow - am looking to use pcp - a Reh Hat performance tool along with my SGI service engineer.
I'd say default, data_cache1 is doing fine for cache misses, default data cache is the one that needs it the most.
Probably questions on anything other than data caches should go into a new question topic, BTW. EE prefer new questions not be added. My bad - I like looking at sysmons. :)
Will run sp_object_stats now...
Probably questions on anything other than data caches should go into a new question topic, BTW. EE prefer new questions not be added. My bad - I like looking at sysmons. :)

Ok - I'll leave that point (DISK I/O) for another question.

I am sure the changes are making a difference - however I am starting to really question the I/O as the root cause.
Caching ultimately is just a hack for dealing with slow I/O. If all your disks were enterprise SSDs you wouldn't need data cache at all. So yes, in a way it is impossible to sensibly discuss caching without talking about disk - good caching tuning can mask disk problems, for instance.
sp_object_stats:

DBCC execution completed. If DBCC printed error messages, contact a user with System Administrator (SA) role.         
1 row(s) affected.         
DBCC execution completed. If DBCC printed error messages, contact a user with System Administrator (SA) role.         
----------------------------         
No contention on any tables!         
Just to complete this - we have checked the storage and found that the Disk Array is operating at Max capacity and needs to be updated.

So thanks for all your help - its was very useful and will be put to good use when we get a new storage system