Link to home
Start Free TrialLog in
Avatar of dbaSQL
dbaSQLFlag for United States of America

asked on

RESOURCE_SEMAPHORE waits - unexplained performance problem, BIG

Windows Server 2003, SQL Server 2008, sp4.  
max server memory (mb) = 61440
min server memory (mb)  = 0
maxdop = 0
CTFP = 50
optimize for adhoc workloads = 1
Lock Pages in memory / awe enabled - server is using more than 2GB confirmed

Long story short, the front end is an OLTP MS Access application;  there are 140 clients all running an app locally with odbc to the SQL Server and they are exhausting the system.  This seems to have been escalating for the last month, every day worse than the one before.  Until now, I saw nothing but repeated async_network_io waits, but as of yesterday, I am also seen a ton of resource_semaphore waits.  The users contact me multiple times a day, saying all performance is dead, nobody can work. I check, see no blocking at all and a boat load of async_network_io and resource_semaphore waits.  I kill all connections (not my fix, it's theirs), and everything clears up for 15 - 20 minutes till it happens again.  I've done a lot of digging with sp_whoIsActive and sp_blitz/blitzIndex/blitzFirst, but I am not finding what is causing this.  The statements sitting behind the Access application forms and queries are all very poor in terms of design -- but I can't comfortably say that is the cause of this problem.  If it were, then they'd have been having these problems for much longer.  Something is prompting a ton of these waits to happen now.  

Are there any Expert suggestions?




Using sp_whoIsActive, there are no blocks, nothing going on in tempdb, just a ton of waits - async_network_io and resource_semaphore
Using sp_blitz, I identified huge occurrences of the resource_semaphore wait, but no clear indication of what specficially may be the cause
Using sp_blitzIndex, I identified and resolved numerous missing indexes -- though I wonder if the Access application will see them by default, or if there is something that needs to be done to recognize/use the new indexes.  This change was made last week, and there was no visible improvement at all.
Avatar of Megan Brooks
Megan Brooks
Flag of United States of America image

Troubleshooting this kind of issue is not my main thing but I am curious: roughly how large is the database in terms of data & indexes? Is it an ever-growing database that has been running for years? Do you have the usual maintenance jobs in place?

I have seen large databases slow to a crawl when they reach a certain size. It's not necessarily that the whole thing quits working but that, for example, some weak link process develops query plan issues due to who knows what (poor code, poor maintenance, bad luck) and places a huge drag on the overall system.

If it were my server, I would poke and fiddle until I found a culprit, fix that problem, and repeat. I don't have any general advice on how to do that, though. It all depends.

One issue I encountered involved huge growth of several databases over several years that resulted in the auto update statistic issue that is addressed by Trace Flag 2371. Things would usually go south after a feature release, for some reason. Execution plans would go bad due to stale statistics on specific tables, and certain reports would stop working due to the performance drag. The workaround was to manually update statistics on those tables (always the same ones), which typically hadn't been auto-updated for several weeks. The fix? Turn on Trace Flag 2371. Apparently, nobody knew about it. So that's one example of what can happen. Most of the time it's not that easy, though.
Avatar of dbaSQL

ASKER

It's fairly small, 66gb data, 10gb log.  Running ola hallengren daily maint with index and statistics administration.  I checked indexes first thing, aside from several missing indexes that have been created, there is no problem there.  I've identified the hottest statements repeatedly using dmvs and sp_whoisactive.  I can run them myself sometimes with completely acceptable performance, while at other times the do not complete.  During these times is whe the resource_smaphore waits are the highest.  Again, this has been going on for about a month.  Everything is fine, I see normal activity until someone contacts me about slow performance.  Within only a few minutes, he problem escalates to the point that the server is unusable. Literally.  Kill all connections, everything frees up until it happens again.  Two minutes later, five or ten minutes later, maybe two hour later...it happens again.
Avatar of dbaSQL

ASKER

Incredibly surprising, but it was the removal of one index.  Having re-applied the index, the semaphore waits have subsided completely.
Not surprised. :-)

If it happens again, check for stale statistics before re-creating the index, to possibly give you a better idea of what is happening. I have a script somewhere that I have used, but I have not organized my collection since I retired and I don't know where it (the collection) is at the moment! Here is one that TechNet offers.

In the instance I described above, the statistics were good up until a week or so before the current date (especially after I got a weekly stats update job running -- the company hadn't had one for the prior year or two). The tables involved were gigantic, and turning on TF 2371 seems to have fixed it.
Avatar of dbaSQL

ASKER

I spoke too hastily.  The very same problem is back today.  I wonder if there are any Expert suggestions for what may be causing this, or at least what to look at.

One thing that I wondered was whether the index changes I made in SQL Server could even bee utilized at the Office Application level without a corresponding adjustment in Access on the unique indexes used to create each of the ODBC-linked table keysets.  See 'Adjusting Dynaset Behavior' section in this reference:

https://technet.microsoft.com/en-us/library/bb188204(v=sql.90).aspx
Well, there is the suggestion I posted here yesterday. If you need help with that, I have some free time today. First, however, it might be very helpful to identify the culprit query or queries and the associated tables.

I have done a lot of performance troubleshooting, but it was always hands on, on servers that I administered. Trying to do this in a forum is new to me, and I'm not sure how doable it is, especially in terms of effort expended vs. results obtained when I am not being paid. You might consider some 1:1 consulting time, not necessarily with me, if this is causing you major headaches.

As long as Access is executing its queries against SQL Server, SQL Server is in charge and will make use of whatever statistics and indexes it can.
Avatar of dbaSQL

ASKER

I had already checked the indexes and statistics --- they are being updated daily.  There are no outdated stats at all.  I have also identified the statements being run every time there is a problem. They are not optimally defined, but nothing is blocked, nothing is using tempdb, it is just a tremendous amount of waiting on ASYNC_NETWORK_IO and RESOURCE_SEMAPHORE.
Avatar of dbaSQL

ASKER

I even profiled the statements a couple days ago and found a large percentage of them were being cursored from the Access application, with durations as great as 63 to 120 seconds each.  There are many clients running concurrently, running the same statements.  This strongly suggests the reason for the high async_io waits -- but this code has been in place for a very long time.  IF that were the problem, we'd have been here much longer ago than now.  The only change that has happened since this problem began escalating are index changes.  The only thing I can think is that this is much less about SQL Server than it is the Office Access application, and the way it is accessing the SQL Server data.
ASKER CERTIFIED SOLUTION
Avatar of Megan Brooks
Megan Brooks
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of dbaSQL

ASKER

That is correct, they are not using stored procedures.  It is adhoc, or raw sql statements.  Regardless, I must reiterate -- this problem began on 9/5 and has seemed to escalate every day since then, except for yesterday, when things were very quiet.

I have captured the statements, ran them in SSMS and produced the execution plan.  Many times.  Some statements are painfully lacking, others are fine.  But if these statements were the reason for this problem, then it would have been happening long before 9/5.  Something else is going on here.  I believe it is within the Access application or possibly even at the level of the ODBC, but I cannot find it.

Wednesday evening I recreated ONE index that had been removed, and all of yesterday the system was quiet, no problems at all.  Now today, the problem is back.  Something else must be going on here.
Even though the queries haven't changed, the execution plans may have. This can be due to statistics, or growing database size, or "it depends" factors that are hard to categorize. If there aren't any indexes then I guess you can rule out index statistics. Hopefully you at least have default combined PK/clustered indexes: otherwise no wonder!

If the indexes are missing or not well designed, or if the queries themselves are badly written (usually not the case for generated code, but did you say "cursored?") then as the tables grow in size you may reach a tipping point where there isn't enough memory.

Is the information in this article of any help?
Avatar of dbaSQL

ASKER

I have checked the execution plans again and again.  They are the same.  The stats are being updated daily, I checked again today to be sure.  Thank you for your reference on the resource_semaphore waits, but I've already seen it.  Unfortunately, it did not help me t o progress further with this problem.  I will say again, I really don't believe it is a problem on the SQL Server.  All of my checks show that SQL Server is fine.  It is not spiking or maxing out at all.
I don't see a problem with SQL Server per se, but it sure seems to be happening on the server - I say that because of the particular waits you are seeing. I'd say there is a problem with the workload having trouble obtaining enough memory to run. That could be query/index efficiency issues, cache bloat, an increase in the number of queries being run, or other running processes that are hogging memory. Or? ______________

"optimize for adhoc workloads = 1" should manage the cache bloat. I am assuming no major increase in demand from users. What about other running processes?

Is the server virtualized?
Avatar of dbaSQL

ASKER

No, the server is not virtualized.  As I mentioned before, this problem is escalating, every day getting worse.  If it were a matter of indexes or the queries being run from the Office Access application to SQL Server, these problems would have been occurring long before now.  The only known change that has occurred during this time was the implementation of new indexes -- which have all been rolled back, to see if that was the cause of the problem.  It was not, the problem persists.
I can believe that the issues you are seeing might not be caused by queries and indexes, but it is a fact that issues with queries and indexes can cause performance degradation over time, and can lead to a "tipping point" scenario where the drop in performance occurs suddenly. I say this because I have seen it happen in more than one large-scale SQL Server database.

So if you don't think that is what it is, OK, but don't rule it out altogether. And I would say that 66 Gb might be large scale enough, although the ones where I saw it were into the terabytes. And I am still wondering why re-creating an index appeared to help, temporarily.

You seem to have pretty much ruled out issues external to the server except for one: SAN. Either problems with the SAN itself or with the network connection to it.

There is still potential for internal server problems external to SQL Server. Is the Windows system log free of unusual messages? (Or the application log, for that matter.) Has anyone reviewed pertinent server performance measures outside of SQL Server, such as can be monitored using PerfMon? Do you have a monitoring tool in place that provides an overview of both SQL Server and Windows Server performance?
Avatar of dbaSQL

ASKER

>>You seem to have pretty much ruled out issues external to the server
I have not ruled out any issues, external to the server or otherwise.  My problem is that absolutely none of my diagnostics have yet given me any evidence for the cause of the problem.

>> And I would say that 66 Gb might be large scale enough, although the ones where I saw it were into the terabytes. And I am still wondering why re-creating an index appeared to help, temporarily.
66GB fairly small, in my opinion.  Was the one index recreation the cause of the quiet day?   I really cannot think that it was.  In fact, I am inclined to think there is something at the api, the odbc, or somewhere upstream from SQL Server that caused this 'quiet day'.  Before the end of the day, the problems resumed, and it has been continual ever since.  

>>Either problems with the SAN itself or with the network connection to it.
It's all local disk, but I did ask the IT group to review the hardware; specifically the cpu, disk and nic.  Windows log is free of all messages with exception to one warning (see below the carrots) which is being logged almost every day for a very long time now.  I've asked them to review this and have been assured it is common, and not the cause.  I am not a network specialist, so I can only provide them the detail that I have, and hope the problem is researched and resolved, or found to be mute.  Little supporting detail here, https://technet.microsoft.com/en-us/library/cc735785(v=ws.10).aspx, suggests that this error will not impact the local instance from using the network, but may hinder other machines from reaching this one.  THAT is why I noted it to the IT group, but again, I was told it was a non-issue.

>>but it is a fact that issues with queries and indexes can cause performance degradation over time,
I have never ruled out indexes or queries.  In fact, I can see full well that they are lacking, but the point I am trying to stress is that the queries and the indices have always been the way that the are, yet the system began becoming nearly unusable almost three weeks ago.  I am trying to identify what it is that has changed, and is causing this very significant behavioral change.



>>>>>>>>>>>>
The system failed to register host (A) resource records (RRs) for network adapter
with settings:

   Adapter Name : {90EE0E13-AF75-4EE1-9B8A-CB4A43DD6FA3}
   Host Name : richline_jts
   Primary Domain Suffix : domainName.local
   DNS server list :
           10.10.13.21, 10.10.13.8
   Sent update to server : 10.10.13.21
   IP Address(es) :
     10.10.13.11
Avatar of dbaSQL

ASKER

Megan, or any other Expert, I really must ask if there is any additional input.  Application starts up, users are logged in for only a few minutes before the system becomes unusable.  Please understand, there are NO locks or blocking on the Server, no queues or visible contention on disk or memory. Using Glen Berry's expertise (https://sqlserverperformance.wordpress.com/2010/10/08/sql-server-memory-related-queries/),  I have checked the memory state, and SQL Server thinks is quite content.  In fact, SQL tells me 'Available physical memory is high'.

And every 3 to 5 minutes throughout the day -- if not more frequently -- the us ers say the application is dead, they cannot access the server.  Every time I check, I DO see long waiting cursored requests. Not because they have coded cursors in the Access application, but because for whatever reason, that is the way the Access application sends its requests to the SQL Server.  When this happens, it is not one or two or ten client applications.  It is ALL of them.  That means more than 100 client workstations using this Office Access application to hit SQL Server, and they all stop working at the same time.  'go dead', 'stop working', these are the user's description of the problem.  But again, every time I look at SQL Server during one of these times of contention, SQL Server has nothing to report to me except for every single session sitting with async_network_io -- 80 to 120 connections.  And a very small handful of resource_semaphore waits - 3 to 5 connections. My gut says these times when everything is 'dead', and the users cannot use the application -- I believe this is just SQL Server AND the client apps waiting for one or more of those cursored requests to complete.  Yet how can I prove this?  And, if that is not it, then what else is it?  It appears as a queue on the server;  it is a queue or bottleneck that is prompted by the applications, and it is cleared when the application connections are killed.  And every time it happens, I see nothing on SQL Server except for the async_network_io waits, suggesting sql server is fine, it's just waiting to hear back from the clients.


Does anyone have any insight?
Avatar of dbaSQL

ASKER

I'm going to go ahead and award the points and close this.  The problem is not gone, but we have quieted things tremendously by removing a few ORDER BYs in some of the uglier statements running at the client.  I would be extremely surprised if that was the cause of all of this terribly degraded performance -- but, we know Access to SQL Server is not always the best, and any statement needs to be optimized if we expect any type of optimal performance.  Thank you for all of your input, Megan.