?
Solved

SQL Disk utilization

Posted on 2011-05-02
14
Medium Priority
?
492 Views
Last Modified: 2012-05-11
Hi All, I have a custom application that uses SQL Server 2008 R2.

From the Resource monitor, disk activity usually resides around 50 KB/sec and the application performs as expected.  However, randomly, I will see the disk activity jump up to 10MB/sec and the app will come to a crashing halt with SQL timeouts.

What's the best way to troubleshoot this issue?
0
Comment
Question by:polaris101
  • 5
  • 5
  • 2
  • +1
14 Comments
 
LVL 5

Expert Comment

by:Eduardo Goicovich
ID: 35504945
HI,
 
   Do you know what is the app using from the server?
   To have an idea, create a sql server profiler session to record the activity from the application.
0
 

Author Comment

by:polaris101
ID: 35505051
The App is Microsoft Access 2007
0
 
LVL 5

Expert Comment

by:Eduardo Goicovich
ID: 35505122
Usually the problem comes from poor designed queries.

- Verify what kind of queries is using (use sql server profiler to get an excerpt, filtering by host or any other specific property)
- Once you get the culprit query, run on ssms to see the execution plan
- Check the indexes from the query
0
Configuration Guide and Best Practices

Read the guide to learn how to orchestrate Data ONTAP, create application-consistent backups and enable fast recovery from NetApp storage snapshots. Version 9.5 also contains performance and scalability enhancements to meet the needs of the largest enterprise environments.

 
LVL 13

Expert Comment

by:dwkor
ID: 35505503
Find most IO consuming queries with the script below and optimize them.

Keep in mind that timeouts could be related with the locking rather than IO but at the end query optimization should help with that too
SELECT TOP 50 
	SUBSTRING(qt.TEXT, (qs.statement_start_offset/2)+1,
		((
			CASE qs.statement_end_offset
				WHEN -1 THEN DATALENGTH(qt.TEXT)
				ELSE qs.statement_end_offset
			END - qs.statement_start_offset)/2)+1),
	qs.execution_count,
	(qs.total_logical_reads + qs.total_logical_writes) / qs.execution_count as [Avg IO],
	qs.total_logical_reads, qs.last_logical_reads,
	qs.total_logical_writes, qs.last_logical_writes,
	qs.total_worker_time,
	qs.last_worker_time,
	qs.total_elapsed_time/1000 total_elapsed_time_in_ms,
	qs.last_elapsed_time/1000 last_elapsed_time_in_ms,
	qs.last_execution_time,
	qp.query_plan
FROM 
	sys.dm_exec_query_stats qs
		CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) qt
		CROSS APPLY sys.dm_exec_query_plan(qs.plan_handle) qp
ORDER BY 
	[Avg IO] DESC
go

Open in new window

0
 

Author Comment

by:polaris101
ID: 35506913
Thanks Dwkor - maybe I'm looking in the wrong spot then...bc I just ran that script and the 'last execution time' does not display todays date...and the server really hung up a couple times today
0
 
LVL 13

Expert Comment

by:dwkor
ID: 35506972
This script works only on the scope of the plan cache. So if server has been restarted or query removed from the plan cache, it would not be included to results.

Run the attached selects. Post top 5 waits from the first one and result from the second one.
-- Shows TOP waits in the system
SELECT 
	wait_type, wait_time_ms, 
	convert(decimal(7,4), 100.0 * wait_time_ms / SUM(wait_time_ms) OVER()) AS [Percent]
from 
	 sys.dm_os_wait_stats
where 
	wait_type NOT IN ('CLR_SEMAPHORE','LAZYWRITER_SLEEP','RESOURCE_QUEUE','SLEEP_TASK'
	,'SLEEP_SYSTEMTASK','SQLTRACE_BUFFER_FLUSH','WAITFOR', 'LOGMGR_QUEUE','CHECKPOINT_QUEUE'
	,'REQUEST_FOR_DEADLOCK_SEARCH','XE_TIMER_EVENT','BROKER_TO_FLUSH','BROKER_TASK_STOP',
	'CLR_MANUAL_EVENT','CLR_AUTO_EVENT','DISPATCHER_QUEUE_SEMAPHORE', 'FT_IFTS_SCHEDULER_IDLE_WAIT'
	,'XE_DISPATCHER_WAIT', 'XE_DISPATCHER_JOIN')
order by 
	[Percent] Desc
go

-- Shows Signal wait time (how long worker waits in Runnable queue) vs Total wait time
-- High % of signal waits could indicate CPU starvation
select 
	sum(signal_wait_time_ms) as [Signal Wait Time (ms)]
	,convert(decimal(7,4), 100.0 * sum(signal_wait_time_ms) / sum (wait_time_ms)) as [% Signal waits]
	,sum(wait_time_ms - signal_wait_time_ms) as [Resource Wait Time (ms)]
	,convert(decimal(7,4), 100.0 * sum(wait_time_ms - signal_wait_time_ms) / sum (wait_time_ms)) as [% Resource waits]
from
	sys.dm_os_wait_stats
go

Open in new window

0
 

Author Comment

by:polaris101
ID: 35507086
Great.  Here ya go...see attach. what are your thoughts? perf
0
 
LVL 13

Accepted Solution

by:
dwkor earned 2000 total points
ID: 35507280
Well, first 3 wait types could be excluded from analysis, you can add them to NOT IN list next time. If you do that, obviously it change Percent for the other events.

Async_Network_IO means that the clients don't consume data fast enough. In most part of the cases, it means that client code opens reader and read/process rows one by one. This is not good (client design) but at the end it would not hung the server.

CXPacket - it's parallelism issue. I don't know nature of your system, but if it's OLTP, it could be the sign of non-optimized queries. You can try to reduce MaxDOP a little bit although you have to test it. One other thing - if your server is HARDWARE NUMA - make sure that MaxDOP <= # of CPU per NUMA node. Otherwise parallel plans could easily timeout if it involves CPU from different NUMA node.

PageIOLatch and BackupIO are IO and/or memory related waits - but those are really hard to say.

17% of signal waits is acceptable but very close to CPU bottleneck point. This is the time when query waits for available CPU versus waits for some resources.

So unfortunately it's hard to say based on those waits what exactly is going on. Need to have more info.

Speaking of IO expensive queries - how many of those queries are IO intensive (large # in AVG io)? How often (many times) are they running? How many of the queries have Avg IO > 10,000? Does the time of "hung ups" correspond to backups?

For now I'd suggest to clear wait statistics (would be interesting to see waits again after some time with: DBCC SQLPERF ('sys.dm_os_wait_stats', CLEAR). This is safe to do on production system as long as you don't use any 3rd party monitoring tools (it could screws up the analysis). Next, you can run blocking process trace (http://msdn.microsoft.com/en-us/library/ms191168.aspx) - set blocked process threshold to 15 seconds for example. Last, but not least, if you have any blocking during the time when server hung up, try to run the script below and see if there are any blockers.

Obviously we can also try to use extended events to notify us about blocking, but this one is a bit complex - so let's start with those ones.
select
	TL1.resource_type
	,DB_NAME(TL1.resource_database_id) as [DB Name]
	,CASE TL1.resource_type
		WHEN 'OBJECT' THEN OBJECT_NAME(TL1.resource_associated_entity_id, TL1.resource_database_id)
		WHEN 'DATABASE' THEN 'DB'
		ELSE
			CASE 
				WHEN TL1.resource_database_id = DB_ID() 
				THEN
					(
						select OBJECT_NAME(object_id, TL1.resource_database_id)
						from sys.partitions
						where hobt_id = TL1.resource_associated_entity_id
					)
				ELSE
					'(Run under DB context)'
			END
	END as ObjectName
	,TL1.resource_description
	,TL1.request_session_id
	,TL1.request_mode
	,TL1.request_status
	,WT.wait_duration_ms as [Wait Duration (ms)]
	,(
		select
			SUBSTRING(
				S.Text, 
				(ER.statement_start_offset / 2) + 1,
				((
					CASE 
						ER.statement_end_offset
					WHEN -1 
						THEN DATALENGTH(S.text)
						ELSE ER.statement_end_offset
					END - ER.statement_start_offset) / 2) + 1)		
		from 
			sys.dm_exec_requests ER 
				cross apply sys.dm_exec_sql_text(ER.sql_handle) S
		where
			TL1.request_session_id = ER.session_id
	 ) as [Query]
from
	sys.dm_tran_locks as TL1 join sys.dm_tran_locks TL2 on
		TL1.resource_associated_entity_id = TL2.resource_associated_entity_id
	left outer join sys.dm_os_waiting_tasks WT on
		TL1.lock_owner_address = WT.resource_address and TL1.request_status = 'WAIT'

where
	TL1.request_status <> TL2.request_status and
	(
		TL1.resource_description = TL2.resource_description OR
		(TL1.resource_description is null and TL2.resource_description is null)
	)
go

Open in new window

0
 
LVL 13

Expert Comment

by:dwkor
ID: 35507293
> PageIOLatch and BackupIO are IO and/or memory related waits - but those are really hard to say.

Hate that you cannot edit the message - I meant that % of those waits not necessarily indicates the issues at this point.  
0
 

Author Comment

by:polaris101
ID: 35507539
Ok, a lot of that is a little over my head.  I just inherited this issue and only have basic SQL DBA knowledge.  Can you point me towards some good SQL troubleshooting documentation?  Thanks!
0
 
LVL 13

Expert Comment

by:dwkor
ID: 35507627
Try to download those whitepapers: http://download.microsoft.com/download/D/B/D/DBDE7972-1EB9-470A-BA18-58849DB3EB3B/TShootPerfProbs2008.docx

Frankly if I had time, I would start with some reading about internals. My favorite one is Kalen Delaney "Inside Microsoft SQL Server series". Start with this one: http://www.amazon.com/Microsoft®-SQL-Server®-2008-Internals/dp/0735626243/
0
 
LVL 21

Expert Comment

by:mastoo
ID: 35508862
Are you looking at disk io by process in the Resource Monitor, so that you know it is sql server?

While the load is underway, run sp_who2 from two different windows and compare the disk io between them.  This leads to the spid that is causing it.  Run activity monitor to get details on that spid.

Or run Sql Server Profiler using the standard template and you'll see activity flash by.  Look for big disk io numbers and you have your culprit.  Once you identify the sql, it likely leads to sql that needs to be fixed or a table that needs an index.
0
 

Author Comment

by:polaris101
ID: 35509267
Thanks mastoo - what do you mean by running sp_who2 from two different windows?
0
 
LVL 21

Expert Comment

by:mastoo
ID: 35509528
From a query window in Sql Server Management Studio, run this:

sp_who2

It spits out numbers indicating io for the various spid's.  Opening two separate instances of SSMS and running it from each allows you to compare the different results.  io is total so what you really care about is the change during the few seconds between you running instance 1 and instance 2.
0

Featured Post

NFR key for Veeam Agent for Linux

Veeam is happy to provide a free NFR license for one year.  It allows for the non‑production use and valid for five workstations and two servers. Veeam Agent for Linux is a simple backup tool for your Linux installations, both on‑premises and in the public cloud.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This article explains how to reset the password of the sa account on a Microsoft SQL Server.  The steps in this article work in SQL 2005, 2008, 2008 R2, 2012, 2014 and 2016.
Restoring deleted objects in Active Directory has been a standard feature in Active Directory for many years, yet some admins may not know what is available.
This tutorial will walk an individual through locating and launching the BEUtility application to properly change the service account username and\or password in situation where it may be necessary or where the password has been inadvertently change…
This tutorial will walk an individual through the steps necessary to install and configure the Windows Server Backup Utility. Directly connect an external storage device such as a USB drive, or CD\DVD burner: If the device is a USB drive, ensure i…
Suggested Courses

864 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question