SQL Disk utilization

Hi All, I have a custom application that uses SQL Server 2008 R2.

From the Resource monitor, disk activity usually resides around 50 KB/sec and the application performs as expected.  However, randomly, I will see the disk activity jump up to 10MB/sec and the app will come to a crashing halt with SQL timeouts.

What's the best way to troubleshoot this issue?
polaris101Asked:
Who is Participating?
 
dwkorConnect With a Mentor Commented:
Well, first 3 wait types could be excluded from analysis, you can add them to NOT IN list next time. If you do that, obviously it change Percent for the other events.

Async_Network_IO means that the clients don't consume data fast enough. In most part of the cases, it means that client code opens reader and read/process rows one by one. This is not good (client design) but at the end it would not hung the server.

CXPacket - it's parallelism issue. I don't know nature of your system, but if it's OLTP, it could be the sign of non-optimized queries. You can try to reduce MaxDOP a little bit although you have to test it. One other thing - if your server is HARDWARE NUMA - make sure that MaxDOP <= # of CPU per NUMA node. Otherwise parallel plans could easily timeout if it involves CPU from different NUMA node.

PageIOLatch and BackupIO are IO and/or memory related waits - but those are really hard to say.

17% of signal waits is acceptable but very close to CPU bottleneck point. This is the time when query waits for available CPU versus waits for some resources.

So unfortunately it's hard to say based on those waits what exactly is going on. Need to have more info.

Speaking of IO expensive queries - how many of those queries are IO intensive (large # in AVG io)? How often (many times) are they running? How many of the queries have Avg IO > 10,000? Does the time of "hung ups" correspond to backups?

For now I'd suggest to clear wait statistics (would be interesting to see waits again after some time with: DBCC SQLPERF ('sys.dm_os_wait_stats', CLEAR). This is safe to do on production system as long as you don't use any 3rd party monitoring tools (it could screws up the analysis). Next, you can run blocking process trace (http://msdn.microsoft.com/en-us/library/ms191168.aspx) - set blocked process threshold to 15 seconds for example. Last, but not least, if you have any blocking during the time when server hung up, try to run the script below and see if there are any blockers.

Obviously we can also try to use extended events to notify us about blocking, but this one is a bit complex - so let's start with those ones.
select
	TL1.resource_type
	,DB_NAME(TL1.resource_database_id) as [DB Name]
	,CASE TL1.resource_type
		WHEN 'OBJECT' THEN OBJECT_NAME(TL1.resource_associated_entity_id, TL1.resource_database_id)
		WHEN 'DATABASE' THEN 'DB'
		ELSE
			CASE 
				WHEN TL1.resource_database_id = DB_ID() 
				THEN
					(
						select OBJECT_NAME(object_id, TL1.resource_database_id)
						from sys.partitions
						where hobt_id = TL1.resource_associated_entity_id
					)
				ELSE
					'(Run under DB context)'
			END
	END as ObjectName
	,TL1.resource_description
	,TL1.request_session_id
	,TL1.request_mode
	,TL1.request_status
	,WT.wait_duration_ms as [Wait Duration (ms)]
	,(
		select
			SUBSTRING(
				S.Text, 
				(ER.statement_start_offset / 2) + 1,
				((
					CASE 
						ER.statement_end_offset
					WHEN -1 
						THEN DATALENGTH(S.text)
						ELSE ER.statement_end_offset
					END - ER.statement_start_offset) / 2) + 1)		
		from 
			sys.dm_exec_requests ER 
				cross apply sys.dm_exec_sql_text(ER.sql_handle) S
		where
			TL1.request_session_id = ER.session_id
	 ) as [Query]
from
	sys.dm_tran_locks as TL1 join sys.dm_tran_locks TL2 on
		TL1.resource_associated_entity_id = TL2.resource_associated_entity_id
	left outer join sys.dm_os_waiting_tasks WT on
		TL1.lock_owner_address = WT.resource_address and TL1.request_status = 'WAIT'

where
	TL1.request_status <> TL2.request_status and
	(
		TL1.resource_description = TL2.resource_description OR
		(TL1.resource_description is null and TL2.resource_description is null)
	)
go

Open in new window

0
 
Eduardo GoicovichIT ConsultantCommented:
HI,
 
   Do you know what is the app using from the server?
   To have an idea, create a sql server profiler session to record the activity from the application.
0
 
polaris101Author Commented:
The App is Microsoft Access 2007
0
Making Bulk Changes to Active Directory

Watch this video to see how easy it is to make mass changes to Active Directory from an external text file without using complicated scripts.

 
Eduardo GoicovichIT ConsultantCommented:
Usually the problem comes from poor designed queries.

- Verify what kind of queries is using (use sql server profiler to get an excerpt, filtering by host or any other specific property)
- Once you get the culprit query, run on ssms to see the execution plan
- Check the indexes from the query
0
 
dwkorCommented:
Find most IO consuming queries with the script below and optimize them.

Keep in mind that timeouts could be related with the locking rather than IO but at the end query optimization should help with that too
SELECT TOP 50 
	SUBSTRING(qt.TEXT, (qs.statement_start_offset/2)+1,
		((
			CASE qs.statement_end_offset
				WHEN -1 THEN DATALENGTH(qt.TEXT)
				ELSE qs.statement_end_offset
			END - qs.statement_start_offset)/2)+1),
	qs.execution_count,
	(qs.total_logical_reads + qs.total_logical_writes) / qs.execution_count as [Avg IO],
	qs.total_logical_reads, qs.last_logical_reads,
	qs.total_logical_writes, qs.last_logical_writes,
	qs.total_worker_time,
	qs.last_worker_time,
	qs.total_elapsed_time/1000 total_elapsed_time_in_ms,
	qs.last_elapsed_time/1000 last_elapsed_time_in_ms,
	qs.last_execution_time,
	qp.query_plan
FROM 
	sys.dm_exec_query_stats qs
		CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) qt
		CROSS APPLY sys.dm_exec_query_plan(qs.plan_handle) qp
ORDER BY 
	[Avg IO] DESC
go

Open in new window

0
 
polaris101Author Commented:
Thanks Dwkor - maybe I'm looking in the wrong spot then...bc I just ran that script and the 'last execution time' does not display todays date...and the server really hung up a couple times today
0
 
dwkorCommented:
This script works only on the scope of the plan cache. So if server has been restarted or query removed from the plan cache, it would not be included to results.

Run the attached selects. Post top 5 waits from the first one and result from the second one.
-- Shows TOP waits in the system
SELECT 
	wait_type, wait_time_ms, 
	convert(decimal(7,4), 100.0 * wait_time_ms / SUM(wait_time_ms) OVER()) AS [Percent]
from 
	 sys.dm_os_wait_stats
where 
	wait_type NOT IN ('CLR_SEMAPHORE','LAZYWRITER_SLEEP','RESOURCE_QUEUE','SLEEP_TASK'
	,'SLEEP_SYSTEMTASK','SQLTRACE_BUFFER_FLUSH','WAITFOR', 'LOGMGR_QUEUE','CHECKPOINT_QUEUE'
	,'REQUEST_FOR_DEADLOCK_SEARCH','XE_TIMER_EVENT','BROKER_TO_FLUSH','BROKER_TASK_STOP',
	'CLR_MANUAL_EVENT','CLR_AUTO_EVENT','DISPATCHER_QUEUE_SEMAPHORE', 'FT_IFTS_SCHEDULER_IDLE_WAIT'
	,'XE_DISPATCHER_WAIT', 'XE_DISPATCHER_JOIN')
order by 
	[Percent] Desc
go

-- Shows Signal wait time (how long worker waits in Runnable queue) vs Total wait time
-- High % of signal waits could indicate CPU starvation
select 
	sum(signal_wait_time_ms) as [Signal Wait Time (ms)]
	,convert(decimal(7,4), 100.0 * sum(signal_wait_time_ms) / sum (wait_time_ms)) as [% Signal waits]
	,sum(wait_time_ms - signal_wait_time_ms) as [Resource Wait Time (ms)]
	,convert(decimal(7,4), 100.0 * sum(wait_time_ms - signal_wait_time_ms) / sum (wait_time_ms)) as [% Resource waits]
from
	sys.dm_os_wait_stats
go

Open in new window

0
 
polaris101Author Commented:
Great.  Here ya go...see attach. what are your thoughts? perf
0
 
dwkorCommented:
> PageIOLatch and BackupIO are IO and/or memory related waits - but those are really hard to say.

Hate that you cannot edit the message - I meant that % of those waits not necessarily indicates the issues at this point.  
0
 
polaris101Author Commented:
Ok, a lot of that is a little over my head.  I just inherited this issue and only have basic SQL DBA knowledge.  Can you point me towards some good SQL troubleshooting documentation?  Thanks!
0
 
dwkorCommented:
Try to download those whitepapers: http://download.microsoft.com/download/D/B/D/DBDE7972-1EB9-470A-BA18-58849DB3EB3B/TShootPerfProbs2008.docx

Frankly if I had time, I would start with some reading about internals. My favorite one is Kalen Delaney "Inside Microsoft SQL Server series". Start with this one: http://www.amazon.com/Microsoft®-SQL-Server®-2008-Internals/dp/0735626243/
0
 
mastooCommented:
Are you looking at disk io by process in the Resource Monitor, so that you know it is sql server?

While the load is underway, run sp_who2 from two different windows and compare the disk io between them.  This leads to the spid that is causing it.  Run activity monitor to get details on that spid.

Or run Sql Server Profiler using the standard template and you'll see activity flash by.  Look for big disk io numbers and you have your culprit.  Once you identify the sql, it likely leads to sql that needs to be fixed or a table that needs an index.
0
 
polaris101Author Commented:
Thanks mastoo - what do you mean by running sp_who2 from two different windows?
0
 
mastooCommented:
From a query window in Sql Server Management Studio, run this:

sp_who2

It spits out numbers indicating io for the various spid's.  Opening two separate instances of SSMS and running it from each allows you to compare the different results.  io is total so what you really care about is the change during the few seconds between you running instance 1 and instance 2.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.