asked on

Disk Queue Length

Hi Experts,

I am having an issue where our Current Queue Length and Avg Queue Length are fluctuating quite significantly.

Normal:
The Current Queue Length is: 0
Average Queue Length is: 3

High:
The Current Queue Length is: 45
Average Queue Length is: 375.6

It fluctuates every few hours and has only been happening the last few days. I would like to find the root cause of the pressure on the hard disks and eliminate it if possible.

Enviroment:
SBS2008
File
Mail
DNS
DHCP
Network Security (Trend)
WSUS
Backup
Database server (quickbooks)
Sharepoint

kyodai

Most probably the file services. I guess some user accesses one or several large files via a share. If you wanna know for sure you can use perfmon to collect key values over a longer time.

Duncan Meyers

It's best not to look at Queue Length in isolation (although 375 looks a bit grim...). You also need to take a look at Response Time (Physical Disk -> Avg Disk sec/Read and Write) and IOPS (I/O Per Second) (Physical Disk -> Disk Reads/Sec and Writes/Sec).
If Queue length is high AND Response Time is high AND IOPS is high, then you have a disk performance problem.

The rules of thumb:
- Response time should be less than 20mS in a perfect world, and once it starts climbing t0 40mS and beyond, you're going to start seeing performance degradation.

- Expect 3 to 4 queued I/Os per disk in a RAID set.

- Expect 180 IOPS per 15K rpm SAS drive, 120 IOPS per 10K rpm SAS drive and 60 - 80 IOPS per 7200rpm SATA drive. This is the steady state performance, and drives will typically hit 2 to 2.5 times that figure per drive before you see a performance impact. So: if you have 4 10K SAS drives in a RAID 5 group, you'd expect between 480 IOPS and 1200 IOPS no worries. Once you got above 1200 IOPS, though, things are going to grind to a halt. If, on the other hand, you've got a mirrored pair of 2TB SATA drives, you're looking at 80 to 200 IOPS. Ouch.

Tracking down the source of the high queue depths is going to be tricky. Anything that's waiting on I/O is likely to have low processor utilisation, so Task Manager won't be a huge amount of help. Process Monitor from Sysinternals might be a good start for debugging (http://technet.microsoft.com/en-us/sysinternals/bb545046), but you may just have to pony up some cash for more disk drives, depending on what you find.

isdd2000

ASKER

Hi koydia,

I dont think it is someone accessing a big file as all design files sit on another server, and was going up and down all weekend.

Meyersd,

Will check the drive speeds tomorrow, I just find it odd that after 3 months of the disks being ok suddenly they start peaking out one weekend!

Duncan Meyers

Disk performance tends to deteriorate slowly where nobody notices, then fall off a cliff. Disk performance goes up to about 2.5 x the base performance. Beyond that, it becomes terrible really fast.

The reason I mention all that is so that you could measure response time and IOPs and make a quick decision as to whether your problem is in hardware (where the fix is to add more drives or go SSD), or you have another issue.

isdd2000

ASKER

Hi Guys,

Sorry for the delayed response, we have 3 15K SAS drives in raid 5. What I find really strange is it spikes for a very short period of time. By the time i log into the server after receiving the alert, everything is ok.

Because its not consistent I don't think its a performance issue, what do you think?

Nagendra Pratap Singh

I would not call it a performance issue yet.

You can suppress your alerts in your monitoring tool if it does not affect the business. Most tools have an option to raise an alert only if the situtation is out of normal for say 5 minutes.

You should still export the logs and check the spike patterns.

isdd2000

ASKER

Hi,

These monitoring thresholds have been there for 4 months and last month it started spiking. Currently using windows performance and alerts, how do I export the logs?

Duncan Meyers

On the face of it, you have a disk performance problem. You could either
- find the source of the spikes then move the workload that's being accessed to another drive
- just add more drives.

Have you had any application updates recently?
End of month (or end of financial year) processing for quickbooks?
Any changes to Trend Micro (other than virus signatures)?
More workload on Sharepoint?

isdd2000

ASKER

Hi meyersd,

There have been windows updates run, no updates on the trend side of things, QuickBooks does not get used anymore due to a cloud application we have been using. There was a few issues with WSUS that where fixed recently and now machines on the network are communicating with wsus. No one really utilises SharePoint on this network.

Any ideas on how to find out if WSUS is causing the issue?

Duncan Meyers

In Task Manager, under Performance is a Resource Monitor button (in Windows 7 - I presume it's the same in 2008 R2). You can look at what resources are being consumed.

isdd2000

ASKER

Hi Meyersd,

Our biggest two writes to the drive are an APC agent and our network security applications.

Duncan Meyers

Does the activity coincide with the slow downs?

isdd2000

ASKER

Hard to tell as it spikes by the time I checked the spike was over, what should I be looking for I/O writes or I/O reads?

Duncan Meyers

Writes are usually the culprit as there is a RAID write penalty, but huge read workloads will also cause a performance problem.

How many I/O per second are you seeing in the spikes?

isdd2000

ASKER

Will make note next time it spikes, and let you know.

Duncan Meyers

Rule of thumb: the RAID array should produce 600 IOPS give-or-take, peaking at about 1200-1500 IOPS. But, you have to remember write penalty. For every 1 host write, you've got 4 disk transactions, so 500 host writes will generate 2000 IOPS - and that's more than your array can deliver.

isdd2000

ASKER

Hi Meyersd,

After a 3 days of monitoring I have a bit more info to work with.

Split I/O per second reached a max of 72
Current Disk Queue reached max of 245
Average Disk Queue reached max of 144

Over the first 24hour period it peeks at the following times, I will provide corrisponding events if there is any, see below:

5:09PM
7:49PM (BACKUP)
9:09PM
10:39PM
2:19AM (Exchange Maintenance)
4:19AM
8:09AM
10:29AM
2:09PM
2:39PM

Any Ideas?

Duncan Meyers

Always at hh:m9? I wonder why that is. Do you have any scheduled tasks that run? AV tasks perhaps?

Do you have the luxury of being able to stop services to see what changes?

isdd2000

ASKER

Its a live SBS server unfortunatly cannot stop any services. I did notice the TT:T9, ill have a look at trend to see if it runs, keep you updated.

isdd2000

ASKER

AV Scans weekly

Duncan Meyers

Very odd...

isdd2000

ASKER

The spikes are getting much larger, see bellow. Still can't see anything out of the ordinary is Task manager.

Write Operations/Second, Bytes Written/Second, Read Operations/Second, Bytes Read/Second, Total Operations/Second and Total Bytes/Second are all within normal values, even during a spike!

Need to get to the bottom of this as I am getting alot of emails. Any ideas?

1:57AM - Avg Queue Length: 1184
2:27AM - Avg Queue Length: 1196
2:43AM - Avg Queue Length: 1201
5:18AM - Avg Queue Length: 632
6:23AM - Avg Queue Length: 1294
7:03AM - Avg Queue Length: 653
7:33AM - Avg Queue Length: 1318
8:59AM - Avg Queue Length: 676
9:59AM - Avg Queue Length: 687
12:14PM - Avg Queue Length: 1429
1:34PM - Avg Queue Length: 734

Duncan Meyers

Have you got the manufacturer's diagnostic software installed? Dell's Open Manage, IBM Director or HP's whatever-it's-called? It's worth having a look for any odd hardware errors in there. Also check the event log for any activity that's happening at the same time as the peaks.

Also have a look here: http://www.microsoft.com/en-us/download/details.aspx?id=6231. The best practices analyzer may point you in the right direction.

isdd2000

ASKER

Figured out that it wasnt a disk queue issue just windows was reporting on a 1000 scale for some odd reason, maybe and update not too sure as to why. But since we changed monitoring to a 1 scale has been fine.

ASKER CERTIFIED SOLUTION

Duncan Meyers

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

isdd2000

ASKER

Thanks meyersd.

There was not real performance issues that was probably a key point

Duncan Meyers

Thanks! Glad you got it resolved.