Solved

Extreemly high average disk queue length

Posted on 2016-08-01
12
242 Views
Last Modified: 2016-08-08
I am using a N-Able/N-Central monitoring and on some servers I see the average disk queue length randomly reported in an extremely high number. For example: 4875851.

If i manually monitor the queue length via performance monitor or even capture data in windows perfmon log files I see no data logged that high.

The product vendor (solarwinds) says that the metric is taken directly from WMI readings.
Win32_PerfRawData_PerfDisk_LogicalDisk get AvgDiskQueueLength

Is it possible that WMI is reporting a value that high?

System is Hyper-v 2008r2.  Mulitple VM's
Four SAS disks: 15K 6Gb/s (Write-Cache Status Write Back )
Raid level: 5
Raid Card: Adaptec 6805, 512MB RAM
Stripe size: 256KB
Write-cache/mode: write through

I have another similar system with 6 SAS drives in Raid10, getting same WMI type results.
0
Comment
Question by:jpgillivan
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 5
  • 4
  • 3
12 Comments
 
LVL 40

Accepted Solution

by:
Adam Brown earned 250 total points
ID: 41737954
nAble pulls data on disk queue length on a single snapshot pull on a scheduled basis, rather than using averages over the period since last check. I've seen it regularly pull corrupted or incorrect data on that counter, so if it goes into an error state, wait for the next scan to verify whether it's bad. It will usually go back to normal on the next poll of the service.
0
 

Author Comment

by:jpgillivan
ID: 41737985
I am going to set the failed upper limit to 10000 and see if the alerts stop.  

However, I am concerned that maybe there is an disk I/O issue.  

Or do you think it's a bug with N-Able?  
I have a case open with them but they won't/don't acknowledge there is an issue.
0
 
LVL 40

Expert Comment

by:Adam Brown
ID: 41738057
I usually shut down alerts on disk queue length because the way it's calculated isn't always accurate and can easily be impacted by a sudden burst of small transactions following a lengthy single transaction (Windows determines disk queue length with Little's Law, or Length = Arrivals x Wait period). You'll want to record it for reporting purposes, but take it out of the alerts (this can be done by disabling the threshold in nAble). Disk Time % is basically the queue length * 100.

A better counter to monitor is %Idle Time, which is a 0-100% value and represents how much time the drive has spent with no incoming requests. That value is calculated by marking when the disk goes idle, then marking again when it processes an action. The values you want to see depend on the monitoring interval, but I would put a warning in below 20% and failure below 10% (100% means the drive is idle during the entire window).
0
Free learning courses: Active Directory Deep Dive

Get a firm grasp on your IT environment when you learn Active Directory best practices with Veeam! Watch all, or choose any amount, of this three-part webinar series to improve your skills. From the basics to virtualization and backup, we got you covered.

 
LVL 38

Expert Comment

by:Philip Elder
ID: 41738201
What's Disk Queue Length in the VM(s)?

Another indicator that the disk subsystem is limiting things is in latency.

Check the latency within the guest and the host to see if latency is hitting 150ms or higher. 300ms and higher and users start getting impacted. Higher than 450ms-500ms and performance is not good at all.
0
 

Author Comment

by:jpgillivan
ID: 41740501
The purpose of monitoring disk queue length is to get information as to whether the current disk setup is configured / provisioned correctly to handle the work load.  

disk idle time is typically above 70% and mostly above 90% during that time.

I have also been monitoring disk I/O to use with queue length during that time.

One of the VM's is a SQL server.  I am not typically seeing disk I/O or queue alerts from the SQL server at the same time of the Hyper-V host.  

I don't think NAble has the ability to check latency.  I will have to figure out how to do that in windows.
0
 
LVL 38

Expert Comment

by:Philip Elder
ID: 41740655
Start --> type ResMon[ENTER on the result]

Or, on the host set up PerfMon to monitor both the host's and the guest's disk subsystems. This is the most accurate way to get readings from both.

ResMon works well in a pinch.
0
 
LVL 40

Assisted Solution

by:Adam Brown
Adam Brown earned 250 total points
ID: 41740987
disk idle time is typically above 70% and mostly above 90% during that time.

Then the setup is capable of handling the load. 70% idle time means that the drive are certainly not causing a bottleneck. Disk Queue length is just not a very accurate counter to use, particularly when examining it through WMI, which doesn't maintain averages over time. Perfmon would allow you to get a better idea of disk queue length averages, and it would be a very good idea to set up a performance log on that counter and other disk performance counters for a while to get better data, but the amount of time that occurs between scans in nAble makes it less reliable and accurate for determining averages on disk queue length.

nAble pulls data from WMI about once every 5 minutes by default, so the values it reports for things like Disk and CPU performance aren't very reliable unless there is a very long period of high utilization. Perfmon polls the system counters once or twice per second, so it will give you a much more reliable and useful set of data to work with when benchmarking systems.

nAble'ss monitors are designed to help identify long-term trends that may be symptomatic of underlying issues before they become serious problems. They aren't designed to help find performance bottlenecks.

Another tool that would be useful for your purposes is SQLIO: https://www.microsoft.com/en-us/download/details.aspx?id=20163 
It's capable of benchmark testing SQL IO speeds, so it will help you determine if the drive array you have will meet the requirements you have.
0
 
LVL 38

Expert Comment

by:Philip Elder
ID: 41741016
When it comes to SQLIO, DiskSPD, IOmeter, Exchange JETStress, and other such performance stress software keep in mind that running them during production hours may bring production workloads down to a crawl.
0
 

Author Comment

by:jpgillivan
ID: 41741133
Adam,

The client has complained about their SQL application (VM) having (slow responsiveness) issues from time to time.  This is why I was looking at queue lengths (disk performance).  I had run perfmon for 24 hours and saw no extremely high numbers that matched NAble and since NAble was pulling from WMI I was concerned about he discrepancy.  

Based on what I have researched that the (acceptable) average queue length for a raid5 with four drives is 6, (number spindles - 1) * 2
I understand the general rule is that a disk queue length of over 2 (per spindle) indicates a bottleneck.  I also understand that there may be spikes that go over an acceptable limit and as long as they are not sustained or many peaks then the issue is probably minor (but worth noting)
Since I am monitoring queue length with perfmon/nable should I be using the value of 2 still or or 6?

I have also been monitoring disk I/O total operations per second and have seen what I believe to be high numbers for this raid setup, values over 300.  I based this on calculation from this site http://www.thecloudcalculator.com/calculators/disk-raid-and-iops.html 
Based on the calculator a raid 5 with 15k SAS drives is capable 312 IOPS.
Occasionally I will see values over 400 and even in to the 1000's but that is typical after hours when backups are running.
0
 
LVL 38

Expert Comment

by:Philip Elder
ID: 41741195
A stripe size of 256KB is great for throughput, that is reading and writing large files.

For IOPS, especially for database driven apps with lots of smaller reads and writes, this will bomb for performance.

And, four 15K disks in RAID 5 won't have much in the way of IOPS anyway ... maybe 200 per disk if they are 3.5" and less with that 256KB stripe size.

As a rule, a stripe size (we build on Storage Spaces so Interleave size for us) of 64KB is the place to start for storage hosting fairly intensive read/write activity like mail and SQL databases.

EDIT: As a point of reference:
Large Stripe Size = >>>> Throughput (GB/Second) <<<<< IOPS (I/Os per second)
Small Stripe Size = >>>> IOPS <<<<<< Throughput
We have a great blog post that explains things based on our own learning experiences building out high performance storage: MPECS Inc. Blog: Storage Configuration: Know Your Workloads for IOPS or Throughput
0
 

Author Comment

by:jpgillivan
ID: 41741257
Thanks Philip.   Found an interesting article of someone that did stripe size testing, however it only went 128KB.  Given the results in the testing comparing 64KB to 128KB I would think that 128 to 256KB would be close also.
http://www.kendalvandyke.com/2009/02/disk-performance-hands-on-part-3-raid-5.html 

The author did also indicated that it seems in raw numbers, 64KB was best performer.  I will keep that in mind on future builds.

Can anyone confirm if I am calculating avg. queue lengths correctly for raid5 or do I simply stick with the value of 2 no matter what when pulling (physical disk) data from the OS (perfmon, wmi, etc)?

if I am pulling logical volume lengths I would assume I would use the value of 2.  Is that correct?
0
 
LVL 38

Assisted Solution

by:Philip Elder
Philip Elder earned 250 total points
ID: 41744957
We test _all_ disk subsystems before we deploy. We are not interested in deploying a very expensive storage system that turns out to be a brick. We've seen that happen with some very expensive storage situations belonging to others.

We've tested from 512KB (GBs/Second) right down to 4KB (1.2M IOPS via SAS SSDs 3 nodes and 1 JBOD).

In our experience the best disk queue length = # Disks * 2. 5 Disks would be a DQL of 10 or less. That would be where we would start.
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In this article we have discussed the manual scenarios to recover data from Windows 10 through some backup and recovery tools which are offered by it.
Restoring deleted objects in Active Directory has been a standard feature in Active Directory for many years, yet some admins may not know what is available.
This tutorial will show how to configure a single USB drive with a separate folder for each day of the week. This will allow each of the backups to be kept separate preventing the previous day’s backup from being overwritten. The USB drive must be s…
This Micro Tutorial hows how you can integrate  Mac OSX to a Windows Active Directory Domain. Apple has made it easy to allow users to bind their macs to a windows domain with relative ease. The following video show how to bind OSX Mavericks to …

735 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question