Extreemly high average disk queue length

I am using a N-Able/N-Central monitoring and on some servers I see the average disk queue length randomly reported in an extremely high number. For example: 4875851.

If i manually monitor the queue length via performance monitor or even capture data in windows perfmon log files I see no data logged that high.

The product vendor (solarwinds) says that the metric is taken directly from WMI readings.
Win32_PerfRawData_PerfDisk_LogicalDisk get AvgDiskQueueLength

Is it possible that WMI is reporting a value that high?

System is Hyper-v 2008r2.  Mulitple VM's
Four SAS disks: 15K 6Gb/s (Write-Cache Status Write Back )
Raid level: 5
Raid Card: Adaptec 6805, 512MB RAM
Stripe size: 256KB
Write-cache/mode: write through

I have another similar system with 6 SAS drives in Raid10, getting same WMI type results.
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Adam BrownSr Solutions ArchitectCommented:
nAble pulls data on disk queue length on a single snapshot pull on a scheduled basis, rather than using averages over the period since last check. I've seen it regularly pull corrupted or incorrect data on that counter, so if it goes into an error state, wait for the next scan to verify whether it's bad. It will usually go back to normal on the next poll of the service.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
jpgillivanConsultantAuthor Commented:
I am going to set the failed upper limit to 10000 and see if the alerts stop.  

However, I am concerned that maybe there is an disk I/O issue.  

Or do you think it's a bug with N-Able?  
I have a case open with them but they won't/don't acknowledge there is an issue.
Adam BrownSr Solutions ArchitectCommented:
I usually shut down alerts on disk queue length because the way it's calculated isn't always accurate and can easily be impacted by a sudden burst of small transactions following a lengthy single transaction (Windows determines disk queue length with Little's Law, or Length = Arrivals x Wait period). You'll want to record it for reporting purposes, but take it out of the alerts (this can be done by disabling the threshold in nAble). Disk Time % is basically the queue length * 100.

A better counter to monitor is %Idle Time, which is a 0-100% value and represents how much time the drive has spent with no incoming requests. That value is calculated by marking when the disk goes idle, then marking again when it processes an action. The values you want to see depend on the monitoring interval, but I would put a warning in below 20% and failure below 10% (100% means the drive is idle during the entire window).
Acronis True Image 2019 just released!

Create a reliable backup. Make sure you always have dependable copies of your data so you can restore your entire system or individual files.

Philip ElderTechnical Architect - HA/Compute/StorageCommented:
What's Disk Queue Length in the VM(s)?

Another indicator that the disk subsystem is limiting things is in latency.

Check the latency within the guest and the host to see if latency is hitting 150ms or higher. 300ms and higher and users start getting impacted. Higher than 450ms-500ms and performance is not good at all.
jpgillivanConsultantAuthor Commented:
The purpose of monitoring disk queue length is to get information as to whether the current disk setup is configured / provisioned correctly to handle the work load.  

disk idle time is typically above 70% and mostly above 90% during that time.

I have also been monitoring disk I/O to use with queue length during that time.

One of the VM's is a SQL server.  I am not typically seeing disk I/O or queue alerts from the SQL server at the same time of the Hyper-V host.  

I don't think NAble has the ability to check latency.  I will have to figure out how to do that in windows.
Philip ElderTechnical Architect - HA/Compute/StorageCommented:
Start --> type ResMon[ENTER on the result]

Or, on the host set up PerfMon to monitor both the host's and the guest's disk subsystems. This is the most accurate way to get readings from both.

ResMon works well in a pinch.
Adam BrownSr Solutions ArchitectCommented:
disk idle time is typically above 70% and mostly above 90% during that time.

Then the setup is capable of handling the load. 70% idle time means that the drive are certainly not causing a bottleneck. Disk Queue length is just not a very accurate counter to use, particularly when examining it through WMI, which doesn't maintain averages over time. Perfmon would allow you to get a better idea of disk queue length averages, and it would be a very good idea to set up a performance log on that counter and other disk performance counters for a while to get better data, but the amount of time that occurs between scans in nAble makes it less reliable and accurate for determining averages on disk queue length.

nAble pulls data from WMI about once every 5 minutes by default, so the values it reports for things like Disk and CPU performance aren't very reliable unless there is a very long period of high utilization. Perfmon polls the system counters once or twice per second, so it will give you a much more reliable and useful set of data to work with when benchmarking systems.

nAble'ss monitors are designed to help identify long-term trends that may be symptomatic of underlying issues before they become serious problems. They aren't designed to help find performance bottlenecks.

Another tool that would be useful for your purposes is SQLIO: https://www.microsoft.com/en-us/download/details.aspx?id=20163 
It's capable of benchmark testing SQL IO speeds, so it will help you determine if the drive array you have will meet the requirements you have.
Philip ElderTechnical Architect - HA/Compute/StorageCommented:
When it comes to SQLIO, DiskSPD, IOmeter, Exchange JETStress, and other such performance stress software keep in mind that running them during production hours may bring production workloads down to a crawl.
jpgillivanConsultantAuthor Commented:

The client has complained about their SQL application (VM) having (slow responsiveness) issues from time to time.  This is why I was looking at queue lengths (disk performance).  I had run perfmon for 24 hours and saw no extremely high numbers that matched NAble and since NAble was pulling from WMI I was concerned about he discrepancy.  

Based on what I have researched that the (acceptable) average queue length for a raid5 with four drives is 6, (number spindles - 1) * 2
I understand the general rule is that a disk queue length of over 2 (per spindle) indicates a bottleneck.  I also understand that there may be spikes that go over an acceptable limit and as long as they are not sustained or many peaks then the issue is probably minor (but worth noting)
Since I am monitoring queue length with perfmon/nable should I be using the value of 2 still or or 6?

I have also been monitoring disk I/O total operations per second and have seen what I believe to be high numbers for this raid setup, values over 300.  I based this on calculation from this site http://www.thecloudcalculator.com/calculators/disk-raid-and-iops.html 
Based on the calculator a raid 5 with 15k SAS drives is capable 312 IOPS.
Occasionally I will see values over 400 and even in to the 1000's but that is typical after hours when backups are running.
Philip ElderTechnical Architect - HA/Compute/StorageCommented:
A stripe size of 256KB is great for throughput, that is reading and writing large files.

For IOPS, especially for database driven apps with lots of smaller reads and writes, this will bomb for performance.

And, four 15K disks in RAID 5 won't have much in the way of IOPS anyway ... maybe 200 per disk if they are 3.5" and less with that 256KB stripe size.

As a rule, a stripe size (we build on Storage Spaces so Interleave size for us) of 64KB is the place to start for storage hosting fairly intensive read/write activity like mail and SQL databases.

EDIT: As a point of reference:
Large Stripe Size = >>>> Throughput (GB/Second) <<<<< IOPS (I/Os per second)
Small Stripe Size = >>>> IOPS <<<<<< Throughput
We have a great blog post that explains things based on our own learning experiences building out high performance storage: MPECS Inc. Blog: Storage Configuration: Know Your Workloads for IOPS or Throughput
jpgillivanConsultantAuthor Commented:
Thanks Philip.   Found an interesting article of someone that did stripe size testing, however it only went 128KB.  Given the results in the testing comparing 64KB to 128KB I would think that 128 to 256KB would be close also.

The author did also indicated that it seems in raw numbers, 64KB was best performer.  I will keep that in mind on future builds.

Can anyone confirm if I am calculating avg. queue lengths correctly for raid5 or do I simply stick with the value of 2 no matter what when pulling (physical disk) data from the OS (perfmon, wmi, etc)?

if I am pulling logical volume lengths I would assume I would use the value of 2.  Is that correct?
Philip ElderTechnical Architect - HA/Compute/StorageCommented:
We test _all_ disk subsystems before we deploy. We are not interested in deploying a very expensive storage system that turns out to be a brick. We've seen that happen with some very expensive storage situations belonging to others.

We've tested from 512KB (GBs/Second) right down to 4KB (1.2M IOPS via SAS SSDs 3 nodes and 1 JBOD).

In our experience the best disk queue length = # Disks * 2. 5 Disks would be a DQL of 10 or less. That would be where we would start.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Windows Server 2008

From novice to tech pro — start learning today.