Avatar of jpgillivan
jpgillivan
Flag for United States of America asked on

Extreemly high average disk queue length

I am using a N-Able/N-Central monitoring and on some servers I see the average disk queue length randomly reported in an extremely high number. For example: 4875851.

If i manually monitor the queue length via performance monitor or even capture data in windows perfmon log files I see no data logged that high.

The product vendor (solarwinds) says that the metric is taken directly from WMI readings.
Win32_PerfRawData_PerfDisk_LogicalDisk get AvgDiskQueueLength

Is it possible that WMI is reporting a value that high?

System is Hyper-v 2008r2.  Mulitple VM's
Four SAS disks: 15K 6Gb/s (Write-Cache Status Write Back )
Raid level: 5
Raid Card: Adaptec 6805, 512MB RAM
Stripe size: 256KB
Write-cache/mode: write through

I have another similar system with 6 SAS drives in Raid10, getting same WMI type results.
Windows Server 2008Storage Hardware

Avatar of undefined
Last Comment
Philip Elder

8/22/2022 - Mon
ASKER CERTIFIED SOLUTION
Adam Brown

THIS SOLUTION ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
GET A PERSONALIZED SOLUTION
Ask your own question & get feedback from real experts
Find out why thousands trust the EE community with their toughest problems.
jpgillivan

ASKER
I am going to set the failed upper limit to 10000 and see if the alerts stop.  

However, I am concerned that maybe there is an disk I/O issue.  

Or do you think it's a bug with N-Able?  
I have a case open with them but they won't/don't acknowledge there is an issue.
Adam Brown

I usually shut down alerts on disk queue length because the way it's calculated isn't always accurate and can easily be impacted by a sudden burst of small transactions following a lengthy single transaction (Windows determines disk queue length with Little's Law, or Length = Arrivals x Wait period). You'll want to record it for reporting purposes, but take it out of the alerts (this can be done by disabling the threshold in nAble). Disk Time % is basically the queue length * 100.

A better counter to monitor is %Idle Time, which is a 0-100% value and represents how much time the drive has spent with no incoming requests. That value is calculated by marking when the disk goes idle, then marking again when it processes an action. The values you want to see depend on the monitoring interval, but I would put a warning in below 20% and failure below 10% (100% means the drive is idle during the entire window).
Philip Elder

What's Disk Queue Length in the VM(s)?

Another indicator that the disk subsystem is limiting things is in latency.

Check the latency within the guest and the host to see if latency is hitting 150ms or higher. 300ms and higher and users start getting impacted. Higher than 450ms-500ms and performance is not good at all.
This is the best money I have ever spent. I cannot not tell you how many times these folks have saved my bacon. I learn so much from the contributors.
rwheeler23
jpgillivan

ASKER
The purpose of monitoring disk queue length is to get information as to whether the current disk setup is configured / provisioned correctly to handle the work load.  

disk idle time is typically above 70% and mostly above 90% during that time.

I have also been monitoring disk I/O to use with queue length during that time.

One of the VM's is a SQL server.  I am not typically seeing disk I/O or queue alerts from the SQL server at the same time of the Hyper-V host.  

I don't think NAble has the ability to check latency.  I will have to figure out how to do that in windows.
Philip Elder

Start --> type ResMon[ENTER on the result]

Or, on the host set up PerfMon to monitor both the host's and the guest's disk subsystems. This is the most accurate way to get readings from both.

ResMon works well in a pinch.
SOLUTION
Adam Brown

THIS SOLUTION ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
GET A PERSONALIZED SOLUTION
Ask your own question & get feedback from real experts
Find out why thousands trust the EE community with their toughest problems.
Philip Elder

When it comes to SQLIO, DiskSPD, IOmeter, Exchange JETStress, and other such performance stress software keep in mind that running them during production hours may bring production workloads down to a crawl.
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.
jpgillivan

ASKER
Adam,

The client has complained about their SQL application (VM) having (slow responsiveness) issues from time to time.  This is why I was looking at queue lengths (disk performance).  I had run perfmon for 24 hours and saw no extremely high numbers that matched NAble and since NAble was pulling from WMI I was concerned about he discrepancy.  

Based on what I have researched that the (acceptable) average queue length for a raid5 with four drives is 6, (number spindles - 1) * 2
I understand the general rule is that a disk queue length of over 2 (per spindle) indicates a bottleneck.  I also understand that there may be spikes that go over an acceptable limit and as long as they are not sustained or many peaks then the issue is probably minor (but worth noting)
Since I am monitoring queue length with perfmon/nable should I be using the value of 2 still or or 6?

I have also been monitoring disk I/O total operations per second and have seen what I believe to be high numbers for this raid setup, values over 300.  I based this on calculation from this site http://www.thecloudcalculator.com/calculators/disk-raid-and-iops.html 
Based on the calculator a raid 5 with 15k SAS drives is capable 312 IOPS.
Occasionally I will see values over 400 and even in to the 1000's but that is typical after hours when backups are running.
Philip Elder

A stripe size of 256KB is great for throughput, that is reading and writing large files.

For IOPS, especially for database driven apps with lots of smaller reads and writes, this will bomb for performance.

And, four 15K disks in RAID 5 won't have much in the way of IOPS anyway ... maybe 200 per disk if they are 3.5" and less with that 256KB stripe size.

As a rule, a stripe size (we build on Storage Spaces so Interleave size for us) of 64KB is the place to start for storage hosting fairly intensive read/write activity like mail and SQL databases.

EDIT: As a point of reference:
Large Stripe Size = >>>> Throughput (GB/Second) <<<<< IOPS (I/Os per second)
Small Stripe Size = >>>> IOPS <<<<<< Throughput
We have a great blog post that explains things based on our own learning experiences building out high performance storage: MPECS Inc. Blog: Storage Configuration: Know Your Workloads for IOPS or Throughput
jpgillivan

ASKER
Thanks Philip.   Found an interesting article of someone that did stripe size testing, however it only went 128KB.  Given the results in the testing comparing 64KB to 128KB I would think that 128 to 256KB would be close also.
http://www.kendalvandyke.com/2009/02/disk-performance-hands-on-part-3-raid-5.html 

The author did also indicated that it seems in raw numbers, 64KB was best performer.  I will keep that in mind on future builds.

Can anyone confirm if I am calculating avg. queue lengths correctly for raid5 or do I simply stick with the value of 2 no matter what when pulling (physical disk) data from the OS (perfmon, wmi, etc)?

if I am pulling logical volume lengths I would assume I would use the value of 2.  Is that correct?
Your help has saved me hundreds of hours of internet surfing.
fblack61
SOLUTION
Philip Elder

THIS SOLUTION ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.