Solved

Extreemly high average disk queue length

Posted on 2016-08-01
12
102 Views
Last Modified: 2016-08-08
I am using a N-Able/N-Central monitoring and on some servers I see the average disk queue length randomly reported in an extremely high number. For example: 4875851.

If i manually monitor the queue length via performance monitor or even capture data in windows perfmon log files I see no data logged that high.

The product vendor (solarwinds) says that the metric is taken directly from WMI readings.
Win32_PerfRawData_PerfDisk_LogicalDisk get AvgDiskQueueLength

Is it possible that WMI is reporting a value that high?

System is Hyper-v 2008r2.  Mulitple VM's
Four SAS disks: 15K 6Gb/s (Write-Cache Status Write Back )
Raid level: 5
Raid Card: Adaptec 6805, 512MB RAM
Stripe size: 256KB
Write-cache/mode: write through

I have another similar system with 6 SAS drives in Raid10, getting same WMI type results.
0
Comment
Question by:jpgillivan
  • 5
  • 4
  • 3
12 Comments
 
LVL 38

Accepted Solution

by:
Adam Brown earned 250 total points
Comment Utility
nAble pulls data on disk queue length on a single snapshot pull on a scheduled basis, rather than using averages over the period since last check. I've seen it regularly pull corrupted or incorrect data on that counter, so if it goes into an error state, wait for the next scan to verify whether it's bad. It will usually go back to normal on the next poll of the service.
0
 

Author Comment

by:jpgillivan
Comment Utility
I am going to set the failed upper limit to 10000 and see if the alerts stop.  

However, I am concerned that maybe there is an disk I/O issue.  

Or do you think it's a bug with N-Able?  
I have a case open with them but they won't/don't acknowledge there is an issue.
0
 
LVL 38

Expert Comment

by:Adam Brown
Comment Utility
I usually shut down alerts on disk queue length because the way it's calculated isn't always accurate and can easily be impacted by a sudden burst of small transactions following a lengthy single transaction (Windows determines disk queue length with Little's Law, or Length = Arrivals x Wait period). You'll want to record it for reporting purposes, but take it out of the alerts (this can be done by disabling the threshold in nAble). Disk Time % is basically the queue length * 100.

A better counter to monitor is %Idle Time, which is a 0-100% value and represents how much time the drive has spent with no incoming requests. That value is calculated by marking when the disk goes idle, then marking again when it processes an action. The values you want to see depend on the monitoring interval, but I would put a warning in below 20% and failure below 10% (100% means the drive is idle during the entire window).
0
 
LVL 38

Expert Comment

by:Philip Elder
Comment Utility
What's Disk Queue Length in the VM(s)?

Another indicator that the disk subsystem is limiting things is in latency.

Check the latency within the guest and the host to see if latency is hitting 150ms or higher. 300ms and higher and users start getting impacted. Higher than 450ms-500ms and performance is not good at all.
0
 

Author Comment

by:jpgillivan
Comment Utility
The purpose of monitoring disk queue length is to get information as to whether the current disk setup is configured / provisioned correctly to handle the work load.  

disk idle time is typically above 70% and mostly above 90% during that time.

I have also been monitoring disk I/O to use with queue length during that time.

One of the VM's is a SQL server.  I am not typically seeing disk I/O or queue alerts from the SQL server at the same time of the Hyper-V host.  

I don't think NAble has the ability to check latency.  I will have to figure out how to do that in windows.
0
 
LVL 38

Expert Comment

by:Philip Elder
Comment Utility
Start --> type ResMon[ENTER on the result]

Or, on the host set up PerfMon to monitor both the host's and the guest's disk subsystems. This is the most accurate way to get readings from both.

ResMon works well in a pinch.
0
Better Security Awareness With Threat Intelligence

See how one of the leading financial services organizations uses Recorded Future as part of a holistic threat intelligence program to promote security awareness and proactively and efficiently identify threats.

 
LVL 38

Assisted Solution

by:Adam Brown
Adam Brown earned 250 total points
Comment Utility
disk idle time is typically above 70% and mostly above 90% during that time.

Then the setup is capable of handling the load. 70% idle time means that the drive are certainly not causing a bottleneck. Disk Queue length is just not a very accurate counter to use, particularly when examining it through WMI, which doesn't maintain averages over time. Perfmon would allow you to get a better idea of disk queue length averages, and it would be a very good idea to set up a performance log on that counter and other disk performance counters for a while to get better data, but the amount of time that occurs between scans in nAble makes it less reliable and accurate for determining averages on disk queue length.

nAble pulls data from WMI about once every 5 minutes by default, so the values it reports for things like Disk and CPU performance aren't very reliable unless there is a very long period of high utilization. Perfmon polls the system counters once or twice per second, so it will give you a much more reliable and useful set of data to work with when benchmarking systems.

nAble'ss monitors are designed to help identify long-term trends that may be symptomatic of underlying issues before they become serious problems. They aren't designed to help find performance bottlenecks.

Another tool that would be useful for your purposes is SQLIO: https://www.microsoft.com/en-us/download/details.aspx?id=20163
It's capable of benchmark testing SQL IO speeds, so it will help you determine if the drive array you have will meet the requirements you have.
0
 
LVL 38

Expert Comment

by:Philip Elder
Comment Utility
When it comes to SQLIO, DiskSPD, IOmeter, Exchange JETStress, and other such performance stress software keep in mind that running them during production hours may bring production workloads down to a crawl.
0
 

Author Comment

by:jpgillivan
Comment Utility
Adam,

The client has complained about their SQL application (VM) having (slow responsiveness) issues from time to time.  This is why I was looking at queue lengths (disk performance).  I had run perfmon for 24 hours and saw no extremely high numbers that matched NAble and since NAble was pulling from WMI I was concerned about he discrepancy.  

Based on what I have researched that the (acceptable) average queue length for a raid5 with four drives is 6, (number spindles - 1) * 2
I understand the general rule is that a disk queue length of over 2 (per spindle) indicates a bottleneck.  I also understand that there may be spikes that go over an acceptable limit and as long as they are not sustained or many peaks then the issue is probably minor (but worth noting)
Since I am monitoring queue length with perfmon/nable should I be using the value of 2 still or or 6?

I have also been monitoring disk I/O total operations per second and have seen what I believe to be high numbers for this raid setup, values over 300.  I based this on calculation from this site http://www.thecloudcalculator.com/calculators/disk-raid-and-iops.html
Based on the calculator a raid 5 with 15k SAS drives is capable 312 IOPS.
Occasionally I will see values over 400 and even in to the 1000's but that is typical after hours when backups are running.
0
 
LVL 38

Expert Comment

by:Philip Elder
Comment Utility
A stripe size of 256KB is great for throughput, that is reading and writing large files.

For IOPS, especially for database driven apps with lots of smaller reads and writes, this will bomb for performance.

And, four 15K disks in RAID 5 won't have much in the way of IOPS anyway ... maybe 200 per disk if they are 3.5" and less with that 256KB stripe size.

As a rule, a stripe size (we build on Storage Spaces so Interleave size for us) of 64KB is the place to start for storage hosting fairly intensive read/write activity like mail and SQL databases.

EDIT: As a point of reference:
Large Stripe Size = >>>> Throughput (GB/Second) <<<<< IOPS (I/Os per second)
Small Stripe Size = >>>> IOPS <<<<<< Throughput
We have a great blog post that explains things based on our own learning experiences building out high performance storage: MPECS Inc. Blog: Storage Configuration: Know Your Workloads for IOPS or Throughput
0
 

Author Comment

by:jpgillivan
Comment Utility
Thanks Philip.   Found an interesting article of someone that did stripe size testing, however it only went 128KB.  Given the results in the testing comparing 64KB to 128KB I would think that 128 to 256KB would be close also.
http://www.kendalvandyke.com/2009/02/disk-performance-hands-on-part-3-raid-5.html

The author did also indicated that it seems in raw numbers, 64KB was best performer.  I will keep that in mind on future builds.

Can anyone confirm if I am calculating avg. queue lengths correctly for raid5 or do I simply stick with the value of 2 no matter what when pulling (physical disk) data from the OS (perfmon, wmi, etc)?

if I am pulling logical volume lengths I would assume I would use the value of 2.  Is that correct?
0
 
LVL 38

Assisted Solution

by:Philip Elder
Philip Elder earned 250 total points
Comment Utility
We test _all_ disk subsystems before we deploy. We are not interested in deploying a very expensive storage system that turns out to be a brick. We've seen that happen with some very expensive storage situations belonging to others.

We've tested from 512KB (GBs/Second) right down to 4KB (1.2M IOPS via SAS SSDs 3 nodes and 1 JBOD).

In our experience the best disk queue length = # Disks * 2. 5 Disks would be a DQL of 10 or less. That would be where we would start.
0

Featured Post

Maximize Your Threat Intelligence Reporting

Reporting is one of the most important and least talked about aspects of a world-class threat intelligence program. Here’s how to do it right.

Join & Write a Comment

In this article you will get to know about pros and cons of storage drives HDD, SSD and SSHD.
This article is an update and follow-up of my previous article:   Storage 101: common concepts in the IT enterprise storage This time, I expand on more frequently used storage concepts.
This tutorial will show how to push an installation of Backup Exec to an additional server in both 2012 and 2014 versions of the software. Click on the Backup Exec button in the upper left corner. From here, select Installation and Licensing, then I…
This tutorial will give a short introduction and overview of Backup Exec 2012 and how to navigate and perform basic functions. Click on the Backup Exec button in the upper left corner. From here, are global settings for the application such as conne…

771 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now