Solved

Why does my 10GB iSCSI setup seem see such high latency and how can I fix it?

Posted on 2014-07-17
11
1,100 Views
Last Modified: 2016-11-23
I have a iscsi server setup with the following configuration

Dell R510
Perc H700 Raid controller
Windows Server 2012 R2
Intel Ethernet X520 10Gb
12 near line SAS drives
I have tried both Starwind and the built in Server 2012 iscsi software but see similar results.  I am currently running the latest version of starwinds free iscsi server.

I have connected it to a HP 8212 10Gb port which is also connected via 10Gb to our vmware servers.  I have a dedicated vlan just for iscsi and have enabled jumbo frames on the vlan.

I frequently see very high latency on my iscsi storage.  So much so that it can timeout or hang vmware.  I am not sure why.  I can run IOmeter and get some pretty decent results.

I am trying to determine why I see such high latency 100'ms.  It doesn't seem to always happen, but several times throughout the day, vmware is complaining about the latency of the datastore.  I have a 10Gb iscsi connection between the servers.  I wouldn't expect the disks to be able to max that out.  The highest I could see when running IO meter was around 5Gb.  I also don't see much load at all on the iscsi server when I see the high latency.  It seems network related, but I am not sure what settings I could check.  The 10Gb connect should be plenty as I said and it is no where near maxing that out.

Any thoughts about any configuration changes I could make to my vmware enviroment, network card settings or any ideas on where I can troubleshoot this.  I am not able to find what is causing it.  I reference this document and for changes to my iscsi settings

http://en.community.dell.com/techcenter/extras/m/white_papers/20403565.aspx

Thank you for your time.
iometer.csv
0
Comment
Question by:gacus
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 3
  • 3
  • +1
11 Comments
 
LVL 121
ID: 40202720
So your "SAN" is running Starwind Software iSCSI connected to VMware ESXi ?
0
 
LVL 1

Author Comment

by:gacus
ID: 40202982
yes
0
 
LVL 121

Assisted Solution

by:Andrew Hancock (VMware vExpert / EE MVE^2)
Andrew Hancock (VMware vExpert / EE MVE^2) earned 167 total points
ID: 40203026
there are some specific iSCSI settings that we use, that are vendor defined, I'll dig them out tomorrow, and you could try them, these are recommended by HP and NetApp for their hardware SANs.

Have you configured multi pathing?

You may want to check which version of ESXi are you using ?
0
Get 15 Days FREE Full-Featured Trial

Benefit from a mission critical IT monitoring with Monitis Premium or get it FREE for your entry level monitoring needs.
-Over 200,000 users
-More than 300,000 websites monitored
-Used in 197 countries
-Recommended by 98% of users

 
LVL 1

Author Comment

by:gacus
ID: 40203036
esxi 5.5 1891313

I only have 1 10Gb connection for my iSCSI so no need for multipathing.  It should be plenty for this server.  It would be nice to have a backup, but the cost for the extra 10Gb connection make it not an option.
0
 
LVL 58

Accepted Solution

by:
Cliff Galiher earned 167 total points
ID: 40203442
I know you probably won't like this answer (and, of course, you are welcome to try and find a better one), but I don't think there is much you can do in this situation.  The factors are several, all accumulating to see the behavior that you are.

First is the controller. The H700 is honestly a mid-range controller at best. The 800 series is a bit better, but for "roll your own SAN" solutions, none of the Dell solutions are very good. You really have to start considering going native like LSI if you want good performance.  Dell doesn't expect their servers to be used as SANs so they heavily optimize their drivers and caching routines for single application access...or at least on-server access. Because of how iSCSI flows, it can basically negate the entire controller cache and obviously that'll come with a performance hit.

Your second issue is the NL-SAS drives. The distinction between NL-SAS and SAS is a simple one. NL-SAS is a SATA drive with a SAS firmware bolted on. Sure, it can "understand" SAS commands, but it doesn't really do things that real SAS drives do, like queue reprioritization. A real SAS drive can take instructions from the controller and find the most optimal way to process them. An NL-SAS drive will usually do minimal or (more often) no optimization and just handle the requests in the order it received them. Which, during heavy I/O or even moderate random I/O, can add sudden latency.

iSCSI on a server certainly has a place. For archival storage, backup storage, and other uses, iSCSI on a target server is *great.* Single streams of I/O and if a failure occurs, reasonable downtime is not an issue.  But for the usual place where people want a SAN, which is the use case you are currently describing, the benefits just aren't there. Servers aren't optimized for this use, and of course the whole point of running multiple VMWare or Hyper-V nodes is to eliminate single points of failure...but with a "roll your own SAN" server, you've just kicked the can down the road to the storage being the single point of failure. That isn't particularly useful.

Truth is, given the platform you built, I think you'll just have to accept the latency. The bottleneck isn't the 10GB. It is the I/O on the target. Because your target is running Windows, you *do* have the benefit of turning on peformance monitors and counters to verify this. But I think you'll find when you do, you'll find your actual disk queues on the target are high when you see the latency warnings while your network utilization is still relatively low.

-Cliff
0
 
LVL 1

Author Comment

by:gacus
ID: 40203480
All I am using this SAN for is vmware backup so I don't need high performance or redundancy.  We have high end enterprise fiber channel sans for our vms.  I understand what performance I should get out of my low end drives and raid controller and I get it sometimes.  Other times I see high latency.

The issue is I don't see hardly any load on my disks during the latency issues.
0
 
LVL 121
ID: 40203502
If it's just for backup performance should be fine?

Replace the OS, and create a JBOD using a LSI SCSI HBA (no RAID) and use ZFS and a Solaris implementation and a few SSDs for ZIL and Arc cache.
0
 
LVL 47

Expert Comment

by:David
ID: 40203775
Your problem is most likely due to  your RAID config.  Don't tell me,  reads are OK, but writes crap out after a few secs sustained.  If that is the case it confirms it is your RAID config.    No tweaking other than going to RAID10 and smaller volumes will help.
0
 
LVL 1

Author Comment

by:gacus
ID: 40204388
Thank you all for your advice!  I really appreciate it!

Very interesting dlethe.  I use r10 on all of our other systems, but since this is just for backup and I needed the space I used r6.  Can you provide more detail on what you are saying?  It does appear that it slows down overtime and reads are indeed noticeably faster.

Andrew:  I don't have the budget to add to the configuration which is why it is such a cheap configuration.  I would also lose a good chunk of space removing two of the drives.  Any thoughts on what I could do without adding anything additional?
0
 
LVL 47

Expert Comment

by:David
ID: 40204415
R6 is slow as heck in writing. The reason it is fast for a second or so then drops is due to the cache buffers the write, then when the cache is full (so it actually has to write to the disks), it cranks down.

If you want speed for writes, don't use RAID6. It is that simple.  Google articles about how RAID5 and RAID6 works and see for yourself.

This has nothing to do with TCP/IP or your network.  

The controller and disks don't care what your budget is, by the way ;)

If you went to solaris & used ZFS with a pair of the smallest SSDs you could find for the ZIL then you would be much better off. you could also enable compression at the filesystem level to get back some space.  Use the RAIDZ2 configuration in ZFS which is like RAID6 but better.  [As andrew suggested]

------------- OR -----------------
.. a suggestion.  [Only of your software allows this ... ] Buy two of the largest disks you can afford. Build them as a RAID1.  Modify the backup process so step one is to copy the files you want to back up to that RAID1. Then  backup from the RAID1 to your RAID6, then delete the files from the RAID1.   You will still be protected in case of  HDD Loss and have 2 levels of protection..

The RAID1 will allow backups to complete much quicker on the machines, and then the server can take it's sweet time migrating from the RAID1 internally to the RAID6 internally.    Using multiple RAID levels as storage backup pools is the textbook means of  solving this problem frugally.
0
 
LVL 47

Assisted Solution

by:David
David earned 166 total points
ID: 40204437
Or use a staging server that has lots of disk space and a RAID10 or RAID1.   Backup there, and migrate.  Use a 2nd dual ported NIC direct attached between these two systems specifically for the backup pipe.  Bond them so you get twice the throughput.   No need to even go through a switch.    

That way backup completes much faster and your normal network bandwidth is not effected by backup from a temporary storage pool to the repository which is a less expensive and slower tier.
0

Featured Post

Ready to get started with anonymous questions?

It's easy! Check out this step-by-step guide for asking an anonymous question on Experts Exchange.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

The following article is comprised of the pearls we have garnered deploying virtualization solutions since Virtual Server 2005 and subsequent 2008 RTM+ Hyper-V in standalone and clustered environments.
In this article we will learn how to backup a VMware farm using Nakivo Backup & Replication. In this tutorial we will install the software on a Windows 2012 R2 Server.
This tutorial will walk an individual through the process of configuring basic necessities in order to use the 2010 version of Data Protection Manager. These include storage, agents, and protection jobs. Launch Data Protection Manager from the deskt…
This tutorial will walk an individual through the process of transferring the five major, necessary Active Directory Roles, commonly referred to as the FSMO roles from a Windows Server 2008 domain controller to a Windows Server 2012 domain controlle…

615 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question