• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 2583
  • Last Modified:

VMware iSCSI Stability issues.


We have 4 ESXi servers in a cluster, We're having an issue where about once a week one (randomly) of our ESXi 4.1 (latest) servers appears to have a problem connecting to the SAN (DELL 3220i via Software iscsi adapter) using Round Robin.

All VMs on the server are up but disk access is so slow that they are unusable.

The vSphere client cannot connect directly to the problem esxi host, but can connect to the other 3, and it always appears to be the host server for the vCenter VM so cannot control the server via this method either.

We can get access to the problem host via SSH without issue, but anything that involves the Datastore just hangs it (reboot, navigating to Datastore etc)
At the same time as the one host is having a very bad issue, the other 3, whilst up, begin to show signs of slowdown / latency in their own SAN access which gradually gets so bad that they have intermittent latency lags (hdparm showed 1MB/sec throughput, ls –al can sometimes work, sometimes hang for 5-10 seconds)

As the reboot command via SSH doesn't work, Powercycling the server resolves everything the second the problem esxi host goes down, all other esxi hosts begin working normally, but obviously this isn't ideal!

I have a cat of the /var/log/messages from the server the last time this happened and have attached it.

We have an active case open with VMWare about this, but wanted to see if it sounded familiar to anyone in the community also.

Tonight for the first time the problem server wasn't the vCentre host server, but following the problem (and a reboot of the problem host) the vcentre VM became unresponsive and eventually took out its own host as described above, resulting in it needing to be power cycled.

All is back up and 'normal' again now!

The servers are diskless, booting from an SD card and work 100% fine, until they don't.
They are running software iscsi initiators to a jumbo-frame enabled redundant gbit switch setup going to redundant controllers on the Dell iscsi SAN.
iOPS do seem to massively peak around the same time, then massively tail off

under normal load we're running around 4-600 ions on a 24x600GB SAS array, so plenty of headroom.   cv4.txt
  • 10
  • 9
1 Solution
Specifically what is the hardware config.  I care most about the make/model of controller , memory, and make/model of disks.  I've seen this before but need mroe info
WolfofWharfStreetAuthor Commented:

Its a Dell MD3220i 24x600GB 6GbPS SAS iSCSI SAN.

 controller2.txt controller1.txt




Media type:
Hard Disk Physical Disk

Interface type:
Serial Attached SCSI (SAS)

Physical Disk path redundancy:

Security Capable:


Read/write accessible:

Security key identifier:
Not Applicable

10,000 RPM

Current data rate:  
6 Gbps

Product ID:

Physical Disk Firmware Version:

Serial number:  


Date of manufacture:  
November 22, 2010

Do let me know if you need anything more.
The dumps don't show a lot of the details, but most likely scenario is that this is a result of HDD error recovery when you get a few bad blocks.  I've not been inside one of these systems, but if it is built on the LSI (PERC H700) controller, for example, then you can tell it to do 24x7 background scanning. This will repair bad blocks during idle time.  Get 1 or 2 consecutive bad blocks, which is pretty much the norm and you very much will get a 5-10 hang.  

If you have an option for a battery backup of the controller, then purchase it.
In any event, it isn't likely esx problem, it is with the dell.  If you have a support contract (which you should since it is new), then get Dell involved as ask them about error recovery delays and make sure the hardware is set up for this automatic "patrol" operation to repair bad or weak blocks before you need the data during idle time.
Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

WolfofWharfStreetAuthor Commented:

The systems have no local HDDs or RAID Card, they are Dell R610 servers which have an Embedded SD card for the VMWare HyperVisor.

The RAID array shows Optimal status and no data is lost, the reboot of the ESXi Host cures the issue until it reoccurs later.

To me it just looks like an iSCSI networking / pathing issue which isnt present / visible under normal conditions but something triggers and it gets progressivly worse until I reboot the ESXi host which displays the worst symptoms (ie the only one I cant vSphere connect to directly or via vCentre)

I dont expect this is a hardware issue, more a software/config issue.
You mean it doesn't have 24 of those disks internally?  the part number, ST9600204SS, in the dump is a SAS disk, not an embedded SD card.

But since you have the hardware in front of you, then hopefully you have a decent ethernet switch .. one with RMON capability.  if it has a web interface then have the switch monitor traffic, especially during one of these cycles.  I would also consider if it is possible, to do a direct-attach cross-over cable whenever possible.  Why put traffic on the switch if you can do direct-attach.  There won't be any ethernet collisions or duplicate IPs or anything else to contend with so you will get higher efficiency and no interruptions.
WolfofWharfStreetAuthor Commented:

No, apologies if I wasnt clear.

4x Dell R610 with 96GB RAM, Dual Hex Core CPU and NO Local Storage
The Dells boot from a Local SD Card
All data is stored on the ISCSI SAN

The ISCSI SAN has 24x600GB SAS Drives, 2x Controllers and 2x PSU
Each Controller is connected to a Different Cisco GBIT Switch

Each Dell has 3 Gbit Connections to the Switches (2 to 1, one to another)

Connections from the Dells to the iSCSI SAN go over the VMWare iSCSI Software Initiator
The Dells all have 4 onboard BroadCom Nics and either 2 Additional PCI-E Broadcom ports or 4 additional INTEL ports.

The switches and nics are all setup to use Jumbo Frames
The Servers use RR pathing.

I use the switch as I dont have enough ports on the SAN to connect up all 4 (at present) esxi Host servers directly.
Understood.  My answer still stands, but now I need to be specific ;)
 -> Look inside of the iSCSI SAN system, i.e, the dell 3220i.

Take ESXi and the networking out of the equation, look at it as I/O hanging at the source ... the RAID controller stops sending data during the hangs, and then everything goes downhill from there.  I do work on sites that have petabytes of data on the SAN and it is almost always comes down to a hold-up at the disk+controller.  (But that is anecdotal, so it doesn't mean that this is what happens to you ... it just means that it is more common then you may think it is)
WolfofWharfStreetAuthor Commented:
I do keep an eye on the SAN during the issue, but the crucial point is it only hits one ESXi host hard, the others to a lesser degree, and as soon as I power off the problematic box the issue immediately resolves.

Whilst I am open to any possible solutions, this does point more to an ESX based issue than the SAN, especially as the SAN is showing 100% Optimal

Given all servers all use the same controllers, the same luns and the same switches, I would expect them all to suffer to a similar degree if it was the SAN itself.

I am contemplating removing the multiple paths we have configured on each ESXi host and reverting to one single path per host.
I see where you are coming from, and your argument would make sense ... but ONLY if every host used the same physical devices and block numbers on the 3220i.

When you run into bad block recovery, remember it is for specific disk(s) and block numbers.  But now you gave me an idea on how to confirm this.  When you go into a recovery mode, then it only affects LUNs that use that particular disk drive.

So if LUN0 uses disks A,B,C  and LUNB uses disks D,E,F, and the problem is with disk "A" ... then the system that generated the read request on the affected blocks would hang,  any other systems that use other blocks on disk A would be affected, but it depends on the specifics of what is in cache, and whether or not it is waiting for a read.  ANy systems using physical disks  D,E,F would not be affected.

Does this scenario make sense? When multiple systems ARE affected, do those systems share the same physical disk drives?    If yes, then you have something compelling enough that you must investigate.  If no, then you can eliminate this (provided you know for a fact that other systems are attempting to read something that requires disk "A".

WolfofWharfStreetAuthor Commented:
That makes sense, however the system is setup as follows

Datastore0 2TB
Datastore1 2TB
Datastore2 2TB
Datastore3 700GB

ESXi Host 1,2,3,4 all use Datastore0 exclusively, datastore1,2,3 are unpopulated VMDKs at present.
I suppose I could vmotion all vm's to an alternate datastore to eliminate the slight chance of it being a disk failure on Datastore0, however as this will in itself generate an awful lot of SAN traffic (not VAAI enabled SAN yet) its something I'd prefer to avoid if possible.

The SAN is also not reporting any bad disks or sector fails, the VMs are also not reporting any bad data or sector fails.

I would much prefer this to be hardware related as I can then simply swap it out so will happily explore all suggestions, however I do expect this will be down to a software / configuration oddity.

The SAN will NOT report any bad disks or sector failures.  That is why these things are hard to nail down.  You would only get unreadable block errors if the LUN was degraded due to a drive failure, and there was an unreadable block.

Now if you had a downtime window, you could run some decent diagnostics that would look at the physical SAS drives and give you counts of these bad blocks.  In fact, you could even get timestamps (relative to power-on hours) so you know then they happened.    

Perhaps you can do something else in effort to eliminate this possibility ...
What about taking a low-end, stand-alone PC, doing a direct-attach and letting it see all iSCSI LUNs.  I would also use LINUX as this would make things easier and eliminate any sort of weirdness and Microsoft AD nonsense.

Mount all iSCSI, read-only.  When this happens, kick off a simple dd so it does a raw read on the iSCSI LUN at the physical level, not file system level  directly into the bit bucket, if the dd hangs, then you know it is the 3220i.  

i.e,.  if the iSCSI lun is /dev/sdc

dd if-/dev/sdc of=/dev/null bs=64k

This is 100% safe, you don't even actually mount any file system.  It is just reading the physical blocks and if it goes offline, dd will tell you.  
If you then open up another window on the LINUX machine, you can measure throughput, so you can see MB/sec throughput real-time.

Do this with an ubuntu live CD, and you can just run this all from a ramdisk session w/o installing LINUX.  
WolfofWharfStreetAuthor Commented:
I'll have my techs give this a try, I can do it direct in one of the spare controller ports so iSCSI saturation wont be an issue, it will push the IOPS through the roof but that's not the end of the world
not necessarily ...

Set up a  job that kicks off every 30 secs, and put a limit on dd,  use the "count=" command

time dd if=/dev/sdb of=/dev/null bs=1024k count=100 2>/dev/null  >/dev/null  >> /var/log/iscsiwatchdog.log
this copies just 100MB, enough to make sure it won't be cached.   Then either add a sleep and put it in a loop.

The time command will tell you how long each of these took, and you will end up with a file that you can inspect whenever, and see if you have any issues.  add the echo -n 'date'  and each entry has the system date/time as a prefix on each line so you know when it happened.

(30 seconds might not be the right number, you can probably stretch it out, just that the shorter the polling interval, the more likely you will catch the event)
WolfofWharfStreetAuthor Commented:
Just to add in the interests of Completeness.

I also have 1x Veeam Windows VM and 1x vCentre Windows VM (both 2008 R2) connected directly on the SAN ISCSI network.

I have actually just removed the vCentre VM given my earlier comments about it seemingly being at the centre of things possibly..  

Veeam needs to stay on to allow Direct SAN Backups with CBT, but given its designed to sit on the SAN directly I assume its not going to be badly behaved (unlike the vCentre windows VM potentially)

I was little concerned that maybe the Windows VM was shouting netbios type requests over the iSCSI lan and upsetting it and its Jumbo frames.
That is certainly a possibility .. but you'll go nuts thinking about what it could be ... this exercise can be used to eliminate all sorts of things as well.  The key is to get this linux system going and use it as a watchdog timer that is pretty much running from the outside on it's own path.  Use it to flood jumbo frames as well and measure the results.
Hi interesting problem, i am more of a network guy

[Dell Storage] -> [2 x Cisco1G Model?] -> 4 x ESXi Server with VM's on.

Slowdown of disk access,   rebooting single ESXi seem to speedup others.
Eventually all ESXi servers slow down ?

Cisco switch model ?
The fact that a single ESXi causes the others to also slow down, would indicate the problem is on the network/storage, or the ESXi is just causing some kind of overload affecting network/storage.
Do you have traffic monitoring on the cisco switch ?
Have you looked at the traffic normally coming into the switch (cisco provides this per port) and what happens when a server is problematic ?
WolfofWharfStreetAuthor Commented:

Dell Storage --> 2x SRW2016-UK --> 4x ESXi Servers with VMs On --> 3x SRW2016-UK to different networks.

Problem: almost like a broadcast storm, iscsi latency (or availability) becomes an issue, primarily on the one problem host, but also knock on to all other hosts.

Problem host all VM's stop working, all other hosts all VMs disk access is very latent and so they suffer hangs / slowdowns.

No traffic monitoring
We do graph traffic per port, but its graphed from one of the VMs and so this isnt handy at the time, also as the issue cant be left for 5 mins+ for MRTG to update its again not that useful.
WolfofWharfStreetAuthor Commented:
Well it seems to have been an issue with the SAN network not liking having non-Jumbo-framed devices on the same broadcast domain.

We removed everything except the SAN uplink connections themselves (we were monitoring using MRTG and had a Windows Machine sat on the same LAN chatting away with its discovery protocols) and everything has remained solid.

We then added a non-JF server to the SAN switch to map it to the LUN and noticed the problem reoccur immediately, so removed it and it went away..

Lesson learnt, no JF and Non-JF on the same SAN LAN.

WolfofWharfStreetAuthor Commented:
Diagnosis and Real world testing.
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

  • 10
  • 9
Tackle projects and never again get stuck behind a technical roadblock.
Join Now