We have 4 ESXi servers in a cluster, We're having an issue where about once a week one (randomly) of our ESXi 4.1 (latest) servers appears to have a problem connecting to the SAN (DELL 3220i via Software iscsi adapter) using Round Robin.
All VMs on the server are up but disk access is so slow that they are unusable.
The vSphere client cannot connect directly to the problem esxi host, but can connect to the other 3, and it always appears to be the host server for the vCenter VM so cannot control the server via this method either.
We can get access to the problem host via SSH without issue, but anything that involves the Datastore just hangs it (reboot, navigating to Datastore etc)
At the same time as the one host is having a very bad issue, the other 3, whilst up, begin to show signs of slowdown / latency in their own SAN access which gradually gets so bad that they have intermittent latency lags (hdparm showed 1MB/sec throughput, ls –al can sometimes work, sometimes hang for 5-10 seconds)
As the reboot command via SSH doesn't work, Powercycling the server resolves everything the second the problem esxi host goes down, all other esxi hosts begin working normally, but obviously this isn't ideal!
I have a cat of the /var/log/messages from the server the last time this happened and have attached it.
We have an active case open with VMWare about this, but wanted to see if it sounded familiar to anyone in the community also.
Tonight for the first time the problem server wasn't the vCentre host server, but following the problem (and a reboot of the problem host) the vcentre VM became unresponsive and eventually took out its own host as described above, resulting in it needing to be power cycled.
All is back up and 'normal' again now!
The servers are diskless, booting from an SD card and work 100% fine, until they don't.
They are running software iscsi initiators to a jumbo-frame enabled redundant gbit switch setup going to redundant controllers on the Dell iscsi SAN.
iOPS do seem to massively peak around the same time, then massively tail off
under normal load we're running around 4-600 ions on a 24x600GB SAS array, so plenty of headroom. cv4.txt