esxi 5.5 losing connection to datastore


We have setup a new vmware esxi 5.5 server for a client and are having a rather troublesome issue.  The vm box seems to lose connection to the datastore about once a week, and the entire vm seems to freeze.  This seems to happen in the middle of the night, with no warnings, the guest machines are all rather idle, and do not report any errors in the logs, other than the previous shutdown was unexpected.

We are still able to ping the quest boxes, and the vmware host.
The VMWare VSphere Client also becomes mostly unresponsive.
Any commands etc, seem to hang, and the last time this happened it hung on the scanning datastrores.  The only way to recover is a hard reboot of the box, This however degrades our array and sets us up for a host of other issues.

There is no network attached storage, just the internal raid controller with 8 drives, 1 array is our working or primary data store, the other array is a mirrored backup array, and 1 hotspare.

1 VM Host ESXI 5.5
3ware 9650SE 8 Port controller w/ battery backup
1 Raid 5 Array <- Primary Data Store
1 Raid 1 Array <- Backup Data Store

4 Server 08 R2 Machines running:
1 DC/File server
1 Exchange/DC
1 Application Server
1 Remote Desktop Server

We have this setup running on a few servers with the same controller card, only difference is we went ahead and used ESXI 5.5 instead of ESXI 5.1.  Maybe this was a real bad idea...

Any thoughts would be appreciated.


Who is Participating?
Stevef316Author Commented:
Possibly but we could never get it to work with that card.  We had newest drivers, updated firmware for card, etc, all made no difference.  After talking some more with some of our suppliers, turns out they had seen some strange problems with those cards as well.  

Solution... Return raid card, go to a card with megaraid, as that 3ware card is on it's way out anyway.  Our issue was random, bottom line the server would lose datastore at least once a day possibly more.  We could not afford to wait on lsi to fix their driver and or firmware.
Andrew Hancock (VMware vExpert / EE MVE^2)VMware and Virtualization ConsultantCommented:
Is your server on the Hardware Compatibility List?

Is your storage controller on the hardware compatibility list?

You will need to check the logs in /var/log for this issue, I suspect non compatible, firmware or storage controller issues, which could also be hardware related fault.
vmwarun - ArunCommented:
A quick way to check the hardware for any problems is to look under Configuration -> Hardware -> Health Status which should show any problems with the host hardware.
Ultimate Tool Kit for Technology Solution Provider

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy now.

piyushranusriSystem Cloud SpecialistCommented:
1. Log into the ESXi/ESX host and verify that the VMkernel interface (vmk) on the host can vmkping the iSCSI targets with this command

2.Perform some form of network packet tracing and analysis

3.Capturing virtual switch traffic with tcpdump and other utilities (1000880)
Troubleshooting network issues by capturing and sniffing network traffic via tcpdump.

4. any scheduled task is running before that issue happen.

This can happen when backing up your VM clients. In our environment, ESX is taking snapshots initiated by EqualLogic. When that happens, there is a brief hiccup that causes a SQL disconnection. For XenApp, this is not too bad as everything can rely on LHC and the connection will be automatically re-established. If you are running an app that does not attempt to keep open a SQl connection but instead terminates at the first set of dropped packets you will get an app error.

please share the output
Stevef316Author Commented:
Sorry, for our delayed response, been rather swamped.

As an update, the Configuration -> Hardware -> Health Status appears ok, with all green checks.

Here is what is listed under the events:

Device or filesystem with identifier mpx.vmhba32
:C0:T0:L0 has entered the All Paths Down state.
1/2/2014 5:23:38 AM

I believe their was another error stating it lost connection to the data store, however when the system is rebooted, the log is cleared.

We checked the HCL for ESXI 5.5 and found that our LSI 9650SE is NOT listed.  It is on the HCL for ESXI 5.1U1.

We also double checked the Supermicro motherboard (MBD-X9DRI-F) at the manufacturers website and they list it as ESXI 5.1U1 compatible, however we did not find the motherboard explicitly listed on the vmware HCL.  We did however find a Supermicro server with this motherboard listed on the vmware HCL, (Model:      SYS-7047R-TRF).

Here is the ping response:
~ # vmkping
PING ( 56 data bytes
64 bytes from icmp_seq=0 ttl=64 time=0.060 ms
64 bytes from icmp_seq=1 ttl=64 time=0.048 ms
64 bytes from icmp_seq=2 ttl=64 time=0.049 ms

--- ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.048/0.052/0.060 ms
~ # esxcfg-route -l
VMkernel Routes:
Network          Netmask          Gateway          Interface    Local Subnet     vmk0
default      vmk0
~ # esxcfg-nics -l
Name    PCI           Driver      Link Speed     Duplex MAC Address       MTU    Description
vmnic0  0000:02:00.00 igb         Up   1000Mbps  Full   00:25:90:e0:6a:16 1500   Intel Corporation I350 Gigabit Network Connection
vmnic1  0000:02:00.01 igb         Up   1000Mbps  Full   00:25:90:e0:6a:17 1500   Intel Corporation I350 Gigabit Network Connection

We have not had a chance to capture any packet data yet.

There are no scheduled tasks running, we are only running Mozy Pro on the boxes, and were getting ready to add Veam until this problem showed up.

Since our controller is not specifically listed on the HCL, nor is our motherboard (rather embarrassed), I believe we have decided to attempt a snapshot of the virtual machines, blow away ESXI 5.5, reinstall ESXI 5.1U1, configure, and restore the snapshots of the virtual machines.

Since we did not upgrade to 5.5, it was a clean install, I think this is our only real option.

Any thoughts before we dive in on Saturday would be appreciated.

Thank you for the assistance!!
Stevef316Author Commented:
Our current vm machine versions are at level 8.  So I would assume backing up via a snapshot, reinstalling ESXI 5.1U1 and restoring the snapshots should be pretty straight forward.
Andrew Hancock (VMware vExpert / EE MVE^2)VMware and Virtualization ConsultantCommented:
Snapshots are not backups.

Ensure you have full backups.

You could install ESXi to USB flash drive supported and just add VMs back to inventory.

You VMs will be compatible with 5.1.

Did you ask Supermicro if they have new drivers for ESXi 5.5
Stevef316Author Commented:
Just an update to this ongoing problem.

We installed a usb drive to the motherboard, installed esxi 5.1u1, added the guest machines back to inventory, and fired everything back up.

Unfortunately we are still experiencing the same connection issues.

From the vsphere client summary page, the storage section just says loading..., during this freeze, the vm network section also says loading...

from the vmkernal.log here are a couple of error messages:

2014-01-08T22:16:22.543Z cpu2:8194)ScsiDeviceIO: 2316: Cmd(0x41240249a400) 0x1a, CmdSN 0x26ed9 from world 0 to dev "naa.600050e06ec3580016360000978b0000" failed H:0x0 D:0x4 P:0x0 Possible sense data: 0x5 0x24 0x0.

2014-01-08T23:46:22.621Z cpu3:7490639)ScsiDeviceIO: 2316: Cmd(0x412402506780) 0x4d, CmdSN 0x1e4a08c0 from world 9281 to dev "naa.600050e06ec3580016360000978b0000" failed H:0x0 D:0x4 P:0x0 Possible sense data: 0x5 0x20 0x0.

2014-01-08T23:46:22.620Z cpu0:8200)<4>3w-9xxx: scsi7: ERROR: (0x03:0x0101): Invalid command opcode:opcode=0x85.

The hostd.log:

2014-01-08T22:11:16.526Z [63D3DB90 error 'SoapAdapter.HTTPService'] HTTP Transaction failed on stream TCP(error:Transport endpoint is not connected) with error N7Vmacore15SystemExceptionE(Connection reset by peer)
2014-01-08T22:11:22.639Z [63A64B90 verbose 'Cimsvc'] Ticket issued for CIMOM version 1.0, user root
2014-01-08T22:11:44.602Z [63D7EB90 verbose 'ResourcePool ha-root-pool'] Root pool capacity changed from 16776MHz/125648MB to 16776MHz/125647MB
2014-01-08T22:12:53.968Z [63A23B90 verbose 'Cimsvc'] Ticket issued for CIMOM version 1.0, user root
2014-01-08T22:13:23.642Z [63A23B90 verbose 'DvsManager'] PersistAllDvsInfo called
2014-01-08T22:13:44.607Z [63CFCB90 verbose 'ResourcePool ha-root-pool'] Root pool capacity changed from 16776MHz/125647MB to 16776MHz/125648MB
2014-01-08T22:14:25.292Z [63A64B90 verbose 'Cimsvc'] Ticket issued for CIMOM version 1.0, user root
2014-01-08T22:15:01.841Z [63A64B90 verbose 'SoapAdapter'] Responded to service state request
2014-01-08T22:15:56.619Z [63A23B90 verbose 'Cimsvc'] Ticket issued for CIMOM version 1.0, user root
2014-01-08T22:16:16.986Z [63940B90 info 'Vmomi'] Activation [N5Vmomi10ActivationE:0xcd6a7a8] : Invoke done [waitForUpdates] on [vmodl.query.PropertyCollector:ha-property-collector]
2014-01-08T22:16:16.986Z [63940B90 verbose 'Vmomi'] Arg version:
--> "891"
2014-01-08T22:16:16.986Z [63940B90 info 'Vmomi'] Throw vmodl.fault.RequestCanceled
2014-01-08T22:16:16.986Z [63940B90 info 'Vmomi'] Result:
--> (vmodl.fault.RequestCanceled) {
-->    dynamicType = <unset>,
-->    faultCause = (vmodl.MethodFault) null,
-->    msg = "",
--> }

the vpxa.log looks good.

We are planning on replacing the lsi 9650se raid controller this weekend, we are thinking we have a bad controller.

Any thoughts on the logs, etc?

Andrew Hancock (VMware vExpert / EE MVE^2)VMware and Virtualization ConsultantCommented:
either hardware fault on the controller, bad firmware, not compatible, or incorrect drivers.
Stevef316Author Commented:
drivers for the raid card came from lsi and vmware websites, firmware for the card is version 26, will try to update it this evening to version 27.

Controller is on vmware HCL for version 5.1 and 5.1u1

So we are still leaning towards bad controller.

Stevef316Author Commented:
An update...

We replaced the lsi 9650se with a lsi 9750-8i.

And are still experiencing the random datastore disconnects.

The last time this happened, we looked at the host summary tab with the vpshere client and the datastore and network sections were blank and said loading...

This seems to be the case each time it drops the datastores.  We grabbed another motherboard in case its the issue, but at this point we are starting to scratch our heads.

any thoughts would be greatly appreciated.
Stevef316Author Commented:
one last thought...

When the system freezes, we can still ping the ESXI host, and can connect via ssh.  All commands we seem to issue the host fail, or timeout.

The only thing that brings it back is a power cycle.
Andrew Hancock (VMware vExpert / EE MVE^2)VMware and Virtualization ConsultantCommented:
It would suggest a non compatible hardware component. I'll look over the logs this evening, report back tomorrow AM.
Stevef316Author Commented:

kind of what we thought, hence why we rolled back to 5.1u1 due to the raid cards.

Searched the HCL for supermicro systems, and found one listed with our motherboard, posted it above:

"We also double checked the Supermicro motherboard (MBD-X9DRI-F) at the manufacturers website and they list it as ESXI 5.1U1 compatible, however we did not find the motherboard explicitly listed on the vmware HCL.  We did however find a Supermicro server with this motherboard listed on the vmware HCL, (Model:      SYS-7047R-TRF)."
Stevef316Author Commented:
Hello everyone,

We thought we would share our progress on this issue.
After paying for support with vmware, and several of their engineers looking over our issue for the last three days, it is agreed that we may have a bad backplane in the case, causing the interruption between esxi and the datastores.

We have replaced the backplane this morning, and are waiting to see if we crash again.  If we do, we are replacing the entire server with another one.

Thanks everyone for their input!!
Hi Steve,

do you have an update on this situation ? We have the EXACT issue to the dot.

9650SE controller, installed ESXi 5.5 without reading the supported hardware.

Datastore disconnects but box is still available via ping/ssh, etc.  logs are full of those same errors you had:

2014-04-28T19:41:56.823Z cpu2:299491)ScsiDeviceIO: 2337: Cmd(0x412e82ef6700) 0x28, CmdSN 0x577cd from world 32841 to dev "naa.600050e057b5eb0083380000065d0000" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

2014-04-28T19:41:58.476Z cpu6:32799)ScsiDeviceIO: 2337: Cmd(0x412e826d5ac0) 0x28, CmdSN 0xdd16 from world 33320 to dev "naa.600050e057b5eb0083380000065d0000" failed H:0x8 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

2014-04-28T20:30:15.244Z cpu0:33940)HBX: 2658: Waiting for timed out [HB state abcdef02 offset 3620864 gen 73 stampUS 78337633874 uuid 535d97ec-a1cff2c7-2867-002590e77f74 jrnl <FB 3950408> drv 14.60] on vol 'DATASTORE9'
2014-04-28T20:30:15.244Z cpu0:189983)Fil3: 15338: Max timeout retries exceeded for caller Fil3_FileIO (status 'Timeout')
2014-04-28T20:30:15.304Z cpu8:189983)HBX: 2658: Waiting for timed out [HB state abcdef02 offset 3620864 gen 73 stampUS 78337633874 uuid 535d97ec-a1cff2c7-2867-002590e77f74 jrnl <FB 3950408> drv 14.60] on vol 'DATASTORE9'
2014-04-28T20:30:27.224Z cpu12:308928)<4>3w-9xxx:0:0:0:0 :: WARNING: (0x06:0x002c): Command (0x28) timed out, resetting card.
2014-04-28T20:30:27.224Z cpu0:32883)NMP: nmp_ThrottleLogForDevice:2321: Cmd 0x28 (0x412e8269e380, 32839) to dev "naa.600050e057b5eb0083380000065d0000" on path "vmhba1:C0:T0:L0" Failed: H:0x8 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0. Act:EVAL

Our chassis is SuperMicro and is brand new so I want to make sure it really was the backplane before we start replacing the card, motherboard, etc.

Stevef316Author Commented:
yes, there is a compatibility issue with the raid cards and newer versions of esxi, DO NOT USE a 9650SE/9750, as you will never find a solution.

Our only solution after numerous raid cards, backplanes, motherboards, and entire servers, was to completely go away from that card.  You will need megaraid, MegaRAID SAS 9261 is what we ended up using, and all our problems went away immediately.  Unfortunately you get to rebuild the arrays, wait on the initializing period, and then restore.  I would buy veeam for this server, try to get a good backup, and then restore to the new server/megaraid setup.
Stevef316Author Commented:
I might add, the problem did not show for us until the server went into production under a load.  Everything loaded fine, etc, without issues.  After a case with VM support, LSI, and Supermicro, the only solution we could come up with was to migrate completely away from that type of card.  VM had no idea, LSI and Supermicro blamed each other or VMware.
Stevef316Author Commented:
the last successful use of the 3ware/lsi 9650se for us was in this version of esxi: 5.1.0, 799733.  In which the server has been running for a year solid, minus two long power outages.
Andrew Hancock (VMware vExpert / EE MVE^2)VMware and Virtualization ConsultantCommented:
I'm afraid it does happen, vendors like you to upgrade your hardware with newer versions, and never test older products with newer versions of OS, and then driver support disappears from the OS for no reason.
Stevef316Author Commented:
not exactly, the 3ware 9650se has been around for years, the motherboard was in same model lineup that had been around for a couple years, ALL were on HCL at vmware.

Motherboard and raid card had new vmware drivers, only difference was versions of vmware, and I can honestly say after hours of dealing with support at all the vendors, it boiled down to an issue with vmware and the 3ware 9650/9750 series of cards.  Whether is was a driver issue or whatever issue, that was never determined.  As we had wasted enough time on the problem.

We went through several technicians at vmware, supermicro, and lsi, over the course of a month.

Our box is running 5 copies of 08 r2 without any issues at all, so OS driver compatibility is not an issue.

granted gtech2014 above could have any number of issues going on, but I could save him a whole lot of time and share our experiences and resolution.
I was just about to post this. We have two other servers running ESXI 5.0 with 9650SE-8ML and they are up for >1 year without any issues whatsoever under very heavy loads.

So in that case, we'll try a downgrade of ESXi and not touch the backplane / 3ware card for now. We are quite happy with the performance of the card and ESXi 5.0 as well.

Will keep you updated.
Might also add that after e-mailing LSI support, their first reply was: we don't support 5.5. Go back to 5.1
Stevef316Author Commented:
first thing we did, however it did not help.   Check your running versions build numbers, build 799733 was our last successful attempt with the lsi 9650/9670 cards.  I'd be curious to know what build you are successfully running those cards on.
So wait, even if you went back to esxi: 5.1.0, 799733 it didn't work for you ? I thought you said you had other machines running that version with 3ware doing just fine.

On our two other machines we run 5.0.0 623860.

It didn't natively support 3ware so we made a custom image that included the 3ware drivers for it.
Stevef316Author Commented:
esxi version 799733, which is 5.1.0 worked just fine with the lsi 9650/9670 cards.  Newer versions of esxi and the raid card is where the problems came in, not just in esxi 5.5, but also 5.1U1, and a couple other 5.1 esxi builds we tried.  

Since your build is older than ours running the 9650 card, you should be fine to go up to the version 5.1.0 build 799733, as that was our last successful stable esxi build with that card.

An esxi build after 799733 is where we ran into problems with the 9650 cards.
Here's what I did. I re-downloaded 5.0U1 and added the LSI drivers to the VMWare ISO using ESXi customizer. Installed just fine and as soon as I booted up, the errors came back:

2014-04-29T05:57:48.744Z cpu10:4106)ScsiDeviceIO: 2322: Cmd(0x412400eeb300) 0x1a, CmdSN 0x661 from world 0 to dev "naa.600050e057b5eb0083380000065d0000" failed H:0x0 D:0x4 P:0x0 Possible sense data: 0x5 0x24 0x0.
2014-04-29T05:57:48.745Z cpu10:4106)ScsiDeviceIO: 2322: Cmd(0x412400eeb200) 0x1a, CmdSN 0x666 from world 0 to dev "naa.600050e057b5eb0083380000065d0000" failed H:0x0 D:0x4 P:0x0 Possible sense data: 0x5 0x24 0x0.

Now, just to be thorough it dawned on me to also check the other servers we have and sure enough, the same errors appear there as well except nobody noticed them for over a year since there were no issues with the servers!

Can it really be that they are harmless ?
Datastore disconnected 5 minutes after I posted that message. Back to square 1 I guess.

I think those actual error messages are harmless, but these ones aren't:

2014-04-29T06:36:56.242Z cpu5:4291)<4>3w-9xxx:6:0:0:0 :: WARNING: (0x06:0x002c): Command (0x2a) timed out, resetting card.
2014-04-29T06:36:56.242Z cpu18:4114)NMP: nmp_ThrottleLogForDevice:2318: Cmd 0x8a (0x4124410e4b80, 4200) to dev "naa.600050e057b5eb0083380000065d0000" on path "vmhba1:C0:T0:L0" Failed: H:0x8 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0. Act:EVAL
2014-04-29T06:36:56.242Z cpu18:4114)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237:NMP device "naa.600050e057b5eb0083380000065d0000" state in doubt; requested fast path state update...
2014-04-29T06:36:56.242Z cpu18:4114)ScsiDeviceIO: 2309: Cmd(0x4124410e4b80) 0x8a, CmdSN 0x1363 from world 4200 to dev "naa.600050e057b5eb0083380000065d0000" failed H:0x8 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2014-04-29T06:37:03.960Z cpu21:16303)HBX: 2313: Waiting for timed out [HB state abcdef02 offset 3198976 gen 37 stampUS 17397225255 uuid 535f04ab-44ca19d6-153f-002590e77f75 jrnl <FB 3037000> drv 14.54] on vol 'DATASTORE9'
2014-04-29T06:37:22.398Z cpu0:4291)ScsiDeviceIO: 2322: Cmd(0x412400c45500) 0x2a, CmdSN 0x1364 from world 4120 to dev "naa.600050e057b5eb0083380000065d0000" failed H:0x8 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2014-04-29T06:37:22.470Z cpu4:4131)<4>3w-9xxx:6:0:0:0 :: WARNING: (0x06:0x002c): Command (0x8a) timed out, resetting card.
2014-04-29T06:37:29.019Z cpu12:16298)VMW_SATP_LOCAL: satp_local_updatePathStates:439: Failed to update path "vmhba1:C0:T0:L0" state. Status=Transient storage condition, suggest retry

Vmware states that H 0x0 is NO ERROR which is what I had above and on all my other servers too but H 0X8 is a card reset which apparently is a big deal and causes these issues. I've e-mailed LSI again about this

Edit: it seems the issue might not be related to VMWare but rather to the card itself:
Preliminary finding:

it seems we were 3 versions behind as far as the firmware was concerned. We upgraded from FE9X to FE9X and so far so good. Placed extreme loads on the server and no crash yet!

We do get the harmless notices, just like on the other servers, but they are only the H0x0 kind, no longer H0x8!

Will update if it breaks down again!
Stevef316Author Commented:
What esxi version did you end up with?  Something before or after 799733?
5.0U1 which is the latest supported version for both my motherboard X9-DRD-EF and also for the 3ware 9650SE card.

Server still going strong, I really hope that was the issue.
Server was going very well until we attempted some high speed benchmarks/data transfers.
At that time the controller performed a reset and while the datastore did not detach like we saw before, write performance plunged drastically to 40-50MB/sec vs 450MB/sec before. We tried a lot of things but it would do this without a fault whenever a high load was placed on the server.

Out of ideas, we reverted to the same 3ware Firmware version we had on our other servers: Firmware Version = FE9X  ( about 2 years old ). Following this "upgrade" we saw no issues whatsoever with any load. Speeds are fine, no weird messages in the logs and no datastore disconnects.

I guess we will stick with this version! Who would have thought ?
Stevef316Author Commented:
Similar to what we found, then went with different raid card to we could go forward with vmware
after a month and 6 days of constant reboots and changing pretty much everything: motherboard, raid card ( went from 3ware 9650 to 9750 ), cables, disks and finally the entire chassis, we are now moving away from 3ware and onto LSI MegaRaid.

Hopefully this insane experience will end! I just can't believe how bad this was
Stevef316Author Commented:
Sounds like our experience
Seth SimmonsSr. Systems AdministratorCommented:
This question has been classified as abandoned and is closed as part of the Cleanup Program. See the recommendation for more details.
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.