VMware esxi 5

Hi,

We had a very strange issue on friday, whereby all of the virtual machines on a host, stopped repsonding, we only have 4 hosts and we running HA and DRS.

The machines were all in a hung state when trying to access the console, ie you could see windows running but you couyld not interact with it.

I have exported all the logs from the host but I have no clue where to start looking to see a timeline of events and possibly a reason for the failure.

I was able to put the host in Maintenance mode and HA resstarted the machines on the other 3 hosts, but as far as I can gather the issue HOST wasn't failed enough to produce a faliover, has anyone experienced this type of freeze, hag before?

G
LVL 1
AshridgeTechServicesAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Andrew Hancock (VMware vExpert / EE MVE^2)VMware and Virtualization ConsultantCommented:
Not, see this issue ourselves.

Could the VMs be contacted via RDP or Ping?
0
AshridgeTechServicesAuthor Commented:
hi,

no ping was available or rdp, everything was frozen/hung.

here are some of the vkernel logs from the time of the incident, if this can help?

2011-12-02T17:36:09.013Z cpu16:4291)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237:NMP device "naa.6000eb3c491b4e1f0000000000000390" state in doubt; requested fast path state update...
2011-12-02T17:36:09.013Z cpu16:4291)ScsiDeviceIO: 2305: Cmd(0x412441f02140) 0x28, CmdSN 0x9764d84 to dev "naa.6000eb3c491b4e1f0000000000000390" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2011-12-02T17:36:09.013Z cpu15:4132)HBX: 524: Reading HB at 3358720 on vol 'iSCSI Pri 2.5TB A' failed: IO was aborted
2011-12-02T17:36:09.013Z cpu15:4132)WARNING: HBX: 2697: Reclaiming timed out [HB state abcdef02 offset 3358720 gen 13 stampUS 8123281633113 uuid 4e5d182a-161b9dd0-74dc-3c4a92f48ef6 jrnl <FB 1982400> drv 14.54] on vol 'iSCSI Pri 2.5TB A' failed: I[0$
2011-12-02T17:36:09.013Z cpu17:6401)HBX: 2313: Waiting for timed out [HB state abcdef02 offset 3358720 gen 13 stampUS 8123281633113 uuid 4e5d182a-161b9dd0-74dc-3c4a92f48ef6 jrnl <FB 1982400> drv 14.54] on vol 'iSCSI Pri 2.5TB A'
2011-12-02T17:36:09.013Z cpu19:4136)HBX: 2313: Waiting for timed out [HB state abcdef02 offset 3358720 gen 13 stampUS 8123281633113 uuid 4e5d182a-161b9dd0-74dc-3c4a92f48ef6 jrnl <FB 1982400> drv 14.54] on vol 'iSCSI Pri 2.5TB A'
2011-12-02T17:36:12.013Z cpu1:3893305)HBX: 2313: Waiting for timed out [HB state abcdef02 offset 3358720 gen 13 stampUS 8123281633113 uuid 4e5d182a-161b9dd0-74dc-3c4a92f48ef6 jrnl <FB 1982400> drv 14.54] on vol 'iSCSI Pri 2.5TB A'
2011-12-02T17:36:12.013Z cpu2:4164109)HBX: 2313: Waiting for timed out [HB state abcdef02 offset 3358720 gen 13 stampUS 8123281633113 uuid 4e5d182a-161b9dd0-74dc-3c4a92f48ef6 jrnl <FB 1982400> drv 14.54] on vol 'iSCSI Pri 2.5TB A'
2011-12-02T17:36:12.013Z cpu0:3838403)HBX: 2313: Waiting for timed out [HB state abcdef02 offset 3358720 gen 13 stampUS 8123281633113 uuid 4e5d182a-161b9dd0-74dc-3c4a92f48ef6 jrnl <FB 1982400> drv 14.54] on vol 'iSCSI Pri 2.5TB A'
2011-12-02T17:36:12.013Z cpu14:3875564)HBX: 2313: Waiting for timed out [HB state abcdef02 offset 3358720 gen 13 stampUS 8123281633113 uuid 4e5d182a-161b9dd0-74dc-3c4a92f48ef6 jrnl <FB 1982400> drv 14.54] on vol 'iSCSI Pri 2.5TB A'
2011-12-02T17:36:49.014Z cpu12:4291)NMP: nmp_ThrottleLogForDevice:2318: Cmd 0x28 (0x412440c60d00) to dev "naa.6000eb3c491b4e1f0000000000000390" on path "vmhba36:C0:T1:L0" Failed: H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x24 0x0.Act:EVAL
2011-12-02T17:36:49.014Z cpu12:4291)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237:NMP device "naa.6000eb3c491b4e1f0000000000000390" state in doubt; requested fast path state update...
2011-12-02T17:36:49.014Z cpu12:4291)ScsiDeviceIO: 2316: Cmd(0x412440c60d00) 0x28, CmdSN 0x9764d85 to dev "naa.6000eb3c491b4e1f0000000000000390" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x24 0x0.
2011-12-02T17:36:49.014Z cpu13:4131)HBX: 524: Reading HB at 3358720 on vol 'iSCSI Pri 2.5TB A' failed: Timeout
2011-12-02T17:36:49.014Z cpu13:4131)WARNING: HBX: 2697: Reclaiming timed out [HB state abcdef02 offset 3358720 gen 13 stampUS 8123281633113 uuid 4e5d182a-161b9dd0-74dc-3c4a92f48ef6 jrnl <FB 1982400> drv 14.54] on vol 'iSCSI Pri 2.5TB A' failed: T[0$
2011-12-02T17:36:49.014Z cpu0:6401)HBX: 2313: Waiting for timed out [HB state abcdef02 offset 3358720 gen 13 stampUS 8123281633113 uuid 4e5d182a-
0
Andrew Hancock (VMware vExpert / EE MVE^2)VMware and Virtualization ConsultantCommented:
STORAGE!
0
Cloud Class® Course: Ruby Fundamentals

This course will introduce you to Ruby, as well as teach you about classes, methods, variables, data structures, loops, enumerable methods, and finishing touches.

Andrew Hancock (VMware vExpert / EE MVE^2)VMware and Virtualization ConsultantCommented:
Check.your storage, it was unavailable.

Check you have supported storage controller, RAID configuration, check hba firmware if FC, iSCSI check networking

or did the VMs broadcast storm the storage? not using evil SATA?

okay looks like iSCSI from datastore label?

if the storage is 2.5tb thats a large LUN, how many VMs per LUN, what is the storage?
0
AshridgeTechServicesAuthor Commented:

storage is all iscsi
there are about 50 vm's on iscsi primary 2.5tb
it's lefthand/hp P4300/P4000 storage arrays

hmm i can see from the logs that storage is indicated but no other machines on that storage failed, and nearly all vm's are on 2.5tb iscsi datastore, but only machines on host 01 failed.

all machines on hosts 2,4 & 5 are also on iscsi 2.5tb - they did not fail

maybe host 1, had a network error reading datastore?

I just found this, which is the first readable bit of date in the log file, this looks like a memory problem?


2011-12-02T17:38:50.005Z cpu7:4164109)WARNING: Swap: vm 4164109: 13189: Read failed: numPages=1 status=Timeout
2011-12-02T17:38:50.005Z cpu7:4164109)WARNING: VmMemIO: vm 4164109: 1130: Unable to read from slot(0x100048f4)
2011-12-02T17:38:50.005Z cpu7:4164109)WARNING: World: vm 4164109: 9218: vmm0:ash-roomhub:vmk: vcpu-0:Unable to read swapped out pgNum(0x3fea2) from swap slot(0x100048f4) for VM(4164109)
2011-12-02T17:38:50.005Z cpu7:4164109)World: 9221: vmm group leader = 4164109, members = 2
2011-12-02T17:38:50.005Z cpu7:4164109)Backtrace for current CPU #7, worldID=4164109, ebp=0x4122283479d8
2011-12-02T17:38:50.006Z cpu7:4164109)0x4122283479d8:[0x41800dce50f5]World_Panic@vmkernel#nover+0x184 stack: 0x313130326d375b1b, 0x412228
2011-12-02T17:38:50.006Z cpu7:4164109)0x412228347bf8:[0x41800dce50f5]World_Panic@vmkernel#nover+0x184 stack: 0x412228347c4c, 0xb91e3d2834
2011-12-02T17:38:50.007Z cpu7:4164109)0x412228347c78:[0x41800dcc218b]VmMemIO_FaultSwappedPage@vmkernel#nover+0xf2 stack: 0x412228347cc8,
2011-12-02T17:38:50.007Z cpu7:4164109)0x412228347d38:[0x41800dccd9e3]VmMemPfSwapped@vmkernel#nover+0x6fe stack: 0x412228347e2f, 0x1, 0x41
2011-12-02T17:38:50.007Z cpu7:4164109)0x412228347df8:[0x41800dccdf8d]VmMemPfInt@vmkernel#nover+0x510 stack: 0x3f, 0xfffffffffc00863c, 0x4
2011-12-02T17:38:50.008Z cpu7:4164109)0x412228347e58:[0x41800dcce688]VmMemPf@vmkernel#nover+0xb3 stack: 0x28347ed8, 0x0, 0x28347eb8, 0x41
2011-12-02T17:38:50.008Z cpu7:4164109)0x412228347f98:[0x41800dccfa2e]VmMemPf_LockPage@vmkernel#nover+0x1179 stack: 0x410026fc32a4, 0x4000
2011-12-02T17:38:50.008Z cpu7:4164109)0x412228347fe8:[0x41800dcd8c17]VMMVMKCall_Call@vmkernel#nover+0x186 stack: 0x0, 0x0, 0x0, 0x0, 0x0
2011-12-02T17:38:50.008Z cpu7:4164109)0x41800dcb550c:[0xfffffffffc2235ca]<unknown> stack: 0x0, 0x0, 0x0, 0x0, 0x0
2011-12-02T17:38:50.008Z cpu7:4164109)WARNING: World: vm 4164109: 9253: Couldn't awake world 4164109
2011-12-02T17:38:50.008Z cpu7:4164109)WARNING: VmMemPf: vm 4164109: 416: VmMemPf failed: pgNum=0x3fea2, mpn=0xffffffff, status=Timeout
2011-12-02T17:38:50.008Z cpu7:4164109)WARNING: VmMemPf: vm 4164109: 505: pgNum=0x3fea2 failed
2011-12-02T17:38:50.008Z cpu14:4149)NetPort: 1426: disabled port 0x10000ce
2011-12-02T17:38:50.008Z cpu14:4149)Net: 2195: disconnected client from port 0x10000ce
0
Andrew Hancock (VMware vExpert / EE MVE^2)VMware and Virtualization ConsultantCommented:
It's possible there was a network fault, this would cause the issue.

Personally, we would NEVER stack 50 VMs on the same 2.4TB LUN.

Also remember VM Swap also resides on the storage, so if there is a storage fault, the VM cannot also read it's swap file.

If you are using shared, network storage e.g. iSCSI, I woudl investigate your network of this ESXi host.

How did you recover the situation?

It would seem that the storage disappeared from the host, and this would cause VMs to run, until they need to wrtite any changes to storage, and you would observe this fault.

If you setup VM Monitoring in HA, if the VM misses an internal heartbeat, it will restart the VM on another host.
0
AshridgeTechServicesAuthor Commented:
we have had 50 vm's on this LUN for about 3 years with no failures, the other LUn actually has 33 on it, so it's a rough split of our VM's

it was recovered by manually putting the host 01 into maintenace mode, this caused HA to restart all the vm's on other hosts, and they were worked fine.

seems weird that only machines on host 1 couldn't see storage.

I think we will turn on VM and datastore monitoring in our cluster, good call, thank you

G
0
Andrew Hancock (VMware vExpert / EE MVE^2)VMware and Virtualization ConsultantCommented:
Check your networking on this host.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
AshridgeTechServicesAuthor Commented:
thanks for your support, excellent
0
Andrew Hancock (VMware vExpert / EE MVE^2)VMware and Virtualization ConsultantCommented:
No problems, keep monitoring, network on this suspect host, check network switch port status/stats, and if you want to investigate and monitor further.

One of the biggest management holes in vCenter of ESX is the vSphere Client can indicate that VM network traffic is causing a 1 GB Ethernet adapter to have a 99% utilization rate. But strangely, it doesn't display which kind of traffic is going across the virtual networks, where it came from or where it's going.

To learn which traffic is going across a virtual network, there's a free tool for vSphere: Xangati for ESX, a virtual appliance that tracks conversations on the virtual network. It's great for troubleshooting any virtual network issue, analyzing virtual desktop infrastructure and correlating vCenter performance stats with virtual network stats.

It's available as a fanastic FREE download here.

http://xangati.com/try-it-free/
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Virtualization

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.