Link to home
Start Free TrialLog in
Avatar of datacomsmt
datacomsmtFlag for Australia

asked on

ESXi 4.1 Host crashes when GUEST OS is close to completeing large file transfer to a physical machine

Transferring completed torrents from a Windows 7 machine (1 of the guest VMs on the ESX box) to another win7 box (physical) esx "crashes".

There is nothing to indicate an error on the ESX screen, no PSOD, no kernal panic...just....dies... cannot plug in a keyboard to troubleshoot - as it doesnt detect, cannot ping any of the vms hosted by the box, cannot ping the box.. and it just shows what it always does on the main screen..

usually happens just after, or when the file transfer is close to completion (typically large files)

When ESX dies, vSphere disconnects (i can actually watch the file transfer via console view right up till the crash), and obviously because i cannot ping (ip or hostname) any of the vms and vsphere will not reconnect, i can not RDP to any of the guest os's either (one of which hosts DHCP, DNS, AD, etc, etc)

bounce ESX and everything is back to normal, until it comes time to transfer files again... :(

any suggestions? any logs i can look into? - without getting support from vmware?
Avatar of Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Flag of United Kingdom of Great Britain and Northern Ireland image

You can certainly inspect the logs

VMware KB: Location of ESXi log files

http://kb.vmware.com/kb/1021801 

What server do you have, is on the Hardware Compatibility List?

http://www.vmware.com/go/hcl

It suggest it's have issues with the datastore, or storage controller?

again is the storage controller on the HCL or on the following list

http://www.vm-help.com/esx40i/esx40_whitebox_HCL.php#Storage
I have seen this when the machines are not using the VMXNet3 NIC adapter type in the VM settings on some servers.  Try changing all your NICs to VMXNet3
Avatar of datacomsmt

ASKER

@hanccocka

"NOTE: this list includes a number of SATA
controllers that provide RAID functionily via a
software component in the drivers supplied with
the controller. Examples would be the Intel ICH
series and the nVidia MCP series. ESX 4.x and
ESXi 4.x do not support that software RAID
functionality thus you will only be able to access
the individual drives connected to controllers
such as these."

Mobo is : Asus P5N32-E SLI Plus

Northbridge: C55 a.k.a. nForce 650i SLI
Southbridge: MCP55P a.k.a. nForce 570 SLI

CPU: some Core 2 DUO

It's just a box i built to run ESXi , not a rack mount or anything specifically built to run VMs unfortunately : (

Im not running any form of raid on the box, hardware or software, just a single disk.


@Neilsr

I.....am pretty maximum noob at ESX sorry, not sure where to check this, but i see the NIC (on that specific VM) is labelled @ "Adapter type" as "E1000" whatever that means..?

and is installed on the VM as "intel PRO/1000 MT" PCI\VEN_8086&DEV_100F.....


I've just completely rebuilt the machine that im having troubles with (that is to say the VM, not the esx box), so i can say forsure it's definately not going to be a corrupt vmdk or something



Thanks for both your help so far :)
scratch that last, i just found what you mean on vmxnet3 (just googled)

will install that now and give it a whirl
doesnt like the VMXNET3 driver :(  " This device cannot start. (Code 10)"

Is there somewhere i can get an updated version of the device driver? i searched on Vmware's website to no avail
This is similar to many questions we get on EE, with users that have built what we call" White Boxes", they may experience issues, because of incompatible hardware.

They may work or they may not work. "You mileage will vary."

Personally, I've not seen any issues with certified hardware, that causes "ESX" to crash when using virtual NICs, E1000, VMXNET2, VMXNET3 or older AMD Lance.

But I have seen many instances on unstable ESXi platforms, built on non-certifed hardware, that cannot be explained or fixed.
are you using thin provisioned virtual disks and the datastore is filling up???
@danm66

nup, disks on each vm are static

@hanccoka
yeah, it's a bugger they dont have such a big list of supported hardware as windows.. but i spose windows has been around forever, and vendors build drivers specifically for it..

i appreciate the help all the same, and i realise i may not get it sorted - and thats cool, but i wanna try

i just realised im retarded and actually made the NIC "e1000" when i created the VM... so... there was no point trying to put VMXNET3 driver on E1000 virtual hardware, which is why driver didnt work.. recreated the NIC as VMXNET3, installed appropriate driver...and copied a big file..

... and... it.... hasnt.... crashed.... yet..... :D (but there have been occasions when it doesnt - will just have to wait and see)

 *crosses fingers*
It's an Enterprise Class Datacentre Server Operating system, the same is true for Windows Datacentre Server.

It's not designed to run on any old PC with an Intel chip!

It's not a Desktop operating system! (designed for the masses!)
hahah, i know! im not fighting/arguing with you about that! - or anything else for that matter! :) .. was just saying

i think windows has given me unrealistic hardware support expectations :P you're quite right, it isnt designed to run on just anything with a cpu :P - i've just had mates tell me "oh i run it on a laptop at home and it's fine" so im like "well... i'll give it a go then" since ESXi is free................anyway, i appreciate the help, i'll let you all know if it dies again.
It's easier to get ESXi 4.1 running on a laptop or desktop (using Intel-VT) using WIndows something, and VMware Workstation 7.1 because the "virtual hardware" is compatible! This method works, but is slower, than installing on bare metal!

I think people have also forgotten that there is also a Windows Hardware Compatibility List!

no luck :( died again

.. IF.. this is fixable... what exactly/roughly should i be looking for in the epic 417kb messages.txt file to help me troubleshoot?
Is this for production or lab/home/test learning, if for the later you may have better luck in using VMware Workstation 7.1. (although this would have to be purchased).

This IS to be expected with non-supported hardware or hardware that has not been confirmed to work reliable by trial and error.

The areas that seem to affect ESXi support are storage controllers and network interface cards. If you get the correct storage controller and network interface cards, that are on the VMware HCL, you may have better luck.

http://www.vmware.com/go/hcl

The White box HCL may be of some assistance, although outdated today, because it's only for ESX 4.0.

http://www.vm-help.com/esx40i/esx40_whitebox_HCL.php

This is the issue with trying to use unsupported "built" hardware, some people take great pride, in stating, oh I build X, out of a pile of bits, and it now runs ESXi.

It's easier to find working components, by searching the forums, to alsmost give you some "guarantee" that the system will work, if you get the correct components. (but you'll spend a lot of time by trial and error, trying to get it wo work, and money).

Personally, I'm in favour, of purchasing a very low cost, refrubished/old server from eBay, that reliable works and is supported by ESXi 4.1. e.g. HP DL385/DL585, although not supported since ESX 3.5 U5, they do work with ESXi 4.0/4.1, not on the HCL. (so I wouldn't want to use in a mission critical environment).
is for mostly home with a bit'o learning, will see what i can find as far as replacement goes i guess. thanks again for the help : )
ASKER CERTIFIED SOLUTION
Avatar of Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
i've found a cheapish DL585

p.s. incase it's of any relevance to anyone in future, i managed to get syslog dump of ESXi right before it craps itself and the last meaningful entries are:

04-26-2011      04:03:56      Local6.Notice      192.168.1.200      Apr 25 18:03:51 vmkernel: 0:01:19:06.362 cpu0:4392)NMP: nmp_CompleteCommandForPath: Command 0x28 (0x41027f39ec40) to NMP device "t10.ATA_____ST3400620AS_________________________________________5QH0D78A" failed on physical path "vmhba1:C0:T0:L0" H:0x4

04-26-2011      04:03:56      Local6.Notice      192.168.1.200      Apr 25 18:03:51 vmkernel: 0:01:19:06.338 cpu0:4392)ScsiDeviceIO: 1672: Command 0x28 to device "t10.ATA_____ST3400620AS_________________________________________5QH0D78A" failed H:0x4 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

04-26-2011      04:03:56      Local6.Notice      192.168.1.200      Apr 25 18:03:51 vmkernel: 0:01:19:06.330 cpu0:4392)ScsiDeviceIO: 1672: Command 0x28 to device "t10.ATA_____ST3400620AS_________________________________________5QH0D78A" failed H:0x4 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

04-26-2011      04:03:56      Local7.Debug      192.168.1.200      D:0x0 P:0x0 Possible sense da

04-26-2011      04:03:56      Local6.Notice      192.168.1.200      Apr 25 18:03:51 vmkernel: 0:01:19:06.315 cpu0:4392)NMP: nmp_CompleteCommandForPath: Command 0x28 (0x41027f39ec40) to NMP device "t10.ATA_____ST3400620AS_________________________________________5QH0D78A" failed on physical path "vmhba1:C0:T0:L0" H:0x4

04-26-2011      04:03:56      Local7.Debug      192.168.1.200      D:0x0 P:0x0 Possible sense da

04-26-2011      04:03:56      Local6.Notice      192.168.1.200      Apr 25 18:03:51 vmkernel: 0:01:19:06.295 cpu0:4392)NMP: nmp_CompleteCommandForPath: Command 0x28 (0x41027f39ec40) to NMP device "t10.ATA_____ST3400620AS_________________________________________5QH0D78A" failed on physical path "vmhba1:C0:T0:L0" H:0x4

04-26-2011      04:03:56      Local6.Notice      192.168.1.200      Apr 25 18:03:51 vmkernel: 0:01:19:06.272 cpu0:4392)ScsiDeviceIO: 1672: Command 0x28 to device "t10.ATA_____ST3400620AS_________________________________________5QH0D78A" failed H:0x4 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

04-26-2011      04:03:56      Local7.Debug      192.168.1.200      D:0x0 P:0x0 Possible sense da

04-26-2011      04:03:56      Local6.Notice      192.168.1.200      Apr 25 18:03:51 vmkernel: 0:01:19:06.260 cpu0:4392)NMP: nmp_CompleteCommandForPath: Command 0x28 (0x41027f39ec40) to NMP device "t10.ATA_____ST3400620AS_________________________________________5QH0D78A" failed on physical path "vmhba1:C0:T0:L0" H:0x4

04-26-2011      04:03:56      Local7.Debug      192.168.1.200      D:0x0 P:0x0 Possible sense da

04-26-2011      04:03:56      Local6.Notice      192.168.1.200      Apr 25 18:03:51 vmkernel: 0:01:19:06.244 cpu0:4392)NMP: nmp_CompleteCommandForPath: Command 0x28 (0x41027f39ec40) to NMP device "t10.ATA_____ST3400620AS_________________________________________5QH0D78A" failed on physical path "vmhba1:C0:T0:L0" H:0x4

04-26-2011      04:03:56      Local6.Notice      192.168.1.200      Apr 25 18:03:51 vmkernel: 0:01:19:06.228 cpu0:4392)ScsiDeviceIO: 1672: Command 0x28 to device "t10.ATA_____ST3400620AS_________________________________________5QH0D78A" failed H:0x4 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

04-26-2011      04:03:56      Local7.Debug      192.168.1.200      D:0x0 P:0x0 Possible sense da

04-26-2011      04:03:56      Local6.Notice      192.168.1.200      Apr 25 18:03:51 vmkernel: 0:01:19:06.216 cpu0:4392)NMP: nmp_CompleteCommandForPath: Command 0x28 (0x41027f39ec40) to NMP device "t10.ATA_____ST3400620AS_________________________________________5QH0D78A" failed on physical path "vmhba1:C0:T0:L0" H:0x4

04-26-2011      04:03:56      Local7.Debug      192.168.1.200      D:0x0 P:0x0 Possible sense da

04-26-2011      04:01:48      Local6.Error      192.168.1.200      Apr 25 18:01:48 vmkernel: 0:01:17:03.325 cpu0:10083)ALERT: APIC: 1823: APICID 0x00000000 - ESR = 0x40
Much credit/thanks to hanccocka for his continued and especially FAST help :)
You''ll  not got far wrong with a DL585 G1/G2, fantastic servers, and still work with ESXi 4.0/ESX4.1 U1, we still run them in "production" in our offices!

We use Quads, Dual Core, fully loaded configs, we don't use local disks, because we have SANs for VMs, and we use SSD for VMware View VDI work.