ESXi 4.0 U1 - Management network becomes unstable after a few days

I’ve installed ESXi 4.0 Update 1 on two identical machines that reside in the same network segment. On both servers, I’ve created two virtual machines. One runs RedHat Enterprise Linux 5.4 and one runs a small load balancer appliance (Hercules).

Hardware:
Dell PowerEdge R210
Intel Xeon X3450 2.66GHz HT
8GB RAM
2x500 GB in RAID 1
Using ONE port of the internal Broadcom netxtreme II bcm5716 NIC (this port is shared between the management network and the VM’s).
(all hardware is marked as ‘supported’ by VMware)

We applied all available patches, including the recent april 1st patch; we’re at build 244038 now.

The Problem
After a few days the vSphere client cannot establish a connection to the ESXi hosts anymore. The virtual machines continue to keep running without any problem, however. Only a full reset (applied thru the remote power cycle) restores the connectivity to the management network. We experience this issue on both servers: about three days after power-on/reset, the vSphere client cannot connect anymore.

Observations:
• Only the management network suffers from connectivity problems.
• Restarting the management network (agents) via the physical console doesn’t restore service
• The physical console offers some basic diagnostics like ‘testing the management network’. The PING tests intermittently fail: about half of the PINGs to the gateway or dns-servers fails. The hardware and the network config MUST be correct, since the management network works for a few days before failing and the VM’s keep running without any problem.
• We’ve investigated the network traffic from a remote vSphere client that is trying to connect to the ESXi server using a packet sniffer. The remote ESXi hosts resets the connection after initial contact, so there IS packet interchange.

Given the above, I strongly suspect a problem in the network driver in ESXi, but I don’t know how to diagnose the issue any further. I’ve exhausted all options on the physical ESXi console. I know how to access the (unsupported) commandline console, but don’t know what to look for. Could it be a problem that the management network shares the same NIC as the VM’s?

I’ve been struggling with this issue for a several weeks now – any help/suggestions is highly appreciated.
thijsschoonbroodAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

wolfje_xpCommented:
Hi,
I also use ESXi 4.0, but haven't applied the last update. I also share a nic for mgmt and VM traffic, without problems so far.
Maybe you can have a look in the logfiles /var/log ?
0
bbnp2006Commented:
What does vpxa.log and hostd.log say? If it's agent issues within each esx host, those logs should review some clue for you.

./var/log/vmware/hostd.og
./var/log/vmware/vpx/vpxa.log

Post back.
0
thijsschoonbroodAuthor Commented:
After a recent boot, vSphere is able to connect again. However, i see some disturbing lines in these logs:

In /var/log/messages:
Hostd: [2010-04 ..... 29766DC0 warning 'Proxysvc'] Accept on client connection failed: Bad file descriptor

In /var/log/hostd.org:
[2010-04-....  warning 'Proxysvc Req00005'] Error reading from client while waiting for header: N7Vmacore15SystemExceptionE (Connection reset by peer)

I was unable to find '/var/log/vmware/vpx/vpxa.log', the subdir vpx didn't exist either.


Before i rebooted the physical machine, i noticed a even more disturbing message in a log:
[2010-04 -… 20BECCDC panic ‘App’] error: Cannot allocate memory

and several of these (which i suspect are related to an out of memory situation):
vmkernel: 3:19:39:51.629 cpu2:1028056)WARNING: Tcpip_Socket: 1619: socreate(type=2, proto=0) failed with error 55

I had allocated 8GB to one of the virtual machines, while the host only has 8 GB in total. Can this have caused the memory problem? To be on the safe side, i reduced the amounth to 6GB.
Also, i put the management network on another vSwitch connected to a dedicated NIC port. So now the VM's share a NIC port and the management network uses a seperate one.

Perhaps one of these actions fixes the issue (i only know after three days...), but I would very much like to understand what went wrong. Did i misconfigure anything?
0
Powerful Yet Easy-to-Use Network Monitoring

Identify excessive bandwidth utilization or unexpected application traffic with SolarWinds Bandwidth Analyzer Pack.

wolfje_xpCommented:
Hi. IT sure can be a memory issue. It's not a good idea to grand all your physical memory to a vm. (it shouldn't produce those errors however, but swapping will kill your performance)
I assume your network settings are correct (domain etc)?
Any chance you have bad memory / hard disk ?

0
thijsschoonbroodAuthor Commented:
I can only assume that all physical server stuff is OK, since we experience experience exactly same issue on two identical (brand new) servers and only after three days. It's of course an assumption, but i think i can be rather sure the hw is OK. Network settings are correct i think, i verified them quite a few times. :-)
0
bbnp2006Commented:
Can I assume that you have run the memtest fully for the memories for both servers? :)
0
thijsschoonbroodAuthor Commented:
Put the mgt. network on a different NIC port (and vSwitch) and reduced the memory allocated to VM's to 6GB. Still, after three days, vSphere is unable to connect to ESXi again. :-( I'm running out of ideas here. :-(
0
bbnp2006Commented:
Are you sure you have the latest driver for your NICs? sounds like a NIC problem when you experiencing "PING tests intermittently fail" problems.
0
thijsschoonbroodAuthor Commented:
Hi bbnp2006,

Well, i've applied all available patches, assuming that all the network drivers would be updated to the new version as well. Is my assumption wrong? How can i determine which network driver version i'm exactly using? When my vSphere client still could connect, i was not able to find any version reference except the build number.
(I'm using the (internal) Broadcom netxtreme II bcm5716 NIC.)
0
bbnp2006Commented:
This is the link to the hardware compatibility list from vmware, if you do a search for Broadcom network card, for some reason, I can not find BCM5716 model in the compatibility list. Maybe I am wrong, i am still doing some other searching to see if your NIC card is really supported. If it turns out it is not supported, it might be the reason that the driver is not functioning well. Maybe you can double check that?
0
bbnp2006Commented:
According to VMWare's website:
http://www.vmware.com/resources/compatibility/search.php?action=search&deviceCategory=io&productId=1&advancedORbasic=advanced&maxDisplayRows=50&key=bcm5716&release[]=-1&datePosted=-1&partnerId[]=-1&manufacturer[]=-1&vid=&did=&svid=&ssid=&rorre=0

Broadcom      NetXtreme II BCM5716 Gigabit Ethernet      is supported up to      ESX 3.5 U5
Broadcom      NetXtreme II BCM5716S Gigabit Ethernet is supported in ESX / ESXi 4.0 U1

0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
thijsschoonbroodAuthor Commented:
Hi BBNP2006,
First, thanks for your help! I did check the compatiblity guide, but i guess I overlooked the 'S'...However, DELL indicates that ESX 4.0 U1 is supported on their R210 server hardware. Perhaps it's supported, but not using the internal NIC then? I really don't like to revert to 3.5 though, perhaps it's possible to use a bnx2 version that supports the 5716 (without the S)? How can i determine which bnx2 version is currently being used?
0
thijsschoonbroodAuthor Commented:
I did find a driver for ESXi 4.0 for the BCM 5716 (without the S ;-)).
http://downloads.vmware.com/d/details/esx_esxi_40_broadcom_bnx2_dt/ZCV0YmRqZHRidHdw
I'll check it out asap
0
bbnp2006Commented:
yes, that's the after release of the driver for your NIC that was not originally supported "inbox" when vsphere released. That will hopefully get rid of the networking issue.
Report back if it is working for you. Good luck!
0
thijsschoonbroodAuthor Commented:
I will report back asap. One question though: are drivers (like the bnx2 driver) also included when i check for updates using the Host Update Utility?
0
bbnp2006Commented:
Only the drivers that are on the hardware compatility list will be included for updates. So in your situation, you have to look for the driver vmware released for older hardware. I'd suggest replacing your NIC if possible so that all driver updates will be included when you do host updates :)
0
thijsschoonbroodAuthor Commented:
I checked the bnx2 driver version using 'ethtool -i vmnic1'
It releaved that i was still on 1.6.9. The latest release available from VMWare (URL few posts back) is 2.0.7c. Just installed this new driver successfully and rebooted the system. In three days i'll know if that solved the issue. :-)
0
bbnp2006Commented:
Great stuff! Fingers crossed and looking forward to your updates in 3 days :)
~bbnp2006
0
thijsschoonbroodAuthor Commented:
Unfortunately, it turned out that the issue was not solved. I also disabled the CIM agents but after a while the servers both crashed again.
0
bbnp2006Commented:
sorry to hear that mate.  I would contact VMWare for support (hopefully that's an option). Let me do some digging to see if can find anything.
0
thijsschoonbroodAuthor Commented:
Hi bbnp2006,
VMware support definetely is an option if it's up to me. However, i've enabled SSH (to help diagnose the problem) and disabled the CIM agents which breaks the support option i'm afraid.
Also, here is another guy having the exact same issue: http://communities.vmware.com/thread/268626.
I've added the observation below to that thread, maybe it helps in generating some new ideas.. ;-) Is there a way the two servers can influence one another?
Thanks for sticking with me. Appreciate
--------
We're running two (identical) servers. It appears that the issue only occures when the both ESXi hosts are up (in the sence that vSphere can connect). This chain of events leads me to believe this:

both servers were inaccessible due to the issue at hand (VM's running fine, but vSphere couldn't connect)
i disabled the CIM agents on one server and rebooted this server
worked for weeks, vSphere was able to connect to one server (and not the other which was still to be fixed)
i became confident that this solved the issue and implemented on the other server as well.
rebooted the second server.
vSphere was able to connect to both servers
a few days later: both servers were inaccessible thru vSphere again... same story with 'cannot allocate memory'......
0
bbnp2006Commented:
no problem bud.
just another thought, are you 100% confident that there's no any sort of IP conflict on your network? it just so strange that it will work for a while and all of a sudden stops working... just trying to cover all the basis.
0
thijsschoonbroodAuthor Commented:
These days i'm never 100% certain of anything anymore. ;-) But yes, i did check that (quite a few times ;-)) Could an IP conflict cause memory problems?
Perhaps some process exhausts all sockets? Is there a way to determine which (and how many) sockets are opened by a particular process under ESXi?
0
bbnp2006Commented:
#ps command should give you lots of options to query all the processes running on the host. but I am not sure if it will show all the ports being open though.
0
bbnp2006Commented:
actually,
#netstat -tulnap
will give you all the ports bud :P
0
thijsschoonbroodAuthor Commented:
Hi bbnp2006
No netstat on the SSH console (using ESXi 4.0 U1)... ;-(
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
VMware

From novice to tech pro — start learning today.