Solved

ESXi 4.0 U1 - Management network becomes unstable after a few days

Posted on 2010-04-05
3,511 Views
Last Modified: 2012-05-09
I’ve installed ESXi 4.0 Update 1 on two identical machines that reside in the same network segment. On both servers, I’ve created two virtual machines. One runs RedHat Enterprise Linux 5.4 and one runs a small load balancer appliance (Hercules).

Hardware:
Dell PowerEdge R210
Intel Xeon X3450 2.66GHz HT
8GB RAM
2x500 GB in RAID 1
Using ONE port of the internal Broadcom netxtreme II bcm5716 NIC (this port is shared between the management network and the VM’s).
(all hardware is marked as ‘supported’ by VMware)

We applied all available patches, including the recent april 1st patch; we’re at build 244038 now.

The Problem
After a few days the vSphere client cannot establish a connection to the ESXi hosts anymore. The virtual machines continue to keep running without any problem, however. Only a full reset (applied thru the remote power cycle) restores the connectivity to the management network. We experience this issue on both servers: about three days after power-on/reset, the vSphere client cannot connect anymore.

Observations:
• Only the management network suffers from connectivity problems.
• Restarting the management network (agents) via the physical console doesn’t restore service
• The physical console offers some basic diagnostics like ‘testing the management network’. The PING tests intermittently fail: about half of the PINGs to the gateway or dns-servers fails. The hardware and the network config MUST be correct, since the management network works for a few days before failing and the VM’s keep running without any problem.
• We’ve investigated the network traffic from a remote vSphere client that is trying to connect to the ESXi server using a packet sniffer. The remote ESXi hosts resets the connection after initial contact, so there IS packet interchange.

Given the above, I strongly suspect a problem in the network driver in ESXi, but I don’t know how to diagnose the issue any further. I’ve exhausted all options on the physical ESXi console. I know how to access the (unsupported) commandline console, but don’t know what to look for. Could it be a problem that the management network shares the same NIC as the VM’s?

I’ve been struggling with this issue for a several weeks now – any help/suggestions is highly appreciated.
0
Question by:thijsschoonbrood
    26 Comments
     
    LVL 5

    Expert Comment

    by:wolfje_xp
    Hi,
    I also use ESXi 4.0, but haven't applied the last update. I also share a nic for mgmt and VM traffic, without problems so far.
    Maybe you can have a look in the logfiles /var/log ?
    0
     
    LVL 7

    Expert Comment

    by:bbnp2006
    What does vpxa.log and hostd.log say? If it's agent issues within each esx host, those logs should review some clue for you.

    ./var/log/vmware/hostd.og
    ./var/log/vmware/vpx/vpxa.log

    Post back.
    0
     

    Author Comment

    by:thijsschoonbrood
    After a recent boot, vSphere is able to connect again. However, i see some disturbing lines in these logs:

    In /var/log/messages:
    Hostd: [2010-04 ..... 29766DC0 warning 'Proxysvc'] Accept on client connection failed: Bad file descriptor

    In /var/log/hostd.org:
    [2010-04-....  warning 'Proxysvc Req00005'] Error reading from client while waiting for header: N7Vmacore15SystemExceptionE (Connection reset by peer)

    I was unable to find '/var/log/vmware/vpx/vpxa.log', the subdir vpx didn't exist either.


    Before i rebooted the physical machine, i noticed a even more disturbing message in a log:
    [2010-04 -… 20BECCDC panic ‘App’] error: Cannot allocate memory

    and several of these (which i suspect are related to an out of memory situation):
    vmkernel: 3:19:39:51.629 cpu2:1028056)WARNING: Tcpip_Socket: 1619: socreate(type=2, proto=0) failed with error 55

    I had allocated 8GB to one of the virtual machines, while the host only has 8 GB in total. Can this have caused the memory problem? To be on the safe side, i reduced the amounth to 6GB.
    Also, i put the management network on another vSwitch connected to a dedicated NIC port. So now the VM's share a NIC port and the management network uses a seperate one.

    Perhaps one of these actions fixes the issue (i only know after three days...), but I would very much like to understand what went wrong. Did i misconfigure anything?
    0
     
    LVL 5

    Expert Comment

    by:wolfje_xp
    Hi. IT sure can be a memory issue. It's not a good idea to grand all your physical memory to a vm. (it shouldn't produce those errors however, but swapping will kill your performance)
    I assume your network settings are correct (domain etc)?
    Any chance you have bad memory / hard disk ?

    0
     

    Author Comment

    by:thijsschoonbrood
    I can only assume that all physical server stuff is OK, since we experience experience exactly same issue on two identical (brand new) servers and only after three days. It's of course an assumption, but i think i can be rather sure the hw is OK. Network settings are correct i think, i verified them quite a few times. :-)
    0
     
    LVL 7

    Expert Comment

    by:bbnp2006
    Can I assume that you have run the memtest fully for the memories for both servers? :)
    0
     

    Author Comment

    by:thijsschoonbrood
    Put the mgt. network on a different NIC port (and vSwitch) and reduced the memory allocated to VM's to 6GB. Still, after three days, vSphere is unable to connect to ESXi again. :-( I'm running out of ideas here. :-(
    0
     
    LVL 7

    Expert Comment

    by:bbnp2006
    Are you sure you have the latest driver for your NICs? sounds like a NIC problem when you experiencing "PING tests intermittently fail" problems.
    0
     

    Author Comment

    by:thijsschoonbrood
    Hi bbnp2006,

    Well, i've applied all available patches, assuming that all the network drivers would be updated to the new version as well. Is my assumption wrong? How can i determine which network driver version i'm exactly using? When my vSphere client still could connect, i was not able to find any version reference except the build number.
    (I'm using the (internal) Broadcom netxtreme II bcm5716 NIC.)
    0
     
    LVL 7

    Assisted Solution

    by:bbnp2006
    This is the link to the hardware compatibility list from vmware, if you do a search for Broadcom network card, for some reason, I can not find BCM5716 model in the compatibility list. Maybe I am wrong, i am still doing some other searching to see if your NIC card is really supported. If it turns out it is not supported, it might be the reason that the driver is not functioning well. Maybe you can double check that?
    0
     
    LVL 7

    Accepted Solution

    by:
    According to VMWare's website:
    http://www.vmware.com/resources/compatibility/search.php?action=search&deviceCategory=io&productId=1&advancedORbasic=advanced&maxDisplayRows=50&key=bcm5716&release[]=-1&datePosted=-1&partnerId[]=-1&manufacturer[]=-1&vid=&did=&svid=&ssid=&rorre=0

    Broadcom      NetXtreme II BCM5716 Gigabit Ethernet      is supported up to      ESX 3.5 U5
    Broadcom      NetXtreme II BCM5716S Gigabit Ethernet is supported in ESX / ESXi 4.0 U1

    0
     

    Author Comment

    by:thijsschoonbrood
    Hi BBNP2006,
    First, thanks for your help! I did check the compatiblity guide, but i guess I overlooked the 'S'...However, DELL indicates that ESX 4.0 U1 is supported on their R210 server hardware. Perhaps it's supported, but not using the internal NIC then? I really don't like to revert to 3.5 though, perhaps it's possible to use a bnx2 version that supports the 5716 (without the S)? How can i determine which bnx2 version is currently being used?
    0
     

    Author Comment

    by:thijsschoonbrood
    I did find a driver for ESXi 4.0 for the BCM 5716 (without the S ;-)).
    http://downloads.vmware.com/d/details/esx_esxi_40_broadcom_bnx2_dt/ZCV0YmRqZHRidHdw
    I'll check it out asap
    0
     
    LVL 7

    Expert Comment

    by:bbnp2006
    yes, that's the after release of the driver for your NIC that was not originally supported "inbox" when vsphere released. That will hopefully get rid of the networking issue.
    Report back if it is working for you. Good luck!
    0
     

    Author Comment

    by:thijsschoonbrood
    I will report back asap. One question though: are drivers (like the bnx2 driver) also included when i check for updates using the Host Update Utility?
    0
     
    LVL 7

    Assisted Solution

    by:bbnp2006
    Only the drivers that are on the hardware compatility list will be included for updates. So in your situation, you have to look for the driver vmware released for older hardware. I'd suggest replacing your NIC if possible so that all driver updates will be included when you do host updates :)
    0
     

    Author Comment

    by:thijsschoonbrood
    I checked the bnx2 driver version using 'ethtool -i vmnic1'
    It releaved that i was still on 1.6.9. The latest release available from VMWare (URL few posts back) is 2.0.7c. Just installed this new driver successfully and rebooted the system. In three days i'll know if that solved the issue. :-)
    0
     
    LVL 7

    Expert Comment

    by:bbnp2006
    Great stuff! Fingers crossed and looking forward to your updates in 3 days :)
    ~bbnp2006
    0
     

    Author Comment

    by:thijsschoonbrood
    Unfortunately, it turned out that the issue was not solved. I also disabled the CIM agents but after a while the servers both crashed again.
    0
     
    LVL 7

    Expert Comment

    by:bbnp2006
    sorry to hear that mate.  I would contact VMWare for support (hopefully that's an option). Let me do some digging to see if can find anything.
    0
     

    Author Comment

    by:thijsschoonbrood
    Hi bbnp2006,
    VMware support definetely is an option if it's up to me. However, i've enabled SSH (to help diagnose the problem) and disabled the CIM agents which breaks the support option i'm afraid.
    Also, here is another guy having the exact same issue: http://communities.vmware.com/thread/268626.
    I've added the observation below to that thread, maybe it helps in generating some new ideas.. ;-) Is there a way the two servers can influence one another?
    Thanks for sticking with me. Appreciate
    --------
    We're running two (identical) servers. It appears that the issue only occures when the both ESXi hosts are up (in the sence that vSphere can connect). This chain of events leads me to believe this:

    both servers were inaccessible due to the issue at hand (VM's running fine, but vSphere couldn't connect)
    i disabled the CIM agents on one server and rebooted this server
    worked for weeks, vSphere was able to connect to one server (and not the other which was still to be fixed)
    i became confident that this solved the issue and implemented on the other server as well.
    rebooted the second server.
    vSphere was able to connect to both servers
    a few days later: both servers were inaccessible thru vSphere again... same story with 'cannot allocate memory'......
    0
     
    LVL 7

    Expert Comment

    by:bbnp2006
    no problem bud.
    just another thought, are you 100% confident that there's no any sort of IP conflict on your network? it just so strange that it will work for a while and all of a sudden stops working... just trying to cover all the basis.
    0
     

    Author Comment

    by:thijsschoonbrood
    These days i'm never 100% certain of anything anymore. ;-) But yes, i did check that (quite a few times ;-)) Could an IP conflict cause memory problems?
    Perhaps some process exhausts all sockets? Is there a way to determine which (and how many) sockets are opened by a particular process under ESXi?
    0
     
    LVL 7

    Expert Comment

    by:bbnp2006
    #ps command should give you lots of options to query all the processes running on the host. but I am not sure if it will show all the ports being open though.
    0
     
    LVL 7

    Expert Comment

    by:bbnp2006
    actually,
    #netstat -tulnap
    will give you all the ports bud :P
    0
     

    Author Comment

    by:thijsschoonbrood
    Hi bbnp2006
    No netstat on the SSH console (using ESXi 4.0 U1)... ;-(
    0

    Write Comment

    Please enter a first name

    Please enter a last name

    We will never share this with anyone.

    Featured Post

    Threat Intelligence Starter Resources

    Integrating threat intelligence can be challenging, and not all companies are ready. These resources can help you build awareness and prepare for defense.

    After we apply, or update, ESXi 5.5 we can get this warning in ESXi host: No coredump target has been configured. Host core dumps cannot be saved
    For Backups Guest OS files and indexing(and application awareness), Veeam needs Admin rights in Guest OS(Windows and Linux). In Windows a Domain Administrator account, and in Linux root access to perform this type of Backups and also Restore.
    Teach the user how to convert virtaul disk file formats and how to rename virtual machine files on datastores. Open vSphere Web Client: Review VM disk settings: Migrate VM to new datastore with a thick provisioned (lazy zeroed) disk format: Rename a…
    Teach the user how to use create log bundles for vCenter Server or ESXi hosts Open vSphere Web Client: Generate vCenter Server and ESXi host log bundle:  Open vCenter Server Appliance Web Management interface and generate log bundle: Open vCenter Se…

    933 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    18 Experts available now in Live!

    Get 1:1 Help Now