Solved

weird ESXi host: some VMs in it unpingable : narrowed to vmnic

Posted on 2013-05-23
6
1,536 Views
Last Modified: 2013-05-24
I have 5  x3650 M3 (let's call them s4, s5, s6, s7, s8) all of the same hardware
specs in one cluster : all are on ESXi 5.0 Upd 1.

Each & everyone of the ESXi host's vmnic are the same ie:
vmnic0 ==> Management VLAN
vmnic1 ==> vMotion
Quad NIC 1 has vmnic4,5 connected for Prod VLANs
Quad NIC 2 has vmnic8,9 connected for the same Prod VLANs


They have been running fine since mid last year till about
1 month back when 1-3 VMs in s4 would suddenly become
unpingable (as reported by Tivoli & I would try to Rdp into
them upon getting Tivoli alert but can't access).

Troubleshooting done so far:
a) vCenter could still console into all the VMs (affected &
    unaffected VMs) in s4 but the affected VMs can't ping
    to their gateway IP address though "ipconfig" still show
    the IP addresses of the affected VMs & their respective
    gateway IP addressses.  The affected VMs in s4 could ping
    other VMs in s4 that are of the same VLAN/subnet but
    not other VMs of different VLANs/subnets in s4.  Affected
    VMs also can't ping other VMs of same VLAN that sit
    inside s5-s8

b) while inside the affected VMs console, I noticed
    under Win 2008 R2 Standard x64 the affected
    NIC shows as "Unidentified network".  For those
    VMs not affected still in s4, it shows as "corp.local"

c) from vCenter's "Edit Settings" deleted the NIC
    adapter & recreate back : still the same issue

d) reboot the affected VMs & after they've booted
    up (& still stay in s4), still no joy

e) the moment I vMotioned out the affected VMs
    to another host (s5 - s8), the VMs became
    pingable again
   
f) from vCenter, selected the vSwitch, "Managed Hosts",
    check to select s4 & then  we disabled vmnic4/5, all VMs in
    s4 (including those VMs that were pingable so far while in s4)
    immediately became unpingable.  Then we enable back
    vmnic4/5 & disabled vmnic8/9 (the other pair of NIC ports
    on the other QUAD NIC), all VMs in S4 became pingable
    again. Got IBM to replace this 'suspected' Quad NIC but
    still no joy.
    On the pair of stacked Cisco C3750 switches that vmnic8/9
    are connected, it showed high packet drops with 0 payload
    (ie input rate = output rate = 0 kbps)

g) All LEDs on the pair of switches are green & all LEDs on
    s4's NICs are green

h) I transferred the cable of vmnic8 to a free port vmnic1
    (the onboard NIC), then used vCenter to disable vmnic4+
    5+8+9 but enable vmnic1 ("Managed Hosts") & all VMs
    in s4 became unpingable.  I swapped this piece of cable
    with a tested working cable & still no joy

Management wanted the entire s4's ESXi to be reinstalled.

Any other suggestions?

I'll attach the bundle logs of s4 in a while
0
Comment
Question by:sunhux
  • 3
  • 3
6 Comments
 

Author Comment

by:sunhux
ID: 39190297
The Bundle logs extracted from vCenter is too large, of about
50MB when zipped.  If needed, pls let me know the specific
log / filename required & I'll attach here
0
 
LVL 118

Assisted Solution

by:Andrew Hancock (VMware vExpert / EE MVE)
Andrew Hancock (VMware vExpert / EE MVE) earned 490 total points
ID: 39190307
have you checked the configuration of the ports s4 is connected to?

have you swapped network connections between s4 and other servers?

how are these configured?

Quad NIC 1 has vmnic4,5 connected for Prod VLANs
Quad NIC 2 has vmnic8,9 connected for the same Prod VLANs

teaming policy, load balancing, physical switch config?

can you check which actual nics the VMs are using, using esxtop in network mode, type N.

it will show which VM is using which actual nic for data transfer, or have you done this, and it's all nics, 4,5,8,and 9?
0
 

Author Comment

by:sunhux
ID: 39190412
Once my access is granted in about 1 hr's time, I'll get the esxtop output.

Btw, how do I copy out the esxtop output to a USB thumb drive?
Not allowed to enable SSH server on our ESX servers for security
reason.

interface GigabitEthernet2/0/19
 description *** S2 ***
 switchport trunk encapsulation dot1q
 switchport trunk allowed vlan 48,49,70,129,130,132,133,137-145,161,162,169,171
 switchport trunk allowed vlan add 173,174,185,186,189,191,410-412,421,422,424
 switchport trunk allowed vlan add 425,452,454
 switchport mode trunk
 switchport nonegotiate
 speed 1000
 duplex full
 spanning-tree bpdufilter enable
end

A currently working Cisco switch's port looks like the above.

Will post the configs of the two ports on the Cisco switch which the
suspected vmnic8/9 are connected to in a while
0
Comprehensive Backup Solutions for Microsoft

Acronis protects the complete Microsoft technology stack: Windows Server, Windows PC, laptop and Surface data; Microsoft business applications; Microsoft Hyper-V; Azure VMs; Microsoft Windows Server 2016; Microsoft Exchange 2016 and SQL Server 2016.

 
LVL 118

Accepted Solution

by:
Andrew Hancock (VMware vExpert / EE MVE) earned 490 total points
ID: 39190433
that is difficult, you can only take a picture of the screen, but you should be able to check which VMs are using which ports.

I would swap working ports for "non-working ports" as a check if the issue is physical switch or server.

also check errors on physical switch ports
0
 

Author Comment

by:sunhux
ID: 39194961
>I would swap working ports for "non-working ports" as a check if the issue
> is physical switch or server.

Had done the swapping & isolated that it's due to both the Cisco switches'
(a pair of C3750X-48 stacked together) ports issue: got the network engr
to provision 2 other ports  gi1/0/12 & gi2/0/12 on the same pair of
switches & vmnic8/9 now worked ==> verified by disconnecting all ports
& connecting up vmnic8 only to gi1/0/12 & then disconnect it & connect
up only vmnic9 to gi2/0/12 & all VMs in s4 are pingable.


Just one last question:
with 2 ports working & another 2 ports not working, shouldn't VMware
reroute all traffic to the 2 working ports (ie vmnic4 & vmnic5) ?  This
is an LACP dot1q trunk of the four ports vmnic4/5/8/9 so I'm expecting
that with Cisco Cdp being used (as shown in vCenter), ESXi should be
smart enough to route all traffic to the 2 remaining useable ports,
shouldn't it?
0
 
LVL 118

Assisted Solution

by:Andrew Hancock (VMware vExpert / EE MVE)
Andrew Hancock (VMware vExpert / EE MVE) earned 490 total points
ID: 39195190
How does the ESXi server, know your ports are duff, it doesn't!

so the traffic gets distributed across all four, any traffic up the duff ports, could go into a bucket of water! - not if the port is up and linked!
0

Featured Post

Zoho SalesIQ

Hassle-free live chat software re-imagined for business growth. 2 users, always free.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Is your company's data protection keeping pace with virtualization? Here are 7 dynamic ways to adapt to rapid breakthroughs in technology.
In this article, I will show you HOW TO: Perform a Physical to Virtual (P2V) Conversion the easy way from a computer backup (image).
Teach the user how to join ESXi hosts to Active Directory domains Open vSphere Client: Join ESXi host to AD domain: Verify ESXi computer account in AD: Configure permissions for domain user in ESXi: Test domain user login to ESXi host:
Internet Business Fax to Email Made Easy - With  eFax Corporate (http://www.enterprise.efax.com), you'll receive a dedicated online fax number, which is used the same way as a typical analog fax number. You'll receive secure faxes in your email, f…

920 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

16 Experts available now in Live!

Get 1:1 Help Now