[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1554
  • Last Modified:

weird ESXi host: some VMs in it unpingable : narrowed to vmnic

I have 5  x3650 M3 (let's call them s4, s5, s6, s7, s8) all of the same hardware
specs in one cluster : all are on ESXi 5.0 Upd 1.

Each & everyone of the ESXi host's vmnic are the same ie:
vmnic0 ==> Management VLAN
vmnic1 ==> vMotion
Quad NIC 1 has vmnic4,5 connected for Prod VLANs
Quad NIC 2 has vmnic8,9 connected for the same Prod VLANs


They have been running fine since mid last year till about
1 month back when 1-3 VMs in s4 would suddenly become
unpingable (as reported by Tivoli & I would try to Rdp into
them upon getting Tivoli alert but can't access).

Troubleshooting done so far:
a) vCenter could still console into all the VMs (affected &
    unaffected VMs) in s4 but the affected VMs can't ping
    to their gateway IP address though "ipconfig" still show
    the IP addresses of the affected VMs & their respective
    gateway IP addressses.  The affected VMs in s4 could ping
    other VMs in s4 that are of the same VLAN/subnet but
    not other VMs of different VLANs/subnets in s4.  Affected
    VMs also can't ping other VMs of same VLAN that sit
    inside s5-s8

b) while inside the affected VMs console, I noticed
    under Win 2008 R2 Standard x64 the affected
    NIC shows as "Unidentified network".  For those
    VMs not affected still in s4, it shows as "corp.local"

c) from vCenter's "Edit Settings" deleted the NIC
    adapter & recreate back : still the same issue

d) reboot the affected VMs & after they've booted
    up (& still stay in s4), still no joy

e) the moment I vMotioned out the affected VMs
    to another host (s5 - s8), the VMs became
    pingable again
   
f) from vCenter, selected the vSwitch, "Managed Hosts",
    check to select s4 & then  we disabled vmnic4/5, all VMs in
    s4 (including those VMs that were pingable so far while in s4)
    immediately became unpingable.  Then we enable back
    vmnic4/5 & disabled vmnic8/9 (the other pair of NIC ports
    on the other QUAD NIC), all VMs in S4 became pingable
    again. Got IBM to replace this 'suspected' Quad NIC but
    still no joy.
    On the pair of stacked Cisco C3750 switches that vmnic8/9
    are connected, it showed high packet drops with 0 payload
    (ie input rate = output rate = 0 kbps)

g) All LEDs on the pair of switches are green & all LEDs on
    s4's NICs are green

h) I transferred the cable of vmnic8 to a free port vmnic1
    (the onboard NIC), then used vCenter to disable vmnic4+
    5+8+9 but enable vmnic1 ("Managed Hosts") & all VMs
    in s4 became unpingable.  I swapped this piece of cable
    with a tested working cable & still no joy

Management wanted the entire s4's ESXi to be reinstalled.

Any other suggestions?

I'll attach the bundle logs of s4 in a while
0
sunhux
Asked:
sunhux
  • 3
  • 3
3 Solutions
 
sunhuxAuthor Commented:
The Bundle logs extracted from vCenter is too large, of about
50MB when zipped.  If needed, pls let me know the specific
log / filename required & I'll attach here
0
 
Andrew Hancock (VMware vExpert / EE MVE^2)VMware and Virtualization ConsultantCommented:
have you checked the configuration of the ports s4 is connected to?

have you swapped network connections between s4 and other servers?

how are these configured?

Quad NIC 1 has vmnic4,5 connected for Prod VLANs
Quad NIC 2 has vmnic8,9 connected for the same Prod VLANs

teaming policy, load balancing, physical switch config?

can you check which actual nics the VMs are using, using esxtop in network mode, type N.

it will show which VM is using which actual nic for data transfer, or have you done this, and it's all nics, 4,5,8,and 9?
0
 
sunhuxAuthor Commented:
Once my access is granted in about 1 hr's time, I'll get the esxtop output.

Btw, how do I copy out the esxtop output to a USB thumb drive?
Not allowed to enable SSH server on our ESX servers for security
reason.

interface GigabitEthernet2/0/19
 description *** S2 ***
 switchport trunk encapsulation dot1q
 switchport trunk allowed vlan 48,49,70,129,130,132,133,137-145,161,162,169,171
 switchport trunk allowed vlan add 173,174,185,186,189,191,410-412,421,422,424
 switchport trunk allowed vlan add 425,452,454
 switchport mode trunk
 switchport nonegotiate
 speed 1000
 duplex full
 spanning-tree bpdufilter enable
end

A currently working Cisco switch's port looks like the above.

Will post the configs of the two ports on the Cisco switch which the
suspected vmnic8/9 are connected to in a while
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
Andrew Hancock (VMware vExpert / EE MVE^2)VMware and Virtualization ConsultantCommented:
that is difficult, you can only take a picture of the screen, but you should be able to check which VMs are using which ports.

I would swap working ports for "non-working ports" as a check if the issue is physical switch or server.

also check errors on physical switch ports
0
 
sunhuxAuthor Commented:
>I would swap working ports for "non-working ports" as a check if the issue
> is physical switch or server.

Had done the swapping & isolated that it's due to both the Cisco switches'
(a pair of C3750X-48 stacked together) ports issue: got the network engr
to provision 2 other ports  gi1/0/12 & gi2/0/12 on the same pair of
switches & vmnic8/9 now worked ==> verified by disconnecting all ports
& connecting up vmnic8 only to gi1/0/12 & then disconnect it & connect
up only vmnic9 to gi2/0/12 & all VMs in s4 are pingable.


Just one last question:
with 2 ports working & another 2 ports not working, shouldn't VMware
reroute all traffic to the 2 working ports (ie vmnic4 & vmnic5) ?  This
is an LACP dot1q trunk of the four ports vmnic4/5/8/9 so I'm expecting
that with Cisco Cdp being used (as shown in vCenter), ESXi should be
smart enough to route all traffic to the 2 remaining useable ports,
shouldn't it?
0
 
Andrew Hancock (VMware vExpert / EE MVE^2)VMware and Virtualization ConsultantCommented:
How does the ESXi server, know your ports are duff, it doesn't!

so the traffic gets distributed across all four, any traffic up the duff ports, could go into a bucket of water! - not if the port is up and linked!
0

Featured Post

Visualize your virtual and backup environments

Create well-organized and polished visualizations of your virtual and backup environments when planning VMware vSphere, Microsoft Hyper-V or Veeam deployments. It helps you to gain better visibility and valuable business insights.

  • 3
  • 3
Tackle projects and never again get stuck behind a technical roadblock.
Join Now