Solved

weird ESXi host: some VMs in it unpingable : narrowed to vmnic

Posted on 2013-05-23
6
1,533 Views
Last Modified: 2013-05-24
I have 5  x3650 M3 (let's call them s4, s5, s6, s7, s8) all of the same hardware
specs in one cluster : all are on ESXi 5.0 Upd 1.

Each & everyone of the ESXi host's vmnic are the same ie:
vmnic0 ==> Management VLAN
vmnic1 ==> vMotion
Quad NIC 1 has vmnic4,5 connected for Prod VLANs
Quad NIC 2 has vmnic8,9 connected for the same Prod VLANs


They have been running fine since mid last year till about
1 month back when 1-3 VMs in s4 would suddenly become
unpingable (as reported by Tivoli & I would try to Rdp into
them upon getting Tivoli alert but can't access).

Troubleshooting done so far:
a) vCenter could still console into all the VMs (affected &
    unaffected VMs) in s4 but the affected VMs can't ping
    to their gateway IP address though "ipconfig" still show
    the IP addresses of the affected VMs & their respective
    gateway IP addressses.  The affected VMs in s4 could ping
    other VMs in s4 that are of the same VLAN/subnet but
    not other VMs of different VLANs/subnets in s4.  Affected
    VMs also can't ping other VMs of same VLAN that sit
    inside s5-s8

b) while inside the affected VMs console, I noticed
    under Win 2008 R2 Standard x64 the affected
    NIC shows as "Unidentified network".  For those
    VMs not affected still in s4, it shows as "corp.local"

c) from vCenter's "Edit Settings" deleted the NIC
    adapter & recreate back : still the same issue

d) reboot the affected VMs & after they've booted
    up (& still stay in s4), still no joy

e) the moment I vMotioned out the affected VMs
    to another host (s5 - s8), the VMs became
    pingable again
   
f) from vCenter, selected the vSwitch, "Managed Hosts",
    check to select s4 & then  we disabled vmnic4/5, all VMs in
    s4 (including those VMs that were pingable so far while in s4)
    immediately became unpingable.  Then we enable back
    vmnic4/5 & disabled vmnic8/9 (the other pair of NIC ports
    on the other QUAD NIC), all VMs in S4 became pingable
    again. Got IBM to replace this 'suspected' Quad NIC but
    still no joy.
    On the pair of stacked Cisco C3750 switches that vmnic8/9
    are connected, it showed high packet drops with 0 payload
    (ie input rate = output rate = 0 kbps)

g) All LEDs on the pair of switches are green & all LEDs on
    s4's NICs are green

h) I transferred the cable of vmnic8 to a free port vmnic1
    (the onboard NIC), then used vCenter to disable vmnic4+
    5+8+9 but enable vmnic1 ("Managed Hosts") & all VMs
    in s4 became unpingable.  I swapped this piece of cable
    with a tested working cable & still no joy

Management wanted the entire s4's ESXi to be reinstalled.

Any other suggestions?

I'll attach the bundle logs of s4 in a while
0
Comment
Question by:sunhux
  • 3
  • 3
6 Comments
 

Author Comment

by:sunhux
ID: 39190297
The Bundle logs extracted from vCenter is too large, of about
50MB when zipped.  If needed, pls let me know the specific
log / filename required & I'll attach here
0
 
LVL 118

Assisted Solution

by:Andrew Hancock (VMware vExpert / EE MVE)
Andrew Hancock (VMware vExpert / EE MVE) earned 490 total points
ID: 39190307
have you checked the configuration of the ports s4 is connected to?

have you swapped network connections between s4 and other servers?

how are these configured?

Quad NIC 1 has vmnic4,5 connected for Prod VLANs
Quad NIC 2 has vmnic8,9 connected for the same Prod VLANs

teaming policy, load balancing, physical switch config?

can you check which actual nics the VMs are using, using esxtop in network mode, type N.

it will show which VM is using which actual nic for data transfer, or have you done this, and it's all nics, 4,5,8,and 9?
0
 

Author Comment

by:sunhux
ID: 39190412
Once my access is granted in about 1 hr's time, I'll get the esxtop output.

Btw, how do I copy out the esxtop output to a USB thumb drive?
Not allowed to enable SSH server on our ESX servers for security
reason.

interface GigabitEthernet2/0/19
 description *** S2 ***
 switchport trunk encapsulation dot1q
 switchport trunk allowed vlan 48,49,70,129,130,132,133,137-145,161,162,169,171
 switchport trunk allowed vlan add 173,174,185,186,189,191,410-412,421,422,424
 switchport trunk allowed vlan add 425,452,454
 switchport mode trunk
 switchport nonegotiate
 speed 1000
 duplex full
 spanning-tree bpdufilter enable
end

A currently working Cisco switch's port looks like the above.

Will post the configs of the two ports on the Cisco switch which the
suspected vmnic8/9 are connected to in a while
0
What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

 
LVL 118

Accepted Solution

by:
Andrew Hancock (VMware vExpert / EE MVE) earned 490 total points
ID: 39190433
that is difficult, you can only take a picture of the screen, but you should be able to check which VMs are using which ports.

I would swap working ports for "non-working ports" as a check if the issue is physical switch or server.

also check errors on physical switch ports
0
 

Author Comment

by:sunhux
ID: 39194961
>I would swap working ports for "non-working ports" as a check if the issue
> is physical switch or server.

Had done the swapping & isolated that it's due to both the Cisco switches'
(a pair of C3750X-48 stacked together) ports issue: got the network engr
to provision 2 other ports  gi1/0/12 & gi2/0/12 on the same pair of
switches & vmnic8/9 now worked ==> verified by disconnecting all ports
& connecting up vmnic8 only to gi1/0/12 & then disconnect it & connect
up only vmnic9 to gi2/0/12 & all VMs in s4 are pingable.


Just one last question:
with 2 ports working & another 2 ports not working, shouldn't VMware
reroute all traffic to the 2 working ports (ie vmnic4 & vmnic5) ?  This
is an LACP dot1q trunk of the four ports vmnic4/5/8/9 so I'm expecting
that with Cisco Cdp being used (as shown in vCenter), ESXi should be
smart enough to route all traffic to the 2 remaining useable ports,
shouldn't it?
0
 
LVL 118

Assisted Solution

by:Andrew Hancock (VMware vExpert / EE MVE)
Andrew Hancock (VMware vExpert / EE MVE) earned 490 total points
ID: 39195190
How does the ESXi server, know your ports are duff, it doesn't!

so the traffic gets distributed across all four, any traffic up the duff ports, could go into a bucket of water! - not if the port is up and linked!
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

Will try to explain how to use the VMware feature TAGs in the VMs and create Veeam Backup Jobs using TAGs. Since this article is too long, I will create second article for the Veeam tasks.
HOW TO: Install and Configure VMware vSphere Hypervisor 6.5 (ESXi 6.5), Step by Step Tutorial with screenshots. From Download, Checking Media, to Completed Installation.
Teach the user how to configure vSphere clusters to support the VMware FT feature Open vSphere Web Client: Verify vSphere HA is enabled: Verify netowrking for vMotion and FT Logging is in place or create it: Turn On FT for a virtual machine: Verify …
This video shows you how easy it is to boot from ISO images for virtual machines with the ISO images stored on a local datastore on the ESXi host.

747 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

14 Experts available now in Live!

Get 1:1 Help Now