asked on

SD-Wan edge connected to a Cisco small business stack stops communicating normally

TLDR: after a period of time ARP from devices in a layer 2 connected VLAN quit registering on our SD-WAN edge device, stopping them from traversing that edge or being routed by that edge.

Hardware:
SD-WAN edge: Velocloud Edge 540 (problem has persisted through numerous firmware revisions)
Cisco Stack: 1 Cisco SG500X-48 and 5 Cisco SG500X-48P’s connected loop/chain stack configuration using SFP+ fiber connectors. (also firmware updated more than once).

Configuration:
VC is the router/firewall SD-WAN with redundant internet connections that establishes edge to edge IPSEC tunnels and tunnel to our internet gateway.
The Cisco stack connects 10 VLAN’s to the VC but is not doing any routing or firewall activities. The Cisco has 2 management IP interfaces in those VLANS (1 and 318), the rest are purely layer 2 connected.
Cisco interface to VC is set:
interface gigabitethernet1/1/8
description VC-StackConnection
switchport trunk allowed vlan add (necessary vlans)
switchport default-vlan tagged (default-vlan being 1)

The VC is set:
Mode: Trunk
Drop Untagged

Symptoms:
After an unspecified amount of time (2 weeks to 6 weeks) at our HQ location where the equipment is located, most or all of the devices in some of the layer 2 connected VLAN’s cannot communicate externally. Internal communication work as expected (same broadcast domain) for the most part. Sometimes if you set a persistent ping out of a device it will traverse the VC edge after 8-10 seconds. Sometimes if you ping another device in the same subnet there is a huge delay before it gets a reply (8 +seconds) but once it starts pinging, restarting the ping is immediate. Static IP devices seem to function normally (mostly) if they were already static when the problem starts occurring.

Testing from an affected device shows that it can no longer ping it’s default gateway (the VC). If you reboot the device (so it will ARP for its gateway) and then start an offsite ping to said device , it would continue to function normally. For ‘keep alive’ ping intervals greater than 2.5 minutes, it would cease to function.
ARP table dumps on the VC edge will show the IP of the affected device but mac address 00:00:00:00:00:00 status PENDING when it’s having problems. The Velocloud engineers tell us that this means it has sent an ARP request for the owner of the IP but has not received a response. Packet captures (from the edge itself) show the requests going out but never receiving an answer. PCAP also shows that if a request comes in from device doing lookup for its default gateway – the VC responds to the ARP request.

Capture of an affected devices port from the Cisco stack shows the device sends an ARP request for the gateway but only ‘currently functional devices’ respond to VC ARP requests. This last time the situation devolved to the point that DHCP served from the edge in one VLAN did not reach the devices in that network. We found if we bypassed our Cisco stack that the devices on the ‘bypass switch’ would receive IP’s and update/respond to ARP normally. However other VLAN’s that the edge served DHCP to on the Cisco stack would continue to function as well as the devices in that VLAN.
This would lead one to believe it’s the Cisco stack.
A double reboot of the Cisco stack (in conjunction with ARP flush on the VC) doesn’t resolve the problem. Rebooting the VC seems to. We have never rebooted the edge first in this scenario (this is our main site and ideally we don’t want to simply reboot anything).

We’re uncertain if this is a stack/edge configuration issue, VLAN/PVID issue or all of the above, not even ruling out layer 1.

Craig Beck

Why are you tagging the default VLAN? Why not just use an unused VLAN as default?

Aaron Tomosky

The VC is set to drop untagged, perhaps the Cisco stack is responding untagged default vlan1 or something to these requests?

Craig Beck

You need both sides to tag the default if that's the way you want it.

Member_2_6375190

ASKER

Both sides are tagging VLAN 1, all VC VLAN's are tagged and all interfaced Cisco stack VLAN's are tagged (thus the default VLAN tagged).

I know that not having the VLAN numbers listed on each is vague but this is intentionally vague.

Aaron Tomosky

Did you set the pvid as well?

Craig Beck

What config do you have on your switch? Are you running any L2 security features such as DAI or IPSG?

Member_2_6375190

ASKER

@Aaron - In this config the PVID defaults to VLAN 4095P. I have considered changing this to a General port interface and setting the PVID to VLAN1 but since the VC describes the interface as a trunk port, I have erred on the side of caution. Also note that i do not have "switchport mode trunk" set on the Cisco stack as Cisco documentation indicates that this negotiates trunking via DTP in Catalyst switches or in the case of a small business switch via GVRP.

Member_2_6375190

ASKER

@Craig - Negative for both ARP inspection and IP source guard. Neither are enabled.

Aaron Tomosky

One thing to explore: something sending out untagged packets as vlan1, but it's being picked up as 4095 since that's the pvid, hence no response. I agree that in your situation everything SHOULD be tagged vlan1, but...In fact, I remember some gear didn't allow tagged vlan1, I want to say brocade switches maybe? Perhaps some of your gear also doesn't like tagged vlan 1 and is doing it untagged, hence the PVID problem.

Member_2_6375190

ASKER

@Aaron - So the only way I can do that cleanly and maintain tagged VLAN1 is to switch to a general port (802.1q style port) and set the PVID to VLAN1. I have considered the possibility that something is getting sent to 4095 which is a 'dead end' VLAN built into Cisco. I can't even monitor it with a port mirror.

My other option is to change the VC interface to accept untagged traffic on VLAN1 and remove the Cisco stack config line " switchport default-vlan tagged" on that interface which would then allow untagged VLAN1 to traverse.

Most security forums I read say that switch-to-switch or switch-to-router interfaces should be tagged top to bottom to avoid VLAN hopping which is why I've configured it this way in the first place.

Aaron Tomosky

I'm no security expert, just an admin With experience and I've always done untagged 1, non routable for the mgmt interfaces of my network equipment, no hosts and no user traffic. Since nothings routed I don't see how it's an attack surface, and I only allowed specific admin consoles access.

Again, not a security pro so maybe there is something I'm not considering here.

Member_2_6375190

ASKER

I think we're going to eliminate the 4095p but stay tagged (i.e. convert to General port and PVID 1) and run with that a few weeks to see if it eliminates our issues.

Barring that, we'll go trunked, untagged 1 on both sides of the interface.

ASKER CERTIFIED SOLUTION

Craig Beck

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Member_2_6375190

ASKER

This is something to take into account. So far we've been relatively issue free but this will be the first thing we change if/when the issue comes up again.