TLDR: after a period of time ARP from devices in a layer 2 connected VLAN quit registering on our SD-WAN edge device, stopping them from traversing that edge or being routed by that edge.
SD-WAN edge: Velocloud Edge 540 (problem has persisted through numerous firmware revisions)
Cisco Stack: 1 Cisco SG500X-48 and 5 Cisco SG500X-48P’s connected loop/chain stack configuration using SFP+ fiber connectors. (also firmware updated more than once).
VC is the router/firewall SD-WAN with redundant internet connections that establishes edge to edge IPSEC tunnels and tunnel to our internet gateway.
The Cisco stack connects 10 VLAN’s to the VC but is not doing any routing or firewall activities. The Cisco has 2 management IP interfaces in those VLANS (1 and 318), the rest are purely layer 2 connected.
Cisco interface to VC is set:
switchport trunk allowed vlan add (necessary vlans)
switchport default-vlan tagged (default-vlan being 1)
The VC is set:
After an unspecified amount of time (2 weeks to 6 weeks) at our HQ location where the equipment is located, most or all of the devices in some of the layer 2 connected VLAN’s cannot communicate externally. Internal communication work as expected (same broadcast domain) for the most part. Sometimes if you set a persistent ping out of a device it will traverse the VC edge after 8-10 seconds. Sometimes if you ping another device in the same subnet there is a huge delay before it gets a reply (8 +seconds) but once it starts pinging, restarting the ping is immediate. Static IP devices seem to function normally (mostly) if they were already static when the problem starts occurring.
Testing from an affected device shows that it can no longer ping it’s default gateway (the VC). If you reboot the device (so it will ARP for its gateway) and then start an offsite ping to said device , it would continue to function normally. For ‘keep alive’ ping intervals greater than 2.5 minutes, it would cease to function.
ARP table dumps on the VC edge will show the IP of the affected device but mac address 00:00:00:00:00:00 status PENDING when it’s having problems. The Velocloud engineers tell us that this means it has sent an ARP request for the owner of the IP but has not received a response. Packet captures (from the edge itself) show the requests going out but never receiving an answer. PCAP also shows that if a request comes in from device doing lookup for its default gateway – the VC responds to the ARP request.
Capture of an affected devices port from the Cisco stack shows the device sends an ARP request for the gateway but only ‘currently functional devices’ respond to VC ARP requests. This last time the situation devolved to the point that DHCP served from the edge in one VLAN did not reach the devices in that network. We found if we bypassed our Cisco stack that the devices on the ‘bypass switch’ would receive IP’s and update/respond to ARP normally. However other VLAN’s that the edge served DHCP to on the Cisco stack would continue to function as well as the devices in that VLAN.
This would lead one to believe it’s the Cisco stack.
A double reboot of the Cisco stack (in conjunction with ARP flush on the VC) doesn’t resolve the problem. Rebooting the VC seems to. We have never rebooted the edge first in this scenario (this is our main site and ideally we don’t want to simply reboot anything).
We’re uncertain if this is a stack/edge configuration issue, VLAN/PVID issue or all of the above, not even ruling out layer 1.