SD-Wan edge connected to a Cisco small business stack stops communicating normally

TLDR:  after a period of time ARP from  devices in a layer 2 connected VLAN quit registering on our SD-WAN edge device, stopping them from traversing that edge or being routed by that edge.

Hardware:
SD-WAN edge:  Velocloud Edge 540 (problem has persisted through numerous firmware revisions)
Cisco Stack:  1 Cisco SG500X-48 and 5 Cisco SG500X-48P’s connected loop/chain stack configuration using SFP+ fiber connectors. (also firmware updated more than once).

Configuration:
VC is the router/firewall SD-WAN with redundant internet connections that establishes edge to edge IPSEC tunnels and tunnel to our internet gateway.
The Cisco stack connects 10 VLAN’s to the VC but is not doing any routing or firewall activities.  The Cisco has 2 management IP interfaces in those VLANS (1 and 318), the rest are purely layer 2 connected.
Cisco interface to VC is set:
interface gigabitethernet1/1/8
 description VC-StackConnection
 switchport trunk allowed vlan add (necessary vlans)
 switchport default-vlan tagged (default-vlan being 1)


The VC is set:
Mode: Trunk
Drop Untagged


Symptoms:
After an unspecified amount of time (2 weeks to 6 weeks) at our HQ location where the equipment is located, most or all of the devices in some of the layer 2 connected VLAN’s cannot communicate externally.   Internal communication work as expected (same broadcast domain) for the most part.   Sometimes if you set a persistent ping out of a device it will traverse the VC edge after 8-10 seconds.  Sometimes if you ping another device in the same subnet there is a huge delay before it gets a reply (8 +seconds) but once it starts pinging, restarting the ping is immediate.  Static IP devices seem to function normally (mostly) if they were already static when the problem starts occurring.

Testing from an affected device shows that it can no longer ping it’s default gateway (the VC).  If you reboot the device (so it will ARP for its gateway) and then start an offsite ping to said device , it would continue to function normally.  For ‘keep alive’ ping intervals greater than 2.5 minutes, it would cease to function.
ARP table dumps on the VC edge will show the IP of the affected device but mac address 00:00:00:00:00:00 status PENDING when it’s having problems.  The Velocloud engineers tell us that this means it has sent an ARP request for the owner of the IP but has not received a response.  Packet captures (from the edge itself) show the requests going out but never receiving an answer.  PCAP also shows that if a request comes in from device doing lookup for its default gateway – the VC responds to the ARP request.

Capture of an affected devices port from the Cisco stack shows the device sends an ARP request for the gateway but only ‘currently functional devices’ respond to VC ARP requests. This last time the situation devolved to the point that DHCP served from the edge in one VLAN did not reach the devices in that network. We found if we bypassed our Cisco stack that the devices on the ‘bypass switch’ would receive IP’s and update/respond to ARP normally.  However other VLAN’s that the edge served DHCP to on the Cisco stack would continue to function as well as the devices in that VLAN.  
This would lead one to believe it’s the Cisco stack.  
A double reboot of the Cisco stack (in conjunction with ARP flush on the VC) doesn’t resolve the problem.  Rebooting the VC seems to.  We have never rebooted the edge first in this scenario (this is our main site and ideally we don’t want to simply reboot anything).

We’re uncertain if this is a stack/edge configuration issue, VLAN/PVID issue or all of the above, not even ruling out layer 1.
Member_2_6375190Asked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Craig BeckCommented:
Why are you tagging the default VLAN? Why not just use an unused VLAN as default?
0
Aaron TomoskySD-WAN SimplifiedCommented:
The VC is set to drop untagged, perhaps the Cisco stack is responding untagged default vlan1 or something to these requests?
0
Craig BeckCommented:
You need both sides to tag the default if that's the way you want it.
0
Powerful Yet Easy-to-Use Network Monitoring

Identify excessive bandwidth utilization or unexpected application traffic with SolarWinds Bandwidth Analyzer Pack.

Member_2_6375190Author Commented:
Both sides are tagging VLAN 1,  all VC VLAN's are tagged and all interfaced Cisco stack VLAN's are tagged (thus the default VLAN tagged).

I know that not having the VLAN numbers listed on each is vague but this is intentionally vague.
0
Aaron TomoskySD-WAN SimplifiedCommented:
Did you set the pvid as well?
0
Craig BeckCommented:
What config do you have on your switch? Are you running any L2 security features such as DAI or IPSG?
0
Member_2_6375190Author Commented:
@Aaron - In this config the PVID defaults to VLAN 4095P.  I have considered changing this to a General port interface and setting the PVID to VLAN1 but since the VC describes the interface as a trunk port, I have erred on the side of caution.  Also note that i do not have "switchport mode trunk" set on the Cisco stack as Cisco documentation indicates that this negotiates trunking via DTP in Catalyst switches or in the case of a small business switch via GVRP.
0
Member_2_6375190Author Commented:
@Craig - Negative for both ARP inspection and IP source guard.  Neither are enabled.
0
Aaron TomoskySD-WAN SimplifiedCommented:
One thing to explore: something sending out untagged packets as vlan1, but it's being picked up as 4095 since that's the pvid, hence no response. I agree that in your situation everything SHOULD be tagged vlan1, but...In fact, I remember some gear didn't allow tagged vlan1, I want to say brocade switches maybe? Perhaps some of your gear also doesn't like tagged vlan 1 and is doing it untagged, hence the PVID problem.
0
Member_2_6375190Author Commented:
@Aaron - So the only way I can do that cleanly and maintain tagged VLAN1 is to switch to a general port (802.1q style port) and set the PVID to VLAN1.  I have considered the possibility that something is getting sent to 4095 which is a 'dead end' VLAN built into Cisco.  I can't even monitor it with a port mirror.

My other option is to change the VC interface to accept untagged traffic on VLAN1 and remove the Cisco stack config line " switchport default-vlan tagged" on that interface which would then allow untagged VLAN1 to traverse.

Most security forums I read say that switch-to-switch or switch-to-router interfaces should be tagged top to bottom to avoid VLAN hopping which is why I've configured it this way in the first place.
0
Aaron TomoskySD-WAN SimplifiedCommented:
I'm no security expert, just an admin With experience and I've always done untagged 1, non routable for the mgmt interfaces of my network equipment, no hosts and no user traffic. Since nothings routed I don't see how it's an attack surface, and I only allowed specific admin consoles access.

Again, not a security pro so maybe there is something I'm not considering here.
0
Member_2_6375190Author Commented:
I think we're going to eliminate the 4095p but stay tagged (i.e. convert to General port and PVID 1) and run with that a few weeks to see if it eliminates our issues.

Barring that, we'll go trunked, untagged 1 on both sides of the interface.
0
Craig BeckCommented:
If PVID is 4095 you have a default VLAN mismatch. This will have an effect on some network management protocols and can result in all kinds of weird issues.

As I said earlier, I would use an unusee VLAN as default at both ends, then VLAN1 would still be tagged.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Member_2_6375190Author Commented:
This is something to take into account.  So far we've been relatively issue free but this will be the first thing we change if/when the issue comes up again.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Internet Protocol Security

From novice to tech pro — start learning today.