Port STP Change on Meraki MS320 Locking Up Entire Network

We have been troubleshooting an ongoing STP issue* (see bottom of post) for three weeks. It is causing disconnects across the network for phones, desktop and Citrix connections.

All LAN end users across the board are impacted. WAN users have no issue whatsoever.

- Jittery or dropped VoIP calls
- Disconnected Citrix sessions
- Network lag on physical desktops
- Coreswitch1 misses heartbeats so Coreswitch2 is taking over as VRRP master and then it flips back again

Every night, starting at about 2200 and lasting until about 1030. It may take place every 6 minutes like clockwork, or it will happen intermittently every half hour or every couple of hours. It may start around 2200 or as late as 0030 and the issue definitely stops at/around 1030 and doesn't start again until the evening.

Coreswitch1 is the STP root bridge
Coreswitch2 has 4096 priority
All tertiary swtiches have priority of 32768 with RSTP turned on
I drew this to simply explain the network layout in regards to STP
STP topology

- Meraki event logs show these events (xlsx attached). Their logs don't have that much info in the way of details. It seems that whatever is causing the STP changes is making Coreswitch1 miss a heartbeat so Coreswitch2 is taking over as VRRP master and then flipping back. This is more of a symptom of the STP changes (I think)
Event Log from Meraki Dashboard
- From the Meraki switches - Turned on STP port guard on each port that the tertiary switches connect to. i.e.- Tertiaryswitch1 plugs into Coreswitch1 on port 1. I turned STP port guard on for all tertiary connections. After doing this, I still observed the "flipping" on Coreswitch2. This makes me think that the issue is on Coreswitch2 or a device connected to it, but not coming from the tertiary switches. Perhaps I incorrectly ran that test?

- Disconnected all other (known) network devices on the tertiary switches like access points, printers, etc. and still observed the issue.

- Ran a packet capture on the core switches immediately after the triggering event happened. Running it just a tad late, I got the STP changes but not the event that triggered them. I'll try to run another capture tonight and try to get it started just before the issue starts.

-Meraki support asked me to turn on portfast on all of the tertiary switches. I did that today and will wait until the event timeline starts again tonight to see if it helps.

I've been talking to Meraki support and they haven't been able to put their finger on the issue. I don't even know where to look. Is it coming from the tertiary switches, the ESXi hosts, a "rogue" device, etc.? Any help would be greatly appreciated

*Note: I am fairly certain this is an STP issue since that's what the logs point to, and that's what Meraki support says, but am certainly open to hearing if there is another issue causing the problem.
Paul WagnerFriend To Robots and RocksAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

I can't say that I know Meraki in depth, but I am very familiar with Cisco and networking in general.

What type/mode are your tertiary switches?

When I setup networks, I use a template for configuration purposes. One of the items in my template is to enable 'spanning-tree guard root' on all access ports. What this does is it basically says that any port with Root Guard enabled cannot become a spanning tree root port. This is important because you might very well be the scenario I put that guard in place. Based on your events, I'm going to make the light guess that the connection between your two core switches is on port 47. All other ports I'll assume are the links to the tertiary switches. In this scenario, it usually means that ONLY the core switches could or should ever become the root of any vlan. On the core switches, this would cause me to enable root guard on all ports except the ones that connect to the other core switch. On the tertiary switches, root guard would be enabled on every port except the uplinks. By doing this, you can control which ports become root and avoid some of these issues. Here is my access port template that I use:

interface range #ACCESSPORTSPEED#1/0/1 - #ACCESSPORTS#
switchport access vlan #ACCESSVLAN#
switchport voice vlan #VOICEVLAN#
no shut
switchport mode access
switchport nonegotiate
switchport port-security maximum 5
switchport port-security
switchport port-security aging time 5
switchport port-security aging type inactivity
srr-queue bandwidth share 1 30 35 5
auto qos voip cisco-phone
storm-control broadcast level bps 2m 1m
storm-control multicast level bps 50m 25m
storm-control action shutdown
storm-control action trap
spanning-tree guard root
ip dhcp snoop limit rate 15
no sw trunk encap dot
no sw trunk encap isl
no sw trunk native vlan

Next, on all access switches I enable some default spanning tree guards.

spanning-tree mode rapid-pvst
spanning-tree portfast default
spanning-tree portfast bpduguard default
spanning-tree portfast bpdufilter default
no spanning-tree optimize bpdu transmission
spanning-tree extend system-id
spanning-tree vlan 1-4094 priority 40960

In bold is bpduguard. This causes a port to disable if it is an access port, with portfast enabled, and receives a bpdu from another switch. The only ports that should realistically receive a bpdu is a port connected to another managed switch, and usually those should only be uplinks (or known downlinks to additional access switches). I always put uplink ports in trunk mode. This way, if anyone "accidentally" plugs in a network switch without your knowledge, the port on the access switch will disable to avoid spanning tree loops or changes.

If your problem is a loop or a rogue switch, enabling these guards and then checking on the switches to see if any port went into err-disabled mode will get you to hunt down the culprit. Based on your log messages, I'd be willing to guess that you have a rogue switch underneath someone's desk, or you have a loop caused by a dumb switch or two wall jacks that are conveniently connected together.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Paul WagnerFriend To Robots and RocksAuthor Commented:

Great info!

Yes, port 47 connects between each core switch
Yes, ports 1-6 on core switches are tertiary connections
My tertiary switches are SGE2010's and SG500x's.

A few questions to clarify:
On the tertiary switches, root guard would be enabled on every port except the uplinks

1 - Can I have root guard turned on for the ports connecting printers, copiers and access points?
2 - The GUI port options are below for each switch. I can telnet into the SGE2010s but they only have a menu selection option. The SG500x's have a traditional Cisco ios prompt via SSH.
SG500X port configSGE2010 Port Config
       So, you're saying to enable root guard on every access port on the tertiary switches?

On the core switches, this would cause me to enable root guard on all ports except the ones that connect to the other core switch

1 - What about the ports on the core switches that connect ESXi hosts, edge SIP devices, WAN/firewall uplink, etc.?
2 - I had turned root guard on in the core switches to the tertiary switch ports but still saw the problem. Does that mean the issue is coming from another device on the core switches other than the tertiary devices?

This causes a port to disable if it is an access port, with portfast enabled, and receives a bpdu from another switch

1 - Turn PortFast on in every access port on the tertiary switches?
Since end devices such as printers, computers, and almost all servers don't participate in spanning tree, enabling root guard should be fine.
All ports in the core can have root guard enabled except the ports that connect the two cores together. Firewalls, sip devices, and esx servers shouldn't participate in spanning tree. If root guard caused any pet connected to these devices to disable, then I would investigate the device as it could be the root cause of your problem.
Yes, that could mean the problem is coming from somewhere other than the tertiary switches. At the least, root guard should help narrow down your focus.
You could turn on portfast on all access ports, but that alone won't have quite the same effect as the commands in my example. Based on you screenshots, I would enable root guard and bpduguard on all access ports. Any ports that err-disable because of this should be investigated as there is an unknown switch in your network or a loop.
Paul WagnerFriend To Robots and RocksAuthor Commented:
Per your suggestions, I implemented these changes:

- Root guard enabled on all access ports on tertiary switches
- BPDU Guard enables on all access ports on tertiary switches
- Port Fast enabled on all access ports on tertiary switches
- Root guard enabled on all ports of core switches except for link on 47 and uplink

I do have a problem as a result. One of the tertiary switches can only have a single cable plugged into the core. If I plug in the redundant trunk cable, the network goes down. It seems that the Cisco switch thinks it is the root bridge and therefore has no Root or Alternate STP port. The ports are configured just like the rest of the tertiary switches. I opened a new question here: http://www.experts-exchange.com/Hardware/Networking_Hardware/Switches/Q_28669867.html
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Network Analysis

From novice to tech pro — start learning today.