We have been troubleshooting an ongoing STP issue* (see bottom of post) for three weeks. It is causing disconnects across the network for phones, desktop and Citrix connections.
All LAN end users across the board are impacted. WAN users have no issue whatsoever.
- Jittery or dropped VoIP calls
- Disconnected Citrix sessions
- Network lag on physical desktops
- Coreswitch1 misses heartbeats so Coreswitch2 is taking over as VRRP master and then it flips back again
Every night, starting at about 2200 and lasting until about 1030. It may take place every 6 minutes like clockwork, or it will happen intermittently every half hour or every couple of hours. It may start around 2200 or as late as 0030 and the issue definitely stops at/around 1030 and doesn't start again until the evening.
Coreswitch1 is the STP root bridge
Coreswitch2 has 4096 priority
All tertiary swtiches have priority of 32768 with RSTP turned on
I drew this to simply explain the network layout in regards to STP
- Meraki event logs show these events (xlsx attached). Their logs don't have that much info in the way of details. It seems that whatever is causing the STP changes is making Coreswitch1 miss a heartbeat so Coreswitch2 is taking over as VRRP master and then flipping back. This is more of a symptom of the STP changes (I think)
- From the Meraki switches - Turned on STP port guard on each port that the tertiary switches connect to. i.e.- Tertiaryswitch1 plugs into Coreswitch1 on port 1. I turned STP port guard on for all tertiary connections. After doing this, I still observed the "flipping" on Coreswitch2. This makes me think that the issue is on Coreswitch2 or a device connected to it, but not coming from the tertiary switches. Perhaps I incorrectly ran that test?
- Disconnected all other (known) network devices on the tertiary switches like access points, printers, etc. and still observed the issue.
- Ran a packet capture on the core switches immediately after the triggering event happened. Running it just a tad late, I got the STP changes but not the event that triggered them. I'll try to run another capture tonight and try to get it started just before the issue starts.
-Meraki support asked me to turn on portfast on all of the tertiary switches. I did that today and will wait until the event timeline starts again tonight to see if it helps.
I've been talking to Meraki support and they haven't been able to put their finger on the issue. I don't even know where to look. Is it coming from the tertiary switches, the ESXi hosts, a "rogue" device, etc.? Any help would be greatly appreciated
*Note: I am fairly certain this is an STP issue since that's what the logs point to, and that's what Meraki support says, but am certainly open to hearing if there is another issue causing the problem.