Cisco - Trunks Going Down

We have four 3560 PoE Cisco Switches set up at a site across town and about 1 or two times a week they lose connectivity.
If I go into each switch and make a change to the trunks and save it they all seem to come back up.

Switch 1 is the switch with the Main Uplink to our ISP using Interface Gi0/1. One thing I noticed when looking at spanning tree is that Gi0/1 is not the root port for VLAN 1 but instead the root port is Fa0/1. Fa0/1 is just a trunk to another switch in the rack which in turn trunks to another switch.
Gi0/1 on Switch 1 is the root port for all other VLAN's except for VLAN 1. Since Gi0/1 faces our ISP which has a switch that is directly connected to the datacenter at my location which holds the root bridge, shouldn't the root port be Gi0/1 for that as well?

Correct me if I'm wrong but if there are updates going out on VLAN 1 wouldn't those cause a routing loop if it was just sending it out Fa0/1 (since this is the root port on Switch 1) to the other switches in the same network rack?

I'm not a networking expert by any means so let me know if you need any other info or if this isn't causing an issue at all.

Attached is a quick image that shows the trunked ports and how the switches are connected to each other. Diagram
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Spanning tree will not cross a layer-3 boundary, so it's likely that you have a root bridge at each location.

Spanning tree is pretty good at preventing forwarding loops, but if you don't properly select your root bridge then you can get some sub-optimal traffic patterns.
WinsoupAuthor Commented:
So if fa01 is my root port at this site does that mean that my root bridge is either the top left switch or the bottom left switch? So should I change the trunk going from switch 1 (top right) to the bottom right switch to be coming from the bottom left switch instead or doesn't that matter?
The root bridge will have no root ports, only designated ports.
SolarWinds® Network Configuration Manager (NCM)

SolarWinds® Network Configuration Manager brings structure and peace of mind to configuration management. Bulk config deployment, automatic backups, change detection, vulnerability assessments, and config change templates reduce the time needed for repetitive tasks.

WinsoupAuthor Commented:
Is that for VLAN 1 only? If so, the only switch that doesn't have root ports for VLAN 1 is the top left switch.
WinsoupAuthor Commented:
Hopefully this will help. I think there is a loop that gets caused in there somewhere because it takes the network down until I shut and no shut the trunks or restart the switches.
Diagram 2
All trunks are configured like this:
switchport trunk encapsulation dot1q
Switchport mode trunk
Based on that diagram, you can't have any forwarding loops.
Your SW2 is root bridge (as asavener already said - on root bridge all ports are designated ports). Did you check your interfaces to gather more information (also did you check logs)?
#sh interface Gi0/1
Check if there is something interesting there.

Is your interface getting error disabled?
If so, you can set timer for auto error recovery, but you need to investigate cause.
Next time before you shut/no shut interface check its status to see if interface was error-disabled
#sh interface gi0/1 status
Port    Name               Status       Vlan       Duplex  Speed Type
Gi0/1                      err-disabled 100          full   1000 1000BaseSX

Open in new window

How often are the ports becoming unresponsive?
WinsoupAuthor Commented:
It happens once or twice a week. The last time it happened was yesterday, the time before that was exactly a week ago. But before that it was Thursday's or Friday's so it's not consistent on when it happens.

Predrag, I will make sure I check the port status next time it happens. I don't have to much time I can spend troubleshooting unfortunately because it takes down our call center.
What steps do you take for recovery?
WinsoupAuthor Commented:
I either make a change to the trunk and write it or I just do a shut/no shut on the trunks and it comes back up.
look at "show span sum" on sw 1 and 2 when the problem recurs. any blocked ports in any vlan indicate a spanning tree loop problem. if not, then the problem lays elsewhere.
havsw2#sh spann sum
Switch is in pvst mode
Root bridge for: VLAN0021, VLAN0051-VLAN0052
Extended system ID           is enabled
Portfast Default             is disabled
PortFast BPDU Guard Default  is disabled
Portfast BPDU Filter Default is disabled
Loopguard Default            is disabled
EtherChannel misconfig guard is enabled
UplinkFast                   is disabled
BackboneFast                 is disabled
Configured Pathcost method used is short

Name                   Blocking Listening Learning Forwarding STP Active
---------------------- -------- --------- -------- ---------- ----------
VLAN0001                     0         0        0          3          3
VLAN0010                     1         0        0          7          8
VLAN0014                     0         0        0          6          6
VLAN0015                     0         0        0         28         28
VLAN0016                     0         0        0          3          3
VLAN0020                     0         0        0         32         32
VLAN0021                     0         0        0          3          3
VLAN0030                     0         0        0          5          5
VLAN0035                     0         0        0          3          3
VLAN0036                     0         0        0          3          3
VLAN0050                     0         0        0          4          4
VLAN0051                     0         0        0          4          4
VLAN0052                     0         0        0          3          3
---------------------- -------- --------- -------- ---------- ----------
13 vlans                     1         0        0        104        105

Open in new window

is there any useful info in the logs of sw 1 or 2 at the time of the problem?

it seems you believe the link to the isp is a trunk. if thats so, the isp site switch should be the root bridge for all vlans. but it should not matter unless you have a layer 2 loop somewhere. if the above output indicates a blocked port, try "show span vlan xxx block", where xxx is the vlan with blocked ports, to work out which interface. then go follow the cabling,  "sh cdp neigh" might help if the offending devices are cisco.
WinsoupAuthor Commented:
The link to the ISP is trunked to a switch that is owned by us at the ISP and is directly connected to our core switch at our data center location.
The logs at the time of the incident show line protocol is down for the interfaces. I'll take a print screen next time so I can be certain how many and which ones are going down.

All of our devices are Cisco devices so that makes it easier.

Excellent info though and I will try that next time it happens. (Not that I want it to happen again)  :-)
WinsoupAuthor Commented:
This happened again this that makes 3 Tuesday mornings in a row.
There weren't any blocked ports or anything when I checked show span sum.

However, this is new in the logs from this morning.

%SPANTREE-5-ROOTCHANGE: Root Changed for vlan 30: New                                                                                                   Root Port is FastEthernet0/45. New Root Mac Address is 001a.2ff1.6d80
%SPANTREE-5-ROOTCHANGE: Root Changed for vlan 5: New R                                                                                                  oot Port is FastEthernet0/45. New Root Mac Address is 001a.2ff1.6d80
%LINK-3-UPDOWN: Interface GigabitEthernet0/1, changed                                                                                                   state to up
%LINEPROTO-5-UPDOWN: Line protocol on Interface Gigabi                                                                                                  tEthernet0/1, changed state to up
%SPANTREE-5-ROOTCHANGE: Root Changed for vlan 81: New                                                                                                   Root Port is GigabitEthernet0/1. New Root Mac Address is 0006.d71b.e617
%SPANTREE-5-ROOTCHANGE: Root Changed for vlan 70: New                                                                                                   Root Port is GigabitEthernet0/1. New Root Mac Address is 0006.d71b.e614
%SPANTREE-5-TOPOTRAP: Topology Change Trap for vlan 1

So I'm guessing after these new topology changes it's taking awhile to converge?
Why is the to root changing every week and what do I need to look for?
Your topology should not change without any cause, so you should investigate this.
For start, you should identify devices 001a.2ff1.6d80 and  0006.d71b.e617.

On root that you want to be root bridge you should issue
# spanning-tree vlan XX root primary
For each VLAN (does not have to be on the same switch).
There are others mechanisms to protect topology. For example you can configure Root guard on all ports that you don't want to be root ports (if superior BPDU enter port - port goes to error disabled state) . One of possible reasons for topology change could be your ISP, so you can configure BPDU filter on WAN port to provider.

Optional STP Features
WinsoupAuthor Commented:
Another switch was put in over the weekend at that location. That's the reason for the topology change, so false alarm on that one.

Back to the original issue. I've been checking through VTP and STP because I'm still leaning that way for the cause of the issue. All of the switches there are in VTP Client Mode, would it be best to just put them in Transparent Mode?
That's a good one.
But anyway, if you did not configure manualy rood bridge until now, you should do it.
It would be better if switches are in transparent mode (it is more secure), but if all switches are in client mode until there is no switch in server mode (which is default btw).
WinsoupAuthor Commented:
I only have one switch in server mode and that's the core switch but I do see that the Configuration Revision Number is 48 on all switches, whether it's a client or a server. Should I reset the clients back to 0?
NO, that's OK, revision number will always be the same on all devices that are not in transparent mode.
WinsoupAuthor Commented:
Another question.... I've been just trying to find anything different on these switches to narrow things down. I noticed that vlan 1 is shut down on 4 of the 5 switches. The one that it's not shut down on is the switch that has the uplink to the ISP. I've heard this can cause loops if it's carrying traffic on vlan 1 on some switches but not others. Is this true?
Have in mind, that loops, are possible only if you have two different paths to the same location in your network that can cause loop, and from picture you gave I don't see way for that to happen.

I suspect that your port went to error disabled state maybe interface is flapping, or whatever. One way to resolve err-disabled state is shut - no shut port that is error disabled.

And on the other hand I have seen loops when vlan1 (native vlan) was removed from switches, but I don't expect that is problem here.
WinsoupAuthor Commented:
We've had some networking consultants working on this issue with us and are currently having our ISP look into the issue. It happened again today, which makes the 4th Tuesday morning in a row.
We did notice that VLAN1 on SW1 in the diagram above is not receiving and BPDU's on VLAN1 which is what prompted the consultants to have me check with our ISP for any issues.

If it's not receiving BPDU's on VLAN 1 that means that it's not receiving any STP updates or anything right?

Does anyone have any other things I can check for while I'm waiting to hear back from the ISP.

Port 1 (GigabitEthernet0/1) of VLAN0001 is designated forwarding
   Port path cost 4, Port priority 128, Port Identifier 128.1.
   Designated root has priority 32769, address 001a.2ff1.6d80
   Designated bridge has priority 32769, address 001c.b0d6.2a00
   Designated port id is 128.1, designated path cost 38
   Timers: message age 0, forward delay 0, hold 0
   Number of transitions to forwarding state: 2
   Link type is point-to-point by default
   BPDU: sent 88987, received 0

Open in new window

WinsoupAuthor Commented:
The switch at our ISP is just another switch in our network, nothing special, so we should be getting BPDU's from it. I should've explained that better. So in that diagram when it says it's going to the ISP, it's really just going to their building, not to one of their switches.
All of our sites are connected with direct fiber, they just meet at our ISP to the switch that we have set up down there and then go out to our sites from there for our LAN.

We have a root bridge set, which is our core switch at our datacenter. The problem is that this site that keeps going down is not receiving that update because it's not getting traffic it needs on VLAN 1 so it still has its own root bridge assigned and it shouldn't.

And there's been a switch added since the first diagram, in between SW2 and SW3, that is the switch that is being listed as the root bridge for VLAN 1.

They are connected that way because they are daisy chained that way due to lack of SFP connectors to uplink them all to Gi Ports.  This is a new site so we don't have everything we need yet, we just needed to get it up and running, which is not important to this issue anyway as it will just auto negotiate at 100.
if there is no vlan1 at the other end of that link then it wont receive bpdu's for that vlan, so this is not necessarily a problem. also if there is no other switch upstream from a port then it wont receive any bpdu's. in this case the isp switch should be sending bpdu's to sw1 on all the working vlans if things were normal. if vlan1 is unused it can be pruned from the trunks by using "switchport trunk vlan allowed" to only permit the working ones.

it might be worth verifying that each end of every link has consistent trunk and span tree config, including what vlans are allowed and which if any is the native vlan on each trunk. also ensure that the same span tree mode is used everywhere (looks like it will be pvst (per vlan).

from the diagram, sw1 might be the more appropriate place to set the root rather than at the isp (the previous comment indicated that the isp switch was set to root).

it is worth considering what the isp might have connected to your network. if there is another switch somewhere which is not yours and which has a vlan 1 on it, that might trigger this problem. any isp switch may have nothing to do with your environment but just be part of the way they deliver services to you. it could come from a trunk or access port at your isp switch.

you indicated that the isp site aggregates fibre services which indicates the diagram may be incomplete? so maybe your isp switch is the appropriate root after all? and, if there are other sites, do they all see topology changes at the same time, even if they dont result in a disabled interface? if so, why do they not have the same problem as this site - i.e. whats different about your isp switch ports and the first port at each site compared to the problem site?

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
WinsoupAuthor Commented:
This is finally resolved! Ended up being an issue with our ISP. Still waiting to hear exactly what they did. Will post what they did when I hear more.

Thank you all for the time and replies!
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today

From novice to tech pro — start learning today.