asked on

Cisco 4506 switch rebooted itself

I have a Cisco 4506 chassis with (4) 48 port switch modules in it. It is on a known good UPS, has redundant power supplies and everything. About six weeks ago, the switch restarted itself for no known reason. I couldn't find anything out of the ordinary ... it just came back online by the time I got to the switch room.

Today, It happened right at 3:00pm. Reports that I got had some people losing power to the Cisco phones (PoE) and others claimed the phone didn't lose power but the display said ethernet connection lost. The phones losing power were on switch module 3.

I went into the IOS and did a sh hardware and got this:
Cisco IOS Software, Catalyst 4500 L3 Switch Software (cat4500e-IPBASEK9-M), Version 15.2(2)E5, RELEASE SOFTWARE (fc2)
Technical Support: http://www.cisco.com/techsupport
Copyright (c) 1986-2016 by Cisco Systems, Inc.
Compiled Thu 02-Jun-16 03:28 by prod_rel_team

ROM: 12.2(44r)SG5
ph-4506 uptime is 1 hour, 0 minutes
System returned to ROM by reload
System restarted at 14:58:59 CDT Wed Sep 20 2017
System image file is "bootflash:cat4500e-ipbasek9-mz.152-2.E5.bin"
Darkside Revision 4, Nexu Revision 9, Fortooine Revision 1.40

Last reload reason: reload

My question is, what else can I do from a troubleshooting standpoint? Is it possible that just switch module 3 in the chassis lost power and the rest of the modules remained online? I am having to accept end-user answers that some Cisco PoE phones lost power and some did not. No one else has access to the switch to reload it so I can only assume it lost power for some reason and "reload" is just a generic reason. Is there a different "Last reload reason" message if it just loses power?

Any pointers on figuring out what happened?

Sean

what does it show when you do a sh log?

also what does it say when you do a show version?

Steve B

ASKER

Syslog logging: enabled (0 messages dropped, 7 messages rate-limited, 0 flushes, 0 overruns, xml disabled, filtering disabled)

No Active Message Discriminator.

I know we have it going to SolarWinds so I will look through there also.

Jane Updegraff

weird. According to Cisco's definition of the line "System returned to ROM by reload", the reload had to be initiated by a user so it thinks that a user initiated the reload. Does this switch have more than one supervisor module by chance? I used to have one that had two supervisors (one a warm spare) and they are both able to record logs showing remote command events. So if you can't see who (or what) initiated the reload command, and it doesn't appear on one supervisor you should look in the logs on the other, too.

Here are some reasons for crashes (and reloads):

https://www.cisco.com/c/en/us/support/docs/ios-nx-os-software/ios-software-releases-121-mainline/7957-crashes-lesscommon.html#anc19

I also seem to remember that a switch had a memory leak in one of the buffers on the primary supervisor. Although that may have been on a different core switch... at the time i had 4500s and 6500s and I can't remember which one it was that had the memory leak. It would just run out of memory and reload itself as a failsafe measure rather than locking up and crashing .. in that case a reload was preferable to a lockup.

Also look at these possible bugs (you'll need to be logged in using a cisco account) and see if any of them match your conditions:

https://bst.cloudapps.cisco.com/bugsearch/bug/CSCsi17158/?
https://bst.cloudapps.cisco.com/bugsearch/bug/CSCuh49736/?
https://bst.cloudapps.cisco.com/bugsearch/bug/CSCvd05307/?
https://bst.cloudapps.cisco.com/bugsearch/bug/CSCuu34535/?

Steve B

ASKER

The response I got from Cisco is:
The supervisor engine's "Jawa" ASIC had detected a parity error and sent a signal to the central CPU forcing a reload. The component on the board where the parity error originated. Parity errors indicate that one or more bits in a value of memory have inverted, from 0 to 1 or vice versa, causing a disparity between the expected and actual value. As a recovery mechanism, the system forces itself to reset.

There are two known causes for parity errors - hardware failure and transient disturbances. Environmental factors, such as electromagnetic interference, can alter the contents of memory cells, causing what is known as a "soft" parity error. This is uncontrollable, rare, and non-recurring. However, it's actually a more common phenomenon than sudden hardware failure - which causes what is often referred to as a "hard" parity error.

More information on these two types of parity errors can be sourced within the following document:

http://www.cisco.com/en/US/products/hw/routers/ps341/products_tech_note09186a0080094793.shtml#softvshard

If this is the first time in recent past that the switch has crashed, I suggest we monitor it. If this was a true hardware fault, the system would inevitably attempt to access the corrupt memory cells, leading to another crash in the very near future.

However, if it continues to operate smoothly for a day or so, you should feel very confident that this was a transient issue that will not be seen again.

Jane Updegraff

Wow that is fascinating and a little annoying. A parity error of one single bit is going to instigate a spontaneous reload? And then report to you in the syslog that it was ordered by a person? That's really sloppy design for this ASIC. They could at least have made reference in their error to what might have happened and what should be done to test for further problems.

What is the environment like in the physical location? Got a cell phone tower casting a shadow over the data center or an electrical substation next door? Or did anyone sit a powerful magnet directly on top of the device? I would guess not but it can't hurt to consider it I suppose. And they're right. If it never happens again then something really weird happened, in which case it isn't likely to happen again. They just can't tell you what. LOL! Such a Cisco answer. But thanks for getting back to us.

Has anything happened again since your initial question? Any new reloads that would make you think you have a hardware problem?

Steve B

ASKER

We do have construction work going on but nothing that should introduce any interference. It has happened twice in 4 months. No one but IT has access to the switch closet this equipment is in but I guess I can't rule out something environmental. It is a nice climate controlled closet with UPS and circuits on generator. Given remodeling and construction work, I suppose that is a possible source.

I have been checking the syslogs daily (set to debug) and nothing is happening. I see normal things like Cisco Prime logging in to get backup configs and such. No reloads ... knock on wood.

ASKER CERTIFIED SOLUTION

Jane Updegraff

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial