Packet storm causes a server to restart and be unable to boot

It was Friday and had no client visits scheduled so I could get caught up with office work – unfortunately it was not to be….

Client’s network administrator  called at 0830 saying his entire network was down as well as Internet and that it all started when he heard a server restart. (DL360G8 Server 2012 R2). Some initial testing via phone involving two servers connected to the same switch resulted in them not being able to communicate via ping. A reboot of that switch(HP 2910) did not resolve the issue. With the client’s network administrator in full panic mode, I agreed to go onsite to investigate.

Here’s what I found.
•      The server in question was in a reboot loop – BIOS , try to boot into Windows, goes into Windows Recovery Mode, reboots to BIOS
•      Activity lights on all switches in the MDF (there are also 2 IDFs) are all on solid
•      Unable to ping anything from anywhere, cannot even ping the switch IPs.
•      Their other half dozen servers are all up but unable to communicate.


A quick summary of the network

Internet>Sonicwall NSA3600>HP2910 Core Switch(Default Gateway and Router)>
> IDF1 via Fiber>HP2910s>PCs, VOIP Phones Etc
>IDF2 via Fiber>HP2910s> PCs, VOIP Phones
>HP2910 (x4) > PCs, VOIP Phones Etc.


There was clearly some network packet storm going on so all uplinks from the core switch to the IDFs and other switches were removed and the core switch restarted. Activity on the core switch returned to normal and hosts connected to the core switch were able to ping. So switches and hosts were brought online until the port that caused the loop was found (a VOIP phone with 2 cables BOTH connected to live data ports).  So we turned out attention to the DL360 which had been shut down pending fixing the network loop. The server was restarted with no network connections and I fully expected to see if have the same issues as before when booting, but no, it started completely normally. A look into the logs show errors in DNS trying to connect to the LAN and domain (due to packet storm) and a restart initiated by client(!). There were two instances of this, 1, the one the network admin heard and 2, one after I did a hard restart of the system. So the question is, How can a network loop and resulting packet storm cause a server to restart and then be unable to boot into Windows? Server has 3 active connections – iLO, the server itself as a HyperV Host and one for the HV Guest (2012R2 DC).
(NOTE: Switch Configuration and STP are being addressed, the issue I’m trying to understand is about the Server OS and the packet storm)
gwa60060Asked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

BembiCEOCommented:
So I extract from you report, the source is found, STP port blocking is clear so far and now my question starts:

The affected server is still not running or you just want to have a possible reason, why the server didn't boot while the packet storm was going on?

Windows OS has their own flood protection. Also I would in my mind the physical NIC as well as the power supply.
On possibility is, that the server runs into a undefined state because the flood protection reacts during the boot and avoids services to start in the right way. If one of the services is not quite capable to handle such a situation, it may produce an exception. Such exception should usually not result into a crash and reboot, it is not excluded but more seldom. Possible, if an essential device driver can not handle the situation that it results into a reboot. I would have NIC drivers in my mind.

A second option would be the NIC hardware. During a packet storm, the NIC has to handle a lot of traffic so possible the NIC produced an exception what forces the server to reboot in this situation.

And a third option is just the power supply. During the boot, a lot of devices are starting, and if the NIC is also penetrated during the boot, it may take more energy from the power supply than the power supply can deliver. This possibility is in the scope, if the power supply is anyway near the limit.

So, if the server starts normally after elimination the source of the packet storm, I would investigate the power supply (check if the voltages are correct and they are stable).  Check the NIC drivers and possible a load simulation test may show up, if the server is stable at all. A server may run fine over weeks, but if a more extreme situation happens, it may overdrive the remaining limits.

At the end, it is hard to say what was the real reason as you can not look into the hardware. Sometimes the point of failure gives you a hint, during a reboot the event logs are sometime not really helpful, they help for software reasons (which should not happen in theory) but what the hardware is doing is mostly not written to the logs.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
gwa60060Author Commented:
Thanks for the comments. Hard to say what the root cause was without actually trying to recreate the event.
0
ConfigtermCommented:
If you have support on the unit I would call in a ticket and troubleshoot in case this returns after support expires.  Most vendors will not cover software failures so if they can't find a hardware issue they may ask you reload the OS.

Is the environment cool and low dust ?

I would even swap out the power cord if it has been moved or reused a lot.  Test the patch cable to the switch for any wire damage or shorts.

If there were not any configuration changes at the time you are better off looking at the physical level.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Microsoft Server OS

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.