It was Friday and had no client visits scheduled so I could get caught up with office work – unfortunately it was not to be….
Client’s network administrator called at 0830 saying his entire network was down as well as Internet and that it all started when he heard a server restart. (DL360G8 Server 2012 R2). Some initial testing via phone involving two servers connected to the same switch resulted in them not being able to communicate via ping. A reboot of that switch(HP 2910) did not resolve the issue. With the client’s network administrator in full panic mode, I agreed to go onsite to investigate.
Here’s what I found.
• The server in question was in a reboot loop – BIOS , try to boot into Windows, goes into Windows Recovery Mode, reboots to BIOS
• Activity lights on all switches in the MDF (there are also 2 IDFs) are all on solid
• Unable to ping anything from anywhere, cannot even ping the switch IPs.
• Their other half dozen servers are all up but unable to communicate.
A quick summary of the network
Internet>Sonicwall NSA3600>HP2910 Core Switch(Default Gateway and Router)>
> IDF1 via Fiber>HP2910s>PCs, VOIP Phones Etc
>IDF2 via Fiber>HP2910s> PCs, VOIP Phones
>HP2910 (x4) > PCs, VOIP Phones Etc.
There was clearly some network packet storm going on so all uplinks from the core switch to the IDFs and other switches were removed and the core switch restarted. Activity on the core switch returned to normal and hosts connected to the core switch were able to ping. So switches and hosts were brought online until the port that caused the loop was found (a VOIP phone with 2 cables BOTH connected to live data ports). So we turned out attention to the DL360 which had been shut down pending fixing the network loop. The server was restarted with no network connections and I fully expected to see if have the same issues as before when booting, but no, it started completely normally. A look into the logs show errors in DNS trying to connect to the LAN and domain (due to packet storm) and a restart initiated by client(!). There were two instances of this, 1, the one the network admin heard and 2, one after I did a hard restart of the system. So the question is, How can a network loop and resulting packet storm cause a server to restart and then be unable to boot into Windows? Server has 3 active connections – iLO, the server itself as a HyperV Host and one for the HV Guest (2012R2 DC).
(NOTE: Switch Configuration and STP are being addressed, the issue I’m trying to understand is about the Server OS and the packet storm)