A coworker and I recently ran into a problem where one of our Windows 2003 servers just dropped off the network for apparently no reason at all. The server was up and running happily, with no apparent problem, except it was completely unresponsive --
exactly as if a network cable were unplugged. The server has a pair of on-board Broadcom NICs with each one connected to a separate switch in a two-switch stack and teamed for load-balancing and failover.
The server dropped out off sometime during the day before. We poked around in the Event logs around that time, but found nothing related to networking. Of course, the patch cables were fine and it was highly unlikely that both switches, Cisco 3750's, were faulty. Our network engineering team checked the port configuration on each switch, and not only were they configured correctly, but their laptop worked fine when configured with the same IP address and connected to the same ports.
It is obviously not a network problem, so our focus turned back to the server itself. We uninstalled the NIC teaming software and configured one of them to have a proper IP address. No cigar. Perhaps the TCP/IP stack somehow became corrupt? So we reset TCP/IP by executing the following command:
This required a reboot, and as we stared impatiently at the POST progress, we began to grow happy that our six-hour ordeal might be over.
Wrong. After all that, we still had made absolutely no progress at identifying the cause. Hmm... these two NICs are on-board, so they likely share a single controller. If the controller went bad, both NICs might be affected, right?
Our next move was to install a fresh new dual-port Intel NIC into virgin PCI Express slot. We disabled the on-board NICs in the BIOS setup menu and fired up the server to re-configure teaming. Once everything was configured, we cracked open a command prompt to ping the gateway, but were met again with the same, familiar disappointment. Now what? How could this be?
At this point, we'd verified the switch stack was not misconfigured. We'd updated drivers and teaming software for the on-board Broadcom NICs. We removed the teaming and reverted to a single-NIC configuration. We'd reset the TCP/IP stack. We even installed a whole new NIC from another manufacturer! And after all that, we still had the exact same problem! We'd also started checking for simple things, like making sure the firewall wasn't enabled (but even if it were, I wouldn't expect it to cause the server to "unplug" itself from the network).
At this point, there were four engineers scratching their heads. I started grabbing for straws and decided to throw Service Pack 2 in there -- a desperate move, but it needed it anyway. Meanwhile, my coworker turned to every IT professional's best friend: Google -- what he searched for, I don't know, but he landed on a knowledgebase article from Microsoft,
This article describes a problem with the IPSec MMC, so at first glance, it was an unrelated issue. However, in it we would stumble upon our solution.
In the resolution section, a registry key was identified to be deleted
However, the IPSec\Local keys didn't even exist. Okay, so we're definately on the wrong track, right? Not so fast. We continued on to step two, which offered a command to run to rebuild local policy store
We threw this into a command prompt, slapped the [Enter] key, and rebooted... AND EUREKA, WE HAD EMERGED VICTORIOUSLY!
So what happened to cause the corruption? We cannot be certain. But we did learn from a valuable test: if everything seems right, but your server acts as if the network cable is unplugged, a missing or corrupt IPSec policy might be the cause. While I've never run into this before in my nine years of server administration, this is definately one of those situations that I will remember.