Link to home
Start Free TrialLog in
Avatar of DigiSec
DigiSec

asked on

Intermittent loss of network on HP Proliant ML350 G6

I'm baffled by this one - not sure if it is hardware, driver or OS issue.

The server is an ML350 G6 with SBS2011 installed.  This is connected via NIC1 (no NIC teaming) to an HP ProCurve 2510G-24 switch.

This system has been running just fine for nearly 3 years without any major issues.  Suddenly in the last couple of months an intermittent issue has come up where the server will lose all network connectivity for seemingly no reason and require a reboot.  

The server is not locked up or crashed (BSOD) mind you.  You can login just fine from a console and do a graceful reboot.  During the time the ILO is still responsive and the NIC within windows shows to be online and connected.

Things I've tried/looked at:

1 - Windows Event logs - there is nothing ever reported in any of the windows logs around the time of connectivity loss other than sometimes there is a DNS resolution error (presumably because the network has dropped).  My RMM tool does log the loss of connectivity and status update failures so I can get pretty close to when the issue is happening (within 30 sec +/-)

2 - HP IML, the Integrated Management log on the iLO shows nothing - it logs the power event for the reboot and that's it

3 - Switch syslog, there are some excessive broadcasts on the network from a few of the clients that have chatty software installed, but no issues for the port that the server is plugged into (or the other port that I moved it to for testing)

4 - Windows is 100% in current patch

5 - HP SUM (System Update Manager) has been run every month to get all critical and recommended system firmware/bios/driver updates as needed so that is also current.

6 - I have scanned for rootkits/malware/viruses etc numerous times using multiple tools from Sysinternals, GMER, MBAM, SoPHOS, ESET and it always comes up clean.

I want to call HP or Microsoft but I don't even have anything to give them to start debugging.  I cannot reproduce the issue, but it has happened on 11/3, 11/8, 11/15, and 12/3
ASKER CERTIFIED SOLUTION
Avatar of Cliff Galiher
Cliff Galiher
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of DigiSec
DigiSec

ASKER

That's a possibility.  For that matter I could switch to the unused NIC2 and rebind everything over there.  I believe they are physically separate controllers not shared controller with 2 ports.

To be fair, a reboot is the fastest easiest way that I have been able to get my client back online - talking a non technical person through logging into and restarting the server cleanly (really wish they would pop for the Advanced iLO license).

I can't reproduce so I haven't been onsite to try things like unplugging / disabling the NIC, or using sysinternals to trace or even perfmon to to look at current network utilization.  I suppose it is possible that something is literally locking the NIC - but I don't think it's a bottleneck issue because it is not momentary - still requires a reboot.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of DigiSec

ASKER

Yeah, I think that's the plan for an emergency outage tonight.  Going to have HP diagnose the HW and switch over to the second NIC at the same time.  This will be next to impossible to test though since it is intermittent and not reproducible by me.

I will award partial points - both are good answers.
Has HP solved the problem? I have two servers with the same issue, running two different versions of Windows Server (2003, 2008). Both Proliant DL-308, one is G3, the other G4. Because of the timing of the events, I was thinking it was due to a Microsoft update. It started after an update occurred on both machines pretty close to the same time. I suppose it could be coincidental hardware failures, but it seems unlikely. It's a huge issue as one of them is our internal DNS server and it's going down daily.
Avatar of DigiSec

ASKER

Interestingly enough - no.  We swapped out the motherboard to replace the NICs per HP - but had the call yesterday that the "Server was down"  I could see by the iLO that the system was up and was able to gracefully reboot via iLO - but it was inaccessible on the network.

I'm re-opening the case with HP now