Server unresponsive at random intervals, must force reboot
Posted on 2014-10-29
I have a HP ProLiant ML350p Gen8 which freezes periodically. Has been know to do it two days in a row but can go a month without crashing sometimes. It is running RDS for around 30 users.
Because the server is remote and is rack mounted without a monitor, I dont have physical access to it and nobody can tell me what it says on the screen, I have tried using ILO4 but I can never get the remote console to work.
After forcing reboot the server comes up and runs fine until the next hang 1-30 days down the track.
Users report the screen just freezes. When they close their rdp session and try to reconnect, it just never reconnects. But I do see Event 4005 in application log many times, in between when the users get kicked until I reboot. I suspect it is once for every time someone tries to connect via RDP. It says 'The windows logon process has unexpectedly terminated.' I have teamviewer on it and it shows as online but I can not connect. I can browse shares on the server from another PC on the LAN however it is extremely slow. It responds to pings without dropouts. And of course, it seems to be logging the 4005 events too so it is not completely dead.
At times, the server seems to self recover after 15-30 minutes but not always. When it does self recover, it has not rebooted. It just seems to go on as if nothing happened.
I have supplied HP with the Active health System log and they say there is no hardware issues. All the on board diagnostics tools show no issues.
I have installed the latest proliant support pack for the server and updated firmware / drivers etc. I have not taken the server offline to run a memory test though.
My instincts tell me it's hardware but HP say it isn't. I am at my wits end with this and was hoping someone might be able to direct me on where I should look next. I will monitor this thread daily and supply more info if requested.
Many thanks in advance.
Some spec info:
Microsoft Windows Server 2008 R2 Standard 6.1.7601 Service Pack 1 Build 7601
Smart Array P420i in Embedded Slot (No errors in ACU) with 2x300gb RAID1 and 2x1Tb RAID1