Server unresponsive at random intervals, must force reboot

I have a HP ProLiant ML350p Gen8 which freezes periodically. Has been know to do it two days in a row but can go a month without crashing sometimes. It is running RDS for around 30 users.

Because the server is remote and is rack mounted without a monitor, I dont have physical access to it and nobody can tell me what it says on the screen, I have tried using ILO4 but I can never get the remote console to work.

After forcing reboot the server comes up and runs fine until the next hang 1-30 days down the track.

Users report the screen just freezes. When they close their rdp session and try to reconnect, it just never reconnects. But I do see Event 4005 in application log many times, in between when the users get kicked until I reboot. I suspect it is once for every time someone tries to connect via RDP. It says 'The windows logon process has unexpectedly terminated.' I have teamviewer on it and it shows as online but I can not connect. I can browse shares on the server from another PC on the LAN however it is extremely slow. It responds to pings without dropouts. And of course, it seems to be logging the 4005 events too so it is not completely dead.

At times, the server seems to self recover after 15-30 minutes but not always. When it does self recover, it has not rebooted. It just seems to go on as if nothing happened.

I have supplied HP with the Active health System log and they say there is no hardware issues. All the on board diagnostics tools show no issues.

I have installed the latest proliant support pack for the server and updated firmware / drivers etc. I have not taken the server offline to run a memory test though.

My instincts tell me it's hardware but HP say it isn't. I am at my wits end with this and was hoping someone might be able to direct me on where I should look next. I will monitor this thread daily and supply more info if requested.

Many thanks in advance.

Some spec info:
Microsoft Windows Server 2008 R2 Standard 6.1.7601 Service Pack 1 Build 7601
32Gb RAM
Smart Array P420i in Embedded Slot (No errors in ACU) with 2x300gb RAID1 and 2x1Tb RAID1
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

I have had this *exact* issue with an HP Proliant ML350 G5p on Windows SBS 2008 (which is Server 2008 R2 under the hood). Certain services like Remote Desktop, TeamViewer, RRAS would fail. Other services like File Sharing were still working

Never did figured it out. I was convinced it was a hardware issue too, and we eventually replaced the server and got new ones with Server 2008 R2.

That HP Proliant went on to live a long and healthy life a Linux backup server without so much as a hiccup, so ultimately wasn't a hardware issue.

I did have physical access to the server. My experience was that while the keyboard and mouse seemed to be responsive, it would hang forever trying to unlock or login. Shutdown wasn't possible, needed to force reboot the server for it to come back.

Sorry I can't be of more help.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
nader alkahtaniConsultantCommented:
Id suggest it's faulty hardware. Do a warranty claim on the hardware if possible.
Active Protection takes the fight to cryptojacking

While there were several headline-grabbing ransomware attacks during in 2017, another big threat started appearing at the same time that didn’t get the same coverage – illicit cryptomining.

Dr_SnapidAuthor Commented:
Thanks guys. I have already dealt at length with HP and they are saying it's OS and not warranty.

I don't expect a corrupt registry because the server boots fine afterwards and may run fine for weeks. No system restore necessary to restore operation.

There is always available RAM and disk space so I don't expect its resource related although I could not rule out a process or service that leaks memory under some certain circumstances when i'm not looking - although I would expect some events logged telling me that the system is low on resources prior to the crash.
A selection of what I see in the logs around the time of the crash are:

Error      30/10/2014 1:57:51 PM      Winlogon      4005      The Windows logon process has unexpectedly terminated.
(Lots of these)

SYSTEM LOG (this is a selection of the first messages i see after the time I suspect the crash begins)

Error      30/10/2014 1:08:01 PM      Service Control Manager      7011      A timeout (30000 milliseconds) was reached while waiting for a transaction response from the WerSvc service.

Error      30/10/2014 1:02:58 PM      GroupPolicy      1007      The processing of Group Policy failed. Windows could not determine the site associated for this computer, which is required for Group Policy processing.

Error      30/10/2014 1:02:09 PM      GroupPolicy      1110      The processing of Group Policy failed. Windows could not determine if the user and computer accounts are in the same forest. Ensure the user domain name matches the name of a trusted domain that resides in the same forest as the computer account.

Error      30/10/2014 12:49:34 PM      GroupPolicy      1065      The processing of Group Policy failed. Windows could not evaluate the Windows Management Instrumentation (WMI) filter for the Group Policy object CN={BCE33240-3E92-4AA9-9AC2-9174AB5D86E0},CN=POLICIES,CN=SYSTEM,DC=[removed],DC=LOCAL. This could be caused by RSOP being disabled  or Windows Management Instrumentation (WMI) service being disabled, stopped, or other WMI errors. Make sure the WMI service is started and the startup type is set to automatic. New Group Policy objects or settings will not process until this event has been resolved.

Error      30/10/2014 12:38:03 PM      DistributedCOM      10010      The server {0006F03A-0000-0000-C000-000000000046} did not register with DCOM within the required timeout.
nader alkahtaniConsultantCommented:
Try to avoid hanging by the following :
try to disable RDS , if you have good result then search about this issue for example  let users use this service with little resources from your server
Dr_SnapidAuthor Commented:
Nadir I don't understand what you mean.

It's terminal server, I can not disable RDS, to do so would effectively take the server offline for the users.
When it freezes does it crash and require a reboot. is it temporary and specific to terminal server.

Disabling RDS will definitely resolve you issu,e because no body will be able to use the server and thus can not crash. lol

I would Google all the errors your getting in event viewer and resolve them it does appear that their may be some issues with DNS or Active directory. Check DNS and netbios make sure dns is pointing to the AD server
Dr_SnapidAuthor Commented:
It is only this one computer on the network that has the problem, it is however the only terminal server.

As mentioned in OP the server sometimes seem to self recover after 15-30 minutes, but otherwise recovers when I use ILO to force a reboot.

The errors are not there when the server is running normally. Those errors only start logging around the time that the problem occurs. I believe they are sympotomatic, not the actual cause. The server runs for days or weeks perfectly fine. Then suddenly, uh oh.

I have not isolated a certain program that someone runs or any other event that triggers the outage, it seems completely random. This is partly why i'm so frustrated. There is no way to know if you have fixed it by installing an update or firmware. You just have to wait and see...
do you have business support on the unit. If so make then change the main board and power supply.

You can also try moving the VM off and run a different VM on it and see how it behaves.

Personally I would just fight to get it replaced under warranty  ( I have had a whole server replaced to rectify the issue. Its not my job to identify the faulty bit that theirs. I want a new working server.

You need to be 100% sure its not your software. Run it up as a VM on different hardware. if problem goes away you know its hardware.
Dr_SnapidAuthor Commented:
You need to be 100% sure its not your software. Run it up as a VM on different hardware. if problem goes away you know its hardware.

I agree, running on other hardware would be ideal. Here is the part that worries me. If I run it up on a VM, then I also have to get all the users using that VM in order to accurately test. Essentially, backup the server, restore to VM, leave the HW server offline and get the users running on the VM. But that requires decent hardware on which to run the VM. We just dont have hardware like that lying around.

We do have an SBS2011 on almost identical hardware though... I WONDER? How hard would it be to swap the SBS and TS hardware, given that they are identical? This SBS2011 is the same age, bought at same time, and always been fine. It has a different RAID config, and less RAM but hardware wise I think is identical.

Surely it wouldnt be as simple as pulling the SAS drives and swapping? Where is the raid configuration stored? On the drives or the RAID controller?  I'm not much of an expert on RAID but I could look into it.

If I could essentially swap the hardware from one to the other, and the problem does not change, then I could be almost certain it's a software issue...

Are there any obvious flaws in that logic?
In theory proberbly. The config is stored on the sas drive.  but you would be braver than I would ( and I m not recommending it).  

P.S. make sure you have good backups that you have tested. before reading further

In theory as long as all your drives are installed in the same order and the config is imported onto matching raid cards with identical hardware you should be ok. I would not do it, to scared. Maybe someone else has done this and succeeded and can advise.  windows will complain about new hardware once booted.

There is also the option of swapping the whole raid card and drives into the other server. thus mitigating the config import issue. Again it just an idea and I strongly recommend you research more into this as I have not tired it.
to many unknowns for me to feel comfortable recommending that option.
Dr_SnapidAuthor Commented:
Yeah i'm a bit afraid to try it too...

Thanks for your input. I'll look for other options.
you could always just tell HP you have tested it and its still crashing after a reinstall you want a new MB and PSU. and fight for it. if they replace it and its still crashed then well you know its software and re-install it.
in my experience what you are describing is generally a MB fault.

I would just insist on an HP tech come out a replace it .
nader alkahtaniConsultantCommented:
Armenio :
please read my previous comment completely
I did Read it and appreciate your input. However in this scenario it is not a viable option. the user stated that his issue is intermittent. and can range between 1 and 30 days before showing up. Also this is a production Terminal server and the only one available on the network. If he had heeded your advise and disable RDS It could take anywhere between 1 and 30 or more days to verify weather it is an RDS issue. In the mean time no user would be able to access the Terminal server. This Terminal server is important to the functioning of the business.  So it is of my opinion that your recommendation though sound and with out issue. does not suit this situation and thus is not viable option.
Dr_SnapidAuthor Commented:
I have opened a support case and will push for MB replacement.

If that does not work, my next step will be to backup the server, install Server 2012, restore the current server into a hyper-v VM and bring it back online. While it's online I can prep a Server 2012 terminal server in another VM which I can migrate onto if the original VM crashes, and probably would migrate in time anyway.

If the host server crashes in the same way with Server 2012 installed with only hyper-v role, my emails to HP may draw blood!

I will report back after hopefully a new MB installed.

Thanks guys for your input.
Good Luck let us know the outcome.

Here is an Idea if you have storage to spare.
Use vmware converter to convert the current terminal server to a VM. Test the VM boots. Once you are happy it works install vmware on the  server and import the VM into it.  Test to see if it crashes.

Rinse and repeat on SBS box. Now your entire platform is virtualised.  you can now export and import the hosts onto corresponding hardware and test.

Ps. you stated that your SBS drive config was different. (if it is not raid 5 take this opportunity to make it  raid5)

This is a lot of work and will have fair amount of down time. If you hate your weekends go for it.

I would personally still peruse the MB crashing the server. (one question I am assuming you performed all the updates like firmware on bios and raid extra on the hardware.)

P.s. HP's bark is much worse than their bite don't worry about it..  (I do it all the time Im not going to wast my time or my clients time troubleshooting their crap. I want it replaced and they can sort out their crap in their own time. that why we paid for extended warranty.
Dr_SnapidAuthor Commented:
The SBS has a RAID 1. HP have agreed to replace the motherboard in the TS. Fingers crossed it fixes the problem...
Just a Side note for any future servers and SBS server you build . do not use raid 1 in a server Unless its SSD. (its crap performance). Especially with SBS as SBS is very high IO

Get a Hardware raid card with battery backup, Set it to read ahead, and FILL ALLTHE BAYS up with SAS drives . Remember the number of spindle is were the performance comes from. Lots of small drives is much faster than one big drive.

Good luck :-)
Dr_SnapidAuthor Commented:
Thanks Armenio, yes I agree raid5 is usually the go.
Dr_SnapidAuthor Commented:
Motherboard has been replaced. Fingers crossed.
Good Luck :-)
Dr_SnapidAuthor Commented:
Well it didn't take long. Server froze again after motherboard replaced. So frustrated!
Have you been able to find any thing in the log files that may point you to a cause.
Have you reinstalled and updated the Hypervisor.
Alternatively you may want to consider a clean install if its an option. Its a lot of work but may be your best option.
Dr_SnapidAuthor Commented:
No nothing in the logs that helps me. We are going to backup, reload the HW with Server2012 and restore the backup into a hyperv guest. See how it goes. If the host crashes in the same way i'll have to keep assuming it's hardware. If the guest crashes it's got to be something in the OS or a running program perhaps. Either way the virtualised environment should help especially since I could set up a new TS guest using server 2012

Thanks everyone for your support.
Dr_SnapidAuthor Commented:
Excellent attempts to help but no solution found that avoided having to reinstall or move to other hardware unfortunately. I was hopping to learn some method to gain more information and find the culprit and fix in-situ. Thanks everyone.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Windows Server 2008

From novice to tech pro — start learning today.