asked on

Server unresponsive at random intervals, must force reboot

I have a HP ProLiant ML350p Gen8 which freezes periodically. Has been know to do it two days in a row but can go a month without crashing sometimes. It is running RDS for around 30 users.

Because the server is remote and is rack mounted without a monitor, I dont have physical access to it and nobody can tell me what it says on the screen, I have tried using ILO4 but I can never get the remote console to work.

After forcing reboot the server comes up and runs fine until the next hang 1-30 days down the track.

Users report the screen just freezes. When they close their rdp session and try to reconnect, it just never reconnects. But I do see Event 4005 in application log many times, in between when the users get kicked until I reboot. I suspect it is once for every time someone tries to connect via RDP. It says 'The windows logon process has unexpectedly terminated.' I have teamviewer on it and it shows as online but I can not connect. I can browse shares on the server from another PC on the LAN however it is extremely slow. It responds to pings without dropouts. And of course, it seems to be logging the 4005 events too so it is not completely dead.

At times, the server seems to self recover after 15-30 minutes but not always. When it does self recover, it has not rebooted. It just seems to go on as if nothing happened.

I have supplied HP with the Active health System log and they say there is no hardware issues. All the on board diagnostics tools show no issues.

I have installed the latest proliant support pack for the server and updated firmware / drivers etc. I have not taken the server offline to run a memory test though.

My instincts tell me it's hardware but HP say it isn't. I am at my wits end with this and was hoping someone might be able to direct me on where I should look next. I will monitor this thread daily and supply more info if requested.

Many thanks in advance.

Some spec info:
Microsoft Windows Server 2008 R2 Standard 6.1.7601 Service Pack 1 Build 7601
32Gb RAM
Smart Array P420i in Embedded Slot (No errors in ACU) with 2x300gb RAID1 and 2x1Tb RAID1

ASKER CERTIFIED SOLUTION

Frosty555

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

nader alkahtani

This may help you

http://technet.microsoft.com/en-us/library/cc734097(v=ws.10).aspx

Armenio

Id suggest it's faulty hardware. Do a warranty claim on the hardware if possible.

Dr_Snapid

ASKER

Thanks guys. I have already dealt at length with HP and they are saying it's OS and not warranty.

I don't expect a corrupt registry because the server boots fine afterwards and may run fine for weeks. No system restore necessary to restore operation.

There is always available RAM and disk space so I don't expect its resource related although I could not rule out a process or service that leaks memory under some certain circumstances when i'm not looking - although I would expect some events logged telling me that the system is low on resources prior to the crash.
A selection of what I see in the logs around the time of the crash are:

APPLICATION LOG
Error      30/10/2014 1:57:51 PM      Winlogon      4005      The Windows logon process has unexpectedly terminated.
(Lots of these)

SYSTEM LOG (this is a selection of the first messages i see after the time I suspect the crash begins)

Error      30/10/2014 1:08:01 PM      Service Control Manager      7011      A timeout (30000 milliseconds) was reached while waiting for a transaction response from the WerSvc service.

Error      30/10/2014 1:02:58 PM      GroupPolicy      1007      The processing of Group Policy failed. Windows could not determine the site associated for this computer, which is required for Group Policy processing.

Error      30/10/2014 1:02:09 PM      GroupPolicy      1110      The processing of Group Policy failed. Windows could not determine if the user and computer accounts are in the same forest. Ensure the user domain name matches the name of a trusted domain that resides in the same forest as the computer account.

Error      30/10/2014 12:49:34 PM      GroupPolicy      1065      The processing of Group Policy failed. Windows could not evaluate the Windows Management Instrumentation (WMI) filter for the Group Policy object CN={BCE33240-3E92-4AA9-9AC2-9174AB5D86E0},CN=POLICIES,CN=SYSTEM,DC=[removed],DC=LOCAL. This could be caused by RSOP being disabled or Windows Management Instrumentation (WMI) service being disabled, stopped, or other WMI errors. Make sure the WMI service is started and the startup type is set to automatic. New Group Policy objects or settings will not process until this event has been resolved.

Error      30/10/2014 12:38:03 PM      DistributedCOM      10010      The server {0006F03A-0000-0000-C000-000000000046} did not register with DCOM within the required timeout.

nader alkahtani

Try to avoid hanging by the following :
try to disable RDS , if you have good result then search about this issue for example let users use this service with little resources from your server

Dr_Snapid

ASKER

Nadir I don't understand what you mean.

It's terminal server, I can not disable RDS, to do so would effectively take the server offline for the users.

Armenio

When it freezes does it crash and require a reboot. is it temporary and specific to terminal server.

Disabling RDS will definitely resolve you issu,e because no body will be able to use the server and thus can not crash. lol

I would Google all the errors your getting in event viewer and resolve them it does appear that their may be some issues with DNS or Active directory. Check DNS and netbios make sure dns is pointing to the AD server

Dr_Snapid

ASKER

It is only this one computer on the network that has the problem, it is however the only terminal server.

As mentioned in OP the server sometimes seem to self recover after 15-30 minutes, but otherwise recovers when I use ILO to force a reboot.

The errors are not there when the server is running normally. Those errors only start logging around the time that the problem occurs. I believe they are sympotomatic, not the actual cause. The server runs for days or weeks perfectly fine. Then suddenly, uh oh.

I have not isolated a certain program that someone runs or any other event that triggers the outage, it seems completely random. This is partly why i'm so frustrated. There is no way to know if you have fixed it by installing an update or firmware. You just have to wait and see...

SOLUTION

Armenio

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Dr_Snapid

ASKER

You need to be 100% sure its not your software. Run it up as a VM on different hardware. if problem goes away you know its hardware.

I agree, running on other hardware would be ideal. Here is the part that worries me. If I run it up on a VM, then I also have to get all the users using that VM in order to accurately test. Essentially, backup the server, restore to VM, leave the HW server offline and get the users running on the VM. But that requires decent hardware on which to run the VM. We just dont have hardware like that lying around.

We do have an SBS2011 on almost identical hardware though... I WONDER? How hard would it be to swap the SBS and TS hardware, given that they are identical? This SBS2011 is the same age, bought at same time, and always been fine. It has a different RAID config, and less RAM but hardware wise I think is identical.

Surely it wouldnt be as simple as pulling the SAS drives and swapping? Where is the raid configuration stored? On the drives or the RAID controller? I'm not much of an expert on RAID but I could look into it.

If I could essentially swap the hardware from one to the other, and the problem does not change, then I could be almost certain it's a software issue...

Are there any obvious flaws in that logic?

Armenio

In theory proberbly. The config is stored on the sas drive. but you would be braver than I would ( and I m not recommending it).

P.S. make sure you have good backups that you have tested. before reading further

In theory as long as all your drives are installed in the same order and the config is imported onto matching raid cards with identical hardware you should be ok. I would not do it, to scared. Maybe someone else has done this and succeeded and can advise. windows will complain about new hardware once booted.

There is also the option of swapping the whole raid card and drives into the other server. thus mitigating the config import issue. Again it just an idea and I strongly recommend you research more into this as I have not tired it.
to many unknowns for me to feel comfortable recommending that option.

Dr_Snapid

ASKER

Yeah i'm a bit afraid to try it too...

Thanks for your input. I'll look for other options.

Armenio

you could always just tell HP you have tested it and its still crashing after a reinstall you want a new MB and PSU. and fight for it. if they replace it and its still crashed then well you know its software and re-install it.

Armenio

in my experience what you are describing is generally a MB fault.

I would just insist on an HP tech come out a replace it .

nader alkahtani

Armenio :
please read my previous comment completely

Armenio

Nadir:
I did Read it and appreciate your input. However in this scenario it is not a viable option. the user stated that his issue is intermittent. and can range between 1 and 30 days before showing up. Also this is a production Terminal server and the only one available on the network. If he had heeded your advise and disable RDS It could take anywhere between 1 and 30 or more days to verify weather it is an RDS issue. In the mean time no user would be able to access the Terminal server. This Terminal server is important to the functioning of the business. So it is of my opinion that your recommendation though sound and with out issue. does not suit this situation and thus is not viable option.

SOLUTION

Dr_Snapid

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Armenio

Good Luck let us know the outcome.

Here is an Idea if you have storage to spare.
Use vmware converter to convert the current terminal server to a VM. Test the VM boots. Once you are happy it works install vmware on the server and import the VM into it. Test to see if it crashes.

Rinse and repeat on SBS box. Now your entire platform is virtualised. you can now export and import the hosts onto corresponding hardware and test.

Ps. you stated that your SBS drive config was different. (if it is not raid 5 take this opportunity to make it raid5)

This is a lot of work and will have fair amount of down time. If you hate your weekends go for it.

I would personally still peruse the MB crashing the server. (one question I am assuming you performed all the updates like firmware on bios and raid extra on the hardware.)

P.s. HP's bark is much worse than their bite don't worry about it.. (I do it all the time Im not going to wast my time or my clients time troubleshooting their crap. I want it replaced and they can sort out their crap in their own time. that why we paid for extended warranty.

Dr_Snapid

ASKER

The SBS has a RAID 1. HP have agreed to replace the motherboard in the TS. Fingers crossed it fixes the problem...

Armenio

Just a Side note for any future servers and SBS server you build . do not use raid 1 in a server Unless its SSD. (its crap performance). Especially with SBS as SBS is very high IO

Get a Hardware raid card with battery backup, Set it to read ahead, and FILL ALLTHE BAYS up with SAS drives . Remember the number of spindle is were the performance comes from. Lots of small drives is much faster than one big drive.

Good luck :-)

Dr_Snapid

ASKER

Thanks Armenio, yes I agree raid5 is usually the go.

Dr_Snapid

ASKER

Motherboard has been replaced. Fingers crossed.

Armenio

Good Luck :-)

Dr_Snapid

ASKER

Well it didn't take long. Server froze again after motherboard replaced. So frustrated!

Armenio

Have you been able to find any thing in the log files that may point you to a cause.
Have you reinstalled and updated the Hypervisor.
Alternatively you may want to consider a clean install if its an option. Its a lot of work but may be your best option.

Dr_Snapid

ASKER

No nothing in the logs that helps me. We are going to backup, reload the HW with Server2012 and restore the backup into a hyperv guest. See how it goes. If the host crashes in the same way i'll have to keep assuming it's hardware. If the guest crashes it's got to be something in the OS or a running program perhaps. Either way the virtualised environment should help especially since I could set up a new TS guest using server 2012

Thanks everyone for your support.

Dr_Snapid

ASKER

Excellent attempts to help but no solution found that avoided having to reinstall or move to other hardware unfortunately. I was hopping to learn some method to gain more information and find the culprit and fix in-situ. Thanks everyone.