Solved

Server unresponsive at random intervals, must force reboot

Posted on 2014-10-29
27
539 Views
Last Modified: 2014-11-30
I have a HP ProLiant ML350p Gen8 which freezes periodically. Has been know to do it two days in a row but can go a month without crashing sometimes. It is running RDS for around 30 users.

Because the server is remote and is rack mounted without a monitor, I dont have physical access to it and nobody can tell me what it says on the screen, I have tried using ILO4 but I can never get the remote console to work.

After forcing reboot the server comes up and runs fine until the next hang 1-30 days down the track.

Users report the screen just freezes. When they close their rdp session and try to reconnect, it just never reconnects. But I do see Event 4005 in application log many times, in between when the users get kicked until I reboot. I suspect it is once for every time someone tries to connect via RDP. It says 'The windows logon process has unexpectedly terminated.' I have teamviewer on it and it shows as online but I can not connect. I can browse shares on the server from another PC on the LAN however it is extremely slow. It responds to pings without dropouts. And of course, it seems to be logging the 4005 events too so it is not completely dead.

At times, the server seems to self recover after 15-30 minutes but not always. When it does self recover, it has not rebooted. It just seems to go on as if nothing happened.

I have supplied HP with the Active health System log and they say there is no hardware issues. All the on board diagnostics tools show no issues.

I have installed the latest proliant support pack for the server and updated firmware / drivers etc. I have not taken the server offline to run a memory test though.

My instincts tell me it's hardware but HP say it isn't. I am at my wits end with this and was hoping someone might be able to direct me on where I should look next. I will monitor this thread daily and supply more info if requested.

Many thanks in advance.

Some spec info:
Microsoft Windows Server 2008 R2 Standard 6.1.7601 Service Pack 1 Build 7601
32Gb RAM
Smart Array P420i in Embedded Slot (No errors in ACU) with 2x300gb RAID1 and 2x1Tb RAID1
0
Comment
Question by:Dr_Snapid
  • 12
  • 11
  • 3
  • +1
27 Comments
 
LVL 31

Accepted Solution

by:
Frosty555 earned 150 total points
ID: 40412539
I have had this *exact* issue with an HP Proliant ML350 G5p on Windows SBS 2008 (which is Server 2008 R2 under the hood). Certain services like Remote Desktop, TeamViewer, RRAS would fail. Other services like File Sharing were still working

Never did figured it out. I was convinced it was a hardware issue too, and we eventually replaced the server and got new ones with Server 2008 R2.

That HP Proliant went on to live a long and healthy life a Linux backup server without so much as a hiccup, so ultimately wasn't a hardware issue.

I did have physical access to the server. My experience was that while the keyboard and mouse seemed to be responsive, it would hang forever trying to unlock or login. Shutdown wasn't possible, needed to force reboot the server for it to come back.

Sorry I can't be of more help.
0
 
LVL 8

Expert Comment

by:nader alkahtani
ID: 40412552
0
 
LVL 5

Expert Comment

by:Armenio
ID: 40412597
Id suggest it's faulty hardware. Do a warranty claim on the hardware if possible.
0
 
LVL 1

Author Comment

by:Dr_Snapid
ID: 40412624
Thanks guys. I have already dealt at length with HP and they are saying it's OS and not warranty.

I don't expect a corrupt registry because the server boots fine afterwards and may run fine for weeks. No system restore necessary to restore operation.

There is always available RAM and disk space so I don't expect its resource related although I could not rule out a process or service that leaks memory under some certain circumstances when i'm not looking - although I would expect some events logged telling me that the system is low on resources prior to the crash.
A selection of what I see in the logs around the time of the crash are:

APPLICATION LOG
Error      30/10/2014 1:57:51 PM      Winlogon      4005      The Windows logon process has unexpectedly terminated.
(Lots of these)


SYSTEM LOG (this is a selection of the first messages i see after the time I suspect the crash begins)

Error      30/10/2014 1:08:01 PM      Service Control Manager      7011      A timeout (30000 milliseconds) was reached while waiting for a transaction response from the WerSvc service.

Error      30/10/2014 1:02:58 PM      GroupPolicy      1007      The processing of Group Policy failed. Windows could not determine the site associated for this computer, which is required for Group Policy processing.

Error      30/10/2014 1:02:09 PM      GroupPolicy      1110      The processing of Group Policy failed. Windows could not determine if the user and computer accounts are in the same forest. Ensure the user domain name matches the name of a trusted domain that resides in the same forest as the computer account.

Error      30/10/2014 12:49:34 PM      GroupPolicy      1065      The processing of Group Policy failed. Windows could not evaluate the Windows Management Instrumentation (WMI) filter for the Group Policy object CN={BCE33240-3E92-4AA9-9AC2-9174AB5D86E0},CN=POLICIES,CN=SYSTEM,DC=[removed],DC=LOCAL. This could be caused by RSOP being disabled  or Windows Management Instrumentation (WMI) service being disabled, stopped, or other WMI errors. Make sure the WMI service is started and the startup type is set to automatic. New Group Policy objects or settings will not process until this event has been resolved.

Error      30/10/2014 12:38:03 PM      DistributedCOM      10010      The server {0006F03A-0000-0000-C000-000000000046} did not register with DCOM within the required timeout.
0
 
LVL 8

Expert Comment

by:nader alkahtani
ID: 40412825
Try to avoid hanging by the following :
try to disable RDS , if you have good result then search about this issue for example  let users use this service with little resources from your server
0
 
LVL 1

Author Comment

by:Dr_Snapid
ID: 40414663
Nadir I don't understand what you mean.

It's terminal server, I can not disable RDS, to do so would effectively take the server offline for the users.
0
 
LVL 5

Expert Comment

by:Armenio
ID: 40414811
When it freezes does it crash and require a reboot. is it temporary and specific to terminal server.

Disabling RDS will definitely resolve you issu,e because no body will be able to use the server and thus can not crash. lol

I would Google all the errors your getting in event viewer and resolve them it does appear that their may be some issues with DNS or Active directory. Check DNS and netbios make sure dns is pointing to the AD server
0
 
LVL 1

Author Comment

by:Dr_Snapid
ID: 40418722
It is only this one computer on the network that has the problem, it is however the only terminal server.

As mentioned in OP the server sometimes seem to self recover after 15-30 minutes, but otherwise recovers when I use ILO to force a reboot.

The errors are not there when the server is running normally. Those errors only start logging around the time that the problem occurs. I believe they are sympotomatic, not the actual cause. The server runs for days or weeks perfectly fine. Then suddenly, uh oh.

I have not isolated a certain program that someone runs or any other event that triggers the outage, it seems completely random. This is partly why i'm so frustrated. There is no way to know if you have fixed it by installing an update or firmware. You just have to wait and see...
0
 
LVL 5

Assisted Solution

by:Armenio
Armenio earned 350 total points
ID: 40418767
do you have business support on the unit. If so make then change the main board and power supply.

You can also try moving the VM off and run a different VM on it and see how it behaves.

Personally I would just fight to get it replaced under warranty  ( I have had a whole server replaced to rectify the issue. Its not my job to identify the faulty bit that theirs. I want a new working server.

You need to be 100% sure its not your software. Run it up as a VM on different hardware. if problem goes away you know its hardware.
0
 
LVL 1

Author Comment

by:Dr_Snapid
ID: 40418773
You need to be 100% sure its not your software. Run it up as a VM on different hardware. if problem goes away you know its hardware.

I agree, running on other hardware would be ideal. Here is the part that worries me. If I run it up on a VM, then I also have to get all the users using that VM in order to accurately test. Essentially, backup the server, restore to VM, leave the HW server offline and get the users running on the VM. But that requires decent hardware on which to run the VM. We just dont have hardware like that lying around.

We do have an SBS2011 on almost identical hardware though... I WONDER? How hard would it be to swap the SBS and TS hardware, given that they are identical? This SBS2011 is the same age, bought at same time, and always been fine. It has a different RAID config, and less RAM but hardware wise I think is identical.

Surely it wouldnt be as simple as pulling the SAS drives and swapping? Where is the raid configuration stored? On the drives or the RAID controller?  I'm not much of an expert on RAID but I could look into it.

If I could essentially swap the hardware from one to the other, and the problem does not change, then I could be almost certain it's a software issue...

Are there any obvious flaws in that logic?
0
 
LVL 5

Expert Comment

by:Armenio
ID: 40418809
In theory proberbly. The config is stored on the sas drive.  but you would be braver than I would ( and I m not recommending it).  


P.S. make sure you have good backups that you have tested. before reading further


In theory as long as all your drives are installed in the same order and the config is imported onto matching raid cards with identical hardware you should be ok. I would not do it, to scared. Maybe someone else has done this and succeeded and can advise.  windows will complain about new hardware once booted.

There is also the option of swapping the whole raid card and drives into the other server. thus mitigating the config import issue. Again it just an idea and I strongly recommend you research more into this as I have not tired it.
to many unknowns for me to feel comfortable recommending that option.
0
 
LVL 1

Author Comment

by:Dr_Snapid
ID: 40418814
Yeah i'm a bit afraid to try it too...

Thanks for your input. I'll look for other options.
0
 
LVL 5

Expert Comment

by:Armenio
ID: 40418820
you could always just tell HP you have tested it and its still crashing after a reinstall you want a new MB and PSU. and fight for it. if they replace it and its still crashed then well you know its software and re-install it.
0
Too many email signature changes to deal with?

Are you constantly being asked to update your organization's email signatures? Do they take up too much of your time? Wouldn't you love to be able to manage all signatures from one central location, easily design them and deploy them quickly to users. Well, you can!

 
LVL 5

Expert Comment

by:Armenio
ID: 40418823
in my experience what you are describing is generally a MB fault.

I would just insist on an HP tech come out a replace it .
0
 
LVL 8

Expert Comment

by:nader alkahtani
ID: 40419200
Armenio :
please read my previous comment completely
0
 
LVL 5

Expert Comment

by:Armenio
ID: 40420611
Nadir:
I did Read it and appreciate your input. However in this scenario it is not a viable option. the user stated that his issue is intermittent. and can range between 1 and 30 days before showing up. Also this is a production Terminal server and the only one available on the network. If he had heeded your advise and disable RDS It could take anywhere between 1 and 30 or more days to verify weather it is an RDS issue. In the mean time no user would be able to access the Terminal server. This Terminal server is important to the functioning of the business.  So it is of my opinion that your recommendation though sound and with out issue. does not suit this situation and thus is not viable option.
0
 
LVL 1

Assisted Solution

by:Dr_Snapid
Dr_Snapid earned 0 total points
ID: 40420948
I have opened a support case and will push for MB replacement.

If that does not work, my next step will be to backup the server, install Server 2012, restore the current server into a hyper-v VM and bring it back online. While it's online I can prep a Server 2012 terminal server in another VM which I can migrate onto if the original VM crashes, and probably would migrate in time anyway.

If the host server crashes in the same way with Server 2012 installed with only hyper-v role, my emails to HP may draw blood!

I will report back after hopefully a new MB installed.

Thanks guys for your input.
0
 
LVL 5

Expert Comment

by:Armenio
ID: 40420979
Good Luck let us know the outcome.

Here is an Idea if you have storage to spare.
Use vmware converter to convert the current terminal server to a VM. Test the VM boots. Once you are happy it works install vmware on the  server and import the VM into it.  Test to see if it crashes.

Rinse and repeat on SBS box. Now your entire platform is virtualised.  you can now export and import the hosts onto corresponding hardware and test.

Ps. you stated that your SBS drive config was different. (if it is not raid 5 take this opportunity to make it  raid5)

This is a lot of work and will have fair amount of down time. If you hate your weekends go for it.

I would personally still peruse the MB crashing the server. (one question I am assuming you performed all the updates like firmware on bios and raid extra on the hardware.)

P.s. HP's bark is much worse than their bite don't worry about it..  (I do it all the time Im not going to wast my time or my clients time troubleshooting their crap. I want it replaced and they can sort out their crap in their own time. that why we paid for extended warranty.
0
 
LVL 1

Author Comment

by:Dr_Snapid
ID: 40423050
The SBS has a RAID 1. HP have agreed to replace the motherboard in the TS. Fingers crossed it fixes the problem...
0
 
LVL 5

Expert Comment

by:Armenio
ID: 40423206
Just a Side note for any future servers and SBS server you build . do not use raid 1 in a server Unless its SSD. (its crap performance). Especially with SBS as SBS is very high IO

Get a Hardware raid card with battery backup, Set it to read ahead, and FILL ALLTHE BAYS up with SAS drives . Remember the number of spindle is were the performance comes from. Lots of small drives is much faster than one big drive.

Good luck :-)
0
 
LVL 1

Author Comment

by:Dr_Snapid
ID: 40423209
Thanks Armenio, yes I agree raid5 is usually the go.
0
 
LVL 1

Author Comment

by:Dr_Snapid
ID: 40433794
Motherboard has been replaced. Fingers crossed.
0
 
LVL 5

Expert Comment

by:Armenio
ID: 40433834
Good Luck :-)
0
 
LVL 1

Author Comment

by:Dr_Snapid
ID: 40443749
Well it didn't take long. Server froze again after motherboard replaced. So frustrated!
0
 
LVL 5

Expert Comment

by:Armenio
ID: 40461097
Have you been able to find any thing in the log files that may point you to a cause.
Have you reinstalled and updated the Hypervisor.
Alternatively you may want to consider a clean install if its an option. Its a lot of work but may be your best option.
0
 
LVL 1

Author Comment

by:Dr_Snapid
ID: 40465603
No nothing in the logs that helps me. We are going to backup, reload the HW with Server2012 and restore the backup into a hyperv guest. See how it goes. If the host crashes in the same way i'll have to keep assuming it's hardware. If the guest crashes it's got to be something in the OS or a running program perhaps. Either way the virtualised environment should help especially since I could set up a new TS guest using server 2012

Thanks everyone for your support.
0
 
LVL 1

Author Closing Comment

by:Dr_Snapid
ID: 40472360
Excellent attempts to help but no solution found that avoided having to reinstall or move to other hardware unfortunately. I was hopping to learn some method to gain more information and find the culprit and fix in-situ. Thanks everyone.
0

Featured Post

IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

Suggested Solutions

Title # Comments Views Activity
AD user acount change history 4 50
importing users to Security group 2 31
Server HP DL380 G7 13 36
What is this Task? 4 34
Recently, I was asked to look into SCCM 2007 by my employer, having a degree of experience of earlier versions of SMS and some previous SCCM knowledge I didn't expect the procedure to involve to much time. I read a number of guides concerning it…
I had a question today where the user wanted to know how to delete an SSL Certificate, so I thought that I would quickly add this How to! Article for your reference. WHY WOULD YOU WANT TO DELETE A CERTIFICATE? 1. If an incorrect certificate was …
This tutorial will give a short introduction and overview of Backup Exec 2012 and how to navigate and perform basic functions. Click on the Backup Exec button in the upper left corner. From here, are global settings for the application such as conne…
This tutorial will walk an individual through the steps necessary to enable the VMware\Hyper-V licensed feature of Backup Exec 2012. In addition, how to add a VMware server and configure a backup job. The first step is to acquire the necessary licen…

707 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

18 Experts available now in Live!

Get 1:1 Help Now