Diagnosing System Hangs on a Remote Server

We have a Windows 2003 Server at a remote location so the only access I have to it is via a remote method such as Remote Desktop (RDP).  Every two to three weeks it locks up for some reason and I can't even ping it, let alone connect to it.  A reboot of the system clears everything up.  I'm at a loss on how to even begin to figure out what went wrong.  It just happened again today.
LVL 14
quizwedgeAsked:
Who is Participating?
 
sfossupportConnect With a Mentor Commented:
You should check the event logs to see if there are some idea's. You also need to do some baseline measures to try and isolate the problem. Here is what we start with. Schedule this to run during peak times and off-peak times.

Hard disc counters:
  Logical Disk
      1. Average Disk Queue Length – is drive keeping up with demand of running       processes.
       2. Disk Bytes/Sec – Is the drive keeping up to expectations
      3. Free Megabytes – Useful in predicting future needs

 Memory
      4. Available Kbytes  - If this counter is greater than 10% of the actual RAM in       your machine then you probably have more than enough RAM and don't need to worry.

      5. Pages/Sec – Tracks number of virtual memory pages read or written per             second. Multiply value x 4kb memory pages = data moved from memory to disk each second. Indicator of insufficient ram

  Network
      6. Bytes Total/sec – Baseline measure for load
      7. Output Queue Length – If this value average is 2 or greater it indicates the       network card isn’t able to handle the capabilities of the server

Processor
      8. %Processor Time
             Measures the total utilization of your processor by all running processes.       Note that if you have a multiprocessor machine, Processor(_Total)\%Processor Time actually measures the average processor utilization of       your machine (i.e. utilization averaged over all processors).
0
 
quizwedgeAuthor Commented:
Should have mentioned that using Event Viewer yielded no results of anything.  Good thoughts on what to track.  I'll see who else responds and with what they respond, but your comment will at least get me started and is worthy of some points.

Thanks,
Dan
0
 
greggy86Connect With a Mentor Commented:
A server system that is hanging is obviously having serious problems, nothing in the event viewer means the OS cannot read whatever is causing the problem. Ask yourself what cant an OS see natively? The answer is RAID drives - let me not get myself in trouble here - the OS will see logical representation of drives through your SCSI Controller card/ interface.
Short Answer: Check for driver issues with your scsi controller. Also check the RAID diagnostic tools that came with the server to see if an errors
0
Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
quizwedgeAuthor Commented:
Good guess, but we don't have RAID on that server.  Looking at the Device Manager, I see one item with an exclamation point under "Other devices".  It's an "Other PCI Bridge Device".  Not sure what a PCI Bridge Device is or how I figure out what drivers I need for it.  Could that be what's causing an every couple of week freeze-up?
0
 
vertsyeuxConnect With a Mentor Commented:
Intermittent problems like this can be tricky to track down, you need to know what is going on when it stops responding and that means having a screen, mouse & keyboard connected - someone has to to look at a screen when it is offline and report. You said you can reboot it, someone has access..

Find out what happens when the server becomes unresponsive. Is it still running and responding to keyboard/mouse input? Is there a BSD? - that would point to a hardware/memory/driver problem?

If it appears to be alive, there may be a problem with the interface portion of the network card, the server could think there is nothing wrong, just nobody talking.. If it has a second unused ethernet port, try connecting to that. Try another port on the switch.

If the screen is completely blank, ie. no desktop, no BSOD, look for a power-supply fault.

If the screen has a desktop but is frozen, no mouse/keyboard response, it is most likely a hardware fault - RAM, power-supply, something overheating.

If you have another identical server, swap out the hard-drives and see what happens on both machines.

God Luck!!



I know you can only access this remotely, but someone has to be there
0
 
AnnOminousCommented:
Can you confirm whether *ANY* services are functional on the server when it appears to hang. For example, if it is the gateway to the Internet can you still route traffic through it even if you can't access it with Remote Desktop?

I've seen situations where Windows 2008 (all versions) lock up some services while others remain unmolested. No remote desktop, no event viewer (or other management console) access, but it continues to route NAT traffic and the file server works locally (but not remotely).

Can you set up a ping from the internal network that traverses this machine? Then when it 'hangs' you will know whether it's the entire machine or only some services.

How do you reboot? Can you do a soft reboot or do you need to do a hard reboot?

If a soft reboot, can you access the system via the console - even to a logon message?
0
 
ChiefITCommented:
If running Service Pack 1 on this DC, download and install SP2.

(If this applies, I will explain what's happening with SP1)
0
 
quizwedgeAuthor Commented:
vertsyeux:
      The server is a rented server, but I should be able to contact the company and have them tell me what's going on.  It just always seems to happen at the worst of times so I've just had them reboot it.
      
AnnOminous:
      As best I can tell, nothing is connecting.  A ping times out.  We do have one machine on a private network with that machine.  I think I tried pinging it from that machine and got nothing, but I can't remember now.  I'm not exactly sure how the reboot works since we rent the server.  They give a link from the web-based admin tools that I click and then then magic happens behind the scenes.  I assume it's some kind of software reboot, but it could always be that they're talking to some kind of smart power supply and just power cycling.  I don't have any access when the server goes down.  No ping and no RDP.
      
ChiefIT:
      It's SP2.

I don't know when the server will freeze up again and I don't want to lag with awarding points.  sfossupport and vertsyeux seemed to have the answers that are going to get me the closest to solving this, so I'll split the points between them.  greggy86 pointed me in the direction of drivers which might be the issue, so I've carved out a few points for him/her as well.  Thanks everyone for your help!
0
 
quizwedgeAuthor Commented:
Shoot... should have posted this part below rather than as part of the accepted solution.


Vertsyeux:
The server is a rented server, but I should be able to contact the company and have them tell me what's going on. It just always seems to happen at the worst of times so I've just had them reboot it.

AnnOminous:
As best I can tell, nothing is connecting. A ping times out. We do have one machine on a private network with that machine. I think I tried pinging it from that machine and got nothing, but I can't remember now. I'm not exactly sure how the reboot works since we rent the server. They give a link from the web-based admin tools that I click and then then magic happens behind the scenes. I assume it's some kind of software reboot, but it could always be that they're talking to some kind of smart power supply and just power cycling. I don't have any access when the server goes down. No ping and no RDP.

ChiefIT:
It's SP2.

Thanks for the help.
0
 
AnnOminousCommented:
Note that I've seen systems block new connections but allow existing (or UDP) connections to work. So a ping to the machine might fail, but a ping to a machine that traverses the 'failed' machine might work.

If you can traverse the machine, then it's likely a software problem. if you can not, then hardware is more likely.
0
All Courses

From novice to tech pro — start learning today.