Link to home
Start Free TrialLog in
Avatar of Josh
JoshFlag for United States of America

asked on

Server seemed to shut down, without shutting down, need help diagnosing the problem.

We've got a Dell PE2900 here running Windows SBS 2003 The server has been working great up until 3 weeks ago I had run into our first problem. At some point during the night the server "shut down", and when I say shut down it isn't actually shut down, the screen is blank, but the server is still running and there seems to be no way to wake it up.

Fast forward to this morning, got a call from work early this morning, same problem. There is no hibernation or anything set on the server, there is nothing that I can see in the event log to help me diagnose the problem either.

I am really not sure where to look to try and find out exactly what is happening.
The first time it happened our backup did not run, and our backup usually runs at 10PM, so I know it happened before 10PM the first time, this time the backup ran and completed fine, so this time it must have happened after 10PM.

If anyone has any suggestions on where I should start looking for answers it would be greatly appreciated, I want to figure this out before it happens again or happens more frequently. Also I have not changed anything on the server before or after this happened.

Thanks
Avatar of mastoo
mastoo
Flag of United States of America image

If there's a warranty, go for tech support as your first best choice.  If not, I'll comment from the school of bitter experience.  If it's a server people need, it is going to be trial and error to get to the point where you trust the server again so use another server while you trouble-shoot this.  Easier said than done.  Check Dell knowledge base.  When it dies, can you ping it or remote log on?  Are there any drive lights on?  What expansion cards does it have?  Is it on a UPS with software installed so you can be sure it isn't a over/under-voltage sending it out to lunch?  If you've got monitoring software, or in our case a programmer wrote something to test the server every 5 minutes, your cellphone can ring as soon as the server dies which is better than the frantic phone call from a user.
Avatar of Josh

ASKER

Thanks for the info, I am going to look into some of the things you suggested. The server is under warranty, although I have had mixed results dealing with Dell, they can be difficult at times. When it goes down I cannot ping it or remote log in, it's almost as if everything software related is off and the hardware is all still running.
Yeah, they aren't perfect and even after they fix it once or twice you can't trust it until it withstands the test of time.  A temperature problem will usually cause a server to buzz or turn on a red light but you might also check that fans spin and aren't blocked with dust.

You can try things like guessing at a software problem (this kind of problem is hardware, driver, or power related probably in that order), or swapping components and then hold your breath.
Avatar of Josh

ASKER

Well, there is no A/C in this server room and it gets pretty hot in the summer and while I am sure it wasn't good for the server everything was fine. As for power, we have a new UPS hooked up to the server, however the powerchute software that came with the APC unit would not reliably run on the server, I had so many calls into APC and finally they told me to just run the native windows UPS software, a solution I wasn't entirely happy with.

I will have to take a look in the server and see how clean it is inside but I think it's probably ok.
You mean no dedicated server room AC but the building has general AC?  You might want to check that your building people aren't turning off the AC in the middle of the night:  server gets too hot and dies, but by the time you hear about it the AC is back on.

If you want, check the bios for watchdog functionality and turn that on.  Watchdog will sometimes detect a server is "dead" and reboot it - you'd see this as an "unexpected shutdown" in the event log.
Avatar of Josh

ASKER

I will, actually I mean there is no AC in the server room at all, or on this floor for that matter.
It gets very hot in the summer, however it has not been very hot out lately so I didn't think overheating would be an issue right now.

I didn't have any of these issues when it was 90 degrees and now it's 70.
I guess I've got a lot of things to take a look at.
Avatar of Josh

ASKER

Just to keep this up to date, I have yet to figure out what is causing this, it has happened again and it is the first time it's happened since I first asked this question on EE, it is very frustrating.
I have been in contact with Dell also and they have yet to figure out the problem. I am not sure if anyone else here can help or not but it would be greatly appreciated if anyone else has anymore advice.
Did you look for drive lights when stuck?  I don't know as a general rule of thumb, but  my experience of causes for this has been (in decreasing frequency): scsi (controller, cables, terminator), drive, mobo, power, other.  You can run something to exercise the drives and see if failures correlate with drive activity.  You could swap out the scsi controller/cable/terminator and wait and see.  You can turn on process auditing and see if any particular process correlates with the failure.  Bug Dell some more.  Cable the UPS to another computer (anything, even a clunker) that the UPS software will install on so you can record power events that could affect the server.
Avatar of Josh

ASKER

The drive lights are on, no activity but they are on. I will have to get a hold of Dell again.
This seems to happen when there is little to no activity going on with the server, it only happens during the evening, the only thing that runs is the backup but I have determined that it happens before the backup even runs.
I am really trying to narrow down the time-frame of when this happens and see if there is something happening at that particular moment that makes it lock up, that is, if it's a software issue at all. It is all under warranty and I don't care if Dell has to send me a new server as long as the issue gets resolved.
Avatar of Josh

ASKER

Just another thing I want to add to this is that, the last thing in the event log was at about 6PM on Friday after which nothing else was logged over the entire weekend until the server was rebooted this morning.

Is there some other way I could determine the exact time the server locked up or close?
All I know is that nothing in the event viewer after 6PM and our backup did not run at 11PM as scheduled.
Servers do some kind of heartbeat so it can tell you when it died (usually).  I think you get a popup when logging in at the console, telling you "unexpected shutdown at xx:yy" and it gets logged in events.  But depending on how the server died you might not necessarily get this.
ASKER CERTIFIED SOLUTION
Avatar of Josh
Josh
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
"Sometimes, you'll get an answer that isn't what you want to hear; this doesn't make it a bad answer. So even if the answer you receive is not what you want to hear, it still may be the correct answer, and you still need to award points to the Expert that gave you that answer."  Referring to my first post to go with Dell support.
Avatar of Josh

ASKER

mastoo,

I did not see this from your perspective and apparently did not give it enough thought when I did this. I completely agree with you and and I apologize.
Not a problem  :-)