Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

Server seemed to shut down, without shutting down, need help diagnosing the problem.

Posted on 2007-10-19
15
Medium Priority
?
483 Views
Last Modified: 2008-03-06
We've got a Dell PE2900 here running Windows SBS 2003 The server has been working great up until 3 weeks ago I had run into our first problem. At some point during the night the server "shut down", and when I say shut down it isn't actually shut down, the screen is blank, but the server is still running and there seems to be no way to wake it up.

Fast forward to this morning, got a call from work early this morning, same problem. There is no hibernation or anything set on the server, there is nothing that I can see in the event log to help me diagnose the problem either.

I am really not sure where to look to try and find out exactly what is happening.
The first time it happened our backup did not run, and our backup usually runs at 10PM, so I know it happened before 10PM the first time, this time the backup ran and completed fine, so this time it must have happened after 10PM.

If anyone has any suggestions on where I should start looking for answers it would be greatly appreciated, I want to figure this out before it happens again or happens more frequently. Also I have not changed anything on the server before or after this happened.

Thanks
0
Comment
Question by:Josh
  • 8
  • 7
15 Comments
 
LVL 21

Expert Comment

by:mastoo
ID: 20109274
If there's a warranty, go for tech support as your first best choice.  If not, I'll comment from the school of bitter experience.  If it's a server people need, it is going to be trial and error to get to the point where you trust the server again so use another server while you trouble-shoot this.  Easier said than done.  Check Dell knowledge base.  When it dies, can you ping it or remote log on?  Are there any drive lights on?  What expansion cards does it have?  Is it on a UPS with software installed so you can be sure it isn't a over/under-voltage sending it out to lunch?  If you've got monitoring software, or in our case a programmer wrote something to test the server every 5 minutes, your cellphone can ring as soon as the server dies which is better than the frantic phone call from a user.
0
 
LVL 1

Author Comment

by:Josh
ID: 20109321
Thanks for the info, I am going to look into some of the things you suggested. The server is under warranty, although I have had mixed results dealing with Dell, they can be difficult at times. When it goes down I cannot ping it or remote log in, it's almost as if everything software related is off and the hardware is all still running.
0
 
LVL 21

Expert Comment

by:mastoo
ID: 20109528
Yeah, they aren't perfect and even after they fix it once or twice you can't trust it until it withstands the test of time.  A temperature problem will usually cause a server to buzz or turn on a red light but you might also check that fans spin and aren't blocked with dust.

You can try things like guessing at a software problem (this kind of problem is hardware, driver, or power related probably in that order), or swapping components and then hold your breath.
0
Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

 
LVL 1

Author Comment

by:Josh
ID: 20109592
Well, there is no A/C in this server room and it gets pretty hot in the summer and while I am sure it wasn't good for the server everything was fine. As for power, we have a new UPS hooked up to the server, however the powerchute software that came with the APC unit would not reliably run on the server, I had so many calls into APC and finally they told me to just run the native windows UPS software, a solution I wasn't entirely happy with.

I will have to take a look in the server and see how clean it is inside but I think it's probably ok.
0
 
LVL 21

Expert Comment

by:mastoo
ID: 20109705
You mean no dedicated server room AC but the building has general AC?  You might want to check that your building people aren't turning off the AC in the middle of the night:  server gets too hot and dies, but by the time you hear about it the AC is back on.

If you want, check the bios for watchdog functionality and turn that on.  Watchdog will sometimes detect a server is "dead" and reboot it - you'd see this as an "unexpected shutdown" in the event log.
0
 
LVL 1

Author Comment

by:Josh
ID: 20109730
I will, actually I mean there is no AC in the server room at all, or on this floor for that matter.
It gets very hot in the summer, however it has not been very hot out lately so I didn't think overheating would be an issue right now.

I didn't have any of these issues when it was 90 degrees and now it's 70.
I guess I've got a lot of things to take a look at.
0
 
LVL 1

Author Comment

by:Josh
ID: 20263950
Just to keep this up to date, I have yet to figure out what is causing this, it has happened again and it is the first time it's happened since I first asked this question on EE, it is very frustrating.
I have been in contact with Dell also and they have yet to figure out the problem. I am not sure if anyone else here can help or not but it would be greatly appreciated if anyone else has anymore advice.
0
 
LVL 21

Expert Comment

by:mastoo
ID: 20264487
Did you look for drive lights when stuck?  I don't know as a general rule of thumb, but  my experience of causes for this has been (in decreasing frequency): scsi (controller, cables, terminator), drive, mobo, power, other.  You can run something to exercise the drives and see if failures correlate with drive activity.  You could swap out the scsi controller/cable/terminator and wait and see.  You can turn on process auditing and see if any particular process correlates with the failure.  Bug Dell some more.  Cable the UPS to another computer (anything, even a clunker) that the UPS software will install on so you can record power events that could affect the server.
0
 
LVL 1

Author Comment

by:Josh
ID: 20264709
The drive lights are on, no activity but they are on. I will have to get a hold of Dell again.
This seems to happen when there is little to no activity going on with the server, it only happens during the evening, the only thing that runs is the backup but I have determined that it happens before the backup even runs.
I am really trying to narrow down the time-frame of when this happens and see if there is something happening at that particular moment that makes it lock up, that is, if it's a software issue at all. It is all under warranty and I don't care if Dell has to send me a new server as long as the issue gets resolved.
0
 
LVL 1

Author Comment

by:Josh
ID: 20266274
Just another thing I want to add to this is that, the last thing in the event log was at about 6PM on Friday after which nothing else was logged over the entire weekend until the server was rebooted this morning.

Is there some other way I could determine the exact time the server locked up or close?
All I know is that nothing in the event viewer after 6PM and our backup did not run at 11PM as scheduled.
0
 
LVL 21

Expert Comment

by:mastoo
ID: 20266310
Servers do some kind of heartbeat so it can tell you when it died (usually).  I think you get a popup when logging in at the console, telling you "unexpected shutdown at xx:yy" and it gets logged in events.  But depending on how the server died you might not necessarily get this.
0
 
LVL 1

Accepted Solution

by:
Josh earned 0 total points
ID: 20976066
Just to update this question, I spoke with Dell again and it seems it was a firmware issue, they were very specific on how some of these updates are carried out, after working with the gentleman on the phone for some time and getting everything up to date properly it seems this issue has been fixed.
0
 
LVL 21

Expert Comment

by:mastoo
ID: 21041200
"Sometimes, you'll get an answer that isn't what you want to hear; this doesn't make it a bad answer. So even if the answer you receive is not what you want to hear, it still may be the correct answer, and you still need to award points to the Expert that gave you that answer."  Referring to my first post to go with Dell support.
0
 
LVL 1

Author Comment

by:Josh
ID: 21042757
mastoo,

I did not see this from your perspective and apparently did not give it enough thought when I did this. I completely agree with you and and I apologize.
0
 
LVL 21

Expert Comment

by:mastoo
ID: 21042833
Not a problem  :-)
0

Featured Post

Prep for the ITIL® Foundation Certification Exam

December’s Course of the Month is now available! Enroll to learn ITIL® Foundation best practices for delivering IT services effectively and efficiently.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Learn about cloud computing and its benefits for small business owners.
Backups and Disaster RecoveryIn this post, we’ll look at strategies for backups and disaster recovery.
In this video, Percona Director of Solution Engineering Jon Tobin discusses the function and features of Percona Server for MongoDB. How Percona can help Percona can help you determine if Percona Server for MongoDB is the right solution for …
In this video, Percona Solutions Engineer Barrett Chambers discusses some of the basic syntax differences between MySQL and MongoDB. To learn more check out our webinar on MongoDB administration for MySQL DBA: https://www.percona.com/resources/we…
Suggested Courses
Course of the Month20 days, 17 hours left to enroll

810 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question