Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people, just like you, are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
Solved

Server room Airconditioning failed

Posted on 2007-11-29
8
882 Views
Last Modified: 2013-11-10
Hi,

The A/C in our server room failed and we were unaware of it for about two hours. Actually I don't worry about it because there is a stand alone unit as well as the central cooling so if one is off  then  the other is cooling the room . This morning the Central system was shut down for maintenance( of which i was informed very late) and at the same time( or earlier) the stand alone unit decided to quit. Dont know what caused it. I notice a couple of hours later that the server room(3.5 by 1.6 mts with two racks/ 6 servers /10KVA ups/ switches no ventilation except for the A/C's, Glass sliding doors) is very  hot .The room was cooler after opening the doors but there seems to be some damage done on one of the servers. One  HDD on this machine  ( RAid 5 with 5 SCSI  HDD's) is showing a red indication with a cross.The other 4 HDD's are ok with a green light. I checked server and see no problems .Everything is working fine. I checked the Disk MAnagemnet and it says the Logical drive is healthy. I need to know where I can find if something is wrong. Also what does the red cross on the HDD mean( obviously it is still working )Is this caused by the temprature rise or could it be something that was already there and I missed it.

Also the other servers ... what possibilitise are ther that there is damage caused due to the temprature. Is it treu that tempratures upto 60C are bearable by the equipments.Is there anything that I can setup that will monitor the temperature on servers / Server room and notify me.
I also need to know if the servers will shut down if the temperatures rise above limits( before getting damaged) . Is there a way to do this .Would my servers have shutdown if it had continued for another two hours in the same condition.  The thing is I do not know what damage the servers took and need some idea of what maight have hapenned.
0
Comment
Question by:ssosw
8 Comments
 
LVL 43

Expert Comment

by:ravenpl
ID: 20372856
On linux servers I use http://www.lm-sensors.org/ for that purpose. I'm sure there equivalent for Windows and others.
0
 
LVL 21

Accepted Solution

by:
robocat earned 500 total points
ID: 20372967

>Is there anything that I can setup that will monitor the temperature on servers / Server room and notify me

There exist many systems for environmental monitoring, e.g. black box:

http://www.blackbox.com/Catalog/Detail.aspx?cid=425,1898,1899&mid=5277

>Is it true that tempratures upto 60C are bearable by the equipments

The internal equipment temp can go to these levels, but the surrounding air should typically stay below 30C- 35C to avoid damage.

>I also need to know if the servers will shut down if the temperatures rise above limits

This depends on the brand of the servers. Check for documentation with the manufacturer. Servers that have remote management processors almost always have this.

E.g. on an HP server with an ILO board:

    Location     Status     Reading     Thresholds  
Temp 1:     I/O Board Zone     Ok     39C     Caution: 65C; Critical:70C  
Temp 2:     Ambient Zone     Ok     18C     Caution: 40C; Critical:45C  
Temp 3:     CPU 1     Ok     30C     Caution: 95C; Critical:100C  
Temp 4:     CPU 1     Ok     30C     Caution: 95C; Critical:100C  
Temp 5:     Power Supply Zone     Ok     23C     Caution: 60C; Critical:65C  
Temp 6:     CPU 2     n/a     n/a     Caution: 95C; Critical:100C  
Temp 7:     CPU 2     n/a     n/a     Caution: 95C; Critical:100C  


Can you tell us more about the kind of servers you're using ?


0
 
LVL 21

Expert Comment

by:robocat
ID: 20373031

>One  HDD on this machine  ( RAid 5 with 5 SCSI  HDD's) is showing a red indication with a cross.The other 4 HDD's are ok with a green light. Everything is working fine. I checked the Disk MAnagemnet and it says the Logical drive is healthy.

The raid is probably degraded (one failed disk) but from a logical view still working fine. Windows disk management can't see the failed drive, but the raid diagnostics tool from your vendor will allow you to confirm that the disk has failed.

You need to replace the failed drive asap to avoid server failure should another disk fail.

0
Use Case: Protecting a Hybrid Cloud Infrastructure

Microsoft Azure is rapidly becoming the norm in dynamic IT environments. This document describes the challenges that organizations face when protecting data in a hybrid cloud IT environment and presents a use case to demonstrate how Acronis Backup protects all data.

 
LVL 26

Expert Comment

by:lnkevin
ID: 20373076
Yes, replace the failed disk with the made and model and same or bigger size. Allow 4 to 6 hours for RAID 5 to rebuild itself. Make sure you replace the drive while server is UP and RUNNING so RAID 5 can be rebuilt.

K
0
 
LVL 55

Expert Comment

by:andyalder
ID: 20375433
The drive may have its own temperature sensor in which case it might have turned itself off rather than failed. In that case you may get away with simply reseating it and it will rebuild into the array. Manufacturer's diags may be able to interrogate it still and see if that is the case.
0
 

Author Comment

by:ssosw
ID: 20393798
Hi,

The server is an HP Proliant ML570.
It shows on the  Compaq Integrated Management Log Viewer and entry that says

        Drive Array Device Failure (Slot 1 Bus 1 Bay 4)

The thing is the date is 17th Nov . The temprature rise was last thursday.
0
 
LVL 21

Expert Comment

by:robocat
ID: 20394323

The drive has indeed failed and should be replaced ASAP.

BTW: this server has an integrated ILO management processor. Is this isn't already configured, you really should. If you combine this with HP Insight Manager software, you can setup an early warning system that sends e-mail when certain temperature levels are exceeded.

0
 
LVL 55

Expert Comment

by:andyalder
ID: 20394356
Either CMOS clock is incorrect or the disk failed weeks ago and you didn't notice. That's the problem with redundancy, if you do not monitor the system you may have failures and not notice them.
0

Featured Post

Free Tool: Postgres Monitoring System

A PHP and Perl based system to collect and display usage statistics from PostgreSQL databases.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

This paper addresses the security of Sennheiser DECT Contact Center and Office (CC&O) headsets. It describes the DECT security chain comprised of “Pairing”, “Per Call Authentication” and “Encryption”, which are all part of the standard DECT protocol.
This article outlines why you need to choose a backup solution that protects your entire environment – including your VMware ESXi and Microsoft Hyper-V virtualization hosts – not just your virtual machines.
Finds all prime numbers in a range requested and places them in a public primes() array. I've demostrated a template size of 30 (2 * 3 * 5) but larger templates can be built such 210  (2 * 3 * 5 * 7) or 2310  (2 * 3 * 5 * 7 * 11). The larger templa…
In a recent question (https://www.experts-exchange.com/questions/29004105/Run-AutoHotkey-script-directly-from-Notepad.html) here at Experts Exchange, a member asked how to run an AutoHotkey script (.AHK) directly from Notepad++ (aka NPP). This video…

856 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question