Solved

Testing for heat related damage on servers

Posted on 2008-10-27
8
405 Views
Last Modified: 2012-05-05
Greetings Experts.  Our server room has experienced two severe over-heating issues within 1 weekend.  The AC failed and since no one was here on Saturday, we weren't aware of the problem until three of the eight servers failed or were unresponsive.  When we arrived on site, the server room had hit 105 degrees.  We powered down the remaining servers.  On Sunday, the building owner said the AC was fixed so the servers were turned back on.  This morning the server room was again 100+ degrees so we went through the process of shutting everything down again.

I know there are a host of things we need to address (mainly the AC unit) but the question being asked is if there are any tools that you are familiar with that can determine if any hardware was damaged (like a stress test, etc..)  I did several searches on Google and wanted to check with the experts before I started installing 3rd party apps on my servers.

Thanks in advance for the help.
0
Comment
Question by:samiam41
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 5
  • 2
8 Comments
 
LVL 31

Accepted Solution

by:
Paranormastic earned 250 total points
ID: 22813097
Here is a hardware monitor:
http://www.download.com/HWMonitor/3000-2094_4-10793486.html?tag=mncol&cdlPid=10820795

For testing the CPU:
http://www.download.com/Hot-CPU-Tester-Pro/3000-2086_4-10671809.html

For stress testing memory, look to your resource kit.  Leakyapp.exe simulates, well, a leaky app and will test your memory pretty well - just give it awhile and keep an eye on things as it will fill up most of your memory and start kicking over to virtual memory.

General burn-in test:
http://www.download.com/BurnInTest-Professional/3000-2086_4-10892885.html

I would suggest a monitor that can at least alert you to issues:
http://sensorsview-pro.stv-software.qarchive.org/

Here is a decent list of some various monitors if you want to browse a little
http://cpu-temperature-monitor.qarchive.org/

Beyond that, there are a number of benchmark and burn-in testing apps out there, but I think this should get you to "comfortable."
0
 
LVL 9

Author Comment

by:samiam41
ID: 22813118
Perfect!!  I will begin working with those now.  The results should help us figure out how bad it got.
0
 
LVL 32

Assisted Solution

by:aleghart
aleghart earned 250 total points
ID: 22816969
An ambient temperature of 100-105 degrees by itself would not necessarily crash your servers.  They should be able to operate up to 110-120 ambient, as long as there is sufficient airflow to keep the internals below 130-140F (see the specs from your hardware manufacturers).

I can attest that I've had server room operate repeatedly at 95+F for long weekends, even extending up to 105F during the summer, or when the building HVAC system was switched over to heat-only.

Those servers and switches are still running.  Most with the original hard drives from 2002-2004.
0
Efficient way to get backups off site to Azure

This user guide provides instructions on how to deploy and configure both a StoneFly Scale Out NAS Enterprise Cloud Drive virtual machine and Veeam Cloud Connect in the Microsoft Azure Cloud.

 
LVL 9

Author Comment

by:samiam41
ID: 22821108
@ aleghart: You bring up a great point.  The problem was that this server room was very poorly designed and since a government agency occupies it, no money has gone in to keeping any of the HVAC or other environment controls even close to standard.  It was the perfect storm (the fourth since I have been here in 4 years) and I would bet nothing will be done again.

The server room is a 30'x45' and the server racks are no where near the flow of cool air.  The server rack has an open front but since the room is so old, dust continues to build up on the front and back of the servers.  The entire building including server room will be under going a huge renovation start next month which should take care of these problems.  It's like no environment I have ever worked and hope once it's fixed, I never see it again!

@Paranormastic: The first link you provided won't load for me.  I get the "webpage cannot be found" once I click on the download now option.  The other links look good except most require payment (which is probably about right).  I am looking at a couple others and then I will accept your solution if another one doesn't come along.

Thanks for the suggestions and info.
0
 
LVL 9

Author Comment

by:samiam41
ID: 22821133
Oh yeah, we sit in with the servers (team of 4) so that doesn't help either.  Factor in the printers, additional computers, monitors and body heat, the HVAC is almost always trying to keep the temp at or near 68 degrees.  I wanted it a little colder but since we have to sit in here, that was about as low as I could get it without causing strife.
0
 
LVL 32

Expert Comment

by:aleghart
ID: 22824328
68F is "old school" thought for a computer room.  72-74 is just fine.  Feeding ambient air and exhausting hot air is more important than the 68-degree target.

User comfort, energy requirements (and consumption), and equipment cost and maintenance all lead away from the refrigerator-like rooms.

Extremely cold rooms leave a buffer in case of HVAC failure.  There are some longevity claims, but I don't see them as valid when the typical replacement cycle is far less than 5 years.

It's actually cheaper to use two standard HVAC units at 72-74 degrees.  One can be cycled off duty when temps are OK.  Both on-line when it's hot.  We do this by using house air (free from the building cooling towers) plus an additional unit, which runs only 1/2 the time.

When the supplemental air is running, it is only cold in the path of the blower (<60F).  Just inside the rack boundary stays around 80.  Outside the rack it is 74.

0
 
LVL 9

Author Comment

by:samiam41
ID: 22870575
Thanks aleghart.

I am going to close this question out as it's been a week since the "inferno" and I haven't seen hardware failures yet.  I am going to monitor them but since I am trying to get this building's renovation and wiring back on track, I won't be able to spend any more time with this question.

I appreciate everyone's help and I am working on the points now.
0
 
LVL 9

Author Closing Comment

by:samiam41
ID: 31512855
Thanks again for your time and suggestions.  Take care!

Regards,
Aaron
0

Featured Post

NEW Veeam Agent for Microsoft Windows

Backup and recover physical and cloud-based servers and workstations, as well as endpoint devices that belong to remote users. Avoid downtime and data loss quickly and easily for Windows-based physical or public cloud-based workloads!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
New HP server install iLo 4 88
Problem to router 7 98
HP Server The device, \Device\Harddisk1\DR5, has a bad block. 5 111
How to become System Integration engineer 3 76
Setting up a Microsoft WSUS update system is free relatively speaking if you have hard disk space and processor capacity.   However, WSUS can be a blessing and a curse. For example, there is nothing worse than approving updates and they just have…
The 6120xp switches seem to have a bug when you create a fiber port channel when you have a UCS fabric interconnects talking to them.  If you follow the Cisco guide for the UCS, the FC Port channel will never come up and it will say that there are n…
Nobody understands Phishing better than an anti-spam company. That’s why we are providing Phishing Awareness Training to our customers. According to a report by Verizon, only 3% of targeted users report malicious emails to management. With compan…
This video shows how to use Hyena, from SystemTools Software, to update 100 user accounts from an external text file. View in 1080p for best video quality.

751 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question