Explanation for outage

Hi,

I would like to pick your brains on something that recently occurred at a customer for which I try to find a reasonable explanation.
In four days time, three H700 RAID controllers broke down in three different R510 Dell servers of each about one year old. Exchanging the controllers by a new one Dell sent, proved to be the solution.

One could argue coincidence, but I think this is statistically challenging.
I argued something environmental, which the customer doubts.
The servers are located in a server room close to a factory where paper is printed and handled. Paper dust might be an issue.

The customer is also located next to a high voltage setup. Maybe something in the configuration has changed that altered the characteristics of the power supplied. Mind you we do have a UPS battery connected to the servers, which is supposed to "clean" the power.
Of course it could be the UPS itself ...

"A virus" has been argued as well, but I don't see how this can affect a RAID controller, not the used Linux operating system.

What are your thoughts?

thank you,

Bart Coninckx

bcnx71Asked:
Who is Participating?
 
rindiConnect With a Mentor Commented:
I don't think dust (from the paper factory) could cause such issues, maybe others, but not this. For me it is more likely the High Voltage setup could be a cause. Maybe you don't have good surge protectors, and maybe a better earthing of the server room could help. Make sure the Server Racks are of metal and those racks are properly earthed, that would give a sort of faraday's cage and could help.
0
 
speak2abConnect With a Mentor Commented:
If all the controllers that broke down are connected to the same UPS, that could be the culprit. You might want to check that out before installing a new device to it.

The paper dust is a possibilit but considering the fact that there were 3 breakdowns in such a short succession, makes it less probable to be the cause in this situation.

If a voltage influence from the environment is a possibility.... I will say anything can happen! If you know of this possibility, why not avoid it? I won't gamble with voltage issues with any gadget.

A virus? Well yeah it's remote but c'est possible.
0
 
bcnx71Author Commented:
I tend to agree. The dust to me was a solution after the first two cards broke down, but for me went out the window after the third one broke.

The customer is less aware of the consequences of for instance static electricity, as I saw him touching one RAID card with his bare hands after having walked on carpet without first discharging.

Other cards survived though, so It's also safe to say I guess that these type of cards are very sensitive.

Thank you,

B.
0
Improve Your Query Performance Tuning

In this FREE six-day email course, you'll learn from Janis Griffin, Database Performance Evangelist. She'll teach 12 steps that you can use to optimize your queries as much as possible and see measurable results in your work. Get started today!

 
wolfcamelConnect With a Mentor Commented:
i agree - most likely power supply related. yo may find that that model card has a design fault/flaw that makes it less tolerant to power supply or bad earth issues.
typically the cards inside a system are running on low voltage from the servers powersupply which is reasonably well regulated and tolerant of input supply variations.
However the earth is rarely regulated or isolated - the earth of the raid card will be directly connected to the chassis earth which in turn will no doubt be connected to the main earth or the rack.

It may pay for you to get an electrician to check the earthing - the rack should ideally be well grounded to something that acutally goes into the ground.

I have often seen large voltage spikes when the printing presses switch on and off.
0
 
CSorgCommented:
what do the UPS log files tell you?
0
 
rindiCommented:
I've just rethought the dust theory and now just want to throw this in:

Too much dust can cause cooling issues, and of course if due to that the servers get too hot, it is quite easily possible that servers of the same age will have accumulated about the same amount of dust, and then it is also quite possible that the same parts will probably break on those servers. But on the other hand I had assumed the personal of that company to be aware of the dust situation and clean out the servers and server room from dust more frequently than you would in a more controlled environment and that's why I first said I wouldn't suspect dust to be the reason.

But of course if that company didn't take any such precautions... (and as you mentioned the customer doesn't seem to care about static, and that may also imply that he didn't care about dust), I now regard it as quite a good chance that this is the issue...
0
 
bcnx71Author Commented:
Thank you for your thoughts. I considered the dust more as a possibility to act as a chemical agent on electrical connections. There is not enough dust to produce these cooling problems.

Something else I forgot to mention is at one point there were renovations involving drywall just outside of the serverroom. During a visit I noticed a lot of white power in the servers.
I adviced the customer to urgently clean them and so we did with high pressure air. Loads of white power came out of the servers. This was a couple of months ago and would net entirely explain why just these three identical servers.

B.
0
 
CSorgCommented:
white powerin the servers


oh that is bad! :-)
0
 
bcnx71Author Commented:
The client insists on the theory that the damage is possibly done by a virus affecting the controllers.
What are your thoughts in this?

0
 
rindiConnect With a Mentor Commented:
That's highly unlikely. It would have to be something that can change the controller's firmware. Maybe if the user updated the firmware using an update file that didn't come from Dell or was manipulated, but that should be verifiable if he still has the update files and allows Dell to examine them, or if there is a checksum for the original update file on the Dell site, and the checksums for the installed updates don't compare...
0
 
CSorgCommented:
chances of that being caused by a virus are very very slim.

do all affected controllers have the same revision?
0
 
speak2abConnect With a Mentor Commented:
Whenever there is a problem, it is always easier to blame a virus and by the way the possibility of a virus always exists. However in this specific case or most hardware cases, before you leap to the virus conclusion you need to have taken some basic troubleshooting steps.

So i will suggest that yes the virus is a problem if the problem still persists.after taking care of:
1.  the voltage issues
2. the UPS concerns and
3. the dust and white power issues

As a precaution and to satisfy your client why not create a virus arsenal for him and probably do a complete virus scan?
0
 
rindiCommented:
One more thought about the dust, I don't think dust alone and paper dust in particular (which is basically cellulose or wood based and therefore chemically pretty stable, could cause chemical reactions on electronic boards, you'd probably need at least some sort of moisture to start most reactions, and when, then it would affect all electronic boards, not just some identical RAID controllers.
0
 
bcnx71Author Commented:
@speak2ab: the OS is Linux, ad far as I know there are not a lot of Linux virusses around. In this case we would more specifically have to find a virus that indeed attacks firmwares of one particular manufacturer. Two unlikely things to me. I do concur with your order of possibilities. I contacted the high voltage supplier next door to see if they changed anything.

@rindi: I agree, though be it that these controllers could be more vulerable to the problem. What if for instance the dust can cary static electricity better than regular dust and affects vulnerable components on just this controller ...
0
 
rindiCommented:
Static within dust shouldn't be able to harm electronic components when those components are within the server and things are properly grounded. It might be an added risk when carried outside the server and not properly protected during that time.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.