Solved

Explanation for outage

Posted on 2011-09-14
15
643 Views
Last Modified: 2012-05-12
Hi,

I would like to pick your brains on something that recently occurred at a customer for which I try to find a reasonable explanation.
In four days time, three H700 RAID controllers broke down in three different R510 Dell servers of each about one year old. Exchanging the controllers by a new one Dell sent, proved to be the solution.

One could argue coincidence, but I think this is statistically challenging.
I argued something environmental, which the customer doubts.
The servers are located in a server room close to a factory where paper is printed and handled. Paper dust might be an issue.

The customer is also located next to a high voltage setup. Maybe something in the configuration has changed that altered the characteristics of the power supplied. Mind you we do have a UPS battery connected to the servers, which is supposed to "clean" the power.
Of course it could be the UPS itself ...

"A virus" has been argued as well, but I don't see how this can affect a RAID controller, not the used Linux operating system.

What are your thoughts?

thank you,

Bart Coninckx

0
Comment
Question by:bcnx71
  • 5
  • 4
  • 3
  • +2
15 Comments
 
LVL 87

Accepted Solution

by:
rindi earned 50 total points
ID: 36534577
I don't think dust (from the paper factory) could cause such issues, maybe others, but not this. For me it is more likely the High Voltage setup could be a cause. Maybe you don't have good surge protectors, and maybe a better earthing of the server room could help. Make sure the Server Racks are of metal and those racks are properly earthed, that would give a sort of faraday's cage and could help.
0
 
LVL 5

Assisted Solution

by:speak2ab
speak2ab earned 50 total points
ID: 36534777
If all the controllers that broke down are connected to the same UPS, that could be the culprit. You might want to check that out before installing a new device to it.

The paper dust is a possibilit but considering the fact that there were 3 breakdowns in such a short succession, makes it less probable to be the cause in this situation.

If a voltage influence from the environment is a possibility.... I will say anything can happen! If you know of this possibility, why not avoid it? I won't gamble with voltage issues with any gadget.

A virus? Well yeah it's remote but c'est possible.
0
 

Author Comment

by:bcnx71
ID: 36534839
I tend to agree. The dust to me was a solution after the first two cards broke down, but for me went out the window after the third one broke.

The customer is less aware of the consequences of for instance static electricity, as I saw him touching one RAID card with his bare hands after having walked on carpet without first discharging.

Other cards survived though, so It's also safe to say I guess that these type of cards are very sensitive.

Thank you,

B.
0
 
LVL 20

Assisted Solution

by:wolfcamel
wolfcamel earned 25 total points
ID: 36534855
i agree - most likely power supply related. yo may find that that model card has a design fault/flaw that makes it less tolerant to power supply or bad earth issues.
typically the cards inside a system are running on low voltage from the servers powersupply which is reasonably well regulated and tolerant of input supply variations.
However the earth is rarely regulated or isolated - the earth of the raid card will be directly connected to the chassis earth which in turn will no doubt be connected to the main earth or the rack.

It may pay for you to get an electrician to check the earthing - the rack should ideally be well grounded to something that acutally goes into the ground.

I have often seen large voltage spikes when the printing presses switch on and off.
0
 
LVL 7

Expert Comment

by:CSorg
ID: 36534930
what do the UPS log files tell you?
0
 
LVL 87

Expert Comment

by:rindi
ID: 36534968
I've just rethought the dust theory and now just want to throw this in:

Too much dust can cause cooling issues, and of course if due to that the servers get too hot, it is quite easily possible that servers of the same age will have accumulated about the same amount of dust, and then it is also quite possible that the same parts will probably break on those servers. But on the other hand I had assumed the personal of that company to be aware of the dust situation and clean out the servers and server room from dust more frequently than you would in a more controlled environment and that's why I first said I wouldn't suspect dust to be the reason.

But of course if that company didn't take any such precautions... (and as you mentioned the customer doesn't seem to care about static, and that may also imply that he didn't care about dust), I now regard it as quite a good chance that this is the issue...
0
 

Author Comment

by:bcnx71
ID: 36534994
Thank you for your thoughts. I considered the dust more as a possibility to act as a chemical agent on electrical connections. There is not enough dust to produce these cooling problems.

Something else I forgot to mention is at one point there were renovations involving drywall just outside of the serverroom. During a visit I noticed a lot of white power in the servers.
I adviced the customer to urgently clean them and so we did with high pressure air. Loads of white power came out of the servers. This was a couple of months ago and would net entirely explain why just these three identical servers.

B.
0
Maximize Your Threat Intelligence Reporting

Reporting is one of the most important and least talked about aspects of a world-class threat intelligence program. Here’s how to do it right.

 
LVL 7

Expert Comment

by:CSorg
ID: 36535017
white powerin the servers


oh that is bad! :-)
0
 

Author Comment

by:bcnx71
ID: 36535028
The client insists on the theory that the damage is possibly done by a virus affecting the controllers.
What are your thoughts in this?

0
 
LVL 87

Assisted Solution

by:rindi
rindi earned 50 total points
ID: 36535079
That's highly unlikely. It would have to be something that can change the controller's firmware. Maybe if the user updated the firmware using an update file that didn't come from Dell or was manipulated, but that should be verifiable if he still has the update files and allows Dell to examine them, or if there is a checksum for the original update file on the Dell site, and the checksums for the installed updates don't compare...
0
 
LVL 7

Expert Comment

by:CSorg
ID: 36535485
chances of that being caused by a virus are very very slim.

do all affected controllers have the same revision?
0
 
LVL 5

Assisted Solution

by:speak2ab
speak2ab earned 50 total points
ID: 36535527
Whenever there is a problem, it is always easier to blame a virus and by the way the possibility of a virus always exists. However in this specific case or most hardware cases, before you leap to the virus conclusion you need to have taken some basic troubleshooting steps.

So i will suggest that yes the virus is a problem if the problem still persists.after taking care of:
1.  the voltage issues
2. the UPS concerns and
3. the dust and white power issues

As a precaution and to satisfy your client why not create a virus arsenal for him and probably do a complete virus scan?
0
 
LVL 87

Expert Comment

by:rindi
ID: 36535628
One more thought about the dust, I don't think dust alone and paper dust in particular (which is basically cellulose or wood based and therefore chemically pretty stable, could cause chemical reactions on electronic boards, you'd probably need at least some sort of moisture to start most reactions, and when, then it would affect all electronic boards, not just some identical RAID controllers.
0
 

Author Comment

by:bcnx71
ID: 36535846
@speak2ab: the OS is Linux, ad far as I know there are not a lot of Linux virusses around. In this case we would more specifically have to find a virus that indeed attacks firmwares of one particular manufacturer. Two unlikely things to me. I do concur with your order of possibilities. I contacted the high voltage supplier next door to see if they changed anything.

@rindi: I agree, though be it that these controllers could be more vulerable to the problem. What if for instance the dust can cary static electricity better than regular dust and affects vulnerable components on just this controller ...
0
 
LVL 87

Expert Comment

by:rindi
ID: 36535916
Static within dust shouldn't be able to harm electronic components when those components are within the server and things are properly grounded. It might be an added risk when carried outside the server and not properly protected during that time.
0

Featured Post

Highfive Gives IT Their Time Back

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

Suggested Solutions

Title # Comments Views Activity
Website for comparisons 3 61
New Netapp Configuration 11 51
Install EXCHANGE 2013 on NUTANIX nodes 7 46
M2 drive not showing up 16 63
More or less everybody in the IT market understands the basics of Networking, however when we start talking about Storage Networks, things get a bit dizzier, and this is where I would like to help.
This article is an update and follow-up of my previous article:   Storage 101: common concepts in the IT enterprise storage This time, I expand on more frequently used storage concepts.
This video teaches viewers how to encrypt an external drive that requires a password to read and edit the drive. All tasks are done in Disk Utility. Plug in the external drive you wish to encrypt: Make sure all previous data on the drive has been …
This Micro Tutorial will teach you how to reformat your flash drive. Sometimes your flash drive may have issues carrying files so this will completely restore it to manufacturing settings. Make sure to backup all files before reformatting. This w…

760 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

23 Experts available now in Live!

Get 1:1 Help Now