Link to home
Start Free TrialLog in
Avatar of bcnx71
bcnx71Flag for Belgium

asked on

Explanation for outage

Hi,

I would like to pick your brains on something that recently occurred at a customer for which I try to find a reasonable explanation.
In four days time, three H700 RAID controllers broke down in three different R510 Dell servers of each about one year old. Exchanging the controllers by a new one Dell sent, proved to be the solution.

One could argue coincidence, but I think this is statistically challenging.
I argued something environmental, which the customer doubts.
The servers are located in a server room close to a factory where paper is printed and handled. Paper dust might be an issue.

The customer is also located next to a high voltage setup. Maybe something in the configuration has changed that altered the characteristics of the power supplied. Mind you we do have a UPS battery connected to the servers, which is supposed to "clean" the power.
Of course it could be the UPS itself ...

"A virus" has been argued as well, but I don't see how this can affect a RAID controller, not the used Linux operating system.

What are your thoughts?

thank you,

Bart Coninckx

ASKER CERTIFIED SOLUTION
Avatar of rindi
rindi
Flag of Switzerland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of bcnx71

ASKER

I tend to agree. The dust to me was a solution after the first two cards broke down, but for me went out the window after the third one broke.

The customer is less aware of the consequences of for instance static electricity, as I saw him touching one RAID card with his bare hands after having walked on carpet without first discharging.

Other cards survived though, so It's also safe to say I guess that these type of cards are very sensitive.

Thank you,

B.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
what do the UPS log files tell you?
I've just rethought the dust theory and now just want to throw this in:

Too much dust can cause cooling issues, and of course if due to that the servers get too hot, it is quite easily possible that servers of the same age will have accumulated about the same amount of dust, and then it is also quite possible that the same parts will probably break on those servers. But on the other hand I had assumed the personal of that company to be aware of the dust situation and clean out the servers and server room from dust more frequently than you would in a more controlled environment and that's why I first said I wouldn't suspect dust to be the reason.

But of course if that company didn't take any such precautions... (and as you mentioned the customer doesn't seem to care about static, and that may also imply that he didn't care about dust), I now regard it as quite a good chance that this is the issue...
Avatar of bcnx71

ASKER

Thank you for your thoughts. I considered the dust more as a possibility to act as a chemical agent on electrical connections. There is not enough dust to produce these cooling problems.

Something else I forgot to mention is at one point there were renovations involving drywall just outside of the serverroom. During a visit I noticed a lot of white power in the servers.
I adviced the customer to urgently clean them and so we did with high pressure air. Loads of white power came out of the servers. This was a couple of months ago and would net entirely explain why just these three identical servers.

B.
white powerin the servers


oh that is bad! :-)
Avatar of bcnx71

ASKER

The client insists on the theory that the damage is possibly done by a virus affecting the controllers.
What are your thoughts in this?

SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
chances of that being caused by a virus are very very slim.

do all affected controllers have the same revision?
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
One more thought about the dust, I don't think dust alone and paper dust in particular (which is basically cellulose or wood based and therefore chemically pretty stable, could cause chemical reactions on electronic boards, you'd probably need at least some sort of moisture to start most reactions, and when, then it would affect all electronic boards, not just some identical RAID controllers.
Avatar of bcnx71

ASKER

@speak2ab: the OS is Linux, ad far as I know there are not a lot of Linux virusses around. In this case we would more specifically have to find a virus that indeed attacks firmwares of one particular manufacturer. Two unlikely things to me. I do concur with your order of possibilities. I contacted the high voltage supplier next door to see if they changed anything.

@rindi: I agree, though be it that these controllers could be more vulerable to the problem. What if for instance the dust can cary static electricity better than regular dust and affects vulnerable components on just this controller ...
Static within dust shouldn't be able to harm electronic components when those components are within the server and things are properly grounded. It might be an added risk when carried outside the server and not properly protected during that time.