asked on

How do other admins monitor random RAID controllers?

I inherited one serious hodge-podge network with a dozen different brands, models and ages of servers. Many of these have built in RAID controllers but there is no central monitoring utility. Most don't even have email notification should a drive fail (kind of defeats the purpose, no?). I was wondering what other people were doing in this scenario. Is there some sort of central SMART montioring software that could give me a clue? Is it common for a RAID controller to put anything in the SMART data to indicate a drive had failed? Ultimately I'd like to configure Ops View (nagios NSClient++ based monitoring software) to be able to report on it but for now I'd simply settle for the warm fuzzy of knowing an email would be sent to me should I lose a disk somewhere. Constructive Thoughts? Comments?

David

RAID controllers all have vendor-unique APIs. Nothing on the planet monitors everything. Also SMART software is pointless, as that is only good for physical devices. Sure some RAID vendors use it, but they don't present it to the operating system w/o special software.

Because it cant ... by design, the O/S sees just one disk, as example if you have a RAID5. It presents the logical volume which shows that it is online, even if you have a drive failure.

Now there are some products out there that will drill inside a few 3rd party controllers ... so what do you have (and what O/S)?

Anyway, there are some SNMP packages but they only kick in once you have a program that knows how to drill into the RAID, that can then send off a SNMP alert.

What do people do in the real world? Well, it costs less money and is ultimately better to junk as many controllers as possible, and standardize on a family that has a good mix of support and device types. When you standardize, you pay more up front, but you save money in long run because you know to buy disks that are qualified, and no issues with device management because the vendor has something that works across the board.

That is why people standardize on things HP or Dell servers ... or they standardize on LSI controllers which pretty much run on all operating systems, and they have a wide range from RAID1/10 only under $100, to controllers that can handle hundreds of drives that cost thousands of dollars.

dax_bad

We're monitoring our HP/Dell/IBM hardware with WhatsUp Gold premium through SNMP. All you need is the MIB packages and have the HW vendors own monitoring software installed. You can often extract the MIB packages from the vendor monitoring software, then use a MIB walker to identify the OID's and use Whatsup Gold (or any other monitring tool supporting SNMP monitoring) to check the state of the hardware. Allthough sometimes you have a lot of different models with different RAID controllers we found that it is not possible to create a fully generic monitor template, but you have 2 options. 1

1) non-generic - Split the templates into model specific with a template for each drive in the server, 1 for battery + one for Controller state.
2) Semi-generic - By looking at the overall state of the RAID controller (warning/failed state indicates a drive / battery error), this can be semi generic and we found we could cover all our 15 different HP models with around 4 templates (the instance still varies amongst the same models sometimes, guess it depends on the firmware level).

If you need some more details let me now, i can give you some OID's for atleast HP models and 95% of our serverfarm is HP (1000+)

Cheers
Daniel

ASKER CERTIFIED SOLUTION

Member_2_231077

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

dax_bad

Andyalder: RAID controllers doesn't write disk accellerator battery failures to the event log, but yes usually disk failure's are written or at least can be set to do this through the vendor monitoring software. You would however still need a central monitoring tool to make any use of it in larger server farms.

Cheers
Daniel

Member_2_231077

LSI and HP ones do, not sure about the other ones out there.

sifugreg

ASKER

Not very sexy but until I can get them all replaced, I've created a custom filter in my monitoring process to alert me of multiple warnings or any critical messages sent to the System Event Log. Don't know why I didn't think about that.