How do other admins monitor random RAID controllers?

Posted on 2011-02-15
Last Modified: 2012-05-11
I inherited one serious hodge-podge network with a dozen different brands, models and ages of servers.  Many of these have built in RAID controllers but there is no central monitoring utility.  Most don't even have email notification should a drive fail (kind of defeats the purpose, no?).  I was wondering what other people were doing in this scenario.  Is there some sort of central SMART montioring software that could give me a clue?  Is it common for a RAID controller to put anything in the SMART data to indicate a drive had failed?  Ultimately I'd like to configure Ops View (nagios NSClient++ based monitoring software) to be able to report on it but for now I'd simply settle for the warm fuzzy of knowing an email would be sent to me should I lose a disk somewhere.  Constructive Thoughts?  Comments?
Question by:sifugreg
LVL 47

Expert Comment

ID: 34903486
RAID controllers all have vendor-unique APIs.  Nothing on the planet monitors everything.  Also SMART software is pointless, as that is only good for physical devices.  Sure some RAID vendors use it, but they don't present it to the operating system w/o special software.

Because it cant ... by design, the O/S sees just one disk, as example if you have a RAID5. It presents the logical volume which shows that it is online, even if you have a drive failure.

Now there are some products out there that will drill inside a few 3rd party controllers ... so what do you have (and what O/S)?

Anyway, there are some SNMP packages but they only kick in once you have a program that knows how to drill into the RAID, that can then send off a SNMP alert.

What do people do in the real world?  Well, it costs less money and is ultimately better to junk as many controllers as possible, and standardize on a family that has a good mix of support and device types.   When you standardize, you pay more up front, but you save money in long run because you know to buy disks that are qualified, and no issues with device management because the vendor has something that works across the board.

That is why people standardize on things HP or Dell servers ... or they standardize on LSI controllers which pretty much run on all operating systems, and they have a wide range from RAID1/10 only under $100, to controllers that can handle hundreds of drives that cost thousands of dollars.


Expert Comment

ID: 34904330
We're monitoring our HP/Dell/IBM hardware with WhatsUp Gold premium through SNMP. All you need is the MIB packages and have the HW vendors own monitoring software installed.  You can often extract the MIB packages from the vendor monitoring software, then use a MIB walker to identify the OID's and use Whatsup Gold (or any other monitring tool supporting SNMP monitoring) to check the state of the hardware. Allthough sometimes you have a lot of different models with different RAID controllers we found that it is not possible to create a fully generic monitor template, but you have 2 options. 1

1) non-generic - Split the templates into model specific with a template for each drive in the server, 1 for battery + one for Controller state.
2) Semi-generic - By looking at the overall state of the RAID controller (warning/failed state indicates a drive / battery error), this can be semi generic and we found we could cover all our 15 different HP models with around 4 templates (the instance still varies amongst the same models sometimes, guess it depends on the firmware level).

If you need some more details let me now, i can give you some OID's for atleast HP models and 95% of our serverfarm is HP (1000+)

LVL 55

Accepted Solution

andyalder earned 125 total points
ID: 34905085
Most controllers drivers will write something to the OS log if there is a disk problem so you can monitor that instead of monitoring the hardware.
What is SQL Server and how does it work?

The purpose of this paper is to provide you background on SQL Server. It’s your self-study guide for learning fundamentals. It includes both the history of SQL and its technical basics. Concepts and definitions will form the solid foundation of your future DBA expertise.


Expert Comment

ID: 34905106
Andyalder: RAID controllers doesn't write disk accellerator battery failures to the event log, but yes usually disk failure's are written or at least can be set to do this through the vendor monitoring software. You would however still need a central monitoring tool to make any use of it in larger server farms.

LVL 55

Expert Comment

ID: 34905223
LSI and HP ones do, not sure about the other ones out there.

Author Closing Comment

ID: 34943949
Not very sexy but until I can get them all replaced, I've created a custom filter in my monitoring process to alert me of multiple warnings or any critical messages sent to the System Event Log.  Don't know why I didn't think about that.

Featured Post

DevOps Toolchain Recommendations

Read this Gartner Research Note and discover how your IT organization can automate and optimize DevOps processes using a toolchain architecture.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

The article will include the best Data Recovery Tools along with their Features, Capabilities, and their Download Links. Hope you’ll enjoy it and will choose the one as required by you.
In 2017, ransomware will become so virulent and widespread that if you aren’t a victim yourself, you will know someone who is.
To efficiently enable the rotation of USB drives for backups, storage pools need to be created. This way no matter which USB drive is installed, the backups will successfully write without any administrative intervention. Multiple USB devices need t…
This tutorial will show how to configure a single USB drive with a separate folder for each day of the week. This will allow each of the backups to be kept separate preventing the previous day’s backup from being overwritten. The USB drive must be s…

828 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question