Link to home
Start Free TrialLog in
Avatar of pmckenna11
pmckenna11

asked on

Disk failure notification on Dell RAID

I have a Dell Power Edge 2800 with a 3 disk RAID 5. It recently failed on reboot and I discovered that 2 of the 3 disks had failed. I assume that one of the disks in the array had failed some time ago and that the array was running in degraded mode. The second drive probably failed on reboot (after a power outage) and so the machine failed to boot.

I check the logs fairly regulary (bi-monthly?) and don't recall ever seeing any type of disk failure warnings in the log. Dell Open Manage was installed but I am not that familiar with it. I had assumded that it would create hardware related events in the logs but perhaps not.

Can someone clarify for me how hardware failure notifications can be handled on Dell Server hardware? I would like to be automatically notified of critical hardware failures if possible without having to install any 3rd party montitoring solutions.
ASKER CERTIFIED SOLUTION
Avatar of michaelkovac
michaelkovac

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of joe_massimino
joe_massimino

Didn't your server flash a warning light/LED when something goes wrong? All of my Dell servers do at least that. The event viewer shows most all hardware failures, nothing that important is ever left for you to hunt for. So, your question baffles me.  
Avatar of pmckenna11

ASKER

Thanks for the link to the thread. I will check through it and see if any of the suggested fixes will work for me but it is probably time to look into something more robust then open manage.

The server is at a remote location so I am not able to check lights on the drives. Kind of lame if you have to physically look at a flashing light on the server to know there is a problem.

I am thinking that maybe both drives failed since I last checked the event logs on this server. Seems unlikely by otherwise how can you explain that there were no hardware failure notifications in the Windows logs? If anyone has a different explanation I  would love to hear it. I work hard to catch problems proactively and to lose a RAID 5 server from disk failure is quite embarassing!!

To Windows there was no drive failure because the Perc controller abstracts its RAID config from the OS in form of virtual disks. To the OS it looks just like a physical hard drive. If any element of that disk fails, the controller 'takes care' of the issue by rebuilding the RAID and/or if available using hot spares and rebuilding the array (not so RAID 0 which just fails). You should really always use a hot spare just so that you're not stuck in this kind of predicament. Raid 5 arrangements are notorious for total failure when a single drive fails (search Google and you will find some scary stats on that)
http://www.zdnet.com/blog/storage/why-raid-5-stops-working-in-2009/162 

If you had setup a softraid in the OS you'd get messages in the Wndows system logs, but performance wouldn't be that great
Interesting article on RAID 5. Doesn't really apply in this case because we are using 146G SCSI drives so the chance of a read error is small but still interesting. I will keep it in mind on other servers.

But still I am left with how does one know when a drive fails? Even with a hot spare backup (I am adding one to the rebuilt server as I type this) I stilll need to know that a piece of hardware has failed. I don't want to have to manually launch OpenManage to constantly check on storage health. I have been poking in the OpenManage interface but don't see any built in notification functionality (fixes were pointed to in an earlier message).

Also I get that windows thinks the virtual disk is fine but I thought OpenManage sent a notification of hardware problems that showed up in the logs. I guess not!!!!
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Do you have any SNMP server for system management and control? Dell OpenManage has an SNMP agent which can respond to pings. Example:
http://www.tools4ever.com/products/monitormagic/policies/dell.asp 
Avatar of Sajid Shaik M
first symptom on the server if u have a phusucal look on it ... the led on the hdd become orrange instead of Green=healthy

Yes I do use SNMP for server management.  Our organization uses HP Business class workstations, and Proliant servers, so the SNMP events are caught through HP's Insight manager, or lightsout services if the box fails.  The other nice thing about HP's is in the BIOS there is a place to set up event notifications via sms or txt messaging or via email. This type of notification is in addition to SNMP and Windows Event Logging, it's done at the hardware level of the box.  Cool stuff.  I'ts saved offices from a complete melt down.  
It appears that Openmanage needs to be used in conjunction with OpenManage Server Administrator Managed Node in order to get the reporting. From what I gather Server Administrator uses snmp info from local and remote servers and is capable of generating the desired alerts along with other functionality. I looked briefly at Server Administrator and may setup a box with it installed to monitor all my Dell servers.

Goes without saying that it is absolutely ridiculous that Openmangement does not have it's own alert functionality and that you have to go through all this hassle just to get an alert emailed to you (or use a work around as previously suggested)