pmckenna11
asked on
Disk failure notification on Dell RAID
I have a Dell Power Edge 2800 with a 3 disk RAID 5. It recently failed on reboot and I discovered that 2 of the 3 disks had failed. I assume that one of the disks in the array had failed some time ago and that the array was running in degraded mode. The second drive probably failed on reboot (after a power outage) and so the machine failed to boot.
I check the logs fairly regulary (bi-monthly?) and don't recall ever seeing any type of disk failure warnings in the log. Dell Open Manage was installed but I am not that familiar with it. I had assumded that it would create hardware related events in the logs but perhaps not.
Can someone clarify for me how hardware failure notifications can be handled on Dell Server hardware? I would like to be automatically notified of critical hardware failures if possible without having to install any 3rd party montitoring solutions.
I check the logs fairly regulary (bi-monthly?) and don't recall ever seeing any type of disk failure warnings in the log. Dell Open Manage was installed but I am not that familiar with it. I had assumded that it would create hardware related events in the logs but perhaps not.
Can someone clarify for me how hardware failure notifications can be handled on Dell Server hardware? I would like to be automatically notified of critical hardware failures if possible without having to install any 3rd party montitoring solutions.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Didn't your server flash a warning light/LED when something goes wrong? All of my Dell servers do at least that. The event viewer shows most all hardware failures, nothing that important is ever left for you to hunt for. So, your question baffles me.
ASKER
Thanks for the link to the thread. I will check through it and see if any of the suggested fixes will work for me but it is probably time to look into something more robust then open manage.
The server is at a remote location so I am not able to check lights on the drives. Kind of lame if you have to physically look at a flashing light on the server to know there is a problem.
I am thinking that maybe both drives failed since I last checked the event logs on this server. Seems unlikely by otherwise how can you explain that there were no hardware failure notifications in the Windows logs? If anyone has a different explanation I would love to hear it. I work hard to catch problems proactively and to lose a RAID 5 server from disk failure is quite embarassing!!
The server is at a remote location so I am not able to check lights on the drives. Kind of lame if you have to physically look at a flashing light on the server to know there is a problem.
I am thinking that maybe both drives failed since I last checked the event logs on this server. Seems unlikely by otherwise how can you explain that there were no hardware failure notifications in the Windows logs? If anyone has a different explanation I would love to hear it. I work hard to catch problems proactively and to lose a RAID 5 server from disk failure is quite embarassing!!
To Windows there was no drive failure because the Perc controller abstracts its RAID config from the OS in form of virtual disks. To the OS it looks just like a physical hard drive. If any element of that disk fails, the controller 'takes care' of the issue by rebuilding the RAID and/or if available using hot spares and rebuilding the array (not so RAID 0 which just fails). You should really always use a hot spare just so that you're not stuck in this kind of predicament. Raid 5 arrangements are notorious for total failure when a single drive fails (search Google and you will find some scary stats on that)
http://www.zdnet.com/blog/storage/why-raid-5-stops-working-in-2009/162
If you had setup a softraid in the OS you'd get messages in the Wndows system logs, but performance wouldn't be that great
ASKER
Interesting article on RAID 5. Doesn't really apply in this case because we are using 146G SCSI drives so the chance of a read error is small but still interesting. I will keep it in mind on other servers.
But still I am left with how does one know when a drive fails? Even with a hot spare backup (I am adding one to the rebuilt server as I type this) I stilll need to know that a piece of hardware has failed. I don't want to have to manually launch OpenManage to constantly check on storage health. I have been poking in the OpenManage interface but don't see any built in notification functionality (fixes were pointed to in an earlier message).
Also I get that windows thinks the virtual disk is fine but I thought OpenManage sent a notification of hardware problems that showed up in the logs. I guess not!!!!
But still I am left with how does one know when a drive fails? Even with a hot spare backup (I am adding one to the rebuilt server as I type this) I stilll need to know that a piece of hardware has failed. I don't want to have to manually launch OpenManage to constantly check on storage health. I have been poking in the OpenManage interface but don't see any built in notification functionality (fixes were pointed to in an earlier message).
Also I get that windows thinks the virtual disk is fine but I thought OpenManage sent a notification of hardware problems that showed up in the logs. I guess not!!!!
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Do you have any SNMP server for system management and control? Dell OpenManage has an SNMP agent which can respond to pings. Example:
http://www.tools4ever.com/products/monitormagic/policies/dell.asp
http://www.tools4ever.com/products/monitormagic/policies/dell.asp
first symptom on the server if u have a phusucal look on it ... the led on the hdd become orrange instead of Green=healthy
Yes I do use SNMP for server management. Our organization uses HP Business class workstations, and Proliant servers, so the SNMP events are caught through HP's Insight manager, or lightsout services if the box fails. The other nice thing about HP's is in the BIOS there is a place to set up event notifications via sms or txt messaging or via email. This type of notification is in addition to SNMP and Windows Event Logging, it's done at the hardware level of the box. Cool stuff. I'ts saved offices from a complete melt down.
ASKER
It appears that Openmanage needs to be used in conjunction with OpenManage Server Administrator Managed Node in order to get the reporting. From what I gather Server Administrator uses snmp info from local and remote servers and is capable of generating the desired alerts along with other functionality. I looked briefly at Server Administrator and may setup a box with it installed to monitor all my Dell servers.
Goes without saying that it is absolutely ridiculous that Openmangement does not have it's own alert functionality and that you have to go through all this hassle just to get an alert emailed to you (or use a work around as previously suggested)
Goes without saying that it is absolutely ridiculous that Openmangement does not have it's own alert functionality and that you have to go through all this hassle just to get an alert emailed to you (or use a work around as previously suggested)