Solved

Disk failure notification on Dell RAID

Posted on 2011-02-19
10
1,550 Views
Last Modified: 2012-05-11
I have a Dell Power Edge 2800 with a 3 disk RAID 5. It recently failed on reboot and I discovered that 2 of the 3 disks had failed. I assume that one of the disks in the array had failed some time ago and that the array was running in degraded mode. The second drive probably failed on reboot (after a power outage) and so the machine failed to boot.

I check the logs fairly regulary (bi-monthly?) and don't recall ever seeing any type of disk failure warnings in the log. Dell Open Manage was installed but I am not that familiar with it. I had assumded that it would create hardware related events in the logs but perhaps not.

Can someone clarify for me how hardware failure notifications can be handled on Dell Server hardware? I would like to be automatically notified of critical hardware failures if possible without having to install any 3rd party montitoring solutions.
0
Comment
Question by:pmckenna11
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 3
  • 2
  • +2
10 Comments
 
LVL 1

Accepted Solution

by:
michaelkovac earned 250 total points
ID: 34933897

Here's a good thread on the subject.
Open Manage will show you in detail what's going on but logging is dismal.
http://en.community.dell.com/support-forums/servers/f/177/t/19206983.aspx
0
 
LVL 6

Expert Comment

by:joe_massimino
ID: 34933919
Didn't your server flash a warning light/LED when something goes wrong? All of my Dell servers do at least that. The event viewer shows most all hardware failures, nothing that important is ever left for you to hunt for. So, your question baffles me.  
0
 
LVL 2

Author Comment

by:pmckenna11
ID: 34934034
Thanks for the link to the thread. I will check through it and see if any of the suggested fixes will work for me but it is probably time to look into something more robust then open manage.

The server is at a remote location so I am not able to check lights on the drives. Kind of lame if you have to physically look at a flashing light on the server to know there is a problem.

I am thinking that maybe both drives failed since I last checked the event logs on this server. Seems unlikely by otherwise how can you explain that there were no hardware failure notifications in the Windows logs? If anyone has a different explanation I  would love to hear it. I work hard to catch problems proactively and to lose a RAID 5 server from disk failure is quite embarassing!!
0
Best Practices: Disaster Recovery Testing

Besides backup, any IT division should have a disaster recovery plan. You will find a few tips below relating to the development of such a plan and to what issues one should pay special attention in the course of backup planning.

 
LVL 1

Expert Comment

by:michaelkovac
ID: 34934266

To Windows there was no drive failure because the Perc controller abstracts its RAID config from the OS in form of virtual disks. To the OS it looks just like a physical hard drive. If any element of that disk fails, the controller 'takes care' of the issue by rebuilding the RAID and/or if available using hot spares and rebuilding the array (not so RAID 0 which just fails). You should really always use a hot spare just so that you're not stuck in this kind of predicament. Raid 5 arrangements are notorious for total failure when a single drive fails (search Google and you will find some scary stats on that)
http://www.zdnet.com/blog/storage/why-raid-5-stops-working-in-2009/162 

If you had setup a softraid in the OS you'd get messages in the Wndows system logs, but performance wouldn't be that great
0
 
LVL 2

Author Comment

by:pmckenna11
ID: 34935678
Interesting article on RAID 5. Doesn't really apply in this case because we are using 146G SCSI drives so the chance of a read error is small but still interesting. I will keep it in mind on other servers.

But still I am left with how does one know when a drive fails? Even with a hot spare backup (I am adding one to the rebuilt server as I type this) I stilll need to know that a piece of hardware has failed. I don't want to have to manually launch OpenManage to constantly check on storage health. I have been poking in the OpenManage interface but don't see any built in notification functionality (fixes were pointed to in an earlier message).

Also I get that windows thinks the virtual disk is fine but I thought OpenManage sent a notification of hardware problems that showed up in the logs. I guess not!!!!
0
 
LVL 1

Assisted Solution

by:MarkThomasLee
MarkThomasLee earned 250 total points
ID: 34936180
With Dell Servers, an event was flagged and is visible via Open Manage, however, there is usually a light on the front of the box that will flash amber when there is a hardware alert.  then of course the preboot process would also display an error that is visiable if the DELL preboot flash screen is disabled.  Usually though, on RAID or other Disc errors, the preboot will hang at the disc controller portion of the preboot process.  These notifications are outside the OS so there isn't a thread or mechanism other than Open Manage to gain access to these alerts - other than the flashing amber light on the front of the box. As far as Windows is concerned.  even though it's a 3 disc RAID Windows see's it as 1 physical drive. That's because the hardware handles the creation of the logical drive spread out over the 3 drives.  Long story short, it's a hardware array which windows is blind to. The only array's Windows is aware of are software arrays - ones that are created using disk management msc.

Sorry buddy, That really sucks.  I just went through the same thing at a new client's office - chart less medical office with no backup.  they are hurting.  Data recovery time.  CBL here we go!

I hope this info helps
M
0
 
LVL 1

Expert Comment

by:michaelkovac
ID: 34940094
Do you have any SNMP server for system management and control? Dell OpenManage has an SNMP agent which can respond to pings. Example:
http://www.tools4ever.com/products/monitormagic/policies/dell.asp 
0
 
LVL 16

Expert Comment

by:Shaik M. Sajid
ID: 34941841
first symptom on the server if u have a phusucal look on it ... the led on the hdd become orrange instead of Green=healthy

0
 
LVL 1

Expert Comment

by:MarkThomasLee
ID: 34945137
Yes I do use SNMP for server management.  Our organization uses HP Business class workstations, and Proliant servers, so the SNMP events are caught through HP's Insight manager, or lightsout services if the box fails.  The other nice thing about HP's is in the BIOS there is a place to set up event notifications via sms or txt messaging or via email. This type of notification is in addition to SNMP and Windows Event Logging, it's done at the hardware level of the box.  Cool stuff.  I'ts saved offices from a complete melt down.  
0
 
LVL 2

Author Comment

by:pmckenna11
ID: 35044200
It appears that Openmanage needs to be used in conjunction with OpenManage Server Administrator Managed Node in order to get the reporting. From what I gather Server Administrator uses snmp info from local and remote servers and is capable of generating the desired alerts along with other functionality. I looked briefly at Server Administrator and may setup a box with it installed to monitor all my Dell servers.

Goes without saying that it is absolutely ridiculous that Openmangement does not have it's own alert functionality and that you have to go through all this hassle just to get an alert emailed to you (or use a work around as previously suggested)
0

Featured Post

Connect further...control easier

With the ATEN CE624, you can now enjoy a high-quality visual experience powered by HDBaseT technology and the convenience of a single Cat6 cable to transmit uncompressed video with zero latency and multi-streaming for dual-view applications where remote access is required.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Issue: One Windows 2008 R2 64bit server on the network unable to connect to a buffalo Device (Linkstation) with firmware version 1.56. There are a total of four servers on the network this being one of them. Troubleshooting Steps: Connect via h…
More or less everybody in the IT market understands the basics of Networking, however when we start talking about Storage Networks, things get a bit dizzier, and this is where I would like to help.
In an interesting question (https://www.experts-exchange.com/questions/29008360/) here at Experts Exchange, a member asked how to split a single image into multiple images. The primary usage for this is to place many photographs on a flatbed scanner…
How to Install VMware Tools in Red Hat Enterprise Linux 6.4 (RHEL 6.4) Step-by-Step Tutorial

734 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question