Link to home
Start Free TrialLog in
Avatar of ctagle
ctagleFlag for United States of America

asked on

Replace Failed Drive in PERC6 RAID 5 Array

Hello Experts,

     We have a poweredge r510 with a perc6 and a raid 5 array configured with 5 500 gig sata hard drives.  One of the drives failed a few days ago so we ordered  a replacement for it.  Typically when I've replaced failed hard drives in the past i just pop in the new one and it does a back ground rebuild, by the morning we ahve a good array again.  in this case though at some point during the rebuiled it appearantly fails and the new drive goes into an offline state.  things i have tried so far (after retrying the rebuild incase of a random error)

1. moving the drive to another slot and seeing if i could assign it as a hotspare.  I tried going to through openmanage and it says the new disk is foreign when i move it to another slot, i tried clearing the foreign config in openmanage but that wouldn't work
2. next i booted up into the controller bios and tried clearing the config from there, which seems to have worked but i can't use the disk as a hotspare from there either.
3. next i thought i would try the replace option (still in the controller bios) so i put the old one back in and left the new one in one of the previously unoccupied slots, then i tried going to the old drive and selecting replace, but for some reason it shows the old drive as a good unused drive (not part of the array), and its saying the new one is offline, and this status follows it around from slot to slot

so at this point its kind of odd, the old drive is able to be used as a hotspare or be used as a replacement, but I can't use the new one as a replacement or hotspare and it constantly goes offline, any help is greatly appreciated, thank you.
SOLUTION
Avatar of rindi
rindi
Flag of Switzerland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Hi Ctagle
In our experience messing with the drives in this way can mess up your array further.
Personally I tend to stick to the slots drives came out from.

You could consider the following.
The new drive you have is failing or faulty - if it's under warranty request a different one and swop it.
Or
You MAY have a second hard drive in the array on predicted fail and hence the array does not rebuild. We have had drives go into predicted fail when the amber light doesn't show - performance is an absolute dog.
Or
There may be an issue with the RAID controller - Failing

Is there nothing In the server manager Hardware or Alert logs - clues can be in here.
1) What happens when you try to clear the foreign configuration?
2) You can assign a hot-spare in CTRL-R, on either the VD MGMT screen (highlight controller, F2) or the PD MGMT screen (highlight disk, F2).
Avatar of ctagle

ASKER

Thank you for the replies
 @AndyKeen
The first indication of the drive failing was in the windows event log i was getting disk0 errors, i checked the openmanage and saw a drive flagged as faling, all the others are healthy, those were the only errors though.

@PowerEdgeTech
when i try to clear the foreign config i can't, the disk shows up as red in the disk manager and says offline.
I tried assigning a hot spare but the option is greyed out on the new drive
Hi Ctagle.

I forgot you could check the Windows event log. We have a small piece of VB code that you run against the Dell server in question and configure it to email you with a range of failure messages such as this:

SBS2011 - SERVER1 - Dell Server Alert - Power supply failure - Customer 214
takes the stress out of constantly monitoring an worrying.

By co-incidence I was reading a Dell article this morning about RAID re-build as people are tending to move away from RAID 5 to other RAID configs primarily due to the size of the disk these days - but at 500gb in size I cant say a RAID 5 is the worst choice.

Anyhow - the article mentions that when the drive is trying to re-build if a second drive has a bad block the re-build will continue to fail - but that second drive wont necessarily show as predicted fail - however the rebuild will continue to fail.

I'm not all about doom and gloom though - so I still recon that you could be suffering from a bad NEW drive. Are these drives Dell Certified Enterprise drives? if so they will be warrantied and new - if not then you may have anything.

Try leaving the drive in the Dell unit for a while and see if it goes predicted fail even thought you cant clear the config.

If your trying to re-build the array I am unclear why you would set the new drive as a hot spare as you stated the RAID configuration should automatically start the re-build process without intervention - so to me that has to say there is either a problem with:

The new Drive
The remaining RAID array
the RAID Controller

I would certainly only insert the drive into the bay where you took the old one from - but that's probably my OCD :)

I would also start with the easiest denominator - the NEW Hard Drive - get that replaced first and then move on - either way I would suggest you ensure you have a full backup available should the array not re-build.

Hope this helps
Andy
Avatar of ctagle

ASKER

@AndyKeen
We have two backups, one file level in the cloud and nightly bare metal backups done to an appliance that is synced up to the cloud so we are covered on backups.   Personally I am leaning more towards either the controller is failing or we got a DOA drive, the perc will throw false positives but rarely if ever have i seen it miss a failing drive.  that coupled with the fact that I also saw a few stray "raid0" erros in the windows logs tells me that it is most likely one of those two.  i am going to pull a known good sata drive from a desktop computer and see if it will at least rebild, i of course won't leave that drive in the server but if it does rebuild we know its the drive, if it doesn't, then its most likely the controller.
I think that's a good start - you should see straight away if its rebuilding.

To me logic would dictate that if this drives fails during the rebuild then either the array is screwed or the controller.

As a suggestion - you may want to run the drive you are going to put in through a smart test - desktop PCs are never very good at indicating SMART issues and indeed can carry on for a long time with one - but we know a server won't.

Good luck
"@PowerEdgeTech
 when i try to clear the foreign config i can't, the disk shows up as red in the disk manager and says offline.
 I tried assigning a hot spare but the option is greyed out on the new drive "

If it shows OFFLINE, then it's not foreign, so there is no foreign configuration to import/clear. You can't assign an OFFLINE disk as a hot-spare, only a READY disk.
If it shows OFFLINE, then it's not foreign, so there is no foreign configuration to import/clear. You can't assign an OFFLINE disk as a hot-spare, only a READY disk.
Avatar of ctagle

ASKER

@AndyKeen
Bought a brand new drive to eliminate any variables, its been rebuilding for the past hour now, fingers crossed that it goes online successfully, will make my life much simpler if its just a doa drive

@PowerEdgeTech
Right, but I can't get it to come on line, the only time it will is if it is rebuilding as part of the old array, and then the rebuild will fail and the drive will go offline, no matter what slot i move the drive to it will show up as part of the old array and as offline, and the only options are to rebuild, I can't get that drive to "not" be part of the old array.  either way though if that drive was good and the raid controller was good then the rebuild should have worked, if the raid controller is bad the rebuild will fail, if the original new drive was bad then the rebuild will succeed, at least thats my logic at this point, i will post back here when i know if the rebuild fails or succeeds
Avatar of ctagle

ASKER

Well the rebuild failed, the new drive went offline after about 2 or so hours, so i'm thinking its most likely the raid controller, unless ya'll think differently
A little late to the party, but how exact a match is the replacement drive to the drives already in the array? A small deficiency in capacity could be enough to have the drive fail to rebuild.
A little late to the party, but how exact a match is the replacement drive to the drives already in the array? A small deficiency in capacity could be enough to have the drive fail to rebuild.
Dell controllers won't even allow the rebuild to start where this is the case.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of ctagle

ASKER

thank you for your help, ended up just rebuilding the array and restoring from a backup and the raid is ok, having OS troubles but thats for another question, thank you.