Replace Failed Drive in PERC6 RAID 5 Array

Hello Experts,

     We have a poweredge r510 with a perc6 and a raid 5 array configured with 5 500 gig sata hard drives.  One of the drives failed a few days ago so we ordered  a replacement for it.  Typically when I've replaced failed hard drives in the past i just pop in the new one and it does a back ground rebuild, by the morning we ahve a good array again.  in this case though at some point during the rebuiled it appearantly fails and the new drive goes into an offline state.  things i have tried so far (after retrying the rebuild incase of a random error)

1. moving the drive to another slot and seeing if i could assign it as a hotspare.  I tried going to through openmanage and it says the new disk is foreign when i move it to another slot, i tried clearing the foreign config in openmanage but that wouldn't work
2. next i booted up into the controller bios and tried clearing the config from there, which seems to have worked but i can't use the disk as a hotspare from there either.
3. next i thought i would try the replace option (still in the controller bios) so i put the old one back in and left the new one in one of the previously unoccupied slots, then i tried going to the old drive and selecting replace, but for some reason it shows the old drive as a good unused drive (not part of the array), and its saying the new one is offline, and this status follows it around from slot to slot

so at this point its kind of odd, the old drive is able to be used as a hotspare or be used as a replacement, but I can't use the new one as a replacement or hotspare and it constantly goes offline, any help is greatly appreciated, thank you.
ctagleAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

rindiCommented:
First of all, move from RAID 5 to something else. RAID 5 is outdated and unreliable. You are very likely to have a full data loss with RAID 5.

Then make sure you only use enterprise class disks that are certified by your Server manufacturer and RAID controller.

Always make sure your backups are 100% OK, particularly before meddling around with the disks.

Since you have that good backup, you can now rebuild your array with something better than RAID 5, restore your backup, and everything should be fine. This is probably a lot faster than trying to fix the array and have it rebuild...
0
AndyKeenCommented:
Hi Ctagle
In our experience messing with the drives in this way can mess up your array further.
Personally I tend to stick to the slots drives came out from.

You could consider the following.
The new drive you have is failing or faulty - if it's under warranty request a different one and swop it.
Or
You MAY have a second hard drive in the array on predicted fail and hence the array does not rebuild. We have had drives go into predicted fail when the amber light doesn't show - performance is an absolute dog.
Or
There may be an issue with the RAID controller - Failing

Is there nothing In the server manager Hardware or Alert logs - clues can be in here.
0
PowerEdgeTechIT ConsultantCommented:
1) What happens when you try to clear the foreign configuration?
2) You can assign a hot-spare in CTRL-R, on either the VD MGMT screen (highlight controller, F2) or the PD MGMT screen (highlight disk, F2).
0
Ultimate Tool Kit for Technology Solution Provider

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy now.

ctagleAuthor Commented:
Thank you for the replies
 @AndyKeen
The first indication of the drive failing was in the windows event log i was getting disk0 errors, i checked the openmanage and saw a drive flagged as faling, all the others are healthy, those were the only errors though.

@PowerEdgeTech
when i try to clear the foreign config i can't, the disk shows up as red in the disk manager and says offline.
I tried assigning a hot spare but the option is greyed out on the new drive
0
AndyKeenCommented:
Hi Ctagle.

I forgot you could check the Windows event log. We have a small piece of VB code that you run against the Dell server in question and configure it to email you with a range of failure messages such as this:

SBS2011 - SERVER1 - Dell Server Alert - Power supply failure - Customer 214
takes the stress out of constantly monitoring an worrying.

By co-incidence I was reading a Dell article this morning about RAID re-build as people are tending to move away from RAID 5 to other RAID configs primarily due to the size of the disk these days - but at 500gb in size I cant say a RAID 5 is the worst choice.

Anyhow - the article mentions that when the drive is trying to re-build if a second drive has a bad block the re-build will continue to fail - but that second drive wont necessarily show as predicted fail - however the rebuild will continue to fail.

I'm not all about doom and gloom though - so I still recon that you could be suffering from a bad NEW drive. Are these drives Dell Certified Enterprise drives? if so they will be warrantied and new - if not then you may have anything.

Try leaving the drive in the Dell unit for a while and see if it goes predicted fail even thought you cant clear the config.

If your trying to re-build the array I am unclear why you would set the new drive as a hot spare as you stated the RAID configuration should automatically start the re-build process without intervention - so to me that has to say there is either a problem with:

The new Drive
The remaining RAID array
the RAID Controller

I would certainly only insert the drive into the bay where you took the old one from - but that's probably my OCD :)

I would also start with the easiest denominator - the NEW Hard Drive - get that replaced first and then move on - either way I would suggest you ensure you have a full backup available should the array not re-build.

Hope this helps
Andy
0
ctagleAuthor Commented:
@AndyKeen
We have two backups, one file level in the cloud and nightly bare metal backups done to an appliance that is synced up to the cloud so we are covered on backups.   Personally I am leaning more towards either the controller is failing or we got a DOA drive, the perc will throw false positives but rarely if ever have i seen it miss a failing drive.  that coupled with the fact that I also saw a few stray "raid0" erros in the windows logs tells me that it is most likely one of those two.  i am going to pull a known good sata drive from a desktop computer and see if it will at least rebild, i of course won't leave that drive in the server but if it does rebuild we know its the drive, if it doesn't, then its most likely the controller.
0
AndyKeenCommented:
I think that's a good start - you should see straight away if its rebuilding.

To me logic would dictate that if this drives fails during the rebuild then either the array is screwed or the controller.

As a suggestion - you may want to run the drive you are going to put in through a smart test - desktop PCs are never very good at indicating SMART issues and indeed can carry on for a long time with one - but we know a server won't.

Good luck
0
PowerEdgeTechIT ConsultantCommented:
"@PowerEdgeTech
 when i try to clear the foreign config i can't, the disk shows up as red in the disk manager and says offline.
 I tried assigning a hot spare but the option is greyed out on the new drive "

If it shows OFFLINE, then it's not foreign, so there is no foreign configuration to import/clear. You can't assign an OFFLINE disk as a hot-spare, only a READY disk.
If it shows OFFLINE, then it's not foreign, so there is no foreign configuration to import/clear. You can't assign an OFFLINE disk as a hot-spare, only a READY disk.
0
ctagleAuthor Commented:
@AndyKeen
Bought a brand new drive to eliminate any variables, its been rebuilding for the past hour now, fingers crossed that it goes online successfully, will make my life much simpler if its just a doa drive

@PowerEdgeTech
Right, but I can't get it to come on line, the only time it will is if it is rebuilding as part of the old array, and then the rebuild will fail and the drive will go offline, no matter what slot i move the drive to it will show up as part of the old array and as offline, and the only options are to rebuild, I can't get that drive to "not" be part of the old array.  either way though if that drive was good and the raid controller was good then the rebuild should have worked, if the raid controller is bad the rebuild will fail, if the original new drive was bad then the rebuild will succeed, at least thats my logic at this point, i will post back here when i know if the rebuild fails or succeeds
0
ctagleAuthor Commented:
Well the rebuild failed, the new drive went offline after about 2 or so hours, so i'm thinking its most likely the raid controller, unless ya'll think differently
0
jmcgOwnerCommented:
A little late to the party, but how exact a match is the replacement drive to the drives already in the array? A small deficiency in capacity could be enough to have the drive fail to rebuild.
0
PowerEdgeTechIT ConsultantCommented:
A little late to the party, but how exact a match is the replacement drive to the drives already in the array? A small deficiency in capacity could be enough to have the drive fail to rebuild.
Dell controllers won't even allow the rebuild to start where this is the case.
0
PowerEdgeTechIT ConsultantCommented:
Well the rebuild failed, the new drive went offline after about 2 or so hours, so i'm thinking its most likely the raid controller, unless ya'll think differently
It is more likely a corrupt array than a bad controller. Check the controller log for "punctures".
0
AndyKeenCommented:
It could still be a bad block in a second drive In the array that's causing the rebuild to fail.

I think depending on the availability then a RAID controller swap would be next to try and failing that a rebuild of the array.

Not ideal but logically the next steps I would say.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
ctagleAuthor Commented:
thank you for your help, ended up just rebuilding the array and restoring from a backup and the raid is ok, having OS troubles but thats for another question, thank you.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Storage Hardware

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.