Solved

W2K8 R2 - RAID 5 - Software Array Failed Redundancy

Posted on 2013-01-10
17
1,453 Views
Last Modified: 2013-02-19
Disk-Manager.pdfCASE BACKGROUND
Have client with Win2k8 Server and a very "Klughey" setup.

Server has 10 SATA Drives in it, 6 are controlled by a controller on the board, 4 are controlled by a PCI SATA Expander Card.

System is running, in disk manager shows "Failed Redundancy" and it is identifying disk 8 as the resource that is at fault (drive does not show failed, but does have yellow and black triangle {see disk manager screenshot attached}.

Also have error in event log which shows that Disk 10 is the issue {see event log screenshot}

Have determined that both disks are on the PCI controller, not the main board controller.

WHAT HAS BEEN DONE
Ordered 2 new drives
Identified Drive 8, replaced it with a new drive, brought server back up.
Now in disk manager the entire array shows failed.
Put "bad drive back in, brought server back up, array is working, but with the "failed redundancy" alert.

Repeated the process with drive 10, same results.

QUESTION
Anybody have any "in the field" experience with this BS.
Read multiple posts, tried multiple other recommendations, but none of them seem to have a step by step so we can actually determine what is going on.
EventLog.pdf
0
Comment
Question by:tech911
  • 6
  • 6
  • 2
  • +1
17 Comments
 
LVL 47

Expert Comment

by:dlethe
ID: 38763075
What is the make/model of disk?  Reason I ask, if they are NOT enterprise class devices, then they have the wrong firmware for most RAID controllers, and you are pretty much guaranteed data loss.
0
 
LVL 3

Author Comment

by:tech911
ID: 38763154
Disks are Seagate, barracuda SATA.

There is no "Hardware RAID Controller" the RAID is created using the Win 2K8 Server software RAID utility (Read = Make all disks Dynamic Disks, bind them together in one volume, etc...)

Not to worry about data loss, the RAID is running and we have a validated backup of everything on it.

If we have to, we will scrap the whole thing an build a new array but we would like to try to troubleshoot this to see if we can understand what is wrong and how to fix it... It may not be a "fixable" situation.

Forgive me for saying this but is sounds like you haven't had any time in the field with the MS Server Software RAID utility, if not, you really won't be able to help because it doesn't behave like a "Normal" Hardware RAID Controller (Although I wish it did).

Thank you for the suggestion.
0
 
LVL 3

Expert Comment

by:zackmccracken
ID: 38763158
it seems that you might have mislocated the faulty drive, since you remove another drive in a failed raid 5 set the array will not work

is there a possibility in the raid config to have the faulty drive identified (via blinking option)?
0
 
LVL 87

Expert Comment

by:rindi
ID: 38763211
If I'm not mistaken Windows Software RAID us being used here, not a RAID controller. The problem is that it uses RAID 5 and 2 disks are failed in that Array. That means the complete array is now broken (unless one of the bad disks happened to be a "Hot spare". When you change a bad disk in a RAID array, always only replace one at a time and wait for the array to finish the rebuild before replacing the other. Also when changing a bad drive in a Windows software RAID, you have to manually add the new disk to the array in the disk management. It will only start resyncing after that.

I'm afraid your only option now is to remove all the disks from your array, then create a new one. When that is done you'll have to restore the data from the backups.
0
 
LVL 3

Author Comment

by:tech911
ID: 38763213
First... The drives have not failed, but seem to have some underlying logical issue (Bad Block, Bad Sector, etc...)

Second to have a "Blinking Light" you need a HARDWARE BASED RAID CONTROLLER not a SOFTWARE BASED RAID CONTROLLER

Third, the way to figure out which drive is which is by the bus number, which we were able to effectively do and we validated our identification by removing the drive which was indicated as having an issue (see disk mgmt screen shot, notice Disk 8).

The problem is that even though we have identified the correct drive(s) there is no way to effectively replace them. When we do, the array enters a failed state with no way to "Un-fail" it.

I hate to beat a dead horse but if you don't have field experience with the MS RAID Software Array Utility, you won't be able to help.
0
 
LVL 3

Author Comment

by:tech911
ID: 38763224
Rindi -

You are correct, I'm thinking the same thing, but...

The odd thing is the drives are not Failed they are working, but have been identified as having an issue, bad block, etc...not failed.

We did a replacement on Drive 8, ran the resync but the Disk Manager still showed the array as failed...

Thoughts...
0
 
LVL 47

Accepted Solution

by:
dlethe earned 500 total points
ID: 38763233
root cause is the disk drives.  the problem is that they are not suitable due to TLER.  The drives will pass every diagnostic on the planet, but that will not change the fact that they can take up to 60 seconds while in deep recovery to get back an unreadable block.  That is too long. Enterprise disks guarantee they will return or give up in just a few seconds.  (They also have 10-100X better error recovery)

In fact, if you had bought the WD equivalent of those disks then WD will say warranty is void if you use them in a RAID5 config.
0
Highfive Gives IT Their Time Back

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

 
LVL 3

Author Comment

by:tech911
ID: 38764057
dlethe,

Thank you for the comments...

Anyone else have something that can move us forward in the discussion.  

Commenting on what the IT guy (2 IT Guys Ago) should have done or suspecting that the drives are the problem really doesn't contribute to any type of solution or a deeper understanding of the problem one way or the other, its just comments.

What we need is an understanding of WHY things are happening the way they are. We want this for an overall community benefit, at this point I'm certain that I am going to have to create a brand new array and restore 10TB of data, I just want to dig deeper and understand WHY in case someone else encounters this problem since we have the opportunity:

Think about this...

When we replace a drive, the system recognizes it,and it is displayed in disk manager.
Disk manger allows us to add it as a dynamic disk
It also allows us to resync the new drive with the array, but when the resync finishes, the array is now in a completely failed state (not a Failed Redundancy State)...

Makes no sense...
0
 
LVL 47

Assisted Solution

by:dlethe
dlethe earned 500 total points
ID: 38764149
The solution is to use different disk drives.  Google "TLER".  That explains the problem in more detail, but again, the problem is pretty simple. The disks will lock up to recover an unreadable block, for up to a minute. All I/O stops during that time. If two disks are doing recovery, then the array crashes. If one disk does that deep recovery then the software thinks the drive failed.  That is why it passes diagnostics.   That is why the drive is good.   The problem is it just takes too long.

Desktop drives are not designed for such use. They are designed not to lose data, and try to recover. Surely you've seen desktop systems just do nothing for 20-60 seconds and then start back up again with no complaints.  This is what you are seeing.

If you don't want to replace the drives, then you can minimize risk by going to separate RAID1 groups.

Sorry but that is your only recourse. You bought the wrong kind of disk drives for this intended use.
0
 
LVL 3

Expert Comment

by:zackmccracken
ID: 38764160
apologies for missing the software raid remark
0
 
LVL 3

Author Comment

by:tech911
ID: 38887567
I've requested that this question be deleted for the following reason:

There was no real answer to this. The fix was replace the array with a new one and get a new controller for the old one, then blow out everything on it, and build it fresh.
0
 
LVL 47

Expert Comment

by:dlethe
ID: 38887568
There is absolutely a good and correct answer to this.  Read #38763233 and #38764149.  The drives are timing out due to TLER issues. They are unsuitable and incompatible for this type of use.
0
 
LVL 47

Expert Comment

by:dlethe
ID: 38891670
I might add that unless you use an enterprise class drive, or a controller that is designed to work with desktop drives, then you have not addressed root cause, so will continue to have this problem in the future.
0
 
LVL 3

Author Comment

by:tech911
ID: 38893074
Thank you for your opinion...

Its not the drives... its the whole configuration.  The person who thought having 6 drives being controlled by the motherboard and 4 more being controlled by a cheap 4 port sata expansion card and then running the whole thing as a Windows Software RAID was fool.

This was a recipe for disaster waiting to happen. The most important thing about this setup was to have a backup of the data because you were going to need it.

Sorry the answer didn't solve the problem, but I can't be asked to reward points for an answer that says replace everything (which is the answer to the bigger picture) but not the answer to the actual problem (for which there appears not to be an answer that can fix the problem, i.e. there is no answer...for this described problem.)

Thanks again for the participation and time your efforts are always valued by the Expert Exchange Customers
0
 
LVL 47

Expert Comment

by:dlethe
ID: 38893195
There is absolutely nothing wrong with windows software RAID that will be fixed with controller-based RAID.  The drives are incompatible with this particular intended use. It is an interoperability issue.  Enterprise drives ARE designed to work in this  situation because they have 100X more ECC tolerance, so the number of recoverable/unrecoverable errors is reduced by orders of magnitude.  The risk of multiple unrecoverable errors on the same stripe is almost zero.

You are incorrectly under the impression that desktop class and enterprise drives have the same operating characteristics.  they do not.  As for hardware vs. software RAID, I then you probably are not aware that NetApp and even EMC have $100,000+ subsystems that use software, not controller-based RAID.

I've got cloud customers with over 50petabytes of SOFTWARE RAID.  It works, but only if you have the right kind of disks, with correct firmware settings.
0

Featured Post

Why spend so long doing email signature updates?

Do you spend loads of your time carrying out email signature updates? Not very interesting are they? Don’t let signature updates get you down. Let Exclaimer Cloud - Signatures for Office 365 make managing email signatures a breeze.

Join & Write a Comment

You might have come across a situation when you have Exchange 2013 server in two different sites (Production and DR). After adding the Database copy in ECP console it displays Database copy status unknown for the DR exchange server. Issue is strange…
Sometimes drives fill up and we don't know why.  If you don't understand the best way to use the tools available, you may end up being stumped as to why your drive says it's not full when you have no space left!  Here's how you can find out...
This tutorial will show how to configure a new Backup Exec 2012 server and move an existing database to that server with the use of the BEUtility. Install Backup Exec 2012 on the new server and apply all of the latest hotfixes and service packs. The…
This Micro Tutorial will teach you how to reformat your flash drive. Sometimes your flash drive may have issues carrying files so this will completely restore it to manufacturing settings. Make sure to backup all files before reformatting. This w…

757 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

22 Experts available now in Live!

Get 1:1 Help Now