W2K8 R2 - RAID 5 - Software Array Failed Redundancy

Have client with Win2k8 Server and a very "Klughey" setup.

Server has 10 SATA Drives in it, 6 are controlled by a controller on the board, 4 are controlled by a PCI SATA Expander Card.

System is running, in disk manager shows "Failed Redundancy" and it is identifying disk 8 as the resource that is at fault (drive does not show failed, but does have yellow and black triangle {see disk manager screenshot attached}.

Also have error in event log which shows that Disk 10 is the issue {see event log screenshot}

Have determined that both disks are on the PCI controller, not the main board controller.

Ordered 2 new drives
Identified Drive 8, replaced it with a new drive, brought server back up.
Now in disk manager the entire array shows failed.
Put "bad drive back in, brought server back up, array is working, but with the "failed redundancy" alert.

Repeated the process with drive 10, same results.

Anybody have any "in the field" experience with this BS.
Read multiple posts, tried multiple other recommendations, but none of them seem to have a step by step so we can actually determine what is going on.
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

What is the make/model of disk?  Reason I ask, if they are NOT enterprise class devices, then they have the wrong firmware for most RAID controllers, and you are pretty much guaranteed data loss.
tech911Author Commented:
Disks are Seagate, barracuda SATA.

There is no "Hardware RAID Controller" the RAID is created using the Win 2K8 Server software RAID utility (Read = Make all disks Dynamic Disks, bind them together in one volume, etc...)

Not to worry about data loss, the RAID is running and we have a validated backup of everything on it.

If we have to, we will scrap the whole thing an build a new array but we would like to try to troubleshoot this to see if we can understand what is wrong and how to fix it... It may not be a "fixable" situation.

Forgive me for saying this but is sounds like you haven't had any time in the field with the MS Server Software RAID utility, if not, you really won't be able to help because it doesn't behave like a "Normal" Hardware RAID Controller (Although I wish it did).

Thank you for the suggestion.
it seems that you might have mislocated the faulty drive, since you remove another drive in a failed raid 5 set the array will not work

is there a possibility in the raid config to have the faulty drive identified (via blinking option)?
10 Tips to Protect Your Business from Ransomware

Did you know that ransomware is the most widespread, destructive malware in the world today? It accounts for 39% of all security breaches, with ransomware gangsters projected to make $11.5B in profits from online extortion by 2019.

If I'm not mistaken Windows Software RAID us being used here, not a RAID controller. The problem is that it uses RAID 5 and 2 disks are failed in that Array. That means the complete array is now broken (unless one of the bad disks happened to be a "Hot spare". When you change a bad disk in a RAID array, always only replace one at a time and wait for the array to finish the rebuild before replacing the other. Also when changing a bad drive in a Windows software RAID, you have to manually add the new disk to the array in the disk management. It will only start resyncing after that.

I'm afraid your only option now is to remove all the disks from your array, then create a new one. When that is done you'll have to restore the data from the backups.
tech911Author Commented:
First... The drives have not failed, but seem to have some underlying logical issue (Bad Block, Bad Sector, etc...)


Third, the way to figure out which drive is which is by the bus number, which we were able to effectively do and we validated our identification by removing the drive which was indicated as having an issue (see disk mgmt screen shot, notice Disk 8).

The problem is that even though we have identified the correct drive(s) there is no way to effectively replace them. When we do, the array enters a failed state with no way to "Un-fail" it.

I hate to beat a dead horse but if you don't have field experience with the MS RAID Software Array Utility, you won't be able to help.
tech911Author Commented:
Rindi -

You are correct, I'm thinking the same thing, but...

The odd thing is the drives are not Failed they are working, but have been identified as having an issue, bad block, etc...not failed.

We did a replacement on Drive 8, ran the resync but the Disk Manager still showed the array as failed...

root cause is the disk drives.  the problem is that they are not suitable due to TLER.  The drives will pass every diagnostic on the planet, but that will not change the fact that they can take up to 60 seconds while in deep recovery to get back an unreadable block.  That is too long. Enterprise disks guarantee they will return or give up in just a few seconds.  (They also have 10-100X better error recovery)

In fact, if you had bought the WD equivalent of those disks then WD will say warranty is void if you use them in a RAID5 config.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
tech911Author Commented:

Thank you for the comments...

Anyone else have something that can move us forward in the discussion.  

Commenting on what the IT guy (2 IT Guys Ago) should have done or suspecting that the drives are the problem really doesn't contribute to any type of solution or a deeper understanding of the problem one way or the other, its just comments.

What we need is an understanding of WHY things are happening the way they are. We want this for an overall community benefit, at this point I'm certain that I am going to have to create a brand new array and restore 10TB of data, I just want to dig deeper and understand WHY in case someone else encounters this problem since we have the opportunity:

Think about this...

When we replace a drive, the system recognizes it,and it is displayed in disk manager.
Disk manger allows us to add it as a dynamic disk
It also allows us to resync the new drive with the array, but when the resync finishes, the array is now in a completely failed state (not a Failed Redundancy State)...

Makes no sense...
The solution is to use different disk drives.  Google "TLER".  That explains the problem in more detail, but again, the problem is pretty simple. The disks will lock up to recover an unreadable block, for up to a minute. All I/O stops during that time. If two disks are doing recovery, then the array crashes. If one disk does that deep recovery then the software thinks the drive failed.  That is why it passes diagnostics.   That is why the drive is good.   The problem is it just takes too long.

Desktop drives are not designed for such use. They are designed not to lose data, and try to recover. Surely you've seen desktop systems just do nothing for 20-60 seconds and then start back up again with no complaints.  This is what you are seeing.

If you don't want to replace the drives, then you can minimize risk by going to separate RAID1 groups.

Sorry but that is your only recourse. You bought the wrong kind of disk drives for this intended use.
apologies for missing the software raid remark
tech911Author Commented:
I've requested that this question be deleted for the following reason:

There was no real answer to this. The fix was replace the array with a new one and get a new controller for the old one, then blow out everything on it, and build it fresh.
There is absolutely a good and correct answer to this.  Read #38763233 and #38764149.  The drives are timing out due to TLER issues. They are unsuitable and incompatible for this type of use.
I might add that unless you use an enterprise class drive, or a controller that is designed to work with desktop drives, then you have not addressed root cause, so will continue to have this problem in the future.
tech911Author Commented:
Thank you for your opinion...

Its not the drives... its the whole configuration.  The person who thought having 6 drives being controlled by the motherboard and 4 more being controlled by a cheap 4 port sata expansion card and then running the whole thing as a Windows Software RAID was fool.

This was a recipe for disaster waiting to happen. The most important thing about this setup was to have a backup of the data because you were going to need it.

Sorry the answer didn't solve the problem, but I can't be asked to reward points for an answer that says replace everything (which is the answer to the bigger picture) but not the answer to the actual problem (for which there appears not to be an answer that can fix the problem, i.e. there is no answer...for this described problem.)

Thanks again for the participation and time your efforts are always valued by the Expert Exchange Customers
There is absolutely nothing wrong with windows software RAID that will be fixed with controller-based RAID.  The drives are incompatible with this particular intended use. It is an interoperability issue.  Enterprise drives ARE designed to work in this  situation because they have 100X more ECC tolerance, so the number of recoverable/unrecoverable errors is reduced by orders of magnitude.  The risk of multiple unrecoverable errors on the same stripe is almost zero.

You are incorrectly under the impression that desktop class and enterprise drives have the same operating characteristics.  they do not.  As for hardware vs. software RAID, I then you probably are not aware that NetApp and even EMC have $100,000+ subsystems that use software, not controller-based RAID.

I've got cloud customers with over 50petabytes of SOFTWARE RAID.  It works, but only if you have the right kind of disks, with correct firmware settings.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Windows Server 2008

From novice to tech pro — start learning today.