Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x
?
Solved

W2K8 R2 - RAID 5 - Software Array Failed Redundancy

Posted on 2013-01-10
17
Medium Priority
?
1,583 Views
Last Modified: 2013-02-19
Disk-Manager.pdfCASE BACKGROUND
Have client with Win2k8 Server and a very "Klughey" setup.

Server has 10 SATA Drives in it, 6 are controlled by a controller on the board, 4 are controlled by a PCI SATA Expander Card.

System is running, in disk manager shows "Failed Redundancy" and it is identifying disk 8 as the resource that is at fault (drive does not show failed, but does have yellow and black triangle {see disk manager screenshot attached}.

Also have error in event log which shows that Disk 10 is the issue {see event log screenshot}

Have determined that both disks are on the PCI controller, not the main board controller.

WHAT HAS BEEN DONE
Ordered 2 new drives
Identified Drive 8, replaced it with a new drive, brought server back up.
Now in disk manager the entire array shows failed.
Put "bad drive back in, brought server back up, array is working, but with the "failed redundancy" alert.

Repeated the process with drive 10, same results.

QUESTION
Anybody have any "in the field" experience with this BS.
Read multiple posts, tried multiple other recommendations, but none of them seem to have a step by step so we can actually determine what is going on.
EventLog.pdf
0
Comment
Question by:tech911
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 6
  • 6
  • 2
  • +1
17 Comments
 
LVL 47

Expert Comment

by:David
ID: 38763075
What is the make/model of disk?  Reason I ask, if they are NOT enterprise class devices, then they have the wrong firmware for most RAID controllers, and you are pretty much guaranteed data loss.
0
 
LVL 3

Author Comment

by:tech911
ID: 38763154
Disks are Seagate, barracuda SATA.

There is no "Hardware RAID Controller" the RAID is created using the Win 2K8 Server software RAID utility (Read = Make all disks Dynamic Disks, bind them together in one volume, etc...)

Not to worry about data loss, the RAID is running and we have a validated backup of everything on it.

If we have to, we will scrap the whole thing an build a new array but we would like to try to troubleshoot this to see if we can understand what is wrong and how to fix it... It may not be a "fixable" situation.

Forgive me for saying this but is sounds like you haven't had any time in the field with the MS Server Software RAID utility, if not, you really won't be able to help because it doesn't behave like a "Normal" Hardware RAID Controller (Although I wish it did).

Thank you for the suggestion.
0
 
LVL 3

Expert Comment

by:zackmccracken
ID: 38763158
it seems that you might have mislocated the faulty drive, since you remove another drive in a failed raid 5 set the array will not work

is there a possibility in the raid config to have the faulty drive identified (via blinking option)?
0
Office 365 Training for IT Pros

Learn how to provision tenants, synchronize on-premise Active Directory, implement Single Sign-On, customize Office deployment, and protect your organization with eDiscovery and DLP policies.  Only from Platform Scholar.

 
LVL 88

Expert Comment

by:rindi
ID: 38763211
If I'm not mistaken Windows Software RAID us being used here, not a RAID controller. The problem is that it uses RAID 5 and 2 disks are failed in that Array. That means the complete array is now broken (unless one of the bad disks happened to be a "Hot spare". When you change a bad disk in a RAID array, always only replace one at a time and wait for the array to finish the rebuild before replacing the other. Also when changing a bad drive in a Windows software RAID, you have to manually add the new disk to the array in the disk management. It will only start resyncing after that.

I'm afraid your only option now is to remove all the disks from your array, then create a new one. When that is done you'll have to restore the data from the backups.
0
 
LVL 3

Author Comment

by:tech911
ID: 38763213
First... The drives have not failed, but seem to have some underlying logical issue (Bad Block, Bad Sector, etc...)

Second to have a "Blinking Light" you need a HARDWARE BASED RAID CONTROLLER not a SOFTWARE BASED RAID CONTROLLER

Third, the way to figure out which drive is which is by the bus number, which we were able to effectively do and we validated our identification by removing the drive which was indicated as having an issue (see disk mgmt screen shot, notice Disk 8).

The problem is that even though we have identified the correct drive(s) there is no way to effectively replace them. When we do, the array enters a failed state with no way to "Un-fail" it.

I hate to beat a dead horse but if you don't have field experience with the MS RAID Software Array Utility, you won't be able to help.
0
 
LVL 3

Author Comment

by:tech911
ID: 38763224
Rindi -

You are correct, I'm thinking the same thing, but...

The odd thing is the drives are not Failed they are working, but have been identified as having an issue, bad block, etc...not failed.

We did a replacement on Drive 8, ran the resync but the Disk Manager still showed the array as failed...

Thoughts...
0
 
LVL 47

Accepted Solution

by:
David earned 2000 total points
ID: 38763233
root cause is the disk drives.  the problem is that they are not suitable due to TLER.  The drives will pass every diagnostic on the planet, but that will not change the fact that they can take up to 60 seconds while in deep recovery to get back an unreadable block.  That is too long. Enterprise disks guarantee they will return or give up in just a few seconds.  (They also have 10-100X better error recovery)

In fact, if you had bought the WD equivalent of those disks then WD will say warranty is void if you use them in a RAID5 config.
0
 
LVL 3

Author Comment

by:tech911
ID: 38764057
dlethe,

Thank you for the comments...

Anyone else have something that can move us forward in the discussion.  

Commenting on what the IT guy (2 IT Guys Ago) should have done or suspecting that the drives are the problem really doesn't contribute to any type of solution or a deeper understanding of the problem one way or the other, its just comments.

What we need is an understanding of WHY things are happening the way they are. We want this for an overall community benefit, at this point I'm certain that I am going to have to create a brand new array and restore 10TB of data, I just want to dig deeper and understand WHY in case someone else encounters this problem since we have the opportunity:

Think about this...

When we replace a drive, the system recognizes it,and it is displayed in disk manager.
Disk manger allows us to add it as a dynamic disk
It also allows us to resync the new drive with the array, but when the resync finishes, the array is now in a completely failed state (not a Failed Redundancy State)...

Makes no sense...
0
 
LVL 47

Assisted Solution

by:David
David earned 2000 total points
ID: 38764149
The solution is to use different disk drives.  Google "TLER".  That explains the problem in more detail, but again, the problem is pretty simple. The disks will lock up to recover an unreadable block, for up to a minute. All I/O stops during that time. If two disks are doing recovery, then the array crashes. If one disk does that deep recovery then the software thinks the drive failed.  That is why it passes diagnostics.   That is why the drive is good.   The problem is it just takes too long.

Desktop drives are not designed for such use. They are designed not to lose data, and try to recover. Surely you've seen desktop systems just do nothing for 20-60 seconds and then start back up again with no complaints.  This is what you are seeing.

If you don't want to replace the drives, then you can minimize risk by going to separate RAID1 groups.

Sorry but that is your only recourse. You bought the wrong kind of disk drives for this intended use.
0
 
LVL 3

Expert Comment

by:zackmccracken
ID: 38764160
apologies for missing the software raid remark
0
 
LVL 3

Author Comment

by:tech911
ID: 38887567
I've requested that this question be deleted for the following reason:

There was no real answer to this. The fix was replace the array with a new one and get a new controller for the old one, then blow out everything on it, and build it fresh.
0
 
LVL 47

Expert Comment

by:David
ID: 38887568
There is absolutely a good and correct answer to this.  Read #38763233 and #38764149.  The drives are timing out due to TLER issues. They are unsuitable and incompatible for this type of use.
0
 
LVL 47

Expert Comment

by:David
ID: 38891670
I might add that unless you use an enterprise class drive, or a controller that is designed to work with desktop drives, then you have not addressed root cause, so will continue to have this problem in the future.
0
 
LVL 3

Author Comment

by:tech911
ID: 38893074
Thank you for your opinion...

Its not the drives... its the whole configuration.  The person who thought having 6 drives being controlled by the motherboard and 4 more being controlled by a cheap 4 port sata expansion card and then running the whole thing as a Windows Software RAID was fool.

This was a recipe for disaster waiting to happen. The most important thing about this setup was to have a backup of the data because you were going to need it.

Sorry the answer didn't solve the problem, but I can't be asked to reward points for an answer that says replace everything (which is the answer to the bigger picture) but not the answer to the actual problem (for which there appears not to be an answer that can fix the problem, i.e. there is no answer...for this described problem.)

Thanks again for the participation and time your efforts are always valued by the Expert Exchange Customers
0
 
LVL 47

Expert Comment

by:David
ID: 38893195
There is absolutely nothing wrong with windows software RAID that will be fixed with controller-based RAID.  The drives are incompatible with this particular intended use. It is an interoperability issue.  Enterprise drives ARE designed to work in this  situation because they have 100X more ECC tolerance, so the number of recoverable/unrecoverable errors is reduced by orders of magnitude.  The risk of multiple unrecoverable errors on the same stripe is almost zero.

You are incorrectly under the impression that desktop class and enterprise drives have the same operating characteristics.  they do not.  As for hardware vs. software RAID, I then you probably are not aware that NetApp and even EMC have $100,000+ subsystems that use software, not controller-based RAID.

I've got cloud customers with over 50petabytes of SOFTWARE RAID.  It works, but only if you have the right kind of disks, with correct firmware settings.
0

Featured Post

The Eight Noble Truths of Backup and Recovery

How can IT departments tackle the challenges of a Big Data world? This white paper provides a roadmap to success and helps companies ensure that all their data is safe and secure, no matter if it resides on-premise with physical or virtual machines or in the cloud.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Each year, investment in cloud platforms grows more than 20% (https://www.immun.io/hubfs/Immunio_2016/Content/Marketing/Cloud-Security-Report-2016.pdf?submissionGuid=a8d80a00-6fee-4b85-81db-a4e28f681762) as an increasing number of companies begin to…
The business world is becoming increasingly integrated with tech. It’s not just for a select few anymore — but what about if you have a small business? It may be easier than you think to integrate technology into your small business, and it’s likely…
This tutorial will give a short introduction and overview of Backup Exec 2012 and how to navigate and perform basic functions. Click on the Backup Exec button in the upper left corner. From here, are global settings for the application such as conne…
This tutorial will walk an individual through the steps necessary to join and promote the first Windows Server 2012 domain controller into an Active Directory environment running on Windows Server 2008. Determine the location of the FSMO roles by lo…
Suggested Courses

670 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question