Go Premium for a chance to win a PS4. Enter to Win

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 3051
  • Last Modified:

Bad stripe on RAID-1 primary Domain Controller

Hi experts,

Some weeks ago we had a power failure which caused one of the RAID-1 disks in our primary domain controller (Windows Server 2003) to fail. This server is an IBM System x3550. Once the disk was replaced, the array rebuilt successfully, but since then an error message occurs in the event log:

One or more logical drives contain a bad stripe: controller 1.

Now I understand that the only way to get rid of this error is to break the RAID array and rebuild it - which means wiping all the data off the disk.

What would you recommend I do - should I go ahead and rebuild the array, or could I safely ignore the errors? Would ignoring the errors result in an eventual problem?

If I should rebuild the array, what's the best sequence to do this in? I was thinking the following - please advise if this is correct or if I should change something:

1. dcpromo the primary DC so it is no longer a DC - both primary and secondary DC's are global catalogs, so I assume that doing this will make the secondary DC become primary (as it'll be the only one).

2. Run NTBackup of system state and disk

3. Rebuild array, run a basic reinstall of the O/S, then restore the NTBackup

4. dcpromo the machine to make it a DC - but how do I make it the primary? Also, this machine is the primary DNS and WINS server, as well as DHCP - I assume NTBackup will take care of DHCP and dcpromo will take care of DNS/WINS replication?

Guidance would be appreciated! Thanks

Jon
0
Jon Winterburn
Asked:
Jon Winterburn
  • 6
  • 6
1 Solution
 
DanielGouldCommented:
If this error is showing, I would verify that your new disk has correctly rebuilt, then pull the other disk in the mirror before reconnecting it and letting it rebuild. I had this same issue on an HP BL25P blade server and that fixed it.
0
 
Jon WinterburnAuthor Commented:
So let me get this right:

Drive 0 was the physical drive that failed, drive 1 was fine. So drive 0 was replaced. To ensure I understand you correctly...

Are you saying I should:
power off, eject drive 1
power up and let it rebuild
power off, insert drive 1
power up and let logical drive rebuild

Is that right?
0
 
DanielGouldCommented:
Are the drives hot-swap ? If they are, just unplug drive 1, wait 30 sec and then reconnect, with the server powered on and running. This should cause it to rebuild it from the drive 0 data.

If they're not hot-swap, use your array config util to tell the controller that drive 1 was replaced and force it to rebuild it (same as you would have done to rebuild drive 0 when it was replaced)

You shouldn't lose any data as drive 0 is consistent (check that the array is definitely rebuilt and consistent before you do any of this).
0
Nothing ever in the clear!

This technical paper will help you implement VMware’s VM encryption as well as implement Veeam encryption which together will achieve the nothing ever in the clear goal. If a bad guy steals VMs, backups or traffic they get nothing.

 
Jon WinterburnAuthor Commented:
Okay, I have marked drive 1 as defunct and rebooted - it is now rebuilding the array in the same way it did when I replaced drive 0. However, it still states that "One or more logical drives contain a bad stripe: controller 1." - I assume this will remain until the array is rebuilt, then (hopefully) go away? Or does that mean that the bad stripe exists on drive 0 and not drive 1?
0
 
DanielGouldCommented:
It should go once disk 1 has rebuilt. If it still shows, are you able to do a controller-level consistency check ?
0
 
Jon WinterburnAuthor Commented:
The error still remains...and now I have the error in the code snippet box below:
This server is the owner of the following FSMO role, but does not consider it valid. For the partition which contains the FSMO, this server has not replicated successfully with any of its partners since this server has been restarted. Replication errors are preventing validation of this role. 
 
Operations which require contacting a FSMO operation master will fail until this condition is corrected. 
 
FSMO Role: CN=Schema,CN=Configuration,DC=dialacab,DC=co,DC=uk 
 
User Action: 
 
1. Initial synchronization is the first early replications done by a system as it is starting. A failure to initially synchronize may explain why a FSMO role cannot be validated. This process is explained in KB article 305476. 
2. This server has one or more replication partners, and replication is failing for all of these partners. Use the command repadmin /showrepl to display the replication errors.  Correct the error in question. For example there maybe problems with IP connectivity, DNS name resolution, or security authentication that are preventing successful replication. 
3. In the rare event that all replication partners being down is an expected occurance, perhaps because of maintenance or a disaster recovery, you can force the role to be validated. This can be done by using NTDSUTIL.EXE to seize the role to the same server. This may be done using the steps provided in KB articles 255504 and 324801 on http://support.microsoft.com. 
 
The following operations may be impacted: 
Schema: You will no longer be able to modify the schema for this forest. 
Domain Naming: You will no longer be able to add or remove domains from this forest. 
PDC: You will no longer be able to perform primary domain controller operations, such as Group Policy updates and password resets for non-Active Directory accounts. 
RID: You will not be able to allocation new security identifiers for new user accounts, computer accounts or security groups. 
Infrastructure: Cross-domain name references, such as universal group memberships, will not be updated properly if their target object is moved or renamed.
 
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

Open in new window

0
 
Jon WinterburnAuthor Commented:
If it helps, this is the output of repadmin /showrepl:

Source: Default-First-Site-Name\DC2
******* 1 CONSECUTIVE FAILURES since 2008-10-16 12:56:44
Last error: 1908 (0x774):
            Can't retrieve message string 1908 (0x774), error 1815.
0
 
Jon WinterburnAuthor Commented:
Having dropped into controller-level (post-boot), I cannot find anything that allows me to check the consistency of the array - only initialize drives, create/delete arrays etc. The only verification available is on the physical drives themselves, which I am loathe to do as it takes so long.
0
 
DanielGouldCommented:
Looks like you have more issues than just a raid issue there. Check that your servers can talk to each other. Don't start making lots of AD-level changes (roles, etc) as it will cause you more problems. Make sure DNS and timesync are happy across your other DCs first. Make sure your logical drive is 100% consistent on that DC too.
0
 
DanielGouldCommented:
http://support.microsoft.com/kb/914032

gives more information. It states that this error is not shown in advance until an operation that requires the role is executed.
0
 
Jon WinterburnAuthor Commented:
Okay, reset all NIC settings and rebooted both DCs, then ran dcdiag and netdiag and repadmin /showrepl and all is well - no more errors, thankfully!

So returning to the RAID issue - I can't figure out how to check the consistency of the logical drive. IBM ServeRAID Manager doesn't offer this option (it just informs me there are bad stripes) and the controller-level manager only provides a tool to verify the physical disks or create/delete the array. So how do I check the consistency of the logical drive???
0
 
DanielGouldCommented:
Hmmm, I'm not 100% sure on the IBM ServeRAID (I'm mainly an HP person). If these errors are showing since the drive replacement, it could also be that the replacement drive you used has faults. Was it a new drive or a recycled one ? If you have another drive, I'd suggest swapping out drive 0 again.
0

Featured Post

What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

  • 6
  • 6
Tackle projects and never again get stuck behind a technical roadblock.
Join Now