Solved

Bad stripe on RAID-1 primary Domain Controller

Posted on 2008-10-16
12
3,012 Views
Last Modified: 2012-08-14
Hi experts,

Some weeks ago we had a power failure which caused one of the RAID-1 disks in our primary domain controller (Windows Server 2003) to fail. This server is an IBM System x3550. Once the disk was replaced, the array rebuilt successfully, but since then an error message occurs in the event log:

One or more logical drives contain a bad stripe: controller 1.

Now I understand that the only way to get rid of this error is to break the RAID array and rebuild it - which means wiping all the data off the disk.

What would you recommend I do - should I go ahead and rebuild the array, or could I safely ignore the errors? Would ignoring the errors result in an eventual problem?

If I should rebuild the array, what's the best sequence to do this in? I was thinking the following - please advise if this is correct or if I should change something:

1. dcpromo the primary DC so it is no longer a DC - both primary and secondary DC's are global catalogs, so I assume that doing this will make the secondary DC become primary (as it'll be the only one).

2. Run NTBackup of system state and disk

3. Rebuild array, run a basic reinstall of the O/S, then restore the NTBackup

4. dcpromo the machine to make it a DC - but how do I make it the primary? Also, this machine is the primary DNS and WINS server, as well as DHCP - I assume NTBackup will take care of DHCP and dcpromo will take care of DNS/WINS replication?

Guidance would be appreciated! Thanks

Jon
0
Comment
Question by:Jon Winterburn
  • 6
  • 6
12 Comments
 
LVL 1

Expert Comment

by:DanielGould
ID: 22729160
If this error is showing, I would verify that your new disk has correctly rebuilt, then pull the other disk in the mirror before reconnecting it and letting it rebuild. I had this same issue on an HP BL25P blade server and that fixed it.
0
 
LVL 11

Author Comment

by:Jon Winterburn
ID: 22729211
So let me get this right:

Drive 0 was the physical drive that failed, drive 1 was fine. So drive 0 was replaced. To ensure I understand you correctly...

Are you saying I should:
power off, eject drive 1
power up and let it rebuild
power off, insert drive 1
power up and let logical drive rebuild

Is that right?
0
 
LVL 1

Expert Comment

by:DanielGould
ID: 22729328
Are the drives hot-swap ? If they are, just unplug drive 1, wait 30 sec and then reconnect, with the server powered on and running. This should cause it to rebuild it from the drive 0 data.

If they're not hot-swap, use your array config util to tell the controller that drive 1 was replaced and force it to rebuild it (same as you would have done to rebuild drive 0 when it was replaced)

You shouldn't lose any data as drive 0 is consistent (check that the array is definitely rebuilt and consistent before you do any of this).
0
 
LVL 11

Author Comment

by:Jon Winterburn
ID: 22730287
Okay, I have marked drive 1 as defunct and rebooted - it is now rebuilding the array in the same way it did when I replaced drive 0. However, it still states that "One or more logical drives contain a bad stripe: controller 1." - I assume this will remain until the array is rebuilt, then (hopefully) go away? Or does that mean that the bad stripe exists on drive 0 and not drive 1?
0
 
LVL 1

Expert Comment

by:DanielGould
ID: 22730779
It should go once disk 1 has rebuilt. If it still shows, are you able to do a controller-level consistency check ?
0
 
LVL 11

Author Comment

by:Jon Winterburn
ID: 22730884
The error still remains...and now I have the error in the code snippet box below:
This server is the owner of the following FSMO role, but does not consider it valid. For the partition which contains the FSMO, this server has not replicated successfully with any of its partners since this server has been restarted. Replication errors are preventing validation of this role. 

 

Operations which require contacting a FSMO operation master will fail until this condition is corrected. 

 

FSMO Role: CN=Schema,CN=Configuration,DC=dialacab,DC=co,DC=uk 

 

User Action: 

 

1. Initial synchronization is the first early replications done by a system as it is starting. A failure to initially synchronize may explain why a FSMO role cannot be validated. This process is explained in KB article 305476. 

2. This server has one or more replication partners, and replication is failing for all of these partners. Use the command repadmin /showrepl to display the replication errors.  Correct the error in question. For example there maybe problems with IP connectivity, DNS name resolution, or security authentication that are preventing successful replication. 

3. In the rare event that all replication partners being down is an expected occurance, perhaps because of maintenance or a disaster recovery, you can force the role to be validated. This can be done by using NTDSUTIL.EXE to seize the role to the same server. This may be done using the steps provided in KB articles 255504 and 324801 on http://support.microsoft.com. 

 

The following operations may be impacted: 

Schema: You will no longer be able to modify the schema for this forest. 

Domain Naming: You will no longer be able to add or remove domains from this forest. 

PDC: You will no longer be able to perform primary domain controller operations, such as Group Policy updates and password resets for non-Active Directory accounts. 

RID: You will not be able to allocation new security identifiers for new user accounts, computer accounts or security groups. 

Infrastructure: Cross-domain name references, such as universal group memberships, will not be updated properly if their target object is moved or renamed.
 

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

Open in new window

0
Free Trending Threat Insights Every Day

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

 
LVL 11

Author Comment

by:Jon Winterburn
ID: 22730929
If it helps, this is the output of repadmin /showrepl:

Source: Default-First-Site-Name\DC2
******* 1 CONSECUTIVE FAILURES since 2008-10-16 12:56:44
Last error: 1908 (0x774):
            Can't retrieve message string 1908 (0x774), error 1815.
0
 
LVL 11

Author Comment

by:Jon Winterburn
ID: 22731153
Having dropped into controller-level (post-boot), I cannot find anything that allows me to check the consistency of the array - only initialize drives, create/delete arrays etc. The only verification available is on the physical drives themselves, which I am loathe to do as it takes so long.
0
 
LVL 1

Expert Comment

by:DanielGould
ID: 22731172
Looks like you have more issues than just a raid issue there. Check that your servers can talk to each other. Don't start making lots of AD-level changes (roles, etc) as it will cause you more problems. Make sure DNS and timesync are happy across your other DCs first. Make sure your logical drive is 100% consistent on that DC too.
0
 
LVL 1

Expert Comment

by:DanielGould
ID: 22731281
http://support.microsoft.com/kb/914032

gives more information. It states that this error is not shown in advance until an operation that requires the role is executed.
0
 
LVL 11

Author Comment

by:Jon Winterburn
ID: 22732119
Okay, reset all NIC settings and rebooted both DCs, then ran dcdiag and netdiag and repadmin /showrepl and all is well - no more errors, thankfully!

So returning to the RAID issue - I can't figure out how to check the consistency of the logical drive. IBM ServeRAID Manager doesn't offer this option (it just informs me there are bad stripes) and the controller-level manager only provides a tool to verify the physical disks or create/delete the array. So how do I check the consistency of the logical drive???
0
 
LVL 1

Accepted Solution

by:
DanielGould earned 500 total points
ID: 22738695
Hmmm, I'm not 100% sure on the IBM ServeRAID (I'm mainly an HP person). If these errors are showing since the drive replacement, it could also be that the replacement drive you used has faults. Was it a new drive or a recycled one ? If you have another drive, I'd suggest swapping out drive 0 again.
0

Featured Post

What Is Threat Intelligence?

Threat intelligence is often discussed, but rarely understood. Starting with a precise definition, along with clear business goals, is essential.

Join & Write a Comment

Synchronize a new Active Directory domain with an existing Office 365 tenant
Find out how to use Active Directory data for email signature management in Microsoft Exchange and Office 365.
This tutorial will walk an individual through the process of transferring the five major, necessary Active Directory Roles, commonly referred to as the FSMO roles from a Windows Server 2008 domain controller to a Windows Server 2012 domain controlle…
This tutorial will walk an individual through the process of configuring their Windows Server 2012 domain controller to synchronize its time with a trusted, external resource. Use Google, Bing, or other preferred search engine to locate trusted NTP …

758 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

23 Experts available now in Live!

Get 1:1 Help Now