Solved

Repair Exchange 2003 store on failing RAID 1 array

Posted on 2012-03-26
15
670 Views
Last Modified: 2012-08-14
Greetings,

I have an Exchange 2003 server running on Server 2003 R2.  The server has a RAID 1 array on two 146GB 15K SAS hard drives.

About two weeks ago one of the drives failed and was kicked from the array and the other drive began reporting media errors during the nightly backup.  I know some of the errors are affecting the Exchange store because NT Backup can no longer complete the backup of the Exchange store and the System event log reports "bad block" errors occurring at the same time the backup runs.  My last good backup of my Exchange database is from 2/13/12.  Otherwise, the server continues to run without error and Exchange is functioning normally.

I have two replacement hard drives arriving today.  They are identical models of the existing drives.

My question is, how should I approach this repair?

Should I replace the failed drive, let the RAID 1 mirror rebuild, then replace the 2nd defective drive, rebuild again, THEN work on repairing the damage to the Exchange DB?

Or should I attempt to repair the Exchange DB while on the existing drive that's reporting media errors, and after fixing the DB put the replacement drives in place?

Some specific questions I have regarding these two approaches:
- Will the RAID 1 rebuild to a new drive from the current drive with media errors fail, or make things worse for the Exchange database given the known bad blocks in the database? (The RAID controller is a built-in Intel Embedded Server RAID)
- I'm assuming bad blocks in the Exchange DB means I cannot shut down the Exchange services and copy the Exchange DB to another drive before repairing it.  If this is correct, does this mean my best bet is to try to repair the DB in-place?  Or is there a way I can duplicate/back it upbefore working on it?

Thanks for the help in advance!!
0
Comment
Question by:TWAINdriver
  • 6
  • 5
  • 4
15 Comments
 
LVL 76

Accepted Solution

by:
Alan Hardisty earned 250 total points
Comment Utility
I would copy the existing database (with the information store service stopped) to a backup drive / location where you know it is safe.

Once you have a copy, repair the RAID Array replacing one drive at a time until it is happy, then look at repairing the database.

I wouldn't start to mess with the HDD's until you have a copy of the database just in case the other drive fails and you lose the lot.
0
 
LVL 16

Assisted Solution

by:R. Andrew Koffron
R. Andrew Koffron earned 250 total points
Comment Utility
I agree with alanhardisty  

how old is the last successful backup?

I would also export as much of the mail store to PST files using exmerge just incase you end up needing to punt.
here's a previous ee thread with exmerge instructions.
http://www.experts-exchange.com/Software/Server_Software/Email_Servers/Exchange/Q_23919688.html

I would attempt to repair a second copy on good media prior to doing anything though. if your totally screwed
0
 
LVL 1

Author Comment

by:TWAINdriver
Comment Utility
@alanhardisty:  Do I correctly assume you're implying I should be able to successfully copy the Exchange database to other media, despite the bad block errors...or are you simply suggesting that "if" this can be done, this is the best approach?

What if I get something like a CRC error when trying to copy the store?  Would it be possible to restore the database from my most recent backup (2/13/12) and then replay the log files?  ...which haven't been truncated since 2/13 due to subsequent backups failing.

@Harel66: Exporting to PST sounds like a good idea.
0
 
LVL 16

Expert Comment

by:R. Andrew Koffron
Comment Utility
personally I'd probably rebuild this server since you pretty much know both sides of your mirror are screwed up. export info, rely on last good backup initially, than try and get any newly updated work off the compromised drives.  

this situation the more I think about it, sounds like you should totally go into diaster recovery mode. and not think about using the compromised drives any more than absolutely needed.
0
 
LVL 76

Expert Comment

by:Alan Hardisty
Comment Utility
I'm not suggesting you should be able to - only time will tell on that - I am suggesting you try to take a copy of the database files as a backup / precautionary measure in case you need them to fall back to.

If you can't, then you may just have to fly by the seat of your pants and hope the RAID rebuild goes well.

As long as you have the store and the logs that have been created since the last successful Store backup, you should be able to recover the store, so copy the logs out too as a precaution.
0
 
LVL 1

Author Comment

by:TWAINdriver
Comment Utility
@Harel66: I can appreciate your suggestion to rebuild the server from the ground up, but would like to avoid that if I have a reasonable chance of success without doing so.

So here's what I'm thinking:

1. Copy off store & logs
2. Export mailboxes to PST
3. If store copies successfully, perform repair on other media
4. Rebuild array to new drives
5. If store could not be copied to other location, attempt store repair in-place
6. If store repair fails, restore last full backup then replay log files from last full backup.

Is this a good plan?
0
 
LVL 16

Expert Comment

by:R. Andrew Koffron
Comment Utility
looks solid to me however:

My thinking is that you definitively know there is a problem on both sides of your mirror. so what makes you sure the message store is the only place there is a problem?

further once you mess with the drives for another couple hours/days will your recoverability of remaining files be as good?

IF you proceed with your plan and find out there's a GB of random corrupt user files OR the AD is broken in some obscure way that it runs but can't change or save (whatever weird issue "could" show up) but seems ok for a month, how will you ever hope to get it all backtogether?
0
Complete Microsoft Windows PC® & Mac Backup

Backup and recovery solutions to protect all your PCs & Mac– on-premises or in remote locations. Acronis backs up entire PC or Mac with patented reliable disk imaging technology and you will be able to restore workstations to a new, dissimilar hardware in minutes.

 
LVL 1

Author Comment

by:TWAINdriver
Comment Utility
Good questions.  Here's what I know....tell me if you think I'm setting myself up for failure:

- I do actually know of about 8 files besides the Exchange store impacted by the bad blocks.  I don't believe any of these are critical files and I do have access to a good backup from which to recover them.

- There is no other data on the server besides the Exchange store.  It's not a DC.

I'm not sure what negative impact this scenario could have on AD.  If there is something, please give me insight into that possibility.

I would plan to take a full backup of the server after the repair.  My reasoning is that if the full backup succeeds (every file on the disk read in the process), and I'm on two brand new drives in a RAID 1 array...I'm golden.

Again, point out where I may be wrong.  I certainly want to come out on the successful side of this.
0
 
LVL 16

Expert Comment

by:R. Andrew Koffron
Comment Utility
if your 99.9% sure sure know where all currupt data is I don't see any major flaw with your plan.  

@alanhardisty what you think? any real risk here? or am I being a worry wort?
0
 
LVL 76

Expert Comment

by:Alan Hardisty
Comment Utility
The fact that you have corruption is a worry.  A bit like having an infected system, once you have cleaned up the mess, your confidence in the reliability won't be 100%, so you may get errors in the future caused by the corruption, but in my experience, disk problems can be overcome quite easily, so not convinced a rebuild is strictly necessary.

If you have a spare server - all well and good, but if not, then to buy one and keep the same OS is mad.   Upgrading to a newer OS makes more sense, or migrating, but you need stable foundations to start from.

I would take things one step at a time and start by backing up what you can, repairing the Array by replacing the disks one by one and letting them mirror, then repair the store and take it from there.

We will be with you throughout, so you have Experts on your side.
0
 
LVL 16

Expert Comment

by:R. Andrew Koffron
Comment Utility
haha considering he's got more than 20x the points I do I'll listen :)
0
 
LVL 1

Author Comment

by:TWAINdriver
Comment Utility
LOL.  Ok, thank you both.  I'll update as things progress.
0
 
LVL 1

Author Comment

by:TWAINdriver
Comment Utility
Well, I'm glad to report that after 13 hours and blessing from the Lord, I've got the server working.  I did end up opening a case with PSS after being unable to copy off the priv1.stm file due to CRC errors.

Long and short of the repair follows:
1. Copied off all logs since last full backup.  Thankfully none of these were damaged.
2. Ran CHKDSK to repair bad blocks.  It did so, but not surprisingly, while this fixed the damage to the priv1.stm at the file-level, it left logical corruption that still resulted in errors when attempting an online backup of Exchange
3. Replaced both bad drives in the array
4. Restored mailbox database from last full backup and replayed log files.  This is where PSS really came in handly (at 4:30 AM when my brain was definately not its sharpest)
5. Took successful full online backup of the Exchange databases

I think I knew this, but it became clear reality for me that your last full backup plus all subsequent log files of an Exchange 2003 database is absolutely critical for a restore operation like this.  You simply cannot use any other full backup and expect a good outcome (I know I'm tell you guys something you already know, but this was solidified for me in this experience).

I did also as a precaution export several key mailboxes to PST while all of this was going on.  I did this directly from the Outlook client (which was offline due to the stores being dismounted) rather than using Exmerge.

Thanks both for helping me navigate and plan for this repair.
0
 
LVL 76

Expert Comment

by:Alan Hardisty
Comment Utility
Glad you got it sorted and that all is well.  Hopefully it won't happen again in a hurry, but if it does, you will know what to do next time :)

Alan
0
 
LVL 1

Author Comment

by:TWAINdriver
Comment Utility
I wanted to address a portion of my original question for the benefit of others reading this.

I asked whether bad blocks impacting the Exchange database would prevent my shutting dismounting the stores and performing a file-level (offline) copy of the database files.  In my case, because the damage to the database was at the file level, and I was getting a -1022 error from Exchange during the online backup, I had "physical" damage to the files.  For this reason, when attempting to copy the offline database, I got a CRC error and was not able to copy the file as a result.  (See http://support.microsoft.com/kb/314917 for more about the three levels of damage an Exchange store can suffer, including file-level errors that result in the -1022 error).

Running CHKDSK on the volume corrected the CRC error in the effected database file, and afterward I was able to copy the file to another location.  However, CHKDSK couldn't recover the damaged blocks in the file, resulting in "logical" corruption to the database file.  An attempt to perform an online backup of the store after running CHKDSK started returning page checksum mismatch errors instead of I/O errors.

I imagine that if I had not had a full backup of my database I would have probably been able to use ESEUTIL and ISINTEG to repair the store, albeit with some likely data loss.  Obviously, for best results, take frequent backups!
0

Featured Post

Better Security Awareness With Threat Intelligence

See how one of the leading financial services organizations uses Recorded Future as part of a holistic threat intelligence program to promote security awareness and proactively and efficiently identify threats.

Join & Write a Comment

The Delta outage: 650 cancelled flights, more than 1200 delayed flights, thousands of frustrated customers, tens of millions of dollars in damages – plus untold reputational damage to one of the world’s most trusted airlines. All due to a catastroph…
Find out how to use Active Directory data for email signature management in Microsoft Exchange and Office 365.
This tutorial will show how to configure a single USB drive with a separate folder for each day of the week. This will allow each of the backups to be kept separate preventing the previous day’s backup from being overwritten. The USB drive must be s…
To add imagery to an HTML email signature, you have two options available to you. You can either add a logo/image by embedding it directly into the signature or hosting it externally and linking to it. The vast majority of email clients display l…

763 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

7 Experts available now in Live!

Get 1:1 Help Now