Repair Exchange 2003 store on failing RAID 1 array


I have an Exchange 2003 server running on Server 2003 R2.  The server has a RAID 1 array on two 146GB 15K SAS hard drives.

About two weeks ago one of the drives failed and was kicked from the array and the other drive began reporting media errors during the nightly backup.  I know some of the errors are affecting the Exchange store because NT Backup can no longer complete the backup of the Exchange store and the System event log reports "bad block" errors occurring at the same time the backup runs.  My last good backup of my Exchange database is from 2/13/12.  Otherwise, the server continues to run without error and Exchange is functioning normally.

I have two replacement hard drives arriving today.  They are identical models of the existing drives.

My question is, how should I approach this repair?

Should I replace the failed drive, let the RAID 1 mirror rebuild, then replace the 2nd defective drive, rebuild again, THEN work on repairing the damage to the Exchange DB?

Or should I attempt to repair the Exchange DB while on the existing drive that's reporting media errors, and after fixing the DB put the replacement drives in place?

Some specific questions I have regarding these two approaches:
- Will the RAID 1 rebuild to a new drive from the current drive with media errors fail, or make things worse for the Exchange database given the known bad blocks in the database? (The RAID controller is a built-in Intel Embedded Server RAID)
- I'm assuming bad blocks in the Exchange DB means I cannot shut down the Exchange services and copy the Exchange DB to another drive before repairing it.  If this is correct, does this mean my best bet is to try to repair the DB in-place?  Or is there a way I can duplicate/back it upbefore working on it?

Thanks for the help in advance!!
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Alan HardistyCo-OwnerCommented:
I would copy the existing database (with the information store service stopped) to a backup drive / location where you know it is safe.

Once you have a copy, repair the RAID Array replacing one drive at a time until it is happy, then look at repairing the database.

I wouldn't start to mess with the HDD's until you have a copy of the database just in case the other drive fails and you lose the lot.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
R. Andrew KoffronCommented:
I agree with alanhardisty  

how old is the last successful backup?

I would also export as much of the mail store to PST files using exmerge just incase you end up needing to punt.
here's a previous ee thread with exmerge instructions.

I would attempt to repair a second copy on good media prior to doing anything though. if your totally screwed
TWAINdriverAuthor Commented:
@alanhardisty:  Do I correctly assume you're implying I should be able to successfully copy the Exchange database to other media, despite the bad block errors...or are you simply suggesting that "if" this can be done, this is the best approach?

What if I get something like a CRC error when trying to copy the store?  Would it be possible to restore the database from my most recent backup (2/13/12) and then replay the log files?  ...which haven't been truncated since 2/13 due to subsequent backups failing.

@Harel66: Exporting to PST sounds like a good idea.
CEOs need to know what they should worry about

Nearly every week during the past few years has featured a headline about the latest data breach, malware attack, ransomware demand, or unrecoverable corporate data loss. Those stories are frequently followed by news that the CEOs at those companies were forced to resign.

R. Andrew KoffronCommented:
personally I'd probably rebuild this server since you pretty much know both sides of your mirror are screwed up. export info, rely on last good backup initially, than try and get any newly updated work off the compromised drives.  

this situation the more I think about it, sounds like you should totally go into diaster recovery mode. and not think about using the compromised drives any more than absolutely needed.
Alan HardistyCo-OwnerCommented:
I'm not suggesting you should be able to - only time will tell on that - I am suggesting you try to take a copy of the database files as a backup / precautionary measure in case you need them to fall back to.

If you can't, then you may just have to fly by the seat of your pants and hope the RAID rebuild goes well.

As long as you have the store and the logs that have been created since the last successful Store backup, you should be able to recover the store, so copy the logs out too as a precaution.
TWAINdriverAuthor Commented:
@Harel66: I can appreciate your suggestion to rebuild the server from the ground up, but would like to avoid that if I have a reasonable chance of success without doing so.

So here's what I'm thinking:

1. Copy off store & logs
2. Export mailboxes to PST
3. If store copies successfully, perform repair on other media
4. Rebuild array to new drives
5. If store could not be copied to other location, attempt store repair in-place
6. If store repair fails, restore last full backup then replay log files from last full backup.

Is this a good plan?
R. Andrew KoffronCommented:
looks solid to me however:

My thinking is that you definitively know there is a problem on both sides of your mirror. so what makes you sure the message store is the only place there is a problem?

further once you mess with the drives for another couple hours/days will your recoverability of remaining files be as good?

IF you proceed with your plan and find out there's a GB of random corrupt user files OR the AD is broken in some obscure way that it runs but can't change or save (whatever weird issue "could" show up) but seems ok for a month, how will you ever hope to get it all backtogether?
TWAINdriverAuthor Commented:
Good questions.  Here's what I know....tell me if you think I'm setting myself up for failure:

- I do actually know of about 8 files besides the Exchange store impacted by the bad blocks.  I don't believe any of these are critical files and I do have access to a good backup from which to recover them.

- There is no other data on the server besides the Exchange store.  It's not a DC.

I'm not sure what negative impact this scenario could have on AD.  If there is something, please give me insight into that possibility.

I would plan to take a full backup of the server after the repair.  My reasoning is that if the full backup succeeds (every file on the disk read in the process), and I'm on two brand new drives in a RAID 1 array...I'm golden.

Again, point out where I may be wrong.  I certainly want to come out on the successful side of this.
R. Andrew KoffronCommented:
if your 99.9% sure sure know where all currupt data is I don't see any major flaw with your plan.  

@alanhardisty what you think? any real risk here? or am I being a worry wort?
Alan HardistyCo-OwnerCommented:
The fact that you have corruption is a worry.  A bit like having an infected system, once you have cleaned up the mess, your confidence in the reliability won't be 100%, so you may get errors in the future caused by the corruption, but in my experience, disk problems can be overcome quite easily, so not convinced a rebuild is strictly necessary.

If you have a spare server - all well and good, but if not, then to buy one and keep the same OS is mad.   Upgrading to a newer OS makes more sense, or migrating, but you need stable foundations to start from.

I would take things one step at a time and start by backing up what you can, repairing the Array by replacing the disks one by one and letting them mirror, then repair the store and take it from there.

We will be with you throughout, so you have Experts on your side.
R. Andrew KoffronCommented:
haha considering he's got more than 20x the points I do I'll listen :)
TWAINdriverAuthor Commented:
LOL.  Ok, thank you both.  I'll update as things progress.
TWAINdriverAuthor Commented:
Well, I'm glad to report that after 13 hours and blessing from the Lord, I've got the server working.  I did end up opening a case with PSS after being unable to copy off the priv1.stm file due to CRC errors.

Long and short of the repair follows:
1. Copied off all logs since last full backup.  Thankfully none of these were damaged.
2. Ran CHKDSK to repair bad blocks.  It did so, but not surprisingly, while this fixed the damage to the priv1.stm at the file-level, it left logical corruption that still resulted in errors when attempting an online backup of Exchange
3. Replaced both bad drives in the array
4. Restored mailbox database from last full backup and replayed log files.  This is where PSS really came in handly (at 4:30 AM when my brain was definately not its sharpest)
5. Took successful full online backup of the Exchange databases

I think I knew this, but it became clear reality for me that your last full backup plus all subsequent log files of an Exchange 2003 database is absolutely critical for a restore operation like this.  You simply cannot use any other full backup and expect a good outcome (I know I'm tell you guys something you already know, but this was solidified for me in this experience).

I did also as a precaution export several key mailboxes to PST while all of this was going on.  I did this directly from the Outlook client (which was offline due to the stores being dismounted) rather than using Exmerge.

Thanks both for helping me navigate and plan for this repair.
Alan HardistyCo-OwnerCommented:
Glad you got it sorted and that all is well.  Hopefully it won't happen again in a hurry, but if it does, you will know what to do next time :)

TWAINdriverAuthor Commented:
I wanted to address a portion of my original question for the benefit of others reading this.

I asked whether bad blocks impacting the Exchange database would prevent my shutting dismounting the stores and performing a file-level (offline) copy of the database files.  In my case, because the damage to the database was at the file level, and I was getting a -1022 error from Exchange during the online backup, I had "physical" damage to the files.  For this reason, when attempting to copy the offline database, I got a CRC error and was not able to copy the file as a result.  (See for more about the three levels of damage an Exchange store can suffer, including file-level errors that result in the -1022 error).

Running CHKDSK on the volume corrected the CRC error in the effected database file, and afterward I was able to copy the file to another location.  However, CHKDSK couldn't recover the damaged blocks in the file, resulting in "logical" corruption to the database file.  An attempt to perform an online backup of the store after running CHKDSK started returning page checksum mismatch errors instead of I/O errors.

I imagine that if I had not had a full backup of my database I would have probably been able to use ESEUTIL and ISINTEG to repair the store, albeit with some likely data loss.  Obviously, for best results, take frequent backups!
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today

From novice to tech pro — start learning today.