Raid 5 failure on SBS 2003 - urgently need backup options for Exchange store and WSUS

The server that I mostly administer remotely has both Raid 1 and Raid 5 with a e200i controller card.
The OS (sbs 2003 fully patched) sits on the Raid 1.
User shared drives, user data, Exchange DB, logs, WSUS database sits on the raid 5 logical drive. In this Array B, we had a drive failure in Bay 4. Got the new one shipped and replaced the drive - said it was ready for rebuilding . Restarted the server, pressed F1 for automatic data recovery, all seems fine until about 40%, in which it failed. Updated the firmware on the controller card, it now gave this error:

   Logical drive 2 status = Ready for recovery
   Array accelerator status:
      Valid data found at reset
      Dirty data detected. Unable to write dirty data to drives
      Hardware Problem detected with Cache board.
      Please replace Cache board.

Support indicated that not only the cache board needs to be replaced, but from the error report they saw that now drive 6 has to be replaced as well. It is a big mess!

Conclusion - Backup up data from Raid 5 and rebuild the array. As Raid 5 is failing, I need to do this quick.

I have a USB large drive attached...I can attempt to copy all the user data onto that drive. Any suggestions on doing so are welcome. As I mentioned above, I have to move the user shared folder, user data (regulars docs that would be the easiest one), exchange logs and store, wsus database and files - and then reconnect them all once the drives are all rebuilt.

What would be the most effective way of accomplishing this?
Can I mount the Exchange store to the USB drive? The store is a little under 60gb.
Would I need to go one by one and move them individually?

thanks for the all the help!
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

1 - The best way to accomplish this is to perform a full backup of the data, using Robocopy or something similar.

Robocopy :

2 - You can use the External disk to attach the Exchange Stores, TEMPORARILY, as the access will become noticible slower. Probably you will also need to run the Eseutil tool on the database, using :
ESEUTIL /MH "pathofedb\store.edb" and check for Clean Shutdown or Dirty Shutdown status.
If Clean, good to go, try to mount the store. If Dirty, then use the Eseutil /R, if the logs are available or /P if there are no logs.

3 - You can move them all at once.

Hope this helpls.

I would agree with Jon on most things except for the USB, I wouldn't recommend running Exchange from the USB drive, you are asking for trouble.  I would instead suggest taking the Exchange Server offline, copy/backup the Exchange database files to USB, repair the raid, and then bring Exchange back up on local disk
surge1Author Commented:
well the parts are in route now. I have copied all the shares and user files from the Raid onto the USB drive. it is working well now. Exchange is still running of the Raid. I will attempt to do a test copy tonight to see if I get any errors or run into problems.

WSUS database is upwards of 20 gb now for some reason located on the bad drive. I would like to shrink it considerably before moving it using
wsusutil movecontent DestinationPath LogFile

Any ideas on how to do it? I know I can run the wizard, but that typically takes a loooooong time.
Ultimate Tool Kit for Technology Solution Provider

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy now.

I would avoid stressing the disk too much since everything is on the razors edge so IMO better to copy it AFTER you protect that Exchange EDB, i.e. I would personally take the database offline which will commit the logs to the DB and put it into a clean state and then copy it offline so that you have a good recovery point.

WSUS data is much easier to rebuild and less valuable data IMO
Hi Lucid,

That's why I stated " You can use the External disk to attach the Exchange Stores, TEMPORARILY, as the access will become noticible slower. "

You are right about the SUS. Usually Exchange is the Core business of many enterprises and it can not fail.

Thanks for bringing the USB question up :)

Hey Jon I understand your suggestion however I have to disagree since;

A. USB drives are great for general storage but not for running a DB
B. The USB bandwidth isn't fast enough to cope and you could actually cause the DB to timeout and crash which would put you in a much worse place

You are better off to shutdown Exchange, copy the files to the USB drive, clean up the disk array mess and then move them back and bring Exchange online.
I have already used it once, and did ok... never the less, it's not safe neither recommended or even supported.

The thing here was to mount DB, if ok... proceed. If not, repair. ( I misexplained myself :( )

You are absolutely right.

surge1Author Commented:
OK, noted with thanks guys!

i will take care of the Exchange db first. Plan to follow these instructions:

First I will dismount the store, then run the ESEUTIL /MH "pathofedb\store.edb" to make sure it was successful. I would like to check it for any errors as well - should I do eseutil /k or /d to defragment at the same time while repairing it. Should this be run on USB drive or still on the array?

then if it is clean, move using xcopy to copy the store to the USB drive, moving the whole MDBDATA folder.

sounds good?
1. Again I wouldn't recommend moving your Exchange database and logs to a USB drive, IF by chance you are able to get it running there is a high degree of failure and you would be in a super bad place should that happen.  

2. Instead I would;
A. dismount the database gracefully

B. If you want to validate the state just do a simple eseutil /mh against the database.  if its in a clean state you are in happy place.  NOTE: if its in a dirty state then that database is in danger and I would still copy it off to the USB for safekeeping and also to give you a chance of recovery/repair should the whole things go south.

C. Stop and then  modify the Information Store Service to be disabled.  This will ensure that if the system reboots it wont try to start the information store.

D. I would not run /D or /G or /K against it that is only going to stress the disk and put you at risk.  Think of it like taking your car out with bald tires on a windy mountain road, drinking tequila and going as fast as possible and all is well until boom your tires blow out and then ahhhhh into the ravine you go :-(

NOTE: I run a software firm that specializes in Exchange optimization, backup, discovery and recovery software and we hear these stories all the time and customers pay us well to recover that data so even though you might beat the odds and be ok,  I just wouldn't do it :-)

E. ok now that the store is stopped. edb is validated as clean and services are disabled copy that database EDB + STM off to the USB drive for safe keeping. Note since you dismounted the database and validated it was clean you don't need the logs because they are already committed into the EDB

F. Once you have the EDB and STM copied you can use eseutil /mh to dump the header on the copy you moved to USB and if still clean you have a good backup.

G. Now fix that raid, check the event viewer to ensure no other disk errors exist.

H. If the original database files are still on the array i.e. you didn't have to reformat etc then you can just adjust the services back to automatic, mount the store and be back on your way.

I realize you are going to have some downtime with this but better to give you users some planned downtime vs umm sorry the entire database is toast...  This is  the safest way to ensure you keep that data safe and minimize stress on the HD system.

Now if I have missed any of your concerns let me know and I will do my best to assist asap
BTW how many mailboxes are we talking about?

If your users are using.Outlook in cache mode you could have them export to pst without loading down the server but of course save to the local disk not the network
surge1Author Commented:
Well I pretty much did what you have outlined Lucid, using the help from your posts and jon's:

Dismounted the Stores, verified the state - both had a clean state, started copying the whole directory with robocopy. At about 82% while copying priv1.edb, checking the robocopy log, it got stuck and there was no movement for about an hour. I went to event viewer which had a bunch of errors as shown below. It appears the array has failed while I was performing the copy. My only hope is that by restarting the computer the raid controller card can attempt to put it back on line. I cannot access the drives from disk management either.

All in all is a bad situation. The HP technician is scheduled to come in tomorrow to attempt to recover the array or as a final resort rebuild it. I really hope it comes back online after recovery. I have stopped the information store and marked it as disabled so it does not start the stores during restart, as you have mentioned. Also disabled AV and stopped WSUS service.

I am left now with all the logs written to USB, as well as 82% of priv1.edb. For now I wait until somewhere gets there in the morning. Any ideas what else I can try until someone gets there in the morning?

Event Type:      Warning
Event Source:      HpCISSs2
Event Category:      None
Event ID:      129
Date:            6-10-2011
Time:            21:21:31
User:            N/A
Computer:      SERVER
The description for Event ID ( 129 ) in Source ( HpCISSs2 ) cannot be found. The local computer may not have the necessary registry information or message DLL files to display messages from a remote computer. You may be able to use the /AUXSOURCE= flag to retrieve this description; see Help and Support for details. The following information is part of the event: \Device\RaidPort0.
0000: 0f 00 10 00 01 00 6a 00   ......j.
0008: 00 00 00 00 81 00 04 80   ......¿
0010: 04 00 00 00 00 00 00 00   ........
0018: 00 00 00 00 00 00 00 00   ........
0020: 00 00 00 00 00 00 00 00   ........
0028: 00 00 00 00 00 00 00 00   ........
0030: 00 05 00 00 81 00 04 80   ......¿

surge1Author Commented:
Lucid: it is about 10 Mailboxes, most are around 2-3Gb, one being 7Gb...
Do you have cache turned on for each user?
I right now don't do anything yet Mi reboots etc will post again she. I get to office
surge1Author Commented:
I believe it is mixed, i had to turn off cache mode for a couple of users as it was giving problems with the larger mailbox. i know you can convert the OST to PST mailboxes.
ok so

1. First thing I would do is get those that are in cache mode backed up to PST.  You will of course have the chance to do this later but sooner is better in the current position you are in just to be extra safe.

2. I wouldn't reboot the server until you talk to the HP guy.  Had a customer with a similar situation the other day and in short there was some funky deal with eh controller where once a failure happens you get prompted on reboot to recover i.e. I think replace drives and recover and if you passed up that chance well then game over no recovery for you.  So I would talk to Hp and see what they recommend in order to enhance your chances of recovery.

3. When was your last good backup?
surge1Author Commented:
1. Is there a utility on doing so? can you link me to instructions? I've seen it mentioned in places but did not attempt to do so myself.

2. Well we did press F1 to attempt auto recovery after replacing the 1st bad drive. The recovery starts, but then stops at about 40% and puts the array in a state of "ready for recovery". did this a couple of times with same result.

3. About a week ago onto a tape, using Symantec Backup exec. Have yet to verify this as I can't check the tape physically myself

thanks for the help.
1, at this point you dont need a utility, i.e. go to each machine and open outlook and it will try to connect to the Exchange server.  It wont find one and if cache is available it will give you an option to start offline.  Once started you can see all the items and then depending on the Outlook version you can do something like File/Import/Export and then select Export to a File, Select PST and then click on the top of the mailbox name, click the "Include sub-folders" option and then follow the instructions on screen.  Once the export is done do a file/Open/Outlook Data file and  point to the file location.  Validate that all info is there.   Then close that PST file, make a copy to a backup location i.e. maybe the USB and repeat on next station.

2. yeah ok I would wait for the HP guy and you may want to call ahead and tell them you need someone that really knows there stuff i.e. no cowboys because you are in a serious position and cant afford anyone guessing eh?

3. ok well hopefully we don't end of needed it, but lets cross that bridge post array fix.

Happy to assist and hang in there... :-)
surge1Author Commented:
The HP technician was just there. He changed the cache board, and it seems all the checks are done after a batter will be recharged, which should take under 24 hours. i'll wait until then to see what the status is. the array is up and operational, although i did not start the exchange store as I do not want to risk it.

I want to attempt copying the stores again. do you think it is better to use eseutil /y or robocopy to do so?

Thanks for the update and;

eseutil is better IMO since you are dealing with an EDB
surge1Author Commented:
This time around the copy with eseutil /y went through fine and was able to put the entire store onto the USB drive.

I had a feeling the Array may still have corrupted the db, so decided to run eseutil /g on the copy that is on the USB drive to check for integrity. it gave an error after 40 seconds.

Integrity check completed.
Database is CORRUPTED, the last full back of this database was on 09/23/2011 23:26:09

Operation terminated with error -1206 (JET_errDatabaseCorrupted, Non database file or corrupted db) after 40.0 seconds.

currently running eseutil /k to check for checksum errors - i would presume it will find a lot.

since its in a clean state already, next step is to attempt to repair it using eseutil /p command:
Eseutil /p "usb drive \priv1.edb" /t c:\usb drive\ tempedb1.edb

As I do not want it to run on the C drive (not enough space) and D drive (bad array). Will the command above ensure it only runs out of USB drive (I know it may take a while).


You know you may want to consider building a fresh EDB. You said you have about 10 mailboxes but  how big is this database?
surge1Author Commented:
50Gb - at one time it was allowed to grow without a limit. I then did a lot of archiving, that is the reason for such a large size.
So you might want to consider exporting everything to PST, dial toning the database and reimporting the data.  That way you have a new clean database vs a repaired because once you do a /P to a DB everything is suspect.   Anyway I would look at bringing the DB backup letting mail catch up and then doing an export import action.
BTW if this wasn't an SBS server you could just create a new DB and then do mailbox moves into the new DB
surge1Author Commented:
error on after the checksum test...


Microsoft(R) Exchange Server Database Utilities
Version 6.5
Copyright (C) Microsoft Corporation. All Rights Reserved.

Initiating CHECKSUM mode...
        Database: I:\EMERGENCY BACKUP\EXCHANGE\priv1.edb
  Temp. Database: TEMPCHKSUM508.EDB


                     Checksum Status (% complete)

          0    10   20   30   40   50   60   70   80   90  100

12466338 pages seen
0 bad checksums
0 correctable checksums
1216857 uninitialized pages
0 wrong page numbers
0x2847adfa highest dbtime (pgno 0x955df9)

779147 reads performed
48696 MB read
1520 seconds taken
32 MB/second
1555669113 milliseconds used
1996 milliseconds per read
2625 milliseconds for the slowest read
0 milliseconds for the fastest read


                     Checksum Status (% complete)

          0    10   20   30   40   50   60   70   80   90  100
          ERROR: checksumming of streaming file "I:\EMERGENCY BACKUP\EXCHANGE\priv1.STM" finishes wi
th error -1019 (0xfffffc05)

0 pages seen
0 bad checksums
0 uninitialized pages

Operation completed successfully in 1525.281 seconds.
yeah 1019 is not a good thing at all, its second to a 1018 but both mean its DB rebuild time'

Can you still mount the EDB and gain access to the mailboxes?
surge1Author Commented:
I will try.

reading through the above error, it only errored out when it touched the stm file. can the answer be simply rebuilding the stm?
You could create a new STM but the you wont get the data within the STM back if you do that.  How big is the STM?
surge1Author Commented:
the stm is 7Gb, so a good amount of data in there. we mostly deal with just standard attachment, nothing big, just a lot of them, so not sure why it is so big.

What do you think of this:

1. Attempt to do an offline defrag eseutil /d (in case the error is in the white space), if successful check integrity and checksums, if not go to next step.

2. if not, recreate the stm and check integrity and so forth.

3. if the above does not work, mount the database and use exmerge to retrieve what I can.

4. if nothing works, create a new database from scratch...actually not sure how to do that


thanks for the help again Lucid. it is really helping me talking it through to make sure I am going the right path.
1. The /D and /P has a  99.999999% chance of failure so i wouldn't waste the time.

2. You only have 10 users correct? if so then I would recommend

A. Bring the database back up and let the mail catch up, i.e. since its been down I am sure you have inbound mail queued up but that shouldn't take long to get caught up

B. Stop the MTA and SMTP service so that no additional mail can enter from the outside (will queue up just like the server being down)

C. Export each mailbox from Outlook 2003 or greater since the default is Unicode PST's which have a 20GB limit whereas ExMerge only supports ANSI which has a 2GB limit and after that it becomes corrupt and ExMerge wont tell you that, i.e. you just find out later when you cannot open the PST.

D. Then verify each PST, i.e. open each one and do a quick browse.

E. take the DB down gracefully

F. Rename the current DB and Log Directories, i.e. if path is \MDBATA rename to OLD-MDBDATA and same for Logs.

G. Recreate the old DB and LOG path Directories

H. Mount the database and Exchange will squawk about the database files being gone and if you continue you will create new files.  Say yes and new DB files are created

I. Connect with Outlook and of clients configured with OST it will tell you that there is a mismatch so if you connect you will kill the old OST, or you can connect to the old offline OST.  If you verified the backup to PST from above then connect to the new server.

J. Once connected the users can send and receive mail

K. Get a backup of the new DB started right away

L. Import the data from the PST backups.

M. Watch your event logs as you do the imports to ensure the disk issues are really corrected, i.e. if you start seeing disk errors stop...

N. Drive home, drink and sleep well!

You can contact me via my profile if you want as well
surge1Author Commented:
well it turns out I am no longer able to mount the original failed store. It was most likely corrupted when the raid 5 went offline the first time I was trying to copy the store.

My only option now is to restore from Symantec Backup Exec 12 from about a week ago. The files are stored on a tape. It seems, however, I am only able to restore to the Exchange server directly and not to the USB drive. Can anyone verify this is the actual case?

Also how would I find the tape that I need? the current media in there keeps saying its from the latest backup date and does not change while putting other tapes in.

The way I see it, I can rebuild the Raid 5 drive now and hope the backup is good, but would rather test it before doing that. Please help.

1. I would preserve the current database since you can attempt to re-create the STM and mount in RSG later to mount the damaged EDB and then merge the delta contents.

2. If you want to recover to an alternate location best way would be to use the RSG and then you can setup the RSG paths to be on the USB.

3. However that said being that the current database is damaged I think you would almost be better off to

A: Export the data from the OST to PST for all Outlook clients with caching turned on.  This will be your most current dataset for those users
B: this will also create an extra safety net if the current DB is nonrecoverable via the RSG or 3rd party product that can open offline EDB's

4. Now recover the Information store directly over your current implementation because you already have a copy of the EDB and STM safely tucked away on the USB drive correct?   If so then you can;

A: Recover the Database to the live system and get it mounted and operational

B:  The queued email will start to flow again and people will be able to send and receive messages

C:  Then once that's done get a new backup going and check the Event logs to ensure you don't have any disk or database related issues.

D: if all is well you can attempt to mount the EDB in the RSG, however if you are correct about the STM then you will want to recreate the STM by following this article here.;en-us;555146 and more about it here

E: once you have the EDB with the new STM mounted in the RSG you can recovery any missing delta data as needed to the production server OR use Exmerge to export a date range of data to PST.  More about that here

F: However I would use the PST's that you created from the cached OST files to get any missing data first because If you recreate the .stm file you will probably have a number of messages that have subject and header information but you cant open them, are corrupted or they have empty content. which obviously not a good thing so I would go to the PST first.

G: above all keep checking your event log to ensure that you don't have any additional events in the system log pointing towards disk issues or events in the application log regarding database issues

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
surge1Author Commented:
Store has been recreated from scratch. I did not have access to the backup as they took it offsite and the person is out of town.

was able to recover all emails from cache on each users computers except one that had the cache turned off. Once I get the tape back, I will try to recover that users mailbox only directly into the store.

we may not choose to put the old emails back into exchange as I think its a good idea to start from scratch - I'll see what the users have to say about that.

thanks for the help lucid.
Happy to assist and glad you are on the path to recovery
I wanted to make a note about the hardware side of this. It sounds like the software side is finally getting to an ok place which I am sure was your first priority so here is a look at the cause.

Raid 5 works off of logical AND. If both bits match, the result is 1. If the bits are different, the result is 0. Not to get into the gritty details of more than three disks, but this allows for a unique solution that is transferrable.
1 x 1 = 1
1 x 0 = 0
0 x 1 = 0
0 x 0 = 1

If you take and remove any one number from any line, you can do the exact same math with the remaining numbers and fill in the missing piece. Take a moment and notice how that works.

But harddrives are not perfect which is why raid was developed. Sometimes bits get lost or corrupted. Raid provides the framework to replace those bits but only if the maintenance is being run. Let's say that those four lines are the first four stripes of your array and the first number 1 in line one becomes corrupted. IF this is detected, it is replaced. IF it is not, and now the SECOND number 1 becomes corrupted, there is not enough data to rebuild what is missing. That is what happened to your data.That is why rebuilding the array failed and is likely why the copy + data verification was failing.

Raid controllers have a function usually called a consistency check or something very similar. Patrol Read is a similar function for maintenance. Patrol read looks for missing pieces. Consistency check not only looks for missing pieces but also re-runs the logical math to ensure the data is not corrupt. If you have not set up one or both of these to run on a scheduled basis, you are losing the better part of the raid's ability to protect your data.

If you have not already, I would recommend getting a backup of your data as soon as you feel it is stable and re-create your raid array. You want to be sure that the array is allowed to remap any unreadable sectors, etc, which it will do when you delete the existing array and re-create it.

If you are using software raid...most of this is inapplicable and your data is not truly "raid" protected.
surge1Author Commented:
Just wanted to give an update:

1. I did recreate the Array B (Raid 5). Used another drive that was shipped form HP by support, so I rebuilt it on 1 old drive and 2 new ones. I have put an extra one on order to use it as a spare. The status alert on the logic drive still says the following, but I've read up that it can take a while to scan the whole surface area, but it has been almost 5 days now...your thoughts?

785 Background parity initialization is currently queued or in progress on Logical Drive 2 (273.4 GB, RAID 5). If background parity initialization is queued, it will start when I/O is performed on the drive. When background parity initialization completes, the performance of the logical drive will improve.  

2. Restoring the users mailbox has proven to be much easier that I originally thought. The server had Symantec Backup Exec 12 installed - it had GRT turned on - which enabled me to restore directly into live mailbox store that one user's mailbox. I actually created the Recovery Group and followed the instructions lucid posted, but it just restored directly into the live newly created store. Had to follow a couple of articles to get it working, but this was really helpful:

3. The only cumbersome thing that I encountered was when after retrieving all the mails from the cache of each individual user and creating a new Exchange Mailbox store, once the clients reconnected the only way I could get the cache to start working again is to actually turn it off and disable it, then turn it back on. Not sure why it wouldn't just create a new cache file from the start. All clients are using Outlook 2007 fully patched.

4. I did loose the WSUS database and contents files. Took recommendations here and did not bother to manually copy it to the external HD, and relied on ntbackup file I performed. Unfortunately that ended up being corrupt. I tried finding out articles about recreating and reinstalling WSUS on sbs 2003 (found something about reinstalling the whole R2), but could not find a straight guide to follow. I may post another questions on here relating to that problem in particular.

Thanks for all the help!

1. Glad you are getting back to normal

2. Taking 5 days doesn't sound right at all, I would contact HP and ask whats up.  I assume you are already copying data to it and that drive is in use?

3. Sorry to hear about WSUS, just figured the Exchange DB was job one and clearly that disk was on the edge of complete failure eh?
Thanks for the points
surge1Author Commented:
Yes, it def was. The disks are being very heavily used now. I read somewhere you can force it to scan all of the disk surface by running a disk defrag, but not sure what impact that would have on the exchange store. I'll give HP a call just to make sure.

WSUS is not a major problem, I agree. Just got to find the right way to fix it. I am tired of seeing all the errors regarding WSUS in the event log.
Yeah I think a call is best to ensure all is well because 5 days is a long time overall and even with heavy activity that stops at the end of the work day so at this point best to be extra cautious eh?

Unfortunately I don't know much about WSUS :-(
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.