Reconstruction in RAID 5 fails

I have a "MetaSTOR" External SCSI RAID array.  I believe the controller is from LSI Logic.  This unit has a single LUN with 9 drives.  The 9 drives are made up of both 36GB and 73GB disks, though of course only 36GB is being used on the 73GB disks since they are all in the same RAID 5 LUN.   About a month ago, a 73GB disk failed.  We replaced it with a 36GB disk and reconstruction started automatically and completed in about 2-3 hours.  Earlier this week another drive failed.  This time, when we replace the drive with a brand new drive, reconstruction starts, but then stops after about 1 hour, with the controller showing that new drive as "failed", and the LUN as degraded.  We tried another drive and the same thing happened.  The status message in the RAID software (SYMPlicity Storage Manager) says that a drive experienced a write failure and was failed.  This array happens to hold our Exchange database (71GB) so of course we've been adding to our regular backup schedule to make sure we can recover if another drive fails.  The problem is that for both the tape backups we've tried, the job gets to about 90% of the way through the .edb file (71GB total size) and then fails, saying the Information Store is not available.  Last night I shutdown exchange and tried to copy the database file to another location to serve as a "cold" backup using Robocopy from the windows resource kit.  It got to about 82.3% of the file and then just restarted from 0% again.  So, it seems to me that there's something wrong with the array itself, perhaps with the RAID controller.  It also seems that perhaps this is causing some sort of filesystem problem since both the backup software and the Robocopy utility seem to be having trouble accessing the file.  What's strange is that the email server is operating normally, as you would expect with a degraded RAID array (albeit with some performance degradation).  I am at a loss as to what to do next.  I though of perhaps configuring a hot spare in the RAID array in an empty slot so that it would automatically be used for the failed drive.  I only need to have the system run for another 5-6 days as it's being retired and replaced by a completely new server with a SAN (isn't that always the case - the thing fails right before it's about to be put out of service!!!).  I actually have attached a completely new RAID array via Fibre Channel direct attach to move the data from the failing array but I can't seem to get the single 71GB file stored on that array to copy over.  Restoring from backup would be quite difficult, though not impossible as we'd need to recover hundreds of transaction log files from multiple tapes and then apply them to a full backup that is a couple of weeks old.  I thought of running Chkdsk on the logical drive to see if perhaps that might clear up the problem I'm having accessing the file but I hesistate to do so without a more recent backup.  Catch 22 I guess.... ideas would be most appreciated.    If I don't respond to questions, this means my email is not working (RAID array failed completely!!!) and you can send replies to
Who is Participating?
_Mr_LimoConnect With a Mentor Commented:
It's actually not unusual that you would get your data back - professional data recovery is definitely an option...

It does, however, sound like you have a logical corruption issue.  There is some type of data corruption that is going on @ that 82.3% location....

You seem to have two options at this point - professional data recovery, or trying to rebuild from your backups.

It is definaltey NOT a 2 drive failure - if 2 drives are down in a RAID 5, you have NO access at all.  It's a locical corrupiton issue on the array itself - may very well have been caused by the controller.
NopiusConnect With a Mentor Commented:
> Earlier this week another drive failed.  This time, when we replace the drive with a brand new drive
what is the size of that new drive (if 36Gb, then exact CHS values please).
emilysamAuthor Commented:
I've tried both a 36GB and a 73GB drive.  Another note.  Last night I tried copying the data from the failing array to a new array.  The data contained is a single file of approximately 71GB running on an NT4, sp6a, NTFS filesystem.  I've tried to copy it to the new array twice and both times it fails at exactly the same spot (82.3% and the same amount of data ~ 58GB).  More info:  I wanted to see if the original "failed" drive (a 73GB drive) was physically or logically failed so I put into another system and was able to view it under the Adaptec SCSI utility.  However, the Adaptec utility showed the drive capacity as only 4,144MB, even though it's a 73GB drive.  Very strange.  

This morning I used another new 73GB drive and put it into an empty slot in the array, hoping to assign it as a hot spare.  When I did that, the drive was recognized as unassigned, but it also showed a status of "failed".   I then tried a 36GB drive and it appeared to allow me to create the hot spare but it threw a warning that since it was smaller than some of the drives in the system that it couldn't be used as a hot spare to back those drives up.  Since the failed drive is a 73 GB (but again is only using 36GB), I cancelled the hot spare creation.

I then tried to backup the RAID partition using Acronis True Image.  Acronis found several bad sectors but was able to continue copying the partition anyways.  I have restored that partition to the new array in the system and can see my data but I don't know if Exchange Server is going to see that edb file (71GB) as valid or not as there may be data for that file in one or more of the bad sectors that Acronis True Image did not copy.

It seems to me that perhaps the RAID controller is having problems or the RAID configuration itself has become corrupted.  I don't have enough experience with these matters to know if the symptoms could be cause by this and/or what to do if that's the case.  As usual, thanks for any insight.  I will try to use a new 36GB drive for reconstruction again and will provide results and CHS info shortly.

Keep up with what's happening at Experts Exchange!

Sign up to receive Decoded, a new monthly digest with product updates, feature release info, continuing education opportunities, and more.

emilysamAuthor Commented:
Ok, I tried to reconstruct using a brand new 36GB Fujitsu drive (MAP3367NC) and it failed.  It started to reconstruct but about 1/2 hour into the process, the reconstruction status screen stopped showing the progress and instead just said "No LUNS are reconstructing".  Now, when I look at the LUN it's degraded and the new disk shows as Failed.  Here is the geometry info as best I can tell:

Original Failed Drive:  Seagate ST173404LC, 73.4 GB formatted capacity.  There is no info on cylinders in the documentation for this drive.  It does show 24 heads, and a default sector size of 512 bytes.

Replacement Drive 1 (failed during reconstruction):  Seagate ST373207LC, 73 GB formatted capacity.  90,774 cylinders, 2 physical heads, default 512 bytes per sector

Replacement Drive 2 (failed during reconstruction):  Fujitsu MAP3367NC, 36.74 formatted capacity. 48,104 cylinders, 2 heads, 512 bytes/sector (I think).

The RAID software has an option called "Manual Reconstruction" but I hesitate to use it as the manual warns that it should only be used if the included Recovery Guru software (or tech support) instructs the user to do.  They warn of data loss if used incorrectly.  The instructions say to only use it if automatic reconstruction doesn't start when a new drive is replaced into the slot where the original drive failed.

It seems that you have double drive failure in your RAID-5. That's why new replacement drive cannot be reconstructed.
Also all other simptoms are very similar (Acronis also found bad sectors in RAID partition and reconstruction fails at the same %). It's hardly a RAID-5 configuration problem (if so, your RAID should become unaccessable from any software).
Manual reconstruction is really not a good idea. Probably it's better to call professionals. But in case of double drive failure it's very unusual you will get your data back, since it's a RAID-5, not RAID 1+0.
Any further recommendations and actions may get things worse, probably someone from EE community disagree with me and may give better advise.
emilysamAuthor Commented:
I don't think it's double drive failure.  When I look at the drive display it only shows 1 drive in the drive group as failed.  The data is still accessible and the email server is still running.  If two drives were failed, I'd have no access to the LUN on my server.  

This will likely all become an exercise in futility anyways as I've moved 80% of the mailboxes that are stored on this failing array over to a new Exchange Server that has it's storage located on our new SAN.  If I can get the rest of the mailboxes migrated over before this array finally gives out completely, I'll be good to go.

emilysamAuthor Commented:
I would agree w/ Mr. Limo that this is a logical corruption issue.  Makes sense since I can't rebuild the drive either.  If indeed the array fails before I get all the data off it, I will likey have to go to backups.  Split points for the effort...
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.