Reconstruction in RAID 5 fails
Posted on 2006-07-06
I have a "MetaSTOR" External SCSI RAID array. I believe the controller is from LSI Logic. This unit has a single LUN with 9 drives. The 9 drives are made up of both 36GB and 73GB disks, though of course only 36GB is being used on the 73GB disks since they are all in the same RAID 5 LUN. About a month ago, a 73GB disk failed. We replaced it with a 36GB disk and reconstruction started automatically and completed in about 2-3 hours. Earlier this week another drive failed. This time, when we replace the drive with a brand new drive, reconstruction starts, but then stops after about 1 hour, with the controller showing that new drive as "failed", and the LUN as degraded. We tried another drive and the same thing happened. The status message in the RAID software (SYMPlicity Storage Manager) says that a drive experienced a write failure and was failed. This array happens to hold our Exchange database (71GB) so of course we've been adding to our regular backup schedule to make sure we can recover if another drive fails. The problem is that for both the tape backups we've tried, the job gets to about 90% of the way through the .edb file (71GB total size) and then fails, saying the Information Store is not available. Last night I shutdown exchange and tried to copy the database file to another location to serve as a "cold" backup using Robocopy from the windows resource kit. It got to about 82.3% of the file and then just restarted from 0% again. So, it seems to me that there's something wrong with the array itself, perhaps with the RAID controller. It also seems that perhaps this is causing some sort of filesystem problem since both the backup software and the Robocopy utility seem to be having trouble accessing the file. What's strange is that the email server is operating normally, as you would expect with a degraded RAID array (albeit with some performance degradation). I am at a loss as to what to do next. I though of perhaps configuring a hot spare in the RAID array in an empty slot so that it would automatically be used for the failed drive. I only need to have the system run for another 5-6 days as it's being retired and replaced by a completely new server with a SAN (isn't that always the case - the thing fails right before it's about to be put out of service!!!). I actually have attached a completely new RAID array via Fibre Channel direct attach to move the data from the failing array but I can't seem to get the single 71GB file stored on that array to copy over. Restoring from backup would be quite difficult, though not impossible as we'd need to recover hundreds of transaction log files from multiple tapes and then apply them to a full backup that is a couple of weeks old. I thought of running Chkdsk on the logical drive to see if perhaps that might clear up the problem I'm having accessing the file but I hesistate to do so without a more recent backup. Catch 22 I guess.... ideas would be most appreciated. If I don't respond to questions, this means my email is not working (RAID array failed completely!!!) and you can send replies to firstname.lastname@example.org.