Link to home
Start Free TrialLog in
Avatar of jhuntii
jhuntii

asked on

RAID 5 keeps losing a drive

Hello experts:

  I have a fairly new server ( <3 months old) that seems to drop a drive out of the RAID 5 configuration every once in a while.  This has happened twice so far - close to once a month.  The sever is all Intel server board SE7520BD2V.  The RAID controller is SRCS28X and I'm running 5 x 250GB Seagate drives in a RAID 5 with one hot spare (total 6 drives in the sever).  It came pre-installed with Windows Server 2003 SBS - which I had to re-install - including loading the RAID controller driver from CD.

The first time a drive "fell" out of the array, the hot spare immediately took it's place and began a rebuild.  I got on the phone with Intel tech support - drive 5 showed that it was not in the array - nor a hot spare.  With their utility, I identified drive 5 and according to Intel's tech, pulled it out of the server.  The sever blue-screened and shut down.  We went into the BIOS version of the utility and were able to add the drive back into the array (big whew!).  Anyway, I tell this because the drives do not identify where they should be.  Drive 0 identifies as drive 1 and drive 1 as drive 0.  2 and 3 are swapped and 4 and 5 are swapped.  So now they're labled correctly on the outside of the server.  I haven't had a chance to verify yet, but I'll be the back plane is mis-wired.  So, do I want or need to fix this???  If I change the cables and also change the drives, do I run the risk of losing the array?  This is a production server in a small company, so it's very bad when it's down.  

   The second time when a drive "fell" out of the array (about a month later), I just let the spare rebuild.  The one then showed ready, and I added it back in a the new hot spare.  It did have a media error of 113, but the Intel techs don't know what the media error codes mean.  Right now the server seems to be running fine, but if you watch the drive lights, they all go out every few minutes and come back on one by one.  Would the mis-cabling cause these problems???

   Has anyone ever seen this type of behavior or have suggestions on how to proceed???

Thanks,
Avatar of jhuntii
jhuntii

ASKER

Oops - this should have been posted in the Storage section.  Anyway to move this question - moderator - anyone???

Sorry.
Jon
"do I run the risk of losing the array? "
YES
You most likely have a bad RAID controller.
Should consider a different type of RAID or controller, sounds like this is going to die and you will lose it all.
Second thought, if only 3 months old, send this back, harware is defective.  If you chose seagate drives, there could be some incompatibility with the controller, if they sent you the system this way, you should be sending it back, sooner the better.
ASKER CERTIFIED SOLUTION
Avatar of rindi
rindi
Flag of Switzerland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Problem is either a cabling issue, possbily a controller port issue, but more likely a corrupted "parity" issue on the RAID 5 array itself. Most controllers have an option to run a consistency check (LSI based RAID controllers terminology) which will attempt to check and rebuild any corrupted "parity" information. The other option you have, is simply to migrate the RAID 5 to a RAID 0 to delete the parity, and then re migrate the array back to RAID 5 to recreate a new set of parity. Most LSI/Adaptec controllers allow you to do this, but not sure about your controller
Avatar of jhuntii

ASKER

OK, thanks very much for your help.  I started a consistency check after the first drop out, but if I remember correctly, I stopped it after about 3 days of running.  I thought the check was hung, although the server was up.  If new cables will fix it, that would be great - I'll try that first.  Otherwise, I'll see if I can get a replacement controller.  If I cannot swap it out from the vendor, what controller would you recommend??  (Yes, it's SATA - Intel SRCS28X - 8 port controller, currently using 6.)  I'll start taking getting some backups of the entire system rather than just the data.  Any other suggestions??

Thanks,
Jon
This is SATA drives, right...

SATA drives and certain especially certain brands SATA drives have a tendency of occationally timeout in the bus when used in a RAID 5 configuration. This disconnects the drive and kicks in the spare.

Make sure you have spares on your shelf and do not build too large RAID sets with SATA drives. Also keep good backups.

Often the "failed" drives can be inserted again, reinitialized and put as a spare. I have seen this behaviour on many SATA based systems lately.
Avatar of jhuntii

ASKER

kkrans - This is what appears to be happening.  Although the cables are swapped (port0 -> port1, port1 -> port0), etc., it really appears to be a timeout issue on the drives.  I've seen some errors flash up indicating a response timeout, but I don't have the exact message.  Also, I have noticed this site indicating a solution for issues with standard grade SATA drives in RAID configurations.  http://www.westerndigital.com/en/products/Products.asp?DriveID=114  

Thanks very much.  I think this is my problem.

jon
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Have you tried other cables? As I've already mentioned, I've had this type of problem and after changing the SATA cables the problems never showed up again!