RAID 5 keeps losing a drive

Hello experts:

  I have a fairly new server ( <3 months old) that seems to drop a drive out of the RAID 5 configuration every once in a while.  This has happened twice so far - close to once a month.  The sever is all Intel server board SE7520BD2V.  The RAID controller is SRCS28X and I'm running 5 x 250GB Seagate drives in a RAID 5 with one hot spare (total 6 drives in the sever).  It came pre-installed with Windows Server 2003 SBS - which I had to re-install - including loading the RAID controller driver from CD.

The first time a drive "fell" out of the array, the hot spare immediately took it's place and began a rebuild.  I got on the phone with Intel tech support - drive 5 showed that it was not in the array - nor a hot spare.  With their utility, I identified drive 5 and according to Intel's tech, pulled it out of the server.  The sever blue-screened and shut down.  We went into the BIOS version of the utility and were able to add the drive back into the array (big whew!).  Anyway, I tell this because the drives do not identify where they should be.  Drive 0 identifies as drive 1 and drive 1 as drive 0.  2 and 3 are swapped and 4 and 5 are swapped.  So now they're labled correctly on the outside of the server.  I haven't had a chance to verify yet, but I'll be the back plane is mis-wired.  So, do I want or need to fix this???  If I change the cables and also change the drives, do I run the risk of losing the array?  This is a production server in a small company, so it's very bad when it's down.  

   The second time when a drive "fell" out of the array (about a month later), I just let the spare rebuild.  The one then showed ready, and I added it back in a the new hot spare.  It did have a media error of 113, but the Intel techs don't know what the media error codes mean.  Right now the server seems to be running fine, but if you watch the drive lights, they all go out every few minutes and come back on one by one.  Would the mis-cabling cause these problems???

   Has anyone ever seen this type of behavior or have suggestions on how to proceed???

Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

jhuntiiAuthor Commented:
Oops - this should have been posted in the Storage section.  Anyway to move this question - moderator - anyone???

"do I run the risk of losing the array? "
You most likely have a bad RAID controller.
Should consider a different type of RAID or controller, sounds like this is going to die and you will lose it all.
Second thought, if only 3 months old, send this back, harware is defective.  If you chose seagate drives, there could be some incompatibility with the controller, if they sent you the system this way, you should be sending it back, sooner the better.
The 7 Worst Nightmares of a Sysadmin

Fear not! To defend your business’ IT systems we’re going to shine a light on the seven most sinister terrors that haunt sysadmins. That way you can be sure there’s nothing in your stack waiting to go bump in the night.

This is probably a cabling issue. Are the drives SATA? If yes, and if it is possible to swap the SATA cables with good quality ones do that (in colaboration with the server manufacturer). If you can't do it yourself, ask them to swap the backplane. I've recently had a similar problem with an onboard sata raid 5 controller which was also intel, and shuttle (the pc manufacturer) sent me a new set of sata cables. Since then this server has been working flawlessly.

The numbering of your disks will probably have changed because the hotswap disk moved in. That will of course change the numbering.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Dushan De SilvaTechnology ArchitectCommented:
Problem is either a cabling issue, possbily a controller port issue, but more likely a corrupted "parity" issue on the RAID 5 array itself. Most controllers have an option to run a consistency check (LSI based RAID controllers terminology) which will attempt to check and rebuild any corrupted "parity" information. The other option you have, is simply to migrate the RAID 5 to a RAID 0 to delete the parity, and then re migrate the array back to RAID 5 to recreate a new set of parity. Most LSI/Adaptec controllers allow you to do this, but not sure about your controller
jhuntiiAuthor Commented:
OK, thanks very much for your help.  I started a consistency check after the first drop out, but if I remember correctly, I stopped it after about 3 days of running.  I thought the check was hung, although the server was up.  If new cables will fix it, that would be great - I'll try that first.  Otherwise, I'll see if I can get a replacement controller.  If I cannot swap it out from the vendor, what controller would you recommend??  (Yes, it's SATA - Intel SRCS28X - 8 port controller, currently using 6.)  I'll start taking getting some backups of the entire system rather than just the data.  Any other suggestions??

This is SATA drives, right...

SATA drives and certain especially certain brands SATA drives have a tendency of occationally timeout in the bus when used in a RAID 5 configuration. This disconnects the drive and kicks in the spare.

Make sure you have spares on your shelf and do not build too large RAID sets with SATA drives. Also keep good backups.

Often the "failed" drives can be inserted again, reinitialized and put as a spare. I have seen this behaviour on many SATA based systems lately.
jhuntiiAuthor Commented:
kkrans - This is what appears to be happening.  Although the cables are swapped (port0 -> port1, port1 -> port0), etc., it really appears to be a timeout issue on the drives.  I've seen some errors flash up indicating a response timeout, but I don't have the exact message.  Also, I have noticed this site indicating a solution for issues with standard grade SATA drives in RAID configurations.  

Thanks very much.  I think this is my problem.

This is a typical problem with cartain SATA stuff. Usually you cannot do much to fix that. Look for firmware upgrades for controller and the drives. That might help.

Another thing: make sure you have an efficient backup from your data, also have some sparedrives handy.
Have you tried other cables? As I've already mentioned, I've had this type of problem and after changing the SATA cables the problems never showed up again!
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today

From novice to tech pro — start learning today.