[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 649
  • Last Modified:

RAID 5 keeps losing a drive

Hello experts:

  I have a fairly new server ( <3 months old) that seems to drop a drive out of the RAID 5 configuration every once in a while.  This has happened twice so far - close to once a month.  The sever is all Intel server board SE7520BD2V.  The RAID controller is SRCS28X and I'm running 5 x 250GB Seagate drives in a RAID 5 with one hot spare (total 6 drives in the sever).  It came pre-installed with Windows Server 2003 SBS - which I had to re-install - including loading the RAID controller driver from CD.

The first time a drive "fell" out of the array, the hot spare immediately took it's place and began a rebuild.  I got on the phone with Intel tech support - drive 5 showed that it was not in the array - nor a hot spare.  With their utility, I identified drive 5 and according to Intel's tech, pulled it out of the server.  The sever blue-screened and shut down.  We went into the BIOS version of the utility and were able to add the drive back into the array (big whew!).  Anyway, I tell this because the drives do not identify where they should be.  Drive 0 identifies as drive 1 and drive 1 as drive 0.  2 and 3 are swapped and 4 and 5 are swapped.  So now they're labled correctly on the outside of the server.  I haven't had a chance to verify yet, but I'll be the back plane is mis-wired.  So, do I want or need to fix this???  If I change the cables and also change the drives, do I run the risk of losing the array?  This is a production server in a small company, so it's very bad when it's down.  

   The second time when a drive "fell" out of the array (about a month later), I just let the spare rebuild.  The one then showed ready, and I added it back in a the new hot spare.  It did have a media error of 113, but the Intel techs don't know what the media error codes mean.  Right now the server seems to be running fine, but if you watch the drive lights, they all go out every few minutes and come back on one by one.  Would the mis-cabling cause these problems???

   Has anyone ever seen this type of behavior or have suggestions on how to proceed???

Thanks,
0
jhuntii
Asked:
jhuntii
  • 3
  • 2
  • 2
  • +3
2 Solutions
 
jhuntiiAuthor Commented:
Oops - this should have been posted in the Storage section.  Anyway to move this question - moderator - anyone???

Sorry.
Jon
0
 
scrathcyboyCommented:
"do I run the risk of losing the array? "
YES
You most likely have a bad RAID controller.
Should consider a different type of RAID or controller, sounds like this is going to die and you will lose it all.
0
 
scrathcyboyCommented:
Second thought, if only 3 months old, send this back, harware is defective.  If you chose seagate drives, there could be some incompatibility with the controller, if they sent you the system this way, you should be sending it back, sooner the better.
0
VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

 
rindiCommented:
This is probably a cabling issue. Are the drives SATA? If yes, and if it is possible to swap the SATA cables with good quality ones do that (in colaboration with the server manufacturer). If you can't do it yourself, ask them to swap the backplane. I've recently had a similar problem with an onboard sata raid 5 controller which was also intel, and shuttle (the pc manufacturer) sent me a new set of sata cables. Since then this server has been working flawlessly.

The numbering of your disks will probably have changed because the hotswap disk moved in. That will of course change the numbering.
0
 
Dushan De SilvaCommented:
0
 
Dell100Commented:
Problem is either a cabling issue, possbily a controller port issue, but more likely a corrupted "parity" issue on the RAID 5 array itself. Most controllers have an option to run a consistency check (LSI based RAID controllers terminology) which will attempt to check and rebuild any corrupted "parity" information. The other option you have, is simply to migrate the RAID 5 to a RAID 0 to delete the parity, and then re migrate the array back to RAID 5 to recreate a new set of parity. Most LSI/Adaptec controllers allow you to do this, but not sure about your controller
0
 
jhuntiiAuthor Commented:
OK, thanks very much for your help.  I started a consistency check after the first drop out, but if I remember correctly, I stopped it after about 3 days of running.  I thought the check was hung, although the server was up.  If new cables will fix it, that would be great - I'll try that first.  Otherwise, I'll see if I can get a replacement controller.  If I cannot swap it out from the vendor, what controller would you recommend??  (Yes, it's SATA - Intel SRCS28X - 8 port controller, currently using 6.)  I'll start taking getting some backups of the entire system rather than just the data.  Any other suggestions??

Thanks,
Jon
0
 
kkransCommented:
This is SATA drives, right...

SATA drives and certain especially certain brands SATA drives have a tendency of occationally timeout in the bus when used in a RAID 5 configuration. This disconnects the drive and kicks in the spare.

Make sure you have spares on your shelf and do not build too large RAID sets with SATA drives. Also keep good backups.

Often the "failed" drives can be inserted again, reinitialized and put as a spare. I have seen this behaviour on many SATA based systems lately.
0
 
jhuntiiAuthor Commented:
kkrans - This is what appears to be happening.  Although the cables are swapped (port0 -> port1, port1 -> port0), etc., it really appears to be a timeout issue on the drives.  I've seen some errors flash up indicating a response timeout, but I don't have the exact message.  Also, I have noticed this site indicating a solution for issues with standard grade SATA drives in RAID configurations.  http://www.westerndigital.com/en/products/Products.asp?DriveID=114  

Thanks very much.  I think this is my problem.

jon
0
 
kkransCommented:
This is a typical problem with cartain SATA stuff. Usually you cannot do much to fix that. Look for firmware upgrades for controller and the drives. That might help.

Another thing: make sure you have an efficient backup from your data, also have some sparedrives handy.
0
 
rindiCommented:
Have you tried other cables? As I've already mentioned, I've had this type of problem and after changing the SATA cables the problems never showed up again!
0

Featured Post

How to Use the Help Bell

Need to boost the visibility of your question for solutions? Use the Experts Exchange Help Bell to confirm priority levels and contact subject-matter experts for question attention.  Check out this how-to article for more information.

  • 3
  • 2
  • 2
  • +3
Tackle projects and never again get stuck behind a technical roadblock.
Join Now