Servers keep failing RAID drives

We have two new servers with identical hardware.  Server 1 is having random drives in the RAID5 array fail (port 3 twice, port 2 once, port 1 once) every few days.  Server 2 is having the same drive on port 3 fail, 5 times now.  I use Intel Matrix Storage Manager to mark the drive as normal and it rebuilds after about a day and half.  Then another drive will fail within a few days.  

I have the vendor working on it and they're working with Intel, but so far no luck.  They have a new motherboard coming for the server1 that has random drives failing and we've tried replacing the constantly failing drive on Server2, but it still failed within a day.  I haven't yet tried to destroy the RAID volume and start from scratch.

Does anybody have any other ideas?  I really need these boxes to be stable before I can deploy them to our remote office.

Server Configurations:
*Intel S3420GP motherboards
*4 x 500GB Seagate Barracuda (ST3500418AS) hard drives running in a RAID5 configuration using on-board *Intel Matrix RAID
*I think the drive cage is model number R4E9Q4E4 C4E3R4E10.    
*Windows Server 2003 R2 with basic services (File, Printer, DFS, rsync).
dslntadminAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Justin OwensITIL Problem ManagerCommented:
Is this a homemade server or one obtained from a Vendor?
0
SnowWolfCommented:
Have you installed the latest intel matrix driver?
0
DavidPresidentCommented:
First the Intel Matrix RAID controller is not server class, and is really just an awful controller for anything other than hobbyist use.  The controller is telling you that it is in over it's head and needs to be replaced.  

You are also using $50 bottom-of-barrel consumer class disk drives.  Totally unacceptable. Not even qualified for this use. Buy enterprise class storage.  I would also examine the process that allowed somebody to authorize using disks designed for 2400 hours annual use in a server.  

In short, Get enterprise class storage and a decent RAID controller, and you will not have such problems.  Also, take frequent backups.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Powerful Yet Easy-to-Use Network Monitoring

Identify excessive bandwidth utilization or unexpected application traffic with SolarWinds Bandwidth Analyzer Pack.

dslntadminAuthor Commented:
Yes, this is a vendor configured server and we are using the latest matrix drivers.  I'm starting to wish that I'd built it myself like the last two, they've been running flawlessly for two years now.
0
DavidPresidentCommented:
(Disty cost for those drives actually make them a little under $50.)

... So  you are trusting your business on an unreliable $200 worth of disk drives and a $10 chip.   Budget $1000 and get a controller with battery backup along with disks that have been qualified by the vendor for 365x24x7 use.

0
Justin OwensITIL Problem ManagerCommented:
Methinks your vendor is trying to make too much of a profit.  Those parts are not designed for heavy, enterprise use.
Justin
0
SnowWolfCommented:
some other things to consider..

Change sata cables
Faulty back plane where drives connected.
PSU not strong enough
0
AnnOminousCommented:
I have seen faulty SATA cables create this type of problem. Given the pattern of drive failure, I would try rearranging SATA cables on the more stable system to see if that changes the behaviour.

As to whether this is a proper server configuration, I think we would all agree that the system has not exceeded even 2400 hours between failures.

I do agree that your vendor appears to be cutting a few cost corners. I've seen systems run quite well with low cost drives and Intel Matrix Storage - even beating some higher cost systems - but the level of support may not be what you want in the long term.
0
andyalderCommented:
>I think we would all agree that the system has not exceeded even 2400 hours between failures.

It's not so much 2400 hours per year but a 30% duty cycle that they are designed for, that means they're designed to be used for 8 hours per day, or even for 20 minutes every hour.

And the drives aren't failing, the controller is dropping them out of the array because when they reach a slightly dodgy block of data they retry a few times until they read the data correctly and that time is too long for the controller to wait so it decides they have failed. Drives designed to go on RAID controllers report such dodgy blocks as unrecoverable without all those retries that would eventually get the data so instead of failing the whole disk the controller gets the data from parity and re-writes the dodgy block. They're also not designed to sit alongside other disks, they don't like the vibrations.

You could use these disks on software RAID and they wouldn't keep dropping out because software RAID relies on OS drivers that are used to waiting a while for disks to retry. Wouldn't recommend it though; do what dlethe says and replace with disks designed for 100% duty cycle and re-use these in desktops.
0
AnnOminousCommented:
andyalder has a good analysis.
If you want to low ball it, use software RAID.
Otherwise, pay the money and do it right.
0
DavidPresidentCommented:
Don't lowball with software RAID5.  Just google all the hits for data loss on software RAID5 on Win2K3 boot devices.   This works just fine until you have a drive failure, and some PCs will crash, others won't boot up.  Human intervention may be necessary.  Considering the investment that a company has on a server, it is just stupid to cut corners on the storage farm.  Do it right, or just outsource.
0
AnnOminousCommented:
Agree with dlethe on not using RAID5 for the boot drive.

But a RAID1 (mirrored) boot drive with a RAID5 data drive should play nice - even if it's a little slow.

I have a Vostro 420 in exactly that configuration.
0
andyalderCommented:
LOL, you're both joking right? Windows can't boot from software RAID 5 because it doesn't understand what RAID is until it's loaded ftdisk.sys.
0
AnnOminousCommented:
If you consider Intel ICHR10 as a software RAID, then you can create a RAID5 boot disk.

I wouldn't recommend it, but you can do it.
0
andyalderCommented:
We generally call that fake raid.
0
dslntadminAuthor Commented:
We will be returning the servers, thanks for all the help
0
dslntadminAuthor Commented:
Didn't really solve the issue, but confirmed what I was already beginning to suspect about the quality of the hardware.
0
dslntadminAuthor Commented:
It turns out it was the Intel Matrix Storage Console 8.9 that was causing issues.  They downgraded to 8.8 and it seems to have solved the issue, no problems so far.

http://communities.intel.com/thread/5036?start=0&tstart=0
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Server Hardware

From novice to tech pro — start learning today.