Link to home
Start Free TrialLog in
Avatar of fisc
fisc

asked on

Hard Drive Failures in RAID System

This is sort of a continuation of a previous post I had.

Problems have reoccured.  We are using Adaptec's verify media on all of our disks.  So far all of them have FAILED except for one which is now 98% complete.  That one was our hot spare and never actually used in the array.  One of the ones that failed had just been repalced into the array.  

Motherboard: Asus PU-DLS, firmware build 1006
RAID card: Adaptec 2810SA, lastest firmware
Hard drives: Seagate Barracuda 7200 120gb ST320026AS
Power Supply: Antec 550W
Raid configuration: RAID 5, 6 drives (5 in array, 1 hotspare)

We have a UPS system.  Basically problems had reoccured and Adaptec said a hard drive had failed, Windows would completely stop responding for no reason (nothing in the event log), rebuilds to a new drive failed.  And now we have started verifying media on each drive.... all but this one have failed so far (4 total failures, with two more to test--which includes the two drives we previously pulled and replaced)

Looks like are going hard drive shopping.  But question is:  What is causing these drives to fail??? We don't want to replace them only to have it happen again.  Could they all have gotten bad sectors due to a specific event?  Or is something in the hardware causing this?

I realize this is a crazy complex question, but we are going crazy here.  We would appreciate greatly your help.
Avatar of nobus
nobus
Flag of Belgium image

I suppose you got errors on the drives, and did test them separately (i mean, not in an array). If they fail separately, the troubles most probaly come from the drives themselves. It is weird though having so much drives going faulty.
This leads to the second conclusion : maybe it is the board causing this (raid controller, motherboard)
Not knowing how much you already troubleshooted, i suggest replacing the board, if not done already.
I suppose you did the necessary lookup for updates (hard, soft, firmware), incompatibility etc..


Avatar of fisc
fisc

ASKER

We have replaced the RAID Controller.  Of course, we just replaced this controller the other day... so it is possible that it had previously corruped the drives.  Adaptec says that since the only thing connecting the controller and the drives is a SATA cable they don't think that is possible.  But of course they don't want it to be on their end.

Yes, we are testing these drives separately.
I would suspect the RAID card, if all of a sudden all the drives are having problems.  The other possibility is a bad power supply.
Avatar of fisc

ASKER

Our Motherboard firmware is actually one build # behind, and of course we will flash it to update it. However, the notes on the new build don't have any fixes having to do with RAID, this card, or anything in our system.
Maybe you should list the replacements and actions you performed already, to give us more insight
Iguess the system is installed properly, with a good grounding, well away from big electric consumers like lifts and airconditioning. You can put a monitoring device onthe AC line if in doubt.
Avatar of fisc

ASKER

Mid-August: Drive #3 failed, we had lots of problems and it wasn't booting.  We had to do a trust array, and eventually recreated the array in the same order, booted Windows again, and rebuilt to #5.

This Week: Windows started freezing, alarm sounded (normally signaling a failed drive and rebuild should be automatic--should not freeze windows though), on manual restart it showed array as optimal (?).  Eventually after a few times of this happening freezing after 20 mintues to an hour, Windows stopped booting and the array was marked as degraded.  Drive #5 was the bad drive according to Adaptec's BIOS.  We replaced drive #5.  Windows still wouldn't boot correctly, and we could only get into it if we did a trust array...tried to get a backup but it would stop responding (mouse not even moving) before it completed. (We do have a relatively new backup though)

So 3 was replaced in August, was never in array--was hot spare.  5 was just replaced and Adaptec's verify media utility said it failed (along with every other drive so far but 3).

Just ran Seagate's diagnostic test on one of the drive's Adaptec's verify media said failed and it PASSED Seagate's test.  Now Seagate's test only took an hour, while Adaptec's took about 5 or 6 hours.... so maybe it's not as thorough???

Adaptec's firmware and drives are the most recent.  Seagate doesn't have any firmware updates for its SATA drives.  We have a temperature gauge on our case and it always keeps it at a good temp between 27-32 degrees C
It is still possible that the controller card has failed.
Well, it is not an easy one to diagnose, as you are well aware. I suggest running the faulty drives for a night each, to have a more dependable test. However, the way you describe the failures, and certainly if the drives are coming out of the test ok, i would suspect 1- Raid controller, or 2 - Motherboard. In the case you find faulty drives, it CAN be that you have run into a batch of bad drives, but that seems very unlikely.
A combination of 2 factors is also possible : some drive failure, AND another faulty part (can be anything, but you know the most likely ones) and as this makes the troubleshooting extremely difficult, proceed with thorough tested parts only.
Avatar of fisc

ASKER

The server is in the same place that the company has had it for years with previous servers.  We started running this one in June.  Haven't had any other problems.
Avatar of fisc

ASKER

We are definitely going to run diagnostic tests on all NEW drives that we buy (probably two different ones).  We are considering buying a new RAID card or abandonning SATA for SCSI.  We'll probably replace the power supply today. We're ven considering just scrapping this server.  Bottom line is the company is down!  And the worse part is, as you noted, we haven't even diagnosed what is causing the problem yet!!  Your suggestions are very much appreciated and we'll definitely look into and test them.
It would be easy for you to just press that red button named 'Exact Diagnostics'...
hehehe

Some Background:
If you want to verify RAID controller problems you need to know that RAID controller trnsfers data on an a much easier manner than you may think. It manages how data is transferred and to what disk in the array.
In other words; Problematic RAID controller would have scrambled all the RAID structure and eventually would bring the total loss of the RAID configuration.

Diagnostics:
Well, first install the Adaptec Storage Manager where you can verify propper system functionallity.
Afterwards; you can view the disks via 'DiskPart' utility within Windows and diagnose those disks. Follow these steps:
1. Run DiskPart utility via: Start => type 'cmd' and press 'Enter' => type 'diskpart' and press 'Enter'.
2. Type 'list disk' to view all the disks installed on your OS.
3. Type 'select disk <Disk #>'.
4. Type 'Detail disk' and press Enter. Verify that the OS sees that drive as it should see it.

Solution:
You can recover RAID construction within Windows by using the same DiskPart utility by using the 'repair' command.

Links:
More on 'DiskPart':
http://home.earthlink.net/~rlively/MANUALS/COMMANDS/D/DISKPART.HTM

Adaptec Storage Manager:
http://www.adaptec.com/worldwide/support/driverdetail.jsp?sess=no&language=English+US&cat=/Product/AAR-2810SA&filekey=Adaptec_Storage_Manager-Windows.exe

Hope that helped a bit...

Cyber
Avatar of fisc

ASKER

Thanks, but I guess the bottom line is the drives are failing.  The one failed the Adaptec test but passed the Seagate... we've been doing more tests and all the rest that Adaptec's verify media test said failed, Seagate agreed.

So drives are failing. Why? I don't know. That's the central question.

I think we're going back to SCSI
The simple answer would be:
Sometimes it takes a misconfiguration rather a malfunction to shut down a site.

See, the drives configuration for all devices should be the same (i.e. block size, drive order and forth).

Whats my point? Can you use Storage Manager to get more info on each drive?

Cyber
First, check for power issues.  Any other drives (CD/DVD), or anything else drawing power?  You need 24watts/drive on spinup, per Seagate's spec.  Tally your mobo's req's, along with other req's for cards, etc. and see how you compare to the Antec supply.

If you are OK on power, all signs point to the SATA controller being the cause.  The drives are likely fine if they pass the Seagate test.

I have found that in dealing with ATA arrays -- from Promise Cards to Nexsan FC rackmounts -- they were far less stable (and far more flaky on rebuilds!) -- than SCSI/FC RAID setups.

Sure, you may have to bite the bullet on adapter and drive cost, and deal with less drive space, but we seem to be talking about something that is mission critical.  SCSI RAID has been on the block alot longer, and that makes me inclined to prefer it for critical services over SATA RAID.  Granted, you can't beat the cost/GByte on SATA!
Avatar of fisc

ASKER

I don't know what the proper etiquite is on EE... if I shoudl award points for helpful hits although none seemed to directly find the answer.

Here's what we did... we first bought a new SATA RAID card and did a trust array with the drives.  It still said the array was degraded.  Now could the old SATA controller have caused the drives to fail?  Not sure.  Hard to test that, but I don't suspect so.  We decided that it was a total loss at this point so we bought a SCSI controller and drives, built an array and tried to install Windows Server.  The Windows Server would get to a certain point and then stall (at the point of "Setup is Starting Windows")... some things online said this could be hardware issues (yes, we did install the SCSI RAID driver).  Anyway, we sent it off to the place we bought it.  It wouldn't work for them either.  They took out every part, tested it, they all seemed to work individually.  Then they put it back together and it worked.

So ... what does that tell us?  I don't know.  Were the two issues (failing drives and Windows installer freezing) realted?  Not sure.  But perhaps there was some loose connection in there and when they took it apart and rebuilt it that was resolved?  That could have been causing hardware failures?  I don't know!  But at least for now our system is working with our SCSI RAID.
Avatar of fisc

ASKER

I think (hopefully finally the answer this time) the whole situation revolved around the motherboard.  Most recently, the screen was blank... would not respond.  We manually rebooted and all we got was a long beep from the motherboard, a pause, and another long beep.  Nothing would ever come up on the screen.  I don't think there is anything wrong with the memory.

But we replaced the motherboard and memory (had to replace the memory because the new motherboard didn't support ECC memory) and after some trouble with drivers and getting Windows to like the new motherboard we are sitting pretty good now.  Hopefully no problems will reoccur...they shouldn't as we have replaced everything but the floppy drive, fans, and case at this point!  At least it didn't kill our hard drives (brand new SCSIs) this time.
Avatar of fisc

ASKER

Just an update.... all is running great with the new motherboard and its been over two months.  So if you have this problem, try a new motherboard!  
ASKER CERTIFIED SOLUTION
Avatar of modulo
modulo

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial