Solved

Hard Drive Failures in RAID System

Posted on 2004-10-14
20
919 Views
Last Modified: 2008-02-26
This is sort of a continuation of a previous post I had.

Problems have reoccured.  We are using Adaptec's verify media on all of our disks.  So far all of them have FAILED except for one which is now 98% complete.  That one was our hot spare and never actually used in the array.  One of the ones that failed had just been repalced into the array.  

Motherboard: Asus PU-DLS, firmware build 1006
RAID card: Adaptec 2810SA, lastest firmware
Hard drives: Seagate Barracuda 7200 120gb ST320026AS
Power Supply: Antec 550W
Raid configuration: RAID 5, 6 drives (5 in array, 1 hotspare)

We have a UPS system.  Basically problems had reoccured and Adaptec said a hard drive had failed, Windows would completely stop responding for no reason (nothing in the event log), rebuilds to a new drive failed.  And now we have started verifying media on each drive.... all but this one have failed so far (4 total failures, with two more to test--which includes the two drives we previously pulled and replaced)

Looks like are going hard drive shopping.  But question is:  What is causing these drives to fail??? We don't want to replace them only to have it happen again.  Could they all have gotten bad sectors due to a specific event?  Or is something in the hardware causing this?

I realize this is a crazy complex question, but we are going crazy here.  We would appreciate greatly your help.
0
Comment
Question by:fisc
  • 9
  • 4
  • 2
  • +3
20 Comments
 
LVL 91

Expert Comment

by:nobus
Comment Utility
I suppose you got errors on the drives, and did test them separately (i mean, not in an array). If they fail separately, the troubles most probaly come from the drives themselves. It is weird though having so much drives going faulty.
This leads to the second conclusion : maybe it is the board causing this (raid controller, motherboard)
Not knowing how much you already troubleshooted, i suggest replacing the board, if not done already.
I suppose you did the necessary lookup for updates (hard, soft, firmware), incompatibility etc..


0
 

Author Comment

by:fisc
Comment Utility
We have replaced the RAID Controller.  Of course, we just replaced this controller the other day... so it is possible that it had previously corruped the drives.  Adaptec says that since the only thing connecting the controller and the drives is a SATA cable they don't think that is possible.  But of course they don't want it to be on their end.

Yes, we are testing these drives separately.
0
 
LVL 69

Expert Comment

by:Callandor
Comment Utility
I would suspect the RAID card, if all of a sudden all the drives are having problems.  The other possibility is a bad power supply.
0
 

Author Comment

by:fisc
Comment Utility
Our Motherboard firmware is actually one build # behind, and of course we will flash it to update it. However, the notes on the new build don't have any fixes having to do with RAID, this card, or anything in our system.
0
 
LVL 91

Expert Comment

by:nobus
Comment Utility
Maybe you should list the replacements and actions you performed already, to give us more insight
0
 
LVL 91

Expert Comment

by:nobus
Comment Utility
Iguess the system is installed properly, with a good grounding, well away from big electric consumers like lifts and airconditioning. You can put a monitoring device onthe AC line if in doubt.
0
 

Author Comment

by:fisc
Comment Utility
Mid-August: Drive #3 failed, we had lots of problems and it wasn't booting.  We had to do a trust array, and eventually recreated the array in the same order, booted Windows again, and rebuilt to #5.

This Week: Windows started freezing, alarm sounded (normally signaling a failed drive and rebuild should be automatic--should not freeze windows though), on manual restart it showed array as optimal (?).  Eventually after a few times of this happening freezing after 20 mintues to an hour, Windows stopped booting and the array was marked as degraded.  Drive #5 was the bad drive according to Adaptec's BIOS.  We replaced drive #5.  Windows still wouldn't boot correctly, and we could only get into it if we did a trust array...tried to get a backup but it would stop responding (mouse not even moving) before it completed. (We do have a relatively new backup though)

So 3 was replaced in August, was never in array--was hot spare.  5 was just replaced and Adaptec's verify media utility said it failed (along with every other drive so far but 3).

Just ran Seagate's diagnostic test on one of the drive's Adaptec's verify media said failed and it PASSED Seagate's test.  Now Seagate's test only took an hour, while Adaptec's took about 5 or 6 hours.... so maybe it's not as thorough???

Adaptec's firmware and drives are the most recent.  Seagate doesn't have any firmware updates for its SATA drives.  We have a temperature gauge on our case and it always keeps it at a good temp between 27-32 degrees C
0
 
LVL 69

Expert Comment

by:Callandor
Comment Utility
It is still possible that the controller card has failed.
0
 
LVL 91

Expert Comment

by:nobus
Comment Utility
Well, it is not an easy one to diagnose, as you are well aware. I suggest running the faulty drives for a night each, to have a more dependable test. However, the way you describe the failures, and certainly if the drives are coming out of the test ok, i would suspect 1- Raid controller, or 2 - Motherboard. In the case you find faulty drives, it CAN be that you have run into a batch of bad drives, but that seems very unlikely.
A combination of 2 factors is also possible : some drive failure, AND another faulty part (can be anything, but you know the most likely ones) and as this makes the troubleshooting extremely difficult, proceed with thorough tested parts only.
0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 

Author Comment

by:fisc
Comment Utility
The server is in the same place that the company has had it for years with previous servers.  We started running this one in June.  Haven't had any other problems.
0
 

Author Comment

by:fisc
Comment Utility
We are definitely going to run diagnostic tests on all NEW drives that we buy (probably two different ones).  We are considering buying a new RAID card or abandonning SATA for SCSI.  We'll probably replace the power supply today. We're ven considering just scrapping this server.  Bottom line is the company is down!  And the worse part is, as you noted, we haven't even diagnosed what is causing the problem yet!!  Your suggestions are very much appreciated and we'll definitely look into and test them.
0
 
LVL 15

Expert Comment

by:Cyber-Dude
Comment Utility
It would be easy for you to just press that red button named 'Exact Diagnostics'...
hehehe

Some Background:
If you want to verify RAID controller problems you need to know that RAID controller trnsfers data on an a much easier manner than you may think. It manages how data is transferred and to what disk in the array.
In other words; Problematic RAID controller would have scrambled all the RAID structure and eventually would bring the total loss of the RAID configuration.

Diagnostics:
Well, first install the Adaptec Storage Manager where you can verify propper system functionallity.
Afterwards; you can view the disks via 'DiskPart' utility within Windows and diagnose those disks. Follow these steps:
1. Run DiskPart utility via: Start => type 'cmd' and press 'Enter' => type 'diskpart' and press 'Enter'.
2. Type 'list disk' to view all the disks installed on your OS.
3. Type 'select disk <Disk #>'.
4. Type 'Detail disk' and press Enter. Verify that the OS sees that drive as it should see it.

Solution:
You can recover RAID construction within Windows by using the same DiskPart utility by using the 'repair' command.

Links:
More on 'DiskPart':
http://home.earthlink.net/~rlively/MANUALS/COMMANDS/D/DISKPART.HTM

Adaptec Storage Manager:
http://www.adaptec.com/worldwide/support/driverdetail.jsp?sess=no&language=English+US&cat=/Product/AAR-2810SA&filekey=Adaptec_Storage_Manager-Windows.exe

Hope that helped a bit...

Cyber
0
 

Author Comment

by:fisc
Comment Utility
Thanks, but I guess the bottom line is the drives are failing.  The one failed the Adaptec test but passed the Seagate... we've been doing more tests and all the rest that Adaptec's verify media test said failed, Seagate agreed.

So drives are failing. Why? I don't know. That's the central question.

I think we're going back to SCSI
0
 
LVL 15

Expert Comment

by:Cyber-Dude
Comment Utility
The simple answer would be:
Sometimes it takes a misconfiguration rather a malfunction to shut down a site.

See, the drives configuration for all devices should be the same (i.e. block size, drive order and forth).

Whats my point? Can you use Storage Manager to get more info on each drive?

Cyber
0
 

Expert Comment

by:Cameron888
Comment Utility
First, check for power issues.  Any other drives (CD/DVD), or anything else drawing power?  You need 24watts/drive on spinup, per Seagate's spec.  Tally your mobo's req's, along with other req's for cards, etc. and see how you compare to the Antec supply.

If you are OK on power, all signs point to the SATA controller being the cause.  The drives are likely fine if they pass the Seagate test.

I have found that in dealing with ATA arrays -- from Promise Cards to Nexsan FC rackmounts -- they were far less stable (and far more flaky on rebuilds!) -- than SCSI/FC RAID setups.

Sure, you may have to bite the bullet on adapter and drive cost, and deal with less drive space, but we seem to be talking about something that is mission critical.  SCSI RAID has been on the block alot longer, and that makes me inclined to prefer it for critical services over SATA RAID.  Granted, you can't beat the cost/GByte on SATA!
0
 

Author Comment

by:fisc
Comment Utility
I don't know what the proper etiquite is on EE... if I shoudl award points for helpful hits although none seemed to directly find the answer.

Here's what we did... we first bought a new SATA RAID card and did a trust array with the drives.  It still said the array was degraded.  Now could the old SATA controller have caused the drives to fail?  Not sure.  Hard to test that, but I don't suspect so.  We decided that it was a total loss at this point so we bought a SCSI controller and drives, built an array and tried to install Windows Server.  The Windows Server would get to a certain point and then stall (at the point of "Setup is Starting Windows")... some things online said this could be hardware issues (yes, we did install the SCSI RAID driver).  Anyway, we sent it off to the place we bought it.  It wouldn't work for them either.  They took out every part, tested it, they all seemed to work individually.  Then they put it back together and it worked.

So ... what does that tell us?  I don't know.  Were the two issues (failing drives and Windows installer freezing) realted?  Not sure.  But perhaps there was some loose connection in there and when they took it apart and rebuilt it that was resolved?  That could have been causing hardware failures?  I don't know!  But at least for now our system is working with our SCSI RAID.
0
 

Author Comment

by:fisc
Comment Utility
I think (hopefully finally the answer this time) the whole situation revolved around the motherboard.  Most recently, the screen was blank... would not respond.  We manually rebooted and all we got was a long beep from the motherboard, a pause, and another long beep.  Nothing would ever come up on the screen.  I don't think there is anything wrong with the memory.

But we replaced the motherboard and memory (had to replace the memory because the new motherboard didn't support ECC memory) and after some trouble with drivers and getting Windows to like the new motherboard we are sitting pretty good now.  Hopefully no problems will reoccur...they shouldn't as we have replaced everything but the floppy drive, fans, and case at this point!  At least it didn't kill our hard drives (brand new SCSIs) this time.
0
 

Author Comment

by:fisc
Comment Utility
Just an update.... all is running great with the new motherboard and its been over two months.  So if you have this problem, try a new motherboard!  
0
 

Accepted Solution

by:
modulo earned 0 total points
Comment Utility
PAQed with points refunded (500)

modulo
Community Support Moderator
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

Stuck in voice control mode on your Amazon Firestick?  Here is how to turn it off!!!
This paper addresses the security of Sennheiser DECT Contact Center and Office (CC&O) headsets. It describes the DECT security chain comprised of “Pairing”, “Per Call Authentication” and “Encryption”, which are all part of the standard DECT protocol.
Illustrator's Shape Builder tool will let you combine shapes visually and interactively. This video shows the Mac version, but the tool works the same way in Windows. To follow along with this video, you can draw your own shapes or download the file…
This video shows how to remove a single email address from the Outlook 2010 Auto Suggestion memory. NOTE: For Outlook 2016 and 2013 perform the exact same steps. Open a new email: Click the New email button in Outlook. Start typing the address: …

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now