Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people, just like you, are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
Solved

Hard Drive Failures in RAID System

Posted on 2004-10-14
20
924 Views
Last Modified: 2008-02-26
This is sort of a continuation of a previous post I had.

Problems have reoccured.  We are using Adaptec's verify media on all of our disks.  So far all of them have FAILED except for one which is now 98% complete.  That one was our hot spare and never actually used in the array.  One of the ones that failed had just been repalced into the array.  

Motherboard: Asus PU-DLS, firmware build 1006
RAID card: Adaptec 2810SA, lastest firmware
Hard drives: Seagate Barracuda 7200 120gb ST320026AS
Power Supply: Antec 550W
Raid configuration: RAID 5, 6 drives (5 in array, 1 hotspare)

We have a UPS system.  Basically problems had reoccured and Adaptec said a hard drive had failed, Windows would completely stop responding for no reason (nothing in the event log), rebuilds to a new drive failed.  And now we have started verifying media on each drive.... all but this one have failed so far (4 total failures, with two more to test--which includes the two drives we previously pulled and replaced)

Looks like are going hard drive shopping.  But question is:  What is causing these drives to fail??? We don't want to replace them only to have it happen again.  Could they all have gotten bad sectors due to a specific event?  Or is something in the hardware causing this?

I realize this is a crazy complex question, but we are going crazy here.  We would appreciate greatly your help.
0
Comment
Question by:fisc
  • 9
  • 4
  • 2
  • +3
20 Comments
 
LVL 92

Expert Comment

by:nobus
ID: 12307365
I suppose you got errors on the drives, and did test them separately (i mean, not in an array). If they fail separately, the troubles most probaly come from the drives themselves. It is weird though having so much drives going faulty.
This leads to the second conclusion : maybe it is the board causing this (raid controller, motherboard)
Not knowing how much you already troubleshooted, i suggest replacing the board, if not done already.
I suppose you did the necessary lookup for updates (hard, soft, firmware), incompatibility etc..


0
 

Author Comment

by:fisc
ID: 12307543
We have replaced the RAID Controller.  Of course, we just replaced this controller the other day... so it is possible that it had previously corruped the drives.  Adaptec says that since the only thing connecting the controller and the drives is a SATA cable they don't think that is possible.  But of course they don't want it to be on their end.

Yes, we are testing these drives separately.
0
 
LVL 69

Expert Comment

by:Callandor
ID: 12307559
I would suspect the RAID card, if all of a sudden all the drives are having problems.  The other possibility is a bad power supply.
0
U.S. Department of Agriculture and Acronis Access

With the new era of mobile computing, smartphones and tablets, wireless communications and cloud services, the USDA sought to take advantage of a mobilized workforce and the blurring lines between personal and corporate computing resources.

 

Author Comment

by:fisc
ID: 12307576
Our Motherboard firmware is actually one build # behind, and of course we will flash it to update it. However, the notes on the new build don't have any fixes having to do with RAID, this card, or anything in our system.
0
 
LVL 92

Expert Comment

by:nobus
ID: 12307584
Maybe you should list the replacements and actions you performed already, to give us more insight
0
 
LVL 92

Expert Comment

by:nobus
ID: 12307611
Iguess the system is installed properly, with a good grounding, well away from big electric consumers like lifts and airconditioning. You can put a monitoring device onthe AC line if in doubt.
0
 

Author Comment

by:fisc
ID: 12307678
Mid-August: Drive #3 failed, we had lots of problems and it wasn't booting.  We had to do a trust array, and eventually recreated the array in the same order, booted Windows again, and rebuilt to #5.

This Week: Windows started freezing, alarm sounded (normally signaling a failed drive and rebuild should be automatic--should not freeze windows though), on manual restart it showed array as optimal (?).  Eventually after a few times of this happening freezing after 20 mintues to an hour, Windows stopped booting and the array was marked as degraded.  Drive #5 was the bad drive according to Adaptec's BIOS.  We replaced drive #5.  Windows still wouldn't boot correctly, and we could only get into it if we did a trust array...tried to get a backup but it would stop responding (mouse not even moving) before it completed. (We do have a relatively new backup though)

So 3 was replaced in August, was never in array--was hot spare.  5 was just replaced and Adaptec's verify media utility said it failed (along with every other drive so far but 3).

Just ran Seagate's diagnostic test on one of the drive's Adaptec's verify media said failed and it PASSED Seagate's test.  Now Seagate's test only took an hour, while Adaptec's took about 5 or 6 hours.... so maybe it's not as thorough???

Adaptec's firmware and drives are the most recent.  Seagate doesn't have any firmware updates for its SATA drives.  We have a temperature gauge on our case and it always keeps it at a good temp between 27-32 degrees C
0
 
LVL 69

Expert Comment

by:Callandor
ID: 12307709
It is still possible that the controller card has failed.
0
 
LVL 92

Expert Comment

by:nobus
ID: 12307774
Well, it is not an easy one to diagnose, as you are well aware. I suggest running the faulty drives for a night each, to have a more dependable test. However, the way you describe the failures, and certainly if the drives are coming out of the test ok, i would suspect 1- Raid controller, or 2 - Motherboard. In the case you find faulty drives, it CAN be that you have run into a batch of bad drives, but that seems very unlikely.
A combination of 2 factors is also possible : some drive failure, AND another faulty part (can be anything, but you know the most likely ones) and as this makes the troubleshooting extremely difficult, proceed with thorough tested parts only.
0
 

Author Comment

by:fisc
ID: 12307802
The server is in the same place that the company has had it for years with previous servers.  We started running this one in June.  Haven't had any other problems.
0
 

Author Comment

by:fisc
ID: 12307838
We are definitely going to run diagnostic tests on all NEW drives that we buy (probably two different ones).  We are considering buying a new RAID card or abandonning SATA for SCSI.  We'll probably replace the power supply today. We're ven considering just scrapping this server.  Bottom line is the company is down!  And the worse part is, as you noted, we haven't even diagnosed what is causing the problem yet!!  Your suggestions are very much appreciated and we'll definitely look into and test them.
0
 
LVL 15

Expert Comment

by:Cyber-Dude
ID: 12308968
It would be easy for you to just press that red button named 'Exact Diagnostics'...
hehehe

Some Background:
If you want to verify RAID controller problems you need to know that RAID controller trnsfers data on an a much easier manner than you may think. It manages how data is transferred and to what disk in the array.
In other words; Problematic RAID controller would have scrambled all the RAID structure and eventually would bring the total loss of the RAID configuration.

Diagnostics:
Well, first install the Adaptec Storage Manager where you can verify propper system functionallity.
Afterwards; you can view the disks via 'DiskPart' utility within Windows and diagnose those disks. Follow these steps:
1. Run DiskPart utility via: Start => type 'cmd' and press 'Enter' => type 'diskpart' and press 'Enter'.
2. Type 'list disk' to view all the disks installed on your OS.
3. Type 'select disk <Disk #>'.
4. Type 'Detail disk' and press Enter. Verify that the OS sees that drive as it should see it.

Solution:
You can recover RAID construction within Windows by using the same DiskPart utility by using the 'repair' command.

Links:
More on 'DiskPart':
http://home.earthlink.net/~rlively/MANUALS/COMMANDS/D/DISKPART.HTM

Adaptec Storage Manager:
http://www.adaptec.com/worldwide/support/driverdetail.jsp?sess=no&language=English+US&cat=/Product/AAR-2810SA&filekey=Adaptec_Storage_Manager-Windows.exe

Hope that helped a bit...

Cyber
0
 

Author Comment

by:fisc
ID: 12309371
Thanks, but I guess the bottom line is the drives are failing.  The one failed the Adaptec test but passed the Seagate... we've been doing more tests and all the rest that Adaptec's verify media test said failed, Seagate agreed.

So drives are failing. Why? I don't know. That's the central question.

I think we're going back to SCSI
0
 
LVL 15

Expert Comment

by:Cyber-Dude
ID: 12309615
The simple answer would be:
Sometimes it takes a misconfiguration rather a malfunction to shut down a site.

See, the drives configuration for all devices should be the same (i.e. block size, drive order and forth).

Whats my point? Can you use Storage Manager to get more info on each drive?

Cyber
0
 

Expert Comment

by:Cameron888
ID: 12321933
First, check for power issues.  Any other drives (CD/DVD), or anything else drawing power?  You need 24watts/drive on spinup, per Seagate's spec.  Tally your mobo's req's, along with other req's for cards, etc. and see how you compare to the Antec supply.

If you are OK on power, all signs point to the SATA controller being the cause.  The drives are likely fine if they pass the Seagate test.

I have found that in dealing with ATA arrays -- from Promise Cards to Nexsan FC rackmounts -- they were far less stable (and far more flaky on rebuilds!) -- than SCSI/FC RAID setups.

Sure, you may have to bite the bullet on adapter and drive cost, and deal with less drive space, but we seem to be talking about something that is mission critical.  SCSI RAID has been on the block alot longer, and that makes me inclined to prefer it for critical services over SATA RAID.  Granted, you can't beat the cost/GByte on SATA!
0
 

Author Comment

by:fisc
ID: 12447811
I don't know what the proper etiquite is on EE... if I shoudl award points for helpful hits although none seemed to directly find the answer.

Here's what we did... we first bought a new SATA RAID card and did a trust array with the drives.  It still said the array was degraded.  Now could the old SATA controller have caused the drives to fail?  Not sure.  Hard to test that, but I don't suspect so.  We decided that it was a total loss at this point so we bought a SCSI controller and drives, built an array and tried to install Windows Server.  The Windows Server would get to a certain point and then stall (at the point of "Setup is Starting Windows")... some things online said this could be hardware issues (yes, we did install the SCSI RAID driver).  Anyway, we sent it off to the place we bought it.  It wouldn't work for them either.  They took out every part, tested it, they all seemed to work individually.  Then they put it back together and it worked.

So ... what does that tell us?  I don't know.  Were the two issues (failing drives and Windows installer freezing) realted?  Not sure.  But perhaps there was some loose connection in there and when they took it apart and rebuilt it that was resolved?  That could have been causing hardware failures?  I don't know!  But at least for now our system is working with our SCSI RAID.
0
 

Author Comment

by:fisc
ID: 12614288
I think (hopefully finally the answer this time) the whole situation revolved around the motherboard.  Most recently, the screen was blank... would not respond.  We manually rebooted and all we got was a long beep from the motherboard, a pause, and another long beep.  Nothing would ever come up on the screen.  I don't think there is anything wrong with the memory.

But we replaced the motherboard and memory (had to replace the memory because the new motherboard didn't support ECC memory) and after some trouble with drivers and getting Windows to like the new motherboard we are sitting pretty good now.  Hopefully no problems will reoccur...they shouldn't as we have replaced everything but the floppy drive, fans, and case at this point!  At least it didn't kill our hard drives (brand new SCSIs) this time.
0
 

Author Comment

by:fisc
ID: 13215296
Just an update.... all is running great with the new motherboard and its been over two months.  So if you have this problem, try a new motherboard!  
0
 

Accepted Solution

by:
modulo earned 0 total points
ID: 13515304
PAQed with points refunded (500)

modulo
Community Support Moderator
0

Featured Post

Simplifying Server Workload Migrations

This use case outlines the migration challenges that organizations face and how the Acronis AnyData Engine supports physical-to-physical (P2P), physical-to-virtual (P2V), virtual to physical (V2P), and cross-virtual (V2V) migration scenarios to address these challenges.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Find power supply plug from picture.. 6 84
Automotive battery with UPS 4 43
Is my HP C7000 enclosure failing? 2 60
evulating CPU from family/model/stepping #s 6 27
In this article you will get to know about pros and cons of storage drives HDD, SSD and SSHD.
Moving your enterprise fax infrastructure from in-house fax machines and servers to the cloud makes sense — from both an efficiency and productivity standpoint. But does migrating to a cloud fax solution mean you will no longer be able to send or re…
The Email Laundry PDF encryption service allows companies to send confidential encrypted  emails to anybody. The PDF document can also contain attachments that are embedded in the encrypted PDF. The password is randomly generated by The Email Laundr…
In an interesting question (https://www.experts-exchange.com/questions/29008360/) here at Experts Exchange, a member asked how to split a single image into multiple images. The primary usage for this is to place many photographs on a flatbed scanner…

838 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question