• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 743
  • Last Modified:

Are drives in a Failed state a lost cause or is there hope??

Hello all:

   I just had a server go down.  It has a RAID 10 with a hot spare.  There was a bad storm and the UPS backup didn't appear to do it's job - or something.  Anyway, the controller is an LSI MegaRAID 300 SATA 8xLP.  I had 5 x 500 GB Seagate SATA drives, 4 in a RAID 10, with one hot spare.  Currently, I show:

Port 0: A1: online 426837 MB
Port 1: A1: online: not responding
Port 3: A2: failed: not responding
Port 4: A2: failed: 476837 MB

I haven't done anything tonight - I guess I want tech support to hold my hand...  Anyway, would a rescan or something possibly reactivate these drives enough for me to boot the server?  It's interesting that these drives show as failed.  Out of all the 15 or so computers, the only ones to have problems during the storm are the ones in the server.  (I assume port 2 was there as well and the drive in port 4 jumped in to take its place...???)

Also, could this be caused by a bad controller??  When adding batteries to the UPS, the technician shorted out the UPS and the server went down hard.  Ever since then, a cold reboot would pop up a question from the LSI RAID controller asking for configuration information.  Selecting the configuration from the disks allowed the server to reboot just fine.  (A replacement controller has been ordered.)  But now, I have beeping and show failed drives.  Any chance these drives can be recovered (or at least one of the drives in A2:)?  (We have made some backups, but haven't been that diligent, so I hope we haven't lost all...)  Comments and suggestions welcome.

1 Solution
The only way to tell is to test the drives. You should do that individually, one by one in another PC without raid, and after the test put them back into the original place on the server. If possible leave the server off while that is done. With raid 10 you do have a good chance to get back up with the least effort anyway. Use the manufacturer's drive test utility and only run non-destructive tests. If they end up ok, the drive is probably still OK. If not, replace the defective drive(s) with a new one. If the array after that test doesn't rebuild, change the SATA cables in the server, and get only high quality cables. If that is no good either, check the controller. Maybe upgrading the firmware could help, but if necessary, change it.
jhuntiiAuthor Commented:
Well, I got tech support on the line and we checked the drives for errors using the LSI MegaRAID Bios utility.  They showed no errors - so we forced them on-line and they stayed.  He said that generally you don't want to do that - you want to bring one in and rebuild the other.  But since both had failed, there wasn't any way to tell which to use as the master, I guess.  Anyway, the story still doesn't end that happily.  Windows started to boot, but then died.  It turns out that the data still is not there in the second mirrored set.  My guess is that one drive failed, the hot spare took over and began to rebuild, then either that drive or the original failed, then the final one fell out of the array as well.  It may have something to do with the drives - Seagate ST3500641AS - which are I think are not the server grade drives...?? and may not report errors or status back to the controller fast enough, so the controller drops them out of the array causing a rebuild on the hot spare.  On second thought, the main problem was probably the power.  Bigger UPS is coming...  Thanks for you help and quick response.


Featured Post

Get quick recovery of individual SharePoint items

Free tool – Veeam Explorer for Microsoft SharePoint, enables fast, easy restores of SharePoint sites, documents, libraries and lists — all with no agents to manage and no additional licenses to buy.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now