Unpredictable RAID behavior
Posted on 2014-04-08
Hello fellow geeks,
I have ran my gambit and request higher minds than myself.
Allow me to lay out the scene....
I built a server back in August in which there are two 500G drives in RAID 1, and 2 2TB drives in a RAID 1.
The 2 500G drives are for C: and the OS running Win SBS 2011 Standard
The 2 2TB drives are for all the stores, data, etc...pretty straight forward
The server motherboard has the integrated SAS 2008 LSI megaraid chipset and came with dual ports and cables to hold up to 8 HDD's on it's RAID platform.
Original installation and insertion into production went flawlessly and it's been in service since August roughly.
Well a couple weeks ago I got a notice that RAID had degraded and so I ran a consistency check in the megaraid software and it shows DRIVE 1 on the first RAID array has failed.
So I ordered an EXACT replacement drive (In this case it's Seagate's enterprise baracuda drive with the five year warranty)
The drive arrived, and I went down to the customer.
I shut down the server, swapped the drive it showed was bad, rebooted into the SAS utility in the BIOS and saw that it was rebuilding the new drive into the RAID....
After it finished rebuilding, I rebooted to the OS, and got a buttload of megaraid alerts saying that it degraded and so I ran the consistency check again and it shows the SAME drive failed and degraded again.
I'm thinking at this point that MAYBE the drive labels are wrong (I've read this happens sometimes on the little stickers so I decide to put the newly removed drive in place of the NON newly replaced slot.
To clarify... At THIS point, I've replaced drive 1 with a brand new replacement and it showed as rebuilt successfully. But then I replaced drive 0 with the recently removed drive....still with me?
At this point, the RAID utility will not even recognize the drive at all! It shows NO DRIVE in the slot. I try reseating cables, etc...no change. At this point I'm kinda freaking out because this drive had JUST been in service just fine more or less.
So I take the drive back off, and put the ORIGINAL drive 0 back in place...
this time the utility begins and completes a rebuild of the array with out hesitation.
again on reboot, massive alerts and degraded RAID.
So I take the drive that would not show up on the server RAID utility to my shop and stick it in my little external desktop plug in...boom. It shows up with files and data entact. Cr;ystal Disk Info says the drive is perfectly fine!
This is my very first time having to swap out a bad drive in a RAID 1 array, and the first time it behaved EXACTLY like I expected it to, minus the alerts on reboot.
I did NOT test the new drive before putting it into service, my initial thought is I'm dealing with another failing drive (likely the one I just purchased was doa, but I didn't want to pull it out and find out after the nightmare I just had...
Has anyone had a similar experience?
I'm trying VERY hard to not break my OS so I can keep things running good without having to rely on restores from backups.
I do have backups from the microsoft utility so they're full system restores only, I prefer to do that as a very last resort.
any advice is appreciated