• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 562
  • Last Modified:

Will RAID1 be non-interruptable while one HD broken?

Hi Experts,

We have a DELL PE2900 server box running Win Server 2008 with a hardware RAID1 (two hard drives mirrorring). My question to you is, if one of the hard drives is broken -- whether Disk0 or Disk1, would the server be interruptted or shut down? (If not, I can wait till off hours to replace the broken hard drive.)

Please help. Thanks.
0
Castlewood
Asked:
Castlewood
2 Solutions
 
xDUCKxCommented:
Yep, you can wait indefinately until the drive is replaced.  When the drive is replaced you may notice degredation in performance as the array is rebuilt...but your computer will be happy and chug along without any cares in the world.  :-)
0
 
DavidCommented:
Can you wait? Yes.  But realize you can have another drive failure, or just a bad block and you lose data.  

You also won't be able to do a rebuild if you have any bad blocks on the surviving disk.  My advice .. kick off the rebuild ASAP.  (And make a fresh backup copy your most important files before you begin the rebuild, you never know)
0
 
CastlewoodAuthor Commented:
Ok, that is for hardware mirroring.
How about using the software (OS) mirroring? Would the server be still not interruptted while one of the hard drives broken?
0
Granular recovery for Microsoft Exchange

With Veeam Explorer for Microsoft Exchange you can choose the Exchange Servers and restore points you’re interested in, and Veeam Explorer will present the contents of those mailbox stores for browsing, searching and exporting.

 
DavidCommented:
Win2K8 RAID is designed to work the same way.  So best I will say is that "it is supposed to work the same way".

If the HDDs are SCSI, then they will, however, be on the same bus, so a hot swap will generate a bus reset.   Some I/O operations will fail and then the O/S is "supposed" to resend the command.  Such a thing is not infallible.  SATA disks don't share the same parallel bus so they won't bother each other.
0
 
andyalderSaggar makers bottom knockerCommented:
>would the server be interruptted or shut down?

Unfortunately yes sometimes by a well meaning but misguided administrator. There are two schools of thought as to what to do in the case of a single disk failure in a mirror; shut it down and call a DR expert to fix it or leave it on and increase the backup schedule in case the other disk fails before the replacement has been fitted and parity restored.

Hard to tell from EE questions or even paid-for technical support records as to which is the best since nobody bothers to report cases of 'disk failed and was replaced successfully a day later'. Certainly no DR expert could tell from real world experience because they don't have a list of people who didn't phone them up.
0
 
DavidCommented:
Actually, best practice is to assess the health of the surviving disk (while it is online) to assess the risk of data loss.  If there is little or no indication of recent unrecoverable errors, or even recoverable errors, and there are plenty of reserved blocks that can be remapped, then the likelihood is good that a disk will survive the stress of a rebuild.

If, however, that disk is not long for this world, best practice is to shut down unessential services, and back up the most important files first.  Leave it all on (power cycle is also stressful), and then the decision becomes more complicated.  You can't efficiently backup if the O/S is online sometimes, so then you have to do full block I/O.

A rebuild is high I/O for 100% of the blocks.  it is more stressful than a backup, and touches every block.  So statistically speaking, unless you know a lot more about the health of the hardware, the safest thing to do is a backup.
0
 
DavidCommented:
But Andy makes an excellent  point. You need a pro to get the full scoop.

When in doubt, get the checkbook out ;)
0
 
andyalderSaggar makers bottom knockerCommented:
Since RAID is there to protect you against downtime rather than protect against data loss if it does fail to rebuild you just restore the last backup. That shouldn't need a pro if the backup regime is setup right in the first place.
0
 
DavidCommented:
RAID also protects against data corruption.  Bad blocks are much more likely than a drive failure.
0
 
andyalderSaggar makers bottom knockerCommented:
A bad block *is* data loss, it only becomes data corruption if you run a program to re-write it with zeros and mark it as good such as chkdsk. If a bad block causes a read error on a JBOD you have to delete the file and restore it from backup, RAID protects you from the downtime this takes. You should never rely on RAID to protect you from data loss, that's what backups are for.

Since you mention data corruption then the most common source of that (if we ignore mis-use of data repairing tools) is a virus, and RAID will never protect you from that.
0
 
DavidCommented:
Andy, a bad block is not data loss unless you need that data and there is no redundancy.  If you are rebuilding a RAID and block 123 is unreadable, but that block is unused by the file system, then it is not data loss .. because it isn't data.

The RAID controller is blissfully unaware of what blocks contain "data", and which don't.

Not only that but some filesystems such as ZFS create additional redundant data to protect against data loss even on a single disk.  Or one can create a RAID1 array using a single drive to provide protection against data loss in event of bad blocks.

Data corruption can take many forms that have nothing to do with rogue applications such as viruses.  Vibration, and using drives / firmware settings that aren't set up with proper error recovery / retry settings are much more common in the real world.

The world of RAID is not as simplistic as one might think.
0
 
andyalderSaggar makers bottom knockerCommented:
>Andy, a bad block is not data loss unless you need that data

Nor is it corruption.

>Or one can create a RAID1 array using a single drive to provide protection against data loss in event of bad blocks.

Indeed, it's pretty bad idea though since it thrashes hell oyt of the disk with all those long seeks.

>Data corruption can take many forms...
Indeed, that's why one does not rely on RAID but backs up properly.

You may think it is simple semantecs but I still say RAID only protects against downtime, not data loss or corruption. It saves you having to restore backups which is (or at least should be) the first line of defence against data loss or corruption. If you wnat to rely on RAID to protect you against data loss or corruption that's your own lookout, but I advise using backup software.
0
 
DavidCommented:
It is corruption if the two blocks are supposed to contain the same information, but it is not data loss if that particular block is not in use by the O/S.

Andy, I have no desire to get into a debate. I develop and test RAID controller hardware, and firmware for a living, and have access to information available under non-disclosure to developers.  You don't.

As for the comment about thrashing, it isn't my fault that you don't know how to do it properly.
0
 
andyalderSaggar makers bottom knockerCommented:
I do have access to information under NDAs thank you, but it's totally irrelevant to the question which is about downtime while replacing a failed disk. You only post it to prove your todger is bigger than anyone elses.

Server is hotswap so disk should be swapped as soon as possible and the server should not be shut down to do it. If Castlewood is in any doubt as to that they can phone Dell support and ask them.
0
 
DavidCommented:
One does not just swap out a bad drive IF they have the ability to determine health of the surviving disks to assess the health and determine the risk of drive failure on rebuild vs. risk of additional loss during the time it takes to do a backup.  Rebuilds can bring a critical array to one that goes offline.

Blindly kicking off a rebuild no matter what can't possibly be the right answer. Common sense is that if you have some files that are mission critical, that aren't backed up, is to back them up while the system is online.

Then you rebuild.  If you have the ability to even look at a controller log to assess health of the array, or have software which can test the disk behind the array, then you base your course of action on the health of the array.

As for thrashing, it is load dependent and not an absolute. If you want to accuse me of being more qualified and knowledgeable in this area than you, or less, I don't care. My answers speak for themselves.  

As for having NDA access, good for you.  I have access to controller / raid / firmware internals to write test software and RAID code to do my job in this very field of expertise.   Do you?
0
 
andyalderSaggar makers bottom knockerCommented:
Well obviously if you can assess the situation it is worth doing so, that wasn't my point and anyway the health is is generally monitored all the time through the controller reading the SMART data from the disks and one shouldn't wait for a disk failure before checking.
 
It was that "Unfortunately [the machine is] sometimes [shut down] by a well meaning but misguided administrator.

Would you not agree with that? I see that all the time and in some instances it confuses the controller.
0

Featured Post

What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now