?
Solved

Replacement drive for failed drive in RAID-5 - shutdown and replace or hot-swap replace? - HP DL 380 G3

Posted on 2010-11-26
9
Medium Priority
?
1,117 Views
Last Modified: 2012-05-10
This morning I discovered that our Windows 2003 Server (and Active Directory controller) had a failed drive. There are 5 36.8 GB drives configured in a RAID-5. I have a replacement drive coming in the next four hours.

Windows boots until the login screen, then it sits there at a gray screen and no users can login on their workstations or access the Exchange server. Even though 1-drive failing should not affect the server except to slow it down, it seems to and I'm concerned.

The HP tech recommend I replace the drive while the machine is on. Then wait for it to rebuild (~9 hours) and then see if it starts working.

This doesn't make sense to me. It makes more sense to me to hard shutdown the server, Replace the drive and then boot and let the rebuild process go on while the Windows services don't try to function off the failed drive.

It also makes me think that the I will get my AD controller functioning sooner this way.

This morning I restarted the server remotely with the ILO controller with a hard reset (not knowing a drive failed). In the first few minutes of the boot process the AD and file sharing services were working, then stopped and then I couldn't log in via the login window.

So, my guess is that the drive is reporting partial failure to the array controller and continuing trying to function, but in reality it is totally not working and the system would function better if the drive was out of the equation.

Following my logic or am I nuts and ignorant? :)

Thanks.

P.S. In another worst case scenario, another drive has failed and it isn't being shown on the drive lights. I can't access the Insight Diagnostics remotely.
0
Comment
Question by:sweetseater
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 2
  • 2
  • +1
9 Comments
 
LVL 76

Accepted Solution

by:
Alan Hardisty earned 500 total points
ID: 34219470
The HP DL380 G3 has hot-swap drives, so you can happily replace a drive with it still running.

As for the grey screen - sounds a little peculiar, but might be disk issues that get resolved once the replacement drive has been installed.
0
 
LVL 47

Assisted Solution

by:David
David earned 500 total points
ID: 34219473
Your premise is a bit incomplete.  The XOR parity in RAID5 protects against both full drive failure, and block failure.  If any surviving disks have a bad block, then you lose the entire stripe, so you can still have data loss.

That is why you need to run regular data consistency check/repairs, to clean up bad blocks that are lurking and are results of media failures AFTER the last time you read that stripe of data.

Best practice is to always do replacements hot.   The most stressful thing you can do to disk is power cycle it.  Why rock the boat?

the HP person is correct.  There is also no reason to have to guess to discover what is going wrong.  Look at the event log in the controller. it will tell you.

0
 

Author Comment

by:sweetseater
ID: 34219574
Dlethe,

I can't access the event log in the controller. Not until the server is fully functioning.

What is the method you recommend to do regular data consistency check/repairs?

Thanks for the help. I will settle down and trust the hot-swapping process.
0
 [eBook] Windows Nano Server

Download this FREE eBook and learn all you need to get started with Windows Nano Server, including deployment options, remote management
and troubleshooting tips and tricks

 
LVL 47

Expert Comment

by:David
ID: 34219737
HP has a several freebies.  Certainly get latest smartstart / ACU / ADU utilities. I don't know off of my head what controller comes bundled with your system, but some of them have feature to automate the consistency repairs.

Go to the support.hp.com site, download/install latest firmware, updates, drivers, and get the add-on monitoring at the same site as well.
0
 
LVL 76

Expert Comment

by:Alan Hardisty
ID: 34219747
@dlethe - controller is as follows:

Integrated Smart Array 5i Plus Controller with optional Battery-Backed Write Cache (BBWC) Enabler option kit

http://h18000.www1.hp.com/products/quickspecs/11473_div/11473_div.HTML
0
 
LVL 56

Expert Comment

by:andyalder
ID: 34219816
You should always replace drives on Smart Array controllers hot (you can remove cold but should add hot), but that doesn't explain your grey screen. What the HP tech advises you makes no sense to me either, as you say it should be running albeit slower, it ought to boot happily even with the bad drive removed.

Could you boot SmartStart Cd and run Array Diagnostic Utility and post log file **as attachment**, we might spot something wrong with one of the other disks.

You don't have an option to start a manual parity check, Smart Array controllers do this in background after 15 seconds of inactivity.

At a guess I'd say it's more than a flakey disk, chkdsk might show a file system corruption.
0
 

Author Comment

by:sweetseater
ID: 34220146
After the drive rebuilt, which took about 2 hours only (36 GB) I was able to restart the server and it booted fine. Now running chkdsk. I did a hotswap.

I looked through some of the details the HP Insight Online edition gave, but it showed no errors for this failure and no record of it where I was looking. Attached is the ADU from the machine after the drive has been replaced.

It did show one of the drives still in there had 6 "unavailable" errors whereas all the other drives had zero and one had one. Dunno if that is pre-emptive notice an issue, but none of the drives showed read errors, etc.

 Prolaw---Array-Diagnostic-Report.zip
0
 
LVL 47

Expert Comment

by:David
ID: 34220323
Don't confuse XOR parity errors with file system errors.   They are independent of each other.  
0

Featured Post

Get your Conversational Ransomware Defense e‑book

This e-book gives you an insight into the ransomware threat and reviews the fundamentals of top-notch ransomware preparedness and recovery. To help you protect yourself and your organization. The initial infection may be inevitable, so the best protection is to be fully prepared.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Data center, now-a-days, is referred as the home of all the advanced technologies. In-fact, most of the businesses are now establishing their entire organizational structure around the IT capabilities.
In this article we will learn how to backup a VMware farm using Nakivo Backup & Replication. In this tutorial we will install the software on a Windows 2012 R2 Server.
This tutorial will walk an individual through the process of installing the necessary services and then configuring a Windows Server 2012 system as an iSCSI target. To install the necessary roles, go to Server Manager, and select Add Roles and Featu…
In this video, Percona Solutions Engineer Barrett Chambers discusses some of the basic syntax differences between MySQL and MongoDB. To learn more check out our webinar on MongoDB administration for MySQL DBA: https://www.percona.com/resources/we…
Suggested Courses

765 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question