Replacement of damaged SAS Drive, RAID 5 in Dell PowerEdge R720

Daniel Flores Olmos
Daniel Flores Olmos used Ask the Experts™
on
Hi,

I have a Dell PowerEdge R720 with 3 600 GB SAS HD's RAID 5 in a PERC card and this morning I saw the orange alert light and in the display says that "there is an error in DIsk 0, check Drive" (not sure how much time the error is because I rarely enter to the site) so, I have a spare 500 GB SAS drive and I have some questions:

- Since the message in the display says: "check drive" (doesn´t says literally "replace the disk") it means there is something to do in order to fix the disk or it is inevitable to replace it?
- If there's nothing else to do to fix the disk, may I replace the 500GB HD in a 2 HD 600GB array?
- If yes, may I do that with the server running in a Windows session or I have to power off the server?
- If I have to power off the server, do I have to boot with the PERC software and put online the disk and rebuild the array?
- If yes, takes many time to rebuild the array? (I ask this to plan how much time the server will be offline and warn users) the total array space with the 3 600GB HD's was 1TB and there are 450GB used.

Thanks.
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Network and Security Consultant
Commented:
As a rule I would just replace the drive, even if it is just predicting the failure (ie. SMART tripped) it's not worth the risk.

No, you could add a 1 TB drive and just wouldn't get to use any space above 600GB but you can't add a smaller drive and rebuild the array.

In most of the Dell's I've seen you have to reboot the server and set the new disk as a hot spare, and then the rebuild will occur but the server does not actually need to be offline for this process to happen. So once you've accessed the bios utility and done this  you can boot into Windows and monitor the rebuild from OpenManage.

Time wise is extremely difficult to say as it is constrained by a lot of factors, but as the server can be live during the rebuild it's less of an issue. (May not run at peak performance but all of the servers I've done this with have been usable during the process).

If you know the model of your PERC you can generally find documentation that will run through the rebuild process in more detail, this is just how it was done with the 2-3 Dell servers I've had to do this with.
Daniel Flores OlmosInfrastructure and Support Engineer

Author

Commented:
Forgot to mention the error code in display: "PDR1101 Fault detected on Drive 0. Check drive." it seems its like "bad connection" of the HD. The manual description says: "The controller detected a drive removal. If unintended, verify drive installation. Remove and reseat the indicated disk. If the problem persists, contact technical support." I'll do that but I don´t want to touch anything until the backup finish.
Daniel Flores OlmosInfrastructure and Support Engineer

Author

Commented:
UPDATE: For some reason the current windows server session closes and the backup was interrupted so I took advantage of that and removed the HD and plugged it again and for some minutes the display stops sending the error and goes blue and the blinking led of the HD goes green but few minutes later the error goes back in the display and the HD led goes back to orange.
Ensure you’re charging the right price for your IT

Do you wonder if your IT business is truly profitable or if you should raise your prices? Learn how to calculate your overhead burden using our free interactive tool and use it to determine the right price for your IT services. Start calculating Now!

Top Expert 2014
Commented:
In addition to Tyler's comments, you may be able to insert the new drive and have it rebuild without rebooting the server.  You should have Dell OpenManage Server Administrator installed.  Then you can open it and view the status of drives, change rebuild rates, assign hotspares, etc.  You may want to use that to run the "check consistency" task.

As a matter of course you should make sure that your BIOS and firmware levels are up-to-date, especially for the PERC card, backplane, etc.
PowerEdgeTechIT Consultant
Top Expert 2010
Commented:
- If it is showing pred fail, then replace it. It may or may not fail, it may or may not rebuild successfully, but it is not worth the effort - it needs to be replaced.
- I don't understand your second bullet. Are you asking if you can replace a 600GB disk with a 500GB disk? If so, the answer is no. Must be at least as large as the one it is replacing. You also CANNOT mix SAS and SATA in the same array, so you can't replace a SAS disk with a SATA disk, etc.
- Do not power down the server to replace or introduce a hard drive. If it is hot-swap, swap it hot.
- Again, you should be doing this in the OS, not the BIOS utility, but in either event, be careful of your terminology ... "put online the disk" may lead to using the "force online" option which will destroy your data if you were in a position where that was actually an option. To rebuild the disk, you need to assign it as a hot-spare.
- Rebuild time is dependant on size, speed, type, and usage (will rebuild slower during work hours than during offpeak usage). 600GB 15K SAS disk shouldn't take long ... maybe less than an hour, maybe a little more.
Top Expert 2014

Commented:
Just to add that replacing a 600GB disk with a 1TB would be ill advised even if they are both SAS because the 600GB ones will be 10 or 15K whereas the 1TB will be 7.2K "nearline" disk, you can replace it with a same speed 90GB one if you had one of them that's the same spin speed.
DavidPresident
Top Expert 2010

Commented:
Do NOT replace the drive.   it puts you at extreme risk of data loss.  (Because you degrade the RAID, and just ONE unreadable block guarantees data loss)

So here is the smart move ... buy a replacement,  then do an in-place upgrade from RAID5 -> RAID6.   You have redundant data all of the time, and even if the drive eventually fails you still have redundant data.

(Besides, doing RAID5 is just nuts if your system is one where you are concerned about the inconvenience of down time, data loss, or rebuilding.
Daniel Flores OlmosInfrastructure and Support Engineer

Author

Commented:
Thank you all,

I now have clear my doubts but Dell is giving me 3-4 weeks delivery time and in that time, surely I'll be back with you all to rebuild the array; hope I can keep open this ticket until the new disk comes.
PowerEdgeTechIT Consultant
Top Expert 2010

Commented:
If you'd rather not wait that long, there are resellers that could get it to you in a day or two:
http://www.xbyte.com/Items.aspx?key=fr&code=457&cat=P_D_SP_HDD&grp=2&fil5=5%3a106&fil2=2%3a457&incl_m=F
Top Expert 2014

Commented:
3-4 weeks is a long time to be at risk, I'd rather fit a reconditioned one than wait that long.

I like dlethe's idea of migrating to RAID 6 although it's a bit slower but I'd still replace the predictive fail one after the RAID level migration was complete so that would mean buying two. At least they're 10 or 15K SAS so low chance of unrecoverable read errors compared to 7.2K disks so tolerable in RAID5.

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial