Replacement of damaged SAS Drive, RAID 5 in Dell PowerEdge R720


I have a Dell PowerEdge R720 with 3 600 GB SAS HD's RAID 5 in a PERC card and this morning I saw the orange alert light and in the display says that "there is an error in DIsk 0, check Drive" (not sure how much time the error is because I rarely enter to the site) so, I have a spare 500 GB SAS drive and I have some questions:

- Since the message in the display says: "check drive" (doesn´t says literally "replace the disk") it means there is something to do in order to fix the disk or it is inevitable to replace it?
- If there's nothing else to do to fix the disk, may I replace the 500GB HD in a 2 HD 600GB array?
- If yes, may I do that with the server running in a Windows session or I have to power off the server?
- If I have to power off the server, do I have to boot with the PERC software and put online the disk and rebuild the array?
- If yes, takes many time to rebuild the array? (I ask this to plan how much time the server will be offline and warn users) the total array space with the 3 600GB HD's was 1TB and there are 450GB used.

Daniel Flores OlmosInfrastructure and Support EngineerAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Tyler BrooksNetwork and Security ConsultantCommented:
As a rule I would just replace the drive, even if it is just predicting the failure (ie. SMART tripped) it's not worth the risk.

No, you could add a 1 TB drive and just wouldn't get to use any space above 600GB but you can't add a smaller drive and rebuild the array.

In most of the Dell's I've seen you have to reboot the server and set the new disk as a hot spare, and then the rebuild will occur but the server does not actually need to be offline for this process to happen. So once you've accessed the bios utility and done this  you can boot into Windows and monitor the rebuild from OpenManage.

Time wise is extremely difficult to say as it is constrained by a lot of factors, but as the server can be live during the rebuild it's less of an issue. (May not run at peak performance but all of the servers I've done this with have been usable during the process).

If you know the model of your PERC you can generally find documentation that will run through the rebuild process in more detail, this is just how it was done with the 2-3 Dell servers I've had to do this with.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Daniel Flores OlmosInfrastructure and Support EngineerAuthor Commented:
Forgot to mention the error code in display: "PDR1101 Fault detected on Drive 0. Check drive." it seems its like "bad connection" of the HD. The manual description says: "The controller detected a drive removal. If unintended, verify drive installation. Remove and reseat the indicated disk. If the problem persists, contact technical support." I'll do that but I don´t want to touch anything until the backup finish.
Daniel Flores OlmosInfrastructure and Support EngineerAuthor Commented:
UPDATE: For some reason the current windows server session closes and the backup was interrupted so I took advantage of that and removed the HD and plugged it again and for some minutes the display stops sending the error and goes blue and the blinking led of the HD goes green but few minutes later the error goes back in the display and the HD led goes back to orange.
 Acronis Global Cyber Summit 2019 in Miami

The Acronis Global Cyber Summit 2019 will be held at the Fontainebleau Miami Beach Resort on October 13–16, 2019, and it promises to be the must-attend event for IT infrastructure managers, CIOs, service providers, value-added resellers, ISVs, and developers.

In addition to Tyler's comments, you may be able to insert the new drive and have it rebuild without rebooting the server.  You should have Dell OpenManage Server Administrator installed.  Then you can open it and view the status of drives, change rebuild rates, assign hotspares, etc.  You may want to use that to run the "check consistency" task.

As a matter of course you should make sure that your BIOS and firmware levels are up-to-date, especially for the PERC card, backplane, etc.
PowerEdgeTechIT ConsultantCommented:
- If it is showing pred fail, then replace it. It may or may not fail, it may or may not rebuild successfully, but it is not worth the effort - it needs to be replaced.
- I don't understand your second bullet. Are you asking if you can replace a 600GB disk with a 500GB disk? If so, the answer is no. Must be at least as large as the one it is replacing. You also CANNOT mix SAS and SATA in the same array, so you can't replace a SAS disk with a SATA disk, etc.
- Do not power down the server to replace or introduce a hard drive. If it is hot-swap, swap it hot.
- Again, you should be doing this in the OS, not the BIOS utility, but in either event, be careful of your terminology ... "put online the disk" may lead to using the "force online" option which will destroy your data if you were in a position where that was actually an option. To rebuild the disk, you need to assign it as a hot-spare.
- Rebuild time is dependant on size, speed, type, and usage (will rebuild slower during work hours than during offpeak usage). 600GB 15K SAS disk shouldn't take long ... maybe less than an hour, maybe a little more.
andyalderHaemorrhoids victimCommented:
Just to add that replacing a 600GB disk with a 1TB would be ill advised even if they are both SAS because the 600GB ones will be 10 or 15K whereas the 1TB will be 7.2K "nearline" disk, you can replace it with a same speed 90GB one if you had one of them that's the same spin speed.
Do NOT replace the drive.   it puts you at extreme risk of data loss.  (Because you degrade the RAID, and just ONE unreadable block guarantees data loss)

So here is the smart move ... buy a replacement,  then do an in-place upgrade from RAID5 -> RAID6.   You have redundant data all of the time, and even if the drive eventually fails you still have redundant data.

(Besides, doing RAID5 is just nuts if your system is one where you are concerned about the inconvenience of down time, data loss, or rebuilding.
Daniel Flores OlmosInfrastructure and Support EngineerAuthor Commented:
Thank you all,

I now have clear my doubts but Dell is giving me 3-4 weeks delivery time and in that time, surely I'll be back with you all to rebuild the array; hope I can keep open this ticket until the new disk comes.
PowerEdgeTechIT ConsultantCommented:
If you'd rather not wait that long, there are resellers that could get it to you in a day or two:
andyalderHaemorrhoids victimCommented:
3-4 weeks is a long time to be at risk, I'd rather fit a reconditioned one than wait that long.

I like dlethe's idea of migrating to RAID 6 although it's a bit slower but I'd still replace the predictive fail one after the RAID level migration was complete so that would mean buying two. At least they're 10 or 15K SAS so low chance of unrecoverable read errors compared to 7.2K disks so tolerable in RAID5.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today

From novice to tech pro — start learning today.