Dell Perc H700 Virtual Disk Bad Blocks - Unrecoverable Errors

I have a Dell Poweredge T710 with 16 drives. I'm getting the following errors in the OpenManage for the Disks

A block on the physical disk has been punctured by the controller
Physical Disk 0:0:8/9/11/12/13/14

there was an unrecoverable disk media error during the rebuild or recovery operation
Physical Disk 0:0:8/9/11/12

The driver detected a controller error on \Device\RaidPort0

Virtual Disk bad block medium error is detected: Virtual Disk 2

Open in new window


There's tons of these in event viewer.

I'm not sure what to think of these errors. Theres one for almost every disk in that Virtual Disk set. We have no problems with that server. Backups are completing successfully, no errors other than these disk errors in the OpenManage.  all the physical disks have green checks are are in STATUS OK state. This almost looks like a controller error, given that so many disks are reporting bad. How can I confirm if a disk is bad or not, and clear these errors from OpenManage?

Im afraid the clear Virtual Disk Bad Blocks option will break the server, depending on where the blocks are. Can I run a consistency check while the virtual drive is online and the server is running?

This server is a Hyper V server with 6 VMs. I have 3 virtual disks in the set. Disk 1 is the C drive. It is simply 2 disks in a mirrored set. Virtual Disk 1 and 2 are Raid 5 sets that are spanned volumes in Computer Management. My spanned volumes say "Healthy (At Risk)". What is listed as Disk 2 in Computer Management has a little yellow Triangle and says "errors". If I right click I only have three options "Reactivate, Offline, or Properties". Not sure what to make of this.

Any advice is appreciated!
Thanks!
new435Asked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

arnoldCommented:
Check whether the RAID controller has the latest firmware.
The other, is that there is an issue with one of the disks in the array and the Controller marked the same block on all of the RAID group to prevent write operations to it.
Look at other error logs to see whether there is a specific disk that is reflected with a predictive failure notice. You do not want to continue going down this road where the "failing disk" continues to lead to additional blocks being marked by the controller as inaccessible.  An EE expert/contributor dlethe, I think a while back, mentioned that bad blocks on a RAIDed drive are the worst possible situation compared to a completely failed drive ( or it might be might interpretation of his insight into RAID controllers and disks over the years). This is akin to an intermittent failures whose cause can not be identified/circumstance replicated. If one has the option, a drive failure is preferred to a bad block detection.
0
new435Author Commented:
None of the drives are in predictive failure mode. There is an update to the firmware but I'm not sure if I should run that while the drives are failing.
0
arnoldCommented:
The "drives" are not all failing, one of the drives is having an issue, the difficulty is identifying the drive that is.

Searching for the error reported/recorded, points to a remedy of backing up the data, reinitializing the drives, restoring the backup.

Presumably or hopefully, the issue is on a single drive, but it is an untenneous position. The issue might be related to the media degrading .

Does your environment have redundancy/contingency if this hyper-v server needs to be taken offline for some time?
1
Ultimate Tool Kit for Technology Solution Provider

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy now.

new435Author Commented:
We do, but I really don't want to use it. The On-Site failover box wasn't meant to run this many VMs, it was really only meant to pickup if one or two VMs failed. We have a DR solution, but that means I would have to throw all traffic over the WAN to the off-site location, which is great if the building catches fire, but I'd hate to use it to test drives. If it can't be done while the server is live, then ill have to come up with something.
0
arnoldCommented:
This is a prefidicament no one wants to be in.

Given these events are continuing it suggests that additional blocks rather a single block is the cause.
Time, unfortunately, is not on your side.
Your backups, gave you ever tested them, restore in a segregated network (non connected to LAN ) to see whether the restored VMs etc run as intended?

A failed disk is replaced, and you deal with the slower pergormance while the volume rebuilds.
0
new435Author Commented:
How could I possibly tell which disk it is? Either way, if the RAID striped the bad blocks, then isnt a full rebuild of the raid the only option?
0
Philip ElderTechnical Architect - HA/Compute/StorageCommented:
The error indicates that the disk plugged into port 0 on the RAID controller/backplane is the one that is causing the grief:


A block on the physical disk has been punctured by the controller
Physical Disk 0:0:8/9/11/12/13/14

there was an unrecoverable disk media error during the rebuild or recovery operation
Physical Disk 0:0:8/9/11/12

The driver detected a controller error on \Device\RaidPort0

It's best to address this sooner than later. I hope the backups are known good.
1
new435Author Commented:
Yes. The backups are good. What about the last part of that number

0:0:8

for example?

What does that refer to?
0
arnoldCommented:
Furtunately I've not seen such an event, but I am suspicious of attributing the issue to a single drive based on that.

The \Device\RaidPort0 is a windows reference to the controller.

Using openmanage, look at storage and labels on the physical disks.
Check the volumes you have defined and which disks make them.

Notations controller:Channel:disk
.....
0
Philip ElderTechnical Architect - HA/Compute/StorageCommented:
I sit corrected. Thanks for the clarification. :)
0
new435Author Commented:
All the disks that are reporting errors are part of Virtual Disk 2 in Server Manager. But its pretty much every disk. Not sure how to narrow this down.
0
Philip ElderTechnical Architect - HA/Compute/StorageCommented:
If the server still has warranty I suggest giving Dell a call with the Service Ticket ID. They will narrow it down and provide a replacement under warranty.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
arnoldCommented:
You can get dset from the dell support page, But with the information here, I hope dset will report which drive has the issue, but I suspect that it might not be possible.

booting into the controller bios and looking for the log there to see which disk triggered the bad block ........


On the planing side, do you have resources (server/drives) that you can use to reconstitute what this server is performing to migrate away from the risk of ever increasing number of bad blocks that continually erode the ......... to a point where it may compromise your setup.


Note good backups are not merely backups that completed successfully, a good backup is only good if when restored, the items there are functional as intended.
0
Philip ElderTechnical Architect - HA/Compute/StorageCommented:
+1 on the "Good Backup" definition. For us that is a full bare-metal or hypervisor restore once a quarter (a service we provide our clients).

On the Intel side which also uses LSI the RAID Web Console 2 gives us direct access to the controller's logs. We can download them from there.

OpenManage is a wrapper IIRC? Is there a direct attached RAID management console that will allow access to the controller's logs?
0
arnoldCommented:
It should feed the error through (openmanager logs hardware, but not having seen such error, I do not know whether such error is reported there with a clear identification of the drive on which the bad block was detected on which the controller acted to mark it as bad on the remaining disks.)
0
new435Author Commented:
It does happen to be in Warranty, so I can call Dell for support. Backups are the least of my worries for this client. Worst case I can failover to the Server they have on-site, or off-site. (although as I said before, it would be awfully slow, but it would get them up and running) I think I'll need to schedule some off-site maintenance for this, but I'm going to call dell first and see what they say.
1
Jim_NimSenior EngineerCommented:
As a few others have stated, this isn't a good position to be in.

A "puncture" on a virtual disk with a Dell PERC controller refers to a stripe in the RAID set where too many blocks of data were missing or bad to be able to recalculate what is missing. Typically this results from a combination of a single drive failure, and a "URE" event during rebuild to a replacement or hot-spare. When using RAID5, a rebuild to a spare drive requires that all data on the remaining drives is readable. If one single disk encounters an unrecoverable read error (which is not at all uncommon with modern high-capacity disks), this results in an inability to reproduce the data that the failed drive contained for that corresponding block.
You can read more about punctures here

Speaking from years of experience in Dell's enterprise storage support... you unfortunately only have one real option for a true solution: Identify a validated backup of your data (or get one ASAP from what you still have access to), and delete/recreate/initialize the affected virtual disks. Until you do this, you stand to potentially face corruption of your data (possibly worse than it may already be at the moment). If this seems like an unacceptable action to be required, then be sure to stray away from RAID5 for production data in the future, or make the most use of PERC consistency check scheduling that you can if RAID5 is somehow unavoidable. Consider RAID6 where capacity is the primary need, which has about the same chances of encountering a puncture after a single drive failure as RAID5 does with no failures, or RAID10 where performance is a priority.

I would assume you're coming to Experts Exchange because the unit isn't under warranty... but if it does still have hardware support, I would strongly suggest you reach out to Dell for some log review so that any problematic physical disks (if any one drive seems to be the primary contributor to the punctures during rebuilds) can be replaced before the virtual disks are re-initialized.

Good luck!!
0
RojoshoRTCC-III Level-2 SupportCommented:
Hello new435,

Hope it is not too late to join the party.

Jim_Nim's appears very sound.  I do have one suggestion to make.  Most Array Controllers have a ROM Based Setup which can be accessed from the POST screen.  We use this to help triage the failed HDD when there are multiple HDDs having issues - I have to be honest that I am not that familiar with the Dell array controller (Mostly work with the HP SmartArray), but array controllers are similar in many areas - I hope this helps.

Rojosho
0
new435Author Commented:
Spent HOURS with Dell trying to figure this out. The final solution? I had to run the dell tools from a bootable USB to upgrade every piece of firmware on the unit, including the disks. Then I had to run "Clear Bad Blocks" in the controller. Then I had to reboot. That finally fixed it.
0
Philip ElderTechnical Architect - HA/Compute/StorageCommented:
Curiosity Question: If you can tell us, what make/model/manufacturer of disk please?
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Server Hardware

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.