SeeDk
asked on
PowerEdge R630 - iDrac8: New HD is showing as Failure Predicted
This is on a RAID 10 array.
The HD in slot 3 was reporting "Failure Predicted: Yes""
All the other HD's are in optimal status.
I purchased a new HD and swapped it in on Friday. The drive started rebuilding automatically. The status still showed failure predicted but I assumed it would clear up after the rebuild was done.
Today, I check the status and see the rebuild completed but that same slot is still reporting "Failure Predicted: Yes"".
Should I consider this to be a false positive and find a way to clear the error? Or is this really a bad drive?
The HD in slot 3 was reporting "Failure Predicted: Yes""
All the other HD's are in optimal status.
I purchased a new HD and swapped it in on Friday. The drive started rebuilding automatically. The status still showed failure predicted but I assumed it would clear up after the rebuild was done.
Today, I check the status and see the rebuild completed but that same slot is still reporting "Failure Predicted: Yes"".
Should I consider this to be a false positive and find a way to clear the error? Or is this really a bad drive?
ASKER
Yes, I used a server class HD specifically for this server/controller.
Is the server class HDD running Dell's firmware? (Unfortunately it is more complicated than just getting any server class HDD. There are dozens of configurable mode page settings Dell makes that are non-standard, most deal with error recovery retry counts, timers, how they deal with the bad block table structures and so on.
Some controllers are ok with stock off-the-shelf retail seagate or HGST disks, as long as you get the right model. Some are not. Exactly what model of disk did you use, and what firmware revision is it running? (If you have not done so, it certainly won't hurt to make sure you have latest controller firmware & drivers installed as well)
Some controllers are ok with stock off-the-shelf retail seagate or HGST disks, as long as you get the right model. Some are not. Exactly what model of disk did you use, and what firmware revision is it running? (If you have not done so, it certainly won't hurt to make sure you have latest controller firmware & drivers installed as well)
ASKER
I think it was not a Dell branded drive.
The other disks in this RAID are Seagate drives and have been working fine for years.
The new disk is also a Seagate: ST1800MM0168
Now sure how to check the firmware it is running.
The other disks in this RAID are Seagate drives and have been working fine for years.
The new disk is also a Seagate: ST1800MM0168
Now sure how to check the firmware it is running.
ASKER
Also, I noticed the new disk has not logged a failure predicted event in the iDrac storage events.
The other disk logged an entry every day saying that failure was predicted. This one has not.
So this also makes me wonder if it's just a bug in which the status has not updated from that of the old disk.
The other disk logged an entry every day saying that failure was predicted. This one has not.
So this also makes me wonder if it's just a bug in which the status has not updated from that of the old disk.
Can you check the status using OpenManage Server Administrator?
Generic disks are suppoted on Dell PERCs although you can get an "uncertified disk" warning. I don't think they tweak any parameters but they do write "Dell was here" on a particular code page.
Generic disks are suppoted on Dell PERCs although you can get an "uncertified disk" warning. I don't think they tweak any parameters but they do write "Dell was here" on a particular code page.
ASKER
Yes, I've seen what you are talking about on other servers. "Certified Disk - No" and then a warning status that means "non-critical". But in these cases, the "Failure Predicted" will show "No" so it is not a cause for concern.
I don't see any 'Certified Disk' field here:
Status status_noncritical
Name Physical Disk 0:1:3
Device Description Disk 3 in Backplane 1 of Integrated RAID Controller 1
State Online
Operational State Not Applicable
Slot Number 3
Size 1676.13 GB
Block Size 512 bytes
Security Status Not Capable
Bus Protocol SAS
Media Type HDD
Hot Spare No
Remaining Rated Write Endurance Not Applicable
Failure Predicted Yes
Power Status Spun Up
Progress Not Applicable
Used RAID Disk Space 1676.13 GB
Available RAID Disk Space 0.00 GB
Negotiated Speed 12 Gbps
Capable Speed 12 Gbps
SAS Address 0x5000C500967F4D1D
Part Number CN0VJ7CD7262262R00DCA00
Manufacturer SEAGATE
Product ID ST1800MM0168
Revision 2S23
Serial Number S3Z09RK7
Manufactured Day 2
Manufactured Week 8
Manufactured Year 2016
Form factor 2.5 inch
T10 PI Capability Capable
Controller PERC H330 Mini (Embedded)
Enclosure BP13G+ 0:1
View Virtual Disks for this Physical Disk
I don't see any 'Certified Disk' field here:
Status status_noncritical
Name Physical Disk 0:1:3
Device Description Disk 3 in Backplane 1 of Integrated RAID Controller 1
State Online
Operational State Not Applicable
Slot Number 3
Size 1676.13 GB
Block Size 512 bytes
Security Status Not Capable
Bus Protocol SAS
Media Type HDD
Hot Spare No
Remaining Rated Write Endurance Not Applicable
Failure Predicted Yes
Power Status Spun Up
Progress Not Applicable
Used RAID Disk Space 1676.13 GB
Available RAID Disk Space 0.00 GB
Negotiated Speed 12 Gbps
Capable Speed 12 Gbps
SAS Address 0x5000C500967F4D1D
Part Number CN0VJ7CD7262262R00DCA00
Manufacturer SEAGATE
Product ID ST1800MM0168
Revision 2S23
Serial Number S3Z09RK7
Manufactured Day 2
Manufactured Week 8
Manufactured Year 2016
Form factor 2.5 inch
T10 PI Capability Capable
Controller PERC H330 Mini (Embedded)
Enclosure BP13G+ 0:1
View Virtual Disks for this Physical Disk
ASKER
I'm still thinking this is some bug in the status display.
I checked the logs, and since 1/1/17, the idrac was logging a predictive failure alert everyday on disk 3.
Since it was replaced, no alerts have been logged.
I could probably leave it like this, but I would to clear that buggy status so it doesn't cause confusion in a few months.
I checked the logs, and since 1/1/17, the idrac was logging a predictive failure alert everyday on disk 3.
Since it was replaced, no alerts have been logged.
I could probably leave it like this, but I would to clear that buggy status so it doesn't cause confusion in a few months.
ASKER
Noticed something odd now.
The other drives installed on this RAID are all ST1800MM0018.
If I look online for this drive, I see that it is capable of 12Gbps speeds.
However, on the drac, this is displayed for those drives:
Negotiated Speed 6 Gbps
Capable Speed 6 Gbps
On the new drive I swapped in, this is reported:
Negotiated Speed 12 Gbps
Capable Speed 12 Gbps
Could this be the cause of the issue? Why didn't the new drive "scale down" to 6Gbps?
Why are these supposedly 12Gbps capable drives running at 6Gbps?
The other drives installed on this RAID are all ST1800MM0018.
If I look online for this drive, I see that it is capable of 12Gbps speeds.
However, on the drac, this is displayed for those drives:
Negotiated Speed 6 Gbps
Capable Speed 6 Gbps
On the new drive I swapped in, this is reported:
Negotiated Speed 12 Gbps
Capable Speed 12 Gbps
Could this be the cause of the issue? Why didn't the new drive "scale down" to 6Gbps?
Why are these supposedly 12Gbps capable drives running at 6Gbps?
Well, lots of things could tell it to negotiate at 6G, or that disk could be programmed for 6G. This is one of the risks going with retail disks. You have no idea what the mode page settings are. Dell does do quite a bit of work to the mode page settings. Of particular interest they insure that the log pages related to reporting HDD health, bad blocks, and such are consistent across the manufacturers they OEM.
That is why Dell might have the same P/N of disk but the disk itself could be a SEAG or HGST drive of different models. They specialize the firmware to insure the CDBs to report all the health (including SMART) is consistent, regardless of make/model/firmware.
That is because many of the parameters are vendor/product specific. So they standardize so their controller knows exactly where to get the health information once it starts seeing unrecoverable i/o errors, or even a high degree of timeouts.
Bottom line, you get what you pay for. There are differences, quite a bit of them, especially for the 12G drives.
I'd put those other disks in personal PCs and get the right drives with the right firmware if you want to have safe data.
That is why Dell might have the same P/N of disk but the disk itself could be a SEAG or HGST drive of different models. They specialize the firmware to insure the CDBs to report all the health (including SMART) is consistent, regardless of make/model/firmware.
That is because many of the parameters are vendor/product specific. So they standardize so their controller knows exactly where to get the health information once it starts seeing unrecoverable i/o errors, or even a high degree of timeouts.
Bottom line, you get what you pay for. There are differences, quite a bit of them, especially for the 12G drives.
I'd put those other disks in personal PCs and get the right drives with the right firmware if you want to have safe data.
ASKER
I got these disks from a IT vendor where we always purchase our server drives. I always tell them what servers these are for so they should not be sending me drives which won't work. And I have not had issues with the other drives.
I guess the bottom line is: is this a serious issue and should I demand the vendor send me a different disk..could this disk really potentially fail or is this just a software quirk.
I still have not seen any predictive failure log entry.
I guess the bottom line is: is this a serious issue and should I demand the vendor send me a different disk..could this disk really potentially fail or is this just a software quirk.
I still have not seen any predictive failure log entry.
Never in a million years would an IT vendor insure the disks have the correct mode page settings for any given controller. But I've only been in the biz for over 20 years so haven't seen it all yet ;)
You won't se any predictive messages unless the drive is properly configured, AND the firmware is presenting the correct log pages that the controller is looking for. While one can program mode pages for any given firmware, you have to have the correct firmware to get the log pages. (Mode pages configure operational characteristics like timeouts and is the write cache enabled, and hundreds more. Log pages are counters like number of unrecovered read errors).
In your case there are also a few mode pages that deal with default and max bus interface speed, a la 6Gbit vs 12Gbit).
I'd just get the HP disks if you are using a HP controller if this is an important server. My company writes diagnostics, and firmware for controllers and disks and such and that is what we do ourselves, even though we can change the settings I just mentioned. Now if you used an off-the-shelf retail controller like the mega raid that is designed to work with NON-OEM firmware, then you can use retail generic disks ... but that is not what you have)
You won't se any predictive messages unless the drive is properly configured, AND the firmware is presenting the correct log pages that the controller is looking for. While one can program mode pages for any given firmware, you have to have the correct firmware to get the log pages. (Mode pages configure operational characteristics like timeouts and is the write cache enabled, and hundreds more. Log pages are counters like number of unrecovered read errors).
In your case there are also a few mode pages that deal with default and max bus interface speed, a la 6Gbit vs 12Gbit).
I'd just get the HP disks if you are using a HP controller if this is an important server. My company writes diagnostics, and firmware for controllers and disks and such and that is what we do ourselves, even though we can change the settings I just mentioned. Now if you used an off-the-shelf retail controller like the mega raid that is designed to work with NON-OEM firmware, then you can use retail generic disks ... but that is not what you have)
ASKER
Ok, still learning here :).
This is a Dell server, so I suppose Dell branded drives would work better, though I have usually not have these issues before.
So regardless of whether the disk will actually fail soon or not...you think this current setup is not good, right?
This is a Dell server, so I suppose Dell branded drives would work better, though I have usually not have these issues before.
So regardless of whether the disk will actually fail soon or not...you think this current setup is not good, right?
Did you check the status using OMSA?
Re the speed I think dlethe is right Dell set them to 6Gb, almost every website lists 43N12 as being 6Gb even though they are 12Gb Seagate disks.
Re the speed I think dlethe is right Dell set them to 6Gb, almost every website lists 43N12 as being 6Gb even though they are 12Gb Seagate disks.
ASKER
Thanks, I looked up 43N12 too and see what you mean. Looks like the Dell Branded are set to 6GB but the one I bought is not Dell branded and is causing this problem. It does seem I will have to contact the vendor.
OMSA is actually not installed on this server - I guess because it has a DRAC card.
Do you think there would be a difference between what OMSA and DRAC report?
OMSA is actually not installed on this server - I guess because it has a DRAC card.
Do you think there would be a difference between what OMSA and DRAC report?
There's certainly a difference since OMSA shows a nasty warning triangle for disks that aren't certified,
screenshots:
http://en.community.dell.com/support-forums/servers/f/906/t/19993240
http://nmtechno.com/dellraidh200/
screenshots:
http://en.community.dell.com/support-forums/servers/f/906/t/19993240
http://nmtechno.com/dellraidh200/
Yes, but OMSA labels it as not certified whereas DRAC shows yours as being predictive failure, it may be that OMSA gets uncertified confused with predictive failure so it would be nice to compare the two.
ASKER
It would be nice but installing OMSA requires a reboot and my window for rebooting this box is only during the weekend.
For what it's worth, I have another server with both OMSA and DRAC installed that has two drives in 'Foreign' state.
Both DRAC and OMSA report the same status on those.
Looks like the best option it to get a Dell Certified drive for this server.
For what it's worth, I have another server with both OMSA and DRAC installed that has two drives in 'Foreign' state.
Both DRAC and OMSA report the same status on those.
Looks like the best option it to get a Dell Certified drive for this server.
Foreign just means they have stale metadata on them, either they went offline and then back online after a reboot or they came from another computer. You can clear the metadata under OMSA, highlight the controller and it's on the configuration tab under controller tasks.
ASKER
I don't see that option on the OMSA but that's fine, I'm not concerned with those disks.
For the R630, I will get a Dell-branded replacement from the vendor and swap it in. Hopefully by Friday. Will update this topic with the result.
For the R630, I will get a Dell-branded replacement from the vendor and swap it in. Hopefully by Friday. Will update this topic with the result.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Thanks for the info.
I actually did get a Dell branded replacement that was rated at 12Gb/s. Popped it in but still the speed would not go down to 6Gb/s!
So the vendor is sending me another one which is rated at 6Gb/s.
Out of curiosity I am wondering is if there is a real cause for concern by having the 12Gb/s and 6gb/s running in the same RAID (will the disks fail faster? data loss?) or if it's a warning that can be ignored.
I still do intend to replace it with a 6Gb/s drive to avoid any potential problems.
I actually did get a Dell branded replacement that was rated at 12Gb/s. Popped it in but still the speed would not go down to 6Gb/s!
So the vendor is sending me another one which is rated at 6Gb/s.
Out of curiosity I am wondering is if there is a real cause for concern by having the 12Gb/s and 6gb/s running in the same RAID (will the disks fail faster? data loss?) or if it's a warning that can be ignored.
I still do intend to replace it with a 6Gb/s drive to avoid any potential problems.
Nothing wrong with having 12Gb and 6Gb in the same array, it's just OMSA being stupid. It's like fitting a 120MPH rated tire to a car with 3 70MPH rated tyres on it, perfectly safe to run at 70MPH. The warning can be ignored as you aren't using the advanced features of the 12Gb disk. One feature is probably self-encryption which you're not using on the 6Gb ones anyway.
What was the Dell part no on the 12Gb one they sent you?
They sell ST1800MM0168/6gb as 43N12 and ST1800MM0168/12Gb as VJ7CD or FDDG4 by the looks of it. Same HDA just the code pages tweaked to make one appear to be a 6Gb one. Not 100% those part nos are correct just got them from Google as Dell don't have a decent disk part number matrix.
Always worth going to http://www.dell.com/support/home/uk/en/ukbsdt1 and looking up the part no they fitted initially.
What was the Dell part no on the 12Gb one they sent you?
They sell ST1800MM0168/6gb as 43N12 and ST1800MM0168/12Gb as VJ7CD or FDDG4 by the looks of it. Same HDA just the code pages tweaked to make one appear to be a 6Gb one. Not 100% those part nos are correct just got them from Google as Dell don't have a decent disk part number matrix.
Always worth going to http://www.dell.com/support/home/uk/en/ukbsdt1 and looking up the part no they fitted initially.
ASKER
Shouldn't the 12Gb/s usually 'clock down' to 6Gb/s when the other drives in the RAID are 6Gb/s. The problem with this one is still claims to be running at 12Gb/s.
This is on a DRAC, not OMSA. The LED warning light on the drives is flashing too.
I think this is VJ7CD since I see this as the part# in the DRAC console: Part # CN0VJ7CD7262263V00UBA00
The RAID was initially fitted with ST1800MM0018 - Part # CN043N12726224BG01LPA00
The vendor I used at first claimed not to have the part rated at 6Gb/s but said 12Gb/s would be fine.
This is on a DRAC, not OMSA. The LED warning light on the drives is flashing too.
I think this is VJ7CD since I see this as the part# in the DRAC console: Part # CN0VJ7CD7262263V00UBA00
The RAID was initially fitted with ST1800MM0018 - Part # CN043N12726224BG01LPA00
The vendor I used at first claimed not to have the part rated at 6Gb/s but said 12Gb/s would be fine.
It's not going to clock down to 6Gb because the controller and backplane support 12Gb and this one has not been manually set to 6Gb by a code page tweak. The 12Gb is fine but the stupid firmware says all those other 6Gb disks are slowing this 12Gb one down. Assuming it has rebuilt are all the 6Gb ones giving warnings but the 12Gb one saying it is OK? You say the LED warning is flashing on the drives, is it just the 6Gb ones?
ASKER
No, it is the only 12gb/s disk warning in the DRAC that it will fail and LED light flashing. The 6gb/s disk are all reporting fine.
So the 12gb/s has been hardcoded to always run at 12gb/s?
So the 12gb/s has been hardcoded to always run at 12gb/s?
ASKER
Received the replacement disk from the vendor. Part#CN043N12726225910039A 00
Rated at 6Gbps. Placed it in the server and the DRAC is also reporting 6Gbps.
It is currently rebuilding but already the status is also showing 'Failure Predicted - Yes'
Is this normal or should I start looking into if there is a problem elsewhere? The other 3 disks in this RAID report no issues.
Rated at 6Gbps. Placed it in the server and the DRAC is also reporting 6Gbps.
It is currently rebuilding but already the status is also showing 'Failure Predicted - Yes'
Is this normal or should I start looking into if there is a problem elsewhere? The other 3 disks in this RAID report no issues.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
I looked this up as well since it seems to match this case but it doesn't make sense to me.
I thought punctures and faults only occurred if a drive failed and the other disks didn't have some data?
There was no failure here.
I removed the predicted failure drive from the array.
Replaced with a new drive. It was the wrong speed (12Gb/s) so I replaced that one again with the correct speed. (6Gb/s)
I don't think I can see the controller log on DRAC but I can see the lifecycle logs.
The only errors are the reports of the original Disk 3 predictive failure. No other alerts before or after other than my swapping of the disks.
I'm thinking the next step is to run a consistency check on the RAID? That article mentions that could clear up the error if it is not serious.
The DRAC doesn't have that option, so I will need to install OMSA first.
I thought punctures and faults only occurred if a drive failed and the other disks didn't have some data?
There was no failure here.
I removed the predicted failure drive from the array.
Replaced with a new drive. It was the wrong speed (12Gb/s) so I replaced that one again with the correct speed. (6Gb/s)
I don't think I can see the controller log on DRAC but I can see the lifecycle logs.
The only errors are the reports of the original Disk 3 predictive failure. No other alerts before or after other than my swapping of the disks.
I'm thinking the next step is to run a consistency check on the RAID? That article mentions that could clear up the error if it is not serious.
The DRAC doesn't have that option, so I will need to install OMSA first.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
But it's happening with Dell disks as well as OEM ones. And BTW it was you that pooh-poohed the Dell document.
ASKER
A key difference between this predictive failure and others I've seen is that there is NO alert generated in the logs.
Usually, an alert is generated daily in the storage logs saying "Predictive Failure on Disk x:x in Controller Y.
This has not happened after I replaced the disk the first time. Before that, the logs were alerting a predictive failure daily.
Is this a decent sign that perhaps there is no issue? though doing a consistency check is still a good idea.
Or do the logs not alert correctly in a RAID puncture situation?
Usually, an alert is generated daily in the storage logs saying "Predictive Failure on Disk x:x in Controller Y.
This has not happened after I replaced the disk the first time. Before that, the logs were alerting a predictive failure daily.
Is this a decent sign that perhaps there is no issue? though doing a consistency check is still a good idea.
Or do the logs not alert correctly in a RAID puncture situation?
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Google TLER to find out why.