Link to home
Start Free TrialLog in
Avatar of SeeDk
SeeDk

asked on

PowerEdge R630 - iDrac8: New HD is showing as Failure Predicted

This is on a RAID 10 array.
The HD in slot 3 was reporting "Failure Predicted: Yes""
All the other HD's are in optimal status.

I purchased a new HD and swapped it in on Friday. The drive started rebuilding automatically. The status still showed failure predicted but I assumed it would clear up after the rebuild was done.
Today, I check the status and see the rebuild completed but that same slot is still reporting "Failure Predicted: Yes"".

Should I consider this to be a false positive and find a way to clear the error? Or is this really a bad drive?
Avatar of David
David
Flag of United States of America image

Did you use server class HDDs that are qualified for that controller?  If you used consumer class SATA drives then they are totally unsuitable and will just never work right, due to how long it takes to do bad block recovery.

 Google TLER to find out why.
Avatar of SeeDk
SeeDk

ASKER

Yes, I used a server class HD specifically for this server/controller.
Is the server class HDD running Dell's firmware?   (Unfortunately it is more complicated than just getting any server class HDD.   There are dozens of configurable mode page settings Dell makes that are non-standard, most deal with error recovery retry counts, timers, how they deal with the bad block table structures and so on.

Some controllers are ok with stock off-the-shelf retail seagate or HGST disks, as long as you get the right model.  Some are not.  Exactly what model of disk did you use, and what firmware revision is it running?  (If you have not done so,  it certainly won't hurt to make sure you have latest controller firmware & drivers installed as well)
Avatar of SeeDk

ASKER

I think it was not a Dell branded drive.
The other disks in this RAID are Seagate drives and have been working fine for years.
The new disk is also a Seagate: ST1800MM0168

Now sure how to check the firmware it is running.
Avatar of SeeDk

ASKER

Also, I noticed the new disk has not logged a failure predicted event in the iDrac storage events.
The other disk logged an entry every day saying that failure was predicted. This one has not.
So this also makes me wonder if it's just a bug in which the status has not updated from that of the old disk.
Can you check the status using OpenManage Server Administrator?

Generic disks are suppoted on Dell PERCs although you can get an "uncertified disk" warning. I don't think they tweak any parameters but they do write "Dell was here" on a particular code page.
Avatar of SeeDk

ASKER

Yes, I've seen what you are talking about on other servers. "Certified Disk - No" and then a warning status that means "non-critical". But in these cases, the "Failure Predicted" will show "No" so it is not a cause for concern.
I don't see any 'Certified Disk' field here:


Status   status_noncritical
Name  Physical Disk 0:1:3
Device Description  Disk 3 in Backplane 1 of Integrated RAID Controller 1
State  Online
Operational State  Not Applicable
Slot Number  3
Size  1676.13 GB
Block Size  512 bytes
Security Status  Not Capable
Bus Protocol  SAS
Media Type  HDD
Hot Spare  No
Remaining Rated Write Endurance  Not Applicable
Failure Predicted  Yes
Power Status  Spun Up
Progress  Not Applicable
Used RAID Disk Space  1676.13 GB
 
Available RAID Disk Space  0.00 GB
Negotiated Speed  12 Gbps
Capable Speed  12 Gbps
SAS Address  0x5000C500967F4D1D
Part Number  CN0VJ7CD7262262R00DCA00  
Manufacturer  SEAGATE  
Product ID  ST1800MM0168  
Revision  2S23
Serial Number  S3Z09RK7  
Manufactured Day  2
Manufactured Week  8
Manufactured Year  2016
Form factor  2.5 inch
T10 PI Capability  Capable
Controller  PERC H330 Mini (Embedded)
Enclosure  BP13G+ 0:1
View Virtual Disks for this Physical Disk
Avatar of SeeDk

ASKER

I'm still thinking this is some bug in the status display.
I checked the logs, and since 1/1/17, the idrac was logging a predictive failure alert everyday on disk 3.
Since it was replaced, no alerts have been logged.

I could probably leave it like this, but I would to clear that buggy status so it doesn't cause confusion in a few months.
Avatar of SeeDk

ASKER

Noticed something odd now.
The other drives installed on this RAID are all ST1800MM0018.

If I look online for this drive, I see that it is capable of 12Gbps speeds.
However, on the drac, this is displayed for those drives:
Negotiated Speed  6 Gbps
Capable Speed  6 Gbps

On the new drive I swapped in, this is reported:
Negotiated Speed  12 Gbps
Capable Speed  12 Gbps


Could this be the cause of the issue? Why didn't the new drive "scale down" to 6Gbps?
Why are these supposedly 12Gbps capable drives running at 6Gbps?
Well, lots of things could tell it to negotiate at 6G, or that disk could be programmed for 6G.  This is one of the risks going with retail disks.  You have no idea what the mode page settings are.    Dell does do quite a bit of work to the mode page settings.   Of particular interest they insure that the log pages related to reporting HDD health, bad blocks, and such are consistent across the manufacturers they OEM.

That is why Dell might have the same P/N of disk but the disk itself could be a SEAG or HGST drive of different models.  They specialize the firmware to insure the CDBs to report all the health (including SMART) is consistent, regardless of make/model/firmware.

That is because many of the parameters are vendor/product specific.  So they standardize so their controller knows exactly where to get the health information once it starts seeing unrecoverable i/o errors, or even a high degree of timeouts.

Bottom line, you get what you pay for.  There are differences, quite a bit of them, especially for the 12G drives.

I'd put those other disks in personal PCs and get the right drives with the right firmware if you want to have safe data.
Avatar of SeeDk

ASKER

I got these disks from a IT vendor where we always purchase our server drives. I always tell them what servers these are for so they should not be sending me drives which won't work. And I have not had issues with the other drives.

I guess the bottom line is:  is this a serious issue and should I demand the vendor send me a different disk..could this disk really potentially fail or is this just a software quirk.
I still have not seen any predictive failure log entry.
Never in a million years would an IT vendor insure the disks have the correct mode page settings for any given controller.  But I've only been in the biz for over 20 years so haven't seen it all yet ;)

You won't se any predictive messages unless the drive is properly configured, AND the firmware is presenting the correct log pages that the controller is looking for.   While one can program mode pages for any given firmware, you have to have the correct firmware to get the log pages.   (Mode pages configure operational characteristics like timeouts and is the write cache enabled, and hundreds more.  Log pages are counters like number of unrecovered read errors).

In your case there are also a few mode pages that deal with default and max bus interface speed, a la 6Gbit vs 12Gbit).

I'd just get the HP disks if you are using a HP controller if this is an important server.  My company writes diagnostics, and firmware for controllers and disks and such and that is what we do ourselves, even though we can change the settings I just mentioned.    Now if you used an off-the-shelf retail controller like the mega raid that is designed to work with NON-OEM firmware, then you can use retail generic disks ... but that is not what you have)
Avatar of SeeDk

ASKER

Ok, still learning here :).
This is a Dell server, so I suppose Dell branded drives would work better, though I have usually not have these issues before.

So regardless of whether the disk will actually fail soon or not...you think this current setup is not good, right?
Did you check the status using OMSA?

Re the speed I think dlethe is right Dell set them to 6Gb, almost every website lists 43N12 as being 6Gb even though they are 12Gb Seagate disks.
Avatar of SeeDk

ASKER

Thanks, I looked up 43N12 too and see what you mean. Looks like the Dell Branded are set to 6GB but the one I bought is not Dell branded and is causing this problem. It does seem I will have to contact the vendor.

OMSA is actually not installed on this server - I guess because it has a DRAC card.
Do you think there would be a difference between what OMSA and DRAC report?
There's certainly a difference since OMSA shows a nasty warning triangle for disks that aren't certified,

screenshots:
http://en.community.dell.com/support-forums/servers/f/906/t/19993240
http://nmtechno.com/dellraidh200/
Avatar of SeeDk

ASKER

DRAC also shows that nasty warning triangle

User generated image
Yes, but OMSA labels it as not certified whereas DRAC shows yours as being predictive failure, it may be that OMSA gets uncertified confused with predictive failure so it would be nice to compare the two.
Avatar of SeeDk

ASKER

It would be nice but installing OMSA requires a reboot and my window for rebooting this box is only during the weekend.
For what it's worth, I have another server with both OMSA and DRAC installed that has two drives in 'Foreign' state.
Both DRAC and OMSA report the same status on those.

Looks like the best option it to get a Dell Certified drive for this server.
Foreign just means they have stale metadata on them, either they went offline and then back online after a reboot or they came from another computer. You can clear the metadata under OMSA, highlight the controller and it's on the configuration tab under controller tasks.
Avatar of SeeDk

ASKER

I don't see that option on the OMSA but that's fine, I'm not concerned with those disks.
For the R630, I will get a Dell-branded replacement from the vendor and swap it in. Hopefully by Friday. Will update this topic with the result.
ASKER CERTIFIED SOLUTION
Avatar of Member_2_231077
Member_2_231077

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of SeeDk

ASKER

Thanks for the info.
I actually did get a Dell branded replacement that was rated at 12Gb/s. Popped it in but still the speed would not go down to 6Gb/s!
So the vendor is sending me another one which is rated at 6Gb/s.

Out of curiosity I am wondering is if there is a real cause for concern by having the 12Gb/s and 6gb/s running in the same RAID (will the disks fail faster? data loss?) or if it's a warning that can be ignored.

I still do intend to replace it with a 6Gb/s drive to avoid any potential problems.
Nothing wrong with having 12Gb and 6Gb in the same array, it's just OMSA being stupid. It's like fitting a 120MPH rated tire to a car with 3 70MPH rated tyres on it, perfectly safe to run at 70MPH. The warning can be ignored as you aren't using the advanced features of the 12Gb disk. One feature is probably self-encryption which you're not using on the 6Gb ones anyway.

What was the Dell part no on the 12Gb one they sent you?
They sell ST1800MM0168/6gb as  43N12 and ST1800MM0168/12Gb as VJ7CD or FDDG4 by the looks of it. Same HDA just the code pages tweaked to make one appear to be a 6Gb one. Not 100% those part nos are correct just got them from Google as Dell don't have a decent disk part number matrix.

Always worth going to http://www.dell.com/support/home/uk/en/ukbsdt1 and looking up the part no they fitted initially.
Avatar of SeeDk

ASKER

Shouldn't the 12Gb/s usually 'clock down' to 6Gb/s when the other drives in the RAID are 6Gb/s. The problem with this one is still claims to be running at 12Gb/s.
This is on a DRAC, not OMSA. The LED warning light on the drives is flashing too.
I think this is VJ7CD since I see this as the part# in the DRAC console: Part # CN0VJ7CD7262263V00UBA00

The RAID was initially fitted with ST1800MM0018 - Part # CN043N12726224BG01LPA00
The vendor I used at first claimed not to have the part rated at 6Gb/s but said 12Gb/s would be fine.
It's not going to clock down to 6Gb because the controller and backplane support 12Gb and this one has not been manually set to 6Gb by a code page tweak. The 12Gb is fine but the stupid firmware says all those other 6Gb disks are slowing this 12Gb one down. Assuming it has rebuilt are all the 6Gb ones giving warnings but the 12Gb one saying it is OK? You say the LED warning is flashing on the drives, is it just the 6Gb ones?
Avatar of SeeDk

ASKER

No, it is the only 12gb/s disk warning in the DRAC that it will fail and LED light flashing. The 6gb/s disk are all reporting fine.
So the 12gb/s has been hardcoded to always run at 12gb/s?
Avatar of SeeDk

ASKER

Received the replacement disk from the vendor. Part#CN043N12726225910039A00
Rated at 6Gbps. Placed it in the server and the DRAC is also reporting 6Gbps.
It is currently rebuilding but already the status is also showing 'Failure Predicted - Yes'

Is this normal or should I start looking into if there is a problem elsewhere? The other 3 disks in this RAID report no issues.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of SeeDk

ASKER

I looked this up as well since it seems to match this case but it doesn't make sense to me.
 I thought punctures and faults only occurred if a drive failed and the other disks didn't have some data?
There was no failure here.
I removed the predicted failure drive from the array.
Replaced with a new drive. It was the wrong speed (12Gb/s) so I replaced that one again with the correct speed. (6Gb/s)
I don't think I can see the controller log on DRAC but I can see the lifecycle logs.
The only errors are the reports of the original Disk 3 predictive failure. No other alerts before or after other than my swapping of the disks.

I'm thinking the next step is to run a consistency check on the RAID? That article mentions that could clear up the error if it is not serious.
The DRAC doesn't have that option, so I will need to install OMSA first.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
But it's happening with Dell disks as well as OEM ones. And BTW it was you that pooh-poohed the Dell document.
Avatar of SeeDk

ASKER

A key difference between this predictive failure and others I've seen is that there is NO alert generated in the logs.
Usually, an alert is generated daily in the storage logs saying "Predictive Failure on Disk x:x in Controller Y.
This has not happened after I replaced the disk the first time. Before that, the logs were alerting a predictive failure daily.

Is this a decent sign that perhaps there is no issue? though doing a consistency check is still a good idea.

Or do the logs not alert correctly in a RAID puncture situation?
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial