Solved

HP ML 350 G6 p410i, Raid SMART detects imminent failure

Posted on 2013-05-18
29
3,016 Views
Last Modified: 2013-06-28
Hi i have a HP ML 350 G6 server that i have been out to and on boot the system is telling me that one of the drives is about to fail, I have purchased another drive of the same size to replace it but after installing and checking the HP ProLiant Array Configuration Utility it doesn't say anything about a drive problem.

Also if i wanted to be safe and just replace the drive how would i go about doing this with the configuration below? im not very familiar with RAID and dont want to screw things up.

The configuration is as follows (not showing the new drive):

Logical Drive # 1, RAID 1+0, 137.0GB, Status OK
Port 1I, Box 1, Bay 1, 147.1GB SAS HDD OK
Port 1I, Box 1, Bay 2, 147.1GB SAS HDD OK

Logical Drive # 2, RAID 1+0, 279.4GB Status OK
Port 1I, Box 1, Bay 3, 300.0GB SAS HDD OK
Port 1I, Box 1, Bay 4, 300.0GB SAS HDD OK

I had an idea that i would get this new drive and make it a spare for the 137GB array and it would kick in if and when it failed BUT i cant seem to find anywhere to add it as a spare.

Please see attached Pics they may shed more light on the config, I can get any information you might need to help me, please ask..

TIA
IMG-1442.PNG
Screen-Shot-2013-05-18-at-19.20..png
0
Comment
Question by:firstnetsupport
  • 10
  • 9
  • 8
  • +1
29 Comments
 
LVL 10

Expert Comment

by:bigbigpig
ID: 39177538
To assign the new drive as a spare... In the Array Configuration Utility click Array A and click the button for Spare Management.

If you just wanted to replace the one in predictive failure take the drive out of bay 1 and put the new drive in that same spot.
0
 
LVL 47

Expert Comment

by:dlethe
ID: 39177547
S.M.A.R.T. errors can be transient.  One can run offline diagnostics if you have a non-RAID controller and do some read-only testing, but that won't necessarily tell you that the drive is in an imminent failure situation.

If the HDD is under HP support, then request a warranty replacement.  For now, remember that 100% of HDDs fail, so don't fall into the trap of thinking that RAID is a substitute for backup.

So best practice is to make a complete backup of the array that comprises the drive, then while system is up and operational, replace the questionable HDD with the spare drive, AND system is not booted to your host O/S. Do this from the BIOS.  That way if the "good" drive fails during the profoundly stressful rebuild, you have a full backup and can restore.

(If you have the opportunity to take care of the issue today, without risk of data loss, during a down time window .. then that is always preferable to a HDD failing & rebuilding when you have risk of data loss due to a 2nd failure and all data has not been backed up)
0
 
LVL 55

Expert Comment

by:andyalder
ID: 39177700
Instead of running the ACU run the ADU and upload adureport.txt as an attachment, then we can look at the "read errors hard" count to check if one of the disks looks bad to us rather than to the disks internal diagnostics.

There is no way at present to force a hotspare to mirror a working disk with SMART errors on their Smart Array controllers at the moment unfortunately; they require you to make it fail by pulling it out before replacing it even though it is complaining that it is unwell.
0
 

Author Comment

by:firstnetsupport
ID: 39177826
Hi, thanks for your replies they are exactly the type of responses im looking for, I would love to be able to just replace the drive, but what im worried about as im a novice with raid is pulling a drive out and killing the system.

i know the O/S is running on the first array as 1 drive, but I'm not sure how to go about replacing the failing drive, ie do i pull the drive while the server is on and in windows?

or do i do it when the server is off, will i have to do anything at all or will it do it all itself? if you can imaging doing it for the first time thats where i am i don't know the procedure and what to expect.

I have read something about an automatic rebuild and putting the drives in if they are online and support "hot swap" but i dont know how or what to check?

if you can help with the above that would be great..

Thanks

P.S how do i get the ADU report is this a seperate tool to ACU, is it on the HP site for download?
0
 

Author Comment

by:firstnetsupport
ID: 39177844
Right ive just found the ADU report section within the ACU tool, ive uploaded the ADUreport please let me know what you thinks going on and how to proceed with my previous questions.
ADUReport.txt
0
 
LVL 47

Expert Comment

by:dlethe
ID: 39178016
This is not a false positive, all those errors on the read commands confirm it.  Backup and just yank the bad one and put in the replacement with system turned on ... but back up first, just in case the good drive in the mirror pair decides to die during the rebuild.  (it happens).
0
 

Author Comment

by:firstnetsupport
ID: 39178184
Ok from the info I have supplied can you confirm

1 the setup is working in raid 1 and mirroring the drives
(Is says raid 1+0 if it was raid 0 then pulling the drive would screw the system so want to be sure)

2 the drive can defiantly be pulled and replaced with the system on and os running

3 I have spotted the drive size is different the first 2 drives are 147gb and this new one is 146gb should I wait and replace for a 147gb drive or could this one work?
0
 
LVL 55

Assisted Solution

by:andyalder
andyalder earned 333 total points
ID: 39178369
I don't see any read errors although "Errors Logged" is far too high on the first disk.

Regarding the size difference yes it does matter, the 147GB ones are 3rd party Fujitsu disks rather than disks with HP's firmware on them, the block count of the 146GB HP ones is fewer than the block count on the 147GB ones so the replacement is too small. You'll have to buy a 300GB HP one or a FUJITSU MBD2147RC and swap the caddy from your genuine HP one.

HP report RAID 1 as RAID 1+0 so no problem there, you have two RAID 1 arrays.

You always replace drives hot on HP controllers although in this case if it was big enough you could do it cold since there is no metadata on the 146GB one.
0
 
LVL 47

Expert Comment

by:dlethe
ID: 39178533
i was looking at the hex data dump and decoding the sense data. look at the lines that use op codes 28h
0
 
LVL 55

Expert Comment

by:andyalder
ID: 39178598
Most likely that as they don't have HP firmware on them the ACU/ADU doesn't fill in the statistics properly, that would also explain why it gives a POST error but the ACU doesn't show any problems.
0
 
LVL 47

Expert Comment

by:dlethe
ID: 39179322
No the log is quite clear.  You have a large number of errors. The hex dumps reveal that on READ(10) commands, a combination of:
 * Recovered read errors - recovered w/o ECC, data auto reallocated
 * Recovered read errors - recovered with ECC - data rewritten
 * Unrecoverable read errors - recovered via redundant data elsewhere.  (In other words, data loss on the disk).

Granted this is not spelled out in the decoded text, it is revealed by the hex dumps.

Draw your own conclusions why HP chose to just reveal the hex codes that RAID developers like me can interpret off top of their heads, and not decode them as text so people know that their disks lost data.

But bottom line, those 3/11/01s; 17/06s; 18/07s are quite clear and specific to people like me who write RAID & disk diagnostic code.  This is bad news.  Replace the drive.
0
 
LVL 55

Expert Comment

by:andyalder
ID: 39179343
I think you misunderstand me, I'm not saying that the hex information isn't there but that it isn't in the summary where it should also be; look at http://filedb.experts-exchange.com/incoming/2013/01_w04/630575/ADUReport.txt for example, the ADU report should have those errors in the summary. Since it doesn't there is either something wrong with the diagnostic or more likely it can't cope with the non-HP firmware. The log isn't clear since it says "Read Errors Hard 0x00000000" which is incorrect.

This is what the summary ought to look like...

Serial Number                        D0A1P9A040WR0942
   Firmware Revision                    HPD5
   Product Revision                     HP      EG0146FARTR    
   Reference Time                       0x0008855f
   Sectors Read                         0x00000005cb187955
   Read Errors Hard                     0x000009ab
   Read Errors Retry Recovered          0x00000000
   Read Errors ECC Corrected            0x0000000000000001
   Sectors Written                      0x00000001be3da688
0
 
LVL 47

Expert Comment

by:dlethe
ID: 39179360
The only thing arguably wrong with the diagnostic is that it is written in such a way to reveal to  level-2/3 engineers information that they need to assess the nature and risk of continuing on with a drive ... while keeping non-HP people in the dark.  

I'll go out on a limb here and say IMO that this isn't a bug. It is a feature.  Less info = fewer RMAs.
0
 
LVL 55

Expert Comment

by:andyalder
ID: 39179371
But it *does* list the read errors in the summary when it's working as the above example shows so it's not by design that it has zero for this particular report. That's why I'm assuming it's because it's a non-HP disk that the summary info isn't filled in properly.

HP support don't actually read it themselves BTW, they have a program to interpret it for them.
0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 

Author Comment

by:firstnetsupport
ID: 39179391
A lot of that has gone over my head but am I right in thinking there hasn't been any data "lost" as it would be on the good drive still?

My plan for tomorrow is to order another 147gb hypertec drive the same as the first 2 discs then when it arrives I will attend site and yank the drive in bay 1 out with the server booted up and the server will automatically re establish the array with the new disc and sync the data on both drives without me doing anything to trigger it..
0
 
LVL 47

Expert Comment

by:dlethe
ID: 39179393
Fair enough, neither of us have the source code, but the point I am making is that if I can decode the nature of the problems for non-HP disks using a hex dump from a HP program, then HP *could* modify that very same program to do this if they cared to.

In fact, the dump reveals how the disks are programmed; whether or not each disk has HP firmware on them; and even some changes that could be done to improve the situation.

Hmmm, thinking market opportunity here.  A program you run that just fixes all this stuff for non-HP drives, and also gives people full details on all drives, whether HP or not. If anybody interested in discussing, can contact me offline.  ;)
0
 
LVL 47

Expert Comment

by:dlethe
ID: 39179403
Just spend the extra money and buy HP disks with HP firmware.   At this point I think it should be obvious that the extra $100 or so for HP disks are a bargain.

Firmware matters.
(Yes, that means buying 2 HP disks and deploying the non-HP disks elsewhere.   Native windows software-based RAID1 and a JBOD controller will do nicely).

P.S. looked at the dump further, the non-HP disks are never going to work well for you, unless the mode pages changed.  -- either of them. It isn't whether one of them is sick, it is that they are not programmed to play nice with the controller.  You'll continue to have problems with them moving forward.  But they could be just fine under software RAID1.   I don't know w/o having them and running tests, and not anything I would do anyway.
0
 
LVL 55

Expert Comment

by:andyalder
ID: 39179511
> it is that they are not programmed to play nice with the controller

That's what I was trying to tell you, it's why the ADU report has missing data.

Nevertheless what matters is fixing it and assuming the Hypertec disk is the same size as the current disks then just pull disk 1 out and plug the replacement in live and it will rebuild.
0
 
LVL 47

Expert Comment

by:dlethe
ID: 39179533
that disk will never work properly with that controller unless the mode pages are set properly.  They are flat-out wrong and the 3rd-party HDD is configured so that it only informs the controller of certain ongoing errors ONCE, and then it will no longer report until the next power cycle, where it will only report that error on boot-up.

That disk, as it is configured, is unacceptable.  It goes beyond a health issue with this particular disk.  The firmware is currently configured to suppress certain diagnostic info and perform the incorrect number of retries.

This disk is tuned for stand-alone, non-RAID use.
0
 

Author Comment

by:firstnetsupport
ID: 39182778
ok the latest is this, the hdd has apparently been discontinued, i can get the same size drive in 15k speed im not sure if this is ok?

or a 300gb drive not sure if its hp or 3rd party would this matter?
0
 
LVL 47

Expert Comment

by:dlethe
ID: 39182889
Get a pair of disks that HP supports for this controller and don't worry so much about the RPMs.
0
 
LVL 55

Expert Comment

by:andyalder
ID: 39183852
300GB HP disk will do, you'll only get 147GB of data on it although if you replace the second one afterwards with another 300GB then you can expand the current logical disk to 300GB if you have a cache battery.

Disk is still available though, caddy swap is quite easy, http://www.newegg.com/Product/Product.aspx?Item=N82E16822116176
0
 

Author Comment

by:firstnetsupport
ID: 39186101
can you explain how the battery cache works and how to check if it has and is using it?

I have 2 x 300GB drives already installed setup in another raid 1+0 array which i can use as to replace the faulty drive and can re-use the 147GB array for something else later on.

steps to acheive this?

1. Delete the second array power off remove the 2 x 300gb drives from bays 3/4
2. backup the server (im using the built in sbs backup)
3. pull the faulty 147GB Drive (while its powered on and in windows)
4. put in the 300GB HP Drive into bay 1

how do i check that all the data is in sync?

5. after checking the data is synced reboot and check for problems
6. pull the second 147gb drive in bay 2 and allow it to sync with the new 300GB drive (with 147GB size)
7. resize the discs somehow..
0
 
LVL 55

Assisted Solution

by:andyalder
andyalder earned 333 total points
ID: 39193795
That will work fine.. to check rebuild progress just look in the acu. Will take about 2hours

To resize current logical disk you highlight it in the acu and extend option appears if you have battery, if not you can still create a second logical disk.
0
 

Author Comment

by:firstnetsupport
ID: 39256163
Hi sorry I haven't been on for a while ive been away just come back and we have had a replacement drive sent out that matches the original one that's dying...

just to ask how do I check that this server and also the drive supports hot swapping the drives?
0
 
LVL 47

Accepted Solution

by:
dlethe earned 167 total points
ID: 39256182
Hot Swapping is always supported in SCSI, SAS,FibreChannel, and SATA drives.  It is part of the design spec of the physical interface and protocol.  

Granted if there is no hot-swap bay and it is screwed in, then it is difficult to do, but you can still unplug power/signal cabling at any time without messing it up as long as you take proper precautions against static electricity.
0
 

Author Comment

by:firstnetsupport
ID: 39256242
cool, I feel better now :-) so im gonna get to site take a quick flat file backup on a usb pull the faulty drive plug in the new replacement and job done... ? :-)
0
 

Author Closing Comment

by:firstnetsupport
ID: 39284144
Thanks for the help on this one guys, I attended site ripped the drive and popped in the new one and the array went ahead and rebuilt itself with the new drive..

Really appreciate the advice from dlethe and andyalder.
0

Featured Post

Top 6 Sources for Identifying Threat Actor TTPs

Understanding your enemy is essential. These six sources will help you identify the most popular threat actor tactics, techniques, and procedures (TTPs).

Join & Write a Comment

Suggested Solutions

More or less everybody in the IT market understands the basics of Networking, however when we start talking about Storage Networks, things get a bit dizzier, and this is where I would like to help.
Hyper-convergence systems have taken the IT world by storm and have quickly started to change our point of view of how the data center should and could be architected. In this article, I’ll explain the benefits of employing a hyper-converged system …
This tutorial will walk an individual through the process of installing the necessary services and then configuring a Windows Server 2012 system as an iSCSI target. To install the necessary roles, go to Server Manager, and select Add Roles and Featu…
This Micro Tutorial will teach you how to reformat your flash drive. Sometimes your flash drive may have issues carrying files so this will completely restore it to manufacturing settings. Make sure to backup all files before reformatting. This w…

760 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now