Link to home
Start Free TrialLog in
Avatar of exadmin2006
exadmin2006Flag for United States of America

asked on

HP Dl585 G6 Disk Failure Question

Hey all,

We just rolled out 30+ new DL585 G6's (with most recent firmware). They are setup as follows:

1. Disk Bays 1-2: RAID 1 (2x146gig disk, 15k)
2. Disk Bays 3-8: RAID 5 (6x156gig disk, 10k)

These all run Server 2003 R2 Enterprise, 64-bit. For some reason, on almost all the servers, the disk in Bay 3 (the first disk of the second array) keeps going bad. We've replaced some disk, moved others around, etc., which sometimes works for a day or two but then goes bad again). We figure it can't be that we have bad disk in the same bay of every server. HP doesn't seem to know at this point either. We even tried a different array config (that is, turned the second array to a RAID10)...no luck.

I figured maybe this was a known issue or something, but no luck. Any ideas?

Thanks.
Avatar of David
David
Flag of United States of America image

Specifically what make/model of disk?
ASKER CERTIFIED SOLUTION
Avatar of David
David
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of exadmin2006

ASKER

Good point...the disk makes are:

First array (good array): 146GB 2-port SAS 15k EH0146FARWD
Second array (failing): 146GB 2-port SAS 10K DG146BB976

Not sure of the make (like Seagate, etc.) as I dont access.
Well, they are HP disks, so at least you aren't dealing with 3rd-party, so HP is on the hook.   What you can do is
1) check to see if firmware is old, and upgrade.  The HP support site will have upgrades, and more importantly, release notes.   There are ALWAYS bugs in disk (and for that matter), controller firmware, so make sure everything is current.

2) If you have nothing else to do in the interim, you can get yourself a JBOD SAS controller (can't do this with the HP controllers), and run some extreme diagnostics that will tell you exact nature of what is going on, but that has cost associated with it, especially if you don't have a JBOD controller and a way to hook up the drives.     Instead, look at all the event logs in the controller.  It won't give you much, but it might be enough.   SAS drives present a great deal of reportable information, dozens of fields, and the totals are kept in non-volatile memory within the disks, so you could take a few drives that failed and run the software on a JBOD controller   (Look at http://www.santools.com/smart/unix/manual, and goto log pages for SAS disks)

This is from the site to give you an idea what the disks will report, and I'm just scratching the surface as you can run self-tests, get link speeds, verify data.   So if you run diagnostics on some of the disks that failed, and see the nature of the errors (if any), then this will tell you if you just have bad luck with some disk drives.  Or maybe the disks are perfectly fine, and pass all diagnostics.  If so, blame the controller or backplane.  

 Write errors corrected with possible delays: 0 [4]
 Total Write errors: 0 [4]
 Write errors corrected: 0 [4]
 Times correction algorithm processed (on Writes): 0 [4]
 Bytes processed (on Writes): 353948013568 [8]
 Unrecovered errors (on Writes): 0 [4]
 Read errors corrected without substantial delay: 605260 [4]
 Read errors corrected with possible delays: 9 [4]
 Total Read errors: 0 [4]
 Read errors corrected: 605269 [4]
 Times correction algorithm processed (on Reads): 605996 [4]
 Bytes processed (on Reads): 652188835328 [8]
 Unrecovered errors (on Reads): 727 [4]
 Verify errors corrected without substantial delay: 590 [4]
 Verify errors corrected with possible delays: 0 [4]
 Total Verify errors: 0 [4]
 Verify errors corrected: 590 [4]
 Times correction algorithm processed (on Verifys): 590 [4]
 Bytes processed (on Verifys): 0 [8]
 Unrecovered errors (on Verifys): 0 [4]
 Total Non-medium errors: 0 [4]
 Current temperature +/- 3 degrees C: 32
 Reference temperature +/- 3 degrees C: 68
 Background scanning status: 8
 Number of background scans performed: 35
 Background scan percentage completed: 35
 SAS Phy #0 (50-00-C5-00-06-94-BF-FD) - Invalid dwords:  0
 SAS Phy #0 (50-00-C5-00-06-94-BF-FD) - Running disparity errors:  0
 SAS Phy #0 (50-00-C5-00-06-94-BF-FD) - Loss of dword syncs:  0
 SAS Phy #0 (50-00-C5-00-06-94-BF-FD) - Reset problems:  0