Solved

HP Dl585 G6 Disk Failure Question

Posted on 2010-08-13
4
762 Views
Last Modified: 2012-05-10
Hey all,

We just rolled out 30+ new DL585 G6's (with most recent firmware). They are setup as follows:

1. Disk Bays 1-2: RAID 1 (2x146gig disk, 15k)
2. Disk Bays 3-8: RAID 5 (6x156gig disk, 10k)

These all run Server 2003 R2 Enterprise, 64-bit. For some reason, on almost all the servers, the disk in Bay 3 (the first disk of the second array) keeps going bad. We've replaced some disk, moved others around, etc., which sometimes works for a day or two but then goes bad again). We figure it can't be that we have bad disk in the same bay of every server. HP doesn't seem to know at this point either. We even tried a different array config (that is, turned the second array to a RAID10)...no luck.

I figured maybe this was a known issue or something, but no luck. Any ideas?

Thanks.
0
Comment
Question by:exadmin2006
  • 3
4 Comments
 
LVL 47

Expert Comment

by:dlethe
ID: 33429105
Specifically what make/model of disk?
0
 
LVL 47

Accepted Solution

by:
dlethe earned 500 total points
ID: 33429130
If this is all HP kit, so under HP warranty .. then I would just demand that HP comes out and fixes it.  Geez, you bought, what, $100,000 worth of hardware?   Make it their problem, talk to the regional service manager if you have to, and get them to send out a team to make it right, or tell them to send out somebody to deinstall it and take it back.  This is unacceptable.
0
 

Author Comment

by:exadmin2006
ID: 33429239
Good point...the disk makes are:

First array (good array): 146GB 2-port SAS 15k EH0146FARWD
Second array (failing): 146GB 2-port SAS 10K DG146BB976

Not sure of the make (like Seagate, etc.) as I dont access.
0
 
LVL 47

Expert Comment

by:dlethe
ID: 33429426
Well, they are HP disks, so at least you aren't dealing with 3rd-party, so HP is on the hook.   What you can do is
1) check to see if firmware is old, and upgrade.  The HP support site will have upgrades, and more importantly, release notes.   There are ALWAYS bugs in disk (and for that matter), controller firmware, so make sure everything is current.

2) If you have nothing else to do in the interim, you can get yourself a JBOD SAS controller (can't do this with the HP controllers), and run some extreme diagnostics that will tell you exact nature of what is going on, but that has cost associated with it, especially if you don't have a JBOD controller and a way to hook up the drives.     Instead, look at all the event logs in the controller.  It won't give you much, but it might be enough.   SAS drives present a great deal of reportable information, dozens of fields, and the totals are kept in non-volatile memory within the disks, so you could take a few drives that failed and run the software on a JBOD controller   (Look at http://www.santools.com/smart/unix/manual, and goto log pages for SAS disks)

This is from the site to give you an idea what the disks will report, and I'm just scratching the surface as you can run self-tests, get link speeds, verify data.   So if you run diagnostics on some of the disks that failed, and see the nature of the errors (if any), then this will tell you if you just have bad luck with some disk drives.  Or maybe the disks are perfectly fine, and pass all diagnostics.  If so, blame the controller or backplane.  

 Write errors corrected with possible delays: 0 [4]
 Total Write errors: 0 [4]
 Write errors corrected: 0 [4]
 Times correction algorithm processed (on Writes): 0 [4]
 Bytes processed (on Writes): 353948013568 [8]
 Unrecovered errors (on Writes): 0 [4]
 Read errors corrected without substantial delay: 605260 [4]
 Read errors corrected with possible delays: 9 [4]
 Total Read errors: 0 [4]
 Read errors corrected: 605269 [4]
 Times correction algorithm processed (on Reads): 605996 [4]
 Bytes processed (on Reads): 652188835328 [8]
 Unrecovered errors (on Reads): 727 [4]
 Verify errors corrected without substantial delay: 590 [4]
 Verify errors corrected with possible delays: 0 [4]
 Total Verify errors: 0 [4]
 Verify errors corrected: 590 [4]
 Times correction algorithm processed (on Verifys): 590 [4]
 Bytes processed (on Verifys): 0 [8]
 Unrecovered errors (on Verifys): 0 [4]
 Total Non-medium errors: 0 [4]
 Current temperature +/- 3 degrees C: 32
 Reference temperature +/- 3 degrees C: 68
 Background scanning status: 8
 Number of background scans performed: 35
 Background scan percentage completed: 35
 SAS Phy #0 (50-00-C5-00-06-94-BF-FD) - Invalid dwords:  0
 SAS Phy #0 (50-00-C5-00-06-94-BF-FD) - Running disparity errors:  0
 SAS Phy #0 (50-00-C5-00-06-94-BF-FD) - Loss of dword syncs:  0
 SAS Phy #0 (50-00-C5-00-06-94-BF-FD) - Reset problems:  0
0

Featured Post

Connect further...control easier

With the ATEN CE624, you can now enjoy a high-quality visual experience powered by HDBaseT technology and the convenience of a single Cat6 cable to transmit uncompressed video with zero latency and multi-streaming for dual-view applications where remote access is required.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Scenerio: You have a server running Server 2003 and have applied a retail pack of Terminal Server Licenses.  You want to change servers or your server has crashed and you need to reapply the Terminal Server Licenses. When you enter the 16-digit lic…
ADCs have gained traction within the last decade, largely due to increased demand for legacy load balancing appliances to handle more advanced application delivery requirements and improve application performance.
A short tutorial showing how to set up an email signature in Outlook on the Web (previously known as OWA). For free email signatures designs, visit https://www.mail-signatures.com/articles/signature-templates/?sts=6651 If you want to manage em…

685 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question