Solved

HP Dl585 G6 Disk Failure Question

Posted on 2010-08-13
4
758 Views
Last Modified: 2012-05-10
Hey all,

We just rolled out 30+ new DL585 G6's (with most recent firmware). They are setup as follows:

1. Disk Bays 1-2: RAID 1 (2x146gig disk, 15k)
2. Disk Bays 3-8: RAID 5 (6x156gig disk, 10k)

These all run Server 2003 R2 Enterprise, 64-bit. For some reason, on almost all the servers, the disk in Bay 3 (the first disk of the second array) keeps going bad. We've replaced some disk, moved others around, etc., which sometimes works for a day or two but then goes bad again). We figure it can't be that we have bad disk in the same bay of every server. HP doesn't seem to know at this point either. We even tried a different array config (that is, turned the second array to a RAID10)...no luck.

I figured maybe this was a known issue or something, but no luck. Any ideas?

Thanks.
0
Comment
Question by:exadmin2006
  • 3
4 Comments
 
LVL 47

Expert Comment

by:dlethe
ID: 33429105
Specifically what make/model of disk?
0
 
LVL 47

Accepted Solution

by:
dlethe earned 500 total points
ID: 33429130
If this is all HP kit, so under HP warranty .. then I would just demand that HP comes out and fixes it.  Geez, you bought, what, $100,000 worth of hardware?   Make it their problem, talk to the regional service manager if you have to, and get them to send out a team to make it right, or tell them to send out somebody to deinstall it and take it back.  This is unacceptable.
0
 

Author Comment

by:exadmin2006
ID: 33429239
Good point...the disk makes are:

First array (good array): 146GB 2-port SAS 15k EH0146FARWD
Second array (failing): 146GB 2-port SAS 10K DG146BB976

Not sure of the make (like Seagate, etc.) as I dont access.
0
 
LVL 47

Expert Comment

by:dlethe
ID: 33429426
Well, they are HP disks, so at least you aren't dealing with 3rd-party, so HP is on the hook.   What you can do is
1) check to see if firmware is old, and upgrade.  The HP support site will have upgrades, and more importantly, release notes.   There are ALWAYS bugs in disk (and for that matter), controller firmware, so make sure everything is current.

2) If you have nothing else to do in the interim, you can get yourself a JBOD SAS controller (can't do this with the HP controllers), and run some extreme diagnostics that will tell you exact nature of what is going on, but that has cost associated with it, especially if you don't have a JBOD controller and a way to hook up the drives.     Instead, look at all the event logs in the controller.  It won't give you much, but it might be enough.   SAS drives present a great deal of reportable information, dozens of fields, and the totals are kept in non-volatile memory within the disks, so you could take a few drives that failed and run the software on a JBOD controller   (Look at http://www.santools.com/smart/unix/manual, and goto log pages for SAS disks)

This is from the site to give you an idea what the disks will report, and I'm just scratching the surface as you can run self-tests, get link speeds, verify data.   So if you run diagnostics on some of the disks that failed, and see the nature of the errors (if any), then this will tell you if you just have bad luck with some disk drives.  Or maybe the disks are perfectly fine, and pass all diagnostics.  If so, blame the controller or backplane.  

 Write errors corrected with possible delays: 0 [4]
 Total Write errors: 0 [4]
 Write errors corrected: 0 [4]
 Times correction algorithm processed (on Writes): 0 [4]
 Bytes processed (on Writes): 353948013568 [8]
 Unrecovered errors (on Writes): 0 [4]
 Read errors corrected without substantial delay: 605260 [4]
 Read errors corrected with possible delays: 9 [4]
 Total Read errors: 0 [4]
 Read errors corrected: 605269 [4]
 Times correction algorithm processed (on Reads): 605996 [4]
 Bytes processed (on Reads): 652188835328 [8]
 Unrecovered errors (on Reads): 727 [4]
 Verify errors corrected without substantial delay: 590 [4]
 Verify errors corrected with possible delays: 0 [4]
 Total Verify errors: 0 [4]
 Verify errors corrected: 590 [4]
 Times correction algorithm processed (on Verifys): 590 [4]
 Bytes processed (on Verifys): 0 [8]
 Unrecovered errors (on Verifys): 0 [4]
 Total Non-medium errors: 0 [4]
 Current temperature +/- 3 degrees C: 32
 Reference temperature +/- 3 degrees C: 68
 Background scanning status: 8
 Number of background scans performed: 35
 Background scan percentage completed: 35
 SAS Phy #0 (50-00-C5-00-06-94-BF-FD) - Invalid dwords:  0
 SAS Phy #0 (50-00-C5-00-06-94-BF-FD) - Running disparity errors:  0
 SAS Phy #0 (50-00-C5-00-06-94-BF-FD) - Loss of dword syncs:  0
 SAS Phy #0 (50-00-C5-00-06-94-BF-FD) - Reset problems:  0
0

Featured Post

Zoho SalesIQ

Hassle-free live chat software re-imagined for business growth. 2 users, always free.

Join & Write a Comment

Scenerio: You have a server running Server 2003 and have applied a retail pack of Terminal Server Licenses.  You want to change servers or your server has crashed and you need to reapply the Terminal Server Licenses. When you enter the 16-digit lic…
The 6120xp switches seem to have a bug when you create a fiber port channel when you have a UCS fabric interconnects talking to them.  If you follow the Cisco guide for the UCS, the FC Port channel will never come up and it will say that there are n…
Access reports are powerful and flexible. Learn how to create a query and then a grouped report using the wizard. Modify the report design after the wizard is done to make it look better. There will be another video to explain how to put the final p…
This video demonstrates how to create an example email signature rule for a department in a company using CodeTwo Exchange Rules. The signature will be inserted beneath users' latest emails in conversations and will be displayed in users' Sent Items…

758 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

22 Experts available now in Live!

Get 1:1 Help Now