Solved

Suspect RAID Card failing

Posted on 2011-09-28
13
414 Views
Last Modified: 2012-09-09
We have a HP DL385 G2 rackmount server running VMWare ESX 4.1.
It has 2 RAID cards, the internal (P400) seems to be going fine but in the last few days we've had the P800 (512MB BBWC) connected to a MSA60 with what seems to be it halting. A while back after a reboot the server hung not long after the VM guests came back online however a hard-reset fixed it. I know of a case with another HP RAID card having issues with heavy I/O halting it (which seemed similar) however even after a full Firmware update today it happened 4 hours later.

Now we also have a Seagate ES drive (1TB) in the MSA that has twice today been marked as faulty. I'm at a loss to it being an error as only today (the 5th crash tonight since Monday night) it showed as failed twice (remove, reseat, re-sync & ok).

Can anyone give me a heads up possibly with any advice to what it could be? Is it as simple as an actual HDD failing (the one in question is a 1TB RAID1 set) or is there issues with the RAID controller?

I've already cut back the caching to 75% read, 25% write & disabled the Array acceleration on the RAID1 as mentioned above as preventative measures as well as lowering the load on the server but have to now think about moving images to another VM Server or making other arrangements!
Screenshot of system post failure: Screenshot of post-RAID halt.
Diagnostics report of HP RAID ACU: report-4e82d645-000065bc-0000000.zip
0
Comment
Question by:kiwistag
  • 6
  • 3
  • 3
  • +1
13 Comments
 
LVL 9

Expert Comment

by:Lester_Clayton
ID: 36715517
There could be many causes for this problem, including poor connections on the cables, to faulty controllers and a faulty backplane.  Try to see if the P800 controller is properly seated, and the cables to the backplanes or chassis are firm.

One thing that has caught me out in a RAID array is the backplane actually deciding it's getting too hot and then switching some of the hard drives off to protect me from catastrophe - causing a raid failure in the process.

It's unlikely for all these drives to fail at the same time, I'd look at potentially replacing the cables and/or backplane/chassis.  Of course, the controller could still be at fault, but usually when they fault, they just stop working altogether.
0
 
LVL 6

Author Comment

by:kiwistag
ID: 36715570
All is in a controlled environment (inc. temperature) without cabinet shift. I'll check the backplane seating however it definitely seems to happen more often when one VM (which does use that RAID1) is running.

I did consider about the P800 seating & will have a look.
0
 
LVL 55

Expert Comment

by:andyalder
ID: 36715883
Does sound like the card is failing, although the MSA60 has firmware it's pretty dumb. One thing to try is make sure the MSA60 is on the port that doesn't go through the P800's expander (although an expander failure shouldn't cause a card lockup) the port furthest away from the LEDs is the one that bypasses the expander.
0
 
LVL 118
ID: 36716693
1. Backup contents of the MSA60.
2. Shutdown server and MSA60
3. Remove ALL Disks from the MSA60 (just unplug no need to remove from chassis)
4. Power on Server, does the error still occur?

We've had a similar fault with an MSA60, and the backplane needed replacing.
0
 
LVL 6

Author Comment

by:kiwistag
ID: 36719737
The error only occurs after a crash. A reboot is fine. (We did a maintenance update on ESX (in Maintenance mode) so it behaved itself and the reboot message on the controller only specified about the Rebuilding of the RAID1 array in progress.
0
 
LVL 118
ID: 36719866
The error only occurs after a crash.

Your ESX server crashes?

Could just be a hung controller in the MSA60.

Update firmwares if applicable.
0
Get up to 2TB FREE CLOUD per backup license!

An exclusive Black Friday offer just for Expert Exchange audience! Buy any of our top-rated backup solutions & get up to 2TB free cloud per system! Perform local & cloud backup in the same step, and restore instantly—anytime, anywhere. Grab this deal now before it disappears!

 
LVL 6

Author Comment

by:kiwistag
ID: 36720050
It's a sort of weird one.
(Almost) All guests become unresponsive in the vSphere client, (option to power off only but it stops at 95% in the GUI), sometimes some guests are still running but most times I have to reset or power it down because the ESX console can't finish the shutdown procedure (since it still sees the guests as running).

All firmware was updated yesterday including the ROMPAQ (individually), SAS HDD firmware & the ILO2 controller (via HP's Firmware Update 9.90 DVD) . I'm yet to see a firmware update for the MSA.

I can only assume slightly that one Array is at fault as it seems the more logical answer at present until enough evidence suggests otherwise (but of course I have to look for that specific evidence first).  We have a HP authorized engineer coming in a few hours to bounce the ideas around with him also but it's not a black & white fault really.. :(
0
 
LVL 55

Expert Comment

by:andyalder
ID: 36813094
Even if the MSA60 caught fire the controller should not lock up. Same goes for individual disks, it's either an undocumented bug or the controller is going faulty.
0
 
LVL 6

Author Comment

by:kiwistag
ID: 36813338
The Engineer stated that Although it being odd, the Disk was not a HP Part so wouldn't be supported. The most logical explanation was that it was the disk that failed (prior without error & lately stating an error) which knocked things out.
On the ESX side of things we can only assume that as ESX seen a datastore as unresponsive it got angsty & started to play up (in simple terms).
With the failed disk removed we've restarted the VM Guest used on that specific datastore and so far so good...
0
 
LVL 118

Accepted Solution

by:
Andrew Hancock (VMware vExpert / EE MVE) earned 250 total points
ID: 36813973
Umm, if you are using non HP Parts, with non-HP firmware, weird things could happen.
0
 
LVL 55

Assisted Solution

by:andyalder
andyalder earned 250 total points
ID: 36814050
Whilst not recommended HP controllers do support generic disks.
0
 
LVL 6

Author Comment

by:kiwistag
ID: 36814225
Yeah - it's weird as the previous HP Firmware DVD actually upgraded the Seagate firmware on those disks as part of their release. They may not be "HP Certified" Disks ($400 Vs $1,200 at the time), but like for like they are the same albeit any extra Firmware tweaks HP put on.

It's madness in a way as the 4 other 3.5" SATA HP Certified disks of all the same size (250GB), all ordered at the same time are 2x Seagate & 2x Western Digital..

We have desktop/laptop standard 2.5" HDD's in the server itself (in the hot-plug trays) and had the odd minor fault (false-negative over-temp issue) but an eject and re-seat has worked fine. One did it twice at the start of the year & nothing since. (That is run of the P400 controller).

So far so good tonight with it running. Hopefully the server behaves itself once the faulty HDD comes back from RMA.
0
 
LVL 6

Author Comment

by:kiwistag
ID: 38381712
The actual issue (found at the end of 2011) was that the I/O controller on the MSA itself was faulty.
We found this out after the entire controller failed.
Since replacement we have had no more issues whatsoever.
0

Featured Post

Backup Your Microsoft Windows Server®

Backup all your Microsoft Windows Server – on-premises, in remote locations, in private and hybrid clouds. Your entire Windows Server will be backed up in one easy step with patented, block-level disk imaging. We achieve RTOs (recovery time objectives) as low as 15 seconds.

Join & Write a Comment

Suggested Solutions

Last article we focus in how to VMware: How to create and use VMs TAGs – Part 1 so before follow this article and perform the next tasks, you should read the first article how to create the TAG before using them in Veeam Backup Jobs.
HOW TO: Upload an ISO image to a VMware datastore for use with VMware vSphere Hypervisor 6.5 (ESXi 6.5) using the vSphere Host Client, and checking its MD5 checksum signature is correct.  It's a good idea to compare checksums, because many installat…
Teach the user how to configure vSphere Replication and how to protect and recover VMs Open vSphere Web Client: Verify vsphere Replication is enabled: Enable vSphere Replication for a virtual machine: Verify replicated VM is created: Recover replica…
This tutorial will walk an individual through the process of installing the necessary services and then configuring a Windows Server 2012 system as an iSCSI target. To install the necessary roles, go to Server Manager, and select Add Roles and Featu…

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

22 Experts available now in Live!

Get 1:1 Help Now