?
Solved

Suspect RAID Card failing

Posted on 2011-09-28
13
Medium Priority
?
425 Views
Last Modified: 2012-09-09
We have a HP DL385 G2 rackmount server running VMWare ESX 4.1.
It has 2 RAID cards, the internal (P400) seems to be going fine but in the last few days we've had the P800 (512MB BBWC) connected to a MSA60 with what seems to be it halting. A while back after a reboot the server hung not long after the VM guests came back online however a hard-reset fixed it. I know of a case with another HP RAID card having issues with heavy I/O halting it (which seemed similar) however even after a full Firmware update today it happened 4 hours later.

Now we also have a Seagate ES drive (1TB) in the MSA that has twice today been marked as faulty. I'm at a loss to it being an error as only today (the 5th crash tonight since Monday night) it showed as failed twice (remove, reseat, re-sync & ok).

Can anyone give me a heads up possibly with any advice to what it could be? Is it as simple as an actual HDD failing (the one in question is a 1TB RAID1 set) or is there issues with the RAID controller?

I've already cut back the caching to 75% read, 25% write & disabled the Array acceleration on the RAID1 as mentioned above as preventative measures as well as lowering the load on the server but have to now think about moving images to another VM Server or making other arrangements!
Screenshot of system post failure: Screenshot of post-RAID halt.
Diagnostics report of HP RAID ACU: report-4e82d645-000065bc-0000000.zip
0
Comment
Question by:kiwistag
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 6
  • 3
  • 3
  • +1
13 Comments
 
LVL 9

Expert Comment

by:Lester_Clayton
ID: 36715517
There could be many causes for this problem, including poor connections on the cables, to faulty controllers and a faulty backplane.  Try to see if the P800 controller is properly seated, and the cables to the backplanes or chassis are firm.

One thing that has caught me out in a RAID array is the backplane actually deciding it's getting too hot and then switching some of the hard drives off to protect me from catastrophe - causing a raid failure in the process.

It's unlikely for all these drives to fail at the same time, I'd look at potentially replacing the cables and/or backplane/chassis.  Of course, the controller could still be at fault, but usually when they fault, they just stop working altogether.
0
 
LVL 6

Author Comment

by:kiwistag
ID: 36715570
All is in a controlled environment (inc. temperature) without cabinet shift. I'll check the backplane seating however it definitely seems to happen more often when one VM (which does use that RAID1) is running.

I did consider about the P800 seating & will have a look.
0
 
LVL 56

Expert Comment

by:andyalder
ID: 36715883
Does sound like the card is failing, although the MSA60 has firmware it's pretty dumb. One thing to try is make sure the MSA60 is on the port that doesn't go through the P800's expander (although an expander failure shouldn't cause a card lockup) the port furthest away from the LEDs is the one that bypasses the expander.
0
What is SQL Server and how does it work?

The purpose of this paper is to provide you background on SQL Server. It’s your self-study guide for learning fundamentals. It includes both the history of SQL and its technical basics. Concepts and definitions will form the solid foundation of your future DBA expertise.

 
LVL 123
ID: 36716693
1. Backup contents of the MSA60.
2. Shutdown server and MSA60
3. Remove ALL Disks from the MSA60 (just unplug no need to remove from chassis)
4. Power on Server, does the error still occur?

We've had a similar fault with an MSA60, and the backplane needed replacing.
0
 
LVL 6

Author Comment

by:kiwistag
ID: 36719737
The error only occurs after a crash. A reboot is fine. (We did a maintenance update on ESX (in Maintenance mode) so it behaved itself and the reboot message on the controller only specified about the Rebuilding of the RAID1 array in progress.
0
 
LVL 123
ID: 36719866
The error only occurs after a crash.

Your ESX server crashes?

Could just be a hung controller in the MSA60.

Update firmwares if applicable.
0
 
LVL 6

Author Comment

by:kiwistag
ID: 36720050
It's a sort of weird one.
(Almost) All guests become unresponsive in the vSphere client, (option to power off only but it stops at 95% in the GUI), sometimes some guests are still running but most times I have to reset or power it down because the ESX console can't finish the shutdown procedure (since it still sees the guests as running).

All firmware was updated yesterday including the ROMPAQ (individually), SAS HDD firmware & the ILO2 controller (via HP's Firmware Update 9.90 DVD) . I'm yet to see a firmware update for the MSA.

I can only assume slightly that one Array is at fault as it seems the more logical answer at present until enough evidence suggests otherwise (but of course I have to look for that specific evidence first).  We have a HP authorized engineer coming in a few hours to bounce the ideas around with him also but it's not a black & white fault really.. :(
0
 
LVL 56

Expert Comment

by:andyalder
ID: 36813094
Even if the MSA60 caught fire the controller should not lock up. Same goes for individual disks, it's either an undocumented bug or the controller is going faulty.
0
 
LVL 6

Author Comment

by:kiwistag
ID: 36813338
The Engineer stated that Although it being odd, the Disk was not a HP Part so wouldn't be supported. The most logical explanation was that it was the disk that failed (prior without error & lately stating an error) which knocked things out.
On the ESX side of things we can only assume that as ESX seen a datastore as unresponsive it got angsty & started to play up (in simple terms).
With the failed disk removed we've restarted the VM Guest used on that specific datastore and so far so good...
0
 
LVL 123

Accepted Solution

by:
Andrew Hancock (VMware vExpert / EE MVE^2) earned 1000 total points
ID: 36813973
Umm, if you are using non HP Parts, with non-HP firmware, weird things could happen.
0
 
LVL 56

Assisted Solution

by:andyalder
andyalder earned 1000 total points
ID: 36814050
Whilst not recommended HP controllers do support generic disks.
0
 
LVL 6

Author Comment

by:kiwistag
ID: 36814225
Yeah - it's weird as the previous HP Firmware DVD actually upgraded the Seagate firmware on those disks as part of their release. They may not be "HP Certified" Disks ($400 Vs $1,200 at the time), but like for like they are the same albeit any extra Firmware tweaks HP put on.

It's madness in a way as the 4 other 3.5" SATA HP Certified disks of all the same size (250GB), all ordered at the same time are 2x Seagate & 2x Western Digital..

We have desktop/laptop standard 2.5" HDD's in the server itself (in the hot-plug trays) and had the odd minor fault (false-negative over-temp issue) but an eject and re-seat has worked fine. One did it twice at the start of the year & nothing since. (That is run of the P400 controller).

So far so good tonight with it running. Hopefully the server behaves itself once the faulty HDD comes back from RMA.
0
 
LVL 6

Author Comment

by:kiwistag
ID: 38381712
The actual issue (found at the end of 2011) was that the I/O controller on the MSA itself was faulty.
We found this out after the entire controller failed.
Since replacement we have had no more issues whatsoever.
0

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In this article we will learn how to backup a VMware farm using Nakivo Backup & Replication. In this tutorial we will install the software on a Windows 2012 R2 Server.
New style of hardware planning for Microsoft Exchange server.
Advanced tutorial on how to run the esxtop command to capture a batch file in csv format in order to export the file and use it for performance analysis. He demonstrates how to download the file using a vSphere web client (or vSphere client) and exp…
This video teaches viewers how to encrypt an external drive that requires a password to read and edit the drive. All tasks are done in Disk Utility. Plug in the external drive you wish to encrypt: Make sure all previous data on the drive has been …
Suggested Courses

764 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question