Solved

Urgent : Critical error on an HP P4300 storage

Posted on 2013-01-30
8
4,465 Views
Last Modified: 2013-02-07
Hello experts,

I have a cluster of 4 hp p4300 storage node, each storage node has a raid 5 configuration and the cluster is configured with a network raid 10+2.

One of the storage has critical errors (all the disks has a status "off or removed" and the storage is unavailable (image below). I need help to troubleshoot it.

I am thinking about removing the failed storage of the cluster and  management group to rebuild its raid configuration (raid 5). Then I will add it again to the management group and cluster, butI don't know if It will causes any data loss?
 Is there other solutions? please help?


screenshot
log error messages
Event: EID_CACHE_CORRUPT E01020101
Severity: Critical
Date and Time: 29/01/13 14 h 32 GMT
Component: HW
User: System
Object Type: Cache
Object Name: Cache 1
Management Group: BAIEHP 
Cluster: 
IP/Hostname: BAIEHPA
Message: [b]The 'Cache 1' status is 'Corrupt'[/b].  Contact technical support for assistance.
*****

Event: EID_DISK_OFF E01030105
Severity: Warning
Date and Time: 29/01/13 14 h 32 GMT
Component: HW
User: System
Object Type: Disk
Object Name: Drive 1
Management Group: BAIEHP 
Cluster: 
IP/Hostname: BAIEHPA
Message: The 'Drive 1' status is 'Off or removed'.


Event: EID_SYSTEM_RAID_OFF E01000104
Severity: Critical
Date and Time: 29/01/13 14 h 32 GMT
Component: HW
User: System
Object Type: System
Object Name: DataRaid
Management Group: BAIEHP 
Cluster: 
IP/Hostname: BAIEHPA
Message: [b]The RAID status is 'Off'[/b]. Note: This is normal if you are replacing a drive in RAID0.


Event: EID_BBU_UNKNOWN E01020209
Severity: Warning
Date and Time: 29/01/13 14 h 32 GMT
Component: HW
User: System
Object Type: Cache
Object Name: Cache 1
Management Group: BAIEHP 
Cluster: 
IP/Hostname: BAIEHPA
Message: The 'Cache 1' BBU status is 'Unknown'.  Contact technical support for assistance.

Open in new window


Diagnostic test

Yes		Cache Status                       		FAIL Cache 1 -> Corrupt
		                                   		
------------------------------------------------------------------------------------------------------------------------
Yes		Cache BBU Status                   		FAIL Cache 1 -> Unknown
		                                   		
------------------------------------------------------------------------------------------------------------------------
Yes		Disk Status Test                   		FAIL Drive 1 -> Off or removed
		                                   		Drive 2 -> Off or removed
		                                   		Drive 3 -> Off or removed
		                                   		Drive 4 -> Off or removed
		                                   		Drive 5 -> Off or removed
		                                   		Drive 6 -> Off or removed
		                                   		Drive 7 -> Off or removed
		                                   		Drive 8 -> Off or removed

Open in new window

0
Comment
Question by:cismoney
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
8 Comments
 
LVL 19

Assisted Solution

by:strivoli
strivoli earned 100 total points
ID: 38835069
Seriously, if your data is valuable (I'm quite sure it is) and you have any sort of Maintenance on the storage, contact HP as you first choice. Any minimal error could make things worsen. Only HP highly prepared technicians will be able to help you with a good level of success.
0
 

Author Comment

by:cismoney
ID: 38835194
Thanx @strivoli, unfortunately the client told me he doesn't have a contract support with HP, and I am his local support, so I need to find the solution.
This is the reason why I am requesting help from experts-exchange storage specialists ....
0
 

Author Comment

by:cismoney
ID: 38838396
ok.

Experts, I am looking for valuable solutions or hint on how to resolve the issue,  , not for comments about the fact that we don't have service contract.... so please help me.
0
Webinar: Aligning, Automating, Winning

Join Dan Russo, Senior Manager of Operations Intelligence, for an in-depth discussion on how Dealertrack, leading provider of integrated digital solutions for the automotive industry, transformed their DevOps processes to increase collaboration and move with greater velocity.

 
LVL 37

Expert Comment

by:ArneLovius
ID: 38838972
As you have the error "Message: The 'Cache 1' status is 'Corrupt'.  Contact technical support for assistance", the first step should be to contact HP support.

http://blog.mrpol.nl/2011/09/10/how-to-solve-hp-p4000-cache-status-corrupt-2/
0
 
LVL 17

Expert Comment

by:Gerald Connolly
ID: 38839027
Has the Cache battery died/failed on the node in question?
0
 
LVL 10

Accepted Solution

by:
millardjk earned 400 total points
ID: 38839238
Stop. Don't do anything else until you have a RESTORABLE backup of all the customer's data. The P4000 array is a resilient thing, but if something goes horribly wrong, you could end up losing everything and still needing to contact HP for an expensive support case!

If all the volumes are running in "network raid10x2" as you indicate--and you have the capacity to spare--you should be able to successfully eject the node from the cluster. Once you've normalized the cluster, you can begin troubleshooting the problematic node independently, even to the point of wiping the array and reinstalling the SAN I/Q operating system. Doing that, however, may be a challenge if the array hasn't been kept up-to-date and/or you don't have access to the appropriate installation media.

Whether you have the free capacity or not, you will want to cold boot (ie power cycle) the node to give a full POST a chance to re initialize the cache. Depending on the nature of the root cause, you could end up with a system that can't even boot (permanent hardware failure in HBA cache; corrupted array; etc) which will require hands-on repairs--whether that's parts replacement or rebuilding the arrays on the node and reinstalling the OS--or you could end up with the system being fully functional and everything nominal.

It's more likely that you'll end up with the thing somewhere in between. The node will boot, but the node is still throwing errors, etc. At this point, you would attempt to see if you can reconfigure the RAID level to get it to rebuild the storage partitions; you lose the data on the node--which will require a re-sync--but at this point you can pretty much expect that to be necessary anyway.

The CMC might not allow you to reconfigure the array, however, because of the volume(s) it "thinks" it's participating in. At that point, you'll have to forcibly remove the node from the management group using the node's console. Once you've done that, you can set the "repair a storage system" to flag the node within the cluster, and you can set about making your repairs: you can always rebuild the RAID on a node which is not participating in a cluster.

Ultimately, however, you may be looking at this error occurring again if the cache on the controller is going bad; depending on the cache--especially if it's Flash-based--you could be looking at cache parts replacement at a minimum, total replacement of the HBA as a worst-case.
0

Featured Post

Business Impact of IT Communications

What are the business impacts of how well businesses communicate during an IT incident? Targeting, speed, and transparency all matter. Find out more in this infographic.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Is your phone running out of space to hold pictures?  This article will show you quick tips on how to solve this problem.
We look at whether swapping a controller board on a failed hard drive is likely to solve the problem.
This video Micro Tutorial explains how to clone a hard drive using a commercial software product for Windows systems called Casper from Future Systems Solutions (FSS). Cloning makes an exact, complete copy of one hard disk drive (HDD) onto another d…
Two types of users will appreciate AOMEI Backupper Pro: 1 - Those with PCIe drives (and haven't found cloning software that works on them). 2 - Those who want a fast clone of their boot drive (no re-boots needed) and it can clone your drive wh…

752 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question