Solved

Urgent : Critical error on an HP P4300 storage

Posted on 2013-01-30
8
4,270 Views
Last Modified: 2013-02-07
Hello experts,

I have a cluster of 4 hp p4300 storage node, each storage node has a raid 5 configuration and the cluster is configured with a network raid 10+2.

One of the storage has critical errors (all the disks has a status "off or removed" and the storage is unavailable (image below). I need help to troubleshoot it.

I am thinking about removing the failed storage of the cluster and  management group to rebuild its raid configuration (raid 5). Then I will add it again to the management group and cluster, butI don't know if It will causes any data loss?
 Is there other solutions? please help?


screenshot
log error messages
Event: EID_CACHE_CORRUPT E01020101
Severity: Critical
Date and Time: 29/01/13 14 h 32 GMT
Component: HW
User: System
Object Type: Cache
Object Name: Cache 1
Management Group: BAIEHP 
Cluster: 
IP/Hostname: BAIEHPA
Message: [b]The 'Cache 1' status is 'Corrupt'[/b].  Contact technical support for assistance.
*****

Event: EID_DISK_OFF E01030105
Severity: Warning
Date and Time: 29/01/13 14 h 32 GMT
Component: HW
User: System
Object Type: Disk
Object Name: Drive 1
Management Group: BAIEHP 
Cluster: 
IP/Hostname: BAIEHPA
Message: The 'Drive 1' status is 'Off or removed'.


Event: EID_SYSTEM_RAID_OFF E01000104
Severity: Critical
Date and Time: 29/01/13 14 h 32 GMT
Component: HW
User: System
Object Type: System
Object Name: DataRaid
Management Group: BAIEHP 
Cluster: 
IP/Hostname: BAIEHPA
Message: [b]The RAID status is 'Off'[/b]. Note: This is normal if you are replacing a drive in RAID0.


Event: EID_BBU_UNKNOWN E01020209
Severity: Warning
Date and Time: 29/01/13 14 h 32 GMT
Component: HW
User: System
Object Type: Cache
Object Name: Cache 1
Management Group: BAIEHP 
Cluster: 
IP/Hostname: BAIEHPA
Message: The 'Cache 1' BBU status is 'Unknown'.  Contact technical support for assistance.

Open in new window


Diagnostic test

Yes		Cache Status                       		FAIL Cache 1 -> Corrupt
		                                   		
------------------------------------------------------------------------------------------------------------------------
Yes		Cache BBU Status                   		FAIL Cache 1 -> Unknown
		                                   		
------------------------------------------------------------------------------------------------------------------------
Yes		Disk Status Test                   		FAIL Drive 1 -> Off or removed
		                                   		Drive 2 -> Off or removed
		                                   		Drive 3 -> Off or removed
		                                   		Drive 4 -> Off or removed
		                                   		Drive 5 -> Off or removed
		                                   		Drive 6 -> Off or removed
		                                   		Drive 7 -> Off or removed
		                                   		Drive 8 -> Off or removed

Open in new window

0
Comment
Question by:cismoney
8 Comments
 
LVL 19

Assisted Solution

by:strivoli
strivoli earned 100 total points
ID: 38835069
Seriously, if your data is valuable (I'm quite sure it is) and you have any sort of Maintenance on the storage, contact HP as you first choice. Any minimal error could make things worsen. Only HP highly prepared technicians will be able to help you with a good level of success.
0
 

Author Comment

by:cismoney
ID: 38835194
Thanx @strivoli, unfortunately the client told me he doesn't have a contract support with HP, and I am his local support, so I need to find the solution.
This is the reason why I am requesting help from experts-exchange storage specialists ....
0
 

Author Comment

by:cismoney
ID: 38838396
ok.

Experts, I am looking for valuable solutions or hint on how to resolve the issue,  , not for comments about the fact that we don't have service contract.... so please help me.
0
VMware Disaster Recovery and Data Protection

In this expert guide, you’ll learn about the components of a Modern Data Center. You will use cases for the value-added capabilities of Veeam®, including combining backup and replication for VMware disaster recovery and using replication for data center migration.

 
LVL 37

Expert Comment

by:ArneLovius
ID: 38838972
As you have the error "Message: The 'Cache 1' status is 'Corrupt'.  Contact technical support for assistance", the first step should be to contact HP support.

http://blog.mrpol.nl/2011/09/10/how-to-solve-hp-p4000-cache-status-corrupt-2/
0
 
LVL 16

Expert Comment

by:Gerald Connolly
ID: 38839027
Has the Cache battery died/failed on the node in question?
0
 
LVL 10

Accepted Solution

by:
millardjk earned 400 total points
ID: 38839238
Stop. Don't do anything else until you have a RESTORABLE backup of all the customer's data. The P4000 array is a resilient thing, but if something goes horribly wrong, you could end up losing everything and still needing to contact HP for an expensive support case!

If all the volumes are running in "network raid10x2" as you indicate--and you have the capacity to spare--you should be able to successfully eject the node from the cluster. Once you've normalized the cluster, you can begin troubleshooting the problematic node independently, even to the point of wiping the array and reinstalling the SAN I/Q operating system. Doing that, however, may be a challenge if the array hasn't been kept up-to-date and/or you don't have access to the appropriate installation media.

Whether you have the free capacity or not, you will want to cold boot (ie power cycle) the node to give a full POST a chance to re initialize the cache. Depending on the nature of the root cause, you could end up with a system that can't even boot (permanent hardware failure in HBA cache; corrupted array; etc) which will require hands-on repairs--whether that's parts replacement or rebuilding the arrays on the node and reinstalling the OS--or you could end up with the system being fully functional and everything nominal.

It's more likely that you'll end up with the thing somewhere in between. The node will boot, but the node is still throwing errors, etc. At this point, you would attempt to see if you can reconfigure the RAID level to get it to rebuild the storage partitions; you lose the data on the node--which will require a re-sync--but at this point you can pretty much expect that to be necessary anyway.

The CMC might not allow you to reconfigure the array, however, because of the volume(s) it "thinks" it's participating in. At that point, you'll have to forcibly remove the node from the management group using the node's console. Once you've done that, you can set the "repair a storage system" to flag the node within the cluster, and you can set about making your repairs: you can always rebuild the RAID on a node which is not participating in a cluster.

Ultimately, however, you may be looking at this error occurring again if the cache on the controller is going bad; depending on the cache--especially if it's Flash-based--you could be looking at cache parts replacement at a minimum, total replacement of the HBA as a worst-case.
0

Featured Post

What is SQL Server and how does it work?

The purpose of this paper is to provide you background on SQL Server. It’s your self-study guide for learning fundamentals. It includes both the history of SQL and its technical basics. Concepts and definitions will form the solid foundation of your future DBA expertise.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Workplace bullying has increased with the use of email and social media. Retain evidence of this with email archiving to protect your employees.
When we purchase storage, we typically are advertised storage of 500GB, 1TB, 2TB and so on. However, when you actually install it into your computer, your 500GB HDD will actually show up as 465GB. Why? It has to do with the way people and computers…
This tutorial will walk an individual through locating and launching the BEUtility application to properly change the service account username and\or password in situation where it may be necessary or where the password has been inadvertently change…
This tutorial will walk an individual through the process of installing of Data Protection Manager on a server running Windows Server 2012 R2, including the prerequisites. Microsoft .Net 3.5 is required. To install this feature, go to Server Manager…

816 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

8 Experts available now in Live!

Get 1:1 Help Now