Go Premium for a chance to win a PS4. Enter to Win

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 4844
  • Last Modified:

Urgent : Critical error on an HP P4300 storage

Hello experts,

I have a cluster of 4 hp p4300 storage node, each storage node has a raid 5 configuration and the cluster is configured with a network raid 10+2.

One of the storage has critical errors (all the disks has a status "off or removed" and the storage is unavailable (image below). I need help to troubleshoot it.

I am thinking about removing the failed storage of the cluster and  management group to rebuild its raid configuration (raid 5). Then I will add it again to the management group and cluster, butI don't know if It will causes any data loss?
 Is there other solutions? please help?


screenshot
log error messages
Event: EID_CACHE_CORRUPT E01020101
Severity: Critical
Date and Time: 29/01/13 14 h 32 GMT
Component: HW
User: System
Object Type: Cache
Object Name: Cache 1
Management Group: BAIEHP 
Cluster: 
IP/Hostname: BAIEHPA
Message: [b]The 'Cache 1' status is 'Corrupt'[/b].  Contact technical support for assistance.
*****

Event: EID_DISK_OFF E01030105
Severity: Warning
Date and Time: 29/01/13 14 h 32 GMT
Component: HW
User: System
Object Type: Disk
Object Name: Drive 1
Management Group: BAIEHP 
Cluster: 
IP/Hostname: BAIEHPA
Message: The 'Drive 1' status is 'Off or removed'.


Event: EID_SYSTEM_RAID_OFF E01000104
Severity: Critical
Date and Time: 29/01/13 14 h 32 GMT
Component: HW
User: System
Object Type: System
Object Name: DataRaid
Management Group: BAIEHP 
Cluster: 
IP/Hostname: BAIEHPA
Message: [b]The RAID status is 'Off'[/b]. Note: This is normal if you are replacing a drive in RAID0.


Event: EID_BBU_UNKNOWN E01020209
Severity: Warning
Date and Time: 29/01/13 14 h 32 GMT
Component: HW
User: System
Object Type: Cache
Object Name: Cache 1
Management Group: BAIEHP 
Cluster: 
IP/Hostname: BAIEHPA
Message: The 'Cache 1' BBU status is 'Unknown'.  Contact technical support for assistance.

Open in new window


Diagnostic test

Yes		Cache Status                       		FAIL Cache 1 -> Corrupt
		                                   		
------------------------------------------------------------------------------------------------------------------------
Yes		Cache BBU Status                   		FAIL Cache 1 -> Unknown
		                                   		
------------------------------------------------------------------------------------------------------------------------
Yes		Disk Status Test                   		FAIL Drive 1 -> Off or removed
		                                   		Drive 2 -> Off or removed
		                                   		Drive 3 -> Off or removed
		                                   		Drive 4 -> Off or removed
		                                   		Drive 5 -> Off or removed
		                                   		Drive 6 -> Off or removed
		                                   		Drive 7 -> Off or removed
		                                   		Drive 8 -> Off or removed

Open in new window

0
cismoney
Asked:
cismoney
2 Solutions
 
strivoliCommented:
Seriously, if your data is valuable (I'm quite sure it is) and you have any sort of Maintenance on the storage, contact HP as you first choice. Any minimal error could make things worsen. Only HP highly prepared technicians will be able to help you with a good level of success.
0
 
cismoneyAuthor Commented:
Thanx @strivoli, unfortunately the client told me he doesn't have a contract support with HP, and I am his local support, so I need to find the solution.
This is the reason why I am requesting help from experts-exchange storage specialists ....
0
 
cismoneyAuthor Commented:
ok.

Experts, I am looking for valuable solutions or hint on how to resolve the issue,  , not for comments about the fact that we don't have service contract.... so please help me.
0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 
ArneLoviusCommented:
As you have the error "Message: The 'Cache 1' status is 'Corrupt'.  Contact technical support for assistance", the first step should be to contact HP support.

http://blog.mrpol.nl/2011/09/10/how-to-solve-hp-p4000-cache-status-corrupt-2/
0
 
Gerald ConnollyCommented:
Has the Cache battery died/failed on the node in question?
0
 
millardjkCommented:
Stop. Don't do anything else until you have a RESTORABLE backup of all the customer's data. The P4000 array is a resilient thing, but if something goes horribly wrong, you could end up losing everything and still needing to contact HP for an expensive support case!

If all the volumes are running in "network raid10x2" as you indicate--and you have the capacity to spare--you should be able to successfully eject the node from the cluster. Once you've normalized the cluster, you can begin troubleshooting the problematic node independently, even to the point of wiping the array and reinstalling the SAN I/Q operating system. Doing that, however, may be a challenge if the array hasn't been kept up-to-date and/or you don't have access to the appropriate installation media.

Whether you have the free capacity or not, you will want to cold boot (ie power cycle) the node to give a full POST a chance to re initialize the cache. Depending on the nature of the root cause, you could end up with a system that can't even boot (permanent hardware failure in HBA cache; corrupted array; etc) which will require hands-on repairs--whether that's parts replacement or rebuilding the arrays on the node and reinstalling the OS--or you could end up with the system being fully functional and everything nominal.

It's more likely that you'll end up with the thing somewhere in between. The node will boot, but the node is still throwing errors, etc. At this point, you would attempt to see if you can reconfigure the RAID level to get it to rebuild the storage partitions; you lose the data on the node--which will require a re-sync--but at this point you can pretty much expect that to be necessary anyway.

The CMC might not allow you to reconfigure the array, however, because of the volume(s) it "thinks" it's participating in. At that point, you'll have to forcibly remove the node from the management group using the node's console. Once you've done that, you can set the "repair a storage system" to flag the node within the cluster, and you can set about making your repairs: you can always rebuild the RAID on a node which is not participating in a cluster.

Ultimately, however, you may be looking at this error occurring again if the cache on the controller is going bad; depending on the cache--especially if it's Flash-based--you could be looking at cache parts replacement at a minimum, total replacement of the HBA as a worst-case.
0

Featured Post

Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Tackle projects and never again get stuck behind a technical roadblock.
Join Now