Solved

Urgent : Critical error on an HP P4300 storage

Posted on 2013-01-30
8
4,098 Views
Last Modified: 2013-02-07
Hello experts,

I have a cluster of 4 hp p4300 storage node, each storage node has a raid 5 configuration and the cluster is configured with a network raid 10+2.

One of the storage has critical errors (all the disks has a status "off or removed" and the storage is unavailable (image below). I need help to troubleshoot it.

I am thinking about removing the failed storage of the cluster and  management group to rebuild its raid configuration (raid 5). Then I will add it again to the management group and cluster, butI don't know if It will causes any data loss?
 Is there other solutions? please help?


screenshot
log error messages
Event: EID_CACHE_CORRUPT E01020101
Severity: Critical
Date and Time: 29/01/13 14 h 32 GMT
Component: HW
User: System
Object Type: Cache
Object Name: Cache 1
Management Group: BAIEHP 
Cluster: 
IP/Hostname: BAIEHPA
Message: [b]The 'Cache 1' status is 'Corrupt'[/b].  Contact technical support for assistance.
*****

Event: EID_DISK_OFF E01030105
Severity: Warning
Date and Time: 29/01/13 14 h 32 GMT
Component: HW
User: System
Object Type: Disk
Object Name: Drive 1
Management Group: BAIEHP 
Cluster: 
IP/Hostname: BAIEHPA
Message: The 'Drive 1' status is 'Off or removed'.


Event: EID_SYSTEM_RAID_OFF E01000104
Severity: Critical
Date and Time: 29/01/13 14 h 32 GMT
Component: HW
User: System
Object Type: System
Object Name: DataRaid
Management Group: BAIEHP 
Cluster: 
IP/Hostname: BAIEHPA
Message: [b]The RAID status is 'Off'[/b]. Note: This is normal if you are replacing a drive in RAID0.


Event: EID_BBU_UNKNOWN E01020209
Severity: Warning
Date and Time: 29/01/13 14 h 32 GMT
Component: HW
User: System
Object Type: Cache
Object Name: Cache 1
Management Group: BAIEHP 
Cluster: 
IP/Hostname: BAIEHPA
Message: The 'Cache 1' BBU status is 'Unknown'.  Contact technical support for assistance.

Open in new window


Diagnostic test

Yes		Cache Status                       		FAIL Cache 1 -> Corrupt
		                                   		
------------------------------------------------------------------------------------------------------------------------
Yes		Cache BBU Status                   		FAIL Cache 1 -> Unknown
		                                   		
------------------------------------------------------------------------------------------------------------------------
Yes		Disk Status Test                   		FAIL Drive 1 -> Off or removed
		                                   		Drive 2 -> Off or removed
		                                   		Drive 3 -> Off or removed
		                                   		Drive 4 -> Off or removed
		                                   		Drive 5 -> Off or removed
		                                   		Drive 6 -> Off or removed
		                                   		Drive 7 -> Off or removed
		                                   		Drive 8 -> Off or removed

Open in new window

0
Comment
Question by:cismoney
8 Comments
 
LVL 19

Assisted Solution

by:strivoli
strivoli earned 100 total points
ID: 38835069
Seriously, if your data is valuable (I'm quite sure it is) and you have any sort of Maintenance on the storage, contact HP as you first choice. Any minimal error could make things worsen. Only HP highly prepared technicians will be able to help you with a good level of success.
0
 

Author Comment

by:cismoney
ID: 38835194
Thanx @strivoli, unfortunately the client told me he doesn't have a contract support with HP, and I am his local support, so I need to find the solution.
This is the reason why I am requesting help from experts-exchange storage specialists ....
0
 

Author Comment

by:cismoney
ID: 38838396
ok.

Experts, I am looking for valuable solutions or hint on how to resolve the issue,  , not for comments about the fact that we don't have service contract.... so please help me.
0
Scale it in WD Gold

With up to ten times the workload capacity of desktop drives, WD Gold hard drives employ advanced technology to deliver among the best in reliability, capacity, power efficiency and performance.

 
LVL 36

Expert Comment

by:ArneLovius
ID: 38838972
As you have the error "Message: The 'Cache 1' status is 'Corrupt'.  Contact technical support for assistance", the first step should be to contact HP support.

http://blog.mrpol.nl/2011/09/10/how-to-solve-hp-p4000-cache-status-corrupt-2/
0
 
LVL 16

Expert Comment

by:Gerald Connolly
ID: 38839027
Has the Cache battery died/failed on the node in question?
0
 
LVL 10

Accepted Solution

by:
millardjk earned 400 total points
ID: 38839238
Stop. Don't do anything else until you have a RESTORABLE backup of all the customer's data. The P4000 array is a resilient thing, but if something goes horribly wrong, you could end up losing everything and still needing to contact HP for an expensive support case!

If all the volumes are running in "network raid10x2" as you indicate--and you have the capacity to spare--you should be able to successfully eject the node from the cluster. Once you've normalized the cluster, you can begin troubleshooting the problematic node independently, even to the point of wiping the array and reinstalling the SAN I/Q operating system. Doing that, however, may be a challenge if the array hasn't been kept up-to-date and/or you don't have access to the appropriate installation media.

Whether you have the free capacity or not, you will want to cold boot (ie power cycle) the node to give a full POST a chance to re initialize the cache. Depending on the nature of the root cause, you could end up with a system that can't even boot (permanent hardware failure in HBA cache; corrupted array; etc) which will require hands-on repairs--whether that's parts replacement or rebuilding the arrays on the node and reinstalling the OS--or you could end up with the system being fully functional and everything nominal.

It's more likely that you'll end up with the thing somewhere in between. The node will boot, but the node is still throwing errors, etc. At this point, you would attempt to see if you can reconfigure the RAID level to get it to rebuild the storage partitions; you lose the data on the node--which will require a re-sync--but at this point you can pretty much expect that to be necessary anyway.

The CMC might not allow you to reconfigure the array, however, because of the volume(s) it "thinks" it's participating in. At that point, you'll have to forcibly remove the node from the management group using the node's console. Once you've done that, you can set the "repair a storage system" to flag the node within the cluster, and you can set about making your repairs: you can always rebuild the RAID on a node which is not participating in a cluster.

Ultimately, however, you may be looking at this error occurring again if the cache on the controller is going bad; depending on the cache--especially if it's Flash-based--you could be looking at cache parts replacement at a minimum, total replacement of the HBA as a worst-case.
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

Suggested Solutions

A Bare Metal Image backup allows for the restore of an entire system to a similar or dissimilar hardware. They are highly useful for migrations and disaster recovery. Bare Metal Image backups support Full and Incremental backups. Differential backup…
Are you looking to recover an email message or a contact you just deleted mistakenly? Or you are searching for a contact that you erased from your MS Outlook ‘Contacts’ folder and now realized that it was important.
This video teaches viewers how to encrypt an external drive that requires a password to read and edit the drive. All tasks are done in Disk Utility. Plug in the external drive you wish to encrypt: Make sure all previous data on the drive has been …
This tutorial will walk an individual through the steps necessary to install and configure the Windows Server Backup Utility. Directly connect an external storage device such as a USB drive, or CD\DVD burner: If the device is a USB drive, ensure i…

705 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

23 Experts available now in Live!

Get 1:1 Help Now