Dell EquilLogic PS6110 epic fail! Can this be recovered? (Happy to pay for recovery solution via "Live Mentor")

SUMMARY
We have two Dell EqualLogic PS6110VX storage arrays that we purchased second hand approx. 9 months ago.  As such they are not covered by Dell’s warranty / or support.  One of the units has crashed, as detailed below, and I’m not sure if it's recoverable.  

I would value your input to if this can be restored without data loss or if we should cut our losses,  re-initialise the unit and start again?

And YES, happy to pay for your time via the "Live Mentor" option if you know how to get this unit back online without losing any data.

OVERVIEW
The storage array has 24 * 600GB, 15.7 drives.   22 drives are mounted in one big Raid 10 array and with 2 hot spares.
 
Last night within a 15 minute period, 3 drives failed.  (What are the chances, hey?)  The failed drive numbers are 8,15 & 23.  We replaced them this afternoon with official Dell drives (firmware EN03).  The system has recognised the drives but has not started the automatic rebuild.

The disk lights on the front of the unit are all solid green and a red light is flashing in the top LHS of the unit

If we go into the CLI console via Putty and a serial cable and run the “su raidtool“ command.  It displays the following:

 ***************************************************

Driver Status: *Admin Intervention Requested*

RAID LUN 0 Faulted Beyond Recovery.
  12 Drives (0,2,17,7,9,1,4,6,?,?,11,12)
  RAID 10 (64KB sectPerSU)
  Capacity 3,456,711,524,352 bytes

RAID LUN 1 Ok.
raid status dirty.
  10 Drives (13,5,16,10,18,19,20,21,3,14)
  RAID 10 (64KB sectPerSU)
  Capacity 2,880,592,936,960 bytes

Available Drives List: 8,15,22,23

*********************************************************
My understanding of the above output is Raid is still recoverable, because LUN 1 is still OK?  (Please enlighten me if I'm delusional)

My hope is to somehow allocate two of the available spare-drives (e.g.  #8 & #15) to LUN 0 to replace the missing drives as shown by “?”

Once done, then to force the raid to rebuild.

But I have no idea how to do this via the CLI.  The GUI Management Interface works for the second unit, however, reports the first unit “offline”

If one tries to run a "diag" command via the CLI console, you receive an "Error: Bad Command".  Other commands like "member show" are met with a "The storage array is still initialising.  Limited commands will be available until the initialisation is complete.  Please try again later"

A "restart" command does not work, and reports an error "Raid is faulted beyond recovery.  The array cannot be rebooted or halted in this state.  There are two options: 1) Fix RAID and retry the shutdown.  2) Power off the array."  We have manually powered off the device and restarted, and are still stuck in the same place.

The EL unit is used to host approx. 50 SQL databases.  They are backed up, and can be recovered, however, would lose 1 day’s data.  (The crash happened +- 10 min before the daily backups began)  Not the end of the world but something I would like to avoid.  

Many thanks in anticipation.    And once again, happy to pay for the get-back-online-without-losing-any-data solution.
Data GuruAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Travis MartinezStorage EngineerCommented:
I'll do some research, and I'm not familiar with the EqualLogic system but I am with storage in genera.  It appears as if the drives you replaced have to be admitted back to the group in order to be used.  Liking it to NetApp the hot spare is replaced for the failed drive and the new ones are admitted back as hot spares.  In this case you failed beyond your spares but the integrity of the RAID 10 depends on specifically which two drives were faulted.  If it was the primary and the mirror you may have an issue with recovery.

I'll see what I can find...
0
andyalderSaggar maker's framemakerCommented:
RAID LUN 0 Faulted Beyond Recovery.
12 Drives (0,2,17,7,9,1,4,6,?,?,11,12)

You've lost two drives in a mirror, Faulted Beyond Recovery is correct. 0 mirrors 2, 17 mirrors 7, 9 mirrors 1... and ? mirrors ? so those two were mirroring each other. Recovery would only be possible if one of the drives you replaced can be repaired without losing the data from it and that will cost a lot and take time.

You don't have one big RAID 10, you have two RAID 10s, one has 12 drives in it and the other has 10.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Travis MartinezStorage EngineerCommented:
As mentioned above, you do have two different RAID sets.

I've found a cli manual that on page 90 has the commands for showing disks in a member.

http://people.stern.nyu.edu/nwhite/scrc/ps4000e/110-6028-EN-R1_CLI_REFERENCE_V5.0.pdf

Can you screen scrape the GUI as to the status of the drives?
0
Data GuruAuthor Commented:
Hi Travis and Handy Holder

Thanks for your input and explanation of how to interpret the raid mirroring outputs.

In working through your suggestions, I was trying to figure out which drives were part of the LUN 0 failed array.  (i.e. which ones made up the ?)

In our case, drive #15 went first, and the array degraded.  About 10 min later drive #8 also failed, and then the spare #23 failed about a minute thereafter.  Thereafter the array went critical.  I was able to figure out which were the hot spares from some photos we had previously taken of the Management Interface.

In a I-am-desperate-and-will-do-anything approach, we tried turning off the unit, inserting original failed drive #15 and restarted.  Nothing happened.  The CLI command “su raidtool” showed the same status as before.  

So we turned off and re-inserted the original failed drive #8, and restarted.  And the unit came back to life!   Oh the joy of seeing flashing green lights ….

I don’t recall the exact CLI messages, however there was something like “this drive has a history of failures” referring to drive #8.  

At this stage, the raid is rebuilding to the new spare.  

My two cents worth is:
1.   Drive #15 failed. This seemed to be a ‘catastrophic failure’ and the disk is dead.  
2.   Disk #8’s failure seems to be a ‘soft’ failure, and once it had been taken out, cooled down and reinserted, it worked again.   In researching possible solutions, it appears that the EqualLogic systems have had issues in the past of marking a good drive as failed.  Although we are using Storage Array Firmware V9.1.2, this seemed happened in our case.
3.   Drive #23, the hot spare, also had a ‘catastrophic failure’ and is dead

My recommendations to anyone who finds themselves in the same situation with multiple failed drives, and a critical degrade:
1.      Don’t replace all the failed drives at once.  Replace one at a time and see what happens.  You may need to restart the unit between replacements.  If this works, the rebuild will be longer but any rebuild is better than none at all.

Our second learning is that EqualLogic has a smart load balancing function that moves data between the physical units within the group.  Although one of our units was “dedicated” to SQL databases, the system had moved some of our regular data volumes onto the SQL unit too.  So when the SQL array went critical, the other volumes went critical too.  We have now turned this load balancing function off.  This is done through the CL with the following:  "volume select <volume_name> bind <member_name>"

Finally, when we first went down the path of choosing a SAN, I read a comment something like “if you are using a SAN, you will lose your data at some time in the future”.  Yea, right, I thought, it will never happen to us.  But it almost did.  The probability of 3 x drive failures all within a 15 min period is so infinitesimally low that it should not happen.  But it did.  So back up your critical data onto a different array.  

Thank you Travis and Handy Holder for your effort and assistance.  I’ll also PM the two of you shortly
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Storage

From novice to tech pro — start learning today.