Dell PE2950 Raid5 'Fail' drives

Hello Guys,

    I am having some problems with one of our critical database server.  Over the weekend there was an extended power outage and seems the ups did not hold up.

On Monday morning when we came in the PE2950 had a blinking amber light with message "w1228 Romb batt"

The PE2950 with a Perc 4e/DC raid controller card which connects to a dell PowerVault 220s.
The powervault has 6 300GB SCSI hard drives that are configured in a raid 5 array with one as hotspare.
The first Four drives are showing up as 'FAIL' , the fifth one 'ONLINE' and the sixth one as 'hotspare' . Attached is a screenshot of Config utility.

Please advise the best solution to bring back this raid online with the least risk of losing data.

All assistance and input is appreciated.

 Thanks
IMAG0826.jpg
jrj195Asked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Cliff GaliherCommented:
Well, since RAID5 only tolerates one failure and you seemingly have four, you are probably already out of luck. I'd start planning on restoring a backup.

Not to Monday morning quarterback too much, but there seem to be several unrelated failures here that coalesced into one significant problem, which is why my answer is "restore a backup."

First, there are the UPS failure. If the server is critical, it should be on a UPS that can initiate a controlled shutdown in the event of an extended power outage. A server having power just drop out from under it (or worse, suddenly reappearing) can cause older controllers and particularly older drives, to fail. I suspect this is what happened here. The drives are very likely truly failed at this point.

While discussing UPSes, again, on a critical server, UPSes need to be *tested* on a regular schedule. Like any battery, UPS batteries have a shelf life. They stop providing adequate power and quit holding a charge after a couple years. I've seen far too many people try to get maximum mileage by running batteries 5 years or more and are surprised when the UPS causes data corruption. Testing doesn't mean you can stretch the battery life. It catches premature failures. The two are complementary, not mutually exclusive. Replace batteries on a schedule AND test regularly (every three months is my recommendation.)

From there, you have the RAID controller itself. To adequately protect, and prevent the RAID from splitting, the cache itself also needs a battery. Note that the UPS is *not* a replacement for a cache battery. Nor is a cache battery a replacement for a UPS. They serve two different purposes. A cache battery keeps the RAID in a consistent state, so it won't report as "failed" in the event of a power loss. If the UPS fails or the server is not on a UPS, the files (particularly database files) may still be corrupted, but the corruption is at the filesystem level, not the RAID level. The RAID will be consistent. A failure at the RAID level is much more difficult to recover from. And the cache battery is essential to that. And like a UPS battery, it has a shelf life, and if a server is expected to be in production beyond that shelf life, it can fail and needs to be replaced regularly. Sounds like that didn't happen.

Ultimately though, this boils down to running a critical server on an old server. The 2950 is indeed very old and hasn't been produced in several years. So the disks are old (unless you've replaced all of them recently) and thus more suspectible to power events. The controller battery is old, and obviously (now) was not up to the task of protecting the data. And if the UPS was part of the install, the UPS was old and the battery was not tested and/or remote shutdown was not configured.

The culmination is that the data has failed. and with that many drive failures, is not recoverable by mere mortals. Because RAID five stripes the data across disks, the only way to recover *any* data would be to painstakingly attempt to reassemble the data one sector at a time. There is no quick and easy way to do this, and no utilities I know of that can within a reasonable price range. High end forensic tools can do it, but cost 5 and 6 digits.

If you don't have backups, your next best option would be to ship the disks off to a data recovery shop that specialized in such things (they can have enough customers to recoup the 6 digit cost of such software) and see if the corruption was minimal enough to reassemble the data.

Someone else may have a better answer, but my experience is that due to the many factors that came together to cause that many disks to report as failed, this is the only recourse. I provided the background to help you see how it happened (more than just a simple extended power outage came together to cause this) and hopefully help you avoid it from happening again in the future and integrate that into your DR plan.

-Cliff
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
pjamCommented:
(w1228 Romb batt), kinda tells me your RAID battery is dead.  i would start there by replacing it.
You can find your owners Manual here that will tell you how to replace:
http://downloads.dell.com/Manuals/all-products/esuprt_ser_stor_net/esuprt_poweredge/poweredge-2950_owner%27s%20manual_en-us.pdf?c=us&l=en&cs=04&s=bsd

Page 27 says this about error:
Warns predictively that the RAID
battery has less than 24 hours of
charge left.
Replace RAID battery. See "RAID
Battery" on page 74.
0
jrj195Author Commented:
Ok, Question:
 
1) What would happen If I 'rebuild' the drives using the Raid config util?

2)1) What would happen If I 'force online' the drives using the Raid config util?

Thanks
0
Newly released Acronis True Image 2019

In announcing the release of the 15th Anniversary Edition of Acronis True Image 2019, the company revealed that its artificial intelligence-based anti-ransomware technology – stopped more than 200,000 ransomware attacks on 150,000 customers last year.

Cliff GaliherCommented:
I don't have a PERC4 anywhere to test anymore, but as I recall the RAID array would be rebuilt. NOT the data on the array. It would build up a new (clean) array ready to accept new data.
0
jrj195Author Commented:
I was able to bring back online the failed drives and restored the raid with everything in tact.

Thanks Cgaliher for your detailed answer. You brought to light certain things that must be in place to prevent such am occurrence.

I will go ahead and award you the points.

Can you give me links or info on best ways to create and configure virtual servers.  I want to have a environment where there is redundancy and there is as little downtime as possible if a server was to crash.

Thanks
0
jrj195Author Commented:
The response provided was informative and helped in solving my issue.

Thank You.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Windows Server 2008

From novice to tech pro — start learning today.