Is there a risk of losing data when performing a RAID5 rebuild?

RedITAus
RedITAus used Ask the Experts™
on
Hi all,
I have an IBM server with a Serveraid 6i controller. The hard drives where set up in a 3 disk RAID5 array + 1 hot spare. About a week ago one the disks developed a fault and the hot spare took over, the rebuild was not entirely succesfull but it was good enough to get us going and eventually all data was recovered. The faulty one was removed and a new one ordered, an original IBM was 10 days away and I was worried one of the other disks could fail prior so I ordered a seagate with same parameters (except for it having a higher RPM). The replacement disk was installed on wednesday night and server continued working, thursday morning the server was not booting! One of the other drives failed during the rebuild.
Last night I received a refurbished IBM original, IBM does not have any of these drives in the country, even though they are less than 4 years old!
I installed the new drive, but because the controller was detecting 2 defunct drives, it was not a case of simply installing the new drive and expecting the rebuild to start automatically. I had to go into the RAID controller (using the bootable disc supplied) and followed instructions on IBM document id: MIGR-39144.

The situation is this now, Slot0 and Slot3 have the original drives, Slot1 has the new drive.When using the IBM boot cd, all of the other slots are empty.Slots 0,1,3 show online and as being members of the array (should Slot 1 be showing this?). Slot4 shows as defunct and when I right click on it gives me the option to REBUILD. SLot 4, I think points to one of my original HDDs (going by the reported serial number)

Prior to receiving the refurbished one, I had tried other second hand disks but they were all reporting PFA errors as well, so I tried them on different slots (always making sure not to confuse them with the good drives), hence slot 2 is empty.

My worry is this, if I select the REBUILD option, is there any risk of losing all of the data? i.e. would this place the exisiting good drives under more stress and thus giving me another failed disk? During the rebuild is anything changed on the original drives, and if so and the rebuild fails, have I lost all hope of recovering my data?
I realize that probably the 2 disk that are left are about to fail pretty soon too.
What are my options?  What should I do next?
I have thought of:
a) Set the drive in Slot 1 to defunct and then convert it to a Hot Spare, reboot.(not sure how though)
or
b) Just right click on Slot 4, click rebuild.
But these are just educated guesses.

Please do not ask me about the tape backups!

Attached is the log from the disks

Thank you!
Raid1.log
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®

Commented:
My experience with RAID-5 (3 disks on PERC controllers), as long as the original 2 disks are running fine, it is safe to rebuild the array. I had also an experience with RAID-5 (3 disks), 1 disk failed, while the replacement was on the way, one of the 2 remaining disks failed. What I did so I can rebuild the array, I issued a "Forced Online" on the second failed disk and rebuild the array and it was successful. I requested for a replacement for the second disk and rebuild the array again.

Commented:
On RAID configs, it is not advisable to replace the disk with different rpm than. It is ok to replace it with a much larger size but same rpm. Lower capacity is not also allowed.

Author

Commented:
thanks for your quick reply Powereds,

I am using an IBM identical now.
I did the forced online and it is showing three disks online with NONE showing as a hot spare. The fourth is defunct.
If the rebuild does not work, would I have completely stuffed up the 2 original goods drives?

I will be back in a couple of hours...
Ensure you’re charging the right price for your IT

Do you wonder if your IT business is truly profitable or if you should raise your prices? Learn how to calculate your overhead burden using our free interactive tool and use it to determine the right price for your IT services. Start calculating Now!

President
Top Expert 2010
Commented:
First, it is a foregone conclusion that you have already lost at a minimum, one stripe of data.  To be specific, 24KB worth of your RAIDset is gone forever just based on the logs. No way to know from the logs if this affected any data files. Hopefully not.

Did you do this all offline at the BIOS?  Or have you ever booted the system to an O/S (or even started to) since this happened. Also have you run regular data consistency check/repairs?

The absolute most conservative thing to do now is clone all of the disks, make note of any unrecoverable read errors and the blocks, and then do a manual reconstruction with the original disks.  If you have problems then you can (painfully, unless you are running a UNIX O/S and can use dd) restore and reconstruct with some other software and have a horrible weekend doing this.

Not 100% sure of the scenario..  but IF you ran a rebuild and changed any of the drives, and did any more writes; or you had your O/S booted at any time, or even attempted to boot your O/S, then you have a high probability of major damage because you ran in degraded mode.   I think you mean that you just tried different replacement disks, but never attempted a rebuild ... which is fine.

So exactly step-by-step what did you do ?

Commented:
You should have not move them to any slots if you are not sure that your server/RAID controller supports it. Every disk in the array has somewhat like an ID because the RAID config is stored also on each disks (PERC). If your server/RAID controller doesn't support "disk roaming" feature, please put back the disks to its original slots.

Please fill up the table below.

SLOT         MAIN-ARRAY-MEMBER             DRIVE STATUS               DISK REPLACED
0                Yes or Hot Spare                       Running or not?              Yes or No
1
2
3
etc

Did the HSpare completed the rebuild? If HSpare didnt completely rebuild, it is running in degraded mode (2 disks are only functioning). Is it booting when HSpare is removed (2 original drives only)? If it is booting, it is safe to rebuild using the replacement disk on its original slot.

Author

Commented:
Thanks,
 Or have you ever booted the system to an O/S (or even started to) since this happened. Also have you run regular data consistency check/repairs?
A. The OS is not boting it does not find an OS.

 Also have you run regular data consistency check/repairs?
A. Not when it failed the second time. The first time yes and it succesfully booted and gave me access to all of the data and programs.

I think you mean that you just tried different replacement disks, but never attempted a rebuild ... which is fine.
A. I initially had installed the 2nd hand drive and assigned it as a hot spare but I think because it was reporting an FPA error it wouldn't even attempt to do a rebuild. Upon reboot it kept showing 2 drives critical 2 defunct.

You should have not move them to any slots if you are not sure that your server/RAID controller supports it.
A. The RAID appears to support this feature as it has been able to 'follow' when I have moved the disks.
It prompts me to say a disk has been moved and if I want to accepot changes (I can't remember the exact prompts but it is something to that effect).

1. At first at tried using a brand new hard drive to replace a failed one, this was when the server was still runing, the brand new hard drive is the Seagate that you will see in the logs. Although the RAID controller recognized it, I could not get it use it to rebuild (rebuild option was not available).
2. Then I used second had hard drives,same deal. This time I used IBMs.
I started to think may be tyhose slots were faulty so I tried installing the HDDs in different slots, then it would tell me it detected config changes which I accepted. Still no luck though.
3. The good hdd always showed either online or critical, the RAID manager shows that there are two logialc drives inside the array as well. This is true even when moving the HDDs to different slots.
4. I was then able to finally get a good refurbished hard drive. But still was showing offline form some and defunct for the ones I had tried before(the second hand ones).
5 I found the IBM document I mentioned above and followed its instructions, however when I try to start it tells me there is no OS.
6Windows CD (this version has the F6 drivers) that I used last time to boot and recover from the first
failure does not see them either.

Author

Commented:
powereds,
i will fill in your table shortly

Author

Commented:
SLOT         MAIN-ARRAY-MEMBER             DRIVE STATUS               DISK REPLACED      
0               YES                                              ONLINE                            NO(original)
1               YES(refurbished)                        ONLINE                            YES #
2               EMPTY SLOT                                                                        now in slot3
3               YES                                              ONLINE                            NO(originally in slot2
4               EMPTY SLOT                               NOT SHOWN                    *
5               EMPTY SLOT                                  DEFUNCT                       YES**

* I had tried inserting a 2nd unit in here and had assigned it as a hot spare but it did not take it or even attempted to rebuild as far as I can see.

** The disk that was on SLOT3 was temporarily inserted here(SLOT5) to ensure it was not a faulty slot. It recognized it being moved and re-identified it.It was later removed, thus it now shows defunct. It was an original array member.

# Is this correct? Should it be showing it as member of the array? As far as the controller is concerned it is a brand new drive.

Author

Commented:

These are the serial numbers in case they help you interpret the logs (particularly the defunct drives area)
S/N AAR9P5B0AT6C
Notes: Last original drive to fail, currently removed from the server. was in slot3

S/N: AAR9P5B0ATCB
Notes: First original drive to fail, currently removed from the server. Was in slot 2. Server was booting succesfully wheN this one failed(only after a few adjustments).

S/N *3LQ4K8A8* (SEAGATE CHEETAH 15K.5 ST373455LC)
Notes: Brand new unit, I tried using it in Slot2 as a replacement for the failed  AAR9P5B0ATCB, currently removed from the server.
S/N:J20H13EK, D215EA1K, AAR9P5402H38, V3V2BD0A, AAR9P5503V82
Notes: 5x 2nd hand units, ALL reported FPA errors. Tried to use it as a hot swap, but did not work. Now not in the server. Tried in different slots.

 

Author

Commented:
These are the S/N for the drives currently in the server

S/N AR90ATAJ (SLOT 0) original drive (shows online, always been in SLOT 0)

S/N PF9016N2 (SLOT 1) new hard drive (shows online, tried in different slots)

S/N AR90AT9G (SLOT 3) original drive (shows online, was in SLOT1)

According to the logs, drive S/N AR90ATC6, originally in SLOT 3, died on the 21st of october at 7pm. This is 2 hours after the Seagate was inserted.   S/N AR90ATCB was moved to SLOT 5 to check if I had a faulty Slot. It prompted me accept changes, which I did.

The ServeRaid CD Boot utility, shows SN AR90ATCB in SLOT 5 as being defunct. It gives me the option to REBUILD this drive.

The log also shows the stripe order as being 0,1,5,3

I hope this clarifies things a bit.

Commented:
Where is the original location of the HSpare drive? If it is moved now, where it is inserted?

Commented:
Is SLOT2 defective? If you move the drive from SLOT3 to SLOT2 what will happen?

SLOT         MAIN-ARRAY-MEMBER             DRIVE STATUS               DISK REPLACED      
0               YES                                              ONLINE                            NO(original)
1               YES(refurbished)                        ONLINE                            YES #
2               EMPTY SLOT                                                                        now in slot3
3               YES                                              ONLINE                            NO(originally in slot2

If SLOT2 is not defective, please return the drive in SLOT3 to SLOT2.
Please run consistency checks on the 2 original drives. If it the conchecks are successful, you can try to rebuild the array/volume using the 2 original disks and the refurbished disk. If one of the 2 original disks fails, force it to ONLINE and try rebuilding again.

Note: For a 3 disks RAID-5 config, you need 2 original array member disks to be able to rebuild successfully provided that these original disks are untouched(data intact) and the RAID-5 NVRAM config is working.

Hope this helps you.

Author

Commented:
There is no hot spare at present as I had 2 drives fail. The hot spare became one of the other drives that are now healthy (at least I hope they are). The drive in Slot1 is the brand new one, which I think should be assigned as a hot spare?

I have done as instructed and moved the drive in Slot3 to Slot 2.
How do I run a consistency check now?

Should I set the drive in Slot1 to defunct, and then mark it as a Hot Spare, if so what is the proper procedure?

We ran into a similar issue & we were not able to rebuild the array and have any usable data using the array manager using techniques described above - but our attempts were not destructive to the data.

We were able to send the drives out to a data recovery firm who was able to recover 98% of the data on the drives - even after several attempts at forcing drives online & rebuilding arrays. Unfortunately this was quite expensive.
DavidPresident
Top Expert 2010

Commented:
See my very early post in the thread.  I can not emphasize enough that you are in a moderate to high risk zone.  The prudent thing to do is clone each disk at the block level before moving forward.  It is irresponsible to do otherwise.
i agree with dletehe - that is good advice

Author

Commented:
gnarlysage,
I have already contacted a data recovery center, I was expecting a callback last night but nothing happened. I am also considering using Raid reconstructor with BartPE, this is a read only procedure, but I need a non-RAID card which I do not have right now.

dlethe,
Correct me if I'm wrong, (I will be posting this on a separate thread) to ghost these drives I will need a SCSI  card that allows me to see the drives independently. Therefore I think the safest (if the data recovery people do not turn up today and I need this running tonight) is to on a second pc insert a SCSI u320 card and image the drives one at a time. Then I could reconstruct the RAID using the image. I should not use the server or its card to prevent deletion of the RAID configuration data.



We have used R-Studio to image and create a virtual RAID set.
DavidPresident
Top Expert 2010

Commented:
Yes, you need a card that can see individual disks, like your basic adaptec non-RAID U160 or U320 controller.  Now when you image the drives, obviously image them to RAID1 if you can, but you could also image them to a SATA disk and create a single file representing each hard drive.  The bottleneck will be I/O transfer rate and your bus, so do this in a system that has PCIe and preferably dual channel controllers, then you can get all the transfers done in an hour or so.  

Now the important thing is that you will undoubtedly get unrecoverable read errors.  Very important to note the block numbers .... as you will have to make an intelligent decision whether or not an XOR-based reconstruction will give you reasonable results for that stripe.   If you have a dead disk and unrecoverable read error then 100% data loss on the stripe, so when you reconstruct on the clone wiht whatever software you use, you need to manually take care of that stripe.   When you use raid reconstruction from the cloned disks, it won't have an unrecoverable error, as only the original disk does, so it will think that everything is fine.

When I do this for customers, my trick to get around the problem is to generate an ECC error myself on the same block, so the raid reconstruction software knows that it has a bad block and can deal with it accordingly. I am not aware of any inexpensive products that will let you pre-enter a list of known bad blocks ahead of time, hence I fool software by generating ECC errors at the same physical blocks.  

You will get better results if you reconstruct from the original disks, but no going back, and unless you run full diagnostics and understand the nature of the problem(s) with your disks, then you should quickly clone the drives.  Reconstruction takes several times longer than cloning, and your disks may have limited number of hours left on them, so best to clone first.

Note, I tend to be very conservative in recovery, as you can tell.  Certainly there are shortcuts .. but I've recovered several dozen arrays over the years on high-end fibre channel arrays where data ranged from national security-related "stuff" to databases at US government agencies that have never been backed up since the database was computerized in the 80s (Sad but true, no backup, EVER, except for paper and microfiche) .. so I am very conservative.  

Personally if data is valuable enough to warrant calling in the pro's like you have, then best thing to do is get a backup policy figured out and in place ASAP so when/if this happens again then your exposure will be limited. There is so much more to the process then I can cover here.  

I do not recommend BartPE.   Go with the stuff from runtime.org, the reconstructor is around $100.  BartPE is free, you get what you pay for.    I use software that I wrote over the years that deal with issues like reconstructing specific stripes based on various failure scenarios where you had either degraded I/Os; XOR errors; or unrecoverable read errors on specific stripes, or when you don't know the stripe size, and may not be sure about the ordering of the disk drives in the RAID group.   The off-the-shelf stuff pretty much figures out the layout and then reconstructs, but they can't deal with the subtleties of stripe-specific problems.   If you are in the 'biz, you pretty much have to write that code yourself.

But I still use runtime's reconstructor at times, quite successfully, but only after I analyzed the nature of the physical errors so I could go back and handle exceptions that BartPE, Runtime's software, and the other packages can't deal with.


Author

Commented:
Thanks Dlethe,
Whilst I wait for the data recovery people I will prepare for "Plan B" and start sourcing the required hardware and software. I can't get any parts for another couple hours at least as the only place I can get them on a Sunday is at the swap meet.
Dletehe is clearly the expert on this - so I'd be curious what Dlethe thinks about using the current server by installing Windows on a large SATA or IDE drive and then use the current raid controller configured without RAID to image the current drives on the recovery drive and/or  use try and create a virtual RAID set using Runtime's RAID reconstructor.

Author

Commented:
I am planning to build a dedicated workstation for this purpose. I think using the current RAID controller could cause me more trouble by deleting my RAID configuration?
Yes - you definitely want to keep the configuration intact if you still have it.
DavidPresident
Top Expert 2010

Commented:
The problem with using the current RAID controller is that there is metadata on it.  You would have to tell it that the disks are JBOD. This will generally cause the controller to change the usable capacity so that you have more usable blocks (i.e, if metadata is X KB, then total usable is whatever you had before + X KB.  This would also result in the controller losing whatever RAID information it has along with potentially useful info like whether or not it already reconstructed certain blocks and they do not need to be reconstructed again (meaning if they get rebuilt a second time, the data will be lost because those blocks are no longer considered degraded.

Also, after re-reading your initial post, it is likely that you have your firmware configured so that a SMART (PFA) error will trigger a drive failure/rebuild.  BAD BAD BAD idea.   There are known issues with a great deal of certain popular disk drives running certain firmware revisions where a bad block will trigger a S.M.A.R.T. alert.  This will cascade into the exact situation you have.   If your system was not in stress at this time, then I would say to disable the particular setting.

Please also add to your post-recovery to-do list to upgrade drive firmware.  Assuming you are providing replacement disks, then take the opportunity now to upgrade firmware on the replacement drives.  You do have vanity firmware on the disks, and the logs do not provide EVPD pages, so I can't tell you a lot, other than they appear to be fujitsu. That being said there have been some major firmware high-priority updates in last few years that are vital for data integrity.

You CAN use a large SATA disk as a temporary target to reconstruct, but you will have issues that may or may not be important to you, based on time and desire for greatest amount of recovery .. the biggie issues are:
1. IOPs (I/Os per second) / Throughput constraints.  If you create X data files which represent a raw dump on the SATA disk then the queue depth and sheer number of I/Os required to rebuild could take all weekend.
2. No way to propagate a read error, so the recovery software will think a stripe is good when it is bad, and you will not get notification.  Furthermore, if you *really* needed to get all that data back, then it may be possible to get information from the original failed disk .. which may be able to provide the missing XOR data, so you could recover the stripe.  Just a matter of running some XOR calculations and either just eyeballing the resulting recovered stripe to see if it looks like data, and if so, reconstructing manually.
3. Unless you clone first, then it will leave the original disks exposed for much longer than necessary.   YOu may have impending doom of a head crash, which means really big dollars and more lost data and a clean room.

If you want to do a SATA-based recovery on the cheap, get 2 SATA disks and use software RAID1.  You will not be exposed if a SATA disk crashes. and software RAID1 will provide much better read performance than a single disk.  Still be sure to use a dual channel controller so disks don't compete. Don't make mistake of using extrernal USB.

Author

Commented:
I am out luck, I could not source an 80 Pin SCSI u320 card.
What I do have is an IBM SR-4mx this is an Ultra160 RAID controller card. It has a connectpr that goes to the planar.
What I am considering is if I remove the exisiting card from the server and install this card.

Install Windows xp to a a sata drive, if possible configure this other card to read the disk as individuals,. This card is a U160 and my drives are U320, would this stop me from accessing the drives?


Author

Commented:
Thje only cards they had at the swap meet where 68pin, even on the classifieds that's the best i can find

Author

Commented:
dlethe,
Can you provide me with the model of a suitable adaptec card? I'll if I can find one on ebay.
DavidPresident
Top Expert 2010

Commented:
All you need is the right cable.   The disks will just run at U160 interface speed.  Any adaptec X9160 or X9320 card will do the job. Avoid the 2xxx family as that has embedded raid.

Get a high quality cable, that is shielded and actually rated for the proper speed.  scsi4me.com is a great source, lots of pictures and information on cabling and interconnectivity.


Author

Commented:
I have an IBM 4mx RAID controller.
Can I use this one to see the disks as individual without changing the array configuration?


In the mean time I have rang few "24x7 data recovery centers" all they have is a 24x7 answering machine and no one calls you back except for one that couldn't take the job because he didn't all the tools.

Author

Commented:
Can I use a SCSI68 pin card with an adpter at he end so I can see the drives. I think all I can get for now is a 68 Pin SCSI u160 card.

Does anyone have any experience in using these adaptors?
http://cgi.ebay.com.au/5X-SCA-80-Pin-F-to-SCSI-III-68-F-Card-Converter-Adapter_W0QQitemZ150376542345QQcmdZViewItemQQptZAU_Components?hash=item230323f089



DavidPresident
Top Expert 2010

Commented:
I do not know internals of the IBM 4mx.  The link shows a valid connectivity option.  The great thing about parallel scsi is that it just works it all out.

Author

Commented:
Hi all,
I did of further research and found that Adaptec does not like people using those 68 to 80 Pin adaptors, they can cause a bit of trouble.
Here is what I also did yesterday:
(Since I could not source an 80 PIN scsi, to attemp to set up a controller card to see the drives as individuals and ghost them)
1. Removed the original RAID controller.
2.Labelled the 3 drives from the server and removed them.
3Installed the spare 4mx card on the server where the original was.
4.Installed 2 old drives (the second hand ones I had bought on ebay).
5.Upon boot I instructed the replacement controller to copy the configuration from the drives.
6.The boot cd was complaining that the BIOS was not up to date but I did not accept the offer to upgrade.
7.Boot CD could not see the drives anymore after reboot.
8. Went for a walk. Planned to sell everything buy a Winnebaggo and became a gypsy.
9.Restored server to previous configuration (reinstated original controller card, inserted drives in their original slots).
10. Turned server back on, nothing happened just fans turning and a black screen. I will becamne a gypsy.Left for the day.
11. This morning I had an idea: I pulled out the drives, turned the server back on waited for it to tell me there were no drives and it did, I reinserted the drives with the server on and pressed F4 to retry.
12.Continued booting the boot cd in there.
13. To my surprise now it only shows drives 0 and 2 as being online with the other ones defunct. The drive 1 light is orange (not too concerned as it was blank).
14. I am shutting the server down now and taking to the recovery guys.


I got a call last from the data recovery guys and I will be dropping off the server and drives this morning.
So the status of this server is still the same, 2 original drives show online, a third new one shows online.

I will wait for them to recover the data for me.
I will get back to you and assign points accrodingly.
Thanks to everyone for all your help!

Author

Commented:
Hi Guys,
I just got an e-mail from the data recovery experts. They are looking at 95% worst case scenario of data being recovered! They sent me a directory listing of files found and it looks pretty good.

I still want to try and hit the rebuild button once the files are returned as they can only give me all of the recoverd files in a USB drive not as an image.

Now I am goign to research on how to recover AD from files only.

I think I should assign dlethe the poiints as I should have followed his advise from day one, however I am keen to try Powereds solution once I get the server back.

A word of caution to everyone else whos stumbles upon this thread (from what I have learned from this experience):

1. Backup, backup and backup. Do not be complacent, if using portable drives swap them daily! The cost of an extra portable drives is probably close to %1 that of an expert data recovery. Using just one portable even as interim measure is extremely risky (mine one was an interim until the new server was installed and it crashed along with the RAID).

1b Check your backups, that they show succesful does not mean you can recover anything.

2. If one your disks in your RAID fails the other ones will follow very closely (which makes me wonder about the usefulnes of a hot spare, may be the spares should be left unplugged and only plug them in when a member fails, that way they are brand new when you need them).

3. Not sure about RAID5 anymore, I never had these issues with RAID1, even when they completely crash it is easier to recover.














DavidPresident
Top Expert 2010

Commented:
Those bozos can't give you an image?  I can see how they would make the image a bit smaller due to the metadata, but since they are not trying to give you an image, it tells me that they aren't very sophisticated in their techniques.   I bet they aren't trying to XOR test whether or not any information from the failed drive can be used to fill parity holes in event of unreadable blocks.

Hopefully you will get a report of the unreadable block numbers on each disk.  Find out also if they are cracking open the disk drives, or if they are using software only recovery methods.  Are you getting the original disks back?  

I am sure that many people would be curious to know a pricing structure .

P.S. RAID5 is great, but no matter what RAID level, i bet if you did regular consistency check/repairs then you wouldn't have had any bad blocks.    IN this day and age, where multi-TB LUNS are commonplace, then there really is no reason not to go with the extra redundancy of RAID6.



Author

Commented:
Sorry everyone,
I had been meaning to log back on to assign points. If it is ok with everyone we'll go with points as assigned by rindi.
Dlethe,
To answer your questions, it doesn't look like they opened the hard drives.
They did not provide and image, they said they use propietary software so they can't provide me with an image.
100% of the data was recovered which after all is thew most important thing. Not one file was damaged.(Got to keep the customer)
Cost was about $6k
I attempted to rebuild the array as suggested once I knew I had all of the data back but it didn't work.
Thanks again to everyone.

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial