• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1307
  • Last Modified:

How to surface scan drives in a raid 5 with Adaptec controller.

Good morning!

I have a windows 2000 server running on a SuperMicro server with an Adaptec 2010S raid controller configured as a raid-5 with hot spare.

Last week we had a drive failure and a few weird things happened.  Replacement drive is in and array is now optimal. System seems to be behaving normally.  However one of the last things I did was an Acronis system image of the array.  Almost at the end Acronis reported that there was a read error from drive 2.  From my past experience with Acronis, every time I get a read error, a subsequent surface scan reveals a failing hard drive.  

I haven’t found any information from Adaptec about performing individual surface scans on the drives.  Anyone have any suggestions on how I should proceed?

Thank you
  • 4
  • 3
1 Solution
You are doing something gravely wrong.
First, if Acronis shows a read error from drive #2, then that means you aren't backing up the logical RAID array, you are backing up the physical disks .. which does you no good if you plan on restoring.

Now IF you are backing up the logical array, and drive #2 refers to a logical device or partition, then that means that you have either
  - a degraded array AND an unrecoverable read error somewhere on the logical device
  - two or more HDDs have an unrecoverable read error in the same chunk.

In either case, above, scanning won't fix it.  It can't. The data isn't readable.  The only way to repair it is to write something.   Think about it .. what should go there, the array is unreadable?  

Do a chkdsk with the full repair and scan all blocks option.   But the 2010S is such a low-end controller it doesn't support things like automated background scanning on the physical drives.  Late model smartarray and LSI controllers, do, however.
sattech2000Author Commented:
Not sure how much clearer I can be here.

Acronis CLONED the array to a single disk.  With the array removed and the cloned drive in the server it booted fine and all my data is in place so it did the clone just fine.

The array is currently optimal and working fine.

The QUESTION is....II want to surface TEST each of the FOUR disk drives in the array to scan for failing clusters. I don't need to repair anything.  Just identify if any one of the drives are starting to fail.

The RAID controller isn't that smart.  But if you do the chkdsk from WINDOWS, with full surface, and option to repair bad blocks, then it will get 99% of the drive (everything except metadata). It won't get the parity blocks, on each drive.

If you want 100% of disk. you have to use a non-raid controller, and do something like boot the system to linux and just read all the blocks into bit bucket, like

for i in /dev/sd[a-e]
dd if=$i of=/dev/null bs=64k &

Above will read all disks from /dev/sda, /dev/sdb .. /dev/sde  into the bit bucket in parallel.  If you get a read error, then dd will stop and tell you which disk.

sattech2000Author Commented:
Thank you :)
Thats what I was wondering.  Is chkdsk trustworthy for the most part?  In the IDE world what I would usually do is boot the system with the drive manufactures diagnostic utility and run a complete surface scan. If it scanned clean then I would sleep good at nights.  If it had read errors then the drive was replaced.  
I'm surprised, with this configuration, that there isn't a way to accomplish a direct surface scan with an adaptec utility.    

Are there any known issues of damaging the array by simply putting the drives in another system, non raid, for the scan in dos? I would probably use the Seagate disk utilities.  Im aware of the risk of doing destructive testing, write testing etc... But unclear if the drives would get marked dirty somehow
chkdsk will read all non-parity blocks used with exception of some metadata.  This is the only option you have unless you get a better controller.    But you can't sleep at nights :(

Because it will not force checking of the parity blocks.  So if you did lose a disk then you are exposed.  At least it will get 75% of it.   It is your only practical option.

You won't damage the array by placing them in another system using a non-raid controller and an O/S like linux which doesn't try to mount disks. But if you do get a bad block, then what?  You can't repair it because you don't know what to put there.  

If you want to sleep at nights, then upgrade the disks.  Let's face it ... if the O/S is win2K, and you're using that controller with SCSI disks, then those drives are probably way overdue for failure.   The manufacturer warranty to OEMs on enterprise SCSI from Seagate is only 5 years.  
sattech2000Author Commented:
This is good to know.  Thank you.

If I get a bad block then I fail the drive in the array, let it rebuild to the hot spare and replace the hot spare with a new drive.  Currently have a new hot spare ready to go and a new one on the shelf.

A surface scan of the drives would give me an indication if it's ready to go back into production or if another drive is starting to fail, at which point the drive would be replaced.  

Brings up another question, how accurate is adaptec itself?  I mean it's currently indicating the array is optimal.  I ran a verify to double check and everything checked good.  Is this alone a good indication that everything is working normal or is there a chance that blocks are starting to fail and adaptec hasn't recognized that?

Unfortunately a replacement server is not in the budget this year :(  so I have to make the best with what I have.  
A bad block does not mean you need to fail the drive.  You have thousands of spare blocks, but since you have no way to verify 100% of the data any other way, then I guess you'll have to resort to that

Adaptec is #1 for SCSI, and they own the market.  It is as good as you can get with parallel SCSI.  (For serial SCSI, i.e, SAS, LSI by far is the way to go).

If you run a verify to double-check everything then it still won't do squat for data integrity. i.,e file system errors, or bit rot, or anything like that.   All it does is check the XOR. This does NOT mean your data is all OK, but again it is best you can expect with what you have.

QlemoC++ DeveloperCommented:
This question has been classified as abandoned and is closed as part of the Cleanup Program. See the recommendation for more details.
  • 4
  • 3
Tackle projects and never again get stuck behind a technical roadblock.
Join Now