• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 843
  • Last Modified:

Intel 631xESB controller, Raid 1, Checking Smart status of drives

Hi All-

I've got a low end SATA server at a colocoation facility that I manage for a non-profit who I volunteer for.  Physical access to the server is possible, but is a real pain, so if I can avoid it, I'd like to.

It's a Windows 2003 machine with a built in Intel 631xESB/632xESB raid controller.  2 of the drives are in a Raid 1 configuration for redundancy.  Recently, I've seen sporatic slowness, almost unresponsiveness.  It lasts a minute or two and then things resume.  I've tried disabiling the public network interface and connecting only via the private lan to it, and I still see the problem.  I don't believe it has anything to do with system load.  

These are Seagate 7200.11 750gb drives in the machine, and I know they have problems.  I'm hoping to replace them.  But until the budget is approved, I want to see if I can figure out this problem.  I've had these drives fail in workstations before, and smart reporting shows that they're bad.

However, wtih this Raid1 setup, I can't see the smart status with tools like
Active @ hard Disk Monitor (http://www.disk-monitor.com/).  

Does anyone know of a way to check if the drives are bad without physical access to the machine?

Second, if I find that a drive is bad, if I pull it out when the system is shut down and replace it with a same sized drive or one that's bigger, will the intel Matrix Storage manager automatically recognize the new drive and copy the data from the remaining current data disk to the new one so that the Raid 1 is live again for redundancy?

Thanks.
0
Berkson Wein
Asked:
Berkson Wein
  • 5
  • 4
2 Solutions
 
DavidPresidentCommented:
You are using $50 consumer disk drives in a colo site on a server.  These drives are designed for duty cycle of 2400 hours/year.  Get enterprise drives designed for 24x7x365 that also has things like 100X better ECC and perform background bad block repair.  

This is root cause.  Bottom line, you are "overclocking" your disks in a remote site.  Of course you are going to have such problems.  You should read this for more details (feel free to comment with followup questions, am going to try to crank out a few more articles like this over next few weeks)

Probably only thing you can do is disable write cache if it is enabled.    You won't be able to run any inband SMART software because the RAID controller prevents it from seeing individual drives.   Intel has a windows-based utility you can install, but I don't know off top of head if it will look at SMART.

But, very doubtful S.M.A.R.T. is going to show you have an issue.  These disks are not architected to run behind RAID controllers.

Another thing you must do is run regular data consistency checks within the RAID. This will insure that each disk is a 100% match, and if any bad blocks appear on disk "A", then the RAID controller will repair it with data from the same block # on disk "B" and vice-versa. See if you can set up a script to automate this around midnight, it will probably take all night to run, so don't automate it until you know how long it takes.


http://www.experts-exchange.com/articles/Storage/Misc/Disk-drive-reliability-overview.html
0
 
Berkson WeinTech FreelancerAuthor Commented:
The drives weren't $50, they were $49!!
These are the drives I've got to work with.  It's a non-profit, and unfortunately there's not much of a budget for anything, let alone disks.  The system pre-dates my time there.
I'm trying to find a way to work with what I've been given.  
The Intel Matrix Storage Manager software does report on data consistency, and that's fine.  It would be nice it it would alert if there was a problem without manually having to check it, but again, I've got what I've got.  The tool doesn't look at SMART.
I'm hoping to find someone who has experience with this specific controller and the software who can advise on this specific issue - and address my question on replacing drives one at a time and rebuilds.
Thanks again.
 
0
 
Berkson WeinTech FreelancerAuthor Commented:
Do you think this drive would make a substantial difference?  It's refurbished and I know the risk there, but it is under $60.  Barracuda Enterprise ES.2.     ST3750330NS
http://www.buy.com/prod/seagate-barracuda-enterprise-es-2-750gb-sata-300-7200rpm-32mb-hard/q/listingid/56348131/loc/101/212654841.html
I don't know enough about their enterprise line to know if this is one of the models that should be avoided.  Thoughs appreciated.
0
Improve Your Query Performance Tuning

In this FREE six-day email course, you'll learn from Janis Griffin, Database Performance Evangelist. She'll teach 12 steps that you can use to optimize your queries as much as possible and see measurable results in your work. Get started today!

 
DavidPresidentCommented:
Refurbished drive?  Does that mean it is a seagate-supplied refurbished drive, or did some dealer buy used SE.2, and dust it off and spit-shine with alcohol. The problem with buying used SATA disks is that the total number of bad block replacements are not obtainable programmatically, so you could end up with a lemon of a drive that has had thousands of bad blocks, and room for only one more.

Don't buy used SATA disks unless they have the specific part number from Seagate that indicates it is a factory refurbished drive.   When seagate (or any mfg) refurbishes a drive, the media passes full tests and the grown defect count is near zero.
0
 
Berkson WeinTech FreelancerAuthor Commented:
These do have the ST3750330NS-R part number.
0
 
DavidPresidentCommented:
then you are OK to buy them.  They should work much better for you then what you have.
0
 
Berkson WeinTech FreelancerAuthor Commented:
I'll see if I can get budget approval.
In terms of replacing them: I'm ok with some server downtime.  If I pull out an existing drive and replace it with one of the new ones, will the server rebuild automatically?  Remember, this is Raid 1.  if so, I'll let that happen and then swap the other one and let that rebuild.
 
0
 
DavidPresidentCommented:
Yes, it should ... but first kick off a data consistency check (this is a function of the RAID hardware), not a windows-based scandisk.   the data consistency check will make sure if there is a unreadable block on disk a, that it uses parity form disk b to repair it and vice-versa.    if you don't  then if you have a bad block on the disk that is still in the machine, then it won't be able to read the block so you will get partial data loss.

And, of course, have a full backup. You never know when a disk is going to fail.

Finally take the opportunity to make sure you have current firmware & drivers for the RAID and motherboard..
0
 
DavidPresidentCommented:
First #31885378 does address running data consistency check, which will check that each disk matches.  However, the author is under mistaken impression that the lockups and behaviors are an indication that the drives are bad.  The reason for these problems is that the disk drive is not qualified or designed to work behind the controller. This problem will absolutely cause behavior he is seeing.   Intel has the windows-based Matrix management software that can be run remotely on the download page of the motherboard, but that software does not have true diagnostics.

The consistency check (called a verify in the Matrix controller) is effectively the only test that can be done, but it is NOT designed to be a diagnostic.  His controller is incapable of performing diagnostics directly on disk drives, so what the author desires is not possible.  The controller also does not have a full API that supports a full pass-through suite of commands to be sent to individual disks, so it is not possible for even a 3rd-party software product to do what he asks.

0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Improve Your Query Performance Tuning

In this FREE six-day email course, you'll learn from Janis Griffin, Database Performance Evangelist. She'll teach 12 steps that you can use to optimize your queries as much as possible and see measurable results in your work. Get started today!

  • 5
  • 4
Tackle projects and never again get stuck behind a technical roadblock.
Join Now