Link to home
Start Free TrialLog in
Avatar of johnstaggs
johnstaggs

asked on

Repeated scsi drive failure with windows 2003

Hey Guys,

I'm having problems with drives failing on reboots.  I'm using windows 2003 enterprise server, IBM Ultrastar 160 drives, adaptec 2010S raid controller, supermicro motherboard and backplane.

In the past two weeks, we've had 3 drive failures all on the reboots.   We are running Raid 5 config, so we haven't lost any data, and we have caught it every time.  Basically is there any known issues or compatibility issues with this hardware?

Thanks.....This is a EXTREMELY important question, so it's valued at 500 points.
Avatar of chicagoan
chicagoan
Flag of United States of America image

Did the drives really fail or are they failing to come ready?
I've seen 160's and 320's in hotswap cages not come ready from a cold start, but work fine if they're inserted into the cage after the system's power up and warm booted.
Sometimes you see  LUN=0 BUS=0 ID=0 Bad SCSI Status – Check Condition messages.
Setting the system bios to do a full post sometimes helps, especially if there is a lot of ram.
These drive can take up to 20 seconds to spin up.
Check the auto spin up jumper settings, though setting all the drives to auto spin can put a severe load on the PSU.

Avatar of johnstaggs
johnstaggs

ASKER

Well, when the system boots, the drives has no lights on what so ever.  Then after Windows boots up, we will go into SMOR (Adaptecs software) to look at the config, and the drive with no lights on will be marked with a red drive, and say "failed".
One thing we did notice with the Ultrastar drives....is some are differant models then the others.  Only 1 model that we have is on the Windows HCL list for Windows 2003 Server......Could this be a potential pitfall?
Are these in a hotswap cage? Whose?

What have you done with the failed drives?
Yes, SuperMicro.

We've reformatted them, and gave them another try, and they are able to be used again.
WHich lends credence to them being OK and just not ready.
I'd get with SuperMicro and see what their suggestions are about getting the drives initialized before the OS looks at the array.
The drive is intializing when the system boots up, because you can see it when you post.  And also if you go to SMOR (bios) it will show the raid "degraded".   But maybe I'm not quite getting what your saying.
If the drives test out OK afterward, something's going on at boot time that makes them unavailable to the array.
I'd really be suspicious of the backplane/disk enclosure here, and I'd see what the manufacturerer has to say.
Avatar of dbrunton
>> WHich lends credence to them being OK and just not ready.

I'll add a couple of comments to this statement.

I'd be looking at the SCSI cable and the host adapter in this case.  And possibly power supply, it may not be capable of supporting the power required for everything.
Hey guys, I'm going to get ahold of supermicro today, so bare with me.   But the backplane is a good possibility....since I had already called adaptec and they said it could be the problem.  Another thing is, I had single drive on a dual xeon box (same exact type of setup), the machine did not have a raid controler in it (wasn't running raid, hense the single drive).  And it died on me in about 1 week.

Maybe that will lead to something else.  I'm in the process of setting up another machine, which it has dual backplanes on it, and i'm sure a differant type of raid controller.


But all suggestions are welcome, and I really appreciate the time you guys take to help me figure out the problem.
While a drive dying in another box, especially a non redundant drive, is a pain, I think it's just anecdotal.
You said the drives from the degraded array tested OK outside the system.
Unless these drives are all from one lot and you suspect a manufacturing defect, I'm liking the backplane/bus as the likely suspect.
Indeed, I was able to format the failed drive, and put it back into the array, so that shows it's not really the drives.  All the drives being from the same lot is a good chance, there is two differant models of the drive that we have.  (drive specs are the same, just differant models).

So I should look into the backplane/bus issue correct?   And do you guys have any suggestions on how I could go about testing it?

(btw, that non redundant drive that i lost, was just on a test box i had setup, so it wasn't to important...thank god).   Right now I'm not setting any machine up, unless it's using raid 5, and has two hot spares.   You know, I had ran these drives quite awhile on a differant box, that had a older motherboard, never had a single problem...  Then I went to these new boxes that have a newer motherboard, and have had nothing but problems with the drives.

When i say older motherboard, i'm meaning months, not years or anything, but they are differant models.
That's one of those "zero-raid" setups through a dedicated PCI slot?
yes, that is correct.
ASKER CERTIFIED SOLUTION
Avatar of chicagoan
chicagoan
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I do have the latest 2003 drivers, but I'm going to have to check about the latest bios, give me a few min, and I'll update this and let you know.
Bios shows I20 v.001.62 but the date doesn't match the date on the link.  So i'm going to do a update to both of these on a new machine
I had the latest 2003 (tried reinstalling it).  The bios looked like the same version, but it was updated.  So both of those are done.

We've got another very similar machine, and we are setting it up with raid5 (it has a split backplane), and two hostspares.  We are going to run it for awhile and see if we run into any more problems.

I"m going to go ahead and award the points to you, but if you can think of anything else to try down the road, please reply.

Thanks