Dell Poweredge T300 intermittant boot issues

I have a Dell Poweredge T300 running SBS 2003.  It is setup with RAID 0 with a SAS6/IR controller.  It has 2 swappable 500 GB drives partitioned with a 25GB system drive and a 900+GB "data" partition.

The server had an issue with the Exchange databse maxing out, however, instead of calling me someone decided to "reboot" the server.  I am pretty sure they pulled the plug out of the back when the server hung up on shutdown.  When they tried to bring it back up the server would not boot.  The server displayed the F1 to retry boot, F2 to enter setup menu and then they called me.

I rebooted the server a couple times and the controller was detecting Volume (00:00) as failed and I ended up at the F1/F2 screen.  I hit F1 and it would not boot.  So I pulled HD0 out of the cage and shook it.  Put it back in and booted up and I was able to get into Windows.  So to make sure I shut down the server and it start having the same issues.  I decided to call Dell.

The Dell tech said I needed to reset the BIOS and then once back in Windows update the BIOS and controller firmware and that will resolve my issue.  Dell had me reset the bios which took it back to the defaults and renabled SATA-A and the server threw an error about that.  So Dell had me disable SATA-A and then the server booted into Windows.  I told
the Dell rep that I wanted to shut it down and see if it will come up again without issue.  It didn't.  It threw the Volume (00:00) failed error after the controller initialized and the I got the F1/F2 error.

So Dell had me create a diagnostic CD and boot from that.  While the hard drive test was running I explained to him that there are 2 500GB drives in the server an only one was lit up with green lights (HD1).  HD0 had no light on at all while HD1 had one solid and one blinking when there was HD activity.  After about 20 minutes into the HD test the completion percentage was not increasing from 4% but the byte/sector count was.  So he had me stop the HD test because he said HD0 wasn't even being tested because the lights aren't on and it's dead and he is sending me a new hard drive.  I did explain to him that when the system does boot OK both hard drives have 2 green lights.

So while I was giving him my info I decided to reboot the server.  After 3 or 4 unsuccessful boot attempts I was able to get it to boot up. Each time I rebooted I reseated HD0.

I'm confused because at first the Dell tech said the issue was the Bios and controller needing updates and then all the sudden he had me stop the test and he deemed it HD0 failing as the issue.

So basically my questions are:
How do I know or figure out if its the HD that is the problem and it isn't the controller or backplane?
If the HD is bad how come I can get it into Windows and Windows seems to operate just fine when running?
Whats the deal with the green light on HD0 not working and why wouldn't they be amber or red if it really was dead?

I just seems to me like a controller or the MB is bad and I'm sure Dell would much rather it be a $80 500 GB HD than an expensive MB or RAID controller.

Thanks Experts!
Who is Participating?
PowerEdgeTechConnect With a Mentor IT ConsultantCommented:
To answer your questions:

How do I know or figure out if its the HD that is the problem and it isn't the controller or backplane?
It is sometimes difficult to determine which is the point of failure without other parts to swap out and try with, but the rate of failure of a drive is MUCH higher than the rate of failure of a backplane or controller.  Sure those components can fail, but it is much more likely a bad drive.  So, without a way to determine for sure, Dell has a 99% chance of fixing the issue with a replacement drive.

If the HD is bad how come I can get it into Windows and Windows seems to operate just fine when running?
Intermittent problems can cause a drive to work fine one moment and not the next, so if it is an intermittent problem, you may get lucky and get past the point of failure.

Whats the deal with the green light on HD0 not working and why wouldn't they be amber or red if it really was dead?
The light on the hard drive is green when Online, or participating in an array, and amber when Offline, or NOT participating in an array.  It is not an indication of the drive's health (unless blinking alternately green/amber).  A drive can pass diagnostics but be offline and the light will be amber.  Likewise, a drive can fail diagnostics and be online and the light will be green.  It is merely an online/offline status indicator.

Good luck.
jamietonerConnect With a Mentor Commented:
First off raid 0 is a very bad idea if this is an important sever, as it has no redundancy. If one of the drives complete fails(something reseating it wont temporarily remedy) they system is down and any data that wasn't backed up will be very expensive to recover, since reseating it temporarily remedies the issue I'd say either the drive or the backplane has a bad connection, the only way to know for sure is to use a know good hdd, since you have raid 0 when you replace the possibly faulty drive you will have to recreate the raid, reinstall(reimage) and restore data from backup.
xactdesignAuthor Commented:
Thanks jamietoner, I am aware of this WRT the Raid 0.  Unfortunatly I didnt setup the server so this is what I am stuck with until we upgrade/replace the server.

The server is up now and I am running backups.  I threw the question out there because I don't want to reinstall the OS and do a recovery if its a backplane issue.  Not only will it throw the same error with a new drive but it will also waste about 8 hours of my time.....
Improve Your Query Performance Tuning

In this FREE six-day email course, you'll learn from Janis Griffin, Database Performance Evangelist. She'll teach 12 steps that you can use to optimize your queries as much as possible and see measurable results in your work. Get started today!

SteveConnect With a Mentor Commented:
"So I pulled HD0 out of the cage and shook it."

Can I just check we understood this. did you really take out the HDD from your server and shake it like a box of sweets?

No offence mate but I think you should do what dell say and stop your own diagnostic techniques.

Get a new HDD and get this on raid1 asap. raid 0 on a server is ridiculous and is highly likely to fail again.
I'd also recommend updating the firmware on the Raid/Disk controller as these are known to be issues for dells.

xactdesignAuthor Commented:
That is good Qlemo
QlemoBatchelor, Developer and EE Topic AdvisorCommented:
This question has been classified as abandoned and is closed as part of the Cleanup Program. See the recommendation for more details.
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.