Link to home
Start Free TrialLog in
Avatar of barnesm6
barnesm6

asked on

Raid keeps failing on sbs2003

I need urgent help with this one. My client has a "server" using a Gigabyte GA-8I945PL-G motherboard. It has been set up with Raid 5 within SBS2003. However, although this works fine for a short while it's not long before the raid fails recently 2 out of the 3 drives failed. I know this board states that it has onboard IDE raid, if this is not disabled in the BIOS will this cause a conflict with the software raid?

I've had a similar issue with a client using mirroring on the same board where one of the two drives goes off-line and the server has to be rebooted to get it back but after resynching the mirror it fails again after a short while (hours/days, not quite sure)

500 points if I get an answer quickly.
ASKER CERTIFIED SOLUTION
Avatar of Jeffrey Kane - TechSoEasy
Jeffrey Kane - TechSoEasy
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
hmmm...

I would tend to disagree as the board also supports JBOD.. If RAID is not set up in the BIOS, then the controller will only present the individual drives to the OS..

When you say that two drives fail, do they fail in terms of a physical fault, or do the drives still function. If it is the second, then I would run a chkdisk checking for bad blocks, if that OK then try TechSoEasy's suggestion to see if it makes things better. If, however, the drives have physically failed, then we need to look at a different area. One often overlooked is the power supply. If you have three or more drives in the system, then a larger power supply may help ..

Mike

Avatar of barnesm6
barnesm6

ASKER

After disabling the onboard RAID the system lasted for 4 days before the raid failed again. I'm not sure if it's power due to the fact that it will last for several days and then fail and there's no extra load at the time that it goes.

The most recent time it failed I rebooted the server and got a disk boot failure message. I disconnected all 3 raid drives and the os booted fine. Reconnected two of the raid drives and still boots fine - put the third on and now get 'error loading os'.

Reconfigured the way the drives were connected to the board to make it more logical: Boot disk on controller 0, Raid drive 1 on controller 1 and so on. System booted fine with all three connected. Rebooted and half way through resynching the whole raid array became unavailable.

I've since put 2 seperate PCI SATA controllers on the board and connected 2 of the raid drives to them and left one on the board. Following this the raid went off line again but this time I was able to reactivate 2 of the disks from disk management but not the other which I suspect is the one connected to the main board.

I'm now of the opinion that it's actually a controller problem on the board. I think I'm going to scrap the raid 5 and mirror the 2 drives on the sata cards. Unfortunaltey Exchange was installed to the raid and the log files and database are now screwed.
Interesting that you mention "power".  With all of those drives... how big is your power supply?  How close are you to being at maximum load?

Jeff
TechSoEasy
Sorry for not responding sooner. The PSU is 500w although I don't know how reliable it is. I did think that it could possibly be the PSU so as they're still having problems with the drives just disappearing at random I've totally disconnected two of the drives so we now have one OS drive and one Data drive. This was set up last Tuesday and so far so good.

I was under the opinion that a 500w should be more than enough for 4 drives but that said I suppose if it's a cheap PSU it may not be able to sustain suffiicient throughput.

I have another client with a board by the same manufacturer just a different model and he's had similar issues with drives going off line and you can't reactivate than without shutting down, disconnecting, rebooting, shutting down and then reconnecting - only then will the drives pick up again. I've since (Friday gone) done the same for him and disconnected two of the drives. I suspect that even when the OS can't see the drives they may still be drawing power.

Anyway, I'm going to run this for another week and see if the server stays up OK. If it does I think I'll put a better quality PSU in the system and then reconnect the other drives.

Fingers crossed anyway otherwise I'm going to have to pull the server and put a completely new board in thus having to rebuild the whole thing and rejoin everyone to the domain. Joy!!!
Well, actually if you're using SATA drives, then 500w should be adequate for about 8 or 10 of them, so no problems there.  I did a quick search to see if it could be something else... do you have a CD/DVD drive sharing IDE1 with another Drive?  Or just there by itself?  

From a feedback post at newegg.com (http://snipurl.com/sag8):

   "Cons: BE SURE (if using EIDE HDD & CD/DVD drive) to install on IDE 1. BIOS will not recognize (for boot) ANY drives on IDE 2 & 3. SATA works GREAT & is  
   detected by XP when installing. BIOS is very straightforward. Drivers from CD (includes Norton software) loaded without any problems.

   Other Thoughts: Running Pentium D 930 (3.0GHz), and system temp NEVER goes above 25C. Couple the MOBO with an excellent powersupply & case, and
   you're ready to go!"

Jeff
TechSoEasy
This is just too strange...

Server has lasted now since last Tuesday afternoon with no problems. This morning I can't log in remotely and the server is showing 'Disk boot failure'.

Reset server, no drives detected in bios, still same error
Power off and back on (via pwr button), same response
Shut off power disconnect all case fans, power back on, detects drives and boots fine with no data loss
Reconnect fans, power back on, still boots ok.

Had a Winpower 650w PSU by the way, replaced this now with Winpower 500w just to eliminate power problem although I'll probably swap this out for a better brand later this week as the winpower ones are cheap as chips.

The last entry in the system event log was a 15:12 Sunday, no further entries until server booted back up at 9:30 this morning.
Last entry in application log was just after midnight (nothing between 15:11 Sunday and 9:30 Monday except this one entry)
On reboot event log records that server was shut down unexpectedly shortly after 1:00am this morning

I've already built a new Proliant ML310 in preperation for them now and will put this in over the weekend if I can't get to the bottom of this.
Update, this is getting sooooooooo frustrating.

It's not the power, it's not the drives, it's not the controller, it's not the cables and it's not the OS.

Question??? My in-depth knowledge of the bios is a bit limited. Does anyone think that upgrading the bios will help? Once the bios has done it's thing at start up does it continue to carry out any tasks that effect the drives?

I understand that modern boards are a bit more reliant on the bios than they used to be for things like power management, thermal management etc but I can't find anything that may be responsible for making the hard drives just stop responding.
The rule of thumb is that if there is an update to the Mobo BIOS you should ALWAYS apply it... there can only be good results (or no results) from doing so.

But since you originally called this box a "server" you may just want to replace the thing anyhow... since many folks call those kind of machines "DUDs" (Dressed up Desktops) http://uksbsguy.com/blogs/doverton/archive/2006/06/07/541.aspx

Jeff
TechSoEasy
OK..... I've updated the bios and the server has gone for 12 days without failing. The only other thing that has changed is the fact that I haven't fully logged in remotely since flashing the bios. What I mean by this is that I have connected via RDP got to the logon screen and then disconnected. The reason for this is that I wanted to prove something to myself as the last time the server failed was shortly after I remoted into the server and accessed computer management, this was on a sunday and the monday morning the customer reported the system was down so it happened sometime after I logged off my RDP session.

Anyway, servers been running fine for 12 days from the Friday I flashed the bios. I remoted in fully last night and..... you guessed it, got a call from the customer this morning, can't log in. Customer checked the server and it's reporting a disk boot failure (think this could be unrelated to the crashing though) if you press reset server still has disk boot failure but if you remove the power for a few seconds and plug it back in it then boots fine.

I've done a quick bith of research and I've seen a few people mentioning that Mac OS has had some issues with RDP causing the server to crash but can't find anything where server 2003 crashes when an XP machine accesses it remotely.

Does anyone know why RDP could be causing the crashes?? This coukd be coincidence but as it's happened at least twice (and I did suspect it before but thought it was a coincidence) I really doubt it is.

Cheers
Mike
If you are logging in at night, and then logging out... can you log back in?  

Also, what is the EXACT error message(s) that's recorded in the System Event Log.  There should be a few for this kind of issue.

Jeff
TechSoEasy
I did try to log straight back in on the Sunday and did get the login screen up so I just cancelled it. The server crashed approx 2 hours after I logged in so may be unrelated but it seems like too big a coincidence. The problem is that it doesn't log anything in the event logs I think the drives just stop responding and the server crashes.

However, on this occasion I have had an email report stating that the server restarted unexpectedly but I'm reluctant to access the system remotely at this point and would rather check the logs next time I'm on site.

I did actually get the following critical errors following the reboot:

Critical Errors in Application Log

Source Event ID Last Occurrence Total Occurrences
  Perflib 1017 19/07/2006 09:53 2 *
Performance counter data collection from the "DAVEX" service has been disabled due to one or more errors generated by the performance counter library for that service. The error(s) that forced this action have been written to the application event log. The error(s) should be corrected before the performance counters for this service are enabled again.  
 
Source Event ID Last Occurrence Total Occurrences
  Perflib 1022 19/07/2006 09:53 2 *
Windows cannot open the 64-bit extensible counter DLL DAVEX in a 32-bit environment. Contact the file vendor to obtain a 32-bit version. Alternatively if you are running a 64-bit native environment, you can open the 64-bit extensible counter DLL by using the 64-bit version of Performance Monitor. To use this tool, open the Windows folder, open the System32 folder, and then start Perfmon.exe.  
 
Source Event ID Last Occurrence Total Occurrences
  dsrestor 1005 19/07/2006 09:45 1
The DSRestore Filter failed to connect to local SAM server. Error returned is <id:997>.  


The last time it failed prior to this one was also when I remoted in but that time it failed while I was logged on - I right clicked on 'My Computer' to bring up computer management and it crashed. This is why I'm starting to suspect this is related.
 
Given points to techsoeasy for the continuos help with this although the 'server' is still failing and we, MS & Gigabyte can't work out why. Time to throw in the towel and stick another server in. I am going to start a new question about migrating AD to the new server.
Well, sorry you didn't get full resolution.  I generally don't like to accept points for a question that does not have a solution... so I'll offer one to make it easy (on the question that is, maybe not on you):

Get a real server.

To be honest, you are using hardware that is NOT approved as compatible with the Windows Server 2003 family.  You should check out http://www.windowsservercatalog.com/default.aspx to find appropriate hardware for your server.

Sorry that's not really the answer you were looking for, but it truly is a solution to your problems.

Jeff
TechSoEasy