IT2100
asked on
Unexpected Server Shutdown causes RAID to Bad Unconfig Drives
My server is a Gateway E-9520T. It has an Intel Xeon 5110 Processor with 2 Gb of ram running Windows Server 2008 with McAfee Anti Virus Enterprise installed. The server shutdown unexpectedly in the early morning. It knocked of the four hard drives (Western Digital WD1001FALS-00J7B1) that I have on a Raid configured as two virtual disks mirrored as Raid 1. One is the OS and the other is the office data. I was able to reconfigure the raid on the drives but the office data drive came up with a Corrupt Master File Table, so now I am trying to recover the data through Kernel NTFS. I got the OS up an running, but I would like to figure this problem out because it has happened more than once and is random.
FYI: There were no other errors in the event viewer, the only error was ID: 6008 Description: It shutdown unexpectedly and at what time. The server has two power supplies one to kick in if one fails. I tested both of the they worked. I also have a APC battery backup on the unit which monitors any power fluctuations and it has reported none.
The RAID controller is an LSI MegaRaid 8308ELP, there is an event that occurs on reboot stating Fatal - Controller ID: 0 Controller cache discarded due to memory/battery problems. But it still boots up.
I ran a BurnInTest on the system and everything passed, I also removed the CMOS Battery on the motherboard and checked the voltage it was giving the right amount.
What would cause this to happen, it happens too frequently like 4 to 5 times within the past two months. and has caused a great amount of down time. We are a small office with 5 workstations so we only have one server in the office.
FYI: There were no other errors in the event viewer, the only error was ID: 6008 Description: It shutdown unexpectedly and at what time. The server has two power supplies one to kick in if one fails. I tested both of the they worked. I also have a APC battery backup on the unit which monitors any power fluctuations and it has reported none.
The RAID controller is an LSI MegaRaid 8308ELP, there is an event that occurs on reboot stating Fatal - Controller ID: 0 Controller cache discarded due to memory/battery problems. But it still boots up.
I ran a BurnInTest on the system and everything passed, I also removed the CMOS Battery on the motherboard and checked the voltage it was giving the right amount.
What would cause this to happen, it happens too frequently like 4 to 5 times within the past two months. and has caused a great amount of down time. We are a small office with 5 workstations so we only have one server in the office.
ASKER
Do I have to replace with the current Raid Controller or do you recommend a specific controller? Will any controller fit my server for the drive bays?
ASKER
Wait your just saying the Raid Controller not the SATA cards for the drive bays, right? I think I misinterpreted the first time, sorry.
Check the voltage of your server power supply, if possible substitute the power supply.
if the voltage is not the correct or is failing, the raid controller also fail to copy the correct data integrity from disc.
Other thing is external devices connect with it.
ex: printers, tapes devices, etc...
this devices working can create a power pick.
Check it.
if the voltage is not the correct or is failing, the raid controller also fail to copy the correct data integrity from disc.
Other thing is external devices connect with it.
ex: printers, tapes devices, etc...
this devices working can create a power pick.
Check it.
ASKER
Using CPUID for the power supplies I got:
Power Supply 1:
CPUVCore - 1.25V
AUX - 1.22V
+3.3V - 1.23V
+5V - 4.97V
+12V - 10.70V
-12V - (-1.01V)
-5V - 0.43V
Power Supply 2:
CPUVCore - 1.25V
AUX - 1.22V
+3.3V - 1.23V
+5V - 4.97V
+12V - 10.70V
-12V - (TRIAL)
-5V - 0.43
Also, I only have one external drive hooked up for backup and this problem occured prior to the drive.
Power Supply 1:
CPUVCore - 1.25V
AUX - 1.22V
+3.3V - 1.23V
+5V - 4.97V
+12V - 10.70V
-12V - (-1.01V)
-5V - 0.43V
Power Supply 2:
CPUVCore - 1.25V
AUX - 1.22V
+3.3V - 1.23V
+5V - 4.97V
+12V - 10.70V
-12V - (TRIAL)
-5V - 0.43
Also, I only have one external drive hooked up for backup and this problem occured prior to the drive.
You only need to replace the Raid controller, not the drives, backplane or anything else. Replace it with the same controller, this raid card is fine, yours has just gone bad.
ASKER
Are these values normal or do I have a problem with my power supply?
check the values in bios, is the most correct method in software.
Buy one and check, you have a intermittent problem, so the values may ok but the problem is there.
Buy one and check, you have a intermittent problem, so the values may ok but the problem is there.
ASKER
Ok, I understand the RAID Controller and I ordered a replacement, but what causes the unexpected shutdown? I wouldn't expect that to be a RAID Controller.
for what is causing the RAID to fail, it is possibally either of the two below;
The Rais controller is faulty OR atleat its memory/ cache is faulty.
this mostly happens due to a bad/variable powe source, dust, moisture, OR untable/high grounding. get your envoirnment checked especially for this server rack, see if grounding is proper and monitor if value is stabe.
Shabhi
The Rais controller is faulty OR atleat its memory/ cache is faulty.
this mostly happens due to a bad/variable powe source, dust, moisture, OR untable/high grounding. get your envoirnment checked especially for this server rack, see if grounding is proper and monitor if value is stabe.
Shabhi
I have seen this issue many, many times, and it is almost always the Raid Cache Memory.
As well, the controller itself has told you that it was not able to flush out the contents of the cache to disk. When the controller has this error it almost always causes a full bus reset and will shutdown the server.
As well, the controller itself has told you that it was not able to flush out the contents of the cache to disk. When the controller has this error it almost always causes a full bus reset and will shutdown the server.
ASKER
Ok, I replaced the RAID Controller on Saturday and I do not get the memory error anymore but there is a Battery Backup card I think I will get for it. The major problem though is I had another unexpected shutdown on Sunday and I replaced the card Saturday. It lost the configuration on the Hard Drives. My data was ok though. Again there was no event message other than the time and date the unexpected shutdown occured (ID 6008). I believe my power supplies should be fine, plus I have a UPS on them. What could be causing this?!
ASKER
FYI the UPS is brand new and the software is set to shut it down with 5 mins battery left.
Did the server throw a stop code or just hard reset?
ASKER
No stop codes in the Event Viewer, just an ID 6008.
ASKER
Anybody have an idea, what this might be?
as i said earlier check the power and ground value is stable or not. once you replace the battery backup card and server PSU is fine right now, you might get stable behavior for a while, but you have to make sure that it will not happen again, for that i am referring the power and grounding check.
see is exist any mini dump files in C:\Windows\Minidump\*.dmp
analyse all it with Windows Debugging Tools to analyse the possible crash dump (BSOD)
Regards,
VS
analyse all it with Windows Debugging Tools to analyse the possible crash dump (BSOD)
Regards,
VS
ASKER
The last time a mini dump file was modified was on 2/02/2010.
do you have the sure that the power supply is ok ?
raid controller consume a lot of power, double check the power supply.
check also the earth ground from electrical installation.
put other computer connected at same line as your server, and check if when the server goes down, the extra computer also goes down.
Regards,
VS
raid controller consume a lot of power, double check the power supply.
check also the earth ground from electrical installation.
put other computer connected at same line as your server, and check if when the server goes down, the extra computer also goes down.
Regards,
VS
ASKER
Grounding is ok.
ASKER
It has two power supplies one as a backup.
ASKER
I tested both of them, by unpluggin one another and they both seem to be working right.
so next try is hardware....
ram is ECC ?
check if is on the correct place, may have bad joined it motherboard.
check is all slot card is right connect in motherboard.
check with ex: Memtest86 the server memory.
ubuntu live cd have Memtest86 on boot cd.
Regards,
VS
ram is ECC ?
check if is on the correct place, may have bad joined it motherboard.
check is all slot card is right connect in motherboard.
check with ex: Memtest86 the server memory.
ubuntu live cd have Memtest86 on boot cd.
Regards,
VS
ASKER
Been doing some looking around, I lost three drives on the bezel, so I looked at the backplane board and it has cloudy white marks on the back. I took it out and replaced with the bottom. I hadn't lost any drives on the bottom. Do you think the backplane could cause it?
ASKER
top bezel*
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
The memory on the Raid controller is failing or the controller itself, both should be replaced.
When cache memory goes bad so does your data
Replace the parts and your problem will be solved