Link to home
Start Free TrialLog in
Avatar of itpro365
itpro365

asked on

Windows Server 2003 BSOD

Hardware is a Dell PowerEdge 2600
Perc 4e D1 SCSI Ultra 320 Drives Maxtor

Windows Server 2003 SP2

Stop Code = 0x000000F4 (0x00000003,0x08590C280,0x8590C3E4,0x8967C6C)

Any help would be great.
Avatar of PowerEdgeTech
PowerEdgeTech
Flag of United States of America image

Is the amber light on your system on (the one that is usually blue)?
Does it boot up and you just get this every so often?
Or does it not start and this is the error message you get?
Did you try Last Known Good Configuration and/or Safe Mode?
Do you have your OS CD's handy?
Avatar of itpro365
itpro365

ASKER

Yes the amber light is flashing.
Yes it boots, but then eventually crashes.
No to LKGC
Yes to the CD
I have the minidump  Mini030811-01---Copy.txt
Sorry wrong file.
 Mini030811-01.dmp
Amber light = you have a hardware problem.  
Are their any other amber lights on the server - on the drives, power supplies, etc.?
As the system is going through its BIOS/POST screens, look at everything that scrolls on the screen.  What messages do you see?
More Info
On Wed 3/9/2011 7:19:15 AM GMT your computer crashed
crash dump file: C:\Windows\Minidump\Mini030811-01.dmp
This was probably caused by the following module: ntoskrnl.exe (nt+0x7C4A0)
Bugcheck code: 0xF4 (0x3, 0xFFFFFFFF85959D88, 0xFFFFFFFF85959EEC, 0xFFFFFFFF80967CEC)
Error: CRITICAL_OBJECT_TERMINATION
file path: C:\Windows\system32\ntoskrnl.exe
product: Microsoft® Windows® Operating System
company: Microsoft Corporation
description: NT Kernel & System
Bug check description: This indicates that a process or thread crucial to system operation has unexpectedly exited or been terminated.
This appears to be a typical software driver bug and is not likely to be caused by a hardware problem.
The crash took place in the Windows kernel. Possibly this problem is caused by another driver which cannot be identified at this time.
THe whocrashed application provided that information in my last post.
I pulled out all the drives and the power supply. Re-seated them and now the amber light is gone. I will see if we still get a dump.
The PE2600 is kind of stupid, cuz it doesn't tell you on the LCD panel that practically every other PE has what is wrong.  Best thing to do is to try and run diagnostics to see if that will tell you the faulty part:
http://support.dell.com/support/downloads/download.aspx?c=us&cs=04&l=en&s=bsd&releaseid=R206154&SystemID=PWE_FOS_XEO_2600&servicetag=&os=WNET&osl=en&deviceid=196&devlib=0&typecnt=0&vercnt=17&catid=-1&impid=-1&formatcnt=0&libid=13&typeid=-1&dateid=-1&formatid=-1&source=-1&fileid=288046

You could also try booting to an OMSA Live CD ... if you can, you can check the Hardware Log for the exact error:
http://linux.dell.com/files/openmanage-contributions/omsa-54-live/omsa-54-040308.iso
If you get back into Windows, check the Hardware Logs in OMSA.  If you don't already have it installed, here it is:
ftp://ftp.dell.com/sysman/OM_5.5.0_ManNode_A00.exe

Download and run to extract, then run C:\Openmanage\windows\setup.exe
Well amber light is now gone, but BSOD is still happening. I will try the Open Manage.
Here are some things I'd try:

Test the RAM

http://www.memtest.org/
http://www.ultimatebootcd.com/

Boot a linux boot disk and toture the system parts such as RAM/CPU/DISK?

http://fedoraproject.org/en/get-fedora
So here are the results of the first test.


photo.JPG
Well I cleared the log files as instructed. Rebooted. Replaced the memory with RAM from an identical box.  Ran the test again and got the same exact results.
Did you clear the Hardware logs (OMSA, System, Logs, Clear)?  After clearing them, it should not have shown the error from 2007 ... are you saying that it failed DIMM_1A on the actual test (not the Event Log Scan)?
I cleared all the log files in event viewer - application, hardware, system, etc. I just re-read the error, it is failing pre-test only at this point. But I dont see any other logs to clear. I do not have Open Manage installed.
The Hardware Log has nothing to do with Windows Event Logs ... it is kept by the ESM/BMC for hardware-related errors and warnings.  You can view these logs and clear the logs in OMSA (see post above for link and installation instructions).

Or ...

You can clear it with DSET - Run and Clear option:
ftp://ftp.dell.com/sysman/Dell_DSET_1.6.0.131_A01.msi

Well I tried to install Open Manage and during the setup I received multiple errors stating there was a delayed write... And then BSOD
Then try DSET - you can run that option without actually installing ... that should reduce the effort of the PC to run it.  If that doesn't work, then you might consider the OMSA Live CD (linux-based) to run OMSA to view the logs (find out what the errors are) and clear the logs (so you can get a clean test if the hardware log is not enough).
Ok - I will first try DSET and then I will try the OMSA Live CD. Just a note. The drives seem to just go quiet for long periods of time, just before the BSOD. Also, they seem to take forever to spin up during POST. POST takes approximately 9 mins.
Very likely a failed drive - if you're lucky - or a controller/motherboard problem.  If it is a failed drive, could be causing problems (if it hasn't already).  Any amber lights on your drives?
Ok RAM DSET Run and Clear Option. The ESM file that was created only has the following in it:
Embedded System Management (ESM) Log

Health : Ok

Embedded System Management Log contains...

Severity      : Ok
Date and Time : Fri Mar 11 15:23:35 2011
Description   : Log cleared
Embedded System Management (ESM) Log

Health : Ok

Embedded System Management Log contains...

Severity      : Ok
Date and Time : Wed Jul 26 06:31:40 2006
Description   : Log cleared

Severity      : Critical
Date and Time : Thu Jul 27 05:55:52 2006
Description   : Bezel Intrusion sensor detected an intrusion

Severity      : Ok
Date and Time : Thu Jul 27 05:56:12 2006
Description   : Bezel Intrusion sensor return to normal
Did it create a DSET Report (DSETsomethingsomethingservicetag.zip) on your Desktop?  If so, try to attach it here so we can see the hardware logs.
I ran the 3rd option that just clears the log. I can run the first option to create the report, but wont it be useless now that I cleard the logs?
Right.  Best now to run diagnostics (now that the log contains no errors).
Here is the DSET Report
File is blocked because of OSX and HTA
You can send it to poweredgetech@gmail.com, if you want (that is, if gmail takes password-protected ZIP's).

Another option ... call Dell Tech Support.  They have a "dropbox" you can upload it to so they can review it.  Support is always free, whether in or out of warranty.
I went ahead and ran the tool and didfind this:
User generated image
THanks PowerEdgeTech - I just sent you the zip without a password.
Or if that link doesnt work, then you can use this one:
http://dl.dropbox.com/u/18479552/DSETReport.zip
Well...

The Windows Event Log logged some RAID controller events, but since OpenManage was not installed to tell Windows what it meant, it doesn't contain much information.

Since OMSA was not installed, DSET could not pull the RAID controller log to take a look at the array.

Power Supply 2 is probably unplugged?  If not, it could be bad.

The RAID battery should be addressed but it's probably not responsible.  You can clear the "Recharge Count" in the controller BIOS (CTRL-M, Objects) ... resetting it will put it back under the "error" threshold and your system will stop telling you about it until it reaches that 1100 count threshold again.  At least that way, you can confirm or refute the idea that the amber light is on only because of the battery.

Check the BIOS, under Integrated Devices and make sure the connector going to the tape drive is set to SCSI rather than RAID (sometimes this is not reported correctly in the DSET, particularly without OMSA installed, but a tape device should not be connected to a RAID controller.

Other than that, based on what we have so far, nothing else really jumped out at me.

I would run the diagnostics to see what comes up.  You can run another DSET should it happen again so we can see the Hardware entry for it.
So is it possible the degraded state of the raid is causing the bald?
Sorry not bald but bsod.
If you are running out of options, why not try the 3rd party memory test?  You say you changed the memory 'sticks/chips', but you still have the error right?
The RAM chips themselves might not be bad, but the RAM i/o controller or some other part could be.   If the memtest86 tests runs for a few hours without any problem then you will have learnt something...  at least one part of the system is 'good'.   If you find any errors, test each RAM 'Stick' one at a time in the first slot.

Why not take it a step further and label up all drives, connectors etc and REMOVE any hardware that is not required to perform the memory tests.  ie unplug the RAID controller from the main board (if it is a seperate device) or disconnect the drives at a minimum.  Please be careful and take lots of photos of the system and label everything!  If you're not confident doing this, please call someone that can help.   - Do you have a backup of this server??

I would also advise to make a physical check of the boards inside the machine.. can you see any bad capactors for example?  or siezed cooling fans? Take a look here for an example:

http://www.mikerepairscomputers.com/blog/wp-content/uploads/2009/07/Bad_Capacitor_01-200x150.jpg

Apologies if this seems all a bit basic,
It is only the RAID battery that is "degraded" - and only because it has exceeded the pre-defined number of recharges.  Even if the battery charge is too low or the battery dies altogether, the controller simply turns off the cache that the battery is for.  So, in a word: no, I don't believe it is causing the blue screen.

I would start with diagnostics from here to find out what the failed hardware is (if any).  As pc-cyt suggests, memory is a good place to start, and now that we have cleared the log we actually can run a clean test.  I would also test the rest of the machine - not just memory.

Remember the amber light means there is a hardware error, but the following things that you have will turn on the amber light:
- Chassis being open and/or the front bezel is removed.
- Second power supply is not plugged in (but still in the machine).
- RAID battery is "degraded".

So, your "hardware" error may simply be one of the above ... easily fixed.  If that is all the light is on for and your system (including memory) passes diagnostics, then we're looking at a deeper OS issue.
Ok,
sorry for the delay.
The memory test ran perfectly fine.
I cleared the battery count.
I was able to actually run several OS updates and driver updates last night. It looked like I was in the clear, but this morning the system is down again.
Now the drives dont seem to want to spin up, even though all lights are green on the drives. I cant even get into the RAID controller by using CTRL M
ASKER CERTIFIED SOLUTION
Avatar of PowerEdgeTech
PowerEdgeTech
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks PowerEdge.
I took the drives out and placed them into a spare 2600 and I have been up and running for 2 hours now without any issues.
That works too :)  If it was hardware-related, then it shouldn't BSOD again.  If it is the OS, I would assume it would.