Link to home
Start Free TrialLog in
Avatar of 8mathieu8
8mathieu8

asked on

BSOD : How to identify the source ???

I've been having a lot of Black Screen Of Deaths of my WinNT 4 SP6 BDC server lately. I'm not very familiar with identifying the source of the crash. I've read in a couple of places that it is usually a driver, or a bad RAM problem. I ran a test on the RAM and the result reported no error. Here is the things that were writen the last time it crashed:

STOP: 0x00000000, 0x0000001C, 0x00000001, 0x801176AA

IRQL:1e    SYSVER    0xf0000565
80100000

I know that it is possible to identify the possibly faulty driver from theses addresses. But how do I make the link between the two?

thanks!
Avatar of wesbird
wesbird

there's usually a list of drivers on the right hand side of the blue screen which may hold clues.  

Here's a place to start: http://www.sun.com/desktop/products/sunpci/bsod.pdf

or http://aumha.org/win5/kbestop.php



Avatar of 8mathieu8

ASKER

The names and the numbers seems to always be different.

The last time that it crashed, I had those names writen on the screen:

hal.dll
diskport.sys
intlf...
floppy.

I didn't have time to take them all because the computer rebooted, but it seems to be hardware related.

I know that the bugcheck number are recorded in eventvwr, but is there a way to get all the info of the blue screen of death (running processes and all). I've heard about connecting a device in the Com port or something to record those information.

I'm going to read the document that pointed me out in the meantime.

thanks!
I don't know if it is related to that. But I notice that the server make a short "beepbeep" sound a couple of minutes after the computer is fully booted. Could this be related? I don't remember hearing that before I started having BSOD problems.

My server is an IBM Netfinity 3500 M10 - M/T 8655-21Y.

I'm going to check IBM's documentations.
Read pages 11-18 of the Sun BSOD primer that wesbird linked to.  What bug code shows up at the Blue Screen?

In my experience it's very hard to determine which driver is failing (if indeed it is a driver problem) by looking at the blue screen.  Try booting into safe mode and check which driver loads before the blue screen occurs.  That may be the culprit.

Also, have you recently loaded any programs, drivers, or security patches?  What other software is running on the server?  
What is the bugcheck code of your BSOD?

The following looks like the bugcheck paramter
STOP: 0x00000000, 0x0000001C, 0x00000001, 0x801176AA

If it is bugcheck code 1000000A or 1000000D1, the second bugcheck parameter is the IRQL. IRQL 1C is the clock level interrupt. It is unlikely to fail unless it is hardware error (ie RAM, CPU or motherboard).

download Windows Memory Diagnostic and stress test your ram.
http://oca.microsoft.com/en/windiag.asp




>Try booting into safe mode and check which driver loads before the blue screen occurs.

The problem is that it does not occur at boot up. It hapens sporadically, about every 2-3 days. Independently of the sessions status (loged in/out, different applications).

>Also, have you recently loaded any programs, drivers, or security patches?  What other software is running on the server?  

I don't remember installing any programs lately.

I ran a test on the ram using a problem called QuickTech (which is a bootable floppy).
I'll try to run tests on the CPU and motherboard. I think that there is a tool in QuickTech for CPU, but not for the Motherboard, do you know of one?

I wrote everything that had time to write from the BSOD. Where can I find de bugcheck code (in what section of the BSOD it is? (I thought it was the "IRQL:1e    SYSVER    0xf0000565"). Is there suppose to be one in NT 4 Server?
When Windows XP or W2K crashes with blue screen, it writes a system event 1001. Check system event 1001 and it has the content of the blue screen

Control Panel -> Adminstrative Tools -> Event Viewer -> System -> Event 1001. Copy the content and paste it back here

I am not sure WINNT writes system event 1001 when it crashes.
Yes there are 1001 system events recorded. Here is the full text of the description:

The computer has rebooted from a bugcheck.  The bugcheck was: 0x00000024 (0x00190201, 0xf0d3e700, 0xf0d3e53c, 0x80225f63). Microsoft Windows NT [v15.1381]. A dump was saved in: C:\WINNT\MEMORY.DMP.

Here are the bugcheck of some of the other ones:

0x0000000a (0x00000007, 0x0000001c, 0x00000000, 0x801175af)
0x0000000a (0x00000000, 0x0000001c, 0x00000001, 0x801176aa)
0x0000004e (0x00000002, 0x000e1fdf, 0x0000fffc, 0x0000ffff)
0x00000024 (0x00190201, 0xf0cfa584, 0xf0cfa3c0, 0x80133dcd)
0x0000001e (0xc0000005, 0xa005ff37, 0x00000001, 0x01c52657)
0x0000000a (0x00140000, 0x0000001c, 0x00000000, 0x801175af)

There are nearly always different. Does it make sence to say that it could be the motherboard?

ASKER CERTIFIED SOLUTION
Avatar of cpc2004
cpc2004
Flag of Hong Kong image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I've checked my records, there have 3 confirmed case with IRQL 1c and the culprit is CPU.
One confirmed case at EE and two cases at www.techspot.com
https://www.experts-exchange.com/questions/21306911/Spontaneous-rebooting-help.html
I have IRQL 1e and not IRQL 1c. Do you still think that it is related to CPU?

I will try the Windows Memory Diagnostic when possible since it requires a reboot.

Could inserting the RAM in another slot fix the problem?
You may try reinsert the RAM in another another memory slot
 BC                BCP1              BCP2          BCP3           BCP4
0x0000000a (0x00000007, 0x0000001c, 0x00000000, 0x801175af)

Bugcheck Parameter Description
1 Memory referenced
2 IRQL at time of reference
3 0: Read
1: Write
 
4 Address which referenced memory  

The BCP2 is the IRQL which is 1C (ie decimal 28 which is clock level interrupt). I don't know how to get the IRQL 1e (ie decimal 30) which is power level. Do you mix up IRQ and IRQL?


One confirmed case for IRQL 1c which is related to faulty CPU
https://www.experts-exchange.com/questions/20999245/IRQL-NOT-LESS-OR-EQUAL.html?query=bugcheck+0x0000000a&topics=231

Mini052304-12.dmp
        STOP: 0x0000000A (0x00000003, 0x0000001C, 0x00000000, 0x804f5715)
                                                                        ^^
I had another one yesterday at 10 pm, while no one was connected to that server.

0x00000024 (0x00190201, 0xf0d3e700, 0xf0d3e53c, 0x80225f63).

Is this memory related as well?

Ok, now I understand your IRQL 1c (thank you for the clear explanation by the way)

From what I understand, I could have both a CPU and RAM problem or maybe it is a motherboard problem that creates those problems altogether.


Bugcheck code 24 may be caused by the faulty cache memory of CPU or M/B. According to my resolved case for IRQL IC, 3 for faulty CPU and one for faulty M/B (ie 75% is CPU and 25% is M/B). One case the problem owner does not respond. You can run memtest to stress test the memory, however memtest does not distinguish whether it is ram and cache memory problem.

If you want to find out the exact root cause, attach the minidump at any webspace. You can find the minidump at \winnt\minidump
Ok, I've ran some tests on the RAM and CPU.  Ram is fine in both Windows Memory Diagnostic and QuickTech tests.

CPU on the other hand gives me a Fail result on the Periodic Interrupt Test, which is part of the Real Time Clock/CMOS test in QuickTech.

What should I do, change de processor?
Yes, replace the CPU
I install another processor (identical) into the server and it still gives me a Fail result on the Periodic Interrupt Test (Real Time Clock/CMOS). I did that yesterday and the server hasn't crashed yet. Should I conclude that it is a motherboard problem?
Bugcheck code 1c is either CPU or M/B. I have 4 confirmed cases. 3 for CPU and 1 for M/B. My sample is too small hence it is not accurate.  If it is not CPU, it must be mother board.
The RAM in this server is ECC. I've read in the Windows Memory Diagnostic documentation that the test could be unreliable for that kind of memory.

Did those confirmed cases only had code 1c or they had memory and CPU error codes?
All the confirmed cases have various bugcheck code. As you know, the hardware error occurred random. If the CPU hardware error occurs at other NT routine, NT crashes with memory error.  If it crashes at within IRQL 1c (clock interrupt, I'm very assure it is hardware error. As clock level interrupt is well tested, it crashes unless it is hardware error.
The following case is caused by faulty m/b
https://www.experts-exchange.com/questions/21222403/Bugcheck-0x1000007f-Random-Full-Reboot.html

The following cases are caused by faulty CPU
https://www.experts-exchange.com/questions/21306911/Spontaneous-rebooting-help.html
http://www.techspot.com/vb/showthread.php?p=143574#post143574  refer page 8 Kritonas
kritonas install the CPU hot tester and find out this is CPU problem. You can ask him the url of the hot tester link.

one case is still opened and the prolem is either related to CPU or M/B
http://www.techspot.com/vb/showthread.php?p=144841#post144841  page Boxer
Ok, the computer still hasn't crashed. It usually crashes every 2-3 days or so. If it still hasn't cash by the end of this week, I'll conclude that it is the CPU that was the culprit.

Do you mean that you've replaced the CPU?
Yes
:(
It crashed again... with the new CPU. It took longer thought... about 11 days.

I got this bugcheck for this crash...

The bugcheck was: 0x00000024 (0x00190201, 0xf402bc88, 0xf402bac4, 0x8010a975).

Your previous bugcheck 0x00000024 (0x00190201, 0xf0d3e700, 0xf0d3e53c, 0x80225f63).
Your new       bugcheck 0x00000024 (0x00190201, 0xf402bc88, 0xf402bac4, 0x8010a975).

The first bugcheck parameter for both cases are exactly the same. Usually RAM and CPU error occurs randomly and cause various failing pattern and your pattern is regular. You have multiple culprit. One culprit and the other culprit may be disk corruption (ie bad file system structural corruption). Run chkdsk /f /r
You think that I might have bad sectors? Is the chkdsk of NT 4 secure (non-destructive). From past experience with Norton System Works on clients machine, I never had problem. But this is a very important production server that is used for finance and that cannot be rebuild easily.
No bad sector cannot be recover. I mean file system structural corruption which can be fixed by chkdsk.
I thought that the /r option "Locates bad sectors and recovers readable information"?

I can run it, but how long will it take, and is it secure or it could make things even worst?
Check Event Viewer for error messages from SCSI and FASTFAT (System Log) or Autochk (Application Log) that might help pinpoint the device or driver that is causing the error. You should also run hardware diagnostics supplied by the hard disk manufacturer.
Attach the latest minidump to any webspace. I will study the dumps and find out the culprit. I am not sure whehter I can process your NTminidumps as I have XP, W2k and W2K3. You can find the minidump at the folder \winnt\minidump
Its going to be hard for me to attach the dump file in a webspace because the file is 255 megs! The file is \WINNT\MEMORY.DMP

Nothing from SCSI, FASTFAT and Autochk in Event Viewer.

I'll check for the hardware diagnostics supplied by the hard disk manufacturer.

thanks!
After you zip the dump and it will be less than 100MB. BTW before you replaced the CPU, how often did the NT?
how often did the NT...?... crashed? If that is the question. About every 2 to 3 days.

Ok, I'll zip it. Do you have a webspace where I could dump it?
Get public webspace
Go to www.geocities.com - sign up for an account (they are owned by Yahoo - if you have a yahoo account, then you just need to activate the geocities part of it there). Then use their tools to upload the file.
I prefer the minidumps. Can you find the minidumps at the folder \winnt\minidump as they are much smaller (around 64k to 90k)
Sorry, I can't find de minidump folder. Are you sure that it exists in NT4?
Procedure to change NT support minidump or not
Control Panel --> System --> Advance --> Startup and Recovery --> Write debugging information -->  minidump
I don't see any Advance Tab or button in CP --> System. There is a Recovery section Startup/Shutdown, but there are no options for a minidump. There is just one for "Write debugging information to..." and this is the memory.dmp file that I've talked about.
NT does not support minidump. Zip the kernel dump and attach it at any webspace.
I checked geocities and they only give 15 mgs free. Do you know any webspace that would give about 100 MB?
Usuauly bugcheck is caused by faulty ram or disk error (usually it can be fixed by chkdsk). If you are worried about the data before you run the chkdsk backup the data to tape or to another hard disk.
Attach the following files at any webspace. I will study your log and dumps.
C:\Documents and Settings\All Users\Documents\DrWatson\user.dmp
C:\Documents and Settings\All Users\Documents\DrWatson\drwtsn32.log
Do you have any update of the problem?
If your nt stll get bugcheck code 24, it maybe faulty RAM. Close the problem if there have no more blue screen.
Sorry, I didn't have time to put the files in a webspace. I'll try do to it when I have time. Besides that, I've switched the RAM of that faulty server with another one and haven't got any problem since then. I want to wait a little longer before saying that that fixed the problem, since it is sporadic.
If your NT still have hardware error, it crashes within one week for a production environment. Before you replace the faulty CPU, your NT crashes within 2-3 days.  It is great improvement. BTW do you have any update of the poblem.
Server didn't crash since the memore replacement. Hence I must conclude that it was a faulty memory problem.

thanks for all your help cpc2004!
Your NT has two hardware error. The faulty CPU and the RAM.  Most of the crashes are caused by faulty CPU. Some faulty RAM can pass the memtest. Maybe memtest does not run long enough (ie your NT crashes 11 days). I don't think you can aford to shutdown the NT for 2 weeks to run the memtest. You can use Prime 95 to stress test your ram and it executes at Windows environment. I am not sure whether prime 95 supports NT Platform.

If the RAM cannot pass memtest, it must be faulty.  Even it passes the memtest, it does not mean it is good.
I've run 2 different kind of test on the Memory, and they both gave a pass result. If I'm not mistaken, the fact the memory is ECC can give a false positive result.