Link to home
Start Free TrialLog in
Avatar of brian_leighty
brian_leighty

asked on

CRITICAL!!!! Looking for somebody to help me troubleshoot a stop error (BSOD?) that is happening very often....I get some odd feedback from the dump files.....

I have had several stop error's and need them to be debugged.....It's is more then likely software related.....I now that because the hardware is brand new and that I have replaced alot of hardware to pinpoint the problem unsuccessfully...

Looking for serious and knowlegable experts that will be thorough with a resolution...oh ya and lots of points!!

Thanks Guys
Brian
Avatar of brian_leighty
brian_leighty

ASKER

Upon request I will post all information....including minidumps and information....

I have an account to upload all files to online and will do so upon request....
Avatar of Guy Hengel [angelIII / a3]
Please specify all the software / hardware versions that you think are relevant.
What have you already sorted out etc.
mainboard: Intel SE7320SP2
Dual 3.0GHz Xeon
memory: 266 ECC 1GB
RAID Controller: Adaptec SATA RAID controller card 1210SA w/ 2 120GB SATA Hard Drives running mirrored
OS Hard Drive is SATA 120 non-RAID running onboard SATA controller
Not using the onboard 100/1000 Network adapter
Using a Intel 100/1000 adapter which is PCI express

OS: Windows 2003 Small Business
Windows 2003 Server with Service Pack 1
Small Business Server Pack 1

Using mainly Exchange Server and SQL server

The stop error is 4e and says PFN List Corrupt

This is the second install of OS on machine and have had the same problem from day one....Even replaced memory and motherboard for new ones

I'm think that maybe its the RAID card or the network card..I thought about switching to the onboard LAN and removing the PCI-X LAN card......
When Windows crashes with blue screen, it writes a system event 1001 and a minidump to the folder \windows\minidump. Check system event 1001 and it has the content of the blue screen

Control Panel -> Adminstrative Tools -> Event Viewer -> System -> Event 1001. Copy the content and paste it back here.

Zip 5 to 6 minidumps and attach the zip files at any webspace. I will study the dump and find out the culprit.
Well When I formatted and rebuilt the server I did not have AV installed and the problem still existed...
It basically had nothing but the OS and Exchange
I have uploaded all the minidumps since the server rebuild...
to www.streamload.com

l:terabyte_junki
p:ih28lgih28lg
i also would like to add a screen shoot of the system event in event viewer at streamload.com
The analyse report of Mini063005-02.dmp has the footprint of naiavf5x.sys.  

*** ERROR: Module load completed but symbols could not be loaded for Ntfs.sys
*** ERROR: Symbol file could not be found.  Defaulted to export symbols for fltmgr.sys -
Unable to load image \SystemRoot\system32\drivers\naiavf5x.sys, Win32 error 2
*** WARNING: Unable to verify timestamp for naiavf5x.sys
*** ERROR: Module load completed but symbols could not be loaded for naiavf5x.sys
Probably caused by : fltmgr.sys ( fltmgr!FltGetIrpName+25a )

Followup: MachineOwner
---------

0: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

PFN_LIST_CORRUPT (4e)
Typically caused by drivers passing bad memory descriptor lists (ie: calling
MmUnlockPages twice with the same list, etc).  If a kernel debugger is
available get the stack trace.
Arguments:
Arg1: 00000099, A PTE or PFN is corrupt
Arg2: 00020a55, page frame number
Arg3: 00000006, current page state
Arg4: 00000000, 0

Debugging Details:
------------------

CUSTOMER_CRASH_COUNT:  2
DEFAULT_BUCKET_ID:  DRIVER_FAULT_SERVER_MINIDUMP
BUGCHECK_STR:  0x4E
CURRENT_IRQL:  2
LAST_CONTROL_TRANSFER:  from 80865301 to 8087b6be

STACK_TEXT:  
f549d938 80865301 0000004e 00000099 00020a55 nt!KeBugCheckEx+0x1b
f549d964 80887a31 00000002 ffffffff c038fbac nt!MiRestoreTransitionPte+0x173
f549d97c 8086ae25 00000000 8086772a 00000000 nt!MiRemovePageFromList+0xd1
f549d984 8086772a 00000000 9b48c000 00001000 nt!MiRemoveAnyPage+0x68
f549da38 8084e105 9b48c000 00dbcc90 00000000 nt!MmCopyToCachedPage+0x437
f549dac8 8084df73 84eca328 00dbcc90 f549db0c nt!CcMapAndCopy+0x1b2
f549db58 f71ce6f2 84eb2f90 f549db98 00008000 nt!CcCopyWrite+0x29b
WARNING: Stack unwind information not available. Following frames may be wrong.
f549dbc0 f72fbcc0 84eb2f90 f549dc30 00008000 Ntfs+0x586f2
f549dbf4 f7308a2f 00000004 00000000 f549dc28 fltmgr!FltGetIrpName+0x25a
f549dc48 f4b3ffa7 84eb2f90 f549dcd0 00008000 fltmgr!FltProcessFileLock+0x65d
f549dc8c 8092688f 84eb2f90 f549dcd0 00008000 naiavf5x+0x1fa7  <--
f549dd38 80834d3f 0000030c 00000000 00000000 nt!NtWriteFile+0x317
f549dd38 7c82ed54 0000030c 00000000 00000000 nt!KiFastCallEntry+0xfc
0139fe14 00000000 00000000 00000000 00000000 0x7c82ed54

FOLLOWUP_IP:
fltmgr!FltGetIrpName+25a
f72fbcc0 e942010000       jmp     fltmgr!FltGetIrpName+0x3a1 (f72fbe07)

SYMBOL_STACK_INDEX:  8
FOLLOWUP_NAME:  MachineOwner
SYMBOL_NAME:  fltmgr!FltGetIrpName+25a
MODULE_NAME:  fltmgr
IMAGE_NAME:  fltmgr.sys
DEBUG_FLR_IMAGE_TIMESTAMP:  42435ba1
STACK_COMMAND:  kb
FAILURE_BUCKET_ID:  0x4E_fltmgr!FltGetIrpName+25a
BUCKET_ID:  0x4E_fltmgr!FltGetIrpName+25a

naiavf5x naiavf5x.sys Fri Aug 20 19:42:57 2004 (4125E3C1)
naiavf5x.sys is Mcfee NAI filter.  
well this is the first one since I installed Mcfee yesterday but what about the one previously the other four mindumps did not have Mcaffe installed


how do I fix this problem not have antivirus on the server or is there a alternative
According to infromation from microsoft hhow to diagnostic bugcheck 4E.
1. Disable all file system filter drivers, such as backup utilities, virus scanners, or firewall software.
2. Check hardware error. Install Windows Memory Diagnostic to check the ram
    http://oca.microsoft.com/en/windiag.asp


Debug report of Mini062105-01.dmp shows that it has memory access fault.

PFN_LIST_CORRUPT (4e)
Typically caused by drivers passing bad memory descriptor lists (ie: calling
MmUnlockPages twice with the same list, etc).  If a kernel debugger is
available get the stack trace.
Arguments:
Arg1: 00000099, A PTE or PFN is corrupt
Arg2: 0001fca2, page frame number
Arg3: 00000006, current page state
Arg4: 00000000, 0

Debugging Details:
------------------

CUSTOMER_CRASH_COUNT:  1
DEFAULT_BUCKET_ID:  DRIVER_FAULT_SERVER_MINIDUMP
BUGCHECK_STR:  0x4E
CURRENT_IRQL:  2

LAST_CONTROL_TRANSFER:  from 80865301 to 8087b6be

STACK_TEXT:  
f596d950 80865301 0000004e 00000099 0001fca2 nt!KeBugCheckEx+0x1b
f596d97c 80887a31 8599ad68 ffffffff 84e4c118 nt!MiRestoreTransitionPte+0x173
f596d994 8086ae25 e4c1878c 8082f2b2 000fffff nt!MiRemovePageFromList+0xd1
f596d99c 8082f2b2 000fffff bbb104de a1be3000 nt!MiRemoveAnyPage+0x68
f596d9d4 8082f796 e4c1878c 00008000 00000000 nt!MiResolveMappedFileFault+0x508
f596da08 8084a5e8 00000000 a1be3000 c0286f8c nt!MiResolveProtoPteFault+0x1a6
f596daa0 80849ce5 00000001 a1be3000 c0286f8c nt!MiDispatchFault+0x834
f596dafc 8082fd4f 00000000 a1be3000 00000000 nt!MmAccessFault+0x64a    <----- ???
f596db2c 8090f25d a1be3000 00000000 f596dc84 nt!MmCheckCachedPageState+0x48e
f596dbbc f71af431 851a01e0 a89e223e 00008000 nt!CcFastCopyRead+0x159
WARNING: Stack unwind information not available. Following frames may be wrong.
f596dc14 f72fbcc0 851a01e0 f596dc84 00008000 Ntfs+0x39431
f596dc48 f73088b3 00000003 00000000 f596dc7c fltmgr!FltGetIrpName+0x25a    
f596dc9c 80934677 851a01e0 f596dcd8 00008000 fltmgr!FltProcessFileLock+0x4e1
f596dd38 80834d3f 00000690 00000000 00000000 nt!NtReadFile+0x2c5
f596dd38 7c82ed54 00000690 00000000 00000000 nt!KiFastCallEntry+0xfc
0007f164 00000000 00000000 00000000 00000000 0x7c82ed54


FOLLOWUP_IP:
fltmgr!FltGetIrpName+25a
f72fbcc0 e942010000       jmp     fltmgr!FltGetIrpName+0x3a1 (f72fbe07)

SYMBOL_STACK_INDEX:  b

FOLLOWUP_NAME:  MachineOwner
SYMBOL_NAME:  fltmgr!FltGetIrpName+25a
MODULE_NAME:  fltmgr
IMAGE_NAME:  fltmgr.sys
DEBUG_FLR_IMAGE_TIMESTAMP:  42435ba1
STACK_COMMAND:  kb
FAILURE_BUCKET_ID:  0x4E_fltmgr!FltGetIrpName+25a
nt!MmAccessFault is page fault and it is normal.  I believe that it is faulty RAM or CPU. As you know the information at the minidump is very limited.  I want a full W2K3 memory dump if you want me to pursue the problem. Maybe I have a new finiding at the full dump.
I will need a way to get it to you I'm not sure about the size of the compressed dump file going on the existing web space
cpc2004,,
I hope you are online today!!


I found a PCI Express Intel 100/1000 MT server adapte card that I used for the network to have a capaciter completely ripped of the circuit board....just a little one capaciter...could this cause the system to have these problems.....since I reoved it there has not been any BSOD's  just a little bit ago the system would not even start up without a BSOD first....


??????????????????????????????????????????
Yes, if it affects the stability of the motherboard, windows will have unpredicatabke result such as BOSD.
when i removed the lan card and setup the same ip's but on the onboard LAN card  my exchange worked but nobody could access any files from network neighberhood...



????????????????????
i got a new one it's only a minidump thou
please d/l and debug let me know what you think..
Mini070605-02.dmp  is the filename
I really need some help here guys please my boss is going to fire me if I don't get this fixed
Your latest minidump is very strange as the load module list is corrupt. The failing module is intelppm and it looks like it is caused by faulty hardware.

PFN_LIST_CORRUPT (4e)
Typically caused by drivers passing bad memory descriptor lists (ie: calling
MmUnlockPages twice with the same list, etc).  If a kernel debugger is
available get the stack trace.
Arguments:
Arg1: 00000099, A PTE or PFN is corrupt
Arg2: 00012c02, page frame number
Arg3: 00000006, current page state
Arg4: 00000000, 0

Debugging Details:
------------------

Unable to load image Unknown_Module_f75f7000, Win32 error 2
*** WARNING: Unable to verify timestamp for Unknown_Module_f75f7000
*** ERROR: Module load completed but symbols could not be loaded for Unknown_Module_f75f7000

CUSTOMER_CRASH_COUNT:  2
DEFAULT_BUCKET_ID:  DRIVER_FAULT_SERVER_MINIDUMP
BUGCHECK_STR:  0x4E
CURRENT_IRQL:  0
LAST_CONTROL_TRANSFER:  from 8083abf2 to f75e9ca2

STACK_TEXT:  
808a3600 8083abf2 00000000 0000000e 00000000 intelppm+0x2ca2

FOLLOWUP_IP:
intelppm+2ca2
f75e9ca2 0000             add     [eax],al

SYMBOL_STACK_INDEX:  0
FOLLOWUP_NAME:  MachineOwner
SYMBOL_NAME:  intelppm+2ca2
MODULE_NAME:  intelppm
IMAGE_NAME:  intelppm.sys
DEBUG_FLR_IMAGE_TIMESTAMP:  0
STACK_COMMAND:  kb
FAILURE_BUCKET_ID:  0x4E_intelppm+2ca2
BUCKET_ID:  0x4E_intelppm+2ca2
since you are the expert then I will ask your advice...


In my many years of computer work I have seen strange things....chips with there little problems,etc..

I built two identical servers exact same hardware and everything....same memory (2 x ECC 266MHz  512mb)
the only difference is one server is small business 2003 and the other is windows server 2003 standard...same basic OS minus the little changes in small business by microsoft...

is it poosible for a type of memory to work in one machine and not in the other  basically speaking I have swopped memory in the machine only to have the same error but the memory is identical

both machine had 1 gb dual channel  I took all the memory out of the server (small business) with these BSOD and took 512 out of the server without problems to put in the small business server

I did this yesterday and so far everything has been ok...using your expertise what do you think is going on here
also I may have just not waited long enough for the BSOD but they were getting worse....

I know thats me ramblin on trying not to confuse you to much
thanks
The faulty hardware maybe motherboard or L2 cache memory of the CPU.  Can you swap the CPU between two servers. If you still get the same problem, swap the motherboard.
the motherboard is new it's a replacement from the same problem I had before the first thing I did was replae it

the processor itself had a capcitor (you know the little squares that are soldered at each end) that had been deattached but was still dangling I soldered it back on used a magnifing glass to make sure it was connected.. it was large enough for me to handle....I'm thinking that if it was not there or didnt work I would have different problems then I do...

I hope


thansk
 
I have a resolved case with bugcheck 4E and the solution was reseat the memory to another memory slot
http://www.techspot.com/vb/topic17691-pg10.html&pp=20 (refer the post from DAE_DMP_IKA_DTP)
After reseat the memory, his windows did not crash for three days. Then BSOD re-occurs and he replaced the RAM. No more BSOD.
ya I replaced the RAM and so far everything is good...knock on wood...odd thing is that the memory that I took out (the Bad memory) will run in the other server without problems....
Maybe it is bad contact between the memory stick and memory slot.
still there....replacing memory, cleaning slots, uninstalling anitvirus  everything.......
Even with the antivirus unistalled I get the same nainf5x.sys as you seen before....that is a syatem driver for Antivirus Mcaffee.....
This lst couple dumps have been ntoskernel.sys and fltmgr.sys

should I delete page file and move it or recreate...

would you like to see the most recent memory dumps......


I have to get this problem fixed...please help me I'm going to be in deep deep trouble if I don't get this problem fixed ASAP....
could this be a corrupt page file or something


PFN_LIST_CORRUPT (4e)
Typically caused by drivers passing bad memory descriptor lists (ie: calling
MmUnlockPages twice with the same list, etc).  If a kernel debugger is
available get the stack trace.
Arguments:
Arg1: 00000099, A PTE or PFN is corrupt
Arg2: 00009a73, page frame number
Arg3: 00000006, current page state
Arg4: 00000000, 0

Debugging Details:
------------------


CUSTOMER_CRASH_COUNT:  1

DEFAULT_BUCKET_ID:  DRIVER_FAULT_SERVER_MINIDUMP

BUGCHECK_STR:  0x4E

CURRENT_IRQL:  2

LAST_CONTROL_TRANSFER:  from 8086b019 to 8087b6be

STACK_TEXT:  
f67e6b08 8086b019 0000004e 00000099 00009a73 nt!MmAdvanceMdl+0x137
f67e6b48 808463c2 c01c7248 71c92000 00000001 nt!Magic86400000+0x81d
f67e6c2c 80844f81 71c92000 71c97fff ff3559b8 nt!MiProtectVirtualMemory+0x520
f67e6ce8 80844e3c 81b417f8 81c08480 f67e6d64 nt!MiSwapWslEntries+0xc9
f67e6d38 80933700 00000000 ffb8f808 81b41a20 nt!MiFlushSectionInternal+0x40b
f67e6d54 80834d3f ffffffff 81b417f8 00eff564 nt!CcZeroData+0x1fc
f67e6d64 7c82ed54 badb0d00 00eff460 00000000 nt!ObpTraceDepth+0x3b
WARNING: Frame IP not in any known module. Following frames may be wrong.
00eff564 00000000 00000000 00000000 00000000 0x7c82ed54


FOLLOWUP_IP:
nt!Magic86400000+81d
8086b019 ??               ???

SYMBOL_STACK_INDEX:  1

FOLLOWUP_NAME:  MachineOwner

SYMBOL_NAME:  nt!Magic86400000+81d

MODULE_NAME:  nt

IMAGE_NAME:  ntoskrnl.exe

DEBUG_FLR_IMAGE_TIMESTAMP:  42435e60

STACK_COMMAND:  kb

FAILURE_BUCKET_ID:  0x4E_nt!Magic86400000+81d

BUCKET_ID:  0x4E_nt!Magic86400000+81d

Followup: MachineOwner
It is a hard problem.  Report the problem to microsoft maybe it is software problem of XP SP2.
are you farmiliar with the boot.ini switch  /userva=????   ex. /userva=3030



Although my second boot hdisk is w2k3, I always boot from 1st boot disk which is XP.  /userva is used for 3GB memory for W2K3. Honestly speaking I never use /userva.
what do I do for help  what is next if I cannot find the problem...
ASKER CERTIFIED SOLUTION
Avatar of cpc2004
cpc2004
Flag of Hong Kong image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
i'm on the phone with them right now and I can hardly understand what the technician is saying
What is the update from microsoft of your problem?
no bad processor Intel is shipping one to me right know....I'm only running on one procesoor and it sucks
Do you mean microsoft has no reply and you think it is a faulty CPU.
microsoft think it is fault CPU  I took one out and so far no BSOD's


I think It had a bad cpu and I just got lucky at which one I removed
Most likely it is the L2 cache memory of the CPU is faulty.
well I hope that it fix's the problem because my company is working on a 1.3 million dollar job and if the servers do not work then  bad news
you are not going to believe this...

Intel sent me the wrong stepping of CPU.....I removed it and reinstalled the full 1GB ECC memory...because its been running at 512 for debugging and stuff...when I installed the full 1gb the same error started up.....again...I removed the 512 and server back to where it was when it was working without error and one processor....it continued to crash....about 3 restarts later with the debugging setup it works just fine....

is it possible that I had to clear out the memory or something
Do you mean the CPU debugging setup resolve the problem?
no it did not at first...but after the 3 restarts of the server it quit restarting.....it all started when I put the 1gb of memory in the server...then I took it out of the server (512mb) and the server restarted 3 times were as before it worked great....but after that it was fine.......it  is like the memory was still corrupted then after it was cleared out it worked just fine....
You suspected that the memory is corrupted and I can't figure out why it works without problem. Hardware problem always have strange symptom and unpredicatble result.  It makes the windows trouble more difficult. If your w2k3 still crashes, you can try downclock the RAM or CPU.
i'm not able to downclok in the BIOS settings....

does ECC register memory store information that could be cared over from removing and reinstalling....so that when I put the 512mb dimm back in the server to make 1Gb it still had the memory from the past in its registers...

I'm thinking outside the box....cause I'm all out of options...
Do you have any update of your problem?
ya man I appreciate all your help it was a CPU