Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x
?
Solved

Customer gets stop error Event ID 1003

Posted on 2005-12-22
21
Medium Priority
?
2,914 Views
Last Modified: 2012-08-13
Customer describes Blue Screen stop error she gets.  I never saw error first hand. I had to depend on somethin showing up in Event Logs.  Event ID 1003 error has been showing up once per day. I assume this is the error causing the system hault.  The text of the error is: "Error code 0000007f, Parameter 1 00000000, Parameter 2 00000000, Parameter 3 00000000, Parameter 4 00000000.   Need to find out what is the casue ASAP.
0
Comment
Question by:071171
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 11
  • 5
  • 3
  • +2
21 Comments
 
LVL 12

Expert Comment

by:GinEric
ID: 15540182
That's a strange error; what does EventViewer say?

once a day sounds like DNS TTL though, off the top of my head.

I'll look it up, but you could do the same at http://support.microsoft.com/
0
 
LVL 12

Expert Comment

by:GinEric
ID: 15540199
It appears to be the code for Stack Overflow.  That can be caused by many things, drivers, malware.

I didn't find any with all zeroes in the parameters, which is suspicious.

Do you have more information, from EventViewer?  or a .dmp file for analysis?
0
 
LVL 5

Expert Comment

by:sachin_raorane
ID: 15540204
Whats the source program of this Event ID 1003 ,  

You can look up varius source reasons for this Event id at

http://www.eventid.net/display.asp?eventid=1003



0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 93

Expert Comment

by:nobus
ID: 15540563
0
 
LVL 20

Expert Comment

by:cpc2004
ID: 15545763
The culprit maybe faulty ram. Run memtest to stress test the ram.
http://www.experts-exchange.com/Operating_Systems/WinXP/Q_21510623.html
0
 

Author Comment

by:071171
ID: 15553581
Source: System Error; Event ID: 1003; several .dmp files - where do I put them?
0
 
LVL 20

Expert Comment

by:cpc2004
ID: 15553791
Get public webspace
Go to www.geocities.com - sign up for an account (they are owned by Yahoo - if you have a yahoo account, then you just need to activate the geocities part of it there). Then use their tools to upload the file.
0
 

Author Comment

by:071171
ID: 15554041
Go to http://www.geocities.com/alefsky and choose File Manager.  Once there you'll see dump.zip which contains 5 .dmp files. Thanks.
0
 
LVL 20

Expert Comment

by:cpc2004
ID: 15554320
I go to http://www.geocities.com/alefsky and I can't find File Manager option to download the minidump.
0
 

Author Comment

by:071171
ID: 15558522
cpc2004 -

How do I upload files on Geocities? Must i create a web page?
0
 
LVL 20

Expert Comment

by:cpc2004
ID: 15560087
Email the minidumps to me and you can find my email address at my profile.
0
 
LVL 20

Accepted Solution

by:
cpc2004 earned 1500 total points
ID: 15569182
I believe that the culprit is faulty ram. Your minidumps are crashed at different symtoms and this is symptom of hardware error. As you know, hardware problem occurs randomly. One minidump is crashed at hal!KfRaiseIrq which is the symptom of hardware problem.

You can run memtest to stress the ram. If memtest reports the ram is faulty, ram is bad. However Memtest is not a perfect tool to test the memory as some faulty ram can pass memtest.

Suggestion
1. Check the temperature of the CPU and make sure that it is not overheat (ie temperature < 60C)
   Make sure that the CPU fan works properly
2. Reseat the memory stick to another memory slot. Reseat video card as well.
3. Downclock the ram. Check to default setting if you video card is overclocked.
4. Clean the dust inside the computer case
5. Make sure that the ram is compatible to the motherboard
6. Check the bios setting about memory timing and make sure that it is on
   For example : DIMM1 and DIMM2 do not have the same timing.
   DIMM1: Corsair CMX512-3200C2 512 MB PC3200 DDR SDRAM (2.5-3-3-8 @ 200 MHz) (2.0-3-3-7 @ 166 MHz)
   DIMM2: Corsair CMX512-3200C2 512 MB PC3200 DDR SDRAM (3.0-3-3-8 @ 200 MHz)
   DIMM3: Corsair CMX512-3200C2 512 MB PC3200 DDR SDRAM (3.0-3-3-8 @ 200 MHz)
7. Make sure that your PSU have adequate power to drive all the hardware including USB devices
8. Run chkdsk /r at command prompt
9. Run 3DMark 2005 to test your video card
10. Upgrade BIOS and make sure that the motherboard has no leaking capacitor
11. Your Norton AV is outdated, upgrade to latest version.

If it still crashes, diagnostic which memory stick is faulty
1. Take out one memory stick. If windows does not crash, the removed memory stick is faulty.

Mini121005-01.dmp
Stack Trace
80550088 806ed02e badb0d00 00000010 823aa020 nt!KiTrap00+0x7e
805500f8 804e38b3 82c4b494 82c4b470 eff02000 hal!KfRaiseIrql+0x2e  <--- crash with division by zero (hardware problem ??)
80550110 f844eec8 82c4b470 00000000 00000000 nt!KeInsertQueueDpc+0x11
80550130 804da779 82969008 01c4b45c 00000000 NDIS!ndisMIsr+0x54
80550148 804da721 00000000 80550164 804da72e nt!KiChainedDispatch2ndLvl+0x39
80550148 ffdff000 00000000 80550164 804da72e nt!KiChainedDispatch+0x1b


Debug Report
Mini120105-02.dmp BugCheck 7F, {0, 0, 0, 0}
Owning Process            826d0658       Image:         svchost.exe
Probably caused by : hardware ( nt!Ki386CheckDivideByZeroTrap+41 )

Mini120305-01.dmp BugCheck 100000D1, {c3d519, ff, 1, c0201009}
Owning Process            80558e80       Image:         Idle
Probably caused by : ati2mtag.sys ( ati2mtag+490fc )

Mini120705-01.dmp BugCheck 100000D1, {0, ff, 0, 0}
Owning Process            826cb6d0       Image:         Rtvscan.exe
Probably caused by : e1000325.sys ( e1000325+3469 )

Mini120805-01.dmp BugCheck 7F, {0, 0, 0, 0}
Owning Process            80558e80       Image:         Idle
Probably caused by : hardware ( e1000325+4097 )

Mini121005-01.dmp BugCheck 7F, {0, 0, 0, 0}
Owning Process            80558e80       Image:         Idle
Probably caused by : hardware ( NDIS!ndisMIsr+54 )

Mini121205-01.dmp BugCheck 7F, {0, 0, 0, 0}
Owning Process            80558e80       Image:         Idle
Probably caused by : hardware ( NDIS!ndisMIsr+54 )
0
 
LVL 20

Expert Comment

by:cpc2004
ID: 15569209
Hi GinEric

This problem is not related to stack overflow
Error code 0000007f, Parameter 1 00000000, Parameter 2 00000000, Parameter 3 00000000, Parameter 4 00000000

The first bugcheck parmaeter is trap code.. Trap code 8 is stack overflow and trap code 0 is divide by zero. For this case, it is divided by zero.
0
 
LVL 20

Expert Comment

by:cpc2004
ID: 15569215
Two system crashed at IRQL x'ff' which are invalid IRQL. It is symptom of faulty ram.

Mini120305-01.dmp BugCheck 100000D1, {c3d519, ff, 1, c0201009}
Probably caused by : ati2mtag.sys ( ati2mtag+490fc )

Mini120705-01.dmp BugCheck 100000D1, {0, ff, 0, 0}
Probably caused by : e1000325.sys ( e1000325+3469 )
0
 
LVL 12

Expert Comment

by:GinEric
ID: 15585881
Microsoft said it was stack overflow related.

Everything "appears" to be a symptom of faulty RAM, I just don't buy it that everything is because I've seen people in manufacturing constantly blame faulty RAM only to deliver it to some foreign country, in the field, where I'd have to fly later to find "the real problem."

I could easily show you a stack overflow, or underflow, that produces a divide by zero because below the stack or above is an uninitialized operand.

A fetid power supply will acquiesce to a new RAM by not failing for another time period, but only because the current and voltage regulation is spongey; this appears to be bad RAM, but is, in fact, just aging of the Power supply and circuitry, moreso than the aging of the RAM circuitry.

There are not absolutes in elctrical and electronics troubleshooting until exactly the same error occurs at exactly the same point every time a specific sequence of operators is executed.

Therefore, you cannot say it is positively this thing while Microsoft's debugger itself is saying it was probably caused by something else, because that probability negates your affirmation, even if following the advice of changing the RAM does work, the problem may not have been truly identified, but merely deferred.

"hal!KfRaiseIrq " is also a sign of faulty drivers that cannot account for the very high speed of modern architecture, coupled with the lack of knowledge of the hardware operation by some programmers.  It can also be caused by a collision, such as USB and others, rampant on two real Interrupt Request Lines 2/9 which effectuates as many IRQ's as use Plug and Play all vying for every buss and DMA all at the same time, or SCSI, PCI, or other timings overly dependent on software simulation of hardware interrupts.

Once a day is still the key.  RTVscan sure sound like a virus scanner running a heavy load.  The chains, I would think are ipchain rules, instituted by some firewall or other DNS interferon.

And all zeroes in the parameters points to all zeroes in the Vector Indirect Addressing, a term which hardly any programmers are familiar with in microcomputers.

RTVscan could as well be a video signal; you don't know without proper debugging symbols, which means, full word descriptions of what the process is.

NDIS is definitely network card.  So, it's possible the NIC and video are colliding at memory.  The video card should use its own memory, not system RAM.

"hal!KfRaiseIrql" happens every time you access a plug and play device.  So, it can be either a collision between plug and play devices, or a collision between plug and play software timing problems.

That's exactly what happens when you load everything in a system onto a cheap USB serial buss throught the Programmalbe Logic Controller and the Programmable Logic Array.  It's called a bottleneck, which statistically and probabalistically must fail around one billion times more than the hardware.

The hardware is highly parallel while the software is highly serial; this is just cheap design and playing on the ignorance of the public.

It could also be a cheap motherboard, a third party Asus clone, like some Dell's, that is the problem.

Some cheap stuff that comes out of third party vendors in Taiwan and China, copyright clones of bonafide boards, do not perform as well as name brand boards.

But if changing the RAM works, by all means, do it.  RAM, as I've stated many times, uses more current and power than anything else.  This is so because of the millions of transistors all switching in the active region where current is maximum, thus heat is maximun, in Emitter Coupled Logic, the fastest logic there is.

Most vendors cool everything but the critical RAM, and most do not apply proper engineering principles to heat evacuation anyway.

Changing the RAM will work, even when it isn't the RAM, but one should be ready to inform the customer that it's about time they bought a reputable board, got away from USB and bottlenecks, like SATA, and start demanding quality in the other components instead of blaming the RAM all the time.

Perhaps the customer can be sold a better system next time, something with true 64-bit architecture instead of 1-bit architecture, which is what USB and SATA are, and perhaps true 64-bit PCI and other busses and controllers, instead of a one-wire modulator, like USB and SATA, and Plug and Play.

Your reputation for repair rests on how long the repair works, not how fast it works.

0
 
LVL 20

Expert Comment

by:cpc2004
ID: 15585940
Hi GinEric

Why do you think trap code is stack overflow? Tell me the webpage page. In fact trap code 0 is divide by zero.
Error code 0000007f, Parameter 1 00000000, Parameter 2 00000000, Parameter 3 00000000, Parameter 4 00000000

0
 
LVL 20

Expert Comment

by:cpc2004
ID: 15585952
UNEXPECTED_KERNEL_MODE_TRAP (7f)
This means a trap occurred in kernel mode, and it's a trap of a kind
that the kernel isn't allowed to have/catch (bound trap) or that
is always instant death (double fault).  The first number in the
bugcheck params is the number of the trap (8 = double fault, etc)
Consult an Intel x86 family manual to learn more about what these
traps are. Here is a *portion* of those codes:
If kv shows a taskGate
        use .tss on the part before the colon, then kv.
Else if kv shows a trapframe
        use .trap on that value
Else
        .trap on the appropriate frame will show where the trap was taken
        (on x86, this will be the ebp that goes with the procedure KiTrap)
Endif
kb will then show the corrected stack.
Arguments:
Arg1: 00000000, EXCEPTION_DIVIDED_BY_ZERO <-----  trap code 0
Arg2: 00000000
Arg3: 00000000
Arg4: 00000000

0x00000000, or Divide by Zero Error, is caused when a DIV instruction is executed and the divisor is zero. Memory corruption, other hardware problems, or software failures can cause this error.
0
 
LVL 20

Expert Comment

by:cpc2004
ID: 15585964
Hi GinEric,

Your comment
>>>>
RTVscan could as well be a video signal; you don't know without proper debugging symbols, which means, full word descriptions of what the process is.
>>>>>

In fact RTVscan is Norton AV Real Time Virus Scan.
http://www.liutilities.com/products/wintaskspro/processlibrary/rtvscan/
0
 
LVL 12

Expert Comment

by:GinEric
ID: 15586580
Gee, what can I say, I've designed the hardware bit that detects "Divide By Zero?"

Divide by Zero is a soft error, that is, one that can be reported and ignored.  Perhaps apparently not in Intel's architecture, however, in other architectures it is.

Divide by zero is a common error most often found in software, not hardware.  Thus, it behooves the Operating System to handle it first as if it were software related.  A true hardware Divide by Zero can only arise if every error detection bit designed into the hardware fails to detect the related hardware error that preceded the Divide by Zero; such as a down Enable Bit, the statistically improbable combinatorial event that parity, one bit error detection, and two bit error detection failed, and the like, generally, that is, a failure of the error detection logic itself.

Were the data never initialized [voided in C], then the tag field of the three most significant bits should indicate this and a paging sequence should ensue.  That means, if any word or descriptor indicates and uninitialized operand or set of operands, the I/O routine is to go and initialize the set to zero.  This causes the base of stack to increment one level, lexicographical level as it were, so that a "trap" can be provided for the software Divide by Zero error which will first show up as stack underflow to the lexicographical level two levels below the Operating System environment.  Similarly, it can be shown that stack overflow can be detected using this same lexicographical level order, or layering of execution, which for any application outside of the Operating System shall not be allowed below Lexicographical Level 2.  Finally, this concept is known as the Interrupt Request Level.

It is mostly an oddity that it matches the 2/9 PLC soft interrupt handling, however, it is related for historic reasons.

I'm working on the next generation of hardware and software design; I hardly have time to go read each vendor's acronymical identifier references and descriptions, I can only advise them to take advantage of the massive memory in existence today and to begin to start using defines closer to English, or whatever human language they happen to prefer to program in.

I have been receiving Motorola and Intel design manuals since they started.  I don't really have time to correct their documentation as they catch up to design that was implemented over 25 years ago and developed on other systems.  It's just a bit too much to do.

I think trap code may be stack overflow first because it's possible, and second that opinion was backed up by evaluation at Microsoft Corporation.

If the kernel itself encounters a Divide by Zero, there are two possible cause, first, the hardware could be faulty, second, the software failure has been masked because the kernel is somehow accessing the wrong stack.  Consider it a case of either stack underflow of the stack ordering or stack overflow of the stack ordering.  That is, the Matrix of the stacks themselves has a broken pointer somewhere.  We see this a lot in bad buss timing and race conditions where the error detection buss times the operator and data out of synchronization with the address buss.  If the split harmonic clocked race condition is such that error detection and generation on the busses is in sync, while the data or address splits itself into a doubly clocked condition during that interim, then the set of referenced addresses and/or the dataset can be erroneously zero or unpredictible and go completely undetected, except as some other error that is detected later on by a subsequent failure owing to the bad set[s].

Thereafter, it will be reported as the second failure and not the first, even with a stack trace, however, the complete trace of the stack will show where the actual failure occurred, yet was not "fully" detected at that point in time.  An example would be a one bit error that occurs with simultaneous failure of the parity or other error detection and is really a three bit error or any odd number of n-bits error.  The logic thinks it has detected and corrected the error, but the conditions were such that because of perhaps loading and other factors it has not.  Thus the convention of using Control Modes [supervisory modes and special modes] implemented in hardware design for such contingency owing to the fact that a computer can simply make a mistake and it will go undetected either by the hardware and/or the software.

Which contradicts the dictum that the computer never makes a mistake; it most certainly does and this mistake can and has historically gone completely undetected.  Statistically, it is the three sides of the coin and the false assumption that there are only two possible tosses of the coin - heads or tails - when in fact the coin also can land in a third position, on its edge.

So, in any binary system, there are three possibilities, not two; Conditions "0," "1," and "Undefined."

It is not all ones and zeroes, but must include the set "unknown."

Any mathematical system always has n+1 possible solutions where n is the base of the mathematical system.  The(n+1)th solution is "unpredictible" in all systems.

And it doesn't matter if n is a number or an expression [formula].

I don't remember nor did I bookmark the Microsoft page; you can find it by searching their site.

Seven Fox is what value?  127 isn't it?  For 8-bit error encoding with sign, it is considered "negative zero" an imaginary number or vector with special meaning, even in error flags.

It means "I don't know what really went wrong" from the hardware's perspective.  Unfortunately, someone has assigned this to mean Divide by Zero, and that is simply not true in all cases.

Seven Fox appears in the obverse bit of the endian notation, meaning, that it is a direct read of the hardware's error bits which were stored as such state in a register when the error occurred.

Just as five is an Access Error in the obverse bits of the endian, one bit means error, the other bit means Access Denied.  It is an hardware code directly from the hardware error gates.  Five has been Access Denied since the early days of mainframes.  And negative zero has meant "I don't know what the hardware error was" since those days.  Intel's designs are copies of these designs and simply carried forward the codes, as has Microsoft, mostly, and generally because very few people actually know how these codes are generated.

Remember, Divide by Zero is not a hard error per se, but Divide by Zero in Control Mode 2 is an hard error.  Divide by Zero can be corrected in most cases, but it is assumed that if it occurs in the Control Mode above 1, the Operating System or kernel stack, then it is irrecoverable and considered an hardware error, even though is may actaully turn out to be a software error.

In Control Mode 2, it causes a system halt immediately after a dump or an attempted dump when the Halt Bit is set.  You can see this as the Blue Screen of Death.

For example, if antivirus somehow wrote to kernel area, especially its stack array of pointers, Stack Array Row Descriptors, then the entire system stack is off and invalid for all stacks, including the system stacks, Base of Stack, Top of Stack, and the associated Control Words, that is, the complete description of all stacks is wrong.

And neither the software nor the hardware knows exactly how this has happened and it must call for human intervention for a cure.

Basically, the coin, the "hardware" and "software" sides, have been negated and the system has landed on the edge of the coin.
0
 
LVL 20

Expert Comment

by:cpc2004
ID: 15623908
Do you have any update?
0
 
LVL 12

Expert Comment

by:GinEric
ID: 15625348
I don't see any dumps uploaded, so no idea about the status of the question.
0

Featured Post

Enroll in September's Course of the Month

This month’s featured course covers 16 hours of training in installation, management, and deployment of VMware vSphere virtualization environments. It's free for Premium Members, Team Accounts, and Qualified Experts!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Many people tend to confuse the function of a virus with the one of adware, this misunderstanding of the basic of what each software is and how it operates causes users and organizations to take the wrong security measures that would protect them ag…
I. Introduction There's an interesting discussion going on now in an Experts Exchange Group — Attachments with no extension (http://www.experts-exchange.com/discussions/210281/Attachments-with-no-extension.html). This reminded me of questions tha…
This is used to tweak the memory usage for your computer, it is used for servers more so than workstations but just be careful editing registry settings as it may cause irreversible results. I hold no responsibility for anything you do to the regist…
Hi friends,  in this video  I'll show you how new windows 10 user can learn the using of windows 10. Thank you.

688 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question