asked on

Customer gets stop error Event ID 1003

Customer describes Blue Screen stop error she gets. I never saw error first hand. I had to depend on somethin showing up in Event Logs. Event ID 1003 error has been showing up once per day. I assume this is the error causing the system hault. The text of the error is: "Error code 0000007f, Parameter 1 00000000, Parameter 2 00000000, Parameter 3 00000000, Parameter 4 00000000. Need to find out what is the casue ASAP.

GinEric

That's a strange error; what does EventViewer say?

once a day sounds like DNS TTL though, off the top of my head.

I'll look it up, but you could do the same at http://support.microsoft.com/

GinEric

It appears to be the code for Stack Overflow. That can be caused by many things, drivers, malware.

I didn't find any with all zeroes in the parameters, which is suspicious.

Do you have more information, from EventViewer? or a .dmp file for analysis?

sachin_raorane

Whats the source program of this Event ID 1003 ,

You can look up varius source reasons for this Event id at

http://www.eventid.net/display.asp?eventid=1003

nobus

can this help you :

http://support.microsoft.com/?kbid=870908

cpc2004

The culprit maybe faulty ram. Run memtest to stress test the ram.
https://www.experts-exchange.com/questions/21510623/3-different-errors-ntdll-dll-explorer-exe-drwtsn32-exe-STOP-0x0000007F.html

071171

ASKER

Source: System Error; Event ID: 1003; several .dmp files - where do I put them?

cpc2004

Get public webspace
Go to www.geocities.com - sign up for an account (they are owned by Yahoo - if you have a yahoo account, then you just need to activate the geocities part of it there). Then use their tools to upload the file.

071171

ASKER

Go to http://www.geocities.com/alefsky and choose File Manager. Once there you'll see dump.zip which contains 5 .dmp files. Thanks.

cpc2004

I go to http://www.geocities.com/alefsky and I can't find File Manager option to download the minidump.

071171

ASKER

cpc2004 -

How do I upload files on Geocities? Must i create a web page?

cpc2004

Email the minidumps to me and you can find my email address at my profile.

ASKER CERTIFIED SOLUTION

cpc2004

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

cpc2004

Hi GinEric

This problem is not related to stack overflow
Error code 0000007f, Parameter 1 00000000, Parameter 2 00000000, Parameter 3 00000000, Parameter 4 00000000

The first bugcheck parmaeter is trap code.. Trap code 8 is stack overflow and trap code 0 is divide by zero. For this case, it is divided by zero.

cpc2004

Two system crashed at IRQL x'ff' which are invalid IRQL. It is symptom of faulty ram.

Mini120305-01.dmp BugCheck 100000D1, {c3d519, ff, 1, c0201009}
Probably caused by : ati2mtag.sys ( ati2mtag+490fc )

Mini120705-01.dmp BugCheck 100000D1, {0, ff, 0, 0}
Probably caused by : e1000325.sys ( e1000325+3469 )

GinEric

Microsoft said it was stack overflow related.

Everything "appears" to be a symptom of faulty RAM, I just don't buy it that everything is because I've seen people in manufacturing constantly blame faulty RAM only to deliver it to some foreign country, in the field, where I'd have to fly later to find "the real problem."

I could easily show you a stack overflow, or underflow, that produces a divide by zero because below the stack or above is an uninitialized operand.

A fetid power supply will acquiesce to a new RAM by not failing for another time period, but only because the current and voltage regulation is spongey; this appears to be bad RAM, but is, in fact, just aging of the Power supply and circuitry, moreso than the aging of the RAM circuitry.

There are not absolutes in elctrical and electronics troubleshooting until exactly the same error occurs at exactly the same point every time a specific sequence of operators is executed.

Therefore, you cannot say it is positively this thing while Microsoft's debugger itself is saying it was probably caused by something else, because that probability negates your affirmation, even if following the advice of changing the RAM does work, the problem may not have been truly identified, but merely deferred.

"hal!KfRaiseIrq " is also a sign of faulty drivers that cannot account for the very high speed of modern architecture, coupled with the lack of knowledge of the hardware operation by some programmers. It can also be caused by a collision, such as USB and others, rampant on two real Interrupt Request Lines 2/9 which effectuates as many IRQ's as use Plug and Play all vying for every buss and DMA all at the same time, or SCSI, PCI, or other timings overly dependent on software simulation of hardware interrupts.

Once a day is still the key. RTVscan sure sound like a virus scanner running a heavy load. The chains, I would think are ipchain rules, instituted by some firewall or other DNS interferon.

And all zeroes in the parameters points to all zeroes in the Vector Indirect Addressing, a term which hardly any programmers are familiar with in microcomputers.

RTVscan could as well be a video signal; you don't know without proper debugging symbols, which means, full word descriptions of what the process is.

NDIS is definitely network card. So, it's possible the NIC and video are colliding at memory. The video card should use its own memory, not system RAM.

"hal!KfRaiseIrql" happens every time you access a plug and play device. So, it can be either a collision between plug and play devices, or a collision between plug and play software timing problems.

That's exactly what happens when you load everything in a system onto a cheap USB serial buss throught the Programmalbe Logic Controller and the Programmable Logic Array. It's called a bottleneck, which statistically and probabalistically must fail around one billion times more than the hardware.

The hardware is highly parallel while the software is highly serial; this is just cheap design and playing on the ignorance of the public.

It could also be a cheap motherboard, a third party Asus clone, like some Dell's, that is the problem.

Some cheap stuff that comes out of third party vendors in Taiwan and China, copyright clones of bonafide boards, do not perform as well as name brand boards.

But if changing the RAM works, by all means, do it. RAM, as I've stated many times, uses more current and power than anything else. This is so because of the millions of transistors all switching in the active region where current is maximum, thus heat is maximun, in Emitter Coupled Logic, the fastest logic there is.

Most vendors cool everything but the critical RAM, and most do not apply proper engineering principles to heat evacuation anyway.

Changing the RAM will work, even when it isn't the RAM, but one should be ready to inform the customer that it's about time they bought a reputable board, got away from USB and bottlenecks, like SATA, and start demanding quality in the other components instead of blaming the RAM all the time.

Perhaps the customer can be sold a better system next time, something with true 64-bit architecture instead of 1-bit architecture, which is what USB and SATA are, and perhaps true 64-bit PCI and other busses and controllers, instead of a one-wire modulator, like USB and SATA, and Plug and Play.

Your reputation for repair rests on how long the repair works, not how fast it works.

cpc2004

Hi GinEric

Why do you think trap code is stack overflow? Tell me the webpage page. In fact trap code 0 is divide by zero.
Error code 0000007f, Parameter 1 00000000, Parameter 2 00000000, Parameter 3 00000000, Parameter 4 00000000

cpc2004

UNEXPECTED_KERNEL_MODE_TRAP (7f)
This means a trap occurred in kernel mode, and it's a trap of a kind
that the kernel isn't allowed to have/catch (bound trap) or that
is always instant death (double fault). The first number in the
bugcheck params is the number of the trap (8 = double fault, etc)
Consult an Intel x86 family manual to learn more about what these
traps are. Here is a *portion* of those codes:
If kv shows a taskGate
use .tss on the part before the colon, then kv.
Else if kv shows a trapframe
use .trap on that value
Else
.trap on the appropriate frame will show where the trap was taken
(on x86, this will be the ebp that goes with the procedure KiTrap)
Endif
kb will then show the corrected stack.
Arguments:
Arg1: 00000000, EXCEPTION_DIVIDED_BY_ZERO <----- trap code 0
Arg2: 00000000
Arg3: 00000000
Arg4: 00000000

0x00000000, or Divide by Zero Error, is caused when a DIV instruction is executed and the divisor is zero. Memory corruption, other hardware problems, or software failures can cause this error.

cpc2004

Hi GinEric,

Your comment
>>>>
RTVscan could as well be a video signal; you don't know without proper debugging symbols, which means, full word descriptions of what the process is.
>>>>>

In fact RTVscan is Norton AV Real Time Virus Scan.
http://www.liutilities.com/products/wintaskspro/processlibrary/rtvscan/

GinEric

Gee, what can I say, I've designed the hardware bit that detects "Divide By Zero?"

Divide by Zero is a soft error, that is, one that can be reported and ignored. Perhaps apparently not in Intel's architecture, however, in other architectures it is.

Divide by zero is a common error most often found in software, not hardware. Thus, it behooves the Operating System to handle it first as if it were software related. A true hardware Divide by Zero can only arise if every error detection bit designed into the hardware fails to detect the related hardware error that preceded the Divide by Zero; such as a down Enable Bit, the statistically improbable combinatorial event that parity, one bit error detection, and two bit error detection failed, and the like, generally, that is, a failure of the error detection logic itself.

Were the data never initialized [voided in C], then the tag field of the three most significant bits should indicate this and a paging sequence should ensue. That means, if any word or descriptor indicates and uninitialized operand or set of operands, the I/O routine is to go and initialize the set to zero. This causes the base of stack to increment one level, lexicographical level as it were, so that a "trap" can be provided for the software Divide by Zero error which will first show up as stack underflow to the lexicographical level two levels below the Operating System environment. Similarly, it can be shown that stack overflow can be detected using this same lexicographical level order, or layering of execution, which for any application outside of the Operating System shall not be allowed below Lexicographical Level 2. Finally, this concept is known as the Interrupt Request Level.

It is mostly an oddity that it matches the 2/9 PLC soft interrupt handling, however, it is related for historic reasons.

I'm working on the next generation of hardware and software design; I hardly have time to go read each vendor's acronymical identifier references and descriptions, I can only advise them to take advantage of the massive memory in existence today and to begin to start using defines closer to English, or whatever human language they happen to prefer to program in.

I have been receiving Motorola and Intel design manuals since they started. I don't really have time to correct their documentation as they catch up to design that was implemented over 25 years ago and developed on other systems. It's just a bit too much to do.

I think trap code may be stack overflow first because it's possible, and second that opinion was backed up by evaluation at Microsoft Corporation.

If the kernel itself encounters a Divide by Zero, there are two possible cause, first, the hardware could be faulty, second, the software failure has been masked because the kernel is somehow accessing the wrong stack. Consider it a case of either stack underflow of the stack ordering or stack overflow of the stack ordering. That is, the Matrix of the stacks themselves has a broken pointer somewhere. We see this a lot in bad buss timing and race conditions where the error detection buss times the operator and data out of synchronization with the address buss. If the split harmonic clocked race condition is such that error detection and generation on the busses is in sync, while the data or address splits itself into a doubly clocked condition during that interim, then the set of referenced addresses and/or the dataset can be erroneously zero or unpredictible and go completely undetected, except as some other error that is detected later on by a subsequent failure owing to the bad set[s].

Thereafter, it will be reported as the second failure and not the first, even with a stack trace, however, the complete trace of the stack will show where the actual failure occurred, yet was not "fully" detected at that point in time. An example would be a one bit error that occurs with simultaneous failure of the parity or other error detection and is really a three bit error or any odd number of n-bits error. The logic thinks it has detected and corrected the error, but the conditions were such that because of perhaps loading and other factors it has not. Thus the convention of using Control Modes [supervisory modes and special modes] implemented in hardware design for such contingency owing to the fact that a computer can simply make a mistake and it will go undetected either by the hardware and/or the software.

Which contradicts the dictum that the computer never makes a mistake; it most certainly does and this mistake can and has historically gone completely undetected. Statistically, it is the three sides of the coin and the false assumption that there are only two possible tosses of the coin - heads or tails - when in fact the coin also can land in a third position, on its edge.

So, in any binary system, there are three possibilities, not two; Conditions "0," "1," and "Undefined."

It is not all ones and zeroes, but must include the set "unknown."

Any mathematical system always has n+1 possible solutions where n is the base of the mathematical system. The(n+1)th solution is "unpredictible" in all systems.

And it doesn't matter if n is a number or an expression [formula].

I don't remember nor did I bookmark the Microsoft page; you can find it by searching their site.

Seven Fox is what value? 127 isn't it? For 8-bit error encoding with sign, it is considered "negative zero" an imaginary number or vector with special meaning, even in error flags.

It means "I don't know what really went wrong" from the hardware's perspective. Unfortunately, someone has assigned this to mean Divide by Zero, and that is simply not true in all cases.

Seven Fox appears in the obverse bit of the endian notation, meaning, that it is a direct read of the hardware's error bits which were stored as such state in a register when the error occurred.

Just as five is an Access Error in the obverse bits of the endian, one bit means error, the other bit means Access Denied. It is an hardware code directly from the hardware error gates. Five has been Access Denied since the early days of mainframes. And negative zero has meant "I don't know what the hardware error was" since those days. Intel's designs are copies of these designs and simply carried forward the codes, as has Microsoft, mostly, and generally because very few people actually know how these codes are generated.

Remember, Divide by Zero is not a hard error per se, but Divide by Zero in Control Mode 2 is an hard error. Divide by Zero can be corrected in most cases, but it is assumed that if it occurs in the Control Mode above 1, the Operating System or kernel stack, then it is irrecoverable and considered an hardware error, even though is may actaully turn out to be a software error.

In Control Mode 2, it causes a system halt immediately after a dump or an attempted dump when the Halt Bit is set. You can see this as the Blue Screen of Death.

For example, if antivirus somehow wrote to kernel area, especially its stack array of pointers, Stack Array Row Descriptors, then the entire system stack is off and invalid for all stacks, including the system stacks, Base of Stack, Top of Stack, and the associated Control Words, that is, the complete description of all stacks is wrong.

And neither the software nor the hardware knows exactly how this has happened and it must call for human intervention for a cure.

Basically, the coin, the "hardware" and "software" sides, have been negated and the system has landed on the edge of the coin.

cpc2004

Do you have any update?

GinEric

I don't see any dumps uploaded, so no idea about the status of the question.