We help IT Professionals succeed at work.

bugcheck stop 0x0000000a on windows 2000 web server

Starr Duskk
Starr Duskk asked
on
13,791 Views
Last Modified: 2010-03-23
bugcheck stop 0x0000000a on windows 2000 web server

This kind of goes back to an issue I was dealing with before:
https://www.experts-exchange.com/Operating_Systems/Q_21531853.html

GinEric was helping out there...

GinEric,

I just got this memory dump error. It is not the same, but the knowledge base indicates it might have to do with a timing problem:

Event Type:      Information
Event Source:      Save Dump
Event Category:      None
Event ID:      1001
Date:            8/19/2005
Time:            11:57:02 PM
User:            N/A
Computer:      SSWEB
Description:
The computer has rebooted from a bugcheck.  The bugcheck was: 0x0000000a (0xc0074ee4, 0x00000002, 0x00000000, 0x80443637). Microsoft Windows 2000 [v15.2195]. A dump was saved in: C:\WINNT\MEMORY.DMP.

http://support.microsoft.com/default.aspx?scid=kb;en-us;286362
>>This problem is caused by a small timing problem that can cause a null pointer to be referenced.

It says to fix it, obtain the latest service pack. I have the latest service pack and that is what includes the rollup that maybe caused it in the first place.

Another knowledge base article talks about a virus. But that was an NT serverand also I already checked for those virus files and there are none. I also checked the registry for the virus files and they weren't there.

I'm thinking this is a timing issue as you already indicated.

I am uploading the file to your ftp site, but it wouldn't grant me permission to create a directory.

It says it will take 1 hour and 40 minutes. The file is much smaller, so it must be the kernal file. Plus it didn't take 15 minutes to reboot as in the past. It was only a few minutes.

If you don't feel this is related to the issue you wanted to look at, don't feel obligated to help. I understand.

thanks!

Bobi


Comment
Watch Question

Commented:
You had better install windbg and attach the analysis report here. Hence all the experts can help you to find out the root cause.

Debugging Tools from Microsoft
1) Download and install the http://www.microsoft.com/whdc/devtools/debugging/installx86.mspx
2) Locate your latest memory.dmp file- C:\winnt\memory.dmp
3) invoke windgb
4) File --> Open Crash Dump -> C:\winnt\memory.dmp

kd> .logopen c:\debuglog.txt
kd> .sympath srv*c:\symbols*http://msdl.microsoft.com/download/symbols
kd> .reload;!analyze -v;r;kv;lmnt;.logclose;q
5) You now have a debuglog.txt in c:\, open it in notepad and post the conetent here

Commented:
Interesting, quite a large dump, at 410 MB.

Well, I take it that that is your dump BobCSD

It's been moved to /uploads/BobCSD/

and the analysis is at /uploads/BobCSD/Analysis/

or use this link:

ftp://guest@musics.com/uploads/dumps/BobCSD/Analysis/BobCSD.memory.dmp.analysis.html

While the kernel dump would have sufficed, a production server probably should have full memory dump turned on, just in case it's one of those hairy intermittent problems that require such extensive dumps.

At first extrapolation, it's either a BIOS caching problem in area C0000000, or, the loading was such that memory and resources were simply overtaxed [unlikely in new boards].

If you had a problem creating a directory, as I saw that "New Folder" was created, while the dump was uploaded to the base guest directory, let me know.  I moved them to the directory above.  I also deleted the dump file, 410 MB, because it was no longer necessary after the debugging analysis.

Let me know if disabling the BIOS cache in this area solves this problem.

Commented:
And thank you for finding that lost link to the timing problem; yes, that is the timing problem as stated by Intel and Microsoft, and solved by Linux Programmers!  hehehe

Now I can complete the write up on general "Twilight Zone" crash dumps.

Acutally, you did create a directory, "New Folder" which I changed.  If you try again and can't create a directory, let me know, I'll have to change strict permissions.

This problem seems more like a different problem though, one having to do with cache and page files.  Because both backup and antivirus started at 4:00 A.M., it is very likely that you were sucking up all available memory, running at an unusual pace and cpu time percentage, and that perhaps backup collided with antivirus in the storage of the dif files for your backup, or antivirus may have added changed files, locked a changed file, or whatever, and the two could not get along; backup and antivirus.

Who knows, they may have both been using the same memory area and one of them went off.

One questiuon: is that board a dual processor?

Commented:
Oh, addendae, it should have taken about 6 minutes or less to upload the entire 410 MB; can you tell me how long you think it actually took?

This is for reference with our pipe provider; I need some feedback on the actual speed experienced.

It took me 6 minutes to upload it to another server; which I consider a little long, but it depends on the client in most cases, not the server.

The dump was analyzed on a Windows Server.

Commented:
Hi GinEric,

Your analysis report is incomplete. Can you provide the output of the following commands
lm tn
!thread
!process
u 80443620 l20
r

Thanks
cpc2004
Starr DuskkASP.NET VB.NET Developer

Author

Commented:
GinEric,

>>Let me know if disabling the BIOS cache in this area solves this problem.

So you're saying I should disable BIOS. How do I do that? What are the ramifications of this?

>>Because both backup and antivirus started at 4:00 A.M.,

That was the other day. Since then I had uninstalled the antivirus... trying to determine if it was the culprit. So there was nothing running last night when THIS occurred. It was at 11:57 PMish and wasn't during a backup either.

>>One questiuon: is that board a dual processor?

I'll have to get back to you on that. It's an Abit Fatality AA8XE. I searched the web to find out, but just don't know what to look for. It says "Designed for IntelĀ® 90nm Pentium 4 LGA775 processors ."

>>Oh, addendae, it should have taken about 6 minutes or less to upload the entire 410 MB; can you tell me how long you think it actually took?

It was over an hour... maybe two. I can send you something else sometime to know for certain and actually keep track of the time, if you like. I'm on the equivalent of a T1 line. It's cable though.

I did have the memory set to do a kernal dump, but I hadn't rebooted since changing it. I was waiting for the site to get less busy in the wee hours of the morning. But then the machine decided to reboot itself and save me the trouble, I guess. Nice machine. ;)

Truthfully, if you look at my history, I have had one major problem after another with this server. It is a recent build. I'm seriously considering putting at least the web server, (database is on a different newly built box), back on the old scsi server. There was nothing wrong with it. We just wanted to build two new boxes and remotely move everything over and get the ips' setup and DNS moved so that to our users, the site move from our colocator to our house would be invisible and it wouldn't be down for even an hour, while hauling the machines across town and setting up the new ip addresses.

But I have had nothing but problems with this box since going live. So I'm thinking a move back is best. Course, my spouse, who built it, thinks I should just fix it, but he's not the one up til 4 AM trying to figure out what is wrong with the thing! (well, truthfully, you guys are figuring it out for me... ;)

So do you think it's salvageable and I should move forward and try to fix it, or just put the data back on the old server and go? It's a few years old, maybe 4. I'd still have to setup the old server with the new IP's (I have tons of unused IP's to use), put the zywall back on and get it setup (it's an extra too from when at colocator). So it wouldn't be free of problems/issues either. But at least I could get that setup and tested without it being live. This box just never seems to get better. I just don't know. Sigh.

Thanks!

BOBi

Commented:
It was complete.  Further analysis wasn't necessary until some things had been tried or eliminated.  But, for your benefit, I expanded it:

ftp://guest@musics.com/uploads/dumps/BobCSD/Analysis/BobCSD2.memory.dmp.analysis.html

I'll wait to see if he's checked the cache settings.

And to think, I could get about $150k a year or more for this level of troubleshooting!

:)

Commented:
Hi Bob,

Can you send the minidump to me as well and I want to analyze the dump? As GinEric does not respond to my post.

cpc2004
Starr DuskkASP.NET VB.NET Developer

Author

Commented:
>> Suggest checking BIOS caching; if enabled, try disabling it in the C00000000 area.
>>Let me know if disabling the BIOS cache in this area solves this problem.

So you're saying I should disable BIOS. How do I do that? What are the ramifications of this?

>>The annoying and nagging question is about IRQ 2; what was using this special IRQ?  Was it a sound card driver or a video driver?

We checked in the device manager and:
0) system timer
1) standard keyboard
4) communication port

So you see the 2) and 3) are entirely missing... Unless you're starting count at 0, in which case, the standard keyboard is at two.

>>And to think, I could get about $150k a year or more for this level of troubleshooting!

Where do I send the check?

Thanks!

BOBi
Starr DuskkASP.NET VB.NET Developer

Author

Commented:
cpc2004,

How do I send the minidump to you?

thanks!

BOBi

Commented:
The BIOS is what you get to when you hold down something like the Delete key during boot, it should ask you if you want to go to the setup screen.

That's that actual setup for the motherboard.  In there, you'll find all your motherboard's setting, usually under some place you'll find Enable/Disable BIOS Cache.  For various areas, including the video cache, which I suspect is at C0000000; disable it.

We usually diable all BIOS cache on the motherboard.

As for the board you're using, I don't see that it is it's fault, however, for building servers I usually recommend the highest end Tyan you can get for commercial use.

I think you've just hit a flukey problem and with a little understanding you can get it going.  If it is truly "at home," you've got to consider things like residential electricity variations and power outs, heat, environment, and so forth.  Say, you wouldn't put a server in an unairconditioned or unhumidity controlled home environment in places like Florida.

But if it were all that bad, it would be crashing on a regular basis.  Did you have any electrical storms lately?  That'll bring a server down!  Real quick, and it can easily get a BSOD and a dump, before it reboots.

Then again, unless the circuits are conditioned, and if the server happens to be on the same line with other equipment, refridgerators, air conditioners, etc., the glitch when they switch on can bring a server down.

I think you need to study it and monitor it a little while longer though.  Don't just go shut it down to change the BIOS settings, wait and see, if you're there when it goes down, then reboot into BIOS setup and check the cache settings.  Otherwise, send me the newer kernel dumps and we'll get it checked out.  Don't forget to leave contact information, a simple text file, when you upload.
Starr DuskkASP.NET VB.NET Developer

Author

Commented:
>>Did you have any electrical storms lately?  That'll bring a server down!  Real quick, and it can easily get a BSOD and a dump, before it reboots.

Uh, last night, we had quite the electrical storm. Really.

But we have them on battery backups, surge protectors, and the house is also on a built-in surge protector device built into the meter, by the electric company for this sole purpose. We have a whole house generator as well, just in case.

It is only the web server that is having problems. The database server, sitting next to it and plugged into the same source, is fine. They do both have different battery backups though. We have had wild electric storms in the past that never caused a reboot as well.

Plus this box has had so many problems lately when there were no storms.

As far as temperature, the front panels have the temperature displayed and they are fine. But no, they're not in a "cold room" but they are air conditioned.

>>if the server happens to be on the same line with other equipment,

This office is on its own circuit breaker and no other equipment is involved.

>>Don't just go shut it down to change the BIOS settings, wait and see, if you're there when it goes down, then reboot into BIOS setup and check the cache settings

Okay. Otherwise, I'll check in the wee hours when the site is not so busy. Saturday nights are very busy.

>Don't forget to leave contact information, a simple text file, when you upload.

Okay, will do that next time. I couldn't create a folder. It never showed me even the New Folder.

thanks!

BOBi

Commented:
A couple of things.  IRQ 2 is part of the IRQ 9/2 PLC that Windows and other Operating Systems use for Plug and Play.  So, IRQ 2 is missing from the Device Manager as that IRQ number, but every IRQ above 15 uses it!  It the Interrupt was IRQ at the time of the failure, hte machine was on a Plug and Play device.  Oddly enough, even parts of the motherboard can be assigned interrupts through this controller.  The operation of a Programable Logic Controller is the subject of about two college level course either in Electrical, Electronics, or Instrumentation and Control.  But basically it allows for a form of relay and switch design for various busses, other memory controllers, and commands to all sorts of things, like the PCI Buss, Memory Buss, ISA Buss, and so on, to be designed on the fly, programatically, and then redesigned with something called ladder logic if necessary.  It's how a motherboard can be made to accommodate different hardware onboard and into its slots.  It is transparent to Device Manager so it won't show up in the list.

Upload the minidump here:

ftp://guest@musics.com/uploads/dumps/BobCSD/Analysis/

You should be able to just copy and paste, or whatever, the minidump as a file from one IE window to another, or, do the ftp via a DOS command prompt, it's up to you.  Some people use programs such as FTP Commander and others to make the ftp experience more user friendly.

I fixed the permissions so that you should have no problem copying and pasting a file into that directory now.

cpc2004 the the response to your question is in the analysis directory as BobCSD2 html file.

While it's harder to determine whether the sound card or the video card were using the 9/2 IRQ, it's easier just to know that the video card uses the area of memory requested, C0000000, and it is likely that if caching was enabled the video card overwrote the C0000000 area so that had the Operating System been using that area for cache, the hardware level access of the video card would have simply overwritten it, thus making the entire area invalid.  Notice that modules are being looked for that do not exist, as if they were simply erased.

This has been one of those problems that began around the beginning of time, which is why I suspect it so strongly.  It's actually delegated to that area forerly known as C000 but because of Big Endian notation, this shows up as C0000000.  If you look at my translation for the call, you'll see that :

804856cc  is  actually  cc564880  inside the hardware, which places it in the C0000000 cache area.  Addressing is split across to RAM cards, so that 0000C000 becomes C0000000 when recombined.  

and because of another flukey representation and various designs in logic at the substrate and gating level this is seen as in C0000 area [take the "0" of "0x" and put it at the end, thus, "endian," and 0xC000 becomes address C0000, the video cache area for a 1 megabyte system; add four more places for split card addressing and you arrive at C0000000 with the endian being therefore 0xC0000000].

And I haven't even touched on Associative Memory and Associative Memory Addressing yet.

These are all design tricks of the computer design engineers who actually design the logic that is a computer.

It's just much simpler for me to suggest checking one area at a time, than to lecture on doctoral thesis computer design and research, which, at this point, add up to numerous volumes.

Basically, to the machine hardware, addresses C000 and 0xC0000000 are the same area because of how the gates are arranged!

Hardly any programmers in the world know this, and thus, there is no software that handles it properly.  Which is why nearly all documentation tells you to disable this BIOS caching feature.

Again, hold down the delete key [or whatever your computer manufacturer tells you to do] to get into the motherboard BIOS setup.

There are only a handful of exceptions.

And I think this thread deserves to be in one of the books now.

cpc2004

I've moved the memory dump back to:

ftp://guest@musics.com/uploads/dumps/BobCSD/

where you can download it.

Please let me know also how long it takes.  I have some concerns about bandwidth that I'd like to resolve.

Commented:
Addendae:

You had previously asked about the serial buss, saying that you [comm port, IRQ 4] wondered something about it.

Well, usually, the UPS system is on a serial connection to the comm port [how this works:  the electric company has grid signals way ahead of the substation that can send you a signal that the substation has gone down; this is possible because it takes physical time for the electricity to stop while the tripped substation signal gets to you at the Speed of Light, you have something like a millisecond or so to switch to UPS].  During that time, power and thus current and voltage slowly slew toward zero, meaning, it drops from normal voltage down to a voltage where things don't work over time; it's not instantaneous when power goes out, although it seems so.  The timing is close; you can actually get to the area where things start to flake before the UPS takes over, leading to very intermittent and seemingly unexplainable errors.  In other words, UPS is not a perfectly functioning system.

So that one computer will not be affected, while the one next to it will, in rare circumstances.

You seem to have a very professional setup at home, commendable.  I would, however, consider looking into an Isolation Transformer.  Places in Connecticutt, Chicago, and Philadelphia make very good ones.  They're not really very expensive.

Funny, have had tons of really bad electrical storms here too.  Which is why I asked; we had one of those dips that causes interruption, without the servers going down, but apparently lead to a stagnant state.  When I say bad, I mean I saw an 18 wheeler next day with two substation transformers on it, and about 40 block transformers; 13,600 Volt substations, and whatever [4600 or 2300] block transformers.  A sure indication that a lot of stuff went out in the city!

Must have been quite a lightning bolt.

Silly question probably, but are your running a Windows Web Server or Apache Web Server?


Commented:
Unlock this solution and get a sample of our free trial.
(No credit card required)
UNLOCK SOLUTION

Commented:
Hi Bob,

Your version of smtp.exe is developed at 1992. Is it still compatible at W2K SP4? You had better to install the latest versionof smtp.

00400000 00547000   smtp     smtp.exe     Sat Jun 20 06:22:17 1992 (2A425E19)

Commented:
Bob
After I do the research at Google, I find out that MiDispatchFault is part of Page-Fault exception processing. It is stable routine and it crashes only if hardware problem or corrupted paging file. As reallocating the paging file is free and it is no harm to allocate a new paging space. Maybe it can resolve your problem.
Starr DuskkASP.NET VB.NET Developer

Author

Commented:
>>Silly question probably, but are your running a Windows Web Server or Apache Web Server?

Microsoft all the way!

:)
Starr DuskkASP.NET VB.NET Developer

Author

Commented:
cpc,

>>You had better to install the latest versionof smtp.

Do you know where I can get this? I use the software updates to regularly get the latest stuff.

Is this something provided with the windows 2000 web server, or from my Merak Mail server?

BOBi

Starr DuskkASP.NET VB.NET Developer

Author

Commented:
cpc2004,

>>As reallocating the paging file is free and it is no harm to allocate a new paging space. Maybe it can resolve your problem.

How does one go about reallocating the paging file?

BOBi

Starr DuskkASP.NET VB.NET Developer

Author

Commented:
So far there have been two suggestions:

>> Suggest checking BIOS caching; if enabled, try disabling it in the C00000000 area.
>>Let me know if disabling the BIOS cache in this area solves this problem.

and

>>I am very sure that it is hardware problem. Most likely it is faulty RAM. Base upon my past record, I will rate the possibility of the error by 70% RAM, 20% CPU and 10%M/B. As hardware error occurs randomly, if W2k keeps on crashing with different bugcheck code, it is symptom of hardware error. If W2K also crashes at the instruction address and bugcheck code and probably it is software errror.

It has had 3 different bugcheck codes in the last couple of weeks. Plus other non-bug check problems. Basically, this box has had problems since setup. Certain SQL statements that ran for years on the other box with no problems, ran up Mem Usage on this box and I had to take down those modules until I could figure out what was going on. Just a LOT of intermittent problems and site performance issues that never occurred on the other box. I USED to have time to GARDEN!!!! ;)

I didn't want to reboot the box last night, due to the busy weekend. But tonight I will reboot and check on the BIOS caching. Considering cpc's suggestion that this might be a hardware problem, would disabling the BIOS in this area cause any problems?

>>I don't think the problem is not related address C000000 is cache.

Since I am ignorant in all this regard... :( ... I can't make a wise decision. Can the two of you come up with a little plan you can agree on that you think I should do?

I'm thinking now:

I check the BIOS and see what it is.... without changing it...
I change the BIOS....
I replace the RAM.

Which one? All three?

thanks!

BOBi


Starr DuskkASP.NET VB.NET Developer

Author

Commented:
>>My prelimary finding, I am very sure that it is hardware problem. Most likely it is faulty RAM. Base upon my past record, I will rate the possibility of the error by 70% RAM, 20% CPU and 10%M/B. As hardware error occurs randomly, if W2k keeps on crashing with different bugcheck code, it is symptom of hardware error. If W2K also crashes at the instruction address and bugcheck code and probably it is software errror.

On startup, doesn't the system check the RAM? It doesn't indicate it is bad. Wouldn't it? It's a gigabyte of dual channel RAM. How can we verify whether it's good or bad?

thanks!

BOBi

Commented:
BobCSD

There are only a couple of more steps in determining if it is an actual hardware problem before touching the hardware.  Before placing the next post, which explains the reasoning behind why it is not a good idea to touch the hardware until you've eliminated some obvious software.  The whole of the posts I've made are a well known and well organized troubleshooting technique developed of more than three decades of doing computer dumps for exactly this purpose.

I will also look up that board, and see what it is capable of; although, I would have never bought a board with the name "Abit Fatality AA8XE" because it shows a very poor thinking in marketing, and will probably lead to many jokes in the computer industry.

Tell your computer builder to go here:

http://www.Tyan.com/

the next time a superb motherboard is needed; these are the very best.

As for the RAM, we haven't asked what brand it is yet, so, what brand is it and how much?

I'll also go and see what this statement is supposed to mean: "Designed for IntelĀ® 90nm Pentium 4 LGA775 processors ." which is kind of funny, to an engineer, as 90 nanometers is something most people have no intuitive feeling about its meaning.  It's a wavelength, an etching specification for substrate design, and stuff most buyers don't really care about, but which the clean labs that make the salami's are very proud of.

It's kind of overkill from the engineers, while the name of the board is a flat out boo-boo from the marketing end of the company's "fatality" board - I mean, you just don't put this in a sales product's name!

Triton II and III, F21, MIG-29, these are types of boards, and Tiger, Thunder, and Tomcat!  These are names you put on motherboards!  Like the state of the art submarines, supersonice aircraft, etc., not a reference to the casualty list!

I know I'm a little off topic here, but this is why it got that name:

"ABIT is one of the most respected board manufacturer that caters mainly to the gamer, . . ."
"Just recently, ABIT has partnered with the world's number one professional gamer, Johnathan "Fatal1ty" Wendel,"

Game over, you might just lose on this one, Johnathan.

Yes, Abit makes good boards for gamers, but I'd be skeptical for production servers, at best.

Secondly, Abit doesn't even seem to have a site with the specs and data for their boards, at least not one that can be quickly found.

So, maybe there is a serious hardware conflict, but it's not as simple as bad RAM, more like a bad choice for a production server motherboard, if it turns out that there is an actual hardware cause to this problem, which may well be, but all other possibilities should be eliminated before throwing the board in the trash and buying an expensive one that is known to work without intermittent crashes.

Okay, now check the next post.
Commented:
Unlock this solution and get a sample of our free trial.
(No credit card required)
UNLOCK SOLUTION
Starr DuskkASP.NET VB.NET Developer

Author

Commented:
>>flat out boo-boo from the marketing end of the company's "fatality" board - I mean, you just don't put this in a sales product's name!

Yes, it's originally designed for gamers. The CPU is black with blue glowing lights all over it and you can see the guts inside. It's designed for teens, but Larry said that was the best CPU they had available at MediaCenter. (Or he simply isn't admitting he liked the lights. I joke with him sometimes that I send him out with the family cow and he comes back with beans. ;)

I'll find out the RAM stuff and get back with you.

BOBi

Commented:
Addendae 3:

This idea of refreshing memory was later applied to planar memory chips, that is, the current set of RAM that uses a refresh cycle to conserve power.  The RAM is not supplied a constant supply voltage and current, but uses the fact that memory is not lost all at once when power is turned off, but will remain in tact for a short span of time.

RAS and CAS are terms from the 1970's.  They refer to what is called dynamic refresh RAM.  A switching power supply is used to supply power to RAM for short periods of time, and then simply turned off for another period of time.  Because the charge on the RAM circuitry decays over time, the RAM still holds valid data.  However, the refresh must occur within a specified time period, or that data will be lost.  RAS and CAS refer to the counter mechanisms which control this timing period.  During RAS cycles, the RAM is placed in a wait state and cannot be read or written.  This results in some slight delay, but an acceptible one and a workable one because usually the entirety of the memory is divided up into 8 or more banks, where the other 7 banks can still be read or write and thus the time delay completely masked out, that is, seemingly invisible to the memory control process.

Now, recall that RAM is not always on, and that therefore is more subject to high load conditions and the consequent dips in power [current dips] in overloaded, borderline, and substandard power systems.  At some point, under load, which strains the power requirements, a dip can be propogated through to the RAM address buss resulting in intermittent failures.  Such failures are extremely hard to identify, and, in 9 out of 10 cases are not the fault of the RAM, but are the fault of the load on the power supply, the integrity of gold plated [or not, very bad not to use real gold on RAM lands and connectors!] lands and connectors, poor solder joints, extreme run lengths, substandard lasering of the gold wires within the RAM chips from substrate to pin leg platform, and so on.  I have seen, under and electron microscope, an unsoldered gold wire to an IC pin just sitting there jumping up and down when vibrated; this is called a "cold solder joint" even if it was supposed to be laser welded from the gold wire to the pin head.  Not to get too technical [which is probably unavoidable at this point], but this was about the only RAM hardware failure I could identify as an actual hardware failure.  Another one was the examination of a production run of RAM IC's wherein the negative used to etch the substrate, or the layer's photomask, had a microscopic fleck of dust that had settled on the emulsion before the lithium wash had stopped the photographic developing process.  This resulted in a run of some millions of useless RAM chips, which all had problems that showed up as intermittent and occurred only under load.  Basically, the resulting etched substrate buss had an anomally in it and most of the time it worked fine, but every now and then some machine somewhere using the chips would get an intermittent failure.

That was unacceptible to the clients and customers, very large institutions, which demanded that these machines and these networks be up absolutely 100% of the time; even only a microsecond of down time was unacceptible to them.

And the result was that millions of RAM chips were simply thrown out and replaced with new ones for the entire production run, which was recalled from the commercial market.

Today's mass markets for microprocessors do not maintain that level of quality control, so you may get RAM that is substandard, but I haven't seen much of this to date.  Which is why you should only rely on reliable companies for RAM, and used or reseller RAM is always suspect.

RAM lands and connectors should also be cleaned with methyl ethyl alcohol for best results; some rubbing alcohols and other cleaners often contain sugar, something you definitely don't want around electricity!  The average cleaners you buy at a computer shop should be good enough though, and you needn't go hunting down the hard to find pure alcohol.

And then, the lands should absolutely never be touched by human fingers!

Seating should be confident and secure, but never forced.  If you have to force RAM seating, you can bet something has gone wrong or is going to go wrong.

At this point, when you have the time to run the procedures outlined above, you should do so.  Remember, it is going to take some time do this and you should announce a scheduled "outage."

What we do is have another server ready to take the place of the server going down for scheduled maintenance; which is what all professional network enterprises do.  That way, we don't even have to go offline.

And scheduled maintenance is not an option beyond certain business growth size; it is absolutely necessary.

You might want to consider all of this in your business plan.
Starr DuskkASP.NET VB.NET Developer

Author

Commented:
>>a new disk that was not initially exercised enough.

Makes me think I should put it on a leash and take it for a walk! ;)
Starr DuskkASP.NET VB.NET Developer

Author

Commented:
>> You must chkdsk, scandisk, and defrag manually, and, you must do this more than one time, preferably at least three times.

How often do you expect this to be done? Daily, weekly, monthly?

>>Because Windows does not rewrite system protected areas, thus, they never get refreshed when only autodefrag is used.  You must chkdsk, scandisk, and defrag manually, and, you must do this more than one time, preferably at least three times.
>>Remember, it is going to take some time do this and you should announce a scheduled "outage."

Okay.

>>What we do is have another server ready to take the place of the server going down for scheduled maintenance; which is what all professional network enterprises do.  That way, we don't even have to go offline

I keep telling Larry we need to set up the old one as a backup for stuff like this!! Maybe he'll help me do this now.

>>Reapplying a service pack may have a similar result.

We've been doing regular service packs.

The machine isn't that old at all. Only a few months at most. Would this be an issue already?

>>repage, chkdsk, scandisk, defrag, as per above.
How does one repage?

>>And scheduled maintenance is not an option beyond certain business growth size; it is absolutely necessary.

We had our old server for about 4 years and never once did maintenance like this on it. :(

Thanks!

BOBi

Starr DuskkASP.NET VB.NET Developer

Author

Commented:
>>As for the RAM, we haven't asked what brand it is yet, so, what brand is it and how much?

Centon 2 GB PC4200 DDR2 dual channel

As far as boards, Larry wanted me to let you know:
The motherboards were over $250 each, they weren't cheap boards, and they have everything the boards you recommended do. They can have up to 4 GB ram, and take a pentium 4 chip. They support Intel hyper-thread technology. They can do RAID.



Commented:
That's fine, I didn't mean they were cheap.  They seem to be very good, which is why I'm leaning on it being problems other than hardware.  So far, I've only seen one problem here that was an actual hardware problem, and it was the motherboard, not the RAM.

There is, however, also a reason why some boards do cost upwards of $700.00, some $1,500.00, and so on.

Centon is an old enough company to be considered extremely reliable!  It wasn't much more than 26 years ago that RAM in an IC package was invented.  Aboriginal planar memory, and then Refresh Memory.

And their location is a dead giveaway, when the company age is considered, to be a sure indicator that they don't make bad memory chips.

So, I think you have good hardware.

You repage by going into the My Computer | Advanced | Performance | Settings | Advanced | Virtual Memory | Change

and set it to custom size and the other settings and make the changes there.  Do not let this go below 2 meg, ever, as it may cause the system to slug, but 32 meg or 64 meg should work for the first step of chkdsk, scandisk, defrag.  Therafter, you want to get it back up to a much larger size, which is usually one and one-half times the size of current RAM.

Step two is to chkdsk, scandisk, defrag again, with say a 480 meg page file.

Step three is to finalize by setting the pagefile over multiple disks [makes it work faster], and other advanced techniques.
I would not, however, myself, set the page file to 6 gig simply because I had a 4 gig RAM size on board.  It's probably too much; most server people I know keep it around 1 gig maximum.

How you get to the page file [Virtual Memory] to repage may be different on your server, but it will be similar, all you're doing is changing the disk cache, which is a better term than Virtual Memory for what the page file actually is.

I have seen at least three intermittent crash problems solved this way here on Experts-Exchange.

I've added a link for you, and others, because the question keeps coming up, and I'm trying to finalize a standard method for approaching this problem.

http://www.musics.com/manhtml/Troubleshooting/

will be an ongoing documentation of Troubleshooting Crash Dumps.

On Preventative Maintenance:

Yes, you can get away without it for years, but sooner or later it is going to catch up to you, and when it does you may be at the threshhold of a medium sized business where a catastrophic crash could be both catastrophic and disastrous for company owners and stockholders.  I would not even like to imagine a runaway crash that is destroying petabytes of critical company data.  {shudder!} It's just to much of a horror picture that turns into real life!

Need I scare anyone, I once witnessed a man lose $200,000,000.00 and everything he owned because he did not see that even he could make a mistake.  True story.

I have servers that have been running for over 10 years here, with little or no change in hardware, except the occasional upgrade of motherboard, disks, power supplies; but these were done as scheduled maintenance with a backup server.  There are other servers I've worked on that have been running for over 30 years, and they definitely have scheduled maintenance, especially the ones at places like the Federal Reserve; none of us would want them to crash, would we?

See if you can find your way through all of this advice and we'll see if we can solve this problem.

It's one for the book.

Commented:
Hi Eric,
It is great idea to document how to diagnostic crash dump. The document must be simple and easy to understand. I can't understand your document.

>>> I guess I should explain again that "IRQL_NOT_LESS_OR_EQUAL" means IRQ greater than 1; this determines only that it is not the system timer or the keyboard, NMI or non-maskable interrupts. But if the stack dump says the IRQL was "2," it means that it was re-entrant code via the PLC system, in other words, a redirected IRQ, and while not terribly important, it is the context of the failure which thereby eliminated the timer and keyboard. <<<

My comment
You mixed up IRQL and IRQ.

IRQL is interrupt request level and it is software routine to handle interrupt.  IRQ is interrupt request line and it is hardware.
IRQL 2 is the dispatch level IRQL and it is not relate to IRQ 2.

00 PASSIVE_LEVEL  - execute thread
01 APC_LEVEL      - execute special kernel APC; page fault
02 DISPATCH_LEVEL - dispatch (execute DPC)
03                - 24 device interrupt
..
1A
1B PROFILE_LEVEL  -
1C CLOCK2_LEVEL   - interval-timer execution
1D REQUEST_LEVEL  - interprocessor request
1E POWER_LEVEL    - power failure notification
1F HIGH_LEVEL     - machine checks or bus errors

Starr DuskkASP.NET VB.NET Developer

Author

Commented:
>>I can't understand your document.

I can't understand much of most of the stuff in this post. Makes my head hurt. ;)

I'm just a poor mother trying to feed my children. But I'm going to figure it out and I appreciate ya'lls help!

BOBi
Starr DuskkASP.NET VB.NET Developer

Author

Commented:
>>32 meg or 64 meg should work for the first step of chkdsk, scandisk, defrag

Okay, I'm going to practice on my old server....then I'll do the new ones at 2 in the morning:

Under virtual memory, My old server settings are:
Initial size: 768
Max size: 1536
minimum: 2 MB
recommended: 1534 MB
currently allocated: 768 MB
Current registry size:  15 MB
max registry size: 54

The new server settings for both web and database are:
Initial size: 2046
Max size: 4092
minimum: 2 MB
recommended: 3070 MB
currently allocated: 2046 MB
Current registry size: 17 MB
max registry size: 114

I guess by the time I'm done, I should set it back to the recommended size. Both of them seem to be really off from their recommended size. That's what the initial size should be, right? The recommended size?

BOBi

Commented:
Hi Bob,

I understand you want to fix the problem. My prelimary finding it faulty ram or corrupted paging space. If you provide the system event 1001/1003 within the last two months, it will be useful to find out the root cause of the problem.

When Windows crashes with blue screen, it writes a system event 1001 or 1003 Check system event 1001 and 1003 and it has the content of the blue screen

Event ID: 1001
Source: Save Dump
Description:
The computer has rebooted from a bugcheck.The bugcheck was : 0xc000000a (0xe1270188, 0x00000002, 0x00000000, 0x804032100).
Microsoft Windows..... A dump was saved in: .......


Event Source: System Error
Event Category: (102)
Event ID: 1003
Description:
Error code 1000007f, parameter1 0000000d, parameter2 00000000, parameter3 00000000, parameter4 00000000

Control Panel -> Adminstrative Tools -> Event Viewer -> System -> Event 1001/1003. Copy the content and paste it back here
Starr DuskkASP.NET VB.NET Developer

Author

Commented:
I had already pasted 1001 in the very first message:

Event Type:     Information
Event Source:     Save Dump
Event Category:     None
Event ID:     1001
Date:          8/19/2005
Time:          11:57:02 PM
User:          N/A
Computer:     SSWEB
Description:
The computer has rebooted from a bugcheck.  The bugcheck was: 0x0000000a (0xc0074ee4, 0x00000002, 0x00000000, 0x80443637). Microsoft Windows 2000 [v15.2195]. A dump was saved in: C:\WINNT\MEMORY.DMP.

There was no 1003.

Thanks!
Starr DuskkASP.NET VB.NET Developer

Author

Commented:
On my old server I ran chkdisk /f, rebooted, finished it.
Then I went to run scandisk, and got an error that it could not find the file.
I did a search on the drive and could not find it.

I guess I'll skip that and do defrag.

BOBi


Commented:
I do not find any memory corruption at the dump. As the page fault hadling routine is a well test routine. It crashes only if the hardware error or coruppted paging space. I suggest you should re-allocate a new paging space as the circumvention of the problem. After re-allocation of paging space, if the BSOD still occurs, it must be faulty hardware.

Commented:
BobCSD you have to complete the scandisks after the chkdsk or the process is ineffective in fixing bad blocks on the disk.  Check Disk finds and marks the bad blocks setting a bit for scandisk to try to reallocate them, and scandisk looks for that information and the try to fix and reallocate bit and does the actual fixing.

You can go to something like My Computer or Administrative Tools and either use Disk Manager then the partition's properties to get to the scandisk tool, or you can simply get the drive's properties, then the tab or button that says check this disk now, or something like that.

chkdsk marks the bad blocks that scandisk will fix; it must be done this way or corrupt blocks will not be fixed and added back into the free memory pool.

To clear up some definitions for cpc2004 before continuing the discussion of the problem:

IRQ ::= "Interrupt Request Queue"  # The number of the interrupting device being queued by I/O
IRQL ::=  "Interrupt Request Queue Level" # The Lexicographical Level of the interrupting device
                                          # on modern computers this is implemented in hardware
                                          # via Control Modes; 0, 1, and 2 associated with the
                                          # System Control Mode Errorhandlers which are called
                                          # by the special error flag bits of the Control Mode
                                          # Operations.  Simply put, an error has occurred, so
                                          # the Control Mode Error Handling routines are called.

I will refine that in the troubleshooting document.  But none of that is germinal to the problem at hand, not really, There is a great deal of confusion at both Intel and Microsoft as to what the design engineers are trying to tell them, so I can understand that oftentimes technical documents will not, at first, be understandable.  We write the technical documents first, and then try to reduce them to laymen's terms.  Since it's rather too technical to put here, I will add yet a third document to try and explain why it's in virtual memory, why the handler was called by the Intel Special Operator Set, and why it points to caching memory.

Back to BobCSD:

A bad address is loosely called memory corruption; both by Windows and others.  The dump error report uses the term loosely, but a bad address is a corrupt address, and since it pertains to memory, is just agglomerated under the general heading of memory corruption.

Windows suggested it was probably memory corruption.  And they further explain that that corruption is most likely a bad address.  This can occur for a number of reasons, the least of which is usually hardware.  Although Intel has admitted that there may be timing problems with the P4 when SCSI is used or emulated, and the programmer has not allowed for or written the proper code in their driver when an indirect addressing reference is made and/or the indirect vector addressing reference is used.  The result being that the request to read or write arrives too soon ahead of the address couple, and thus, the indirect reference indexes the wrong pointer in a multidimensional pointer array.  Simplistically put, the operator executes before the arrival of the proper address on the address buss!  Thus, the wrong address is fetched or written.

To add insult to injury, this fault was discovered by Linux engineers and programmers, who put in the fix by telling Intel and the others to add 7 no ops to the beginning of their drivers to allay the iterative memory transfer routine long enough for the memory and pci busses to synchronize to the memory transfer.  And I have to refind the link to that information as it seems to have disappeared from one of my previous answers at Experts-Exchange.

However, it was in Virtual Memory [C0000000 is the Virtual Memory Area].  Virtual Memory is an unusual concept, it includes devices, such as disk partitions, the cache and pagefile, and other peripheral addresses as part of the memory space.  Basically, it's saying that either the Virtual Address was wrong, the device didn't exist [and thus the memory space did not exist], or the device did not respond in a timely manner.

Thinking along those lines, RAM should actually have little to do with Virtual Memory, unless part of it is allocated in the virtual address space, an unusual thing to do.

Okay, so back to BobCSD's troubleshooting.

I see your new server is quite powerful.  For the initial size, you want something slightly over the actual size of RAM, add about 100 meg to the RAM size.  Recommended size can also be the maximum size, this is where the one and one-half rule is used, although, 2046 is not an exact boundary, it should report 2048 [which is 2 gigabytes of memory; some systems use 2 meg for the mmx drivers that emulate missing hardware in some RISC computers], I would therefore suggest making the minimum size 2048, and the maximum size 3072.  However, again, you have to think "What is it actually doing with this allocation?"  It's merely providing an area equal to or greater than the size of RAM from which to cache temporary and interim fetches to and from main memory.  Many administrators question whether any allocation over 1.5 gig is really necessary, and, whether or not there is any improvement in performance beyond the one gigabyte RAM implementation, with most administrators practically stating flat out that there is none beyond the 2 gigabyte memory implementation.  There are gaggles of technical reasons for them making these statements, but they support it with actual observed results; more memory beyond some point does not improve performance and in some circumstances has degraded performance.

Okay, so let's get back to your new server now.  How did you initially set it up?  How many hard drives, what sizes, and how many partitions on each hard drive?

I would like to know all of this to outline a suggestion for initially setting up the physical topology of a modern server.

BOBi, don't worry about the technical jargon, just take it one step at a time.

Before we conclude that you do have something like the timing problem, try to get these done in order [don't forget to set "automatically fix all errors"] :

chkdsk, scandisk, defrag

Even if you have to do it using the disk properties and scheduled for the next time the system reboots.  I do want to know about the number of hard drives and the partitioning because this affects the speed of your server.

I appreciate all ya'lls' help and comments.
Starr DuskkASP.NET VB.NET Developer

Author

Commented:
>>BobCSD you have to complete the scandisks after the chkdsk or the process is ineffective in fixing bad blocks on the disk.  Check Disk finds and marks the bad blocks setting a bit for scandisk to try to reallocate them, and scandisk looks for that information and the try to fix and reallocate bit and does the actual fixing.

So did I break anything by doing a chkdsk and defrag without the scan disk?

>.or you can simply get the drive's properties, then the tab or button that says check this disk now
That's where I found it, not the other.

>>Okay, so let's get back to your new server now.  How did you initially set it up?  How many hard drives, what sizes, and how many partitions on each hard drive?

1 floppy drive (antique eh?)
1 C drive
1 CD/DVD drive

also a USB Maxtor 300 GB is attached, but that is removable.

C drive is 34.4 GB with no partitions.
File System: NTFS
Used: 8.27 GB
Available: 26.1 GB
Location: 0

This is not on a network where someone accesses my drive by putting in J: or something like that.. but in sharing, I do notice that Share this folder is select C$, default share. Web sharing is turned off.

check box for allow indexing service to index is checked...
do I need that? does that allow me to do the search and find files via explorer? or is it something else? Maybe I should turn it off.

I do notice that when I select security from on C: ... the owners is Administrators... but Administrators has no permissions set. And Administrator (singular) has only read/execute, list, read. Everyone has everything. And System has read/execute, list, read. CREATOR OWNER has nothing.

>>Before we conclude that you do have something like the timing problem, try to get these done in order

I did this stuff on my old server the other night and to run through it once took more than an hour. I'm thinking of setting it up as a DFS as per:
https://www.experts-exchange.com/Operating_Systems/Win2000/Q_21534761.html

Then that way I can take the box down and do messing around to my heart's content without shutting down my sites. Nothing like adding more potential problems to the mix though. But I just don't see how I can get this fixed without having a backup in place. I am so fearful of ruining the only box I have.

So today, after I get breakfast, I'll be working on that.

BOBi

Commented:
No, you did not break anything by doing a chkdsk and defrag; most probably Windows just skipped over any bad blocks and kept them in its list of bad blocks, no harm done.

I believe in having floppy drives, even if the rest of the world doesn't.  Floppies have saved the day when all of the other super computer devices just sat there with "duh . . " and egg on their faces.  Keep the floppy; it's the smartest thing you can do.  Buy more at flea markets, before the nonces have their way and make them obsolete.

Okay, the new server setup continues; you have a base 34.4 gig drive.  A good idea, well within needed range.  However, I would consider adding another drive for simple redundancy, in addition to the removable or hot swapable drive of 300 gigabytes.

Why I would do this?

Disk caching across two disks for the page file system is faster across two disks because both can be accessed simulataneously and/or the access time masked while the disk cache is done in the background.

Secondly, I would have set up the drive for optimum partitoning.  That is, a schema that goes something like this:

Drive 1: two partitions, one 30 gigabytes, and the remaining on an extended partition divided into two caching partitions.
Drive 2: [this is not the hot swapable or removable drive!]  probably 200 to 300 gigabytes partitioned into at least two or three spaces, with two cache spaces.

The partitioning looks something like this:

Disk or Partition     Volume Label    
C                     <any name>
D                     <another name>

and so on.  I've started a much overdo document on setting up a Server at :

http://www.musics.com/manhtml/Windows/Partitioning/PartitioningWindows.html

The document is fairly complete as of its first draft and should be useful.

I completely agree that you should have a backup in place.  Be careful with Distributed File Systems; they have their own nightmares and are more geared toward large systems.

You might consider a mirror for the 30 gig drive though.

The amount of time to defrag is a function of the size of the disk.  Plainly put, 300 gigabytes on a single partition is just asking for trouble!

See the document above.

What basically happens is that people generally do not do much planning for their Server Systems, and discover problems with their configuration, pre-setup, and setup afterwards, when it is often too late to make changes.

I sincerely wish that Microsoft had never got into their Indexing; with Explorer embedded in Internet Explorer, and gazillions of histories and logging of every keystroke, then constantly snooping around with what seems to me to be the world's slowest indexing scheme, the computer is spending entirely too much time doing nonsense, instead of its job, to serve.

I have been looking into that indexing check box, and it seems to be a child of Fast Find, which actually made Windows slower in finding things, than it could in Windows 3.1 days.

You're not doing too bad, but don't overdo it; accept that you'll have a crash here and there until you either exercise the machine or wear it in a little, and until you get that other box to back it up.

Don't simply try everything in the Internet book; remember, if it's working don't try to fix it.  And one crash at 4:00 A.M. in the morning means it's working better than one crash every hour.

If you uncheck the indexing, it will ask you if you want to unindex for every file on disk.  This will take a lot of time too.  However, I unchecked that box last night on a Windows Server and now Explorer does seem to respond much faster to browsing folders and so on.

Be careful with the DFS, and yes, you should have been running the Server from an NTFS partition.

Again, read:

http://www.musics.com/manhtml/Windows/Partitioning/PartitioningWindows.html

What my Windows Server does is to provide a custom menu from a DOS partition, which allows me to select which Operating System to start.  I have opted to have three hard drives, and the Production Server Operating System is on the second drive, the D drive, I simply set it to the default boot.

I'll also post a copy of the menu for people.  The server has basically been running for well over ten years without a reinstall, bad crash recovery, virus bring down, etc..

The entirety of the technique requires that the partitioning and setup of the server be well planned, something that's a little late unless you have a backup server while you do it.

One last note:  no need to put any pagefiles on a hot swapable drive.  In fact, it might prove detrimental.  I haven't thought much about partitions, but I can't see them doing any harm, depending on what the hot swap drive is used for.

And I leave a few one gig areas unpartitioned on disks, in case I later need them.  And, the pagefiles I put on a partition leave some bytes, like I only use 1 gig out of a 1.5 gig partition, otherwise, if the cache fills up it will start popping up windows telling me the disk is almost full.  So, the partitions for pagefiles in the example although 1.5 gig in size, are only assigned a 1 gig pagefile each.

I don't care about an unused 1 gig space with hundreds of gigs on multiple disks.  Well, at least I hope it never fills up.

The Microsoft "Restore" function was based on this partitioning scheme.  I use one disk, as you'll see, for a complete Operating System Image from which a crashed system can be recovered.  Dell also uses it for their repair or restore partition, which they now seem to put into every system the ship.

DFS and replication also take time away from your cpu and server, so weigh your decision carefully.  Generally, it is used in multiple server systems in very large networks.  Read more about it online.

And, I have to add this, if the P4 was anything of a problem, it turned out to be that it was too fast!  Secondly, it does require some special heat dissipation.  Many server operators using P4's have opted for cooling towers in the place of just a fan blowing air.  And heat will cause memory related problems.  There are other posts here on EE about he P4.  See if you can find any and if they are of any help to you.

ciao for now.


Starr DuskkASP.NET VB.NET Developer

Author

Commented:
>>However, I would consider adding another drive for simple redundancy

Oh, I forgot, both boxes have a mirrored drive as well. Sorry.
Starr DuskkASP.NET VB.NET Developer

Author

Commented:
>>Secondly, I would have set up the drive for optimum partitoning.  

Can you partition a drive after the server is already setup and in use? Or does this require formatting and starting over?

>>I completely agree that you should have a backup in place.  Be careful with Distributed File Systems; they have their own nightmares and are more geared toward large systems.

I dont' think the point of the DFS is going to work for me as I have IIS 5 and I think it takes IIS 6 to right click and export... I'll just program on one box and copy over the files I guess....

>>You might consider a mirror for the 30 gig drive though.

Yes, all 4 boxes, old and new, have mirrors. Forgot. Sorry.

>>300 gigabytes on a single partition is just asking for trouble!

That's my USB removable Maxtor, not my server. You think I should partition it? I use it for backing up and like never running out of disk space on the daily, weekly, monthly, etc. backups.

>>no need to put any pagefiles on a hot swapable drive

What is a hot swapable drive? The USB Maxtor?

I am setting up the backup box now and copying over the files. I got my network setup tonight and the test site up and running and the DNS records setup with the additional IP's and the firewall setup to point the remote to the new local box.... this is pretty exciting doing something that is actually working!

BOBi
(not a hardware tech, if you couldnt' tell ;)






Commented:
I was asked to formalise the troubleshooting procedure for Twilight Zone Crashes, so I used your dumps as sort of a basis.

http://www.musics.com/manhtml/Windows/TwilightZone/ProcedureCrashFix.html#Procedure

Unfortunately, repartitioning requires that you start over.  Some people have used Partition Magic, but from what I've seen they've all eventually had problems.  My guidelines:

http://www.musics.com/manhtml/Windows/Partitioning/PreSetupServer.html
http://www.musics.com/manhtml/Windows/Partitioning/PartitioningWindows.html

some reading for now.

If your systems are truly mirrored [RAID 1, and not other RAID implementation], you will have to think and take notes, and consider that you must provide mirrors of exactly the same size, that is, two separate partitions on two separate disks for each mirror set.  Stripes, Parity, and the other RAID's are not mirrors!  Mirrors are "exact" copies of a disk on another disk.  When one is corrupted, you fix it by issuing a command to "break the mirror" so that the system reverts to the one disk in the mirror set that it finds uncorrupted.  Thereafter, you fix the corrupted disk and remake the mirror.  You do not approach this with other RAID methods, but by Microsoft Documentation on the Mirror Set.

If you follow the partitioning scheme outlined, you will have to calculate how to fit mirrors into the scheme.

How Swapable means removable, with the added feature that the machine does not have to be powered down, thus, the swapping of drives can be done while the box is "hot."

In your case I'd consider the USB Maxtor hotswapable.

Since it's swapped in and out, partitioning is entirely up to you.  If 300 gig works, use it; as long as you feel confident with it.  That's an aweful lot of data, big database?

In servers and multiple servers, we try to divy things up as follows:

01.)  Operating System - one disk or partition, stands on its own, no access by apps or db
02.)  Applications - one disk, or partition, stand on their own, user files and docs on DataBase
03.)  DataBase - one disk, or partition, information storage only, no applications.

Now you can vary and plan this as you see fit.  The basic idea is to keep all applications and all database, information, and the like, from writing to the Operating System disk or partition.

To keep applications that are non-database off of the database and operating system disks or partitions.

To keep database operations off of the operating system and applications disk or partitions.

User profiles, of course, remain on the Operating System's disk, but things like documents, accounts, personal pages, and the like, are kept on a different disk.

In fact, the idea of a separate server for web web services and email conform to this concept, keeping them, as well, off of all other areas - operating system, applications, database; unless web and email are integrated into a database, but even in that scenario, we can further delineate separate areas and partitions for web databases and database oriented email.

Which is why "Planning" is number one in networking.

What is best?  Which is most efficient?

With even two servers, maybe six disk drives, the throughput can be increased six times if operations for different services are on different disks because, remember, these modern systems can access multiple disks simultaneously and effectively mask the access time to near zero access time.  Memory operations are near transparent to code execution, therefore, far less cpu time is used and far less wait time is queued.

And as a laughable comment, "that's why we designed it that way."  Someone forgot to tell people about it!

Very glad you got your backup server up.

Please take some time to read all of the documentation at :

http://www.Musics.com/manhtml/Windows/

Comments and Feedback appreciated.

Commented:
Hi Eric,

I haven't go through your document.  I find out some explanation is incorrect. You still mix up IRQ and IRQL. Bob's problem crashes at IRQL 2 and not IRQ 2. Routine executes at IRQL 2 or higher cannot have page fault and this is the meaning of "IRQL_NOT_LESS_OR_EQUAL".

>>>And, as another hint about how IRQ>1 is used, from the Registry, Plug and Play devices, somewhere under CurrentControlSet or the others, ControlSet00, etc., IRQ 2 is pnp0200 under ROOT\*PNP0200\PNPBIOS_2

This comment is not related to Bob's problem. If you want  explain IRQ 2, don't use Bob's minidump.

There have a lot of good webpages to guide us how to use windbg.
http://www.codeproject.com/debug/cdbntsd6.asp.  

I am not intelligence enough to understand your document as you write the document from hardware point of view. I think Bob is more interested to find out the culprit of his problem.
Starr DuskkASP.NET VB.NET Developer

Author

Commented:
>>If 300 gig works, use it; as long as you feel confident with it.  That's an aweful lot of data, big database?

The maxtor us only used for my backups. No, I have a 300 gb maxtor on the database server as well, this one is for the web server.

They both are for backups.

I do a daily backup, weekly, monthly and keep old versions around a while, so that allows me plenty of room for backups without it filling up the drive too soon. I do system state, complete backup with file changes, etc.

Plus I swap the backup units on the machines occasionally so that both will have backups of both machines on them in case one back up goes bad. So they both have data from two machines (3 if you count my entire hard drive of my development machine as well.)

It's probably overkill, but doesn't cost that much more for the 300gb, so I figure why not go for the gusto and never run out.

Starr DuskkASP.NET VB.NET Developer

Author

Commented:
>>Unfortunately, repartitioning requires that you start over.

With the server only having 36 gb, is that really necessary? The old one had only 18 gb.

So do I understand this right, with partitioning:

I keep the system stuff working on one partition, while the regular activities of the website are running on the other partition, and it allows them both to work at the same time. Without the partition, they have to take turns working?

So partitioning improves performance?
Starr DuskkASP.NET VB.NET Developer

Author

Commented:
>>02.)  Applications - one disk, or partition, stand on their own, user files and docs on DataBase
>>03.)  DataBase - one disk, or partition, information storage only, no applications.

My database is on an entirely different webserver, but I guess if you're talking specifically about my database server, then this would still apply.
Starr DuskkASP.NET VB.NET Developer

Author

Commented:
>>The basic idea is to keep all applications and all database, information, and the like, from writing to the Operating System disk or partition.

Back in the old days, I had a partitioned drive, and the problem with that was that I ran out of space on programs partition and didn't have room to add anymore. So after that I quit partitioning so that my partitions wouldn't run out of room. It was a real mess.

>In fact, the idea of a separate server for web web services and email conform to this concept
I have a web server and a database server on different boxes.

I also have my mail server and chat server on the same box as my web server. I am thinking of moving them to their own server so that if the web server goes down and I have to move to my backup webserver, I don't have to maintain the chat and mail server on my backup box as well. Both of those are not at all utilized very much. The mail typically just does outgoing mail that I send. I don't have thousands of users using my mail server, because that is not my business. It is just for me sending mail through it to my members, etc. The chat server can host thousands, but sadly only has about 20 given members in it at a time, maybe 5-10 on average. But still, they are taking up space and being utilized.

>>With even two servers, maybe six disk drives, the throughput can be increased six times if operations for different services are on different disks because, remember, these modern systems can access multiple disks simultaneously and effectively mask the access time to near zero access time.  Memory operations are near transparent to code execution, therefore, far less cpu time is used and far less wait time is queued.

Good to know. Do you have recommendations as far as the size amount for each partition? I haven't looked at your document yet, does it cover that? I just don't want it to be too small and run out of room, or too much and waste space in it that will never be used. Now I have to talk my husband into formatting and starting over, and so far with a quick check out at the pond where he is knee deep in water, he is not very happy with the idea. ;) I think I might be doing it myself. It's hard to find good help nowadays.

There is so much in this question/answer to do and think about. I think I'm going to close this out. If you want to keep adding to it after it is closed, feel free to, I will read it, but I don't want folks to think I'm still waiting for help and it will certainly take  me forever to go through all this stuff.

Thanks!

BOBi
Starr DuskkASP.NET VB.NET Developer

Author

Commented:
I am accepting this as the accepted answer...

>> Because both backup and antivirus started at 4:00 A.M., it is very likely that you were sucking up all available memory, running at an unusual pace and cpu time percentage, and that perhaps backup collided with antivirus in the storage of the dif files for your backup, or antivirus may have added changed files, locked a changed file, or whatever, and the two could not get along; backup and antivirus.

because....

The ntbackup started at 4:00 am and did complete
The mcafee virus scan started at 4:00 am and didn't complete....
Also, the database server was doing a backup at 4:00 am as well, which was on a different box, but site users would still be accessing the database from the webserver, thus still putting load on it remotely.

The shutdown started at 4:00:57 am.
The first event announcing the shutdown was at 4:11
And the memory dump was announced at 4:22

I haven't had any of this particular dump since then. I do intend to go through this and do various of the tuning suggestions to improve my server and I appreciate everything!

BOBi

Starr DuskkASP.NET VB.NET Developer

Author

Commented:
GinEric,

I have an idea.... since you're writing a book.... How can I reach you?

BOBi

Commented:
BobCSD,

While finalising the draft on Windows Twilight Zone Crashes, I had to come back to this thread for reference, since it was such a comprehensive effort on all of our parts, you, cpc2004, and I.

One thing I want to clear up for cpc2004, and others, IRQL > 1 is IRQ = 2, and it is the redirected IRQ from the PLC to Virtual Devices [Plug and Play devices] and it includes the Vitural Address space at C00000000 by definition.  At various places, references were made to Microsoft's definitions of IRQL, and they proved that, indeed, the pagefile is accessed above this level.  Since the Virtual Space is on "devices" and the pagefile is a "device" on hard disk, the memory corruption can most assuredly occur in the pagefile and thus a pagefault can be the result of this error event.

It is best to "never say never" with a computer system, computers do, in fact, make mistakes.  I actually had to prove this many years ago to the people who kept chanting "computers never make mistakes."  Yes they do, and they can go unreported as well, that is, they can make a mistake that no one will ever know about.

It has been proven, once and for all time, that computers do make mistakes.

The assessment that the "load" was the cause of the error effervesced from the discussion.

And that is the point of what the Continental Congress called "arguing" as in the realm of a debate.  Arguing per se is not a "bad thing," since, apparently, America is founded on arguing, which continues to this day.  It is a positive thing.

You can reach me, BobCSD, through my website, or simply use James @ Musics.com



Starr DuskkASP.NET VB.NET Developer

Author

Commented:
GinEric,

I had moved everything over to my backup computer and have been operating on it for several weeks. My other computer, when I log into it, or network to it, or whatever I do to it, typically reboots itself with 000001A MEMORY_MANAGEMENT or 000004e PFN-LIST-CORRUPT errors....

does a code dump... etc. Got one of each of these today.

I have asked Larry to take the memory out and get it tested or replaced. Seems like there should be some warranty on that. Meanwhile, the computer is worthless and I just wont'/can't use it for anything! Not even as a backup box. Stupid machine.

BOBi

Commented:
Really quite a shame.  Do you have any of those new dumps?  This question has piqued more than a few people's curiosity.

I have a whole section on crash dumps now,

http://www.Musics.com/manhtml/Windows/TwilightZone/

which refers to two of your questions here.  PFN has something to do with pointers into the pagefile.  Again, it appears as if addressing is bad.  If the processor is a P4, I am more reliant on the problem being timing of the buss, rather than the RAM which came from a reputable company.  That's not to say they don't have bad ones occasionally, buy it is rare.  If this is the SCSI driver indirect vector reference, as I suspect, it quite literally boils down to the motherboard being "too fast" for the software.  I'm still trying to relocate that write up by Intel and the Linux people because it is a software problem and not a hardware problem.

I'll keep checking.  Meanwhile, if you have any dumps, send them on.

Starr DuskkASP.NET VB.NET Developer

Author

Commented:
the thing is, we built two computers, exact same hardware. Both with windows 2000. But one was for the web server and the other for the database server. We haven't had a lick of problems with the database server (argh, I know i'm jinxed now!)... but all the problems are with the web server. So if the motherboard were too fast on one, it should be too fast on the other.

I saw this in another post, from CrazyOne:
https://www.experts-exchange.com/Operating_Systems/Win2000/Q_20373431.html
DocMemory PC RAM
Diagnostic Software
http://www.simmtester.com/PAGE/products/doc/docinfo.asp

And rather than keep spinning my wheels with dump files, paging, scan disk, chkdsk, and stuff, I'm going to test the RAM with that utility today, now that the box is not in use. See if I can either rule out the RAM or know it is the RAM.

I saw another post on here, which I can't find now, directs to http://support.microsoft.com/?kbid=291806 in regard to PFN_LIST_CORRUPT and it indicates item #4: If you receive this error message randomly, or when you try to start a program, remove extra memory or have the random access memory (RAM) in your computer tested. This behavior may occur if you have bad RAM.

BOBi

Commented:
After reading over 1000 minidumps in several forums, I am confident that it is faulty ram. Some faulty ram can pass memtest. You can try downclock the ram or reseat the memory stick to another memory slot.

My previous post
<<<<
I do not find any memory corruption at the dump. As the page fault hadling routine is a well test routine. It crashes only if the hardware error or coruppted paging space. I suggest you should re-allocate a new paging space as the circumvention of the problem. After re-allocation of paging space, if the BSOD still occurs, it must be faulty hardware.
>>>>
Starr DuskkASP.NET VB.NET Developer

Author

Commented:
I downloaded and installed the free doc memory utility on my windows 2000 server. I created a boot floppy and it brings up the screen, but at the bottom has this error:
 
run-time error M6103: MATH
- floating-point error: divide by 0

don't know if that's yet another problem with my machine or their application. I wrote them.

Commented:
I'm not saying it's not bad RAM, however, I disagree completely with Microsoft and other software people who will blame another reputable company rather than explain the exact problem.

Microsoft cannot explain the exact problem because Microsoft does not employ people with hardware design engineering experience, nor education.

There are a number of ways in which a simple technician can blame the RAM, replace it with a different RAM, and see the problem go away, but with one catch, the real problem was never solved because the new RAM is simply masking the real problem.  RAM will be indicated, but will not be the cause when:

01.)  The RAM is faster and requires more power than cheaper RAM [cause: insufficient Power Supply]
02.)  Memory tests will fail if not for the specific configuration [cause: bad memtest]
03.)  One RAM fails while another doesn't [cause:  wrong type RAM]

and many, many more.  Unless Microsoft and software people can state, unequivocably, and in full detail right down to the substrate issue and show the proving timing study diagram that they know the reason and can epeat the failure, every time, they have not found the failure, period.  They are simply "passing the buck."  I see this all the time at their support site; like the "small timing problem" statement; that statement is obviously given as vague and ingenuine because they do not want to explain what they mean by a "small timing problem."

Nor does their partner, Intel.

As an engineer, one who has designed computers, even those upon which all of Intel's microprocessor's are based, I know that either what appears to be a hardware problem is actually an "uninformed software" problem, and what appears to be a software problem is an "uninformed hardware" problem.  The approach of the engineer is to exactly identify the cause, not to simply make a replacement that works in some percentage of cases; not even 99% of all cases.  You don't want a levy that works 99% of the time, but fails devastatingly on the 100th time.  It would be the same requirement for something like a Stealth aircraft, or a commercial jetliner; a guess is as good as a mile and can be shown to be catastrophically costly.

We who do such analyses differ only in the level of acceptance of the cause of a failure; my particular training and responsibilites has required that I be absolutely 100% sure of the cause and can identify it and repeat it in any demonstration.  300 lives or 30,000 lives may depend on it.

As professional engineers, we are taught to ignore requests by administrative and business interests for a "quickie" solution or a "patch" and to come up with the real cause.  Personally, I've seen ten of thousands of dumps and have analysed all of them.  Both on mainframes and microprocessors.  Real dumps printed out on 132 column paper, some as much as a foot thick.  I've replicated the failures using various techniques and equipment, including Biomation oscilloscopes in the gigahertz range and Logic Analyzers with multiple traces, 16, 32, whatever.  I have done Timing Analyses Reports thereafter for Time Studies to report to the hardware manufacturer the exact cause of a failure, including many microprocessors and other chips which show more than a 1% failure rate.  That is the general level of considering a recall and a "bad production run."  The same applies for motherboards and other printed circuits.  In 99% of those cases, it is a substandard power configuration and/or a substandard timing specification.

The actual failure of memory is only attributable to that memory hardware in less than one in one million cases; it is nearly always the "skimping" by other manufacturers of motherboards and other devices for the sake of reducing their costs and thus selling a less than quality product.

And they can blame the RAM because there are so few people actually knowledgeable enough to identify the exact cause.  While the RAM manufacturer can afford to replace it for free, even when they know that some of their "technician" customers have not used grounding properly, and home users simply don't know how to do that, and that the RAM was not the original fault, they are more than willing to simply satisfy their customer with a replacement.  There is a reason that the RAM is sent back to them; they have a whole department that will apply engineering techniques to find the real cause and not simply a guess.

The dumps show that there were a memory corruption and an invalid address, in fact, this is clearly stated by the dumps themselves!  I have no idea how anyone could miss this, or not understand it in clear, plain, English.

Why would anyone not understand that :

0x0000001e  (0xc0000005, 0xa003ee9f, 0x00000000, 0x00000001)
Exception   (No Access,  Address,    Read,       Index)
"An attempt was made to access a pageable (or completely invalid) address . . ."

means what it says?  The address itself was invalid; the "address" part is gotten from memory, and that memory was corrupted somewhere along the line; "C" or "Charlie" is a device, and "5" is Access Denied, a general protection fault.  It basically says that "You are not allowed to read that address."  Calculate that address in decimal and you will see why:

0xa003ee9f translated is 9FA003EE = 2678064110

which address was a result of the memory in a Virtual Address Space [0xc0000005] which means it came from a device.  Now you can treast RAM as a device, but this is a very, very, bad programming idea.  It will inevitably conflict with an address and/or itself because RAM addresses are real addresses.

The second and major dump provided more detail:

0x0000000A (0xC0074EE4,  0x00000002, 0x00000000, 0x80443637)
IRQL>1     (Virtual Mem, IRQL = 2,   Read,       Address)

The Microsoft debugger said, and I quote:

"An attempt was made to access a pageable (or completely invalid) address at an
interrupt request level (IRQL) that is too high.  This is usually
caused by drivers using improper addresses."

What part of "pageable" does one not understand to mean "pagefile?"  The pagefile is on disk, not in RAM, although it may have failed during a transfer to RAM.  Again, "0xC" means, a "device," such as a hard disk; it does not mean RAM.

And:

"BugCheck A, {c0074ee4, 2, 0, 80443637}
Probably caused by : memory_corruption ( nt!MiFreeMdlTracker+cb )"

when couple with the above "This is usually
caused by drivers using improper addresses" suggests that the "driver" may have miscalculated the address in tracking the Free Memory in the disk cache, i.e., the pagefile of pagefile.sys

So, Microsoft says it's in the pagefile and it was probably caused by a driver.  How does that statement translate, in any way, to affirming the statement "I do not find any memory corruption at the dump.(?)"

When Microsoft's first statement on the dump was "Probably caused by : memory_corruption ( nt!MiFreeMdlTracker+cb )[?]"

Perhaps because the author of the debugger considers a bad dword to be the result of a bad pointer in the Plug and Play driver [IRQL and/or IRQ on 9/2, the PLC used for Plug and Play] that resulted in a bad fetch [Access Denied] from a device [0XC0000005] to be, in and of itself, "memory corruption."

Getting through all of that technical jargon, it still boils down to an "Invalid Address."

That can happen for only one of two reasons:

01.)  The driver did not have sufficient permission to access this area of memory
02.)  The value found at this memory address or the pointer to it was invalid.

As of a few hears ago, Intel, AMD, and others, incorporated something into their hardware which tells the software if the data at any address is valid, that is, "an initialized operand" with a tag field in the data itself called the "Data Valid Bit."  If this bit is not set, then the data, usually an array, has never been initialized.  That means that the expected values have never been stored there.  If that area happens to be an array of pointers [an index table, such as would exist when any memory, disk cache or RAM, is indexed by an array of pointers] as opposed to actual data or code, then the returned error will be "Invalid Address."  Microsoft, apparently, considers this to be called "memory corruption."

If IRQL>1, that is, 2 or above, this is the same as "Lexicographical Level" or "Control Mode" or "Supervisory Mode" outside of the Operating System whose Layer 0 and Layer 1 Routines "Only" are allowed to access this device, i.e, make a direct call to the disk device.  A driver is not allowed to access the disk directly; this can only be done by the Operating System itself, and that only at the most protected kernel area.

If a driver is written in such a way that it attempts to bypass this protection, it will get a "General Protection Fault."

Now, surrounding all of the error in the dump and debugger analysis are:

bff83000 bff98180   atapi    atapi.sys    Tue Apr 01 13:08:25 2003 (3E89D599)
bff99000 bffba9c0   dmio     dmio.sys     Wed Jan 15 14:47:04 2003 (3E25BAB8)
bffbb000 bffd75a0   ftdisk   ftdisk.sys   Thu Dec 02 22:29:58 2004 (41AFDDB6)
bffd8000 bffffc20   ACPI     ACPI.sys     Wed Jan 15 14:44:22 2003 (3E25BA16)
Note the gaping memory hole here from C0000000 to near F0000000, the usual BIOS cache areas
f6400000 f640e6a0   pci      pci.sys      Wed Jan 15 14:44:07 2003 (3E25BA07)
f6410000 f641b680   isapnp   isapnp.sys   Wed Jan 15 14:43:47 2003 (3E25B9F3)

as I outlined in the analysis.  If you look at the overall picture, it is all about the atapi disk driver [atapi.sys], the DMA I/O Controller [dmio.sys], ftdisk, the ACPI system, the PCI system, and the isapnp Plug and Play driver.

Disk, Plug and Play, the PCI buss, the ISA buss, the DMA buss, and the ACPI Virtual Memory Devices area.

The only association with RAM is the DMA Controller driver part of dmio.sys

At the time of the failure, smtp.exe "simple mail transfer protocol" was executing a thread.  You said that the server worked for your database server, but not for your web server.  I am not sure of any relevance of smtp running during this error, however, it is evident that a web server handles email, moreso than a database server, so there is a possibility that smtp.exe is only called when the server is a web server, and not when it is a database server.

PFN_LIST_CORRUPTION : what is it?

Page Frame Number

A Descriptor List of pointers to other things, like paged memory areas, some organized in frames.

Again, this type of "memory corruption" is the result of a bad address; usually, in a chaining of index calls such as a pointer that points to another pointer, et ux, that finally results in the access to a data or code segment.

There are two oddities in all of this:

01.)  Compiled drivers often call an index to another index which fetches a Matrix Array Datum
02.)  SCSI, PCI, and DMA busses often misindex one of the two pointer indexes.

The current "too fast" problem of these busses.  You have an operator which does two indexes simultaneously.  For example Vector Index EAX, EBX, [dY, dX], n, m

Which says Index the Vector Matrix XY by Xsub0=m, Ysub0=n.  This is your machine doing Differential Calculus at the hardware level!

The points, in space, are Xm and Yn in two-Space using Sigma Sum in Newton's Method of Approximation.  This is an extremely fast method of rendering, often seen in high resolution motional video graphics.

The problem is, the indices may never be zero simultaneously, a fact both in Physics and in computer software.  Why?  Because f[t]=0,0 is the Origin and the Origin is reserved for Base Descriptors only.

What that means in English:  The mother and father of all pointers is here at 0,0 and you are not allowed to access them and use them as code or data - you get an Invalid Address.  Certainly, no level 2 or above driver may ever even access them.

Why they get accessed by error:

The buss is not intialized, that is, 7 clocks have not been issued in order to get the actual address to the DMA, SCSI, or PCI buss!

The solution:

Add 7 no ops to the driver software at the beginning to delay the driver routine long enough for the buss to initialize.  7 no ops are effectively 7 clocks.

And that's what we mean by "the machine is too fast."

This is the third possibility between your failures and the two machines; maybe the other server is slower in it's clock rate, or the buss architecture is different, i.e., it takes longer for the pointers [addresses] to arrive so that when they get there the buss is already initialized.

This happens because the DMA in 64-bit machines does not have to diddle 64 bit addresses down into two 32-bit parts, so, the 64-bit address arrives nearly instantly, while the fake 64 [as a two or more clocked pair of 32-bit address parts] on a 32-bit machine.

And that can cause RAM to be read before it has an address, thus, something tries to effectively read address 0 and that is not allowed by any program other than the very basic kernel Operating System software.

This is a "known" Intel and other microprocessor problem, as that "small timing problem" that Microsoft refers to as such.  And it is a software problem, not a hardware problem.  The hardware is simply faster than the authored software that tries to implement the vector indirect addressing without consideration to proper timing and initialization.

With two processors, and up to four DMA busses each, eight fetches on eight addresses can occur simultaneously.  And this can continue in a stream.  If the code in the calling function cannot handle 8 words simultaneously, or four double precision words simultaneously, then invalid or corrupt data can be the result.

I know I have been very technical, but I have tried to be simple and say that it is possible the board you have is "too fast" and all of this is what I meant by that summary statement in laymen's terms.

I have conceded that it "may" be RAM, but I will not say that it is "absolutely RAM."  And that, even if changing the RAM fixes it because that may only have changed the characteristics of the machine and still not have solved the problem permanently.

It is more time efficient to just swap out the RAM, agreed, but there will be those cases where this simply changes the failure from once a day to perhaps once a week, or whatever, and therein lies the danger: a system that fails unexplainedly from time to time because it was never truly fixed.

A quick fix or patch was simply added to make it appear to perform better.

As I've said before, I have had this "go round" with Intel and Microsoft about their Plug and Play, and warned Intel that it would come back to haunt them.

BOBi, switching the RAM at this point will be a good test.  If it doesn't work, you'll have to consider the other possibilities.

Again, I have taken this time to write at length because this problem is cropping up all over 64-bit machines; it's not just isolated to a few RAM problems or any one motherboard, it is growing, like a cancerous monstrosity.  I am including these writings in my research for a book and posting on my site.  I realise that my responses may be lengthy and technical, but I also hope I've included enough simple terminology to break it all down to the average person.  I hope it makes thoughtful reading.

If you're going to build a 64-bit machine and write 64-bit capable programming, then the whole machine must comply with 64-bit architecture in every minute detail, and that means:

A 64-bit IRQ Buss!

Starr DuskkASP.NET VB.NET Developer

Author

Commented:
Bottomline, after having numerous errors for two months, including this one, I finally downloaded an app to test the RAM. It was corrupted. Replaced it. Fixed all the numerous errors I had been getting and is running like a dream since.

So even if someone tells you not to mess with the ram and hardware until you've tried a jillion other things, please download a RAM testing app at least and test it. I didn't know it existed until I ran across it.

I accepted answers that weren't the answer simply to clean up the questions that had been around so long. The answer was bad RAM.

BOBi

Commented:
Hi Bob,

Refer to my post on 16 Sept
>>
After reading over 1000 minidumps in several forums, I am confident that it is faulty ram.
<<
Starr DuskkASP.NET VB.NET Developer

Author

Commented:
then I owe you too...

I apologize for not giving you credit on this and will see if I can remedy it.

I wish someone had told me that I could test the RAM. Maybe they did and I hope no one else goes through what I went through.

I'm learning that most of the time the simplest solution is the solution. I don't have to read a book to get to the answer. :)

BOBi

Commented:
Hi Bob,

Refer to my post on 21 Aug.    

>>
The failing routine is W2K MiDispatchFault and it is W2K task dispatcher. It is unlikely to fail as million of users are using this routine daily. It fails unless it is hardware errror.

My prelimary finding, I am very sure that it is hardware problem. Most likely it is faulty RAM. Base upon my past record, I will rate the possibility of the error by 70% RAM, 20% CPU and 10%M/B. As hardware error occurs randomly, if W2k keeps on crashing with different bugcheck code, it is symptom of hardware error. If W2K also crashes at the instruction address and bugcheck code and probably it is software errror.
<<


Starr DuskkASP.NET VB.NET Developer

Author

Commented:
I'm unsure why you sent me this last reference? There is nothing there that tells me I can test the RAM or how?

Commented:
You are interested in GinEric's finding more than my finding and this is why I don't propose to test the ram.

Commented:
Another reason I don't suggest to use memtest because memtest is not a relaible tools. I prefer to reseat memory stick or taking out a memory stick to diagnostic which memory stick is faulty.  Refer to comment of the problem owner of the following problem
https://www.experts-exchange.com/Operating_Systems/WinXP/Q_21505124.html
Starr DuskkASP.NET VB.NET Developer

Author

Commented:
You are right. I did give credence to his responses. And I do apologize. I didn't want it to be RAM. I wanted it to be something I could fix without taking the machine down and the RAM was brand new so I just didn't believe it.

But I would have loved to learn how to test the RAM if anyone would have told me. I asked on 8/21:

>>On startup, doesn't the system check the RAM? It doesn't indicate it is bad. Wouldn't it? It's a gigabyte of dual channel RAM. How can we verify whether it's good or bad?

But I didn't get an answer. I would have gladly tested the RAM if I'd known how.

On 8/21 I also said:

"I'm thinking now:

I check the BIOS and see what it is.... without changing it...
I change the BIOS....
I replace the RAM."

I was then told:

"There are only a couple of more steps in determining if it is an actual hardware problem before touching the hardware."

and

"So, maybe there is a serious hardware conflict, but it's not as simple as bad RAM, more like a bad choice for a production server motherboard"

and

"06.)  Run chkdsk, scandisk, and defrag one last time

If the crash on memory error recurs, you have narrowed it down to either a bad disk or bad RAM, but have not eliminated the timing problem.
"

I did all this and the problem never went away... so I did finally find something to test the RAM. But I was thrown so many things.... like bad mother board, timing problem...

It wasn't a motherboard it wasn't a timing problem... it was simply BAD RAM as you said in this post and another guy said previously in another post.

And on and on and on.... as you know.

Again, I apologize and it was RAM. In future perhaps answers could include how to test the RAM as well so newbies know it's possible without taking the entire system down and replacing brand new RAM.

BOBi


Commented:
Not only do some dealers simply take another customer's problem, a bad RAM stick, and put it in other customer's repairs, some people don't know what speed RAM to put in boards, while others don't know not to handle RAM without being grounded.

You replaced the RAM and it works; that's good.  For your case maybe it is RAM, but maybe too you don't know why it is RAM.

No one needs to apologize here, and no one needs to take a superior attitude about quitting a question just because some other expert is involved or someone is following another expert's advice.

Haven't you noticed BOBi that you have been as skillfully turned against a person or persons as adeptly as any retired KGB officer can do so?

There were only a couple of more steps, but you had already jumped into the hardware.  That's good, you got it working.  However, I will not tolerate clicquisms, rule by committees, and communistic tactics at any level aimed against me and see right through some people's attitudes toward me; not my advice, but towards me, a person they have never met.

I have tried to be nice to one person, and have tried to handle that person with kidd gloves, however, after a bit of competition for points that are meaningless, at least to me, there seems to be no hope of my American manners being reciprocated.  That's about the point when I quit my course and leave such person to their own self-importance.

I get enough attacks against my servers as it is without socializing with people tending to alienate me.  And I don't have time for petty politics and who's better than whom on a free question forum.

Notice that my last rule said it was either disk or RAM, as you restated above, and it turned out to be the second.

I don't care at all about points; since I'm doing this for work on a book.  However, I do care about "my good name" as Shakespeare put it.  I do know when I've been baited, or looked down upon from someone's nose, etc., and I seriously try to avoid reacting to such tomes simply because the man with real class rises above such petty things.  But there is a point at which one must tell ones peers to proceed no further - apologizing to one who seeks one-upmanship or acquiescing to a gestalt opinion of one other is nothing less than antisocial behavior, that is, ganging up on someone.  It seems to begin a lot with someone saying someone else is wrong, or whatever, and escalates from there.  I'm not going to allow any person, and they know who they are, to steer opinion against my character and toward their political agenda of character assassination.  I know my politics and psychology far too well to allow that to happen.

Don't be misled into boughing before a false prophet, tin god, or any other poltician; you've no need to apologize to anyone.

Good job with the RAM

And some folks need to get off of GinEric's case with the innuendos and hidden discussions about him; it's just plain unprofessional.

Besides, he's in the music business which makes politics, spying, and IT look like kindergarten; he knows when there are darts aimed at his back.



Unlock the solution to this question.
Thanks for using Experts Exchange.

Please provide your email to receive a sample view!

*This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

OR

Please enter a first name

Please enter a last name

8+ characters (letters, numbers, and a symbol)

By clicking, you agree to the Terms of Use and Privacy Policy.