Question

bugcheck stop 0x0000000a on windows 2000 web server

Asked by: BobCSD

bugcheck stop 0x0000000a on windows 2000 web server

This kind of goes back to an issue I was dealing with before:
http://www.experts-exchange.com/Operating_Systems/Q_21531853.html

GinEric was helping out there...

GinEric,

I just got this memory dump error. It is not the same, but the knowledge base indicates it might have to do with a timing problem:

Event Type:      Information
Event Source:      Save Dump
Event Category:      None
Event ID:      1001
Date:            8/19/2005
Time:            11:57:02 PM
User:            N/A
Computer:      SSWEB
Description:
The computer has rebooted from a bugcheck.  The bugcheck was: 0x0000000a (0xc0074ee4, 0x00000002, 0x00000000, 0x80443637). Microsoft Windows 2000 [v15.2195]. A dump was saved in: C:\WINNT\MEMORY.DMP.

http://support.microsoft.com/default.aspx?scid=kb;en-us;286362
>>This problem is caused by a small timing problem that can cause a null pointer to be referenced.

It says to fix it, obtain the latest service pack. I have the latest service pack and that is what includes the rollup that maybe caused it in the first place.

Another knowledge base article talks about a virus. But that was an NT serverand also I already checked for those virus files and there are none. I also checked the registry for the virus files and they weren't there.

I'm thinking this is a timing issue as you already indicated.

I am uploading the file to your ftp site, but it wouldn't grant me permission to create a directory.

It says it will take 1 hour and 40 minutes. The file is much smaller, so it must be the kernal file. Plus it didn't take 15 minutes to reboot as in the past. It was only a few minutes.

If you don't feel this is related to the issue you wanted to look at, don't feel obligated to help. I understand.

thanks!

Bobi


This Question has been solved and asker verified All Experts Exchange premium technology solutions are available to subscription members.

Subscribe now for full access to Experts Exchange and get

Instant Access to this Solution

  • Plus...
  • 30 Day FREE access, no risk, no obligation
  • Collaborate with the world's top tech experts
  • Unlimited access to our exclusive solution database
  • Never be left without tech help again

Subscribe Now

Asked On
2005-08-19 at 22:30:03ID21533763
Tags

0x0000000a

,

bugcheck

Topic

Operating Systems Miscellaneous

Participating Experts
2
Points
500
Comments
67

Trusted by hundreds of thousands everyday for fast, accurate and reliable tech support.

  • "The time we save is the biggest benefit of Experts Exchange to Warner Bros. What could take multiple guys 2 hours or more each to find is accessed in around 15 minutes on Experts Exchange." Mike Kapnisakis, Warner Bros.
  • "Our team likes having a resource that is more secure than just using Google and most experts using this service really know their stuff. It's nice to look here first versus using Google." Dayna Sellner, Lockheed Martin
  • "Anytime that I've been stumped with a problem, 9 out of 10 times Experts Exchange has either the accepted solution or an open discussion of the potential solution to the problem." Kenny Red, eBay Inc.

See what Experts Exchange can do for you.

Got a question?

We've got the answer.

Experts Exchange has been collecting answers to technology questions since 1996…3 million and counting! If you have a question, chances are we already have your answer.

Screenshot of Experts Exchange Knowledgebase

Need individual assistance?

Our experts are ready to help.

If you can't find the exact answer you're looking for, ask our exclusive community of 50,000 experts. You’ll get a personalized answer from a trusted professional.

Screenshot of Experts Exchange Knowledgebase

Want to learn from the best?

Read articles from industry experts.

Thousands of free tech tips, tricks, how-to’s and tutorials are available in our peer reviewed articles section. See for yourself how smart our experts are, no login required.

Screenshot of an Article

Working on a long term project?

Store your work and research.

Save solutions to your questions, answers you’ve discovered through searching plus helpful articles in your personal knowledgebase for easy future access.

Screenshot of Experts Exchange Knowledgebase

Access the answers to your technology questions today.

Subscribe Now

30-day free trial. Register in 60 seconds.

What Makes Experts Exchange Unique?

Members of the expert community talk about why the experience at Experts Exchange is different than what you will find anywhere else.

Trusted by the world's most respected brands.

image of each brand's logo

Faithfully serving IT professionals since 1996.

Experts Exchange Logo

Try it out and discover for yourself.

Subscribe Now

30-day free trial. Register in 60 seconds.

Related Solutions

  1. Bugcheck
    My NT4.0 server (sp5) rebooted itself during the last weekend. I discovered that it was caused by a bugcheck and in the event log it said event ID 4188. the bugcheck data was; STOP: 0x000000A 0xE158F9AC, 0x00000002,0x00000001,0xF20709A4. p/s: No Mac involve on our network I'l...
  2. bugcheck 0x0000009c
    I've a Siemens HighEnd laptop with all the drivers provided by the manufacturer. Sometimes the following bugcheck STOP error occurs: The computer has rebooted from a bugcheck. The bugcheck was: 0x0000009c (0x00000001, 0x00000000, 0xb2000000, 0x00000115). Microsoft Windows 20...
  3. Bugcheck 0x1e
    When Windows2000 is loaded the computer automatically boots after I have reinstalled a possible damaged harddisk. In the eventviewer I see this message: "The computer has rebooted from a bugcheck. The bugcheck was: 0x0000001e (0xc0000005, 0x80421885, 0x00000001, 0x0128...
  4. rebooted from a bugcheck error
    Can anyone explain why this keeps happening? Or where I can find out info on why it happens! Event Type: Information Event Source: Save Dump Event Category: None Event ID: 1001 Date: 7/15/2003 Time: 5:39:33 PM User: N/A Computer: JENMAL Description: The computer has reboo...
  5. The computer has rebooted from a bugcheck
    Event Type: Information Event Source: Save Dump Event Category: None Event ID: 1001 Date: 11/27/2003 Time: 6:11:43 PM User: N/A Computer: EAS22IL Description: The computer has rebooted from a bugcheck. The bugcheck was: 0x000000e3 (0x88eda208, 0x88d93c00, 0x87f16128, 0x00...

Free Tech Articles

  1. WARNING: 5 Reasons why you should NEVER fix a computer for free.
    It is in our nature to love the puzzle. We are obsessed. The lot of us. We love puzzles. We love the challenge. We thrive on finding the answer. We hate disarray. It bothers us deep in our soul. W...
  2. SCCM OSD Basic troubleshooting
    SCCM 2007 OSD is a fantastic way to deploy operating systems, however, like most things SCCM issues can sometimes be difficult to resolve due to the sheer volume of logs to sift through and the dispe...
  3. Migrate Small Business Server 2003 to Exchange 2010 and Windows 2008 R2
    This guide is intended to provide step by step instructions on how to migrate from Small Business Server 2003 to Windows 2008 R2 with Exchange 2010. For this migration to work you will need the fo...
  4. Create a Win7 Gadget
    This article shows you how to create a simple "Gadget" -- a sort of mini-application supported by Windows 7 and Vista. Gadgets can be dropped anywhere on the desktop to provide instant information, ...
  5. Outlook continually prompting for username and password
    There have been a lot of questions recently regarding Outlook prompting for a username and password whilst using Exchange 2007. There are a few reasons why this would happen and I will try to cover t...
  6. Backup Exchange 2010 Information Store using Windows Backup
    There seems to be quite a lot of confusion around the ability to backup Exchange 2010 using the built in Windows Backup feature. This stems from the omission of this feature prior to Exchange 2007 s...

Cloud Class Webinars

  1. Avoiding Bugs in Microsoft Access
    Alison Balter takes and in-depth look at avoiding bugs in Access. In this webinar you will learn about using the immediate window to debug your applications, invoking the debugger, using breakpoints to troubleshoot, stepping through code, setting the next statement to execute, ...
  2. Top 10 Best New Features in Visio 2010
    Scott Helmers gives live demonstrations of the top 10 new features in Visio 2010. This webinar will teach you how to create compelling diagrams by adding shapes to the page with a single click, linking the shapes in a diagram to data in Excel (or SQL Server, or SharePoint), ...
  3. IT Consultant Business Secrets Revealed
    Michael Munger, Experts Exchange tech pro and IT consultant, pulls back the curtain on his very successful businesses and answers question on every IT consultant and business owner should know about. He shares secrets on what he did to solve the 5 most common problems in IT, ...
  4. Disaster Recovery and Business Continuity
    Quest CTO, Mike Billon, gives an overview of the steps involved in building a dunamic disaster recovery plan. Through case studies and an examination of software/hardware tooles for monitoring and testing, you'll gain a better understandin of where you are, where you want ...
  5. Organize Your Visio Diagrams with Containers and Lists
    Scott Helmers uses cross functional flowcharts, wireframe diagrams, data graphic legends and seating charts to teach you: how to ustilize all three new structured diagram components in Visio 2010, the best practices for organizeing shapes in previous version of Visio, how to organize ...
  6. How to Us Objects, Properties, Events and Methods in Microsoft Access
    Alison Dalter gives an in-depbth look at objects, properties, events and methods in Microsoft Access. In this webinar you will learn about using the object browser, referring to objects, working with properties and methods, working with object variables, understanding the ...

Join the Community

Give a Little. Get a Lot.

Join the community of experts here and help other tech pros by answering question in your area of expertise. You can earn FREE access to all Experts Exchange's premium features and resources.

Join the Community

Answers

 

by: cpc2004Posted on 2005-08-19 at 23:40:11ID: 14714663

You had better install windbg and attach the analysis report here. Hence all the experts can help you to find out the root cause.

Debugging Tools from Microsoft
1) Download and install the http://www.microsoft.com/whdc/devtools/debugging/installx86.mspx
2) Locate your latest memory.dmp file- C:\winnt\memory.dmp
3) invoke windgb
4) File --> Open Crash Dump -> C:\winnt\memory.dmp

kd> .logopen c:\debuglog.txt
kd> .sympath srv*c:\symbols*http://msdl.microsoft.com/download/symbols
kd> .reload;!analyze -v;r;kv;lmnt;.logclose;q
5) You now have a debuglog.txt in c:\, open it in notepad and post the conetent here

 

by: GinEricPosted on 2005-08-20 at 05:03:56ID: 14715378

Interesting, quite a large dump, at 410 MB.

Well, I take it that that is your dump BobCSD

It's been moved to /uploads/BobCSD/

and the analysis is at /uploads/BobCSD/Analysis/

or use this link:

ftp://guest@musics.com/uploads/dumps/BobCSD/Analysis/BobCSD.memory.dmp.analysis.html

While the kernel dump would have sufficed, a production server probably should have full memory dump turned on, just in case it's one of those hairy intermittent problems that require such extensive dumps.

At first extrapolation, it's either a BIOS caching problem in area C0000000, or, the loading was such that memory and resources were simply overtaxed [unlikely in new boards].

If you had a problem creating a directory, as I saw that "New Folder" was created, while the dump was uploaded to the base guest directory, let me know.  I moved them to the directory above.  I also deleted the dump file, 410 MB, because it was no longer necessary after the debugging analysis.

Let me know if disabling the BIOS cache in this area solves this problem.

 

by: GinEricPosted on 2005-08-20 at 05:13:38ID: 14715407

And thank you for finding that lost link to the timing problem; yes, that is the timing problem as stated by Intel and Microsoft, and solved by Linux Programmers!  hehehe

Now I can complete the write up on general "Twilight Zone" crash dumps.

Acutally, you did create a directory, "New Folder" which I changed.  If you try again and can't create a directory, let me know, I'll have to change strict permissions.

This problem seems more like a different problem though, one having to do with cache and page files.  Because both backup and antivirus started at 4:00 A.M., it is very likely that you were sucking up all available memory, running at an unusual pace and cpu time percentage, and that perhaps backup collided with antivirus in the storage of the dif files for your backup, or antivirus may have added changed files, locked a changed file, or whatever, and the two could not get along; backup and antivirus.

Who knows, they may have both been using the same memory area and one of them went off.

One questiuon: is that board a dual processor?

 

by: GinEricPosted on 2005-08-20 at 05:18:06ID: 14715416

Oh, addendae, it should have taken about 6 minutes or less to upload the entire 410 MB; can you tell me how long you think it actually took?

This is for reference with our pipe provider; I need some feedback on the actual speed experienced.

It took me 6 minutes to upload it to another server; which I consider a little long, but it depends on the client in most cases, not the server.

The dump was analyzed on a Windows Server.

 

by: cpc2004Posted on 2005-08-20 at 05:50:09ID: 14715473

Hi GinEric,

Your analysis report is incomplete. Can you provide the output of the following commands
lm tn
!thread
!process
u 80443620 l20
r

Thanks
cpc2004

 

by: BobCSDPosted on 2005-08-20 at 12:16:03ID: 14716533

GinEric,

>>Let me know if disabling the BIOS cache in this area solves this problem.

So you're saying I should disable BIOS. How do I do that? What are the ramifications of this?

>>Because both backup and antivirus started at 4:00 A.M.,

That was the other day. Since then I had uninstalled the antivirus... trying to determine if it was the culprit. So there was nothing running last night when THIS occurred. It was at 11:57 PMish and wasn't during a backup either.

>>One questiuon: is that board a dual processor?

I'll have to get back to you on that. It's an Abit Fatality AA8XE. I searched the web to find out, but just don't know what to look for. It says "Designed for Intel® 90nm Pentium 4 LGA775 processors ."

>>Oh, addendae, it should have taken about 6 minutes or less to upload the entire 410 MB; can you tell me how long you think it actually took?

It was over an hour... maybe two. I can send you something else sometime to know for certain and actually keep track of the time, if you like. I'm on the equivalent of a T1 line. It's cable though.

I did have the memory set to do a kernal dump, but I hadn't rebooted since changing it. I was waiting for the site to get less busy in the wee hours of the morning. But then the machine decided to reboot itself and save me the trouble, I guess. Nice machine. ;)

Truthfully, if you look at my history, I have had one major problem after another with this server. It is a recent build. I'm seriously considering putting at least the web server, (database is on a different newly built box), back on the old scsi server. There was nothing wrong with it. We just wanted to build two new boxes and remotely move everything over and get the ips' setup and DNS moved so that to our users, the site move from our colocator to our house would be invisible and it wouldn't be down for even an hour, while hauling the machines across town and setting up the new ip addresses.

But I have had nothing but problems with this box since going live. So I'm thinking a move back is best. Course, my spouse, who built it, thinks I should just fix it, but he's not the one up til 4 AM trying to figure out what is wrong with the thing! (well, truthfully, you guys are figuring it out for me... ;)

So do you think it's salvageable and I should move forward and try to fix it, or just put the data back on the old server and go? It's a few years old, maybe 4. I'd still have to setup the old server with the new IP's (I have tons of unused IP's to use), put the zywall back on and get it setup (it's an extra too from when at colocator). So it wouldn't be free of problems/issues either. But at least I could get that setup and tested without it being live. This box just never seems to get better. I just don't know. Sigh.

Thanks!

BOBi

 

by: GinEricPosted on 2005-08-20 at 12:27:22ID: 14716574

It was complete.  Further analysis wasn't necessary until some things had been tried or eliminated.  But, for your benefit, I expanded it:

ftp://guest@musics.com/uploads/dumps/BobCSD/Analysis/BobCSD2.memory.dmp.analysis.html

I'll wait to see if he's checked the cache settings.

And to think, I could get about $150k a year or more for this level of troubleshooting!

:)

 

by: cpc2004Posted on 2005-08-20 at 12:49:37ID: 14716632

Hi Bob,

Can you send the minidump to me as well and I want to analyze the dump? As GinEric does not respond to my post.

cpc2004

 

by: BobCSDPosted on 2005-08-20 at 12:56:07ID: 14716652

>> Suggest checking BIOS caching; if enabled, try disabling it in the C00000000 area.
>>Let me know if disabling the BIOS cache in this area solves this problem.

So you're saying I should disable BIOS. How do I do that? What are the ramifications of this?

>>The annoying and nagging question is about IRQ 2; what was using this special IRQ?  Was it a sound card driver or a video driver?

We checked in the device manager and:
0) system timer
1) standard keyboard
4) communication port

So you see the 2) and 3) are entirely missing... Unless you're starting count at 0, in which case, the standard keyboard is at two.

>>And to think, I could get about $150k a year or more for this level of troubleshooting!

Where do I send the check?

Thanks!

BOBi

 

by: BobCSDPosted on 2005-08-20 at 13:02:01ID: 14716674

cpc2004,

How do I send the minidump to you?

thanks!

BOBi

 

by: GinEricPosted on 2005-08-20 at 13:04:24ID: 14716679

The BIOS is what you get to when you hold down something like the Delete key during boot, it should ask you if you want to go to the setup screen.

That's that actual setup for the motherboard.  In there, you'll find all your motherboard's setting, usually under some place you'll find Enable/Disable BIOS Cache.  For various areas, including the video cache, which I suspect is at C0000000; disable it.

We usually diable all BIOS cache on the motherboard.

As for the board you're using, I don't see that it is it's fault, however, for building servers I usually recommend the highest end Tyan you can get for commercial use.

I think you've just hit a flukey problem and with a little understanding you can get it going.  If it is truly "at home," you've got to consider things like residential electricity variations and power outs, heat, environment, and so forth.  Say, you wouldn't put a server in an unairconditioned or unhumidity controlled home environment in places like Florida.

But if it were all that bad, it would be crashing on a regular basis.  Did you have any electrical storms lately?  That'll bring a server down!  Real quick, and it can easily get a BSOD and a dump, before it reboots.

Then again, unless the circuits are conditioned, and if the server happens to be on the same line with other equipment, refridgerators, air conditioners, etc., the glitch when they switch on can bring a server down.

I think you need to study it and monitor it a little while longer though.  Don't just go shut it down to change the BIOS settings, wait and see, if you're there when it goes down, then reboot into BIOS setup and check the cache settings.  Otherwise, send me the newer kernel dumps and we'll get it checked out.  Don't forget to leave contact information, a simple text file, when you upload.

 

by: BobCSDPosted on 2005-08-20 at 13:18:03ID: 14716714

>>Did you have any electrical storms lately?  That'll bring a server down!  Real quick, and it can easily get a BSOD and a dump, before it reboots.

Uh, last night, we had quite the electrical storm. Really.

But we have them on battery backups, surge protectors, and the house is also on a built-in surge protector device built into the meter, by the electric company for this sole purpose. We have a whole house generator as well, just in case.

It is only the web server that is having problems. The database server, sitting next to it and plugged into the same source, is fine. They do both have different battery backups though. We have had wild electric storms in the past that never caused a reboot as well.

Plus this box has had so many problems lately when there were no storms.

As far as temperature, the front panels have the temperature displayed and they are fine. But no, they're not in a "cold room" but they are air conditioned.

>>if the server happens to be on the same line with other equipment,

This office is on its own circuit breaker and no other equipment is involved.

>>Don't just go shut it down to change the BIOS settings, wait and see, if you're there when it goes down, then reboot into BIOS setup and check the cache settings

Okay. Otherwise, I'll check in the wee hours when the site is not so busy. Saturday nights are very busy.

>Don't forget to leave contact information, a simple text file, when you upload.

Okay, will do that next time. I couldn't create a folder. It never showed me even the New Folder.

thanks!

BOBi

 

by: GinEricPosted on 2005-08-20 at 16:47:17ID: 14717174

A couple of things.  IRQ 2 is part of the IRQ 9/2 PLC that Windows and other Operating Systems use for Plug and Play.  So, IRQ 2 is missing from the Device Manager as that IRQ number, but every IRQ above 15 uses it!  It the Interrupt was IRQ at the time of the failure, hte machine was on a Plug and Play device.  Oddly enough, even parts of the motherboard can be assigned interrupts through this controller.  The operation of a Programable Logic Controller is the subject of about two college level course either in Electrical, Electronics, or Instrumentation and Control.  But basically it allows for a form of relay and switch design for various busses, other memory controllers, and commands to all sorts of things, like the PCI Buss, Memory Buss, ISA Buss, and so on, to be designed on the fly, programatically, and then redesigned with something called ladder logic if necessary.  It's how a motherboard can be made to accommodate different hardware onboard and into its slots.  It is transparent to Device Manager so it won't show up in the list.

Upload the minidump here:

ftp://guest@musics.com/uploads/dumps/BobCSD/Analysis/

You should be able to just copy and paste, or whatever, the minidump as a file from one IE window to another, or, do the ftp via a DOS command prompt, it's up to you.  Some people use programs such as FTP Commander and others to make the ftp experience more user friendly.

I fixed the permissions so that you should have no problem copying and pasting a file into that directory now.

cpc2004 the the response to your question is in the analysis directory as BobCSD2 html file.

While it's harder to determine whether the sound card or the video card were using the 9/2 IRQ, it's easier just to know that the video card uses the area of memory requested, C0000000, and it is likely that if caching was enabled the video card overwrote the C0000000 area so that had the Operating System been using that area for cache, the hardware level access of the video card would have simply overwritten it, thus making the entire area invalid.  Notice that modules are being looked for that do not exist, as if they were simply erased.

This has been one of those problems that began around the beginning of time, which is why I suspect it so strongly.  It's actually delegated to that area forerly known as C000 but because of Big Endian notation, this shows up as C0000000.  If you look at my translation for the call, you'll see that :

804856cc  is  actually  cc564880  inside the hardware, which places it in the C0000000 cache area.  Addressing is split across to RAM cards, so that 0000C000 becomes C0000000 when recombined.  

and because of another flukey representation and various designs in logic at the substrate and gating level this is seen as in C0000 area [take the "0" of "0x" and put it at the end, thus, "endian," and 0xC000 becomes address C0000, the video cache area for a 1 megabyte system; add four more places for split card addressing and you arrive at C0000000 with the endian being therefore 0xC0000000].

And I haven't even touched on Associative Memory and Associative Memory Addressing yet.

These are all design tricks of the computer design engineers who actually design the logic that is a computer.

It's just much simpler for me to suggest checking one area at a time, than to lecture on doctoral thesis computer design and research, which, at this point, add up to numerous volumes.

Basically, to the machine hardware, addresses C000 and 0xC0000000 are the same area because of how the gates are arranged!

Hardly any programmers in the world know this, and thus, there is no software that handles it properly.  Which is why nearly all documentation tells you to disable this BIOS caching feature.

Again, hold down the delete key [or whatever your computer manufacturer tells you to do] to get into the motherboard BIOS setup.

There are only a handful of exceptions.

And I think this thread deserves to be in one of the books now.

cpc2004

I've moved the memory dump back to:

ftp://guest@musics.com/uploads/dumps/BobCSD/

where you can download it.

Please let me know also how long it takes.  I have some concerns about bandwidth that I'd like to resolve.

 

by: GinEricPosted on 2005-08-21 at 06:28:45ID: 14718729

Addendae:

You had previously asked about the serial buss, saying that you [comm port, IRQ 4] wondered something about it.

Well, usually, the UPS system is on a serial connection to the comm port [how this works:  the electric company has grid signals way ahead of the substation that can send you a signal that the substation has gone down; this is possible because it takes physical time for the electricity to stop while the tripped substation signal gets to you at the Speed of Light, you have something like a millisecond or so to switch to UPS].  During that time, power and thus current and voltage slowly slew toward zero, meaning, it drops from normal voltage down to a voltage where things don't work over time; it's not instantaneous when power goes out, although it seems so.  The timing is close; you can actually get to the area where things start to flake before the UPS takes over, leading to very intermittent and seemingly unexplainable errors.  In other words, UPS is not a perfectly functioning system.

So that one computer will not be affected, while the one next to it will, in rare circumstances.

You seem to have a very professional setup at home, commendable.  I would, however, consider looking into an Isolation Transformer.  Places in Connecticutt, Chicago, and Philadelphia make very good ones.  They're not really very expensive.

Funny, have had tons of really bad electrical storms here too.  Which is why I asked; we had one of those dips that causes interruption, without the servers going down, but apparently lead to a stagnant state.  When I say bad, I mean I saw an 18 wheeler next day with two substation transformers on it, and about 40 block transformers; 13,600 Volt substations, and whatever [4600 or 2300] block transformers.  A sure indication that a lot of stuff went out in the city!

Must have been quite a lightning bolt.

Silly question probably, but are your running a Windows Web Server or Apache Web Server?


 

by: cpc2004Posted on 2005-08-21 at 06:51:37ID: 14718834

Hi GinEric.

Thanks for your indepth analysis. It takes almost 5 hours to download the dumps which it is mich longer than I expected. Actually this is my fourth time to dwonload the dumps with more than 400MB. The last three times were complete within an hour. After I analyze the dump, my analyze report is difficult to yours. I think you mixed up IRQ and IRQL. IRQL is software and IRQ is hardware. IRQL 2 is the dispatch IRQL and is not related to IRQL 2.

I decode the failing instruction backword and I find out the the failing was caused by register esi has the vaule of 1d3b9440. I don't think the problem is not related address C000000 is cache. Unless I decode the instruction further backward and I don't know why esi load a invalid value.

0: kd> .trap bb444770
ErrCode = 00000000
eax=00000001 ebx=800656b0 ecx=c0074ee4 edx=00000001 esi=1d3b9440 edi=c03a7720
eip=80443637 esp=bb4447e4 ebp=bb444804 iopl=0         nv up ei ng nz na po cy
cs=0008  ss=0010  ds=0023  es=0023  fs=0030  gs=0000             efl=00010287
nt!MiDispatchFault+0x53:
80443637 f60101           test    byte ptr [ecx],0x1      ds:0023:c0074ee4=??
nt!MiDispatchFault+0x1c:
80443600 0000             add     [eax],al
80443602 6a02             push    0x2
80443604 59               pop     ecx
80443605 ffd3             call    ebx                     call KeAcquireQueueSpinLock
80443607 81fe00000080     cmp     esi,0x80000000
8044360d 8ad0             mov     dl,al
8044360f 7215             jb      nt!MiDispatchFault+0x42 (80443626)
80443611 81fe000000a0     cmp     esi,0xa0000000
80443617 730d             jnb     nt!MiDispatchFault+0x42 (80443626)
80443619 833dcc56488000   cmp     dword ptr [nt!MmKseg2Frame (804856cc)],0x0
80443620 0f8584000000     jne     nt!MiDispatchFault+0xc6 (804436aa)          
80443626 8bce             mov     ecx,esi
                                                  ecx = 1d3b9440
80443628 c1e90a           shr     ecx,0xa
                                                  ecx = 00074ee5
8044362b 81e1fcff3f00     and     ecx,0x3ffffc
                                                  ecx = 00074ee5
80443631 81e900000040     sub     ecx,0x40000000
                                                  ecx = c0074ee4
80443637 f60101           test    byte ptr [ecx],0x1
                                                   crash because address c0074ee4 does not exist
8044363a 756e             jnz     nt!MiDispatchFault+0xc6 (804436aa)
8044363c 83651400         and     dword ptr [ebp+0x14],0x0
80443640 89750c           mov     [ebp+0xc],esi
80443643 6a02             push    0x2
80443645 8bf1             mov     esi,ecx
80443647 59               pop     ecx

I am using the latest windbg version  6.5.0003.7. My stack trace is much longer than your stack trace. If you install this version and you may have new finding.

IRQL_NOT_LESS_OR_EQUAL (a)
An attempt was made to access a pageable (or completely invalid) address at an
interrupt request level (IRQL) that is too high.  This is usually
caused by drivers using improper addresses.
If a kernel debugger is available get the stack backtrace.
Arguments:
Arg1: c0074ee4, memory referenced
Arg2: 00000002, IRQL
Arg3: 00000000, value 0 = read operation, 1 = write operation
Arg4: 80443637, address which referenced memory

Debugging Details:
------------------
*** WARNING: Unable to verify checksum for smtp.exe
*** ERROR: Module load completed but symbols could not be loaded for smtp.exe
READ_ADDRESS:  c0074ee4 Nonpaged pool
CURRENT_IRQL:  2
FAULTING_IP:
nt!MiDispatchFault+53
80443637 f60101           test    byte ptr [ecx],0x1

DEFAULT_BUCKET_ID:  INTEL_CPU_MICROCODE_ZERO
BUGCHECK_STR:  0xA
LAST_CONTROL_TRANSFER:  from 8046dd5f to 804671f6

TRAP_FRAME:  bb444770 -- (.trap ffffffffbb444770)
ErrCode = 00000000
eax=00000001 ebx=800656b0 ecx=c0074ee4 edx=00000001 esi=1d3b9440 edi=c03a7720
eip=80443637 esp=bb4447e4 ebp=bb444804 iopl=0         nv up ei ng nz na po cy
cs=0008  ss=0010  ds=0023  es=0023  fs=0030  gs=0000             efl=00010287
nt!MiDispatchFault+0x53:
80443637 f60101           test    byte ptr [ecx],0x1      ds:0023:c0074ee4=??
Resetting default scope
STACK_TEXT:  
ChildEBP RetAddr  Args to Child
bb444770 80443637 00000000 00000000 ffdff4fc nt!KiTrap0E+0x210
bb444804 8044c6b6 00000000 e9dc8b28 c03a7720 nt!MiDispatchFault+0x53
bb444854 8046afe3 00000000 00000000 00000000 nt!MmAccessFault+0x704
bb444854 804671f6 00000000 00000000 00000000 nt!KiTrap0E+0xc7
bb4448e4 8046dd5f 83b41cc8 0000001a 8618ee20 nt!ExpInterlockedPopEntrySListFault
bb444910 bfe7ed7a 00000011 00000004 6446744e nt!ExAllocatePoolWithTag+0x21f
bb444b38 bfe7f955 bb444bb4 8618ed68 88f4b0f0 Ntfs!NtfsQueryDirectory+0x218
bb444b6c bfe7fd39 bb444bb4 e231feb0 88f4b020 Ntfs!NtfsCommonDirectoryControl+0xa7
bb444ce4 8041eec9 88f4b020 8618ed68 8618ed68 Ntfs!NtfsFsdDirectoryControl+0xef
bb444cf8 804b3190 bb444d64 02b3bb1c 804ac8f2 nt!IopfCallDriver+0x35
bb444d0c 804ac94d 88f4b020 8618ed68 87c8c968 nt!IopSynchronousServiceTail+0x60
bb444d30 80468309 000005e4 00000000 00000000 nt!NtQueryDirectoryFile+0x5b
bb444d30 77f88847 000005e4 00000000 00000000 nt!KiSystemService+0xc9
02b3bae8 7c585d48 000005e4 00000000 00000000 ntdll!ZwQueryDirectoryFile+0xb
02b3bde0 7c58556f 7ffdbc00 000005e4 02b3be0c kernel32!FindFirstFileExW+0x2c1
02b3c070 00409c34 01950c50 02b3c084 01950c50 kernel32!FindFirstFileA+0x31
02b3c1d0 00409c86 00000000 004f1451 02b3c428 smtp+0x9c34
02b3c414 004f1a21 00000000 40e2d6e0 02b3cc9c smtp+0x9c86
02b3cc9c 00502404 00000001 02b3cdcc 02b3cfec smtp+0xf1a21
02b3cfe0 005057a1 02b3ff40 02b3cff8 00508ab2 smtp+0x102404
02b3ff40 00460fc0 02b3ff54 00460fd6 02b3ff70 smtp+0x1057a1
02b3ff70 0041c433 02b3ff84 0041c43d 02b3ffa0 smtp+0x60fc0
02b3ffa0 0040500a 02b3ffdc 00404b80 02b3ffb4 smtp+0x1c433
02b3ffb4 7c57b388 01950bdc 00000000 00000000 smtp+0x500a
02b3ffec 00000000 00404fe0 01950bdc 00000000 kernel32!BaseThreadStart+0x52

FOLLOWUP_IP:
nt!MiDispatchFault+53
80443637 f60101           test    byte ptr [ecx],0x1

SYMBOL_STACK_INDEX:  0
FOLLOWUP_NAME:  MachineOwner
SYMBOL_NAME:  nt!MiDispatchFault+53
MODULE_NAME:  nt
DEBUG_FLR_IMAGE_TIMESTAMP:  427b58bb
STACK_COMMAND:  .trap ffffffffbb444770 ; kb
IMAGE_NAME:  memory_corruption
FAILURE_BUCKET_ID:  0xA_nt!MiDispatchFault+53
BUCKET_ID:  0xA_nt!MiDispatchFault+53

Followup: MachineOwner
---------
0: kd> lm tn
start    end        module name
00400000 00547000   smtp     smtp.exe     Sat Jun 20 06:22:17 1992 (2A425E19) <--
10000000 10040000   regex    regex.dll    Wed Dec 24 05:10:31 2003 (3FE8AF47)

The failing routine is W2K MiDispatchFault and it is W2K task dispatcher. It is unlikely to fail as million of users are using this routine daily. It fails unless it is hardware errror.

My prelimary finding, I am very sure that it is hardware problem. Most likely it is faulty RAM. Base upon my past record, I will rate the possibility of the error by 70% RAM, 20% CPU and 10%M/B. As hardware error occurs randomly, if W2k keeps on crashing with different bugcheck code, it is symptom of hardware error. If W2K also crashes at the instruction address and bugcheck code and probably it is software errror.

 

by: cpc2004Posted on 2005-08-21 at 07:00:38ID: 14718876

Hi Bob,

Your version of smtp.exe is developed at 1992. Is it still compatible at W2K SP4? You had better to install the latest versionof smtp.

00400000 00547000   smtp     smtp.exe     Sat Jun 20 06:22:17 1992 (2A425E19)

 

by: cpc2004Posted on 2005-08-21 at 07:10:49ID: 14718913

Bob
After I do the research at Google, I find out that MiDispatchFault is part of Page-Fault exception processing. It is stable routine and it crashes only if hardware problem or corrupted paging file. As reallocating the paging file is free and it is no harm to allocate a new paging space. Maybe it can resolve your problem.

 

by: BobCSDPosted on 2005-08-21 at 13:26:01ID: 14720304

>>Silly question probably, but are your running a Windows Web Server or Apache Web Server?

Microsoft all the way!

:)

 

by: BobCSDPosted on 2005-08-21 at 13:30:02ID: 14720332

cpc,

>>You had better to install the latest versionof smtp.

Do you know where I can get this? I use the software updates to regularly get the latest stuff.

Is this something provided with the windows 2000 web server, or from my Merak Mail server?

BOBi

 

by: BobCSDPosted on 2005-08-21 at 13:31:05ID: 14720334

cpc2004,

>>As reallocating the paging file is free and it is no harm to allocate a new paging space. Maybe it can resolve your problem.

How does one go about reallocating the paging file?

BOBi

 

by: BobCSDPosted on 2005-08-21 at 13:37:24ID: 14720371

So far there have been two suggestions:

>> Suggest checking BIOS caching; if enabled, try disabling it in the C00000000 area.
>>Let me know if disabling the BIOS cache in this area solves this problem.

and

>>I am very sure that it is hardware problem. Most likely it is faulty RAM. Base upon my past record, I will rate the possibility of the error by 70% RAM, 20% CPU and 10%M/B. As hardware error occurs randomly, if W2k keeps on crashing with different bugcheck code, it is symptom of hardware error. If W2K also crashes at the instruction address and bugcheck code and probably it is software errror.

It has had 3 different bugcheck codes in the last couple of weeks. Plus other non-bug check problems. Basically, this box has had problems since setup. Certain SQL statements that ran for years on the other box with no problems, ran up Mem Usage on this box and I had to take down those modules until I could figure out what was going on. Just a LOT of intermittent problems and site performance issues that never occurred on the other box. I USED to have time to GARDEN!!!! ;)

I didn't want to reboot the box last night, due to the busy weekend. But tonight I will reboot and check on the BIOS caching. Considering cpc's suggestion that this might be a hardware problem, would disabling the BIOS in this area cause any problems?

>>I don't think the problem is not related address C000000 is cache.

Since I am ignorant in all this regard... :( ... I can't make a wise decision. Can the two of you come up with a little plan you can agree on that you think I should do?

I'm thinking now:

I check the BIOS and see what it is.... without changing it...
I change the BIOS....
I replace the RAM.

Which one? All three?

thanks!

BOBi


 

by: BobCSDPosted on 2005-08-21 at 14:15:22ID: 14720527

>>My prelimary finding, I am very sure that it is hardware problem. Most likely it is faulty RAM. Base upon my past record, I will rate the possibility of the error by 70% RAM, 20% CPU and 10%M/B. As hardware error occurs randomly, if W2k keeps on crashing with different bugcheck code, it is symptom of hardware error. If W2K also crashes at the instruction address and bugcheck code and probably it is software errror.

On startup, doesn't the system check the RAM? It doesn't indicate it is bad. Wouldn't it? It's a gigabyte of dual channel RAM. How can we verify whether it's good or bad?

thanks!

BOBi

 

by: GinEricPosted on 2005-08-21 at 15:08:06ID: 14720688

BobCSD

There are only a couple of more steps in determining if it is an actual hardware problem before touching the hardware.  Before placing the next post, which explains the reasoning behind why it is not a good idea to touch the hardware until you've eliminated some obvious software.  The whole of the posts I've made are a well known and well organized troubleshooting technique developed of more than three decades of doing computer dumps for exactly this purpose.

I will also look up that board, and see what it is capable of; although, I would have never bought a board with the name "Abit Fatality AA8XE" because it shows a very poor thinking in marketing, and will probably lead to many jokes in the computer industry.

Tell your computer builder to go here:

http://www.Tyan.com/

the next time a superb motherboard is needed; these are the very best.

As for the RAM, we haven't asked what brand it is yet, so, what brand is it and how much?

I'll also go and see what this statement is supposed to mean: "Designed for Intel® 90nm Pentium 4 LGA775 processors ." which is kind of funny, to an engineer, as 90 nanometers is something most people have no intuitive feeling about its meaning.  It's a wavelength, an etching specification for substrate design, and stuff most buyers don't really care about, but which the clean labs that make the salami's are very proud of.

It's kind of overkill from the engineers, while the name of the board is a flat out boo-boo from the marketing end of the company's "fatality" board - I mean, you just don't put this in a sales product's name!

Triton II and III, F21, MIG-29, these are types of boards, and Tiger, Thunder, and Tomcat!  These are names you put on motherboards!  Like the state of the art submarines, supersonice aircraft, etc., not a reference to the casualty list!

I know I'm a little off topic here, but this is why it got that name:

"ABIT is one of the most respected board manufacturer that caters mainly to the gamer, . . ."
"Just recently, ABIT has partnered with the world's number one professional gamer, Johnathan "Fatal1ty" Wendel,"

Game over, you might just lose on this one, Johnathan.

Yes, Abit makes good boards for gamers, but I'd be skeptical for production servers, at best.

Secondly, Abit doesn't even seem to have a site with the specs and data for their boards, at least not one that can be quickly found.

So, maybe there is a serious hardware conflict, but it's not as simple as bad RAM, more like a bad choice for a production server motherboard, if it turns out that there is an actual hardware cause to this problem, which may well be, but all other possibilities should be eliminated before throwing the board in the trash and buying an expensive one that is known to work without intermittent crashes.

Okay, now check the next post.

 

by: GinEricPosted on 2005-08-21 at 15:10:15ID: 14720694

This will be incorporated into a site link that outlines the approach to troubleshooting crash dumps:


Well, it takes me about 20 minutes to download a one gigabyte file, and about 20 minutes to download network service packs in the range of 280 megabytes, so I'll have to consider that the interim connections, or yours, are latent.

I guess I should explain again that "IRQL_NOT_LESS_OR_EQUAL" means IRQ greater than 1; this determines only that it is not the system timer or the keyboard, NMI or non-maskable interrupts. But if the stack dump says the IRQL was "2," it means that it was re-entrant code via the PLC system, in other words, a redirected IRQ, and while not terribly important, it is the context of the failure which thereby eliminated the timer and keyboard.

I did say it was a simple page fault, or, in terms of old windows, a gpf.

The technique for cleaning the corrupted areas on disk which occur over time not because of hardware, but because Windows autodefragging doesn't perform properly, are:

01.)  Resize the page file, usually down to about 32 or 64 meg
02.)  Run chkdsk, scandisk, and defrag with "fix all errors" set
03.)  Change the page file again, usually, this is distributed disks and partitions
04.)  Rerun chkdsk, scandisk, and defrag
05.)  Now set the page file to one and one-half times the size of RAM for manual configuration
06.)  Run chkdsk, scandisk, and defrag one last time

If the crash on memory error recurs, you have narrowed it down to either a bad disk or bad RAM, but have not eliminated the timing problem.

If the crash doesn't recur, you have most likely solved the problem by exercising the disks which lose hysteresis over time, or, on a new disk that was not initially exercised enough.

This is preferable to guessing at faulty hardware, even if it is faulty hardware, because the act of touching things, such as RAM cards, can destroy them.  So, you want to eliminate all other possibiities before doing so as you may actually case a problem thereby.

Why does disk become corrupt?

This is way into the field of Electrophysics.  Data is stored on disks by inducing areas with cumulative polarity fields.  You write them by using a very strong current in a nearby conductor, which induces a counter electromotive force sufficient to alter the actual magnetic orientation of the atoms and molecules within the magnetic medium, the disk or coating on the disk.

You read them by passing a much lessor electromagnetic field over the previsouly written or induced area, then measure the amount of counter electromotive force returned from that action and set a one or a zero according to the strength of the returned counter electromotive force.

You rewrite them by again applying the much stronger electromotive force.

However, if they haven't been written lately, or they have been constantly read with no interim refresh writing, the magnetic poloarization of the aggregate atoms and/or molecules begins to decay [the Law of Entropy aided by millions of low voltage reads] and the returned field becomes weaker and weaker until it reaches the grey area or less, where the returned values is either indeterminate or wrong.  If only one bit, this is a one bit error.  If more than one bit, this is considered a fatal error.  The name given to this error is "memory corruption."

Of course, memory corruption can have other causes, however, they are usually constant failures and not the intermittent ones seen from field decay.

So, what you're doing when you move all these page files and rewrite them is to actually refresh the hysteris of the medium.

So, why is this a problem in Windows?

Because Windows does not rewrite system protected areas, thus, they never get refreshed when only autodefrag is used.  You must chkdsk, scandisk, and defrag manually, and, you must do this more than one time, preferably at least three times.

Windows "format" is also supposed to write, read, and rewrite a sufficient number of times to achieve a known level of hysteresis, but this is skeptical at best.  Low level format by the disk manufacturers should always use this method, especially on a new disk, as well as on an old disk.

But you can circumvent continual failures by forcing the rewrite of certain areas.  Which is what the method above does.

Reapplying a service pack may have a similar result.

About Dispatch Fault and Free Memory Tracker:

both part of the same thing; got a page fault, or was measuring free memory or freeing it up, and neither is the cause of the problem.

How can I say that?

Well, recall that the debugger is telling you that it found and "invalid address":

"An attempt was made to access a pageable (or completely invalid) address at an
interrupt request level (IRQL) that is too high"

It says that the "address" was invalid, not the data.  

When you analyze a crash dump, you must see the overall picture because often the failure occurred before the reported point of failure:

"80443637 f60101           test    byte ptr [ecx],0x1
                                                   crash because address c0074ee4 does not exist
8044363a 756e             jnz     nt!MiDispatchFault+0xc6 (804436aa)"

That clearly says that in the call stack, "jump to report error in MiDispatchFault" because the result of a previous test check failed.  That failure was at 80443637, not at the Error Handler portion of dispatching a faulted error report MiDispatchFault.

And, as another hint about how IRQ>1 is used, from the Registry, Plug and Play devices, somewhere under CurrentControlSet or the others, ControlSet00, etc., IRQ 2 is pnp0200 under ROOT\*PNP0200\PNPBIOS_2

And the values of the key are for the DMA Controller, which is the Direct Memory Access Controller, and it is by this IRQ number that the Error Handler knows which device and IRQ are associated with the failure, even if the failure was actually caused by software.

So, the real failure is an invalid address, and this issue is addressed more by the timing problem link, or, a bad address while looking for free memory or a valid pointer to something in memory, than to the dispatch routine.  Of course the dispatch routine is going to fail if it is dependent on a routine that failed previous to its operation.

Which basically means that you cannot assume the tree you have found in the forest is the tree housing the root cause of the problem in the forest, it may just point to the real tree causing the problem, usually, right next to it, as in the call stack.

So, a good firefighter asks "Which tree actually caught fire first?"  And thereby, doesn't have to inspect all of the trees in the forest, but only the ones he suspects.  In a forest of 410 million trees, there is little need to look at every tree when you can see where the smoke is originating.  You go to that area, immediately, and considerably narrow down your search and save valuable time in putting out fires.

To make a point, I didn't need to inspect all 400 million possibilities, nor to spend considerable time doing more tests than I considered necessary [the extra analyze commands], to assess the damage and cause quickly, in fact, it would have confused the attempt to put out the fire and delayed the work unreasonably.

ESI Register can be used for many things, a counter, the Source Index Register, whatever, what it contains was set by some previous routine.

Interpreting these registers is almost useless unless you can single step through the execution, a feature I haven't yet found in these assemblers and disassemblers, which is called the monitor and editor function.

I have single stepped systems to troubleshoot and find the cause of such problems, but I don't know that it even exists on microprocessor systems yet.  It does on certain mainframe systems where you can single step a massively large system or network.

Just as an anecdotal note, I was involved in the design of the predecessors to Intel's registers so I think I know what they do and how they do it.  It's just a shame that they don't provide a definitive table of acronyms to English meanings of those acronyms, or it's almost impossible to find out what the numerous acronyms mean in a confusing world of acronymia-itus.

To return from defending the reasons for using the much more cost effective technique of troubleshooting without overkill, using a warhead to kill a flea that is, being faster, cheaper, and more constructive of the use of my time, and the time of companies and clients who want their machines up "now!" and not after a fedrally funded study in time-waste management, let the poster try the disk cleaning outlined above before ripping the hardware apart.

I will say that a badly seated RAM card or one with any finger oils on the lands can get similar problems, but it's best to eliminate the others first if they can be done with software only.

repage, chkdsk, scandisk, defrag, as per above.

and cpc2004, you can have the points if that's a concern to you, I'm not trying for them, not really, since this entire post is going into a book, from which I'll benefit.

Nothing like a keen mind to play devil's advocate with one's analysis; it sharpens my wits and makes me answer questions I hadn't thought to ask.

 

by: BobCSDPosted on 2005-08-21 at 15:51:48ID: 14720821

>>flat out boo-boo from the marketing end of the company's "fatality" board - I mean, you just don't put this in a sales product's name!

Yes, it's originally designed for gamers. The CPU is black with blue glowing lights all over it and you can see the guts inside. It's designed for teens, but Larry said that was the best CPU they had available at MediaCenter. (Or he simply isn't admitting he liked the lights. I joke with him sometimes that I send him out with the family cow and he comes back with beans. ;)

I'll find out the RAM stuff and get back with you.

BOBi

 

by: GinEricPosted on 2005-08-21 at 15:53:23ID: 14720823

Addendae 3:

This idea of refreshing memory was later applied to planar memory chips, that is, the current set of RAM that uses a refresh cycle to conserve power.  The RAM is not supplied a constant supply voltage and current, but uses the fact that memory is not lost all at once when power is turned off, but will remain in tact for a short span of time.

RAS and CAS are terms from the 1970's.  They refer to what is called dynamic refresh RAM.  A switching power supply is used to supply power to RAM for short periods of time, and then simply turned off for another period of time.  Because the charge on the RAM circuitry decays over time, the RAM still holds valid data.  However, the refresh must occur within a specified time period, or that data will be lost.  RAS and CAS refer to the counter mechanisms which control this timing period.  During RAS cycles, the RAM is placed in a wait state and cannot be read or written.  This results in some slight delay, but an acceptible one and a workable one because usually the entirety of the memory is divided up into 8 or more banks, where the other 7 banks can still be read or write and thus the time delay completely masked out, that is, seemingly invisible to the memory control process.

Now, recall that RAM is not always on, and that therefore is more subject to high load conditions and the consequent dips in power [current dips] in overloaded, borderline, and substandard power systems.  At some point, under load, which strains the power requirements, a dip can be propogated through to the RAM address buss resulting in intermittent failures.  Such failures are extremely hard to identify, and, in 9 out of 10 cases are not the fault of the RAM, but are the fault of the load on the power supply, the integrity of gold plated [or not, very bad not to use real gold on RAM lands and connectors!] lands and connectors, poor solder joints, extreme run lengths, substandard lasering of the gold wires within the RAM chips from substrate to pin leg platform, and so on.  I have seen, under and electron microscope, an unsoldered gold wire to an IC pin just sitting there jumping up and down when vibrated; this is called a "cold solder joint" even if it was supposed to be laser welded from the gold wire to the pin head.  Not to get too technical [which is probably unavoidable at this point], but this was about the only RAM hardware failure I could identify as an actual hardware failure.  Another one was the examination of a production run of RAM IC's wherein the negative used to etch the substrate, or the layer's photomask, had a microscopic fleck of dust that had settled on the emulsion before the lithium wash had stopped the photographic developing process.  This resulted in a run of some millions of useless RAM chips, which all had problems that showed up as intermittent and occurred only under load.  Basically, the resulting etched substrate buss had an anomally in it and most of the time it worked fine, but every now and then some machine somewhere using the chips would get an intermittent failure.

That was unacceptible to the clients and customers, very large institutions, which demanded that these machines and these networks be up absolutely 100% of the time; even only a microsecond of down time was unacceptible to them.

And the result was that millions of RAM chips were simply thrown out and replaced with new ones for the entire production run, which was recalled from the commercial market.

Today's mass markets for microprocessors do not maintain that level of quality control, so you may get RAM that is substandard, but I haven't seen much of this to date.  Which is why you should only rely on reliable companies for RAM, and used or reseller RAM is always suspect.

RAM lands and connectors should also be cleaned with methyl ethyl alcohol for best results; some rubbing alcohols and other cleaners often contain sugar, something you definitely don't want around electricity!  The average cleaners you buy at a computer shop should be good enough though, and you needn't go hunting down the hard to find pure alcohol.

And then, the lands should absolutely never be touched by human fingers!

Seating should be confident and secure, but never forced.  If you have to force RAM seating, you can bet something has gone wrong or is going to go wrong.

At this point, when you have the time to run the procedures outlined above, you should do so.  Remember, it is going to take some time do this and you should announce a scheduled "outage."

What we do is have another server ready to take the place of the server going down for scheduled maintenance; which is what all professional network enterprises do.  That way, we don't even have to go offline.

And scheduled maintenance is not an option beyond certain business growth size; it is absolutely necessary.

You might want to consider all of this in your business plan.

 

by: BobCSDPosted on 2005-08-21 at 15:54:08ID: 14720827

>>a new disk that was not initially exercised enough.

Makes me think I should put it on a leash and take it for a walk! ;)

 

by: BobCSDPosted on 2005-08-21 at 16:10:03ID: 14720868

>> You must chkdsk, scandisk, and defrag manually, and, you must do this more than one time, preferably at least three times.

How often do you expect this to be done? Daily, weekly, monthly?

>>Because Windows does not rewrite system protected areas, thus, they never get refreshed when only autodefrag is used.  You must chkdsk, scandisk, and defrag manually, and, you must do this more than one time, preferably at least three times.
>>Remember, it is going to take some time do this and you should announce a scheduled "outage."

Okay.

>>What we do is have another server ready to take the place of the server going down for scheduled maintenance; which is what all professional network enterprises do.  That way, we don't even have to go offline

I keep telling Larry we need to set up the old one as a backup for stuff like this!! Maybe he'll help me do this now.

>>Reapplying a service pack may have a similar result.

We've been doing regular service packs.

The machine isn't that old at all. Only a few months at most. Would this be an issue already?

>>repage, chkdsk, scandisk, defrag, as per above.
How does one repage?

>>And scheduled maintenance is not an option beyond certain business growth size; it is absolutely necessary.

We had our old server for about 4 years and never once did maintenance like this on it. :(

Thanks!

BOBi

 

by: BobCSDPosted on 2005-08-21 at 16:56:01ID: 14721035

>>As for the RAM, we haven't asked what brand it is yet, so, what brand is it and how much?

Centon 2 GB PC4200 DDR2 dual channel

As far as boards, Larry wanted me to let you know:
The motherboards were over $250 each, they weren't cheap boards, and they have everything the boards you recommended do. They can have up to 4 GB ram, and take a pentium 4 chip. They support Intel hyper-thread technology. They can do RAID.



 

by: GinEricPosted on 2005-08-21 at 18:01:03ID: 14721200

That's fine, I didn't mean they were cheap.  They seem to be very good, which is why I'm leaning on it being problems other than hardware.  So far, I've only seen one problem here that was an actual hardware problem, and it was the motherboard, not the RAM.

There is, however, also a reason why some boards do cost upwards of $700.00, some $1,500.00, and so on.

Centon is an old enough company to be considered extremely reliable!  It wasn't much more than 26 years ago that RAM in an IC package was invented.  Aboriginal planar memory, and then Refresh Memory.

And their location is a dead giveaway, when the company age is considered, to be a sure indicator that they don't make bad memory chips.

So, I think you have good hardware.

You repage by going into the My Computer | Advanced | Performance | Settings | Advanced | Virtual Memory | Change

and set it to custom size and the other settings and make the changes there.  Do not let this go below 2 meg, ever, as it may cause the system to slug, but 32 meg or 64 meg should work for the first step of chkdsk, scandisk, defrag.  Therafter, you want to get it back up to a much larger size, which is usually one and one-half times the size of current RAM.

Step two is to chkdsk, scandisk, defrag again, with say a 480 meg page file.

Step three is to finalize by setting the pagefile over multiple disks [makes it work faster], and other advanced techniques.
I would not, however, myself, set the page file to 6 gig simply because I had a 4 gig RAM size on board.  It's probably too much; most server people I know keep it around 1 gig maximum.

How you get to the page file [Virtual Memory] to repage may be different on your server, but it will be similar, all you're doing is changing the disk cache, which is a better term than Virtual Memory for what the page file actually is.

I have seen at least three intermittent crash problems solved this way here on Experts-Exchange.

I've added a link for you, and others, because the question keeps coming up, and I'm trying to finalize a standard method for approaching this problem.

http://www.musics.com/manhtml/Troubleshooting/

will be an ongoing documentation of Troubleshooting Crash Dumps.

On Preventative Maintenance:

Yes, you can get away without it for years, but sooner or later it is going to catch up to you, and when it does you may be at the threshhold of a medium sized business where a catastrophic crash could be both catastrophic and disastrous for company owners and stockholders.  I would not even like to imagine a runaway crash that is destroying petabytes of critical company data.  {shudder!} It's just to much of a horror picture that turns into real life!

Need I scare anyone, I once witnessed a man lose $200,000,000.00 and everything he owned because he did not see that even he could make a mistake.  True story.

I have servers that have been running for over 10 years here, with little or no change in hardware, except the occasional upgrade of motherboard, disks, power supplies; but these were done as scheduled maintenance with a backup server.  There are other servers I've worked on that have been running for over 30 years, and they definitely have scheduled maintenance, especially the ones at places like the Federal Reserve; none of us would want them to crash, would we?

See if you can find your way through all of this advice and we'll see if we can solve this problem.

It's one for the book.

 

by: cpc2004Posted on 2005-08-21 at 19:45:19ID: 14721530

Hi Eric,
It is great idea to document how to diagnostic crash dump. The document must be simple and easy to understand. I can't understand your document.

>>> I guess I should explain again that "IRQL_NOT_LESS_OR_EQUAL" means IRQ greater than 1; this determines only that it is not the system timer or the keyboard, NMI or non-maskable interrupts. But if the stack dump says the IRQL was "2," it means that it was re-entrant code via the PLC system, in other words, a redirected IRQ, and while not terribly important, it is the context of the failure which thereby eliminated the timer and keyboard. <<<

My comment
You mixed up IRQL and IRQ.

IRQL is interrupt request level and it is software routine to handle interrupt.  IRQ is interrupt request line and it is hardware.
IRQL 2 is the dispatch level IRQL and it is not relate to IRQ 2.

00 PASSIVE_LEVEL  - execute thread
01 APC_LEVEL      - execute special kernel APC; page fault
02 DISPATCH_LEVEL - dispatch (execute DPC)
03                - 24 device interrupt
..
1A
1B PROFILE_LEVEL  -
1C CLOCK2_LEVEL   - interval-timer execution
1D REQUEST_LEVEL  - interprocessor request
1E POWER_LEVEL    - power failure notification
1F HIGH_LEVEL     - machine checks or bus errors

 

by: BobCSDPosted on 2005-08-21 at 22:14:55ID: 14721895

>>I can't understand your document.

I can't understand much of most of the stuff in this post. Makes my head hurt. ;)

I'm just a poor mother trying to feed my children. But I'm going to figure it out and I appreciate ya'lls help!

BOBi

 

by: BobCSDPosted on 2005-08-21 at 22:28:03ID: 14721934

>>32 meg or 64 meg should work for the first step of chkdsk, scandisk, defrag

Okay, I'm going to practice on my old server....then I'll do the new ones at 2 in the morning:

Under virtual memory, My old server settings are:
Initial size: 768
Max size: 1536
minimum: 2 MB
recommended: 1534 MB
currently allocated: 768 MB
Current registry size:  15 MB
max registry size: 54

The new server settings for both web and database are:
Initial size: 2046
Max size: 4092
minimum: 2 MB
recommended: 3070 MB
currently allocated: 2046 MB
Current registry size: 17 MB
max registry size: 114

I guess by the time I'm done, I should set it back to the recommended size. Both of them seem to be really off from their recommended size. That's what the initial size should be, right? The recommended size?

BOBi

 

by: cpc2004Posted on 2005-08-21 at 23:16:56ID: 14722076

Hi Bob,

I understand you want to fix the problem. My prelimary finding it faulty ram or corrupted paging space. If you provide the system event 1001/1003 within the last two months, it will be useful to find out the root cause of the problem.

When Windows crashes with blue screen, it writes a system event 1001 or 1003 Check system event 1001 and 1003 and it has the content of the blue screen

Event ID: 1001
Source: Save Dump
Description:
The computer has rebooted from a bugcheck.The bugcheck was : 0xc000000a (0xe1270188, 0x00000002, 0x00000000, 0x804032100).
Microsoft Windows..... A dump was saved in: .......


Event Source: System Error
Event Category: (102)
Event ID: 1003
Description:
Error code 1000007f, parameter1 0000000d, parameter2 00000000, parameter3 00000000, parameter4 00000000

Control Panel -> Adminstrative Tools -> Event Viewer -> System -> Event 1001/1003. Copy the content and paste it back here

 

by: BobCSDPosted on 2005-08-21 at 23:56:39ID: 14722188

I had already pasted 1001 in the very first message:

Event Type:     Information
Event Source:     Save Dump
Event Category:     None
Event ID:     1001
Date:          8/19/2005
Time:          11:57:02 PM
User:          N/A
Computer:     SSWEB
Description:
The computer has rebooted from a bugcheck.  The bugcheck was: 0x0000000a (0xc0074ee4, 0x00000002, 0x00000000, 0x80443637). Microsoft Windows 2000 [v15.2195]. A dump was saved in: C:\WINNT\MEMORY.DMP.

There was no 1003.

Thanks!

 

by: BobCSDPosted on 2005-08-22 at 00:36:24ID: 14722319

On my old server I ran chkdisk /f, rebooted, finished it.
Then I went to run scandisk, and got an error that it could not find the file.
I did a search on the drive and could not find it.

I guess I'll skip that and do defrag.

BOBi


 

by: cpc2004Posted on 2005-08-22 at 22:51:13ID: 14730742

I do not find any memory corruption at the dump. As the page fault hadling routine is a well test routine. It crashes only if the hardware error or coruppted paging space. I suggest you should re-allocate a new paging space as the circumvention of the problem. After re-allocation of paging space, if the BSOD still occurs, it must be faulty hardware.

 

by: GinEricPosted on 2005-08-23 at 06:20:44ID: 14733021

BobCSD you have to complete the scandisks after the chkdsk or the process is ineffective in fixing bad blocks on the disk.  Check Disk finds and marks the bad blocks setting a bit for scandisk to try to reallocate them, and scandisk looks for that information and the try to fix and reallocate bit and does the actual fixing.

You can go to something like My Computer or Administrative Tools and either use Disk Manager then the partition's properties to get to the scandisk tool, or you can simply get the drive's properties, then the tab or button that says check this disk now, or something like that.

chkdsk marks the bad blocks that scandisk will fix; it must be done this way or corrupt blocks will not be fixed and added back into the free memory pool.

To clear up some definitions for cpc2004 before continuing the discussion of the problem:

IRQ ::= "Interrupt Request Queue"  # The number of the interrupting device being queued by I/O
IRQL ::=  "Interrupt Request Queue Level" # The Lexicographical Level of the interrupting device
                                          # on modern computers this is implemented in hardware
                                          # via Control Modes; 0, 1, and 2 associated with the
                                          # System Control Mode Errorhandlers which are called
                                          # by the special error flag bits of the Control Mode
                                          # Operations.  Simply put, an error has occurred, so
                                          # the Control Mode Error Handling routines are called.

I will refine that in the troubleshooting document.  But none of that is germinal to the problem at hand, not really, There is a great deal of confusion at both Intel and Microsoft as to what the design engineers are trying to tell them, so I can understand that oftentimes technical documents will not, at first, be understandable.  We write the technical documents first, and then try to reduce them to laymen's terms.  Since it's rather too technical to put here, I will add yet a third document to try and explain why it's in virtual memory, why the handler was called by the Intel Special Operator Set, and why it points to caching memory.

Back to BobCSD:

A bad address is loosely called memory corruption; both by Windows and others.  The dump error report uses the term loosely, but a bad address is a corrupt address, and since it pertains to memory, is just agglomerated under the general heading of memory corruption.

Windows suggested it was probably memory corruption.  And they further explain that that corruption is most likely a bad address.  This can occur for a number of reasons, the least of which is usually hardware.  Although Intel has admitted that there may be timing problems with the P4 when SCSI is used or emulated, and the programmer has not allowed for or written the proper code in their driver when an indirect addressing reference is made and/or the indirect vector addressing reference is used.  The result being that the request to read or write arrives too soon ahead of the address couple, and thus, the indirect reference indexes the wrong pointer in a multidimensional pointer array.  Simplistically put, the operator executes before the arrival of the proper address on the address buss!  Thus, the wrong address is fetched or written.

To add insult to injury, this fault was discovered by Linux engineers and programmers, who put in the fix by telling Intel and the others to add 7 no ops to the beginning of their drivers to allay the iterative memory transfer routine long enough for the memory and pci busses to synchronize to the memory transfer.  And I have to refind the link to that information as it seems to have disappeared from one of my previous answers at Experts-Exchange.

However, it was in Virtual Memory [C0000000 is the Virtual Memory Area].  Virtual Memory is an unusual concept, it includes devices, such as disk partitions, the cache and pagefile, and other peripheral addresses as part of the memory space.  Basically, it's saying that either the Virtual Address was wrong, the device didn't exist [and thus the memory space did not exist], or the device did not respond in a timely manner.

Thinking along those lines, RAM should actually have little to do with Virtual Memory, unless part of it is allocated in the virtual address space, an unusual thing to do.

Okay, so back to BobCSD's troubleshooting.

I see your new server is quite powerful.  For the initial size, you want something slightly over the actual size of RAM, add about 100 meg to the RAM size.  Recommended size can also be the maximum size, this is where the one and one-half rule is used, although, 2046 is not an exact boundary, it should report 2048 [which is 2 gigabytes of memory; some systems use 2 meg for the mmx drivers that emulate missing hardware in some RISC computers], I would therefore suggest making the minimum size 2048, and the maximum size 3072.  However, again, you have to think "What is it actually doing with this allocation?"  It's merely providing an area equal to or greater than the size of RAM from which to cache temporary and interim fetches to and from main memory.  Many administrators question whether any allocation over 1.5 gig is really necessary, and, whether or not there is any improvement in performance beyond the one gigabyte RAM implementation, with most administrators practically stating flat out that there is none beyond the 2 gigabyte memory implementation.  There are gaggles of technical reasons for them making these statements, but they support it with actual observed results; more memory beyond some point does not improve performance and in some circumstances has degraded performance.

Okay, so let's get back to your new server now.  How did you initially set it up?  How many hard drives, what sizes, and how many partitions on each hard drive?

I would like to know all of this to outline a suggestion for initially setting up the physical topology of a modern server.

BOBi, don't worry about the technical jargon, just take it one step at a time.

Before we conclude that you do have something like the timing problem, try to get these done in order [don't forget to set "automatically fix all errors"] :

chkdsk, scandisk, defrag

Even if you have to do it using the disk properties and scheduled for the next time the system reboots.  I do want to know about the number of hard drives and the partitioning because this affects the speed of your server.

I appreciate all ya'lls' help and comments.

 

by: BobCSDPosted on 2005-08-23 at 11:10:33ID: 14736078

>>BobCSD you have to complete the scandisks after the chkdsk or the process is ineffective in fixing bad blocks on the disk.  Check Disk finds and marks the bad blocks setting a bit for scandisk to try to reallocate them, and scandisk looks for that information and the try to fix and reallocate bit and does the actual fixing.

So did I break anything by doing a chkdsk and defrag without the scan disk?

>.or you can simply get the drive's properties, then the tab or button that says check this disk now
That's where I found it, not the other.

>>Okay, so let's get back to your new server now.  How did you initially set it up?  How many hard drives, what sizes, and how many partitions on each hard drive?

1 floppy drive (antique eh?)
1 C drive
1 CD/DVD drive

also a USB Maxtor 300 GB is attached, but that is removable.

C drive is 34.4 GB with no partitions.
File System: NTFS
Used: 8.27 GB
Available: 26.1 GB
Location: 0

This is not on a network where someone accesses my drive by putting in J: or something like that.. but in sharing, I do notice that Share this folder is select C$, default share. Web sharing is turned off.

check box for allow indexing service to index is checked...
do I need that? does that allow me to do the search and find files via explorer? or is it something else? Maybe I should turn it off.

I do notice that when I select security from on C: ... the owners is Administrators... but Administrators has no permissions set. And Administrator (singular) has only read/execute, list, read. Everyone has everything. And System has read/execute, list, read. CREATOR OWNER has nothing.

>>Before we conclude that you do have something like the timing problem, try to get these done in order

I did this stuff on my old server the other night and to run through it once took more than an hour. I'm thinking of setting it up as a DFS as per:
http://www.experts-exchange.com/Operating_Systems/Win2000/Q_21534761.html

Then that way I can take the box down and do messing around to my heart's content without shutting down my sites. Nothing like adding more potential problems to the mix though. But I just don't see how I can get this fixed without having a backup in place. I am so fearful of ruining the only box I have.

So today, after I get breakfast, I'll be working on that.

BOBi

 

by: GinEricPosted on 2005-08-23 at 21:20:46ID: 14739663

No, you did not break anything by doing a chkdsk and defrag; most probably Windows just skipped over any bad blocks and kept them in its list of bad blocks, no harm done.

I believe in having floppy drives, even if the rest of the world doesn't.  Floppies have saved the day when all of the other super computer devices just sat there with "duh . . " and egg on their faces.  Keep the floppy; it's the smartest thing you can do.  Buy more at flea markets, before the nonces have their way and make them obsolete.

Okay, the new server setup continues; you have a base 34.4 gig drive.  A good idea, well within needed range.  However, I would consider adding another drive for simple redundancy, in addition to the removable or hot swapable drive of 300 gigabytes.

Why I would do this?

Disk caching across two disks for the page file system is faster across two disks because both can be accessed simulataneously and/or the access time masked while the disk cache is done in the background.

Secondly, I would have set up the drive for optimum partitoning.  That is, a schema that goes something like this:

Drive 1: two partitions, one 30 gigabytes, and the remaining on an extended partition divided into two caching partitions.
Drive 2: [this is not the hot swapable or removable drive!]  probably 200 to 300 gigabytes partitioned into at least two or three spaces, with two cache spaces.

The partitioning looks something like this:

Disk or Partition     Volume Label    
C                     <any name>
D                     <another name>

and so on.  I've started a much overdo document on setting up a Server at :

http://www.musics.com/manhtml/Windows/Partitioning/PartitioningWindows.html

The document is fairly complete as of its first draft and should be useful.

I completely agree that you should have a backup in place.  Be careful with Distributed File Systems; they have their own nightmares and are more geared toward large systems.

You might consider a mirror for the 30 gig drive though.

The amount of time to defrag is a function of the size of the disk.  Plainly put, 300 gigabytes on a single partition is just asking for trouble!

See the document above.

What basically happens is that people generally do not do much planning for their Server Systems, and discover problems with their configuration, pre-setup, and setup afterwards, when it is often too late to make changes.

I sincerely wish that Microsoft had never got into their Indexing; with Explorer embedded in Internet Explorer, and gazillions of histories and logging of every keystroke, then constantly snooping around with what seems to me to be the world's slowest indexing scheme, the computer is spending entirely too much time doing nonsense, instead of its job, to serve.

I have been looking into that indexing check box, and it seems to be a child of Fast Find, which actually made Windows slower in finding things, than it could in Windows 3.1 days.

You're not doing too bad, but don't overdo it; accept that you'll have a crash here and there until you either exercise the machine or wear it in a little, and until you get that other box to back it up.

Don't simply try everything in the Internet book; remember, if it's working don't try to fix it.  And one crash at 4:00 A.M. in the morning means it's working better than one crash every hour.

If you uncheck the indexing, it will ask you if you want to unindex for every file on disk.  This will take a lot of time too.  However, I unchecked that box last night on a Windows Server and now Explorer does seem to respond much faster to browsing folders and so on.

Be careful with the DFS, and yes, you should have been running the Server from an NTFS partition.

Again, read:

http://www.musics.com/manhtml/Windows/Partitioning/PartitioningWindows.html

What my Windows Server does is to provide a custom menu from a DOS partition, which allows me to select which Operating System to start.  I have opted to have three hard drives, and the Production Server Operating System is on the second drive, the D drive, I simply set it to the default boot.

I'll also post a copy of the menu for people.  The server has basically been running for well over ten years without a reinstall, bad crash recovery, virus bring down, etc..

The entirety of the technique requires that the partitioning and setup of the server be well planned, something that's a little late unless you have a backup server while you do it.

One last note:  no need to put any pagefiles on a hot swapable drive.  In fact, it might prove detrimental.  I haven't thought much about partitions, but I can't see them doing any harm, depending on what the hot swap drive is used for.

And I leave a few one gig areas unpartitioned on disks, in case I later need them.  And, the pagefiles I put on a partition leave some bytes, like I only use 1 gig out of a 1.5 gig partition, otherwise, if the cache fills up it will start popping up windows telling me the disk is almost full.  So, the partitions for pagefiles in the example although 1.5 gig in size, are only assigned a 1 gig pagefile each.

I don't care about an unused 1 gig space with hundreds of gigs on multiple disks.  Well, at least I hope it never fills up.

The Microsoft "Restore" function was based on this partitioning scheme.  I use one disk, as you'll see, for a complete Operating System Image from which a crashed system can be recovered.  Dell also uses it for their repair or restore partition, which they now seem to put into every system the ship.

DFS and replication also take time away from your cpu and server, so weigh your decision carefully.  Generally, it is used in multiple server systems in very large networks.  Read more about it online.

And, I have to add this, if the P4 was anything of a problem, it turned out to be that it was too fast!  Secondly, it does require some special heat dissipation.  Many server operators using P4's have opted for cooling towers in the place of just a fan blowing air.  And heat will cause memory related problems.  There are other posts here on EE about he P4.  See if you can find any and if they are of any help to you.

ciao for now.


 

by: BobCSDPosted on 2005-08-23 at 21:54:15ID: 14739769

>>However, I would consider adding another drive for simple redundancy

Oh, I forgot, both boxes have a mirrored drive as well. Sorry.

 

by: BobCSDPosted on 2005-08-23 at 22:13:11ID: 14739833

>>Secondly, I would have set up the drive for optimum partitoning.  

Can you partition a drive after the server is already setup and in use? Or does this require formatting and starting over?

>>I completely agree that you should have a backup in place.  Be careful with Distributed File Systems; they have their own nightmares and are more geared toward large systems.

I dont' think the point of the DFS is going to work for me as I have IIS 5 and I think it takes IIS 6 to right click and export... I'll just program on one box and copy over the files I guess....

>>You might consider a mirror for the 30 gig drive though.

Yes, all 4 boxes, old and new, have mirrors. Forgot. Sorry.

>>300 gigabytes on a single partition is just asking for trouble!

That's my USB removable Maxtor, not my server. You think I should partition it? I use it for backing up and like never running out of disk space on the daily, weekly, monthly, etc. backups.

>>no need to put any pagefiles on a hot swapable drive

What is a hot swapable drive? The USB Maxtor?

I am setting up the backup box now and copying over the files. I got my network setup tonight and the test site up and running and the DNS records setup with the additional IP's and the firewall setup to point the remote to the new local box.... this is pretty exciting doing something that is actually working!

BOBi
(not a hardware tech, if you couldnt' tell ;)






 

by: GinEricPosted on 2005-08-25 at 10:18:22ID: 14754178

I was asked to formalise the troubleshooting procedure for Twilight Zone Crashes, so I used your dumps as sort of a basis.

http://www.musics.com/manhtml/Windows/TwilightZone/ProcedureCrashFix.html#Procedure

Unfortunately, repartitioning requires that you start over.  Some people have used Partition Magic, but from what I've seen they've all eventually had problems.  My guidelines:

http://www.musics.com/manhtml/Windows/Partitioning/PreSetupServer.html
http://www.musics.com/manhtml/Windows/Partitioning/PartitioningWindows.html

some reading for now.

If your systems are truly mirrored [RAID 1, and not other RAID implementation], you will have to think and take notes, and consider that you must provide mirrors of exactly the same size, that is, two separate partitions on two separate disks for each mirror set.  Stripes, Parity, and the other RAID's are not mirrors!  Mirrors are "exact" copies of a disk on another disk.  When one is corrupted, you fix it by issuing a command to "break the mirror" so that the system reverts to the one disk in the mirror set that it finds uncorrupted.  Thereafter, you fix the corrupted disk and remake the mirror.  You do not approach this with other RAID methods, but by Microsoft Documentation on the Mirror Set.

If you follow the partitioning scheme outlined, you will have to calculate how to fit mirrors into the scheme.

How Swapable means removable, with the added feature that the machine does not have to be powered down, thus, the swapping of drives can be done while the box is "hot."

In your case I'd consider the USB Maxtor hotswapable.

Since it's swapped in and out, partitioning is entirely up to you.  If 300 gig works, use it; as long as you feel confident with it.  That's an aweful lot of data, big database?

In servers and multiple servers, we try to divy things up as follows:

01.)  Operating System - one disk or partition, stands on its own, no access by apps or db
02.)  Applications - one disk, or partition, stand on their own, user files and docs on DataBase
03.)  DataBase - one disk, or partition, information storage only, no applications.

Now you can vary and plan this as you see fit.  The basic idea is to keep all applications and all database, information, and the like, from writing to the Operating System disk or partition.

To keep applications that are non-database off of the database and operating system disks or partitions.

To keep database operations off of the operating system and applications disk or partitions.

User profiles, of course, remain on the Operating System's disk, but things like documents, accounts, personal pages, and the like, are kept on a different disk.

In fact, the idea of a separate server for web web services and email conform to this concept, keeping them, as well, off of all other areas - operating system, applications, database; unless web and email are integrated into a database, but even in that scenario, we can further delineate separate areas and partitions for web databases and database oriented email.

Which is why "Planning" is number one in networking.

What is best?  Which is most efficient?

With even two servers, maybe six disk drives, the throughput can be increased six times if operations for different services are on different disks because, remember, these modern systems can access multiple disks simultaneously and effectively mask the access time to near zero access time.  Memory operations are near transparent to code execution, therefore, far less cpu time is used and far less wait time is queued.

And as a laughable comment, "that's why we designed it that way."  Someone forgot to tell people about it!

Very glad you got your backup server up.

Please take some time to read all of the documentation at :

http://www.Musics.com/manhtml/Windows/

Comments and Feedback appreciated.

 

by: cpc2004Posted on 2005-08-25 at 10:49:23ID: 14754511

Hi Eric,

I haven't go through your document.  I find out some explanation is incorrect. You still mix up IRQ and IRQL. Bob's problem crashes at IRQL 2 and not IRQ 2. Routine executes at IRQL 2 or higher cannot have page fault and this is the meaning of "IRQL_NOT_LESS_OR_EQUAL".

>>>And, as another hint about how IRQ>1 is used, from the Registry, Plug and Play devices, somewhere under CurrentControlSet or the others, ControlSet00, etc., IRQ 2 is pnp0200 under ROOT\*PNP0200\PNPBIOS_2

This comment is not related to Bob's problem. If you want  explain IRQ 2, don't use Bob's minidump.

There have a lot of good webpages to guide us how to use windbg.
http://www.codeproject.com/debug/cdbntsd6.asp.  

I am not intelligence enough to understand your document as you write the document from hardware point of view. I think Bob is more interested to find out the culprit of his problem.

 

by: BobCSDPosted on 2005-08-25 at 11:18:37ID: 14754857

>>If 300 gig works, use it; as long as you feel confident with it.  That's an aweful lot of data, big database?

The maxtor us only used for my backups. No, I have a 300 gb maxtor on the database server as well, this one is for the web server.

They both are for backups.

I do a daily backup, weekly, monthly and keep old versions around a while, so that allows me plenty of room for backups without it filling up the drive too soon. I do system state, complete backup with file changes, etc.

Plus I swap the backup units on the machines occasionally so that both will have backups of both machines on them in case one back up goes bad. So they both have data from two machines (3 if you count my entire hard drive of my development machine as well.)

It's probably overkill, but doesn't cost that much more for the 300gb, so I figure why not go for the gusto and never run out.

 

by: BobCSDPosted on 2005-08-25 at 11:20:13ID: 14754870

>>Unfortunately, repartitioning requires that you start over.

With the server only having 36 gb, is that really necessary? The old one had only 18 gb.

So do I understand this right, with partitioning:

I keep the system stuff working on one partition, while the regular activities of the website are running on the other partition, and it allows them both to work at the same time. Without the partition, they have to take turns working?

So partitioning improves performance?

 

by: BobCSDPosted on 2005-08-25 at 11:21:31ID: 14754884

>>02.)  Applications - one disk, or partition, stand on their own, user files and docs on DataBase
>>03.)  DataBase - one disk, or partition, information storage only, no applications.

My database is on an entirely different webserver, but I guess if you're talking specifically about my database server, then this would still apply.

 

by: BobCSDPosted on 2005-08-25 at 11:31:24ID: 14754993

>>The basic idea is to keep all applications and all database, information, and the like, from writing to the Operating System disk or partition.

Back in the old days, I had a partitioned drive, and the problem with that was that I ran out of space on programs partition and didn't have room to add anymore. So after that I quit partitioning so that my partitions wouldn't run out of room. It was a real mess.

>In fact, the idea of a separate server for web web services and email conform to this concept
I have a web server and a database server on different boxes.

I also have my mail server and chat server on the same box as my web server. I am thinking of moving them to their own server so that if the web server goes down and I have to move to my backup webserver, I don't have to maintain the chat and mail server on my backup box as well. Both of those are not at all utilized very much. The mail typically just does outgoing mail that I send. I don't have thousands of users using my mail server, because that is not my business. It is just for me sending mail through it to my members, etc. The chat server can host thousands, but sadly only has about 20 given members in it at a time, maybe 5-10 on average. But still, they are taking up space and being utilized.

>>With even two servers, maybe six disk drives, the throughput can be increased six times if operations for different services are on different disks because, remember, these modern systems can access multiple disks simultaneously and effectively mask the access time to near zero access time.  Memory operations are near transparent to code execution, therefore, far less cpu time is used and far less wait time is queued.

Good to know. Do you have recommendations as far as the size amount for each partition? I haven't looked at your document yet, does it cover that? I just don't want it to be too small and run out of room, or too much and waste space in it that will never be used. Now I have to talk my husband into formatting and starting over, and so far with a quick check out at the pond where he is knee deep in water, he is not very happy with the idea. ;) I think I might be doing it myself. It's hard to find good help nowadays.

There is so much in this question/answer to do and think about. I think I'm going to close this out. If you want to keep adding to it after it is closed, feel free to, I will read it, but I don't want folks to think I'm still waiting for help and it will certainly take  me forever to go through all this stuff.

Thanks!

BOBi

 

by: BobCSDPosted on 2005-08-25 at 12:11:40ID: 14755434

I am accepting this as the accepted answer...

>> Because both backup and antivirus started at 4:00 A.M., it is very likely that you were sucking up all available memory, running at an unusual pace and cpu time percentage, and that perhaps backup collided with antivirus in the storage of the dif files for your backup, or antivirus may have added changed files, locked a changed file, or whatever, and the two could not get along; backup and antivirus.

because....

The ntbackup started at 4:00 am and did complete
The mcafee virus scan started at 4:00 am and didn't complete....
Also, the database server was doing a backup at 4:00 am as well, which was on a different box, but site users would still be accessing the database from the webserver, thus still putting load on it remotely.

The shutdown started at 4:00:57 am.
The first event announcing the shutdown was at 4:11
And the memory dump was announced at 4:22

I haven't had any of this particular dump since then. I do intend to go through this and do various of the tuning suggestions to improve my server and I appreciate everything!

BOBi

 

by: BobCSDPosted on 2005-08-26 at 16:37:04ID: 14766073

GinEric,

I have an idea.... since you're writing a book.... How can I reach you?

BOBi

 

by: GinEricPosted on 2005-09-15 at 19:25:42ID: 14895036

BobCSD,

While finalising the draft on Windows Twilight Zone Crashes, I had to come back to this thread for reference, since it was such a comprehensive effort on all of our parts, you, cpc2004, and I.

One thing I want to clear up for cpc2004, and others, IRQL > 1 is IRQ = 2, and it is the redirected IRQ from the PLC to Virtual Devices [Plug and Play devices] and it includes the Vitural Address space at C00000000 by definition.  At various places, references were made to Microsoft's definitions of IRQL, and they proved that, indeed, the pagefile is accessed above this level.  Since the Virtual Space is on "devices" and the pagefile is a "device" on hard disk, the memory corruption can most assuredly occur in the pagefile and thus a pagefault can be the result of this error event.

It is best to "never say never" with a computer system, computers do, in fact, make mistakes.  I actually had to prove this many years ago to the people who kept chanting "computers never make mistakes."  Yes they do, and they can go unreported as well, that is, they can make a mistake that no one will ever know about.

It has been proven, once and for all time, that computers do make mistakes.

The assessment that the "load" was the cause of the error effervesced from the discussion.

And that is the point of what the Continental Congress called "arguing" as in the realm of a debate.  Arguing per se is not a "bad thing," since, apparently, America is founded on arguing, which continues to this day.  It is a positive thing.

You can reach me, BobCSD, through my website, or simply use James @ Musics.com



 

by: BobCSDPosted on 2005-09-16 at 02:14:23ID: 14896482

GinEric,

I had moved everything over to my backup computer and have been operating on it for several weeks. My other computer, when I log into it, or network to it, or whatever I do to it, typically reboots itself with 000001A MEMORY_MANAGEMENT or 000004e PFN-LIST-CORRUPT errors....

does a code dump... etc. Got one of each of these today.

I have asked Larry to take the memory out and get it tested or replaced. Seems like there should be some warranty on that. Meanwhile, the computer is worthless and I just wont'/can't use it for anything! Not even as a backup box. Stupid machine.

BOBi

 

by: GinEricPosted on 2005-09-16 at 07:30:13ID: 14898306

Really quite a shame.  Do you have any of those new dumps?  This question has piqued more than a few people's curiosity.

I have a whole section on crash dumps now,

http://www.Musics.com/manhtml/Windows/TwilightZone/

which refers to two of your questions here.  PFN has something to do with pointers into the pagefile.  Again, it appears as if addressing is bad.  If the processor is a P4, I am more reliant on the problem being timing of the buss, rather than the RAM which came from a reputable company.  That's not to say they don't have bad ones occasionally, buy it is rare.  If this is the SCSI driver indirect vector reference, as I suspect, it quite literally boils down to the motherboard being "too fast" for the software.  I'm still trying to relocate that write up by Intel and the Linux people because it is a software problem and not a hardware problem.

I'll keep checking.  Meanwhile, if you have any dumps, send them on.

 

by: BobCSDPosted on 2005-09-16 at 09:20:07ID: 14899360

the thing is, we built two computers, exact same hardware. Both with windows 2000. But one was for the web server and the other for the database server. We haven't had a lick of problems with the database server (argh, I know i'm jinxed now!)... but all the problems are with the web server. So if the motherboard were too fast on one, it should be too fast on the other.

I saw this in another post, from CrazyOne:
http://www.experts-exchange.com/Operating_Systems/Win2000/Q_20373431.html
DocMemory PC RAM
Diagnostic Software
http://www.simmtester.com/PAGE/products/doc/docinfo.asp

And rather than keep spinning my wheels with dump files, paging, scan disk, chkdsk, and stuff, I'm going to test the RAM with that utility today, now that the box is not in use. See if I can either rule out the RAM or know it is the RAM.

I saw another post on here, which I can't find now, directs to http://support.microsoft.com/?kbid=291806 in regard to PFN_LIST_CORRUPT and it indicates item #4: If you receive this error message randomly, or when you try to start a program, remove extra memory or have the random access memory (RAM) in your computer tested. This behavior may occur if you have bad RAM.

BOBi

 

by: cpc2004Posted on 2005-09-16 at 10:08:13ID: 14899750

After reading over 1000 minidumps in several forums, I am confident that it is faulty ram. Some faulty ram can pass memtest. You can try downclock the ram or reseat the memory stick to another memory slot.

My previous post
<<<<
I do not find any memory corruption at the dump. As the page fault hadling routine is a well test routine. It crashes only if the hardware error or coruppted paging space. I suggest you should re-allocate a new paging space as the circumvention of the problem. After re-allocation of paging space, if the BSOD still occurs, it must be faulty hardware.
>>>>

 

by: BobCSDPosted on 2005-09-16 at 12:37:36ID: 14900787

I downloaded and installed the free doc memory utility on my windows 2000 server. I created a boot floppy and it brings up the screen, but at the bottom has this error:
 
run-time error M6103: MATH
- floating-point error: divide by 0

don't know if that's yet another problem with my machine or their application. I wrote them.

 

by: GinEricPosted on 2005-09-16 at 16:12:59ID: 14902115

I'm not saying it's not bad RAM, however, I disagree completely with Microsoft and other software people who will blame another reputable company rather than explain the exact problem.

Microsoft cannot explain the exact problem because Microsoft does not employ people with hardware design engineering experience, nor education.

There are a number of ways in which a simple technician can blame the RAM, replace it with a different RAM, and see the problem go away, but with one catch, the real problem was never solved because the new RAM is simply masking the real problem.  RAM will be indicated, but will not be the cause when:

01.)  The RAM is faster and requires more power than cheaper RAM [cause: insufficient Power Supply]
02.)  Memory tests will fail if not for the specific configuration [cause: bad memtest]
03.)  One RAM fails while another doesn't [cause:  wrong type RAM]

and many, many more.  Unless Microsoft and software people can state, unequivocably, and in full detail right down to the substrate issue and show the proving timing study diagram that they know the reason and can epeat the failure, every time, they have not found the failure, period.  They are simply "passing the buck."  I see this all the time at their support site; like the "small timing problem" statement; that statement is obviously given as vague and ingenuine because they do not want to explain what they mean by a "small timing problem."

Nor does their partner, Intel.

As an engineer, one who has designed computers, even those upon which all of Intel's microprocessor's are based, I know that either what appears to be a hardware problem is actually an "uninformed software" problem, and what appears to be a software problem is an "uninformed hardware" problem.  The approach of the engineer is to exactly identify the cause, not to simply make a replacement that works in some percentage of cases; not even 99% of all cases.  You don't want a levy that works 99% of the time, but fails devastatingly on the 100th time.  It would be the same requirement for something like a Stealth aircraft, or a commercial jetliner; a guess is as good as a mile and can be shown to be catastrophically costly.

We who do such analyses differ only in the level of acceptance of the cause of a failure; my particular training and responsibilites has required that I be absolutely 100% sure of the cause and can identify it and repeat it in any demonstration.  300 lives or 30,000 lives may depend on it.

As professional engineers, we are taught to ignore requests by administrative and business interests for a "quickie" solution or a "patch" and to come up with the real cause.  Personally, I've seen ten of thousands of dumps and have analysed all of them.  Both on mainframes and microprocessors.  Real dumps printed out on 132 column paper, some as much as a foot thick.  I've replicated the failures using various techniques and equipment, including Biomation oscilloscopes in the gigahertz range and Logic Analyzers with multiple traces, 16, 32, whatever.  I have done Timing Analyses Reports thereafter for Time Studies to report to the hardware manufacturer the exact cause of a failure, including many microprocessors and other chips which show more than a 1% failure rate.  That is the general level of considering a recall and a "bad production run."  The same applies for motherboards and other printed circuits.  In 99% of those cases, it is a substandard power configuration and/or a substandard timing specification.

The actual failure of memory is only attributable to that memory hardware in less than one in one million cases; it is nearly always the "skimping" by other manufacturers of motherboards and other devices for the sake of reducing their costs and thus selling a less than quality product.

And they can blame the RAM because there are so few people actually knowledgeable enough to identify the exact cause.  While the RAM manufacturer can afford to replace it for free, even when they know that some of their "technician" customers have not used grounding properly, and home users simply don't know how to do that, and that the RAM was not the original fault, they are more than willing to simply satisfy their customer with a replacement.  There is a reason that the RAM is sent back to them; they have a whole department that will apply engineering techniques to find the real cause and not simply a guess.

The dumps show that there were a memory corruption and an invalid address, in fact, this is clearly stated by the dumps themselves!  I have no idea how anyone could miss this, or not understand it in clear, plain, English.

Why would anyone not understand that :

0x0000001e  (0xc0000005, 0xa003ee9f, 0x00000000, 0x00000001)
Exception   (No Access,  Address,    Read,       Index)
"An attempt was made to access a pageable (or completely invalid) address . . ."

means what it says?  The address itself was invalid; the "address" part is gotten from memory, and that memory was corrupted somewhere along the line; "C" or "Charlie" is a device, and "5" is Access Denied, a general protection fault.  It basically says that "You are not allowed to read that address."  Calculate that address in decimal and you will see why:

0xa003ee9f translated is 9FA003EE = 2678064110

which address was a result of the memory in a Virtual Address Space [0xc0000005] which means it came from a device.  Now you can treast RAM as a device, but this is a very, very, bad programming idea.  It will inevitably conflict with an address and/or itself because RAM addresses are real addresses.

The second and major dump provided more detail:

0x0000000A (0xC0074EE4,  0x00000002, 0x00000000, 0x80443637)
IRQL>1     (Virtual Mem, IRQL = 2,   Read,       Address)

The Microsoft debugger said, and I quote:

"An attempt was made to access a pageable (or completely invalid) address at an
interrupt request level (IRQL) that is too high.  This is usually
caused by drivers using improper addresses."

What part of "pageable" does one not understand to mean "pagefile?"  The pagefile is on disk, not in RAM, although it may have failed during a transfer to RAM.  Again, "0xC" means, a "device," such as a hard disk; it does not mean RAM.

And:

"BugCheck A, {c0074ee4, 2, 0, 80443637}
Probably caused by : memory_corruption ( nt!MiFreeMdlTracker+cb )"

when couple with the above "This is usually
caused by drivers using improper addresses" suggests that the "driver" may have miscalculated the address in tracking the Free Memory in the disk cache, i.e., the pagefile of pagefile.sys

So, Microsoft says it's in the pagefile and it was probably caused by a driver.  How does that statement translate, in any way, to affirming the statement "I do not find any memory corruption at the dump.(?)"

When Microsoft's first statement on the dump was "Probably caused by : memory_corruption ( nt!MiFreeMdlTracker+cb )[?]"

Perhaps because the author of the debugger considers a bad dword to be the result of a bad pointer in the Plug and Play driver [IRQL and/or IRQ on 9/2, the PLC used for Plug and Play] that resulted in a bad fetch [Access Denied] from a device [0XC0000005] to be, in and of itself, "memory corruption."

Getting through all of that technical jargon, it still boils down to an "Invalid Address."

That can happen for only one of two reasons:

01.)  The driver did not have sufficient permission to access this area of memory
02.)  The value found at this memory address or the pointer to it was invalid.

As of a few hears ago, Intel, AMD, and others, incorporated something into their hardware which tells the software if the data at any address is valid, that is, "an initialized operand" with a tag field in the data itself called the "Data Valid Bit."  If this bit is not set, then the data, usually an array, has never been initialized.  That means that the expected values have never been stored there.  If that area happens to be an array of pointers [an index table, such as would exist when any memory, disk cache or RAM, is indexed by an array of pointers] as opposed to actual data or code, then the returned error will be "Invalid Address."  Microsoft, apparently, considers this to be called "memory corruption."

If IRQL>1, that is, 2 or above, this is the same as "Lexicographical Level" or "Control Mode" or "Supervisory Mode" outside of the Operating System whose Layer 0 and Layer 1 Routines "Only" are allowed to access this device, i.e, make a direct call to the disk device.  A driver is not allowed to access the disk directly; this can only be done by the Operating System itself, and that only at the most protected kernel area.

If a driver is written in such a way that it attempts to bypass this protection, it will get a "General Protection Fault."

Now, surrounding all of the error in the dump and debugger analysis are:

bff83000 bff98180   atapi    atapi.sys    Tue Apr 01 13:08:25 2003 (3E89D599)
bff99000 bffba9c0   dmio     dmio.sys     Wed Jan 15 14:47:04 2003 (3E25BAB8)
bffbb000 bffd75a0   ftdisk   ftdisk.sys   Thu Dec 02 22:29:58 2004 (41AFDDB6)
bffd8000 bffffc20   ACPI     ACPI.sys     Wed Jan 15 14:44:22 2003 (3E25BA16)
Note the gaping memory hole here from C0000000 to near F0000000, the usual BIOS cache areas
f6400000 f640e6a0   pci      pci.sys      Wed Jan 15 14:44:07 2003 (3E25BA07)
f6410000 f641b680   isapnp   isapnp.sys   Wed Jan 15 14:43:47 2003 (3E25B9F3)

as I outlined in the analysis.  If you look at the overall picture, it is all about the atapi disk driver [atapi.sys], the DMA I/O Controller [dmio.sys], ftdisk, the ACPI system, the PCI system, and the isapnp Plug and Play driver.

Disk, Plug and Play, the PCI buss, the ISA buss, the DMA buss, and the ACPI Virtual Memory Devices area.

The only association with RAM is the DMA Controller driver part of dmio.sys

At the time of the failure, smtp.exe "simple mail transfer protocol" was executing a thread.  You said that the server worked for your database server, but not for your web server.  I am not sure of any relevance of smtp running during this error, however, it is evident that a web server handles email, moreso than a database server, so there is a possibility that smtp.exe is only called when the server is a web server, and not when it is a database server.

PFN_LIST_CORRUPTION : what is it?

Page Frame Number

A Descriptor List of pointers to other things, like paged memory areas, some organized in frames.

Again, this type of "memory corruption" is the result of a bad address; usually, in a chaining of index calls such as a pointer that points to another pointer, et ux, that finally results in the access to a data or code segment.

There are two oddities in all of this:

01.)  Compiled drivers often call an index to another index which fetches a Matrix Array Datum
02.)  SCSI, PCI, and DMA busses often misindex one of the two pointer indexes.

The current "too fast" problem of these busses.  You have an operator which does two indexes simultaneously.  For example Vector Index EAX, EBX, [dY, dX], n, m

Which says Index the Vector Matrix XY by Xsub0=m, Ysub0=n.  This is your machine doing Differential Calculus at the hardware level!

The points, in space, are Xm and Yn in two-Space using Sigma Sum in Newton's Method of Approximation.  This is an extremely fast method of rendering, often seen in high resolution motional video graphics.

The problem is, the indices may never be zero simultaneously, a fact both in Physics and in computer software.  Why?  Because f[t]=0,0 is the Origin and the Origin is reserved for Base Descriptors only.

What that means in English:  The mother and father of all pointers is here at 0,0 and you are not allowed to access them and use them as code or data - you get an Invalid Address.  Certainly, no level 2 or above driver may ever even access them.

Why they get accessed by error:

The buss is not intialized, that is, 7 clocks have not been issued in order to get the actual address to the DMA, SCSI, or PCI buss!

The solution:

Add 7 no ops to the driver software at the beginning to delay the driver routine long enough for the buss to initialize.  7 no ops are effectively 7 clocks.

And that's what we mean by "the machine is too fast."

This is the third possibility between your failures and the two machines; maybe the other server is slower in it's clock rate, or the buss architecture is different, i.e., it takes longer for the pointers [addresses] to arrive so that when they get there the buss is already initialized.

This happens because the DMA in 64-bit machines does not have to diddle 64 bit addresses down into two 32-bit parts, so, the 64-bit address arrives nearly instantly, while the fake 64 [as a two or more clocked pair of 32-bit address parts] on a 32-bit machine.

And that can cause RAM to be read before it has an address, thus, something tries to effectively read address 0 and that is not allowed by any program other than the very basic kernel Operating System software.

This is a "known" Intel and other microprocessor problem, as that "small timing problem" that Microsoft refers to as such.  And it is a software problem, not a hardware problem.  The hardware is simply faster than the authored software that tries to implement the vector indirect addressing without consideration to proper timing and initialization.

With two processors, and up to four DMA busses each, eight fetches on eight addresses can occur simultaneously.  And this can continue in a stream.  If the code in the calling function cannot handle 8 words simultaneously, or four double precision words simultaneously, then invalid or corrupt data can be the result.

I know I have been very technical, but I have tried to be simple and say that it is possible the board you have is "too fast" and all of this is what I meant by that summary statement in laymen's terms.

I have conceded that it "may" be RAM, but I will not say that it is "absolutely RAM."  And that, even if changing the RAM fixes it because that may only have changed the characteristics of the machine and still not have solved the problem permanently.

It is more time efficient to just swap out the RAM, agreed, but there will be those cases where this simply changes the failure from once a day to perhaps once a week, or whatever, and therein lies the danger: a system that fails unexplainedly from time to time because it was never truly fixed.

A quick fix or patch was simply added to make it appear to perform better.

As I've said before, I have had this "go round" with Intel and Microsoft about their Plug and Play, and warned Intel that it would come back to haunt them.

BOBi, switching the RAM at this point will be a good test.  If it doesn't work, you'll have to consider the other possibilities.

Again, I have taken this time to write at length because this problem is cropping up all over 64-bit machines; it's not just isolated to a few RAM problems or any one motherboard, it is growing, like a cancerous monstrosity.  I am including these writings in my research for a book and posting on my site.  I realise that my responses may be lengthy and technical, but I also hope I've included enough simple terminology to break it all down to the average person.  I hope it makes thoughtful reading.

If you're going to build a 64-bit machine and write 64-bit capable programming, then the whole machine must comply with 64-bit architecture in every minute detail, and that means:

A 64-bit IRQ Buss!

 

by: BobCSDPosted on 2005-10-18 at 09:33:32ID: 15109019

Bottomline, after having numerous errors for two months, including this one, I finally downloaded an app to test the RAM. It was corrupted. Replaced it. Fixed all the numerous errors I had been getting and is running like a dream since.

So even if someone tells you not to mess with the ram and hardware until you've tried a jillion other things, please download a RAM testing app at least and test it. I didn't know it existed until I ran across it.

I accepted answers that weren't the answer simply to clean up the questions that had been around so long. The answer was bad RAM.

BOBi

 

by: cpc2004Posted on 2005-10-18 at 09:52:51ID: 15109173

Hi Bob,

Refer to my post on 16 Sept
>>
After reading over 1000 minidumps in several forums, I am confident that it is faulty ram.
<<

 

by: BobCSDPosted on 2005-10-18 at 09:58:14ID: 15109222

then I owe you too...

I apologize for not giving you credit on this and will see if I can remedy it.

I wish someone had told me that I could test the RAM. Maybe they did and I hope no one else goes through what I went through.

I'm learning that most of the time the simplest solution is the solution. I don't have to read a book to get to the answer. :)

BOBi

 

by: cpc2004Posted on 2005-10-18 at 10:08:11ID: 15109298

Hi Bob,

Refer to my post on 21 Aug.    

>>
The failing routine is W2K MiDispatchFault and it is W2K task dispatcher. It is unlikely to fail as million of users are using this routine daily. It fails unless it is hardware errror.

My prelimary finding, I am very sure that it is hardware problem. Most likely it is faulty RAM. Base upon my past record, I will rate the possibility of the error by 70% RAM, 20% CPU and 10%M/B. As hardware error occurs randomly, if W2k keeps on crashing with different bugcheck code, it is symptom of hardware error. If W2K also crashes at the instruction address and bugcheck code and probably it is software errror.
<<


 

by: BobCSDPosted on 2005-10-18 at 10:21:19ID: 15109433

I'm unsure why you sent me this last reference? There is nothing there that tells me I can test the RAM or how?

 

by: cpc2004Posted on 2005-10-18 at 10:26:50ID: 15109478

You are interested in GinEric's finding more than my finding and this is why I don't propose to test the ram.

 

by: cpc2004Posted on 2005-10-18 at 10:40:22ID: 15109593

Another reason I don't suggest to use memtest because memtest is not a relaible tools. I prefer to reseat memory stick or taking out a memory stick to diagnostic which memory stick is faulty.  Refer to comment of the problem owner of the following problem
http://www.experts-exchange.com/Operating_Systems/WinXP/Q_21505124.html

 

by: BobCSDPosted on 2005-10-18 at 10:41:05ID: 15109602

You are right. I did give credence to his responses. And I do apologize. I didn't want it to be RAM. I wanted it to be something I could fix without taking the machine down and the RAM was brand new so I just didn't believe it.

But I would have loved to learn how to test the RAM if anyone would have told me. I asked on 8/21:

>>On startup, doesn't the system check the RAM? It doesn't indicate it is bad. Wouldn't it? It's a gigabyte of dual channel RAM. How can we verify whether it's good or bad?

But I didn't get an answer. I would have gladly tested the RAM if I'd known how.

On 8/21 I also said:

"I'm thinking now:

I check the BIOS and see what it is.... without changing it...
I change the BIOS....
I replace the RAM."

I was then told:

"There are only a couple of more steps in determining if it is an actual hardware problem before touching the hardware."

and

"So, maybe there is a serious hardware conflict, but it's not as simple as bad RAM, more like a bad choice for a production server motherboard"

and

"06.)  Run chkdsk, scandisk, and defrag one last time

If the crash on memory error recurs, you have narrowed it down to either a bad disk or bad RAM, but have not eliminated the timing problem.
"

I did all this and the problem never went away... so I did finally find something to test the RAM. But I was thrown so many things.... like bad mother board, timing problem...

It wasn't a motherboard it wasn't a timing problem... it was simply BAD RAM as you said in this post and another guy said previously in another post.

And on and on and on.... as you know.

Again, I apologize and it was RAM. In future perhaps answers could include how to test the RAM as well so newbies know it's possible without taking the entire system down and replacing brand new RAM.

BOBi


 

by: GinEricPosted on 2005-10-18 at 21:52:33ID: 15113496

Not only do some dealers simply take another customer's problem, a bad RAM stick, and put it in other customer's repairs, some people don't know what speed RAM to put in boards, while others don't know not to handle RAM without being grounded.

You replaced the RAM and it works; that's good.  For your case maybe it is RAM, but maybe too you don't know why it is RAM.

No one needs to apologize here, and no one needs to take a superior attitude about quitting a question just because some other expert is involved or someone is following another expert's advice.

Haven't you noticed BOBi that you have been as skillfully turned against a person or persons as adeptly as any retired KGB officer can do so?

There were only a couple of more steps, but you had already jumped into the hardware.  That's good, you got it working.  However, I will not tolerate clicquisms, rule by committees, and communistic tactics at any level aimed against me and see right through some people's attitudes toward me; not my advice, but towards me, a person they have never met.

I have tried to be nice to one person, and have tried to handle that person with kidd gloves, however, after a bit of competition for points that are meaningless, at least to me, there seems to be no hope of my American manners being reciprocated.  That's about the point when I quit my course and leave such person to their own self-importance.

I get enough attacks against my servers as it is without socializing with people tending to alienate me.  And I don't have time for petty politics and who's better than whom on a free question forum.

Notice that my last rule said it was either disk or RAM, as you restated above, and it turned out to be the second.

I don't care at all about points; since I'm doing this for work on a book.  However, I do care about "my good name" as Shakespeare put it.  I do know when I've been baited, or looked down upon from someone's nose, etc., and I seriously try to avoid reacting to such tomes simply because the man with real class rises above such petty things.  But there is a point at which one must tell ones peers to proceed no further - apologizing to one who seeks one-upmanship or acquiescing to a gestalt opinion of one other is nothing less than antisocial behavior, that is, ganging up on someone.  It seems to begin a lot with someone saying someone else is wrong, or whatever, and escalates from there.  I'm not going to allow any person, and they know who they are, to steer opinion against my character and toward their political agenda of character assassination.  I know my politics and psychology far too well to allow that to happen.

Don't be misled into boughing before a false prophet, tin god, or any other poltician; you've no need to apologize to anyone.

Good job with the RAM

And some folks need to get off of GinEric's case with the innuendos and hidden discussions about him; it's just plain unprofessional.

Besides, he's in the music business which makes politics, spying, and IT look like kindergarten; he knows when there are darts aimed at his back.



20120131-EE-VQP-002

3 Ways to Join

30-Day Free Trial

The Experts

98% positive feedback on 31,087 answers since March 2000. angeliii is a Microsoft Most Valuable Professional for his work with MS SQL Server & Develoment.

He has also proven his knowledge of Visual Basic Programming, PHP Scripting and Oracle Databases.

The Experts

97% positive feedback on 10,752 answers since July 2000. lrmoore has more than 18 years experience in the networking industry.

The six-time Mircosoft MVPs specialties include firewalls, virtual private networking, and network management.

Testimonials

"...and excellent source for support... Kind of like having your very own IT dept." Electriciansnet

Testimonials

"I was apprehensive at signing up at first. However... it has already made my life as an IT administrator much easier." JaCrews

Testimonials

"WOW! You guys have great, active, and knowledgeable people on here." moore50

Business Clients

Business Clients

In the Press

"If you’ve got a question... Experts Exchange can supply an answer.”

In the Press

"...an invaluable aid for both IT professionals and those who require tech support."

In the Press

"where IT professionals provide quick answers on just about any topic"

Business Account Plans

Loading Advertisement...