System Error Event ID 333 appears numerous times and Terminal Services fail to function properly.

In one of my previous posts I had a similar problem which I had thought to be resolved but that was proven wrong to me last evening.

Here is the link to my previous post: http://www.experts-exchange.com/OS/Microsoft_Operating_Systems/Server/Remote_Desktop-Terminal_Services/Q_23101970.html#discussion

A brief recap:  We have a Win Server 2003 Server with the roles of Web Application and Terminal Server.  Lately I have been getting calls that the client is having problems such using the application on the Terminal Server, printing and, etc..  As soon as I would get this phone call I had learned that I need to reboot the Server.  On almost every similar occasion, when I would go through the System Log I would see the following errors in chronological order: First I would see Error Event ID 1103 which would then be followed by numerous Error Event ID 333, flooding the log.

Based on the recommendations and solutions from my previous post I deleted all the unwanted Printer Drivers, uninstalled all installed printers (since we never use the server to print directly from Consol) and updated the Antivirus software.  I was almost certain that the problem had been resolved but I was wrong.

Yesterday evening I got a phone call that there is a problem again.  So once again I checked the Event Log and there it was, numerous Errors generated by Event ID 333.  But this time it was a little different.  Error Event ID 1103 was NOT the first Error that appeared.  The very first Error was Event ID 59.  This was repeated for about dozen times repeatedly.  Then appeared Event ID 333 for a few times, followed by more Event ID 59.  Then there was ONE occurrence of the Event ID 1103 and it was followed by more Error Event ID 333.  Then a "Warning" with Event ID 56, followed by more Event ID 333.  About a couple of dozen Event ID 333 there was ONE occurrence of Event ID 2019 and, finally, even more Event ID 333, flooding the remainder of the Event Log.

I have attached some screen captures as well as a copy of the Log in text format.
ServerError.EID59-1.PNG
ServerError.EID59-2.PNG
ServerError.EID333.PNG
ServerError.EID1103.PNG
ServerError.EID56.PNG
ServerError.EID2019.PNG
System-Error-Event-Log-as-of-01-.txt
esabetAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

PlaceboC6Commented:
It would appear something is still hogging the kernel memory.

See the following article:

http://support.microsoft.com/kb/q177415/

Being a 2003 server,  the GFLAGS part of the doc shouldn't be necessary.
Next time the server is exhibiting this behavior,  launch poolmon.exe and sort it by the largest tags.

Past a screenshot and we'll see what is using all your kernel memory.

You can sort it by non-paged ,paged, and then a mix of non-paged and paged.

You want to sort it to have both non-paged and paged,  and then sort by size so that both types of tags are on the screen sorted in order of largest tags.  

Then we can see what is going on.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
ChiefITCommented:
I have been working with a few people on event 333:

It seems there is an update with SP1 that may be causing the problem.

Have a gander at this and see if it is relative to your situation:
http://www.experts-exchange.com/OS/Microsoft_Operating_Systems/Server/2003_Server/Q_23008324.html
0
esabetAuthor Commented:
ChiefIT, my Server does not really act as a DHCP but I guess it would not hurt if I was to apply the security update that your link reffers me to.  I will keep you all posted.
0
Introducing Cloud Class® training courses

Tech changes fast. You can learn faster. That’s why we’re bringing professional training courses to Experts Exchange. With a subscription, you can access all the Cloud Class® courses to expand your education, prep for certifications, and get top-notch instructions.

PlaceboC6Commented:
Are you not running Service Pack 2?

Don't forget to look at the poolmon information I gave you.  If you do this at the time of failure,  I will be able to see a snapshot of your pool usage and see which tag is using the memory.

Otherwise we won't know.
0
esabetAuthor Commented:
Hi PlaceboC,

I am not using Service Pack 2.  I am Service Pack 1 and I have not forgotten about poolmon but I am trying to understand how to use it.  Most probably I will have some questions for you.

Meanwhile here is what happened after I installed the security update CheifIT recommended.

I applied the security update and particular things happened.  After the update was installed successfully I restarted the server and right away at the logon screen I got pop-ups that the system cannot write to event log and something about not being able to write to memory as well.  After I would click OK both errors would reappear right away.  Meanwhile I got phone calls from clients trying to access the server so immediately (which explains why I can't recall the exact error messages) I rebooted the server by pressing the reset button.

After reboot everything seemed normal and business was as usual.

This morning I had an email from a client which was about 3 hours after the supposedly successful reboot that the they cannot access the server.  When I went to the consol I noticed that the screen is blank and I cannot logon.  So I was forced to reset the server.  At post I noticed that one of the RAID (mirroring) drives has to be rebuilt.  (This had happened before and we do not know why it is happening.  In the past we have checked the hardware as well as the Hard Drives and there is nothing wrong with either.)  At reboot the RAID automatically started to rebuild the drive.  After logon I got an error message on the screen saying the memory could not be written.  I captured the screen and then rebooted the server once again.  I have attached the screen capture.

I clicked OK and once again restarted the Server.  After restart I checked the system event log and here is what I discovered.  After 2 hours from the time I supposedly successfully rebooted the server last night there is an error reported by the Promise Raid Driver that the device has failed to respond within the timeout period.  This is typical error message we have gotten in the past and it requires a rebuild of the mirrored RAID drive.  From the screen capture attached you can see that right after that error, at 8:00 AM I had to reboot the server. That tells me that after that error the server had stopped working.
Now everything seems to be running once again!


Server-problem-on-01-30-08.PNG
Server-Error-EID9.01-29-08.PNG
0
PlaceboC6Commented:
#1 I would make sure you have a good full backup with system state.

You can check to see if there are firmware or driver updates for the raid controller to help with wierd issues with that controller.

Additionally,  an upgrade to SP2 could be useful.  It upgrades the kernel and most of the primary system drivers such as tcpip.sys, ntfs.sys, etc.

If you have a good backup of the box,  then you are safe in case you experience any issue.
0
esabetAuthor Commented:
I use Retrospect 7.5 for backup software and have used it in the past successfully.  I think i will leave installing the Service Pack 2 for Sunday so I am not rushed!

I will also check with regard to the driver update. Thanks.
0
PlaceboC6Commented:
Always have to be careful when updating firmware and drivers for a raid controller.  :)  
0
esabetAuthor Commented:
PlaceboC;
This may be the wrong place to ask this but all the problems beg this question and I have been asked to look into it:  what is involved in setting up a fail safe cluster environment using Windows Server 2003?  Basically to have another server in place that mirrors the current server and in the event one server goes down the other server will take over automatically!
0
PlaceboC6Commented:
If you are talking failsafe for a terminal server,  you want to use a network load balancing cluster:

http://technet2.microsoft.com/WindowsServer/en/library/65319bac-2efe-4764-8752-d091447dddbe1033.mspx?mfr=true
0
PlaceboC6Commented:
But that is if the server is ONLY a terminal server.  If it is a DC or anything else,  this does not apply.
0
ChiefITCommented:
1 problem resolved:
It looks like the update wiped out the event error 333.
________________________________________________________________________________
Error 1:
For event ID 9, "can't find xxx device within a specific time period", you should probably look into this article. This gives you a step-by-step on how to track down these errors.
http://support.microsoft.com/kb/314093
Since you troubleshot the hardware, and it appeared to be fine, you might be looking at a bad driver. Just like you are currently thinking. If this is a third party driver, you might want to look and see if there were documented problems with this driver by searching it on the internet.
________________________________________________________________________________
Error 2:
The memory error, "can't write to memory", tells me you might have an incompatible memory stick with that Motherboard.  Look at the motherboard model number and make sure the memory is compatible with that mother board. You don't want PC100 memory on a PC133 motherboard. If you have multiple sticks, maybe they are mismatched. Example: you may have one PC100 and one PC166  RAM stick.

If you can't access memory, or have incompatible memory, you're going to have problems with the drives. Memory is used to help the CPU make transactions with the drives.

Further, some viruses can attack memory sticks. The Haxdoor virus is one.

This error can explain why you can't write to the drive within a given time out period.
_________________________________________________________________________________
Just Some thoughts:
You have a raid array. What drivers are you using for the raid array? We may want to look and see if there was a bad driver sent out, especially if it is a third-party driver.

Is this a custom build? These errors look like a custom built machine with incompatible drivers and incompatible memory sticks to the motherboard. Can't write to RAM is makes it look like there is an issue with incompatible RAM.

BIOS may need to be upgraded to accomodate these items.

IT IS ODD that you are having both memory and raid drive issues. I think we should look at what may be causing both problems simultaneously.
__________________________________________________________________________________
At this point, It would help us if we had the following information:
System BIOS version
SCSI BIOS version if applicable
Raid configuration
motherboard Manufacturer and model number
RAM model: example 1 PC100 DDR RAM DIMMS, and 3 PC133 DDR RAM DIMMS
Hard drive make and model
Antivirus you are using and date of the last update.
Is this a custom build or manufacturer build? If a manufacturer build, make, model and service tag number.
0
PlaceboC6Commented:
Chief

It may be too early to know if the 333 has been resolved.  The errors usually don't pop up immediately after reboot.  If the kernel was being exhausted,  and there is a leak....it could take a few days to reappear.

He was getting 2019's showing an exhaustion of non-paged pool resources.   This will not be visible immediately after rebooting.
0
esabetAuthor Commented:
Chief

Not to burst any bubbles but I went and took a closer look at the System Event Log and here is what I further found out:

As I mentioned before, after the security update was install successfully I rebooted the server even though it did not prompt me to do so. (See screen capture 1).

After reboot I started to get errors at login screen at which point I pressed the rest button (See screen capture 2).  That was 6:12 PM.

As I was scrolling up all of the sudden I noticed that at 6:21 PM there were bunch of EID 333.  But before that there was an Event ID 10005. (See Screen Capture 3).  And then followed by Event ID 333. (See screen capture 4).

Then at 6:20 I must have reset the server once again because I see Event ID 6008. (See screen capture 5).

It was after this last one that about 2.5 hours later the RAID driver showed an error I mentioned before which caused the server to go down.

So, since the Event ID 333 occurred right after the security update was installed, I am not sure if it did in fact solve the problem.

As for the hardware, yes it is a custom built machine.  I have all the details and I will put it together and post it with my next reply.






Server.Security-Update-Install.1.PNG
Server.Security-Update-Install.2.PNG
Server.Security-Update-Install.3.PNG
Server.Security-Update-Install.4.PNG
Server.Security-Update-Install.6.PNG
0
PlaceboC6Commented:
Next time you are getting the 333's,  run poolmon.exe and post a screenshot.

There are two keys you can use to organize in poolmon. B and P  One will sort by tag type and one by tag size.   Sort it by type so that both nonp and paged are displayed.  Then sort by size so the biggest are at the top.
0
esabetAuthor Commented:
Placebo;

The reason I did not run poolmon last time was becasue, that was the time I got pop up error messages on the screen at logon page and since I knew people are waiting to get in i simlpy reset the server.  But the next time I get the error I will run poolmon and send you the screen captures.

WIth regard to the failsafe, I can't use load balancing becasue this server is being used more than a terminal server.  It is being used to hos a Web-Application as well as running SQL Server.
0
esabetAuthor Commented:
Ok, here is the system hardware spec:

Motherboard: MSI K8T Master2 FAR (Model No. MS-9130)
System BIOS: Award BIOS v1.35
CPU: Dual AMD Opteron 248 (2.2 GHz)
No SCSI
RAID Controller: Promise FastTrack 2300, BIOS v2.5.0.3122, Driver v2.6.0.318
RAID Setup: RAID Level 1 (Mirroring)
Hard Drives on RAID: 2 - Western Digital Model No. WD5000YS, 7200 RPM w/ 16MB Buffer
System RAM: 4 - Mushkin 991143 - 1GB PRO3200 ECC Registered
Antivirus: McAfee VirusScan Enterprise v8.0i w/Patch 16
This is a CUSTOM built machine.

Here is couple of other info:

1) I have always noticed that though I have 4 GB of RAM installed on the system, when you go on My Computer properties it only shows 3.5 GB.  I never thought much about it before but now I am questioning everything.

2) I had also noticed in the past that during POST, right after Promise FastTrack 2300 controller info is displayed I see a prompt that says "Not sufficient PCI Rom space."  Again I saw this in the past but did not think of it much!

0
PlaceboC6Commented:
Often what you are describing with the 3.5 gb of 4.0gb being shown is due to a resource address range conflict with the hardware.  Try slapping /PAE in your boot.ini statement and rebooting.  You can put it after whatever the last switch is. Even if you are running Standard,  this will move the memory ranges around and could get you more of your memory back because it is still going to be under 4gb.  I've seen this particular issue with a specific PCI Express mobo I work with.

I highly suggest researching a mobo and controller firmware update.  Most likely the promise firmware is lumped in with the mobo bios update.    Just make sure that backup is dusted off.  :)

0
esabetAuthor Commented:
I ran "MSCONFIG" to modify boot.ini but it does not give me the /PAE option.

Also, I have already chekced for th efirmware update for the mobo as well as the controller and what i have are the latest!  The same goes with the BIOS and drivers.
0
PlaceboC6Commented:
You can add it yourself.

Control Panel
System
Advanced Tab
Startup and recovery
There will be an edit button.  
It will launch notepad with the boot.ini inside.  For example:

[boot loader]
timeout=30
default=multi(0)disk(0)rdisk(0)partition(1)\WINDOWS
[operating systems]
multi(0)disk(0)rdisk(0)partition(1)\WINDOWS="Microsoft Windows XP Professional" /fastdetect /usepmtimer /NoExecute=OptOut /pae

Just add it to the end of the line on your os instance listed under [operating systems]
0
PlaceboC6Commented:
Oh yeah,  and don't forget to save the file in notepad so it overwrites.
0
esabetAuthor Commented:
I just added the /pae switch and after reboot the RAM size did not change.  It still shows 3.5 GB only!

Do you think this may have something to do with EID 333?
0
PlaceboC6Commented:
No I am pretty sure it is unrelated.  The 3.5gb thing is most likely related to a memory address range conflict with the hardware.
0
ChiefITCommented:
Looked up the hardware compatibility: Alll looks good.

PC3200 is what you want on that MB.

You are using ECC memory: (Error Correction)
So, you may be doing a lot of data checking.

If memory serves me right, some have turned this off. Checking data takes up a lot of resources. But, don't go off my memory, because I don't trust it. Placebo, do you have an opinion on ECC memory, should it be turned off in BIOS?

I also know of a discrepancy with any RAM  where it is greater than 3GB. This should be looked at. There are server configurations to accomodate the 3GB "limit" on a 32 bit system. This error I think is specifically for 32 bit systems and may not apply to you.
0
PlaceboC6Commented:
ChiefIT,

I don't have an opinion on ECC.  I say if the mobo has it installed,  I don't know what harm it does.  But I have never run into a scenario personally where it has made a difference.
0
ChiefITCommented:
Here is the Mobo manual for your system:
http://www.merlinux.de/download/E9130v1.2_K8T%20Master%20Series.pdf

It explains the rules and types of memory that are allowed for this computer.
Please see the section called, "Rules for populating memory, also the section above for types of memory authorized":
NOTES:
1) The channels are paired. For dual channel, memory modules. 1 and 2 are paird as well as 3 and 4 are paired. for single channel 1 and 3 ar paired. (looks odd, huh)
2) You must use SDRAM. (Synchronis Dynamic RAM) So if these are four sticks, these should be 128 bit dual channel DDR SDRAM DIMMS PC3200 2.5 volt memory modules you have in the slots. (Say that ten times real fast)

If your system RAM follows the rules, I will look into how ECC might cause a problem. I have seen different sources of error checking conflict, but don't remember the overall RAMifications or fix. (no pun intended)
0
PlaceboC6Commented:
One key thing though..  The 333's are paired with 2019's.  2019 shows an exhaustion of non-paged pool kernel memory which will strangle an x86 server.  The 333's start generating once the kernel is at a critical point.

2019's and such are an exhaustion of a virtual memory address space and won't have anything to do with what type of RAM is in the system.

However,  obviously not having ram that is rated correctly for the mb can cause other obvious issues and is a concern.
0
ChiefITCommented:
I agree, but isn't virtual memory is a direct result of not being able to utilize system RAM? Isn't virtual memory a section on the hard drive that acts as a temporary buffer until system RAM can pick up the ball and run with it?

I am looking for a choke point. Why is the system choking on virtual memory? Shouldn't that amount of system ram pick up the pool and run with it rather than choke on it?

Just a thought.
0
PlaceboC6Commented:
There are two type of virtual memory addresses in Windows.  

Kernel Virtual memory space and the user mode virtual memory space.   Page files and virtual memory as you are thinking are assciated with user mode virtual memory space.   Kernel virtual address space is limited and for under the hood OS stuff like drivers and other kernel functions.

See this link:

http://blogs.technet.com/askperf/archive/2007/03/07/memory-management-understanding-pool-resources.aspx
0
esabetAuthor Commented:
If i understand these discussions so far, it is possible that the type of RAM that i have in the system is incorrect, right?

Before I go ahead and count the cash for new RAM, do you guys think I should wait till I get another EID 333, run poolmon, and then take it from there?

I just setup eventtriggers.exe to send me an email notification as soon as an Event ID 333 is recorded in the Event Log.  I hope I never have to see that email but I have a bad feeling.................
0
PlaceboC6Commented:
I think you should wait for the 333's again and do the poolmon.  Those errors are not caused by a physical issue.

Let's chase that down and keep the ram on the back burner.  If we can pinpoint a cause of the 333's,  then the server may run fine as is.

The thing about the ram speed rating such as pc 3200 pc 2700 etc....that doesn't mean that the ram is hardwired to run that way.  It simply means that the components are rated to work at that speed.  It is VERY possible that the same chips used in 3200 can be used in 2700.  It wouldn't be the first time they relable the same thing and sell it at different prices.  Makes more sense than physically building 100 different models of something like that.
0
esabetAuthor Commented:
I do understand that, but for a minute I was questioning whether my RAM is SDRAM or not, it appears they are in fact SDRAM PC3200/DDR400.
0
PlaceboC6Commented:
It looks like you have DDR PC3200.  I'm sure you are ok.
0
ChiefITCommented:
OK:

As I agree to put the system RAM on the back burner. I started looking through articles that may pertain to this situation. I think we have all come to the conclusion that the server is constipated and causing some/all of these errors. We just need to find the blockage. I have a new found knowlege of page pool and non-page pool virtual memory. Thanks placebo. With that knowledge I decided to research some plausible causes of this issue.

Reflecting back on the original issue you have a number of errors: I think they are all related given the errors I see and the research I just dug up.

Let's review just a little:

Error 2019: non-page pool problem
Error 9: not enough system resources (source ftt52xp)
Error 6008: Unexpected shutdown (source Event log)
Event 1005: Insufficient resources (source DCOM)
Boot up error: The memory could not be written to
Error 59: Insufficient resources (Source sidbyside)
Error 333: I/O problem can't write to registry (source application pop up)
Error 56: The driver failed to allocate memory (Source viamraid)

After all: Page pooling is a system resource and considered virtual memory.

Here's the theory:

Looking at this article says a lot. It looks like you have a flooded NIC or something that is constantly flooding your non-paged pool resources. The drivers, according to the article that placebo showed me, are usually written to the non-page pooled space. If something is flooding a driver with constant requests to open a non-page pooled driver, then you may run into all of the above errors. Maybe SQL or something is getting ahold of your NIC driver and it is unable to process the request. This floods the non-page pooled environment and knocks down NIC devices causing all of the above virtual events. Maybe that process is SQL trying to open up sockets that don't exist, as this article implies for an NT4 environment.
http://support.microsoft.com/kb/133384

Have you ever heard of the domino effect? Could SQL be trying to open up sockets that don't exist causing Insufficient system resources, printing problems, inability to apply remote registry edits ect...? I know this article is for NT4, but could the same apply in this case.

I also know of a number of instances where NIC flooding can be caused on a switched network. If the NIC is flooded, it can cause a temporary shut down of the NIC resources and ability for the server to do as it was implied to do.

I am thinking a quick review of the SQL server is a possibility. Then, maybe look into other forms of NIC flooding that can tax the non-page pooled drivers. What do you think of these idears?



0
PlaceboC6Commented:
Agreed.

That's what poolmon is going to do for us.  He will run poolmon at time of failure.  When this is done,  I can see the pool tags using the most memory and possibly pinpoint the cause or problem area.  

As soon as it has another problem,  we'll be closer to identifying the cause.  There are tags associated with network drivers both inside and external to the os.  As well as file caching, the ntfs file system, print system, etc.  

Now we wait.
0
esabetAuthor Commented:
About the 3.5GB issue: I just remmebered that during post the BIOS reports ONLY 3668616 KB (or something like that but the first four digits I am 100% certain).  So even the BIOS does not show the entire 4 GB but according to the mobo manual I can install up to 4 GB on the board!?  Isn't that strange?
0
PlaceboC6Commented:
Ohhhhhhhhh

That might be because the mobo is sharing it with integrated video perhaps?
0
esabetAuthor Commented:
By integrated video do you mean a video card that is integrated into the mobo (similar to integrated sound cards!)?  If so, then this mobo has NO integrated video!
0
PlaceboC6Commented:
It could also be going to caching.  System bios caching, video caching.  It is very possible a small chunk of your memory is being allocated to something else by your motherboard.   I wouldn't sweat it too much.

That certainly explains why the OS only sees 3.5...because the mobo is only giving it that much.
0
esabetAuthor Commented:
So do you think I should remove the /pae switch from boot.ini ?
0
PlaceboC6Commented:
Yeah you can pull it.....but I wouldn't reboot.  Just let it run as is...that way we can reproduce the 333's..  If we reboot again,  we'll reset the kernel levels back to nothing.
0
esabetAuthor Commented:
Ok, something strange.  This morning the server was not responding, not even to the console's keyboard.  The screen was blank and there was no response.  We could not access it via Remote Desktop and nothing seemed to work.  So I was forced to reset the server and after reboot I looked at the log and there was NO Errors!!  The last log reported was an "Information" event at 3:00 AM generated by Retrospect (the backup software) that the backup completed successfully.  The next event is an Error relating to my unexpected reboot (EID 6008) at 8:45 AM.
0
PlaceboC6Commented:
I'm wondering if the backup exhausted the kernel.

If you are not running sp2,  you really need to install this VSS hotfix at a minimum:

http://support.microsoft.com/kb/940349/en-us
0
esabetAuthor Commented:
Maybe I should do a bakup now and go ahead and install SP2?  Would you recommend that?

Also, a food for thought, do you think it would be wise that at some point in future and as a general practice to have another machine solely dedicated to do the backup operations?
0
PlaceboC6Commented:
If resources allow,  I always think it is better to have a dedicated backup server.

Do a full backup to include system state,  and then upgrade to sp2.  That will upgrade a TON of stuff in the OS.

90% of the time people don't have problems with this upgrade,  but at least you will have a backup just in case.
0
esabetAuthor Commented:
Ok, I just finished installing SP2.  Everything look good but I will cross my fingers for the rest of the day.  I am always worried when I install Service Packs!
0
PlaceboC6Commented:
Lol.   If your users can use their applications and nothing is failing immediately (due to incompatibility),  you should be ok.  SP2 is more stable.
0
esabetAuthor Commented:
Ok, I just got an email at 9:44 AM from a Terminal Service user that while they were using Terminal Services they got a pop-up saying they lost connection and could not get back on thereafter.  I checked the server at 9:54 AM and I could not log on either.  I tried from Console and still nothing.  So I had no choice but to reboot, hoping to see something in the Event log.  

But here is the bad news: there was NO error message in the Event log and the last event recorded was at 9:28 AM.

I did notice something else as well but not sure if it means anything: when the server was not responding at Console, when I pressed Num-Lock or Caps-Lock on the Console keyboard, they would not work either (by not working I mean the LED that shows the Caps-Lock or the Num-Lock is ON would not lit!).
0
PlaceboC6Commented:
Typically if the mouse isn't moving,  and the keyboard lights don't respond...you are looking at a hardware issue of some form (from my experience).

A memory leak failure will usually respond in some fashion.  Keyboard lights will work,  mouse cursor will move.  Just might be extremely slow to respond.
0
esabetAuthor Commented:
So how would I go about trouble shooting that?  Is it possible that all my problems are coming from hardware to begin with?  Could it be since this is a custom made Server, it is not fully compatible with Win Server 2003 OS and it is the source of all my grief?
0
PlaceboC6Commented:
A hardware problem isn't going to cause the 2019's you were seeing.

As far as the hardware goes,  I would ensure that all drivers are up to date:  chipset, network card, controllers, video card...

If there are firmware updates for anything,  take care of that and see what happens.  Additionally if any external devices are attached,  they can be removed for troubleshooting purposes.  

Random lockup problems are hard to find the cause for with diagnostics because they will usually pass most of the time and only when the failure occurs would the diags show anything..but of course by then it is too late.

0
ChiefITCommented:
This sounds almost indicative of a flooded NIC. Do you have dual NICs. There is some fine tuning of the server and switches to perform network load balancing on a switched network.

With clients getting kicked off and your resources on the server failing, It really sounds like one NIC is being overhwelmed. This can happen on a switched network because a switch uses a defined path to a NIC. A network hub uses broadcasted messages to communicate to the NICS.

An overloaded NIC can cause all of the symptoms you are talking about:
1) broken communications: A NIC can pause services that are flooding the server
2) lack of dcdiag or event errors that point out the problem: The NIC can appear to be working much of the time and to the server it looks perfectly normal. But the back flood of packets can hose up the server and you will not see errors indicating the problem.
3) Your computer quits responding: Much like requesting a mapped drive that is not there, a mouse can freeze or mouseclicks are unresponsive. If you keep pressing the issue, the window will say, "No Response"

With that said, Let me ask what does your NIC configurations look like? Are you performing Network Load Balancing, using two NICs, and do you have a switched network?
0
esabetAuthor Commented:
Chief;

The server ahs only one (1) NIC which is builtin to the mobo.  According to the mobo specification the NIC is as follows: Broadcom BCM5705 Gigabit LAN controller providing 1000, 100 and 10MB/s data rates.

The incoming WAN is a DSL that is routed via Netgear VPN Firewall Model No. FVS338 (8 PORT 10/100 SWITCH).  Devices connected to (routed by) the Netgear are: 1 Desktop PC, The Server, 1 Network HP Printer, 1 NAS

Lastly, we are not doing any Network Load Balancing.

I hope that answers your question.
0
esabetAuthor Commented:
By the way.  I just checked the driver for the builtin NIC and there was an updated driver as of October 10th 2007 so I just installed the new driver.
0
PlaceboC6Commented:
Are there any updates for controller or motherboard chipset?
0
esabetAuthor Commented:
I did some research and the mobo website does not offer any updates (I guess this mobo is pretty old by today's standards, LOL).

So next I looked at the individual mobo chipsets and I found updates for the Northbridge Chipset (VIA K8T800) and the Southbridge Chipset (VIA VT8237).

Should I go ahead and install the updates though they are not offered by the mobo manufacturer?
0
ChiefITCommented:
With no visible errors, could this be something as simple as hybernate, screensaver, or power modes that are not responding as they should?

Good call on the chipset. I was thinking about that.
0
PlaceboC6Commented:
The chipset manufacturer will usually have more recent versions of the chipset driver than the motherboard manufacturer.  If it is the same chipset,  then it should be safe to upgrade.  I have an Abit system board on my primary PC,  but download the Nvidia chipset drivers from Nvidia.  Once the motherboard becomes end of life,  they may quit posting updates.

Interesting we haven't seen a repeat of the 333's and 2019s yet.
0
esabetAuthor Commented:
I will install teh chipset updates this weekend when traffic to the server is less.

Meanwhile, When I came in this morning the server was not responding - neither at console or via remote desktop.  And just like last time the keyboard Caps Lock and Num-Lock would not function either. After reboot, the last log befre the forced reboot was at 3:08 AM and NO errors!

This is a new behavior and don't recall having it beofre I installed SP2.

Lastly, while I was searching for new drivers, as for the Video Card (ATI Radeon 9700 Pro) I could not locate any drivers for Win 2003 and when I contacted the manufacturer I got the following reply:

Dear Sir,
thank you for contacting AMD Customer Care for ATI products!
Our cards are not supported in Windows 2003. You can only use the standard driver that came with the system.

I have also noticed another particular thing about the Video Card.  When I go to the Device Manager I see two instances of the Video Card the only difference is that one of them says  "Secondary".


0
PlaceboC6Commented:
95% of the time a hard lock as you are describing it is a hardware/firmware related issue.

Not saying it isn't possible that something in SP2 is causing the issue....but it is highly unlikely.
If you want to uninstall it as a test, you can do so in add/remove programs.


0
esabetAuthor Commented:
I guess what I will do for the tme being is to install the update for the chipsets and see what happens next.

Any recommendations as to what I should do with the Video Card?
0
PlaceboC6Commented:
I'd probably leave it as is.  The built in drivers may be ok.
0
esabetAuthor Commented:
Ok, this morning the Server was responding normally. :>)  I also checked the log and the last entry, just like the previous two mornings, was at 3:06 AM.  That tells me that I can't really figure out at what time the server had stopped responding the last two mornings.  But I am still thankfull that the server was up and runnig this morning.

Since I was in a good mood I decided to install the new North Bridge drivers.  After install everything seems to be functioning.  I think I will wait 24 hours before I attempt to install the South Bridge new drivers.
0
PlaceboC6Commented:
Cool.   Good to hear.  We'll see what she does.  
0
ChiefITCommented:
As a test, try to do a couple print jobs from a remote computer. I have a hunch.
0
esabetAuthor Commented:
I just did print some documents using different applications through remote desktop and everything seems to be still good!
0
PlaceboC6Commented:
Hopefully the hard lock issue is resolved,  and we can wait and see if the 333/2019's show back up.
0
esabetAuthor Commented:
Ok, I was very hopefull but it happened again.  I just got an email that Server is down.  When I went to console, there was no respond and as before the Caps-Lock, Num-Lock would not respond either.  So had to do a hard reset.

After reboot I checked the Log and there was no errors.  But as I was about to close the log I got a pop-up on the screen that said "Network path not found!".  So I checked the System Log and saw a Warning Event ID 1 that said the following:

The Subscription WSManSelRg could not be activated on target machine localhost due to a reachability error.  Error Code is 0x0.  All retries have been performed before reaching this point and so the subscription will remain inactive on this target until subscription is resubmitted / reset.  

Does this mean anything?  
0
ChiefITCommented:
I had to look that one up. I have never seen that one before. But, what I found looks like the issue:

http://technet2.microsoft.com/windowsserver2008/en/library/3fa8a7b7-ab82-4661-9b7f-c43477f0e6af1033.mspx?mfr=true

Your error says it can't activate the local host. (This is reaching), but maybe you are having a flooded NIC that is knocking down your WSMAN console. The reason I say this is because the WSMAN connection can't contact itself. That seems odd. I also see few Event errors. Anything in DCdiag?

What do you think, Placebo?
0
esabetAuthor Commented:
You know this issue of the NIC becoming flooded was raised by you guys before so it has gotten me into thinking.

I know that at the time the Server went down at the minimum two or three people were using the Terminal Services simultaneously.  It is also possible the web services (IIS) were also being used.  I am not sure if that is a lot of traffic or not.

And here is another thing but this one could be very coincidental:  On one other PC that is on the same network that the server is on, I was trying new software called NetworkView.  If you are not familiar with it, the software scans the entire network range and builds a Tree of all the devices on the network.  I was running this software at the same time the Server went down.  Is it possible that this software helped flooding the NIC.

I am absolutely no Tech so whatever I am suggesting is purely a theory, nothing more.

As for the Warning Event ID 1, I checked the history of the log and I see that it had come up before and I had not paid attention to it since it was a Warning as oppose to an Error.
0
PlaceboC6Commented:
Sorry guys.  I've been deathly ill due to the flu thing that seems to be getting everyone I know.

Anyway,  if the server is locking up hard...I still suspect some level of hardware involvement (from my experience).

If the mouse was still moving or the keyboard lights worked,  I would suspect a sofware or resource issue.  But from what you are saying it sounds like it is locking up hard.
0
ChiefITCommented:
esbet:

Can you tell us more about the video setup. You don't have integrated video, but have a "secondary" driver in hardware configurations?

Placebo, What do you think about unistalling the "Secondary" Driver?

0
esabetAuthor Commented:
Placebo, sorry to hear that you were sick, hope you are feeling better.

I went away on Friday and while I was away the Server experienced three more incidents of hard locks! To say the least it was fun trying to find someone to reset the server. At least two of the hard locks appear to have happened after the automated backup had been performed @ 3:00AM.  But I can't be certain if the backup was the cause; I simply don't see any log reports after the successful completion of the backup.  The next log entry is after the system was reset due to hard locks!  Here is a wild card: would shadow copy have anything to do with this?  B.T.W.  I have run the backup software just to see if the server locks up but nothing went wrong.  And also, one of the hard locks on the weekend was long before the scheduled backup.

Chief, to answer your question, there is only ONE video card that is installed on the server and it is a Radeon Pro 9700.  I never knew that I have a "secondary" driver installed till I was going through the Device Manger looking for anything that I could find newer drivers.  So why it is there and how it got there I am not certain about.  (I have attached a screen capture of the Device Manager.)  And, as I mentioned before, even though it appears the drivers installed for the Video card are from ATI (the manufacturer) the manufacturer claim that they do not have any drivers for Windows 2003!

I have also ran a diagnostic test (which I downloaded from Broadcom) on the Broadcom integrated NIC and it passed all the tests.

Lastly, after this morning's reset (due to a Hard Lock) I witnessed one ERROR Event that I had not seen before: Event ID 107.  I have attached a screen capture of the error.

Server.EventID.107.png
Server.Display-Adaptor.png
0
PlaceboC6Commented:
Event 106 and 107's tend to do with processor errors.

http://support.microsoft.com/kb/889249

Machine Check Architecture exceptions are classified into the following two categories:
" Fatal errors
" Corrected errors

Fatal errors are errors that cannot be corrected by the computer. When a fatal error occurs, the computer may restart unexpectedly or may stop unexpectedly.

Corrected errors are errors that are corrected by the computer. Corrected errors may also be corrected by the firmware, such as the processor abstraction layer or the system abstraction layer. The firmware is provided by the manufacturer of the computer. Corrected errors do not cause the computer to stop unexpectedly.

Corrected errors are cosmetic at best.  Fatal could suggest a hardware problem with the cpu.  How is the CPU fan on this guy?   Is the room nice a cool?

The "HARD LOCK" tends to suggest a hardware or heat related hardware problem.  Whether you have defective equipment or it is simply overheating obviously requires troubleshooting.  Being a custom built server anything is possible.

Make sure the hard drives aren't too close together in the case...that there is plenty of cooling, etc.

0
esabetAuthor Commented:
Placebo

I just ran Speedfan and it is reporting that CPU 1 is at 73C and CPU 2 is at 76C.  The System Temp is at 41C.

The CPUs are AMD Opteron 248 (2.2 GHz).  According to my research the Maximum operating tempreture is 70c.

As for the System Temp, accoridng to the motherboard manufacturer it should be under 45c.

So it appears that my CPUs are runnig too hot while my case temp is OK.

What do you recommend me to do?
0
PlaceboC6Commented:
Well,  first things first..

You need a good cpu fan/heatsink combo installed on your procs.  You also want to make sure when the fan/heatsink is attached that you use that thermal grease stuff  on top of the cpu where it makes contact with the heatsink.

Looks like we might be narrowing down the cause of your problem.  If you're not big into this sort of thing,  you could always check a computer store in town and tell them you need some help with a possible overheating issue and they should have the thermal grease.

I honestly would have to look at my system at the house to see what temp the proc is at to try to get a feel for what is "normal"....but from your research it sounds like it is running hot and that will definitely cause hard lock's and machine check errors.
0
esabetAuthor Commented:
Ok, then i will do some research to find a good PCU heatsink and fan,  I will work on it right away but please let me know if you have any other ideas or suggestions.
0
PlaceboC6Commented:
Nothing new to add right now.

I'm convinced of a hardware and/or heat problem,  so we definitely need to make sure those procs aren't over-heating.
0
ChiefITCommented:
Yep, heat can definately cause your latest issue:

As a side note:
The primary port and secondary port are ok. So, no worries.
I think it is like this:
Primary is for the DVI port on the card
Secondary is for the VGA port on the card.

As for driver, I didn't see one for 2003, as you pointed out. However, when you load an OS you have a bunch of generic drivers that run most VGA monitors. I also think you would have luck with the xp driver.

I think we have your issue, (HEAT)
Usually, freezing up for no appearent reason is due to HEAT, or RAM, not video, or NIC. Video has some pretty distinct symptoms. NIC usually throughs up a bunch of red flags.
0
esabetAuthor Commented:
Thank you guys.  I have started doing some research on finding a good HeatSink.  If you guys know of any that you would recommend please let me know.

I will keep you posted as soon as I have replaced those HeatSinks!
0
PlaceboC6Commented:
Make sure they have a nice big fan on them.
0
ChiefITCommented:
What do you prefer, paste or pad plecebo?

I like refigeration.
0
esabetAuthor Commented:
Update -

I have been doing some research and apparently this mobo has some cooling limitations.  Due to its design not all HeatSinks will fit.  The footprint of the HeatSink cannot be any larger than 80mm square and, actually, when the mobo was shipped out it was shipped with its own HeatSinks.  Meaning the sockets does not accept your standard Socket 940 HeatSinks.

So I am researching away to see what other may have successfully used with this mobo.

B.T.W. Chief, what do you mean by refrigeration?
0
ChiefITCommented:
They now have refrigeration cooled boxes. You can't quite put your beer in there, but they keep the electronics cool inside. A lot of High-Profile gamers are using them.
0
PlaceboC6Commented:
Refrigeration is cool,  but overkill for this application.  Most of those die-hard gamers in question over-clock the heck out of their hardware pushing it beyond its advertised specifications.

In our case we'd be happy with something to keep the processor in the green running at the speed it is supposed to be running at.
0
esabetAuthor Commented:
I have done some further research and people have reported one particular aftermarket HeatSink that works well with this mobo - Thermalright Si-9XV.  I just ordered two of them.  I was also advised to use Shin-Etsu 783 as thermal grease.

I also emailed the mobo manufacturer about the issue and they emailed me back that even system temp at 40 some degrees is too high.  I think I have too many hard drives in this system but at this point I don't think I can get rid of any.  So I have decided to add an additional exhaust fan to top of the case (there was always a place for it but it was never installed).
0
PlaceboC6Commented:
I've got a sweet case that I used to build my home PC.  It's a Thermaltake "Tsunami Dream"

http://www.newegg.com/Product/Product.aspx?Item=N82E16811133133&Tpk=tsunami%2bdream

Has a big fan that blows across the hard drive bay.   A big fan in the back.  And a big fan in the side of the case.  Love it.

0
esabetAuthor Commented:
That case looks GREAT and very reasonably priced.  I hope I don't have to go that far and have to change the case though, it owuld be a real project!
0
PlaceboC6Commented:
Yeah it's not bad.   You would be able to use your existing power supply in it.

I just had to swap the mobo, cpu, ram in mine because the motherboard went bad.  Computer started locking up.  Now i'm posting from the new and improved model right now.  Of course this is just a workstation.  It's just a really fast one.  LOL.
0
ChiefITCommented:
Placebo:

Here's someone with a similar problem.
http://www.experts-exchange.com/Software/Internet_Email/Email/Anti_Spam/Q_23113855.html

Willing to attack this one?
0
esabetAuthor Commented:
Ok, last week I replaced the rear exhaust fan to a  75 CFM fan and added another on top of the case (75 CFM fan).  The System Temp dropped to 35c (from 41c) and the CPU temps also dropped to about 69c to 70c.  

On Sunday I then replaced the CPU HeatSinks.  These new Heatsinks have 92mm fans and they are HUGE!  Anyway.  After some modification I managed to make them fit.

Now I get the following readings:  The system temps is avarage of 29c;  CPU1 is avarage of 65c (no load) and 68c (100% load); CPU2 is avarage of 60c (no load) and 62c (100% load).

That looks a lot better now but I don't understand why CPU1 is about 5 to 6 degrees hotter than CPU2?  any ideas?
0
PlaceboC6Commented:
CPU1 is going to get used more regardless.  A lot of apps are not multi-cpu intelligent.  So cpu1 is going to get a chunk of the load.

Looks like you really made a dent in the temperature!
0
esabetAuthor Commented:
I understand.

As for the temps, yes a major improvement compared to before, but did I mention the NOISE!!! LOL!
0
PlaceboC6Commented:
That's what servers are supposed to sound like.  :)  

If heat was indeed your problem,  I am anxious to see if this thing goes without locking up now.
0
esabetAuthor Commented:
Same here, I am very anxious.  Today is the first day that the system is being really used since the new fans.  I was just checking the temps and under normal use here are the average readings:

System:  34c
CPU1: 69.9c
CPU2: 64c

The reading I was getting before were based on when I was running a CPU testing diagnostic software called Hot CPU Tester v4.3.  Also it seems that on weekdays the building temperature is higher than it is on weekends which would explain why the System Temp is also higher.  Nonetheless I am still happy with all the temps except for CPU1 since it is still lurking around the 70c (the MAX allowed according to the specs for the CPU).  Maybe I need to recheck my Heatsink install on that one!

Lastly, I have come across two additional temp readings by Speedfan called Core0 and Core1. They are reported by the chipset called AMD K8.  Core0 averages at 52c and Core1 averages at 42c  Any idea what these are?
0
PlaceboC6Commented:
Has it been behaving?
0
esabetAuthor Commented:
Yes, so far all is good!
0
PlaceboC6Commented:
That sounds promising.  Don't forget if you see a re-occurance of the 333/2019's to run the poolmon.exe like we talked about.
0
PlaceboC6Commented:
I think we're possibly looking at a record with no lock ups.
0
esabetAuthor Commented:
Yes, as you may recall recently almost every morning I had to reset the server.  And so far nothing of the sort!

Is it possible that the 333 and 2019 were caused due to the heat as well - my theory is perhaps the CPU or even the RAM was overheating and was sending garbage out?
0
PlaceboC6Commented:
333's can be caused by a number of things,  but 2019's are very specific and wouldn't be a result of the heat.  I think it is safe to say that the heat was the cause of the hard lock at this point,  so at this point we're just monitoring to make sure the 2019's don't show up again.

If I recall,  you updated to SP2 during this process,  so that combined with any other drivers you updated could have resolved the 333/2019 issue.
0
PlaceboC6Commented:
I'm starting to think we can call your server fixed.
0
esabetAuthor Commented:
I would have to agree with you BUT something did happen, though I believe it is totally unrelated.  Two days ago I found the server not responding but this time it was not a hard lock (the CAPS Lock and the Num Lock were working).  You could hear a beeping sound from the server.  Though it was not the ordinary hard lock, you still could not logon in any way, even at console.  This had happened long ago too and I knew what it is.  The beeping sound was being generated by the Raid card (Promise FastTrack 2300).  The beep signifies that one of the drives has failed.  So I had to reboot the server.

The log shows an event by Promise that one of the drives has been unplugged, even though that was not the case.  And the next few events showed that the Raid driver has timed out and that Logical Drive 1 has gone critical.

As I mentioned earlier, this had happened several times before and I had called Promise Tech support. First time around they had me change the PCI slot in an attempt to change the IRQ.  The next time it happened they had me change my drives. Then it all seemed good and the problem seemed to have disappeared.  But then we started having the problems that we had addressed in this thread.

Now that it happened again I called them and they now think the card is bad so they have sent me a replacement card, which I will install this weekend.

All this just as I had thought that we resolved everything.  I am hoping after this final step everything is good.
0
PlaceboC6Commented:
I'm sure it is unrelated,  yes.

Hopefully that new card works out for you.
0
esabetAuthor Commented:
This is very frustrating I must say and I wish I was not posting this but I just had another Hard Lock.  The event logs don't show anything either.  Right after reboot I checked the temps and CPU1 is at 69c to 70c.  Is it possible that even at 70c the CPU is runnig too hot - as you may recall 70c is the maximum temp per the CPU specifications.
0
PlaceboC6Commented:
If 70 is max...I would tend to think 69 isn't much different and could probably cause issue.

Where is the server located.  Is it enclosed in any kind of furniture or anything?  Plenty of airspace all around it for breathing and cooling?
0
esabetAuthor Commented:
It is not in a cabinet or anything, it is away from the rear wall about 18 inches and it is elevated from the floor about 18" inches.  The sides are clear for about 6 to 8 inches on either side and the front and the top are well open.
0
esabetAuthor Commented:
This is getting very frustrating.  I had a hard lock yesterday evening and luckily I was around to reset and I had another one this morning.  In both cases I am not sure if it was heat issue since the temps were reading about 67 to 68c.  There are no events in the log either.  If this keeps up I think i shoudl be looking at buying anew box and make sure it is Win Server 2003 "Certified".  What you guys think?  ANy recommendations?
0
PlaceboC6Commented:
It's a tough decision.  A hard lock is always hardware in nature.  It could still be heat.  It could still be that maybe you need a better case that breath's....or....it could be something else and you'll keep spending money until you find the cause.

I'm partial to Dell personally.  They have a broad range of servers to meet different pricing requirements.  Not to mention support personnel and a warranty that won't require you personally to do all the work (you'll have someone you can call).

I'm very against building servers myself.  I like having a hardware combo that is tested over and over again.  Don't have to worry about whether the ram is going to like the mobo...or if the case is good enough for cooling,  because it should have been designed and tested before it was ever shipped.
0
esabetAuthor Commented:
So you know the server is still randomly constantly crashing and they are all due to HARDLOCKS.  But I have not seen the Event ID 333.  So whatever you guys did fixed that issue and now I have to deal with this new issue.  I thick I will close this thread and award you the point you truly deserve and perhaps open a new thread and see if I can fix the hard lock issues while searching to buy a new Server.

Thanks a million for everything.
0
PlaceboC6Commented:
You're welcome.

If the server is still locking up,  and all firmware and drivers are up to date...then we are either dealing with heat, bad hardware, or a configuration issue at the hardware level.  Perhaps a BIOS setting or something.  I can almost guarantee that based off extensive troubleshooting experience on the server end.

I wish you luck with that.  I do recommend you go with a major name brand server moving forward if you are going to replace the box.  Something with HW support/troubleshooting available and a consistent design.  

Have a good one...
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Microsoft Server OS

From novice to tech pro — start learning today.