We have recently begun to have some problems with our 2003 Server DC. The Server was working just fine but over the past week or so we have begun to have to forcefully reset the Server nearly every day.
The problem is first realised when our Insight Management System reports that the server is no longer responding to pings. We then visit the Server to find that the machine appears to be booted - but when we try to unlock the console it reports that it is unable to do so because insufficient resources are available. In the end the only way to reboot the server is:
i) Forcefully reset using the power button.
ii) Use the ILO to remotely connect and force a warm reboot.
Upon rebooting the Server and checking the Event Log we find the following area has been logged many many times prior to the machine becoming non-responsive:
Event ID: 2019
Description: The server was unable to allocate from the system nonpaged pool because the pool was empty.
As an example the first of these errors was recorded at 02:36AM this morning. I then received an error from the Insight Manager at 02:43AM saying that the Server was no longer responding. Looking further on through the even log we then see these errors:
Event ID: 1001
Description: System memory is running very low. Norton AntiVirus Realtime Protection may not be able to function properly.
Source: Application Popup
Event ID: 333
Description: An I/O operation initiated by the Registry failed unrecoverably. The Registry could not read in, or write out, or flush, one of the files that contain the system's image of the Registry.
However I believe that these errors are both related to a lack of memory on the system so if we can resolve these issues then we shouldn't see these messages again. The message from Srv then repeats itself every 60 seconds whilst the one raised by Application Popup occurs every 20 seconds or so.
Of course it seems that something is consuming the nonpaged pool but the strange this is that this issue always seems to occur at the same time in the middle of the night. Here are a few of the things we have done.
1. Disabled the Volume Shadow Copy and Microsoft Software Shadow Copy Provider Services. This is in relation to a KB I read about backup software causing problems with the VSC Service which resulted in a memory leak which consumed the non-paged pool.
2. Disabled the backup and all ArcServe related Services (using BrightStor 11 IIRC).
3. Set up PerfMon to collect statistics from Memory\Pool Nonpaged Allocs and Memory\Pool Nonpaged Bytes. Also set up monitoring of Process\Pool Nonpaged Bytes. These counters when graphed in Perfmon don't show anything of interest - but i'm using selected processes so if the process causing the problem is spawned at night then it could be that Perfmon is missing it.
4. Created a scheduled task to run PoolMon and dump the output to file every 15 minutes. It seems that a Tag called THRE is taking up significant amount of non-paged pool space and it gradually increases in size throughout the day. Just prior to the problem this morning the Tag had a byte size which equated to 202MB.
I'm running out of ideas so any suggestions which can be offered would be greatly apprecaited.