RPCss service stops responding after 49-50 hours of server uptime on Windows Server 2003

We have a problem with the rpcss service running on Windows Server 2003 (Domain Controller).
 
After 49-50 hours of server uptime (seems to vary a little but last time was 49 hours 58 minutes) the process SVChost.exe running the rpcss service hangs and consumes ~50% of CPU runtime (ie. all the runtime for one of the 2 CPUs).
 
Network access to server, internet access (via proxy on server), and all Active Directory funtions cease (all require Remote Procedure Calls), and a reboot is necessary to fix the problem.  The same problem occurs 49-50 hours later (this has occured 7 times so far as of this posting).
 
I have looked into the Event Viewer for the server and the following error is generated in system events:
"The server was unable to allocate from the system nonpaged pool because the pool was empty."
 
I have scanned the system files on the server with latest Symantec Antivirus definitions and found nothing.  I cannot find any obvious updates (specifically to the Server OS) that occurred up to 50 hours prior to the first fault.  We have not made any hardware changes in the past month.
 
The server is installed behind a firewall.
Has anyone experienced something similar or can offer any ideas?
 
Thanks in advance for any advice/thoughts on the matter!
 
Cheers,
Daniel
 
System:
OS:    Microsoft Windows Server 2003 R2 Standard Edition (with all current updates) running as Domain Controller
CPU:   2 x Intel(R) Xeon(TM) CPU 2.40GHz
MB:     Intel Corporation SE7505VB2 Board
Memory:     3072MB
Network Adapter:    Intel 8255x-based PCI Ethernet Adapter (10/100)
Firewall:     Sonicwall TZ 170 Standard
Anti-virus:    Symantec Antivirus Corporate Edition 8.0 client (ver 8.00.9374)
dan-rollsAsked:
Who is Participating?
 
Amit BhatnagarConnect With a Mentor Technology Consultant - SecurityCommented:
Yes, TCP Connections stays in 2 MSL (Maximum Segment Lives) after being released by the application before they can be used again. One MSL is equal to 120 seconds. So If an Application releases a TCP port, it waits for 240 Seconds before it can be released into the pool of available ports.
We can reduce this value from 240 seconds to 120 seconds for more 'busy' servers..

http://support.microsoft.com/kb/149532/en-us

Looking at the information that you have pasted, it seems the affected server is a DC..(LDAP). LDAP normally consumes this no. of ports. Can you check the event log and see if you are getting any 2021,2022 Server Service errors under System? It normally means that there is a memory leak of some sort.  Also, try and increase the maximum no. of ephemeral ports to 65535 from the usual 5000 thousand limits. For that try and set the "MaxUserPort" registry value to max.

http://support.microsoft.com/kb/319504/en-us
0
 
Amit BhatnagarTechnology Consultant - SecurityCommented:
I have faced this issue quite a no. of times due to various applications though. Have you tried checking the no. of open TCP Sessions at the time of the issue? You can try and run "Netstat -pano tcp" to see the list of TCP Sessions. You can also use TCPView from Sysinternals (Now Microsoft) to see the no. of active sessions.
The whole issue is normally when some faulty applications starts grabbing all the TCP ports slowly but does not release them. Since, the process is one port at a time, it takes about three days to cover all the ports..Once you are out of open ports..(Ephemeral ports), you start getting the errors. I have seen this issue occuring over a gap of few hours as well if the application is consuming TCP ports quickly. Make sure to run TCPView or "Netstat -pano TCP" at the time of the issue.
0
 
dan-rollsAuthor Commented:
No I haven't looked into open TCP sessions, but that sounds like a very good idea.  I was actually looking at Process Explorer by Sysinternals earlier today, and am now downloading the entire suite.  I will have a look at this in the office tomorrow (in about 15 hours)
Thanks!
0
SMB Security Just Got a Layer Stronger

WatchGuard acquires Percipient Networks to extend protection to the DNS layer, further increasing the value of Total Security Suite.  Learn more about what this means for you and how you can improve your security with WatchGuard today!

 
Amit BhatnagarTechnology Consultant - SecurityCommented:
Sure, you can download TCPView from this lnik given below ...Just in csae..:)

http://technet.microsoft.com/en-us/sysinternals/bb897437.aspx

0
 
dan-rollsAuthor Commented:
Thanks, TCPView was good utility.  I took a snap shot of the log just prior and just after the problem occured and can't really see much of a difference between the two (.  Certainly not the thousands of open TCP sessions one would expect if all the ports had been snatched up.

However, I have since checked the log on a couple of different occasions and have noticed that occasionally a very large number of TCP connections appear (about 500 or so):

[System Process]:0      TCP      hebe:ldap      hebe:4859      TIME_WAIT      
[System Process]:0      TCP      hebe:ldap      hebe:4860      TIME_WAIT      
[System Process]:0      TCP      hebe:ldap      hebe:4861      TIME_WAIT      
[System Process]:0      TCP      hebe:ldap      hebe:4862      TIME_WAIT      
[System Process]:0      TCP      hebe:ldap      hebe:4863      TIME_WAIT      
[System Process]:0      TCP      hebe:ldap      hebe:1309      TIME_WAIT      
[System Process]:0      TCP      hebe:ldap      hebe:1308      TIME_WAIT      
[System Process]:0      TCP      hebe:ldap      hebe:1307      TIME_WAIT      
[System Process]:0      TCP      hebe:ldap      hebe:1306      TIME_WAIT

They seem to stay open in the TIME_WAIT state for a while and then close (or are killed?).  Not entirely sure of the way the OS handles TCP connections.  But I am guessing the the System Idle process only holds these connections while they are in the TIME_WAIT state.  Any idea how to find out which process is creating these hundreds of TCP connections?
0
 
dan-rollsAuthor Commented:
Thanks Bammit99 for your help.
I think you have put me on the right track with the notion that this might be caused by a memory leak.  I used Poolmon.exe (available in the Microsoft Support Tools - Microsoft KB article 177415) to try and find the offender, and got the following results:

Memory: 3144648K Avail: 2192492K  PageFlts:  5459   InRam Krnl: 3844K P:106492K
Commit: 519508K Limit:4563972K Peak: 577536K            Pool N:116168K P:107008K
System pool information
Tag  Type     Allocs            Frees            Diff   Bytes       Per Alloc

AfdB Nonp     173603 (   3)      3248 (   1)   170355 84515600 (  2368)    496
Ntfr Nonp      30759 (   0)      2311 (   0)    28448 1821640 (     0)     64
File Nonp    2298349 (1154)   2276414 (1160)    21935 3337752 (  -912)    152
NtFs Nonp      48331 (   0)     29633 (   0)    18698  753392 (     0)     40
.....

Anyway, I did a search on the AfdB tag, and finally came across this Microsoft article: http://support.microsoft.com:80/kb/933999/en-us
This seems to fit exactly my symptoms and set-up, we had a couple of printers using the HP Standard TCP/IP port monitor, including one printer that had long been removed and had a document sitting waiting to print!  Immediately I could see the Spoolsv.exe process had consumed ~83MB of non-paged memory (something I missed/wasn't sure was correct behavior beforehand).  After making the suggested fixes and restarting the print spooler service it seems to be behaving normally now!

We'll see how things go.
0
 
Amit BhatnagarConnect With a Mentor Technology Consultant - SecurityCommented:
Sure thing, Budd..:) Also, incase you feel a certain service, DLL is causing the issue then contact Microsoft for a free Hotfix. Microsoft does not charge anything for Hotfix cases. Infact get me the version of Service\DLL that is leaking and I might be able to get you the correct Hotfix\Update no.
0
 
dan-rollsAuthor Commented:
Thanks for your help.  It looks like the problem is fixed now!  spoolsv.exe now seems to be behaving, and only consuming ~10k of non-paged memory, and the system hasn't crashed for over 3 ½ days.
0
All Courses

From novice to tech pro — start learning today.