Link to home
Start Free TrialLog in
Avatar of D91Admin
D91AdminFlag for United States of America

asked on

Lsass.exe High CPU Utilization on windows 2008 R2 Domain controller

I have a single domain with 3 DCs that I have just recently upgraded from windows 2003 SP2, to Windows 2008 R2. Each server is a Global Catalog server. Since the upgrade, as the client load increases, the Lsass.exe process cpu utilization climbs from 5% - 15%, to 85% - 95%. I have dissabled the network connector during high cpu utilization, and emediatly usage drops down too the lower percentage stated above.

I need to know the best way to determine what is causing this issue. I have tried to caputer packets by mirroring the switch port for the DC, but this is difficult because the DC's are all virtual machines, and I have to capture on ports that are shared by several servers, and I can't capture packets fast enough. The machine I try to capture with quickly begins to drops packets, and will lockup after a short period of time.
Avatar of Renato Montenegro Rustici
Renato Montenegro Rustici
Flag of Brazil image

The last time I saw that it was a virus issue (Conficker). You can install the network capture software directly on the virtual machine as you intend to view the traffic directed to the server. There's no need to do any additional configuration in your network equipments.

Conficker, for example, tries to guess the user's password in AD and that causes the increase in LSASS. I saw another problem like that but related to a inventory software. So, check for third party software that may be causing that behavior.

Just do check, the vm tools are installed in the virtual machine, right?
Follow the steps in this article to run AD perf monitor. It may give you an idea of what is causing the high CPU.

http://blogs.technet.com/b/askds/archive/2010/06/08/son-of-spa-ad-data-collector-sets-in-win2008-and-beyond.aspx
Avatar of D91Admin

ASKER

Good ideas to look for, I had an IT friend from another school district mention that they saw the same thing with the conficker worm. I have talked to our virus specialist and she is looking into it. Thus far we don't have much evidence pointing that direction, but now that I have heard it a second time, I am going to look into it much closer.

I will also check into any 3rd party software that authenticates against AD.

I do have VM tools installed on the machine. Thanks for the suggestions, I will report back what I find the next couple of days, after I get time to test. Just curious as to which capture program you normally use?
LSASS is the local Security Authoritative Subsystem service. Usually when seeing CPU usage climb as a result of that service, it means you have malicious malware trying to authenticate by possibly using a dictionary attack. So, I agree to scan for malware first.
OK, good suggestions. I have ran AD perf monitor on one of the DCs during high CPU utilization. I am now looking in the report and not sure exactly what items I need to check under for specific diagnosis.

The following is what I see under each section.

Diagnostic Results:
Warning  
 
Severity: Warning
Warning: Process lsass.exe [ProcessId: 476] has a high CPU consumption of 66.4%.
Related: Performance Diagnosis
 

Performance  
Resource Overview  
 
Component Status Utilization Details
CPU Busy 99 % High CPU load. Investigate Top Processes.
Network Idle 1 % Busiest network adapter is less than 15%.   Nic Intel[R] PRO_1000 MT Network Connection using 7,359,880 bits and has 1,000,000,000 bits capacity.
Disk Idle 12 /sec Disk I/O is less than 100 (read/write) per second on disk 0.   Reads 0.3/sec + Writes 11.2/sec
Memory Normal 52 % 1980 MB Available.

This stuff I already know. So now I need to understand the best indicators to look at in order to dig deeper into this issue. There are so many items in the log, I don't know where to begin. I have looked at clients with the most CPU usage hoping that it might point me to evedence of a machine with conficker, but I'm afraid I don't know exactly what kinds of information to be looking for.

I have an idea that the SEcurity Account Manager section my yield some clues, I do see that list of top users for numerouse different types of SAM activity, but due to my lack of knowledge of how they interact, I'm afraid I need some help with diagnosis.
 

 
I think Wireshark would be perfect to identify the worm condition, but it requires a trained eye. It may be not that clear to you at first look.

There is a invaluable tool that may help a lot in your issue. In the Windows 2003 universe, it was called Server Performance Advisor, or SPA for short. The tool is now part of Windows 2008 R2.

Open Performance Monitor, navigate through "Data Collector Sets", "System". Right click "Active Directory Diagnostics" and choose Start. Wait until the diagnostics finishes (You will see the "Running" status).

When its done, right click "Active Directory Diagnostics" and choose "Latest Report". Navigate the report to investigate the issue. Particulary interesting is the Network section. There you can see which clients are connecting more frequently (during the analysis, of course) and which PID is involved. Note that you will need to add the PID column in Task Manager to match.

I dont know. It may be a simpler path.
ASKER CERTIFIED SOLUTION
Avatar of Renato Montenegro Rustici
Renato Montenegro Rustici
Flag of Brazil image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
rmrrustice,

I have monitored 6 to 7 captures, and there seems to only be a few duplicate clients between only 1 or 2 captures. The bytes/sec range form 35K at the high end for inbound and outbound bytes. I don't know what normal is.  Here is the top 15 Outbound and Inbound clients caputered during high CPU usage.




Some of the clients areThe low end of the top 25 clients has only 161 bytes. If these are my top 25 users, I would think that to be a small amount of traffic. We have 3500 computers, divided between three domain controllers, so perhaps the cumulative total could have an effect. But again, I never had this problem last year while running our domain with windows 2003 domain controllers. Could it be the client load is to large?
OutboundBytes.jpg
Oops, I was composing my submission, and while adding the image, inadvertantly clicked submit. Hence the jumbled 2nd paragraph. Simply remove "Some of the clients are" from the first of the paragraph, and it should read better.

Here is the corresponding inbound:




InboundBytes.jpg
Does the PID (4 in this case) map to the LSASS PID? If it maps, then investigate those clients. You may try to turn them off and see if LSASS processor time drops.
I mapped the PID back and 4 maps to System, and 476 maps to lsass.exe. It helps to understand that mapping, I had not tied that together before.

A few things that kind of stick out to me:

The host that seems to be using lsass.exe the most is our exchange server. (10.1.10.24) It is one of the hosts that is showing up on the other DCs.

I find it a funny that the local 127.0.0.1 is showing up in these lists as using the lsass process. What would it be doing?

Most of the hosts that are using lsass.exe are members servers on our domain. Including the other 2 domain controllers, as well as a new Exchange 2010 server that is being setup. I would suspect that this is normal.
Are those virtual machine sized correctly for the load? Windows 2008 R2 requires more system resources than Windows 2003. Whats the vms hardware configuration? How many users do you have in your domain? Exchange 2010 will query your DCs frequently. Make sure you have the correct configuration for the load.
That is a good question.

I have been looking for documentation or some sort of tool that will allow me to size them properly, but as of yet haven't found one. Do you have any suggestions?

I have 4 VMWare hosts running on 3 HP DL580 G3 servers each with 4 dual core, 2.666 GHz per core, procs., andI also have a 4th Host that is an HP DL580 G5 with 4 Quad core, 2.93 per core, procs.

DC1 - has access to 2 vCPUs 11.72 GHz each and 4Gig of RAM
DC2 - has access to 4 vCPUs 5.32 GHz each and 4Gig of RAM
DC3 - has access to 2 vCPUs 5.32 GHz each and 4Gig of RAM

Our domain has about 7500 user on 4500 computers. Full time employees that are logged in all the time are about 950, the rest would be logged in only during classes that requier computer use.

When you say the correct config for the load, I'm not sure exactly what you mean. Are you refering to the schema?
I meant the hardware configuration. Its hard to find sizing information for Active Directory.

I dont know. There are so many variables. It may be more productive to open a case in Microsoft PSS (and its cheap). They will connect in your servers and run diagnostics. It may be faster than we trying from here.

Or you can combine all information we provided here and conduct some investigation.

Let me send you two links that may help in the diagnostic:

Troubleshooting High LSASS CPU Utilization on a Domain Controller (Part 1 of 2)
http://blogs.technet.com/b/askds/archive/2007/08/20/troubleshooting-high-lsass-cpu-utilization-on-a-domain-controller-part-1-of-2.aspx

Troubleshooting High LSASS CPU Utilization on a Domain Controller (Part 2 of 2)
http://blogs.technet.com/b/askds/archive/2007/08/23/troubleshooting-high-lsass-cpu-utilization-on-a-domain-controller-part-2-of-2.aspx
I have gone through the Troubleshooting documents, and was not able to come up with any smoking gun. I have however opened a case with Microsoft PSS as you suggested, and I am currently working with them. Their rates are resonable as you indicated. I decided to sign up for TechNet which baugt me two trouble shooting cases with the yearly subscription. And I am now using one of the cases to find a resolution to the LSASS high CPU issue. I will report back when I get to the bottom of the issue.
Nice. They are great. I am sure they will identify the source of your problem.

Let us know when you get a solution.
I have been working with MS PSS for the last 3 days, and still haven't found a solution. The case will be escalated on Monday so hopefully we will get some results. I will report the findings when we get them.
Wow, after all that, the problem ended up being a third party application on certian models of computers. We are an HP shop and the problem so far has only shown up on HP Desktop model DC7900's that are running an older version of "HP Protect Tools". Once this program, that we don't even use, was either removed or upgraded, the problem went away. My domain controllers now have normal CPU utilization of about 20%.

We finally ended up finding the problem by doing alot of captures and going to the top talkers as far as the number of packets they were sending to our domain controllers. Once we had this view, it was easy to indentify the culprits, and remove the offending software.
Avatar of Mike Kline
Great work!!  This post will definitely help others,  network sniffing for the win!
Glad to hear that. HP really invents. :)
The key to uncovering the primary issue was to find the top talkers by the number of packets sent. Because these were such small packets, the byte count was not very large. The packet type we learned to watch for was of protocol type "MSRPC" and would produce an "Unknown" result in the detail field.