exchange does not find global catalog - LSASS might be the reason

Dear experts

I have a problem on our Exchange server witch probably is caused by a problem on our DC (jupiter).

It started with a hanging outlook. Upon inspection of the exchange server (called SATURN) I found the following entry in the app evt log:

[Event Type - Source - Category - ID - Date Time - User - Computer
Description]

***  Error - MSExchangeDSAccess - Topology - 2103 - 15.03.06 13:40:15 - N/A - SATURN
Process MAD.EXE (PID=2384). All Global Catalog Servers in use are not responding:jupiter.xxx.com

The DC was locked with the domain admin account. I was unable to unlock the server, it always said invalid password. I'm convinced I typed the correct password.
On an other occasion when I was logged in as doman admin and tried to shut the server down, it said that I had no permission to shut the server down.

I also noticed that LSASS was using 60% to 99% CPU, so even after rebooting the exchange server the information store would not start while LSASS on the DC was running wild. After a while LSASS went down to almost no CPU (I didn't actually DO anything, I was just watching with task manager and process explorer) and then a reboot of the exchange got it back to work.

*** Information - MSExchangeDSAccess - Topology - 2081 - 15.03.06 16:06:02 - N/A - SATURN
Process INETINFO.EXE (PID=728). DSAccess will use the servers from the following list:

Domain Controllers:
jupiter.xxx.com
venus.xxx.com

Global Catalogs:
jupiter.xxx.com

The Configuration Domain Controller is set to jupiter.xxx.com


I also found the following entry in the evt log of the DC:

*** Error - Userenv - None - 1000 - 15.03.06 13:39:20 - NT AUTHORITY\SYSTEM - JUPITER
Windows cannot obtain the domain controller name for your computer network. Return value (2146).Userenv.log: USERENV(e8.39c) 13:33:39:662 ProcessGPOs: DSGetDCName failed with 2146.

*** Warning - w32time - None - 63 - 15.03.06 14:42:53 - N/A - JUPITER
The time service cannot provide secure (signed) time to client 192.168.1.140
because the attempt to validate its computer account failed with error 1723.
Falling back to insecure (unsigned) time for this client.
Data:
0000: 00 00 00 00               ....  
[Note: 192.168.1.140 is a W2k client)

*** Error - Userenv - None - 1000 - 15.03.06 15:04:25 - NT AUTHORITY\SYSTEM - JUPITER
Windows cannot obtain the domain controller name for your computer network. Return value (2146).

The problem has happened 3 times with two to three days inbetween.

We have a small W2k domain with about 100 user/mailboxes and 30 desktops/notebooks.

I have collected some more evidence but not knowing what is relevant I stop here to not overwhelm you with too many details.

Thanks

Roger
LVL 1
TuliTaivasserver adminAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Jay_Jay70Commented:
how often do you reboot your DC?

sounds and looks like the system went into a pretty crazy state of mind...... :)

LSASS is a process i have seen many times cause greif, are you completely up to date with service packs and updates, and are you certain there is no malware on your DC or exchange server playing around with processes

lots of different little issues in their that all seem to be linked    hmmmm
0
TuliTaivasserver adminAuthor Commented:
Hi Jay

We reboot as little as possible, e.g. after installing patches if it is needed other wise it runs for weeks. We have used it that way for 5 years without any major problems so far.

We are up to date with windows patches on both  DC and XCHN. Not sure about the latest exchange patches, though.
It's exchange 2000 with SP3 if I recall it correctly (can't check right now).

We have Symantec AV installed which scans the server once a week. But I haven't checked with a rootkit revealer like e.g. BlackLight.

I have observed the available memory on jupiter with task manager. After cold boot LSASS uses about 40MB and there are about 270 MB available of the 500 MB that are installed. Within say 3 hrs LSASS goes up to 53MB and available memory starts to go down slowly but steadily to 60MB and even to 5 MB. What I'm wondering and what's worrying me a bit is, where does the RAM go when all other numbers stay more or less the same (system cache goes up a bit but does not account for the whole difference). Commited charge is always around 220 MB.
I also checked with Sysinternal's process explorer but this didn't reveal any memory leak either.

Once when I logged off and on again, the available memory went up from 60 to about 180 MB so I thought that it was something in the user session that used the memory. But the next time I tried this when the avail mem. was again down to 60MB, it stayed at 60MB after logging in again.

Maybe LSASS running wild is not the primary cause but rather a reaction to something else that goes bad?

As an aside:
We have a second DC which is NOT configured as a global catalog server. It's a rather old compaq deskpro 800 MHz, 392 MB, 4GB(770MB free), 14GB (7GB), 72GB (28GB). Not sure whether it could take the additional load. Setting it up as a second GC would help in keeping the exchange server up, I assume. What do you think?

Roger
0
Jay_Jay70Commented:
problem with trying to troubleshoot LSASS issues is that there are so many different ones out there

i dont see the benefits of setting the machine as an additional GC but it may be worth having set as the only GC temporarily and see if it makes any difference, it will handle the load

if no difference you can always make the original CG a GC again
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Cloud Class® Course: Microsoft Azure 2017

Azure has a changed a lot since it was originally introduce by adding new services and features. Do you know everything you need to about Azure? This course will teach you about the Azure App Service, monitoring and application insights, DevOps, and Team Services.

TuliTaivasserver adminAuthor Commented:
Hi Jay

I tried to find out, what the return value 2146 means in the evt log message
"Windows cannot obtain the domain controller name for your computer network. Return value (2146)."
which seems to be the first in the chain of events. No luck so far. Any ideas where I should look?

I thought the benefit of having 2 GCs is, that if one fails then the other can provide the information to exchange (or whatever is querying the GC). And also the logins in the morning could be quicker. And I also remember having read (on EE) such recommendations.  

[However there is a gotcha in that the infrastructure FSMO server should not be placed on a GC under certain circumstances. But apparently in a single domain this is not an issue since the infrstructure server has nothing to do. (http://support.microsoft.com/default.aspx?scid=KB;en-us;q223346)]

Roger
0
TuliTaivasserver adminAuthor Commented:
Hi Jay

Since you didn't mention it in your reply, do you think I don't have to worry about the available memory thing?

Roger
0
Jay_Jay70Commented:
heya mate, the rules for infrastructure master only come into play with multi domains... you are correct :)


i still think we need to look at that available memory im just trying to think on what it could be...
0
computertsuCommented:
In the Exchange System Manager, go to (your names may vary)
Administrative Groups, First Administrative Group, Servers, SATURN
right-click SATURN, go to Properties, click the tab named Directory Access
chage the Show drop down list to Global Catalog Servers, disable the automatic check box and add a different DC, then remove JUPITER.
The change should take effect immediately. See if you still have DSAccess errors.
I believe the AD and/or File Replication Service (NTFRS) may be damaged on your JUPITER DC server.
0
TuliTaivasserver adminAuthor Commented:
Hi
The systems have behaved well for more than a week now, whereas when the problems began, the servers threw up every 2 to 4 days. A few days ago I set our second DC (called VENUS) to be GC as well. Maybe this has helped.

In the properties page computertsu has pointed out I noticed that configuration server is now venus, before that it was jupiter. I don't know if this is of any significance.

I'm going to split the points among you if this is OK.

If something comes to your mind that explains what has happened, please let me know. For the moment I don't have the time to investigate more. I'm glad that it seems OK now. Hope the best.

Take care

Roger
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Windows 2000

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.