Link to home
Start Free TrialLog in
Avatar of TuliTaivas
TuliTaivas

asked on

exchange does not find global catalog - LSASS might be the reason

Dear experts

I have a problem on our Exchange server witch probably is caused by a problem on our DC (jupiter).

It started with a hanging outlook. Upon inspection of the exchange server (called SATURN) I found the following entry in the app evt log:

[Event Type - Source - Category - ID - Date Time - User - Computer
Description]

***  Error - MSExchangeDSAccess - Topology - 2103 - 15.03.06 13:40:15 - N/A - SATURN
Process MAD.EXE (PID=2384). All Global Catalog Servers in use are not responding:jupiter.xxx.com

The DC was locked with the domain admin account. I was unable to unlock the server, it always said invalid password. I'm convinced I typed the correct password.
On an other occasion when I was logged in as doman admin and tried to shut the server down, it said that I had no permission to shut the server down.

I also noticed that LSASS was using 60% to 99% CPU, so even after rebooting the exchange server the information store would not start while LSASS on the DC was running wild. After a while LSASS went down to almost no CPU (I didn't actually DO anything, I was just watching with task manager and process explorer) and then a reboot of the exchange got it back to work.

*** Information - MSExchangeDSAccess - Topology - 2081 - 15.03.06 16:06:02 - N/A - SATURN
Process INETINFO.EXE (PID=728). DSAccess will use the servers from the following list:

Domain Controllers:
jupiter.xxx.com
venus.xxx.com

Global Catalogs:
jupiter.xxx.com

The Configuration Domain Controller is set to jupiter.xxx.com


I also found the following entry in the evt log of the DC:

*** Error - Userenv - None - 1000 - 15.03.06 13:39:20 - NT AUTHORITY\SYSTEM - JUPITER
Windows cannot obtain the domain controller name for your computer network. Return value (2146).Userenv.log: USERENV(e8.39c) 13:33:39:662 ProcessGPOs: DSGetDCName failed with 2146.

*** Warning - w32time - None - 63 - 15.03.06 14:42:53 - N/A - JUPITER
The time service cannot provide secure (signed) time to client 192.168.1.140
because the attempt to validate its computer account failed with error 1723.
Falling back to insecure (unsigned) time for this client.
Data:
0000: 00 00 00 00               ....  
[Note: 192.168.1.140 is a W2k client)

*** Error - Userenv - None - 1000 - 15.03.06 15:04:25 - NT AUTHORITY\SYSTEM - JUPITER
Windows cannot obtain the domain controller name for your computer network. Return value (2146).

The problem has happened 3 times with two to three days inbetween.

We have a small W2k domain with about 100 user/mailboxes and 30 desktops/notebooks.

I have collected some more evidence but not knowing what is relevant I stop here to not overwhelm you with too many details.

Thanks

Roger
Avatar of Jay_Jay70
Jay_Jay70
Flag of Australia image

how often do you reboot your DC?

sounds and looks like the system went into a pretty crazy state of mind...... :)

LSASS is a process i have seen many times cause greif, are you completely up to date with service packs and updates, and are you certain there is no malware on your DC or exchange server playing around with processes

lots of different little issues in their that all seem to be linked    hmmmm
Avatar of TuliTaivas
TuliTaivas

ASKER

Hi Jay

We reboot as little as possible, e.g. after installing patches if it is needed other wise it runs for weeks. We have used it that way for 5 years without any major problems so far.

We are up to date with windows patches on both  DC and XCHN. Not sure about the latest exchange patches, though.
It's exchange 2000 with SP3 if I recall it correctly (can't check right now).

We have Symantec AV installed which scans the server once a week. But I haven't checked with a rootkit revealer like e.g. BlackLight.

I have observed the available memory on jupiter with task manager. After cold boot LSASS uses about 40MB and there are about 270 MB available of the 500 MB that are installed. Within say 3 hrs LSASS goes up to 53MB and available memory starts to go down slowly but steadily to 60MB and even to 5 MB. What I'm wondering and what's worrying me a bit is, where does the RAM go when all other numbers stay more or less the same (system cache goes up a bit but does not account for the whole difference). Commited charge is always around 220 MB.
I also checked with Sysinternal's process explorer but this didn't reveal any memory leak either.

Once when I logged off and on again, the available memory went up from 60 to about 180 MB so I thought that it was something in the user session that used the memory. But the next time I tried this when the avail mem. was again down to 60MB, it stayed at 60MB after logging in again.

Maybe LSASS running wild is not the primary cause but rather a reaction to something else that goes bad?

As an aside:
We have a second DC which is NOT configured as a global catalog server. It's a rather old compaq deskpro 800 MHz, 392 MB, 4GB(770MB free), 14GB (7GB), 72GB (28GB). Not sure whether it could take the additional load. Setting it up as a second GC would help in keeping the exchange server up, I assume. What do you think?

Roger
ASKER CERTIFIED SOLUTION
Avatar of Jay_Jay70
Jay_Jay70
Flag of Australia image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Hi Jay

I tried to find out, what the return value 2146 means in the evt log message
"Windows cannot obtain the domain controller name for your computer network. Return value (2146)."
which seems to be the first in the chain of events. No luck so far. Any ideas where I should look?

I thought the benefit of having 2 GCs is, that if one fails then the other can provide the information to exchange (or whatever is querying the GC). And also the logins in the morning could be quicker. And I also remember having read (on EE) such recommendations.  

[However there is a gotcha in that the infrastructure FSMO server should not be placed on a GC under certain circumstances. But apparently in a single domain this is not an issue since the infrstructure server has nothing to do. (http://support.microsoft.com/default.aspx?scid=KB;en-us;q223346)]

Roger
Hi Jay

Since you didn't mention it in your reply, do you think I don't have to worry about the available memory thing?

Roger
heya mate, the rules for infrastructure master only come into play with multi domains... you are correct :)


i still think we need to look at that available memory im just trying to think on what it could be...
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Hi
The systems have behaved well for more than a week now, whereas when the problems began, the servers threw up every 2 to 4 days. A few days ago I set our second DC (called VENUS) to be GC as well. Maybe this has helped.

In the properties page computertsu has pointed out I noticed that configuration server is now venus, before that it was jupiter. I don't know if this is of any significance.

I'm going to split the points among you if this is OK.

If something comes to your mind that explains what has happened, please let me know. For the moment I don't have the time to investigate more. I'm glad that it seems OK now. Hope the best.

Take care

Roger