Solved

Entire System Struggles when Backup DC is Down

Posted on 2010-09-03
31
421 Views
Last Modified: 2012-05-10
I'm trying to understand why our entire system struggles when our Backup DC is down temporarily. As soon as the DC goes down users start screaming that programs are crashing or taking forever to load/save files. Here is the setup.

Server 1 (2008 R2) - Primary DC - All FSMO, GC, DHCP, DNS, Files/Printers, databases)
Server 2 (2003 R2) - Exchange Server - Exchange only
Server 3 (2003 R2) - Backup DC - GC, DNS, WSUS, AV Server, Network Monitor

I have checked that the Primary server is the first DNS server listed in the DHCP and have edited the registry on both DC's so that Server 1 is the preferred login server for all clients. So, I'm not sure why clients are still having trouble when Server 3 is temporarily offline.

I have checked and nslookup on our local domain (domain.local) will alternate between which server is listed first, but I can't find a way in Windows DNS to control that.

Any thoughts or suggestions?

Thanks in advance.
0
Comment
Question by:Danoklas
  • 13
  • 6
  • 5
  • +3
31 Comments
 
LVL 53

Expert Comment

by:Will Szymkowski
ID: 33597030
For the users that are having issues when you take the Backup DC down, can you check to see what DC they originally authenticated at their inital login? You can do this by opening command prompt and type SET. Look for LOGONSERVER.

What type of programs are failing or giving errors? Have you checked the event viewer on the clients machines to see if there is anything that stands out that might be related to this issue?
0
 

Author Comment

by:Danoklas
ID: 33597054
The only errors in the client were MS office application crashes. Unfortunately, the server is back up now, and we don't allow command-prompt access to our users. However, checking my own machine, it's logged into Server 1 as they all should be.
0
 
LVL 3

Accepted Solution

by:
tonyszko earned 125 total points
ID: 33597056
What registry changes have you done on a DCs to make sure that Server 1 is preffered server? In most cases this isn't a good idea to do such configuration - AD is pretty good on handling such things like one DC being down.

Is there a chance that some applications are hardcoded to use this second DC?

When second DC is down are You able to verify DNS name resolution and network connectivity to first DC?
0
 
LVL 53

Expert Comment

by:Will Szymkowski
ID: 33597171
If you use PS EXEC Tools you will be able to do remote commands from the commant prompt on another computer. From there you will be able to verify what logonserver they are authenticating to.

PSEXEC Toosl - http://technet.microsoft.com/en-us/sysinternals/bb897553.aspx
0
 
LVL 20

Expert Comment

by:Radhakrishnan Rajayyan
ID: 33597223
Issue may cause with AV Server..Which one client always looking for the server..disable the AV client from one of your workstation while BDC down and check once
0
 
LVL 3

Expert Comment

by:tonyszko
ID: 33597275
Actually radhakrishnan2007 advice can be good - check CPU and memory utilization on a client when AV server is not available - if it spikes because AV client is looking for a server this might cause sluggish apps behavior.
0
 

Author Comment

by:Danoklas
ID: 33597499
tonyszko

The specific registry settings are LdapSrvWeight and LdapSrvPriority in HKLM\SYSTEM\CurrentControlSet\Services\Netlogon\Parameter which essentially cause the server to update it's DNS records with the proper weight and priority values.

Server 1 (Primary) - unchanged (0)
Server 3 (Backup) - Priority: 1

This should be forcing clients to login using Server 1 and only use Server 3 if Server 1 is unavailable, which is what we want.

Other than AV and WSUS, no application are hard-coded with anything. However, using PSexec to run SET on the remote machine, I did notice an old entry in the path variable pointing to the new server, however the application that set that path entry is definitely no longer using the, so I don't think that entry should be causing any trouble... or could I be wrong about that?.

Name resolution and connectivity to the PDC are just fine. Also, the SET command when run from PSexec, did not list the LOGONSERVER.

I will check the AV server, but I don't think that would be causing this problem. I did check that last week when this first occurred and there were no significant errors or signs of trouble from the AV software.
0
 
LVL 53

Expert Comment

by:Will Szymkowski
ID: 33597603
What I would recommend is going on the users machine when this happens. Users are known for giving little/vague information. Get on the machine of the clients machine in question and check (event viewer, Task Manager, Processes, Programs running etc). By a simple "My computer is running slow" is a very hard symptom to diagnose. I'm sure if you are on the machine when this is happening you will find your cause.
0
 

Author Comment

by:Danoklas
ID: 33597638
I have been down there when it happened. The problems are specifically VERY slow opening of files, or accessing Outlook (exchange) with frequent crashing of those applications as they open/close/print or try and create a new email. The only applications they run are simply accessing files on the server shares (no SQL databases or anything like that). The error logs don't show any errors beyond recording the times the office applications crashed on them.... no system errors.
0
 
LVL 53

Expert Comment

by:Will Szymkowski
ID: 33597756
You said that Outlook is slow, when this happen hold CTRL and right click on the Outlook icon in the system tray, select Connection Status. In here you will be able to see what DC your client is trying to authenticate to.

If they close Outlook and reopen does it try and authenticate back to the PDC?
0
 

Author Comment

by:Danoklas
ID: 33597791
The users did try rebooting, but that didn't fix anything. Unfortunately, right now I can't test as they have some important work to finish, so I'll have to wait until their critical work is finished before I can down the server again and test some of these suggestions. I'll report back as soon as I'm able to get that done.
0
 
LVL 53

Assisted Solution

by:Will Szymkowski
Will Szymkowski earned 125 total points
ID: 33597857
Also, do they have any network drives or files that are trying to access on this DC? There has to be something that is authenticating with the users if some are running slow immediately after the DC goes down. Check it out and keep us posted.
0
 
LVL 20

Expert Comment

by:Radhakrishnan Rajayyan
ID: 33597903
Also it may cause if any folder redirection (group policy) enabled from this server for the users, I am not sure about this just giving an idea to check
0
 
LVL 59

Expert Comment

by:Darius Ghassem
ID: 33597910
Make sure your other DCs are Global Catalog server as well.

Run dcdiag post results.
0
 
LVL 5

Expert Comment

by:Greg Jacknow
ID: 33597961
Interesting.
It realy sound like either the workstation is getting the wrong logonserver or it is having issues talking to some file/printer network resource which also causes similar symptoms as the worstation sort of pauses as it tries to access the resource.  Perhaps because of name resolution issues.
I am pretty sure that by design a windows workstation does get pretty unhappy if the DNS or logon server goes away after it is started up.  It is not smart enough to switch over to the other servers in it's list.  (that has alway seemed wrong to me, some let me know if I am incorrect)
It looks like you are using LdapSrvPriority correctly, but you should comfirm that by checking the logonserver envirmonent variable on the clients. (you could also have the logon script pipe that nifo to a file on the network for a day or two, that should show you if it is working correctly.)
The next thing I would look at would be Name resolution issue when that second server is down. I have seen wierd things happen with the order of DNS servers on a workstation not working the way you expect.
0
 

Author Comment

by:Danoklas
ID: 33598295
Okay, one thing I'm noticing is that if I do a name lookup on the ldap SRV records, it always alternates between which server is presented first in the list, with no regard for the weight and priority.... i.e. the backup server is listed first half the time. Is there a way to force the DNS server to deliver one address first every time?
0
 
LVL 5

Expert Comment

by:Greg Jacknow
ID: 33598363
It really isn't important which is returned first by the DNS.  
The registry change you made should be changing the priority listed in the SRV records which determines what the workstation does.
0
 

Author Comment

by:Danoklas
ID: 33598365
Also a few other notes:

- Both DC's are GC servers
- We do not use folder redirection
- There were previously printers hosted on the backup server, but they were all moved to the newer Primary DC.... however, some users may not have deleted all of their connection to those printers.
- The backup server was previously (2 months ago) our primary file server, before the 2008 R2 server was added. However, all drive mappings were from the login script and were removed when the new server was added.
0
 
LVL 59

Expert Comment

by:Darius Ghassem
ID: 33599635
Post dcdiag
0
 

Author Comment

by:Danoklas
ID: 33599763
Server 1 (PDC) had the following errors, everything else passed

     Starting test: NCSecDesc
        Error NT AUTHORITY\ENTERPRISE DOMAIN CONTROLLERS doesn't have
           Replicating Directory Changes In Filtered Set
        access rights for the naming context:
        DC=ForestDnsZones,DC=domain,DC=local
        Error NT AUTHORITY\ENTERPRISE DOMAIN CONTROLLERS doesn't have
           Replicating Directory Changes In Filtered Set
        access rights for the naming context:
        DC=DomainDnsZones,DC=domain,DC=local

Server 3 (Backup DC) had the same plus this additional error

      Starting test: Services
            Invalid service type: RpcSs on Server3, current value
            WIN32_OWN_PROCESS, expected value WIN32_SHARE_PROCESS
0
 
LVL 59

Expert Comment

by:Darius Ghassem
ID: 33599968
First error is fine

Second error is not good.
0
 
LVL 5

Expert Comment

by:Greg Jacknow
ID: 33601865
0
 

Author Comment

by:Danoklas
ID: 33620172
Thanks guys. Working on two possible sources of the trouble:

1) An application added UNC entries to the Path variable, but didn't clean them up when we moved to the new server. I'm adding an entry to our login script to clear that up

2) Our Exchange Server is listing Server3 first in it's GC list and seems to be using it for authentications even though Server1 should be used. Perhaps Exchange just ignores the DNS settings for preference and goes with the server it's used in the past?

I'm going to correct those two conditions and check back. It may be a few days as I have to find time to down the server with least user impact... but when there are a few people here to complain ;-)

Will report back the results.
0
 

Author Comment

by:Danoklas
ID: 33672191
Checking back in....

I was able to write a script to cleanup the UNC entries in the PATH variable and was also able to run the set-exchangeserver commandlet to force Exchange to use the Server1 GC first.... Unfortunately, neither seem to have corrected the problem.

I have not had a chance to down Server 3, but in checking the Security logs, I can see that it's continuing to service a lot of user logon/logoff requests throughout the day... far more than I would expect given the DNS setup.
0
 

Author Comment

by:Danoklas
ID: 33737496
Another update...

The problem persists but the two actions taken have helped: (1) Use set-exchangeserver to specify the GC's in order and set which one to prefer. (2) Remove the unc entry from the path variable.

I've also added a few lines to the login script to clean out any printer connections to the old server that the users themselves haven't deleted already.

That said, I'm pretty convinced that Exchange is still responsible for most of these authenitcations. I've noticed that one user in particular has more login entries in the log. That individual is rarely on campus connected to the network... that person IS however a very heavy email user and is logged into Outlook most of the day, but usually remotely through Outlook Anywhere.

Since I've already set the StaticGlobalCatalog (server1), StaticDomainContollers (server1, Server3 - in that order) and StaticConfigDomainController (Server1), why would it still be servicing so many logons through Server 3. Is there something else hard-coded into Exchange?
0
 
LVL 59

Expert Comment

by:Darius Ghassem
ID: 33737692
That is an exchange question that should be posted in the Exchange zone
0
 

Author Comment

by:Danoklas
ID: 33772906
Hey, all.

Something just occurred to me.... when we installed Office 2007 a couple years back, it was installed from a file share on Server2 (the server in question). That share is still active, but used only for storing software installs. Is it possible that this is the source of all the trouble? I wouldn't think that Office would need frequent access to the setup files, but I've been wrong about that kind of thing before.
0
 
LVL 59

Expert Comment

by:Darius Ghassem
ID: 33773487
Office could need access to the software install after the software was installed. Did you do a complete install or typical?
0
 

Author Comment

by:Danoklas
ID: 33773498
It was a custom install. But it was also about 18 months ago.

DK
0
 
LVL 59

Expert Comment

by:Darius Ghassem
ID: 33773532
Well depending on what Office needs for a tool which is not installed then it can ask or look for the install files
0
 

Author Closing Comment

by:Danoklas
ID: 33790745
Final Update:

Everything is working as it should.

I was able to shut-down the server when only a half-dozen users were on our system - enough to check whether the fixes have worked without impacting workloads if something was still not right.

I suspect the main cause of the trouble was the UCN entry in the Path variables that was pointing to a file share on Server 2 which had long since been deleted. The UNC entry had been added by an application during it's install, although I'm at a loss for why they would have done that as only the executables are local and only the data resides on a share.

I'll award the point to tonyszko and spec01 as between the two of you, it was correctly identified as the workstations attempting to access a file share, but was caused by an application.

Thanks to everyone for your input.
0

Join & Write a Comment

In this article, we will see the basic design consideration while designing a Multi-tenant web application in a simple manner. Though, many frameworks are available in the market to develop a multi - tenant application, but do they provide data, cod…
Synchronize a new Active Directory domain with an existing Office 365 tenant
This tutorial will show how to push an installation of Backup Exec to an additional server in both 2012 and 2014 versions of the software. Click on the Backup Exec button in the upper left corner. From here, select Installation and Licensing, then I…
This tutorial will walk an individual through the steps necessary to join and promote the first Windows Server 2012 domain controller into an Active Directory environment running on Windows Server 2008. Determine the location of the FSMO roles by lo…

747 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now