Domain Controller authentication not working when one DC is down

We have two W2k3 domain controllers - the FSMO and a GC.  When we shutdown the FSMO, say to apply patches, users cannot authenticate to the domain.  Because a GC is still up, there should be no authentication problem.  But, users cannot authenticate.  The reverse is also true: If I shutdown my GC, and the FSMO is up, I cannot authenticate to the domain in Chicago.

Why? Ive been trying to resolve this issue literally for months and have yet to find any problems with my DNS, stub zones, event log errors, replication, anything...  

When I use replmon to search domain controllers for replication errors, none are posted. Both DCs are running AD integrated DNS, and are the primary and secondary name servers advertised via dhcp.  Replication-wise, the data is consistent across both DCs. So, why cant I log onto my workstation if one server is not available?

Furthermore, I have a 3rd DC (a GC) in another state.  If both DCs fail locally here, I should be able to reach that remote GC.  Exchange seems to redirect itself to the GC in Washington correctly if I fail the GC in Chicago, but I still cannot log into a workstation or other machine on the local network if one or the other local DC servers is down.

It really makes no sense to me...

Who is Participating?
shaynegConnect With a Mentor Commented:
do you have site links to each site in Active Directory. If so remove them and let AD sort out replication itself as long as you have a fully meshed network
es-itAuthor Commented:
thanks guys. heres some more info:
In sites and services I have Chicago and Washington.  
In Chicago site I have servers dc1 and dc2 (fsmo and gc).  
In Washington site I have server dc3 (gc).  

For dc1 I have an automatically generated connector to dc2. But, a manually created connector to dc3.
For dc2 I have an automatically generated connector to dc1. But, a manually created connector to dc3.
For dc3 I have two manually created connectors to dc1 and dc2.

In inter-site transport, I have a single site-link that includes both sites (Chicago and Washington).  IP addresses are also linked correctly to their respective sites.

So, do I -
a) Delete the manually-created connectors and keep the site-link?
b) Delete the site-link and keep the manually created connectors?
c) Delete the site-link AND the manually created connectors?

We are single forest, single domain with plenty o bandwidth between Chicago and Washington.

Simplify Active Directory Administration

Administration of Active Directory does not have to be hard.  Too often what should be a simple task is made more difficult than it needs to be.The solution?  Hyena from SystemTools Software.  With ease-of-use as well as powerful importing and bulk updating capabilities.

do you get any errors at all when the FSMO holder is off? JRNL_WRAP errors etc
es-itAuthor Commented:
The only errors I see are on the GC in Chicago when coming up from reboot:
Application Error 1030:
Windows cannot query for the list of Group Policy objects. Check the event log for possible messages previously logged by the policy engine that describes the reason for this.
And before that:
Windows cannot access the file gpt.ini for GPO cn={84731FCC-ABA7-4A7A-B201-482C1C83B49C},cn=policies,cn=system,DC=domain,DC=org. The file must be present at the location <\\domain\SysVol\domain\Policies\{84731FCC-ABA7-4A7A-B201-482C1C83B49C}\gpt.ini>. (Access is denied. ). Group Policy processing aborted.

But a few minutes later, GPO processing is fine, and no more errors in the event log: "Security policy in the Group policy objects has been applied successfully."
es-itAuthor Commented:
Sorry. I didnt really answer your question Jay_Jay. When the FSMO is powered off, I dont get errors on the other DCs, but I cannot log in to my workstation.  Theres nothing in the system or application event log on my workstation either.
Did you turn Logon Caching off for your domain?  By default the value is set to 10, so there would have to have been a change in Group Policy to turn this off.  This would allow your users to still login to their workstations, without the DC actually having to authenticate the logon.  However, it may not solve problems once they logon when they need to reach other network resources.

Turn on Logon Caching:
remove manually generated connections in the NTDS settings for each office. Make sure you have a link between all offices so you have a proper meshed network. then go to each server with each site and right click ang go to properties. Where it says "This server is the bridghead server for the following transports" remove all transports out so the box is empty. Do this for all other servers and then leave AD to replicate. I know this because I have just had the same issue and we had to call Microsft. In 2003 you shouldnt need to use bridghead servers. Also make sure you DNS is functioning correctly. The process above will also help DNS
es-itAuthor Commented:
Thanks guys. Im going to wait til this weekend to remove the manual connectors per shayneq's suggestion.  I do not have any servers listed as bridgeheads. Ill let you know how it goes.
Forced accept.

EE Admin
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.