Link to home
Start Free TrialLog in
Avatar of HGL
HGL

asked on

Active Directory will not start!

All,

We are currently suffering a problem starting Active Directory following a power down of the domain... When we power up the first domain controller (PDCe) we get the usual delay before logging on, once logged on there are a number of event log errors and Active Directory and DNS will not start. This is a Server 2003 R2 Domain Controller with a Forest and Domain functional level of Server 2003.

When opening AD Users and Computers and try to connect to the domain I receive the following error: "The specified domain either does not exist or cannot be contacted"

When opening the DNS console I receive the following error: "Cannot contact the DNS server". The DC in question is using 127.0.0.1 (as per the MS best practice) as its primary DNS and the IP of a different domain controller as the secondary DNS which wouldn't be powered on at this stage.

I can see a number of issues in the event logs for the DC, including:

Directory Services
1. Replication errors which I would expect given the other Domain Controllers aren't powered on at this stage

2. Event ID 1126: Active Directory was unable to establish a connection with the global catalogue

DNS Server
1. Event ID 4013: The DNS server was unable to open the Active Directory. This DNS server is configured to use directory service information and can not operate without access to the directory. The DNS server will wait for the directory to start. If the DNS server started but the appropriate event has not been logged, then the DNS server is still waiting for the directory to start.

It appears that DNS will not start without Active Directory and Active Directory will not start without DNS!

We've looked at this for a couple of days with one of our Microsoft Gold Partners with no success. Any ideas or guidance would be greatly appreciated.

Thanks,

Rob.
Avatar of PeteJThomas
PeteJThomas
Flag of United Kingdom of Great Britain and Northern Ireland image

Can you run a dcdiag on the DC and post back the results please?
Avatar of ARK-DS
ARK-DS

Hi,

I would suggest to point this DC to the other DC as DNS server. And then try to reboot. Sometimes, DNS takes a long time to start up or for some reasons does not start, in that case, AD will also give issues.

So try to point it to another DNS and then see if it works.

As Pete said, DCDiag would also help in tracing whats going wrong.


Avatar of HGL

ASKER

Hi guys, thanks for the quick responses. I have attached the results of the DCDiag at the point when we experience issues.

@ARK-DS We tried pointing the DC at one of the other Domain Controllers, however, they were experiencing the same issue and therefore DNS wasn't running on them.

TIA,

Rob.

DCDiagSANA.txt
Hmmm interesting - Can you look at the services on this box? Ensure the netlogon service isn't paused, DNS service is running etc?

And also can you run a dcdiag /test:DNS and post the results please?
if this server is dns then change your primary dns to itself and  the secondary dns to another dns server on your domain
run dcdiag /fix and netdiag /fix this will recreate the missing dns records
Regards,
Jose
Avatar of HGL

ASKER

@PeteJThomas - I don't believe the Netlogon service was paused. I don't have access to provide the DCDiag /test:DNS results at the moment but I will post them as son as I can.

@Jose - The server is using the loopback address as it's primary DNS and an alternate Domain Controller as it's secondary DNS as per my original post.

Rob.
Can you look at your DNS snapin and see if your MSDCS file folders are greyed out?

It looks exactly like this:
https://www.experts-exchange.com/questions/24349599/URGENT-MSDCS-records-registering-directly-under-FWD-lookup-zone-not-under-FQDN-name-space.html
Hiii,

Going through the DCDiag report, the connectivity tests and the FSMO check tests are failing. PDC role holder is not contactable, KDC services holder is not contactable.

Can you please run this command on the server and give the output?

"Netdom Query FSMO"
and
"NEtdom Query DC"
I am asking this because you said that other domain controllers are not powered up at this time in you first post.
Also tell us the status of "Kerberos Key Distribution Center" service. And Server & workstation Services.

A general step to refresh required services:
If any of the services mentioned above are stopped/paused, please start thatand if all the services are running, run this command (without quotes) and see any events logged:
"Net stop dnd & net stop netlogon & net stop ntfrs & ipconfig /flushdns & net start dns & net start netlogon & ipconfig /registerdns & net start ntfrs"

Avatar of HGL

ASKER

@ChiefIT - When I open the DNS snapin it will not enumerate the DNS zones so I can't expand anything beyond the server name.

@ARK-DS - I will have a look when I get back to work and run the commands you requested.

Thanks all,

Rob.
40xx warnings/errors about not able to load DNS data from AD is generated when using AD integrated DNS zones as it makes DNS rely on AD which at the same time relies on DNS. The logging is generated when not able to contact another DNS server for querying where the AD logon servers are located.
This is often happening when rebooting DC in a single DC/DNS environment. The same thing will happen when the other DC/DNS is as stated in question turned off.
Ensure that there's a DC/DNS which is alive listed in the DNS client configuration on the DC that is rebooted.

When not able to contact DNS, it will also cause the logging about not able to contact GC. If having DNS alive, the error about not able to establish contact with a GC can be minimized by promoting all DCs in the domain as GC by using AD Sites and Services
http://support.microsoft.com/kb/296882

The replication errors will be generated when the remote DC isn't reachable as it's powered off.
Henjo:

Looks like a problem with DNS all the way around.

Replications and AD are effected.

DNS will not enumerate the forward/reverse lookujp zones. I have not see the zones blank or unable to be seen. Have you?
Avatar of HGL

ASKER

@henjoh09 - I agree that that would resolve the issue and I did indeed try that, however, the other Domain Controllers also suffered the same problem when they were started, therefore would not load the DNS zones.

Each Domain Controller already hosts the GC so no luck there unfortunately.

Thanks,

Rob.
Re-read question and see the note about getting the error about not able to contact DNS server when starting DNS MMC.
Check if RPC service is started and listening. 'netstat -ano|find ":135"' command on the DC should return entries. If firewall is enabled, you either nead to open all ports above 1024. The number of ports for the dynamic interval that neads to be opened can be limited as described in KB below.
Check if the firewall is enabled on the DC. It can be turned on, but it nead in that case to have some exceptions to get DC to work. Disable it to see if it solves the problem.
http://support.microsoft.com/kb/555381
http://support.microsoft.com/kb/154596

Have you checked that the netlogon, KDC and DNS services is started on the DC? If anyone of them fails to start, it can be a dependencies issue configured in registry
HKLM\System\CurrentControlSet\Services\<servicename>\DependOnService (or DependOnGroup)
Avatar of HGL

ASKER

Hi all,

I am back in the office today and thought I'd take the opportunity to answer your questions. In retrospect, generating a question just before the holidays probably wasn't the ideal time. Anyway, thanks for your patience and help so far.

@PeteJThomas - The netlogon service isn't paused, it is started along with DNS Client and Server & KDC. The DCDiag /Test:DNS is attached to this post.

@ARK-DS - Both commands return the error: "The specified domain either does not exist or could not be contacted."

@henjoh09 - I have run netstat as you requested and the results are attached.

Again, apologies for the delay and I thank you for your help so far.

Rob.
DCDiag-DNS-SANA.txt
NetStatSANA.txt
Avatar of Netman66
Update your Proliant Support Pack versions using the latest PSP ISO from HP.

If that fails, then dissolve the Team and disable the adapter that is NOT at the top of the binding order.  Reboot the server and see if you have things in order.  Then re-enable the second NIC and recreate the Team.

You may also be suffering from this:
http://support.microsoft.com/kb/948496
Avatar of HGL

ASKER

Thanks Netman, I'll give this a go tomorrow. Out of interest, what made you think that this was the issue? Have you seen it before?

Rob.
Not personally, no.  However, according to your logs it appears the teaming software (or team) isn't properly operational - which started me down the path of thinking about the SP2 and RSS issue that almost every Proliant I've dealt with seemed to experience.

It's a start at least.  The newer PSP contains updated NIC drivers and Teaming software which mitigates the SP2 RSS issue, so we could either fix the issue or rule this possibility out.

Let us know how you make out and we'll take it to the next level.


Avatar of HGL

ASKER

Thanks Netman,

I tried installing the latest PSP, unfortunately this did not resolve my issue.

However, if dissolve the team and restarted the server I can start Active Directory, very bizarre. I guess at this point I need to start looking at the Microsoft article you sent me.

Do you think the update provided by microsft in that article will resolve the issue?

Thanks for you help so far.

Rob.
Avatar of HGL

ASKER

Netman,

Just to add to this, we've also noticed if we start the server, disable the team from Control Panel -> Network Connections and then re-enable it Active Directory will then start along with DNS. I tried applying the patch from the Microsft article you referenced, however, this has not resolved the issue.

We now have a work around which is a bonus (disable/enable the team) but I would like to get to the bottom of this. We also experience a similar issue with our SQL cluster which is resolved by following the same work around.

Thanks,

Rob.
You might want to look at how the Team is being configured.  If you're using switch assisted load balancing then the switches must be properly configured for port channeling and support LACP.  You also want to hard select 802.3ad if your switches are port channeled - don't rely on the Automatic feature of the teaming software as it isn't always functional.

Here is a Cisco guide explaining NIC teaming in great detail:  https://learningnetwork.cisco.com/servlet/JiveServlet/downloadBody/3729-102-3-10438/teaming.pdf;jsessionid=8D2C832A5E34340D6A848D33370EC47F.node0

HP Teaming according to Cisco:

http://www.cisco.com/en/US/tech/tk389/tk213/technologies_configuration_example09186a008089a821.shtml
Avatar of HGL

ASKER

Netman,

We are using Cisco Catalyst 4948 switches which are congigured correctly. We used a CCIE to design and implement this, we have reviewed it since and the config does appear to be okay.

The servers themselves are configured to use 802.3AD. (802.3ad Dynamic Dual Channel Load Balancing (INP)) with Automatic set as the Transmit Load Balancing Method.

I haven't seen the first reference you have referenced but I do believe our network team have reviewed the second reference in the past.

Thanks,

Rob.
Try setting the Transmit Load Balancing to Destination IP and see what that does for you.

Avatar of HGL

ASKER

Hi Netman,

I made the changes to the team but the behaviour remains the same unfortunately.

Rob.
Interesting....

We know for sure it's the teaming software, but exactly what escapes me.

I'll see if there is any method to log or debug that software.
Avatar of HGL

ASKER

Agreed, I can't see how it can be anything else... I'm running out of ideas and options fairly rapidly!

The only thing I haven't mentioned is that we don't use the on-board NICs. We use an NC340T and an NC360T for the team.

I really appreciate all your help so far.

Rob.
I had to fly out because of a duty call to a domain that was down. I am back on line with this domain and in the testing stages.

I have reviewed your question and I am ready to assist again.

In reviewing your question, I see this line in your DCdiag report:

""The host dda55228-2856-4fcc-9b9f-2f868f794499._msdcs.<DOMAIN> could
not be resolved to an""

What you have appears to be an outdated Delegation record to DNS.

It looks exactly like this if you look within your DNS snapin:
https://www.experts-exchange.com/questions/24349599/URGENT-MSDCS-records-registering-directly-under-FWD-lookup-zone-not-under-FQDN-name-space.html

As you can see, I ran into this issue as well and it jacked up AD and DNS. (similar symptoms to what you have).

Furthermore:

Loadbalancing over a managed switch network can be really tricky.

Spanning tree protocol prevents L2 loops. It can knock down one leg of your mutihomed DC. Some managed switches allow spanning tree on "Access Ports" even when spanning tree shouldn't be on "access ports". Spanning tree is best applied to trunk ports, (between switches and routers). All other ports should have spanning tree disabled.

I used this as a guide to overcome some load balancing problems. Please review the comments in this section:  "Distribution of Cluster Traffic"
http://technet.microsoft.com/en-us/library/bb742455.aspx

If a L2 loop is created, Spanning tree will disable one of the two nics, and drop the load balancing. This knocks you down, and it is something your Network Engineer should know about.  
_________________________________________________________________
So, I have to ask "Why?"

If you have a total of 250 nodes or less, it is best to NOT load balance a network. It simply isn't needed.

Any time you multihome a domain controller, your asking for problems. Do you need Network load balancing?
A third problems is Routing and remote access services:

RRAS, may not show it, but it does enable Windows firewall. Windows firewall knocks down LDAP, the global catalog, and DNS unless otherwise configured to work on a domain controller.

http://support.microsoft.com/kb/555381

Now it may appear the Windows firewall service is stopped, but with RRAS running, the firewall is too.
Avatar of HGL

ASKER

@ChiefIT - I have had a chat with out Network Team, they tell me that they have checked the issues you have identifed with spanning tree and it isn't applicable in our case.

I do know that the ports are set to Spanning Tree PortFast and the guys have run some commends on the switch to look at blocked interfaces to ensure this isn't the problem.

I will have a read of the articles you've referenced to see if there is anything we can try.

The reason we have these servers load balanced is the need for HA. We have in excess of 5,000 nodes.

Rob.
If memory serves me right, we also had to look into multicast versus unicast, (but I dont' remember how that applied). Let me look that up.

If you ask me, I would look into the DNS delegation records, first. That's quick and I think a problem with your servers. ...No DNS, No AD.
Avatar of HGL

ASKER

@ChiefIT - I had a look at the DNS delegation records, doesn't appear to be the problem. DNS doesn't  start. The issue appears to reside around the team/teaming software as discussed with Netman above. We need to try and work out if it is the server or switch config now I think.

Rob.
Well your configuration looks right.

Your access ports for the server is set for STP.

Microsoft recommends using Multicast (instead of the default of Unicast), for load balancing.

As per technet article: http://technet.microsoft.com/en-us/library/bb742455.aspx

Section: Distribution of Cluster Traffic

"Network Load Balancing uses layer-two broadcast or multicast to simultaneously distribute incoming network traffic to all cluster hosts. In its default unicast mode of operation, Network Load Balancing reassigns the station address ("MAC" address) of the network adapter for which it is enabled (called the cluster adapter), and all cluster hosts are assigned the same MAC address. Incoming packets are thereby received by all cluster hosts and passed up to the Network Load Balancing driver for filtering. To insure uniqueness, the MAC address is derived from the cluster's primary IP address entered in the Network Load Balancing Properties dialog box. For a primary IP address of 1.2.3.4, the unicast MAC address is set to 02-BF-1-2-3-4. Network Load Balancing automatically modifies the cluster adapter's MAC address by setting a registry entry and then reloading the adapter's driver; the operating system does not have to be restarted."


--When DNS starts prior to AD, you will see 4004 and 4015 DNS errors, but that's upon boot up.

AD is pointed to the wrong place for global catalog.

Can you provide an IPconfig /all on this server?

Avatar of HGL

ASKER

@ChiefIT - We aren't using NLB on this server. We use NIC teaming (802.3ad)...

The IPConfig on the server looks like this:

Windows IP Configuration

   Host Name . . . . . . . . . . . . : <SERVERNAME>
   Primary Dns Suffix  . . . . . . . : <DOMAIN.CO.UK>
   Node Type . . . . . . . . . . . . : Hybrid
   IP Routing Enabled. . . . . . . . : No
   WINS Proxy Enabled. . . . . . . . : No
   DNS Suffix Search List. . . . . . : <DOMAIN.CO.UK>
                                       <CO.UK>

Ethernet adapter Teamed Connection:

   Connection-specific DNS Suffix  . :
   Description . . . . . . . . . . . : HP Network Team #1
   Physical Address. . . . . . . . . : 00-12-34-26-78-90
   DHCP Enabled. . . . . . . . . . . : No
   IP Address. . . . . . . . . . . . : XX.XX.XX.XX
   Subnet Mask . . . . . . . . . . . : 255.255.255.0
   Default Gateway . . . . . . . . . : XX.XX.XX.XX
   DNS Servers . . . . . . . . . . . : 127.0.0.1
                                       XX.XX.XX.XX (Alternate DC)
   Primary WINS Server . . . . . . . : XX.XX.XX.XX
   Secondary WINS Server . . . . . . : XX.XX.XX.XX
Avatar of HGL

ASKER

Guys,

I have exhausted all the options for this so far so I intend to close this problem as a Known Error with a workaround. I thank you for all of your help but the underlying cause remains a mystery.

Rob.
ASKER CERTIFIED SOLUTION
Avatar of Netman66
Netman66
Flag of Canada image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I agree with Netman.

Unless you have over 250 nodes on a network, you really don't need to team the nics.

Try a one nic solution, and then rebuild the team later.
I had the same problem.   I hope this works for you.

Leave the PDC on, disable the firewall, turn on the SDC, wait.... turn off the PDC and back on... all services should start.