Rob G
asked on
I will offer Points for even good suggestions, if not fixes.. (Server 2k8X64 Active Directory and possibly Desktop Authentication issues
Ok,
So here is the dilemma, which i honestly have been on for the past month, and can't seem to track down the issue. It hasn't been just me either, there was a support case from Microsoft created, and they can't seem to figure this out either, so if i get no replies here i understand why. (open to suggestions)
The issue is that the Domain logon server itself is disappearing, which in a high load environment i would understand and think this is likely an issue with over utilization.. The environment here is 34 people, and at any given time there is only 20 - 25 in the office. But, for whatever reason, i am getting clients disconnecting from the domain controller. I had created a Domain policy to try to force the systems to wait for the network, thinking this would at minimal point out if this is a network issue, or a server issue.. But the Policy setting doesn't seem to make a difference.. (Wait for network and require domain controller to log into the PC)
The DC is a 2k8 server x64 running 64GB of DDR2 Registered memory
It has 2 8 core Xenon processors
It is a Raid6 with SSD's that do 6GBPS
The server is fully patched, all drivers NIC and all have been updated, there is no NIC teaming.
The server is rebooted 4 times a year, once every 3 months.
The network is a Gigabit network using Cat6 and Juniper hardware
There is no internal firewalls and the servers do not run an AV.
The Desktops are all windows 7 X64 and vary as to hardware, with the worst being Core2's and the best being I7's.
Memory varies from 4GB - 16GB
Hard disks are all at least 250Gb and SATA 7200RPM
All the windows 7 machines are updated weekly with WSUS
All the 7 machines have an AV installed, all the same, all have the Firewall turned off and network discovery turned on.
They were all clean builds never using ghost or any cloning.
I haven't been able to track down where the issue is coming from, has anyone else seen this issue?
LogonServerNA.png
LogonServer-NA2.png
So here is the dilemma, which i honestly have been on for the past month, and can't seem to track down the issue. It hasn't been just me either, there was a support case from Microsoft created, and they can't seem to figure this out either, so if i get no replies here i understand why. (open to suggestions)
The issue is that the Domain logon server itself is disappearing, which in a high load environment i would understand and think this is likely an issue with over utilization.. The environment here is 34 people, and at any given time there is only 20 - 25 in the office. But, for whatever reason, i am getting clients disconnecting from the domain controller. I had created a Domain policy to try to force the systems to wait for the network, thinking this would at minimal point out if this is a network issue, or a server issue.. But the Policy setting doesn't seem to make a difference.. (Wait for network and require domain controller to log into the PC)
The DC is a 2k8 server x64 running 64GB of DDR2 Registered memory
It has 2 8 core Xenon processors
It is a Raid6 with SSD's that do 6GBPS
The server is fully patched, all drivers NIC and all have been updated, there is no NIC teaming.
The server is rebooted 4 times a year, once every 3 months.
The network is a Gigabit network using Cat6 and Juniper hardware
There is no internal firewalls and the servers do not run an AV.
The Desktops are all windows 7 X64 and vary as to hardware, with the worst being Core2's and the best being I7's.
Memory varies from 4GB - 16GB
Hard disks are all at least 250Gb and SATA 7200RPM
All the windows 7 machines are updated weekly with WSUS
All the 7 machines have an AV installed, all the same, all have the Firewall turned off and network discovery turned on.
They were all clean builds never using ghost or any cloning.
I haven't been able to track down where the issue is coming from, has anyone else seen this issue?
LogonServerNA.png
LogonServer-NA2.png
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Knife,
It seems to happen on machines sporadically, There are currently about 8 machines with this issue, as of today, but yesterday there were 9 machines, so it appears that one had corrected itself with logon today. The error logs on the domain controller show nothing out of the ordinary, as in nothing that would make me think the issue is on the DC side. The DC errors that are shown are..
Failed extract of third-party root list from auto update cab at: <http://www.download.windowsupdate.com/msdownload/update/v3/static/trustedr/en/authrootstl.cab> with error: A required certificate is not within its validity period when verifying against the current system clock or the timestamp in the signed file.
Event ID 11
Source CAPI2
--and--
System
Driver xxx required for printer xxx is unknown contact the administrator to install the driver before you log in again..
Event ID 1111
Source Terminal Service Printers
Both of these is assume were null errors..
As the Application one i assumed was due to the system not getting the updated Information from windows update, since the server is segregated and the system error from people logging into the server (admins) who have the printer thing checked in RDS.
The systems (Clients) never show no logon server, they log in fine, but only show it in the information of the system, or if you query the logon server from the client side. Which is confusing, since the policy again, is set to not let the clients log into the desktop without the DC being available. But at no time does the desktop fail to log into the domain. The Policy is set, i can see it in the desktop, but again.. it's got me baffled.
My desktop, which does not have the issue, is on the same switch, has the same software, plus additional administrator tools, and logs into the same domain. The only difference between my desktop and the others that i can tell is that i log into it with an administrative account. But there are many other machines here without the issue logged into with a user account, so i can't seem to figure out the difference.. Same policy and all.. I added myself to the policy when i noticed the error occurring, which i figured if it were policy based i would eventually have the same issue.. Just luck i guess.. LOL
It seems to happen on machines sporadically, There are currently about 8 machines with this issue, as of today, but yesterday there were 9 machines, so it appears that one had corrected itself with logon today. The error logs on the domain controller show nothing out of the ordinary, as in nothing that would make me think the issue is on the DC side. The DC errors that are shown are..
Failed extract of third-party root list from auto update cab at: <http://www.download.windowsupdate.com/msdownload/update/v3/static/trustedr/en/authrootstl.cab> with error: A required certificate is not within its validity period when verifying against the current system clock or the timestamp in the signed file.
Event ID 11
Source CAPI2
--and--
System
Driver xxx required for printer xxx is unknown contact the administrator to install the driver before you log in again..
Event ID 1111
Source Terminal Service Printers
Both of these is assume were null errors..
As the Application one i assumed was due to the system not getting the updated Information from windows update, since the server is segregated and the system error from people logging into the server (admins) who have the printer thing checked in RDS.
The systems (Clients) never show no logon server, they log in fine, but only show it in the information of the system, or if you query the logon server from the client side. Which is confusing, since the policy again, is set to not let the clients log into the desktop without the DC being available. But at no time does the desktop fail to log into the domain. The Policy is set, i can see it in the desktop, but again.. it's got me baffled.
My desktop, which does not have the issue, is on the same switch, has the same software, plus additional administrator tools, and logs into the same domain. The only difference between my desktop and the others that i can tell is that i log into it with an administrative account. But there are many other machines here without the issue logged into with a user account, so i can't seem to figure out the difference.. Same policy and all.. I added myself to the policy when i noticed the error occurring, which i figured if it were policy based i would eventually have the same issue.. Just luck i guess.. LOL
Do you have NTP set up as a GPO? all the end devices on the correct time?
ASKER
Greg,
No it is physical, it was a left over after we migrated all the main servers to a HyperV based virtual server farm. The "Overkill" server was just an older Dell Server we had to use as the DC, which was upgraded from 2k3 about a year ago.. I can't tell if the issue started around then or not, since i was not working here during that time.. It is possible that something is jacked from that time..
No it is physical, it was a left over after we migrated all the main servers to a HyperV based virtual server farm. The "Overkill" server was just an older Dell Server we had to use as the DC, which was upgraded from 2k3 about a year ago.. I can't tell if the issue started around then or not, since i was not working here during that time.. It is possible that something is jacked from that time..
ASKER
Greg,
No and yes.. No NTP source internal set through group policy.. Although i have contemplated testing that out to see if that would help.. But the clocks all look to be the correct time.. They are pretty much spot on too.. Like change in time of about .5 seconds when the minute changes between the desktops and the server.. At most i have seen is on one of the Old.. I mean REALLY Old machines the time can get out of sync as much as 20 seconds.. but the threshold should be set for 30 seconds, and ironically that machine doesn't have the issue.
No and yes.. No NTP source internal set through group policy.. Although i have contemplated testing that out to see if that would help.. But the clocks all look to be the correct time.. They are pretty much spot on too.. Like change in time of about .5 seconds when the minute changes between the desktops and the server.. At most i have seen is on one of the Old.. I mean REALLY Old machines the time can get out of sync as much as 20 seconds.. but the threshold should be set for 30 seconds, and ironically that machine doesn't have the issue.
Was the DC a fresh install of 2008R2 then DCPromo'd into the domain?
ASKER
Greg,
No it was an "upgrade" which as far as i can tell was originally NT4 Directory services, migrated to 2000 Advanced server AD, migrated to 2003 Standard AD, migrated to 2k3R2x64 AD, migrated to 2008 x64 standard AD... Which is where it currently sits..
No it was an "upgrade" which as far as i can tell was originally NT4 Directory services, migrated to 2000 Advanced server AD, migrated to 2003 Standard AD, migrated to 2k3R2x64 AD, migrated to 2008 x64 standard AD... Which is where it currently sits..
you have other DC's running on the network?
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Greg,
No just a single Domain controller, for roughly 25 end users.. But they log in at varying times.. typically from 7-9AM with each dept of about 5-10 people..
Hec,
Power management is disabled through policy, in that there is no hibernation, sleep, and the NIC does not turn off, nor does it allow power saving.. The NIC is set to auto-negotiate but the switch ports are set to full duplex 1GB. All desktops have 1GB full or 10/100/1000 auto.
No just a single Domain controller, for roughly 25 end users.. But they log in at varying times.. typically from 7-9AM with each dept of about 5-10 people..
Hec,
Power management is disabled through policy, in that there is no hibernation, sleep, and the NIC does not turn off, nor does it allow power saving.. The NIC is set to auto-negotiate but the switch ports are set to full duplex 1GB. All desktops have 1GB full or 10/100/1000 auto.
You have a Hyper-V environment available?
I would start up a new 2012R2 instance - DCPromo - transfer all roles to it. Demote the DC in place now. disjoin it from domain, you will also need to adjust DHCP service to reflect the new DC DNS
tear down the current server - get rid of the raid 6, make it raid 10. Make the current DC box a hyperV server. Startup a new 2012R2 VM - dc promo it so you have minimum two DC's in place (this is best practice)
You can use the extra capacity to run a network monitor
I would start up a new 2012R2 instance - DCPromo - transfer all roles to it. Demote the DC in place now. disjoin it from domain, you will also need to adjust DHCP service to reflect the new DC DNS
tear down the current server - get rid of the raid 6, make it raid 10. Make the current DC box a hyperV server. Startup a new 2012R2 VM - dc promo it so you have minimum two DC's in place (this is best practice)
You can use the extra capacity to run a network monitor
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Knife,
Yes, i actually had paid Microsoft to connect remotely to it and figure it out.. that ticket was open about 60 days ago, and i have yet to get any kind of usable solution from them, they can't figure it out either.. LOL..
I did google the hell out if it, thinking i would find something..
It's not constant.. One day you will get no logon server, but if you log off, and back on, after clicking switch user, and doing domain\username and password it will show the logon server as the correct server for about 7-10 days, then it is gone again.. The Other odd thing is that if you do the same thing, but log out of the user and try to log in as domain\administrator it will fail with the error "no logon servers available".. Which made me think.. "Cached profile" but when i deleted the cache, and tried again, it worked for the user and not the admin, so i pulled the machine, moved it to another network port and tried again.. still the same error.. but then the next day, both accounts worked fine on that system.. No updates, no physical changes anywhere...
Greg,
I actually have another system with the DC cloned on a different Vlan, with a few test VM desktops connected to it.. I can't replicate it on the other side.. which is weird since everything is identical.. except the OS is on a 2k8R2 Hyper-V instead of physical.. and the switches are ancient Cisco's.
That made me think..
I am pretty sure the test side i changed the DNS Domain Name in the DHCP settings..
Which makes me wonder if maybe this is a DNS/DHCP issue..?
Does anyone who knows the Scope Options on here know what the 015 Scope option will do if it is not correct, or set to a secondary DNS name rather then the current DNS name of the Domain name?
For instance if the domain was originally test1.com
and you change it to Test2.local
but do not updated 015 in the scope options of the DHCP list, will that cause issues?
Does anyone have a good link to what that 015 option does, or what it controls?
Yes, i actually had paid Microsoft to connect remotely to it and figure it out.. that ticket was open about 60 days ago, and i have yet to get any kind of usable solution from them, they can't figure it out either.. LOL..
I did google the hell out if it, thinking i would find something..
It's not constant.. One day you will get no logon server, but if you log off, and back on, after clicking switch user, and doing domain\username and password it will show the logon server as the correct server for about 7-10 days, then it is gone again.. The Other odd thing is that if you do the same thing, but log out of the user and try to log in as domain\administrator it will fail with the error "no logon servers available".. Which made me think.. "Cached profile" but when i deleted the cache, and tried again, it worked for the user and not the admin, so i pulled the machine, moved it to another network port and tried again.. still the same error.. but then the next day, both accounts worked fine on that system.. No updates, no physical changes anywhere...
Greg,
I actually have another system with the DC cloned on a different Vlan, with a few test VM desktops connected to it.. I can't replicate it on the other side.. which is weird since everything is identical.. except the OS is on a 2k8R2 Hyper-V instead of physical.. and the switches are ancient Cisco's.
That made me think..
I am pretty sure the test side i changed the DNS Domain Name in the DHCP settings..
Which makes me wonder if maybe this is a DNS/DHCP issue..?
Does anyone who knows the Scope Options on here know what the 015 Scope option will do if it is not correct, or set to a secondary DNS name rather then the current DNS name of the Domain name?
For instance if the domain was originally test1.com
and you change it to Test2.local
but do not updated 015 in the scope options of the DHCP list, will that cause issues?
Does anyone have a good link to what that 015 option does, or what it controls?
Cloning DC's is never a good thing....
Scope option 15 should have you Active Directory name in it
DNS option should point to your two Active Directory servers. This could also be an issue with desktops not finding logon server - DNS must be the Active Directory servers.
Make sure you do not have two DHCP servers on your network (unless it's 2012R2 DHCP failover) - this can cause logon issues too
Scope option 15 should have you Active Directory name in it
DNS option should point to your two Active Directory servers. This could also be an issue with desktops not finding logon server - DNS must be the Active Directory servers.
Make sure you do not have two DHCP servers on your network (unless it's 2012R2 DHCP failover) - this can cause logon issues too
ASKER
Greg, Sorry for the delay in getting back..
The DNS is on the current AD server, while there are two servers, there is only one in each Vlan, so while there are two servers, only one on each side is visible. There isn't any holes punched through on either Vlan, so there shouldn't be any DHCP duplication, although even if that were the case, they are on two completely different subnets, so i don't think that would be the issue.
The 015 is set to the old DNS name, which is not the same as the new one, so i am curious if this could be the culprit, but i don't know enough about the Scope options and what the thought was on the configuration being the old DNS name in that option, but i have a feeling that changing that to the correct DNS name would be the solution.. Thoughts?
Any idea what changing that could jack up in the current environment?
Also, i found that they had once had WINS in the mix, which is long gone, and i removed the traces of that on Saturday.
Thanks so far for the Q/A's
The DNS is on the current AD server, while there are two servers, there is only one in each Vlan, so while there are two servers, only one on each side is visible. There isn't any holes punched through on either Vlan, so there shouldn't be any DHCP duplication, although even if that were the case, they are on two completely different subnets, so i don't think that would be the issue.
The 015 is set to the old DNS name, which is not the same as the new one, so i am curious if this could be the culprit, but i don't know enough about the Scope options and what the thought was on the configuration being the old DNS name in that option, but i have a feeling that changing that to the correct DNS name would be the solution.. Thoughts?
Any idea what changing that could jack up in the current environment?
Also, i found that they had once had WINS in the mix, which is long gone, and i removed the traces of that on Saturday.
Thanks so far for the Q/A's
Have you done an overnight ping test from one or more of the problem machines to the DC.
Have you started up a new VM to replace this DC?
Is the 'cloned' dc completely separated from the production network?
(the production DC isn't trying to replicate to it? or vice-versa)
also try a Message Analyzer capture on one of the problem machines. This app can break down IP Traffic by conversation with the DC
Have you started up a new VM to replace this DC?
Is the 'cloned' dc completely separated from the production network?
(the production DC isn't trying to replicate to it? or vice-versa)
also try a Message Analyzer capture on one of the problem machines. This app can break down IP Traffic by conversation with the DC
ASKER
Ran a ping last night, along with wireshark, never saw a single issue, no dropped packets, no problems at all.. Also ran a separate wireshark from one of the effected machines, again no issues, never a dropped packet or error.
I don't plan on migrating this to a new VM, If i can't get this to work, i am going to buy a new physical server, and start from scratch, as i would rather dump the time into getting this to work correctly, then migrate a problem from one piece of hardware to another.
The cloned server is in no way connected to the live network.
There is no replication setup.. outside of Veam to send data to our offsite data center. Which is configured through VPN, but it is only a one way trust.. and the DC isn't setup to know about it.. it's segregated..
I don't plan on migrating this to a new VM, If i can't get this to work, i am going to buy a new physical server, and start from scratch, as i would rather dump the time into getting this to work correctly, then migrate a problem from one piece of hardware to another.
The cloned server is in no way connected to the live network.
There is no replication setup.. outside of Veam to send data to our offsite data center. Which is configured through VPN, but it is only a one way trust.. and the DC isn't setup to know about it.. it's segregated..
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
I actually figured this out about 30min ago.. Turns out it is not an issue on the DC side at all. The issue appears to be with the switch that these machines are plugged into. There are 2 Juniper gigabit switches and an old 10mb switch that i was told no one was connected to, but it seems that the one machines with this issue are actually connected to this switch. I have since moved them to the other switch, had them log into the machines and the issues are gone.. Anyone know a good home for a 20 year old Cisco switch? lol
glad you found the network issue!
thats why I kept going back to wireshark to troubleshoot!
Enjoy your win!
thats why I kept going back to wireshark to troubleshoot!
Enjoy your win!
Also check out this link I found on DC's as VM's:
http://www.sole.dk/how-to-configure-your-virtual-domain-controllers-and-avoid-simple-mistakes-with-resulting-big-problems/
http://www.sole.dk/how-to-configure-your-virtual-domain-controllers-and-avoid-simple-mistakes-with-resulting-big-problems/
Let's work on the error description a little. I bet it will be very easy to find out.
->at what point in time does it disconnect (=show no logon server variable content)? Please setup an alert task that informs you when that happens. (query the variable and if not \\dc, send a mail together with echo %time% and an info on uptime (can be read out scripted using psinfo).
->does it happen with clean systems (no software on them, just a completely naked, domain joined win7 with no policies applied) at all?