We have a Windows 2003 Server R2. About a month ago we experienced our first 1030 1058 batch of errors on the server and workstations. The domain literally disappeared. Symptoms:
On the DC server:
1. Typing \\domain.local in a RUN window, came back with an error or "domain not found" or something to that effect.
2. Typing \\name_of_server in a RUN window would normally show all shares. An error instead would come back with "computer not found" or something along those lines (sorry about that phase - working from memory).
3. Trying to bring up group policies would fail. No such domain found.
4. This one is interesting. Doing a NET VIEW from the afflicted domain controller's DOS prompt would show all the machines on the network. Doing a NET VIEW from a workstation on the same Lan would come back with "machine or computer not found". Browsing had essentially stopped outside the server's nic.
Tried the easy stuff. Or what I thought was easy:
" Netlogon and DFS services are started (easy enought - they were)
" Domain controllers have the read and apply rights to the Domain Controllers Policy (where do I confirm this?)
" NTFS file system permissions and share permissions are set correctly on the Sysvol share (what should they be?)
" DNS entries are correct for the domain controllers (what should they be)
http://support.microsoft.com/kb/842804Then tried eventid.com. Quite a nifty site. Two key pages:
http://eventid.net/display.asp?eventid=1030&eventno=1542&source=Userenv&phase=1http://eventid.net/display.asp?eventid=1058&eventno=1752&source=Userenv&phase=1Lots to go thru on both pages. Lots of users down in the meantime. So first time thru, I simply rebooted the servers. Everyone back up and running. Chalked one up for the "great unknown".
4 days later - same thing. Lots of 1030 and 1058 at client and at server. Shares gone. Users locking up. Interesting part - you can ping the server. A database running on the server (using Interbase) continued to work at the client's desktop. Internet continued to function for the workstations even with the DC as their DNS server. But as soon as someone tried to browse the network, explorer would hang them looking for that now, "missing" network drive on the DC that didn't think it was a DC anymore.
Called MS on this issue. First tech suggested it was a Kerberos issue on the workstations causing this to happen. Same Kerberos solution offered up by eventid.net for eventid 1030 listed above. That bought us about 12 days of grace. Before the error came back again. This time MS suggested some registry changes. Rather weak. Gave us about 4 days of uptime before the next dreaded reboot. Each subsequent hang, btw, became a real chore. The last 3 reboots of the server have been manual power downs as the server would NOT restart even after an hour. Simply sit at a blue desktop with a mouse cursor in the middle.
During the next attempt, I asked the MS engineers if they wouldn't mind helping me thru a GPO rebuild back to default. They obliged. Good for another 6 days then the domain disappeared again. Reboot brought it back. But the exchange server was gone! Had to remotely access the eventviewer of the exchange server to spot event id 2114 - Topology Discovery Failed. This was turning into a bit of a nightmare. Ran that one thru the Google grinder to find that a group policy entry was missing. Hey! We just reset group policies! Why didn't MS tell me there were gotchas?
This from eventid.net:
David Page (Last update 8/29/2006):
This error, combined with other numerous MU, SA and IS errors may be due to incorrect permissions in the default domain controllers policy either by miss-configuration or use of the dcgpofix command. The Exchange Enterprise Servers group must be defined in the default domain controllers policy under Manage Auditing and Security Log. This can be found in the User Rights Assignment area of the GPO. Once rights are established, restart SA and IS.
Once I set that permission, Exchange fired right up. Thankfully.
The last time MS suggested (to be honest, they'd suggested it at the 3rd contact as well) that we shut down all 3rd party apps. We did. Backup exec, quickbooks database manager, folder size and logmein. Disabled them. Kept only a single 3rd party piece of software running - Interbase - as shutting that down would have entailed working on 25+ workstations to move their connections to some other spot on the network. This last change was good for 13 days. But alas, all good things must come to and end. And today the domain disappeard again. Reboot was a hard shutdown (ouch).
Okay so I'm down to two possibilities. Interbase - tho nothing except the size of the database (about 2gb) has changed there. Or a faulty TCP/IP stack or Network card issue.
I'm thinking its a TCP/IP stack rebuild issue, only because it happens when a ton of machines are on the network and pounding the server. It appears a difficult task. Wouldn't replacing the network card be an easier task?
I've worked most of the easy fixes, but perhaps someone else has some experience with this one?
Tia