Problems with NTFRS (File Replication Service)

I’m having a rather big problem which I can’t figure out. The problem is File Replication Service. PLEASE READ ON! 

I’m running a school network domain, with over 600 users. There are about 10 servers running Windwos Server 2003 Std., 2 of thsoe are Domain Controllers.

We had some rather unfortunate problems, we lost the power (with faling UPS!!!!). This resulted in the NTDS database being corrupted on BOTH DCs. NETLOGON and SYSVOL disapeared on both domain controllers.

My PDC is called Ares and my other DC is Zeus.

I read a MS article about temporarily stabilized SYSVOL. I had to Edit “HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Netlogon\Parameters” and change DWORD SYSVOLREADY to 1 or 2, making NETLOGON and SYSVOL reappear. And then coping all the scipts and policies in the %systemroot%\SYSVOL.

This attemp did make NETLOGON and SYSVOL reapear, but the FRS still reported problems (Ares is having problems with ZEUS and visaversa). I also found several Experts-exchange links, but none that helped me.

I also used an MS article on how to rebuild the SYSVOL tree and its content and re-creating junctions, but without any fixes to my problem. http://support.microsoft.com/default.aspx?scid=kb;en-us;315457

I then took a backup of my %sytemroot%\SYSVOL. I then demoted ZEUS to a normal member server, and promoted it again. I then got EventID 13508 (http://www.eventid.net/display.asp?eventid=13508&eventno=349&source=NtFrs&phase=1), and event ID 13565.

After promoting ZEUS again, I demoted ARES, but this did not fix anything. I still got the same problems. On Both domain controllers event logs saying that %servername% can’t become a DC before SYSVOL and NETLOG has been replicated (event ID 13565), followed by event log 13508 (same as above). It told me to way till it had replicated troughout the domain (which it always says when adding a new DC), I let the server alone for 24 hours, but only returned to get 13508 event erros. NETLOGON and NTFRS services have no problems starting.
I’m using DELL PowerEdge 2600 & 2800 servers, so the hardware is no problem.

The strange part, while having this FRS problems, demoting and promoting the DCs didn’t report any errors. After promoting, the servers did not have any problems adding a replicaset in “Sites and Trusts” under NTDS, and when forcing a replication the message “%servername% has replicated its connections” returns.

I also tested AD by adding users on Ares, and checking on ZEUS. This worked. I then disabled the users and change the properties on ZUES and checked on ARES. This worked. No problems transferring FSMOs etc. either.
It seems that I’m only having a problem with Replication on “file-level” if can call it that. FRS!

I also tried some tools located on the server 2003 CD, NLTEST, DcDiag and NetDiag.

DcDiag returning some DcGetDcName and some Adverticing problems on Ares, and ZEUS returing FRS problems.

After I demoted Ares, Zeus became my PDC, which isnt a big deal, that will be correted WHEN or IF I get this up and running again.

There are shouldn’t be any DNS problems in my domain. I can ping names and FQDNs without problems. I believe all the correct SRV, A host and other records are in the DNS. 99,99999999% sure! :) Checked this with a hundred times!

I’m an educated IT-professional, but I’m down on my knees with this one. I need help. I can’t afford to do a clean install with over 600 users on the domain! No way! :)

Sincerly,
Annfinn Thomsen

Sorry for this long and boring essay! Plz ask if you don’t understand part of the description! :)
farodaneAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

colin_harfordCommented:
Any Sysvol backups from right before the outage?  Depending on your backup software you may be able to just return the ad info...

What are the file permissions in the sysvol like?
farodaneAuthor Commented:
Hey thx your answer!

No, no backup just before the outage! :( And the newest backup of my 2000 servers is just before I upgradede to 2003. At the moment I'm running with the "stablized SYSVOL" allowing ppl to logon without FRS working.

The file permissions? Like System having full, admins having full, etc.? I havent changed anything there...

Thx in advance
colin_harfordCommented:
Ouchie, this sounds about as bad as my crash last summer.


What do sonar or ultrasound say about frs?

Do you have summer break, one solution I can think of is to export the opbjects to a new domain and start over. Not pretty, but it would work, plus then you get to start clean...


Honestly, this is one where I think it is worth to call Microsoft on.  You call in, pay them their $200US or w/e it is, and they work with you until it is resolved.  Tell them that it is affecting business operations of your organization, and then you get to by-pass the one hour on hold wait.   At least if they can't resolve it you get your money back.  
OWASP Proactive Controls

Learn the most important control and control categories that every architect and developer should include in their projects.

farodaneAuthor Commented:
I treid ultrasound, I couldnt figure it out... or not much anyway, help welcomed!

I also did a FRSDIAG on Ares, result (zeus was very simulair):

Checking for errors/warnings in FRS Event Log ....       
NtFrs      10-07-2005 09.56.37      Warning      13508      The File Replication Service is having trouble enabling replication  from ZEUS to ARES for c:\windows\sysvol\domain using the DNS name zeus.fshf.hoydalar.fo. FRS will keep retrying.     Following are some of the reasons you would see this warning.         [1] FRS can not correctly resolve the DNS name zeus.fshf.hoydalar.fo from this computer.     [2] FRS is not running on zeus.fshf.hoydalar.fo.     [3] The topology information in the Active Directory for this replica has not  yet replicated to all the Domain Controllers.         This event log message will appear once per connection, After the problem  is fixed you will see another event log message indicating that the connection  has been established.      
NtFrs      10-07-2005 09.53.49      Warning      13508      The File Replication Service is having trouble enabling replication  from ZEUS to ARES for c:\windows\sysvol\domain using the DNS name zeus.fshf.hoydalar.fo. FRS will keep retrying.     Following are some of the reasons you would see this warning.         [1] FRS can not correctly resolve the DNS name zeus.fshf.hoydalar.fo from this computer.     [2] FRS is not running on zeus.fshf.hoydalar.fo.     [3] The topology information in the Active Directory for this replica has not  yet replicated to all the Domain Controllers.         This event log message will appear once per connection, After the problem  is fixed you will see another event log message indicating that the connection  has been established.
      WARNING: Found Event ID 13508 errors without trailing 13509 ... see above for (up to) the 3 latest entries!

 ......... failed 1
Checking for minimum FRS version requirement ... passed
Checking for errors/warnings in ntfrsutl ds ... passed
Checking for Replica Set configuration triggers... passed
Checking for suspicious file Backlog size...
      WARNING : File Backlog TO server "FSHF\ZEUS$" is : 96  :: Unless this is due to your schedule, this could be a problem waiting to surface!
passed with 1 warning(s)

Checking Overall Disk Space and SYSVOL structure (note: integrity is not checked)... passed
Checking for suspicious inlog entries ... passed
Checking for suspicious outlog entries ...
      ERROR: 50,56% (45 out of 89) of your outlog contains Security ACL events.
      See KB articles below for further information:
            279156 - The Effects of Setting the File System Policy on a Disk Drive or Folder
            284947 - Antivirus Programs May Modify Security Descriptors and Cause Excessive Replication of FRS Data in Sysvol and DFS
 ......... failed
Checking for appropriate staging area size ... passed
Checking for errors in debug logs ...
      ERROR on NtFrs_0005.log : "EPT_S_NOT_REGISTERED(This may indicate that DNS returns the IP address of the wrong computer. Check DNS records being returned, Check if FRS is currently running on the target server. Check if Ntfrs is registered with the End-Point-Mapper on target server!)" : <SndCsMain:                      756:   873: S0: 09:54:57> ++ ERROR - EXCEPTION (000006d9) :  WStatus: EPT_S_NOT_REGISTERED
      ERROR on NtFrs_0005.log : "EPT_S_NOT_REGISTERED(This may indicate that DNS returns the IP address of the wrong computer. Check DNS records being returned, Check if FRS is currently running on the target server. Check if Ntfrs is registered with the End-Point-Mapper on target server!)" : <SndCsMain:                      756:   874: S0: 09:54:57> :SR: Cmd 0026d900, CxtG 89a4cef1, WS EPT_S_NOT_REGISTERED, To   zeus.fshf.hoydalar.fo Len:  (356) [SndFail - rpc exception]
      ERROR on NtFrs_0005.log : "EPT_S_NOT_REGISTERED(This may indicate that DNS returns the IP address of the wrong computer. Check DNS records being returned, Check if FRS is currently running on the target server. Check if Ntfrs is registered with the End-Point-Mapper on target server!)" : <SndCsMain:                      756:   889: S0: 09:54:57> :SR: Cmd 0026d900, CxtG 89a4cef1, WS EPT_S_NOT_REGISTERED, To   zeus.fshf.hoydalar.fo Len:  (356) [SndFail - Send Penalty]
      ERROR on NtFrs_0002.log : "RPC_S_CALL_FAILED_DNE(Indicates RPC Session was established to target, but there was a failure to send RPC call package. Check for Networking problems!)" : <SndCsMain:                     2348:   873: S0: 17:54:38> ++ ERROR - EXCEPTION (000006bf) :  WStatus: RPC_S_CALL_FAILED_DNE
      ERROR on NtFrs_0002.log : "RPC_S_CALL_FAILED_DNE(Indicates RPC Session was established to target, but there was a failure to send RPC call package. Check for Networking problems!)" : <SndCsMain:                     2348:   874: S0: 17:54:38> :SR: Cmd 0026b9e8, CxtG 89a4cef1, WS RPC_S_CALL_FAILED_DNE, To   zeus.fshf.hoydalar.fo Len:  (356) [SndFail - rpc exception]

      Found 3 EPT_S_NOT_REGISTERED error(s)! Latest ones (up to 3) listed above
      Found 2 RPC_S_CALL_FAILED_DNE error(s)! Latest ones (up to 3) listed above

 ......... failed with 5 error entries
Checking NtFrs Service (and dependent services) state...passed
Checking NtFrs related Registry Keys for possible problems...passed
Checking Repadmin Showreps for errors...passed


Final Result = failed with 7 error(s)

I can't seem to find any 'fixes' to EPT_S_NOT_REGISTERED error and RPC_S_CALL_FAILED_DNE error.

I also did a:

    net stop NetLogon
    net stop Ntfrs
    del %systemroot%\ntfrs\jet\Ntfrs.jdb
    del %systemroot%\ntfrs\jet\Sys\Edb.chk
    del %systemroot%\ntfrs\jet\log\edb.log
    del %systemroot%\ntfrs\jet\log\res1.log
    del %systemroot%\ntfrs\jet\log\res2.log
    net start NetLogon
    net start Ntfrs

on both servers. Now that I think about it, I maybe shouldnt have done that? Deleting both FRS databases, there will be no replication between the two? How do I restore such and error?

I Put 2 VMware machines up, with win2k3 server both DCs, and del %systemroot%\ntfrs\jet\Ntfrs.jdb etc. just like above, and I come to the same problem I'm facing with my real time servers... Help plz.

Thx in advance!

ATH
colin_harfordCommented:
Without a good backup here, there isn't too much that can be done, as there is nothing left to restore the DB from, as its gone on both servers, and so are the logs.




farodaneAuthor Commented:
Colin,

Yea, I'm afraid so. So my question would be, is a back from windows 2000 server good enough? Just before I upgraded the servers?

Or is there any "easy" way to export (backup) all users, competers (objects) from AD, then re-install the servers and import the data again?

I need 100% guarantee that exporting objects works :/

Thx in advance,

ATH.

PS. I'm puzzled that there is no way back if both jet databases are corrpted or deleted. Somekind of restore or something...
colin_harfordCommented:
Making that change to role back to 2K may work, maybe not.   It depends on how old the info is, AD doesn't like jumping backwards in data too often, the further back, the more it doesn't like it.

At this point, it would be better to export everything.  There are a few commands used for exporting, you can use the GUI versions, or there is also LDIFDE.exe which is a tool for bulk export/import of AD objects.  

http://support.microsoft.com/kb/q237677/
http://www.microsoft.com/technet/prodtechnol/windowsserver2003/library/ServerHelp/32872283-3722-4d9b-925a-82c516a1ca14.mspx

Please note, it cannot get the password, so, all user and computer accounts would have to be changed.  Computers you may need to visit to complete this.  You can use commands like dsmod and dsquery to change the passwords after the fact, or you can use ADModify to do it with a gui.  Or you can use ldifde to set it when they are being imported.

GPEdit MMC can export your group policies, including your default domain policy.  

I think that's what you need.  You can also backup/export the DNS reocrds, but it would probably be safer to start fresh for DNS, if DNS is AD intergrated.

All file permissions, etc on file servers, etc will have to reset.

What to do.



1) Choose a computer, I just selected Ares, just cause.
2) Export all objects from Zeus and save else where.
3) Ghost Ares (just cause to have a backup)
4) Install new UPS and Format and re-install Windows 2003 on Ares
5) Patch, and configure Ares
6) UNPLUG ZEUS (so dcpromo can't find it)
7) Run dcromo and configure your server with dns, etc
8) import all objects
9) set passwords, rejoin a few computers to the new domain if needed
10) test
11) test
12) test
13) when you have tested with a number of accounts, computers, etc that things look right rejoin the rest of the computers to the network
14) Ghost Zeus
15) Format and re-install WIndows 2003 on Zeus, patch and configure
16) Dcpromo and setup Zeus, ie: DNS
17) Test, test, test
18) Invest in a backup system for AD.  This could have been a lot worse...  Even if you just use NTBackup, backup the system state and save it on another server, and burn to a CD.
19) Test your backups.
20) Go on vacation.   :)

As for the P.S. AD is fairly resilient it design, party because the whole backend is in LDAP.  It's not too often that people have two servers that loose FRS, and have everything else working, and even less rare to have a good backup.  AD does keep some minor abilities to restore on the machines, but not much.


Last summer, I had a worse disaster than this, I called Microsoft for help, explained the problem and what I wanted to do to correct it.  They laughed at me, told me it was impossible, and then wished me luck.  Three days and 2 long nights later, I had things back up in a working state.

Sorry it's gotta be this...

GL with the restore.

~CH

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
farodaneAuthor Commented:
Re-installed ARES. Transfered all users etc. Still testing.

How do I reset the file permissions on the fileserver (ares and zeus are both fileservers, DFS). The users dont have access now, because the security guid is no longer pointing to that speficic user. How do I do this the easist way?
colin_harfordCommented:
Sadly, this is the manual part, windows stores permissions by SID, and not the name.  
farodaneAuthor Commented:
Yea, figured that much! oh well. Colin since you've made the list of "what to do", i'll give you the points! :)

One last question:

Now that I'm on a whole "new" domain, the relation to my child domain (administrative domain) is no longer valid. Do I have to dcpromo the admin domain, and make it a new child, or is there anyway to adopt the current child domain without DCpromo (loosing all security just like ares and zeus) ?
colin_harfordCommented:
was it a child domain in the same forest?
farodaneAuthor Commented:
Yep, it was.

I'm afraid I have to redo that child domain too, but I love good suprises! :)
colin_harfordCommented:
Yup... me too.
colin_harfordCommented:
Sorry, you never mentioned it...


Why did you only give me a B?


It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Windows Networking

From novice to tech pro — start learning today.