Solved

NT4.0 Server Kicks out users from network!

Posted on 2004-10-27
364 Views
Last Modified: 2013-12-28
Hi there,

We seem to be experiencing a real nasty problem with one of our customers NT4 servers.

Site info:
The site has 2 NT4 servers, one a PDC and the other a BDC.  The email system is kept on the PDC and the
data is all kept on the BDC.  The workstations are a mix of windows 98, XP and 2k.

The network is CAT5 Throughout 10/100  standard except for a gigabit link between the two servers which
are situated in different buildings from each other.

Problem info:
What happens is, at random intervals, the BDC decides to just throw everyone off the network.  Anyone in any network docs or apps which are situated on the BDC  are thrown out.  Any mapped network drivers to the BDC become unavailable.

Double clicking on the BDC in network neighborhood from a workstation during this time gives the message "the server is not accepting this type of request at this time, please try later"

However, the plot thickens.  The BDC is still pingable!  and also we can still VNC into the server desktop, so the IP layer of the network on it still works ok.  It just seems to kick the users out, its like the browser service just decides to stop working or something!

The only way we are resolving this issue is by rebooting the BDC.  However, on reboot the server always hangs at the windows NT4 server splash screen.  It then needs to be rebooted again before it all works again.

NT Event logs show no anomalies whatsoever!

Additional info:
Mostly happens through the night, sometimes in the afternoon though.
Set up replication service between BDC and PDC to run one night, BDC crashed out that night and needed rebooting in the morning (could be co-incidence, could be linked)


Server spec:
Windows NT4.0 Server with SP6a
Intel pentium DUAL XEON 2.8
512MB DDR
3x 36GB SCSI Raid 5 array HOT SWAP
Intel on board Gigabit Lan card
HP 20/40 Tape drive using veritas 8.6 running nightly backup

This is an urgent problem, hence the 500 pints.  I mean points :)


0
Question by:Allan_Shiels
    28 Comments
     
    LVL 7

    Expert Comment

    by:James Rankin
    Have you tried resetting the domain secure channels to the PDC when this happens, e.g. run the command

    nltest /server:bdcname /sc_reset:domain\pdcname

    try reapplying SP6a also
    0
     

    Author Comment

    by:Allan_Shiels
    I Have not tried either of these.

    What does resetting the domain secure channels do exactly?

    0
     
    LVL 7

    Expert Comment

    by:James Rankin
    Secure channels are the mode of communication between servers and domain controllers. they can drop out and give you access problems, however I would probably expect to see these more on a member server than a DC, but anyway, worth a try. Resetting them will re-establish the secure link between the DCs (you will need to get nltest.exe from the NT4 Resource kit though)

    I would also look at your paged-pool and non-paged pool memory usage, memory handle count, and the status of the Netlogon service, amongst other things. Does the server return an error code (a number) when all the connections drop? Are there a large amount of open connections (check the Server applet in Control Panel)?
    0
     
    LVL 7

    Expert Comment

    by:James Rankin
    I would also disable the replication and use robocopy (from the ResKit again) as a scheduled job instead. It may help narrow things down.
    0
     

    Author Comment

    by:Allan_Shiels
    I have disabled the replication service.

    The main problem is that when it occurs ithroughout the day the customer just goes and restarts the server themselves, thus giving us no options to connect in remotly and check services and things out!

    I have told them to let us know straight away when it happens so i can try and obtain new information, which i will post back here.  could be days though.
    0
     
    LVL 2

    Expert Comment

    by:vivekpara
    It sounds like the RPC service (port 493, I think) is taking a hit of some sort.  Have you done a virus sweep recently?  Some viruses will attempt to attack this port...and with the Microsoft patches, it denies access...but still keeps the service responding to the attempts.

    What does the event log say in the System area regarding events during these "blackouts"?  You will normally see errors against the RPC service at this time.

    Besides viruses, if you are using the BDC without a server WINS enabled, you will often get browser elections that can foul things up.  Normally the BDC will become the Master in these elections and when overloaded, it may not be able to service requests.

    The reason I highly believe its the RPC is because you can still Ping.  Pinging is just at the Network layer while the RPC is at the application layer...you would be able to ping but not get RPC requests.  I have a feeling if you can use Network analyzer when this occurs, you should be able to see "where" the problem is coming from.
    0
     
    LVL 2

    Expert Comment

    by:vivekpara
    Port 135 is RPC endpoint mapper and 139 is NETBIOS.  I think these are the ones to check out.  I don't know HOW I got 493...
    0
     

    Author Comment

    by:Allan_Shiels
    I will connect in and have a look at the event log from the last time it happened.
    0
     

    Author Comment

    by:Allan_Shiels
    There is just absolutely no logging in the event log whatsoever at the times of this.  The only ones are 'event log is being shut down' (presumably when they shut the server down) then after that its the good old 'windows NT4 sp6 multiprocessor free' message when it starts up.

    However, a few weeks back there is two red warnings from 'aacdisk' during bootup stating that the file system is corrupt and unusable and to run chkdsk.  

    I just ran chkdsk and it said it found some minor inconsistisies, but i just dont think this is causing the problem.  Unless the network card is faulty and its taking corrupt packets from the network and putting it on the hard disks?
    0
     
    LVL 9

    Expert Comment

    by:TannerMan
    I to had this problem with a BDC on a multi-spoked WAN and never could resolve, even with the suggestions metioned above.  It was, for me, most deffinitely, the loss of the secure channel but I could never restore it. A reboot would sometimes allow the BDC to pull a Browse list from the PDC and it would work ok for a while and shutdown again.

    Even though I don't advice it, it was suggesed to me to remove the BDC from the domain and re-add, but that didn't help either.

    I had to bring up a separate BDC for that network, move the data needed over, and shutdown the original.

    DO NOT do what I did on my say so. Exhaust everything you can find including the very good advice already mentioned by the above posters.
    0
     

    Author Comment

    by:Allan_Shiels
    Tannerman, did you try replacing the network card?

    I am going to try that, then reinstall SP6 after that.

    Also, please can you tell me the EXACT spec of the server that you had this problem with?  Software installed and all.  I know its alot to ask but we could compare specs and see if there are any consistent apps/hardware!

    0
     
    LVL 9

    Expert Comment

    by:TannerMan
    I did replace the NIC
    I did re-apply SP 6a

    NT4.0 BDC running SP 6a
    No software outside of OS, just data storage.
    The machine did have exchange 5.5 at one time, but instead of uninstalling (usually leaves a lot of junk behind) I just killed all exchange services.

    I wish you luck on this. The site I had the problem with was part of a 13 node network with BDC's at each node  and this was the only one that gave the problem.
    0
     

    Author Comment

    by:Allan_Shiels
    If you dont mind me asking, what what the hardware spec.

    were you running adaptec raid sottware?
    0
     
    LVL 9

    Expert Comment

    by:TannerMan
    OH, sorry I didn't include that.
    The box was an old proliant 800 Ppro200mhz 64 meg ram (well overdue for updating anyway), but no raid on the scsi drives other than software based mirroring.
    0
     

    Author Comment

    by:Allan_Shiels
    Well considering that your rig is substantially older and less powerful than mine, and not even running similar software this problem must surely be down to a software problem with NT SERVER.

    Its the only consistency between the two incidents.

    When it happens again im going to check the secure channels.

    Tannerman, may i ask you how often it happened with you, and if there was any specific time?
    0
     
    LVL 2

    Expert Comment

    by:vivekpara
    kz2ofl may be right, then.  But normally you lose secure channels when you promote or demote DC incorrectly (or should I say PDC and BDC in NT 4.0 cases).  Here are the articles to back him up:

    http://support.microsoft.com/default.aspx?scid=kb;en-us;150518

    This also resets the NETDOM password.   Expect to spend an hour doing this...but it can take as little as 10 minutes if you get everything to work right.
    http://support.microsoft.com/default.aspx?scid=kb;EN-US;260575

    A more detailed explanation of netdom
    http://support.microsoft.com/default.aspx?scid=kb;en-us;329721

    I had a similar problem in a Mixed-Mode domain...and I think kz2ofl is on the money on this one.  The question is "why is it losing secure channels?"  I'm wondering about the switch or router you have between these two devices.  Is it using cut-through or spanning tree algorithms?  Sometimes, if the router builds a routing table with certain parameters, it will "misdirect" packets to the wrong address.

    You may be better served moving the two servers in to the same area and letting the clients on the other end log in through the fibre channel route.  It would be a simple way to see if that is your problem.  If the problem still persists when the two computers are on the same subnet and "routing space", then you need to dig deeper.

    0
     

    Author Comment

    by:Allan_Shiels
    The servers are on the same subnet, the buildings are only separated by a road and a factory.

    There is a CAT5 cable that runs over the road and through the factory which connects two netgear 10/100/1000 switches together.  The PDC and the BDC each hang off one of these switches, here is a crude diagram -

    BDC <> 2M Patch cable <> Netgear gigabit port on switch <> CAT5 daisy chain <> Netgear gigabit port on switch <> 2M Patch cable <> PDC

    I hope that makes some sense.

    There are several workstations that hang off both the netgear switches utilising the 10/100 ports.

    Before all this happened, the PDC was the only server there, and all the data and email was accessed on that server alone, with no problems.  Its only since we put the BDC in there and got everyone mapping network drives to that, that the problem has risen.
    0
     
    LVL 2

    Expert Comment

    by:vivekpara
    If it is a switch, you could still be experiencing the problem there.  Switches can direct traffic at the Physical/Network layer of the IP protocol stack.  Most modern "routers" are just dumb devices that basically just broadcast address packets across the ports...but the switch actually looks at the header and tries to "interpret" where a packet should go.

    I agree that it may not be LIKELY considering a reboot seems to solve the problem...but I would consider it.

    Is there a WINS server in place?  Are you using P or B node type for resolution in WINS?  Or are you using NETBIOS over TCP/IP?  If NETBIOS, it could be a broadcast storm causing your problems.  WINS reduces this significantly if used in P mode.  All computers will be resolving to the BDC...and if a broadcast storm occured, it would overwhelm its interface.
    0
     

    Author Comment

    by:Allan_Shiels
    We are using NETBIOS over IP.  I dig what you are saying about broadcast storms - cant faulty network cards cause these?

    If so, could be tricky to track down.

    What i want to do is knock the BDC back onto a 100MB Card instead of a 1000MB card.  I just dont trust NT4 and gigabit LAN.  Too old an OS with too new a tech if you know what i mean.  It cant have been rigorously tested enough in all environments.

    I would also like to add at this point, that this BDC has already had SATA mirroring disks replaced with RAID5 SCSI Because we were getting corrupted data in our accounts package.

    0
     
    LVL 2

    Expert Comment

    by:vivekpara
    Faulty cards on your network can cause broadcast storms.  The most likely candidates are computers that have a tendency to slow down periodically.  The best way that I've found is to wait for a storm and just turn on Network Analyzer immediately and start sifting through packets against the interface.

    It could be as simple as it is the card on the server...and the fact that rebooting seems to help does push things in that direction.

    I would also implement a WINS server and have your DHCP assign the node (P prefereably) and WINS information dynamically to your clients.  This can reduce broadcast traffic up to 50-70 percent on some networks (on average, I say about 20-30 percent, though).  I've done a test and seen dramatic improvements in throughput.  With Win2K, this is NOT as much of a problem with name resolution being performed in DNS...but all old Win98 and NT depend on the NETBIOS for name resolution if there are no WINS services deployed...and even 2K and XP workstations are dependent on it if in a NT server domain.
    0
     
    LVL 16

    Expert Comment

    by:ahmedbahgat
    you may check the muber of licences using the license manager,


    cheers
    0
     

    Author Comment

    by:Allan_Shiels
    ahmed,

    Initially, i thought it was something to do with the licencing also, as the message you get when you click on the server in network neighborhood is the exact same as the message you get when you exceed client licences, and you try to map a drive!

    When i first set up the server i forgot to put it up to 50 concurrent connection licences, and left it at 5.  This resulted in people not being able to connect it with the same message as they get now in network neighborhood.

    We have since upped the concurrent connections to 50.

    I wonder if the lisence system is working ok...
    0
     
    LVL 16

    Expert Comment

    by:ahmedbahgat
    if you have a look at license manager you will know if it has a problem with the yellow mark next to the installed product, it is not just adding licenses you will also need to confirm purchasing it using licenses manager again untill all marks are blue next to the ms products, not to say that you reall need to purchase it to get around that


    cheers
    0
     

    Author Comment

    by:Allan_Shiels
    I have nice blue boxes beside the server in licence manager.
    0
     
    LVL 2

    Expert Comment

    by:vivekpara
    Have you tried the network monitor to see what the bandwith looks like?  Just curious.
    0
     

    Author Comment

    by:Allan_Shiels
    Where is the network monitor located?
    0
     
    LVL 2

    Accepted Solution

    by:
    It is normally located on your server if you have the Networking Tools (adminpak.msi) installed.  Its basically a discrete network sniffer which can monitor traffic.  If you want something more robust and, Network Probe is one you can buy to evaluate traffic.

    A really neat system monitor, which may be more useful for you, is Visual Server Monitor.  Its by 2morrow software and they have a 30-day trial.  It will be able to monitor ALL your servers using a very visually appealing and easy-to-use interface.  I recommend trying that....its free and you won't have to deal with trying to filter your results.
    Its not as comprehensive, but I bet it will help eliminate a lot of possibilities.
    0
     

    Author Comment

    by:Allan_Shiels
    I am going to dowload the 30 day trial then for visaual server monitor :)
    0

    Write Comment

    Please enter a first name

    Please enter a last name

    We will never share this with anyone. Privacy Policy Terms of Use

    Featured Post

    Course: MongoDB Object-Document Mapper for NodeJS

    NodeJS (JavaScript on the server) is awesome, but some developers get confused about NoSQL when it comes to working in Node with MongoDB (NoSQL database). Do you need a better explanation of how to use Node.js with MongoDB? The most popular choice is the Mongoose library.

    A Bare Metal Image backup allows for the restore of an entire system to a similar or dissimilar hardware. They are highly useful for migrations and disaster recovery. Bare Metal Image backups support Full and Incremental backups. Differential backup…
    Today, still in the boom of Apple, PC's and products, nearly 50% of the computer users use Windows as graphical operating systems. If you are among those users who love windows, but are grappling to keep the system's hard drive optimized, then you s…
    This tutorial will walk an individual through the process of configuring basic necessities in order to use the 2010 version of Data Protection Manager. These include storage, agents, and protection jobs. Launch Data Protection Manager from the deskt…
    The viewer will learn how to download and install Comodo Backup on Windows 7. Comodo Backup is another solution for backing up your computer. It is free for local backup and online backup has differing amounts depending on storage required. In my op…

    877 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    13 Experts available now in Live!

    Get 1:1 Help Now