Link to home
Start Free TrialLog in
Avatar of drmarston
drmarston

asked on

help diagnose Network gremlins

I need some help isolating some network gremlins.
Following no pattern I can discern, a handfull of clients randomly "lose" the contents of network shares.

Some setup info:
AD domain; 120 clients (All XP SP2); several servers, HP Switches, and a Sonicwall firewall,
2 @ Windows 2003 domain controlers, both with DNS, replicating with each other, seperate server for DHCP and WINS, several functional servers and a file server (refered hensforth as "\\fileserver")
\\fileserver has multiple shares; Departmental and users
under Deptartments, I have shares for each dept like accounting and sales, etc.
Under Users, each user has a folder
On all the clients \\server\departments\ is mapped to Y:\
and \\server\users\%username%\ is mapped to Z:\
All clients run Outlook 2003 or 2007 (no Exchange server) with their .pst files located in Z:\\%username%\mail\

The weirdness:
The symptoms almost always start the same way; Outlook throws a message that it cant find the users .pst file.
When a client "loses it"; you open "My Computer" and all the mapped drives show up, but when you try to double click on the \\server shares they open and show as empty.
(but not mapped drives going to other servers, just the ones pointing to \\server show as empty)
If you type in \\fileserver in the address bar, it opens, but showing only the "users" share, under that is the %username% share (only the currently loged on user, not all of them) and opening that gets you another empty folder.
If you try to type \\fileserver\departments\ you get a "system can't find it" error.
I can ping to/from the client and I can ping to/from the server
here is where I need help:
depending on the client, one of the 3 following will fix it
(ohh yea, I can still VNC into the client when this is happening)
1) nbtstst -RR
OR
2) ipconfig /flushdns
OR
3) netsh interface ip delete arpcache

On a side note (and it may not be related, but seems like the same gremlin could be causing both issues) Our ERP / Accounting system has been having major performance issues lately. it "hangs" on the client side (server proc is running at 5-15%, but client is "hour-glassing")
Avatar of ChiefIT
ChiefIT
Flag of United States of America image

This is very odd. If you wer unable to ping the IP address and other services were knocked down, then I would think it is an overloaded nic on a temporary shut down cycle.


DNS:
This definately sounds like a,  cached DNS entry. Have you looked at the router to see if the list of LAN side DNS servers has an outside DNS server or non existant DNS server in its list? Or maybe one of your DNS servers is not listed on the router's list of LAN side DNS servers. You may find this odd, but maybe switching server percedence in the IP stack of the server will help.

Further investigation into the matter:
I also suppose you looked at clean DCdiag and Netdiag reports. So, that is the reason you are looking into a network issue? IP config /all on clients that are having issues may shed some light on the subject.

Dumb switches and mode configurations:
Let me see if you also looked into one last thing. Are you using dumb switches and is Spanning Tree Port Fast enabled if using a dumb switch. Also is the mode of operation set to auto instead of 100Mb Full duplex?

Let me know if you have questions on the above.

I would also suggest, if you can, try the following.

1) Reset the router/switch.  Power it off for at least 60 seconds, if possible.
2) Upgrade the network card drivers on the workstations
3) upgrade the network card drivers on the server.  

I've found that network drivers are often up to four years out of date, and a lot can change in that time.

Avatar of drmarston
drmarston

ASKER

Sorry for the delay.. I'm waiting for it to happen again so I can run DCdiag and Netdiag during an "incident"
I will look into the drivers today, the server is only about a year old, but will update nontheless.
As for the switches (All managed, but set to "auto" for speeds), I keep them up to date (quarterly), and the clients are on different switches, but I will update the switch the server is on.
Ok we went over a couple of issues that were not relative.

So, I am reviewing the original message. It appears that when you try to access outlook, your client looses its cached DNS setting. All of the three temp fixes reset the cached DNS server's pointer. nbtstat -RR is a repair function of netbios translation. Ipconfig /flush DNS is flushing DNS cached entries. Deleting the arpcache will also remove the client's cached DNS record as well as the cached DHCP server's address. Either way, you are resetting the cached DNS server's pointer.

One thing I researched, (and the reason for a delay in response), was the way outlook uses DNS to connect to exchange server. In my research, I found out, outlook tries to create its own network connection, via dialup and locate a DNS server that will tell it what the exchange server's IP address is.

I am thinking your Local DNS server may need a DNS record that points it to the ""off-site Exchange Server". You did mention that every time Outlook tries to find a mail message in the pst file, and can't find it, things start shutting down. When found, it will synchronize with the server and you will have your new messages.

I think outlook is trying to create its own DNS connection, can't find the Exchange server's DNS record on your local DNS server, and goes out to the ISP's DNS server for your Exchange server's connection. After it finds the connection, it caches the connection and writes over your cached DNS records on the client to a remote DNS server that provides DNS to the Exchange server.

I think the fix for you is to create a DNS record for your "off-site Exchange server" Then Outlook will not have to seek a DNS server past your local DNS server. Then, your client's network connection's DNS setting will not be overwritten by Outlook.

There must also be settings that prevent outlook from dynamically changing the network connection every time is connects to the server. Outlook is a shell for email and should have the ability to connect with multiple off-site mail servers without interfering with local DNS cached settings.

The result of going to an outside DNS server is you will not be able to access LAN shares and may not see all computers in "My network places" without WINS.

Let me research outlook's settings a little more for you and see if I can prevent outlook from trying to create its own connection to the mail server. The least we can do is force Outlook to connect up by using the default network connection that is already in use rather than try and create its own.

I hope this makes sense.



Oops, My network places is a Master browser issue, So, you will be able to see computers in my network places. But, network shares are still effected and pinging by computer name will not work and other DNS related issues could appear.

Before you add an off site DNS server for exchange, one thing you might check is DNS entries on the clients, router and servers. Make sure the only outside DNS servers listed on the clients are local DNS servers. Make sure the only DNS servers on the router are local DNS servers. In fact, the only place you should see an outside DNS server is in forwarders or a root folder for root hints. All network bindings should be local DNS servers and the router should be local DNS servers.
Wish I had caught you a little sooner...
We do NOT have an Exchange server, all clients just run POP, and keep their .pst files on a share on the file server (so they get backed up). It is just Outlook that shows the first symptoms of the problem.
As for DNS, all clients are set to DHCP, and DHCP only has the 2 domain controllers (DNS) servers set up. the only "outside" DNS servers are the root hints on the DC's
(with the exception of the firewall, which has our TelCo's DNS listed as well)
my understanding is this.
whenever you type \\server in the address bar, Windows first uses WINS, then the LMhosts file, then DNS to resolve, so it's like something in that chain of events goes wrong, but the error seems to be a moving target. (hence it being one of the 3 that fix it)
I know I'm not explaining this very well, and I apologize if you did a bunch of research for nothing
(Well, you can alway say you broadened you own understanding :P )

As for the coment about all 3 clearing the cached DNS, I'm confused why 1 will work on one client, and not another?
Also, if it is a corupt DNS cache, what could possibly cause this?

OK, so you have WINS and DNS (root hints). That's good information.

You asked:
"Also, if it is a corupt DNS cache, what could possibly cause this?"

According to an article I read, Outlook goes out and makes its own connection. It was set up that way mostly for dial up connections. Microsoft wanted to make sure outlook had the ability of connecting. If it makes its own connection, It may bypass the WINS server (local) and DNS server (local) and go straight to the DNS server that hosts your POP mail server. If so, it may changed the client computer's cached DNS pointer. I think if you go into the settings of outlook and point Outlook to your regular internet connection, then tell it to always connect to this connection upon starting Outlook, you may fix this problem. That would force Outlook to use your internet connection with WINS and DNS configured.

I also think the reason Outlook appears to be the first sign of the problem is because it may be the problem.
Well the weirdness just got worse, and we can forget it being either outlook related, and probally take the servers out of the equasion as well.
This morning it happened to a client, but this time it was on one of the functional servers (\\engineer, which is where we house all the CAD drawings and models)
Email was working fine, \\fileserver was responding.
if I tried to go to:
(mapped drive) W:\ = blank
\\engineer\ = blank
\\engineer\share\ = blank
\\engineer\share\drawings = Error (Can't find it)
\\engineer\apps\ = Error (Can't find it).

Here is what I did:
tried the 3 commands in the first post.. this time they had NO effect
Ping engineer: resolved and resopnded
arp -a : the correct MAC for IP of server was listed.

I logged the user off and logged on as domain admin
the same symptoms!

I rebooted, problem vanished.

So: I would say it is definitly client side (and spans logons)
It looks like lower level tcp/ip is working.
As is DNS / WINS (I cound resolve the name, ping, and the file server was resopnding just fine)
next time you have a troubled client, try this at the command prompt:

Net Stop Netlogon
Net Start Netlogon

And see if that is a one-time fix for you.
This really sounds like Spanning Tree. There are no real juicy errors on the DC, you have intermittant comms, and it sounds like the netlogon service pauses.

This article recommends removing the spanning tree algorithem. (This article is word for word of a Microsoft KB article)
http://kbalertz.com/202840/Client-Connected-Ethernet-Switch-Receive-Several-Error-Messages-During-Startup.aspx

Earlier, I recommended enabling spanning tree port fast.

Maybe the best method is to get ahold of your switches customer service and ask for recommendations.

Do you have 5719 errors on the clients or any other errors that can shed some light on the subject?
Spanning tree is disabled on all switches,
Error did show up on the client this time (other clients were clean):
------------------------------------
source: Userenv
EventID: 1054
Windows cannot obtain the domain controller name for your computer network. (A socket operation was attempted to an unreachable host. ). Group Policy processing aborted.
------------------------------------
but why would this affect one server and not the other?
On the server with the problem shares, do you by any chance have some sort of NIC-teaming ?
OK, this is the information I was looking for. Remember I was telling you it sounds like spanning tree. Well, maybe this link will better discribe your NLB problems>

https://www.experts-exchange.com/questions/23037760/Regarding-Windows-network-load-balancing.html
jburgaard:
On the file server, yes; I used the Windows to create a network bridge across the dual gigabit Broadcom integrated NICs (It's a Dell PE 2950)
On the Engineering server, No.
CheifIT:
I read the post, but my switch does see the bridge MAC, so I'm not compleatly sold (but looking like a good canidate)
I bridged the connections for more bandwidth considerations, not necessarily for high availability. so I could disable the bridge and just have one NIC running.
But my question then is why this only affects some clients and not all ??
Also, why sporadicly?
I'm not a networking guru by any streach.. but if it is switch / NIC related, shouldn't this be a constant thing?
Told ya they were gremlins!!

:)
ASKER CERTIFIED SOLUTION
Avatar of ChiefIT
ChiefIT
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Have you checked time synchronisation on Windows Server and clients? Is Windows time service running on the cleints and the server is getting time from a good source?

The symptoms described above are similar to where clients or server clock is drifting or getting bad time signal. This causes the time synchronisationo to be lost and authenticated sessions to become stale. Attempts to re-establish the sessions will fail as authentication is rejected by the server. When the client is rebooted, Netlogon service synchronises time with the DC and the sessions are established.

Check "net time" command.
ChiefIT:
I will disable the bridge this weekend and go down to a single NIC, see if that does it.
Then figure out how it should be done.

udp1024:
I have the DC syncing with an outside source, and the clients against the domain. I haven't noticed any errors, but I will make sure to look (if / when) it happens again.

Everyone:
Thanks for all the help and patience!!
With the problem being sporadic, I know this is taking a while to figure out.
What network adapters are installed in clients and server? Someone commented earlier that the NIC may be getting overloaded and the driver giving up the ghost quietly. I have seen soemthing similar with 3Com NIC's a few years back.
Well, I went down to a single NIC, and it appears to have resolved the issue (hasen't happened all week)
DrMartson:

Disabling one NIC will work well for most LANs. Multiple NICs really isn't needed. As the defined path over a switched network is causing common problems with many administrators who use two NICS for NLB, I am trying to manufacture a fix. Your question caught me in the first stages of discovery to this fix.

There is additional information I wish to provide you that I have found during my research, (in case you wish to go back to NLB over a switched network). I see these errors daily since I see about four or five requests for help with no stand-out errors on DCdiag or event viewer reports. I will provide some of the links to some errors that do stand out as a result of this configuration conflict.

The settings I am concentrating on for proper NLB over a switched network are:
Spanning tree portfast, Proper configuration of dual NICS, NLB over a switched network, Multicast/Unicast modes,

NLB over a swtiched network, Microsoft Fix:
http://support.microsoft.com/kb/261957

technet article on Network load balancing on a switched network:
http://technet.microsoft.com/en-us/library/bb742455.aspx

A little explaination of spanning tree and portfast.
http://itt.theintegrity.net/pmwiki.php?n=ITT.Spanning-TreeAndPortfast
(NOTE: Portfast is necessary for XP clients. XP clients will time out otherwise.)

The differences between Unicast and Multicast modes:
http://support.microsoft.com/kb/291786
(The server requires Multicast mode to work with dual NICS)

Event ID 5719, spanning tree portfast:
http://support.microsoft.com/kb/247922

Preventing NIC flooding caused by NLB:
http://technet2.microsoft.com/windowsserver/en/library/bf3a1c95-f960-4ed3-b154-3586631fb0061033.mspx?mfr=true

NIC flooding is the problem. When a NIC is flooded it will shut down services on the NIC and cause various errors. One error I see a lot is Event 1030 for WSUS. Since this error is a known problem that is not well documented, many administrators may suggest updating NIC drivers as a fix to the problem. Periodically that works. Most of the time, this problem goes unresolved until One NIC is deemed bad and disabled, as in this case.  I hope this additional information helps.

John


The probelm has come up again - remote site disconnects from 1 or more HQ shoretel devices.  In the past, we would reboot any device (on either side) that is indicated as disconnected.

I will attempt a forced 100 mb connection (on shoretel server) again to see if that stops it.  

Also, this has not happened for at least a week but it does go down about 1 or 2 every one or two weeks.
problem-again.png