Link to home
Start Free TrialLog in
Avatar of vassot
vassot

asked on

Lan connectivity problems

Hello, we have been experiencing several lan connectivity problems in my company for about the last two months. More specifically, most users randomly lose their connection to several servers throughout the day. Also please note that the connectivity problems occur only with one server at a time. We have Windows 2003 Active Directory with DNS and WINS server, several Windows 2000 and 2003 servers (these servers are domain members) and Windows 2000, Vista and XP clients. As a firewall, we are using Endian Firewall. We have already checked that it is not a network hardware problem(switches,cables). The event logs of the servers haven't any suspicious entries. We also have checked each machine for viruses. Please help!!!
Avatar of msguru
msguru

When a user has connectivity issues to a particular server, can that PC ping that server's IP address, and also can that PC ping the default gateway?  Are the users on the same network as the servers, or is there a layer-3 device betwwen them?  What make/model networkinbg devices are they, and what network cards do the PC's have?  Is there a similarity between the users or the PCs that have the problem (e.g. same model PC, or they are on the same switch/hub)?
Avatar of Brian Pierce
It would be useful to see the results of an ipconfig /all on a machine that is experiencing an issue.
In addition, did you try to acces by IP rather than via WINS.

If all machines are in one location, I would eliminate WINS totally, as it is not needed in a Native win 2k or newer environment.

ALso check your DNS and AD via the Server resource kit tools.

I hope this helps !
I would reexamine the assumptions even if you think you have conclusive evidence.
- only one server is lost at a time is very suspicious.   What if this hypothesis were changed?  Would your diagnosis of the situation be modified to any benefit?  I'll bet it will point more to hardware.
- that hardware is not the problem is very suspicious.  Bad cables are often hard to pin down.  Bad switches or switches in need of a reboot can be really hard to pin down - often only "proven" when a reboot solves the problem!

So, reboot the switches and routers for sure.  Routers and switches in need of a reboot can do very strange things - including route to some addresses and not to others.  Note that switches have a computer inside just as well as routers do.  Consider the effect an intermittent cable might have on the rules that a switch develops....

Check the firmware versions on the routers to see if upgrades are available.  If so, install them unless there is good reason to the contrary.  You will know better than I.

Examine the physical quality of the cabling.  Except for local patch cables (which are throw-away items), are there any "fixed" cables that aren't terminated with a punch-down block?  If so, they are suspect.   Plug-terminations are really only acceptable for short, throw-away, patch cords at the computers and printers and routers, etc.
 
Next, assuming that such termination situations do exist (because they very often do in small offices) what is the workmanship of the non punch-down cable ends?  Is the insulation crimped into the connector providing strain relief?  If not, replace the plugs with proper workmanship - or better yet, with punch-down terminations.  [I have reworked entire facilities with problems like this - and they *did* have intermittencies that were unexplained!]
Replace any implicated patch cables or ones that are showing signs of wear / abuse.

I understand that this is counter to the "given information" but it's just too common a root cause not to mention it.
Avatar of vassot

ASKER

KCTS, this is what ipconfig/all outputs:

Windows IP Configuration



      Host Name . . . . . . . . . . . . : Tsiartsioni
      Primary DNS Suffix  . . . . . . . : anko.gr
      Node Type . . . . . . . . . . . . : Hybrid

      IP Routing Enabled. . . . . . . . : No

      WINS Proxy Enabled. . . . . . . . : No

      DNS Suffix Search List. . . . . . : anko.gr

Ethernet adapter Local Area Connection :



      Connection-specific DNS Suffix  . :
      Description . . . . . . . . . . . : NVIDIA nForce Networking Controller
      Physical Address. . . . . . . . . : 00-13-D3-12-26-A5

      DHCP Enabled. . . . . . . . . . . : No

      IP Address. . . . . . . . . . . . : 192.168.0.177

      Subnet Mask . . . . . . . . . . . : 255.255.255.0

      Default Gateway . . . . . . . . . : 192.168.0.127

      DNS Servers . . . . . . . . . . . : 192.168.0.192
      Primary WINS Server . . . . . . . : 192.168.0.192


MSGuru, I still don't have all the answers to your questions, so I'll be back to give you all the answers in a while.
Avatar of vassot

ASKER

For Msguru, The pc that has this temporary disconnection, can't ping the server but can ping the default gateway.
What do you mean by a a layer-3 device between the users of the network and the servers?
The networking devices are Hp Procurve 2824, Hp Procurve 1524 and Intel 510T Express, Intel 460T switches. We also have small switches (brand Level1, Compex) of 5 or 8 ports throughout the network. The PCs have various brands and models of network cards either 100 or 1000 Mbps. None of these cards work in auto-save energy mode. There aren't any similarities between the users or the pcs. Almost every pc loses connection to one or more servers regardless of their network location and the switch. The pc models are various.
Avatar of vassot

ASKER

For SysExpert, WINS was installed lately (3 days ago). The problem pre-existed (for almost two months) before WINS. Anyway, we will remove it from the machines again. As far as DNS and AD testing is concerned, we have used netdiag and dcdiag and all tests were passed. Can you suggest a tool for DNS Testing? We have already checked the DNS through the DNS console.
Avatar of vassot

ASKER

This is an error in the Application event log of a server:

Replication of license information failed because the License Logging Service on server \\ARTEMIS could not be contacted.

Computer ATHENA
Source LicenseService
Category None
Event ID 213

Please note that ARTEMIS is the Domain Controller and that this error appears in the Application event log of another server.

These servers have Windows 2000 operating system.
ARTEMIS, the domain Controller, has Windows 2003 operating System.
Hi vassot,

The fact that you could ping the default gateway, but not the server is a big clue!

To answer your question about 'layer-3' devices - these could be a router of a layer-3 switch (which is effectively routing done by an enhanced, 'layer-3' switch).

Now can you cover some presumptions for me:-

P1) If all your workstations and servers are on the same network, and not going through a 'layer-3' device - then we can rule this out.  It looks like they are all on a 192.168.0.x network - just to be sure can you confirm that?

P2) Also, can you confirm the presumption that your workstations don't go through the endian firewall to get to the servers?

P3) I presume the endian firewall is the default gateway 192.168.0.127, is that right?

Now, here's a few things to test/try - this will narrow down the area to look at dramatically:-

T1)
a) When the problem happens on the suspect computer (it loses connectivity to a server), ping from each of the servers to the default gateway, if that works OK ping the other servers from each server.
b) When the problem happens on the suspect computer, again - do the ping to the server that connectivity dropped to, and also a ping to the default gateway.  At the same time, go to a computer that has NO reported problem, and do the sames pings.  This may show that the computers that had no reported problem, actually had connectivity issues as well (maybe just no comms were being done from those computers at that very specific time).
Depending on the length of the outages (how long does the problem server not respond to a ping?), you may have time to ping other things as well... if you do have the time, ping the other servers, and an external IP like ping www.novell.com as well.  That would provide valuable information.

T2)
a) First, look at the path of cables and switches that the problem workstation would have taken to go through to reach the default gateway (I presume this is 192.168.0.127, as you mention in the IPCONFIG above).
b) Now look at the path of cables and switches that the problem workstation would have taken to reach the server (or servers) that didn't respond.
c) Finally - what cables and switches are unique to b), that were not have been taken in a) ?

Please try these and let us know!

Best of luck!
Hi vassot,

I think you can eliminate the error in replicating licenses - it would not be causing network connectivity issues, however - it may be as a *result* of network connecivity issues!

I think you should troubleshoot that error after resolving the network connectivity issues.

Cheers!
Avatar of vassot

ASKER

Dear msguru, sorry for the delay but things are getting worse each day.

First of all I would like to tell you that if I use the command ping -t from a workstation, I see that the connection to the gateway is lost instantly quite often. Also, when we use pathping on the gateway, we see a great deal of loss of packets.

P1) We are not using a 'layer-3' device, just the devices we have mentioned earlier plus a Zyxel P-660H-D1 router. All workstations are on the 192.168.0.x network except two servers that are on a 10.0.0.x network(DMZ).
P2) When we use tracert from a workstation to get to a server, we see that the gateway is not is not in the routing path (there is a direct connection to the server).
p3)Yes, the endian firewall is the default gateway
T1)
a) When a problem happens, the server to which the pc can't connect, can connect successfully to the gateway as well as the other servers.
b) If we exclude the computer with the problem, the other computers that we check randomly are pinging the server. Also, the computer with the problem can ping the gateway.
T2)
a, b, c) the physical path to the server and the firewall is exactly the same (all of the servers and the gateway are directly connected to the same switch, where all the other switches (to which the workstations are connected) are connected, too. Please note that the cable of the firewall was replaced to exclude such a problem.
Avatar of vassot

ASKER

We also suspect that we may have Windows licensing or other security issues (Kerberos, Ldap etc). Please give us any ideas.
You say:
"First of all I would like to tell you that if I use the command ping -t from a workstation, I see that the connection to the gateway is lost instantly quite often. Also, when we use pathping on the gateway, we see a great deal of loss of packets."

So, I'm back to my initial observations pointing to hardware - only now more strongly stated...!  I refer to "lost instantly quite often".

I would investigate cables / cable terminations first.
An intermittency in a switch or router could cause this as well.

You might try using Ethereal (a free download) on a laptop to observe the traffic here and there if you've not already done that.  Perhaps insert a hub (not a switch) at a likely spot to oberve the traffic reliably.

Unless you're able to pinpoint a problem rather exactly, it is often impossible to reject the notion of hardware failures/intermittencies.  They can act weird and you can use up a lot of time assuming that they don't exist.

It looks like you're on the right track.  I don't fully understand the topology of your network but you should definitely have it drawn out so that you know where all the physical paths really are.  Then when there's a path failure you can see it on the diagram.  Then when there's another path failure you can see that one as well.  Look for a common physical path element - be it wire or a box.

This doesn't feel like software to me!
Avatar of vassot

ASKER

Dear fmarshall, thank you for your observations. The cable of the firewall to the switch was replaced. All other servers that do not have any loss of packets are connected to the same switch. Do you suggest a problem in the network card of the firewall because there is nothing else left to check. And could this loss of packets cause trouble to the connection of the workstations with the other servers?
OK- well to respond in any reasonable way I'm going to ask you for the *physical* network topology.

Each client is wired to .... what?
Each hub, switch, ... is wired to what? (in addition to clients)
Where *is* the "gateway" in the endian or separate?
Each server is wired to .... what?
The AD/WINS/DNS servier is wired to .... what?  Is this .192 ??

Then, if we go back through your messages, which paths on this diagram have failed and/or which computers have failures/problems?  Which ones don't?

I can't envision how things are contributing otherwise...  
For example, when you say that you replaced the cable of the firewall to "the switch", I don't know what that means to the problems described without knowing where the switch is in the topology.  I can *guess* but that's dangerous for anyone to do..
Hi Vassot,

I think it's best to resolve your network issues first, then move onto the other issues (Windows licensing or other security issues - Kerberos, Ldap etc) IF THEY STILL EXIST!  Dropped packets will cause these other issues, but these issues should not cause dropped packets.

Based on your feedback, I think it's time to look at your managed switch ports for errors, and also look at ethernet auto-negotiation issues...

First, you have managed switches (at least the HP procurve 2824 is - maybe you can fill me in on the others).  You should be able to log into the management page of the switch (or, less easy - use a console connection) and look at the error statistics on the ports.  First thing is 'zero' all the stats, then look at which ports the errors are happening on.

Now focus on on these 'high error' ports (note that there will ALWAYS be some errors that happen on an occasional basis - but there shouldn't be errors clocking up every second!).

Look at the speed/duplex on that port - is it fixed or is it auto-negotiate?  Auto-negotiate can be a big trouble-maker, and mis-matches of either negotiated or manually set duplex are so common.  Both sides should either be on auto, or they should be on the exact same fixed speed AND duplex (half or full).

When speed/duplex is sorted out, on a particular 'high error' port, check to see if it's still clocking up a lot of errors - if so, then check the cable and the 'health' of the network device on the other end (whether it be another switch, an interface on a router, or another switch) - consider changing that network device (even if temporarily, just as a test).

You do have the option to 'go all guns blazing' and just fix EVERYTHING to the same speed and full duplex (where you can).

At a minimum, I'd fix all the servers speed/duplex on each server and on the switch port they go into - making sure you're using the most appropriate switch as the 'core' switch.  At a glance, I would choose that HP Procurve 2824.  By the way, this switch supports 'layer-3 routing' according to the HP spec sheet.

Best of luck!
Avatar of vassot

ASKER

fmarshall, msguru thank you very much. I'll post you the test results tomorrow.
Avatar of vassot

ASKER

Fmarshall, each client is wired to a switch and each of these switches is connected to the main switch (which is the HP Procurve 2824). The gateway (Endian Firewall) is a machine connected directly to the main switch. Also, each server is wired directly to the main switch. The AD/WINS/DNS server is also wired directly to the main switch. Yes, the AD/WINS/DNS server is a 192.168.0.x server. The only path that has dropped packets all the time is between any workstation or server and the firewall. (When we use pathping we always have a packet loss). To the contrary, pathping has never loss of packets to any other path, except when a workstation/server totally loses connection with a server (ping is impossible anyway).
Vassot,

If you can get onto the console of the 2824 switch and do pings to the Endian firewall - OR - get onto the Edian firewall and ping the 2824.  See if there's dropped packets there - if so, look at fixing the speed and duplex on both ends & then re-test.  If still a problem, talk to Endian support - it could be an interface problem.

However, the Endian firewall shouldn't effect the traffic between the workstation and the server - if the server is on the 2824.

You will gain a lot of insight by having the web or console access to the 2824 and looking at the port statistics - and also on the other switches (if they are managed).

Also, I'll re-iterate the recommendation above, to fix all the servers speed/duplex on each server and on the 2824 switch port they go into.  It also would be good to make sure the network card drivers on the servers are up to date with the latest stable release from the network card vendor's web site.

A general question/suggestion as well;  are there some workstations that are more troublesome than others?  You could go onto those workstations and update the network card drivers, and then connect the directly into youre 'core' switch - the 2824.  Then look at the port statistics on the 2824 where the troublesome workstations are plugged into.

You have a powerful 2824 switch with its own ability to show you where errors are happening!

Let me know how my previous suggestion, and this one are going!

Best of luck!
OK - so the common path in the failure is between the firewall and the main switch.  That suggests hardware components:
- port on the firewall
- the cable between the firewall and the main switch (which you replaced)
- port on the main switch.

I would change the physical port on the main switch that goes to the firewall.  See if that changes things for the better.  I've definitely seen individual ports go out or become flaky.  Then, do the same on the firewall if you can.

I should check this:  in your original post you said that the connections to the servers were failing.  In your last post you say that packet losses are between the main switch and the firewall.  So, I addressed the latter.  But, does the former still apply?
Hang on:-  there's still these problems...

"The pc that has this temporary disconnection, can't ping the server but can ping the default gateway."
"Almost every pc loses connection to one or more servers regardless of their network location and the switch"

So, you can't just concentrate on the Endian to 2824 switch connection.

If anything, the common component is the 2824 switch.  Consider swapping the whole thing out!

BUT, please do go through everyone's recommendations above!!!
SOLUTION
Avatar of hypercube
hypercube
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of vassot

ASKER

The firewall has loss of packets even when it is connected through another switch to a single server. Also, please note that the machine of the firewall was replaced, the firewall was re-installed and all network cards were replaced. The main switch was replaced by another one in a test but there were no changes (we had dropped connections again).
Fix the speed (to the highest supported speed) and duplex (to full-duplex) on both ends (switch port & the firewall network card), then see if there is still packet loss.
Avatar of vassot

ASKER

We have fixed the speed to 100 full duplex on the switch port but we can't set this manually on the firewall, which auto-detects the settings of the port. However, when we set full-duplex, the firewall sees it as half-duplex and a large amount of errors occurred on the switch port.
Vassot,
As described earlier, "Both sides should either be on auto, or they should be on the exact same fixed speed AND duplex (half or full)".  If Endian runs on Linux, then you need to know the Linux commands (or a way to do it through the GUI) to set the speed and duplex of the network card.  I can't help you on the Linux side.  But surely it must be configurable.
You could break the above quoted rule for a short test, and set the port the Endian is plugged into - to 100 megabits and half-duplex, then check the error count.  If the error count goes up, try re-booting the Endian firewall & check the error stats again.
Avatar of vassot

ASKER

Msguru,  thanks for your advice but we have finally decided to replace the firewall with another one, so we won't do any more tests. But we wonder if this problem could be responsible for the whole problem of our network. What do you think?
Hi Vassot,

Something of interest...

http://kb.endian.com/entry/29/

Q: "Why do i have packet loss with some devices if I ping Endian Firewall?"
A: "Endian Firewall has a DoS attack protection which limits ICMP packets to 1 packet per second if more than 5 packets come in too fast."

So, the packet loss with *multiple* PINGs to the Endian would be normal!  Just ONE device pinging the Endian should be OK (but you have to be sure nothing else is PINGing it).

If there were no errors clocking up on the switch port that the Endian is connected to BEFORE you changed the speed/duplex - then, leave the settings as they were.
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of vassot

ASKER

It was the firewall after all!