?
Solved

Clients intermittantly loosing connectivity with one server

Posted on 2008-11-02
9
Medium Priority
?
563 Views
Last Modified: 2012-05-05
Starting about 4 days ago, admin staff started reporting that various applications were shutting down and that drive letters - typically H: drive, were no longer responding. Logging out then back in again fixed the problem in all cases but this proved onerous once it became necessary multiple times every hour.

The drive letters in question were being 'lost' intermittantly. no particular pattern was noted. While it was typically H: drive (data volume) other drives were also affected (P: APPS volume wa scommon too). Symptoms ranged from:
no files displayed on the drive
error message "H:\ refers to a location that is unavailable" and
error message "File handle for H: is invalid"

Using Windows explorer to 'browse' the volumes available on the server on a client that has lost it's connection to a drive/volume shows, for example (note: H: maps to \\thor\data\home\accountname)
\\thor\apps  -> displays correct folders
\\thor\data  -> SHOULD display home folder but instead displays the contents of the apps volume.

For reference, several new core switches had been installed 3 weeks earlier though the problem in question only appeared a few days ago (or was only reported a few days ago - not exactly the same thing I know).

Investigation showed that the entire campus was affected - not just the people on a single switch.

After writing a small script that continually pings various network devices around the school and logging the results, it became apparent that all network devices EXCEPT the primary File server were communicating OK. The primary file server (Thor - Netware 6.0/sp5) was suffering from repeated network interruptions (ie ping returned no reply) at varying time periods (minutes apart to hours apart) for variable amounts of time - usually less than 1 minute each. 1 itteration of a failed log shown here:
---
The current time is:  7:07:02.64
Reply from 192.168.0.26: bytes=32 time=9ms TTL=255
Reply from 192.168.0.25: bytes=32 time=1ms TTL=255
Reply from 192.168.0.24: bytes=32 time=1ms TTL=255
Request timed out.
Reply from 192.168.10.252: bytes=32 time<1ms TTL=128
Reply from 192.168.0.1: bytes=32 time<1ms TTL=64
---

Further investigation has proceeded on the assumption that there is a failure in either the switch port connected to the server or on the server Network card itself. My assumption is that either the network card on the server is failing of it's own accord (likely as the problem only surfaced recently and has been getting worse) or that the network card is responding in an abnormal way to some other device on the network (virus/worm? or a faulty network device producing 'chatter')

The switch (3Com 5500G) is not reporting any errors on that port.
The server is not reporting any errors in monitor either.

Tried hard setting the port to 1000Gb/Full duplex - Server set to auto (no option for 1000/Full).
Server network card is a Broadcom 5700. Research shows a historical record of similar issues with this type of network card.
Updated network drivers for the Broadcom card to latest version.
Updated the firmware on the Broadcom nic.
Tried setting card and switch port to a lower speed (100Mb/FD)
Tried different port on switch at 1000Mb/FD

One consultant we have contacted has recomended splitting the school network into separate VLANs as a start to diagnosing the problem. We are continuing with that option as it needs to be done at some point anyway.

The last thing I have yet to try is to replace the server network card entirely with an industry standard Intel gigabyte card. more specifically, add an additional card and set it to the server IP; as the existing one is on-board.

Now - after all that, my question is:

Does anyone have any suggestions as to what the problem might be or what I can do to continue diagnosing the problem?
0
Comment
Question by:ppofandt
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 3
  • 2
  • +1
9 Comments
 
LVL 23

Expert Comment

by:Mysidia
ID: 22864259
If you have a large network with just one VLAN...  there are a variety of types of traffic that might disrupt a server.

A bad network card or bad cable seems a very plausible reason.

One of them is ARP traffic.   If someone accidentally manually sets a PC to have the same IP address  as is assigned to the server,  the server may periodically lose connectivity for a moment.

Normally this would drop more than one ping.
But it is something to be aware of.

I would suggest  installing  Wireshark on a laptop, during an off-peak time,
such as midnight  (so you have as little legitimate traffic as possible).
Set up a port on a switch to be a "monitoring"  or "SPAN"  port.

Plug the laptop into the port, and start capturing packets.

Start pinging the server continuously from another machine.

Once a few packets have dropped stop the ping.

Save the capture and filter out known ok traffic,  see if there is anything suspicious.


Another test to perform only during off peak  would be to ISOLATE the server that is misbehaving.

For example, plug a laptop into another port on the same switch, start the continuous ping, and  physically separate the switch it is plugged into from the rest of the network,  temporarily.

Isolating and pinging  rules out the possibility of any outside host causing the problem.





0
 

Author Comment

by:ppofandt
ID: 22864329
>If you have a large network with just one VLAN...  there are a variety of types of traffic that might disrupt a server.
How large would you define large? The consultant that wants to split the network up into multiple VLANs has suggested no more than 50 clients per VLAN if possible.
At the moment we probably have 300-400 machines on a single flat lan. I understand that that's way into the 'too large' area.

>A bad network card or bad cable seems a very plausible reason.
on the server or on a client?

>Normally this would drop more than one ping.
it may not have been apparent from the abbreviated log I posted but when packets are being lost there are several sequential pings being lost.

0
 
LVL 5

Expert Comment

by:rexxus
ID: 22866692
Devices per VLAN is dependent upon how "chatty" each client is which depends on the software involved.  Generally try to limit the number of devices to a single class c network, so 250 devices per vlan.  50 devices seems too few but again is dependent on software packages used.

have any dns changes been made recently?
0
Get real performance insights from real users

Key features:
- Total Pages Views and Load times
- Top Pages Viewed and Load Times
- Real Time Site Page Build Performance
- Users’ Browser and Platform Performance
- Geographic User Breakdown
- And more

 
LVL 23

Expert Comment

by:Mysidia
ID: 22866840
The size of the VLAN doesn't explain why this one server is having a problem with dropped pings.  Either other servers on its same segment are having the same problem, or the size you choose to make your VLAN is an aesthetic side issue.

300 machines on one VLAN may be suboptimal for a few reasons, but it is not a very large local VLAN on a fastethernet switched network, in terms of "likely to break things", unless your hosts exchange a very heavy amount of broadcast traffic,  or this VLAN also spans some smaller WAN links.

For example, a total of several hundred broadcasts per second, or more than a megabit or so of broadcast bandwidth usage per second.

1000  hosts is a large VLAN.


There is a CPU cost to all hosts on the network that is incurred by moderate broadcast traffic, but they should not break unless the amount of traffic becomes ridiculous.

And broadcast traffic effects all hosts, not just servers.

In any case, running a packet capture tool like wireshark   will readily tell you
what kind of broadcast traffic your network is seeing.

You don't even need to setup a SPAN session  to see broadcasts,  because
all ports in a VLAN will receive the broadcasts.
0
 
LVL 7

Expert Comment

by:dkarpekin
ID: 22867057
First step would be - see if the PC connected to the same switch, where is the server is , is also loosing connection (you can use Remote desktop, watch simulaniously from troubleshooting PC).
Also disable firewalls/AV see the diffent.
If you using NetBios, make sure WINS is running on a servers, and all NIC's have WINS IP on it -NIC-proprties-TCP/IP-advance- WINS
You might want to  use www.soalrwinds.com or other monitoring tools.
Assuming , that network performance/structure  is  OK.
See also simular on
http://www.experts-exchange.com/Hardware/Networking_Hardware/Q_23632567.html
0
 

Author Comment

by:ppofandt
ID: 22873000
I'm going to shelve to issue of subnetting our network via VLANs for now. While the idea has merit, it would seem that it is not essential and may introduce additional variables into the current equation. I'll wait till we can resolve the current problem first.

I've replaced the server network card with an intel Pro/1000 MT server adapter and gotten that working ok. However the problem of clients dropping off this server persist. Dammit!

The issue is spread campus wide. HOWEVER clients connected to, or close to (hop wise) to the switch the server is on experience it far less frequently than those that are down 4 switch hops.

Broadcast traffic does not seem excessive though I can't quantitatively measure that at the moment.

>have any dns changes been made recently?
no major changes. I've added in the names of the new switches. That's about it.

>f you using NetBios,
Not using netbios.

I'm at about the end of what I can do. right now I'm just collecting information to possibly hand to a specialist consultant. I can't think what it may be now. I'm at a loss.

Any other suggestions?

0
 
LVL 7

Expert Comment

by:dkarpekin
ID: 22873322
Add NetBios on all NIC's, and enable WINS with that IP- server on any win server machine you have.
Tha should fix.
0
 

Accepted Solution

by:
ppofandt earned 0 total points
ID: 22918695
>Add NetBios on all NIC's, and enable WINS with that IP- server
I think you missed the part of the original post where it was mentioned that this was a Netware server - no netbios or WINS.

I eventually had to bite the bullet and get a specialist Netware/Network engineer in.

He spent most of the day looking at the setup and doing his own tests then said basically "Dunno what the problem is". However he did try a few things.

Newer TCPIP stack (last update available for NW6.0 ie, post SP5)
Latest Winsock patch (this only received limited testing on NW6 but we were getting desperate)
He also optimised a few of the configuration parameters that manage SLP and NCP.

That seemed to have fixed the problem, 24 hours after the server restart there had not been a single dropped packet - let alone enough to cause client dropouts.

The expected change that fixed the issue was the updated TCPIP stack which, in hindsight, I should have looked for myself.
0
 
LVL 7

Expert Comment

by:dkarpekin
ID: 22918813
Sorry, that I missed Netware stuff, in Windows those kind of problem related to NetBios , in Netware looks like to "Winsock patch " - one thing is obvius - this "war" between diffrent OS will never end.
0

Featured Post

Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

 One of the main issues with network wires is that you never have enough.  You run plenty and plan for the worst case but you still end up needing more.  What many people do not realize is with 10BaseT and 100BaseT (but not 1000BaseT) networks you …
Every server (virtual or physical) needs a console: and the console can be provided through hardware directly connected, software for remote connections, local connections, through a KVM, etc. This document explains the different types of consol…
Monitoring a network: how to monitor network services and why? Michael Kulchisky, MCSE, MCSA, MCP, VTSP, VSP, CCSP outlines the philosophy behind service monitoring and why a handshake validation is critical in network monitoring. Software utilized …
Have you created a query with information for a calendar? ... and then, abra-cadabra, the calendar is done?! I am going to show you how to make that happen. Visualize your data!  ... really see it To use the code to create a calendar from a q…
Suggested Courses
Course of the Month13 days, 15 hours left to enroll

801 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question