Solved

Clients intermittantly loosing connectivity with one server

Posted on 2008-11-02
9
550 Views
Last Modified: 2012-05-05
Starting about 4 days ago, admin staff started reporting that various applications were shutting down and that drive letters - typically H: drive, were no longer responding. Logging out then back in again fixed the problem in all cases but this proved onerous once it became necessary multiple times every hour.

The drive letters in question were being 'lost' intermittantly. no particular pattern was noted. While it was typically H: drive (data volume) other drives were also affected (P: APPS volume wa scommon too). Symptoms ranged from:
no files displayed on the drive
error message "H:\ refers to a location that is unavailable" and
error message "File handle for H: is invalid"

Using Windows explorer to 'browse' the volumes available on the server on a client that has lost it's connection to a drive/volume shows, for example (note: H: maps to \\thor\data\home\accountname)
\\thor\apps  -> displays correct folders
\\thor\data  -> SHOULD display home folder but instead displays the contents of the apps volume.

For reference, several new core switches had been installed 3 weeks earlier though the problem in question only appeared a few days ago (or was only reported a few days ago - not exactly the same thing I know).

Investigation showed that the entire campus was affected - not just the people on a single switch.

After writing a small script that continually pings various network devices around the school and logging the results, it became apparent that all network devices EXCEPT the primary File server were communicating OK. The primary file server (Thor - Netware 6.0/sp5) was suffering from repeated network interruptions (ie ping returned no reply) at varying time periods (minutes apart to hours apart) for variable amounts of time - usually less than 1 minute each. 1 itteration of a failed log shown here:
---
The current time is:  7:07:02.64
Reply from 192.168.0.26: bytes=32 time=9ms TTL=255
Reply from 192.168.0.25: bytes=32 time=1ms TTL=255
Reply from 192.168.0.24: bytes=32 time=1ms TTL=255
Request timed out.
Reply from 192.168.10.252: bytes=32 time<1ms TTL=128
Reply from 192.168.0.1: bytes=32 time<1ms TTL=64
---

Further investigation has proceeded on the assumption that there is a failure in either the switch port connected to the server or on the server Network card itself. My assumption is that either the network card on the server is failing of it's own accord (likely as the problem only surfaced recently and has been getting worse) or that the network card is responding in an abnormal way to some other device on the network (virus/worm? or a faulty network device producing 'chatter')

The switch (3Com 5500G) is not reporting any errors on that port.
The server is not reporting any errors in monitor either.

Tried hard setting the port to 1000Gb/Full duplex - Server set to auto (no option for 1000/Full).
Server network card is a Broadcom 5700. Research shows a historical record of similar issues with this type of network card.
Updated network drivers for the Broadcom card to latest version.
Updated the firmware on the Broadcom nic.
Tried setting card and switch port to a lower speed (100Mb/FD)
Tried different port on switch at 1000Mb/FD

One consultant we have contacted has recomended splitting the school network into separate VLANs as a start to diagnosing the problem. We are continuing with that option as it needs to be done at some point anyway.

The last thing I have yet to try is to replace the server network card entirely with an industry standard Intel gigabyte card. more specifically, add an additional card and set it to the server IP; as the existing one is on-board.

Now - after all that, my question is:

Does anyone have any suggestions as to what the problem might be or what I can do to continue diagnosing the problem?
0
Comment
Question by:ppofandt
  • 3
  • 3
  • 2
  • +1
9 Comments
 
LVL 23

Expert Comment

by:Mysidia
ID: 22864259
If you have a large network with just one VLAN...  there are a variety of types of traffic that might disrupt a server.

A bad network card or bad cable seems a very plausible reason.

One of them is ARP traffic.   If someone accidentally manually sets a PC to have the same IP address  as is assigned to the server,  the server may periodically lose connectivity for a moment.

Normally this would drop more than one ping.
But it is something to be aware of.

I would suggest  installing  Wireshark on a laptop, during an off-peak time,
such as midnight  (so you have as little legitimate traffic as possible).
Set up a port on a switch to be a "monitoring"  or "SPAN"  port.

Plug the laptop into the port, and start capturing packets.

Start pinging the server continuously from another machine.

Once a few packets have dropped stop the ping.

Save the capture and filter out known ok traffic,  see if there is anything suspicious.


Another test to perform only during off peak  would be to ISOLATE the server that is misbehaving.

For example, plug a laptop into another port on the same switch, start the continuous ping, and  physically separate the switch it is plugged into from the rest of the network,  temporarily.

Isolating and pinging  rules out the possibility of any outside host causing the problem.





0
 

Author Comment

by:ppofandt
ID: 22864329
>If you have a large network with just one VLAN...  there are a variety of types of traffic that might disrupt a server.
How large would you define large? The consultant that wants to split the network up into multiple VLANs has suggested no more than 50 clients per VLAN if possible.
At the moment we probably have 300-400 machines on a single flat lan. I understand that that's way into the 'too large' area.

>A bad network card or bad cable seems a very plausible reason.
on the server or on a client?

>Normally this would drop more than one ping.
it may not have been apparent from the abbreviated log I posted but when packets are being lost there are several sequential pings being lost.

0
 
LVL 5

Expert Comment

by:rexxus
ID: 22866692
Devices per VLAN is dependent upon how "chatty" each client is which depends on the software involved.  Generally try to limit the number of devices to a single class c network, so 250 devices per vlan.  50 devices seems too few but again is dependent on software packages used.

have any dns changes been made recently?
0
 
LVL 23

Expert Comment

by:Mysidia
ID: 22866840
The size of the VLAN doesn't explain why this one server is having a problem with dropped pings.  Either other servers on its same segment are having the same problem, or the size you choose to make your VLAN is an aesthetic side issue.

300 machines on one VLAN may be suboptimal for a few reasons, but it is not a very large local VLAN on a fastethernet switched network, in terms of "likely to break things", unless your hosts exchange a very heavy amount of broadcast traffic,  or this VLAN also spans some smaller WAN links.

For example, a total of several hundred broadcasts per second, or more than a megabit or so of broadcast bandwidth usage per second.

1000  hosts is a large VLAN.


There is a CPU cost to all hosts on the network that is incurred by moderate broadcast traffic, but they should not break unless the amount of traffic becomes ridiculous.

And broadcast traffic effects all hosts, not just servers.

In any case, running a packet capture tool like wireshark   will readily tell you
what kind of broadcast traffic your network is seeing.

You don't even need to setup a SPAN session  to see broadcasts,  because
all ports in a VLAN will receive the broadcasts.
0
Highfive Gives IT Their Time Back

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

 
LVL 7

Expert Comment

by:dkarpekin
ID: 22867057
First step would be - see if the PC connected to the same switch, where is the server is , is also loosing connection (you can use Remote desktop, watch simulaniously from troubleshooting PC).
Also disable firewalls/AV see the diffent.
If you using NetBios, make sure WINS is running on a servers, and all NIC's have WINS IP on it -NIC-proprties-TCP/IP-advance- WINS
You might want to  use www.soalrwinds.com or other monitoring tools.
Assuming , that network performance/structure  is  OK.
See also simular on
http://www.experts-exchange.com/Hardware/Networking_Hardware/Q_23632567.html
0
 

Author Comment

by:ppofandt
ID: 22873000
I'm going to shelve to issue of subnetting our network via VLANs for now. While the idea has merit, it would seem that it is not essential and may introduce additional variables into the current equation. I'll wait till we can resolve the current problem first.

I've replaced the server network card with an intel Pro/1000 MT server adapter and gotten that working ok. However the problem of clients dropping off this server persist. Dammit!

The issue is spread campus wide. HOWEVER clients connected to, or close to (hop wise) to the switch the server is on experience it far less frequently than those that are down 4 switch hops.

Broadcast traffic does not seem excessive though I can't quantitatively measure that at the moment.

>have any dns changes been made recently?
no major changes. I've added in the names of the new switches. That's about it.

>f you using NetBios,
Not using netbios.

I'm at about the end of what I can do. right now I'm just collecting information to possibly hand to a specialist consultant. I can't think what it may be now. I'm at a loss.

Any other suggestions?

0
 
LVL 7

Expert Comment

by:dkarpekin
ID: 22873322
Add NetBios on all NIC's, and enable WINS with that IP- server on any win server machine you have.
Tha should fix.
0
 

Accepted Solution

by:
ppofandt earned 0 total points
ID: 22918695
>Add NetBios on all NIC's, and enable WINS with that IP- server
I think you missed the part of the original post where it was mentioned that this was a Netware server - no netbios or WINS.

I eventually had to bite the bullet and get a specialist Netware/Network engineer in.

He spent most of the day looking at the setup and doing his own tests then said basically "Dunno what the problem is". However he did try a few things.

Newer TCPIP stack (last update available for NW6.0 ie, post SP5)
Latest Winsock patch (this only received limited testing on NW6 but we were getting desperate)
He also optimised a few of the configuration parameters that manage SLP and NCP.

That seemed to have fixed the problem, 24 hours after the server restart there had not been a single dropped packet - let alone enough to cause client dropouts.

The expected change that fixed the issue was the updated TCPIP stack which, in hindsight, I should have looked for myself.
0
 
LVL 7

Expert Comment

by:dkarpekin
ID: 22918813
Sorry, that I missed Netware stuff, in Windows those kind of problem related to NetBios , in Netware looks like to "Winsock patch " - one thing is obvius - this "war" between diffrent OS will never end.
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

This article will step through configuring a SonicWALL appliance to utilize an internal DHCP server for Global VPN Client (GVC) hosts.  There are times when using an external (external to the SonicWALL) DHCP server, such as Windows Servers, isn’t pr…
This article is a how to to configure a UCS Ethernet-uplink portchannel via the console. It is easy to do and can be done quite quickly. In certain versions of the UCS manager the portchannel has issues coming up and this is a workaround. I am…
Internet Business Fax to Email Made Easy - With eFax Corporate (http://www.enterprise.efax.com), you'll receive a dedicated online fax number, which is used the same way as a typical analog fax number. You'll receive secure faxes in your email, fr…
In this tutorial you'll learn about bandwidth monitoring with flows and packet sniffing with our network monitoring solution PRTG Network Monitor (https://www.paessler.com/prtg). If you're interested in additional methods for monitoring bandwidt…

757 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

21 Experts available now in Live!

Get 1:1 Help Now