Clients intermittantly loosing connectivity with one server
Posted on 2008-11-02
Starting about 4 days ago, admin staff started reporting that various applications were shutting down and that drive letters - typically H: drive, were no longer responding. Logging out then back in again fixed the problem in all cases but this proved onerous once it became necessary multiple times every hour.
The drive letters in question were being 'lost' intermittantly. no particular pattern was noted. While it was typically H: drive (data volume) other drives were also affected (P: APPS volume wa scommon too). Symptoms ranged from:
no files displayed on the drive
error message "H:\ refers to a location that is unavailable" and
error message "File handle for H: is invalid"
Using Windows explorer to 'browse' the volumes available on the server on a client that has lost it's connection to a drive/volume shows, for example (note: H: maps to \\thor\data\home\accountname)
\\thor\apps -> displays correct folders
\\thor\data -> SHOULD display home folder but instead displays the contents of the apps volume.
For reference, several new core switches had been installed 3 weeks earlier though the problem in question only appeared a few days ago (or was only reported a few days ago - not exactly the same thing I know).
Investigation showed that the entire campus was affected - not just the people on a single switch.
After writing a small script that continually pings various network devices around the school and logging the results, it became apparent that all network devices EXCEPT the primary File server were communicating OK. The primary file server (Thor - Netware 6.0/sp5) was suffering from repeated network interruptions (ie ping returned no reply) at varying time periods (minutes apart to hours apart) for variable amounts of time - usually less than 1 minute each. 1 itteration of a failed log shown here:
The current time is: 7:07:02.64
Reply from 192.168.0.26: bytes=32 time=9ms TTL=255
Reply from 192.168.0.25: bytes=32 time=1ms TTL=255
Reply from 192.168.0.24: bytes=32 time=1ms TTL=255
Request timed out.
Reply from 192.168.10.252: bytes=32 time<1ms TTL=128
Reply from 192.168.0.1: bytes=32 time<1ms TTL=64
Further investigation has proceeded on the assumption that there is a failure in either the switch port connected to the server or on the server Network card itself. My assumption is that either the network card on the server is failing of it's own accord (likely as the problem only surfaced recently and has been getting worse) or that the network card is responding in an abnormal way to some other device on the network (virus/worm? or a faulty network device producing 'chatter')
The switch (3Com 5500G) is not reporting any errors on that port.
The server is not reporting any errors in monitor either.
Tried hard setting the port to 1000Gb/Full duplex - Server set to auto (no option for 1000/Full).
Server network card is a Broadcom 5700. Research shows a historical record of similar issues with this type of network card.
Updated network drivers for the Broadcom card to latest version.
Updated the firmware on the Broadcom nic.
Tried setting card and switch port to a lower speed (100Mb/FD)
Tried different port on switch at 1000Mb/FD
One consultant we have contacted has recomended splitting the school network into separate VLANs as a start to diagnosing the problem. We are continuing with that option as it needs to be done at some point anyway.
The last thing I have yet to try is to replace the server network card entirely with an industry standard Intel gigabyte card. more specifically, add an additional card and set it to the server IP; as the existing one is on-board.
Now - after all that, my question is:
Does anyone have any suggestions as to what the problem might be or what I can do to continue diagnosing the problem?