We help IT Professionals succeed at work.

Network drop-outs on XP workstation to 2003 Server

thesultanofswine
on
Medium Priority
600 Views
Last Modified: 2010-03-18
Hello all,

I have a Windows 2003 server with a shared directory which my client application needs.  I have XP client machines running on the same network subnet and XP 10 machines running on another subnet\VLAN.  All machines are connected through the same Cisco switch.  All the clients ping the 2003 Server every 20 seconds looking for confirmation the network is still up, this is done through a GetFileAttributes API call to the Server’s shared directory where the clients are also placing data.  If a client’s API call is not returned within 10 seconds the client application assumes the network is down and moves into another state.  

Here is the problem…on occasion my client’s API call fails for unknown reasons.  I’ll explain some of my trouble-shooting:

-      All NIC cards\port settings have been synched (100\full)
-      The server does not have TCP stack issues, it returned 25 consecutive 65k pings in <1 ms
-      The server’s integrated NIC card has been replaced with a PCI NIC
-      Server load is low, so is network load
-      All gateways, dns\wins servers, etc… all network settings are fine
-      No trace problems, goes from source->switch->destination.  Have run all the pingpaths, tracerts, iperf commands with  no issues
-      No errors in the switch logs
-      Here is the kicker!  On consecutive ping tests for 100 times at 50k and 3k, about 1 out of every 5 set of pings fails.  And about 50% of those are on the first ping.  The RTT varies quite a bit too…10 in a row are normal, than a return time goes up to 15ms, than normal times, than another 15ms one.  These long times are about 1 out of 10 pings (within the set of 100).  I’ve also never had a normal 32 btye ping fail, only ones with an increased packet size.

I’m not sure if this is the switch trying to read\route things around or what?  Unfortunately I don’t have the luxury of putting a dumb switch in and see what happens.  If anyone has any experience\pointers I’d love to know how it worked out.

Thanks
Comment
Watch Question

I assume you are pinging by IP address and not by name, so I'll rule out name resolution.  Have you tried running the same test with the workstation and server in the same subnet/VLAN?  That would help narrow it down to a switch problem.  It's possible that your switch is a "store-and-forward" switch, and it is running out of bufferspace, but that's extremely remote since it should have ample buffer space for what you are doing.  And I'm sure it would log buffer overflows in the switchlog.  I would run Network Monitor on the server to try and determine if the problem is evident on the server.  Is it failing before or after it reaches the server.  That sort of thing.

Are there any IPS or IDS devices or software on your network that might be detecting what you are doing as some sort of DoS attack? A new feature of some switches (and possibly some OSes) do just that... take a look at this HP ProCurve Switch:

http://www.hp.com/rnd/products/switches/ProCurve_Switch_3500yl-5400zl_Series/features.htm

"ICMP throttling: defeats ICMP denial-of-service attacks by enabling any switch port to automatically throttle ICMP traffic  NEW!"

I know you don't have an dumb switch handy, but many times you can set one port to 'Monitor' and have it monitor all traffic in and out of another port.  This might help.

Also, turn off any IPSEC that is running on the server.  That can slow TCP/IP communication down.

Also disable any NetBEUI or extra protocols not nescessary.

You may also want to try updating windows, device drivers, and switch firmware just to be safe.

Not the solution you were looking for? Getting a personalized solution is easy.

Ask the Experts
Fatal_ExceptionSystems Engineer
Top Expert 2005
Commented:
Morning Adam..  

Although I should not think this a problem with your Cisco Switch, you might try pinging with a different packet size, and see if the responses are any different..

ping IP_Address -f -i Packet_Size

ie:  ping 192.168.1.x -f -i 1500 (MTU ethernet default packet size)

If you receive a message regarding fragmentation, then try lowering the MTU and ping again, until you discover the optimum MTU...  of course, if you do find problems, then you might need a new switch..  

Then again, it is early here, and I might be completely off base!  :)
Keith AlabasterEnterprise Architect
CERTIFIED EXPERT
Top Expert 2008

Commented:
You mention that a number of your machines are on a different subnet. How are these machines connecting? Are you using VLANs or is there a router in the mix here?

Are you getting the same issue from users on both VLAN's are just one of them?
Keith AlabasterEnterprise Architect
CERTIFIED EXPERT
Top Expert 2008

Commented:
Sorry, I see you are using VLAN's. How are you converging these? Still, how are you connecting these together? Are you getting the same issue from users on both VLAN's are just one of them?

Author

Commented:
Thanks for the suggestions...I'll try to answer some of your questions with some detail:

- The ping tests were performed with both name and IP, the error rate was similar.
- I also pinged between different workstations (which eliminated the server), the same error rate occured.
- About the VLANs, the same error rate is occuring between the machines which are in the same subnet and the 10 machines on the VLAN with a different IP address scheme.  There is no router involved, all the clients are hooked up to the Cisco switch and I believe the the switch is doing the routing.  I have to admit I do not know the specifics about the VLAN or how it is converged.  

I'll go try out your above suggestions and get back to everyone with the results.
Thanks again.
Keith AlabasterEnterprise Architect
CERTIFIED EXPERT
Top Expert 2008

Commented:
Can you tell me which Cisco switch it is? If its a layer 3 switch then fine. If its only a layer 2 then that cannot do the converging and there must be something else in the mix doing the routing. Layer 2 devices cannot route :)

Author

Commented:
Keith,

The switch is a Cisco Catalyst 6509.
Keith AlabasterEnterprise Architect
CERTIFIED EXPERT
Top Expert 2008

Commented:
Wooo. we have four of those; layer 3 it is then lol.

Superb bits of kit. Sup 1A's or using the new 720's?

Sorry, back to the question. How are the subnets/vlans connecting? Boxes directly on switchports or devies at the other end of trunks?
f trunks, what are the access layer boxes at the other ends? If Cisco's, is spanning tree enabled? Could be switching out and taking a few seconds or more to re converge.

Author

Commented:
Keith,

This is where my knowledge of the network really drops out.  This network is not at my site and and have no access to the information\setup, besides the basics.  This is also getting over my head in the networking department too, I'll try to bring up the questions and see what I get back, unfortunately some of the people I'm dealing with probably also don't know the specifics to this degree...  I'll try to get back with some answers soon.

I do have one question though, in my experience with trouble-shooting similar issues on my application, I've noticed the big expensive Cisco switches cause some issues.  In situations where we can, we've swapped out the Cisco switches with old dumb Bayview switches and the problem has decreased.  Is there some issue with all the logic\work\routing the smart Cisco switches do which is causing the delay.  Also one more question, is there a way to take that functionality off certain ports on the Cisco switch so frames just pass through?

Thanks
Enterprise Architect
CERTIFIED EXPERT
Top Expert 2008
Commented:
OK., no sweat.

the 6509 will likely have a blade with x number of 10/100/1000Mb ports and/or a blade with gigabit fibre ports on it.

these ports can be set as trunk ports (no ip address) that connect to switch devices so as to extend the fabric and you state which vlans will be allowed over the trunk (uses 802.1q or Cisco's proprietary ISL protocol). Alternatively they can be set as ordinary ports whereby you may have a single server or device plugged directly into the port. We use Cisco 2950's and the older 35xx series access layer switches all on trunk ports but we have no issues (that I am aware of) with timeouts/drop outs.

The spanning tree or per vlan spanning tree (stp or pvst) is simply the process to ensure that only the best route for the traffic to take is left in an operational state. Any second/third routes that your network discovers to get to a device are placed into a hold-down condition. If something fails/topology changes etc, the algorythm kicks in and the new best route is activated and any others placed into hold-down.

When this change is made (or more pertinently, the networks 'thinks' this change has been made, it can take a small time for the new routes to propagate round causing a delay.

Author

Commented:
All, thank you for the reponses.  I appreciate your efforts helping me through my issue.  
Keith AlabasterEnterprise Architect
CERTIFIED EXPERT
Top Expert 2008

Commented:
welcome :)
Fatal_ExceptionSystems Engineer
Top Expert 2005

Commented:
Keith..  great explanation of the layer3 switching using vlans!  

and of course, a thanks to sos!

FE
Keith AlabasterEnterprise Architect
CERTIFIED EXPERT
Top Expert 2008

Commented:
:)
Access more of Experts Exchange with a free account
Thanks for using Experts Exchange.

Create a free account to continue.

Limited access with a free account allows you to:

  • View three pieces of content (articles, solutions, posts, and videos)
  • Ask the experts questions (counted toward content limit)
  • Customize your dashboard and profile

*This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

OR

Please enter a first name

Please enter a last name

8+ characters (letters, numbers, and a symbol)

By clicking, you agree to the Terms of Use and Privacy Policy.