Network drop-outs on XP workstation to 2003 Server

Hello all,

I have a Windows 2003 server with a shared directory which my client application needs.  I have XP client machines running on the same network subnet and XP 10 machines running on another subnet\VLAN.  All machines are connected through the same Cisco switch.  All the clients ping the 2003 Server every 20 seconds looking for confirmation the network is still up, this is done through a GetFileAttributes API call to the Server’s shared directory where the clients are also placing data.  If a client’s API call is not returned within 10 seconds the client application assumes the network is down and moves into another state.  

Here is the problem…on occasion my client’s API call fails for unknown reasons.  I’ll explain some of my trouble-shooting:

-      All NIC cards\port settings have been synched (100\full)
-      The server does not have TCP stack issues, it returned 25 consecutive 65k pings in <1 ms
-      The server’s integrated NIC card has been replaced with a PCI NIC
-      Server load is low, so is network load
-      All gateways, dns\wins servers, etc… all network settings are fine
-      No trace problems, goes from source->switch->destination.  Have run all the pingpaths, tracerts, iperf commands with  no issues
-      No errors in the switch logs
-      Here is the kicker!  On consecutive ping tests for 100 times at 50k and 3k, about 1 out of every 5 set of pings fails.  And about 50% of those are on the first ping.  The RTT varies quite a bit too…10 in a row are normal, than a return time goes up to 15ms, than normal times, than another 15ms one.  These long times are about 1 out of 10 pings (within the set of 100).  I’ve also never had a normal 32 btye ping fail, only ones with an increased packet size.

I’m not sure if this is the switch trying to read\route things around or what?  Unfortunately I don’t have the luxury of putting a dumb switch in and see what happens.  If anyone has any experience\pointers I’d love to know how it worked out.

Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

I assume you are pinging by IP address and not by name, so I'll rule out name resolution.  Have you tried running the same test with the workstation and server in the same subnet/VLAN?  That would help narrow it down to a switch problem.  It's possible that your switch is a "store-and-forward" switch, and it is running out of bufferspace, but that's extremely remote since it should have ample buffer space for what you are doing.  And I'm sure it would log buffer overflows in the switchlog.  I would run Network Monitor on the server to try and determine if the problem is evident on the server.  Is it failing before or after it reaches the server.  That sort of thing.

Are there any IPS or IDS devices or software on your network that might be detecting what you are doing as some sort of DoS attack? A new feature of some switches (and possibly some OSes) do just that... take a look at this HP ProCurve Switch:

"ICMP throttling: defeats ICMP denial-of-service attacks by enabling any switch port to automatically throttle ICMP traffic  NEW!"

I know you don't have an dumb switch handy, but many times you can set one port to 'Monitor' and have it monitor all traffic in and out of another port.  This might help.

Also, turn off any IPSEC that is running on the server.  That can slow TCP/IP communication down.

Also disable any NetBEUI or extra protocols not nescessary.

You may also want to try updating windows, device drivers, and switch firmware just to be safe.
Fatal_ExceptionSystems EngineerCommented:
Morning Adam..  

Although I should not think this a problem with your Cisco Switch, you might try pinging with a different packet size, and see if the responses are any different..

ping IP_Address -f -i Packet_Size

ie:  ping 192.168.1.x -f -i 1500 (MTU ethernet default packet size)

If you receive a message regarding fragmentation, then try lowering the MTU and ping again, until you discover the optimum MTU...  of course, if you do find problems, then you might need a new switch..  

Then again, it is early here, and I might be completely off base!  :)
Keith AlabasterEnterprise ArchitectCommented:
You mention that a number of your machines are on a different subnet. How are these machines connecting? Are you using VLANs or is there a router in the mix here?

Are you getting the same issue from users on both VLAN's are just one of them?
Become a Certified Penetration Testing Engineer

This CPTE Certified Penetration Testing Engineer course covers everything you need to know about becoming a Certified Penetration Testing Engineer. Career Path: Professional roles include Ethical Hackers, Security Consultants, System Administrators, and Chief Security Officers.

Keith AlabasterEnterprise ArchitectCommented:
Sorry, I see you are using VLAN's. How are you converging these? Still, how are you connecting these together? Are you getting the same issue from users on both VLAN's are just one of them?
thesultanofswineAuthor Commented:
Thanks for the suggestions...I'll try to answer some of your questions with some detail:

- The ping tests were performed with both name and IP, the error rate was similar.
- I also pinged between different workstations (which eliminated the server), the same error rate occured.
- About the VLANs, the same error rate is occuring between the machines which are in the same subnet and the 10 machines on the VLAN with a different IP address scheme.  There is no router involved, all the clients are hooked up to the Cisco switch and I believe the the switch is doing the routing.  I have to admit I do not know the specifics about the VLAN or how it is converged.  

I'll go try out your above suggestions and get back to everyone with the results.
Thanks again.
Keith AlabasterEnterprise ArchitectCommented:
Can you tell me which Cisco switch it is? If its a layer 3 switch then fine. If its only a layer 2 then that cannot do the converging and there must be something else in the mix doing the routing. Layer 2 devices cannot route :)
thesultanofswineAuthor Commented:

The switch is a Cisco Catalyst 6509.
Keith AlabasterEnterprise ArchitectCommented:
Wooo. we have four of those; layer 3 it is then lol.

Superb bits of kit. Sup 1A's or using the new 720's?

Sorry, back to the question. How are the subnets/vlans connecting? Boxes directly on switchports or devies at the other end of trunks?
f trunks, what are the access layer boxes at the other ends? If Cisco's, is spanning tree enabled? Could be switching out and taking a few seconds or more to re converge.
thesultanofswineAuthor Commented:

This is where my knowledge of the network really drops out.  This network is not at my site and and have no access to the information\setup, besides the basics.  This is also getting over my head in the networking department too, I'll try to bring up the questions and see what I get back, unfortunately some of the people I'm dealing with probably also don't know the specifics to this degree...  I'll try to get back with some answers soon.

I do have one question though, in my experience with trouble-shooting similar issues on my application, I've noticed the big expensive Cisco switches cause some issues.  In situations where we can, we've swapped out the Cisco switches with old dumb Bayview switches and the problem has decreased.  Is there some issue with all the logic\work\routing the smart Cisco switches do which is causing the delay.  Also one more question, is there a way to take that functionality off certain ports on the Cisco switch so frames just pass through?

Keith AlabasterEnterprise ArchitectCommented:
OK., no sweat.

the 6509 will likely have a blade with x number of 10/100/1000Mb ports and/or a blade with gigabit fibre ports on it.

these ports can be set as trunk ports (no ip address) that connect to switch devices so as to extend the fabric and you state which vlans will be allowed over the trunk (uses 802.1q or Cisco's proprietary ISL protocol). Alternatively they can be set as ordinary ports whereby you may have a single server or device plugged directly into the port. We use Cisco 2950's and the older 35xx series access layer switches all on trunk ports but we have no issues (that I am aware of) with timeouts/drop outs.

The spanning tree or per vlan spanning tree (stp or pvst) is simply the process to ensure that only the best route for the traffic to take is left in an operational state. Any second/third routes that your network discovers to get to a device are placed into a hold-down condition. If something fails/topology changes etc, the algorythm kicks in and the new best route is activated and any others placed into hold-down.

When this change is made (or more pertinently, the networks 'thinks' this change has been made, it can take a small time for the new routes to propagate round causing a delay.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
thesultanofswineAuthor Commented:
All, thank you for the reponses.  I appreciate your efforts helping me through my issue.  
Keith AlabasterEnterprise ArchitectCommented:
welcome :)
Fatal_ExceptionSystems EngineerCommented:
Keith..  great explanation of the layer3 switching using vlans!  

and of course, a thanks to sos!

Keith AlabasterEnterprise ArchitectCommented:
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Windows Networking

From novice to tech pro — start learning today.