Solved

Network drop-outs on XP workstations connected to Cisco Layer-3 Switches

Posted on 2006-06-17
22
841 Views
Last Modified: 2008-01-09
Hello,

I have a network configuration of (qty 6) Cisco 2950 and (qty 1) 2960G all connected by fiber. On the 2950 switches I have 60+ IP cameras all running significant data rates (appr 300mbs total) back to (qty 2) Windows 2003 servers that are connected to the 2960G. There are (qty 6) XP workstations (WS) used for live viewing, (qty 3) at various locations on the 2950s and (qty 3) located at the 2960G switch. Everything is running on the same network subnet and VLAN.

The good news is all switches, IP Cameras and Windows 2003 servers work perfectly, running continuously for two months without any glitches or issues of any kind.

The problem is with the XP WS used for live viewing. Their network connectivity will drop-out intermittently at different times. I have observed these boxes for hours at a time and they will drop their live video feed intermittently during the day. The software is designed to reconnect to the Windows 2003 servers and on a couple of the boxes, this works very well most of the time. The problem that makes the client complain is when one or more of the XP WS does not recover and displays ‘no video’ until we fix it. Half of the XP boxes exhibit this network drop-out at least twice a day while a couple will run for a week or longer. 5 of the XP WS are Dell 206/207 and other one is a comparable white box.

This issue has been going on for several months now and we have tried many different things to fix this problem when it occurs. The one troubleshooting technique we have used for the last month is to simply run Ping tests to the two servers until the WS starts working again. Typically we will observe the first ping return ‘Request Times Out’ and the balance return normally and then the network connectivity is restored immediately and the video application starts running again. In some cases where the network drop-out occurred close to our testing we will see Ping tests return “Request Times Out’ for 5 to 10 minutes or longer and then, like magic everything will return to normal, until the next time.

Recently we setup FreePing on all machines each pinging all the other machines. The Servers will reliably ping each other 100% of the time. The XP WS will not ping anything reliably clocking in at 93% to 99%

Other troubleshooting items of interest are:

· All NIC cards are set to Auto1000 and have the latest drivers.
· Servers CPU loading is 50%, WS CPU loading is less than 10%
· IPSEC is off/disabled
· One gateway, no dns\wins servers, using IP addressing exclusively.
· No trace problems, goes from source->switch->destination.
· No errors or issues in the switch logs, as verified by a Cisco tech support
· Cisco had us verify/set all host ports to ‘Fastport’ with no change
· 4 of the XP WS have been tested with dumb 100mbs switches and they will run for weeks at a time with no issues under the same loading conditions.

I have reviewed all my Cisco books however the fact that the Windows 2003 Servers and the majority of the network (60+ IP Cameras) is operating perfectly has me quite baffled.
0
Comment
Question by:mfischer
  • 11
  • 10
22 Comments
 
LVL 77

Assisted Solution

by:Rob Williams
Rob Williams earned 250 total points
ID: 16928433
Is there any chance the network adapters have power management enabled ? This is off by default on the server but on, on the workstations.
Device manager | right click on NIC and choose properties | Power management | un-check "allow the computer to turn off this device to save power"
0
 

Author Comment

by:mfischer
ID: 16931030
RobWill, the most problematic three machines had this option enabled as well as one machine that had been working OK. I thought this power save option only kicked in when the NIC had no network traffic? All my machine NICs have continuous network traffic. I did verify/unchecked this option on all machines (where it was enabled). I will monitor this over the next 24 hours and post my findings.

An interesting observation I have made is the problematic systems appear to work longer between failures since the FreePing installation (four days ago) and continuous Ping monitoring. Is this coincidental or is there something that causes Ping requests to connect over program socket connection requests?
0
 
LVL 77

Expert Comment

by:Rob Williams
ID: 16931120
Are you sure you have continuous network activity? If not they will "go to sleep". The fact that running the Free Ping utility seems to resolve the issue ,would indicate to me that the power saving option may be the problem, as the Free Ping application does insure that they don't "fall asleep", by sending continuous traffic.

Let us know if there is any improvement.
0
 

Author Comment

by:mfischer
ID: 16931974
RobWill, yes there is continuous network to all XP WS as they are displaying live video streams. Each XP WS has 16 to 30 live camera streams at 1 FPS nominal. The operator can send select cameras to 15-30 FPS. The range of network traffic at any one XP WS is between appr 500KB/Second (5mbs) to 1.5MB/Second (15mbs) every second.
0
 
LVL 77

Expert Comment

by:Rob Williams
ID: 16933596
Interesting then that Free Ping would make a difference. I have often used utilities similar to that in test environments to keep devices, or VPN tunnels alive. I wouldn't think from that point of view there would be any difference between a ping utility and your video streams. Wonder why it was helping the situation.
0
 

Author Comment

by:mfischer
ID: 16938870
RobWill, I checked in just now and found all working now for over 24 hours at this point in time. I removed/closed the FreePing utility and will monitor another 24 hours which will tell me if your suggestion fixed the problem or if the FreePing utility was keeping things going.

I did notice that the only machines that maintained a perfect 100% PING rate (between them) were the two Windows 2003 servers. These machines are much more loaded than the WS, I wonder why they are more reliable? Maybe it is because their OS cost more? Whatever the reason, I am sure glad these babies work 100% so I can keep my job. The more I observe this, it is like the switches are giving the Windows 2003 servers a higher priority, is this possible? I have read recommendations about separating everything into logical VLANs. This keeps broadcast counts down as they have to go to each port in all switches in any one VLAN. I am somewhat new to Cicso switches, perhaps this is required with these types of switches? Or perhaps multiple VLANS helps things?

By the way, 2 of the 6 WS are Windows 2000 SP4 WS (not XP). Whatever the problem is, it is related to XP and 2000 as one of the 2000 boxes would crap out once or twice a day.

Concerning your last comment that FreePing would make any difference. I wondered if the programmers have missed something here. It is TCP/IT socket programming for the video streams. Perhaps they need to use a 'keep-a-live' function, if supported by TCP/IP. I have tested these same boxes on cheap $13.00 10/100 switches and they all run video perfect for days on end, so it is not code releated, in the general sense. The only differences here are the Cisco switches.
0
 
LVL 77

Expert Comment

by:Rob Williams
ID: 16938929
I doubt that the better ping response on the servers would be due to software, but most server have "server grade" network adapters which are much higher quality and more dependable.
As for the VLAN's then can definitely segment traffic/broadcasts. How many PC's do you have ? I haven't set up any VLANs but as I understand it you wouldn't see any performance enhancements with less than 75 PC's. If more you may want to look into it further.
Keep alive shouldn't really be necessary over a basic network, more of slow links such as VPN's.

Curious to see how it goes over the next 24-48 hours.
0
 

Author Comment

by:mfischer
ID: 16947551
RobWill, I have monitored the installation over the last 24 hours. With the FreePing utility turned off, the problem has returned for all live viewing machines. The guards on duty informed me that all live viewing machines have dropped their network connection at least two or more times each over the last 24 hours. I just remoted in and checked all machines and found 4 of 6 of the WS with some kind of network related problem. On two of the machines I simply reloaded the software (their networks were up again), on the other two I started pinging until the networks came up and then the software would reload OK. I am going to setup FreePing on the most critical systems until I get this figured out as constant pinging definitely helps in this situation.

Concerning network adapters, all servers and workstations (except one WS) are using the Intel PRO 1000 MT NICs. This NIC is a "server grade" NIC and works very well on the Windows 2003 Servers. Although these NICs are on the Dell WS, I have only ONE Dell that has worked perfectly through this madness. Say, is there any way to compare registry settings between two similar machines? Perhaps there is some kind of obscure setting that is causing issues with these Cisco managed switches, just a thought.
0
 

Author Comment

by:mfischer
ID: 16948344
In order to troubleshoot this problem to the next level, I added an unmanaged Linksys model SR2016 (all ports are 1000 gigabit) switch to the server room between the servers, live viewing stations and the Cisco 2960G. My purpose here is to isolate the Cisco 2906G managed switch. I know that the problematic workstations (WS) operate OK with unmanaged 100mbs switches however I need to know if they work with unmanaged 1000mbs gigabit switches. I will provide update within 12 hours.
0
 
LVL 77

Expert Comment

by:Rob Williams
ID: 16950956
Thanks for the update, I am very curious as to the resolution. Also wondering why Free Ping seems to resolve.
As for the NIC's those are great cards, they should be no problem at all. I have not used Cisco switches, so I am no help with their configuration, though short of a bad port, I can't imagine a configuration that would stop communications after a period of time.
0
 

Author Comment

by:mfischer
ID: 16951201
Update: I have found when I measure less that 100% ping success rates over a 10-12 hour time period, I experience applications network failures of the type that necessitated this tech support request. Now I have something that measures my problem very accurately. Using the CISCO 2960G, I only get 100% ping rates between the Windows 2003 Servers, while the XP and 2000 system vary from 90% to 99% even though these systems are located on the same switch.

Yesterday I inserted a Linksys model SR2016 1000mbs Gigabit unmanaged switch between the server room CISCO 2960G and all 5 machines located in the server room. Today, I find that all machines located in the server room ping each other 100% over a 12 hour time period. Obviously I there is some kind of issue within the 2960G that requires attention. I have recontacted CISCO to correct. From my perspective there is something wrong with the 2960G's firmware or there is some internal setting that is not right. Will keep you updated.
0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 
LVL 77

Expert Comment

by:Rob Williams
ID: 16951254
Certainly starting to sound like the switch. Curious as to what they have to say.
If you would like more help here, with the Cisco issue, you might want to post a 20 point pointer question in the routers & switches topic area, pointing to this one. There are quite a few good Cisco guys there. I am not a "member of that club" <G>
http://www.experts-exchange.com/Hardware/Routers/
Explanation of pointer questions:
http://www.experts-exchange.com/help.jsp#hi262
0
 

Author Comment

by:mfischer
ID: 16977502
Cisco tech support had me setup Etherreal at one server and one problematic workstation (WS). FreePing was setup at the workstation pinging the server. By tracing Etherreal logs we found that the WS ping was reaching the server in all cases, the Server was returning in all cases, the WS however was not always receiving the Ping acknowledgement. The Cisco tech check the 2960's internal registers and logs and found nothing wrong inside the 2960. He also had a higher-level guy review stuff and they informed me that they could not find anything wrong with the switch. This was all done at 1000mbs and they had me test again over the weekend using forced 100mbs full duplex connections. There was no change; the network drop-out problem is still there.

I did notice that the 2960 had to be forced into full duplex, as it could not auto negotiate a 100mbs full duplex connection with the Dell workstation. When I mentioned this to the Cisco Tech his response was that the problem was not with the switch but with the Dell workstation. Man, I thought that the 100mbs auto-neg stuff was perfected ages ago!

Another interesting thing I was informed of was that I was getting near the end of the line with my problem because they could not find anything wrong from their inspections. When I realized this, I immediately asked for a diagnosis to present to my client. The tech then stated that although he could not explain it, this was an interoperability of network access as it pertains to negotiating at the physical layer, whatever all that means. I am getting the message I am not going to get much further help from Cisco tech support.

RobWill: I will try your suggestion to get a Cisco guy involved, thanks.
0
 
LVL 77

Expert Comment

by:Rob Williams
ID: 16977510
Thanks for the update. This must be very frustrating for you.
Does it still drop the connection when locked at 100mbps full duplex? Very odd. See what some of the Cisco Experts have to say.
0
 

Author Comment

by:mfischer
ID: 16977563
Yes, it still drops the connection when locked at 100mbps full duplex. And this same machine operated perfectly giving me 100% Pings for 12 hours when I had everything connected to the Linksys 1000mbs switch. During the latest testing, the 2960G offered less than 30% ping returns at 100mbs or 1000mbs! I do not see how a tech for any company could over look these dismal results. At least he should ask to try a different machine or repeat the previous test to insure that the machine that produced these results didn't die somewhere along the way.

Other observations: I was pinging a server that the workstation did not normally communicate with. I maybe wrong however it looks to me like this Cisco switch tries to keep machines that normally talk to each other (via TCP) connected however in the case of the workstations not perfectly. It appears to me in this latest test, that this switch ignores a great many ping requests going to machines that are not communicated with by other communications or protocol. I know this is strange logic however actions are speaking louder than words in this case.
0
 
LVL 79

Accepted Solution

by:
lrmoore earned 250 total points
ID: 16978411
>I have only ONE Dell that has worked perfectly through this madness
Hmmm.... sort of defeats the argument that it is the switch . . .

>I thought that the 100mbs auto-neg stuff was perfected ages ago!
The speed negotiation usually works, but the duplex negotiation does not always. There are so many NIC manufacturers and switch manufacturers that there are sometimes issues between certain nics and certain switches.

>using the Intel PRO 1000 MT NICs.
Have you tried updating the NIC drivers? I have some Dell workstations on a Cisco 2970 switch and I had to get the latest driver for them to make everything work well. Oddly enough, these are all XP workstations and the MT driver is not on the XP Distribution CD and had to be added after XP loaded.

Are your switches default config? Do you have any port security, storm control, QoS, VLAN's, routing or any other advanced features enabled?
What is the physical link between your workstation switch and the server switch, or are they on the same switch and you still see these anomalies?
One primary difference between the 2960 and other layer 2 switches like the Linksys is ARP cache. If you're moving machines around on the switch you might have to clear the arp cache.
Have you tried power cycling the switch?
Have you looked at the individual switchports for error counters? Particularly CRC, frame and collisions.
Have you verified your cable infrastructure and is it all certified CAT5e?
Have you tried different patch cables with one of the trouble workstations?
0
 

Author Comment

by:mfischer
ID: 16980078
Hello Irmoore, Thanks for the response.

Yes I have one Dell and two high-end Windows 2003 servers that are working just fine. All other workstations are at different levels of operability from network drop-outs every few hours to once a week. When I take the 2960 out of the equation then the reliability factor jumps up at least 10 fold.

The NICs have been updated as the first order of troubleshooting; Intel just released a new update v10.3 not long ago that cover most PRO1000 and OSs. It is possible that the newer driver works for my servers and is intermittent on the workstations. What version worked for you? Perhaps I should test that version.

The Switches started at their default config. All ports are set to their proper type (Desktop or Switch). All Host ports are set to fastport enabled. There is only one VLAN1. I know for sure QOS is disabled (a default setting, I think). As far as the rest; port security, storm control, routing or any other advanced features I would say they are in a default condition. I can certainly remote in and check anything that you might think important as I need to get this problem rsolved. The workstations under test are on the same 2960 switch as the servers. I have two other workstations on two 2950s across the fiber, one has network issues and the other is the machince that I have not see fail todate. I do not move machines around and I have power cycled the 2960 switch a couple of times. The Cisco techs have when over the 2960 three times now, looking at all error counters, CRC frame and collisions each time telling me there is nothing they can find wrong with the switch. All cabling at the 2960 is packaged certified CAT6 cable and yes I swapped cables on my first round of troubleshooting.

I am beginning to view this issue as possibly ARP related from everything I have observed to date. When my TCP socket based video application stops receiving data from the server for a period of time (1 minute to 4 hours), as soon as I ping the 2960 switch and/or server, things start working immediately. This seems to be a big mystery to everyone to date. Researching other related network issues, I have found suggestions to set all host ports to fastport, and that has fixed this type of issue many times in the past, however not with this problem.

As I read and learn more on the subject of Cisco switches, I have more questions. Spanning tree is obviously used to connect all the switches together by the fiber link. From some recent Cisco reading I've found reference to the Cisco switches auto discovery of each other (as it pertains to the spanning tree), that the switch with the lowest MAC address is used as the master device, and that you can change this if required. Is this something that can affect what is going on with networking on the 2960? Should the 2960 be set as the master device? Also, with my experince with inserting the Linksys switch I was wondering if the 2960 is designed to have other switches connected to it versus (what Cisco refers to) 'desktop' devices?
0
 

Author Comment

by:mfischer
ID: 17020211
Final Notes, Fixes and Comments:

Shortly after my last post I discovered that the one (and only) working Dell machine did not have its driver updated. I guess we overlooked this because it was working. Interestingly the NIC driver was dated 2002 and after updating it to the latest 2006 driver, it started acting up in the same manner as the other Dell boxes. I took all the Dell boxes back to the 2002 driver and they started working much better with the Cisco switches. I then tested a couple of machines with the second to the last driver (dated 2005) update for the Intel PRO1000 MT NICs and encountered issues. I then pulled these two Dell boxes out and configured the white box machine to display all cameras on two displays located in the server room. The other Dell boxes are in now critical areas and are working well with the 2002 Intel driver. I have rarely observed newer drivers (within the last year) causing issues in this manner.

The Cisco tech had completed his review of our 2960G and had found no problems at all. He mentioned that they commonly observe newer NIC drivers not working with their products. The Cisco tech also told me that it was good practice to use the most newer or more powerful switch as the Root Controller for VLAN purposes.

I have learned much from this experience.

1) Migrating from non-managed switches to Cisco Layer 2 managed switches will most likely require additional work on the part of the technician in charge. Because of this and if you plan on using Cisco switches, I would highly recommend involving a Cisco tech at the design stage. At least you can get a working design layout at the starting point. And, if you have problems, you will have someone who is used to getting Cisco TAC started in solving your issue(s).

2) If you utilize Cisco Layer 2 switches, you can expect issues with some PC machines as it relates to "Interoperability". My definition of Interoperability as it relates to Cisco is as follows. When you have perfectly working machine as it relates to networkability on any brand unmanaged switch (including $13.00 switches) and you connect it to a managed Cisco switch, and if this same machine no longer works reliably, the term used to describe this problem is "Interoperability".

3) If you need the advanced features that layer 2 switches have to offer then you have little choice but to make the switch and deal with the issues that may present themselves.

4) When using Cisco layer 2 switches and high-end Intel server chassis and Windows 2003, you can expect at a 99.999% plus uptime. In my test I have had 100% uptime since the addition of the 2960G switch, now over 2 months.

5) Cisco switches auto-neg speed very well however they sometimes have problems negotiating "Full-Duplex". For this reason it maybe necessary to setup your Cisco switches manually for this function to work correctly. The symptom is sluggish, intermittent or otherwise inoperable network connection.

Hope this information helps someone else.
0
 
LVL 77

Expert Comment

by:Rob Williams
ID: 17022261
Personally I found your findings helpful and useful mfischer.
Thank you for the update.
Still curious why the "Intel PRO 1000 MT NIC's" worked on the servers and not the PC's, if the same generation drivers.
0
 

Author Comment

by:mfischer
ID: 17034474
Hello RobWill, I used to believe that if the NICs were the same on two machines AND if you had the same driver, you could expect to have the same experience, as it relates to networking. Now I believe this is no longer a valid assumption because other factors, such as supporting circuitry, manufacturing processes and OS differences make this impossible.
0
 
LVL 77

Expert Comment

by:Rob Williams
ID: 17034499
Seems you are right, O/S and supporting motherboard appear to have an effect. Good to know.
Thanks,
--Rob
0
 
LVL 77

Expert Comment

by:Rob Williams
ID: 17055411
Thanks mfischer,
--Rob
0

Featured Post

IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

Suggested Solutions

This is the first one of a series of articles I’ll be writing to address technical issues that are always referred to as network problems. The network boundaries have changed, therefore having an understanding of how each piece in the network  puzzl…
This article offers some helpful and general tips for safe browsing and online shopping. It offers simple and manageable procedures that help to ensure the safety of one's personal information and the security of any devices.
Sending a Secure fax is easy with eFax Corporate (http://www.enterprise.efax.com). First, Just open a new email message.  In the To field, type your recipient's fax number @efaxsend.com. You can even send a secure international fax — just include t…
In this tutorial you'll learn about bandwidth monitoring with flows and packet sniffing with our network monitoring solution PRTG Network Monitor (https://www.paessler.com/prtg). If you're interested in additional methods for monitoring bandwidt…

747 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now