Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

random bursts of dropped packets

Posted on 2009-04-30
14
Medium Priority
?
1,579 Views
Last Modified: 2012-05-06
My network is a Sonicwall TZ 210 UTM which is connected to a Dell 48 port unmanaged switch. A server running Windows SBS 2003 which does DHCP DNS hosts the exchange server and is my print server, 2 network copier/printers, 1 NAS device, and 20 or so workstations running XP Pro.  There is one domain on a with class c private address on a /24 subnet.  I use a DHCP scope of 192.168.1.30-192.168.1.100.  Subnet is 255.255.255.0.  The Sonicwall has a static IP and is the gateway.   The printers have static IP outside the distributed range.  Basically, everything is configured correctly.  

Approx. 3-4 of the workstations run some billing software which access the SQL server.  Lately, these workstations have been getting popup errors pertaining to a loss of connection with the sql server.  The configuration of these workstations and the sql server itself was checked over by the company that handles this billing software and they assure me it is a problem with my network.  

I was running ping-t and watching constantly on both servers, the 2 printers, (just because I remembered their IP addresses off the top of my head,) and a few random workstations.  Once every so often, (with no discernable pattern,)  every computer I was watching the ping -t on would drop a packet or 2 at exactly the same time.  Right after that, one of the printers will then drop all packets for 30 sec until a min.  The other computers I'm watching resume normal ping responses.    I watched this happen 7-8 times over the period of an afternoon and thought I had it narrowed down, (after testing the cat5 to the printer,) to be the printers NIC. Also, its important to note that when this happens is when the workstations running the billing software that access the sql server throw their errors about connection loss.

The strange thing is the last time it happen, it was not the same printer that dropped all packets from the ping -t for 30 sec to a min but rather the OTHER printer.  That blew my bad NIC idea out of the water.  I tested the cable to the other printer and it checked out fine.  I'm back at square I trying to figure out whats going on.

I did monitor network traffic with a laptop connected to my switch and running wireshark but I"m not an expert and really didn't see anything jump out at me when these events happen.  

I had a problem with a bad cable before and used a similar method to track it down but I really have no clue how to track this problem down or even remotely might be causing it.

Any additional suggestions would be greatly appreciated.
0
Comment
Question by:FASP
14 Comments
 
LVL 20

Expert Comment

by:Mal Osborne
ID: 24276406
Check Duplex.  Having half at one end & full at the other can cause some odd problems.
0
 

Author Comment

by:FASP
ID: 24279305
checked them, all auto
0
 
LVL 3

Expert Comment

by:Pierellie
ID: 24283633
how many Client Access License's do you have your sbs server? Could it possibly be too many simultanious connections to that server at the same time? Instead of dropping packets it should just refuse the connection.... So maybe not...
0
Automating Your MSP Business

The road to profitability.
Delivering superior services is key to ensuring customer satisfaction and the consequent long-term relationships that enable MSPs to lock in predictable, recurring revenue. What's the best way to deliver superior service? One word: automation.

 
LVL 23

Expert Comment

by:Mysidia
ID: 24287295
So you were running ping -t    constantly from one location, pinging the IPs of several workstations, servers, and printers?

Are your netmasks and IP address  range the same on all your devices?  (So no traffic between PCs is passing through the router)


Then either (a) that one PC you were pinging everything from lost connectivity to the switch, OR...  (b)  all the nodes you were watching lost connectivity to the switch, for a moment.  Since that's the same time as the SQL connectivity issue, I am thinking (b) is a lot more likely, or it may be both (a),(b).

I'm thinking of a few possible causes for (b).

[i]  One of your computers could have a NIC or cabling problem,  it could be spewing  out invalid transmissions,  your  dumb  unmanaged switch  may be choking on invalid transmissions from one PC.

[ii] The switch or some PC connected to it may not be properly grounded, there may be some sort of noise coming in that gets interpreted as a continuous network signal, errant traffic, resulting in collisions  that should not occur on devices connected to a switch.

[iii] Layer 2 issue (most likely).
Your entire switch could be failing,  maybe it's a power loss, maybe it's a defective component.    I would suggest testing your UPS, if the switch has its own UPS.

I would strongly advise replacing the 48 port unmanaged switch with a 48 port MANAGED switch like a  used Cisco  WS-C3550-48-SMI,  or whatever you can find, get the switch an IP,  portfast all the ports turn on logging.  

And watch interface stats for amount of traffic and errors.


Well, the key in finding a managed switch, is the switch should support logging, and the ability for you to check interface status and counters.

What you are experiencing is an issue that should be troubleshooted from the switch, but you can't do it, because your switch doesn't have basic capabilities
that all modern switches should.


[iv]  There is a possibility  of someone making a temporarily loop, or a temporary flood occuring on the network.   It's so severe it stops almost all traffic, but is brief.

This is pretty unlikely, since you're only experiencing a 2-second disruption.
You could troubleshoot by running wireshark on your PC, and look at what
packets are being captured just after a disruption.

0
 
LVL 23

Accepted Solution

by:
Mysidia earned 1500 total points
ID: 24287434
(On second thought, a WS C3550 would be a little extreme here, as you probably don't need multiple virtual LANs and ability to route b/w them..  a decent Layer 2 manageable switch like the 2950 can do all you need for troubleshooting and monitoring.   Dell makes some managed FE products too, but their capabilities at the low end were fairly limited last I checked. )


I consider replacing a switch (even with a more expensive switch, if you're forced to buy brand new) a more efficient alternative than brute-force  re-cabling 20 PCs and swapping NICs in every system in your environment.


An alternative to troubleshoot B  would be go   get a list of every port plugged into the switch.

Figure out how often the problem is occuring.

Go one by one down each port, systematically, unplug the port,  wait  until the "problem" re-occurs.

If the problem still happens, plug that port back in and move on to the next one.

If the problem seems to go away after disconnecting a device,  then you can look into that port further.


If the problem happens no matter what single port you unplug, then you've ruled out the possibility of any  one PC or cable to a PC.




0
 

Author Comment

by:FASP
ID: 24288365
OK,  I checked the switch.  It is a Dell Powerconnect that was unmanaged but can be switched to managed mode.  The switch itself is only 2 months old.

I did try wireshark and didn't notice anything special right after one of these events happens.  I'm no expert reading it though so I will get another capture Monday and maybe someone with more experience than me can take a peek at it.

Would it make a better capture if I turn the switch to managed mode and mirriored each ports traffic to a specific port on the switch and run wireshark off the pc connected to that port? So it sees all traffic?  Or am I making it more complicated than it has to be?
0
 
LVL 23

Expert Comment

by:Mysidia
ID: 24288399
Firstly, I would turn on the switch's  management capability and  run the ping to the switch instead of to other PCs.  Get a serial port plugged into that switch's console management port and watch it.


The most obvious indication of a problem with the switch would be if you see the switch is rebooting.

However, a manged switch will generally take much longer than 2 seconds to reboot.


Depending on the specific model, there may be error counters you can check, and system logs..


0
 
LVL 23

Expert Comment

by:Mysidia
ID: 24288423
By the way, the either printer dropping everything for 30 seconds may actually be pointing towards an  'undesired loop in your network' scenario.

What you want to do is look at the 'forwarding' table on your switch. The command line interface or menu should provide you a way of listing all MAC addresses and what ports they're associated with.

See if you can get a capture of what it says during a 2-second disruption. When a printer becomes unpingable.

If a loop is temporarily forming,  what may happen is the "looped port" sends some broadcast packets back to the port it came in on for a shor ttime.

The result is your switch could think all the MAC addresses are coming from that one port   (if that one port is getting all your traffic,  then well, the other hosts aren't getting their traffic).

The Dell switch even in unmanaged mode might be trying to do something "smart" like storm control or loop avoidance (blocking the looped port, or dropping all traffic when its transmitting at too high a packet rate), but that isn't 100% effective.




Make sure you don't have any possibly misconfig'ed device plugged into your network that "can be a DHCP server" but isn't supposed to be.

Loops can be accidentally created, for example, if you have a wireless AP, and a laptop  both plugged into the AP and  plugged into the wired LAN,    but  with the laptop misconfigured  to  BRIDGE  traffic between the wireless card and the wired Network Card.

Or if someone sets up a 5-port switch at their workstation.

and mistakenly does something like  Network Drop -> 5-port switch -> IP Phone -> 5-port switch    (phone plugged in twice)


If the printer sees  broadcast traffic coming in with its MAC address as source, some printers may detect a conflict and be shutting down its NIC for 30 seconds.


0
 

Author Comment

by:FASP
ID: 24299303
I setup a nice little laptop running wireshark and plugged into my switch, while I sat at my desk watching command prompt windows with ping -t on about 12 different computers, and logged on to the table on the switch which shows mac address matched with port ready to get a picture of that while the network disruption was happening....and for 3 hours nothing happened.  

I have been systematically disconnecting devices at the switch hoping I could find the culprit but its so sporatic now that it makes it very time consuming to do.  I'll keep working on it.

I thought I narrowed it down to one computer that had its nic card flow control set to generate & respond.  I set it to off and it seemed to help but I still did get some packets dropped on the network.  I'll keep investigating and report anything I find unusual.

Thanks for the tips
0
 
LVL 23

Expert Comment

by:Mysidia
ID: 24300666
I would think about bringing in another switch at this point, even a 5-port switch to plug a 'ping point'  and another workstation into.

To try and _prove_  the problem is with the Dell switch.
Although it's fairly new, there's still a possibility of a fault there,

and it's the most likely point of failure that would be disrupting many machines at once for such short, intermittent  intervals,   without something obvious like a flood of traffic occuring.

By the way, if you have a managed switch, you should be able to turn off flow control  on the switch, and it's probably a good idea to do that.



0
 

Author Comment

by:FASP
ID: 24305033
I captured a wireshark file when this was going on this morning, (towards the end of the capture.)  It can be downloaded from http://faspems.dyndns.org:9000/shares/Web/WScapture.pcap    I could not attach a pcap file to this post and exporting the file to .txt didn't seem very helpful.

The capture was taken from a laptop plugged right into my Dell switch.  Maybe someone can see something in it I cannot.
0
 

Author Comment

by:FASP
ID: 24305166
I also captured the address table on the switch at the time.   I took a screenshot and attached it.  At the time of this screenshot my printers were dropping all packets.  The rest of the network had returned to normal.
AddressTable.jpg
0
 

Author Comment

by:FASP
ID: 24318697
I eventually went port by port down my switch one at a time until my network went a long period of time without packet loss.  The problem was only happening once every several hours so this was very time consuming.  Eventually, I had unplugged one of the workstations from the switch and the problem stopped, (I watched my network for over 5 hours.)

I wanted to confirm that the workstations nic or its cable was indeed the problem so I plugged it back in and continued to monitor it but no packets were dropped.  I'm thinking maybe some type of intermittent problem with that nic or cat5 and I"ll keep an eye on it.  I'm guessing just unplugging its cable and plugging it back in was enough to correct the problem for now, (maybe a poor connection?)  
0
 
LVL 23

Expert Comment

by:Mysidia
ID: 24321693
Very possibly a problem with the cable or the NIC.

Unplugging and plugging and plugging back in may have reset some of the circuitry in the network card.

Or it may have effected the connection, if the cable was loose, or if the cable isn't properly terminated.. i.e.

Possibly an intermittent short at one end of the cable.
I  would test that cable thoroughly and re-terminate the ends  or replace that cable/drop if necessary.
0

Featured Post

Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This article offers some helpful and general tips for safe browsing and online shopping. It offers simple and manageable procedures that help to ensure the safety of one's personal information and the security of any devices.
Outsource Your Fax Infrastructure to the Cloud (And come out looking like an IT Hero!) Relative to the many demands on today’s IT teams, spending capital, time and resources to maintain physical fax servers and infrastructure is not a high priority.
There's a multitude of different network monitoring solutions out there, and you're probably wondering what makes NetCrunch so special. It's completely agentless, but does let you create an agent, if you desire. It offers powerful scalability …
In this brief tutorial Pawel from AdRem Software explains how you can quickly find out which services are running on your network, or what are the IP addresses of servers responsible for each service. Software used is freeware NetCrunch Tools (https…
Suggested Courses
Course of the Month20 days, 17 hours left to enroll

810 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question