asked on

random bursts of dropped packets

My network is a Sonicwall TZ 210 UTM which is connected to a Dell 48 port unmanaged switch. A server running Windows SBS 2003 which does DHCP DNS hosts the exchange server and is my print server, 2 network copier/printers, 1 NAS device, and 20 or so workstations running XP Pro. There is one domain on a with class c private address on a /24 subnet. I use a DHCP scope of 192.168.1.30-192.168.1.100. Subnet is 255.255.255.0. The Sonicwall has a static IP and is the gateway. The printers have static IP outside the distributed range. Basically, everything is configured correctly.

Approx. 3-4 of the workstations run some billing software which access the SQL server. Lately, these workstations have been getting popup errors pertaining to a loss of connection with the sql server. The configuration of these workstations and the sql server itself was checked over by the company that handles this billing software and they assure me it is a problem with my network.

I was running ping-t and watching constantly on both servers, the 2 printers, (just because I remembered their IP addresses off the top of my head,) and a few random workstations. Once every so often, (with no discernable pattern,) every computer I was watching the ping -t on would drop a packet or 2 at exactly the same time. Right after that, one of the printers will then drop all packets for 30 sec until a min. The other computers I'm watching resume normal ping responses. I watched this happen 7-8 times over the period of an afternoon and thought I had it narrowed down, (after testing the cat5 to the printer,) to be the printers NIC. Also, its important to note that when this happens is when the workstations running the billing software that access the sql server throw their errors about connection loss.

The strange thing is the last time it happen, it was not the same printer that dropped all packets from the ping -t for 30 sec to a min but rather the OTHER printer. That blew my bad NIC idea out of the water. I tested the cable to the other printer and it checked out fine. I'm back at square I trying to figure out whats going on.

I did monitor network traffic with a laptop connected to my switch and running wireshark but I"m not an expert and really didn't see anything jump out at me when these events happen.

I had a problem with a bad cable before and used a similar method to track it down but I really have no clue how to track this problem down or even remotely might be causing it.

Any additional suggestions would be greatly appreciated.

Mal Osborne

Check Duplex. Having half at one end & full at the other can cause some odd problems.

FASP

ASKER

checked them, all auto

Pierellie

how many Client Access License's do you have your sbs server? Could it possibly be too many simultanious connections to that server at the same time? Instead of dropping packets it should just refuse the connection.... So maybe not...

Mysidia

So you were running ping -t constantly from one location, pinging the IPs of several workstations, servers, and printers?

Are your netmasks and IP address range the same on all your devices? (So no traffic between PCs is passing through the router)

Then either (a) that one PC you were pinging everything from lost connectivity to the switch, OR... (b) all the nodes you were watching lost connectivity to the switch, for a moment. Since that's the same time as the SQL connectivity issue, I am thinking (b) is a lot more likely, or it may be both (a),(b).

I'm thinking of a few possible causes for (b).

[i] One of your computers could have a NIC or cabling problem, it could be spewing out invalid transmissions, your dumb unmanaged switch may be choking on invalid transmissions from one PC.

[ii] The switch or some PC connected to it may not be properly grounded, there may be some sort of noise coming in that gets interpreted as a continuous network signal, errant traffic, resulting in collisions that should not occur on devices connected to a switch.

[iii] Layer 2 issue (most likely).
Your entire switch could be failing, maybe it's a power loss, maybe it's a defective component. I would suggest testing your UPS, if the switch has its own UPS.

I would strongly advise replacing the 48 port unmanaged switch with a 48 port MANAGED switch like a used Cisco WS-C3550-48-SMI, or whatever you can find, get the switch an IP, portfast all the ports turn on logging.

And watch interface stats for amount of traffic and errors.

Well, the key in finding a managed switch, is the switch should support logging, and the ability for you to check interface status and counters.

What you are experiencing is an issue that should be troubleshooted from the switch, but you can't do it, because your switch doesn't have basic capabilities
that all modern switches should.

[iv] There is a possibility of someone making a temporarily loop, or a temporary flood occuring on the network. It's so severe it stops almost all traffic, but is brief.

This is pretty unlikely, since you're only experiencing a 2-second disruption.
You could troubleshoot by running wireshark on your PC, and look at what
packets are being captured just after a disruption.

ASKER CERTIFIED SOLUTION

Mysidia

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

FASP

ASKER

OK, I checked the switch. It is a Dell Powerconnect that was unmanaged but can be switched to managed mode. The switch itself is only 2 months old.

I did try wireshark and didn't notice anything special right after one of these events happens. I'm no expert reading it though so I will get another capture Monday and maybe someone with more experience than me can take a peek at it.

Would it make a better capture if I turn the switch to managed mode and mirriored each ports traffic to a specific port on the switch and run wireshark off the pc connected to that port? So it sees all traffic? Or am I making it more complicated than it has to be?

Mysidia

Firstly, I would turn on the switch's management capability and run the ping to the switch instead of to other PCs. Get a serial port plugged into that switch's console management port and watch it.

The most obvious indication of a problem with the switch would be if you see the switch is rebooting.

However, a manged switch will generally take much longer than 2 seconds to reboot.

Depending on the specific model, there may be error counters you can check, and system logs..

Mysidia

By the way, the either printer dropping everything for 30 seconds may actually be pointing towards an 'undesired loop in your network' scenario.

What you want to do is look at the 'forwarding' table on your switch. The command line interface or menu should provide you a way of listing all MAC addresses and what ports they're associated with.

See if you can get a capture of what it says during a 2-second disruption. When a printer becomes unpingable.

If a loop is temporarily forming, what may happen is the "looped port" sends some broadcast packets back to the port it came in on for a shor ttime.

The result is your switch could think all the MAC addresses are coming from that one port (if that one port is getting all your traffic, then well, the other hosts aren't getting their traffic).

The Dell switch even in unmanaged mode might be trying to do something "smart" like storm control or loop avoidance (blocking the looped port, or dropping all traffic when its transmitting at too high a packet rate), but that isn't 100% effective.

Make sure you don't have any possibly misconfig'ed device plugged into your network that "can be a DHCP server" but isn't supposed to be.

Loops can be accidentally created, for example, if you have a wireless AP, and a laptop both plugged into the AP and plugged into the wired LAN, but with the laptop misconfigured to BRIDGE traffic between the wireless card and the wired Network Card.

Or if someone sets up a 5-port switch at their workstation.

and mistakenly does something like Network Drop -> 5-port switch -> IP Phone -> 5-port switch (phone plugged in twice)

If the printer sees broadcast traffic coming in with its MAC address as source, some printers may detect a conflict and be shutting down its NIC for 30 seconds.

FASP

ASKER

I setup a nice little laptop running wireshark and plugged into my switch, while I sat at my desk watching command prompt windows with ping -t on about 12 different computers, and logged on to the table on the switch which shows mac address matched with port ready to get a picture of that while the network disruption was happening....and for 3 hours nothing happened.

I have been systematically disconnecting devices at the switch hoping I could find the culprit but its so sporatic now that it makes it very time consuming to do. I'll keep working on it.

I thought I narrowed it down to one computer that had its nic card flow control set to generate & respond. I set it to off and it seemed to help but I still did get some packets dropped on the network. I'll keep investigating and report anything I find unusual.

Thanks for the tips

Mysidia

I would think about bringing in another switch at this point, even a 5-port switch to plug a 'ping point' and another workstation into.

To try and _prove_ the problem is with the Dell switch.
Although it's fairly new, there's still a possibility of a fault there,

and it's the most likely point of failure that would be disrupting many machines at once for such short, intermittent intervals, without something obvious like a flood of traffic occuring.

By the way, if you have a managed switch, you should be able to turn off flow control on the switch, and it's probably a good idea to do that.

FASP

ASKER

I captured a wireshark file when this was going on this morning, (towards the end of the capture.) It can be downloaded from http://faspems.dyndns.org:9000/shares/Web/WScapture.pcap I could not attach a pcap file to this post and exporting the file to .txt didn't seem very helpful.

The capture was taken from a laptop plugged right into my Dell switch. Maybe someone can see something in it I cannot.

FASP

ASKER

I also captured the address table on the switch at the time. I took a screenshot and attached it. At the time of this screenshot my printers were dropping all packets. The rest of the network had returned to normal.
AddressTable.jpg

FASP

ASKER

I eventually went port by port down my switch one at a time until my network went a long period of time without packet loss. The problem was only happening once every several hours so this was very time consuming. Eventually, I had unplugged one of the workstations from the switch and the problem stopped, (I watched my network for over 5 hours.)

I wanted to confirm that the workstations nic or its cable was indeed the problem so I plugged it back in and continued to monitor it but no packets were dropped. I'm thinking maybe some type of intermittent problem with that nic or cat5 and I"ll keep an eye on it. I'm guessing just unplugging its cable and plugging it back in was enough to correct the problem for now, (maybe a poor connection?)

Mysidia

Very possibly a problem with the cable or the NIC.

Unplugging and plugging and plugging back in may have reset some of the circuitry in the network card.

Or it may have effected the connection, if the cable was loose, or if the cable isn't properly terminated.. i.e.

Possibly an intermittent short at one end of the cable.
I would test that cable thoroughly and re-terminate the ends or replace that cable/drop if necessary.