?
Solved

Random, intermittent internet disconnect

Posted on 2011-05-12
21
Medium Priority
?
2,866 Views
Last Modified: 2012-05-11
For about the last month, we have had intermittent internet connection issues.  In a ping test we have 0% loss then suddenly 100% loss for 1-3 minutes (50-60 time outs), then 0% loss again.

The network is in a K-12 Private School.

The network configuration is:

Internet Cable >> Cable Modem >> Sonic Wall NSA-240 >> 24 port switch >> the rest of the network about 85 computers.  1 windows 2008 server runs the network.

Time Warner has been here twice and replaced the cable modem twice.

Sonic wall tech support has crawled all over the settings and logs for the NSA240 3 times to make sure it is configured correctly, plus it was new in February.

I have had at least 3 different switches in place.

I had removed the sonic wall replacing it with a router and after a few days the problem returned in the early stages of this issue.

And we have found some things wrong such as a bad cable modem, switch, cable and one computer, but the problem still persists.

The outage may occur many times a day or go 3 or 4 days with nothing.

It seems to go out more often at the beginning of a period or right at the end of the day where a load is placed on the network.

The internal network continues to function fine.  There is no loss internally and those connected to the server work fine.  ONLY THE INTERNET CONNECTION IS DOWN.  So I have no reason to believe a specific computer is poisoning the network.

During an outage, I cannot ping the sonic wall nor the internet.  I thought resetting the devices helped, but I am not so sure because it usually comes back on its own.

Disconnecting the network a cable at a time from the switch to see if the removal of part of the network corrects the problem quicker did not show anything.  I disconnected all but the laptop doing the ping test and the outage still continued.

Time Warner did upgrade it's equipment in our area to accommodate higher data speeds and our problem started during that period, but time Warner has repeated claimed it is not them after checking their system and I do think the trigger is internal, not external.

We do have a mixed network in that Teachers often use their own laptops, plus there are 2 private residences (On site parsonages) and a church network on the system.

Unless there is a breakthrough soon, one of my next steps is to prohibit none school computers from the network and see if it isn't a private computer messing things up.

I have looked through the server logs and while there are some error messages, nothing coincides with the outages.

I have consulted with two other IT guys and they had no suggestions on what to try next.

So the question is:

What have I missed and what haven't I tried?  I am looking at the antivirus to make sure the Avast is keeping everything updated.  I do see some issues there and it may become a different post.

I'm attaching a bunch of points to this question because I don't think it is going to be easy to resolve.

Thank you.

Jerlo
0
Comment
Question by:Jerry Thompson
  • 6
  • 5
  • 3
  • +4
21 Comments
 
LVL 15

Accepted Solution

by:
Robert Sutton Jr earned 375 total points
ID: 35748144
I have a few questions. What type of service do you get from your ISP? When an outage occur's is someone checking your Modem to see if its lost sync with your carrier(TW)? Basically, I'm trying to figure out that when you have an outage, has someone verified your broadband connection at that moment? In addition, have they verified the LAN during the same outage time?

a small side note: Almost EVERY ISP or telco out there will vehemently deny anything wrong on their "end" until you prove the problem is outside of your network.
0
 
LVL 10

Assisted Solution

by:atlas_shuddered
atlas_shuddered earned 375 total points
ID: 35750113
I agree with the above, the other two items that I would interject are:

1:  Have you actually sniffed the your WAN uplink from your switch and looked for spiking traffic there?
2:  Assuming that you are running at least 100Mbps on your LAN, and your Internet path is running at most 10Mbps, it may not be a good idea to dismiss the internal hosts without actual diagnostic proofing.  Have you tried evaluating the problem over the weekend, when your users are not there?
0
 
LVL 3

Assisted Solution

by:fritz5150
fritz5150 earned 375 total points
ID: 35750170
Based upon your statement above about not being able to ping either the sonicwall or the internet during an outage, we can rule out that this is an ISP or cable modem issue. You should always be able to ping the internal address of the sonicwall, even during a full internet outage. When an outage happens can you do a constant ping to the sonciwall IP and then unplug the cable modem from the wan port? This will rule out a ddos attack from the outside as being the problem. The only way that you would not be able to ping the sonicwall is if the CPU usage was too high (ddos attack) or there was a problem with either the sonicwall, the switch it's connected to, or the cable connecting the 2 devices.

Another option you have is to put a port on the switch connected to the sonicwall into monitoring mode, plug into a computer on that port and run an analysis using wireshark to see what is actually going on traffic wise.

0
Cyber Threats to Small Businesses (Part 1)

This past May, Webroot surveyed more than 600 IT decision-makers at medium-sized companies to see how these small businesses perceived new threats facing their organizations.  Read what Webroot CISO, Gary Hayslip, has to say about the survey in part 1 of this 2-part blog series.

 

Author Comment

by:Jerry Thompson
ID: 35750637
Thank you for your responses and I will try my best to address your questions.

Warlock:  

As a K-12 private school, a few years ago, someone (I've been told Bill Clinton) made it a law for schools to get free, basic internet.  Which means the lowest of the lowest from a broad band standpoint.

We get 5 Mbps down and 0.384 Mbps up.  To increase our bandwidth, we'd have to sign up for an regular account and currently we are unable to afford that.

I am not sure how to test to see if our modem has lost sync with the carrier.  We do have a dynamic IP.  I have not checked to see if it is changing with each outage.  that would be annoying, but I'll check in the next time I know of an outage.

I have not had a way to verify the connection during an outage other than a ping test.

One thing I discovered today is the sonic wall logs show no interruption of service during an outage.  Example:  Today we lost our connection for a couple of minutes somewhere between 8:15 and 8:25 AM.  The sonic wall log does not show any error nor are there gaps in the time stamp.  Yet I could not ping it for 2 minutes??  Is it possible to have a bad port only on the sonic wall??

Atlas:  

I am aware of wireshark and it is installed on my work station, but it baffles me.  It provides a lot of data, but I am ignorant of how to interpret the data and to set up proper monitoring.

While I am a network admin, most of my training has been as needed through situations like this one.

Your question 1.  No, I have not sniffed the WAN uplink and I am not sure how to do that.

2.  Time Warner monitored over a weekend and it showed no excessive spikes and on that weekend, the parsonages on site did not report any outages.

Fritz:

I have not disconnected the  cable modem from the WAN port during an outage and will do so next event.

I like the idea of activating one of the sonic wall ports and putting it into monitoring mode.  I will investigate that tomorrow.

thank you all for you responses.

jerlo
0
 
LVL 15

Expert Comment

by:Robert Sutton Jr
ID: 35751123
Well, thats where you need to focus your efforts(WAN side). Since you already stated that it doesnt affect your lan, I would look at a few different things... Your nsa240 prolly wont be reporting issues since its WAN input comes from a cable modem Ethernet hand off which is up regardless of cable modem losing signal to carrier....

At one point above you mentioned that you could not ping the sonicwall or the Internet when it occurs? Is this true?
1)Check DHCP lease times
2) Is cable modem centrally located or next to DMARC? Is wiring old/new? Have input levels been checked at the end of the cable that attaches to the cable modem during any tech visit ?

You shouldn't lose any ability's on your LAN or your nsa when you lose internet.
0
 
LVL 11

Expert Comment

by:pmasotta
ID: 35751784
I think you might have some duplicated IP problem here; probably a device fighting for the default gateway IP.
i.e. a wireless router with a fixed IP identical to the one the Sonic Wall has?

Everything goes ok until a station perform an ARP on the default gateway IP and then it receives 2 answers…
If the one taken is the one that correspond to the sonic wall that station has temporary access to Internet and
the sonic is ping responsive but the station that takes the one for the offending device will not have internet access.

Try uploading a wireshark pcap file when the problem occurs and I can tell you if this is your case.
0
 
LVL 10

Expert Comment

by:atlas_shuddered
ID: 35754271
fritz above notes how to sniff your WAN/Internet uplink above.

The personal devices you mentioned above are being used by folks in your parsonage correct?  If so, these same individuals are using them through the weekend as well?

Have you performed a device by device malware sweep on each of the PC's (with AV other than what is loaded onto the device itself?)
0
 
LVL 11

Expert Comment

by:pmasotta
ID: 35754479
I think chasing malware does not make much sense when you can see the traffic; If there is any offending station its displayed IP & MAC address solve the id problem immediately...
0
 
LVL 10

Expert Comment

by:atlas_shuddered
ID: 35754555
I would agree, however he has already noted that he isn't particularly comfortable with reading the output from WSk.
0
 
LVL 3

Expert Comment

by:NetFixr-Dani
ID: 35754691
I think you may have a device sending out Gratuitous ARPs and therefore causing all the other devices to go offline.

This would most typically occur if another device is configured with a duplicate IP (as was mentioned by pmasotta) and is booting up, but can also be caused by a piece of malware that is trying to play man-in-the-middle.

You should consider setting up arpwatch (http://en.wikipedia.org/wiki/Arpwatch) on a box plugged into the LAN and monitoring its output.

- Dani
0
 
LVL 11

Expert Comment

by:pmasotta
ID: 35759171
there are hundreds of approaches and hundreds of products; arpwatchers, anti malware, anti mother in law etc etc...
but all of them (when they work..) are focused on one of the many possible problems that fit your description.
Do yourself a favor, take an hour of your time and learn Wireshark, at least to learn how to get a good capture.
You'll soon see you do not need to be an expert for getting good results with wireshark.
0
 

Author Comment

by:Jerry Thompson
ID: 35759893
Again thank you for your ideas.  I appreciate them.

I could not find any duplicate IP addresses.

But I did discover that my Avast anti-virus is not distributing updates correctly nor could I initiate a remote scan on 99% of the computers.  So a bit of malware or a virus is a possibility.

For this weekend, I shut down about 95% of the computers on the network and am monitoring the activity to see if the outages continue over the weekend.  If they do not, then it might be a specific computer.  If they do, I know what has been left on and will start examining them closer.

Thanks again.

Jerlo
0
 
LVL 39

Assisted Solution

by:ChiefIT
ChiefIT earned 375 total points
ID: 35767022
It may be a bad forwarder within DNS that you are running into. LOTS of things can cause intermittent connectivity to the internet. Duplex settings, DNS forwarders, conflicting IPs, bad service, bad outside plant, weather (if a wireless, satellite, or optical connection), Network maintenance, Network flooding, broadcast floods, bad nic configuration, bad wiring, water in cables, etc....

To troubleshoot, don't take a shotgun approach. Pinpoint the location with PING. Ping by IP address, then fully qualified domain name. That will help you decide if it's a DNS issue or problems with routing protocols. Ping to the furthest point you can find. You can use TRACRT to trace the IPs of the routes some of your information takes, Write down thos IPs and then follow that list to pinpoint where the error is.

Once you determine the troubled spot, and what protocol it is, you have a lot more to work with.
0
 

Author Comment

by:Jerry Thompson
ID: 35797538
Sorry for the lack of response.  

I found two problems.

1.  Main switch for half the network was failing in a similar way to the internet connection.  It would go out at random times and come back in 5 minutes.  It has been replaced.

I hoped the two were related, but not so.

2.  It looks like the sonic wall itself is either corrupted or hardware defective.  I put a switch between the sonic wall and cable modem.  When the sonic wall went out, the cable modem and the internet did not.

Next step is to contact the sonic wall tech support and see if it is fixable or not.  Some of the DNS issues and Ip conflicts may still apply, but I don't know yet.

thank you for you help and will likely post after I speak with the sonic wall tech support.
0
 
LVL 39

Expert Comment

by:ChiefIT
ID: 35806533
There are two protocols that can cause this issue. One is called Spanning tree, the second are Duplex settings.

Cisco has a quirk in it software CIOS. The duplex settings of all switches and routers need to be the exact same. You can either set them all for xxxFull duplex as a hard coding, or Auto negotiate. But all of their neighboring switchs/routers have to be the exact same.

Yet another quirk is spanning tree. Spanning tree will disable any dual connections to prevent an L2 loop. So, if you decided to connect two switches together for faster throughput, spanning tree will disable one.

Since you are seeing this every 5 minutes then a 15 minute break, i would be willing to bet the duplex settings are incorrect.

Both duplex settings and also spanning tree can cause intermittent communications between switches and routers.
0
 

Author Comment

by:Jerry Thompson
ID: 35807312
Thanks for the input ChiefIT.

Perhaps I was not clear.  while the durationof the discconect is typically 3-5 minutes the occurance is completely random.

Example, yesterday we had 2 discoonects in a half hour then the rest of the day there was nothing.

All my switches are unmanaged so I am not sure how to alter settings on them.

But I do have a situation where 3 switches are chained to gether to get enough ports.  I need to check those to make sure there are not 2 cables connecting the same two switches.  I am careful about such things, but you never know.

Thank you for your thoughts.

Jerlo
0
 
LVL 39

Expert Comment

by:ChiefIT
ID: 35807857
If duplex settings are a factor, you should see amber lights that show collisions on your switches. Since these are unmanaged, or smart switches. Maybe make sure you do NOT have spanning tree enabled on ACCESS ports, (meaning computer ports). Spanning tree protocol prevents L2 loops and will need to be enabled between switches and routers.
0
 
LVL 10

Expert Comment

by:atlas_shuddered
ID: 35807860
??

First, Spanning tree.  If it is behaving correctly, you will never notice the links going up and down.  It will only shut down the redundant link (which will not come up until the first one goes down).  If it is not behaving correctly you will have the entirety of your network effected coming to a crawl, which will persist, due to broadcast storm.  It isn't going to dump for a few minutes and then come back up.

jerlo - keep pursuing the path you have noted above about your sonicwall.  If you aren't going into the switches to tweak them and if you have already gotten some degree of isolation by introducing the switch between the Internet devices, then finish isolating before jumping to something that really is a stretch.

Also, given that you have two devices that are failing at once, have you got these things sitting on UPS/Power protection?  If so, when was the last time it was changed?
0
 
LVL 10

Expert Comment

by:atlas_shuddered
ID: 35807888
Spanning tree won't have an affect with access ports beyond slowing down the initial connection of a device to the network - would be a potential problem if you are running DHCP, but again, you would have noticed this long before now.

Spanning tree protocol does prevent L2 loops, therefore, has nothing to do with your router(s).

If spanning tree is your issue, you would have noticed a long time ago - unless

1.  You are plugging something into your network when the problem occurs (at which point I would assume you would have put that together by now)
2.  One of your users is plugging something in (at which point you would need to find out who).

In either case, your LAN would die, not your Internet connection.

0
 

Author Comment

by:Jerry Thompson
ID: 35826958
I have learned some things since Sunday that I had not experienced nor had I consider possible.

On Sunday i got a call about a switch in an different part of the network going bad.  I had just replaced the switch with a brand new one because I thought the old one was bad.  it was doing something similar as the sonicwall in that is would disconnect from the network and then reconnect.  The switch was not shutting off, just the signals blocked.

I disconnected every Ethernet cable from the device except the uplink and the laptop and it was still blocking signals.  I unplugged it from the surge protector and into the wall and it worked perfect.

I moved the plug back and forth and was bad only when it was plugged into the surge protector.

I changed the surge protector on the sonicwall on Monday (yesterday) and since then there has been no ping loss.

There were 3 devices plugged into that surge protector and none of them has displayed any  bad behavior (except the sonicwall).  So I am going to reserve judgment for a few days and see what happens.

But I never expected a electric outlet could nuts up a device without turning it off or smoking.

I think it was warlock that talked about the integrity of the signal or wiring.  He was pretty close.

Thank you all for your input.  I will post the points when i have more data about whether it was the sonicwall or what the sonic wall was plugged into.

Jerlo
0
 

Author Closing Comment

by:Jerry Thompson
ID: 35930014
Thank you all for your help.
0

Featured Post

Granular recovery for Microsoft Exchange

With Veeam Explorer for Microsoft Exchange you can choose the Exchange Servers and restore points you’re interested in, and Veeam Explorer will present the contents of those mailbox stores for browsing, searching and exporting.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Use of TCL script on Cisco devices:  - create file and merge it with running configuration to apply configuration changes
In this article, WatchGuard's Director of Security Strategy and Research Teri Radichel, takes a look at insider threats, the risk they can pose to your organization, and the best ways to defend against them.
In this tutorial you'll learn about bandwidth monitoring with flows and packet sniffing with our network monitoring solution PRTG Network Monitor (https://www.paessler.com/prtg). If you're interested in additional methods for monitoring bandwidt…
There's a multitude of different network monitoring solutions out there, and you're probably wondering what makes NetCrunch so special. It's completely agentless, but does let you create an agent, if you desire. It offers powerful scalability …
Suggested Courses
Course of the Month15 days, 14 hours left to enroll

850 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question