Network weirdness/slowness/drops - mostly shows on visiting new websites

davebuhl
davebuhl used Ask the Experts™
on
We are experiencing weird issues whose major symptom is slow, or no, loading of new websites.  When visiting a page you haven't visited, it will either take a very long time, or not load at all.  Refreshing once or twice will usually bring the page up.  From that point on, clicking new links on that page will be as fast as normal.  This occurs on all systems, with all browsers.  This is a new issue that started on Monday. There were no configuration changes over the weekend.  We have two ISPs, and the problems occur on both ISPs, so it's not an ISP issue.

It seems like a DNS problem, but DNS resolves very quickly.  Looking in the DNS logs when turned on show very quick responses and no errors. Changing internal and external DNS servers made no difference.  Visiting web pages by IP address made no difference.  Pinging internal and external by name always returns immediately.

While it mostly presents as an internet issue, we do see occasional issues internally.  Some dropped pings (by name and IP), some issues accessing local services (NAS device, internal web interfaces).  However, these are very very sporadic.

We have a Sonicwall and have played with turning things off one at a time, and everything off at the same time (App Control, content filtering, DPISSL, antivirus, antispam...), as well as administrator logins.  This made no difference.

We restarted all our equipment, and nothing seems to have an impact.  The Sonicwall is not stressed, the Cisco switches are not stressed.

Any ideas?  We could wireshark, but our network has 3000 devices on it, so I don't think fishing through Wireshark will be very fruitful.
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Yuri SpirinSystems Integration

Commented:
This behavior looks like an MTU issue, although problems present on both ISP. You may try to reduce MTU on the firewall interfaces that are connected to ISPs. What are the ISP connections? - Ethernet, PPPoE, other?

Another guess: are the firewall routes to ISPs set up as active/passive with different metrics or as ECMP? In latter case there might be such problems e.g. with HTTPS sites.
Jeremy WeisingerSenior Network Consultant / Engineer

Commented:
You really shouldn't be getting any dropped packets when pinging on the same subnet.
What model of switches do you have? Can you check the port counters for errors? Is STP turned on? Do you have any little workgroup switches connected to the network?

Also, if you want to optimize your DNS settings, check out DNS Benchmark: https://www.grc.com/dns/benchmark.htm

Author

Commented:
The ping drops were always across subnets.  I haven't actually seen any, but have been told there were some (I am a little skeptical, but can't discount it).  We have a Cisco 3850 as our core switch, and 2960s dispersed.   Looking at utilization on the 3850, and in the logs, there really isn't anything troubling in there.  We get a few messages about MAC flapping from one port to another, but that is due to roaming wifi clients.

No errors (input and output) on any of the ports.
We do have unmanaged switches, but it doesn't look like a loop.  
STP is turned on.

DNS Benchmark reinforced the idea that this is not a DNS issue.  All green check marks.
OWASP Proactive Controls

Learn the most important control and control categories that every architect and developer should include in their projects.

Tracking something like this down is always maddening. I'd start (or continue) by eliminating variables as you've been doing. During whatever down hours you have I'd go right to the gateway and disconnect the whole darn network. Heck, take the SonicWall out of the loop and assign a static IP and Google DNS to a laptop interface and see if you experience these issues when connected directly to the ISP. Then, walk it back by putting the firewall back into service and connect the laptop to the LAN interface on the SonicWall with the rest of the network disconnected. If the problem disappears. Then, reintroduce the core switch and perform those same tests you've been doing. Clearly, with a 3000 device network downtime is hard to come by. My inclination would be a loop as you've mentioned you've looked for, or possibly a faulty cabling issue or a bad Ethernet transceiver somewhere, even RF interference I suppose. It is an unusual problem so these odd culprits might be the cause.

Looking forward to hearing more about your plight and investigation!

Matt
I would suspect cabling. CAT5 that is over length, or poorly terminated, or crushed or otherwise not meeting specs can exhibit all sorts of intermittent weirdness.

If you don't know that your cabling is fantastic, it might be worth hiring some equipment and testing it. Simple continuity is NOT enough, you need a proper tester.

https://www.techrentals.com.au/Products_Detail.asp?ID=10031&productcode=FLU%2CDSX%2D5000
other wisevoice network engineer

Commented:
you should tracking it from layer one (physical layer) to layer seven (application layer)

step by step till you find where the problem
I have a Fluke Cable Qualifier and, although it was very expensive, it has been invaluable over the years in tracking down cabling issues or, conversely, proving when there are none!

Author

Commented:
Still no solution for us.  The problem gets worse, then better, which makes it hard to figure out of any changes I make have any effect.

MTU's on both ISPs were set to 1444.  Core switch was set to 1500.  I set the ISPs back to 1500.  No change in performance.

The two ISPs are set as failovers for each other.  One VLAN goes with one ISP, all others to the other ISP.  Should one go over my pre-set limit, it can utilize the other.  We aren't near those limits though.

Cabling: Almost all of the cabling is new in the last few years.  There are a couple of long runs, but none over spec.  Fiber to all the MDFs and ISPs.  This is also a site-wide problem, so one cable going bad should not impact the entire site, unless it was the cable between the core switch and the firewall.  I swapped that one, just in case, with no impact.

From one machine, I closed all browsing sessions/tabs and opened a blank Chrome.  Reset all internet history/cache/cookies.... Started Wireshark, then tried to visit www.time.com.  It spun for a while, then timed out at about 25 seconds.  I refreshed, and it started loading the page, though a bit slow.  At this same time, I was able to ping www.cisco.com from the command line, with no delay in resolving IP or ping response.  So does not appear to be a DNS issue.

I'm attaching the wireshark log, in case it helps anyone.  I'm not great at reading them, but all the black and red on this one, it does give me pause and I'm going to do some research.
time.com-wireshark.pcapng

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial