SonicWALL TZ210 Routing Issue

I have a client with an HQ and 4 remote sites (Branch VPN) that is experiencing a seemingly random inability to route past the gateway.  

The setup:
HQ: SonicWALL TZ210 Network: 192.168.0.0/23
Site1: SonicWALL TZ215 Network: 192.168.56.0/24
Site2: SonicWALL TZ215 Network: 192.168.75.0/24
Site3: SonicWALL TZ215 Network: 192.168.100.0/24
Site4: SonicWALL TZ215 Network: 192.168.125.0/24

The symptoms of the problem are:
Seemingly random systems in HQ in the upper range of IP's (192.168.1.0-253) suddenly and intermittently cannot route past the gateway (192.168.1.254).
No systems in the lower range of the HQ network are affected (192.168.0.1-254).
No systems on any of the VPN's are affected.

What I've observed is the problem always exists but the impact/severity of the problem is greatly increased by one particular VPN (Site2) being enabled.  When that particular VPN is up within a couple of minutes several systems stop being able to route past the gateway.  When it's down fewer systems are affected and usually only for a few minutes at a time.  

What makes it stranger is it's not always the same systems affected but always within the same IP range.  If it's 192.168.1.x it is open season.  Anything in the 192.168.0.X range on the LAN is not affected.

I should also add that we have 3 SonicPoint's also deployed at HQ and none of them are affected.  I also ruled out a switch problem by enabling an interface on the SonicWALL and plugging a 'problem' system into it directly.

I suspect one of two things or a combination thereof:
 
1. An IP range conflict with Site2 - even though Site2 is now 192.168.75.0/24 it was previously (before the VPN) a 192.168.1.0/24 network.  I have never been to this site physically but the users claim there are 'a lot of boxes with blinky lights' connected to the LAN.  I'm wondering if having NETBIOS enabled for the VPN's is somehow generating this conflict and the gradual up and down (rolling outages) are from dynamically updating routing tables on the SonicWALL at HQ.  I know this site has a wireless access point of some kind (could be a foreign router) and it could be hooked up to the same LAN.

This only issue with this scenario is that while the problem is drastically reduced when this VPN is disabled it's still present.  One thought was that the other VPN's could be contributing to the issue (the same idea with foreign network devices connected) just not to the same extent as this particular endpoint.

2. The SonicWALL TZ210 has memory/cpu issues causing corruption in the routing tables

The issue I have with this is that the remote endpoints apparently do not have an impact are actually larger networks with more devices connected.  Memory/CPU and connection counts would conceivably be more affected by turning these VPN's off and on but they're not.

Anyway, I'm getting to the point of a factory reset and recreating the entire configuration but I want to avoid that at all costs because it would signify downtime (and a considerable chunk of my time) with no guarantees of success.

Anyone have any ideas?




LVL 4
WiReDWolfAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Aaron TomoskyDirector of Solutions ConsultingCommented:
1. Only do /24.
2. Don't use 192.168.x.x

Consumer gear defaults to 192 and anyone connecting to Vpn from home is going to have problems.

If you have a /23 at headquarters and something has a static ip with a /24 mask, things can get weird like you are experiencing now.

Use one /24 for the lan and another /24 for the wlan or servers or printers or whatever you have more than 253 of.
WiReDWolfAuthor Commented:
Yeah I'm not sure how that's helpful.

1. Only do /24 is not optional in this instance as the network is too large to accommodate a 254 address limitation

2. Changing a pre-existing subnet on a production network with this many systems is a complete network overhaul and really not something we're willing to do

SonicWALL is not "consumer gear" it's business class hardware.  While not at Enterprise level this isn't an Enterprise network - the hardware suits the environment.

I suspect that something was introduced to the network that MAY have a static IP with the wrong subnet but I cannot see how that would affect the router.  The router knows better as do all of the other pieces of equipment.

It's important to note that this router has been in service for 4 years and the most recent VPN tunnel was brought up several weeks ago without incident.  The troubles started last week Tuesday for no apparent reason.  No new hardware (that we are aware of) has been introduced either on the LAN or VPN endpoints.

It's also important to note that while taking the offending VPN down does alleviate the problem it doesn't eliminate it entirely.  I still have a couple of systems that randomly cannot route past the router.
Aaron TomoskyDirector of Solutions ConsultingCommented:
You should check all the VPN group/network settings on both ends of all vpn tunnels. These groups that belong to the VPN are what get put into the routing table. So if there is a 192.168.1.x still referenced in the group of networks in the vpn tunnel, the sonicwall will try to route traffic there.

FYI
in your /24 192.168.0.x, the multicast address is 192.168.0.255, however in a /23 it's now 192.168.1.255 and 192.168.1.255 is a perfectly usable ip address. That is just one way a single device can cause havoc on a network.
Big Business Goals? Which KPIs Will Help You

The most successful MSPs rely on metrics – known as key performance indicators (KPIs) – for making informed decisions that help their businesses thrive, rather than just survive. This eBook provides an overview of the most important KPIs used by top MSPs.

WiReDWolfAuthor Commented:
Ok I've been through the router backwards and forwards and can find no possible range conflicts.  I found two host servers with the wrong subnet applied to them but I corrected that and as I expected it made no difference.

The principle issue is the IP range 192.168.1.0-192.168.1.253 can have some IP's affected by this weird problem.  For instance, 192.168.1.8 (my DC) has never been affected but 192.168.1.2 has.  In fact, we were able to replicate the issue by attempting a TeamViewer session to 192.168.1.2 - instantly becomes unroutable when a connection is attempted.  

I rebooted the router several hours later and the server in question came back up without issue.  I tested again (later) on a different system and TeamViewer connected fine but when I made another attempt on the server it became unroutable again.  It was almost as though the higher connection rate on the server IP caused the router to put the server IP in a timeout.  A reboot made it forget about it.

The thing is a reboot will resolve the issue temporarily.  Downing a VPN will partially resolve the issue temporarily.  Nothing actually fixes the problem.  The first time the server was blocked by attempting a TeamViewer connection the connection restored itself after about an hour.  The second time it stayed off for several hours until the router was rebooted.  

It's acting like the IP's in the upper range are subject to connection limiting.  Go over the limit the connection becomes unroutable.  How it can 'fix itself', which also seems random, is beyond me.  

Getting pretty frustrated with this router.  I have many SonicWALL routers out there, five with this client alone, and I've never experienced anything like this.  

One last thing in case it becomes a suggestion - I have already upgraded the firmware to a current build.  I also updated all of the other routers at all of the endpoints as well.  Nothing seems to make any difference.
Aaron TomoskyDirector of Solutions ConsultingCommented:
Instead of rebooting the router, does just clearing the arp cache fix it?
WiReDWolfAuthor Commented:
Good question.  The router is behaving itself at the moment because we took one of the VPN's down again.  

I'll see if I can force another episode and test flushing the cache.
Aaron TomoskyDirector of Solutions ConsultingCommented:
Following this line of thinking, if arp clear fixes it, that means we have one of at least 3 problems:
1. Devices don't respond after the arp timeout (I've had isp gateways do this) increase the arp timeout or make a static.
2. Duplicate MACs. Cloned VMs or maybe workstations
3. Duplicate IPs. Second DHCP server or statics.

That's all I can think of off the top of my head.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
WiReDWolfAuthor Commented:
Good line of thinking.

Yes I was able to trigger an episode (TeamViewer to one of the servers always causes the server to become unroutable) and an arp cache flush instantly restored connectivity.

It's looking like maybe your idea on "1. Devices don't respond after the arp timeout" is partially responsible for the problem.

Yes we have Cloned VM's but they're floating and their MAC's are auto-generated correctly.  There are no duplicates there.

There are also no duplicate IP's.  

I did find in the arp cache two identical MAC addresses bound to two different IP addresses.  I get three in the cache from the ISP but only two have identical MAC's.
WiReDWolfAuthor Commented:
I have put in a call into the ISP.  We have a bonded DSL solution through a provider that binds two providers to a single public IP.  The transparent network that manages the bonding shows two MAC's but one MAC is bound to two IP's which doesn't make a lot of sense to me.  It sounds like it doesn't make a lot of sense to them either.  They're investigating and will call back.  I'll keep you posted.

Good call on the arp.  Surprised I didn't think of it myself because I've had odd issues with Cisco ASA units due to malfunctioning ARP setup.
WiReDWolfAuthor Commented:
Well other than it clears up the problem temporarily there doesn't appear to be any issues with the ARP cache.  A recommendation from the ISP (providing Bonded DSL) is to "Enable Open ARP Behaviour" which is listed as a security concern but only against internal attacks (which are not likely).

The duplicated MAC to multiple ISP IP's was part of the switching from sources (DSL1 and DSL2) and is expected behaviour.

There are two DHCP scopes but they've been there for a year (since we expanded the network) and never caused an issue.  Same goes for the VMware Horizon View configuration - been in effect for years with no issues.

We have a TZ215 on the way.  A fresh setup will replace this unit to see if that resolves the issue.  It's starting to look to me more like hardware failure.
Aaron TomoskyDirector of Solutions ConsultingCommented:
There were some FW bugs in some version of 5.9.somethingidontremember where routes wouldn't re-enable after auto disabling due to an interface dropping. Could be related.

Can you still buy the 215? The tz300 is the new gen with 6.x firmware that I much prefer.
WiReDWolfAuthor Commented:
We have a TZ400 on order and an emergency TZ215 being sent from another site.

A new wrinkle presented itself this morning.  I found a Cisco managed switch on the network with an IP conflict with the LAN interface of the router.

There was an onsite 'tech' who was way out of his depth but insisted on working in isolation and I'm finding all kinds of weirdness now that he's gone.  

I knew the Cisco switch existed but it never checked in with FindIT before today.  I'm heading down there to remove this switch entirely.
Aaron TomoskyDirector of Solutions ConsultingCommented:
Well well. I'm thinking we found our culprit.
WiReDWolfAuthor Commented:
That does seem to have been the culprit indeed.  The Cisco SG500X managed switch somehow made itself visible to the network for the first time in months by resetting itself back to a factory default IP of 192.168.1.254 - the same IP as our SonicWALL gateway device.

As a layer 3 switch it was also set to IPv4 Routing Mode.

The default IP configuration of this switch is also 192.168.1.254 with a /24 subnet mask which explains why only the upper range of the true /23 subnet was affected.

The switch has been reconfigured - no longer in a routing mode and the IP changed to one that will not conflict.

This is amazing to me how it could have been set up this way since April but not actually caused significant issues before now.  Nothing has changed - only VPN endpoints have been added recently.

Anyway, awarding you the solution.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Routers

From novice to tech pro — start learning today.