Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people, just like you, are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
Solved

Sonicwall NSA 4600 Suddenly stops passing traffic

Posted on 2014-02-24
9
1,645 Views
Last Modified: 2014-03-17
Here's my situation:

I have 2 SonicWall NSA 4600 units in an active/passive failover configuration.
When I move them into Production, everything ticks along without incident for approximately 24-48 hours.

After 24-48 hours (the actual interval between failures is inconsistent - longest uptime was around 2 full days, shortest was about 17 hours) the units just cease passing traffic.  The Web frontends and the SSH just cease responding.

Powering down the primary unit to force a failover does nothing, as the secondary unit is also non-responsive.

Powering off both units and powering them back on also does not appear to solve the problem. The units become responsive again for about 5 minutes and then cease responding again.

The only thing that gets them functional again is to remove the switch they are connected to and take the units completely off the production network for approximately 30 mins to an hour.  Then, when a laptop is plugged into the switch they are on, everything seems fine.  

There are no errors in the SonicWall logs related to the failure.  In fact, it appears, from log entries, that the units never ceased functioning.

I have tried replacing the network infrastructure to which it is attached.  I swapped a smart switch for a 10GB backplane enterprise switch.  The switch also does not report any abnormalities.

I have tried connecting just one of the SonicWall units directly to our Core switches, without HA enabled, and the issue still happened.

Our older Check Point firewall is configured nearly identically and has no issues.

Management refuses to allow the SonicWalls to be placed back into Production without the issue being identified and resolved because it takes down the entire network when it ceases to respond.

I've been over the NATs and Routing 3 times and I see no errors.

There are NATs that put the Sonicwall in "Routed Mode", meaning our internal IPs are also Public IPs... But beyond that, it is a very typical setup.

I am hoping perhaps someone has encountered a similar issue and perhaps a method to resolve it?  Even being pointed in the right direction would help.


Thanks in Advance!
0
Comment
Question by:delpt
  • 5
  • 2
  • 2
9 Comments
 

Author Comment

by:delpt
ID: 39884425
It is worth noting that I have had SNMP monitoring both the Switch and the SonicWall for the duration of the outage.

SNMP continues to respond during the outage, as does the Syslog.  The CPU and RAM are not even close to 100% (about 10% and 15%, respectively).


The switch doesn't even break a sweat.  It hovers around 5-15% capacity at any one given time.
0
 
LVL 20

Expert Comment

by:carlmd
ID: 39885340
0
 

Author Comment

by:delpt
ID: 39886409
OK... Following the guidelines, I've made the following alterations:

1. Added a 10GB Twinax Cable to X17 (10GB SFP) on both units (X11 is also connected with crossover CAT6) and changed the HA Data Interface from X11 to X17.

2. Checked the "Enable Virtual MAC" checkbox.  I could not find the automatically generated MAC for the WAN, so I went into the Monitoring section and specified it.

3. Made sure DPI-SSL was disabled (it was already)


Seems a little odd, however, that the issue still occurred when we had just one Sonicwall connected directly to our core infrastructure with HA disabled...

I won't be able to move it back into Production for a test until next week.  The users have deadlines this week and won't tolerate another potential outage.


If there is anything else I should be looking for, please let me know... I'd like to make sure all my ducks are in a row before I even attempt putting it back.
0
Portable, direct connect server access

The ATEN CV211 connects a laptop directly to any server allowing you instant access to perform data maintenance and local operations, for quick troubleshooting, updating, service and repair.

 
LVL 20

Expert Comment

by:carlmd
ID: 39886672
Can't you test the new settings in your test environment?s At least drop the primary unit and see what the HA does.
0
 

Author Comment

by:delpt
ID: 39886710
Yes I can.  But the units have never failed in the test environment.
I have been able to failover/failback until my fingers hurt without any issue at all in the isolated environment.  I've even slammed it with traffic until the source interfaces overload in the test environment and the Sonicwalls and switch do not even blink.

The issue only seems to replicate itself in our production environment and then only after about 24-48 hours and then only under normal load (users and SSL VPNs connected) and I was never able to determine the real root cause.
0
 
LVL 13

Accepted Solution

by:
Greg Hejl earned 500 total points
ID: 39887529
Have you opened a case with Dell?  their engineers are quite good at fixing these issues.  you would receive priority escalation with your model.
0
 

Assisted Solution

by:delpt
delpt earned 0 total points
ID: 39924333
Just finished troubleshooting with Dell.
It is a defect in their product.

Also, SonicWall Support is "restructuring" and even their "Priority" queues are at least 1 hour deep.

We are returning the devices and upgrading our CheckPoint Gateways instead.

Thanks everyone for input!
0
 

Author Closing Comment

by:delpt
ID: 39933737
The solution was to rid myself of the devices.
0
 
LVL 13

Expert Comment

by:Greg Hejl
ID: 39935647
Thanks for the points....

Curious, as I am about to deploy NSA 3600,  what was the defect?  did the offer a solution to resolve your issues?  was it due to HA implementation?
0

Featured Post

Efficient way to get backups off site to Azure

This user guide provides instructions on how to deploy and configure both a StoneFly Scale Out NAS Enterprise Cloud Drive virtual machine and Veeam Cloud Connect in the Microsoft Azure Cloud.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

I found an issue or “bug” in the SonicOS platform (the firmware controlling SonicWALL security appliances) that has to do with renaming Default Service Objects, which then causes a portion of the system to become uncontrollable and unstable. BACK…
Creating an OSPF network that automatically (dynamically) reroutes network traffic over other connections to prevent network downtime.
After creating this article (http://www.experts-exchange.com/articles/23699/Setup-Mikrotik-routers-with-OSPF.html), I decided to make a video (no audio) to show you how to configure the routers and run some trace routes and pings between the 7 sites…
After creating this article (http://www.experts-exchange.com/articles/23699/Setup-Mikrotik-routers-with-OSPF.html), I decided to make a video (no audio) to show you how to configure the routers and run some trace routes and pings between the 7 sites…

860 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question