Sonicwall NSA 4600 Suddenly stops passing traffic

Here's my situation:

I have 2 SonicWall NSA 4600 units in an active/passive failover configuration.
When I move them into Production, everything ticks along without incident for approximately 24-48 hours.

After 24-48 hours (the actual interval between failures is inconsistent - longest uptime was around 2 full days, shortest was about 17 hours) the units just cease passing traffic.  The Web frontends and the SSH just cease responding.

Powering down the primary unit to force a failover does nothing, as the secondary unit is also non-responsive.

Powering off both units and powering them back on also does not appear to solve the problem. The units become responsive again for about 5 minutes and then cease responding again.

The only thing that gets them functional again is to remove the switch they are connected to and take the units completely off the production network for approximately 30 mins to an hour.  Then, when a laptop is plugged into the switch they are on, everything seems fine.  

There are no errors in the SonicWall logs related to the failure.  In fact, it appears, from log entries, that the units never ceased functioning.

I have tried replacing the network infrastructure to which it is attached.  I swapped a smart switch for a 10GB backplane enterprise switch.  The switch also does not report any abnormalities.

I have tried connecting just one of the SonicWall units directly to our Core switches, without HA enabled, and the issue still happened.

Our older Check Point firewall is configured nearly identically and has no issues.

Management refuses to allow the SonicWalls to be placed back into Production without the issue being identified and resolved because it takes down the entire network when it ceases to respond.

I've been over the NATs and Routing 3 times and I see no errors.

There are NATs that put the Sonicwall in "Routed Mode", meaning our internal IPs are also Public IPs... But beyond that, it is a very typical setup.

I am hoping perhaps someone has encountered a similar issue and perhaps a method to resolve it?  Even being pointed in the right direction would help.

Thanks in Advance!
Who is Participating?
Greg HejlPrincipal ConsultantCommented:
Have you opened a case with Dell?  their engineers are quite good at fixing these issues.  you would receive priority escalation with your model.
delptAuthor Commented:
It is worth noting that I have had SNMP monitoring both the Switch and the SonicWall for the duration of the outage.

SNMP continues to respond during the outage, as does the Syslog.  The CPU and RAM are not even close to 100% (about 10% and 15%, respectively).

The switch doesn't even break a sweat.  It hovers around 5-15% capacity at any one given time.
Ultimate Tool Kit for Technology Solution Provider

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy now.

delptAuthor Commented:
OK... Following the guidelines, I've made the following alterations:

1. Added a 10GB Twinax Cable to X17 (10GB SFP) on both units (X11 is also connected with crossover CAT6) and changed the HA Data Interface from X11 to X17.

2. Checked the "Enable Virtual MAC" checkbox.  I could not find the automatically generated MAC for the WAN, so I went into the Monitoring section and specified it.

3. Made sure DPI-SSL was disabled (it was already)

Seems a little odd, however, that the issue still occurred when we had just one Sonicwall connected directly to our core infrastructure with HA disabled...

I won't be able to move it back into Production for a test until next week.  The users have deadlines this week and won't tolerate another potential outage.

If there is anything else I should be looking for, please let me know... I'd like to make sure all my ducks are in a row before I even attempt putting it back.
Can't you test the new settings in your test environment?s At least drop the primary unit and see what the HA does.
delptAuthor Commented:
Yes I can.  But the units have never failed in the test environment.
I have been able to failover/failback until my fingers hurt without any issue at all in the isolated environment.  I've even slammed it with traffic until the source interfaces overload in the test environment and the Sonicwalls and switch do not even blink.

The issue only seems to replicate itself in our production environment and then only after about 24-48 hours and then only under normal load (users and SSL VPNs connected) and I was never able to determine the real root cause.
delptAuthor Commented:
Just finished troubleshooting with Dell.
It is a defect in their product.

Also, SonicWall Support is "restructuring" and even their "Priority" queues are at least 1 hour deep.

We are returning the devices and upgrading our CheckPoint Gateways instead.

Thanks everyone for input!
delptAuthor Commented:
The solution was to rid myself of the devices.
Greg HejlPrincipal ConsultantCommented:
Thanks for the points....

Curious, as I am about to deploy NSA 3600,  what was the defect?  did the offer a solution to resolve your issues?  was it due to HA implementation?
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.