File Transfer Over VPN Fails

We are in the process of setting up SQL DB mirroring between a production location (primary) and a data center (DC) (the actual DB mirroring is being handled by a third-party - we're responsible for connectivity between the two locations).

The primary location has a DS3 running through a Netgear GB switch and then to (2) Sonicwall NSA 240 UTMs in a high-availability cluster. The data center has a 20Mbps burstable pipe with the identical NSA240 configuration (Cisco desktop switch instead of Netgear). I'm getting consisitent upload speeds from the primary location at just under 40Mbps; the download speed at the DC is 20Mbps (also very consistently). We have configured a VPN tunnel between the two locations:

Policy Type: Site to Site
Authentication Method: IKE using Preshared Key
IKE Proposal
  Exchange: Main Mode
  DH Group: Group2
  Encryption: AES-256
  Authentication: SHA1
  Life Time: 28800
IPSEC Proposal
  Protocol: ESP
  Encryption: AES-256
  Authentication: SHA1
  Lifetime: 28800
There are no errors or warnings being logged in the Sonicwall regarding the IPSEC tunnel and connectivity between the two locations seems pretty solid.

There are (2) main servers that we are concerned with at the primary location. Both are Dell PowerEdge R900 servers that are about 2y old and still under full warranty. They are both dual Intel Xeon processors and both have 16GB of RAM. Both are running Windows Server 2008 Standard x64 with Service Pack 2 (NOT 2008 R2) and both are fully up-to-date with Windows Updates. We have deployed identically configured Virtual Servers at the Data Center (VMware EsXi 4.1 on Dell PowerEdge R900 host).

The issue is that when we try to transfer large files (>100MB) from DBSERVER1 (primary) to DBSERVER2 (DC), the transfer ALMOST always fails. I say almost because we've been able to transfer a 100MB file with fair success and a 1.7GB file one time but it has never been consistent. At this point, transferring anything over 100MB is pretty sure to meet with failure. The transfer begins just fine and shows that it's moving at about 2.5MB/s; then it locks up for a period of time before finally failing with the following error message:

"Item Not Found
  Could not find this item
   This is no longer located in \\SERVERNAME\SHARENAME\. Verify the item's location and try again."

Sometimes the transfer will resume from lockup but then it will show that it's moving at about 500-900KB/s; it may even look to restart completely from the beginning but it generally doesn't get above 1MB/s transfer rate. There does not appear to be any error/event logged in either the Application or System logs at a time that would correspond to the failure.

Initial thoughts were that there were issues with the Sonicwall not being able to handle the throughput. We put this to the test by choosing other devices at the primary location and moving files FROM these locations. We can consistently (without failure or error) move files >500MB from at least one XP workstation and from a Windows Server 2003 x64 server to the desired destination at the DC. We believe this rules out the Sonicwall as we were able to transfer a 22GB .bak file from the 2003 server in about 3h last night.

We also considered NIC traffic as the DB server at the primary location was handling what seemed to be a heavy network load. We updated the driver and teamed the (2) Intel PRO 1000 NICs to maximize throughput and we stopped any other non-essential transfers. This did not fix the issue so We used Windows Server 2008 Data Collector Sets for both LAN Diagnostics and System Performance - the NIC load does not appear out-of-the-ordinary. We did see that there were many System Diagnostics entries on the source server under Nework>IP "Datagrams Received Address Errors" (over 1000) and "Datagrams Received Discarded" (nearly 500). I couldn't begin to find a correlation here, though.

So from what we can see, the issue is that we can't move large files consistently between (2) Windows Server 2008 x64 Standard servers across a VPN tunnel. Oh yeah, to make it more complicated, we can move the same files around on the LAN - NOT over the VPN.

This is a time sensitive issue as the client is looking to have this done yesterday afternoon. Any assistance would be greatly appreciated. If there's anything I haven't covered or need to clarify, please ask.
wjb313Asked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

dafyreCommented:
A couple of things that I can think of to check...

1) Is the Remote Differential Compression(RDC) enabled or disabled at both the SQL Server *and* the remote Data center?

2) Do you have the Network Card Offload Engines enabled or disabled at both ends of the connection?

3) Can you check the error logs on the switches themselves (if they are managed)?

4) Have you tried to copy a file from the SQL Server by hand?  .... what about using other protocols, such as FTP?

5) Is either end of the connection connected to (or perhaps hosted on) a SAN?
wjb313Author Commented:
1) I have enabled RDC on both DB Servers. No change is apparent. Is any configuration necessary?

2) The Intel PRO teamed NICs both have all available Offloading Options enabled:
*TCP/IP Offloading Options>IPv4 Checksum, TCP Checksum (IPv4), TCP Checksum (IPv6), UDP Checksum (IPv4), and UDP Checksum (IPv6) are all enabled; Large Send Offload v2 for both IPv4 and IPv6 are enabled (***Should IPv6 be disabled since I've disabled IPv6 on the NICs??***)
*I'm less worried about the DC/VMs, as the transfer to this VM works fine from other boxes, they're enabled, though.

3) The switches are not managed. The Sonicwalls show no errors of any sort related to this.

4) I've been copying the files by hand all along. Drag&Drop, Copy/Paste, XCOPY. . . I'm trying it all to see if there's any difference. None has been seen yet. I've not tried FTP, though.

5) Neither is connected to a SAN. The VMs are housed locally on the Dell at the DC.
digitapCommented:
it's possible you're dropping packets at the WAN interface due to a misconfiguration of the WAN MTU.  The default is 1500.  I have an article guiding one through calculating the MTU and then setting it on the sonicwall.  you might run through the first part of the calculation and see if it needs to be altered.

http://www.experts-exchange.com/viewArticle.jsp?aid=3110

Also, you might review the Security Services settings.  Make sure the IPS or Gateway AV isn't dropping the connections.  perhaps exclude the VPN Zone from being scanned and/or add the IP addresses of your LAN hosts.
Introduction to Web Design

Develop a strong foundation and understanding of web design by learning HTML, CSS, and additional tools to help you develop your own website.

wjb313Author Commented:
The VPN interface has no scanning for any of the Security Services. The highest ping that goes through is 1472. . . 1472+28=1500. Thanks but that's not it. Again, I'm not convinced that the Sonicwall is a problem here as we can move the data from a Windows Server 2003 server across the VPN to the Windows Server 2008 VM.
digitapCommented:
with your response above, i'm inclined to agree with you.

So, you have SiteA (Sonicwalls) and SiteB (Netgear), right?  Which is failing SiteA > SiteB or SiteB > SiteA?
wjb313Author Commented:
As we continue to troubleshoot this, we find the recurrence of Event ID 5719 and 5783 (Source: NETLOGON):

 Event Type: Error
 Event Source: NETLOGON
 Event Category: None
 Event ID: 5783  
 Date: 12/17/2009
 Time: 9:53:06 AM
 User: N/A
 Computer: ISA2000  
 Description:
 "The session setup to the Windows NT or Windows 2000 Domain Controller \\DCSERVERNAME.DOMAINNAME.local for the domain DOMAINNAME is not responsive. The current RPC call from Netlogon on \\SERVERNAME to \\DCSERVERNAME.DOMAINNAME.local has been cancelled."


 Event Type: Error
 Event Source: NETLOGON
 Event Category: None
 Event ID: 5719  
 Date: 12/17/2009
 Time: 9:53:25 AM  
 User: N/A
 Computer: ISA2000
 Description:
"This computer was not able to set up a secure session with a domain controller in domain DOMAINNAME due to the following:
The RPC server is unavailable.
This may lead to authentication problems. Make sure that this computer is connected to the network. If the problem persists, please contact your domain administrator."

To the best of our knowledge there isn't even a Windows Server 2000 machine on the premises, let alone set up as a DC. . . let alone as THE DC with the server name that's in error. I feel like this is important as we've noted some strange behavior when trying to connect by name and ping by name - but the issues are quirky and not replicable, so it's hard to pin down. Also, of note (maybe?!?!) is that the DFSREvent test failed in DCDIAG.
digitapCommented:
As a long shot, would you review the link below and relay your thoughts?  I've had this resolve client connectivity issues over a VPN.

http://support.microsoft.com/kb/244474
wjb313Author Commented:
On that last. . . the DFSREvent logged was:

Log Name:      DFS Replication
Source:        DFSR
Date:          10/17/2010 6:32:19 PM
Event ID:      5008
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      CPRAPPSSRV1.CPRAPPS.local
Description:
The DFS Replication service failed to communicate with partner CPRAPPSSRV2 for replication group Domain System Volume. This error can occur if the host is unreachable, or if the DFS Replication service is not running on the server.
 
Partner DNS Address: CPRAPPSSRV2.CPRAPPS.local
 
Optional data if available:
Partner WINS Address: CPRAPPSSRV2
Partner IP Address:  
 
The service will retry the connection periodically.
 
Additional Information:
Error: 1722 (The RPC server is unavailable.)
Connection ID: CF1686F1-F200-4801-B5AB-F507A566AB93
Replication Group ID: CF1FDC52-D546-401A-8891-BAC03FC048C4

This seems to mesh with the NETLOGON error that the RPC call was cancelled. Error 1722 suggests enabling File and Printer Sharing - but it's already enabled. Also possibly DNS. Reviewing DNS configuration.
wjb313Author Commented:
Sonicwalls on both ends; the Netgear is just an unmanaged switch that splits the DS3 so it can hit both NSA 240s (there are (2) on each end for HA Cluster).

I tried the regedit; not better. In fact, DNS now seems to be broken. We're currently trying to trace this through the DNS and see if there's a problem with the way DNS routing things between these boxes.
dafyreCommented:
Is the DNS setup on both ends active directory enabled?

Also, does the remote data center connect to the DNS server where your SQL Server lives, or does it connect to it's own DNS server?

If it connects to it's own DNS server, make sure that the zones are correctly replicated across the local and remote data centers.
wjb313Author Commented:
DNS is set up at both ends. There's a DC at either side and it also runs AD-Integrated DNS. Everything seems to be replicating successfully - if I add an A Record manually to one DNS server, it eventually finds it's way to the other one. Also, if I add/del a user from AD on one side, I can replicate that change to the other side. The SQL server is a member of the domain on each side and doesn't run any other roles. We have tried this with the DB Server pointing to the AD/DNS server on the same side of the VPN as itself and with it pointing to the primary location. No change in results. I also noted that the DNS servers were configured for "round robin" and that both of the primary location servers have (2) IP addresses. I disabled round robin on both DNS servers. . . still no consistency.
dafyreCommented:
Are you able to take a large file and copy it by hand from the SQL Server at the local location  across the VPN to the SQL Server at the remote data center?
wjb313Author Commented:
We tried moving a file from DBSERVER1 to DBSERVER2 from both ends - logged into DBSERVER2 and dragging a file to the desktop; logged into DBSERVER1 and dragging a file to a window that we browsed to on DBSERVER2. Both results were the same. Interestingly, since disabling round robin, the file transfers are not failing but are taking significantly longer than they should. We were seeing transfer speeds of between 2-2.5MB/s initially (with consistent failures) and now we're seeing transfer speeds more in the neighborhood of 200-500KB/s (with no failures). We've also disabled SMB v2.0 (waiting on reboot) and disabled automatic adjustment of TCP windows size per MS KB932170.
wjb313Author Commented:
So it appears that disabling round robin was the right answer. Once done, things were slow but the transfers worked. I re-enabled SMB 2.0 on both servers and the speeds picked up again. At this point, I've consistently been able to transfer files upwards of 1.0GB from the source server to the destination server. Thanks for all the ideas.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
digitapCommented:
glad you got it...that what quite elusive!
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Windows Server 2008

From novice to tech pro — start learning today.