Link to home
Start Free TrialLog in
Avatar of mike2401
mike2401Flag for United States of America

asked on

FTP 23 second timeouts!!

Background:

We have a W2k8 FTP server (IIS) in our DMZ (PIX firewall)  that accepts thousands of ftp transmissions daily from our mainframe (inside our network), and from our laptops (outside the network).

Typically, a connection is made, credentials successfully authenticate, a transmission begins, then ends when complete.

This is a sample of a typical successful transaction from our ftp server logs:
   14:29:49 10.0.3.1 [10019]USER curtiscc.net\mfdownload 331 0
   14:29:49 10.0.3.1 [10019]PASS - 230 0
   14:29:49 10.0.3.1 [10019]CWD \curtis\Watch_n_launch\in 250 0
   14:29:49 10.0.3.1 [10019]created /curtis/Watch_n_launch/in/PRINT_APGM999_110819102948.zip 226 0
   14:29:51 10.0.3.1 [10019]created /curtis/Watch_n_launch/in/PRINT_APGM999_110819102948.ins 226 0
In this transmission, two files were successfully delivered to the ftp server (a .zip and .ins).

Symptom:  
Intermittently (maybe once every several days), at precisely 23 seconds after the connection is made and successful authentication, the transmission times-out with a “10060” ftp error. That ftp thread terminates and continues with the next step or thread successfully, as if nothing was wrong.  

This is a sample of a failed transaction from our ftp server logs:
   14:30:01 10.0.3.1 [10020]USER curtiscc.net\mfdownload 331 0
   14:30:01 10.0.3.1 [10020]PASS - 230 0
   14:30:01 10.0.3.1 [10020]CWD \curtis\Watch_n_launch\in 250 0
   14:30:24 10.0.3.1 [10020]created PRINT_APGM999_110819102956.zip 425 10060
   14:32:03 10.0.3.1 [10020]created /curtis/Watch_n_launch/in/PRINT_APGM999_110819102956.ins 226 0
   14:32:03 10.0.3.1 [10020]QUIT - 226 0

In this transmission, the “10060” timeout occurs exactly 23 seconds after the successful logon and directory change.  Then after it fails, the same thread continues to the next step and successfully delivers the next file.  Additionally, it seems to be “thread or connection-specific” (not a pause of the whole ftp service) because we have seen it where, within the 23 second inactivity period, a new thread starts and completes before the original thread eventually times-out with  “10060” error.

Our IIS FTP timeout is set to 5 minutes.  We’ve disabled kaspersky anti-virus, as well as windows firewall, so it seems to be none of those. Disk is de-fragged.  This happened when the server was physical, although right now it's virtualized (vmware).   It happens various times (day/night) so it’s not related to when we do our nightly backup exec tape backup.

Just on a whim, we tweaked the tcp/ip settings per this (but it didn’t solve the problem) :
http://kb.globalscape.com/Print10438.aspx

So, what’s causing this?  Windows, IIS, PIX firewall?

Thanks for your any thoughts, this is driving us crazy!

When the FTP fails, our MF operator has to call the programmer (sometimes 1am, 3am) to inspect and rerun the job.  

BTW: I'm the programmer, so PLEASE: help me get my beauty sleep!

Mike
ASKER CERTIFIED SOLUTION
Avatar of AlexPace
AlexPace
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Avatar of Randy Downs
Randy Downs
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
You might try switching to from Active to Passive  or vice versa

http://learn.iis.net/page.aspx/309/configuring-ftp-firewall-settings/

It is often challenging to create firewall rules for FTP server to work correctly, and the root cause for this challenge lies in the FTP protocol architecture. Each FTP client requires two connections to be maintained between client and server:

•FTP commands are transferred over a primary connection called the Control Channel, which is typically the well-known FTP port 21.
•FTP data transfers, such as directory listings or file upload/download, require a secondary connection called Data Channel.
Opening port 21 in a firewall is an easy task, but this means that an FTP client will only be able to send commands, not transfer data. This means that the client will be able to use the Control Channel to successfully authenticate and create or delete directories, but the client will not be able to see directory listings or be able to upload/download files. This is because data connections for FTP server are not allowed to pass through the firewall until the Data Channel has been allowed through the firewall.

Note: This may appear confusing to an FTP client, because the client will seem to be able to successfully log in to the server, but the connection may appear to timeout or stop responding when attempting to retrieve a directory listing from the server.

The challenges of working with FTP and firewalls doesn't end with the requirement of a secondary data connection; to complicate things even more, there are actually two different ways on how to establish data connection:

•Active Data Connections: In an active data connection, an FTP client sets up a port for data channel listening and the server initiates a connection to the port; this is typically from the server's port 20. Active data connections used to be the default way of connecting to FTP server; however, active data connections are no longer recommended because they do not work well in Internet scenarios.

•Passive Data Connections: In a passive data connection, an FTP server sets up a port for data channel listening and the client initiates a connection to the port. Passive connections work much better in Internet scenarios and recommended by RFC 1579 (Firewall-Friendly FTP).
Note: Some FTP clients require explicit action to enable passive connections, and some clients don't even support passive connections. (One such example is command-line Ftp.exe utility that ships with Windows.) To add to the confusion, some clients attempt to intelligently alternate between the two modes when network errors happen, but unfortunately this does not always work.

Some firewalls try to remedy problems with data connections with built-in filters that scan FTP traffic and dynamically allow data connections through the firewall. These firewall filters are able to detect what ports are going to be used for data transfers and temporarily open them on firewall so that clients can open data connections. (Some firewalls may enable filtering FTP traffic by default, but it is not always the case.) This type of filtering  is known as a type of Stateful Packet Inspection (SPI) or Stateful Inspection, meaning that the firewall is capable of intelligently determine the type of traffic and dynamically choose how to respond. Many firewalls now employ these features, including the built-in Windows Firewall.
Avatar of mike2401

ASKER

Thanks everyone.

Our LAN admin is reviewing these suggestions.

Since making the tcp/ip tweaks, oddly, we've seen an increase in failures on the DIR statement which precedes the actual transfer (not sure if it's a coincidence or not).

Since we did get another 10060 error this morning, I'm ready to conclude the tcp/ip tweaks are not the solution.

BTW, the FTP transmissions from our Mainframe are scripted and copied/pasted all over the place, so it's not easy to change all the mainframe jobs.

Our laptops are using a custom VB control in a custom app, so it's not something we can change.

I've never had a problem using filezilla, but then again, this happens 0.5% of the time so it's hard to replicate.
 
These FTP failures are SOOO frustrating.

Mike
Our lan-admin guy found this yesterday:  (pasted below):

"Yesterday, I found some interesting articles about a component called “Receive-Side Scaling” (RSS) that is enabled by default on Win2008 servers (“netsh int tcp show global”).  Briefly, RSS balances receive traffic across multiple CPUs by offloading the data from the NIC to the CPUs.  All modern NICs support RSS and is enabled by default. This is fine for physical servers.  BUT, when your Win2008 server is virtual, with a virtual NIC (vmware) that does not support RSS, it is recommended to disable RSS to avoid problems with the OS attempting to perform a function that is unsupported.  This may lead to intermittent failures, slowdowns, dropped connections, etc. To disable RSS, the command is “netsh int tcp set global rss=disabled”.

So I disabled RSS and we have not had an incident in 17 hours.  I’m keeping my fingers crossed.   "
Interesting.  Please post an update if it goes a week without trouble.
Sadly, we went 3 days without any error, but had two ftp 23 second timeouts last night.

Interestingly, since we've done the tcp tweaking,, it appears we're getting timeout failures on DIR (list) commands which are included in the job.

I never remember getting errors on that, (normally just on the put statements)

Mike

PS:  I opened up a paid incident with microsoft this morning.
*******************SOLVED**************************

(i hope)

It turns out there is a confirmed bug in the cisco pix os version we are running.

Upgrading to:  v 6.3.5 to supposed to fix the FTP bug.

We will upgrade on Monday.

I'll let everyone know.

BTW, what is proper expert's exchange etiquette:

I want to acknowledge everyone who replied, but if the solution is the upgrade to pix, what am I supposed to do about points?
 
If one or more people helped you find the problem then you should give them the points or share them.

If not the experts will understand.
So, if the solution was completely unrelated to the kind contributions of the experts, how do I close the call without awarding points to anyone?
I think you can award no points or award them to your own solution.
We upgraded the PIX software last night.  Keeping our fingers crossed.  No failures last night (but that doesn't mean anything).

Mike
I still think it was a failure opening the data channel.  The default windows timeout waiting for a SYN ACK is 21 seconds. If you give it a second to get ready to open the data channel and a second to time out and start writing to the log then your at 23.  That said, if your firewall is supposed to open the data channel port on the fly but doesn't this could be misconfigured or it could be one of the things that might be fixed by updating the firewall.
It looks like it was indeed the cisco pix.  The upgrade to  v 6.3.5   solved it

Rajesh and Richard from Microsoft were great, really did a great job of helping us rule out our IIS, and of following up.

So as not to offend any of the experts who took the time to offer suggestions, I'm going to spit points.

Thanks to everyone for their suggestions.

Mike
Even though the problem turned out to be our pix firewall, I very much want to thank everyone for their help.

Regards,
Mike