Link to home
Start Free TrialLog in
Avatar of bplant
bplant

asked on

apache + keepalive + internet explorer + ajax = random 500 internal server error

Background info:

- We have 2 apache (2.2.9) servers (load balanced) on vanilla 2.6.28 linux running as xen VMs using paravirt_ops.
- The keepalive setting in apache is set to 5 seconds and a maximum of 250 requests.
- The application uses PHP (5.2.8) including suhosin (0.9.27) and mod_fcgid (2.2) is used to manage the PHP processes.
- The application includes an ajax component that involves submitting POST requests to the server and getting a response upon user interaction.
- The problem only exists with internet explorer and is version independent. I.e. it occurs with versions 6, 7 and 8.

When it all works:

99% of the time, the client browser downloads the elements on the page using 1 or more keepalive connections. The user interacts with the page and the AJAX POST requests are made to the server. If the user interaction and therefore the AJAX request occur within 5 seconds of another communication with the web server, then the AJAX request may use an existing keepalive connection. This is the normal case when everything works correctly.

The problem:

Using wireshark, I have found that when the 5 second keepalive expires, a FIN ACK is sent from the web server to the client and the client immediately responds with an ACK.

Shortly after (~0.5 seconds), the clients make a POST request using the same connection and then closes the connection with a FIN ACK. No HTTP response is ever sent to the client which makes sense since the server has already closed the connection and cannot send anymore. The request gets logged by apache as a 500 internal server error.

The issue is random and seems to be reliant on interaction occurring ~5 seconds after the last communication with the server so that the client will try and use a recently closed connection.

I have tried turning off keepalive in apache and this seems to fix the issue. Turning off keepalive however is not a solution as we want this left on.

Any advice on how to overcome this problem?
Avatar of ai_ja_nai
ai_ja_nai
Flag of Italy image

If the problem is only IE related probably is a IE bug that can't be resolved here, due to the well known openness of MS products.
What if you increase keepalive time? Say, 10 secs? This should cover most of the short timed interaction, while being enough short to expire properly on long inactivity periods.
Avatar of bplant
bplant

ASKER

Hi ai_ja_nai,

Thank you for your comment.

I had considered increasing the keepalive timeout, however the user interaction could start at any time after the page has loaded. I.e. they might read some of the content on the site before interacting with the AJJAX components.

Increasing the keepalive timeout also affects server performance since threads must wait longer before closing connections and therefore handling new requests. Obviously this can be gotten around by increasing the MaxClients setting, but this consumes more memory etc.
I know what you mean. But if really the bug is IE related, what can we do? Debug IE? :-/

you could add to your webpage a php/javascript/whatever scripted fuction that calculates the average time spent by an user before clicking again; there are generally 2 types of users: those who find what they want and spend at least 10 secs or more reading contents and those who are looking for something else and are just using the page as a waypoint to reach other pages linked. If you are able to calculate the time a user takes to navigate the page in the second case, you could use that time as keepalive
Avatar of bplant

ASKER

Hi ai_ja_nai,

I'm not saying that it isn't a bug in IE, but it would seem strange that the bug exists in 3 different versions of IE (6, 7 or 8).

Statistics on when people interact with the page aren't going to help a whole lot because the interaction is on going. The AJAX calls are made as the user changes filter options and the user may interact with the page for a couple of minutes before moving onto other pages. So increasing the keepalive timeout isn't going to change much.

While increasing the keepalive timeout may change the number of internal server errors reported per day, it doesn't solve the problem. The server logs get used to generate web statistics and I don't want 500 errors showing up in the list. It doesn't look good.

There must be a real solution for this and I'm determined to find it!
>it would seem strange that the bug exists in 3 different versions of IE (6, 7 or 8)
how about that explorer.exe crashed in all windows versions? ;)
IE has never been really compliant to any standard. If you want things done well, go for Safari or Mozilla derived browsers.

Otherwise it could be that AJAX multiplexes connections and forces Apache not to close them propery due to some still obscure side effect
Avatar of bplant

ASKER

Unfortunately IE accounts for over 70% of users and we have no control over what browser people use so we're stuck with IE for the moment.

So far I have not found any other reports of this occurring to other people. I'm looking for a solution, not an excuse. Failing that, I need to find other people having the same issue as it's currently looking like something unique to my setup.
Can you deactivate ajax for that page? Use something more synchronous that doesn't multiplex connection? Just to bugtest
Avatar of bplant

ASKER

Unfortunately not. The website is no longer under development and disabling ajax would render the site ineffective.

Changing how the page works is not a solution. The servers host several sites and many are not under our control.

I'm really looking for a "why" is this happenning as well as a true solution. Neither increasing the keepalive timeout, removing the ajax features or not using IE give a reason for this behaviour or solve the problem.
500 internal server errors usually relate to server-side resources not being available -- it could be Apache MaxClients, threads to the Database backend, multi-thread issues in relation to the mod_fcgid, etc.

When a '500 internal server error' occurs, can you post a copy of your Apache's error_log file so we can help narrow down the issue? Since you state you are also using PHP, a copy of the php.ini would also be beneficial as well.
Avatar of bplant

ASKER

HI mweecomputers,

Thank you for your comment.

I have checked the number of apache threads using ps and it's well below the MaxClients directive. I have also already tried doubling MaxClients just in case. I believe requests get queued when the number of apache threads reaches MaxClients.

The database backend is only using about 20% of available resources.

On the above 2 points, the 500 error occurs during both peak and non-peak loads. Obviously it occurs more during peak times since more people are accessing the site and therefore the chance of a user interacting with the AJAX component ~5 seconds after loading the page is more likely. The fact that it occurs during non-peak times suggests that there is no resource issue.

I can't comment on mod_fcgid, but it is the latest version and is in wide spread use.

When a 500 error occurs, I get a mix of these errors in the error log:
[Mon Mar 02 17:58:51 2009] [warn] (70007)The timeout specified has expired: mod_fcgid: can't get data from http client
[Mon Mar 02 19:36:02 2009] [warn] (104)Connection reset by peer: mod_fcgid: can't get data from http client

I believe the above error is consistent with one end of the tcp connection having been closed and therefore the data from the client not being able to be read.

The php.ini file is relatively standard appart from an increased memory limit etc. I don't believe this issue has anything to do with php.
How large is the data you are trying to send via the mod_fcgid? And are you compressing the data?

Avatar of bplant

ASKER

The POST request contains about half a dozen variables. I'd be surprised if it was more than 100 bytes. The response is a few KB of html.
Do you have a fcgid.conf file that can be cut/paste so it can be reviewed?
Avatar of bplant

ASKER

My fcgid settings are as follows:

      LoadModule fcgid_module modules/mod_fcgid.so
      SocketPath /var/run/fcgidsock
      SharememPath /var/run/fcgid_shm

      IPCConnectTimeout 30
      IPCCommTimeout 300
      ProcessLifeTime 10800
      MaxProcessCount      100
      BusyTimeout 300
      BusyScanInterval 60
      IdleTimeout 900
      IdleScanInterval 60
      ErrorScanInterval 3
      ZombieScanInterval 3
      MaxRequestsPerProcess 500
      DefaultMinClassProcessCount 0
      DefaultMaxClassProcessCount 25
      SpawnScoreUpLimit 10
      SpawnScore 1
      TerminationScore 2

IPCConnectTimeout was previously the default 3 seconds. Increasing it to 30 seconds had no effect.
ASKER CERTIFIED SOLUTION
Avatar of ahoffmann
ahoffmann
Flag of Germany image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of bplant

ASKER

Hi ahoffmann, giltjr,

Thank you for your comments.

I had considered turning off KeepAlive for just IE and I could use this as a last resort. I was hoping someone might have an answer that doesn't involve this however. If no one else can offer anything better, then I guess this is what it'll have to be. I just find it hard to believe that no one else has come across this issue as I cannot find anything similar to this at all.

giltr, only about 70% of people are using IE. The rest are FF, Safari, etc and they are not showing the issue.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of bplant

ASKER

HI Duncan,

Thank you for your comment. Don't worry about checking the TCP RFC. I don't think it'll tell us anything new.

Unfortunately I can't try and debug apache since it's on a production box. Trying to debug the issue on a test box would be pretty hard too since you need to get the timing right.

As for the benefit of keepalive; the webpage probably has around 50 images/css/js files to download. Using keepalive means that all these can be downloaded using a handful of TCP connections instead of 50 reducing the overhead waiting for handshakes and thus making the page load faster for the end user. Would they notice or even be aware of the difference though? Most likely not. It's possible that I'm over estimating the value of keepalive.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of bplant

ASKER

Hi Duncan, gitltjr,

Thank you for your comments.

Unfortunately advertising a trial process is something I cannot do.

I have already disabled keepalive for all MSIE clients. As much as I don't want to do this, I see it as the only option at this stage.

If no one offers any idea as to "why" this might be happening in the next few days then I'll just have to settle for this no keepalive on all MSIE clients and close the question.
Are you using SSL or non-SLL?
Avatar of bplant

ASKER

I am not using SSL. SSL connections have keepalive disabled for MSIE anyway :)
I found a few issues where IE has issue with keepalive on SSL connections.  Unfortunately MS has a habit and a history of not following RFCs.  They tend to walk to their own beat so there may not be much you can do.
Avatar of bplant

ASKER

Hi all,

I tried disabling keepalive for MSIE clients only, but I received some feedback saying that it made the page load slower. I've had to enable keepalive for MSIE clients again which means I'm back at square 1.

Thank you for all your comments. I am going to close this question now.
Avatar of bplant

ASKER

Wish we could have figured this one out. Thanks for your help
We seem to have a similar issue, have you found any way to resolve it?
The problem was with our HTTP server persistent connection setting (KeepAliveTimeout = 5 seconds) and the way that IE handles the situation when the request is sent on the established connection that expires when it reaches our server. According to RFC 2616 (http://www.w3.org/Protocols/rfc2616/rfc2616-sec8.html#sec8.1.4) only GET requests are being automatically retried. If the request is POST (non-idempotent) it’s not retried and all our extjs ajax requests are POSTs by default.

We changed KeepAliveTimeout setting on the server side to 65 to let IE ‘expire’ the connection on a client side and not re-use persistent connections once it’s 1 minute old (this is the default IE behavior).