asked on

Should I be using Full http monitoring or Simple http monitoring for health checks?

Hi,

We recently upgraded from a Foundry load balancer which used simple health script checks to keep nodes in the pool to a Zeus Load Balancer which uses ping, the Zeus "Full HTTP" and a health script which is a .NET page which renders a simple "server up" message. Passive monitoring is turned OFF.

We have multiple issues but it basically comes down to this:
1. The Zeus will drop a node for whatever reason. In this mornings case - it was a timeout to the node it dropped
2. Our CDN all of a sudden can't get to orign
3. Our site's go down until we purge our CDN

This morning Zeus dropped one node but displayed the regrettable - "no suitable nodes available to service your request" But only one node was dropped.

There were no errors or problems on the box - no application or system events, no processor pegs, no memory pegs, but the box was unresponsive for whatever reason and the entire pool failed. 2 minutes later, the pool was back online and functioning properly with 2 nodes. We of course had to flush our CDN to get rid of the nodes message.

The question is this - Is "full http" too much monitoring for simple content and registration Web site?
Would Zeus' "simple http" be a better alternative, which would create fewer false positives with the nodes?

Any help would be appreciated!

Thanks
Chris

ASKER CERTIFIED SOLUTION

Pugglewuggle

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

garriganlyman

ASKER

So they have "started" by raising our connection timeouts on the health file, but I am still not convinced, and am having trouble getting them to change to simple http. I think it's the right answer but I have a bunch of stuff in production that I worry about if something goes wrong. I may build a test pool as a separate VIP and see how that fairs with monitoring.

SOLUTION

Pugglewuggle

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

garriganlyman

ASKER

Pugglewuggle - your answer put us on the right track I think. Tough to know without "soaking in it" for a while, but as a combination solution, we have set 500 to be an allowed return. This way servers are not dropped from the pool if an application they call from is experiencing errors and returns a 500. There was also some output caching that needed to be done to minimize the 500 errors. On some of our more dynamic apps if a db call failed .NET would keep going back to the well to try and try again, which would throw it into a tailspin. Adding the output caching to expire once an hour, allows us to require one successful transaction per hour which is then cached by the .net page itself. This will keep the 500's at bay.

We also set the zeus to load a custom page which can be done in the health config section of the pool itself to load the same error as the 500 should both servers be dropped from the pool in a catastrophic crash.

Thanks for your input I think it put us on the right track. Your use of the phrase "false alarm" got us thinking differently.

Cheers!
Chris