Server environment performance degradation during load tests

We have a server environment consisting of a load balance (hardware), 3 web servers and a single database server.  Against the environment we have run 3 load tests, scaling up to 600 concurrent users over a 20 minute period.  On each test the response time for the page load begins to degrade at exactly the same point, when the number of concurrent users hits 100.  

Test one was done with two web servers balanced
Test two was done with three web servers balanced
Test three was done with three web servers balanced but with +2 cpu allocated

All three tests have exactly the same result.  I have had the hosting provider for the infrastructure review the network and none of the hardware elements in the route have restrictions or limits on users and the network is handling the load with ease.  The individual web servers are also handling the load, even at peak evenly and without maxing out CPU or memory. Its seems highly irregular that the drop-off point remains identical despite the increase in resource and has lead me to question whether there is a configuration or set-up default value which is reaching its limit (100) within the system.

This is a theory but it posing a major challenge to what should be a robust server set-up and I would appreciate some expert opinion on what could be causing the issue.  In preparation I have already ensured I have New Relic APM data available for all three sessions as well as general hardware monitoring data.
James McleanAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Zaheer IqbalTechnical Assurance & ImplementationCommented:
under the Application Pool IIS there is a Queue settings can't remember it from the top of my head using EE mobile to answer.
David FavorLinux/LXD/WordPress/Hosting SavantCommented:
Linux provides the ability to nice (CPU) + ionice (Disk) a process to change it's CPU/Disk queuing priority.

Check your OS docs to see if there's a similar way to effect CPU/Disk queuing priority.

Good rule of thumb is avoid load testing on a production machine, unless your entire Technology Stack is tuned very well.

One Ubuntu + LXD + WordPress sites, I run type type of load test every few minutes on all my production servers, to ensure all client sites are running at full speed.

lxd: net11-jasites # time nice -19 h2speed --compact --count=1000
h2load -ph2c -t16 -c16 -m16 -n16000
finished in 1.59s, 10084.08 req/s, 1.18MB/s
requests: 16000 total, 16000 started, 16000 done, 16000 succeeded, 0 failed, 0 errored, 0 timeout
status codes: 16000 2xx, 0 3xx, 0 4xx, 0 5xx
Requests per second: 10,084.08
Requests per minute: 605,044.8
Requests per hour  : 36,302,688

real	0m1.673s
user	0m0.596s
sys	0m0.656s

Open in new window

The only way this works is if entire Tech Stack is caching content correctly.

So ensuring that a single request correctly caches, so subsequent tests return cached data is vital also.

Where caching includes database + PHP Opcache + mod_rewrite (or IIS equivalent).

I normally run load tests on a zero length .txt file first, because if this runs slow all other load testing is abandoned.

Next is a test of a simple PHP file, like hello.php or similar.

Next is a simple database test - open + SELECT LIMIT 1 + close.

Then if all the three previous tests run at anticipated speed, I run a load test against the actual WordPress sites, as all client sites I host are WordPress.

By taking a stepped approach to load testing, you can ensure (or make a good guess) your load test will run without taking out your production site(s).

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
SeanSystem EngineerCommented:
There isn't exactly a single setting that would say 100 users max but there are a number of things you can do to improve performance:

Also make sure you aren't maxing out disk I/O. You can throw all the CPU and memory you want but if disk I/O is capped then you'll always be limited by that.
Big Business Goals? Which KPIs Will Help You

The most successful MSPs rely on metrics – known as key performance indicators (KPIs) – for making informed decisions that help their businesses thrive, rather than just survive. This eBook provides an overview of the most important KPIs used by top MSPs.

Dan McFaddenSystems EngineerCommented:
A few additional questions:

1.  What is the site built in?  ASP.NET, PHP, etc...
1a.  What version is in use?
1b.  32bit or 64bit?
2.  What version of Windows Server?
3.  What database product?  MS SQL, MySQL, Postgres, Oracle?
4.  What load balancer is in use?
5.  Does the web site/ web app store session data?
6.  How is the http traffic load balanced?  Round robin, load based, latency based, sticky session?
8.  What is the load on the CPU, RAM on the LB?
7.  What is the load on the CPU, RAM, Disk on the IIS servers?
8.  What is the load on the CPU, RAM, Disk on the database server?

As for your testing methodology... have you tried stressing one of the web servers directly to see if the user issue arises?  By taking out LB out of the equation, you may be able to isolate the issue and focus an a smaller area.

In order to get a better feel for a baseline of your setup, testing the individual components of the system is helpful.  If you know the capacity of the web server, you can determine if the system, as a whole, is meeting or exceeding the expected output.

James McleanAuthor Commented:
My thanks to everyone who has submitted a solution to date, they are all helpful in our quest to understand this behaviour.
David FavorLinux/LXD/WordPress/Hosting SavantCommented:
Poster has stopped posting.

Picked two solutions which provided useful information, poster can use to debug problem.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today

From novice to tech pro — start learning today.