Link to home
Start Free TrialLog in
Avatar of Chris Kenward
Chris KenwardFlag for United Kingdom of Great Britain and Northern Ireland

asked on

CentOS Server dying at high load average

I have several CentOS VPS units running cPanel. At regular intervals, the server load goes extremely high (248+) and the server is dead in the water. At this point all customer websites etc. are unavailable. I usually end up resetting the server which is not good for MySQL databases etc. and need to discover what's really going on so I can stop this from happening.

At the time the server is non-responsive, there are hundreds of lines on the console saying kill process ID or sacrifice child. However when this is happening there is not much chance to get into the console as it's too busy going 'round in circles.

I had someone from cPanel support take a look and they say it has nothing to do with cPanel. Here is what the tech wrote:

I was monitoring your server from last 30 minutes and the server load was stable but the site https://www.tgis.co.uk/ is taking time to load.
It indicates that there is an issue with the site scripting/database that is eating resources.
For reference I have checked server old logs and found that the same domain is eating resources

top - 07:40:15 up 13:12,  1 user,  load average: 53.75, 110.05, 170.98
Tasks: 267 total,  49 running, 216 sleeping,   0 stopped,   2 zombie
Cpu(s):  3.8%us,  3.9%sy,  0.0%ni, 30.2%id, 61.3%wa,  0.0%hi,  0.8%si,  0.0%st
Mem:   3925368k total,  3243796k used,   681572k free,    50444k buffers
Swap:  4128764k total,   353920k used,  3774844k free,   662552k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
14221 tgis      20   0  331m  18m  636 R  4.9  0.5   0:58.55 /usr/bin/php /home/
14241 tgis      20   0  331m  19m  144 R  4.9  0.5   0:47.39 /usr/bin/php /home/
14255 tgis      20   0  253m  21m 1732 R  4.9  0.6   0:52.20 /usr/bin/php /home/
17441 tgis      20   0  395m 105m 9552 R  4.9  2.8   0:05.24 /usr/bin/php /home/
17466 tgis      20   0  309m  84m 9320 R  4.9  2.2   0:03.90 /usr/bin/php /home/
17576 tgis      20   0  248m 7440 5260 R  4.9  0.2   0:00.22 php -q /home/tgis/p
14111 tgis      20   0  251m  24m 4372 R  4.3  0.6   0:57.59 /usr/bin/php /home/
14602 root      20   0 92712 8924 1508 R  4.3  0.2   0:37.19 /usr/local/cpanel/s
17508 tgis      20   0  340m  51m 9368 R  4.3  1.3   0:01.82 /usr/bin/php /home/
 2892 root      20   0 32832 7764 1928 R  3.7  0.2   0:00.41 /usr/local/cpanel/3
13111 tgis      20   0  253m  18m 2176 R  3.7  0.5   1:16.94 /usr/bin/php /home/
14234 tgis      20   0  335m  18m   76 R  3.7  0.5   0:44.64 /usr/bin/php /home/
14252 tgis      20   0  247m  21m 2692 R  3.7  0.6   0:56.60 /usr/bin/php /home/
17337 tgis      20   0  420m 130m 9596 R  3.7  3.4   0:10.07 /usr/bin/php /home/
17421 tgis      20   0  417m 127m 9596 R  3.7  3.3   0:06.73 /usr/bin/php /home/
17444 tgis      20   0  398m 109m 9596 R  3.7  2.9   0:05.56 /usr/bin/php /home/
17445 root      20   0 74540  17m 2580 R  3.7  0.5   0:03.92 /usr/local/cpanel/s
17465 tgis      20   0  372m  83m 9248 R  3.7  2.2   0:04.15 /usr/bin/php /home/
17496 tgis      20   0  346m  56m 9212 R  3.7  1.5   0:02.19 /usr/bin/php /home/
17500 tgis      20   0  348m  59m 9416 R  3.7  1.5   0:02.08 /usr/bin/php /home/
17503 tgis      20   0  340m  51m 9216 R  3.7  1.3   0:01.92 /usr/bin/php /home/
17504 tgis      20   0  340m  51m 9200 R  3.7  1.3   0:01.94 /usr/bin/php /home/
17506 tgis      20   0  332m  42m 9204 R  3.7  1.1   0:01.86 /usr/bin/php /home/
17525 tgis      20   0  326m  36m 8940 R  3.7  1.0   0:01.19 /usr/bin/php /home/
17549 clamav    20   0  116m  41m 1084 R  3.7  1.1   0:00.70 /usr/local/cpanel/3
14250 tgis      20   0  320m  20m  916 R  3.0  0.5   0:54.08 /usr/bin/php /home/
14277 tgis      20   0  327m  19m  704 R  3.0  0.5   0:45.58 /usr/bin/php /home/
14507 tgis      20   0  330m  19m   96 R  3.0  0.5   0:43.13 /usr/bin/php /home/
17252 tgis      20   0  416m 127m 9596 R  3.0  3.3   0:09.15 /usr/bin/php /home/

Seems a dynamic site with busy database as mysql error logs shows plenty of threads open

209  5:24:21 [Warning] /usr/sbin/mysqld: Forcing close of thread 244  user: 'tgis_whmcs'

180209  5:24:21 [Warning] /usr/sbin/mysqld: Forcing close of thread 234  user: 'root'

180209  5:24:21 [Warning] /usr/sbin/mysqld: Forcing close of thread 230  user: 'tgis_whmcs'

180209  5:24:21 [Warning] /usr/sbin/mysqld: Forcing close of thread 228  user: 'tgis_oneadmin'

180209  5:24:21 [Warning] /usr/sbin/mysqld: Forcing close of thread 227  user: 'tgis_oneadmin'

180209  5:24:21 [Warning] /usr/sbin/mysqld: Forcing close of thread 222  user: 'tgis_oneadmin'

180209  5:24:21 [Warning] /usr/sbin/mysqld: Forcing close of thread 221  user: 'tgis_oneadmin'

180209  5:24:21 [Warning] /usr/sbin/mysqld: Forcing close of thread 220  user: 'tgis_oneadmin'

I have done a malware scan just in case and it seemed to come up OK.

Can anyone help diagnose these issues? It's happening not only on this VPS but on others, even those on different hypervisor hosts. We use ESXi licensed version.

Many thanks
Chris
Avatar of arnold
arnold
Flag of United States of America image

Sounds like under spec for the requests, services or potentially DDoS attack.
Impacting the requests generated against the backend MySQL db.

Does the issue correlate to a specific time, peak usage of a site...

In your situation tunning MySQL might not be enough to cure the resource depletion.

Not sure you could get the vps additional resources, memory, processing.

What is the host performance stats during that time frame?
How many connections is your web server configured to have (min/max servers and server type)? Also, are your PHP scripts establishing persistent connections to the database?
Avatar of Chris Kenward

ASKER

@arnold: These are all web servers running PHP and MySQL. They are all fairly small servers and all have around 4Gb RAM. The issue happens at any time during the day - could be DDOS I guess but not sure how to find out. Each server is running CSF.

@gr8gonzo: Could you elaborate on how to check and change these settings? This sounds like a good starting spot. Many thanks.

Cheers
Chris
The server min/max client configs will be set in your web server config files. If you're using Apache (likely) it would be in httpd.conf or maybe httpd-mpm.conf.

To find out about persistent connections, search for "pconnect" or "p:" in all your PHP code files (there's a small chance of some false positives).

Also, do you define a connection limit for mysql on my.ini?
Hi there gr8gonzo

This is a cPanel server. I haven't set up anything that isn't auto setup by cPanel when it's installed. I installed the CSF (LFD) plugin for security and went through the security checks but apart from that I haven't fiddled with setup of either PHP, MySQL or the O/S of the server.

Where is my.ini located? I cannot find this file on the server at all.

Thanks
Chris
Hi Chris,

I just took another longer look at your process list - did you edit out anything from that list? I would normally expect a full path to a PHP file at the end of these kinds of line - running PHP against a folder shouldn't really do anything:

17500 tgis      20   0  348m  59m 9416 R  3.7  1.5   0:02.08 /usr/bin/php /home/
17503 tgis      20   0  340m  51m 9216 R  3.7  1.3   0:01.92 /usr/bin/php /home/
17504 tgis      20   0  340m  51m 9200 R  3.7  1.3   0:01.94 /usr/bin/php /home/
17506 tgis      20   0  332m  42m 9204 R  3.7  1.1   0:01.86 /usr/bin/php /home/

Instead of pulling the contents from top, can you provide the output of "ps auxf" instead?

After my second look, I'm less convinced it has anything to do with connections (apache or mysql) and it looks more like you might have some PHP scripts that are stuck in an infinite loop. In that scenario, if a PHP script has opened a connection to the database, it will keep that connection to itself until the script ends, so if this script is running a lot, then you're basically creating processes and connections that are never closing properly.
Hi gr8gonzo

I think I may have inadvertently cut the lines off. I've attached a file containing latest PS auxf. Hope it's good.

Cheers
Chris
WID_PS_auxf.txt
Okay, I don't see those PHP lines in there anywhere, so I assume it's something that might be kicked off manually.

Is "tgis" a customer or something, or is it just the main system user account for your site (I'm guessing it's the latter based on the URL you provided)?

I would next do this:
1. Run crontab -u tgis -l
If there are any cron jobs listed in there, post the output here.

2. Next, do a grep through your source code in /home/tgis (I'm assuming that's the user's home dir) to look for PHP loops:
grep -riP "\b(while|for|foreach)\s*\(\b" * | grep "\.php:"

The resulting files that come back from that command will need you to examine those loops to look for scenarios where the loop might never end. For example:
$some_condition = true;
while($some_condition)
{
  if(XYZ happens)
  {
    $some_condition = false;
  }
}

Open in new window

In the above loop, if XYZ never happens, then $some_condition will never change to false, and the loop will never end, which will suck up your CPU and likely cause the problems you're seeing.

Infinite loops can happen in a number of ways, so that's just one example.

You might also want to grep for any instances of "set_time_limit":
grep -ri "set_time_limit" *

Normally PHP scripts should time out by themselves after 30-60 seconds (whatever your PHP config is set to), but set_time_limit() can override that setting, particularly if you see set_time_limit(0) which means "disable the timeout completely".

However, fixing the time limit might only bandaid the problem - you might still end up with scripts that are running in infinite loops for a period of time, which is a huge drain on your resources.
Hi there

The site is my main site running WHMCS. That's all it does. There are no customers on this server. I'll need to address similar issues on other servers once we've been through this one. Thanks as always for getting back to me.

The result of the CRON output is:

root@wid [~]# crontab -u tgis -l
MAILTO=""
SHELL="/usr/local/cpanel/bin/jailshell"
0 8 * * * php -q /home/tgis/public_html/whmcs/crons/cron.php

SHELL="/usr/local/cpanel/bin/jailshell"
*/5 * * * * php -q /home/tgis/public_html/whmcs/crons/pop.php

SHELL="/usr/local/cpanel/bin/jailshell"
0 */6 * * * php -q /home/tgis/public_html/whmcs/crons/domainsync.php

These are simply jobs to sync domains etc. for the WHMCS package and to intercept incoming POP mail for open tickets.

The test for loops that never end resulted in literally hundreds of results - if there were bad things in here then WHMCS would not work.. I have a feeling that this is a non-starter. It's a very popular commercial package and there would be a lynch mob out there if it was the cause of bringing the server down.

The set time limit test also brings up literally 100s of lines, many of which have the (0) as the limit.

Usually, when the server goes into a tailspin I think it's MySQL that has managed to drag the server into the red.

I see from looking at the my.cnf  - looks like this:

[mysqld]
default-storage-engine=MyISAM
innodb_file_per_table=1
local-infile=0
innodb_buffer_pool_size=124780544
max_allowed_packet=268435456
open_files_limit=10000

noticed that it appears we are using MyISAM rather than InnoDB. Is this a good thing? Should we be limiting max_concurrent_connections etc.?

Cheers
Chris
Default engine in the display is a setting, the show create database <databasename> will tell you whether <database> uses innodb or myisam.

Note in the settings, your innodb buffer is set to use 120MB

There are MySQL tunning that can help improve MySQL responsiveness if needed.


Note in your process list, there are 25 php process with at lease 250mb allocated,
>6.25GB in allocated memory resources.

Depending what php does, one runs every five minutes.
If your settings are to retrieve a copy while leaving a copy on the server, as the amount if email grows, the processing your end increases to differentiate between messages it already copied and new ones.
Fetchmail might be a better...
Or have the front end mailserver forward the messages ........
SOLUTION
Avatar of gr8gonzo
gr8gonzo
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Hi gr8gonzo

2. arnold's advice is pretty good. Having a 120MB buffer and a 256MB allowed packet size seems a little high. The buffer is less of a concern than the huge packet size allowance. If you have a legitimate chance of pushing or pulling 256 megabytes into or out of MySQL, then you might consider keeping it there, but for most typical PHP scripts, 16MB is plenty.

Where would I go to change this to limit the scripts to 16MB? Could we start with that?

All the best
Chris
Hi gr8gonzo

That should run the "ps auxf" command every 10 minutes and dump it into a timestamped-file in your /tmp folder, like /tmp/psauxf_0930.txt, /tmp/psauxf_0940.txt, /tmp/psauxf_0950.txt, /tmp/psauxf_1000.txt, etc...

That way you have snapshots every 10 minutes.

I've added the CRON and will let you know as soon as I see the files being created. Many thanks
Chris
Ah - there's a problem with the CRON. I get this in Email:

/bin/sh: -c: line 0: unexpected EOF while looking for matching ``'
/bin/sh: -c: line 1: syntax error: unexpected end of file

Have I included something I shouldn't?
Seems you placed the command you want executed surrounded by exec tic and no matching...

Please post the entry you add,
If you are adding a script, please paste the entire Script.
First line shoud, be
#!/bin/sh
Hey Arnold

I added this line to the top of the CRON list.

*/10 * * * * ps auxf > /tmp/psauxf_`date "+%H%M"`.txt

Regards
Chris
It is best yo use a shell script versus the way you have it, date +"
Potentially adds a carriage return.

Try


*/10 * * * * ps auxf > "/tmp/psauxf_$(date +\"%H%M\").txt"


It is best to have these in a script
Collectdata.sh
#!/bin/sh

Scheduled_time=$(/bin/date +"%H%M")

/bin/ps auxf > "/tmp/psauxf_${Scheduled_time}.txt"


Make sure the script is executable,
Chmod 700 ....
Chmod u+x Collectdata.sh

The cron entry
*/10 * * * * /path/to/Collectdata.sh
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
@Arnold: OK - thanks I have just changed the CRON and waiting for the first one to run.

@gr8gonzo: I have modified my.cnf file which now looks like this:

[mysqld]
skip-name-resolve
query_cache_size = 16M

log-slow-queries=/var/log/slowqueries.log
long_query_time = 4
log-queries-not-using-indexes

table_cache = 512
tmp_table_size = 128M
max_heap_table_size = 128M
myisam_sort_buffer_size = 8M
sort_buffer_size = 8M
join_buffer_size = 256K
key_buffer = 128M

default-storage-engine=MyISAM
innodb_file_per_table=1
local-infile=0
innodb_buffer_pool_size=124780544
max_allowed_packet=16M
open_files_limit=10000

Is this OK?

Thanks for pointing me in the direction of the article you wrote. Let me know if there's anything else I should be doing with the MySQL?
Hi Folks - is there any way to get the ps auxf script to Email the result to me? That way I can more easily copy the files into Experts Exchange for you?
Instead of redirecting it into a file you could | Mail -s "process" youremail@yourdomain.com
Is this OK?
That generally looks good, but I emphasize "generally." I have no idea what your database/table structure looks like, so nobody can tell you if it's properly optimized without that kind of information (and that's work that's better handled with a paid gig/live help session if you want additional help).

Just keep an eye on things, particularly any logged slow/unindexed queries.

Also, make sure you restart MySQL after any my.cnf changes.

Has the problem occurred yet?
you can apply changes such as this dynamically by
set global variable_name=newvalue
This way you would retain the stats from operation that a restarts dumps.
Folks I put the collectdata script on another of my servers to see if I could capture what's happening there. Although it's running fine on the server WID it's giving me this on the other server and I have no idea why:

root@lune [/tmp]# ./collectdata.sh
-bash: ./collectdata.sh: Permission denied

Help?
in order for any script to run, it must have the exec bit set.
ls -l collectdata.sh
yours has -rw- meaning read and write only for user
chmod u+x collectdata.sh
Hi Arnold - no - it looks like this:

-rwx------. 1 root root 97 Feb 13 21:31 collectdata.sh*

Cheers
Chris
what is the first line?

The other possibility is that SELinux is blocking running scripts from /tmp

The first line
Has to be #!/bin/sh

What happens if you run
sh ./collectdata.sh
Hi Arnold

Yes - the script is the same as before:
#!/bin/sh
Scheduled_time=$(/bin/date +"%H%M")
/bin/ps auxf > "/tmp/psauxf_${Scheduled_time}.txt"

If I run the script manually...

root@lune [~]# /tmp/collectdata.sh
-bash: /tmp/collectdata.sh: Permission denied

This is how it is run on the other server and working fine.

If I run it like this:

root@lune [~]# sh /tmp/collectdata.sh

Then I get the file fine:

psauxf_2029.txt which contains the whole ps auxf output.

Thanks
Chris
other server settings might be different, /tmp is temporary space and often there is a cron that cleans up this space.

Locate the script else where and see if it is also giving you an error
double check where you sh
which sh?
Given that you have multiple servers, can you double-check the permissions on /tmp/collectdata.sh on the server that is having trouble?

Your description of the behavior:
root@lune [~]# /tmp/collectdata.sh
-bash: /tmp/collectdata.sh: Permission denied

If I run it like this:
root@lune [~]# sh /tmp/collectdata.sh

Then I get the file fine
...is identical to what you'd expect if the execute bit was missing. I'm not sure if there's a possibility that this line:
-rwx------. 1 root root 97 Feb 13 21:31 collectdata.sh*

Open in new window

...came from one of the other servers by accident?

While I'd strongly doubt that this has to do with SELinux, you could try to disable it temporarily and try to re-run /tmp/collectdata.sh:

root@lune [~]# setenforce 0
root@lune [~]# /tmp/collectdata.sh
root@lune [~]# setenforce 1

Open in new window


The only other thing that comes to mind is if your sh shell isn't at its default location for some reason (which would be REALLY odd for root). You can always do:
root@lune [~]# which sh 

Open in new window

to see if there's a separate sh shell interpreter that's being called within your path.
Hi there, gr8gonzo

This is what the top level directory looks like for /tmp

drwxr-xr-x.  3 root root  4096 Feb 22 04:02 tmp/

Definitely from the server I'm looking at while having this conversation with you.

root@lune [~]# setenforce 0
root@lune [~]# /tmp/collectdata.sh
-bash: /tmp/collectdata.sh: Permission denied
root@lune [~]# setenforce 1
root@lune [~]#

root@lune [~]# which sh
/bin/sh
root@lune [~]#

Does that make any sense?

Cheers
Chris
@arnold: Thanks for your reply. See the results of the tests above?

Cheers
Chris
Hi Folks

If I type

root@lune [~]# sh /tmp/collectdata.sh

It works?

Cheers
Chris
Back to the server with the high load:

Please see the following results from the PS AUXF

(note this is the original server that resulted in this question being asked)

Cheers
Chris
psauxf_0520.txt
psauxf_0521.txt
psauxf_0530.txt
psauxf_0540.txt
psauxf_0550.txt
psauxf_0600.txt
Listing process,
Run top -n 3

Your issue seems to related to the number of requests you get.
Related to your web httpd.conf dealing with how many min/max child and how many requests each client can address.

This sounds as an under-provisioning of resources to handle the incoming requests which hit the MySQL thus impacting storage (virtual in the VM, physical on the host depending with resource schedule on the host)
Hi Arnold

top -n 3 gives us this:

Tasks: 160 total,   1 running, 159 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.2%sy,  0.0%ni, 99.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   5993836k total,  3397872k used,  2595964k free,   237644k buffers
Swap:  4128764k total,    10916k used,  4117848k free,  1912932k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND          
14665 root      20   0 13128 1264  928 R  0.7  0.0   0:00.06 top                
    1 root      20   0 19360 1452 1224 S  0.0  0.0   0:04.07 init              
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.01 kthreadd          
    3 root      RT   0     0    0    0 S  0.0  0.0   0:01.12 migration/0        
    4 root      20   0     0    0    0 S  0.0  0.0   0:01.08 ksoftirqd/0        
    5 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 stopper/0          
    6 root      RT   0     0    0    0 S  0.0  0.0   0:00.24 watchdog/0        
    7 root      RT   0     0    0    0 S  0.0  0.0   0:00.90 migration/1        
    8 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 stopper/1          
    9 root      20   0     0    0    0 S  0.0  0.0   0:01.16 ksoftirqd/1        
   10 root      RT   0     0    0    0 S  0.0  0.0   0:00.20 watchdog/1        
   11 root      20   0     0    0    0 S  0.0  0.0   0:09.44 events/0          
   12 root      20   0     0    0    0 S  0.0  0.0   0:11.38 events/1          
   13 root      20   0     0    0    0 S  0.0  0.0   0:00.00 events/0          
   14 root      20   0     0    0    0 S  0.0  0.0   0:00.00 events/1          
   15 root      20   0     0    0    0 S  0.0  0.0   0:00.00 events_long/0      
   16 root      20   0     0    0    0 S  0.0  0.0   0:00.00 events_long/1      
   17 root      20   0     0    0    0 S  0.0  0.0   0:00.00 events_power_ef    
   18 root      20   0     0    0    0 S  0.0  0.0   0:00.00 events_power_ef    
   19 root      20   0     0    0    0 S  0.0  0.0   0:00.00 cgroup            
   20 root      20   0     0    0    0 S  0.0  0.0   0:00.02 khelper            
   21 root      20   0     0    0    0 S  0.0  0.0   0:00.00 netns              
   22 root      20   0     0    0    0 S  0.0  0.0   0:00.00 async/mgr          
   23 root      20   0     0    0    0 S  0.0  0.0   0:00.00 pm                
   24 root      20   0     0    0    0 S  0.0  0.0   0:00.64 sync_supers        
   25 root      20   0     0    0    0 S  0.0  0.0   0:00.03 bdi-default
Use, top, sort by CPU, see the top 10 CPU consumers
Look at the top 10 sorted by memory consumption.

The VM reports, it is doing effectively nothing, 99% idle.

Tasks: 160 total,   1 running, 159 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.2%sy,  0.0%ni, 99.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   5993836k total,  3397872k used,  2595964k free,   237644k buffers
Swap:  4128764k total,    10916k used,  4117848k free,  1912932k cached

Diagnosing issues as you report, the process list is to give you a snapshot of what was running.

More indicative is to pull  vmstat 5 5, iostat 5 5 and top all in parallel
Vmstat reports on memory use
Iostat reports on storage access, mainly which device has reads, writes, wait.....
Arnold, the problem is when the load average is high, the reports like vmstat etc. are all blank because the server is too slow to be able to fill the stats document file on request.

Cheers
Chris
Do you have host data collection to see what is happening on it to see if there is a correlation between load on the host and impact on pergormance on the VM.
Alright! So one mystery solved:

The full command being executed is:
/usr/bin/php /home/tgis/public_html/index.php

...and it's being executed from within an Apache child worker/process, which means it's almost certainly being triggered by a web request.

Other notable things (just for reference purposes):

1. suphp is being used to switch the user context.

2. Of the 160 active PHP processes:
     130 of them are for the index.php file
     10 are for the whmcs/announcements.php file
     6 are for the whmcs/downloads.php file
     4 are for the whmcs/serverstatus.php file
     3 are for the whmcs/supporttickets.php file
     2 are for the whmcs/knowledgebase.php file
     2 are for the whmcs/index.php file
     1 is for the xmlrpc.php file (NOT a whmcs file)
     1 is for the whmcs/cart.php file
     1 is for the whmcs/clientarea.php file

3. You're running WP version 4.9.4 (good that you're keeping it up to date).

The next things to figure out are:
1. The type of web request (GET or POST) and the returned status code. This information should be in the web server's access log for your site. You should be able to correlate the time frames from the captured ps output to the access logs and see the above-mentioned files being called. If you can copy those lines to a separate file (there will be 160 requests for PHP files, plus requests for dependencies such as images and such), and attach it, that would be helpful. That information can help indicate whether the activity is normal or abnormal or even possibly malicious (e.g. bots).

2. Your logs seemed to indicate that the server didn't hang - there was a bunch of activity in 0521 but it seemed all-clear in 0530. However, I'm not sure if that's because you rebooted or if it just cleared up on its own. If it cleared up on its own, it might indicate that this whole slew of processes is "normal" and happens at different times but occasionally hits some unknown brick wall. Can you confirm whether or not the server hung after 0521?

3. There are a few CVEs out for WHMCS, including a denial-of-service attack on the cart.php file within WHMCS (although that one is in the exploit database, EDB-DB 39091, not a CVE), which would be executed with a little bit of code like this:
<?php
$strURL = "https://www.tgis.co.uk/whmcs/cart.php";
$strData = "ajax=1&a=domainoptions&sld=saddddd&tld=saasssssssssss&checktype=owndomain";
$len = strlen($strData);

$ch = curl_init($strURL);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, "POST");
curl_setopt($ch, CURLOPT_POSTFIELDS, $strData);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_HTTPHEADER, array(
  "Accept: * /*",
  "Accept-Language: en-gb",
  "Content-Type: application/x-www-form-urlencoded",
  "Accept-Encoding: gzip, deflate",
  "User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)",
  "Host: www.tgis.co.uk",
  "Content-Length: $len",
  "Connection: Keep-Alive",
  "Cache-Control: no-cache"));
  
$result = curl_exec($ch);
curl_close($ch);

Open in new window

it might be worth trying to run that to see if it triggers the problem. If so, the solution may be to upgrade WHMCS.
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
@arnold:
Do you have host data collection
I'm not sure what that is so I'll have to say no to the question. If you could tell me where to look to see whether this is enabled on the ESXi host I'll let you know what the setup is.

@gr8gonzo:
If you can copy those lines to a separate file (there will be 160 requests for PHP files, plus requests for dependencies such as images and such), and attach it, that would be helpful. That information can help indicate whether the activity is normal or abnormal or even possibly malicious (e.g. bots).
Does this look as though it could be DOS attack? That would make sense as WHMCS is a popular accounting package for online host providers and so it could be being attacked on a regular basis. WHMCS on my server is 100% up to date - there are no updates to download.

I'll have a look at the access logs and get back to you.
@arnold:
The system now has 6Gb RAM and the SWAP is set to 4Gb. Increasing the RAM appears to have had the effect of making the problem less often. Will keep an eye on it.

Cheers
Chris
See if through esx you gave stats on the host CPU, memory use.

The costliest transaction on a server is when swap is triggered having to shift a process from memory to disk to make room for running a process shifted previously by bringing data from disk back into memory.

Increasing memory, tunning the web server resource consumption, MySQL tunning to
It might reduce the time from receiving a request to responding potentially minimizing the freeze out until the requests exceed the capacity to respond.

Restricting number of requests from the source is one, but on the connection side external firewall, as a reject)  may stem the issue, the difficulty is accounting for proxies, and how many requests are normally when your page is accessed.
I.e. If you have 50 object (images) that might be retrieve in a pipeline operation or individually requested....
Guys thanks and apologies for taking so long to get back to you. I've been watching and we don't appear to have the problem happening any more. Will keep an eye out but your input has been brilliant. Many thanks.