CentOS Server dying at high load average

I have several CentOS VPS units running cPanel. At regular intervals, the server load goes extremely high (248+) and the server is dead in the water. At this point all customer websites etc. are unavailable. I usually end up resetting the server which is not good for MySQL databases etc. and need to discover what's really going on so I can stop this from happening.

At the time the server is non-responsive, there are hundreds of lines on the console saying kill process ID or sacrifice child. However when this is happening there is not much chance to get into the console as it's too busy going 'round in circles.

I had someone from cPanel support take a look and they say it has nothing to do with cPanel. Here is what the tech wrote:

I was monitoring your server from last 30 minutes and the server load was stable but the site https://www.tgis.co.uk/ is taking time to load.
It indicates that there is an issue with the site scripting/database that is eating resources.
For reference I have checked server old logs and found that the same domain is eating resources

top - 07:40:15 up 13:12,  1 user,  load average: 53.75, 110.05, 170.98
Tasks: 267 total,  49 running, 216 sleeping,   0 stopped,   2 zombie
Cpu(s):  3.8%us,  3.9%sy,  0.0%ni, 30.2%id, 61.3%wa,  0.0%hi,  0.8%si,  0.0%st
Mem:   3925368k total,  3243796k used,   681572k free,    50444k buffers
Swap:  4128764k total,   353920k used,  3774844k free,   662552k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
14221 tgis      20   0  331m  18m  636 R  4.9  0.5   0:58.55 /usr/bin/php /home/
14241 tgis      20   0  331m  19m  144 R  4.9  0.5   0:47.39 /usr/bin/php /home/
14255 tgis      20   0  253m  21m 1732 R  4.9  0.6   0:52.20 /usr/bin/php /home/
17441 tgis      20   0  395m 105m 9552 R  4.9  2.8   0:05.24 /usr/bin/php /home/
17466 tgis      20   0  309m  84m 9320 R  4.9  2.2   0:03.90 /usr/bin/php /home/
17576 tgis      20   0  248m 7440 5260 R  4.9  0.2   0:00.22 php -q /home/tgis/p
14111 tgis      20   0  251m  24m 4372 R  4.3  0.6   0:57.59 /usr/bin/php /home/
14602 root      20   0 92712 8924 1508 R  4.3  0.2   0:37.19 /usr/local/cpanel/s
17508 tgis      20   0  340m  51m 9368 R  4.3  1.3   0:01.82 /usr/bin/php /home/
 2892 root      20   0 32832 7764 1928 R  3.7  0.2   0:00.41 /usr/local/cpanel/3
13111 tgis      20   0  253m  18m 2176 R  3.7  0.5   1:16.94 /usr/bin/php /home/
14234 tgis      20   0  335m  18m   76 R  3.7  0.5   0:44.64 /usr/bin/php /home/
14252 tgis      20   0  247m  21m 2692 R  3.7  0.6   0:56.60 /usr/bin/php /home/
17337 tgis      20   0  420m 130m 9596 R  3.7  3.4   0:10.07 /usr/bin/php /home/
17421 tgis      20   0  417m 127m 9596 R  3.7  3.3   0:06.73 /usr/bin/php /home/
17444 tgis      20   0  398m 109m 9596 R  3.7  2.9   0:05.56 /usr/bin/php /home/
17445 root      20   0 74540  17m 2580 R  3.7  0.5   0:03.92 /usr/local/cpanel/s
17465 tgis      20   0  372m  83m 9248 R  3.7  2.2   0:04.15 /usr/bin/php /home/
17496 tgis      20   0  346m  56m 9212 R  3.7  1.5   0:02.19 /usr/bin/php /home/
17500 tgis      20   0  348m  59m 9416 R  3.7  1.5   0:02.08 /usr/bin/php /home/
17503 tgis      20   0  340m  51m 9216 R  3.7  1.3   0:01.92 /usr/bin/php /home/
17504 tgis      20   0  340m  51m 9200 R  3.7  1.3   0:01.94 /usr/bin/php /home/
17506 tgis      20   0  332m  42m 9204 R  3.7  1.1   0:01.86 /usr/bin/php /home/
17525 tgis      20   0  326m  36m 8940 R  3.7  1.0   0:01.19 /usr/bin/php /home/
17549 clamav    20   0  116m  41m 1084 R  3.7  1.1   0:00.70 /usr/local/cpanel/3
14250 tgis      20   0  320m  20m  916 R  3.0  0.5   0:54.08 /usr/bin/php /home/
14277 tgis      20   0  327m  19m  704 R  3.0  0.5   0:45.58 /usr/bin/php /home/
14507 tgis      20   0  330m  19m   96 R  3.0  0.5   0:43.13 /usr/bin/php /home/
17252 tgis      20   0  416m 127m 9596 R  3.0  3.3   0:09.15 /usr/bin/php /home/

Seems a dynamic site with busy database as mysql error logs shows plenty of threads open

209  5:24:21 [Warning] /usr/sbin/mysqld: Forcing close of thread 244  user: 'tgis_whmcs'

180209  5:24:21 [Warning] /usr/sbin/mysqld: Forcing close of thread 234  user: 'root'

180209  5:24:21 [Warning] /usr/sbin/mysqld: Forcing close of thread 230  user: 'tgis_whmcs'

180209  5:24:21 [Warning] /usr/sbin/mysqld: Forcing close of thread 228  user: 'tgis_oneadmin'

180209  5:24:21 [Warning] /usr/sbin/mysqld: Forcing close of thread 227  user: 'tgis_oneadmin'

180209  5:24:21 [Warning] /usr/sbin/mysqld: Forcing close of thread 222  user: 'tgis_oneadmin'

180209  5:24:21 [Warning] /usr/sbin/mysqld: Forcing close of thread 221  user: 'tgis_oneadmin'

180209  5:24:21 [Warning] /usr/sbin/mysqld: Forcing close of thread 220  user: 'tgis_oneadmin'

I have done a malware scan just in case and it seemed to come up OK.

Can anyone help diagnose these issues? It's happening not only on this VPS but on others, even those on different hypervisor hosts. We use ESXi licensed version.

Many thanks
Chris
LVL 1
Chris KenwardDirectorAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

arnoldCommented:
Sounds like under spec for the requests, services or potentially DDoS attack.
Impacting the requests generated against the backend MySQL db.

Does the issue correlate to a specific time, peak usage of a site...

In your situation tunning MySQL might not be enough to cure the resource depletion.

Not sure you could get the vps additional resources, memory, processing.

What is the host performance stats during that time frame?
gr8gonzoConsultantCommented:
How many connections is your web server configured to have (min/max servers and server type)? Also, are your PHP scripts establishing persistent connections to the database?
Chris KenwardDirectorAuthor Commented:
@arnold: These are all web servers running PHP and MySQL. They are all fairly small servers and all have around 4Gb RAM. The issue happens at any time during the day - could be DDOS I guess but not sure how to find out. Each server is running CSF.

@gr8gonzo: Could you elaborate on how to check and change these settings? This sounds like a good starting spot. Many thanks.

Cheers
Chris
Big Business Goals? Which KPIs Will Help You

The most successful MSPs rely on metrics – known as key performance indicators (KPIs) – for making informed decisions that help their businesses thrive, rather than just survive. This eBook provides an overview of the most important KPIs used by top MSPs.

gr8gonzoConsultantCommented:
The server min/max client configs will be set in your web server config files. If you're using Apache (likely) it would be in httpd.conf or maybe httpd-mpm.conf.

To find out about persistent connections, search for "pconnect" or "p:" in all your PHP code files (there's a small chance of some false positives).

Also, do you define a connection limit for mysql on my.ini?
Chris KenwardDirectorAuthor Commented:
Hi there gr8gonzo

This is a cPanel server. I haven't set up anything that isn't auto setup by cPanel when it's installed. I installed the CSF (LFD) plugin for security and went through the security checks but apart from that I haven't fiddled with setup of either PHP, MySQL or the O/S of the server.

Where is my.ini located? I cannot find this file on the server at all.

Thanks
Chris
gr8gonzoConsultantCommented:
Hi Chris,

I just took another longer look at your process list - did you edit out anything from that list? I would normally expect a full path to a PHP file at the end of these kinds of line - running PHP against a folder shouldn't really do anything:

17500 tgis      20   0  348m  59m 9416 R  3.7  1.5   0:02.08 /usr/bin/php /home/
17503 tgis      20   0  340m  51m 9216 R  3.7  1.3   0:01.92 /usr/bin/php /home/
17504 tgis      20   0  340m  51m 9200 R  3.7  1.3   0:01.94 /usr/bin/php /home/
17506 tgis      20   0  332m  42m 9204 R  3.7  1.1   0:01.86 /usr/bin/php /home/

Instead of pulling the contents from top, can you provide the output of "ps auxf" instead?

After my second look, I'm less convinced it has anything to do with connections (apache or mysql) and it looks more like you might have some PHP scripts that are stuck in an infinite loop. In that scenario, if a PHP script has opened a connection to the database, it will keep that connection to itself until the script ends, so if this script is running a lot, then you're basically creating processes and connections that are never closing properly.
Chris KenwardDirectorAuthor Commented:
Hi gr8gonzo

I think I may have inadvertently cut the lines off. I've attached a file containing latest PS auxf. Hope it's good.

Cheers
Chris
WID_PS_auxf.txt
gr8gonzoConsultantCommented:
Okay, I don't see those PHP lines in there anywhere, so I assume it's something that might be kicked off manually.

Is "tgis" a customer or something, or is it just the main system user account for your site (I'm guessing it's the latter based on the URL you provided)?

I would next do this:
1. Run crontab -u tgis -l
If there are any cron jobs listed in there, post the output here.

2. Next, do a grep through your source code in /home/tgis (I'm assuming that's the user's home dir) to look for PHP loops:
grep -riP "\b(while|for|foreach)\s*\(\b" * | grep "\.php:"

The resulting files that come back from that command will need you to examine those loops to look for scenarios where the loop might never end. For example:
$some_condition = true;
while($some_condition)
{
  if(XYZ happens)
  {
    $some_condition = false;
  }
}

Open in new window

In the above loop, if XYZ never happens, then $some_condition will never change to false, and the loop will never end, which will suck up your CPU and likely cause the problems you're seeing.

Infinite loops can happen in a number of ways, so that's just one example.

You might also want to grep for any instances of "set_time_limit":
grep -ri "set_time_limit" *

Normally PHP scripts should time out by themselves after 30-60 seconds (whatever your PHP config is set to), but set_time_limit() can override that setting, particularly if you see set_time_limit(0) which means "disable the timeout completely".

However, fixing the time limit might only bandaid the problem - you might still end up with scripts that are running in infinite loops for a period of time, which is a huge drain on your resources.
Chris KenwardDirectorAuthor Commented:
Hi there

The site is my main site running WHMCS. That's all it does. There are no customers on this server. I'll need to address similar issues on other servers once we've been through this one. Thanks as always for getting back to me.

The result of the CRON output is:

root@wid [~]# crontab -u tgis -l
MAILTO=""
SHELL="/usr/local/cpanel/bin/jailshell"
0 8 * * * php -q /home/tgis/public_html/whmcs/crons/cron.php

SHELL="/usr/local/cpanel/bin/jailshell"
*/5 * * * * php -q /home/tgis/public_html/whmcs/crons/pop.php

SHELL="/usr/local/cpanel/bin/jailshell"
0 */6 * * * php -q /home/tgis/public_html/whmcs/crons/domainsync.php

These are simply jobs to sync domains etc. for the WHMCS package and to intercept incoming POP mail for open tickets.

The test for loops that never end resulted in literally hundreds of results - if there were bad things in here then WHMCS would not work.. I have a feeling that this is a non-starter. It's a very popular commercial package and there would be a lynch mob out there if it was the cause of bringing the server down.

The set time limit test also brings up literally 100s of lines, many of which have the (0) as the limit.

Usually, when the server goes into a tailspin I think it's MySQL that has managed to drag the server into the red.

I see from looking at the my.cnf  - looks like this:

[mysqld]
default-storage-engine=MyISAM
innodb_file_per_table=1
local-infile=0
innodb_buffer_pool_size=124780544
max_allowed_packet=268435456
open_files_limit=10000

noticed that it appears we are using MyISAM rather than InnoDB. Is this a good thing? Should we be limiting max_concurrent_connections etc.?

Cheers
Chris
arnoldCommented:
Default engine in the display is a setting, the show create database <databasename> will tell you whether <database> uses innodb or myisam.

Note in the settings, your innodb buffer is set to use 120MB

There are MySQL tunning that can help improve MySQL responsiveness if needed.


Note in your process list, there are 25 php process with at lease 250mb allocated,
>6.25GB in allocated memory resources.

Depending what php does, one runs every five minutes.
If your settings are to retrieve a copy while leaving a copy on the server, as the amount if email grows, the processing your end increases to differentiate between messages it already copied and new ones.
Fetchmail might be a better...
Or have the front end mailserver forward the messages ........
gr8gonzoConsultantCommented:
Okay, that's good to know - you hadn't mentioned WHMCS before - just cPanel. That said:

1. Never assume that just because it's a popular product that it's not just as capable of having defects. I work daily on software that is used by hundreds of millions of people and it's not defect-free. Sometimes you are simply the first person with a specific set of circumstances that leads to the problem. Defects are simply part of the normal lifecycle for any product, and the bigger the product, the greater the chance of a defect.

2. arnold's advice is pretty good. Having a 120MB buffer and a 256MB allowed packet size seems a little high. The buffer is less of a concern than the huge packet size allowance. If you have a legitimate chance of pushing or pulling 256 megabytes into or out of MySQL, then you might consider keeping it there, but for most typical PHP scripts, 16MB is plenty.

3. In your original top output, I only see one line that looks like it matches up to any of your crontabs:
17576 tgis      20   0  248m 7440 5260 R  4.9  0.2   0:00.22 php -q /home/tgis/p

...and that one was only running for 22 seconds at the time you captured the output. The other ones that start with "/usr/bin/php /home/" are the more concerning ones. You can see that they are running for a LONG time, sometimes for over an hour.
13111 tgis      20   0  253m  18m 2176 R  3.7  0.5   1:16.94 /usr/bin/php /home/

I put all your top output into an Excel document and sorted it by the time that the script was running:
excel data
A few things to notice if you go from the bottom row towards the top:
1. It seems sometimes the scripts are starting at almost exactly the same time. The bottom-most 5 rows all started within 700 milliseconds of each other. Without knowing the rest of the path, it's hard to tell whether they're all the same script being called or 5 different scripts that all started at around the same time.

2. The scripts seem to start out with quite a large amount of data in memory (36 megabytes for the one that has only been running for 1.2 seconds, and almost doubling in the next second). You can see that the longer the scripts run, the more memory they're consuming, which usually indicates a script that is in a long-running process of reading data into memory (e.g. perhaps reading a large CSV file from over a network connection).

3. At a certain point, the script probably "processes" the data and clears it from memory, but doesn't properly exit, which is why we suddenly see the long-running scripts (starting at row 11 and up, where they've all been running for at least 40 seconds) suddenly see a huge drop in memory usage (all of those long-running scripts are under 30 megabytes). There are a few ways where that number could be deceiving, but it's a decent guess.

4. You'll notice that the sum of the CPU % from those PHP scripts is a whopping 93%. Figure in a few additional percentage points by other unrelated processes and you've probably maxed out your CPU and made it virtually impossible to serve anything else until those scripts are killed.

5. Additionally, the scripts are eating up over 1.3 gigs of your "physical" memory.

Now, all of these processes seem to be launched on the command line. If they were launched from Apache (you mentioned you had a pretty standard cPanel setup), then these scripts would be executing from within the httpd child processes. Instead, these processes are calling the command-line /usr/bin/php binary and they're being called by the "tgis" user.

The face that there seems to be at least 5 that were launched within the span of 1 second indicates that it's less likely to be a manual user launching these scripts but rather some process that is kicking off these other processes.

If we're able to capture a ps auxf output WHILE the problem is happening, then that would solve the mystery of what the problem scripts are, so it's important that you keep running that command from time to time and try to record the full output when the problem is happening. If you want to automate it via a recurring cron job, you could add a line into the root user's crontab like this:

*/10 * * * * ps auxf > /tmp/psauxf_`date "+%H%M"`.txt

Open in new window


That should run the "ps auxf" command every 10 minutes and dump it into a timestamped-file in your /tmp folder, like /tmp/psauxf_0930.txt, /tmp/psauxf_0940.txt, /tmp/psauxf_0950.txt, /tmp/psauxf_1000.txt, etc...

That way you have snapshots every 10 minutes.

It's also possible that the ps auxf might show us the parent process responsible for all the child processes, which could narrow down the culprit even further.

Finally, since this is WHMCS, you might want to enable error reporting if you haven't done so already:
https://docs.whmcs.com/Enabling_Error_Reporting

...just in case there's something in the logs that can lead to the root cause and resolution.

If you have PHP error logging turned on, it might be worth looking in there, too. You'd have to look for your php.ini file (use the locate command if you have it installed in your shell) and see if "log_errors" is set to "On" and if so, look for the "error_log" setting to find the path to the error log. Then check that file to see if there's anything that throws up a red flag.

Finally, there's always a chance that all of this is standard WHMCS script behavior but there might be circumstances or configuration that are negatively impacting the scripts' ability to run to completion. For example, if these scripts are trying to contact a host via its hostname and that hostname can't be resolved for some reason (e.g. a DNS or routing issue), then that could cause a problem like this. Generally speaking, it's bad practice to have a set_time_limit(0) in your code for EXACTLY this reason. It's a little interesting that a product like WHMCS would be doing that. Regularly-occurring scripts should rarely take more than a few seconds to run. Long-running scripts (things like big data imports) might have the time limit turned off, but they shouldn't be being run on a recurring basis.

I think that's all I have for the time being - let us know once you've captured the ps auxf output when the problem is happening.

If you want to quickly check to see if there's a captured snapshot of the problem (instead of manually checking each one), just do a grep through those items:

grep "usr/bin/php /home/" /tmp/psauxf_*
Chris KenwardDirectorAuthor Commented:
Hi gr8gonzo

2. arnold's advice is pretty good. Having a 120MB buffer and a 256MB allowed packet size seems a little high. The buffer is less of a concern than the huge packet size allowance. If you have a legitimate chance of pushing or pulling 256 megabytes into or out of MySQL, then you might consider keeping it there, but for most typical PHP scripts, 16MB is plenty.

Where would I go to change this to limit the scripts to 16MB? Could we start with that?

All the best
Chris
Chris KenwardDirectorAuthor Commented:
Hi gr8gonzo

That should run the "ps auxf" command every 10 minutes and dump it into a timestamped-file in your /tmp folder, like /tmp/psauxf_0930.txt, /tmp/psauxf_0940.txt, /tmp/psauxf_0950.txt, /tmp/psauxf_1000.txt, etc...

That way you have snapshots every 10 minutes.

I've added the CRON and will let you know as soon as I see the files being created. Many thanks
Chris
Chris KenwardDirectorAuthor Commented:
Ah - there's a problem with the CRON. I get this in Email:

/bin/sh: -c: line 0: unexpected EOF while looking for matching ``'
/bin/sh: -c: line 1: syntax error: unexpected end of file

Have I included something I shouldn't?
arnoldCommented:
Seems you placed the command you want executed surrounded by exec tic and no matching...

Please post the entry you add,
If you are adding a script, please paste the entire Script.
First line shoud, be
#!/bin/sh
Chris KenwardDirectorAuthor Commented:
Hey Arnold

I added this line to the top of the CRON list.

*/10 * * * * ps auxf > /tmp/psauxf_`date "+%H%M"`.txt

Regards
Chris
arnoldCommented:
It is best yo use a shell script versus the way you have it, date +"
Potentially adds a carriage return.

Try


*/10 * * * * ps auxf > "/tmp/psauxf_$(date +\"%H%M\").txt"


It is best to have these in a script
Collectdata.sh
#!/bin/sh

Scheduled_time=$(/bin/date +"%H%M")

/bin/ps auxf > "/tmp/psauxf_${Scheduled_time}.txt"


Make sure the script is executable,
Chmod 700 ....
Chmod u+x Collectdata.sh

The cron entry
*/10 * * * * /path/to/Collectdata.sh
gr8gonzoConsultantCommented:
Looks like arnold already addressed the cron tab item, so I'll answer the remaining question:
Where would I go to change this to limit the scripts to 16MB?

I wouldn't focus too much on MySQL overall, since I doubt it's related to the behavior you're seeing, but since we're waiting for those logs...

So there are a few different limits in play here.

1. There's the max_allowed_packet for MySQL, which is found in your my.cnf file. Now, some people set this value really high (like your 256MB value) because MySQL will only use as much as necessary. So if you send in a 4 MB query, it will be accepted just fine. However, if you limited max_allowed_packet to 2 MB and then sent in a 4 MB query, the server would not run the query.

The vast majority of your queries are going to likely be less than 1 or 2 kilbobytes. The bigger packets/queries (they're not exactly the same thing but I'll use them interchangeably for this explanation for now) are typically due to less-common database queries, like restoring backups or inserting a big file into a BLOB field.

Changing the max_allowed_packet is more of a preventive measure so that if you were to ever get hacked in a way where a malicious user was able to run their own queries (e.g. SQL injection), they couldn't generate a ton of queries that each took up 256 megabytes and ate up all your memory (and likely disk space, too). Instead, a lower maximum packet size provides a single layer of defense against that kind of attack. It's flimsy, but it's there. Good security is about lots of layers.

So if you go into your my.cnf file and make it have this line:
max_allowed_packet=16M

...then you'll likely still have a high enough limit to run pretty much every query and you'll have just a little more safety.

2. Often times, your queries are generated and executed from a PHP script via the web. If that's the case for you, then PHP has its own limits in the php.ini file, which are often lower than MySQL's limits. For example, you might have PHP limiting POST / upload data to 8 MB, which means that you'll never have more than 8 MB of raw data in a web request, which means you're less likely to end up with a query that has over 8 MB anyway.

If you want to work on optimizing MySQL for performance, you might want to read my article on the topic.
https://www.experts-exchange.com/articles/1250/3-Ways-to-Speed-Up-MySQL.html
Chris KenwardDirectorAuthor Commented:
@Arnold: OK - thanks I have just changed the CRON and waiting for the first one to run.

@gr8gonzo: I have modified my.cnf file which now looks like this:

[mysqld]
skip-name-resolve
query_cache_size = 16M

log-slow-queries=/var/log/slowqueries.log
long_query_time = 4
log-queries-not-using-indexes

table_cache = 512
tmp_table_size = 128M
max_heap_table_size = 128M
myisam_sort_buffer_size = 8M
sort_buffer_size = 8M
join_buffer_size = 256K
key_buffer = 128M

default-storage-engine=MyISAM
innodb_file_per_table=1
local-infile=0
innodb_buffer_pool_size=124780544
max_allowed_packet=16M
open_files_limit=10000

Is this OK?

Thanks for pointing me in the direction of the article you wrote. Let me know if there's anything else I should be doing with the MySQL?
Chris KenwardDirectorAuthor Commented:
Hi Folks - is there any way to get the ps auxf script to Email the result to me? That way I can more easily copy the files into Experts Exchange for you?
arnoldCommented:
Instead of redirecting it into a file you could | Mail -s "process" youremail@yourdomain.com
gr8gonzoConsultantCommented:
Is this OK?
That generally looks good, but I emphasize "generally." I have no idea what your database/table structure looks like, so nobody can tell you if it's properly optimized without that kind of information (and that's work that's better handled with a paid gig/live help session if you want additional help).

Just keep an eye on things, particularly any logged slow/unindexed queries.

Also, make sure you restart MySQL after any my.cnf changes.

Has the problem occurred yet?
arnoldCommented:
you can apply changes such as this dynamically by
set global variable_name=newvalue
This way you would retain the stats from operation that a restarts dumps.
Chris KenwardDirectorAuthor Commented:
Folks I put the collectdata script on another of my servers to see if I could capture what's happening there. Although it's running fine on the server WID it's giving me this on the other server and I have no idea why:

root@lune [/tmp]# ./collectdata.sh
-bash: ./collectdata.sh: Permission denied

Help?
arnoldCommented:
in order for any script to run, it must have the exec bit set.
ls -l collectdata.sh
yours has -rw- meaning read and write only for user
chmod u+x collectdata.sh
Chris KenwardDirectorAuthor Commented:
Hi Arnold - no - it looks like this:

-rwx------. 1 root root 97 Feb 13 21:31 collectdata.sh*

Cheers
Chris
arnoldCommented:
what is the first line?

The other possibility is that SELinux is blocking running scripts from /tmp

The first line
Has to be #!/bin/sh

What happens if you run
sh ./collectdata.sh
Chris KenwardDirectorAuthor Commented:
Hi Arnold

Yes - the script is the same as before:
#!/bin/sh
Scheduled_time=$(/bin/date +"%H%M")
/bin/ps auxf > "/tmp/psauxf_${Scheduled_time}.txt"

If I run the script manually...

root@lune [~]# /tmp/collectdata.sh
-bash: /tmp/collectdata.sh: Permission denied

This is how it is run on the other server and working fine.

If I run it like this:

root@lune [~]# sh /tmp/collectdata.sh

Then I get the file fine:

psauxf_2029.txt which contains the whole ps auxf output.

Thanks
Chris
arnoldCommented:
other server settings might be different, /tmp is temporary space and often there is a cron that cleans up this space.

Locate the script else where and see if it is also giving you an error
double check where you sh
which sh?
gr8gonzoConsultantCommented:
Given that you have multiple servers, can you double-check the permissions on /tmp/collectdata.sh on the server that is having trouble?

Your description of the behavior:
root@lune [~]# /tmp/collectdata.sh
-bash: /tmp/collectdata.sh: Permission denied

If I run it like this:
root@lune [~]# sh /tmp/collectdata.sh

Then I get the file fine
...is identical to what you'd expect if the execute bit was missing. I'm not sure if there's a possibility that this line:
-rwx------. 1 root root 97 Feb 13 21:31 collectdata.sh*

Open in new window

...came from one of the other servers by accident?

While I'd strongly doubt that this has to do with SELinux, you could try to disable it temporarily and try to re-run /tmp/collectdata.sh:

root@lune [~]# setenforce 0
root@lune [~]# /tmp/collectdata.sh
root@lune [~]# setenforce 1

Open in new window


The only other thing that comes to mind is if your sh shell isn't at its default location for some reason (which would be REALLY odd for root). You can always do:
root@lune [~]# which sh 

Open in new window

to see if there's a separate sh shell interpreter that's being called within your path.
Chris KenwardDirectorAuthor Commented:
Hi there, gr8gonzo

This is what the top level directory looks like for /tmp

drwxr-xr-x.  3 root root  4096 Feb 22 04:02 tmp/

Definitely from the server I'm looking at while having this conversation with you.

root@lune [~]# setenforce 0
root@lune [~]# /tmp/collectdata.sh
-bash: /tmp/collectdata.sh: Permission denied
root@lune [~]# setenforce 1
root@lune [~]#

root@lune [~]# which sh
/bin/sh
root@lune [~]#

Does that make any sense?

Cheers
Chris
Chris KenwardDirectorAuthor Commented:
@arnold: Thanks for your reply. See the results of the tests above?

Cheers
Chris
Chris KenwardDirectorAuthor Commented:
Hi Folks

If I type

root@lune [~]# sh /tmp/collectdata.sh

It works?

Cheers
Chris
Chris KenwardDirectorAuthor Commented:
Back to the server with the high load:

Please see the following results from the PS AUXF

(note this is the original server that resulted in this question being asked)

Cheers
Chris
psauxf_0520.txt
psauxf_0521.txt
psauxf_0530.txt
psauxf_0540.txt
psauxf_0550.txt
psauxf_0600.txt
arnoldCommented:
Listing process,
Run top -n 3

Your issue seems to related to the number of requests you get.
Related to your web httpd.conf dealing with how many min/max child and how many requests each client can address.

This sounds as an under-provisioning of resources to handle the incoming requests which hit the MySQL thus impacting storage (virtual in the VM, physical on the host depending with resource schedule on the host)
Chris KenwardDirectorAuthor Commented:
Hi Arnold

top -n 3 gives us this:

Tasks: 160 total,   1 running, 159 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.2%sy,  0.0%ni, 99.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   5993836k total,  3397872k used,  2595964k free,   237644k buffers
Swap:  4128764k total,    10916k used,  4117848k free,  1912932k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND          
14665 root      20   0 13128 1264  928 R  0.7  0.0   0:00.06 top                
    1 root      20   0 19360 1452 1224 S  0.0  0.0   0:04.07 init              
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.01 kthreadd          
    3 root      RT   0     0    0    0 S  0.0  0.0   0:01.12 migration/0        
    4 root      20   0     0    0    0 S  0.0  0.0   0:01.08 ksoftirqd/0        
    5 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 stopper/0          
    6 root      RT   0     0    0    0 S  0.0  0.0   0:00.24 watchdog/0        
    7 root      RT   0     0    0    0 S  0.0  0.0   0:00.90 migration/1        
    8 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 stopper/1          
    9 root      20   0     0    0    0 S  0.0  0.0   0:01.16 ksoftirqd/1        
   10 root      RT   0     0    0    0 S  0.0  0.0   0:00.20 watchdog/1        
   11 root      20   0     0    0    0 S  0.0  0.0   0:09.44 events/0          
   12 root      20   0     0    0    0 S  0.0  0.0   0:11.38 events/1          
   13 root      20   0     0    0    0 S  0.0  0.0   0:00.00 events/0          
   14 root      20   0     0    0    0 S  0.0  0.0   0:00.00 events/1          
   15 root      20   0     0    0    0 S  0.0  0.0   0:00.00 events_long/0      
   16 root      20   0     0    0    0 S  0.0  0.0   0:00.00 events_long/1      
   17 root      20   0     0    0    0 S  0.0  0.0   0:00.00 events_power_ef    
   18 root      20   0     0    0    0 S  0.0  0.0   0:00.00 events_power_ef    
   19 root      20   0     0    0    0 S  0.0  0.0   0:00.00 cgroup            
   20 root      20   0     0    0    0 S  0.0  0.0   0:00.02 khelper            
   21 root      20   0     0    0    0 S  0.0  0.0   0:00.00 netns              
   22 root      20   0     0    0    0 S  0.0  0.0   0:00.00 async/mgr          
   23 root      20   0     0    0    0 S  0.0  0.0   0:00.00 pm                
   24 root      20   0     0    0    0 S  0.0  0.0   0:00.64 sync_supers        
   25 root      20   0     0    0    0 S  0.0  0.0   0:00.03 bdi-default
arnoldCommented:
Use, top, sort by CPU, see the top 10 CPU consumers
Look at the top 10 sorted by memory consumption.

The VM reports, it is doing effectively nothing, 99% idle.

Tasks: 160 total,   1 running, 159 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.2%sy,  0.0%ni, 99.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   5993836k total,  3397872k used,  2595964k free,   237644k buffers
Swap:  4128764k total,    10916k used,  4117848k free,  1912932k cached

Diagnosing issues as you report, the process list is to give you a snapshot of what was running.

More indicative is to pull  vmstat 5 5, iostat 5 5 and top all in parallel
Vmstat reports on memory use
Iostat reports on storage access, mainly which device has reads, writes, wait.....
Chris KenwardDirectorAuthor Commented:
Arnold, the problem is when the load average is high, the reports like vmstat etc. are all blank because the server is too slow to be able to fill the stats document file on request.

Cheers
Chris
arnoldCommented:
Do you have host data collection to see what is happening on it to see if there is a correlation between load on the host and impact on pergormance on the VM.
gr8gonzoConsultantCommented:
Alright! So one mystery solved:

The full command being executed is:
/usr/bin/php /home/tgis/public_html/index.php

...and it's being executed from within an Apache child worker/process, which means it's almost certainly being triggered by a web request.

Other notable things (just for reference purposes):

1. suphp is being used to switch the user context.

2. Of the 160 active PHP processes:
     130 of them are for the index.php file
     10 are for the whmcs/announcements.php file
     6 are for the whmcs/downloads.php file
     4 are for the whmcs/serverstatus.php file
     3 are for the whmcs/supporttickets.php file
     2 are for the whmcs/knowledgebase.php file
     2 are for the whmcs/index.php file
     1 is for the xmlrpc.php file (NOT a whmcs file)
     1 is for the whmcs/cart.php file
     1 is for the whmcs/clientarea.php file

3. You're running WP version 4.9.4 (good that you're keeping it up to date).

The next things to figure out are:
1. The type of web request (GET or POST) and the returned status code. This information should be in the web server's access log for your site. You should be able to correlate the time frames from the captured ps output to the access logs and see the above-mentioned files being called. If you can copy those lines to a separate file (there will be 160 requests for PHP files, plus requests for dependencies such as images and such), and attach it, that would be helpful. That information can help indicate whether the activity is normal or abnormal or even possibly malicious (e.g. bots).

2. Your logs seemed to indicate that the server didn't hang - there was a bunch of activity in 0521 but it seemed all-clear in 0530. However, I'm not sure if that's because you rebooted or if it just cleared up on its own. If it cleared up on its own, it might indicate that this whole slew of processes is "normal" and happens at different times but occasionally hits some unknown brick wall. Can you confirm whether or not the server hung after 0521?

3. There are a few CVEs out for WHMCS, including a denial-of-service attack on the cart.php file within WHMCS (although that one is in the exploit database, EDB-DB 39091, not a CVE), which would be executed with a little bit of code like this:
<?php
$strURL = "https://www.tgis.co.uk/whmcs/cart.php";
$strData = "ajax=1&a=domainoptions&sld=saddddd&tld=saasssssssssss&checktype=owndomain";
$len = strlen($strData);

$ch = curl_init($strURL);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, "POST");
curl_setopt($ch, CURLOPT_POSTFIELDS, $strData);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_HTTPHEADER, array(
  "Accept: * /*",
  "Accept-Language: en-gb",
  "Content-Type: application/x-www-form-urlencoded",
  "Accept-Encoding: gzip, deflate",
  "User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)",
  "Host: www.tgis.co.uk",
  "Content-Length: $len",
  "Connection: Keep-Alive",
  "Cache-Control: no-cache"));
  
$result = curl_exec($ch);
curl_close($ch);

Open in new window

it might be worth trying to run that to see if it triggers the problem. If so, the solution may be to upgrade WHMCS.
arnoldCommented:
Oh, reviewed your original question where you posted the output of top:

top - 07:40:15 up 13:12,  1 user,  load average: 53.75, 110.05, 170.98
Tasks: 267 total,  49 running, 216 sleeping,   0 stopped,   2 zombie
Cpu(s):  3.8%us,  3.9%sy,  0.0%ni, 30.2%id, 61.3%wa,  0.0%hi,  0.8%si,  0.0%st
Mem:   3925368k total,  3243796k used,   681572k free,    50444k buffers
Swap:  4128764k total,   353920k used,  3774844k free,   662552k cached

note it is reporting that 61.3% is in a wait state commonly means it is waiting for data and is often related to storage. note your system has ~4GB of ram, but your system is swapping ~350MB

This points to an under resourced system. when the system page/swaps this means data from memory needs to be written out to disk and data from disk needs to be rolled into memory to proceed.

if you have resources on the host that you could add to this VM, i.e. raise the VM memory allocation from under the 4Gig to 8 gig and see whether it alters the issue, or makes this situation less frequent.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Chris KenwardDirectorAuthor Commented:
@arnold:
Do you have host data collection
I'm not sure what that is so I'll have to say no to the question. If you could tell me where to look to see whether this is enabled on the ESXi host I'll let you know what the setup is.

@gr8gonzo:
If you can copy those lines to a separate file (there will be 160 requests for PHP files, plus requests for dependencies such as images and such), and attach it, that would be helpful. That information can help indicate whether the activity is normal or abnormal or even possibly malicious (e.g. bots).
Does this look as though it could be DOS attack? That would make sense as WHMCS is a popular accounting package for online host providers and so it could be being attacked on a regular basis. WHMCS on my server is 100% up to date - there are no updates to download.

I'll have a look at the access logs and get back to you.
Chris KenwardDirectorAuthor Commented:
@arnold:
The system now has 6Gb RAM and the SWAP is set to 4Gb. Increasing the RAM appears to have had the effect of making the problem less often. Will keep an eye on it.

Cheers
Chris
arnoldCommented:
See if through esx you gave stats on the host CPU, memory use.

The costliest transaction on a server is when swap is triggered having to shift a process from memory to disk to make room for running a process shifted previously by bringing data from disk back into memory.

Increasing memory, tunning the web server resource consumption, MySQL tunning to
It might reduce the time from receiving a request to responding potentially minimizing the freeze out until the requests exceed the capacity to respond.

Restricting number of requests from the source is one, but on the connection side external firewall, as a reject)  may stem the issue, the difficulty is accounting for proxies, and how many requests are normally when your page is accessed.
I.e. If you have 50 object (images) that might be retrieve in a pipeline operation or individually requested....
Chris KenwardDirectorAuthor Commented:
Guys thanks and apologies for taking so long to get back to you. I've been watching and we don't appear to have the problem happening any more. Will keep an eye out but your input has been brilliant. Many thanks.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Databases

From novice to tech pro — start learning today.