asked on

auto reboot at high cpu load

Hi
Is there a way to have the server issue a reboot onto itself when it reaches a certain cpu load/reaches max open files / detects impending disaster?
So the question also reads; are there any ways of detecting impending disaster and have the server act intelligently without me having to sit and monitor top etc. 24h a day?

jlevie

Off hand I don't know of any "off the shelf" packages that do that, but it wouldn't be very difficult to cobble up something with perl or a shell script to accomplish the task.

The more interesting issue is why you'd think this necessary. If you are having serious problems with too many open files or the cpu maxing out it would make more sense to me to figure out why and eliminate the cause. Generally this kind of problemm is the result of a flaw in some application that causes it to consume excessive resources. However, it can also be a result of trying to do too much with one system, e.g., running too applications on one box that both, during routine operation, need lots of open files.

psimation

ASKER

Hi Jim
Yes , I agree completely, but the problem is 100000000 fold:
Let me explain:
My server ( the one you helped me to setup ), have been running nicely for almost 200 days with not so much as a glitch, then we had a power failure that was long enough to tap the ups, and it went down. So, it basically had a forced reboot to "clean" things up a bit.
Then, all of a sudden, I got a system lockup with a "too many open files" as the reason, but I can't find the reason for this in the logs...
I'm not sure if it was simply too many requests for a specific service, a runanway script etc.
But whatever the cause, it seems like a good ol' reboot usually fixes things ( magically).
Only problem is that once the system reaches the "too many open files", I can't SSH into the system to try and find out what it is or to reboot, so when it gets to that, I need to physically get to the server ( which is not always possible).

A very strange thing happened a day or so before the lockup, I noticed that mysql was not responding at all, I couldn't log into it from the command, nor could I restart safe_mysql ( the process just wouldn't die ), yet other services was unaffected.
So I rebooted, and things seemed fine, and then two days after that I got the lockup.

Another part of this sad tale is the fact that I built a scsi drive into the system to hold the /var and /home/www folders to increase mysql and http performance, but the scsi controller for some reason won't initialise during boot time ( I compiled the module into the kernel, and during boot it sees the scsi, just won't init), yet, directly after boot completed, one simply issue insmod a100u2w, and it initializes fine, and I can mount the partitions to /var and /home/www.
However, since the kernel and os practically has no knowledge of the scsi disk untill AFTER boot time, the boot process uses the old /var folder on the old IDE drive, and I'm sure that things like syslog etc starts up with it's log on the old IDE, and then I come and mount over it with the scsi disk. I've noticed that after the scsi mounts, the files in the /var folder contains what one would expect to see there ( /log/messages etc), yet the messages and specifically secure log files are empty, probably due to the fact that I "killed" it's log files when I mount the scsi.
So i'm also rather limited with my log arsenal to see what is happening.
I did do a service syslog restart after the mount, and that adds the messages to the /log/messages file, but still no secure etc.
I am going to upgrade to 7.3 now ( or rather install from clean), as I hope redhat 7.3 can initialize my scsi correctly, and I will no longer have this "duplicate" /var folders. I just have to scrape up enough courage to do it, as I need to migrate all my cyrus users, all the domains, all the name/zone files etc, and then need to find out if a 7.3 primary name server can cope with my secondary name server which will still be rh7.0 with older BIND etc.
I will have to do all the cyrus suer by hand ( the saslpasswd user bit) , so that is going to be a nightmare...

Anyway, that is the sorry state of affairs, and now I'm also urgently in need of getting that damn autoresponders up and running...

On that note, thanks for the reply, I followed the link and downloaded the websieve app and will get to it shortly ( just wish they would package ALL required software together, now I need to go look for the ADMIN::IMAP.blabla.tar.gz as well...)
sigh.
Hope you can help.

jlevie

I suspect that both the crash and the SCSI nightmare may have bearing on the current problem.

Moving things to 7.3 shouldn't be that much of a problem. There will need to be some detail changes to the Bind files, but that won't be very difficult. And I don't think you'll run into problems with a 7.0 based secondary. But even if you do you could probably install a later version of Bind on the 7.0 box.

Cyrus is pretty easy to move. As long as you use Cyrus-imap-2.0.16 and Cyrus-sasl-1.5.24 you can transfer the mailboxes and saslpasswd file directly to the new server. With the data in place and Cyrus running you do a 'reconstruct -r user.???' for every mailbox and everthing will work. If you'd like to see my process for a current RedHat box and Cyrus 2.0.16/Sendmail 8.12.x, take a look at http://www.entrophy-free.net/mail-server.html

As far as the open file problem is concerned. you can use lsof to get a list of all open files. Running that periodically while the system is up should in short order tell you what is consuming all the file descriptors.

psimation

ASKER

Hi Jim
I tried to run lsof, but it seems that it is a "looping" process?
However, from watching it go for a while, it seems like the most listings are that of all the virtual hosts' access-log files for apache.

Could this be that there are too many virtual hosts on the system? I think there's about 200 by now.

jlevie

Well, that might be part of the problem if each virtual host has its own log files. With 200 virtual hosts you'd have at a minimum 400 open files if there were only one visitor to each site.

FYI lsof isn't a looping process. It will run until it has enumerated each and every open file, and that can take a long time when there are thousands of open files and lots of network connections. There are an number of options to lsof that can filter down the output to just certain things. Look at 'man lsof" for details.

psimation

ASKER

Yikes, so the sane thing to do would be to increase the number of files that my system can handle ( re my other questions about increasing that...)

psimation

ASKER

OH, wait, you made me think now...
Is there a way to NOT have a access-log file for all domains, yet retain the individual domain identities for reporting purposes?
I like to use webalizer for webstats, and each domain has webalizer enabled, so if one can have one "communal" log file for all domains and still have webalizer be able to report only on the specific domain, that should help alot already won't it?

jlevie

Wel, yes, there's a way to do that. You'd remove all log definitions for the virtual hosts and make sure that the global Apache configuration includes logging directives. Then you'd rotate the main logs nightly and use a script to split yesterday's log in to site logs for webalizer to use.

psimation

ASKER

Hmm, you wouldn't perhaps know where I can read up or see examples of doing that ( just out of curiosity, how do these logs know for which site the hit was, will it include an identifier when you change the log format?)
Also, do you think this excercise is worth while ( almost to the point where you could say that it is preferred practice anyway on a virtual hosting machine) ?

Thanks for the insights.
Please holler if I should rather open a new thread for this... Will gladly accept the answer to the previous issues by this time, and continue with the latest development...

ASKER CERTIFIED SOLUTION

jlevie

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

psimation

ASKER

Thanks Jim
I think/hope/pray that the file-max increase will do the trick anyway...