asked on

Several critical services suddenly stop with no warning or apparent cause

Several critical services, including, but not limited to, xinetd, ssh, cron, syslog, cups, iscsi and smb suddenly stop without warning. Users are kicked off the system and unable to log back in. Console login still works and these services can be restarted. This has happened several times so far, on our RHEL 5.3 and 5.2 Dell blade servers and our FC5 Dell Optiplex systems.

It has always happened during the day when users are logged in and working. None of my staff with root access report being active on the system immediately before this happens.

Because syslogd also stops, there are no entries in the logs to point to a cause.

I suspect that because it is happening on both RHEL 5 and FC 5 and on completely different hardware platforms, that it is not a distribution/kernel specific issue or a hardware issue but that it is some of our custom scripting that is causing it.

I would like to know what could cause the symptoms we are experiencing. Knowing what could cause it, may help me find the code that is doing it.

I am also wondering if there is some way to keep logging running while this happens. Perhaps a different log daemon?

ASKER CERTIFIED SOLUTION

arnold

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

omarfarid

when the problem happens and you login from console, what is the output of the command

who -r

if it is s or S then the system went to single user mode where all network services will stop, and you need to see why.

akvalentine

ASKER

Thanks for the suggestions.

arnold: I have checked my cron jobs and none of them seem the be the culprit and none fire at the times the systems have crashed. I have already setup a cron job to run sar to capture activity, but I hadn't thought to use SNMP or check for a rootkit, so I'll do those. I included sudo users with root users (in fact, I am the only one who knows the actual root password). I do not have an idle process clean up script.

omarfarid: I hadn't thought of checking for single user mode. I'll do that when it happens again.

ai_ja_nai

All the services that stop together have in common that are network related. Could it simply be that your server loses connectivity and then regain it, but all the services in the meanwhile have dropped the connections?

Julian Parker

I heard that there were several issues with the original RHEL 5.3 release, I think some of the issues related to networking drivers.

Have you made sure your system is up to date? I believe RH provided updates fairly quickly.

akvalentine

ASKER

Thanks for your replies.

ai_ja_nai: Would having the network connection drop cause the services to actually stop? When this happens I have to log onto the console to restart the services. Cron is also stopped by this event; can it be considered a network related service?

jools: I have updated to the latest patches, minus one kernel update, however RedHat tech support doesn't think that what was fixed in that kernel update would make a difference. In any case, this also happens on our older Fedora Core 5 systems.

ai_ja_nai

>Would having the network connection drop cause the services to actually stop
maybe they are configured to deactivate themselves in case of no connection

>Cron is also stopped by this event; can it be considered a network related service
cron and syslogd actually are not network related services, but xinetd, ssh, cups, iscsi and smb are

akvalentine

ASKER

I have discovered the cause of the problem. Installing rsyslog was the key, as it stayed running and logged the offending command. It turned out to be poor programming on my part in a script I'd written years ago. It dynamically generated a list of PID's to kill, but didn't validate the list before it killed them. I'm awarding points to arnold because his suggestions were the most help to me.