Several critical services suddenly stop with no warning or apparent cause

Several critical services, including, but not limited to, xinetd, ssh, cron, syslog, cups, iscsi and smb suddenly stop without warning.  Users are kicked off the system and unable to log back in.  Console login still works and these services can be restarted.  This has happened several times so far, on our RHEL 5.3 and 5.2 Dell blade servers and our FC5 Dell Optiplex systems.

It has always happened during the day when users are logged in and working.  None of my staff with root access report being active on the system immediately before this happens.  

Because syslogd also stops, there are no entries in the logs to point to a cause.

I suspect that because it is happening on both RHEL 5 and FC 5 and on completely different hardware platforms, that it is not a distribution/kernel specific issue or a hardware issue but that it is some of our custom scripting that is causing it.

I would like to know what could cause the symptoms we are experiencing.  Knowing what could cause it, may help me find the code that is doing it.

I am also wondering if there is some way to keep logging running while this happens.  Perhaps a different log daemon?

Who is Participating?
Check the various cron jobs you may have.
Setup a snapshot cron on a five minute interval that captures/processes/memory usage/ as well restarts syslog if it is not running.
Enable SNMP and poll the servers to get the same information.  In short you have something that triggers the termination of these services.

Do you have folks with sudo rights or were they covered in the "root"  access statement?  Check the system for a root kit or for network connection that is always present.

Do you have an idle process cleanup job? Someone may have made a mistake in the creation of the script that instead of killing the idle session, it kills the parent process.
when the problem happens and you login from console, what is the output of the command

who -r

if it is s or S then the system went to single user mode where all network services will stop, and you need to see why.
akvalentineAuthor Commented:
Thanks for the suggestions.  

arnold: I have checked my cron jobs and none of them seem the be the culprit and none fire at the times the systems have crashed. I have already setup a cron job to run sar to capture activity, but I hadn't thought to use SNMP or check for a rootkit, so I'll do those.   I included sudo users with root users (in fact, I am the only one who knows the actual root password).  I do not have an idle process clean up script.

omarfarid: I hadn't thought of checking for single user mode.  I'll do that when it happens again.

Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

All the services that stop together have in common that are network related. Could it simply be that your server loses connectivity and then regain it, but all the services in the meanwhile have dropped the connections?
I heard that there were several issues with the original RHEL 5.3 release, I think some of the issues related to networking drivers.

Have you made sure your system is up to date? I believe RH provided updates fairly quickly.

akvalentineAuthor Commented:
Thanks for your replies.

ai_ja_nai: Would having the network connection drop cause the services to actually stop?  When this happens I have to log onto the console to restart the services.  Cron is also stopped by this event; can it be considered a network related service?  

jools: I have updated to the latest patches, minus one kernel update, however RedHat tech support doesn't think that what was fixed in that kernel update would make a difference.  In any case, this also happens on our older Fedora Core 5 systems.
>Would having the network connection drop cause the services to actually stop
maybe they are configured to deactivate themselves in case of no connection

>Cron is also stopped by this event; can it be considered a network related service
cron and syslogd actually are not network related services, but xinetd, ssh, cups, iscsi and smb are
akvalentineAuthor Commented:
I have discovered the cause of the problem.  Installing rsyslog was the key, as it stayed running and logged the offending command.  It turned out to be poor programming on my part in a script I'd written years ago.  It dynamically generated a list of PID's to kill, but didn't validate the list before it killed them.  I'm awarding points to arnold because his suggestions were the most help to me.
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.