Several critical services suddenly stop with no warning or apparent cause

Posted on 2009-02-24
Last Modified: 2013-12-06
Several critical services, including, but not limited to, xinetd, ssh, cron, syslog, cups, iscsi and smb suddenly stop without warning.  Users are kicked off the system and unable to log back in.  Console login still works and these services can be restarted.  This has happened several times so far, on our RHEL 5.3 and 5.2 Dell blade servers and our FC5 Dell Optiplex systems.

It has always happened during the day when users are logged in and working.  None of my staff with root access report being active on the system immediately before this happens.  

Because syslogd also stops, there are no entries in the logs to point to a cause.

I suspect that because it is happening on both RHEL 5 and FC 5 and on completely different hardware platforms, that it is not a distribution/kernel specific issue or a hardware issue but that it is some of our custom scripting that is causing it.

I would like to know what could cause the symptoms we are experiencing.  Knowing what could cause it, may help me find the code that is doing it.

I am also wondering if there is some way to keep logging running while this happens.  Perhaps a different log daemon?

Question by:akvalentine
    LVL 76

    Accepted Solution

    Check the various cron jobs you may have.
    Setup a snapshot cron on a five minute interval that captures/processes/memory usage/ as well restarts syslog if it is not running.
    Enable SNMP and poll the servers to get the same information.  In short you have something that triggers the termination of these services.

    Do you have folks with sudo rights or were they covered in the "root"  access statement?  Check the system for a root kit or for network connection that is always present.

    Do you have an idle process cleanup job? Someone may have made a mistake in the creation of the script that instead of killing the idle session, it kills the parent process.
    LVL 40

    Expert Comment

    when the problem happens and you login from console, what is the output of the command

    who -r

    if it is s or S then the system went to single user mode where all network services will stop, and you need to see why.

    Author Comment

    Thanks for the suggestions.  

    arnold: I have checked my cron jobs and none of them seem the be the culprit and none fire at the times the systems have crashed. I have already setup a cron job to run sar to capture activity, but I hadn't thought to use SNMP or check for a rootkit, so I'll do those.   I included sudo users with root users (in fact, I am the only one who knows the actual root password).  I do not have an idle process clean up script.

    omarfarid: I hadn't thought of checking for single user mode.  I'll do that when it happens again.

    LVL 16

    Expert Comment

    All the services that stop together have in common that are network related. Could it simply be that your server loses connectivity and then regain it, but all the services in the meanwhile have dropped the connections?
    LVL 19

    Expert Comment

    I heard that there were several issues with the original RHEL 5.3 release, I think some of the issues related to networking drivers.

    Have you made sure your system is up to date? I believe RH provided updates fairly quickly.


    Author Comment

    Thanks for your replies.

    ai_ja_nai: Would having the network connection drop cause the services to actually stop?  When this happens I have to log onto the console to restart the services.  Cron is also stopped by this event; can it be considered a network related service?  

    jools: I have updated to the latest patches, minus one kernel update, however RedHat tech support doesn't think that what was fixed in that kernel update would make a difference.  In any case, this also happens on our older Fedora Core 5 systems.
    LVL 16

    Expert Comment

    >Would having the network connection drop cause the services to actually stop
    maybe they are configured to deactivate themselves in case of no connection

    >Cron is also stopped by this event; can it be considered a network related service
    cron and syslogd actually are not network related services, but xinetd, ssh, cups, iscsi and smb are

    Author Comment

    I have discovered the cause of the problem.  Installing rsyslog was the key, as it stayed running and logged the offending command.  It turned out to be poor programming on my part in a script I'd written years ago.  It dynamically generated a list of PID's to kill, but didn't validate the list before it killed them.  I'm awarding points to arnold because his suggestions were the most help to me.

    Featured Post

    Maximize Your Threat Intelligence Reporting

    Reporting is one of the most important and least talked about aspects of a world-class threat intelligence program. Here’s how to do it right.

    Join & Write a Comment

    Suggested Solutions

    Little introduction about CP: CP is a command on linux that use to copy files and folder from one location to another location. Example usage of CP as follow: cp /myfoder /pathto/destination/folder/ cp abc.tar.gz /pathto/destination/folder/ab…
    SSH (Secure Shell) - Tips and Tricks As you all know SSH(Secure Shell) is a network protocol, which we use to access/transfer files securely between two networked devices. SSH was actually designed as a replacement for insecure protocols that sen…
    Learn how to get help with Linux/Unix bash shell commands. Use help to read help documents for built in bash shell commands.: Use man to interface with the online reference manuals for shell commands.: Use man to search man pages for unknown command…
    Connecting to an Amazon Linux EC2 Instance from Windows Using PuTTY.

    733 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    17 Experts available now in Live!

    Get 1:1 Help Now