Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 401
  • Last Modified:

Several critical services suddenly stop with no warning or apparent cause

Several critical services, including, but not limited to, xinetd, ssh, cron, syslog, cups, iscsi and smb suddenly stop without warning.  Users are kicked off the system and unable to log back in.  Console login still works and these services can be restarted.  This has happened several times so far, on our RHEL 5.3 and 5.2 Dell blade servers and our FC5 Dell Optiplex systems.

It has always happened during the day when users are logged in and working.  None of my staff with root access report being active on the system immediately before this happens.  

Because syslogd also stops, there are no entries in the logs to point to a cause.

I suspect that because it is happening on both RHEL 5 and FC 5 and on completely different hardware platforms, that it is not a distribution/kernel specific issue or a hardware issue but that it is some of our custom scripting that is causing it.

I would like to know what could cause the symptoms we are experiencing.  Knowing what could cause it, may help me find the code that is doing it.

I am also wondering if there is some way to keep logging running while this happens.  Perhaps a different log daemon?

0
akvalentine
Asked:
akvalentine
1 Solution
 
arnoldCommented:
Check the various cron jobs you may have.
Setup a snapshot cron on a five minute interval that captures/processes/memory usage/ as well restarts syslog if it is not running.
Enable SNMP and poll the servers to get the same information.  In short you have something that triggers the termination of these services.

Do you have folks with sudo rights or were they covered in the "root"  access statement?  Check the system for a root kit or for network connection that is always present.

Do you have an idle process cleanup job? Someone may have made a mistake in the creation of the script that instead of killing the idle session, it kills the parent process.
0
 
omarfaridCommented:
when the problem happens and you login from console, what is the output of the command

who -r

if it is s or S then the system went to single user mode where all network services will stop, and you need to see why.
0
 
akvalentineAuthor Commented:
Thanks for the suggestions.  

arnold: I have checked my cron jobs and none of them seem the be the culprit and none fire at the times the systems have crashed. I have already setup a cron job to run sar to capture activity, but I hadn't thought to use SNMP or check for a rootkit, so I'll do those.   I included sudo users with root users (in fact, I am the only one who knows the actual root password).  I do not have an idle process clean up script.

omarfarid: I hadn't thought of checking for single user mode.  I'll do that when it happens again.


0
Prepare for your VMware VCP6-DCV exam.

Josh Coen and Jason Langer have prepared the latest edition of VCP study guide. Both authors have been working in the IT field for more than a decade, and both hold VMware certifications. This 163-page guide covers all 10 of the exam blueprint sections.

 
ai_ja_naiCommented:
All the services that stop together have in common that are network related. Could it simply be that your server loses connectivity and then regain it, but all the services in the meanwhile have dropped the connections?
0
 
joolsCommented:
I heard that there were several issues with the original RHEL 5.3 release, I think some of the issues related to networking drivers.

Have you made sure your system is up to date? I believe RH provided updates fairly quickly.

0
 
akvalentineAuthor Commented:
Thanks for your replies.

ai_ja_nai: Would having the network connection drop cause the services to actually stop?  When this happens I have to log onto the console to restart the services.  Cron is also stopped by this event; can it be considered a network related service?  

jools: I have updated to the latest patches, minus one kernel update, however RedHat tech support doesn't think that what was fixed in that kernel update would make a difference.  In any case, this also happens on our older Fedora Core 5 systems.
0
 
ai_ja_naiCommented:
>Would having the network connection drop cause the services to actually stop
maybe they are configured to deactivate themselves in case of no connection

>Cron is also stopped by this event; can it be considered a network related service
cron and syslogd actually are not network related services, but xinetd, ssh, cups, iscsi and smb are
0
 
akvalentineAuthor Commented:
I have discovered the cause of the problem.  Installing rsyslog was the key, as it stayed running and logged the offending command.  It turned out to be poor programming on my part in a script I'd written years ago.  It dynamically generated a list of PID's to kill, but didn't validate the list before it killed them.  I'm awarding points to arnold because his suggestions were the most help to me.
0

Featured Post

Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now