Server unresponsive issue

I had a Linux server running RHEL4 become un-responsive recently and required a reboot. When i came up i looked through the logs and dmesg and couldnt see anything that was conclusive as to why it had rebooted. But some of what i saw in /var/log/messages at the time this started im posting below so hopefully someone can tell me what may have caused the server to become unresponsive. Thanks!


Oct 5 06:26:00 servername kernel: request_module: runaway loop modprobe net-pf-10
Oct 5 06:26:00  servername hald[3763]: Timed out waiting for hotplug event 626. Rebasing to 631
Oct 5 06:26:00  servername kernel: usb 1-1.1: USB disconnect, address 4
Oct 5 06:26:00  servername kernel: usb 1-1.1: new full speed USB device using address 5
Oct 5 06:26:00  servername kernel: oom-killer: gfp_mask=0x1d2
Oct 5 06:26:00  servername kernel: Mem-info:
Oct 5 06:26:00  servername kernel: Node 0 DMA per-cpu:
Oct 5 06:26:00  servername kernel: cpu 0 hot: low 2, high 6, batch 1
Oct 5 06:26:00  servername kernel: cpu 0 cold: low 0, high 2, batch 1
Oct 5 06:26:00  servername kernel: cpu 1 hot: low 2, high 6, batch 1
Oct 5 06:26:00  servername kernel: cpu 1 cold: low 0, high 2, batch 1
Oct 5 06:26:00  servername kernel: cpu 2 hot: low 2, high 6, batch 1
Oct 5 06:26:00  servername kernel: cpu 2 cold: low 0, high 2, batch 1
Oct 5 06:26:00  servername kernel: cpu 3 hot: low 2, high 6, batch 1
Oct 5 06:26:00  servername kernel: cpu 3 cold: low 0, high 2, batch 1
Oct 5 06:26:00  servername kernel: Node 0 Normal per-cpu:
Oct 5 06:26:00  servername kernel: cpu 0 hot: low 32, high 96, batch 16
Oct 5 06:26:00  servername kernel: cpu 0 cold: low 0, high 32, batch 16
Oct 5 06:26:00  servername kernel: cpu 1 hot: low 32, high 96, batch 16
Oct 5 06:26:00  servername kernel: cpu 1 cold: low 0, high 32, batch 16
Oct 5 06:26:00  servername kernel: cpu 2 hot: low 32, high 96, batch 16
Oct 5 06:26:00 servername kernel: cpu 2 cold: low 0, high 32, batch 16
Oct 5 06:26:00 servername kernel: cpu 3 hot: low 32, high 96, batch 16
Oct 5 06:26:00 servername kernel: cpu 3 cold: low 0, high 32, batch 16
Oct 5 06:26:00 servername kernel: Node 0 HighMem per-cpu: empty
Oct 5 06:26:00 servername kernel:
Oct 5 06:26:00 servername kernel: Free pages:       19348kB (0kB HighMem)
linuxpigAsked:
Who is Participating?
 
macker-Connect With a Mentor Commented:
There's one big clue in the output:

Oct 5 06:26:00  servername kernel: oom-killer: gfp_mask=0x1d2

OOM is an acronym for Out Of Memory.  Generally speaking, this suggests that the server ran out of memory (ram+swap), and had to resort to killing processes.  The previous log messages may give clues as to the circumstances which lead to this.

As Jonas suggested, having sysstat installed (and running; it's off by default) will help you to track trends.  The default sampling period is 10 minutes, but you may want to change this to something more frequent, such as every 5 or 1 minute.  The primary cost is disk space, which is a small price to pay.

Lastly, if you suspect a kernel panic, there is options such as kdump and netdump, depending on your version of RHEL.  (RHEL3 and RHEL4 still use netdump.)  The software lets you capture a detailed kernel dump, including a memory image, to local disk or over the network, and then reboot the server.  This gives you the opportunity for very detailed post-mortem, though usually the detail is excessive for the average user.  I'd consider it more appropriate (post-mortem) for systems running a specialized software stack, that is crashing on a repeated basis, and vendor support exists for troubleshooting the cause.

For the average situation, which seems to encapsulate your usage, sysstat logging every 5 or 10 minutes, and watching memory usage / swap usage, will be a good start.  Make sure you don't have an excess of swap on an IDE or SATA drive, e.g. 2G of RAM and 8G of swap.  Use sysstat (sar) to watch trends, and snapshots from tools like top, free (`free -h`), and iostat (`iostat -x 5 5`) for current status.
0
 
cjl7freelance for hireCommented:
Hi,


Have you got sysstat installed? Doing a 'sar' will give you a hint if the system was busy with swapping and the cpu was waiting for disk-io.

If the server is really slow, that is the most common reason.

//jonas
0
 
linuxpigAuthor Commented:
The response steered me in the right direction.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.