Server unresponsive issue

I had a Linux server running RHEL4 become un-responsive recently and required a reboot. When i came up i looked through the logs and dmesg and couldnt see anything that was conclusive as to why it had rebooted. But some of what i saw in /var/log/messages at the time this started im posting below so hopefully someone can tell me what may have caused the server to become unresponsive. Thanks!


Oct 5 06:26:00 servername kernel: request_module: runaway loop modprobe net-pf-10
Oct 5 06:26:00  servername hald[3763]: Timed out waiting for hotplug event 626. Rebasing to 631
Oct 5 06:26:00  servername kernel: usb 1-1.1: USB disconnect, address 4
Oct 5 06:26:00  servername kernel: usb 1-1.1: new full speed USB device using address 5
Oct 5 06:26:00  servername kernel: oom-killer: gfp_mask=0x1d2
Oct 5 06:26:00  servername kernel: Mem-info:
Oct 5 06:26:00  servername kernel: Node 0 DMA per-cpu:
Oct 5 06:26:00  servername kernel: cpu 0 hot: low 2, high 6, batch 1
Oct 5 06:26:00  servername kernel: cpu 0 cold: low 0, high 2, batch 1
Oct 5 06:26:00  servername kernel: cpu 1 hot: low 2, high 6, batch 1
Oct 5 06:26:00  servername kernel: cpu 1 cold: low 0, high 2, batch 1
Oct 5 06:26:00  servername kernel: cpu 2 hot: low 2, high 6, batch 1
Oct 5 06:26:00  servername kernel: cpu 2 cold: low 0, high 2, batch 1
Oct 5 06:26:00  servername kernel: cpu 3 hot: low 2, high 6, batch 1
Oct 5 06:26:00  servername kernel: cpu 3 cold: low 0, high 2, batch 1
Oct 5 06:26:00  servername kernel: Node 0 Normal per-cpu:
Oct 5 06:26:00  servername kernel: cpu 0 hot: low 32, high 96, batch 16
Oct 5 06:26:00  servername kernel: cpu 0 cold: low 0, high 32, batch 16
Oct 5 06:26:00  servername kernel: cpu 1 hot: low 32, high 96, batch 16
Oct 5 06:26:00  servername kernel: cpu 1 cold: low 0, high 32, batch 16
Oct 5 06:26:00  servername kernel: cpu 2 hot: low 32, high 96, batch 16
Oct 5 06:26:00 servername kernel: cpu 2 cold: low 0, high 32, batch 16
Oct 5 06:26:00 servername kernel: cpu 3 hot: low 32, high 96, batch 16
Oct 5 06:26:00 servername kernel: cpu 3 cold: low 0, high 32, batch 16
Oct 5 06:26:00 servername kernel: Node 0 HighMem per-cpu: empty
Oct 5 06:26:00 servername kernel:
Oct 5 06:26:00 servername kernel: Free pages:       19348kB (0kB HighMem)
linuxpigAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

cjl7freelance for hireCommented:
Hi,


Have you got sysstat installed? Doing a 'sar' will give you a hint if the system was busy with swapping and the cpu was waiting for disk-io.

If the server is really slow, that is the most common reason.

//jonas
0
macker-Commented:
There's one big clue in the output:

Oct 5 06:26:00  servername kernel: oom-killer: gfp_mask=0x1d2

OOM is an acronym for Out Of Memory.  Generally speaking, this suggests that the server ran out of memory (ram+swap), and had to resort to killing processes.  The previous log messages may give clues as to the circumstances which lead to this.

As Jonas suggested, having sysstat installed (and running; it's off by default) will help you to track trends.  The default sampling period is 10 minutes, but you may want to change this to something more frequent, such as every 5 or 1 minute.  The primary cost is disk space, which is a small price to pay.

Lastly, if you suspect a kernel panic, there is options such as kdump and netdump, depending on your version of RHEL.  (RHEL3 and RHEL4 still use netdump.)  The software lets you capture a detailed kernel dump, including a memory image, to local disk or over the network, and then reboot the server.  This gives you the opportunity for very detailed post-mortem, though usually the detail is excessive for the average user.  I'd consider it more appropriate (post-mortem) for systems running a specialized software stack, that is crashing on a repeated basis, and vendor support exists for troubleshooting the cause.

For the average situation, which seems to encapsulate your usage, sysstat logging every 5 or 10 minutes, and watching memory usage / swap usage, will be a good start.  Make sure you don't have an excess of swap on an IDE or SATA drive, e.g. 2G of RAM and 8G of swap.  Use sysstat (sar) to watch trends, and snapshots from tools like top, free (`free -h`), and iostat (`iostat -x 5 5`) for current status.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
linuxpigAuthor Commented:
The response steered me in the right direction.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Linux

From novice to tech pro — start learning today.