zlygis
asked on
Need some one to "decode" kernel crash.
Hi,
Every few days server started to crash. I could not locate any info in log files. daily server load is very low. Only today, after recent "crash" in /var/log/messages I was able to find some info (see attached code snippet). I would like expert-exchange experts to look tinto this info and see if there is any way to determine the "culprit" of my troubles, or at least where to start looking.
Thank You in advance.
P.S. Sorry for my english
Every few days server started to crash. I could not locate any info in log files. daily server load is very low. Only today, after recent "crash" in /var/log/messages I was able to find some info (see attached code snippet). I would like expert-exchange experts to look tinto this info and see if there is any way to determine the "culprit" of my troubles, or at least where to start looking.
Thank You in advance.
P.S. Sorry for my english
Apr 12 15:39:53 ger1 kernel: list_del corruption. next->prev should be c23ebfb8, but was 00200200
Apr 12 15:39:53 ger1 kernel: ------------[ cut here ]------------
Apr 12 15:39:53 ger1 kernel: kernel BUG at lib/list_debug.c:70!
Apr 12 15:39:53 ger1 kernel: invalid opcode: 0000 [#1]
Apr 12 15:39:53 ger1 kernel: SMP
Apr 12 15:39:53 ger1 kernel: last sysfs file: /block/ram0/range
Apr 12 15:39:53 ger1 kernel: Modules linked in: ipt_owner ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables ipv6 xfrm_na$
Apr 12 15:39:53 ger1 kernel: CPU: 0
Apr 12 15:39:53 ger1 kernel: EIP: 0060:[<c04ea6e4>] Not tainted VLI
Apr 12 15:39:53 ger1 kernel: EFLAGS: 00010046 (2.6.18-128.7.1.el5PAE #1)
Apr 12 15:39:53 ger1 kernel: EIP is at list_del+0x38/0x5c
Apr 12 15:39:53 ger1 kernel: eax: 00000048 ebx: c23ebfb8 ecx: 00000092 edx: 00000000
Apr 12 15:39:53 ger1 kernel: esi: 00000256 edi: c0684780 ebp: c23ebfa0 esp: daa84e44
Apr 12 15:39:53 ger1 kernel: ds: 007b es: 007b ss: 0068
Apr 12 15:39:53 ger1 kernel: Process php5 (pid: 31552, ti=daa84000 task=c3531550 task.ti=daa84000)
Apr 12 15:39:53 ger1 kernel: Stack: c063e008 c23ebfb8 00200200 c0684800 c0458d5f fffb4000 00000003 00000000
Apr 12 15:39:53 ger1 kernel: 000280d2 c0685a28 00000000 00000001 00000000 00000001 00000000 c0685a28
Apr 12 15:39:53 ger1 kernel: 000280d2 c0685a28 c3531550 c0458fa7 00000044 00000000 000280d2 00000010
Apr 12 15:39:53 ger1 kernel: Call Trace:
Apr 12 15:39:53 ger1 kernel: [<c0458d5f>] get_page_from_freelist+0x142/0x333
Apr 12 15:39:53 ger1 kernel: [<c0458fa7>] __alloc_pages+0x57/0x297
Apr 12 15:39:53 ger1 kernel: [<c046643c>] anon_vma_prepare+0x11/0xa5
Apr 12 15:39:53 ger1 kernel: [<c046111d>] __handle_mm_fault+0x4f6/0xb7b
Apr 12 15:39:53 ger1 kernel: [<c061083b>] do_page_fault+0x2d2/0x600
Apr 12 15:39:53 ger1 kernel: [<c0610569>] do_page_fault+0x0/0x600
Apr 12 15:39:53 ger1 kernel: [<c0405a89>] error_code+0x39/0x40
Apr 12 15:39:53 ger1 kernel: =======================
Apr 12 15:39:53 ger1 kernel: Code: 53 68 ba df 63 c0 e8 c1 a7 f3 ff 0f 0b 41 00 f7 df 63 c0 83 c4 0c 8b 03 8b 40 04 39 d8 74 $
Apr 12 15:39:53 ger1 kernel: EIP: [<c04ea6e4>] list_del+0x38/0x5c SS:ESP 0068:daa84e44
Apr 12 15:39:53 ger1 kernel: <0>Kernel panic - not syncing: Fatal exception
Apr 12 15:39:53 ger1 kernel: BUG: warning at arch/i386/kernel/smp.c:550/smp_call_function() (Not tainted)
Apr 12 15:39:53 ger1 kernel: [<c0415ae0>] stop_this_cpu+0x0/0x33
Apr 12 15:39:53 ger1 kernel: [<c04158cf>] smp_call_function+0x57/0xc3
Apr 12 15:39:53 ger1 kernel: [<c0424e9d>] printk+0x18/0x8e
Apr 12 15:39:53 ger1 kernel: [<c041594e>] smp_send_stop+0x13/0x1c
Apr 12 15:39:53 ger1 kernel: [<c0424437>] panic+0x4c/0x16d
Apr 12 15:39:53 ger1 kernel: [<c04064eb>] die+0x25d/0x291
Apr 12 15:39:53 ger1 kernel: [<c0406b85>] do_invalid_op+0x0/0x9d
Apr 12 15:39:53 ger1 kernel: [<c0406c16>] do_invalid_op+0x91/0x9d
Apr 12 15:39:53 ger1 kernel: [<c04ea6e4>] list_del+0x38/0x5c
Apr 12 15:39:53 ger1 kernel: [<c04248b2>] release_console_sem+0x1b0/0x1b8
Apr 12 15:39:53 ger1 kernel: [<c045a51b>] blockable_page_cache_readahead+0x46/0x99
Apr 12 15:39:53 ger1 kernel: [<c0405a89>] error_code+0x39/0x40
Apr 12 15:39:53 ger1 kernel: [<c04ea6e4>] list_del+0x38/0x5c
Apr 12 15:39:53 ger1 kernel: [<c0458d5f>] get_page_from_freelist+0x142/0x333
Apr 12 15:39:53 ger1 kernel: [<c0458fa7>] __alloc_pages+0x57/0x297
Apr 12 15:39:53 ger1 kernel: [<c046643c>] anon_vma_prepare+0x11/0xa5
Apr 12 15:39:53 ger1 kernel: [<c046111d>] __handle_mm_fault+0x4f6/0xb7b
Apr 12 15:39:53 ger1 kernel: [<c061083b>] do_page_fault+0x2d2/0x600
Apr 12 15:39:53 ger1 kernel: [<c0610569>] do_page_fault+0x0/0x600
Apr 12 15:39:53 ger1 kernel: [<c0405a89>] error_code+0x39/0x40
If that does not work, try testing the RAM. Bad RAM can cause random kernel panics.
If you have no luck with yum update, try running "yum clean all" first, then yum update.
ASKER
Well, "yum update" showed that no packages needs to be updated. As to kernel update, running "yum update kernel", I get this strange error "Package(s) kernel available, but not installed."
In the /boot/grub/menu.lst I can see the new kernel, but when I try to switch to it, server wont boot.
The /boot/grub/menu.lst file contents:
timeout 5
default 1
title CentOS (2.6.18-164.15.1.el5PAE)
root (hd0,1)
kernel /vmlinuz-2.6.18-164.15.1.e l5PAE ro root=/dev/sda3 vga=0x317
initrd /initrd-2.6.18-164.15.1.el 5PAE.img
title CentOS Linux (2.6.18-128.7.1.el5PAE)
root (hd0,1)
kernel /boot/vmlinuz-2.6.18-128.7 .1.el5PAE ro root=/dev/sda3 vga=0x317
initrd /boot/initrd-2.6.18-128.7. 1.el5PAE.i mg
In the /boot/grub/menu.lst I can see the new kernel, but when I try to switch to it, server wont boot.
The /boot/grub/menu.lst file contents:
timeout 5
default 1
title CentOS (2.6.18-164.15.1.el5PAE)
root (hd0,1)
kernel /vmlinuz-2.6.18-164.15.1.e
initrd /initrd-2.6.18-164.15.1.el
title CentOS Linux (2.6.18-128.7.1.el5PAE)
root (hd0,1)
kernel /boot/vmlinuz-2.6.18-128.7
initrd /boot/initrd-2.6.18-128.7.
ASKER
OK. Ive managed to update kernel. Now I will wait and see if this was the buggy kernel.
ASKER
OK, this is not kernels fault. Whats the best way to check RAM? Please note, that I have only remote access to the server.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
sorry, didn't finish my post, the link that I added shows you how to install memtest as a bootable option if you don't have the linux rescue disk or if its not already installed. I'll copy and paste the link again in this post for future reference.
http://kbase.redhat.com/faq/docs/DOC-16424
http://kbase.redhat.com/faq/docs/DOC-16424
The best way to test the RAM is to boot from a bootable utility such as memtest86 noted above. You could also run any manufacturer's memory diagnostic for whatever model server you have. For example, Dell has their program "mpmemory" and "Dell Diagnostics" for their servers. Anything that is run in the OS will not be as effective as it cannot check the portion of RAM being used by the OS.
If you do not have physical access, you can get whoever is hosting the server to do it. Most data centers are familiar with this, and do it often (I work in a data center).
If you do not have physical access, you can get whoever is hosting the server to do it. Most data centers are familiar with this, and do it often (I work in a data center).
For Red Hat or Fedora Linux, run "yum update". This will update any installed packages on your system.
If that doesn't work, trying upgrading your kernel. Below is a link to a FAQ sheet that will give you detailed instructions on how to do that.
http://fedoraproject.org/wiki/YumUpgradeFaq