Link to home
Start Free TrialLog in
Avatar of zlygis
zlygis

asked on

Need some one to "decode" kernel crash.

Hi,

Every few days server started to crash. I could not locate any info in log files. daily server load is very low. Only today, after recent "crash"  in /var/log/messages I was able to find some info (see attached code snippet). I would like expert-exchange experts to look tinto this info and see if there is any way to determine the "culprit" of my troubles, or at least where to start looking.

Thank You in advance.

P.S. Sorry for my english


Apr 12 15:39:53 ger1 kernel: list_del corruption. next->prev should be c23ebfb8, but was 00200200
Apr 12 15:39:53 ger1 kernel: ------------[ cut here ]------------
Apr 12 15:39:53 ger1 kernel: kernel BUG at lib/list_debug.c:70!
Apr 12 15:39:53 ger1 kernel: invalid opcode: 0000 [#1]
Apr 12 15:39:53 ger1 kernel: SMP
Apr 12 15:39:53 ger1 kernel: last sysfs file: /block/ram0/range
Apr 12 15:39:53 ger1 kernel: Modules linked in: ipt_owner ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables ipv6 xfrm_na$
Apr 12 15:39:53 ger1 kernel: CPU:    0
Apr 12 15:39:53 ger1 kernel: EIP:    0060:[<c04ea6e4>]    Not tainted VLI
Apr 12 15:39:53 ger1 kernel: EFLAGS: 00010046   (2.6.18-128.7.1.el5PAE #1)
Apr 12 15:39:53 ger1 kernel: EIP is at list_del+0x38/0x5c
Apr 12 15:39:53 ger1 kernel: eax: 00000048   ebx: c23ebfb8   ecx: 00000092   edx: 00000000
Apr 12 15:39:53 ger1 kernel: esi: 00000256   edi: c0684780   ebp: c23ebfa0   esp: daa84e44
Apr 12 15:39:53 ger1 kernel: ds: 007b   es: 007b   ss: 0068
Apr 12 15:39:53 ger1 kernel: Process php5 (pid: 31552, ti=daa84000 task=c3531550 task.ti=daa84000)
Apr 12 15:39:53 ger1 kernel: Stack: c063e008 c23ebfb8 00200200 c0684800 c0458d5f fffb4000 00000003 00000000
Apr 12 15:39:53 ger1 kernel:        000280d2 c0685a28 00000000 00000001 00000000 00000001 00000000 c0685a28
Apr 12 15:39:53 ger1 kernel:        000280d2 c0685a28 c3531550 c0458fa7 00000044 00000000 000280d2 00000010
Apr 12 15:39:53 ger1 kernel: Call Trace:
Apr 12 15:39:53 ger1 kernel:  [<c0458d5f>] get_page_from_freelist+0x142/0x333
Apr 12 15:39:53 ger1 kernel:  [<c0458fa7>] __alloc_pages+0x57/0x297
Apr 12 15:39:53 ger1 kernel:  [<c046643c>] anon_vma_prepare+0x11/0xa5
Apr 12 15:39:53 ger1 kernel:  [<c046111d>] __handle_mm_fault+0x4f6/0xb7b
Apr 12 15:39:53 ger1 kernel:  [<c061083b>] do_page_fault+0x2d2/0x600
Apr 12 15:39:53 ger1 kernel:  [<c0610569>] do_page_fault+0x0/0x600
Apr 12 15:39:53 ger1 kernel:  [<c0405a89>] error_code+0x39/0x40
Apr 12 15:39:53 ger1 kernel:  =======================
Apr 12 15:39:53 ger1 kernel: Code: 53 68 ba df 63 c0 e8 c1 a7 f3 ff 0f 0b 41 00 f7 df 63 c0 83 c4 0c 8b 03 8b 40 04 39 d8 74 $
Apr 12 15:39:53 ger1 kernel: EIP: [<c04ea6e4>] list_del+0x38/0x5c SS:ESP 0068:daa84e44
Apr 12 15:39:53 ger1 kernel:  <0>Kernel panic - not syncing: Fatal exception
Apr 12 15:39:53 ger1 kernel:  BUG: warning at arch/i386/kernel/smp.c:550/smp_call_function() (Not tainted)
Apr 12 15:39:53 ger1 kernel:  [<c0415ae0>] stop_this_cpu+0x0/0x33
Apr 12 15:39:53 ger1 kernel:  [<c04158cf>] smp_call_function+0x57/0xc3
Apr 12 15:39:53 ger1 kernel:  [<c0424e9d>] printk+0x18/0x8e
Apr 12 15:39:53 ger1 kernel:  [<c041594e>] smp_send_stop+0x13/0x1c
Apr 12 15:39:53 ger1 kernel:  [<c0424437>] panic+0x4c/0x16d
Apr 12 15:39:53 ger1 kernel:  [<c04064eb>] die+0x25d/0x291
Apr 12 15:39:53 ger1 kernel:  [<c0406b85>] do_invalid_op+0x0/0x9d
Apr 12 15:39:53 ger1 kernel:  [<c0406c16>] do_invalid_op+0x91/0x9d
Apr 12 15:39:53 ger1 kernel:  [<c04ea6e4>] list_del+0x38/0x5c
Apr 12 15:39:53 ger1 kernel:  [<c04248b2>] release_console_sem+0x1b0/0x1b8
Apr 12 15:39:53 ger1 kernel:  [<c045a51b>] blockable_page_cache_readahead+0x46/0x99
Apr 12 15:39:53 ger1 kernel:  [<c0405a89>] error_code+0x39/0x40
Apr 12 15:39:53 ger1 kernel:  [<c04ea6e4>] list_del+0x38/0x5c
Apr 12 15:39:53 ger1 kernel:  [<c0458d5f>] get_page_from_freelist+0x142/0x333
Apr 12 15:39:53 ger1 kernel:  [<c0458fa7>] __alloc_pages+0x57/0x297
Apr 12 15:39:53 ger1 kernel:  [<c046643c>] anon_vma_prepare+0x11/0xa5
Apr 12 15:39:53 ger1 kernel:  [<c046111d>] __handle_mm_fault+0x4f6/0xb7b
Apr 12 15:39:53 ger1 kernel:  [<c061083b>] do_page_fault+0x2d2/0x600
Apr 12 15:39:53 ger1 kernel:  [<c0610569>] do_page_fault+0x0/0x600
Apr 12 15:39:53 ger1 kernel:  [<c0405a89>] error_code+0x39/0x40

Open in new window

Avatar of bman21
bman21

Try updating your system.

For Red Hat or Fedora Linux, run  "yum update".  This will update any installed packages on your system.

If that doesn't work, trying upgrading your kernel.  Below is a link to a FAQ sheet that will give you detailed instructions on how to do that.

http://fedoraproject.org/wiki/YumUpgradeFaq

If that does not work, try testing the RAM.  Bad RAM can cause random kernel panics.
If you have no luck with yum update, try running "yum clean all" first, then yum update.
Avatar of zlygis

ASKER

Well, "yum update" showed that no packages needs to be updated. As to kernel update, running "yum update kernel", I get this strange error "Package(s) kernel available, but not installed."

In the /boot/grub/menu.lst I can see the new kernel, but when I try to switch to it, server wont boot.

The /boot/grub/menu.lst file contents:

timeout 5
default 1

title CentOS (2.6.18-164.15.1.el5PAE)
root (hd0,1)
kernel /vmlinuz-2.6.18-164.15.1.el5PAE ro root=/dev/sda3 vga=0x317
initrd /initrd-2.6.18-164.15.1.el5PAE.img

title CentOS Linux (2.6.18-128.7.1.el5PAE)
root (hd0,1)
kernel /boot/vmlinuz-2.6.18-128.7.1.el5PAE ro root=/dev/sda3 vga=0x317
initrd /boot/initrd-2.6.18-128.7.1.el5PAE.img

Avatar of zlygis

ASKER

OK. Ive managed to update kernel. Now I will wait and see if this was the buggy kernel.
Avatar of zlygis

ASKER

OK, this is not kernels fault. Whats the best way to check RAM? Please note, that I have only remote access to the server.
ASKER CERTIFIED SOLUTION
Avatar of bman21
bman21

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
sorry, didn't finish my post, the link that I added shows you how to install memtest as a bootable option if you don't have the linux rescue disk or if its not already installed.  I'll copy and paste the  link again in this post for future reference.

http://kbase.redhat.com/faq/docs/DOC-16424
The best way to test the RAM is to boot from a bootable utility such as memtest86 noted above.  You could also run any manufacturer's memory diagnostic for whatever model server you have.  For example, Dell has their program "mpmemory" and "Dell Diagnostics" for their servers.  Anything that is run in the OS will not be as effective as it cannot check the portion of RAM being used by the OS.

If you do not have physical access, you can get whoever is hosting the server to do it.  Most data centers are familiar with this, and do it often (I work in a data center).