[Last Call] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1262
  • Last Modified:

Server hangs.

Hello,

One of our servers hangs randomly once a week or two and needs hard reboot via KVM.

Last logs:

kernel: INFO: task httpd:4851 blocked for more than 120 seconds.
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Open in new window


2.6.18-274.3.1.el5 #1 SMP Tue Sep 6 20:13:52 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
CentOS release 5.7 (Final)
Dell PowerEdge R410
Xeon(R) CPU    2x   E5520  @ 2.27GHz , 24 GB RAM
RAID10 4*300GB SAS 15000 rpm

Please help to resolve this issue.
0
tanel
Asked:
tanel
1 Solution
 
omarfaridCommented:
do you get any errors in log files?
did you try to do update of patches on the system?
0
 
tanelAuthor Commented:
No errors, just
kernel: INFO: task httpd:4851 blocked for more than 120 seconds.
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

These hangs occured before and after yum update when 5.7 (Final) came out. But the kernel is still the same, should we try to replace it ?
0
 
tanelAuthor Commented:
recent log when server froze:

Oct 24 11:46:50 v1 kernel: INFO: task irqbalance:10395 blocked for more than 300 seconds.
Oct 24 11:46:50 v1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 24 11:46:50 v1 kernel: irqbalance    D ffff81032cc90260     0 10395      1         10426  9215 (NOTLB)
Oct 24 11:46:50 v1 kernel:  ffff81032106dd78 0000000000000086 3738313633303839 000000000000006c
Oct 24 11:46:50 v1 kernel:  ffff81032cc90260 ffff8100418d3660 00018a790d47804d 0002762c03a4248c
Oct 24 11:46:50 v1 kernel:  ffff81032cc90468 000000002711fa28 ffff8102ea36f680 ffff81031187e000
Oct 24 11:46:50 v1 kernel: Call Trace:
Oct 24 11:46:50 v1 kernel:  [<ffffffff8006520d>] __mutex_lock_slowpath+0x60/0x9b
Oct 24 11:46:50 v1 kernel:  [<ffffffff80231e8f>] dev_name_hash+0x1e/0x64
Oct 24 11:46:50 v1 kernel:  [<ffffffff80065257>] .text.lock.mutex+0xf/0x14
Oct 24 11:46:50 v1 kernel:  [<ffffffff802322c9>] dev_load+0x18/0x46
Oct 24 11:46:50 v1 kernel:  [<ffffffff802329e1>] dev_ioctl+0x317/0x497
Oct 24 11:46:50 v1 kernel:  [<ffffffff80060e39>] error_exit+0x0/0x84
Oct 24 11:46:50 v1 kernel:  [<ffffffff80227730>] sock_ioctl+0x1d4/0x1e5
Oct 24 11:46:50 v1 kernel:  [<ffffffff80043f27>] do_ioctl+0x21/0x6b
Oct 24 11:46:50 v1 kernel:  [<ffffffff80031550>] vfs_ioctl+0x457/0x4b9
Oct 24 11:46:50 v1 kernel:  [<ffffffff800c3842>] audit_syscall_entry+0x1a8/0x1d3
Oct 24 11:46:50 v1 kernel:  [<ffffffff8004ec0e>] sys_ioctl+0x3c/0x5c
Oct 24 11:46:50 v1 kernel:  [<ffffffff800602dd>] tracesys+0xd5/0xe0
Oct 24 11:46:50 v1 kernel:
Oct 24 11:46:50 v1 kernel: INFO: task ntpd:17253 blocked for more than 300 seconds.
Oct 24 11:46:50 v1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 24 11:46:50 v1 kernel: ntpd          D ffff810316328960     0 17253  16562         18452 17104 (NOTLB)
Oct 24 11:46:50 v1 kernel:  ffff8102f26b1cb8 0000000000200086 0000000000000000 0000000000000000
Oct 24 11:46:51 v1 kernel:  ffff810316328960 ffff81032f3f2220 00018a7b5d8c2b7c 0002762fb5cac0a4
Oct 24 11:46:51 v1 kernel:  ffff810316328b68 00000007173f2e40 0000000000000000 ffff81032f3ee000
Oct 24 11:46:51 v1 kernel: Call Trace:
Oct 24 11:46:51 v1 kernel:  [<ffffffff8006520d>] __mutex_lock_slowpath+0x60/0x9b
Oct 24 11:46:51 v1 kernel:  [<ffffffff80065257>] .text.lock.mutex+0xf/0x14
Oct 24 11:46:51 v1 kernel:  [<ffffffff8001abaf>] vsnprintf+0x5df/0x627
Oct 24 11:46:51 v1 kernel:  [<ffffffff802326ee>] dev_ioctl+0x24/0x497
Oct 24 11:46:51 v1 kernel:  [<ffffffff80227730>] sock_ioctl+0x1d4/0x1e5
Oct 24 11:46:51 v1 kernel:  [<ffffffff80043f27>] do_ioctl+0x21/0x6b
Oct 24 11:46:51 v1 kernel:  [<ffffffff80031550>] vfs_ioctl+0x457/0x4b9
Oct 24 11:46:51 v1 kernel:  [<ffffffff80109026>] inotify_d_instantiate+0x12/0x3c
Oct 24 11:46:51 v1 kernel:  [<ffffffff8004ec0e>] sys_ioctl+0x3c/0x5c
Oct 24 11:46:51 v1 kernel:  [<ffffffff8010f267>] dev_ifconf+0xe5/0x1ab
Oct 24 11:46:51 v1 kernel:  [<ffffffff8010d20a>] compat_sys_ioctl+0x24e/0x292
Oct 24 11:46:51 v1 kernel:  [<ffffffff80062766>] ia32_sysret+0x0/0x5
Oct 24 11:46:51 v1 kernel:
Oct 24 11:46:51 v1 kernel: INFO: task nmbd:18732 blocked for more than 300 seconds.
Oct 24 11:46:51 v1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 24 11:46:51 v1 kernel: nmbd          D ffff810315fccae0     0 18732  18550        938512 18727 (NOTLB)
Oct 24 11:46:51 v1 kernel:  ffff8102eb0edb58 0000000000000082 ffff8102eb0eddd8 ffff8102eb0edde0
Oct 24 11:46:51 v1 kernel:  ffff810315fccae0 ffff81032f39a1e0 00018a797e82e2b6 0002762cb8879978
Oct 24 11:46:51 v1 kernel:  ffff810315fccce8 0000000600000000 0000000000000000 ffff81032f396000
Oct 24 11:46:51 v1 kernel: Call Trace:
Oct 24 11:46:51 v1 kernel:  [<ffffffff8006520d>] __mutex_lock_slowpath+0x60/0x9b
Oct 24 11:46:51 v1 kernel:  [<ffffffff80065257>] .text.lock.mutex+0xf/0x14
Oct 24 11:46:51 v1 kernel:  [<ffffffff800ab824>] ub_sock_getwres_other+0x94/0xc3
Oct 24 11:46:51 v1 kernel:  [<ffffffff80239afd>] rtnetlink_rcv+0x15/0x3b
Oct 24 11:46:51 v1 kernel:  [<ffffffff80248109>] netlink_data_ready+0x12/0x50
Oct 24 11:46:51 v1 kernel:  [<ffffffff80248c18>] netlink_sendmsg+0x4a4/0x4d8
Oct 24 11:46:51 v1 kernel:  [<ffffffff8008b636>] default_wake_function+0x0/0xe
Oct 24 11:46:51 v1 kernel:  [<ffffffff80057ce7>] sock_sendmsg+0xe5/0x137
Oct 24 11:46:51 v1 kernel:  [<ffffffff800a3453>] autoremove_wake_function+0x0/0x2e
Oct 24 11:46:51 v1 kernel:  [<ffffffff8001abaf>] vsnprintf+0x5df/0x627
Oct 24 11:46:51 v1 kernel:  [<ffffffff800f890d>] core_sys_select+0x21d/0x27f
Oct 24 11:46:51 v1 kernel:  [<ffffffff8002f0c8>] __wake_up+0x38/0x4f
Oct 24 11:46:51 v1 kernel:  [<ffffffff80228a4f>] sys_sendto+0x131/0x164
Oct 24 11:46:51 v1 kernel:  [<ffffffff802281a9>] move_addr_to_user+0x5d/0x78
Oct 24 11:46:51 v1 kernel:  [<ffffffff802286b4>] sys_getsockname+0x72/0xa2
Oct 24 11:46:51 v1 kernel:  [<ffffffff800602dd>] tracesys+0xd5/0xe0
Oct 24 11:46:51 v1 kernel:
Oct 24 11:46:51 v1 kernel: INFO: task nmbd:26759 blocked for more than 300 seconds.
Oct 24 11:46:51 v1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 24 11:46:51 v1 kernel: nmbd          D ffff8102b38d65e0     0 26759  25068         26762 26750 (NOTLB)
Oct 24 11:46:51 v1 kernel:  ffff810296d7db58 0000000000000082 ffff8102b38d67e8 0000000680030d42
Oct 24 11:46:51 v1 kernel:  ffff8102b38d65e0 ffff81032f30c160 00018a73ded436d0 00027623bca1af74
Oct 24 11:46:51 v1 kernel:  ffff8102b38d67e8 000000048003f7bc 0000000000000000 ffff81032f308000
Oct 24 11:46:51 v1 kernel: Call Trace:
Oct 24 11:46:51 v1 kernel:  [<ffffffff8004d560>] try_to_del_timer_sync+0x7f/0x88
Oct 24 11:46:51 v1 kernel:  [<ffffffff8006520d>] __mutex_lock_slowpath+0x60/0x9b
Oct 24 11:46:51 v1 kernel:  [<ffffffff80065257>] .text.lock.mutex+0xf/0x14
Oct 24 11:46:51 v1 kernel:  [<ffffffff800ab824>] ub_sock_getwres_other+0x94/0xc3
Oct 24 11:46:51 v1 kernel:  [<ffffffff80239afd>] rtnetlink_rcv+0x15/0x3b
Oct 24 11:46:51 v1 kernel:  [<ffffffff80248109>] netlink_data_ready+0x12/0x50
Oct 24 11:46:51 v1 kernel:  [<ffffffff80248c18>] netlink_sendmsg+0x4a4/0x4d8
Oct 24 11:46:51 v1 kernel:  [<ffffffff8008b636>] default_wake_function+0x0/0xe
Oct 24 11:46:51 v1 kernel:  [<ffffffff80057ce7>] sock_sendmsg+0xe5/0x137
Oct 24 11:46:51 v1 kernel:  [<ffffffff8005ab1a>] __ip_route_output_key+0x17a/0x97f
Oct 24 11:46:51 v1 kernel:  [<ffffffff800a3453>] autoremove_wake_function+0x0/0x2e
Oct 24 11:46:51 v1 kernel:  [<ffffffff8001abaf>] vsnprintf+0x5df/0x627
Oct 24 11:46:51 v1 kernel:  [<ffffffff8002f0c8>] __wake_up+0x38/0x4f
Oct 24 11:46:51 v1 kernel:  [<ffffffff80228a4f>] sys_sendto+0x131/0x164
Oct 24 11:46:51 v1 kernel:  [<ffffffff802281a9>] move_addr_to_user+0x5d/0x78
Oct 24 11:46:51 v1 kernel:  [<ffffffff802286b4>] sys_getsockname+0x72/0xa2
Oct 24 11:46:51 v1 kernel:  [<ffffffff800602dd>] tracesys+0xd5/0xe0
Oct 24 11:46:51 v1 kernel:
Oct 24 11:46:52 v1 kernel: INFO: task dsm_om_connsvcd:43862 blocked for more than 300 seconds.
Oct 24 11:46:52 v1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 24 11:46:52 v1 kernel: dsm_om_connsv D ffff81022509cb60     0 43862  43703         43918 43861 (NOTLB)
Oct 24 11:46:52 v1 kernel:  ffff810220ba3b58 0000000000000082 ffff810324cad440 ffff810058bc08c8
Oct 24 11:46:52 v1 kernel:  ffff81022509cb60 ffff81032f2460a0 00018a82fbe5f51c 0002763be17c22b6
Oct 24 11:46:52 v1 kernel:  ffff81022509cd68 0000000180089baa 0000000000000000 ffff81032f242000
Oct 24 11:46:52 v1 kernel: Call Trace:
Oct 24 11:46:52 v1 kernel:  [<ffffffff8008ae62>] find_busiest_group+0x177/0x638
Oct 24 11:46:52 v1 kernel:  [<ffffffff8006520d>] __mutex_lock_slowpath+0x60/0x9b
Oct 24 11:46:52 v1 kernel:  [<ffffffff80065257>] .text.lock.mutex+0xf/0x14
Oct 24 11:46:52 v1 kernel:  [<ffffffff800ab824>] ub_sock_getwres_other+0x94/0xc3
Oct 24 11:46:52 v1 kernel:  [<ffffffff80239afd>] rtnetlink_rcv+0x15/0x3b
Oct 24 11:46:52 v1 kernel:  [<ffffffff80248109>] netlink_data_ready+0x12/0x50
Oct 24 11:46:52 v1 kernel:  [<ffffffff80248c18>] netlink_sendmsg+0x4a4/0x4d8
Oct 24 11:46:52 v1 kernel:  [<ffffffff8008b636>] default_wake_function+0x0/0xe
Oct 24 11:46:52 v1 kernel:  [<ffffffff80057ce7>] sock_sendmsg+0xe5/0x137
Oct 24 11:46:52 v1 kernel:  [<ffffffff800a3453>] autoremove_wake_function+0x0/0x2e
Oct 24 11:46:52 v1 kernel:  [<ffffffff8001abaf>] vsnprintf+0x5df/0x627
Oct 24 11:46:52 v1 kernel:  [<ffffffff8002f0c8>] __wake_up+0x38/0x4f
Oct 24 11:46:52 v1 kernel:  [<ffffffff80228a4f>] sys_sendto+0x131/0x164
Oct 24 11:46:52 v1 kernel:  [<ffffffff802281a9>] move_addr_to_user+0x5d/0x78
Oct 24 11:46:52 v1 kernel:  [<ffffffff802286ce>] sys_getsockname+0x8c/0xa2
Oct 24 11:46:52 v1 kernel:  [<ffffffff800602dd>] tracesys+0xd5/0xe0

Open in new window


Any ideas ? Server is running SolusVM and OpenVZ virtualization software.

0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
jlevieCommented:
Does the box have the latest bios on it? The error messages look suspiciously like something is hanging the bus.
0
 
tanelAuthor Commented:
Dell openmanage reports, that all firmware is up to date:

OK - System: 'PowerEdge 2950', SN: 'JL3703J', hardware working fine, 1 logical drives, 6 physical drives

And the server was working fine before, mby some vps is overloading it, but I can't get the "top" picture when it hangs. I have now ssh session active and keeping a look on it during the day.
0
 
tanelAuthor Commented:
Also i would like to mention, that before the hang server's CPU load gets CRITICAL - load average: 121.90 with more than 1200 processes. Usually there are running 600 - 750 processes.

0
 
joolsCommented:
Use atop to log the system processes before it fails then run atop and replay the logs.

Is there anything in the logs before the messages you posted earlier?
0
 
tanelAuthor Commented:
Yes, there is some kernel fault:

Oct 24 11:45:02 v1 kernel: Fatal resource shortage: privvmpages, UB 137.

Oct 24 11:36:25 v1 xinetd[10793]: EXIT: nrpe status=0 pid=35539 duration=0(sec)
Oct 24 11:36:36 v1 xinetd[10793]: START: nrpe pid=37588 from=x.x.x.x
Oct 24 11:36:36 v1 xinetd[10793]: EXIT: nrpe status=0 pid=37588 duration=0(sec)
Oct 24 11:36:47 v1 xinetd[10793]: START: nrpe pid=39737 from=x.x.x.x
Oct 24 11:36:47 v1 xinetd[10793]: EXIT: nrpe status=0 pid=39737 duration=0(sec)
Oct 24 11:37:11 v1 xinetd[10793]: START: nrpe pid=43764 from=x.x.x.x
Oct 24 11:37:11 v1 xinetd[10793]: EXIT: nrpe status=0 pid=43764 duration=0(sec)
Oct 24 11:37:40 v1 xinetd[10793]: START: nrpe pid=45771 from=x.x.x.x
Oct 24 11:37:40 v1 xinetd[10793]: EXIT: nrpe status=0 pid=45771 duration=0(sec)
Oct 24 11:37:47 v1 xinetd[10793]: START: nrpe pid=47184 from=x.x.x.x
Oct 24 11:37:53 v1 xinetd[10793]: EXIT: nrpe status=0 pid=47184 duration=6(sec)
Oct 24 11:38:53 v1 xinetd[10793]: START: nrpe pid=56242 from=x.x.x.x
Oct 24 11:38:53 v1 xinetd[10793]: EXIT: nrpe status=0 pid=56242 duration=0(sec)
Oct 24 11:38:57 v1 xinetd[10793]: START: nrpe pid=58223 from=x.x.x.x
Oct 24 11:38:58 v1 xinetd[10793]: EXIT: nrpe status=0 pid=58223 duration=1(sec)
Oct 24 11:41:25 v1 xinetd[10793]: START: nrpe pid=72152 from=x.x.x.x
Oct 24 11:41:25 v1 xinetd[10793]: START: nrpe pid=72153 from=x.x.x.x
Oct 24 11:41:25 v1 xinetd[10793]: EXIT: nrpe status=0 pid=72152 duration=0(sec)
Oct 24 11:41:25 v1 xinetd[10793]: EXIT: nrpe status=0 pid=72153 duration=0(sec)
Oct 24 11:41:36 v1 xinetd[10793]: START: nrpe pid=72223 from=x.x.x.x
Oct 24 11:41:36 v1 xinetd[10793]: EXIT: nrpe status=0 pid=72223 duration=0(sec)
Oct 24 11:41:47 v1 xinetd[10793]: START: nrpe pid=73887 from=x.x.x.x
Oct 24 11:41:47 v1 xinetd[10793]: EXIT: nrpe status=0 pid=73887 duration=0(sec)
Oct 24 11:42:11 v1 xinetd[10793]: START: nrpe pid=76767 from=x.x.x.x
Oct 24 11:42:11 v1 xinetd[10793]: EXIT: nrpe status=0 pid=76767 duration=0(sec)
Oct 24 11:42:40 v1 xinetd[10793]: START: nrpe pid=80103 from=x.x.x.x
Oct 24 11:42:47 v1 xinetd[10793]: START: nrpe pid=80573 from=x.x.x.x
Oct 24 11:42:50 v1 xinetd[10793]: EXIT: nrpe status=0 pid=80103 duration=10(sec)
Oct 24 11:43:47 v1 xinetd[10793]: EXIT: nrpe status=0 pid=80573 duration=60(sec)
Oct 24 11:43:53 v1 xinetd[10793]: START: nrpe pid=94766 from=x.x.x.x
Oct 24 11:43:53 v1 xinetd[10793]: EXIT: nrpe status=0 pid=94766 duration=0(sec)
Oct 24 11:43:57 v1 xinetd[10793]: START: nrpe pid=96484 from=x.x.x.x
Oct 24 11:43:57 v1 xinetd[10793]: EXIT: nrpe status=0 pid=96484 duration=0(sec)
Oct 24 11:44:40 v1 xinetd[10793]: START: nrpe pid=105460 from=x.x.x.x
Oct 24 11:44:47 v1 xinetd[10793]: START: nrpe pid=109139 from=x.x.x.x
Oct 24 11:44:50 v1 xinetd[10793]: EXIT: nrpe status=0 pid=105460 duration=10(sec)
Oct 24 11:45:02 v1 kernel: Fatal resource shortage: privvmpages, UB 137.
Oct 24 11:45:47 v1 xinetd[10793]: EXIT: nrpe status=0 pid=109139 duration=60(sec)
Oct 24 11:45:53 v1 xinetd[10793]: START: nrpe pid=118811 from=x.x.x.x
Oct 24 11:45:53 v1 xinetd[10793]: EXIT: nrpe status=0 pid=118811 duration=0(sec)
Oct 24 11:46:25 v1 xinetd[10793]: START: nrpe pid=120903 from=x.x.x.x
Oct 24 11:46:25 v1 xinetd[10793]: START: nrpe pid=120904 from=x.x.x.x
Oct 24 11:46:25 v1 xinetd[10793]: EXIT: nrpe status=0 pid=120903 duration=0(sec)
Oct 24 11:46:25 v1 xinetd[10793]: EXIT: nrpe status=0 pid=120904 duration=0(sec)
Oct 24 11:46:36 v1 xinetd[10793]: START: nrpe pid=121335 from=x.x.x.x
Oct 24 11:46:36 v1 xinetd[10793]: EXIT: nrpe status=0 pid=121335 duration=0(sec)
Oct 24 11:46:40 v1 xinetd[10793]: START: nrpe pid=121739 from=x.x.x.x
Oct 24 11:46:47 v1 xinetd[10793]: START: nrpe pid=121784 from=x.x.x.x
Oct 24 11:46:47 v1 xinetd[10793]: START: nrpe pid=121785 from=x.x.x.x
Oct 24 11:46:47 v1 xinetd[10793]: EXIT: nrpe status=0 pid=121784 duration=0(sec)
Oct 24 11:46:50 v1 kernel: INFO: task events/0:18 blocked for more than 300 seconds.
Oct 24 11:46:50 v1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 24 11:46:50 v1 kernel: events/0      D ffff81032ebff360     0    18      1            19    17 (L-TLB)
Oct 24 11:46:50 v1 kernel:  ffff81032ee01dc0 0000000000000046 0000000000000000 0000000100000002
Oct 24 11:46:50 v1 kernel:  ffff81032ebff360 ffff81032f30c160 00018a72fa95cb4c 000276225002dd32
Oct 24 11:46:50 v1 kernel:  ffff81032ebff568 0000000480499d00 0000000000000000 ffff81032f308000
Oct 24 11:46:50 v1 kernel: Call Trace:

Open in new window

0
 
tanelAuthor Commented:
Use atop to log the system processes before it fails then run atop and replay the logs.

How to do that ? By running cronjob (every 2 min e.g) with making output of atop command to the log file ?
0
 
joolsCommented:
looks like it might be a kernel setting as described here;
http://forum.openvz.org/index.php?t=msg&goto=8946&#msg_8946

It might be useful to update the kernel as well if you can.
0
 
joolsCommented:
in answer to your other point, atop is run as a package under chkconfig.

as in/
   yum install atop
   chkconfig atop on
   service atop start

You can replay collected data but remember you have to collect the data first, the default is to collect every ten minutes. Leave it running for a couple of days and have a play about with the options, it's all in the man pages.
0

Featured Post

Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now