Link to home
Start Free TrialLog in
Avatar of oldcar53
oldcar53Flag for United States of America

asked on

Damage from UPS failure?

The UPS associated with our server (at our hosting provider) developed a problem and failed, during an electrical storm. We are running CentOS 5.x.
Subsequently, after cominig back online, our server began crashing randomly but not infrequently. There seemed to be a problem with Interrupt 169 which pertains to the raid controller. The controller was replaced, but another crash occurred. The kernel was then rebooted with irqpoll. There was one additional 'soft lockup', but no crash, since that point.

Question:
Could a UPS failure cause this sort of problem?

(I'm new at this posting-questions thing, so have probably left a lot out.)

Roger Ide
Avatar of multimac
multimac
Flag of Germany image

Hello Roger,

do you have screenshots or log files  of your kernel crash?

Have you already forced a check of the filesystems?
ASKER CERTIFIED SOLUTION
Avatar of pclinuxguru
pclinuxguru
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of oldcar53

ASKER

Hi-
There was no check of the filesystems, and the last event was Monday at 3 am.
The nature of the problem prevented any log-writing. I was able to run dmesg on one crash, as I caught it on the way down:

irq 169: nobody cared (try booting with the "irqpoll" option)
 [<c044ea52>] __report_bad_irq+0x2b/0x69
 [<c044ec49>] note_interrupt+0x1b9/0x1f0
 [<c044e215>] handle_IRQ_event+0x45/0x8c
 [<c044e339>] __do_IRQ+0xdd/0x118
 [<c044e25c>] __do_IRQ+0x0/0x118
 [<c04074c4>] do_IRQ+0x9b/0xc3
 [<c040597a>] common_interrupt+0x1a/0x20
 [<c05339f3>] acpi_processor_idle_simple+0x174/0x297
 [<c040597a>] common_interrupt+0x1a/0x20
 [<c053387f>] acpi_processor_idle_simple+0x0/0x297
 [<c0403d14>] cpu_idle+0x9f/0xb9
 =======================
handlers:
[<c058e26d>] (usb_hcd_irq+0x0/0x50)
[<f88db346>] (aac_rx_intr_message+0x0/0x55 [aacraid])
Disabling IRQ #169
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter reset request. SCSI hang ?
INFO: task kjournald:490 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kjournald     D 00007CBB  2788   490     19           515   469 (L-TLB)
       dff96ed4 00000046 701fe859 00007cbb 00000005 00000000 1834632e 0000000a
       dff69550 701ff33a 00007cbb 00000ae1 00000002 dff6965c c37f0788 c39ac040
       105d6eda dff4a4c4 c37f1128 c37f75cc 00000020 00000001 dff4a4bc 105d6eda
Call Trace:
 [<c0621468>] io_schedule+0x36/0x59
 [<c04790db>] sync_buffer+0x30/0x33
 [<c062163f>] __wait_on_bit+0x33/0x58
 [<c04790ab>] sync_buffer+0x0/0x33
 [<c04790ab>] sync_buffer+0x0/0x33
 [<c06216c6>] out_of_line_wait_on_bit+0x62/0x6a
 [<c043737c>] wake_bit_function+0x0/0x3c
 [<c0479058>] __wait_on_buffer+0x1c/0x1f
 [<f88684b3>] journal_commit_transaction+0x4cf/0xf3c [jbd]
 [<c042e621>] lock_timer_base+0x15/0x2f
 [<c042e6a0>] try_to_del_timer_sync+0x65/0x6c
 [<f886bd08>] kjournald+0xa1/0x1c2 [jbd]
 [<c043734f>] autoremove_wake_function+0x0/0x2d
 [<f886bc67>] kjournald+0x0/0x1c2 [jbd]
 [<c043728a>] kthread+0xc0/0xee
 [<c04371ca>] kthread+0x0/0xee
 [<c0405c87>] kernel_thread_helper+0x7/0x10
 =======================
INFO: task syslogd:2386 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
syslogd       D 00007CBA  2340  2386      1          2389  2242 (NOTLB)
       f5c0ced0 00000086 3e0e3044 00007cba 00000070 00000080 030a9588 00000007
       f5c03550 3e0e3868 00007cba 00000824 00000001 f5c0365c c37e9944 f62ea200
       e02e8e68 c37ea2e4 00000001 f5c0cecc c041f0c8 00000000 00000000 ffffffff
Call Trace:
 [<c041f0c8>] __wake_up+0x2a/0x3d
 [<f886b2c1>] log_wait_commit+0x80/0xc7 [jbd]
 [<c043734f>] autoremove_wake_function+0x0/0x2d
 [<f8866679>] journal_stop+0x196/0x1bb [jbd]
 [<c0495846>] __writeback_single_inode+0x199/0x2a5
 [<c045d334>] do_writepages+0x2b/0x32
 [<c0458e37>] __filemap_fdatawrite_range+0x66/0x72
 [<c0495ee4>] sync_inode+0x19/0x24
 [<f889e019>] ext3_sync_file+0xb1/0xdc [ext3]
 [<c0478c15>] do_fsync+0x41/0x83
 [<c0478c74>] __do_fsync+0x1d/0x2b
 [<c0404f4b>] syscall_call+0x7/0xb
 =======================
INFO: task miva:2927 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
miva          D 00007CBA  2572  2927   2757                     (NOTLB)
       e3d1cf44 00000082 a989794c 00007cba f88ad1e0 e3d1cefc 00000000 00000001
       f20be000 aa52abe7 00007cba 00c9329b 00000003 f20be10c c37f75cc f60e2040
       c044e25c c37f7f6c e749e380 e3d1cf30 00000000 e3d1c000 c048af22 ffffffff
Call Trace:
 [<c044e25c>] __do_IRQ+0x0/0x118
 [<c048af22>] locks_remove_posix+0x7d/0x97
 [<c062183f>] __mutex_lock_slowpath+0x4d/0x7c
 [<c062187d>] .text.lock.mutex+0xf/0x14
 [<c0476edc>] generic_file_llseek+0x2a/0xd2
 [<c0476eb2>] generic_file_llseek+0x0/0xd2
 [<c04761f5>] vfs_llseek+0x30/0x34
 [<c0477077>] sys_lseek+0x38/0x63
 [<c0404f4b>] syscall_call+0x7/0xb
 =======================
INFO: task miva:2928 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
miva          D 00007CBA  2524  2928   2895                     (NOTLB)
       d9ee9b2c 00000086 98dd3236 00007cba c1351cc0 00000000 dcdc0990 00000008
       f63d1aa0 99c34660 00007cba 00e6142a 00000000 f63d1bac c37e2b00 f61463c0
       00001000 c37e34a0 f65043e0 00000bef 105d52ba c042d7c7 e030160c ffffffff
Call Trace:
 [<c042d7c7>] getnstimeofday+0x30/0xb6
 [<c0621468>] io_schedule+0x36/0x59
 [<c04790ab>] sync_buffer+0x0/0x33
 [<c04790db>] sync_buffer+0x30/0x33
 [<c062157a>] __wait_on_bit_lock+0x2a/0x52
 [<c04790ab>] sync_buffer+0x0/0x33
 [<c0621604>] out_of_line_wait_on_bit_lock+0x62/0x6a
 [<c043737c>] wake_bit_function+0x0/0x3c
 [<c0479205>] __lock_buffer+0x21/0x24
 [<f88666eb>] do_get_write_access+0x4d/0x462 [jbd]
 [<f8866b18>] journal_get_write_access+0x18/0x26 [jbd]
 [<f88a01f3>] ext3_get_blocks_handle+0x688/0x8d3 [ext3]
 [<f88a0711>] ext3_get_block+0xa2/0xd6 [ext3]
 [<c0479436>] __block_prepare_write+0x19b/0x37e
 [<c045c636>] get_page_from_freelist+0x96/0x378
 [<c04796c4>] block_write_begin+0x88/0xe6
 [<f88a066f>] ext3_get_block+0x0/0xd6 [ext3]
 [<f88a1ad8>] ext3_write_begin+0xc2/0x1a0 [ext3]
 [<f88a066f>] ext3_get_block+0x0/0xd6 [ext3]
 [<c04595af>] generic_file_buffered_write+0x101/0x58b
 [<c042a626>] current_fs_time+0x4a/0x54
 [<c0459edf>] __generic_file_aio_write_nolock+0x4a6/0x52a
 [<c0459431>] __generic_file_aio_read+0x16a/0x1a3
 [<c0457ef3>] file_read_actor+0x0/0xd5
 [<c0459fbc>] generic_file_aio_write+0x59/0xac
 [<f889dea1>] ext3_file_write+0x19/0x83 [ext3]
 [<c0476312>] do_sync_write+0xb6/0xf1
 [<c043734f>] autoremove_wake_function+0x0/0x2d
 [<c044ae8f>] audit_syscall_entry+0x193/0x1bd
 [<c0476f78>] generic_file_llseek+0xc6/0xd2
 [<c047625c>] do_sync_write+0x0/0xf1
 [<c0476b9b>] vfs_write+0xa1/0x143
 [<c04771c5>] sys_write+0x3c/0x63
 [<c0404f4b>] syscall_call+0x7/0xb
 =======================
aacraid: SCSI bus appears hung
INFO: task pdflush:235 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
pdflush       D 00007CCA  2664   235     19           236   234 (L-TLB)
       dff3ff34 00000046 e00b2607 00007cca 00000000 00000100 00000000 0000000a
       dffa4550 e00b3239 00007cca 00000c32 00000003 dffa465c c37f75cc c39ac200
       00000000 c37f7f6c 00000000 dffa4550 c38eec50 c37f44cc c39ac200 ffffffff
Call Trace:
 [<c062183f>] __mutex_lock_slowpath+0x4d/0x7c
 [<c062187d>] .text.lock.mutex+0xf/0x14
 [<c0439d75>] down_read+0x8/0x11
 [<c047cc52>] sync_supers+0x47/0xb8
 [<c045d7c1>] wb_kupdate+0x36/0x130
 [<c045dc77>] pdflush+0x0/0x1a1
 [<c045dd82>] pdflush+0x10b/0x1a1
 [<c045d78b>] wb_kupdate+0x0/0x130
 [<c043728a>] kthread+0xc0/0xee
 [<c04371ca>] kthread+0x0/0xee
 [<c0405c87>] kernel_thread_helper+0x7/0x10
 =======================
INFO: task kjournald:490 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kjournald     D 00007CBB  2788   490     19           515   469 (L-TLB)
       dff96ed4 00000046 701fe859 00007cbb 00000005 00000000 1834632e 0000000a
       dff69550 701ff33a 00007cbb 00000ae1 00000002 dff6965c c37f0788 c39ac040
       105d6eda dff4a4c4 c37f1128 c37f75cc 00000020 00000001 dff4a4bc 105d6eda
Call Trace:
 [<c0621468>] io_schedule+0x36/0x59
 [<c04790db>] sync_buffer+0x30/0x33
 [<c062163f>] __wait_on_bit+0x33/0x58
 [<c04790ab>] sync_buffer+0x0/0x33
 [<c04790ab>] sync_buffer+0x0/0x33
 [<c06216c6>] out_of_line_wait_on_bit+0x62/0x6a
 [<c043737c>] wake_bit_function+0x0/0x3c
 [<c0479058>] __wait_on_buffer+0x1c/0x1f
 [<f88684b3>] journal_commit_transaction+0x4cf/0xf3c [jbd]
 [<c042e621>] lock_timer_base+0x15/0x2f
 [<c042e6a0>] try_to_del_timer_sync+0x65/0x6c
 [<f886bd08>] kjournald+0xa1/0x1c2 [jbd]
 [<c043734f>] autoremove_wake_function+0x0/0x2d
 [<f886bc67>] kjournald+0x0/0x1c2 [jbd]
 [<c043728a>] kthread+0xc0/0xee
 [<c04371ca>] kthread+0x0/0xee
 [<c0405c87>] kernel_thread_helper+0x7/0x10
 =======================
INFO: task syslogd:2386 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
syslogd       D 00007CBA  2340  2386      1          2389  2242 (NOTLB)
       f5c0ced0 00000086 3e0e3044 00007cba 00000070 00000080 030a9588 00000007
       f5c03550 3e0e3868 00007cba 00000824 00000001 f5c0365c c37e9944 f62ea200
       e02e8e68 c37ea2e4 00000001 f5c0cecc c041f0c8 00000000 00000000 ffffffff
Call Trace:
 [<c041f0c8>] __wake_up+0x2a/0x3d
 [<f886b2c1>] log_wait_commit+0x80/0xc7 [jbd]
 [<c043734f>] autoremove_wake_function+0x0/0x2d
 [<f8866679>] journal_stop+0x196/0x1bb [jbd]
 [<c0495846>] __writeback_single_inode+0x199/0x2a5
 [<c045d334>] do_writepages+0x2b/0x32
 [<c0458e37>] __filemap_fdatawrite_range+0x66/0x72
 [<c0495ee4>] sync_inode+0x19/0x24
 [<f889e019>] ext3_sync_file+0xb1/0xdc [ext3]
 [<c0478c15>] do_fsync+0x41/0x83
 [<c0478c74>] __do_fsync+0x1d/0x2b
 [<c0404f4b>] syscall_call+0x7/0xb
 =======================
INFO: task hald-addon-stor:2624 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
hald-addon-st D 00007CC8  2744  2624   2606                2615 (NOTLB)
       f6137e98 00000082 5e52e576 00007cc8 c048c272 e02265bc c3944c8c 0000000a
       f634f000 5e538047 00007cc8 00009ad1 00000002 f634f10c c37f0788 c39e5040
       00000800 c37f1128 0028bcfd 00000003 dca86005 dfc0fe40 e0235888 ffffffff
Call Trace:
 [<c048c272>] dput+0x22/0xed
 [<f8879d7b>] scsi_block_when_processing_errors+0x7a/0xbf [scsi_mod]
 [<c043734f>] autoremove_wake_function+0x0/0x2d
 [<f8854dfc>] sd_open+0x69/0x10f [sd_mod]
 [<c047dce0>] do_open+0x1de/0x2ce
 [<c047df3c>] blkdev_open+0x0/0x44
 [<c047df58>] blkdev_open+0x1c/0x44
 [<c0474f91>] __dentry_open+0xc7/0x1ab
 [<c04750d9>] nameidata_to_filp+0x19/0x28
 [<c0475113>] do_filp_open+0x2b/0x31
 [<c0475157>] do_sys_open+0x3e/0xae
 [<c04751f4>] sys_open+0x16/0x18
 [<c0404f4b>] syscall_call+0x7/0xb
 =======================
INFO: task cpanellogd:4172 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
cpanellogd    D 00007CCC  2572  4172      1          4234  4137 (NOTLB)
       f7c5bd6c 00000082 fe647b4e 00007ccc 00000000 00000001 f7c5bd34 0000000a
       f6379550 fe65d004 00007ccc 000154b6 00000001 f637965c c37e9944 f6135740
       005e000a c37ea2e4 e02071b0 00000000 105fbc80 c042d7c7 dfd4972c ffffffff
Call Trace:
 [<c042d7c7>] getnstimeofday+0x30/0xb6
 [<c0621468>] io_schedule+0x36/0x59
 [<c04790ab>] sync_buffer+0x0/0x33
 [<c04790db>] sync_buffer+0x30/0x33
 [<c062157a>] __wait_on_bit_lock+0x2a/0x52
 [<c04790ab>] sync_buffer+0x0/0x33
 [<c0621604>] out_of_line_wait_on_bit_lock+0x62/0x6a
 [<c043737c>] wake_bit_function+0x0/0x3c
 [<c0479205>] __lock_buffer+0x21/0x24
 [<f88666eb>] do_get_write_access+0x4d/0x462 [jbd]
 [<f886627c>] __journal_file_buffer+0x116/0x1ed [jbd]
 [<f8866b18>] journal_get_write_access+0x18/0x26 [jbd]
 [<f889e80b>] ext3_new_inode+0x591/0x971 [ext3]
 [<f88acb40>] ext3_permission+0x0/0xa [ext3]
 [<c0482dba>] permission+0xa2/0xb5
 [<c0484dff>] __link_path_walk+0xcd4/0xdc3
 [<f8866ee5>] journal_start+0xae/0xdd [jbd]
 [<f88a4c0a>] ext3_create+0x75/0xdc [ext3]
 [<c04833cc>] vfs_create+0xca/0x131
 [<c0485e3a>] open_namei+0x16a/0x631
 [<c04397a9>] lock_hrtimer_base+0x19/0x35
 [<c0475104>] do_filp_open+0x1c/0x31
 [<c0475157>] do_sys_open+0x3e/0xae
 [<c04751f4>] sys_open+0x16/0x18
 [<c0404f4b>] syscall_call+0x7/0xb
 =======================
INFO: task tailwatchd:21557 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
tailwatchd    D 00007CCA  2396 21557      1  3073    5334 21526 (NOTLB)
       f7968ecc 00200082 c5a70071 00007cca 80000001 00000000 00000001 0000000a
       f6139aa0 c5ace8f9 00007cca 0005e888 00000001 f6139bac c37e9944 c3b19e40
       f7968f3c c37ea2e4 e7bff000 00000310 f7968f3c ffffffe9 f7968f3c ffffffff
Call Trace:
 [<c062183f>] __mutex_lock_slowpath+0x4d/0x7c
 [<c062187d>] .text.lock.mutex+0xf/0x14
 [<c0485dad>] open_namei+0xdd/0x631
 [<c0475104>] do_filp_open+0x1c/0x31
 [<c0475157>] do_sys_open+0x3e/0xae
 [<c04751f4>] sys_open+0x16/0x18
 [<c0404f4b>] syscall_call+0x7/0xb
 =======================
aacraid: aac_fib_send: first asynchronous command timed out.
Usually a result of a PCI interrupt routing problem;
update mother board BIOS or consider utilizing one of
the SAFE mode kernel options (acpi, apic etc)
SOLUTION
Avatar of David
David
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
This is good. Thank you for adding to my perspective.