system crashed and database too please help

Hi ALl

My server got crashed due to high cpu/ram and io please advice for below logs why it happend and how to resolve:

dmesg | grep dracut
dracut: dracut-004-256.el6_2.1
dracut: rd_NO_LUKS: removing cryptoluks activation
dracut: rd_NO_DM: removing DM RAID activation
dracut: rd_NO_MDIMSM: no MD RAID for imsm/isw raids
dracut: Scanning devices sda2 sda3 sda4 sdb1 sdc1 sdd sde sdf sdg sdh sdi sdj sdk sdl sdm  for LVM logical volumes rootvg/swap_lv rootvg/root_lv
dracut: inactive '/dev/vg_p10_sap6/lv' [100.00 GiB] inherit
dracut: inactive '/dev/vg_p10_sap7/lv' [100.00 GiB] inheritz
dracut: inactive '/dev/vg_p10_log/lv_sap' [80.00 GiB] inherit
dracut: inactive '/dev/vg_p10_bin/lv_db2' [10.00 GiB] inherit


Sep 18 16:04:17 xyz kernel: dracut: inactive '/dev/vg_p10_bin/lv_db2' [5.00 GiB] inherit
Sep 18 16:04:21 xyz kdump: kexec: loaded kdump kernel
Sep 18 16:04:21 xyz kdump: started up
Sep 18 17:45:33 xyz DB2[13100]: Open of log file "/db2/P10/db2/db2diag.log" failed with rc 0x870F0016#012Instance name: db2p10#012Node number: 0#012Process ID: 13100#012Process name: ipclean#012EDU ID: 0#012EDU name: #012Thread ID: 139794697541408#012Database name: #012Probe number: 500#012Product Name: DB2 UDB#012Component Name: oper system services#012Function Name: sqloRemovePosixIPCResources
Sep 18 17:45:40 xyz DB2[13306]: Open of log file "/db2/P10/db2/db2diag.log" failed with rc 0x870F0016#012Instance name: db2p10#012Node number: 0#012Process ID: 13306#012Process name: ipclean#012EDU ID: 0#012EDU name: #012Thread ID: 47675312169376#012Database name: #012Probe number: 864#012Product Name: DB2 UDB#012Component Name: oper system services#012Function Name: ipclean

012Function Name: sqloRemovePosixIPCResources
Sep 18 18:01:59 xyz DB2[16380]: Open of log file "/db2/P10/db2/db2diag.log" failed with rc 0x870F0016#012Instance name: db2p10#012Node number: 0#012Process ID: 16380#012Process name: db2stop#012EDU ID: 0#012EDU name: #012Thread ID: 140360564000544#012Database name: #012Probe number: 10138#012Product Name: DB2 UDB#012Component Name: base sys utilities#012Function Name: sqlePrintFinalMessage
Sep 18 18:03:49 xyz DB2[16380]: Open of log file "/db2/P10/db2/db2diag.log" failed with rc 0x870F0016#012Instance name: db2p10#012Node number: 0#012Process ID: 16380#012Process name: db2stop#012EDU ID: 0#012EDU name: #012Thread ID: 140360564000544#012Database name: #012Probe number: 12538#012Product Name: DB2 UDB#012Component Name: base sys utilities#012Function Name: sqleReleaseStStLockFile
Sep 18 18:47:34 xyz kernel: dracut: inactive '/dev/vg_p10_bin/lv_db2' [5.00 GiB] inherit
Sep 18 18:47:37 xyz kdump: kexec: loaded kdump kernel
Sep 18 18:47:37 xyz kdump: started up
Sep 18 22:16:30 xyz kernel: dracut: inactive '/dev/vg_p10_bin/lv_db2' [5.00 GiB] inherit
Sep 18 22:16:33 xyz kdump: kexec: loaded kdump kernel
Sep 18 22:16:33 xyz kdump: started up

Please help asap
apunkabollywoodAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

arnoldCommented:
Not sure what or how you had it configured, but it seems /dev/vg_p10_bin/lv_db2
Lost a 5GB extent.
0
apunkabollywoodAuthor Commented:
For more info its having SAP and DB2 and no multipathing is there pls its urgent
0
arnoldCommented:
Depending on which HBA cards are in use, you need to use the respective tool to see where you lost the LUN.
Emulex, brocade, qlogic.
Cat /proc/scsi/scsi

Pvdisplay
Vgdisplay
Lvdisplay
Lvmdiskscan

Depending on the SAN in use at times it has a tool to see the LUNs in GUI.

Frst thing you have to identify what you lost, then determine the cause for the loss, I.e. a storage engineer removed your host from a LUN, a change was made to an FC switch owning for one processor that removed the access by your system to some luns,
0
The Ultimate Tool Kit for Technolgy Solution Provi

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy for valuable how-to assets including sample agreements, checklists, flowcharts, and more!

apunkabollywoodAuthor Commented:
Here are the strange logs in ascending order :

kernel: pci 0000:00:15.3: BAR 13: can't allocate I/O resource [0x1000-0x1fff]
Sep 18 16:04:17
hostname kernel: pci 0000:00:15.4: BAR 13: can't allocate I/O resource [0x1000-0x1fff]

cthats[2865]: (Recorded using libct_ffdc.a cv 2):::Error ID: 822....oCPCG/t9q.8g3w/....................:::Reference ID: :::Template ID: 0:::Details File:  :::Location: rsct,nim_control.C,1.39.1.41,7916             :::TS_NIM_ERROR_STUCK_ER#012NIM thread blocked#012Thread which was blocked#012send thread#012Interval in seconds during which process was blocked#0126#012Interface name#012eth1
Sep 18 16:07:48
cthats[2865]: (Recorded using libct_ffdc.a cv 2):::Error ID: 822....oCPCG/V1t08g3w/....................:::Reference ID: :::Template ID: 0:::Details File:  :::Location: rsct,nim_control.C,1.39.1.41,7916             :::TS_NIM_ERROR_STUCK_ER#012NIM thread blocked#012Thread which was blocked#012send thread#012Interval in seconds during which process was blocked#0126#012Interface name#012eth0
Sep 18 16:07:49
hostname  cthats[2865]: (Recorded using libct_ffdc.a cv 2):::Error ID: 822....pCPCG/YcD.8g3w/....................:::Reference ID: :::Template ID: 0:::Details File:  :::Location: rsct,nim_control.C,1.39.1.41,7916             :::TS_NIM_ERROR_STUCK_ER#012NIM thread blocked#012Thread which was blocked#012receive thread#012Interval in seconds during which process was blocked#0126#012Interface name#012eth2

kernel: INFO: task db2sysc:21333 blocked for more than 120 seconds.
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: db2sysc       D 0000000000000000     0 21333  20749 0x00000000
kernel: ffff8805e8e59cc8 0000000000000082 0000000000000000 ffffffffa033a240
kernel: ffff8805e8e59c38 ffffffff81012b59 ffff8805e8e59c78 ffffffff8109b849
kernel: ffff8803137530b8 ffff8805e8e59fd8 000000000000f4e8 ffff8803137530b8
kernel: Call Trace:
kernel: [<ffffffff81012b59>] ? read_tsc+0x9/0x20
kernel: [<ffffffff8109b849>] ? ktime_get_ts+0xa9/0xe0


samtb_net[16198]: op=release ip=10.91.40.1 rc=0 log=1 count=9
Sep 18 17:53:55 lterhdqeccpd1 samtb_net[16199]: op=release ip=10.91.40.1 rc=0 log=1 count=9
RecoveryRM[4411]: (Recorded using libct_ffdc.a cv 2):::Error ID: 825....MmQCG/KSS18g3w/....................:::Reference ID:  :::Template ID: 0:::Details File:  :::Location: RSCT,Protocol.C,1.54.1.60,2722                :::RECOVERYRM_INFO_3_ST#012A new member has joined.#012Node number = #0122
 /usr/sbin/rsct/sapolicies/db2/hadrV10_monitor.ksh: Entering : db2p10 db2p10 P10
 hadrV10_monitor.ksh[16275]: hadrV10_monitor.ksh[16275]  is offline, hadr monitor must return offline
:

 hadrV10_monitor.ksh[16339]: hadrV10_monitor.ksh[16339]  is offline, hadr monitor must return offline
hadrV10_monitor.ksh[16339]: Returning 2 : db2p10 db2p10 P10
kernel: nfs: server 10.1.2.3 not responding, timed out
kernel: nfs: server 10.1.2.3 not responding, timed out
kernel: nfs: server 10.1.2.3 not responding, timed out


DB2[16380]: Open of log file "/db2/P10/db2dump/db2diag.log" failed with rc 0x870F0016#012Instance name: db2p10#012Node number: 0#012Process ID: 16380#012Process name: db2stop#012EDU ID: 0#012EDU name: #012Thread ID: 140360564000544#012Database name: #012Probe number: 1130#012Product Name: DB2 UDB#012Component Name: base sys utilities#012Function Name: sqleStartStopSingleNode
DB2[15942]: Open of log file "/db2/P10/db2dump/db2diag.log" failed with rc 0x870F0016#012Instance name: db2p10#012Node number: 0#012Process ID: 15942#012Process name: ipclean#012EDU ID: 0#012EDU name: #012Thread ID: 47992190878112#012Database name: #012Probe number: 500#012Product Name: DB2 UDB#012Component Name: oper system services#012Function Name: sqloRemovePosixIPCResources

RMCdaemon[2606]: (Recorded using libct_ffdc.a cv 2):::Error ID: 824....5XRCG/uHh/8g3w/....................:::Reference ID:  :::Template ID: 0:::Details File:  :::Location: RSCT,rmcd.c,1.81,912                          :::RMCD_INFO_1_ST#012The daemon is stopped.#012Number of command that stopped the daemon#0123
Sep 18 18:45:59 hostnameConfigRM[2641]: (Recorded using libct_ffdc.a cv 2):::Error ID: :::Reference ID:  :::Template ID: 0:::Details File:  :::Location: RSCT,ConfigRMDaemon.C,1.18,222                :::CONFIGRM_STOPPED_ST#012IBM.ConfigRM daemon has been stopped.
Sep 18 18:45:59 hostnamecthags[3125]: (Recorded using libct_ffdc.a cv 2):::Error ID: 825....5XRCG/APh/8g3w/....................:::Reference ID:  :::Template ID: 0:::Details File:  :::Location: RSCT,SRCSocket.C,1.79.1.1,492                 :::GS_STOP_ST#012Group Services daemon stopped#012DIAGNOSTIC EXPLANATION#012Received signal[SIGTERM]. Converted to normal stop
Sep 18 18:45:59 hostnamectcasd[2660]: (Recorded using libct_ffdc.a cv 2):::Error ID: 824....5XRCG//Uh/8g3w/....................:::Reference ID:  :::Template ID: cffb2385:::Details File:  :::Location: rsct.core.sec,ctcas_main.c,1.30,399           :::ctcasd Daemon Stopped
Sep 18 18:45:59 hostnameinit: tty (/dev/tty1) main process (3971) killed by TERM signal
Sep 18 18:45:59 hostnameinit: tty (/dev/tty2) main process (3975) killed by TERM signal
Sep 18 18:45:59 hostnameStorageRM[4353]: (Recorded using libct_ffdc.a cv 2):::Error ID: :::Reference ID:  :::Template ID: 0:::Details File:  :::Location: RSCT,StorageRMDaemon.C,1.57,362               :::STORAGERM_STOPPED_ST#012IBM.StorageRM daemon has been stopped.
Sep 18 18:45:59 hostnameinit: tty (/dev/tty3) main process (3979) killed by TERM signal
Sep 18 18:45:59 hostnameinit: tty (/dev/tty4) main process (3983) killed by TERM signal
Sep 18 18:45:59 hostnameinit: tty (/dev/tty5) main process (3989) killed by TERM signal
Sep 18 18:45:59 hostnameinit: tty (/dev/tty6) main process (3993) killed by TERM signal
Sep 18 18:45:59 hostnamekernel: Removing vmci device
Sep 18 18:45:59 hostnamekernel: Resetting vmci device
Sep 18 18:45:59 hostnamekernel: Unregistered vmci device.
Sep 18 18:45:59 hostnamekernel: vmci 0000:00:07.7: PCI INT A disabled
Sep 18 18:46:00 hostnamexinetd[3543]: Exiting...
:



Hope this will help u to understand
0
apunkabollywoodAuthor Commented:
Please have a look
0
arnoldCommented:
Have a look?
You need to make sure that all the hardware is ok.

lspci
You may have lost a device/controller.
Could you make sure you did not lose a drive? Run the four commands dealing with LVM and post their result.

Is the server IBM or is the storage IBM?
0
apunkabollywoodAuthor Commented:
Here are some facts:

Storage and Vmware side confirm no error and logs regarding disk luns or failover .



logs those just came before reboot are

Sep 18 18:45:59 hostnameinit: tty (/dev/tty1) main process (3971) killed by TERM signal
Sep 18 18:45:59 hostnameinit: tty (/dev/tty2) main process (3975) killed by TERM signal
Sep 18 18:45:59 hostnameStorageRM[4353]: (Recorded using libct_ffdc.a cv 2):::Error ID: :::Reference ID:  :::Template ID: 0:::Details File:  :::Location: RSCT,StorageRMDaemon.C,1.57,362               :::STORAGERM_STOPPED_ST#012IBM.StorageRM daemon has been stopped.

and same time on DB2 also error logs were there saying not able to run db2sys ...

Stroage is of Hitachi

and this machine is having around 10 SAP luns of 100 GB each

and 2 Luns one for bin and one for log and OS again on datastore.

Its a VM.

machine crashed 2 times and above logs found only for second - one reboot was all of sudden having no strange logs or error on system side
0
arnoldCommented:
Could you make sure this crashed system did not run out of space?
df -k

Is this the host or a VM?
If VM run disk integrity scan.
When the VM is powered off, make sure there are no remnant lock files.
If VM access via FC/iscsi to the storage, something else is not preventing the access.

Are all the resources allocated on the hitachi to this system/VM are seen by this system/VM.
I am unfamiliar with db2 but the error might be pointing to it crashing because of a missing resource.
0
apunkabollywoodAuthor Commented:
Here are vmcore logs:

    KERNEL: /usr/lib/debug/lib/modules/2.6.32-220.13.1.el6.x86_64/vmlinux
    DUMPFILE: /var/127.0.0.1-2013-09-20-14:02:09/vmcore  [PARTIAL DUMP]
        CPUS: 10
        DATE: Fri Sep 19 17:01:00 2013
      UPTIME: 1 days, 15:45:00
LOAD AVERAGE: 5.99, 5.93, 3.56
       TASKS: 953
    NODENAME: Hostname
     RELEASE: 2.6.32-220.13.1.el6.x86_64
     VERSION: #1 SMP Thu Mar 29 11:46:40 EDT 2012
     MACHINE: x86_64  (2500 Mhz)
      MEMORY: 40 GB
       PANIC: "Oops: 0000 [#1] SMP " (check log for details)
         PID: 0
     COMMAND: "swapper"
        TASK: ffffffff81a8d020  (1 of 10)  [THREAD_INFO: ffffffff81a00000]
         CPU: 0
       STATE: TASK_RUNNING (PANIC)


PID: 0      TASK: ffffffff81a8d020  CPU: 0   COMMAND: "swapper"
 #0 [ffff88003f403590] machine_kexec at ffffffff8103214b
 #1 [ffff88003f4035f0] crash_kexec at ffffffff810b90c2
 #2 [ffff88003f4036c0] oops_end at ffffffff814f09b0
 #3 [ffff88003f4036f0] no_context at ffffffff8104234b
 #4 [ffff88003f403740] __bad_area_nosemaphore at ffffffff810425d5
 #5 [ffff88003f403790] bad_area_nosemaphore at ffffffff810426a3
 #6 [ffff88003f4037a0] __do_page_fault at ffffffff81042d5d
 #7 [ffff88003f4038c0] do_page_fault at ffffffff814f298e
 #8 [ffff88003f4038f0] page_fault at ffffffff814efd45
    [exception RIP: tcp_mark_head_lost+185]
    RIP: ffffffff81473f19  RSP: ffff88003f4039a0  RFLAGS: 00010202
    RAX: 0000000000000000  RBX: 0000000000000000  RCX: 0000000000000000
    RDX: 0000000000000001  RSI: ffff88066d701800  RDI: ffff8804187e2440
    RBP: ffff88003f4039d0   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000015  R11: 0000000000000000  R12: ffff8804187e2440
    R13: ffff8804eb6de8b8  R14: 0000000000000001  R15: ffff8804187e2508
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #9 [ffff88003f4039d8] tcp_ack at ffffffff8147a39f
#10 [ffff88003f403aa8] tcp_rcv_established at ffffffff8147b10d
#11 [ffff88003f403b08] tcp_v4_do_rcv at ffffffff81483163
#12 [ffff88003f403ba8] tcp_v4_rcv at ffffffff81484951
#13 [ffff88003f403c28] ip_local_deliver_finish at ffffffff814626bd
#14 [ffff88003f403c58] ip_local_deliver at ffffffff81462948
#15 [ffff88003f403c88] ip_rcv_finish at ffffffff81461e0d
#16 [ffff88003f403cc8] ip_rcv at ffffffff81462395
#17 [ffff88003f403d08] __netif_receive_skb at ffffffff8142c34b
#18 [ffff88003f403d68] netif_receive_skb at ffffffff8142e408
#19 [ffff88003f403da8] vmxnet3_rq_rx_complete at ffffffffa0062a9d [vmxnet3]
#20 [ffff88003f403e28] vmxnet3_poll_rx_only at ffffffffa0063203 [vmxnet3]
#21 [ffff88003f403e68] net_rx_action at ffffffff81430cb3
#22 [ffff88003f403ec8] __do_softirq at ffffffff81072191
#23 [ffff88003f403f38] call_softirq at ffffffff8100c24c
#24 [ffff88003f403f50] do_softirq at ffffffff8100de85
#25 [ffff88003f403f70] irq_exit at ffffffff81071f75
#26 [ffff88003f403f80] do_IRQ at ffffffff814f5215
--- <IRQ stack> ---
#27 [ffffffff81a01e18] ret_from_intr at ffffffff8100ba53
    [exception RIP: native_safe_halt+11]
    RIP: ffffffff810375eb  RSP: ffffffff81a01ec8  RFLAGS: 00000246
    RAX: 0000000000000000  RBX: ffffffff81a01ec8  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: 0000000000000001  RDI: ffffffff81dd5228
    RBP: ffffffff8100ba4e   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000000  R12: ffffffff814ecf50
    R13: ffffffff81a01ee8  R14: ffff880353cf8100  R15: ffffffff8160b3a0
0
apunkabollywoodAuthor Commented:
INFO: task db2sysc:12659 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
db2sysc       D 0000000000000002     0 12659  11588 0x00000000
 ffff88065990fcc8 0000000000000082 0000000000000000 ffff88003f4528b8
 ffff88065990fc38 ffffffff81012b59 ffff88065990fc78 ffffffff8109b849
 ffff880208939038 ffff88065990ffd8 000000000000f4e8 ffff880208939038
Call Trace:
 [<ffffffff81012b59>] ? read_tsc+0x9/0x20
 [<ffffffff8109b849>] ? ktime_get_ts+0xa9/0xe0
 [<ffffffff81110c60>] ? sync_page+0x0/0x50
 [<ffffffff814ed6e3>] io_schedule+0x73/0xc0
 [<ffffffff81110c9d>] sync_page+0x3d/0x50
 [<ffffffff814ee09f>] __wait_on_bit+0x5f/0x90
 [<ffffffff81110e53>] wait_on_page_bit+0x73/0x80
 [<ffffffff81090c70>] ? wake_bit_function+0x0/0x50
 [<ffffffff811272f5>] ? pagevec_lookup_tag+0x25/0x40
 [<ffffffff8111126b>] wait_on_page_writeback_range+0xfb/0x190
 [<ffffffff81111438>] filemap_write_and_wait_range+0x78/0x90
 [<ffffffff811a567e>] vfs_fsync_range+0x7e/0xe0
 [<ffffffff811a574d>] vfs_fsync+0x1d/0x20
 [<ffffffff811a578e>] do_fsync+0x3e/0x60
 [<ffffffff811a57e0>] sys_fsync+0x10/0x20
 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
INFO: task db2sysc:12659 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
db2sysc       D 0000000000000003     0 12659  11588 0x00000000
 -- MORE --  forward: <SPACE>, <ENTER> or j  backward: b or k  quit: q
0
arnoldCommented:
Instead of repeatedly posting the crash references from the log, could you please answer the questions on what if any components failed.

In your question, a logical volume went from being 10GB to 5GB.
/dev/vg_p10_bin/lv_db2

Please stop posting log entries and respond to what is asked.
lvmdiskscan
pvdisplay
vgdisplay
lvdisplay

A loss of a 5GB portion in /dev/vg_p10_bin/lv_db2 could mean what whatever was there is required for DB2 to start and failing to access the data there it crashes.
i.e. non-committed transaction, etc. Have not dealt with DB2 so not sure what its process of operation is like, but most DB servers rely on transaction logs, etc. to speed up processing without the need to write into the DB until it is committed/checkpointed/etc..
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Linux

From novice to tech pro — start learning today.