Link to home
Create AccountLog in
Avatar of sun hongda
sun hongdaFlag for China

asked on

redhat linux unexpect reboot find out why

customer linux vm is unexpected reboot at Mar 21 19:51:39 ,but message log did not record anyting just some disk error like below

Mar 21 19:48:49 sc1pxn nbd-client: Kernel call returned: Invalid request descriptor
Mar 21 19:48:49 sc1pxn nbd-client: Begin Closing
Mar 21 19:48:49 sc1pxn nbd-client: Closing Complete
Mar 21 19:48:49 sc1pxn kernel: md/raid1:md1: Disk failure on nbd8, disabling device.
Mar 21 19:48:49 sc1pxn kernel: md/raid1:md1: Operation continuing on 1 devices.
Mar 21 19:48:49 sc1pxn kernel: nbd8: Attempted send on closed socket (in do)
Mar 21 19:48:49 sc1pxn kernel: nbd8: Unexpected reply (ffff88025db9fbd0)
Mar 21 19:48:49 sc1pxn kernel: nbd8: queue cleared
Mar 21 19:48:49 sc1pxn rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="4562" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
Mar 21 19:48:49 sc1pxn rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="4562" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
Mar 21 19:48:49 sc1pxn kernel: nbd (pid 76736: nbd24) got signal 9
Mar 21 19:48:49 sc1pxn kernel: nbd24: shutting down socket
Mar 21 19:48:49 sc1pxn kernel: nbd24: Send data failed (result -4)
Mar 21 19:48:49 sc1pxn kernel: nbd24: Receive control failed (result -32)
Mar 21 19:48:49 sc1pxn kernel: nbd24: Request send failed
Mar 21 19:48:49 sc1pxn kernel: end_request: I/O error, dev nbd24, sector 12407408640
Mar 21 19:48:49 sc1pxn kernel: nbd24: Attempted send on closed socket (in do)
Mar 21 19:48:49 sc1pxn kernel: nbd24: queue cleared
Mar 21 19:48:49 sc1pxn kernel: md/raid1:md3: Disk failure on nbd24, disabling device.
Mar 21 19:48:49 sc1pxn kernel: md/raid1:md3: Operation continuing on 1 devices.
Mar 21 19:48:49 sc1pxn nbd-client: Kernel call returned: Broken pipe
Mar 21 19:48:49 sc1pxn nbd-client: Begin Closing
Mar 21 19:48:49 sc1pxn nbd-client: Closing Complete
Mar 21 19:48:50 sc1pxn kernel: md: md3: recovery interrupted.
Mar 21 19:51:39 sc1pxn kernel: imklog 5.8.10, log source = /proc/kmsg started.
Mar 21 19:51:39 sc1pxn rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="4578" x-info="http://www.rsyslog.com"] start
Mar 21 19:51:39 sc1pxn kernel: Initializing cgroup subsys cpuset
Mar 21 19:51:39 sc1pxn kernel: Initializing cgroup subsys cpu
Mar 21 19:51:39 sc1pxn kernel: Linux version 2.6.32-696.16.1.el6.x86_64 (mockbuild@x86-031.build.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-18) (GCC) ) #1 SMP Sun Oct 8 09:45:56 EDT 2017
Mar 21 19:51:39 sc1pxn kernel: Command line: ro root=/dev/mapper/vg_root-lv_root rd_NO_LUKS rd_LVM_LV=vg_root/lv_root LANG=en_US.UTF-8 rd_NO_MD SYSFONT=latarcyrheb-sun16 rd_LVM_LV=vg_swap/lv_swap  KEYBOARDTYPE=pc KEYTABLE=us crashkernel=auto rd_NO_DM console=ttyS0 clocksource=tsc
Mar 21 19:51:39 sc1pxn kernel: KERNEL supported cpus:
Mar 21 19:51:39 sc1pxn kernel:  Intel GenuineIntel
Mar 21 19:51:39 sc1pxn kernel:  AMD AuthenticAMD
Mar 21 19:51:39 sc1pxn kernel:  Centaur CentaurHauls

==========================

and we have sos report, how to analayzing sos report1 to find why it unexpect reboot

thank you very muchh


sosreport-sc1pxn-20230323143459.zip

Avatar of arnold
arnold
Flag of United States of America image

It starts with a failed device at 19:48
MD1 lost a device dropping to 1.
but had a socket issue., queue cleared
Then you have md3 also with a failed device, but md3 ran into an I/O issue and that seems to be what precipitated the reboot.

md3 seems to have entered an unknown stage

A VM using software raid volumes within ?
Avatar of sun hongda

ASKER

yes, seems using veritas netbackup and raid volumes
Is this one of several nodes in a cluster? or it relies on devices presented through NBD and lost several that triggered the cascade and reboot.
md3 specifically is the one that the Kernel got a broken PIPE when it tried to recover md3 from the single device failure.

Does the customer or if you are the reseller have redhat support to whom you can provide the SOSreport for analysis.
host resource exhaustion?

how many VMs on the host?
The VM issues might only be a symptom of what went on.
Or a mere glitch on the network....
customer is one vm has 17 disk and seems did not have red-hat support liunx so can not analyzing SOS report
and SOS report seems taken after reboot
I have ask hardware expert investigate hardware, seems not hardware fault found it
does I can found clue from SOS report after reboot
or if I am ask customer share AMI, what could I do to analyzing that if possible that can find clue
thank you
What are the customer VM dependencies? What is the function of this VM is it the backup of their other vms?

is this a DB server?

Redhat 6?

Host/Environment network saturation?
it is file server I guess
redhat 6 yes
I reinvestigate find below eror any clue for that
FS: Registering the id_resolver key type
Key type id_resolver registered
FS-Cache: Netfs 'nfs' registered for caching
nbd80: Attempted send on closed socket (in do)
nbd82: Attempted send on closed socket (in do)
nbd83: Attempted send on closed socket (in do)
nbd84: Attempted send on closed socket (in do)
nbd85: Attempted send on closed socket (in do)
nbd86: Attempted send on closed socket (in do)
nbd87: Attempted send on closed socket (in do)
nbd88: Attempted send on closed socket (in do)
nbd90: Attempted send on closed socket (in do)
nbd91: Attempted send on closed socket (in do)
nbd92: Attempted send on closed socket (in do)
nbd93: Attempted send on closed socket (in do)
nbd94: Attempted send on closed socket (in do)
nbd95: Attempted send on closed socket (in do)
nbd96: Attempted send on closed socket (in do)
nbd98: Attempted send on closed socket (in do)
nbd99: Attempted send on closed socket (in do)
nbd100: Attempted send on closed socket (in do)
nbd101: Attempted send on closed socket (in do)
nbd102: Attempted send on closed socket (in do)
nbd103: Attempted send on closed socket (in do)
nbd104: Attempted send on closed socket (in do)
nbd106: Attempted send on closed socket (in do)
nbd107: Attempted send on closed socket (in do)
nbd108: Attempted send on closed socket (in do)
nbd109: Attempted send on closed socket (in do)
nbd110: Attempted send on closed socket (in do)
nbd111: Attempted send on closed socket (in do)
nbd112: Attempted send on closed socket (in do)
nbd114: Attempted send on closed socket (in do)
nbd115: Attempted send on closed socket (in do)
nbd116: Attempted send on closed socket (in do)
nbd117: Attempted send on closed socket (in do)
nbd118: Attempted send on closed socket (in do)
nbd119: Attempted send on closed socket (in do)
nbd122: Attempted send on closed socket (in do)
nbd123: Attempted send on closed socket (in do)
nbd124: Attempted send on closed socket (in do)
nbd125: Attempted send on closed socket (in do)
nbd126: Attempted send on closed socket (in do)
nbd127: Attempted send on closed socket (in do)
nbd128: Attempted send on closed socket (in do)
nbd130: Attempted send on closed socket (in do)
nbd131: Attempted send on closed socket (in do)
nbd132: Attempted send on closed socket (in do)
nbd133: Attempted send on closed socket (in do)
nbd134: Attempted send on closed socket (in do)
nbd135: Attempted send on closed socket (in do)
nbd136: Attempted send on closed socket (in do)
nbd138: Attempted send on closed socket (in do)
nbd139: Attempted send on closed socket (in do)
nbd140: Attempted send on closed socket (in do)
nbd141: Attempted send on closed socket (in do)
nbd142: Attempted send on closed socket (in do)
nbd143: Attempted send on closed socket (in do)
nbd144: Attempted send on closed socket (in do)
nbd146: Attempted send on closed socket (in do)
nbd147: Attempted send on closed socket (in do)
nbd148: Attempted send on closed socket (in do)
nbd149: Attempted send on closed socket (in do)
nbd150: Attempted send on closed socket (in do)
nbd151: Attempted send on closed socket (in do)
nbd152: Attempted send on closed socket (in do)
nbd154: Attempted send on closed socket (in do)
nbd155: Attempted send on closed socket (in do)
nbd156: Attempted send on closed socket (in do)
nbd157: Attempted send on closed socket (in do)
nbd158: Attempted send on closed socket (in do)
nbd159: Attempted send on closed socket (in do)
Stopping certmonger: [  OK  ]

Can't connect to default. Skipping.
Shutting down Cluster Module - cluster monitor: [  OK  ]
stopping the NetBackup Service Monitor
stopping the NetBackup CloudStore Service Container
stopping the NetBackup Service Layer
xenbus_dev_shutdown: device/pci/0: Initialising != Connected, skipping
Restarting system.
machine restart
file server, the disks it has issues with, they are not local, what is supposed to present these disks?
veritas netbackup, where are the resources.
how are the drives defined on the VM?
is this VM part of a geographic cluster
Certificate expired, lapsed?
i.e. access to the disk resources uses certificate based authentication, this system's certificate renewal did not go, the cert expired now it is being denied access to resources it needs.



like raid  so can read raid information
is not part of geographic cluster
I think no certicated issues can see from sos report confirm that
Avatar of noci
noci

nbd is a network block device, were there network issues when the mirrorset broke?

any chance you can show the output of:
cat /proc/mdstat
lsblk
pvs
vgs
lvs

is there any info on the nbd-server available?
is there any info on related network events?
Thank you for reply does it mean kernel crash like below error
Mar 21 19:48:28 sc1pxn nbd_server[32321]: sockread fail: Success (0/0/0/0)
Mar 21 19:48:29 sc1pxn nbd_server[34150]: sockread fail: Success (0/0/0/0)
Mar 21 19:48:31 sc1pxn nbd_server[35108]: sockread fail: Success (0/0/0/0)
Mar 21 19:48:32 sc1pxn nbd_server[36188]: sockread fail: Success (0/0/0/0)
Mar 21 19:48:33 sc1pxn nbd_server[37201]: sockread fail: Success (0/0/0/0)
Mar 21 19:48:34 sc1pxn nbd_server[38288]: sockread fail: Success (0/0/0/0)
Mar 21 19:48:36 sc1pxn nbd_server[39741]: sockread fail: Success (0/0/0/0)
Mar 21 19:48:37 sc1pxn nbd_server[40971]: sockread fail: Success (0/0/0/0)
Mar 21 19:48:38 sc1pxn nbd_server[42222]: sockread fail: Success (0/0/0/0)
Mar 21 19:48:39 sc1pxn nbd_server[43232]: sockread fail: Success (0/0/0/0)
Mar 21 19:48:49 sc1pxn kernel: nbd (pid 72653: nbd8) got signal 9
Mar 21 19:48:49 sc1pxn kernel: nbd8: shutting down socket
Mar 21 19:48:49 sc1pxn kernel: nbd8: Send data failed (result -4)
Mar 21 19:48:49 sc1pxn kernel: nbd8: Request send failed
Mar 21 19:48:49 sc1pxn kernel: end_request: I/O error, dev nbd8, sector 761698104
Mar 21 19:48:49 sc1pxn nbd-client: Kernel call returned: Invalid request descriptor
Mar 21 19:48:49 sc1pxn nbd-client: Begin Closing
Mar 21 19:48:49 sc1pxn nbd-client: Closing Complete
Mar 21 19:48:49 sc1pxn kernel: md/raid1:md1: Disk failure on nbd8, disabling device.
Mar 21 19:48:49 sc1pxn kernel: md/raid1:md1: Operation continuing on 1 devices.
Mar 21 19:48:49 sc1pxn kernel: nbd8: Attempted send on closed socket (in do)
Mar 21 19:48:49 sc1pxn kernel: nbd8: Unexpected reply (ffff88025db9fbd0)
Mar 21 19:48:49 sc1pxn kernel: nbd8: queue cleared
Mar 21 19:48:49 sc1pxn rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="4562" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
Mar 21 19:48:49 sc1pxn rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="4562" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
Mar 21 19:48:49 sc1pxn kernel: nbd (pid 76736: nbd24) got signal 9
Mar 21 19:48:49 sc1pxn kernel: nbd24: shutting down socket
Mar 21 19:48:49 sc1pxn kernel: nbd24: Send data failed (result -4)
Mar 21 19:48:49 sc1pxn kernel: nbd24: Receive control failed (result -32)
Mar 21 19:48:49 sc1pxn kernel: nbd24: Request send failed
Mar 21 19:48:49 sc1pxn kernel: end_request: I/O error, dev nbd24, sector 12407408640
Mar 21 19:48:49 sc1pxn kernel: nbd24: Attempted send on closed socket (in do)
Mar 21 19:48:49 sc1pxn kernel: nbd24: queue cleared
Mar 21 19:48:49 sc1pxn kernel: md/raid1:md3: Disk failure on nbd24, disabling device.
Mar 21 19:48:49 sc1pxn kernel: md/raid1:md3: Operation continuing on 1 devices.
Mar 21 19:48:49 sc1pxn nbd-client: Kernel call returned: Broken pipe
Mar 21 19:48:49 sc1pxn nbd-client: Begin Closing
Mar 21 19:48:49 sc1pxn nbd-client: Closing Complete
Mar 21 19:48:50 sc1pxn kernel: md: md3: recovery interrupted.
Mar 21 19:51:39 sc1pxn kernel: imklog 5.8.10, log source = /proc/kmsg started.
It is unable to access the resources and triggers a reboot.
Does the system boots based on local drives and accesses the others.
or does it rely on external resources to boot.
minimal Disk VM boots and then accesses Network bound resources.

Answering noci would be helpful, though as soon as the system boots, it starts lagging and it seems the setup is to restart in an effort to recover.

you could try booting into single mode and then look through the configuration of what the source of those external disks is. check for that host and see what state it is in.

i.e. you are on a system that depends on resources from elsewhere. This is the system everyone accesses and when it misbehaves everyone notices it, but you might be looking for a solution on the wrong system.

How many other VMs or Physical Systems/devices does the customer have on your premises?
Knowing that may help you identify what to check.
i.e. does the customer have a storage device? SAN, NAS, vmware vsan, etc.
The crash most probably involves the root filesystem... the commands for which i asked the output are meant to determine the structure of your IO system... Or it was during a critical update of the IO system...
Anyway from the crashlog as such only the order of events is to be determined.
And it hints on causes...

btw... NBD devices are not file shares, they are Block shares.
the server opens a device/file and delivers it on the network as if it was a disk.
The client presents all data as if it is a local disk.
Network Block Devices - dish out blocks, not files.
So if it is a server it will be an disk image server of some sort.
Al
hello expert find below in last -x, any method can dig furter who did that or which app did that
for this entry
reboot   system boot  2.6.32-696.16.1. Tue Mar 21 19:51 - 09:43 (15+13:52)

[root@ip-172-31-51-78 log]# last -x|more
shutdown system down  2.6.32-754.35.1. Thu Apr  6 09:45 - 08:00 (-19453+-1:-
runlevel (to lvl 0)   2.6.32-754.35.1. Thu Apr  6 09:43 - 09:45  (00:01)
runlevel (to lvl 3)   2.6.32-754.35.1. Thu Apr  6 09:32 - 09:43  (00:11)
reboot   system boot  2.6.32-754.35.1. Thu Apr  6 09:32 - 09:43  (00:11)
wangxin  pts/1        192.1.51.43      Mon Apr  3 16:00 - 16:19  (00:19)
alan     pts/0        172.24.104.48    Mon Apr  3 13:39 - crash (2+19:53)
wangxin  pts/0        192.1.51.43      Thu Mar 23 14:34 - 14:46  (00:11)
siosadmi pts/0        172.30.250.20    Wed Mar 22 10:15 - 11:07  (00:51)
wangxin  pts/0        192.1.51.43      Tue Mar 21 22:39 - 09:23  (10:44)
runlevel (to lvl 3)   2.6.32-696.16.1. Tue Mar 21 19:51 - 09:32 (15+13:41)
reboot   system boot  2.6.32-696.16.1. Tue Mar 21 19:51 - 09:43 (15+13:52)
not sure what you are expecting, according to the log
alan was logged in when it crashed.

alan     pts/0        172.24.104.48    Mon Apr  3 13:39 - crash (2+19:53)
hello thank for reply me
find below error any command see what is dm-0 ,how it is mounted
EXT4-fs (dm-0): INFO: recovery required on readonly filesystem
EXT4-fs (dm-0): write access will be enabled during recovery
EXT4-fs (dm-0): orphan cleanup on readonly fs
EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 308188
EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 268900
EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 266860
EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 297088
EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 311691
EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 291825
EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 277943
EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 334527
EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 276665
EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 268514
EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 298269
EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 392238
EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 266546
EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 296674
EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 306384
EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 310375
EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 312311
EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 312981
EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 265655
EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 268509
EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 280432
EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 274028
EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 269568
EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 306302
EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 274890
EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 308208
EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 273740
EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 280228
EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 269364
EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 278214
EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 278411
EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 268349
EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 334525
EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 7697
EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 287478
EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 13900
EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 334476
EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 149124
EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 309956
EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 278301
EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 791619
EXT4-fs (dm-0): 41 orphan inodes deleted
EXT4-fs (dm-0): recovery complete
EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts:
dracut: Mounted root filesystem /dev/mapper/vg_root-lv_root
This looks like a result of an fsck, filesystem check that repaired the Volume discarding references to location in storage that had "no relevance"
Bookkeeping process.
A times this can happen when a kernel panic, or when the storage access is lost ...not a clean reboot. Changes to data could not be flushed to storage.

Did you get to the bottom of where the storage resources were coming from?

Is the system operating or did you just resolve the boot issue. the rest of the storage resources are a work in progress?
During boot the filesystems are checked (quick check) and if needed simple repairs like pre-allocated but still unused space (not in files allocated & committed) is free.  Or files that are deleted but not yet finalized in the bitmaps.
If more elaborate repairs are needed the systems stops booting and someone needs to logon with the root password during this extended check and decide if files can be removed or not

So this process will happen if disks are not cleanly dismounted (regular shutdown does dismount all disks and commit all pending updates). So after a crash expect recovery of parts of the filesystem.
ASKER CERTIFIED SOLUTION
Avatar of sun hongda
sun hongda
Flag of China image

Link to home
membership
Create an account to see this answer
Signing up is free. No credit card required.
Create Account