• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1531
  • Last Modified:

Auto-reboot without any signal and logging

We have a SunFire v3800 machine (Solaris 9) recently reboot automatically without any signal, after started up, nothing mentioned in /var/adm/messages, etc log file.

Nothing done to the OS...

Anyone has such experience?
Any commands/tools available to enhance the logging?
1 Solution
There is a feature in Solaris called a deadman's kernel.  Sun had me install it earlier this year when trying to discover what was causing something similar.
Here's the info they gave me:


SYNOPSIS: KERNEL: HowTo: Enabling deadman kernel code    
If you have one of the following configurations, you can use the
following procedure to enable the deadman timer and to try to catch
core information from a hard hang:

        Solaris[TM] 7 Operating Environment or higher;
        Solaris[TM] 2.6   with patch 105181-06 or higher;
        Solaris[TM] 2.5.1 with patch 103640-21 or higher;

Due to bug 4080160, it is likely unsafe to use deadman on versions of Solaris
previous to
2.5.1, as it has not yet been backported to any such release.                  
        #1 Enable savecore
        On Solaris 2.6 and previous releases, edit
      the file "/etc/init.d/sysetup".
Uncomment the last six lines as below:
## Default is to not do a savecore
if [ ! -d /var/crash/`uname -n` ]
then mkdir -m 0700 -p /var/crash/`uname -n`
                echo 'checking for crash dump...\c '
savecore /var/crash/`uname -n`
                echo ''
This will save the core files to a directory named after
your system in the directory "/var/crash".

On Solaris 7 and newer releases, use the dumpadm command
to verify that savecore is enabled.  It should show
something like this:

# dumpadm
      Dump content: kernel pages
       Dump device: /dev/dsk/c0t0d0s1 (swap)
Savecore directory: /var/crash/tela
  Savecore enabled: yes

If it indicates that savecore is not enabled, see the
man page for dumpadm to find out how to enable it.

#2      Enable the deadman timer kernel parameter.
In the /etc/system, add the following lines:

        set snooping=1
        set snoop_interval=9000000                      
        The snooping=1 entry enables the deadman code.
        The snoop_interval=9000000 entry will enable the deadman after 90
seconds (against the default of 500 seconds) of system inactivity (no clock
      #3      When the next hang occurs, hopefully the deadman
timer will be triggered, and the system will drop to the ok prompt:


At this point, any specific debugger commands can be run to examine
the current state of the system.  Of particular interest are:
        .registers         dump the registers
        ctrace             dump the current stack backtrace                

        As of Solaris 8 the system will no longer drop to the ok prompt but a
panic is initiated creating a
        corefile with a panic string of "deadman: timed out after %d seconds of
clock inactivity".
      #4      When data collection is complete (make sure to
write down the results, since they will not be recorded
on the system), attempt to take a core dump by doing:

        ok sync                
        As of Solaris 8 step 4 is no longer needed as the system will go
through the panic process of creating a
        core image and rebooting the system.

If the system hangs with the deadman kernel, you can try
again.  Although this is not necessarily a hardware problem,
you also should consider a hardware solution if you are certain
there is not a configuration or patch problem.                                
These are the instructions for turning on the deadman timer for sun4m and
sun4d architectures prior to bug fix #1249985.  

Note: this needs to be done for each boot since the changes are only
      into memory, and are not saved (and can't be saved) to disk.

(Parts that you type are underlined.)

1) Boot kadb w/o starting the kernel:

        ok boot kadb -d
        Rebooting with command: kadb -d
        Boot device: ...
   Just type a <Return> at this point
        kadb: kernel/unix
        ... loaded - ... bytes used

2) Next, initialize the deadman timer interrupt vector:
        kadb[0]: _int_vector+38/W _deadman
        _int_vector+0x38:       0xf0044094      =       0xf0041f14

   As a double check, the last number above (0xf0041f14) should match
   the one derived from:
        kadb[0]: _deadman=X
   It is important that you say "_deadman" and NOT "deadman" (without
   the leading '_': both symbols exist in the kernel and they are
   different routines)

3) The following is optional.  It acts as a double check to make sure
   the deadman timer is working.  

   Set a debugger breakpoint at the "deadman" (no leading "_") routine:
        kadb[0]: deadman:b

   We will see the effect of this later.

4) Start the kernel, and let the system boot up:
        kadb[0]: :c
        SunOS Release ...

5) Once the system is up, run the enabler as root:
        # ./enable_profiler

   (The source code for enable_profiler can be found below.)

6) If you set the breakpoint at step 3, you will see:
        breakpoint      deadman:        save %sp, -0x68, %sp

   This is the double check.  It means the timer is working, and the
   kernel is running the deadman routine.

   Remove the breakpoint, and continue:
        kadb[0]: .:d
        kadb[0]: :c

7) You should now see on the console something like:
        NOTICE: Profiling kernel, textsize = 713484 [f0040000..f00ee30c]
        NOTICE: Profiling modules, size = 6116444 [fc01aab8..fc5eff14] (1140097
        NOTICE: need 162964 bytes per cpu for clock sampling

8) When the next hang occurs, hopefully the deadman timer will be
   triggered, and the system will drop into kadb:

        # ~stopped      at      0xfbd01028:     ta      0x7d

   At this point, any specific debugger commands can be run to examine
   the current state of the system.  Of particular interest are:
        $r              dump the registers
        $c              dump the current stack backtrace
        freemem/D       see how much memory is free

9) When kadb debugging is complete, attempt to take a core dump by
        kadb[0]: $q
        ok sync

   Of course, "savecore" should have been enabled in /etc/init.d/sysetup
   for the dump to be saved.

The code for enable_profile.c follows:

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#define IOC_OUT         0x40000000      /* copy out parameters */
#define IOCPARM_MASK    0xff            /* parameters must be < 256 bytes */
#define _IOR(x, y, t)         (IOC_OUT|((((int)sizeof

#define INIT_PROFILING          _IOR('p', 1, int)

        int fd;
        int ret;

        fd = open("/dev/profile", O_RDWR);
        if (fd == -1) {
        ret = ioctl(fd, INIT_PROFILING, 0);

        if (ret == -1) {
        Keywords: dead, man, enable, deadman, timer, solaris, kernel
APPLIES TO: Hardware, Operating Systems/Solaris/Solaris 2.x, AFO Vertical Team
Docs, AFO Vertical Team Docs/Kernel

After I did this, the next crash gave me a crash dump, while the previous half dozen did not. (In my case it was a bad CPU)
Unfortunately, it's realatively impossible to determine what caused the crash, retro-actively:
   -without a core file
   -error message(s)
   -duplicating by performing similar task that was running when crash occured (i.e. backup)

Try to determine when the system crashed, using the 'last' command and figure out what the system was doing at that time.

Otherwise, make sure you are properly patched, esp. kernel levels.

Good luck...
Can you put a copy of you /etc/system and do you use Veritas Volume Manager

No comment has been added lately, so it's time to clean up this TA.
I will leave a recommendation in the Cleanup topic area that this question is:

Points split liddler & austinwmatthews

Please leave any comments here within the next four days.


EE Cleanup Volunteer
tomtomthecatAuthor Commented:
Many thanks Liddler for info.

Featured Post

[Webinar] Database Backup and Recovery

Does your company store data on premises, off site, in the cloud, or a combination of these? If you answered “yes”, you need a data backup recovery plan that fits each and every platform. Watch now as as Percona teaches us how to build agile data backup recovery plan.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now