?
Solved

Auto-reboot without any signal and logging

Posted on 2003-03-28
5
Medium Priority
?
1,512 Views
Last Modified: 2013-12-27
We have a SunFire v3800 machine (Solaris 9) recently reboot automatically without any signal, after started up, nothing mentioned in /var/adm/messages, etc log file.

Nothing done to the OS...

Anyone has such experience?
Any commands/tools available to enhance the logging?
0
Comment
Question by:tomtomthecat
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
5 Comments
 
LVL 18

Accepted Solution

by:
liddler earned 750 total points
ID: 8226089
There is a feature in Solaris called a deadman's kernel.  Sun had me install it earlier this year when trying to discover what was causing something similar.
Here's the info they gave me:

INFODOC ID: 13258

SYNOPSIS: KERNEL: HowTo: Enabling deadman kernel code    
DETAIL DESCRIPTION:
If you have one of the following configurations, you can use the
following procedure to enable the deadman timer and to try to catch
core information from a hard hang:

        Solaris[TM] 7 Operating Environment or higher;
        Solaris[TM] 2.6   with patch 105181-06 or higher;
        Solaris[TM] 2.5.1 with patch 103640-21 or higher;

Due to bug 4080160, it is likely unsafe to use deadman on versions of Solaris
previous to
2.5.1, as it has not yet been backported to any such release.                  
                     
        #1 Enable savecore
     
        On Solaris 2.6 and previous releases, edit
      the file "/etc/init.d/sysetup".
Uncomment the last six lines as below:
-------------------------------------------------------------
##
## Default is to not do a savecore
##
if [ ! -d /var/crash/`uname -n` ]
then mkdir -m 0700 -p /var/crash/`uname -n`
fi
                echo 'checking for crash dump...\c '
savecore /var/crash/`uname -n`
                echo ''
-------------------------------------------------------------
This will save the core files to a directory named after
your system in the directory "/var/crash".

On Solaris 7 and newer releases, use the dumpadm command
to verify that savecore is enabled.  It should show
something like this:

# dumpadm
      Dump content: kernel pages
       Dump device: /dev/dsk/c0t0d0s1 (swap)
Savecore directory: /var/crash/tela
  Savecore enabled: yes

If it indicates that savecore is not enabled, see the
man page for dumpadm to find out how to enable it.

#2      Enable the deadman timer kernel parameter.
In the /etc/system, add the following lines:

        set snooping=1
        set snoop_interval=9000000                      
        The snooping=1 entry enables the deadman code.
     
        The snoop_interval=9000000 entry will enable the deadman after 90
seconds (against the default of 500 seconds) of system inactivity (no clock
interrupts).
      #3      When the next hang occurs, hopefully the deadman
timer will be triggered, and the system will drop to the ok prompt:

        ok

At this point, any specific debugger commands can be run to examine
the current state of the system.  Of particular interest are:
        .registers         dump the registers
        ctrace             dump the current stack backtrace                

        As of Solaris 8 the system will no longer drop to the ok prompt but a
panic is initiated creating a
     
        corefile with a panic string of "deadman: timed out after %d seconds of
clock inactivity".
      #4      When data collection is complete (make sure to
write down the results, since they will not be recorded
on the system), attempt to take a core dump by doing:

        ok sync                
        As of Solaris 8 step 4 is no longer needed as the system will go
through the panic process of creating a
     
        core image and rebooting the system.
                 ----

If the system hangs with the deadman kernel, you can try
again.  Although this is not necessarily a hardware problem,
you also should consider a hardware solution if you are certain
there is not a configuration or patch problem.                                
 
These are the instructions for turning on the deadman timer for sun4m and
sun4d architectures prior to bug fix #1249985.  

Note: this needs to be done for each boot since the changes are only
incorporated
      into memory, and are not saved (and can't be saved) to disk.


(Parts that you type are underlined.)

1) Boot kadb w/o starting the kernel:

        ok boot kadb -d
           ------------
        Resetting...
        <banner>
        Rebooting with command: kadb -d
        Boot device: ...
        kadb:
   Just type a <Return> at this point
        kadb: kernel/unix
        ... loaded - ... bytes used
        kadb[0]:

2) Next, initialize the deadman timer interrupt vector:
        kadb[0]: _int_vector+38/W _deadman
                 -------------------------
        _int_vector+0x38:       0xf0044094      =       0xf0041f14

   As a double check, the last number above (0xf0041f14) should match
   the one derived from:
        kadb[0]: _deadman=X
                 ----------    
                f0041f14
       
   It is important that you say "_deadman" and NOT "deadman" (without
   the leading '_': both symbols exist in the kernel and they are
   different routines)

3) The following is optional.  It acts as a double check to make sure
   the deadman timer is working.  

   Set a debugger breakpoint at the "deadman" (no leading "_") routine:
        kadb[0]: deadman:b
                 ---------

   We will see the effect of this later.

4) Start the kernel, and let the system boot up:
        kadb[0]: :c
                 --
        SunOS Release ...

5) Once the system is up, run the enabler as root:
        # ./enable_profiler
          -----------------

   (The source code for enable_profiler can be found below.)

6) If you set the breakpoint at step 3, you will see:
        breakpoint      deadman:        save %sp, -0x68, %sp
        kadb[0]:

   This is the double check.  It means the timer is working, and the
   kernel is running the deadman routine.

   Remove the breakpoint, and continue:
        kadb[0]: .:d
                 ---
        kadb[0]: :c
                 --

7) You should now see on the console something like:
        NOTICE: Profiling kernel, textsize = 713484 [f0040000..f00ee30c]
        NOTICE: Profiling modules, size = 6116444 [fc01aab8..fc5eff14] (1140097
used)
        NOTICE: need 162964 bytes per cpu for clock sampling


8) When the next hang occurs, hopefully the deadman timer will be
   triggered, and the system will drop into kadb:

        # ~stopped      at      0xfbd01028:     ta      0x7d
        kadb[0]:

   At this point, any specific debugger commands can be run to examine
   the current state of the system.  Of particular interest are:
        $r              dump the registers
        $c              dump the current stack backtrace
        freemem/D       see how much memory is free

9) When kadb debugging is complete, attempt to take a core dump by
   doing:
        kadb[0]: $q
                 --
        ok sync
           ----

   Of course, "savecore" should have been enabled in /etc/init.d/sysetup
   for the dump to be saved.

The code for enable_profile.c follows:

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#define IOC_OUT         0x40000000      /* copy out parameters */
#define IOCPARM_MASK    0xff            /* parameters must be < 256 bytes */
#define _IOR(x, y, t)         (IOC_OUT|((((int)sizeof
(t))&IOCPARM_MASK)<<16)|(x<<8)|y)

#define INIT_PROFILING          _IOR('p', 1, int)

main()
{
        int fd;
        int ret;

        fd = open("/dev/profile", O_RDWR);
        if (fd == -1) {
                perror("fail");
                exit(1);
        }
        ret = ioctl(fd, INIT_PROFILING, 0);

        if (ret == -1) {
                perror("fail\n");
        }
}                                        
        Keywords: dead, man, enable, deadman, timer, solaris, kernel
     
APPLIES TO: Hardware, Operating Systems/Solaris/Solaris 2.x, AFO Vertical Team
Docs, AFO Vertical Team Docs/Kernel
ATTACHMENTS:

After I did this, the next crash gave me a crash dump, while the previous half dozen did not. (In my case it was a bad CPU)
HTH
0
 

Expert Comment

by:austinwmatthews
ID: 8226100
Unfortunately, it's realatively impossible to determine what caused the crash, retro-actively:
   -without a core file
   -error message(s)
   -duplicating by performing similar task that was running when crash occured (i.e. backup)
   

Try to determine when the system crashed, using the 'last' command and figure out what the system was doing at that time.


Otherwise, make sure you are properly patched, esp. kernel levels.

Good luck...
0
 

Expert Comment

by:SimB
ID: 8260884
Can you put a copy of you /etc/system and do you use Veritas Volume Manager
0
 
LVL 18

Expert Comment

by:liddler
ID: 10476849

No comment has been added lately, so it's time to clean up this TA.
I will leave a recommendation in the Cleanup topic area that this question is:

Points split liddler & austinwmatthews

Please leave any comments here within the next four days.

PLEASE DO NOT ACCEPT THIS COMMENT AS AN ANSWER!

liddler
EE Cleanup Volunteer
0
 

Author Comment

by:tomtomthecat
ID: 10488077
Many thanks Liddler for info.
0

Featured Post

Optimize your web performance

What's in the eBook?
- Full list of reasons for poor performance
- Ultimate measures to speed things up
- Primary web monitoring types
- KPIs you should be monitoring in order to increase your ROI

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Let's say you need to move the data of a file system from one partition to another. This generally involves dismounting the file system, backing it up to tapes, and restoring it to a new partition. You may also copy the file system from one place to…
I promised to write further about my project, and here I am.  First, I needed to setup the Primary Server.  You can read how in this article: Setup FreeBSD Server with full HDD encryption (http://www.experts-exchange.com/OS/Unix/BSD/FreeBSD/A_3660-S…
Learn how to navigate the file tree with the shell. Use pwd to print the current working directory: Use ls to list a directory's contents: Use cd to change to a new directory: Use wildcards instead of typing out long directory names: Use ../ to move…
This video shows how to set up a shell script to accept a positional parameter when called, pass that to a SQL script, accept the output from the statement back and then manipulate it in the Shell.
Suggested Courses
Course of the Month12 days, 4 hours left to enroll

752 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question