asked on

Server reboot when opening console

Hi all,

I had this problem on a server running Solaris8.
"who -r" command show server is in runlevel 6 (instead of 3 as I would expect normally)
Trying to find out why runlevel 6, I do a ps and grep to rc. I see that an application's kill script is hanging there for 1 week. So I suppose someone tried a reboot a week ago, but for whatever reason it did not complete. (Someone actually logged in at that time).
This was all checked via a ssh connection.
Now, I want to open a console connection to the server (via Avocent console). When I do this, at the moment the console screen pops up (just a telnet window), the server simply continues the reboot!
Unfortunately it's a production server, but that's not your problem :-).

Now I'm wondering:
could someone have started a reboot and cancelled at some point or just left it hanging at the (malfunctioning?) kill script and then,
how is it possible this reboot continues on opening a console window?

Any suggestions are appreciated (I have some explaining to do but don't realy have a clue...).

Thanks.

wesly_chen

> could someone have started a reboot and cancelled at some point or just left it hanging at the (malfunctioning?)
> kill script and then, how is it possible this reboot continues on opening a console window?
Or the shutdown process hang on trying to kill some process. So the Solaris box stay in run level 6.
While you login to the console and request a login authentication which triggers some chain processes and finally
timeout the hang "kill" process so the system continues rebooting.

Regards,

Wesly

PsiCop

Run Level 6 = Whatever the system's default Run Level is. If the System Run Level is normally Run Level 3, then booting to Run Level 6 will take it to Run Level 3. If its normally Run Level 2, then booting to Run Level 6 will take it to Run Level 2.

wesly_chen

> Run Level 6 = Whatever the system's default Run Level is
As my understanding, run level 6 in Solaris means "reboot" the system to the default run level.
So "init 6" will go to run level 0 (or kill all the processes like run level 0) first
and re-start the system then go to run level 2 or 3.

In this case, the system is hanging at killing a certain process before it goes to re-start.

Wesly

yuzh

init 6 = reboot

to learn more about Solaris run level:
man init

also have a look at "Booting process in Solaris":
http://www.adminschoice.com/docs/booting_process_in_solaris.htm

eszet

ASKER

init 6 will go to runlevel 0 and then restart (to runlevel 3 in our case). That's correct and not the problem.

What I don't understand is how opening a console continues the reboot that was hanging at the kill script.
Wesly I don't think your answer explains this behaviour or can you clarify this a bit more?
>>>
"While you login to the console and request a login authentication which triggers some chain processes and finally
timeout the hang "kill" process so the system continues rebooting."
<<<

Thanks.

Nukfror

If a kill script is just sitting there, the problem isn't the reboot -- the problem is the kill script. Is the kill script a system standard kill script or is it something you guys added for some application of yours ?

You need to do some research into what this kill is trying to do. Maybe its hanging on a stale NFS mount ...

Maybe its stuck in an infinite loop. Until the kill script return something, the system isn't going to proceed to the next script.

Nukfror

Oh wait ... sorry ... I see the spot you are in now. You're stuck in a reboot with no command line access sounds like.

Maybe someone wrote a wrapper around the /etc/rc# scripts ? This is really weird.

wesly_chen

> I see that an application's kill script is hanging there for 1 week
Could you provide which application?

While the system is going to reboot (run level 0 or run level 6), the shutdown process will start run the K* scripts in
/etc/rc0.d. /etc/rc0.d/K<number><daemon> will kill the processes in that script (symbolic link to /etc/init.d). The lower
number start first. Those scripts are executed sequentialy. So one of script doesn't finish will cause the whole shutdown
process hanging there.
If you have customerized kill script, or modified the scripts in /etc/init.d, then you may want to check them first.

Most of time as my experience, the NFS mountd daemon will hang because the NFS designed flaw.

However, without knowing which application you mentioned, I can't tell you more details.

Wesly

ASKER CERTIFIED SOLUTION

Gugro

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

eszet

ASKER

Wesly,
I'm aware of the Solaris reboot proces. I know if a kill script doesn't finish, the whole shutdown process will hang. But that's not really the problem. I don't even think there's a problem with the script, it worked fine before and also when I verified during the last reboot Monday. I rather think someone may have cancelled the shutdown process somehow, though I'm not sure how. I see some apps have been started manually after that moment, that's why we thought the server was running fine, until we noticed it was in runlevel 6, with the kill script hanging. (btw, the application is Pega from Pegasystems Inc.).

What I'm looking for is how it's possible that opening a console connection continued the reboot. I've no console logging, so unfortunately I can't see at what point it continued.

Gugro,
that's more like what I'm looking for. But I don't think it explains this, because we never have a console connected to the servers, unless needed. So according to your explanation that would cause problems all the time filling the buffer, which is not the case?

Thanks.

SOLUTION

wesly_chen

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial