asked on

Pre-testing ssh connection

I have to administer many RHEL4 linux nodes from my desktop. I do this by passing commands from the desktop to remote nodes through a trusted ssh channel. It works fine except when the remote node is in a semi-dead state such that network is alive, remote ssh server accepts connection but does not do any thing beyond this. As a result ssh connection neither fails nor work completely i.e. my command passing hangs. I can interrupt it by a ctrl-c and go to the next node which works only if I am running the script interactively. How can I skip a node in a cron job?

I tried pre-testing ssh connection by "ssh -o BatchMode=yes -o ConnectTimeout=2 nodexx /bin/true" but this does not timeout after 2 seconds.

Bibliophage

There are perhaps a few ways to get around it.

In your ssh_config file (/etc/ssh/ssh_config in Slackware), you have the option for a couple of variables that might help.

-----
ConnectionAttempts
Specifies the number of tries (one per second) to make before exiting. The argument must be an integer. This may be useful
in scripts if the connection sometimes fails. The default is 1.

ConnectTimeout
Specifies the timeout (in seconds) used when connecting to the SSH server, instead of using the default system TCP timeout.
This value is used only when the target is down or really unreachable, not when it refuses the connection.
-----------

The 'ConnectTimeout' might help you out here - if it doesn't get a full connection in X amount of time, it should disconnect the session and drop back to shell. In Slackware, the default ConnectTimeout is 0 - or disabled. (Actually, 0 is even commented out. So it's probably 0 by default). I haven't logged into one of my RH boxes to check this one.

Another workaround would be to have a process 'watch' the SSH stream, and keep an eye out - if it doesn't see the shell prompt within X seconds, terminate and go to the next.

Hopefully the ConnectTimeout fixes the problem.

vinod

ASKER

As I said in my original posting:

I tried pre-testing ssh connection by "ssh -o BatchMode=yes -o ConnectTimeout=2 nodexx /bin/true" but this does not timeout after 2 seconds.

ConnectTimeout given on command line should override the default, but it does not work:(
'watch' also works interactively. I need something that works in batch mode.

Bibliophage

If you need it working in batch mode, I'd try the ConnectTimeout in the main configuration file, rather than command line. It may be that it won't work correctly from the command line.

You _could_ run a loop that first attempts a telnet session to the ssh port - if the SSH isn't responding correctly, it won't give the right answer.

You might test it with the next 'hung' server - do a telnet to SSH.

It should give you something like the following
\*
Connected to localhost.
Escape character is '^]'.
SSH-2.0-OpenSSH_4.4
*\

run the telnet, break the connection (run telnet, pipe the input to a file, capture the PID, wait five seconds, kill the PID), parse the output from telnet, pass a boolean to make SSH run or skip to the next machine.

Additional options, that may or may not help

ServerAliveCountMax
ServerAliveInterval

Also, I don't know if it helps, but here's a link to a suggestion to another person with the same issue.

http://www.ibm.com/developerworks/forums/dw_thread.jsp?message=13889433&cat=5&thread=142538&treeDisplayType=threadmode1&forum=160#13889433

Bibliophage

As I don't know if my suggestion could help, I have no objections to either having it finalized, or simply removed. I could see the information assisting someone else, but as it's incomplete, the assistance would be minor.

vinod

ASKER

I got around this problem by adding a host_alive shell function that tests ssh connection in the back ground, returns 0 if success else cleans up hung ssh and returns 1. The main loop passes command via ssh only if host_alive tests OK.

Since this solution was inspired by suggestions from Bibliophage, moderator may award the points to him/her.

#!/bin/sh

# Run a command on a remote host via ssh only if the remote sshd
# is actually responding to ssh connections. ssh keys are assumed
# to be already set up.

host_alive ()
{
host=$1;
ping -c 1 -q -w 5 $host >/dev/null 2>&1;
if [ $? -ne 0 ]; then
echo $host does not ping;
return 1;
else
ssh root@$host /bin/true >/dev/null 2>&1 & timeo=50; # run the test in bg with a timout of 5 secs
while [ $timeo -gt 0 ]; do
pid=`/bin/ps auwx | grep "ssh root@$host /bin/true" | grep -v grep | awk '{print $2}'`;
if [ "$pid" ]; then
usleep 100000;
timeo=$((timeo-1));
else
break;
fi;
done;
if [ "$pid" ]; then
kill -9 $pid >/dev/null 2>&1;
echo $host pings but does not ssh;
return 1;
fi;
fi
}

# The main loop

while read h; do
host_alive $h && ssh $h my-command;
done < hosts.lis

Bibliophage

No real objection. I may have pointed him the right way, but he came up with his own solution.

ASKER CERTIFIED SOLUTION

Computer101

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial