• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 551
  • Last Modified:

Bash Shell Sript to Echo and write to text file.

Hi, I am working with a ROCKS cluster on RHEL 5.9. It is running PBS as the grid. I want to make sure that all ot the nodes of the cluster are responive, so I need a script that will possibly grab the name, date and time stamp from each node and write that info back to a text file on the front end node.

Can someone provide a quick example? Google is of too much help on the subject!
3 Solutions
You can run a crontab job on each node, say every 5 min, that will run

/usr/bin/hostname > /tmp/$myhost
/usr/bin/date >> /tmp/$myhost

You could then schedule an ftp or sftp job to copy the files to the front server
Or, if you have password-less logins to the nodes in the cluster (with ssh keys), you could run this as root on your central server.  Have a file with just a list of node names in it called hosts.lst, then:
while read host; do
    rsp=$(ssh $host 'echo $(hostname): $(date) 2>/dev/null' 2>/dev/null </dev/null)
    if [ "$rsp" != "" ]; then
        echo $rsp
        echo Could not connect to $host at $(date)
done < hosts.lst > hosts.log

Open in new window

capperdog13Author Commented:
Great! Let me work with this and I will respond later today. The nodes do require a password, so the rsh script I don't think will apply.

Many thanks! Will get back with you.
Cloud Class® Course: CompTIA Cloud+

The CompTIA Cloud+ Basic training course will teach you about cloud concepts and models, data storage, networking, and network infrastructure.

Do you have the ganglia roll installed and enabled?  You can just use that to track all the compute nodes.

Rocks also includes the tentakel command to query all the hosts more quickly, since it forks all the calls at once.  It should be set up if you loaded all the compute nodes with the rocks installer.  If you want the results to come back in order, you can sort the results afterwards.  The while loop could take quite a while if you have a lot of compute nodes.

It's much simpler to run this line to query all the hosts simultaneously.  Your results will likely come back out of order, but it'll be much faster than running the while loop and waiting for each node's network to respond.

tentakel "hostname; date; hostaname" >> compute_nodes.txt

If I remember correctly, I think you actually just need

tentakel date >> compute_nodes.txt

since tentakel already outputs the hostname of the system with the command.

The head node should have an ssh key automatically installed on each of the compute node already.  You shouldn't need a password when you run tentakel or ssh to the compute nodes, unless the installer messed up somehow or the system becomes corrupted by the users code crashing.  That does happen frequently enough when you have hundreds of systems, but the compute nodes should be easy and quick to reinstall.

http://www.rocksclusters.org/roll-documentation/base/5.5/index.html  You can install other linux distros with Rocks.  Rocks 6 is out and that supports Redhat 6
capperdog13Author Commented:
Hi yes we do have Ganglia installed and from the Web Front End all looks fine. Thanks for the tentakel date >> compute_nodes.txt It says all is fine as well.

I was just handed this old POS, so is it safe to say that from a high level that this cluster is functioning as it should relying on the tentakel cmd and the Gaglia front end??
capperdog13Author Commented:
Also, I do notice one problem you may be able to help with. The nodes are not reloading when I tell them to on a hard reboot. PXE is enabled on the nodes and they do make contact with the front end, but the frontend never sends a packet for the reload, time out occurs and the node boots back up to old image.

Any suggestions here?
From a high level, if Ganglia shows the system as functional and tentakel returns ok, then you should be good to go.

Your system is set to boot instead of install.  You need to change the setting
on the head node with the rocks command

rocks set host boot compute-0-0 action=install

Once the system is up, the action should revert.  If not, you can set the action back to boot.  You can list the settings with:

rocks list host boot


Some helpful hints:

The best place to ask rocks questions is through the rocks mailing list.  They have more experienced users as well as the developers checking the list.  You can sign up here.  https://lists.sdsc.edu/mailman/listinfo.cgi/npaci-rocks-discussion

It's been a year since I touched a Rocks cluster.  It depends on the error message.  Problems happen frequently with rocks when users run their computations on the head node.  Keep your users from running their processes on the head node.  

The other thing to check is that the rocks kickstart directories are working.  They sometimes get corrupted and don't show up properly.  You need to check that Apache is started correctly on the head node and that it's sharing the rocks directories for the compute nodes to connect to.  They need them to install.

Sometimes the compute nodes get corrupt and you just need to stick a live distro on them to completely wipe the  partition.  Unfortunately, the old kickstart on Redhat 5.x  doesn't work on a GUID partition, so anything 2 TB or larger needs to be tweaked.  It's simplest, and quickest, to stick a smaller drive in the system as the primary boot and configure kickstart to mount the secondary drive for processing space.

If all else fails, sometimes you just have to redo the head node installation.  This will take some time, but once set up, the compute nodes are quick to install.  They will install very quickly, but out of order on the rack if you turn them on all at once.  If you want them installed in order, you'll have to turn them on one at a time starting with the first one.  You'll need to watch until DHCP accepts them.  They'll automatically be numbered starting with rack 0, computer 0.
capperdog13Author Commented:
Hey thanks a bunch for all the info! I come from a Windows background and was literally tossed into the sea of Linux and told to fix that cluster...

I did the commands on the head node and forced an install on one of the nodes. I checked it with ROCKS LIST HOST BOOT before I hard rebotted the node, but it still did not reload. The nodes are not getting the info back from the server to reload like I mentioned ealier.

Anyway I am going to post this to the ROCKS site you gave me. You've been a big help!
Many thanks and have a happy holiday!!
capperdog13Author Commented:
The original question was about a script to help me check a ROCKS cluster. Simon supplied me with a couple of great examples. thanks Simon! I did get the most help from serial, who has ROCKS experiance and went over and above with tips and links to help out.
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now