Link to home
Start Free TrialLog in
Avatar of someITGuy
someITGuy

asked on

Nagios (Return code of 255 is out of bounds) using SSH

My Nagios server is getting a (Return code of 255 is out of bounds)  error when it tries to monitor a /tmp directory on a HP-UX box. Using SSH to do this.

Here is the entry from the config file under sshperf:

disk-usedpct    80      90      /tmp
 
Here is the entry from the config file under hosts:

 define service{
         use                             passiveonly-service             ; Name of service template to use

         host_name                       epcpdb01
         service_description             disk-usedpct-/tmp
         is_volatile                     0
         active_checks_enabled           0
         check_period                    24x7
         contact_groups                  coa-team,it-management,on-call,unix-group
         notification_interval           0
         notification_period             24x7
         notification_options            w,c,r
         check_command                   check_sshperf!/usr/local/nagios/etc/sshperf/epcpdb01.cfg


Not sure what is causing the error. Any ideas?
Avatar of Deepak Kosaraju
Deepak Kosaraju
Flag of United States of America image

I haven't done using ssh on hp box, but I did monitor the filesystems on HP-UX using nagios and nrpe agents by following the procedure below installing nagios and nrpe on hp-ux box.

In General:
You get 255 is out of bound if the plugin directory is in-correct (or) if the plugin is not present (or) if nagios user don't have access to run the plugin.
Nagios-NRPE-for-HPUX.pdf
Avatar of someITGuy
someITGuy

ASKER

The problem is this plugin is only having problems with a couple of filesystems while the rest of the filesystems on the HP-UX box are not getting this error.

When you say " if the plugin directory is in-correct" do you mean on the Nagios server or on the remote host?

TIA.

Normally if the service is monitored by nrpe agent on remote machine and when we execute the plugin we might get the error as
NRPE: Unable to read output.
If the plugin is executed via nagios as an active check and plugin doesn't return any valid error code (or) plugin is un-available in the location specified we get
Return code of 255 is out of bounds

So it depends on scenario of the setup. I personally haven't done the plugin execution via ssh, but what I can recommend is refer to the PDF I attached for more help.

check-using-ssh.pdf
I think I may have found the issue:

The server that is being monitored had its time out of sync with the Nagios server. I am having my HP-UX guy fix the problem & I am monitoring to see if this resolves the issue.

I decided to check the nagios.log file & saw:

[1293149233] Warning: The results of service 'mem-freepct' on host 'xxxxxxx' are stale by 0d 0h 0m 55s (threshold=0d 0h 35m 0s).  I'm forcing an immediate check of the service.
Well I got the 255 out of bound error again..verified time is in sync w/ the HP-UX box. Here is what I see in the nagios.log file:

[1293133455] Warning: The results of service 'load-5minave' on host 'xxxxxxxx' are stale by 0d 0h 0m 36s (threshold=0d 0h 35m 0s).  I'm forcing an immediate check of the service.

[1293133455] Warning: The results of service 'mem-freepct' on host 'xxxxxxxx' are stale by 0d 0h 0m 36s (threshold=0d 0h 35m 0s).  I'm forcing an immediate check of the service.

[1293133455] Warning: The results of service 'swap-freepct' on host 'xxxxxxxx' are stale by 0d 0h 0m 36s (threshold=0d 0h 35m 0s).  I'm forcing an immediate check of the service.

[1293133465] Warning: Return code of 255 for check of service 'disk-usedpct-/pice/prdmisc' on host 'xxxxxxxx' was out of bounds.

[1293133465] SERVICE ALERT: xxxxxxxx;disk-usedpct-/pice/prdmisc;CRITICAL;HARD;1;(Return code of 255 is out of bounds)

[1293133465] SERVICE NOTIFICATION: coa-email;xxxxxxxx;disk-usedpct-/pice/prdmisc;CRITICAL;notify-service-by-email;(Return code of 255 is out of bounds)

[1293133465] SERVICE NOTIFICATION: support-email;xxxxxxxx;disk-usedpct-/pice/prdmisc;CRITICAL;notify-service-by-email;(Return code of 255 is out of bounds)

[1293133465] SERVICE NOTIFICATION: windows-email;xxxxxxxx;disk-usedpct-/pice/prdmisc;CRITICAL;notify-service-by-email;(Return code of 255 is out of bounds)

[1293133466] SERVICE NOTIFICATION: unix-email;xxxxxxxx;disk-usedpct-/pice/prdmisc;CRITICAL;notify-service-by-email;(Return code of 255 is out of bounds)

[1293133466] Warning: Return code of 255 for check of service 'disk-usedpct-/home' on host 'xxxxxxxx' was out of bounds.

[1293133466] SERVICE ALERT: xxxxxxxx;disk-usedpct-/home;CRITICAL;HARD;1;(Return code of 255 is out of bounds)

[1293133466] SERVICE NOTIFICATION: coa-email;xxxxxxxx;disk-usedpct-/home;CRITICAL;notify-service-by-email;(Return code of 255 is out of bounds)

[1293133466] SERVICE NOTIFICATION: support-email;xxxxxxxx;disk-usedpct-/home;CRITICAL;notify-service-by-email;(Return code of 255 is out of bounds)

[1293133467] SERVICE NOTIFICATION: windows-email;xxxxxxxx;disk-usedpct-/home;CRITICAL;notify-service-by-email;(Return code of 255 is out of bounds)

[1293133467] SERVICE NOTIFICATION: unix-email;xxxxxxxx;disk-usedpct-/home;CRITICAL;notify-service-by-email;(Return code of 255 is out of bounds)

[1293133468] Warning: Return code of 255 for check of service 'disk-usedpct-/opt' on host 'xxxxxxxx' was out of bounds.

[1293133468] SERVICE ALERT: xxxxxxxx;disk-usedpct-/opt;CRITICAL;HARD;1;(Return code of 255 is out of bounds)

[1293133468] SERVICE NOTIFICATION: coa-email;xxxxxxxx;disk-usedpct-/opt;CRITICAL;notify-service-by-email;(Return code of 255 is out of bounds)

[1293133468] SERVICE NOTIFICATION: support-email;xxxxxxxx;disk-usedpct-/opt;CRITICAL;notify-service-by-email;(Return code of 255 is out of bounds)

[1293133469] SERVICE NOTIFICATION: windows-email;xxxxxxxx;disk-usedpct-/opt;CRITICAL;notify-service-by-email;(Return code of 255 is out of bounds)

[1293133469] SERVICE NOTIFICATION: unix-email;xxxxxxxx;disk-usedpct-/opt;CRITICAL;notify-service-by-email;(Return code of 255 is out of bounds)

[1293133470] SERVICE ALERT: xxxxxxxx;disk-usedpct-/pice/prdmisc;OK;HARD;1;disk-usedpct-/pice/prdmisc OK - Space used on filesystem /pice/prdmisc: 00%

[1293133470] SERVICE NOTIFICATION: coa-email;xxxxxxxx;disk-usedpct-/pice/prdmisc;OK;notify-service-by-email;disk-usedpct-/pice/prdmisc OK - Space used on filesystem /pice/prdmisc: 00%

[1293133470] SERVICE NOTIFICATION: support-email;xxxxxxxx;disk-usedpct-/pice/prdmisc;OK;notify-service-by-email;disk-usedpct-/pice/prdmisc OK - Space used on filesystem /pice/prdmisc: 00%

[1293133470] SERVICE NOTIFICATION: windows-email;xxxxxxxx;disk-usedpct-/pice/prdmisc;OK;notify-service-by-email;disk-usedpct-/pice/prdmisc OK - Space used on filesystem /pice/prdmisc: 00%

[1293133470] SERVICE NOTIFICATION: unix-email;xxxxxxxx;disk-usedpct-/pice/prdmisc;OK;notify-service-by-email;disk-usedpct-/pice/prdmisc OK - Space used on filesystem /pice/prdmisc: 00%
ASKER CERTIFIED SOLUTION
Avatar of Deepak Kosaraju
Deepak Kosaraju
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I do have:

active_checks_enabled           0


Will check the doc next week when I get back in the office.
I found the reolution to the problem.

. I took a look at the configuration of this host in Nagios, and I think I know what the problem is. The unix servers are checked using a plugin called check_sshperf. The full path on the nagios server is /usr/local/nagios/libexec/check_sshperf. Each remote server has its own sshperf config file in /usr/local/nagios/etc/sshperf which determines which services will get checked.

The plugin establishes an SSH connection using parameters specified in the sshperf config file (eg, /usr/local/nagios/etc/sshperf/xxxxxx01.cfg), and then runs a sequence of commands (uptime, df, vmstat, etc). The output is parsed and returned to Nagios as passive service check results.

In this case, it looks like someone changed the SSH service on xxxxxx01 from an active service to a passive service. That means it's never actively run. After a certain interval, Nagios notices that it has no recent data for the host, and tries to initiate a check. That check succeeds *most* of the time, and you never receive any alerts. But occasionally, it fails and you receive a notification. I took the following steps to resolve this issue:

1. Log in to Nagios, navigate to the SSH service on xxxxxx01. Click on the "Stop accepting passive checks of this service" link, and then click on "Enable active checks of this service" instead.

2. Log in to Nagios, navigate to the Event Log. Look at all the entries with messages like "Passive check result was received for host 'xxxxc5', but the host could not be found!". In each case, verify that the server should no longer be monitored, and then log on to the remote host and either disable the Wincheck service, or uninstall it completely.

There are a plethora of hosts that have been removed from Nagios but are still running the Wincheck agent. Wincheck is connecting to the Nagios server every 5 minutes and attempting to submit data, but Nagios doesn't know what to do with that data. This causes an unnecessary load on Apache and on Nagios, and may be a contributing factor when you receive notifications about "Return code 255" errors.
You better do an audit on your Nagios setup,
-> How many services are defined.
-> How many hosts are being monitored.
-> What are the service that complain Host (or) service not found. Remove those services from the nagios agent (or) win agent if they are forwarding un-necessary data to nagios.

I believe 255 error normally occurs only when the path to plugin is wrong (or) plugin is returning un-excpected outputs.