Link to home
Create AccountLog in
Avatar of kapshure
kapshureFlag for United States of America

asked on

Nagios "No Route to Host" error on CentOS

I've got a Nagios server (on CentOS 5), and a monitored node (also on CentOS 5). I initially had a problem with SSH key-exchange, but that has been solved, and I'm still receiving a No Route to Host.

Nagios server: 10.0.100.130
monitored node: 10.0.100.143

Yet, I can do the following from Nagios Server:

/usr/local/nagios/libexec/check_tcp -H 10.0.100.143 -p 5666
TCP OK - 0.000 second response time on port 5666|time=0.000361s;0.000000;0.000000;0.000000;10.000000

Open in new window


also can do this from the Nagios Server:

ssh 10.0.100.143 /usr/local/nagios/libexec/check_procs 
PROCS OK: 603 processes

Open in new window


I can successfully ping 10.0.100.143 from Nagios server as well.

grep for the monitored node in /var/log/messages pulls this up:

Nov 10 00:00:00 nagiosbox nagios: CURRENT HOST STATE: monitorednode;UP;HARD;1;PING OK - Packet loss = 0%, RTA = 0.21 ms 

Nov 10 00:00:00 nagiosbox nagios: CURRENT SERVICE STATE: monitorednode;Home Page;CRITICAL;HARD;1;No route to host

Open in new window


am a bit confused here. any help is much appreciated
Avatar of edster9999
edster9999
Flag of Ireland image

can you ping back in the other direction ?
can you do an ifconfig for both machines and show that
and maybe include a 'route' for both machines too so we see the route setup
Avatar of kapshure

ASKER

from monitored node:

ping 10.0.100.130
PING 10.0.100.130 (10.0.100.130) 56(84) bytes of data.
64 bytes from 10.0.100.130: icmp_seq=1 ttl=64 time=0.897 ms

monitored node ifconfig:

ifconfig
eth0      Link encap:Ethernet  HWaddr 00:1D:09:2C:C3:2A  
          inet addr:10.0.100.143  Bcast:10.0.100.255  Mask:255.255.255.0
          inet6 addr: fe80::21d:9ff:fe2c:c32a/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:151840310 errors:0 dropped:0 overruns:0 frame:0
          TX packets:20026487 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:145578488128 (135.5 GiB)  TX bytes:2364444581 (2.2 GiB)
          Interrupt:169 Memory:f8000000-f8012800

eth0:1    Link encap:Ethernet  HWaddr 00:1D:09:2C:C3:2A  
          inet addr:10.0.100.144  Bcast:10.0.100.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          Interrupt:169 Memory:f8000000-f8012800

"route" from monitored node:

 route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
10.0.100.0      *               255.255.255.0   U     0      0        0 eth0
169.254.0.0     *               255.255.0.0     U     0      0        0 eth0
default         10.0.100.1      0.0.0.0         UG    0      0        0 eth0



from Nagios box, ifconfig:

/sbin/ifconfig
eth0      Link encap:Ethernet  HWaddr 00:1C:23:C8:96:AE  
          inet addr:10.0.100.130  Bcast:10.0.100.255  Mask:255.255.255.0
          inet6 addr: fe80::21c:23ff:fec8:96ae/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:1968825668 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2112609296 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:708043528943 (659.4 GiB)  TX bytes:995965269105 (927.5 GiB)
          Interrupt:169 Memory:f8000000-f8011100

"route" from nagios box:

 /sbin/route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
10.0.101.0      *               255.255.255.0   U     0      0        0 eth1
10.0.100.0      *               255.255.255.0   U     0      0        0 eth0
169.254.0.0     *               255.255.0.0     U     0      0        0 eth0
default         10.0.100.1      0.0.0.0         UG    0      0        0 eth0
Well that all looks fine to me.

In the Nagios server setup - are you calling the remote server by IP or by name ?

Avatar of Sanga Collins
Where exactly are you seeing this error message?
@edster9999:

i have a bucket container:

[code]/usr/local/nagios/etc/servers/monitorednode.cfg:


define host{
      use linux-server ; Inherit default values from a template
        host_name monitorednode ; The name we're giving to this server
        alias monitorednode ; A longer name for the server
        address 10.0.100.143 ; IP address of the server
}
define service{
        use generic-service
        host_name                       monitorednode
        service_description             Home Page
        check_command                   check_http!ww2[/code]

is that what you mean?



@sangamc:

if you click on Tactical Overview, then under the Services section, you see Critical, Warning, Unknown, OK, Pending.

Under Critical, thats where it is. You can also see Service Status Totals from Service Detail, its there under the status information that says: No Route to Host.

On the Host status details main page, it shows the system as UP.

question though.....

I have active checks disabled right now,, is this error message b/c of that?
Avatar of Kerem ERSOY
Kerem ERSOY

Hi,

The thing is I guess "monitorednode" is not resolving to 10.0.100.143. Please try to do this over the nagios server:

ping monitoredhost

I guess it resolves to another addreess.

If this is the case try to edit your DNS if you have one or try to edit your /etc/hosts. Please make sure that :
- Your host name is not assigned to 127.0.0.1 If this is the case just correct it and add your hostname to your IP.

- Then add an entry for the monitored host such as:

10.0.100.143   monitoredhost.domain.com  monitoredhost

- Cehck your /etc/resolv.com for your dearch domain (appended after monitoredhost) to create a FQDN. such as :

nameserver  x.x.x.x
search domain.com

Save and exit and make sure that you should now be able to ping with the host with these commands.

ping monitoredhost
ping monitoredhost.domain.com


Please replace domain.com with your domain.

Cheers,
K.




@KeremE

if you look above for the ifconfig on the monitorednode, you can see there is a 10.0.100.143 on eth0, and then 10.0.100.144 on eth0:1 --- I know this is an alias on the interface, but I am not sure how it is/if affecting this scenario:

I changed the .cfg file, on the nagios server,  for monitorednode, to both, .143, and then to .144 & tested.

I also tested /etc/hosts entry with .144, and .143

if I ping monitorednode(domain.com), I can get successful ICMP replies back for both IP addresses.

If I do a ./check_http -H 10.0.100.143, I get a connection refused, Unable to open TCP socket. I can't telnet to 80 on that box either.

If I do a ./check_http -H 10.0.100.144, I get:

OK - HTTP/1.1 301 Moved Permanently - 0.003 second response time |time=0.002535s;;;0.000000 size=434B;;;0

I can telnet successfully to 80 on .144

Someone mentioned that this error isn't Nagios, but with the OS. specifically stating that the "Home Page" check isn't looking at a valid host name or address vs the check_ping plugin. Problem is... I can't find any reference to "Home Page" anywhere.


I got these from /usr/local/nagios/etc/objects/commands.cfg

'check-host-alive' command definition
define command{
        command_name    check-host-alive
        command_line    $USER1$/check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 5

'check_ping' command definition
define command{
        command_name    check_ping
        command_line    $USER1$/check_ping -H $HOSTADDRESS$ -w $ARG1$ -c $ARG2$ -p 5

# 'check_http' command definition
define command{
        command_name    check_http
        command_line    $USER1$/check_http -I $HOSTADDRESS$ $ARG1$
        }

Under /etc/rc.d/init.d/nagios I can see that I've got the paths right:

prefix="/usr/local/nagios"
exec_prefix="/usr/local/nagios"
exec="/usr/local/nagios/bin/nagios"
config="/usr/local/nagios/etc/nagios.cfg"


thoughts>?
Can I can get a bump on this? I raised the points to 500. I'm really struggling with this.  I can supplement this:
      
Nov  9 00:00:00 nagiosbox nagios: CURRENT SERVICE STATE: monitorednode;Home Page;CRITICAL;HARD;1;No route to host 
Nov 10 00:00:00 nagiosbox nagios: CURRENT HOST STATE: monitorednode;UP;HARD;1;PING OK - Packet loss = 0%, RTA = 0.21 ms 
Nov 10 00:00:00 nagiosbox nagios: CURRENT SERVICE STATE: monitorednode;Home Page;CRITICAL;HARD;1;No route to host

Open in new window


so its the Home Page alert?

see Home Page in relevant .cfg file below
      
define host{
	use linux-server ; Inherit default values from a template
        host_name monitorednode ; The name we're giving to this server
        alias monitorednode ; A longer name for the server
        address 10.0.100.143 ; IP address of the server
}
define service{
        use generic-service
        host_name                       monitorednode
        service_description             Home Page
        check_command                   check_http!ww2
}

Open in new window



the IP listed above is correct for the host. But again, no reference to an IP defined in the command file found here: /usr/local/nagios/etc/objects/commands.cfg

Thoughts????
what happens if you take away the !ww2 arg?
I edited the .cfg file, so now it should just refresh, or do I need to do a nagios reload?

i actually tried that anyways, but I cant get it to execute

/
etc/rc.d/init.d/nagios reload
nagios dead but subsys locked

Open in new window


nagios is still running though, and monitoring
subsys locked usually indicates the lock file still exists. Reboot your server and see if you still get the host not reachable error message and let us know.

Ps if you are centos you should be able to use "service nagios reload" and "service nagios restart" to reload or restart the nagios services.
I cant reboot this box. its our primary monitoring solution for the datacenter.

/sbin/service nagios status
nagios (pid 20266) is running...

then i tried reload:
bash-3.1# /sbin/service nagios reload
nagios (pid 20266) is running...
Reloading nagios:                                          [FAILED]

but its still running
Try this instead from: http://nagios.sourceforge.net/docs/2_0/stoprestart.html

ps axu | grep nagios

The output should look something like this:

nagios  6808  0.0  0.7   840   352  p3 S    13:44   0:00 grep nagios
nagios 11149  0.2  1.0   868   488  ?  S   Feb 27   6:33 /usr/local/nagios/bin/nagios nagios.cfg

From the program output, you will notice that Nagios was started by user nagios and is running as process id 11149.

Manually Stopping Nagios

In order to stop Nagios, use the kill command as follows...

kill 11149

Then do service nagios start
I got nagios to reload. Still see the service alert though. I need to roll out a cleaned up box, but for now, having 2 sets of binaries on here is throwing me off.

not sure what now
ASKER CERTIFIED SOLUTION
Avatar of Sanga Collins
Sanga Collins
Flag of United States of America image

Link to home
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
See answer
well the previous admin did an upgrade, and I can see two different versions, and two different program paths. he didn't rpm anything, so I suspect something was done incorrectly on the upgrade.

/usr/bin/nagios -v
3.2.1
usage: /usr/bin/nagios

/usr/local/nagios/bin/nagios -v
3.0b7
usage: /usr/local/nagios/bin/nagios


so I killed the process this time, and restarted nagios.
service nagios restart

i tail /var/log/messages, and  you can see that Nagios restarted, but look at the version #.. I need to find out how to make 3.2.1 restart, but that may not be the issue.

Nov 18 10:03:46 sacdcdev01 nagios: Successfully shutdown... (PID=20266)
Nov 18 10:03:54 sacdcdev01 nagios: Nagios 3.0b7 starting... (PID=10255)
forgot to add this:

nagios   10256     1  0 10:03 ?        00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg


3.0b7 definitely running. Maybe when I get better at Nagios config/setup, I will just deploy another Nagios roll-out. This is ridics! :)
Bueller? anyone? Bueller?
OK, I believe I have found a possible lead on this.

I changed the monitorednode.cfg to this:


define service{
        use generic-service
        host_name                       sacdcweb03
        service_description             HTTP
        check_command                   check_http
}

Open in new window


took out the "Home Page" and the "check_http!ww2"

service_description             Home Page
        check_command                   check_http!ww2

Open in new window


so what I get now in /var/log/messages is:

nagios: CURRENT SERVICE STATE: sacdcweb03;HTTP;CRITICAL;HARD;3;Connection refused

so now connection refused troubleshooting talks about checking version differences on the Nagios server, and the monitored node where NRPE daemon is running.. sooo.. I found that the monitored node has 2.12, and the Nagios server has 2.8

I ran a "make clean" in the original directory on the monitored node, but I can still execute check_nrpe plugin and see V 2.12 status returned.

How do I correctly remove v2.12 NRPE from the monitored node? I'm suspecting that re-installing the NRPE daemon with 2.8 will possibly clean this up!

anyone>?
OK, so I went back and I have now installed NRPE 2.8 on the monitored node, and I can verify that 2.8 is replying

/usr/local/nagios/libexec/check_nrpe -H localhost
NRPE v2.8

Open in new window


but I am still getting connection refused in Nagios.  Can anyone shed any light on this?
I went back and changed the monitorednode.cfg on Nagios server, to reflect back to the "Home Page" check; even though this seemed to be incorrect previously:


define service{
        use generic-service
        host_name                       monitorednode
        service_description             Home Page
        check_command                   check_http!ww1
}

Open in new window


then i bounced nagios, and now the critical error message has cleared.

the only thing I get now, that I'm not quite sure on:

OK - HTTP/1.1 301 Moved Permanently
HTTP/1.1 301 Moved Permanently

is a problem with the config on the webserver, thats where you need to trouble shoot the error message
alrighty! i'll look into it.  but for all intents and purposes, would you say that Nagios could at least be reliable on monitoring this host, even though this message is popping up? its not being classified as Warning, or Critical

thanks again for your help on this sangamc
Yes it is ... i had a similar situation with a third party websever. The site designer didnt think it a priority to fix the 301 redirect error so we monitored from nagios and took that into account. When the site went down due to network outage nagios would show site as down. and when it came back up, the status would return OK. which is what we were looking for.
SOLUTION
Link to home
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
I found the binary difference in NRPE between the 2 systems. These guys just help make the problem and resolution manifest