Solved

Nagios "No Route to Host" error on CentOS

Posted on 2010-11-10
26
3,888 Views
Last Modified: 2012-05-10
I've got a Nagios server (on CentOS 5), and a monitored node (also on CentOS 5). I initially had a problem with SSH key-exchange, but that has been solved, and I'm still receiving a No Route to Host.

Nagios server: 10.0.100.130
monitored node: 10.0.100.143

Yet, I can do the following from Nagios Server:

/usr/local/nagios/libexec/check_tcp -H 10.0.100.143 -p 5666
TCP OK - 0.000 second response time on port 5666|time=0.000361s;0.000000;0.000000;0.000000;10.000000

Open in new window


also can do this from the Nagios Server:

ssh 10.0.100.143 /usr/local/nagios/libexec/check_procs 
PROCS OK: 603 processes

Open in new window


I can successfully ping 10.0.100.143 from Nagios server as well.

grep for the monitored node in /var/log/messages pulls this up:

Nov 10 00:00:00 nagiosbox nagios: CURRENT HOST STATE: monitorednode;UP;HARD;1;PING OK - Packet loss = 0%, RTA = 0.21 ms 

Nov 10 00:00:00 nagiosbox nagios: CURRENT SERVICE STATE: monitorednode;Home Page;CRITICAL;HARD;1;No route to host

Open in new window


am a bit confused here. any help is much appreciated
0
Comment
Question by:kapshure
  • 15
  • 7
  • 2
  • +1
26 Comments
 
LVL 20

Expert Comment

by:edster9999
Comment Utility
can you ping back in the other direction ?
can you do an ifconfig for both machines and show that
and maybe include a 'route' for both machines too so we see the route setup
0
 

Author Comment

by:kapshure
Comment Utility
from monitored node:

ping 10.0.100.130
PING 10.0.100.130 (10.0.100.130) 56(84) bytes of data.
64 bytes from 10.0.100.130: icmp_seq=1 ttl=64 time=0.897 ms

monitored node ifconfig:

ifconfig
eth0      Link encap:Ethernet  HWaddr 00:1D:09:2C:C3:2A  
          inet addr:10.0.100.143  Bcast:10.0.100.255  Mask:255.255.255.0
          inet6 addr: fe80::21d:9ff:fe2c:c32a/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:151840310 errors:0 dropped:0 overruns:0 frame:0
          TX packets:20026487 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:145578488128 (135.5 GiB)  TX bytes:2364444581 (2.2 GiB)
          Interrupt:169 Memory:f8000000-f8012800

eth0:1    Link encap:Ethernet  HWaddr 00:1D:09:2C:C3:2A  
          inet addr:10.0.100.144  Bcast:10.0.100.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          Interrupt:169 Memory:f8000000-f8012800

"route" from monitored node:

 route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
10.0.100.0      *               255.255.255.0   U     0      0        0 eth0
169.254.0.0     *               255.255.0.0     U     0      0        0 eth0
default         10.0.100.1      0.0.0.0         UG    0      0        0 eth0



from Nagios box, ifconfig:

/sbin/ifconfig
eth0      Link encap:Ethernet  HWaddr 00:1C:23:C8:96:AE  
          inet addr:10.0.100.130  Bcast:10.0.100.255  Mask:255.255.255.0
          inet6 addr: fe80::21c:23ff:fec8:96ae/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:1968825668 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2112609296 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:708043528943 (659.4 GiB)  TX bytes:995965269105 (927.5 GiB)
          Interrupt:169 Memory:f8000000-f8011100

"route" from nagios box:

 /sbin/route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
10.0.101.0      *               255.255.255.0   U     0      0        0 eth1
10.0.100.0      *               255.255.255.0   U     0      0        0 eth0
169.254.0.0     *               255.255.0.0     U     0      0        0 eth0
default         10.0.100.1      0.0.0.0         UG    0      0        0 eth0
0
 
LVL 20

Expert Comment

by:edster9999
Comment Utility
Well that all looks fine to me.

In the Nagios server setup - are you calling the remote server by IP or by name ?

0
 
LVL 18

Expert Comment

by:Sanga Collins
Comment Utility
Where exactly are you seeing this error message?
0
 

Author Comment

by:kapshure
Comment Utility
@edster9999:

i have a bucket container:

[code]/usr/local/nagios/etc/servers/monitorednode.cfg:


define host{
      use linux-server ; Inherit default values from a template
        host_name monitorednode ; The name we're giving to this server
        alias monitorednode ; A longer name for the server
        address 10.0.100.143 ; IP address of the server
}
define service{
        use generic-service
        host_name                       monitorednode
        service_description             Home Page
        check_command                   check_http!ww2[/code]

is that what you mean?



@sangamc:

if you click on Tactical Overview, then under the Services section, you see Critical, Warning, Unknown, OK, Pending.

Under Critical, thats where it is. You can also see Service Status Totals from Service Detail, its there under the status information that says: No Route to Host.

On the Host status details main page, it shows the system as UP.

question though.....

I have active checks disabled right now,, is this error message b/c of that?
0
 
LVL 30

Expert Comment

by:Kerem ERSOY
Comment Utility
Hi,

The thing is I guess "monitorednode" is not resolving to 10.0.100.143. Please try to do this over the nagios server:

ping monitoredhost

I guess it resolves to another addreess.

If this is the case try to edit your DNS if you have one or try to edit your /etc/hosts. Please make sure that :
- Your host name is not assigned to 127.0.0.1 If this is the case just correct it and add your hostname to your IP.

- Then add an entry for the monitored host such as:

10.0.100.143   monitoredhost.domain.com  monitoredhost

- Cehck your /etc/resolv.com for your dearch domain (appended after monitoredhost) to create a FQDN. such as :

nameserver  x.x.x.x
search domain.com

Save and exit and make sure that you should now be able to ping with the host with these commands.

ping monitoredhost
ping monitoredhost.domain.com


Please replace domain.com with your domain.

Cheers,
K.




0
 

Author Comment

by:kapshure
Comment Utility
@KeremE

if you look above for the ifconfig on the monitorednode, you can see there is a 10.0.100.143 on eth0, and then 10.0.100.144 on eth0:1 --- I know this is an alias on the interface, but I am not sure how it is/if affecting this scenario:

I changed the .cfg file, on the nagios server,  for monitorednode, to both, .143, and then to .144 & tested.

I also tested /etc/hosts entry with .144, and .143

if I ping monitorednode(domain.com), I can get successful ICMP replies back for both IP addresses.

If I do a ./check_http -H 10.0.100.143, I get a connection refused, Unable to open TCP socket. I can't telnet to 80 on that box either.

If I do a ./check_http -H 10.0.100.144, I get:

OK - HTTP/1.1 301 Moved Permanently - 0.003 second response time |time=0.002535s;;;0.000000 size=434B;;;0

I can telnet successfully to 80 on .144

Someone mentioned that this error isn't Nagios, but with the OS. specifically stating that the "Home Page" check isn't looking at a valid host name or address vs the check_ping plugin. Problem is... I can't find any reference to "Home Page" anywhere.


I got these from /usr/local/nagios/etc/objects/commands.cfg

'check-host-alive' command definition
define command{
        command_name    check-host-alive
        command_line    $USER1$/check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 5

'check_ping' command definition
define command{
        command_name    check_ping
        command_line    $USER1$/check_ping -H $HOSTADDRESS$ -w $ARG1$ -c $ARG2$ -p 5

# 'check_http' command definition
define command{
        command_name    check_http
        command_line    $USER1$/check_http -I $HOSTADDRESS$ $ARG1$
        }

Under /etc/rc.d/init.d/nagios I can see that I've got the paths right:

prefix="/usr/local/nagios"
exec_prefix="/usr/local/nagios"
exec="/usr/local/nagios/bin/nagios"
config="/usr/local/nagios/etc/nagios.cfg"


thoughts>?
0
 

Author Comment

by:kapshure
Comment Utility
Can I can get a bump on this? I raised the points to 500. I'm really struggling with this.  I can supplement this:
      
Nov  9 00:00:00 nagiosbox nagios: CURRENT SERVICE STATE: monitorednode;Home Page;CRITICAL;HARD;1;No route to host 
Nov 10 00:00:00 nagiosbox nagios: CURRENT HOST STATE: monitorednode;UP;HARD;1;PING OK - Packet loss = 0%, RTA = 0.21 ms 
Nov 10 00:00:00 nagiosbox nagios: CURRENT SERVICE STATE: monitorednode;Home Page;CRITICAL;HARD;1;No route to host

Open in new window


so its the Home Page alert?

see Home Page in relevant .cfg file below
      
define host{
	use linux-server ; Inherit default values from a template
        host_name monitorednode ; The name we're giving to this server
        alias monitorednode ; A longer name for the server
        address 10.0.100.143 ; IP address of the server
}
define service{
        use generic-service
        host_name                       monitorednode
        service_description             Home Page
        check_command                   check_http!ww2
}

Open in new window



the IP listed above is correct for the host. But again, no reference to an IP defined in the command file found here: /usr/local/nagios/etc/objects/commands.cfg

Thoughts????
0
 
LVL 18

Expert Comment

by:Sanga Collins
Comment Utility
what happens if you take away the !ww2 arg?
0
 

Author Comment

by:kapshure
Comment Utility
I edited the .cfg file, so now it should just refresh, or do I need to do a nagios reload?

i actually tried that anyways, but I cant get it to execute

/
etc/rc.d/init.d/nagios reload
nagios dead but subsys locked

Open in new window


nagios is still running though, and monitoring
0
 
LVL 18

Expert Comment

by:Sanga Collins
Comment Utility
subsys locked usually indicates the lock file still exists. Reboot your server and see if you still get the host not reachable error message and let us know.

Ps if you are centos you should be able to use "service nagios reload" and "service nagios restart" to reload or restart the nagios services.
0
 

Author Comment

by:kapshure
Comment Utility
I cant reboot this box. its our primary monitoring solution for the datacenter.

/sbin/service nagios status
nagios (pid 20266) is running...

then i tried reload:
bash-3.1# /sbin/service nagios reload
nagios (pid 20266) is running...
Reloading nagios:                                          [FAILED]

but its still running
0
 
LVL 18

Expert Comment

by:Sanga Collins
Comment Utility
Try this instead from: http://nagios.sourceforge.net/docs/2_0/stoprestart.html

ps axu | grep nagios

The output should look something like this:

nagios  6808  0.0  0.7   840   352  p3 S    13:44   0:00 grep nagios
nagios 11149  0.2  1.0   868   488  ?  S   Feb 27   6:33 /usr/local/nagios/bin/nagios nagios.cfg

From the program output, you will notice that Nagios was started by user nagios and is running as process id 11149.

Manually Stopping Nagios

In order to stop Nagios, use the kill command as follows...

kill 11149

Then do service nagios start
0
What Is Threat Intelligence?

Threat intelligence is often discussed, but rarely understood. Starting with a precise definition, along with clear business goals, is essential.

 

Author Comment

by:kapshure
Comment Utility
I got nagios to reload. Still see the service alert though. I need to roll out a cleaned up box, but for now, having 2 sets of binaries on here is throwing me off.

not sure what now
0
 
LVL 18

Accepted Solution

by:
Sanga Collins earned 250 total points
Comment Utility
why do you have 2 sets of binaries?

i must admit i am as perplexed as you are by this problem. I even setup a simllar scenario with my nagios server and get succesful results. I am not sure what else may bbe the issue
0
 

Author Comment

by:kapshure
Comment Utility
well the previous admin did an upgrade, and I can see two different versions, and two different program paths. he didn't rpm anything, so I suspect something was done incorrectly on the upgrade.

/usr/bin/nagios -v
3.2.1
usage: /usr/bin/nagios

/usr/local/nagios/bin/nagios -v
3.0b7
usage: /usr/local/nagios/bin/nagios


so I killed the process this time, and restarted nagios.
service nagios restart

i tail /var/log/messages, and  you can see that Nagios restarted, but look at the version #.. I need to find out how to make 3.2.1 restart, but that may not be the issue.

Nov 18 10:03:46 sacdcdev01 nagios: Successfully shutdown... (PID=20266)
Nov 18 10:03:54 sacdcdev01 nagios: Nagios 3.0b7 starting... (PID=10255)
0
 

Author Comment

by:kapshure
Comment Utility
forgot to add this:

nagios   10256     1  0 10:03 ?        00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg


3.0b7 definitely running. Maybe when I get better at Nagios config/setup, I will just deploy another Nagios roll-out. This is ridics! :)
0
 

Author Comment

by:kapshure
Comment Utility
Bueller? anyone? Bueller?
0
 

Author Comment

by:kapshure
Comment Utility
OK, I believe I have found a possible lead on this.

I changed the monitorednode.cfg to this:


define service{
        use generic-service
        host_name                       sacdcweb03
        service_description             HTTP
        check_command                   check_http
}

Open in new window


took out the "Home Page" and the "check_http!ww2"

service_description             Home Page
        check_command                   check_http!ww2

Open in new window


so what I get now in /var/log/messages is:

nagios: CURRENT SERVICE STATE: sacdcweb03;HTTP;CRITICAL;HARD;3;Connection refused

so now connection refused troubleshooting talks about checking version differences on the Nagios server, and the monitored node where NRPE daemon is running.. sooo.. I found that the monitored node has 2.12, and the Nagios server has 2.8

I ran a "make clean" in the original directory on the monitored node, but I can still execute check_nrpe plugin and see V 2.12 status returned.

How do I correctly remove v2.12 NRPE from the monitored node? I'm suspecting that re-installing the NRPE daemon with 2.8 will possibly clean this up!

anyone>?
0
 

Author Comment

by:kapshure
Comment Utility
OK, so I went back and I have now installed NRPE 2.8 on the monitored node, and I can verify that 2.8 is replying

/usr/local/nagios/libexec/check_nrpe -H localhost
NRPE v2.8

Open in new window


but I am still getting connection refused in Nagios.  Can anyone shed any light on this?
0
 

Author Comment

by:kapshure
Comment Utility
I went back and changed the monitorednode.cfg on Nagios server, to reflect back to the "Home Page" check; even though this seemed to be incorrect previously:


define service{
        use generic-service
        host_name                       monitorednode
        service_description             Home Page
        check_command                   check_http!ww1
}

Open in new window


then i bounced nagios, and now the critical error message has cleared.

the only thing I get now, that I'm not quite sure on:

OK - HTTP/1.1 301 Moved Permanently
0
 
LVL 18

Expert Comment

by:Sanga Collins
Comment Utility
HTTP/1.1 301 Moved Permanently

is a problem with the config on the webserver, thats where you need to trouble shoot the error message
0
 

Author Comment

by:kapshure
Comment Utility
alrighty! i'll look into it.  but for all intents and purposes, would you say that Nagios could at least be reliable on monitoring this host, even though this message is popping up? its not being classified as Warning, or Critical

thanks again for your help on this sangamc
0
 
LVL 18

Expert Comment

by:Sanga Collins
Comment Utility
Yes it is ... i had a similar situation with a third party websever. The site designer didnt think it a priority to fix the 301 redirect error so we monitored from nagios and took that into account. When the site went down due to network outage nagios would show site as down. and when it came back up, the status would return OK. which is what we were looking for.
0
 
LVL 30

Assisted Solution

by:Kerem ERSOY
Kerem ERSOY earned 250 total points
Comment Utility
As I told earlier this is not an issue with Nagios. This seems to be IPTabless issue. you say:
- you can get ping result for bot 143 and 144
- you can tlenet to 144 port 80 and geta response
- you get a  "no route to host" from 143.

The response you get from 143 is an indication that your target host is running IPTables firewall  at 143. Just add the port 80 to the prts list and restart iptables.
- edit the /etc/sysconfig/iptables file
- locate this line:
-A RH-Firewall-1-INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
and add this line after it:
-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 80 -j ACCEPT
- issue this command to restart iptables:
service iptables restart

Cheers,
K.

0
 

Author Closing Comment

by:kapshure
Comment Utility
I found the binary difference in NRPE between the 2 systems. These guys just help make the problem and resolution manifest
0

Featured Post

How to improve team productivity

Quip adds documents, spreadsheets, and tasklists to your Slack experience
- Elevate ideas to Quip docs
- Share Quip docs in Slack
- Get notified of changes to your docs
- Available on iOS/Android/Desktop/Web
- Online/Offline

Join & Write a Comment

If you have a server on collocation with the super-fast CPU, that doesn't mean that you get it running at full power. Here is a preamble. When doing inventory of Linux servers, that I'm administering, I've found that some of them are running on l…
Little introduction about CP: CP is a command on linux that use to copy files and folder from one location to another location. Example usage of CP as follow: cp /myfoder /pathto/destination/folder/ cp abc.tar.gz /pathto/destination/folder/ab…
Learn several ways to interact with files and get file information from the bash shell. ls lists the contents of a directory: Using the -a flag displays hidden files: Using the -l flag formats the output in a long list: The file command gives us mor…
Learn how to navigate the file tree with the shell. Use pwd to print the current working directory: Use ls to list a directory's contents: Use cd to change to a new directory: Use wildcards instead of typing out long directory names: Use ../ to move…

772 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now