asked on

Nagios alerts and network issues

AWS AMI getting the below error in the logs.

This seems to be tie with the intermittent nagios alerts.

**Subject: PROBLEM jenkins/Memory is CRITICAL

Notification Type: PROBLEM

Service: Memory
Host: jenkins
Address: jenkins
State: CRITICAL

Date/Time: Fri Dec 15 09:36:12 UTC 2017

Additional Info:

... has gone stale.

Open in new window

129260ms.
Dec 15 11:13:47 ip-172-31-16-141 dhclient[2100]: DHCPREQUEST on eth0 to 172.31.1
6.1 port 67 (xid=0x7a75a61f)
Dec 15 11:13:47 ip-172-31-16-141 dhclient[2100]: DHCPACK from 172.31.16.1 (xid=0
x7a75a61f)
Dec 15 11:13:47 ip-172-31-16-141 dhclient[2100]: bound to 172.31.16.141 -- renew
al in 1434 seconds.
Dec 15 11:13:47 ip-172-31-16-141 ec2net: [get_meta] Trying to get http://169.254
.169.254/latest/meta-data/network/interfaces/macs/0a:e8:9e:54:f4:81/local-ipv4s
Dec 15 11:13:47 ip-172-31-16-141 ec2net: [rewrite_aliases] Rewriting aliases of
eth0
Dec 15 11:15:29 ip-172-31-16-141 dhclient[2184]: XMT: Solicit on eth0, interval
117400ms.

Open in new window

Mark Bill

What is the issue here? is the alert working? is the server just performing poorly and not sure why? more information?

lhrslsshahi

ASKER

The issue is we keep getting excessive nagios alerts saying the following and have been unable to identify why we are receiving these alerts its only this server.

Additional Info:

... has gone stale.

David Favor

Look at your Nagios config related to Jenkins.

The message seems to suggests the Nagios discovery process related to Jenkins requires fixing.

Looks like maybe Nagios discovers the Jenkins process to track + then somehow the Jenkins process ID changes + Nagios is using a cached pid, rather than discovering the new pid.

Likely changing how your Nagios config for Jenkins discovery works or changing your Nagios config to fire rediscovery at least once, prior to erroring out might be the approach to take.

Or if this is impossible, then write your own discovery script + integrate into Nagios.

lhrslsshahi

ASKER

We dont have any issues with other AWS jenkins instances however in the /var/log/messages this is the only instance with noise about the below;

29260ms.
Dec 15 11:13:47 ip-172-31-16-141 dhclient[2100]: DHCPREQUEST on eth0 to 172.31.1
6.1 port 67 (xid=0x7a75a61f)
Dec 15 11:13:47 ip-172-31-16-141 dhclient[2100]: DHCPACK from 172.31.16.1 (xid=0
x7a75a61f)
Dec 15 11:13:47 ip-172-31-16-141 dhclient[2100]: bound to 172.31.16.141 -- renew
al in 1434 seconds.
Dec 15 11:13:47 ip-172-31-16-141 ec2net: [get_meta] Trying to get http://169.254
.169.254/latest/meta-data/network/interfaces/macs/0a:e8:9e:54:f4:81/local-ipv4s
Dec 15 11:13:47 ip-172-31-16-141 ec2net: [rewrite_aliases] Rewriting aliases of
eth0
Dec 15 11:15:29 ip-172-31-16-141 dhclient[2184]: XMT: Solicit on eth0, interval
117400ms.

David Favor

Likely you'll have to resolve this problem first...

The following command ends in a timeout, so the link isn't responding correctly....

imac> curl -I -L --connect-timeout=5 http://169.254.169.254/latest/meta-data/network/interfaces/macs/0a:e8:9e:54:f4:81/local-ipv4s

Open in new window

lhrslsshahi

ASKER

I knew there was an issue with the logs what could be causing this timeout?

Phil Phillips

If you have any additional graphs related to memory usage, you might want to comb those over as well. Could be the process just actually running low on memory.

As for the timeout related to http://169.254.169.254/latest/meta-data/network/interfaces/macs/0a:e8:9e:54:f4:81/local-ipv4s: that link will only work within an AWS instance. The logs don't indicate an error there, so it is probably working fine for your instance. As a sanity check, you can always run the command yourself from the instance:

curl http://169.254.169.254/latest/meta-data/network/interfaces/macs/0a:e8:9e:54:f4:81/local-ipv4s

Open in new window

lhrslsshahi

ASKER

The curl command works have increased to a larger instance still the same issue. Intermittent nagios stale messages.

At the time this happens please see below logs.

ec 19 12:17:34 ip-172-31-16-141 kernel: [ 2318.600506] veth39a945b: renamed from eth0
Dec 19 12:17:34 ip-172-31-16-141 kernel: [ 2318.608144] docker0: port 3(veth0ff3c0a) entered disabled state
Dec 19 12:17:34 ip-172-31-16-141 kernel: [ 2318.660295] eth0: renamed from veth39a945b
Dec 19 12:17:34 ip-172-31-16-141 kernel: [ 2318.676520] docker0: port 3(veth0ff3c0a) entered forwarding state
Dec 19 12:17:34 ip-172-31-16-141 kernel: [ 2318.693945] docker0: port 3(veth0ff3c0a) entered forwarding state
Dec 19 12:17:49 ip-172-31-16-141 kernel: [ 2333.728069] docker0: port 3(veth0ff3c0a) entered forwarding state
Dec 19 12:18:18 ip-172-31-16-141 dhclient[2249]: XMT: Solicit on eth0, interval 108430ms.
Dec 19 12:20:07 ip-172-31-16-141 dhclient[2249]: XMT: Solicit on eth0, interval 110140ms.

This question needs an answer!

Become an EE member today

7 DAY FREE TRIAL

Members can start a 7-Day Free trial then enjoy unlimited access to the platform.

View membership options

Learn why we charge membership fees

We get it - no one likes a content blocker. Take one extra minute and find out why we block content.