Nagios stopped email notifications for service and host alerts

Respected Experts,

Until last week all was well with nagios and I was getting email notifications for service and host alerts. Suddenly it stopped. I was doing maintenance last night and I didn't get even a single email notification from nagios although I rebooted the servers.  I restarted the nagios server and to no avail. I am not able to figure out what is causing it and need your help.

Here is my configurations and logs.


ost name nagios(host name changed)

command.cfg

# 'notify-host-by-email' command definition
define command{
     command_name     notify-host-by-email
     command_line     /usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: $NOTIFICATIONTYPE$\nHost: $HOSTNAME$\nState: $HOSTSTATE$\nAddress: $HOSTADDRESS$\nInfo: $HOSTOUTPUT$\n\nDate/Time: $LONGDATETIME$\n" | /bin/mail -s "** $NOTIFICATIONTYPE$ Host Alert: $HOSTNAME$ is $HOSTSTATE$ **" $CONTACTEMAIL$
     }

# 'notify-service-by-email' command definition
define command{
     command_name     notify-service-by-email
     command_line     /usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: $NOTIFICATIONTYPE$\n\nService: $SERVICEDESC$\nHost: $HOSTALIAS$\nAddress: $HOSTADDRESS$\nState: $SERVICESTATE$\n\nDate/Time: $LONGDATETIME$\n\nAdditional Info:\n\n$SERVICEOUTPUT$\n" | /bin/mail -s "** $NOTIFICATIONTYPE$ Service Alert: $HOSTALIAS$/$SERVICEDESC$ is $SERVICESTATE$ **" $CONTACTEMAIL$
     }

Contact.cfg

###############################################################################
###############################################################################
#
# CONTACTS
#
###############################################################################
###############################################################################

# Just one contact defined by default - the Nagios admin (that's you)
# This contact definition inherits a lot of default values from the 'generic-contact'
# template which is defined elsewhere.

define contact{
        contact_name                    nagiosadmin          ; Short name of user
     use                    generic-contact          ; Inherit default values from generic-contact template (defined above)
        alias                           Nagios Admin          ; Full name of user
     service_notifications_enabled     1
     host_notifications_enabled     1
     service_notification_period     24x7
     service_notification_options      w,u,c,r
     host_notification_period      24x7
     host_notification_options      d,u,r
     service_notification_commands      notify-service-by-email
     host_notification_commands      notify-host-by-email
        email                           istalert@abc.com     ; <<***** CHANGE THIS TO YOUR EMAIL ADDRESS ******
        }




define contactgroup{
        contactgroup_name       admins
        alias                   Nagios Administrators
        members                 nagiosadmin
        }

nagios.cfg

#Enable notifications
enable_notifications=1

/var/log/maillog


Oct 11 08:35:24 nagios sendmail[22529]: r9BCZ1lt022529: from=root, size=                                                       9845, class=0, nrcpts=1, msgid=<201310111235.r9BCZ1lt022529@nagios.abc.com>, relay=root@localhost
Oct 11 08:35:24 nagios sendmail[22528]: r9BCZ1SS022528: from=root, size=                                                        31808, class=0, nrcpts=1, msgid=<201310111235.r9BCZ1SS022528@nagios.abc.com>, relay=root@localhost
Oct 11 08:35:24 nagios sendmail[22786]: r9BCZOuI022786: from=<root@nagios.abc.comm>, size=10184, class=0, nrcpts=1, msgid=<20131011                                                        1235.r9BCZ1lt022529@nagios.abc.com>, proto=ESMTP, daemon=MTA, relay=localhost.localdomain [127.0.0.1]
Oct 11 08:35:24 nagios sendmail[22529]: r9BCZ1lt022529: to=root, ctladdr                                                        =root (0/0), delay=00:00:23, xdelay=00:00:00, mailer=relay, pri=39845, relay=[12                                                        7.0.0.1] [127.0.0.1], dsn=2.0.0, stat=Sent (r9BCZOuI022786 Message accepted for                                                         delivery)
Oct 11 08:35:24 nagios sendmail[22787]: r9BCZOu8022787: from=<root@nagios.abc.com>, size=32147, class=0, nrcpts=1, msgid=<20131011                                                        1235.r9BCZ1SS022528@nagios.abc.com>, proto=ESMTP, daemon=M                                                        TA, relay=localhost.localdomain [127.0.0.1]
Oct 11 08:35:24 nagios sendmail[22528]: r9BCZ1SS022528: to=root, ctladdr                                                        =root (0/0), delay=00:00:23, xdelay=00:00:00, mailer=relay, pri=61808, relay=[12                                                        7.0.0.1] [127.0.0.1], dsn=2.0.0, stat=Sent (r9BCZOu8022787 Message accepted for                                                         delivery)
Oct 11 08:35:24 nagios sendmail[22788]: r9BCZOuI022786: to=<root@nagios.abc.com>, ctladdr=<root@nagios.abc.com> (0/0), delay=00:00:00, xdelay=00:00:00, mailer=local, pri=40459, dsn=2.0.0, stat=Sent
Oct 11 08:35:24 nagios sendmail[22789]: r9BCZOu8022787: to=<root@nagios.abc.com>, ctladdr=<root@nagios.abc.com> (0/0), delay=00:00:00, xdelay=00:00:00, mailer=local, pri=62422, dsn=2.0.0,                                                         stat=Sent

I get email when I run the following:

cat file | mail -s "Test Mail" istalert@abc.com

[root@nagios ~]# tail -f /usr/local/nagios/var/nagios.log

[1381478110] SERVICE ALERT: Network-ISP;PING;OK;SOFT;2;PING OK - Packet loss = 0%, RTA = 6.49 ms
[1381481230] SERVICE ALERT: abc-dc2;CPU Load;UNKNOWN;SOFT;1;NSClient - ERROR: Could not get data for 5 please check log for details
[1381481350] SERVICE ALERT: abc-dc2;CPU Load;OK;SOFT;2;CPU Load 0% (5 min average)
[1381481420] Auto-save of retention data completed successfully.
[1381482910] SERVICE ALERT: Network_CORE;PING;WARNING;SOFT;1;PING WARNING - Packet loss = 0%, RTA = 213.39 ms
[1381482970] SERVICE ALERT: Network_CORE;PING;OK;SOFT;2;PING OK - Packet loss = 0%, RTA = 0.34 ms
[1381485020] Auto-save of retention data completed successfully.
[1381488620] Auto-save of retention data completed successfully.
[1381492220] Auto-save of retention data completed successfully.
[1381495520] SERVICE ALERT: Network_Voice;PING;WARNING;SOFT;1;PING WARNING - Packet loss = 0%, RTA = 223.87 ms


Thanks,

Deorali
LVL 1
DeoraliAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Seth SimmonsSr. Systems AdministratorCommented:
if you look at the nagios event log at the time you were doing maintenance, do you see any entries that a notification was sent for the server being down?  if sendmail is working then i would look at the event log to make sure nagios was actually sending the message
0
DeoraliAuthor Commented:
Some of the entries from the Event log: I do not see service or host notifications.

Host Down[2013-10-10 18:38:27] HOST ALERT: abc-sql;DOWN;SOFT;5;CRITICAL - Host Unreachable (10.1.2.104)
Service Critical[2013-10-10 18:38:17] SERVICE ALERT: abc-sql;Uptime;CRITICAL;HARD;1;No route to host
Service Critical[2013-10-10 18:37:57] SERVICE ALERT: abc-sql;CPU Load;CRITICAL;HARD;2;No route to host
Host Down[2013-10-10 18:37:47] HOST ALERT: abc-sql;DOWN;SOFT;4;CRITICAL - Host Unreachable (10.1.2.104)
Host Down[2013-10-10 18:37:17] HOST ALERT: abc-sql;DOWN;SOFT;3;CRITICAL - Host Unreachable (10.1.2.104)
Service Critical[2013-10-10 18:36:37] SERVICE ALERT:abc-sql;C:\ Drive Space;CRITICAL;HARD;2;No route to host
Host Down[2013-10-10 18:36:37] HOST ALERT:abc-sql;DOWN;SOFT;2;CRITICAL - Host Unreachable (10.1.2.104)
Host Down[2013-10-10 18:36:07] HOST ALERT: abc-sql;DOWN;SOFT;1;CRITICAL - Host Unreachable (10.1.2.104)
Service Critical[2013-10-10 18:35:57] SERVICE ALERT: abc-sql;CPU Load;CRITICAL;SOFT;1;No route to host
Host Up[2013-10-10 18:34:47] HOST ALERT: abc-fileshare;UP;SOFT;4;PING OK - Packet loss = 0%, RTA = 1.10 ms
Service Unknown[2013-10-10 18:34:37] SERVICE ALERT: abc-sql;C:\ Drive Space;UNKNOWN;SOFT;1;Free disk space : Invalid drive
0
Seth SimmonsSr. Systems AdministratorCommented:
the most recent event is that the host is down; if you go further, are there entries after that saying that notification was sent?  are notifications enabled for that specific host?
0
Webinar: What were the top threats in Q2 2018?

Every quarter, the WatchGuard Threat Lab releases an Internet Security Report that describes and analyzes the top threat trends impacting companies around the world. Are you ready to learn more about the top threats of Q2 2018? Register for our Sept. 26th webinar to learn more!

DeoraliAuthor Commented:
There are no entries logged for SERVICE or HOST NOTIFICATION being sent on host being down for the recent events.
0
Seth SimmonsSr. Systems AdministratorCommented:
are notifications enabled for that host or service?
are notifications enabled globally?

if the nagios event log has nothing saying it sent a notification, something somewhere has it disabled either individually or globally
0
DeoraliAuthor Commented:
Thanks Seth2740.  Here are some of the places I look at it. Please let me know if I am looking at the wrong places.

nagios.cfg:
#Enable notifications
enable_notifications=1

windows.cfg and Linux.cfg
I have host and service defined for all the windows and Linux servers

contact.cfg

define contact{
        contact_name                    nagiosadmin            ; Short name of user
      use                        generic-contact            ; Inherit default values from generic-contact template (defined above)
        alias                           Nagios Admin            ; Full name of user
      service_notifications_enabled      1
      host_notifications_enabled      1
      service_notification_period      24x7
      service_notification_options       w,u,c,r
      host_notification_period       24x7
      host_notification_options       d,u,r
      service_notification_commands       notify-service-by-email
      host_notification_commands       notify-host-by-email
        email                           istalert@abc.com      ; <<***** CHANGE THIS TO YOUR EMAIL ADDRESS ******
        }

When I took over, nagios was already running.
0
Seth SimmonsSr. Systems AdministratorCommented:
or another thing i thought of is that the host or service might have notifications enabled but the notification period might be different
0
dipopoCommented:
Have you run a check on your config, nice to see the output.

/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
0
dipopoCommented:
Also are there entries for Nagios in SELinux?

Try turning this off and initiating an error to generate notifications.

setenforce 0
0
DeoraliAuthor Commented:
timeperiods.cfg
define timeperiod{
        timeperiod_name 24x7
        alias           24 Hours A Day, 7 Days A Week
        sunday          00:00-24:00
        monday          00:00-24:00
        tuesday         00:00-24:00
        wednesday       00:00-24:00
        thursday        00:00-24:00
        friday          00:00-24:00
        saturday        00:00-24:00
        }
 check




# 'workhours' timeperiod definition
define timeperiod{
        timeperiod_name workhours
        alias           Normal Work Hours
        monday          09:00-17:00
        tuesday         09:00-17:00
        wednesday       09:00-17:00
        thursday        09:00-17:00
        friday          09:00-17:00
        }


# 'none' timeperiod definition
define timeperiod{
        timeperiod_name none
        alias           No Time Is A Good Time
        }


# Some U.S. holidays
# Note: The timeranges for each holiday are meant to *exclude* the holidays from                           being
# treated as a valid time for notifications, etc.  You probably don't want your                           pager
# going off on New Year's.  Although you're employer might... :-)
define timeperiod{
        name                    us-holidays
        timeperiod_name         us-holidays
        alias                   U.S. Holidays

        january 1               00:00-00:00     ; New Years
        monday -1 may           00:00-00:00     ; Memorial Day (last Monday in M                          ay)
        july 4                  00:00-00:00     ; Independence Day
        monday 1 september      00:00-00:00     ; Labor Day (first Monday in Sep                          tember)
        thursday 4 november     00:00-00:00     ; Thanksgiving (4th Thursday in                           November)
        december 25             00:00-00:00     ; Christmas
        }


# This defines a modified "24x7" timeperiod that covers every day of the
# year, except for U.S. holidays (defined in the timeperiod above).
define timeperiod{
        timeperiod_name 24x7_sans_holidays
        alias           24x7 Sans Holidays

        use             us-holidays             ; Get holiday exceptions from ot                          her timeperiod

        sunday          00:00-24:00
        monday          00:00-24:00
        tuesday         00:00-24:00
        wednesday       00:00-24:00
        thursday        00:00-24:00
        friday          00:00-24:00
        saturday        00:00-24:00


[root@nagios objects]# /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg

Nagios Core 3.4.1
Copyright (c) 2009-2011 Nagios Core Development Team and Community Contributors
Copyright (c) 1999-2009 Ethan Galstad
Last Modified: 05-11-2012
License: GPL

Website: http://www.nagios.org
Reading configuration data...
   Read main config file okay...
Processing object config file '/usr/local/nagios/etc/objects/commands.cfg'...
Processing object config file '/usr/local/nagios/etc/objects/contacts.cfg'...
Processing object config file '/usr/local/nagios/etc/objects/timeperiods.cfg'...
Processing object config file '/usr/local/nagios/etc/objects/templates.cfg'...
Processing object config file '/usr/local/nagios/etc/objects/linux.cfg'...
Processing object config file '/usr/local/nagios/etc/objects/windows.cfg'...
Processing object config file '/usr/local/nagios/etc/objects/switch.cfg'...
Processing object config file '/usr/local/nagios/etc/objects/printer.cfg'...
   Read object config files okay...

Running pre-flight check on configuration data...

Checking services...
        Checked 207 services.
Checking hosts...
        Checked 59 hosts.
Checking host groups...
        Checked 4 host groups.
Checking service groups...
        Checked 0 service groups.
Checking contacts...
        Checked 1 contacts.
Checking contact groups...
        Checked 1 contact groups.
Checking service escalations...
        Checked 0 service escalations.
Checking service dependencies...
        Checked 0 service dependencies.
Checking host escalations...
        Checked 0 host escalations.
Checking host dependencies...
        Checked 0 host dependencies.
Checking commands...
        Checked 25 commands.
Checking time periods...
        Checked 5 time periods.
Checking for circular paths between hosts...
Checking for circular host and service dependencies...
Checking global event handlers...
Checking obsessive compulsive processor commands...
Checking misc settings...

Total Warnings: 0
Total Errors:   0

Things look okay - No serious problems were detected during the pre-flight check
0
dipopoCommented:
Hmmm and the SELinux settings?
0
Seth SimmonsSr. Systems AdministratorCommented:
for the services/hosts that didn't send a notification, what time period is defined for it?

in the nagios interface, at the bottom left select view config then on the right select either host or service (let's start with host) and look at the list of hosts.

find the one that you did maintenance on last night and look at the columns for the notification period.  if it's in workhours and you did a reboot late in the evening, then that would explain it since it's configured to notify only between 9am and 5pm.

i would examine how those hosts and services are configured

if you had errors in the nagios config then it wouldn't have even started after you rebooted the nagios server
0
DeoraliAuthor Commented:
dipopp: Red Hat

seth2740:

Configuration->Hosts-> Notification Period for the host is 24X7, Notification Interval is 30 minutes

define timeperiod{
        timeperiod_name 24x7
        alias           24 Hours A Day, 7 Days A Week
        sunday          00:00-24:00
        monday          00:00-24:00
        tuesday         00:00-24:00
        wednesday       00:00-24:00
        thursday        00:00-24:00
        friday          00:00-24:00
        saturday        00:00-24:00
        }
 check
0
Seth SimmonsSr. Systems AdministratorCommented:
have you checked if notifications are enabled both for the host and globally?

when you look at the status screen it would show in red font if notifications are disabled globally

for the host, you would have to select hosts on the left and find it on the right then click on it to see the details; it will say on the right if notifications are enabled
0
DeoraliAuthor Commented:
Seth2740,
Host state informationNotification is globally enabled.

nagios.cfg
#Enable notifications
enable_notifications=1


Notifications enabled confirmed from host state information:

Notifications:       
  ENABLED  

I have also attached a screen shot.
0
DeoraliAuthor Commented:
I am getting an email when I send custom host notification for any host from the nagios interface.
0
Seth SimmonsSr. Systems AdministratorCommented:
is that server in the screenshot the one you rebooted last night?  it shows uptime of 18+ days and no state change since 9/23
0
DeoraliAuthor Commented:
This is the server I rebooted last evening.

server rebooted last night
0
Seth SimmonsSr. Systems AdministratorCommented:
looks like the host check count is 10
once it hits that level then it will send an email

what probably happened was, the server went down but came back up before that 10th check which is why it didn't send an email

you can reduce the check count and/or reduce the check interval
0
DeoraliAuthor Commented:
Where do I go and change the host count?  Appreciate your help.
0
Seth SimmonsSr. Systems AdministratorCommented:
now that i look more that the event log you pasted earlier, it makes sense

you see how the host is listed as down, but it's a soft alert (3, 4, 5) if it stayed down and continued to 10 then it would be a hard alert and would send the email
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Seth SimmonsSr. Systems AdministratorCommented:
it could be in a couple places - either the host definition or in a template
if you look at the configuration file, if you see a line in the host definition starting with 'use' then it means it will also include values from a template

if not, then everything is defined there
0
DeoraliAuthor Commented:
Yes, definition starts with 'use'. I made the following changes in template.cfg:

max_check_attempts 5
normal_check_interval 5

and did service nagios restart

I will wait and see.

Thanks,
0
Seth SimmonsSr. Systems AdministratorCommented:
that would mean it would be 25 minutes before it would send an email - 1 check attempt 5 times every 5 minutes

changing the template would change everything that uses it so keep that in mind
0
DeoraliAuthor Commented:
seth2740,

The default value max_check_attempts was 10 and I changed to 5.  I restarted one server  and no email notification. Here is what I see in the event log:

2013-10-11 13:19:34] SERVICE ALERT: rds;Uptime;OK;HARD;1;System Uptime - 0 day(s) 0 hour(s) 5 minute(s)
Service Ok[2013-10-11 13:17:54] SERVICE ALERT: rds;W3SVC;OK;SOFT;2;W3SVC: Started
Service Ok[2013-10-11 13:17:54] SERVICE ALERT: rds;C:\ Drive Space;OK;SOFT;2;c:\ - total: 59.90 Gb - used: 16.91 Gb (28%) - free 42.99 Gb (72%)
Service Ok[2013-10-11 13:17:24] SERVICE ALERT: rds;NSClient++ Version;OK;SOFT;3;NSClient++ 0,4,1,101 2013-05-18
Service Critical[2013-10-11 13:15:54] SERVICE ALERT: rds;W3SVC;CRITICAL;SOFT;1;Connection refused
Service Critical[2013-10-11 13:15:54] SERVICE ALERT: rds;C:\ Drive Space;CRITICAL;SOFT;1;Connection refused
Service Critical[2013-10-11 13:15:24] SERVICE ALERT: rds;NSClient++ Version;CRITICAL;SOFT;2;Connection refused
Host Up[2013-10-11 13:15:14] HOST ALERT: rds;UP;SOFT;2;PING OK - Packet loss = 0%, RTA = 24.00 ms
Service Critical[2013-10-11 13:14:34] SERVICE ALERT: rds;Uptime;CRITICAL;HARD;1;Connection refused
Host Down[2013-10-11 13:14:04] HOST ALERT: rds;DOWN;SOFT;1;(Host Check Timed Out)
Service Critical[2013-10-11 13:13:24] SERVICE ALERT: rds;NSClient++ Version;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds

In order to generate an email alert notification as soon as the host or service is not available what value should I set it to?

Thanks,
0
Seth SimmonsSr. Systems AdministratorCommented:
i normally set the max checks to 3 and the check interval to 2
forgot about the retry interval...that is the interval during when it's in a soft alert
you could set that to 1 or 2
0
DeoraliAuthor Commented:
Thank you very much Seth2740.

I changed it to:

max_check_attemps 2
check_interval 2
retry_interval 2

I rebooted server rds and got an email notification.

Event log now shows:
2013-10-11 13:54:59] SERVICE ALERT: rds;NSClient++ Version;CRITICAL;HARD;2;CRITICAL - Socket timeout after 10 seconds
Service Critical[2013-10-11 13:54:49] SERVICE ALERT: rds;Memory Usage;CRITICAL;HARD;2;CRITICAL - Socket timeout after 10 seconds
Host Notification[2013-10-11 13:54:49] HOST NOTIFICATION: nagiosadmin;rds;DOWN;notify-host-by-email;(Host Check Timed Out)
Host Down[2013-10-11 13:54:49] HOST ALERT: rds;DOWN;HARD;2;(Host Check Timed Out)
Service Critical[2013-10-11 13:54:19] SERVICE ALERT: rds;W3SVC;CRITICAL;HARD;2;CRITICAL - Socket timeout after 10 seconds
Service Critical[2013-10-11 13:54:19] SERVICE ALERT: rds;C:\ Drive Space;CRITICAL;HARD;1;CRITICAL - Socket timeout after 10 seconds
Service Critical[2013-10-11 13:53:39] SERVICE ALERT: rds;Uptime;CRITICAL;HARD;1;CRITICAL - Socket timeout after 10 seconds
Host Down[2013-10-11 13:53:09] HOST ALERT: rds;DOWN;SOFT;1;(Host Check Timed Out)
Service Critical[2013-10-11 13:52:59] SERVICE ALERT: rds;NSClient++ Version;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds
Service Critical[2013-10-11 13:52:49] SERVICE ALERT: rds;Memory Usage;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds
Service Warning[2013-10-11 13:52:09] SERVICE ALERT: rds;W3SVC;WARNING;SOFT;1;W3SVC: Error

I will wait for until Monday morning and see how it goes.

Thank you once again seth2740.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Linux

From novice to tech pro — start learning today.