?
Solved

Large number of Nagios alerts after a host comes up from being down.

Posted on 2014-11-07
11
Medium Priority
?
302 Views
Last Modified: 2014-11-22
Hi,

We are running Nagios 4.0.7 and whenever a host goes down (ping results time out) we get an alert that the host is down and nothing else, which is great. However, when the host comes back up, all of the other service checks immediately time out and start sending a massive amount of alerts about each service. Then, as soon as the services come back up, we get another massive amount of alerts stating that the services are recovered.

Is there a way to delay service alerts after a host goes down and comes back up? For instance, a host goes down, we get an alert regarding the down'd host. Host comes up, and we get an alert that the host is up. If the services aren't okay after the host has been recoered for, say, 5 minutes THEN we start to get service alerts. Is this possible?

Thank you
0
Comment
Question by:OAC Technology
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
11 Comments
 
LVL 35

Expert Comment

by:Seth Simmons
ID: 40429095
you could increase your service check frequency or the number of soft alerts before triggering a hard alert
if either of those are too small, the hard alert would be triggered faster causing that to happen
0
 
LVL 18

Expert Comment

by:Sanga Collins
ID: 40429116
You can also use host a service dependencies so when a host goes down, any hosts or services that are dependencies will suppress their alerts. When the host comes back up the dependencies will follow the same process. If the services goes down on its own it will alert you as configured.
0
 
LVL 2

Author Comment

by:OAC Technology
ID: 40429163
Seth, this increase would delay service alerts across the board and not just if a host went down and came back up, correct? My hope was that there was a way to tell service alerts to hold off for a while only if the host went down and came back up. Otherwise we'll be waiting 5 minutes to be alerted if a service just decides to die

Sanga, That's how we have it set now. If a host goes down, the services don't report that they are down, but the problem is when the host comes back up, all of the services are still marked as down so we get a flood of alerts


Thanks for the help
0
Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

 
LVL 35

Expert Comment

by:Seth Simmons
ID: 40429170
depends how you have things configured
you might have templates that all hosts follow or some might be customized
is it that important that you need to be notified that soon?  do you need service check intervals that short?

as far as dependencies go, services associated with a host are automatically dependent of a host
using dependencies is more for something in between to prevent false positives
for example, a remote site goes down, a system there could be reported down when it isn't.  having the gateway/router as dependency will make that system 'unknown' because the parent is down and not the system itself
0
 
LVL 29

Expert Comment

by:Jan Springer
ID: 40429171
What Sanga said.  Do you have dependencies configured?
0
 
LVL 2

Author Comment

by:OAC Technology
ID: 40429286
How do I check to make sure I have dependencies setup/configured?
0
 
LVL 35

Expert Comment

by:Seth Simmons
ID: 40429294
could you post your configuration file(s) to review?
0
 
LVL 2

Author Comment

by:OAC Technology
ID: 40429375
I've posted the configuration file for one of the servers I am monitoring (with details scrubbed). Are there any other files you need me to upload?
0
 
LVL 35

Expert Comment

by:Seth Simmons
ID: 40439876
there is nothing attached
0
 
LVL 2

Accepted Solution

by:
OAC Technology earned 0 total points
ID: 40447997
Not sure why the attachment didn't show up, but I was able to find a solution that works for us. We are using NAN (https://www.monitoringexchange.org/inventory/Utilities/AddOn-Projects/Notifications/NAN---Nagios-Notification-Daemon) to consolidate all of our alerts and it has been working great. It takes our flood of 200 messages within a 3 minute period and consolidates them into 1 for alerts and 1 for recoveries.

Thank you
0
 
LVL 2

Author Closing Comment

by:OAC Technology
ID: 40459161
Found solution
0

Featured Post

Are You Ready for GDPR?

With the GDPR deadline set for May 25, 2018, many organizations are ill-prepared due to uncertainty about the criteria for compliance. According to a recent WatchGuard survey, a staggering 37% of respondents don't even know if their organization needs to comply with GDPR. Do you?

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

How to set-up an On Demand, IPSec, Site to SIte, VPN from a Draytek Vigor Router to a Cyberoam UTM Appliance. A concise guide to the settings required on both devices
Use of TCL script on Cisco devices:  - create file and merge it with running configuration to apply configuration changes
Learn how to find files with the shell using the find and locate commands. Use locate to find a needle in a haystack.: With locate, check if the file still exists.: Use find to get the actual location of the file.:
This video gives you a great overview about bandwidth monitoring with SNMP and WMI with our network monitoring solution PRTG Network Monitor (https://www.paessler.com/prtg). If you're looking for how to monitor bandwidth using netflow or packet s…
Suggested Courses

719 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question