Large number of Nagios alerts after a host comes up from being down.

Hi,

We are running Nagios 4.0.7 and whenever a host goes down (ping results time out) we get an alert that the host is down and nothing else, which is great. However, when the host comes back up, all of the other service checks immediately time out and start sending a massive amount of alerts about each service. Then, as soon as the services come back up, we get another massive amount of alerts stating that the services are recovered.

Is there a way to delay service alerts after a host goes down and comes back up? For instance, a host goes down, we get an alert regarding the down'd host. Host comes up, and we get an alert that the host is up. If the services aren't okay after the host has been recoered for, say, 5 minutes THEN we start to get service alerts. Is this possible?

Thank you
LVL 2
OAC TechnologyProfessional NerdsAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Seth SimmonsSr. Systems AdministratorCommented:
you could increase your service check frequency or the number of soft alerts before triggering a hard alert
if either of those are too small, the hard alert would be triggered faster causing that to happen
0
Sanga CollinsSystems AdminCommented:
You can also use host a service dependencies so when a host goes down, any hosts or services that are dependencies will suppress their alerts. When the host comes back up the dependencies will follow the same process. If the services goes down on its own it will alert you as configured.
0
OAC TechnologyProfessional NerdsAuthor Commented:
Seth, this increase would delay service alerts across the board and not just if a host went down and came back up, correct? My hope was that there was a way to tell service alerts to hold off for a while only if the host went down and came back up. Otherwise we'll be waiting 5 minutes to be alerted if a service just decides to die

Sanga, That's how we have it set now. If a host goes down, the services don't report that they are down, but the problem is when the host comes back up, all of the services are still marked as down so we get a flood of alerts


Thanks for the help
0
Ultimate Tool Kit for Technology Solution Provider

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy now.

Seth SimmonsSr. Systems AdministratorCommented:
depends how you have things configured
you might have templates that all hosts follow or some might be customized
is it that important that you need to be notified that soon?  do you need service check intervals that short?

as far as dependencies go, services associated with a host are automatically dependent of a host
using dependencies is more for something in between to prevent false positives
for example, a remote site goes down, a system there could be reported down when it isn't.  having the gateway/router as dependency will make that system 'unknown' because the parent is down and not the system itself
0
Jan SpringerCommented:
What Sanga said.  Do you have dependencies configured?
0
OAC TechnologyProfessional NerdsAuthor Commented:
How do I check to make sure I have dependencies setup/configured?
0
Seth SimmonsSr. Systems AdministratorCommented:
could you post your configuration file(s) to review?
0
OAC TechnologyProfessional NerdsAuthor Commented:
I've posted the configuration file for one of the servers I am monitoring (with details scrubbed). Are there any other files you need me to upload?
0
Seth SimmonsSr. Systems AdministratorCommented:
there is nothing attached
0
OAC TechnologyProfessional NerdsAuthor Commented:
Not sure why the attachment didn't show up, but I was able to find a solution that works for us. We are using NAN (https://www.monitoringexchange.org/inventory/Utilities/AddOn-Projects/Notifications/NAN---Nagios-Notification-Daemon) to consolidate all of our alerts and it has been working great. It takes our flood of 200 messages within a 3 minute period and consolidates them into 1 for alerts and 1 for recoveries.

Thank you
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
OAC TechnologyProfessional NerdsAuthor Commented:
Found solution
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Linux

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.