We were having a lot of "Heartbeat Alerts" in our SCOM environment, now "Heartbeat" in a SCOM environment for those of you who might not be familiar with SCOM is a packet of data sent from the agent to the management server on a regular basis, basically letting the management server know, “Hey I am still ok and here”, the interval is by default every 60 seconds, but this is customizable. When the guys checked these agents they were online and functioning with no apparent issues, so it made me think, the only thing it could be is stale alerts or something wrong with the agents.
Now I love automation and always having some excuse to use Orchestrator, so when the guys asked if I could make it easier for them to automatically get and repair these agents I jumped at it, literally.
System Center Orchestrator is Microsoft’s workflow management solution that allows you to automate the creation, deployment and monitoring in your environment, you will notice that I mention the word “Runbooks”, now runbooks contain the individual instructions for your automation process and each step is called an activity and each of these activities have configurable settings.
While creating mine I found this article from Nathan Olmstead: http://blogs.technet.com/b/systemcenterramblings/archive/2014/03/22/runbook-for-persisting-stale-heartbeat-alerts-in-scom.aspx
With mine I needed to check DNS, ping, update my alerts as well as send out notifications to the BackOffice team of failures if any during the whole remediation process so that they could act accordingly, so you will see a few more activities added.
I am also in the process of adding HP Service Manager Integration, allowing us to have Orchestrator log an incident (Service Ticket) automatically instead of needing our Helpdesk to log it for us, saving us time and giving us reporting. This would also give us an additional notification channel, making sure that nothing is missed.
Here is a view of the runbook
I have attached a Activity Reference to give a little more info on each Activity
Here is a quick view of one or two of the individual activities and their configuration just to give an idea of what they look like.
This activity gets the alert from SCOM, you will see the filters below
This activity as stated above runs the "nslookup" command and it receives the server name from the previous activity "Monitor Alert"
Here you will add the recipients, subject and body of the mail to be sent.
Mail Activity cont. mail server settings where you will add the mail server to use for the SMTP connection as well as the sender address
Each activity in Orchestrator has the ability to pass on relevant data onto the next activity where required and configured to be used, they are also connected by the "Link" lines you see between them.
What you could also do is disable the default "Heartbeat Alerts" monitor and create your own custom monitor. The reason I say that is then instead of having a "Monitor Alert" activity you could use SCOJobrunner and have it triggered from the SCOM side as soon as your custom alert is triggered. SCOJobrunner is a command-line tool you can use to trigger Orchestrator runbooks, so you could create a diagnostic and recovery command within your custom monitor. What this would also do is not require your runbook to always be running, using less overhead.
Yes you can clean it up by using child runbooks and I can also add more failure checks like adding a leg for the "Start Health Service" activity to also notify or remediate a failure when trying to start the agent health service again, but it is working perfectly for us and the automation of the agent repair is helping our guys a great deal, allowing them to focus on a few other things.
I hope this has been useful, if there are any questions please don’t hesitate contacting me