How to make sure IT on-call works for you
I spent a bit of time on Reddit the other day and thought it interesting just how many posts were focused on IT on-call and on-call scheduling
. Some posts were rants on horrible customers – who hasn’t had some of those? Some actually wrote about positive interactions from being on-call – those were rare posts. But many engineers in DevOps and IT posted on their trepidation about being on-call. They wondered:
- What is the best way for my team to create an IT on-call schedule?
- How do I ensure I wake up if I am alerted?
- Should my growing on-call team use an on-call cell phone and hand it off between rotations?
- How do I manage being on-call and then having to show up at 8 a.m. the next morning?
- Is it reasonable to expect on-call duty 24/7?
The answers to these questions though don’t need to cause trepidation. While on-call can be anxiety producing, having the right tools and management go a long way towards helping to create reasonable expectations and outcomes.
Why IT on-call is necessary for all
If I were to ask you about why on-call is necessary you might think me a bit of a dunce – go ahead, I’ve been called worse.
Isn’t it obvious that on-call is needed to answer customer questions about the product? Duh?!
But truth is that answering customer product questions is not the only
reason IT on-call exists. In the realm of product development, on-call is a necessary pursuit. You cannot develop product effectively if the product is disconnected from testing its resilience. And you cannot know the product’s resilience unless you put it in front of your customers and allow them to test it. And let customers call you when it breaks.
Additionally, on-call rotations allow Dev, Ops and all of your IT team to see how well the product or set up they have created is working. Many I have spoken to in the DevOps world call this ‘eating your own dogfood’. Yuck
. This statement is meant to illustrate that no one in the IT family can simply create their perceived technical masterpiece and walk away. Instead, they need to take responsibility for their creation
. Being part of the on-call family helps ensure this level of responsibility.
Traditional problems with IT alerting
In addition to being on-call, there are many additional issues with alerting. Often, issues come in after hours and they lack context. These sorts of problems come in many flavors. For example:
- A call comes in but the engineer cannot escalate the issue if they need to
- There’s a hand-off of a customer problem from regular hours to after-hours on-call and the issue gets muddled because there’s no audit trail on the alert
- For overnight on-call, alerts are not sufficiently persistent to get engineers out of bed
- Poor management of IT on-call and alerting causes engineer burnout
A much better
idea is to create an actual IT on-call schedule with a dedicated tool designed to handle effective alerting, auditing and messaging. A tool like OnPage can answer these on-call issues as well as many of the trepidations which engineers face about being on-call.
Improving life on-call
Effective management of after-hours on-call needs to be premeditated. That is, the process needs to be thought through and cannot be ad hoc.
While most DevOps teams and IT teams have a schedule, they haven’t thought through the whole process. Instead, teams should create on-call schedules that:
- Enables escalation. For example, you cannot expect one person to be on-call for 24/7 without having an escalation procedure. Everyone needs a back-up if they cannot attend to a call. People have lives and stuff happens. So, make sure there’s an escalation procedure. OnPage’s tool has strong escalation capabilities for issues in this realm
- Provide time off after being on-call over-night. When a team member has been actively on-call overnight, it is only fair to give that person a reasonable amount of time off before showing up to work again
- Have schedules. Make sure all your team members have a chance to be on-call. Create scheduling that rotates through the team members equitably
- Run books – defined procedures. When your on-call engineer is alerted in the middle of the night, help them out by having run books available to provide solutions to problems that have crept up in the past. This is really helpful when woken up at 2 a.m. and the engineer’s thinking is somewhat clouded.
- Include prominent and persistent alerts. OnPage provides persistent alerting that will continue for up to 8 hours until answered. Also, there’s no chance of sleeping through the OnPage alerts as they are really designed to wake you up.
- Ensure audit trails to help with hand-offs. Provide an audit trail for alerts so it is clear who on the team is working on an existing issue. Audit trails also provide context to MTTR and help your team keep track of metrics.
- Based on a communal app. Ensure your team has an alerting app on their smartphone so there is no need to physically handoff pagers. By ensuring the use of a smartphone application like OnPage, scheduling is much easier as is ensuring response by the right person every time.
While IT on-call might cause trepidation initially, the time spent planning will definitely pay dividends. Again, use a scheduling tool that will allow your team to work effectively together and more like a, well…, team.
OnPage is an excellent tool for managing and improving life on-call. Learn how OnPage can help you and your team better manage IT on-call alerting. Schedule a demo with OnPage today.