A Guide to Being a Great On-call.
Being a good operator builds a wealth of skills. But there are a few tricky aspects you'll want to think about.
I'm a fan of tech employees being good operators.
What does being a good operator mean? I'm referring to the owner-operator model. The idea is that you, the technical worker, don't simply build systems, but you also operate them.
You (and your team) are not just responsible for writing software, but you're responsible for operating it. You handle customer issues. You investigate bugs. You figure out why it seems to take 5x as much hardware as you'd think it should.
Read more here about the choice to use an owner-operator model.
The most familiar aspect of the owner-operator model is the on-call. The on-call is the designated person who catches operational events (from alarms, customer contacts, escalations) and owns them.
I'm going to discuss some operational recommendations, and then walk you through a story of when we handled an event the wrong way, as we didn't follow my recommendations.
Why would I explain to an on-call how on-call should work?
I'm going to immediately point out what I think people will say.
"But my company is stupid. We don't (do the good thing), we do the (really stupid thing) instead!"
I believe strongly in ownership. Ownership includes putting your foot down when something should be done better. It means demonstrating that you're valuable, and then insisting that they'll lose your value if they don't listen a little. It means not just complaining about something being done wrong, but suggesting fixes to it. And then trying to implement those fixes, even if you weren't given permission.
And I'm sure people will still complain that I'm offering suggestions which are impossible to implement. You can't make everyone happy.
Anyway. To start with, the most tricky aspect of the owner-operator model is that it usually isn't a follow the sun model.
What is a 'follow the sun' model?
A follow the sun model is where you have everyone working during their daytime. You'd have a team in India on-call during their day, a team in California on-call during their day, etc. You expect teams to hand off their operational work at the end of their day.
That is your usual system when you have dedicated operators. You hire them in appropriate locations, and they work (hopefully) reasonable hours in the daytime.
That means while your main owners (builders) are sleeping, you have dedicated operators handling issues.
If you're going with an owner-operator model though, and you still need 24/7 support, it means you need dedicated on-calls, and occasionally, you need to wake people up.
What does 24/7 support look like for owner-operators?
24/7 support by owner-operators means that you usually have a device which notifies the "owner" (usually a software engineer) when something goes wrong. A relatively common model is that the on-call works for 7 days straight as an on-call. During that time, every major event would alert the on-call. Sometimes people rotate more/less often, but that works as an example timeframe.
Now, clearly working 24/7 for a week sounds nightmarish. Which it can be, if it's done wrong.
It shouldn't be 24/7 work.
If you have to do work every night, you have something seriously wrong.
The 24/7 support model for owner-operators is supposed to catch disasters, not work. If you're doing routine maintenance off normal work hours, something is wrong.
If you're catching disasters consistently every night, something is wrong. Health wise, no one can consistently have their sleep disturbed. Operations wise, no system should have disasters every single night.
Disasters every night – Short-Term Reaction.
Some teams do have disasters every night. Yuck.
Get more on-calls, and rotate them more frequently.
If engineer-1 is woken up at 2am for an emergency, I expect them to sleep in, and show up to work late.
If they're woken up again at 4am, and then at 6am, I need to hand on-call to someone else for at least a few days, so that engineer-1 can recover.
There's no exact rule to follow, but here's how I'd think about it.
"The cost of on-call should be carried by the company, not by the individual." - Dave
If the on-call feels like they're carrying a massive weight by going on-call, you have a larger problem. The way you shift more weight to the company is by having more on-calls, having the on-calls rest if they're woken up, and by immediately proceeding into your longer-term solutions.
As a side note, when you have the company carry the weight of on-call responsibilities, it makes it more likely they'll invest resources in making on-call less costly. Which is a good thing.
Disaster mitigation to let people sleep.
There are a few normal mitigations you can set up to avoid the on-call job from being a horrific hassle.
Update your alarms properly
Do you ever jump onto an event ticket, only to realize that it's yet another false alarm?
You want to know that all of your alarms are serious. It's your job as an engineer to update your alarm system to alert you when there's a real event. How do you do that?
A few ideas include looking for multiple data points before alarming (e.g., not a single missed ping, but 3 in a row). Updating your latency alarms to be at realistic levels.
Automatic mitigation
An example of an automatic mitigation would be restarting a server if it runs out of memory. It's a horrible way to fix bad code. It's a good way to let someone sleep.
The proper way to set up automatic mitigation is to make certain it doesn't quietly mitigate. Instead of waking up the on call, it might log a lower severity ticket to investigate. The idea is to change potential 2am emergencies into a 9am investigation.
Deployment windows and safety
I've often advocated for deployment windows corresponding to easy support. If you assume 90% of issues from a deployment will happen within 3 hours, do you want your code change rolled out at 10pm on a Friday, or do you want it rolled out at 9am on a Tuesday?
If you're a bit careful with your deployment scheduling, you can greatly decrease the human impact.
Changes = operational cost.
Every year, as we headed towards peak holiday traffic at Amazon, we'd lock down the code of most major systems. Minimal code changes allowed to be pushed to production.
What happened as a result? An incredible drop in outages and events. Less than 10% of the normal operational issues showed up.
Why is that? Because computer systems rarely fail when you're not changing something. Simple as that. Every change has a chance of causing failure. No change = no failure. Now clearly, you can't operate your software without changes, but it's good to recognize that most issues come from change.
I assume you would rather not decrease the amount of code changes at your company permanently. So, how do you decrease the operational pain from changes? By having a proper change management process.
Always describe the worst-case scenario.
I find the worst-case scenario discussion is fun, and useful.
The idea is that you're describing the blast radius of any change. What systems does your change touch? When you deploy, what's the biggest impact you could imagine, if everything goes wrong?
Take a look at this story I told once about worst-case scenarios.
Always have a rollback plan – never a roll forward plan.
Never accept a change plan which doesn't include a full (simple and documented) rollback plan. A proper rollback plan is where you describe the short and foolproof steps to take the system back to the status before the change was made.
Perhaps 99% of the time people have written poor rollback plans, or claimed it was impossible to rollback, they were mistaken. Yes, it can make the change harder. Yes, it's worth it.
The primary issue is that people are lazy. An engineer will look at a change they need to make. They can make it an irreversible change, but be done in a few hours.
Or they can make it reversible, but it'll take them four days. It's difficult to convince an engineer that the four days for reducing risk are worth it. But get bitten once by an irreversible change, and you'll always insist on rollback steps.
Without this valuable hindsight, people attempt to create change plans which require roll-forward fixes instead. In other words, "If it breaks, we fix the bug."
Here's a safe assumption. Your on-call is a smart software engineer or other technical person. In a perfect world, they could diagnose and repair the issue.
Except that it's possible they're trying to fix it at 2am, when the rest of the team is asleep. So, they're flying solo on this one.
And they're working on 3 hours of sleep because they were woken up in the middle of the night.
And their phone keeps buzzing with alarms, distracting them and making it hard to think.
Expecting the on-call to fix a bug (rather than undo the changes) makes it significantly more likely that things will go from bad to worse.
Making things worse – A story.
I'm going to chat about a time when the on-call made things worse. Not on purpose, but because they were trying to do the right thing. And a few specific examples of how managing operations wasn't done properly.