How do you explain on-call to people who don’t know what it is?
I usually go broad and say something like: when something goes wrong with the product my company sells, someone has to fix it. For stuff that I work on, my team has to fix it, no matter when it breaks. There has to be someone available 24/7 to respond. The availability piece is the thing that’s hard to explain to people.
I remember there was a time when I was at the gym and my coach was asking everybody how we were doing and I was like, “I’m so tired because I was on call this week and I didn’t get any sleep last night.” And she was like, “Oh, are you a doctor?” Which is a very common response. And then I have to say no, I’m not a doctor. I’m on call for computers. And then people are usually pretty puzzled and ask if there’s really something so important that it would warrant waking people up in the middle of the night. The answer is yes, but it’s hard to explain why.
For every hour of every day, there is a person assigned to respond if something goes down.
When something in a system breaks, you need a 24/7 ability to respond. There’s a lot more detail you could get into, like, how do you decide what’s worth paging about? But that’s the high-level summary.
So how do you decide what’s worth paging about?
I’m an engineer and I work on a team that builds products that other engineering teams at my company use. The first question that we ask is, “How do we know when our products are working?” That’s more complicated to figure out than it might seem.
We set up monitoring systems to examine different metrics. For an API, you might look at latency, which is how long it takes for a web request to be fulfilled. Or you might look at the error rate: in a perfect world, there would be zero errors when somebody tries to make a correct request to an API. But if there are errors, that could be because of a code change, or because other pieces of infrastructure aren’t working.
It’s not just about figuring out if something is broken, however. It’s also about figuring out if it’s broken enough to warrant human intervention. There’s a general philosophy that human intervention should not be the first thing that happens. If an application tries to make an API request to an external API that has nothing to do with us or our infrastructure, and that API happens to be down—it could be GitHub, npm, or any number of services—our products should be able to retry the request. If it’s the kind of thing where the request didn’t work at first because GitHub was down, but the retry worked because GitHub is back up, that is something that our system should be able to just do on its own.
But if the system can’t fix itself, then we need somebody to intervene to assess how serious the problem is, and to see if there’s anything we can do to mitigate the impact that it’s having—the fact that this thing is broken and our customers can’t use it—and then fix the thing itself.
What are the kinds of things you personally get paged for?
Broadly, they fall into two categories. One, we made a change, it didn’t go as planned, and it’s breaking things; or, two, something external to our team is broken or unexpected, so our system doesn’t work. Those are both tricky in different ways, but both of those potential failure situations inform how we build our systems and how we handle on-call.
What do you mean that it would affect how you build systems?
We’re on call for systems we’ve built, which is a very particular on-call philosophy. There are some places where these people over here create the thing, and those people over there are on call for when the thing breaks, and those are totally different teams. But we’re on call for systems that we’ve built ourselves, so we have to expect that components of our systems will fail, and we have to integrate that anticipation of failure into what we promise our customers. And we have to think about how we architect and monitor for failure.
Do you have a sense of how many people are impacted by an outage that you’d get paged for? Is there a way to measure impact?
Absolutely. One of the first things we measure is customer impact, and that determines the severity of the incident we’re dealing with. On one end of the spectrum, the least impactful end, our team will have a conversation about whether we should even be getting paged for something like this. Maybe the answer is no and we change how we’re alerted, or we make the system more robust so it doesn’t experience that failure anymore.
On the other end, I’ve responded to pages for downtime, which means that external customers cannot use our product. That’s typically measured as a percentage. So we’ll say, “This outage impacted 5 percent of our customers globally” or, “10 percent of our customers in this particular region couldn’t use our product for fifteen minutes.”
The scariest failure I can think of that my team would be on call for is if our content delivery network (CDN) went down. That is the point of entry for customers who use our service and it handles billions of requests every day. So even if everything behind the CDN is working correctly, if there’s an outage at the point of entry, that would impact a lot of people. Like, potentially all of our customers.
I have this image of one person being woken up in the middle of the night because a million people can’t access the app. It sounds like that’s not how this works, though.
If a million of our users are affected and only one person is waking up to deal with it, that’s wrong. A company with a million users has hopefully put enough thought into how they do on-call that an outage of that size wouldn’t happen that way. All of that said, if something like that did happen, the one person that gets paged would then page a bunch of other people once they realized that something was very broken. When there’s an incident of that scale, whole teams are brought in to help and someone is the “incident commander” who coordinates the response.
Still, this doesn’t happen that often. A lot of people think of outages as all or nothing. But it’s not usually the case that a huge number of our customers can’t use any of our services all at once. The more likely scenario is that one of our services goes down and it’s part of another company’s checkout system, so their customers can’t pay. Or maybe their app doesn’t load properly on their customers’ phones if one of our services is broken because of how the two are tied together with code. That company’s customer has no idea that the problem lies with us. But they get impacted by our outage nonetheless.
Even with those smaller incidents, however, a lot of money can presumably be at stake.
Yes, and if it’s our fault, the companies that rely on our services can come to us and say, “We signed a legal contract where you promised 99.999 percent availability,” or whatever percentage we promised them. There’s this concept of the number of nines of availability a service has. This indicates how available you expect a service to be, because it can never be 100 percent. For instance, a service that is 99.99 percent available has four nines, while a service that is 99.9999999 percent has nine nines. You build your expectations around how close or far from 100 percent availability a service is. The more available a service is, the more other companies rely on it in building their own products.
So if we breach our obligation around availability, a company might ask us for a refund or make a decision to not use our product anymore. We might do the same thing if another product causes us downtime. When there’s an outage of something we rely on, we’ll go to the company and say, “We want a root cause analysis, we want to know what the fix was, and we want a refund.” They can’t just respond and say, “I dunno, something broke but it’s good now!”
It gets trickier when you’re locked into a specific vendor. In some cases, we’ve decided to be locked in, in part, because they promise a lot of nines and we pay a ton of money, so that when their failure affects our failure, we get details as soon as possible, we get information under NDA to understand what happened, and we can ask for more help in how our relationship works.
Waking Up is Hard To Do
If you’re on call outside of a workday, what does that mean for your personal life? How does it affect your evenings and weekends?
Well, that’s definitely when I notice on-call the most. I may try to go to sleep earlier on nights when I’m on call because I can’t guarantee that I’m going to sleep through the night. I can’t make spontaneous plans when I’m on call unless I carry my laptop around with me. So those weeks require a lot more planning.
Because being on call means you literally have to open your laptop and debug as soon as you get paged. We’ve been talking conceptually about what it is and the philosophy behind it, but that’s what it literally looks like.
Right. Let’s say I get paged when I don’t have my laptop or I’m out without my charger. I would escalate immediately to make sure that somebody else responds. But typically, we respond within a few minutes. We respond as if everything is urgent. If I get woken up in the middle of the night, my sleepy brain is like, “You can look at it later,” but I’ve trained myself to not look later, to look now because it could be really bad. But yeah, it can really impact my life, my ability to do errands, my sleep. Bad weeks are bad.
What do you do when you get paged?
Almost everyone I know who’s on call, regardless of the company, uses the same app to configure pager alerts. And you can configure the app to send different kinds of information each time it pages someone. We have a link to whatever metric is passing the threshold that’s causing the page. Those metrics are also things that we configure.
Throughout your code?
Right. All the services that make up what looks to customers like one cohesive product are owned by different teams and the teams set up the thresholds they want, and people get paged based on those thresholds. A threshold might be: “If this function fails five times in an hour, page someone.”
There must be dozens of those? Thousands?
A lot. That’s why it’s important to be thoughtful about what you want to be alerted to and what the threshold for an alert is.
So coming back to your question about what I do: I first look at the metric that has passed the threshold. Then, I look for documentation about that alarm. When we create an alarm, we try to write documentation on what it is and what it’s measuring. If the docs are good, they also include context about why this piece of code or infrastructure exists and its various potential failure states.
But you don’t want to get into too much detail on each failure state. Sometimes, when people deal with failure, their instinct is to say, “If I document every single piece of information about this situation, I will know exactly how to respond when this happens again.” But if you have something that fails regularly, for a very predictable reason, you should fix the problem in the product and stop paging everyone all the time. Although sometimes that’s easier said than done. It is pretty easy to document all the ways in which something can break—it is usually much harder to build something that breaks less often.
Sometimes, I get paged for something I’ve never worked on before. That’s when I really lean on this process. I see a metric. I see some docs. Something is broken that’s potentially impacting people. How do I use these pieces to get to an understanding of what happened and how to fix it?
Does everybody really wake up and deal with a page they get at two in the morning? Surely, people sleep through alarms. What happens then?
Yes, everybody really wakes up. There may be some rare case where, you know, someone got a new phone and didn’t set up their notifications correctly. Or, people accidentally sleep through middle-of-the-night alarms. I certainly have. But if someone were to repeatedly refuse to respond at night, they just wouldn’t last in that job. What happens when someone misses a page is that the next person gets paged. The app is configured to page a certain number of times in a certain number of ways—it’ll text you, then call you, then email you—but if you don’t acknowledge the page, it tries the next person.
In the same way that computers are automated to fail over to the next system, the app will fail over to the next human if one of them is down.
Yeah. If I sleep through an alarm, our escalation is set up to try my team members first. Then my boss, then my boss’s boss, all the way up to the executives. If all of us sleep through all the alarms, the CEO would get paged. I’ve never seen that happen before, though.
What has happened is that I’ve been paged for something, didn’t know how to deal with it, and then paged someone else to wake them up to help.
How does it feel to do that?
I mean, I wish I never had to do that. It sucks, because I know how garbage I feel after I’ve been woken up at that time. But this is actually a place where team culture is important. If someone else wakes me up, I try to respond without resentment and without making somebody feel bad for needing help. We don’t page each other frivolously, but if someone doesn’t know what to do and I’m second in line, it is my job to respond and help that person out. It can create a really toxic culture if you’re like, “Ugh, why did you wake me up for this?” And if somebody stops asking for help, that is a big potential failure scenario. That’s why, when we onboard someone, we really play up the “It’s totally super fine! Don’t worry about it, page me anytime!” They won’t actually page me anytime, but it’s important for them to know that they can if they’re in trouble.
The company probably benefits from people being kind and showing up for each other in cases like that.
Definitely. I mean, on-call can go lots of ways. What I’m describing, even if I don’t love on-call, is being on a team with people I trust, knowing that I won’t get yelled at or fired for unintentionally doing something that causes damage, and knowing that there’s a genuine spirit of reflection around how to fail better. The thing that motivates me during on-call, much more than fixing the tech, is my teammates. There are things that are beyond our control: there’s a lot of failure on the internet and we don’t pick the days when a critical service goes down. If everyone is always exhausted and grumpy when they show up, that sucks for them and it sucks for me. So, almost always, if someone gets woken up in the middle of the night, another person on the team will offer to take over their shift the next night so they can get a full night’s sleep. Because waking up one night sucks, but waking up two nights in a row? You’re toast.
We also encourage one another to ask for help and to offer help. If we’ve identified something that is really disruptive to each other’s day-to-day lives, we take that seriously and make changes so that that thing doesn’t happen anymore. That matters when you think about the fact that we are on call for holidays and weekends. There’s a lot of motivation for us to make on-call not terrible. So we are caring for infrastructure, but ultimately we’re taking care of each other.