The Servers Are Burning

One of the first things I learned after learning how to write software was how to write tests for that software. First you wrote code to perform a certain task—say, find the square root of a number. Then you wrote more code to test whether the first piece of code did what you wanted. Does the function return the correct value? Is two the square root of four?

I thought this type of testing was ridiculous. If you wrote buggy software, why would the software you wrote to check that software be any less buggy? That happened to me a lot: I’d spend twenty minutes trying to figure out why my tests said my programs were broken only to realize that the tests themselves were broken.

Yet what I found even more troubling was that in order to write effective tests, a programmer had to know all of the ways that a piece of software could fail in order to write tests for those cases. If she forgot the square root of -1 was undefined, she’d never write a test for it. But, obviously, if you knew where your bugs were, you’d just fix them. Errors typically hide in the places we forget to look.

My first employer, the online dating site OkCupid, didn’t harp on testing either. In part, this was because our company was so small. Only seven fellow engineers and I maintained all the code running on our servers—and making tests work was time-consuming and error-prone. “We can’t sacrifice forward momentum for technical debt,” then-CEO Mike Maxim told me, referring to the cost of engineers building behind-the-scenes tech instead of user-facing features. “Users don’t care.” He thought of testing frameworks as somewhat academic, more lofty than practical.

Instead, the way we kept from breaking the site was to push updates to small subsets of users and watch what happened. Did anything crash? Did lag increase? Were users reporting new problems? The idea was that the best way to discover errors was to expose software to real site traffic and respond quickly with a patch or rollback if necessary.

Perhaps this was a sensible methodology for a small engineering team. But when I first started at OkCupid, I found it terrifying. About four months in, I was assigned to build a feature that would highlight the interests that members listed on their profiles—things like “Infinite Jest,” “meditation,” “The Dirty Projectors”—for OkCupid staff to see. Because this was an internal feature that users wouldn’t be able to access, the stakes were low.

Still, I was paranoid about breaking the site, and procrastinated. My boss, then-Director of Engineering David Koh, started to notice. “I know it’s intimidating,” he told me at the end of work one day, “but you just have to pull the trigger and launch the code. Soon you’ll do it without even thinking.” He told me to push my update the next day, when he’d be out of the office.

I was nervous to make a change without David there to save me if anything went wrong. But, admittedly, my update was pretty dinky. The next morning, just to cover my tail, I asked Mike the CEO, who was also OkCupid’s best engineer, to take a look at my code. “You only added a few functions,” he said, reading through my lines on his monitor. “Looks fine.” I felt silly taking up his time for something so insignificant.

So I launched my new code for a fraction of our users and watched the statistics. All seemed well. I pushed my changes out to the rest of the site and went for a snack. When I came back, everything was most definitely not fine. The site had slowed to a crawl, and then became completely non-responsive. From the back corner of our single-floor office, the head of operations yelled, “The servers are on fire. What the hell’s going on?”

“It must’ve been me,” Mike shouted back. Mike and I had both deployed code at almost exactly the same time, a development no-no since when something broke (like now), you didn’t know who to blame. But when Mike reversed his change to no effect, he was stuck. It wouldn’t be long before the OkCupid servers became so unresponsive we wouldn’t even be able to connect to them to push our fixes.

#Panic #Freakout

By lunch, we still hadn’t saved the site. OkCupid users were starting to notice on Twitter:

“@okcupid how am I supposed to get my daily dose of crushing rejection and emotional humiliation if your site is down????”

“Okcupid stops working over lunch hour, NYC wonders if we’re committed to the people in our phones from now on, panic in the streets”

“@okcupid How can I continue to be ignored by the females of the world if they don’t know where I am to ignore me?! #panic #freakout”

The more time that passed, the more confident I became, by process of elimination, that I’d taken down the site. I read through every file I’d changed—the lines of code I’d written and even the ones I hadn’t. Then, finally, I found the error. It looked something like this:

If (the database throws an error) { do nothing }

This was a bug, and I hadn’t even written it. I had just triggered it. But it was a bad bug, one that, under the right circumstances—circumstances that I’d unfortunately created—would not only crash the OkCupid servers but also spew radioactive garbage into our caches and database, making recovery especially difficult. I was horrified.

How could such a tiny change have such an outsized impact on the site? “That same story happened so many different times,” my old boss David told me. “Someone launched a small, relatively innocuous change that did one of the millions of unexpected things it could have done, which then happened to break some part of the site, and then bring it all down—sometimes bring it down to the point where we couldn’t recover it for hours.” When I saw him in the office the morning after I’d broken the site, totally mortified, he consoled me by saying that site outages were just the cost of relying on such a small engineering team.

But even large companies are susceptible to meltdowns caused by seemingly innocuous actions. In February 2017, Amazon accidently brought down swaths of the internet when one of its employees, trying to debug the company’s S3 service, entered a command incorrectly. Because hundreds of thousands of companies use S3 to store data, the error took down tons of sites, including Quora, Giphy, and Slack. Ironically, Amazon’s own S3 status indicator relied on S3, which is why it incorrectly reported that the service was working just fine during the outage.

Tangled Webs

For most businesses, however, a software crash is not a death knell. If you’re not building self-driving cars, storing sensitive information, or supporting the data backbone of the internet, it may not matter if an error interrupts your service. It’s okay, for example, if a free online dating site goes down for an hour or half a day. In fact, it might even be better for business to trade off bugginess for forward momentum—the ethos behind Facebook’s old mantra “move fast and break things.”

When you allow yourself to build imperfect systems, you start to work differently—faster, more ambitiously. You know that sometimes your system will go down and you’ll have to repair it, but that’s okay. “The fact that it’s easy to fix things means you end up with this methodology where you think, ‘Let’s get a broken thing out there as fast as possible that does sort of what we want, and then we’ll just fix it up,’” says David. That’s not necessarily a bad thing, since preventing errors is inherently difficult. “Even if you spend a whole bunch of time trying to make something that’s perfect, you won’t necessarily succeed,” he explains.

OkCupid was a complex site. Had we tried to make it perfect, it might not have come to exist in the first place.

But software is built on top of other software. You’re working not just with your own code but with code from your coworkers and from third-party software libraries. If these dependencies are buggy or complicated or behave in non-intuitive ways, errors may seep into the software that relies on them.

Take the Equifax data breach, which leaked some 147.9 million Americans’ private information in September 2017. Equifax was vulnerable to attack not because of a bug introduced by one of its own engineers, but rather because of a bug in the code that Equifax’s software was built on top of—a popular open-source framework called Apache Struts.

That bug had existed in Struts since 2008 but had gone unnoticed until 2017, when Apache finally released a patch. Until they installed the patch, the tens of thousands of companies that used Struts—including banks and government agencies—were hackable.

Yet Equifax didn’t install the patch until two months after Apache released it. Why the delay? According to testimony by Equifax executives before the House Energy and Commerce committee, the problem was twofold: first, the employee whose job it was to report that a patch was necessary didn’t. Second, an error in a software script meant to flag known vulnerabilities in Equifax’s software stack failed.

Yet even if Equifax had tried to install the patch, it wouldn’t have been an easy update. As Ars Technica reported, the patch was difficult to install and could break existing systems, possibly introducing new bugs. That helps explain why, according to the software vendor Sonotype, some 46,557 companies downloaded vulnerable versions of Struts even after the patch came out.

Perhaps Equifax’s security procedures were sloppy and its breach completely preventable. Yet it’s easy for me to imagine a future bug with similarly catastrophic consequences that’s not so easy to blame on incompetence. Most complex software today is composite, consisting of interwoven modules written by many different programmers and maybe even many different organizations. The higher these code towers grow, the more difficult it becomes to understand how their various components interact.

Pessimism of the Intellect

It’s for this reason that my own bug at OkCupid was so bad. I had written a function that asked the database to give me data that didn’t exist. It should’ve returned an error, stopping the program. But because of a bug in one of my dependencies—the bug I found reading through the code on the day of the crash—my program didn’t error out, but instead kept chugging along as if nothing were wrong. It kept consuming more and more memory, which eventually ate into almost every program running on our servers. That’s why the OkCupid site slowed down to the point of becoming unusable.

Perhaps the worst part of figuring out how I’d crashed OkCupid was that I was still too green to know how to fix it. All I could do was sit at my desk twiddling my thumbs in solidarity while my coworkers rebuilt what I’d destroyed. We stayed in the office until 9 PM. From then on, I came to appreciate how unexpectedly far fractures in a system can travel.

“I heard you had an exciting day yesterday,” my boss David said the next morning. Mortified, I was sure I’d be fired. But I wasn’t, and instead I spent the next three months anxiously trying to prove to my coworkers that I wasn’t totally incompetent. I never brought the site down again, perhaps because my code-launching senses were heightened by a PTSD-induced adrenaline rush.

Maybe I’m still scarred or just generally pessimistic, but on the whole, I tend to think that all sufficiently large, complex software is prone to this kind of meltdown. It’s almost certainly impossible to prevent all bugs, and since code is composite, it’s even harder to reason about the consequences of those bugs. A successful developer today needs to be as good at writing code as she is at extinguishing that code when it explodes.