At one of my early software engineering jobs, there was no on-call rotation for production incidents. We relied upon crossed fingers and optimism as our first line of defense. In the predictable moments when that strategy failed us, someone on the customer success team would text our head of engineering, who would activate our single escalation mechanism: notifying every engineer at the company in a group text.
Operating in the serene communication desert that existed before Slack but after private IRC channels, folks would mill around trying to find a shared video chat to debug the incident, at which point whoever happened to join would begin incident response. Among the many challenges of responding to production incidents this way, I fondly remember one engineer replying to the group text with a concise “UNSUBSCRIBE.”
Although I never had the courage to send my own “UNSUBSCRIBE,” I was anxious to move beyond that chaotic approach to incidents. Fortunately, subsequent employers took very different approaches to fostering reliability, and the industry at large has come a long way.
In this guide you will learn:
About the current standard for incident response and analysis
Where some teams get themselves in trouble with the current standard
How to find your own path through the innovation and dogma of leading a company’s approach to reliability
The industry standard: incident response
While the details vary, incident response at most companies today follows a similar process:
An alert is triggered, ideally by monitoring software, but in practice sometimes by a human.
The responsible person in the on-call rotation is paged, and begins responding to the incoming alert.
If the on-call responder decides this is an incident, they begin incident response. If not, they resolve the alert or chat with the human who raised the issue.
Following their incident response process, they’ll begin incident mitigation. This generally starts with creating a chat channel for the incident, pulling in others who have context, and working until the issue is no longer substantially impacting users. Larger companies often split these activities into multiple roles, with the on-call responders leading mitigation and an incident commander handling coordination and communication.
The on-call responder writes the incident report that describes the incident’s timeline, contributing factors, and recommended next steps.
A broader group will review the incident report, adding their perspective and experience to the report.
Much of the magic in this setup lies in selecting the right tools to support the process, architecting your system to fail in predictable ways, and resisting the never-ending urge to complicate the process. Most companies never evolve their incident response beyond this stage, and while it’s a very useful starting point, it does tend to have one major flaw: The incidents keep happening.
Moving one step forward: incident analysis
As incidents continue to occur, teams generally respond by starting to track more metrics, such as “Mean Time To Detect” (MTTD)—the gap in time between the issue beginning and an alert getting triggered—and “Mean Time To Mitigation” (MTTM)—the time between that first alert and when you’ve contained the user impact.
Measures like MTTD and MTTM efficiently evaluate your incident response effectiveness. These measures struggle, however, at telling you where to go next. Let’s say your 90th percentile MTTM was 30 minutes last quarter, up from 20 minutes the quarter before. That does seem like a problem, but what should you do to address it?
The answer is extending your incident response program to also include incident analysis. Get started with three steps:
Continue responding to incidents like you were before, including mitigating incidents’ impact as they occur.
Record metadata about incidents in a centralized store (this can be a queryable wiki, a spreadsheet, or something more sophisticated), with a focus on incident impact and contributing causes.
Introduce a new kind of incident review meeting that, instead of reviewing individual incidents, focuses on reviewing batches of related incidents, where batches share contributing causes, such as “all incidents caused when a new host becomes the primary Redis node.” This meeting should propose remediations that would prevent the entire category of incidents from reoccurring. In the previous example, that might be standardizing on a Redis client that recovers gracefully when a new Redis primary is selected.
Coming from an ad-hoc process, this does take more time, and you’ll hear grumbles about it from those generating the incident metadata. You’ll also hear grumbles about it from those who have to follow up with the ones who should be generating that metadata but aren’t. However, the grumbles will be easy to ignore when you bring the right group of engineers into a room to discuss a batch of incidents and walk out a few hours later with a precise, actionable set of remediations. You will identify significantly more valuable remediations when looking at clusters of related incidents than looking at incidents in isolation.
Then you’ll do it a second time, and it will work well. A third time, and it will generate somewhat fewer new ideas, but still some. Over time, you’ll run into an interesting problem in your incident analysis program: You’re investing more time than ever, but the ideas you’re generating aren’t getting prioritized. How do you ensure your work is culminating in more reliable software?
Beware the trappings of incident legalism
I’ve often seen teams leading reliability efforts respond to the question of why their incident analysis efforts aren’t creating more reliable software by doubling down on metadata collection and heavily structured processes. I think of this as incident legalism. Incident legalism is when an incident response and analysis program—trying to better drive reliability improvements—becomes focused on compliance and loses empathy for the engineers and teams operating within the program’s processes.
If you want a self-diagnosis kit for incident legalism, ask yourself these questions:
Do incident reviews anchor around the same questions every time? Bonus points if that question is about adding another alert!
Do you spend a lot of time debating whether an incident should be a “Severity 1” or a “Severity 2” incident? Bonus points if you continue to expand the definitions of each severity type, even though folks are already struggling to remember the definitions!
Does discussion around incidents spend a significant amount of time on whether metadata has been collected? Bonus points if the metadata doesn’t contribute to the discussion at hand!
Do today’s proposed remediations sound a lot like the remediations that you’ve heard in the last couple incident reviews? Bonus points if no one in the discussion is responsible for prioritizing those remediations!
If a couple of these hit a little too close to home, then you’re probably in the throes of incident legalism. None of these points are particularly bad when they happen once or twice, but when they become routine, it’s clear that something has spoiled within your reliability efforts. Reliability programs rarely fail because someone isn’t working hard enough, and incident legalism is all about working harder—collecting more tagging about incidents, setting stricter timelines on getting incident metadata filed, scheduling more incident review sessions, and so on—and the result is just more work, not more success.
Escape this trap by building a holistic mental model for driving reliability.
An expanded model for reliability
Earlier, I mentioned that the standard metrics for understanding incidents, like MTTD and MTTR, are very effective at evaluating whether your response is going well, but that they’re not very effective at determining how to improve your reliability. On the other hand, I’ve found systems modeling a very helpful tool for designing and debugging reliability programs.
For example, a common model to begin with is:
As changes occur within your system (new code, etc.), some fraction of changes include behaviors that could cause incidents in the future, e.g. latent incidents. Latent incidents aren’t necessarily bugs. For example, using offsets rather than a pagination cursor within search results will work just fine when you potentially have thousands of results, but would become very slow if the dataset becomes larger and you allow a variety of sorting options.
As your system runs into different scenarios over time, there is a discovery rate of those latent incidents turning into incidents. Discovery is rarely a deliberately studied event, but rather the culmination of unnoticed growth. In the earlier example about supporting offsets to paginate through search results, an increase in potential results and a new feature that encourages more users to navigate deeper into the results might be such an event.
As incidents occur, you mitigate them at some rate, transforming them from incidents into mitigated incidents. Each of these becomes a change to your system, which might become a latent incident.
Finally, you study the mitigated incidents, determining how to prevent them from recurring, and they become remediated incidents. Once again, each of these remediations is a change that might come back to you later as an incident!
This is a simple model, but by measuring the number of items in each of these buckets over time, you can get a clear understanding of how things are, or are not, working. In the case of incident legalism, you’ll generally observe that the number of incidents and mitigated incidents remains high, but there are very few remediated incidents. For some reason the remediation rate is simply insufficient, and you need to spend time focused there! (This is often a sign that you’re missing an executive sponsor for reliability who can get remediations prioritized.)
Another common scenario is that your defect rate is simply too high, with too many changes becoming latent incidents. If you notice this, you can invest into developer productivity or infrastructure tools like better tests, static analysis, and gradual rollouts to reduce your defect rate.
You can even use this model to reason about which solutions make sense to solve your current problem! A common response to a high defect rate is to reduce deployment frequency to daily or weekly deployments. Most companies discover that this increases their discovery rate without increasing their mitigation or remediation rates, but once they can concretely describe the problem at hand, they can rule out solutions that appear to work yet don’t address their specific problem.
Summary
While I hope that you add modeling to your toolkit for running a reliability engineering organization, you can get most of the value by taking away three simple rules:
You should invest some reliability energy into response, analysis, and remediation.
If you’re investing much energy, a significant majority should be going towards remediation.
Any problem can be solved by investing more heavily into it, but there’s always a more efficient solution. If you’re convinced that you must surge investment, check if your mental model is leading you astray.
There are certainly more tools and rules you can use to shape your approach to reliability, but I’ve found those to be a remarkably effective starting point.