Unixorn - 90% Snark by weight (@unixorn.hachyderm.io.ap.brid.gy)

The incident metrics mirage Most companies that invest in improving their incident management see something counterintuitive in the first few months: their incident count goes up. And someone in the leadership chain, looking at the dashboard, asks with concern: “Why are things getting worse?” The good news is, they’re probably not. What you’re really seeing is evidence that your incident culture is getting better and stronger. Incident count doesn’t measure system health; it measures how willing people are to declare an incident. When you invest in your incident management program by giving people better training, tools, and processes for handling incidents, they put those things to work, and more situations get treated as incidents. Problems that used to be handled informally (a “spicy bug” that someone handled without declaring an incident, a degradation that the on-call engineer white-knuckled through without telling anyone) now enter your incident process. The company is getting visibility into problems that used to go unnoticed, and using incident management practices and tools to address problems it would have struggled with before. And it’s a self-reinforcing cycle; the more people use these skills and tools, the more comfortable they get with them, and the more they’re inclined to use them. That’s a good thing, as the way to build the skills and confidence to handle big incidents is by handling lots of little ones; the small incidents are an invaluable training ground. In most companies, you’re dealing with increases in multiple dimensions simultaneously: number of users, number of products, number of features in those products, usage of those features, number of engineers, level of training and experience of those engineers, and many more. With all those factors increasing, why would you expect incident count to decrease? In a very real sense, rising incident counts can be a sign of success, not failure. The question to ask isn’t “why are we having more incidents?”, it’s “why aren’t we?” And from there, other good questions follow: are we handling our incidents well, do we have the tools and training that we need, are we learning everything we can from every incident, are we preventing future incidents, are we better prepared to handle those we can’t prevent? But what happens if leadership decides that rising incident count is a problem and sets a target to bring it down? People get the message: fewer incidents is better. So marginal incidents stop getting declared. The “spicy bugs” go back to being handled quietly. The degradations get white-knuckled through again. The number on the dashboard goes down, but the company has lost both the visibility its incident process was providing and the benefits of handling those situations with proper coordination, communication, and prioritization. The problems didn’t go away; you just stopped applying your best tools to them. # The Goodhart’s Law problem This pattern has a name: Goodhart’s Law. When a metric becomes a target, it ceases to be a good metric. And it’s not just an incident count problem; it recurs across every incident metric companies reach for. The logic is straightforward. You focus on a particular metric because you think it captures something you care about. You set a target for the metric because you want to improve. And then smart, well-intentioned people find ways to hit the target. Some of those ways involve actually improving the thing you care about. But some involve optimizing the number without improving the underlying reality, and over time, the second category tends to dominate. Make incident count a target for reduction, and marginal incidents stop getting declared. Make MTTR a target, and people close incidents prematurely. Make action item completion rate a target, and people write easy action items instead of hard ones. In every case, the metric looks better while the thing you actually care about (learning, reliability, preparedness) stays the same or gets worse. This is human nature. Smart people optimize for what gets measured; that’s how incentives work. You’re not going to prevent it by writing sternly worded memos about gaming the system. You can only manage it by choosing metrics carefully and then paying attention to the behaviors driven by your focus on those particular metrics. Adjust when those behaviors aren’t what you intended, and be willing to retire the metric when it’s doing more harm than good. # MTTR: the metric everybody loves and nobody should trust Mean Time to Recovery is the most misleading incident metric in the industry. Leadership loves it because it’s a single number that appears to capture “how fast we fix things.” But it’s deeply flawed, both mathematically and in the incentives it creates. Incident durations follow a power-law distribution: most incidents resolve quickly, while a small number take much longer. When you average power-law data, you get a number that describes nobody’s actual experience. Let’s imagine that last month you had ten incidents, nine of which resolved in about ten minutes each, and one that took six hours. That gives you an MTTR of 45 minutes, but that’s nowhere close to what any of those incidents actually took; it’s way off of both ten minutes and six hours. Google SRE Štěpán Davidovič, in _Incident Metrics in SRE: Critically Evaluating MTTR and Friends_ (O’Reilly, 2023), used Monte Carlo simulations to demonstrate that even with a substantial dataset, MTTR can’t reliably tell you whether your incident response is actually improving. The math doesn’t just give you a misleading number; it can’t even detect real improvement when it’s happening. There’s also a flattening problem: MTTR treats all incidents as interchangeable, as if the only thing that matters about an incident is how long it took. A six-hour incident where page load times were degraded but the service was still usable somehow scores worse than a one-hour total outage. The incentive problems are even worse than the mathematical ones. MTTR incentivizes speed over understanding. Thorough incident response sometimes means deliberately slowing down: carefully analyzing symptoms, verifying that a fix actually works, understanding the contributing factors well enough to prevent recurrence. MTTR punishes all of that. A team that’s genuinely improving (catching issues earlier, preventing cascades, tackling more complex problems) can see its MTTR stay flat or even go up. The metric undermines morale and leadership confidence even as the team does better work. # What fire departments get right about metrics Here’s a lesson that most software companies could learn from fire departments. Well-managed fire departments decompose their response timeline into segments and set targets only on the segments they can actually control, and expect to be fairly consistent across their incidents. Dispatch time (how long from answering the 911 call to notifying the fire crew) gets a target. Turnout time (how long from notification to crews leaving the station) gets a target. Drive time (from the crews leaving the station to arrival at the incident scene) gets a target. These are process steps that are largely consistent from one call to the next, and if they’re too slow, you can do something about it: hire more dispatchers, change how crews stage at the firehouse, build more stations. What fire departments don’t set targets for is how long it takes to put out the fire. That depends on the fire. A dumpster fire and a fully involved warehouse fire are different problems with different durations, and no fixed target could meaningfully apply to both. How long it takes to set up an incident channel, how long it takes responders to acknowledge pages, how long it takes the responders to join the channel and get started: these are your equivalent to the fire department’s dispatch chain. They’re largely consistent from one incident to the next, and you can set targets for them. If you’re not meeting the targets, there are obvious adjustments you can make: auto-create incident channels, set and enforce clearer on-call expectations, and so forth. On the other hand, the time to find the contributing factors, the time to implement a durable fix, and the total time for the incident: those depend on the details of the particular incident. A misconfigured feature flag and a cascading database failure are different problems. Track trends, investigate outliers, learn from reviews. But don’t set targets. Targets on metrics you can’t control produce gaming, demoralization, or both. # Start with questions, not metrics The most useful thing is to stop asking “what should we measure?” and start asking “what questions are we trying to answer?” Are incidents being handled well? Are we learning from them? Is the incident management program serving the business? Is our on-call workload sustainable? Each of these questions leads you to seek different evidence, some quantitative, some qualitative, and the answers are more useful than any single number on a dashboard. These aren’t easy questions to answer, but they’re better than an easy number that misleads you (like MTTR). Your dashboards should make you curious, not confident. When you see a trend, the right response isn’t “we know what’s happening.” It’s “we should dig in and find out why.” And if your incident count went up this quarter? Before you panic, ask why. You might find that your increased focus on incident management is doing exactly what it’s supposed to do. * * * _This is one of the topics I cover in depth in my upcoming book, Incident Management for DevOps and SRE. If you’d like to hear when it’s available, you can sign up atim4ds.com._ _If your company needs help with incident management right now, my consulting practice isGreatCircle.com/im._ https://greatcircle.com/blog/2026/05/26/incident-metrics-mirage/