PERSPECTIVES · 02 9 minute read 2025

The four-hundred-billion-dollar downtime problem AIOps hasn't solved.

We were promised self-healing infrastructure. We got dashboards that detect outages slightly faster. The gap between promise and outcome is a fortune, and it is widening.

Somewhere in your organisation, there is a service that goes down for between forty seconds and forty minutes, several times a year, and nobody can fully explain why. The first responder will tell you it was a deployment. The second responder will tell you it was a downstream dependency. The post-incident review will land on a phrase like contributing factors include, which is engineering Latin for we are not entirely sure.

For the last decade, AIOps was the answer. Machine learning would watch the telemetry. Patterns would emerge. Anomalies would be caught before customers noticed. Incidents would self-heal, or at least self-route to the correct on-call. The category attracted billions in investment and produced an enormous number of dashboards. What it did not produce, in most enterprises I have walked into, is meaningfully less downtime.

§ 01Detection is not resolution.

This is the central confusion of the field. AIOps tools are very good at the first half of the problem — observing that something is wrong. They are less good at the second half — knowing what to do about it. And the second half is where the money lives. A platform that detects an outage four minutes earlier but still requires the same human chain to resolve it has reduced your mean time to detection. It has not reduced your mean time to recovery in any way the income statement can see.

I have sat in operations centres where the AIOps platform was producing twelve thousand alerts a week. The on-call engineer was treating ninety-eight percent of them as noise. The two percent that mattered would, in retrospect, have been visible from a basic threshold alert in any monitoring tool from the previous decade. We had spent considerable money to build a system that drowned its own signal.

If your AIOps platform produces more alerts than your team can read, it is not an operations tool. It is a liability that bills monthly.

§ 02The integration estate is the actual problem.

Most outages in mature enterprises are not caused by the failure of a single component. They are caused by the unexpected interaction of two components that were each behaving correctly in isolation. AIOps tools, the way they are usually deployed, are excellent at watching individual components. They are weak at watching the spaces between components — the API contracts, the message queues, the third-party dependencies, the rate limits, the certificate rotations, the DNS edges.

Until you instrument the spaces between things, your model is learning from the wrong data. It will get very good at predicting the failures that are already easy to predict, and it will be silent about the failures that actually take you down. This is not an algorithm problem. It is a topology problem. No model fixes a missing input.

§ 03Where the four hundred billion goes.

$400B

is the order of magnitude of unplanned downtime cost across the Global 2000, by the modelling that has now become consensus across the major industry analysts. The figure is not an outlier. It has been moving up for five consecutive years. — Composite of analyst estimates, 2024–2025

The number is large because the failure modes have become more expensive, not because they have become more frequent. A modern bank loses roughly the same number of minutes of trading capacity it lost five years ago. The cost of each of those minutes has climbed sharply, because the volume that runs through those minutes has climbed sharply. Every percentage point of throughput that moved from human to automated systems has raised the price of every second of downtime that remains.

This is the part of the equation that AIOps vendors do not advertise. The promise was that systems would become so reliable that downtime would functionally disappear. The reality is that reliability gains have been roughly cancelled, in dollar terms, by the increased fragility of the systems that depend on the reliability. We are running faster to stay in the same financial place.

§ 04What actually moves the number.

I have, in the last five years, watched two organisations cut their downtime cost meaningfully. Neither of them did it with a new platform. Both of them did the same three things, in the same order:

They ran a brutal exercise to identify the five integration points that produced eighty percent of their incident volume, and they invested disproportionately in instrumenting and hardening those five.
They restructured the on-call hierarchy so that the person who could actually resolve an incident was paged first, not the person whose name happened to be at the top of an org chart.
They wrote a runbook for every recurring incident class, automated the first three steps of every runbook, and made it a release-blocking requirement that any new service ship with its runbook on day one.

These are not glamorous moves. They are not what an analyst report describes as transformational. They are, however, what the income statement responds to. AIOps tooling sat on top of these changes and amplified them. AIOps tooling, in the absence of these changes, was decoration.

§ 05The honest framing.

The category will get there. Models are getting better at causal reasoning. Telemetry is getting cheaper. The integration of large language models into incident response — properly bounded — is a genuine step forward, and I am cautiously optimistic about what the next two product cycles will deliver. But none of that absolves the buyer of the upstream work.

You cannot buy your way out of a downtime problem. You can only invest your way out of it. The tooling helps. It does not substitute.

If you are evaluating an AIOps purchase right now, the question to put to the vendor is not what does your platform detect? It is what does your platform resolve, without a human in the middle, and what is your contractual stance on the ones it gets wrong? Most vendors will struggle to answer the second half of that question, which is the answer to the first.

Jayashankar Attupurathu · Bengaluru Discuss this →

Next perspective Why your AI pilot died before it shipped. Also reading Twenty-eight AI failures, and the pattern nobody names.