From Impact to Fragility: Why Resilient Services Still Surprise us
Over recent years organizations have made significant progress in response to operational resilience regulation. Many can now identify their business services, define impact tolerances, and map the people, processes, technology, and third parties that underpin delivery. From a compliance perspective, the foundations are largely in place.
Yet disruptive incidents continue to expose weaknesses that were not fully anticipated. Services that appear resilient on paper still fail in unexpected ways, and post-incident reviews often identify issues that were already known in isolation. The challenge is that their cumulative effect was not well understood in advance.
One reason for this is the way resilience is commonly discussed. Much of the focus is on impact, what would happen if a service were disrupted, how severe that disruption would be, and how quickly unacceptable harm would arise. This perspective underpins BIAs, impact tolerances, and scenario testing, and it remains essential. However, it does not always explain how close a service may already be to disruption.
A different lens
The question is better addressed through the lens of fragility. Fragility is concerned with the condition of a service before disruption occurs, and with how vulnerable it is given the state of its underlying dependencies. A service may have a clear impact tolerance and documented recovery plans, yet still be fragile in practice if key dependencies are weak, untested, or overly concentrated.
Crucially, fragility does not always arise from a single point of failure. More often, it emerges from the interaction of several individually acceptable conditions which may appear manageable in isolation. Together, they can materially reduce a service’s ability to absorb disruption, something that often only becomes visible during an incident, when teams discover too late that the same people, systems or suppliers are required to recover multiple dependencies at the same time.
An example
Consider a simplified example of a customer-facing service. Impact tolerances have been defined, and recovery plans exist. However, recovery relies on a legacy application that has not been DR tested recently, a small number of staff with deep system knowledge, and coordination with an external provider. Individually, none of these dependencies would necessarily be considered unacceptable. Viewed collectively, they create a service that may struggle to recover under real pressure.
A shift in focus can help
For practitioners, this perspective suggests a small but important shift in focus. In addition to identifying critical services and single points of failure, it can be useful to ask:
- Where do multiple marginal dependencies cluster within the same service?
- Which recovery actions rely on the same people, systems or third parties?
- Where have resilience assumptions not been tested together?
These questions do not replace existing resilience practices, but they can help organizations prioritise attention, testing, and investment toward areas where fragility may be accumulating quietly.
As operational resilience continues to mature, understanding not just the impact of disruption but the fragility that precedes it, may be key to reducing unwelcome surprises.
