Why Passing Tests Doesn’t Reduce Surprise
Organizations are testing more than ever. Most firms can now point to mapped business services, defined recovery objectives, structured exercises and regular reporting to boards. From a governance perspective, that represents meaningful progress.
And yet disruptive incidents continue to unfold in ways that feel out of proportion to what was rehearsed.
That is not because exercises are poorly run. In many organizations, they are thoughtful, well facilitated and supported by senior stakeholders. The more uncomfortable observation is that testing often provides confidence within the frame we design, but less visibility beyond it.
Most exercises are built around validating recovery strategies. A disruption is defined, systems are assumed unavailable, teams follow documented responses and performance is measured against agreed tolerances. If recovery is achieved within the expected timeframe, the outcome is recorded as a success.
There is nothing wrong with that. Boards need evidence. Regulators expect demonstration. Structured exercises provide clarity and accountability.
Working assumptions
But success in that setting usually tells us that recovery works under the assumptions embedded in the scenario. It tells us less about how those assumptions behave when conditions begin to drift.
Every scenario needs boundaries. It has to define what has failed, what remains available and how dependencies are expected to respond. Without those constraints, exercises become unfocused and difficult to manage.
The trade-off is that those same boundaries shape what can be discovered.
If a recovery plan assumes that a key supplier responds within a defined timeframe, and the scenario models that response as expected, the exercise confirms coordination. It does not explore what happens when that supplier is dealing with concurrent demand across multiple clients. If manual processes are tested at steady volumes, the exercise demonstrates capability at that level. It does not necessarily reveal how human load accumulates under escalation.
In that sense, passing an exercise often confirms that a design works in controlled conditions.
Real disruption rarely stays controlled.
An example
Consider a critical SaaS platform that meets its recovery objective during testing. Failover is simulated, data integrity is confirmed and vendor communications follow the expected script. The exercise concludes with confidence that the contractual recovery timeline is achievable.
Months later, a live incident occurs. The same platform experiences disruption, but this time several major clients invoke recovery simultaneously. Vendor support queues lengthen. Status updates become less precise. Internal teams escalate while awaiting clarity, and key individuals find themselves dividing attention between recovery, stakeholder updates and parallel operational demands. None of this violates the contract. But coordination slows.
The service does not collapse, but it stretches. Resolution takes longer than rehearsed. Stakeholders begin to ask why the exercise did not surface these dynamics.
The explanation is rarely that the exercise was flawed. It is that the exercise isolated one stressor, whereas the real event layered several.
Over time, exercise outputs accumulate into assurance artefacts. Reports are presented. Metrics are tracked. Governance forums see structured evidence of preparedness. Confidence grows, and in many respects rightly so.
The difficulty is that confidence can stabilise faster than underlying conditions evolve. Dependencies change. Volumes fluctuate. Staff move roles. Suppliers adjust operating models. Documentation may remain current, but the interaction between components shifts subtly.
When exercises consistently confirm performance within expected parameters, attention tends to focus on maintaining that performance rather than probing where it might thin.
How to approach recovery
Where things begin to change is when we look not only at whether recovery works, but at how it behaves when conditions are less tidy than the model assumed.
Structured scenario testing performs an essential role. It creates repeatability, comparability and governance discipline, and it is well suited to demonstrating preparedness. Discovery, however, often requires something slightly different.
Exploratory exercises are rarely as tidy. They may introduce ambiguity rather than clarity. They may layer moderate stresses instead of modelling a single defined event. They may deliberately remove certain assumptions and observe how teams respond when information is incomplete.
For example, instead of fixing the duration of a supplier outage in advance, an exercise might allow uncertainty to remain and adjust conditions as the scenario unfolds. Instead of assuming full staff availability, it may introduce partial absence mid-discussion. Instead of isolating one disruption, it may combine a technical failure with elevated demand.
These variations are not about engineering catastrophe. They are about observing where coordination begins to stretch, where manual effort accumulates and where dependencies compete for the same capacity.
They are harder to score and less comfortable to report. That is why they are less common.
The risk is not that organizations fail to test. It is that they expect one format to provide both governance reassurance and insight into systemic limits.
Passing a test confirms that recovery functions within defined conditions. It does not automatically reveal how it behaves beyond them. Reducing surprise requires understanding how that recovery behaves when conditions are less controlled, less isolated and more layered than the scenario assumed.
In complex operational environments, disruption is rarely singular. It is concurrent, uneven and cumulative. The question is not simply whether plans exist, but how they respond when several moderate pressures combine.
Surprise rarely appears because nothing was tested.
More often, it appears because the conditions that mattered most were never tested together.
