Operational Resilience
Last reviewed April 2026
A bank's AI-powered fraud detection system goes down for six hours. During those hours, every payment is either delayed (while manual review is arranged) or released without fraud screening (creating unacceptable risk). The bank had invested millions in the AI system but had not tested what happens when it fails. Operational resilience for AI systems means designing for failure, not just for performance, and in financial services, the PRA now requires it.
What is operational resilience?
Operational resilience is the ability of a firm to prevent, adapt to, respond to, recover from, and learn from operational disruptions. The PRA, FCA, and Bank of England jointly introduced operational resilience requirements that took full effect in March 2025, requiring firms to identify their important business services, set impact tolerances for disruption, and demonstrate they can remain within those tolerances through severe but plausible scenarios.
For AI systems, operational resilience means ensuring that critical AI-dependent services can continue to function when the AI system is unavailable, degraded, or producing incorrect outputs. This requires fallback mechanisms: manual processes, simpler rule-based systems, or previous model versions that can be activated when the primary AI system fails. It also requires testing: regularly verifying that fallback mechanisms work and that the transition from AI to fallback is operationally smooth.
The dependency chain for AI systems is often longer than for traditional systems. An AI model depends on its serving infrastructure, its data pipeline, its feature store, its monitoring systems, and potentially on third-party APIs for real-time data. A failure at any point in this chain can disable the AI system. Understanding these dependencies, and building resilience at each point, is an engineering challenge that many firms have not fully addressed.
The landscape
The PRA's operational resilience framework (PS6/21 and SS1/21) requires firms to map their important business services, the services whose disruption would cause intolerable harm to consumers, market integrity, or the firm's safety and soundness. For many firms, AI-dependent services are now among the most important: real-time fraud screening, automated credit decisioning, and algorithmic trading systems. These services must have defined impact tolerances and tested recovery capabilities.
The FCA's parallel operational resilience requirements focus on consumer harm. If an AI system that processes insurance claims goes down and customers cannot file or track claims, that is a consumer harm that the FCA expects the firm to prevent, or at least mitigate within defined tolerances. The FCA assesses whether the firm's contingency arrangements are sufficient to maintain an acceptable level of service during disruption.
The EU AI Act's robustness requirements (Article 15) complement operational resilience by requiring that high-risk AI systems are resilient to errors and inconsistencies. This is a system-level requirement (the AI itself must be robust) that sits alongside the firm-level operational resilience requirement (the business must function when the AI fails). Both must be addressed.
How AI changes this
AI system health monitoring provides real-time visibility into the operational status of AI systems: inference latency, error rates, data pipeline health, and model performance metrics. When any metric breaches a defined threshold, the monitoring system alerts operations teams and can trigger automatic failover to fallback systems. This monitoring is distinct from model performance monitoring (which tracks prediction quality) and addresses the operational availability dimension.
Graceful degradation architecture ensures that when an AI component fails, the broader service continues to function at reduced capability rather than failing entirely. A fraud detection system might fall back from ML-based scoring to rule-based screening, accepting a higher false positive rate but maintaining coverage. A credit decisioning system might queue applications for manual review rather than declining all applications. The degradation mode must be designed and tested, not improvised during an incident.
Scenario testing for AI failures goes beyond traditional disaster recovery testing. Scenarios should include: complete AI system outage, degraded performance (the model runs but produces low-confidence outputs), data pipeline failure (the model runs but on stale data), and model corruption (the model produces systematically incorrect outputs without triggering error conditions). Each scenario requires a tested response plan.
Third-party dependency mapping identifies every external service that AI systems depend on: cloud infrastructure, data providers, ML platform services, and API endpoints. For each dependency, the firm must understand the service level commitment, the failure mode, and the impact on the AI system. Concentration risk, where multiple AI systems depend on the same third party, must be assessed and managed.
What to know before you start
Identify which of your important business services depend on AI. For each, define what "within impact tolerance" looks like when the AI is unavailable. Can the service continue manually? At what capacity? For how long? The answers determine the resilience requirements for the AI system and the investment needed in fallback capabilities.
Test the fallback, not just the AI system. Most firms test their AI systems regularly. Few test the transition to fallback mode, the operation of the fallback, and the transition back to normal operation. A fallback that has never been tested is an assumption, not a control. Include AI failure scenarios in your operational resilience scenario testing programme.
The kill switch must be operational and tested. The ability to disable an AI system rapidly, reverting to fallback processing, is a critical control. It must be technically functional (can it actually disable the system in minutes, not hours?), operationally defined (who has authority to activate it?), and regularly tested (does it work when exercised?). A kill switch that requires a change request and a deployment cycle is not a kill switch.
Start by mapping AI dependencies in your important business services. For each AI system in the dependency chain, document the fallback mechanism, the transition process, and the impact tolerance. Test the most critical fallback mechanisms. Build AI resilience into your next round of operational resilience scenario testing. The regulatory expectation is clear: if your business depends on AI, your resilience framework must account for AI failure.
Last updated
Exploring AI for your organisation? There are fifteen minutes on the calendar.
Let’s build AI together