Model Monitoring

Last reviewed April 2026

A fraud detection model deployed in January detected 92 per cent of known fraud patterns. By October, detection had dropped to 71 per cent. Nobody noticed until quarterly losses were reviewed in December. Model monitoring is the discipline that catches degradation before it reaches the P&L, and most institutions do not have it.

What is model monitoring?

Model monitoring is the continuous observation of a machine learning model's behaviour and performance after deployment to production. It tracks whether the model is performing as expected, whether the data feeding it has changed, and whether its outputs remain fair, accurate, and reliable. Without monitoring, a model deployed today could be silently failing within months, and the institution would not know until the consequences became visible in financial results or regulatory findings.

Three categories of monitoring cover the essential ground. Performance monitoring tracks accuracy, precision, recall, and other metrics against defined thresholds. Data monitoring tracks the distribution of input features, detecting when the data the model receives in production diverges from the data it was trained on (model drift). Fairness monitoring tracks whether the model's outputs differ systematically across protected groups, detecting bias that may emerge as the population changes.

The gap between what regulators expect and what institutions deliver is significant. The PRA's SS1/23 requires ongoing monitoring of all models in production. In practice, many institutions rely on periodic manual reviews, quarterly at best, that check a snapshot of performance metrics. By the time a quarterly review identifies degradation, the model may have been underperforming for months.

The landscape

SS1/23 is the primary regulatory driver for model monitoring in UK banking. It requires that institutions define monitoring metrics, set thresholds, and establish escalation procedures for when thresholds are breached. The expectation is continuous or near-continuous monitoring, not periodic reviews. Institutions that rely on manual, quarterly monitoring are increasingly challenged during supervisory assessments.

The EU AI Act extends monitoring requirements to all high-risk AI systems, not just traditional models. This includes monitoring for accuracy degradation, bias emergence, and compliance with the system's intended purpose. For institutions operating under both UK and EU regulatory regimes, the monitoring requirements are converging toward continuous, automated surveillance of all production AI systems.

The MLOps tooling ecosystem now includes dedicated monitoring platforms (Evidently, WhyLabs, Arize, Fiddler) alongside monitoring capabilities embedded in cloud ML platforms. The tooling is mature. The organisational challenge, defining who owns monitoring, who responds to alerts, and how monitoring feeds into governance workflows, is where most institutions struggle.

How AI changes this

Automated drift detection identifies when input data distributions shift before the shift affects model performance. Statistical tests (Kolmogorov-Smirnov, Population Stability Index, Jensen-Shannon divergence) run continuously on incoming data, comparing it to the training distribution. When drift exceeds a threshold, the system alerts the model owner and optionally triggers retraining. This is the most widely deployed monitoring capability and the most immediately valuable.

Prediction monitoring tracks the distribution of model outputs over time. A credit scoring model that suddenly approves 30 per cent more applications than its historical average may be reflecting a genuine population shift or may be drifting. Output distribution monitoring flags the change for investigation, regardless of the cause.

Automated fairness monitoring tests model outputs across protected characteristics continuously. A model that was fair at deployment can become unfair as the underlying population changes. If approval rates diverge between demographic groups beyond a defined threshold, the monitoring system flags the model for review. This converts fairness compliance from a point-in-time validation to an ongoing guarantee.

Feedback loop monitoring tracks the quality of the ground truth labels used to assess model performance. If the labelling process changes (a new fraud investigation team with different thresholds for what constitutes fraud, for example), the model's measured performance may shift even if the model itself has not changed. Monitoring the label distribution alongside the model's predictions catches this confound.

What to know before you start

Define monitoring metrics and thresholds during model development, not after deployment. The model development team understands what performance level is acceptable and what degradation is meaningful. If this knowledge is not captured in monitoring configuration before deployment, the operations team inherits a model with no defined operating parameters.

Monitoring generates alerts. Alerts require responses. Define the escalation process before turning on monitoring: who receives the alert, what is the expected response time, what investigation steps are required, and what governance approval is needed for retraining or retirement. A monitoring system that generates alerts nobody acts on is worse than no monitoring, because it creates a false sense of security.

Ground truth latency affects what you can monitor in real time. Fraud detection ground truth (confirmed fraud) may lag predictions by weeks or months. Credit default ground truth lags by years. In the absence of immediate ground truth, proxy metrics (prediction stability, feature drift, output distribution) provide early warning signals. They are not substitutes for eventual ground truth evaluation, but they fill the gap.

Start by monitoring your highest-risk model. Identify the model whose failure would cause the most regulatory, financial, or reputational damage. Build monitoring for that model first: data drift detection, output distribution tracking, and performance evaluation where ground truth is available. The infrastructure you build for one model extends to the rest of the model inventory incrementally.

Last updated May 2026

Exploring AI for your organisation? There are fifteen minutes on the calendar.

Let’s build AI together

← Back to AI Glossary