Fairness Testing

Last reviewed April 2026

An insurer's pricing model does not use gender as an input. It complies with the law. But it uses vehicle type, annual mileage, and occupation, all of which correlate with gender. The model's outputs show a 12 per cent price differential between male and female policyholders. Is that fair? Fairness testing is the discipline that answers this question with data rather than opinion, and in financial services, the question is being asked by regulators with increasing frequency.

What is fairness testing?

Fairness testing is the systematic evaluation of an AI system's outputs across demographic groups defined by protected characteristics: ethnicity, gender, age, disability, religion, and other attributes protected under equality law. The goal is to identify disparities in outcomes that may constitute direct or indirect discrimination, and to assess whether those disparities are justified by legitimate business objectives. It is the technical implementation of the legal and ethical obligations that responsible AI frameworks describe.

The testing process involves defining the protected groups, selecting appropriate fairness metrics, computing those metrics against the model's outputs, and interpreting the results. The interpretation is the hard part. A disparity in outcomes does not automatically mean the model is biased. Different groups may have genuinely different risk profiles. The test is whether the disparity is proportionate to the legitimate aim (accurate risk assessment) and whether less discriminatory alternatives could achieve the same aim.

Multiple fairness metrics exist, and they measure different things. Demographic parity measures whether outcomes are equal across groups. Equalised odds measures whether error rates are equal. Calibration measures whether the model's predicted probabilities are equally accurate across groups. These metrics can conflict: satisfying one may violate another. The choice of metric is a policy decision with legal implications, not a technical one.

The landscape

The FCA's Consumer Duty requires firms to monitor and evidence that their products and services deliver good outcomes across the customer base. For AI-driven decisions, this means demonstrating that outcomes do not systematically disadvantage any group. The FCA has not prescribed specific fairness metrics, but it expects firms to have a methodology, to apply it consistently, and to act when disparities are identified.

The EU AI Act's requirements for high-risk AI systems include testing for bias during development and monitoring for bias after deployment. Article 9 requires a risk management process that identifies and mitigates foreseeable risks, including discrimination. For UK firms with EU operations, this creates a parallel testing obligation that must be integrated with domestic requirements.

The Equality and Human Rights Commission's guidance on AI and equality clarifies that the Equality Act applies to automated decisions. Indirect discrimination, where a provision, criterion, or practice puts persons sharing a protected characteristic at a particular disadvantage, is unlawful unless it can be objectively justified. This means that a model's use of a feature that correlates with a protected characteristic is potentially unlawful, even if the protected characteristic itself is not an input. Bias in AI is fundamentally a legal risk, not just a reputational one.

How AI changes this

Automated fairness testing platforms compute multiple fairness metrics simultaneously across all protected characteristics, generating a comprehensive fairness profile for each model. These platforms integrate with the model development workflow, providing fairness results alongside accuracy metrics during development. This shifts fairness testing from a post-hoc compliance check to a development-time design constraint.

Intersectional fairness testing evaluates outcomes across combinations of protected characteristics, not just individual ones. A model may be fair across gender and fair across ethnicity but unfair for a specific intersection, for example, older women from a particular ethnic group. Intersectional analysis is computationally more demanding and requires larger datasets, but it reveals disparities that single-axis testing misses.

Causal fairness methods move beyond statistical correlation to assess whether a model's decisions are causally influenced by protected characteristics. Using techniques from causal inference, these methods test whether changing a person's protected characteristic (counterfactually) would change the model's decision. This is a stronger test than statistical parity and aligns more closely with the legal concept of discrimination, which is fundamentally about causation.

Continuous fairness monitoring tracks fairness metrics in production over time. A model that passed fairness testing at deployment can become unfair as the population it serves changes. Monitoring dashboards alert the model risk management function when fairness metrics breach defined thresholds, triggering review before the impact accumulates.

What to know before you start

Legal advice is essential. The choice of fairness metric, the threshold for acceptable disparity, and the justification framework for any residual disparity are all legal questions with equality law implications. Involve your legal team from the outset. A fairness testing framework designed without legal input may test the wrong things or draw the wrong conclusions.

Data availability is the practical constraint. Testing for bias across ethnicity requires ethnicity data, which many financial institutions do not collect. The FCA has indicated that firms should consider collecting this data for fairness purposes. Where direct data is unavailable, proxy methods using surname analysis, geographic data, or Bayesian improved surname geocoding can provide approximate breakdowns, but these introduce their own biases and uncertainties.

Define pass/fail criteria before running the tests. A 2 per cent disparity in approval rates might be acceptable; a 15 per cent disparity is not. But where is the line? Defining thresholds requires input from legal, compliance, risk, and the business. Without predefined criteria, fairness test results become discussion documents rather than decision inputs.

Start with credit and pricing models, where the consumer impact of unfair outcomes is highest and the regulatory scrutiny is sharpest. Run a baseline fairness assessment on your existing models. The results will inform both your fairness testing methodology and your remediation priorities. Build fairness testing into the model validation process for all new models, making it a standard part of the development lifecycle rather than an optional extra.

Last updated

Exploring AI for your organisation? There are fifteen minutes on the calendar.

Let’s build AI together
← Back to AI Glossary