Training Data

Last reviewed April 2026

A machine learning model is a compressed representation of its training data. If the data is biased, the model is biased. If the data is incomplete, the model is blind to what it has never seen. Every failure of a financial services AI system, from discriminatory lending to missed fraud, traces back to the same root cause: the training data.

What is training data?

Training data is the dataset used to teach a machine learning model the patterns it needs to make predictions or decisions. In financial services, training data typically consists of historical records: past loan applications and their outcomes, past transactions and whether they were fraudulent, past insurance claims and their settlement amounts. The model learns from these examples and applies the patterns to new, unseen cases.

The quality of training data determines the ceiling of model performance. No algorithm, however sophisticated, can extract signal from data that does not contain it. A credit scoring model trained on a dataset that excludes applicants who were denied credit in the past (survivor bias) will underestimate default risk for the population it has never observed. A fraud detection model trained on data where only 0.1 per cent of transactions are labelled as fraud will struggle to learn fraud patterns without careful resampling or augmentation.

Three properties define good training data: it must be representative of the population the model will serve, it must be accurately labelled, and it must be sufficiently large for the model to learn the patterns that matter. In financial services, achieving all three simultaneously is harder than it sounds. Historical data reflects historical decisions, which may have been discriminatory. Labels are sometimes subjective or delayed. And rare events (defaults, fraud, catastrophic claims) are by definition underrepresented.

The landscape

The EU AI Act Article 10 establishes specific requirements for training data used in high-risk AI systems. Data must be relevant, representative, free of errors to the extent possible, and subject to appropriate data governance. For credit scoring models, this means demonstrating that the training data does not systematically disadvantage any protected group. For fraud detection, it means ensuring label accuracy and consistency across the dataset.

Data privacy regulations constrain what training data can be used. UK GDPR requires a lawful basis for processing personal data, including for model training. The PRA's SS1/23 requires that training data choices are documented and justified as part of model development governance. Legitimate interest is commonly relied upon, but the balancing test must be genuine and documented. The ICO has indicated increased scrutiny of AI training data practices, particularly where sensitive personal data (health, ethnicity, political opinions) is involved.

Synthetic data is emerging as a partial solution to data scarcity and privacy constraints. Rather than training on real customer records, models train on artificially generated data that preserves the statistical properties of the original dataset without containing any real individual's information. The technique is promising for augmenting rare event classes and enabling data sharing between institutions, but the quality of synthetic data varies significantly and must be validated rigorously.

How AI changes this

Automated data quality assessment identifies issues in training datasets before they corrupt the model. AI systems profile the data for missing values, outliers, distribution shifts, label inconsistencies, and potential biases. Issues that a data scientist might miss during manual exploration, a subtle correlation between a feature and a protected characteristic, for example, are systematically detected.

Data augmentation techniques expand training datasets where genuine examples are scarce. For fraud detection, generative models create realistic synthetic fraud scenarios that the model can learn from. For credit scoring, oversampling techniques create balanced training sets from imbalanced data. The key constraint is that augmented data must reflect real-world distributions, not introduce artefacts that the model mistakes for genuine patterns.

Active learning reduces the labelling burden. Rather than labelling thousands of examples manually, the model identifies the examples it is most uncertain about and requests labels only for those. This focuses human labelling effort where it has the greatest impact on model performance, reducing the total labelling cost by 60 to 80 per cent in typical financial services applications.

Federated learning allows models to train across datasets held by different institutions without centralising the data. Each institution trains locally and shares model updates, not raw data. For AML applications, where cross-institutional patterns are valuable but data sharing is restricted, federated learning offers a path to better-trained models without compromising data sovereignty.

What to know before you start

Audit your training data for bias before training the model. If historical lending decisions were influenced by protected characteristics (directly or through proxies), training on that data will reproduce the discrimination. Remove proxies, rebalance the dataset, and test the trained model for disparate impact across demographic groups. This is not optional under the EU AI Act or the Equality Act.

Label quality is more important than dataset size. A model trained on 10,000 accurately labelled examples will outperform one trained on 100,000 noisy labels. Invest in labelling consistency: clear guidelines, inter-annotator agreement checks, and regular calibration sessions for human labellers. For financial services, this often means having domain experts (underwriters, fraud analysts, compliance officers) involved in the labelling process, not just data annotation contractors.

Version your training data with the same discipline as your code. When a model is retrained, you must be able to reproduce the exact dataset used for the previous version. Model training documentation under SS1/23 requires traceability from model to training data to source systems. Data versioning tools (DVC, LakeFS, Delta Lake) provide this capability but require integration into your MLOps pipeline.

Start by documenting what training data you have today for your target use case. Map the source systems, assess the label quality, profile the demographic distribution, and identify the gaps. This assessment takes days, not months, and it determines whether you can proceed to model training or need to invest in data preparation first. Most institutions discover the data preparation work is larger than expected. Better to discover that early than after six months of model development.

Last updated

Exploring AI for your organisation? There are fifteen minutes on the calendar.

Let’s build AI together
← Back to AI Glossary