Data Quality

Last reviewed April 2026

A Tier 2 bank discovered that 12 per cent of its customer addresses were outdated, 8 per cent of transaction categorisations were wrong, and 3 per cent of counterparty identifiers were duplicated. Its AI programme had been training models on this data for two years. Data quality is not a data management concern. It is the single largest risk factor for every AI initiative in financial services.

What is data quality?

Data quality measures how fit data is for its intended use across six dimensions: accuracy (does it reflect reality), completeness (are required fields populated), consistency (do related records agree), timeliness (is it current), validity (does it conform to expected formats), and uniqueness (are there duplicates). In financial services, poor data quality does not just reduce efficiency. It produces wrong decisions: incorrect risk scores, inaccurate regulatory reports, and AI models that learn from flawed examples.

The cost of poor data quality compounds through the processing chain. An incorrect customer address at the point of entry causes a failed KYC check, which triggers a manual review, which delays onboarding, which requires a customer service intervention. Each downstream process inherits and amplifies the original error. IBM's frequently cited estimate that poor data quality costs organisations 15 to 25 per cent of revenue may be debatable in its precision, but the direction is not: bad data is expensive, and it gets more expensive the later it is caught.

Data governance defines the policies and roles. Data quality is the measurable outcome. You can have a data governance framework without data quality if the framework is not enforced. You cannot have sustained data quality without governance. The two are distinct but inseparable.

The landscape

The EU AI Act Article 10 establishes explicit data quality requirements for high-risk AI systems. Training, validation, and testing datasets must be relevant, representative, free from errors, and complete relative to their intended purpose. For financial institutions deploying AI in credit scoring or fraud detection, this transforms data quality from a best practice into a legal obligation with enforcement powers.

BCBS 239, the Basel Committee's principles for effective risk data aggregation and reporting, continues to drive investment in data quality infrastructure at banks. Supervisors assess whether institutions can produce accurate, complete risk reports in a timely manner, including under stress conditions. Many institutions are still closing gaps identified in supervisory assessments years after BCBS 239's original publication. The standard expected has risen, but many banks' data quality has not kept pace.

The Bank of England's Transforming Data Collection programme is pushing toward standardised, machine-readable data submissions. This raises the bar for data quality at the source: if data feeds directly into the regulator's systems without manual intervention, errors that were previously caught and corrected by reporting teams will arrive at the regulator unfiltered.

How AI changes this

Automated data quality monitoring replaces periodic manual reviews with continuous surveillance. AI systems profile data streams in real time, detecting anomalies in completeness, distribution, and consistency. A sudden drop in populated email addresses, an unexpected shift in transaction amount distributions, or a spike in duplicate records triggers an alert before the bad data propagates into downstream systems. This moves data quality management from reactive (fix problems after they cause harm) to proactive (catch problems at the point of entry).

Entity resolution uses machine learning to identify and merge duplicate records across systems. A customer who appears as "J. Smith" in the core banking system, "John Smith" in the CRM, and "Jonathan Smith" in the AML platform is the same person. ML-based entity resolution matches these records with higher accuracy than deterministic rules, reducing duplication rates by 70 to 90 per cent in typical deployments. This is foundational for KYC and AML, where a unified customer view is a regulatory expectation.

Root cause analysis identifies where and why data quality problems originate. Rather than fixing errors at the point of discovery, AI traces them back to the source system, the process, and sometimes the specific user or integration that introduced the error. Fixing the root cause prevents recurrence. Fixing the symptom guarantees repetition.

What to know before you start

Measure data quality before investing in AI. Run a quality assessment on the datasets that feed your target use case. Measure the six dimensions against defined thresholds. If completeness is below 90 per cent or accuracy is below 95 per cent for critical fields, the AI model will inherit these deficiencies. The assessment takes days. Skipping it costs months of wasted model development.

Fix data quality at the point of capture, not downstream. Input validation, standardised forms, real-time verification against reference data, and clear data entry guidance prevent errors from entering the system. Every pound spent on upstream prevention saves ten on downstream correction. This is a process design investment, not a technology investment.

Assign business owners to data quality metrics. The technology team can build monitoring dashboards. But if nobody is accountable for the numbers on those dashboards, the numbers will not improve. Data quality ownership must sit in the business function that creates and uses the data, not in a central data office that has visibility but no authority.

Start with the data domain that feeds your highest-priority AI use case. Customer master data is the most common starting point because it underpins fraud detection, KYC, credit scoring, and customer analytics. Clean customer data once, and every downstream application benefits. This focused approach delivers measurable improvement faster than a broad data quality programme that tries to fix everything simultaneously.

Last updated May 2026

Exploring AI for your organisation? There are fifteen minutes on the calendar.

Let’s build AI together

← Back to AI Glossary