Data Lineage

Last reviewed April 2026

A regulator asks how a number in your capital adequacy report was derived. The answer involves seven source systems, four transformations, two manual adjustments, and a spreadsheet that a contractor built three years ago. If you cannot trace this path in hours, you have a data lineage problem, and it is a problem that grows worse with every AI model you deploy.

What is data lineage?

Data lineage is the record of where data originates, how it moves through systems, and what transformations it undergoes along the way. In financial services, it answers three questions that regulators, auditors, and model validators ask constantly: where did this number come from, what happened to it between source and consumption, and can you prove it? Data lineage sits at the intersection of data governance and data quality: governance defines the policy, quality measures the outcome, and lineage provides the audit trail that connects them.

The practical challenge is that most financial institutions' data landscapes were not designed for traceability. Data flows through ETL pipelines, data warehouses, reporting tools, and analytical models, each of which transforms the data in ways that may or may not be documented. Manual steps, a spreadsheet adjustment here, an email-based override there, break the automated lineage chain. The result is that tracing a single reported figure back to its source can take days of investigative work.

For AI systems, data lineage becomes even more critical. A model's predictions are only as trustworthy as the data that trained it and the data that feeds it in production. If you cannot trace the lineage of your training data back to its source, you cannot demonstrate to a regulator or validator that the data meets quality and representativeness requirements.

The landscape

BCBS 239 requires banks to demonstrate that reported risk data is traceable from source to report. The PRA assesses this capability during supervisory reviews, and many institutions still fall short. The gap is not conceptual, everyone agrees lineage is important, but operational: the infrastructure to capture lineage automatically, across all data flows, at all times, is expensive to build and maintain.

The EU AI Act introduces lineage requirements for high-risk AI systems. Institutions must document the provenance and processing of training data, demonstrate that it meets quality criteria, and maintain this documentation throughout the model's lifecycle. For a credit scoring model, this means tracing every training example back to its source system and demonstrating that the data was collected, processed, and labelled in compliance with applicable regulations.

Cloud data platforms (Snowflake, Databricks, BigQuery) now provide native lineage capabilities for data processed within their environments. This addresses part of the problem but not all of it. Data that originates in core banking systems, passes through on-premises ETL tools, enters the cloud platform, and is consumed by an AI model has a lineage chain that spans multiple technology layers. No single tool captures the complete picture.

How AI changes this

Automated lineage discovery uses AI to map data flows across systems without manual documentation. By analysing query logs, ETL job definitions, API calls, and database schemas, AI systems can reconstruct the lineage of data as it moves from source to consumption. This is faster and more accurate than manual lineage mapping, particularly in complex environments with hundreds of data pipelines. The output is a lineage graph that shows every transformation a data element undergoes.

Impact analysis becomes automated. When a source system changes its schema, an AI-powered lineage system can identify every downstream report, model, and dashboard that will be affected. This prevents the common scenario where a source system change breaks a regulatory report three weeks later because nobody realised the dependency existed. For regulatory reporting, this capability reduces the risk of reporting errors caused by upstream changes.

Lineage-aware model governance links AI models to the specific data versions they were trained on. When a model validator asks "what data was this model trained on," the answer is not a narrative description but a precise, machine-readable link to the exact dataset, its source lineage, and its quality metrics at the time of training. This is the standard that mature MLOps environments are moving toward.

What to know before you start

Automated lineage tools work best on modern, well-structured data platforms. If your data flows through legacy systems with undocumented transformations, automated discovery will capture only part of the picture. Budget for manual lineage documentation of legacy flows alongside automated capture of modern ones. The hybrid approach is realistic. Pure automation is aspirational.

Lineage granularity matters. Column-level lineage (tracing individual data fields through transformations) is more useful for regulatory and model governance than table-level lineage (knowing which tables feed which reports). But column-level lineage is more expensive to capture and maintain. Match the granularity to the use case: column-level for regulatory reports and AI training data, table-level for general data management.

Manual steps break lineage. Every spreadsheet download, manual adjustment, and email-based data transfer creates a gap in the automated lineage chain. The most valuable short-term investment is often eliminating these manual steps rather than building sophisticated lineage tooling around them. Automate the data flow first, then capture the lineage.

Start with the lineage of your most critical regulatory report. Map it end to end, identify the gaps, and build automated lineage capture for the segments where it is feasible. This produces immediate regulatory value, demonstrates the approach, and reveals the data architecture challenges that will apply to broader lineage initiatives.

Last updated May 2026

Exploring AI for your organisation? There are fifteen minutes on the calendar.

Let’s build AI together

← Back to AI Glossary