Data Leakage

Last reviewed April 2026

An employee pastes a customer complaint into ChatGPT to draft a response. The complaint contains the customer's name, account number, and details of a disputed transaction. That data is now outside the firm's control. Data leakage in the context of AI is not just the traditional cybersecurity concern of data leaving the organisation. It is the new and specific risk of sensitive information flowing into AI models, training datasets, and third-party inference APIs where it should never go.

What is data leakage?

Data governance frameworks traditionally focus on where data is stored and who can access it. Data leakage in the AI context is the unintended exposure of sensitive, confidential, or personal data through AI systems, tools, or workflows. It occurs in three primary ways. First, data sent to external AI services (cloud APIs, SaaS tools, consumer AI applications) where the firm has no control over how the data is processed, retained, or used for model training. Second, data embedded in AI models through training, where the model memorises specific data points and can be prompted to reproduce them. Third, data exposed through AI outputs, where the system reveals information it should not, whether through prompt injection, misconfigured access controls, or model behaviour that surfaces training data in responses.

For financial services, every instance of data leakage is potentially a data protection breach. Customer data shared with an unapproved third party triggers UK GDPR obligations: assessment, possible notification to the Information Commissioner's Office (ICO), and potential notification to affected individuals. The reputational and regulatory consequences are significant, and the "I didn't realise the AI was sending data externally" defence does not satisfy the accountability principle.

The scale of the risk is new. Before generative AI, data leakage was primarily a cybersecurity problem: attackers exfiltrating data, or employees emailing sensitive files to personal accounts. Now, thousands of employees are using AI tools daily, each interaction potentially involving sensitive data. The leakage surface has expanded from a security perimeter breach to every employee's browser.

The landscape

Most large UK financial institutions have issued policies restricting the use of external AI tools with firm data. Samsung, Apple, JPMorgan, and Goldman Sachs all banned or restricted ChatGPT use by employees in 2023. But policies without technical controls are aspirational. Employees use AI tools because they are useful, and usage continues regardless of policy when the tools are accessible and the productivity benefit is compelling.

The UK GDPR's requirements on data processing, data minimisation, and international transfers apply fully to AI. Sending personal data to a US-based AI API is an international data transfer that requires adequate safeguards. Using personal data to fine-tune a model requires a lawful basis. Retaining personal data within a model's weights indefinitely may violate storage limitation principles. The ICO has published specific guidance on generative AI and data protection that financial services firms should treat as required reading.

Third-party AI providers offer varying levels of data protection. Some provide contractual commitments that customer data will not be used for model training. Others offer data processing agreements that meet GDPR requirements. Some offer on-premises or private cloud deployment that keeps data within the firm's environment. The terms vary significantly, and the default settings of many AI services are not suitable for financial services use without modification.

How AI changes this

Data loss prevention (DLP) tools are evolving to detect sensitive data in AI interactions. Traditional DLP monitors email attachments and file transfers. AI-aware DLP extends this to clipboard activity, browser-based AI tools, and API calls to AI services. When an employee pastes a customer record into an AI prompt, the DLP system can detect the sensitive content and block the interaction, log it for review, or redact the sensitive fields before submission.

Privacy-preserving AI techniques reduce the need to expose raw data. Differential privacy adds calibrated noise to training data, preventing the model from memorising individual records. Federated learning trains models on data that remains within the firm's environment, sharing only model updates rather than raw data. Synthetic data generation creates realistic but non-personal datasets for development and testing. Red teaming exercises should specifically test whether models leak training data when prompted adversarially. Each technique has trade-offs between privacy protection and model performance, but for financial services, the privacy requirement is non-negotiable.

Retrieval-augmented generation (RAG) architectures reduce leakage risk by keeping sensitive data in a controlled retrieval layer rather than embedding it in the model's weights. The model queries the firm's knowledge management system at inference time, and access controls on the retrieval layer determine what information the model can access for each user. This is architecturally safer than fine-tuning models on sensitive data, because the data remains in systems the firm controls.

What to know before you start

Provide approved tools, not just prohibitions. Banning external AI use without providing a sanctioned alternative guarantees shadow AI. Employees will use consumer AI tools on personal devices, which is worse than using them on corporate devices where you at least have logging. Deploy approved AI tools with appropriate data protection controls, and make them easy enough to use that employees prefer them over the uncontrolled alternatives.

Classify your data before deploying AI. Not all data carries the same leakage risk. Public information, internal analysis, and customer personal data require different controls. A classification framework that maps data sensitivity to permitted AI use cases prevents both over-restriction (blocking productive use of non-sensitive data) and under-restriction (allowing sensitive data into unsuitable tools). Align this classification with your broader data governance framework.

Contractual review of AI vendors must cover data handling explicitly. Where is inference data processed? Is it retained? For how long? Is it used for model improvement? Can it be accessed by the vendor's staff? What happens to the data when the contract ends? These questions must have satisfactory answers before any financial services data enters the system. The procurement team needs AI-specific due diligence criteria.

Start with a usage audit. Before building controls, understand how AI is actually being used across the organisation. Survey staff, review network logs for AI service access, and interview team leaders about productivity tools their teams use. The gap between official policy and actual practice is usually larger than leadership assumes. Build your controls based on actual usage, not assumed usage.

Last updated May 2026

Exploring AI for your organisation? There are fifteen minutes on the calendar.

Let’s build AI together

← Back to AI Glossary