LLMOps

Last reviewed April 2026

Your data science team deployed a large language model for document summarisation. It worked brilliantly in testing. In production, it hallucinated a clause that did not exist in the contract, and nobody caught it until the client queried the summary. LLMOps is the emerging discipline of operating large language models reliably in environments where getting it wrong has consequences.

What is LLMOps?

LLMOps (large language model operations) adapts MLOps practices for the specific challenges of deploying, monitoring, and governing large language models. Traditional ML models produce numerical predictions that can be evaluated against known correct answers. LLMs produce text that must be evaluated for accuracy, relevance, safety, and alignment with the intended use case. This difference changes nearly every aspect of the operational lifecycle.

The core challenges are distinct from traditional ML. Prompt management replaces feature engineering: the way you phrase the instruction to the model determines the quality of the output. Evaluation is subjective: there is no single "correct" summary of a document. Hallucination, where the model generates plausible but factually incorrect text, is an inherent property of the architecture, not a bug to be fixed. And cost management matters because LLM inference is orders of magnitude more expensive per prediction than traditional ML models.

In financial services, these challenges are amplified by regulatory expectations. A natural language processing system that generates customer communications must be accurate. A system that summarises regulatory documents must not invent requirements. A system that drafts compliance reports must be verifiable against source data. LLMOps must ensure these properties continuously, not just at deployment.

The landscape

The tooling for LLMOps is immature compared to traditional MLOps. Prompt versioning, LLM evaluation frameworks, hallucination detection, and retrieval-augmented generation (RAG) orchestration are all active areas of development with rapidly changing best practices. Institutions deploying LLMs today are building custom tooling for capabilities that will likely be commoditised within two years. This creates a build-versus-wait tension that each institution resolves differently.

The FCA and PRA have not yet issued LLM-specific guidance, but existing model risk management expectations apply. The PRA's SS1/23 requirements for model validation, monitoring, and governance do not exempt LLMs. The challenge is that the validation techniques designed for traditional statistical models (backtesting, sensitivity analysis, benchmark comparison) do not translate directly to generative models. Institutions are developing new validation approaches, but no industry standard has emerged.

Data residency and vendor concentration are strategic concerns. Most LLM capabilities depend on a small number of foundation model providers (OpenAI, Anthropic, Google, Meta). Sending regulated data to third-party APIs raises data protection questions. Hosting open-source models on-premises provides data control but requires significant infrastructure investment and accepts reduced capability. The architecture decision has regulatory, commercial, and technical dimensions that must be evaluated together.

How AI changes this

Retrieval-augmented generation (RAG) is the primary pattern for deploying LLMs in financial services. Rather than relying on the model's parametric knowledge (which may be outdated or incorrect), RAG retrieves relevant source documents and instructs the model to generate its response based on those documents. For regulatory document summarisation, the model reads the actual regulation rather than recalling it from training data. This reduces hallucination and creates a verifiable link between output and source.

Automated evaluation frameworks assess LLM outputs against defined quality criteria at scale. Rather than manual review of every output, evaluation models score outputs for accuracy, relevance, completeness, and safety. Human reviewers focus on edge cases and failures flagged by the automated system. For document intelligence applications, evaluation compares extracted data against known correct values. For generative applications, evaluation is more subjective but can still be partially automated.

Guardrail systems enforce output constraints. Filters prevent the model from generating content that violates policies: personal data in outputs, discriminatory language, factual claims that contradict source documents, or responses outside the defined scope. These guardrails operate at inference time, checking every output before it reaches the user. They are the LLMOps equivalent of input validation in traditional software.

What to know before you start

Define evaluation criteria before deployment, not after. What does a "good" summary look like? What does an "accurate" extraction look like? Create a test set of inputs with human-generated gold-standard outputs and measure your LLM's performance against it. Without this baseline, you cannot know whether the system is improving or degrading, and you cannot demonstrate its quality to a regulator.

Hallucination is a feature of the architecture, not a deficiency of a particular model. All LLMs hallucinate. The question is whether your system design prevents hallucinated content from reaching users or decisions. RAG, output verification against source documents, and human-in-the-loop review are complementary controls. Relying on any single control is insufficient for high-stakes financial services applications.

Cost management requires attention from day one. LLM inference costs scale with usage in ways that traditional ML does not. A credit scoring model costs fractions of a penny per prediction. An LLM-based document summariser can cost pennies to pounds per document depending on model size and document length. Model costs at scale, hundreds of thousands of documents per month, must be part of the business case.

Start with an internal-facing, low-risk application. Summarising internal meeting notes, classifying incoming emails for routing, or drafting first versions of internal reports are all valuable applications where errors are caught by the user before reaching an external audience. Build the audit trail, the evaluation framework, and the operational confidence on internal use cases before deploying customer-facing or regulatory applications.

Last updated

Exploring AI for your organisation? There are fifteen minutes on the calendar.

Let’s build AI together
← Back to AI Glossary