Skip to content

Claude Report - State of Enterprise LLM Customization, Fine-Tuning, and Serving

The State of Enterprise LLM Customization, Fine-Tuning, and Serving: A Landscape Report

Section titled “The State of Enterprise LLM Customization, Fine-Tuning, and Serving: A Landscape Report”

Prepared for startup due diligence purposes — February 2026


The market for enterprise LLM customization tooling is large, growing, and increasingly contested. The global LLM market was valued at roughly $6.4 billion in 2024 and is projected to reach $36 billion by 2030, with the small/specialised model sub-segment alone expected to grow from $6.5 billion to over $20 billion in that period. Production AI use cases have doubled to 31% of enterprises in 2025, up from roughly 15% in 2024.

However, this is a market undergoing rapid structural change. The performance gap between open-source and proprietary frontier models has narrowed to under 5% on many benchmarks. Techniques that were cutting-edge 18 months ago (basic LoRA fine-tuning, naive RAG) are commoditising fast. Meanwhile, new methods like reinforcement fine-tuning (RFT) and model distillation are opening fresh opportunity. The key strategic question for any startup in this space is whether their differentiation sits in a part of the stack that frontier model progress and cloud provider expansion will erode, or in a part that becomes more valuable as the ecosystem matures.


It’s important to understand that “fine-tuning” is just one layer in a spectrum of customisation approaches. Enterprises typically progress through these in order of increasing cost and complexity:

Prompt engineering is the starting point — crafting instructions, few-shot examples, and chain-of-thought patterns. Modern models with large context windows have made this surprisingly powerful. Many use cases that would have required fine-tuning in 2023 can now be handled by a well-structured prompt.

Retrieval-Augmented Generation (RAG) augments model responses with retrieved external knowledge at inference time. It addresses the model’s knowledge gaps without modifying weights, and remains the standard approach for grounding outputs in proprietary or frequently-changing data.

Fine-tuning modifies the model’s weights using a task-specific dataset. This is where the bulk of the tooling market sits. The key sub-categories are supervised fine-tuning (SFT) using input-output pairs, preference-based alignment (RLHF, DPO), and the newer reinforcement fine-tuning (RFT/RLVR) using reward functions. Parameter-efficient methods like LoRA and QLoRA have made this dramatically more accessible — you can fine-tune a 70B parameter model on a single high-end GPU.

Continued pre-training exposes the model to large volumes of unlabelled domain text, teaching it new vocabulary and concepts before fine-tuning on specific tasks. This is more expensive but important for highly specialised domains (medical, legal, scientific).

Distillation transfers capabilities from a large “teacher” model to a smaller “student” model. This has become strategically critical as enterprises seek to reduce inference costs while retaining performance. Companies like AT&T and others have achieved significant cost reductions by distilling frontier model outputs into smaller open-source models.

Training from scratch is generally only viable for the frontier labs and the largest enterprises. For most organisations, it’s neither practical nor necessary.

The practical reality for most enterprises is that they use a combination of these approaches. A common pattern involves fine-tuning for tone, style, and output consistency, while using RAG for factual grounding with current data. The tooling ecosystem must support this hybrid workflow.


3.1 Frontier Lab Proprietary Fine-Tuning Services

Section titled “3.1 Frontier Lab Proprietary Fine-Tuning Services”

The major model providers all offer fine-tuning of their own models, typically via API:

OpenAI provides fine-tuning for GPT-3.5 and GPT-4 class models through their API. They’ve also previewed a managed reinforcement fine-tuning capability, though it remains in research preview and is limited to their proprietary models. Pricing is per-token, typically a few dollars per million tokens for standard runs.

Google offers fine-tuning of Gemini and PaLM models through Vertex AI, with the broadest suite of customisation options among the frontier labs — including prompt tuning, adapter tuning (LoRA), and full fine-tuning.

Anthropic partners with AWS, offering fine-tuning of Claude models through Amazon Bedrock rather than providing a direct fine-tuning API.

Cohere provides fine-tuning specifically positioned for enterprise RAG and search use cases, and offers a free prototyping tier.

The key limitation of these services is vendor lock-in. Your fine-tuned model only runs on that provider’s infrastructure, and you can’t inspect or export the weights. This is acceptable for many use cases but a dealbreaker for organisations with strict data sovereignty or portability requirements.

The hyperscalers have built comprehensive managed platforms that sit a layer above the frontier labs:

AWS Bedrock and SageMaker offer the broadest multi-vendor model access — Anthropic Claude, Meta Llama, Mistral, AI21, Cohere, Stability AI, and Amazon’s own Titan models, all through a unified API. Bedrock provides fine-tuning capabilities, built-in RAG via Knowledge Bases, and agent orchestration through the recently launched AgentCore. SageMaker provides deeper MLOps capabilities for teams that need more control. The strength is the serverless architecture and deep integration with the AWS ecosystem.

Azure AI Foundry (formerly Azure OpenAI Service) centres on direct access to OpenAI’s models within Azure’s enterprise compliance framework. Its strength is the integration with Microsoft 365, Active Directory, and the broader Microsoft enterprise stack. Fine-tuning has historically been limited to GPT-3.5 with GPT-4 tuning expected. The “OpenAI on Your Data” feature grounds models in private datasets without retraining. Azure’s usage in EMEA tripled between 2023 and 2025.

Google Vertex AI stands out for data-intensive, analytics-driven enterprises. Native integration with BigQuery, Dataflow, and Looker means organisations can train and query models without moving data across systems. It offers the full spectrum of customisation from prompt tuning to full fine-tuning of Gemini models, plus a Model Garden for third-party and open-source models.

The cloud providers represent the most significant competitive threat to independent tooling startups. They’re aggressively building out fine-tuning, evaluation, and deployment capabilities that were previously the domain of startups, and they have the advantage of existing enterprise relationships and infrastructure lock-in.

This is where the most dynamic competition is playing out:

Together AI has emerged as a leading inference and fine-tuning platform for open-source models. They offer full fine-tuning and LoRA, continued fine-tuning from checkpoints, DPO alignment, and both serverless and dedicated endpoint deployment. They support models like DeepSeek R1 and the Llama family.

Predibase (now acquired by Rubrik for a reported $100M-$500M on ~$28M in funding) pioneered reinforcement fine-tuning as a managed service and the LoRAX open-source serving engine for efficiently serving many LoRA adapters from a single GPU. Their acquisition by a data management company signals that the fine-tuning tooling layer may be more valuable as a feature of a larger platform than as a standalone business.

Anyscale provides the commercial platform for Ray, the open-source distributed computing framework. They position for teams that need to scale fine-tuning across many GPUs with full control, and serve both fine-tuning and inference use cases.

Modal offers a developer-focused cloud compute platform that has become popular for fine-tuning workloads due to its pay-per-second GPU pricing and simple Python-native API.

Fireworks AI focuses on inference speed and cost optimisation, with fine-tuning as part of the pipeline — particularly strong for serving fine-tuned models at low latency.

H2O.ai launched Enterprise LLM Studio in 2025, running on Dell infrastructure, providing fine-tuning-as-a-service specifically for on-premises deployment. Customers include AT&T and major financial institutions.

Snorkel AI differentiates on the data preparation side — helping enterprises curate high-quality training data before fine-tuning, with integration into Databricks, SageMaker, Vertex, and Azure ML.

SiliconFlow positions as a cost-efficient all-in-one AI cloud covering inference, fine-tuning, and deployment.

A rich ecosystem of open-source tools has made fine-tuning accessible to teams with ML engineering capability:

Hugging Face remains the centre of gravity — the model hub, Transformers library, Trainer API, and PEFT library together form the de facto standard stack for fine-tuning open-source models. Over 60,000 models are available.

LLaMA-Factory provides a no-code/low-code UI for fine-tuning, with integrated Unsloth optimisation that reduces memory usage by ~62% and speeds training by ~2.2x. It has become a popular choice for teams that want to fine-tune without deep infrastructure expertise.

Axolotl is an open-source fine-tuning orchestrator particularly popular for LLaMA and Mistral models.

Unsloth focuses specifically on memory and speed optimisation for fine-tuning, and integrates with LLaMA-Factory and other tools.

DeepSpeed (Microsoft) provides distributed training optimisation for large models across multi-GPU and multi-node setups.

vLLM has become the standard for efficient inference serving of fine-tuned models, with optimisations like continuous batching and PagedAttention.

Red Hat / InstructLab takes an interesting approach with the LAB (Large-scale Alignment for chatBots) method, allowing domain experts to contribute knowledge to models without traditional fine-tuning. This is packaged in RHEL AI and OpenShift AI for enterprise Kubernetes environments.

Several broader platforms now include fine-tuning capabilities:

Databricks (with Mosaic AI, acquired in 2023) integrates model training, fine-tuning, and MLflow-based experiment tracking within its Lakehouse platform. For organisations already on Databricks, this eliminates the need for a separate fine-tuning tool.

IBM watsonx provides end-to-end lifecycle management including fine-tuning with enterprise governance — positioned for regulated industries with strong audit and compliance requirements.

Weights & Biases serves as the system of record for training experiments, tracking runs, logging metrics, and versioning model checkpoints. Not a fine-tuning platform per se, but increasingly integral to the workflow.


4. Key Techniques and How They’re Evolving

Section titled “4. Key Techniques and How They’re Evolving”

The fine-tuning technique landscape is shifting rapidly. Understanding which methods are ascendant matters for assessing startup positioning.

LoRA/QLoRA (Low-Rank Adaptation / Quantized LoRA) has become the baseline. These parameter-efficient methods add small trainable matrices to existing model layers rather than modifying all weights, cutting compute and memory requirements by an order of magnitude. LoRA is now a commodity — every platform supports it. QLoRA adds quantisation to reduce memory further. This is table stakes, not a differentiator.

Supervised Fine-Tuning (SFT) on labelled input-output pairs remains the workhorse method. Quality of training data matters far more than quantity — research has shown that 1,000 carefully curated examples can outperform 10,000 mediocre ones. The DeepSeek R1 model famously achieved strong performance with 100x less data by using expert-crafted chain-of-thought examples.

RLHF and DPO (Reinforcement Learning from Human Feedback / Direct Preference Optimisation) align model outputs with human preferences. DPO has gained ground because it eliminates the need for a separate reward model, simplifying the pipeline. These are essential for chatbot and customer-facing applications but require human preference data, which is expensive to collect.

Reinforcement Fine-Tuning (RFT) / RLVR is the most significant new development. Inspired by DeepSeek R1’s use of GRPO (Group Relative Policy Optimisation), this approach trains models using reward functions rather than labelled examples. It works particularly well for tasks with verifiable outcomes (code generation, maths, structured output) and can deliver meaningful gains with as few as 10 training examples. Predibase was the first to offer this as a managed service. OpenAI has previewed it for their models. This is currently the most differentiated technique and is expected to expand into more domains (chemistry, biology, etc.) through 2026.

Model distillation is becoming strategically critical. The pattern of using a large frontier model to generate training data for a smaller, cheaper model is now standard practice. AT&T, for example, has publicly discussed distilling large models into economical small language models for agentic workflow automation. Tools that make distillation workflows easy and reproducible have clear value.

Continuous learning and model merging are emerging areas. Techniques that allow models to learn from production feedback without full retraining cycles, and methods to combine the strengths of multiple fine-tuned models, are active research areas but not yet widely productionised.


5. Industry Verticals Where Fine-Tuning Adds Most Value

Section titled “5. Industry Verticals Where Fine-Tuning Adds Most Value”

Fine-tuning delivers the most value in industries where domain-specific language, regulatory compliance, and proprietary data create a meaningful gap between what general-purpose models can do and what the business needs:

Financial services is a prime market. Models fine-tuned on financial terminology, regulatory filings, risk frameworks, and institution-specific policies consistently outperform general models on tasks like document analysis, compliance checking, and financial summarisation. The Bloomberg story is instructive here: they spent approximately $10M training BloombergGPT, only to have GPT-4 outperform it on most tasks shortly after. This argues against full pre-training but in favour of lightweight fine-tuning for output format, compliance language, and house style.

Healthcare and life sciences benefits from fine-tuning on clinical notes, medical terminology, imaging data, and drug interaction databases. HIPAA compliance requirements also make on-premises or private cloud fine-tuning attractive. Models fine-tuned on medical data show significantly improved accuracy on diagnostic support, patient report generation, and clinical trial analysis.

Legal is a strong use case for fine-tuning around contract analysis, case law interpretation, regulatory compliance, and legal drafting. The highly specialised vocabulary and reasoning patterns of legal work favour domain-adapted models.

Telecommunications has emerged as an active sector, with companies like AT&T and Singtel publicly discussing their use of fine-tuned and distilled models for network operations, customer service automation, and agentic workflow automation.

Manufacturing and supply chain benefits from fine-tuning models that combine maintenance logs, sensor data, and operational procedures for predictive analytics and quality control.

Non-English language markets represent a particularly durable use case. Most frontier model companies under-invest in non-English language performance. Fine-tuning on domain-specific non-English data can deliver dramatic improvements that frontier models may not match for years.

The common thread is that these verticals have proprietary data, specialised vocabulary, regulatory requirements, or language needs that general models handle poorly. Where a domain is well-represented in public training data (general customer support, code generation for popular languages, content marketing), the case for fine-tuning is weaker and more vulnerable to erosion.


This is the most critical section for assessing startup risk. Several categories of fine-tuning use cases are being eroded by frontier model improvements, while others are becoming more durable.

Naive RAG pipelines. The simple pattern of “chunk documents → embed → retrieve → stuff into context” is under genuine pressure from long-context models. Gemini processes 2 million tokens; GPT-4.1 handles 1 million. For static document sets with repetitive queries, many practitioners now start with long-context prompting before building RAG infrastructure. A Salesforce research study found that long-context models could match or exceed basic RAG on straightforward retrieval tasks. One practitioner’s advice from late 2025 was blunt: your baseline should be to put everything in the context window, and only build RAG infrastructure when that stops working.

However — and this is important — naive RAG is not the same as all RAG. Complex RAG systems with query rewriting, multi-step retrieval, agentic retrieval, graph-based knowledge representations, and reranking are becoming more sophisticated, not less relevant. RAG also retains hard advantages in cost (8-82x cheaper than long-context approaches for typical workloads), latency, data freshness, access control, and auditability. The consensus emerging from 2025 is that “RAG didn’t die — it matured.” Enterprises committed to building core AI capabilities deepened their RAG investments through the year. The tooling that’s at risk is basic vector-search-and-stuff-context; the pattern of intelligent, conditional retrieval within agentic systems is expanding.

Basic domain knowledge fine-tuning. As frontier models absorb more training data and become better at in-context learning, fine-tuning purely to inject domain knowledge is getting harder to justify. If your domain is well-represented on the public internet, the next frontier model release may simply know what you spent months teaching your fine-tuned model. This is the “treadmill” problem: fine-tuned models can be made obsolete by new base model releases.

Simple instruction following and output formatting. Modern frontier models are remarkably good at following detailed instructions. Many formatting and style requirements that previously required fine-tuning can now be handled via prompting.

Basic text classification and NER. While fine-tuned models still outperform prompted models on traditional NLP tasks, the gap is narrowing. For simple classification (sentiment, topic, intent), prompt engineering with a capable frontier model is often sufficient.

Output consistency at scale. Fine-tuning produces reliable, deterministic-feeling behaviour that prompting cannot fully match. When an enterprise needs thousands of outputs per day in an exact format, fine-tuning a smaller model to produce consistent structure eliminates the variability of prompt-based approaches.

Cost and latency optimisation. This is arguably the most durable driver. A fine-tuned 7B or 14B parameter model that matches a frontier model’s performance on a specific task costs a fraction to run — and responds much faster. As AI moves from experimentation to production, margin pressure will drive enterprises toward smaller, cheaper, fine-tuned models. Cursor, for instance, uses many small fine-tuned models for different parts of its code editing workflow.

Proprietary data and competitive differentiation. When your competitive advantage comes from data that isn’t on the public internet — proprietary processes, customer interaction logs, internal knowledge bases — fine-tuning on that data creates genuine moats. A general model cannot replicate this.

Regulatory compliance and safety alignment. Industries with strict regulatory requirements need models that reliably refuse certain outputs, follow specific disclosure rules, or adhere to compliance frameworks. Fine-tuning for these constraints is more reliable than prompting.

Non-English language performance. This remains a major gap in frontier models and a durable opportunity for fine-tuning.

Reinforcement fine-tuning for reasoning. The newest frontier — teaching models to reason about domain-specific problems using reward functions rather than examples — is inherently resistant to commoditisation because the reward functions encode proprietary business logic.

Distillation for deployment efficiency. The need to compress frontier-quality capabilities into deployable, cost-effective models will persist and likely grow as enterprises scale AI from pilots to production.

The macro picture is that fine-tuning is shifting from “necessary to make models work at all” toward “necessary to make models work efficiently, consistently, and with proprietary advantage.” This is a more nuanced value proposition — harder to sell but more defensible when executed well. The argument that fine-tuning is becoming obsolete is wrong, but the argument that what you fine-tune for is changing is correct.


Platform risk from cloud providers. AWS, Azure, and Google are all building fine-tuning, evaluation, and deployment into their managed platforms. For enterprises already committed to a cloud provider, the switching cost of adopting a separate fine-tuning tool is hard to justify unless it offers meaningfully superior results. This is the single largest structural risk.

Commoditisation of core techniques. LoRA, QLoRA, and basic SFT are now commodity capabilities available in open-source libraries and every cloud platform. A startup whose primary value is “we make LoRA fine-tuning easier” faces rapid margin compression.

The treadmill problem. Every new base model release can obsolete the fine-tuned models customers built on the previous generation. This creates churn risk — customers may question whether their fine-tuning investment was worthwhile when GPT-5 or Llama 4 arrives. Startups must either help customers re-fine-tune quickly and cheaply on new bases (making the treadmill manageable) or offer value that transcends any single base model.

Market fragmentation and acqui-hire risk. The Predibase acquisition by Rubrik (for $100-500M on $28M of funding) is a positive exit signal but also suggests that the standalone fine-tuning platform market may not sustain many independent companies. Larger platform companies may acquire the best technology rather than compete with it, potentially limiting the scale of standalone outcomes.

Safety and misalignment concerns. Research published in Nature in January 2026 demonstrated that fine-tuning can introduce broad misalignment — a model fine-tuned to misbehave in one domain showed errant behaviour in unrelated areas. This creates regulatory and liability risk for fine-tuning platforms, particularly in sensitive industries.

Reinforcement fine-tuning (RFT) as a new wave. RFT/RLVR is still early and technically demanding. Startups that can offer managed RFT with good developer experience have a window of differentiation before it commoditises. The ability to fine-tune with as few as 10 examples using reward functions is a compelling value proposition.

Distillation-as-a-service. Helping enterprises compress frontier model capabilities into small, cheap, deployable models is a growing need. This combines fine-tuning, evaluation, and inference optimisation into a workflow that’s hard to DIY.

Vertical specialisation. A horizontal fine-tuning platform competes with AWS. A platform purpose-built for, say, fine-tuning medical models with HIPAA compliance, clinical evaluation benchmarks, and healthcare data pipelines competes in a narrower space where cloud providers are weaker.

Inference cost optimisation. As enterprises move to production, the cost of serving models matters more. Platforms that combine fine-tuning with efficient serving (multi-LoRA deployment, speculative decoding, quantisation) capture more of the value chain.

The “model operations” layer. Continuous evaluation, monitoring for drift, automated re-training on new base models, A/B testing of fine-tuned vs. base models — this operational layer is underdeveloped and increasingly needed.

Open-source as distribution strategy. Together AI, Predibase (with LoRAX), and others have used open-source tools to build community and funnel users to paid managed services. This remains an effective go-to-market approach.

The Predibase/Rubrik acquisition is the most notable exit in this space — validating that the technology has value while raising questions about standalone viability. Predibase’s pivot from general-purpose fine-tuning toward agent governance (their current positioning on predibase.com: “monitor, govern, and rewind every agent action”) suggests even they saw the fine-tuning platform narrowing as a standalone play.

The broader trend of frontier model convergence — where open models like DeepSeek V3 and Qwen3 achieve within 3-5% of proprietary systems — favours the fine-tuning ecosystem by giving enterprises high-quality base models they can actually customise and own. One analyst’s thesis is that 2026 is the year fine-tuned small models become mainstream, as companies seek differentiation and margin in a world where switching frontier models no longer provides competitive advantage.


8. Conclusions and Key Due Diligence Questions

Section titled “8. Conclusions and Key Due Diligence Questions”

The enterprise LLM customisation market is real, growing, and increasingly important. However, it’s also a market where the ground is shifting fast — techniques commoditise, cloud providers expand, and frontier model improvements can erode use cases overnight.

For your due diligence, I’d focus on these questions:

  1. Where in the stack does the startup sit, and is that layer defensible? Raw fine-tuning infrastructure is commoditising. Data preparation, evaluation, RFT, distillation, and model operations are more defensible.

  2. Does the startup compete with cloud providers or complement them? Startups that integrate with AWS/Azure/GCP (like Snorkel does) are better positioned than those that try to replace them.

  3. What’s the re-platforming story? When a new base model drops, how quickly and cheaply can customers re-fine-tune? Startups that make the treadmill manageable turn a risk into a retention mechanism.

  4. Is the value proposition durable against frontier model improvements? If the startup’s core pitch is “make models know your domain better,” that’s vulnerable. If it’s “make models cheaper, faster, more consistent, and compliant for your specific deployment,” that’s more durable.

  5. What’s the go-to-market? Horizontal platforms face brutal competition. Vertical-specific or workflow-specific plays can build defensible positions.

  6. What does the exit landscape look like? Predibase’s acquisition by Rubrik suggests strategic acquirers may be the primary exit path. Is the startup building technology that a larger platform company would want to own?

  7. How does the team think about the RAG/fine-tuning convergence? The best companies in this space understand that the future is hybrid systems — fine-tuning for behaviour, RAG for knowledge, agents for orchestration — not any single technique in isolation.

The bottom line: there is a real and growing market here, but the winning position is likely in the operational layer (making fine-tuning and model customisation reliable, fast, and repeatable at enterprise scale) rather than in the raw technique layer (which is commoditising). Startups that combine technical depth with strong vertical positioning and cloud provider integration have the best risk/reward profile.