Fine-Tuning Foundation Models Creates A New Enterprise And AI Sovereignty Risk: Safety Drift

We’ve looked at how much RLHF (reinforcement learning with human feedback) can destabilize and even stress models out. New research goes even further. A new report from the Center for Democracy & Technology and MIT CSAIL should be required reading for every enterprise AI team relying on fine-tuned foundation models. The report, Out of Tune: Fine-Tuning Foundation Models Leads to Unpredictable Safety Drift, examines what happens when general-purpose AI models are adapted for specialized use in high-stakes fields such as medicine and law. The conclusion is direct: safety behavior does not reliably survive fine-tuning.

That finding matters because fine-tuning has become one of the default answers to enterprise AI adoption. A company starts with a foundation model, adds domain-specific data, tunes the model for internal workflows, and assumes it has created a more useful version of the same system. In theory, the organization keeps the underlying model’s safety profile while improving relevance, accuracy, tone, or domain performance. The CDT/MIT research challenges that assumption. Fine-tuning can make a model better in one domain-specific evaluation while making it worse on broader safety measures. It can strengthen some safety behaviours, weaken others, and produce changes that are difficult to predict in advance.

The authors call this phenomenon “safety drift.” In practical terms, safety drift means that a model’s refusal behaviour, harmful-content boundaries, legal caution, medical caution, or response discipline may shift after further training. A base model that refuses dangerous instructions, avoids unsupported medical advice, or handles sensitive legal prompts carefully may respond differently once it has been fine-tuned. The drift does not require malicious data or adversarial intent. The report focuses on benign, ordinary fine-tuning in real-world development pipelines, which makes the finding more important for business users.

For enterprise leaders, the most important implication is that a fine-tuned model is not simply a customized copy of a foundation model. It is a new risk object. The safety evaluation of the base model is no longer enough. A vendor’s documentation, model card, benchmark score, or assurance process may describe the parent model accurately while saying little about the behaviour of the tuned derivative deployed inside a business process.

This creates a direct governance problem. Many AI risk frameworks still lean heavily on the distinction between foundation model providers and downstream deployers. The upstream provider builds and tests the base model. The enterprise customer or application developer adapts it to a specific use case. Responsibility is then often treated as a layered supply-chain issue: the model provider is responsible for the foundation model, while the deployer is responsible for the application context. Safety drift makes that division less comfortable. The act of downstream adaptation can materially change the system’s behaviour.

The research also complicates the idea that risk can be governed through simple modification thresholds. In software, a minor configuration change is usually understood as less significant than a major architectural change. With foundation models, that intuition may be misleading. Parameter-efficient methods such as LoRA and QLoRA are attractive because they allow organizations to adapt models without retraining the whole system. The CDT/MIT study found that the choice between full fine-tuning, LoRA, and QLoRA did not provide a reliable guarantee against safety drift. Common engineering choices had limited predictive power over the direction or magnitude of safety changes.

This matters for procurement. Enterprises increasingly ask vendors whether their models have been evaluated for safety, bias, hallucination, data leakage, and harmful outputs. Those questions remain necessary, but they are incomplete. The next procurement question should be: what happens to those safety properties after fine-tuning? Vendors offering fine-tuning-as-a-service, private model customization, or domain-specific model variants should be expected to provide evidence about safety resilience under common tuning scenarios.

The report also points to a second problem: benchmark instability. A model may improve on one safety benchmark while degrading on another. In medicine, a model may perform better on a medical safety evaluation while becoming worse on a general-purpose safety benchmark. In law, different evaluative tools may disagree about whether the model became safer or less safe. This means enterprises cannot rely on a single post-tuning benchmark and assume the matter is settled. Safety testing has to reflect the actual deployment context, the foreseeable misuse cases, and the surrounding business consequences.

The business risk is not abstract. A fine-tuned customer service model could become more compliant with user requests and less likely to refuse inappropriate disclosures. A legal assistant tuned on internal memos could become more fluent in legal style while becoming less cautious about jurisdiction, privilege, or unauthorized advice. A healthcare support model could become more confident in medical phrasing while weakening its boundaries around diagnosis, emergency guidance, or medication advice. In each case, the model may look more useful precisely because it has become more willing to answer.

This is why the CDT/MIT report should be read as a warning against treating customization as a purely performance-enhancing step. Fine-tuning is not just optimization. It is behavioural modification. The organization is not merely teaching the model new vocabulary or internal preferences. It may also be shifting the model’s underlying response tendencies in ways that standard application testing does not catch.

The governance response should be straightforward. Enterprises should require pre- and post-tuning evaluations. They should test tuned models against domain-specific risks and general safety risks. They should document training data, tuning method, evaluation results, known regressions, and deployment boundaries. They should retest after each meaningful update, including data refreshes and prompt architecture changes. They should avoid treating the foundation-model vendor’s safety claims as inherited guarantees.

For boards and executives, the larger message is that AI risk now lives across the lifecycle. The risk is not only in model selection. It is in adaptation, integration, retrieval, workflow design, evaluation, monitoring, and update management. Fine-tuning may still be the right choice for many enterprise systems, especially where specialized terminology, internal process knowledge, or domain-specific performance matter. The CDT/MIT research does not argue against fine-tuning. It argues against assuming that safety automatically transfers.

That distinction is crucial. Enterprises do not need to abandon fine-tuning. They need to govern it as a material change to model behaviour. In AI systems, customization is not cosmetic. It can alter safety, reliability, and accountability. This also has implications for sovereignty. Countries that do not own their AI models must now anticipate safety drift that they cannot control.

The safest enterprise assumption is now this: once a foundation model is fine-tuned, it should be evaluated as a distinct system.

Fine-Tuning Foundation Models Creates a New Enterprise and AI Sovereignty Risk: Safety Drift

Featured

The Day of AI Agents Arrived. Did it Result in Anything Meaningful Except Tokens?

Canadian AI Sovereignty Paper 8: The Coordination Layer: Federal, Provincial, Municipal, and the Architecture That Makes Sovereignty Operational

AI and The Grid: Mythos, Power and Canadian Sovereignty

Canada’s Sovereign AI Triumvirate: What Cohere, CoreWeave, and Palantir Actually Are

May 2026 Canadian AI Sovereignty Update: Paper 6