By Jen Evans, founder of Pattern Pulse AI and B2BNN, co-founder of Tech Reset Canada
When people talk about “training” an AI model, the word is doing a lot of work. It can mean several different things at five different moments in the model’s life, with different cost profiles, different governance and privacy implications, and different levels of permanence. The recent CDT/MIT research on safety drift makes the cost of that imprecision concrete. A model can be customized in ways that look like simple optimization and turn out to be behavioural modification. Understanding which layer of training is doing what, and how much of it actually holds, is an executive-level question, but also a consumer awareness issue.
This is a guide to the layers, in the order they happen in a model’s life, and a working sense of which ones are stable.
Pre-training is the foundation. A frontier lab takes trillions of tokens of text from the public internet, books, code repositories, and licensed corpora, and runs them through a model with hundreds of billions of parameters. The model learns statistical patterns in language. It learns that certain words follow certain other words, that certain reasoning patterns produce certain outcomes, that certain forms of writing match certain contexts. Pre-training costs tens to hundreds of millions of dollars. It happens once. Everything that follows is layered on top of what pre-training produced.
Pre-training is the most stable layer of all. The representations it builds are deep, distributed, and entangled across the network. They are also the layer that recent research on subliminal learning has shown to carry traits invisibly through any downstream training. Whatever the foundation model learned at pre-training time travels with it, and travels with anything distilled from it.
Post-training, sometimes called alignment training, is what turns a pre-trained model into something usable. A pre-trained model is a pattern-completion engine. It will continue any text in the most statistically likely way, including in ways that are unhelpful, dangerous, or rude. Post-training shapes the model into something that responds to questions, follows instructions, refuses harmful requests, and behaves in a recognizable way across conversations. This is where most of what users experience as the model’s personality and safety behaviour lives.
Post-training has several subtypes. Supervised fine-tuning trains the model on curated examples of good behaviour. Reinforcement learning from human feedback, or RLHF, has human reviewers rank pairs of model responses, and the model learns to produce responses that score well on the resulting preference model. Reinforcement learning from AI feedback, or RLAIF, replaces the human reviewers with another AI model evaluating against a written set of principles. Anthropic’s Constitutional AI is a version of this. Direct preference optimization, or DPO, is a newer method that achieves similar goals with less computational overhead.
Post-training is moderately stable. The behaviours it produces are real and consistent under normal use. They are also the layer most exposed to drift under further training. RLHF in particular has been shown to produce behaviours that can be partially undone by subsequent fine-tuning, and there is growing evidence that aggressive human feedback regimes can degrade some underlying capabilities while improving others. My earlier work on this is here and a fuller treatment is on Zenodo.
Enterprise fine-tuning is what happens when an organization takes a post-trained model and adapts it to internal data, vocabulary, or workflows. This is where the CDT/MIT finding lands. Three methods dominate. Full fine-tuning retrains all the model’s parameters on new data. LoRA and QLoRA are parameter-efficient methods that train small adapter layers while leaving the original weights frozen, which is cheaper and faster. The CDT/MIT research found that none of these methods reliably predicts whether safety behaviour will hold. A model can become more useful in a target domain and less safe in unrelated areas at the same time. The drift can move in either direction. Common engineering choices have limited predictive power over the result.
This is the layer where enterprises most often confuse customization with optimization. Fine-tuning is not a tuning knob. It is a behavioural modification, and the modification reaches further than the use case it was applied to.
System prompts are the next layer down, and the first that does not touch the model’s weights at all. A system prompt is a block of text inserted before every user input. It can give the model a persona, a set of constraints, a tone of voice, a list of permitted and forbidden behaviours, and a set of organizational facts. System prompts are how most enterprise customization actually happens, even when teams describe what they are doing as “training.” They are extremely flexible. They can be updated without retraining. They are also extremely thin. A determined user can often work around them, and they vanish the moment the session ends. They do not change what the model is. They change what the model is told to do.
Retrieval-augmented generation, or RAG, is the runtime layer most enterprises have actually deployed. A RAG system stores documents in a searchable database, retrieves the relevant ones at the moment a question is asked, and injects them into the prompt before the model responds. RAG lets a model answer questions about content it was never trained on, including current information and proprietary knowledge. RAG does not change the model. It changes the context the model is operating in. The model’s biases, refusal patterns, alignment properties, and base capabilities all remain. If the foundation model is unreliable in some domain, RAG will inherit the unreliability while sounding more authoritative because it is now citing internal sources.
In-context learning and few-shot examples sit in the prompt itself. They are the lightest layer. Showing the model two or three examples of how to handle a task can dramatically shift its output for that conversation. Nothing about the model changes. The next session begins from the same starting point as the last one.
User feedback loops are the layer most enterprise systems have but rarely use rigorously. Thumbs-up and thumbs-down signals, edits, regeneration choices, and explicit ratings can feed into future training cycles, either through periodic RLHF refreshes or through ongoing fine-tuning. The governance question is whether the feedback is being used, by whom, with what oversight, and against what evaluation.
Pre-training defines the substrate and rarely changes. Post-training alignment shapes behaviour and can be partially undone. Enterprise fine-tuning creates a new system whose safety properties are not predictable from the parent. System prompts are flexible. RAG and in-context examples adjust the runtime environment without changing the model at all.
Chat Data
A common point of confusion across all of this is whether the model in front of you is learning from this conversation in real time. Every commercially deployed frontier model has fixed weights between training cycles, and continual learning, where a model updates its parameters from live use, is still a research topic rather than a shipped feature. Whether your conversations are used in future training cycles is a different question, and the answer changed materially in late 2025. As of September 2025, all six major US developers, including Amazon, Anthropic, Google, Meta, Microsoft, and OpenAI, now train on consumer chat data by default, with opt-out options that vary by provider. Anthropic’s switch from opt-in to opt-out in August 2025 was the last consumer holdout.
Google’s position on training is not one position. It is four, sorted by product surface: enterprise contracts say no, consumer chat says yes-by-default, AI Mode in Search says yes on prompts and responses, AI Overviews says yes on publisher content even when opted out. The same company, the same week, the same brand, four different answers. The reader experiencing ‘contradiction’ is reading the architecture correctly.
The privacy implications of this architecture are larger than the consent UI suggests. Search queries are often more revealing than chat, capturing health worries, legal questions, financial stress, and relationship signals that users would never volunteer in a structured dialogue. Most people still approach search with a mental model of ephemeral queries and approach chat with a mental model of saved conversation; AI Mode collapses that distinction in the back end while leaving it intact in the user’s head.
Those queries are now training data by default, retained for years, with consent obtained through opt-out toggles in settings most users have never opened. The U.S. v. Heppner ruling in February 2026 confirmed that AI conversations carry no attorney-client privilege and do not constitute work product, which removes the legal-confidentiality fallback some enterprise users were quietly relying on. The hardest enterprise exposure is the employee running work-adjacent queries through a personal Google account on a personal device, feeding sensitive context into training pipelines that no procurement framework or DLP system was designed to see.
Enterprise
Enterprise and API customers are typically excluded by default, which is why business AI risk frameworks should not assume that consumer data policies apply to commercial deployments. Social media is its own pipeline. xAI trains Grok on X content, Reddit has licensed its data to Google and OpenAI under multi-year agreements, and Meta uses its own platforms to train Llama. None of this is real-time either. A post made online today may surface in a pre-training corpus next year, or never. The lag between a public statement and a model that can discuss it runs through a batched training cycle, typically months long, not a live feed.
On the enterprise side, the picture is genuinely different: Google’s Workspace and Cloud documentation states explicitly that customer prompts and outputs are not used to train Gemini models, and human reviewers do not see customer data without organizational consent. OpenAI’s API and Team contracts carry the same protection. Anthropic excludes Claude for Work, Education, Government, and API.
Search
When using AI Mode in Google Search, your prompts and the model’s responses ARE used for training. Google’s own documentation states it trains on “specific prompts in AI Mode,” “the model’s responses,” and “summaries, excerpts, and inferences used to help answer your prompts.” This sits under the Web & App Activity setting in your Google account, which is on by default.
The thing AI Mode does NOT train on directly is the underlying Gmail or Google Photos content when you opt into Personal Intelligence. It accesses that data to personalize answers, then trains on the derived summaries and inferences rather than the raw archive. That’s the distinction Google is making in its messaging, and it’s a real one, but it’s narrower than “AI Search doesn’t train on your conversations.”
AI Overviews adds another wrinkle. In May 2025 antitrust testimony, Google’s DeepMind VP Eli Collins acknowledged that Google’s Search division can use website content to train AI Overviews even when publishers have used the DeepMind opt-out controls. Different team, different rules, same content. The publisher community is still litigating around this.
Workspace, Cloud, Education, and enterprise accounts are excluded from AI Mode’s personalization experiments entirely, which holds the consumer-versus-enterprise pattern from the chatbot side.
The practical implication for enterprise leaders is that the durability of any AI behaviour you care about depends on which layer it lives in, and the layer you applied may not be the layer your behaviour came from. A safety property baked in at post-training can be eroded at fine-tuning. A factual constraint enforced through RAG vanishes the moment retrieval fails. A persona established in a system prompt evaporates under a sufficiently determined user.
The CDT/MIT finding is a particular case of a more general truth. AI systems are layered, and the layers do not stack neatly. Each one has its own permanence and its own failure mode. Procurement, evaluation, and risk management all need to operate at every layer the enterprise actually uses, not only at the layer the vendor talks about.

