Deterministic compliance frameworks assume systems behave consistently. AI systems do not. The math proves it.
“We’re building a certification framework around our AI deployment. ISO 42001. The board wants confidence that what we’re deploying is auditable, repeatable, and governed.“
It’s not an uncommon sentiment. That was a senior AI executive at a Fortune 50 telco, describing a compliance initiative that will cost millions of dollars, take over a year to complete, and certify something that cannot, by the mathematics of the systems involved, be certified. He did not know this. Most enterprises do not.
The Category Error at the Centre of Enterprise AI Governance
ISO 42001, ISO 23894, NIST AI RMF, the EU AI Act’s conformity assessments: every major AI governance framework currently in use or under development shares a foundational assumption: that the system being governed behaves consistently enough to be described, bounded, and verified. This is the assumption on which all certification depends. You cannot certify a system’s behaviour if you cannot predict its behaviour.
For traditional software, this assumption holds. A database query returns the same result for the same input. A compiled binary executes the same instructions every time. Certification frameworks were designed for this world, a world of deterministic systems where the relationship between input and output is stable, reproducible, and auditable.
Large language models are not that world. They are probabilistic systems. The same prompt, submitted to the same model, at different times, will produce different outputs. This is not a bug. It is the fundamental operating principle of the technology. The stochastic nature of token generation (the mechanism by which these models produce language) means that output variability is not an edge case to be managed. It is the system working as designed.
This creates a category error that the entire enterprise AI governance industry has not yet reckoned with. You are applying deterministic verification to probabilistic systems. The frameworks are not wrong in what they measure. They are wrong in what they assume they are measuring.
The Mathematics of Unpredictable Degradation
The problem is worse than simple output variability. Research in AI Conversational Phenomenology (the systematic study of how AI systems behave under sustained real-world use) has demonstrated that large language models do not merely vary. They degrade. And they degrade in ways that are predictable in aggregate but unpredictable in specifics.
A failure point we observed and documented (L ≈ 1969.8 × M^0.74) predicted when AI systems experience what is known as coherence collapse, the point at which a model’s outputs begin to lose structural and logical consistency during extended interactions. The law (Evans’ Law, the axiom that the longer a session continues the higher the likelihood a model will produce an incorrect answer until the likelihood of an incorrect answer, exceeds the likelihood of an accurate answer) establishes a power-law relationship between model size and the conversational length at which degradation becomes operationally significant.

But the original formulation understates the complexity. Recent work has extended Evans’ Law into a full reliability surface — R(L, M, T) — which maps system reliability across three dimensions simultaneously: context load (L), model capability (M), and a new variable, task-type rigidity (T), which captures how much structural precision a given task demands. The resulting surface, visualised for a frontier model scoring 86 on MMLU, reveals two distinct failure regimes. In Regime 1, context load dominates: reliability drops as conversations extend, with the Evans’ Law threshold at approximately 53,000 tokens marking the onset of significant degradation. In Regime 2, task-type rigidity takes over: even at manageable context lengths, tasks requiring high structural precision (legal drafting, medical protocols, financial compliance) push reliability toward zero. The implication for certification is stark. A framework that tests a system at low context loads on flexible tasks will observe high reliability and certify accordingly. That same system, deployed into the long, structurally rigid workflows that enterprise adoption inevitably produces, will occupy an entirely different region of the surface — one where the certification was never conducted and the reliability it promised does not exist.
What this means for certification is devastating: a system that passes every test on Tuesday may fail the same test on Thursday, not because it has been updated or modified, but because the stochastic processes that govern its outputs have produced a different path through the probability space. Worse, the system’s reliability is not static. It degrades over the course of a single interaction, meaning that the longer an enterprise workflow runs, the less reliable the system becomes. No certification framework currently accounts for this. None even attempts to.
Vendor-Specific Drift: The Certification Multiplier Problem
If the degradation were at least consistent across providers, a sufficiently clever framework might attempt to model it. It is not. Research across multiple frontier AI systems has identified what are called drift signatures, vendor-specific patterns of behavioural degradation that differ not just in degree but in kind.
One model may maintain factual accuracy while losing coherent structure. Another may preserve structure while introducing subtle factual errors. A third may remain coherent on single-turn interactions but degrade catastrophically in multi-turn workflows. Multimodal systems (those processing text, images, and other data types simultaneously) degrade 60 to 80 per cent faster than text-only systems.
This means that an enterprise deploying multiple AI models, which most enterprises now do, faces not one certification problem but several, each with different failure characteristics, different degradation timelines, and different risk profiles. A certification that validates Model A’s behaviour tells you nothing about Model B’s behaviour, even if both are performing the same task. And a certification that validates Model A’s behaviour at the start of an interaction tells you nothing about its behaviour thirty minutes into one.
The compliance frameworks treat this as a testing problem: test more, test at different times, test under different conditions. But this misunderstands the nature of probabilistic systems. You cannot close the uncertainty gap with more samples. You can only characterise it. And characterisation is a fundamentally different activity from certification.
What Certification Actually Certifies
None of this means that ISO 42001, NIST AI RMF, or the EU AI Act’s requirements are useless. They are not. But it is essential to be clear about what they actually certify and what they do not.
What these frameworks can credibly certify is process: that an organisation has identified the AI systems it deploys, that it has documented their intended use cases, that it has established governance structures and assigned accountability, that it has implemented monitoring, that it has conducted risk assessments, and that it has policies for incident response. This is genuinely valuable. Process governance is the floor of responsible AI deployment.
What these frameworks cannot credibly certify is outcome: that a given AI system will behave within specified parameters at any given moment, that it will not produce harmful or inaccurate outputs under conditions indistinguishable from those in which it performed correctly, or that its behaviour today predicts its behaviour tomorrow.
The danger is that enterprises (and their boards, their regulators, their customers) treat a process certification as an outcome guarantee. This is not a hypothetical concern. It is already happening. ‘We’re ISO 42001 certified’ is becoming the new ‘we’re SOC 2 compliant,’ a badge that conveys the appearance of rigour without addressing the actual risk. In traditional software, SOC 2 compliance genuinely narrows the risk aperture. In probabilistic AI, the equivalent certification leaves the risk aperture essentially unchanged while making everyone feel better about it.
The False Confidence Problem
False confidence is worse than no confidence. An enterprise with no certification knows it is operating in uncertainty and behaves accordingly: it hedges, it monitors, it maintains human oversight. An enterprise with a certification it believes to be comprehensive relaxes those hedges. It automates more. It reduces human review. It scales deployment based on the assumption that the certified behaviour will hold.
This is precisely the pattern Evans’ Law predicts will produce the most damaging outcomes. As enterprises extend AI into longer, more complex workflows (the natural direction of adoption) the probability of coherence collapse increases along a power-law curve. The systems that enterprises trust most, because they have been ‘certified,’ are the systems they will push hardest into exactly the conditions under which they are most likely to fail.
The healthcare sector offers the clearest illustration. An AI system certified for clinical decision support that degrades over the course of an extended patient interaction is not merely an IT risk. It is a patient safety risk. The certification did not reduce this risk. It obscured it.
What Would Honest AI Governance Look Like?
If deterministic certification cannot govern probabilistic systems, what can?
The answer is a shift from certification to characterisation. Instead of certifying that a system behaves within defined parameters, honest governance would characterise how a system behaves across the range of conditions it will encounter — including degradation over time, variance across identical inputs, and vendor-specific drift patterns. This is not a new idea in engineering. Structural engineers do not certify that a bridge will never experience stress. They characterise the conditions under which it will fail and design accordingly.
Applied to enterprise AI, this would mean several things. First, replacing point-in-time testing with continuous behavioural monitoring that tracks drift in real time. Second, publishing degradation profiles for each model deployed, not as a failure but as a specification, the way a battery manufacturer publishes discharge curves. Third, establishing operational boundaries based on Evans’ Law predictions: maximum interaction lengths, mandatory human-in-the-loop checkpoints at predicted coherence thresholds, automated fallback triggers. Fourth, requiring vendors to disclose their systems’ drift signatures as a condition of enterprise deployment, the same way pharmaceutical companies disclose side effect profiles.
None of this is technically difficult. The mathematical tools exist. The monitoring infrastructure exists. What does not yet exist is the institutional willingness to admit that the current frameworks are insufficient, because admitting that means admitting that every enterprise AI deployment currently operating under an ISO certification is operating under a compliance fiction.
Who Benefits from the Fiction
It is worth asking who benefits from maintaining the current certification paradigm. The compliance industry (auditors, consultants, certification bodies) benefits directly. AI certification is one of the fastest-growing segments of the compliance market. Enterprises benefit in the short term, because certification provides legal cover and board-level reassurance. AI vendors benefit enormously, because a certification framework that validates process without constraining output places no meaningful burden on their products while providing their customers with a reason to buy.
The people who do not benefit are the ones exposed to the outputs: patients receiving AI-assisted diagnoses, citizens processed by AI-driven government services, employees evaluated by AI performance systems, and consumers whose financial products are priced by AI models that nobody can guarantee will produce the same result twice.
This is not a conspiracy. It is partly inexperience, an partly incentive structure. And it is the same incentive structure that has historically delayed real governance in every industry where the cost of admitting uncertainty exceeded the cost of maintaining a reassuring fiction, until the fiction failed publicly enough to force a reckoning.
The Reckoning Will Come from the Math
The conversation with that Fortune 50 AI executive was not unusual. It was representative. Across every major enterprise sector, teams are spending significant resources certifying systems whose behaviour they cannot predict, using frameworks designed for a category of technology that AI is not. They are doing this because the frameworks exist, because the auditors are available, and because the alternative, admitting that we do not yet have adequate tools for governing probabilistic systems — is institutionally intolerable.
But the math does not care about institutional comfort. Evans’ Law will continue to describe the degradation curves. Drift signatures will continue to differentiate vendor failure modes. Coherence collapse in LLMs will continue to occur at thresholds the research predicts, whether or not a certification body has declared the system compliant.
Enterprises should not ask whether to pursue AI governance. Of course they should. The question is whether they are willing to pursue governance that is realistic about what it can and cannot verify, or whether they will continue to purchase the expensive comfort of a framework that tells them what they want to hear.
You cannot certify uncertainty. And probablistic systems are by definition, uncertain. Uncertainty is part of the architecture. When systems are no longer deterministic, everything needs to adjust.





