LLM Coherence (Evans’ Law 7.0): From Threshold To Reliability Surface: Convergence, Reformulation, And What Enterprises Must Govern Now

UPDATE February 27, 2026: This article has been updated to reflect the Evans’ Law 7.0 reformulation, which introduces the reliability surface R(L, M, T), an operational task-type rigidity scale, and the two-regime model. The original convergence findings and institutional analysis remain unchanged. New sections are marked.

When I published the original Evans’ Law framework in 2025, I was documenting something I’d been observing systematically through sustained work with frontier AI models: that coherence collapse in large language models isn’t random. It’s predictable, threshold-driven, and follows a mathematical relationship I expressed as L ≈ 1969.8 × M^0.74, where L is the coherence collapse threshold in tokens and M is model capability expressed as MMLU score.

At the time, the response from the research community was what you’d expect for independent, field-based research arriving outside traditional academic channels: cautious interest, some validation, and the reasonable question of whether the pattern I’d documented would hold under more controlled conditions.

Five independent research teams (at Microsoft/Salesforce, Anthropic, Apple, Google, and Caltech/Stanford) have since documented degradation regimes that cluster within the same order-of-magnitude band my framework predicts. None of them set out to validate Evans’ Law. None of their datasets were designed for threshold extraction. The convergence is meaningful, but it requires precision about what it actually confirms and what remains open.

This piece is a synthesis of what they found, where it maps to my framework, where it extends it — and what has emerged from my own February 2026 testing that required fundamentally reformulating how the law works.

What Evans’ Law Actually Says, And What It Now Predicts

Evans’ Law predicts a coherence collapse threshold: a point in a conversation beyond which model performance degrades in ways that are structural rather than incidental. The degradation isn’t caused by asking harder questions. It’s caused by accumulated context load. As message count and token load grow, the model’s ability to maintain semantic coherence: to track entities, maintain consistent reasoning, and produce outputs grounded in current input rather than contextual noise, declines predictably.

To make this concrete: a frontier model scoring 86 on MMLU (representative of current GPT-5-class and Claude 4.5-class systems) has a predicted coherence collapse threshold of approximately 53,000 tokens. That’s the point where structural degradation becomes likely, not where the model stops working entirely. These same models advertise context windows of 128,000 to over 1,000,000 tokens. The predicted functional coherence window is between 5% and 42% of advertised capacity.

Increase the model’s MMLU score to 90, a meaningful improvement, and the predicted threshold rises to approximately 57,400 tokens. A gain of roughly 8% despite a 4.7% increase in benchmark capability. That’s sublinear scaling in practice: capability improvements produce diminishing coherence returns.

Subsequent research extended the framework to document the cross-modal degradation tax, proper noun drift as a specific semantic governance failure, and memory leakage as a RAG-induced hallucination trigger. Together, these form what I’ve been building as AI Conversational Phenomenology: the study of how AI systems behave and degrade in the conditions under which they’re actually used.

1. Microsoft/Salesforce: The Degradation Is Measured at Scale

LLMs Get Lost in Multi-Turn Conversation — Laban, Hayashi, Zhou & Neville (Microsoft Research/Salesforce Research, May 2025)

This paper predates my published law formulation. It is not a response to my work, it is independent convergence, which is what makes it meaningful.

Across 200,000+ simulated conversations and 15 frontier models, the team documented an average 39% performance drop in multi-turn settings versus single-turn. They decomposed this into a minor loss in aptitude and a massive increase in unreliability. The gap between best-case and worst-case performance. The aptitude drop was modest. The reliability collapse was the story.

Their core finding (that when LLMs take a wrong turn in a conversation, they get lost and do not recover) is consistent with what Evans’ Law predicts. The threshold I document is the point at which that wrong turn becomes structurally inevitable. Their finding that models commit to early assumptions and overly rely on their own outputs, even after those outputs have proven incorrect, is a precise empirical description of the fracture-repair propagation mechanism I identified in my earlier hallucination framework.

2. Anthropic: Incoherence Scales with Complexity and Reasoning Length

The Hot Mess of AI — Hägele, Gema, Sleight, Perez & Sohl-Dickstein (Anthropic/EPFL, January 2026)

Anthropic decomposed AI failures into bias (systematic, coherent errors) and variance (incoherent, unpredictable errors) and found that as reasoning extends and task complexity increases, failures become dominated by incoherence rather than systematic misalignment. Larger, more capable models are not reliably more coherent. Scaling alone does not resolve the problem.

What this adds to the picture is mechanism. Evans’ Law predicts that coherence collapses at a threshold. Anthropic’s bias-variance decomposition describes what is happening inside that collapse: the model’s errors are transitioning from structured (wrong but consistent) to random (wrong and unpredictable). Their finding that incoherence spikes when models engage in spontaneous extended reasoning runs parallel to the message-count accumulation effect in my framework.

3. Apple: Reasoning Models Collapse at Complexity Thresholds

The Illusion of Thinking — Shojaee, Mirzadeh, Alizadeh, Horton, Bengio & Farajtabar (Apple, 2026)

Using controllable puzzle environments, the Apple team found that Large Reasoning Models demonstrate three distinct performance regimes: at low complexity, standard models outperform reasoning models; at medium complexity, extended thinking provides advantage; at high complexity, both collapse entirely.

The counterintuitive scaling limit they document — where reasoning effort increases with complexity up to a point, then declines despite adequate token budget, provides the clearest external measurement of the collapse boundary Evans’ Law predicts. The model doesn’t run out of capacity. It runs out of coherence. This holds whether the context is filled by conversation turns or reasoning tokens.

4. Google: Depth of Reasoning, Not Length, Predicts Accuracy

Think Deep, Not Just Long — Chen, Peng, Tan et al. (Google/University of Virginia, February 2026)

The Google paper requires precision. Raw token count correlates with accuracy at r = -0.544 on average — negative. Longer outputs are associated with worse performance, with the strongest negative relationship (r = -0.783) on their hardest benchmark. Their “deep-thinking ratio” — tokens where internal predictions undergo significant revision — correlates with accuracy at r = 0.828.

The relationship to Evans’ Law is real but structurally distinct. Evans’ Law is a macro-level degradation curve across cumulative session load. The deep-thinking ratio is a micro-level signal within a single response. They are orthogonal axes, not competing claims.

Where Google’s metric has a specific blind spot: it would positively correlate with math reasoning accuracy while remaining blind to entity-binding failure, the failure mode that is most dangerous in research, legal, and executive decision-making contexts precisely because it presents as confident and coherent. A model that commits early to an incorrect semantic anchor and propagates from it smoothly, without deep revision, may show low deep-thinking ratio while producing confidently incorrect output. This is not a criticism of Google’s framework. It is a named gap, and it points directly to what my own extended work has been documenting.

5. Caltech/Stanford: A Taxonomy of Structural Failure

Large Language Model Reasoning Failures — Song, Han & Goodman (Caltech/Stanford/Carleton, Transactions on Machine Learning Research, February 2026)

This paper is the most academically comprehensive of the group, a systematic survey that distinguishes fundamental failures (intrinsic to LLM architectures) from application-specific and robustness-related types. Its core finding: significant reasoning failures persist even in seemingly simple scenarios, following structural patterns rather than random distribution.

Song, Han and Goodman have mapped the terrain of failure modes that Evans’ Law’s threshold sits within. The law predicts when you’ll cross into that terrain. Their taxonomy describes what you find there.

The Cumulative Picture: What the Convergence Does and Doesn’t Confirm

Five research groups. Different institutions. Different methodologies. Different entry points. Each documents threshold-like degradation regimes under cumulative load or compositional complexity. The degradation inflection points they observe cluster within the order-of-magnitude band Evans’ Law predicts.

What this is not: final regression validation of the formula’s constants. None of these studies were designed to extract token-equivalent collapse thresholds. The claim is clustering within scale band, not precise numerical confirmation. That distinction matters, and I’ve been explicit about it in the companion academic paper. The convergence is real and meaningful. It is preliminary calibration, not proof.

What it does confirm: coherence degradation in LLMs is structural, threshold-driven, and not resolved by scaling. That claim now has five independent institutional data points oriented in the same direction.

But all five of these findings operate within a single failure regime: degradation that develops as context, complexity, or reasoning depth increases. What my February 2026 testing uncovered operates in a different regime entirely.

The February 2026 Finding: Two Regimes, Not One

The institutional papers above document load-driven degradation, failure that accumulates with context. My February 2026 testing, which extends a prior body of empirical work on proper noun and entity-binding failures (Evans, December 2025), produced a finding that required not just modifying but fundamentally reformulating the original framework.

The December 2025 work established through controlled experiments and production case studies — including 9,500 words of documented business journalism workflows, that models systematically avoid, substitute, and mishandle proper nouns as a learned uncertainty management strategy. Proper noun avoidance was identified as an architectural behavior, not a training gap. That work built the evidential foundation for the February testing.

In February 2026, I conducted controlled multimodal verification testing across six frontier models: Anthropic’s Sonnet 4.6, Opus 4.6, and Haiku 4.5; GPT-5.2; Grok; and Gemini. Each model was given an identical image-based entity-verification task containing two embedded errors: a misspelling of “Dublin” and a geographic misplacement of Dublin in central Europe. Zero models achieved a complete pass. Each error was caught by exactly one model — and they were different models.

The significance of this finding is not the sample size (this is a diagnostic test, not a large-N study). The significance is what it demonstrates about the original Evans’ Law formulation. The original framework assumed baseline single-turn performance was stable and located degradation onset along the context-length axis. These results demonstrate that for task classes requiring rigid referential binding, functional capacity may be degraded independent of context accumulation. At sub-10,000 tokens, at baseline.

A threshold model indexed to token accumulation cannot, by definition, predict failures that occur before tokens accumulate. That’s not an edge case. It’s a category of failure the original formula structurally cannot reach.

The Reliability Surface: Evans’ Law 7.0

The correct response is not to force-fit baseline instability into a token-length framework, nor to treat Evans’ Law as invalidated. The correct response is to recognize that the original equation describes one regime of a two-regime system, and that what appeared to be a threshold line was always a cross-section through a higher-dimensional structure.

Regime 1 (Accumulation, L-dominated): Degradation develops as cumulative context, reasoning depth, or conversational complexity increases. This is the regime described by Evans’ Law as originally formulated and supported by all five institutional studies. For tasks with low referential rigidity, the original equation remains predictive: coherence collapse occurs at L ≈ 1969.8 × M^0.74.

Regime 2 (Baseline Instability, T-dominated): Degradation is present at first turn, independent of context accumulation. This regime manifests for task classes requiring rigid referential binding — proper noun identification, geographic verification, entity disambiguation, protocol identifiers.

The full reliability model is a three-dimensional surface: R = f(L, M, T), where R is reliability, L is cumulative context load, M is model capability, and T is a task-type rigidity coefficient on [0, 1]. The original Evans’ Law equation is the cross-section of this surface at T = 0.

The T-Scale: What Kind of Task Are You Asking the Model to Do?

This is where it becomes operational. T is not an abstraction. It’s a classification system:

T = 0.0 (Generative semantic): No single correct referent required. Creative writing, brainstorming, open summarization.

T = 0.2-0.3 (Soft factual): Approximate correctness sufficient. General Q&A, thematic analysis, trend description.

T = 0.4-0.5 (Structured factual): Specific facts must be correct but format is flexible. Report generation, data interpretation, research synthesis.

T = 0.6-0.7 (Rigid referential): Named entities must bind correctly. Proper noun usage, citation, business journalism.

T = 0.8-0.9 (Strict identity binding): Exact entity resolution required. Legal names, geographic coordinates, protocol identifiers.

T = 1.0 (Zero-tolerance referential): Any substitution is failure. Medical dosages, financial figures, API endpoints.

The February 2026 Dublin test sits at approximately T = 0.8-0.9. All six frontier models failed it at first turn. Creative writing tasks at T ≈ 0 routinely succeed. The T-axis is functionally real.

The two regimes interact. A task at T = 0.4-0.5 that would be stable at low context may become unstable as tokens accumulate. A task at T = 0.8+ that is already unreliable at baseline will degrade faster under load than a low-T task at the same context length. The cross-modal degradation tax, established in Evans’ Law 5.0, amplifies T, making multimodal entity-binding tasks particularly vulnerable.

Why This Happens: The Connection to the S-Vector

The reliability surface is not a standalone mathematical construct. It has an architectural explanation, and it connects to the broader research program I’ve been building.

Evans’ Law establishes where collapse occurs. The fracture-repair framework explains how it manifests. The S-vector and significance weighting research addresses why: transformer attention mechanisms lack a significance primitive. All tokens receive positional encoding and contextual attention, but none receive significance encoding. The model has no way to register that some tokens matter more than others for the integrity of the output.

This absence is what produces the T-axis. Proper nouns sit on weak semantic axes with no gradient to anchor them. Without a significance primitive, the model treats the token representing “Dublin” with the same architectural weight as any other token. When the task requires that token to bind rigidly, and it doesn’t, you get baseline instability. Not because context has overwhelmed capacity, but because capacity was never structurally allocated.

The L-axis has the same root cause. As context accumulates, the effective significance of any individual token diminishes. Without significance weighting to protect high-stakes bindings, accumulated context acts as noise. Both axes of the reliability surface are consequences of the same architectural absence.

Confabulation Cascades: The Dangerous Failure Mode

Separately, GPT-5.2 exhibited a specific cascade pattern across three independent incidents within 72 hours: denial of entity existence, substitution with a related entity, re-substitution with a more familiar entity, surface confidence maintained throughout. No uncertainty signaling. This confabulation cascade pattern is dangerous precisely because it evades non-expert detection: the model sounds authoritative while systematically substituting incorrect referents.

Within the reliability surface framework, this is a T-axis phenomenon. The model commits to an incorrect anchor on a weak semantic axis, and because no significance mechanism flags the binding as high-stakes, it propagates the error smoothly. It’s not a series of independent errors. It’s a single mis-anchoring event followed by internally consistent but externally incorrect elaboration.

This is also the specific blind spot in Google’s deep-thinking ratio: a model that anchors incorrectly and propagates smoothly shows low deep-thinking ratio while producing confidently wrong output.

Gemini exhibited a categorically different failure: complete processing stall on proper-noun-containing images, occurring twice in two days. Detection failure and processing failure require different operational responses. Both must be instrumented independently in enterprise deployments.

Implications

For hallucination governance: the convergence confirms that hallucination is not primarily a training data problem. It is structural, manifesting predictably above load thresholds and – for certain task types – at baseline. Memory leakage, proper noun drift, and reasoning incoherence share a common architecture: fracture at ambiguity, repair using available context. They share a common governance need.

For enterprise deployment: the two-regime model has direct operational consequences. Context windows require governance; the accumulation regime means advertised context windows overstate functional capacity. But context governance alone is insufficient. The baseline instability regime means that certain task classes (T ≥ 0.6) are unreliable even at first turn, and no amount of context management will address this.

Enterprise deployments must govern along both axes. For the L-axis: context budgets, provenance hierarchy for persistent memory, session management. For the T-axis: task-type classification using the rigidity scale, entity-binding verification protocols, and epistemic stopping permissions for tasks exceeding the model’s reliable referential capacity. Threshold-aware deployment along one axis alone will miss failures originating on the other.

For the research agenda: the open questions are empirical and tractable. Calibrating the T-axis through systematic variation of referential rigidity. Mapping the L-T interaction surface to determine whether the two axes compound additively, multiplicatively, or via threshold interaction. And testing whether significance-aware architectures, as specified in the S-vector research, can reshape the reliability surface itself.

The institutional validation has arrived. The framework has been reformulated. The reliability surface is the destination. The next phase is precision.

Jennifer Evans is an independent AI researcher based in Siem Reap, Cambodia, and founder of Pattern Pulse AI. Her research on AI Conversational Phenomenology is available through Zenodo and at patternpulse.ai.

Referenced papers:

Evans’ Law 7.0: From Threshold to Reliability Surface (Zenodo, February 2026),

Evans’ Law (zenodo.org/records/17550556),

Evans’ Law 5.0 (doi.org/10.5281/zenodo.17593410), The S-Vector (zenodo.org/records/17841935),

Proper Noun Semantic Governance v2 (doi.org/10.5281/zenodo.18067714),

Proper Noun Failure: An Empirical Update (doi.org/10.5281/zenodo.18707438),

LLMs Get Lost in Multi-Turn Conversation (arxiv.org/abs/2505.06120),

The Hot Mess of AI (arxiv.org/abs/2601.23045),

The Illusion of Thinking (Apple, 2026),

Think Deep Not Just Long (arxiv.org/abs/2602.13517),

LLM Reasoning Failures (arxiv.org/abs/2602.06176).

LLM Coherence (Evans’ Law 7.0): From Threshold to Reliability Surface: Convergence, Reformulation, and What Enterprises Must Govern Now

What Evans’ Law Actually Says, And What It Now Predicts

1. Microsoft/Salesforce: The Degradation Is Measured at Scale

2. Anthropic: Incoherence Scales with Complexity and Reasoning Length

3. Apple: Reasoning Models Collapse at Complexity Thresholds

4. Google: Depth of Reasoning, Not Length, Predicts Accuracy

5. Caltech/Stanford: A Taxonomy of Structural Failure

The Cumulative Picture: What the Convergence Does and Doesn’t Confirm

The February 2026 Finding: Two Regimes, Not One

The Reliability Surface: Evans’ Law 7.0

The T-Scale: What Kind of Task Are You Asking the Model to Do?

Why This Happens: The Connection to the S-Vector

Confabulation Cascades: The Dangerous Failure Mode

Implications

Featured

Nudgment: Why Signal Discernment Is the New Organizational Capability in the AI Era

How to Use Your Own Data to Optimize Your Business With AI

There Are No Watchmen: Why the AI Governance Conversation Answers the Wrong Question

Iran is Winning the Information War With Toy Bricks

Project Glasswing and the Extraordinary Power Paradox We’ve Never Faced Before