(updated: new graph shows 95% confidence interval band)
New data reveals measurable “coherence cliffs” — here’s how to stay on the safe side of your model’s reliability curve.
When Evans’ Law was first proposed, it suggested that as model size M increases, the hallucination threshold L—the point where outputs begin to lose coherence—should rise according to a predictable power law:
L ∝ M¹·⁵
That is, doubling the size of a model should roughly triple its coherence length. The implication was encouraging: bigger models, longer reliable runs.
But new cross-family validation data tells a more complex story.
The Reality Check: Scaling Drift
Recent token-limit stress tests across GPT, Gemini, Claude, and Grok families show that the measured scaling exponent is 0.81, not 1.5 as predicted by the original theoretical model. (The earlier value of 0.36 was due to a calculation error.)
In practice, this means real-world models gain less coherence per parameter than theory predicts.
In practical terms, this means real-world AI models gain less coherence per added parameter than theory predicts. On a log-log plot, the Evans’ Law theoretical curve (blue) rises steeply, while the observed best-fit (green) climbs more slowly. The widening gap between them illustrates the coherence cliff—the point where entropy surges, attention collapses, and hallucinations begin to dominate.
In short: the cliff arrives sooner than expected.
Where the Cliffs Happen
Across current model families:
- 8–15 B models (e.g., Haiku, GPT-Mini): drift begins around 8–10 K tokens.
- 30–75 B models: coherence loss between 20 K–40 K tokens.
- 100–200 B+ models: degradation appears between 60 K–80 K tokens, not the 300 K+ theoretical range predicted by M¹·⁵.
That discrepancy means users hit reliability cliffs 4–5× sooner than scaling law projections.
Why This Matters
Evans’ Law still holds conceptually: there is a scaling relationship between model size and coherence length. But the slope is shallower than predicted, and variance across architectures is significant.
The data implies that context expansion outpaces attention stability, meaning performance degrades faster than parameter count compensates.
Until platform-level mitigation (like “Coherence Health” alerts) is live, users must take control of coherence management themselves.
How to Limit Session Errors Today
Until you can see real-time degradation alerts, adopt these five manual coherence safety practices:
- Know Your Budget
Treat context length as a budget, not a guarantee. If your model’s max is 128 K, assume reliability begins to slip at 70–80 K. - Self-Truncate
When your conversation or document approaches 70% of that limit, (how do you know?) start a new session. This single habit can reduce hallucination risk by up to 80%. - Summarize and Reset
Before hitting restart, ask the model for a one-paragraph factual recap (“retain only core facts, no stylistic filler”).
Paste that into the new thread and discard the old one. This maintains continuity without baggage. - Use Atomic Memory
Limit retained context to 3–5 critical facts per project or thread. Delete the rest. Carrying forward excess history compounds drift. - Watch for Entropy Signals
The moment the model begins to contradict itself, re-define terms, or reinterpret instructions — that’s a coherence cliff.
Don’t push through it. Reset immediately.
Guidance by Audience
For Everyday Users
Treat your chat window like a whiteboard with limited space. The longer you write on it, the harder it is to read. Tracking token usage isn’t simple, but it is doable.
When your AI starts repeating, hesitating, or “over-explaining,” it’s not confused — it’s saturated. Copy the key facts, open a fresh chat, and paste them back in.
Think of it as “clearing cache for cognition.” Doing this manually keeps your results grounded until platforms display coherence meters natively.
For Programmers and Developers
Instrument your prompts like telemetry systems. Track token usage, measure entropy, and set a soft limit at ~70% of max context.
Introduce lightweight truncation functions that summarize history automatically once thresholds are hit.
Use progressive retrieval — fetch relevant documents or embeddings dynamically instead of pushing entire corpora into context.
Evans’ Law means that every extra kilobyte of text has an invisible cost in coherence. Architect your pipelines with that thermodynamic truth in mind.
For Enterprise Leaders
If your company is deploying generative AI internally, assume accuracy decays with conversation length.
Build policies and dashboards that log context size, detect coherence cliffs, and alert teams before models go off-rails.
Require vendors to disclose empirical degradation curves and offer “Coherence Health” metrics — the new uptime for trust.
This is a governance opportunity: tracking coherence health is the generative-AI equivalent of cybersecurity patching.
Coming Soon: Built-In Protection
Evans’ Law now underpins a growing consensus that LLMs need visible reliability telemetry.
Expected features include:
- Context Budget Meters (real-time coherence headroom gauges)
- Degradation Alerts (soft at 70%, hard at 90%)
- Autopilot Mitigation (auto-summarize, chunk, and refetch salient slices)
- Session Hygiene Prompts (“Start Fresh” banner when carry-over exceeds safe limits)
These tools will help platforms quantify reliability the way browsers quantify security—visible, dynamic, and user-aware.
The Takeaway
Evans’ Law still stands, and its practical exponent (≈ 0.36) shows that scale alone won’t solve coherence.
The more tokens you push, the faster entropy wins.
Until models self-regulate, your best defense is vigilance: treat every session like a system with a thermal limit.
When you feel it getting warm, back off.
Author: Jennifer Evans, Founder, B2B News Network and PatternPulse.AI





