Artificial intelligence has delivered extraordinary breakthroughs over the past five years, but it has also generated an equally extraordinary problem: organizations are deploying systems they cannot reliably measure, benchmark, or predict. This gap between capability and comprehension has left enterprises and public institutions navigating AI adoption through a haze of marketing claims, vendor abstractions, and guesswork. While the technology advances at an exponential pace, the practical ability to understand how it behaves under real-world workloads has lagged far behind.
This is where Evans’ Law and the Evans’ Ratio enter the conversation. Developed through empirical testing across multiple model families, they offer a language and structure for understanding one of the quietest but most consequential challenges facing AI today: coherence collapse, or the point at which a model begins to lose its ability to maintain accuracy, self-consistency, and task continuity over extended interactions.
These frameworks do not compete with machine-learning research. But they do fill a gap that users, enterprises and policymakers feel every day: the absence of reliable operational indicators for when a system is likely to fail, how it fails, and how people and organizations can manage and plan around those limits. In that sense, Evans’ Law and Evans’ Ratio represent something AI desperately needs: practical tools for grounding risk, expectations, and governance in observable reality.
The Problem AI Vendors Haven’t Solved: Measuring Reliability Over Time
Most AI evaluations still rely on short-context benchmarks: tests that measure performance on isolated questions, brief interactions, or fixed datasets. They are useful for comparing models at a point in time, but they do not reflect real enterprise use, where models operate across:
• multi-step reasoning
• multi-hour conversations
• chained tool calls
• persistent state
• variable context windows
• compounding decision paths
In these environments, AI systems do not simply make errors, they undergo structural degradation. Fluency remains intact, but internal consistency slowly unravels. Contradictions appear. Memory drifts. Plans collapse. Tools are misused. Corrections disappear. The output remains articulate, but the reasoning quality erodes.
This phenomenon remained largely undocumented, partly because vendors rarely publish long-context failure profiles, and partly because organizations assumed failures were anecdotal rather than systemic. Evans’ Law challenged that assumption by demonstrating that coherence collapse is predictable, quantifiable, and dependent on model scale.
What Evans’ Law Says, and Why It Matters
Evans’ Law, in its distilled form, suggests that there is a measurable, scale-dependent limit on how many tokens (words, instructions, reasoning steps) a model can handle before coherence reliably deteriorates. Rather than being a fringe anomaly, this collapse appears consistently across:
• text-only LLMs
• multimodal models
• agentic scaffolds
• vendor ecosystems
The law does not claim to diagnose the internal cause—whether it is attention mechanism overload, intermediate representation drift, or a downstream effect of self-conditioning. Instead, it establishes a practical truth: every model has a predictable horizon beyond which risk rises sharply.
For enterprises, this matters enormously. Long-context interactions power the workflows that organizations increasingly rely on: customer service agents, legal drafting assistants, decision support systems, research tools, planning agents, and operational copilots.
Evans’ Law gives enterprises something they’ve never had before: a planning metric.
It allows teams to ask sensible questions:
• How far can we push this model before reliability diminishes?
• When should handoffs, checkpoints, summaries, or resets be inserted?
• How long can an agent loop run before it becomes unsafe?
• Why is a system failing after 14 steps when the vendor claims support for 100k tokens?
By turning what looked like random breakdown patterns into an observable curve, Evans’ Law helps companies move from trial-and-error deployment to structured governance.
The Evans’ Ratio: A Missing Measurement for Agentic AI
As enterprises explore “agentic AI” (systems that take sequential actions) another problem appears: autonomy is extremely hard to quantify. Vendors use the term loosely, implying that models “act,” “plan,” or “execute,” when in fact most agent behaviors are deterministic scaffolds built around a language model that remains fundamentally reactive.
Evans’ Ratio provides a way to quantify this gap by comparing:
A. the number of steps an agent succeeds at autonomously
versus
B. the number of steps it requires external scaffolding to complete
This ratio exposes something critical: today’s agentic systems are far less autonomous than marketing implies. A model that can answer a question fluently may fail catastrophically when required to sustain structured, multi-step behavior with memory, correction, and tool use.
More importantly, Evans’ Ratio provides a traceable indicator of improvement or regression across model iterations. Because the ratio can decrease (not just increase) with new model releases, it offers enterprises a more honest metric than vendor-provided benchmark scores, which frequently obscure operational regressions.
Why These Frameworks Are Gaining Attention
Evans’ Law and Evans’ Ratio are resonating with the AI community (over 1800 downloads of the papers since the first was published a month ago) not because they offer mathematical perfection, but because they provide operational clarity. They acknowledge something enterprises already know instinctively: current AI systems are extraordinarily capable in short bursts but remain fragile over long horizons.
By offering a structure for measurement, the frameworks enable:
Better governance
Organizations can define safe operational ranges and avoid overextending models into failure zones.
Better procurement
Buyers can evaluate models based not only on benchmark performance but on the stability of long-form behavior.
Better deployment
Teams can choose architectures and workflows that compensate for known structural limits.
Better risk management
Enterprises can document predictable failure modes, improving transparency and reducing liability exposure.
Better vendor accountability
When reliability collapses can be measured, they can no longer be dismissed as user error or isolated anomalies.
A Contribution to a Growing Field
Long-context reliability is rapidly becoming one of the most important questions in AI. As conversational systems evolve into operational agents, their ability to maintain consistency, correctness, and alignment over time will define both the ceiling of enterprise adoption and the boundaries of regulatory frameworks.
Evans’ Law and Evans’ Ratio represent early but meaningful contributions to this emerging discipline. They do not claim to be the final word. They are omportant metrics intended to shift and shape the conversation. They are not replacements for academic research or mechanistic interpretability. Instead, they are applied tools, developed in the real world, observed across multiple vendors, and validated through repeated empirical testing.
The frameworks matter because they shift AI risk from anecdotal to measurable, from mysterious to explainable, from unpredictable to manageable.
In a domain filled with hype that even the most experienced find challenging to navigate, Evans’ Law and Evans’ Ratio offer practical ways to understand what today’s AI systems can—and cannot—reliably do.
One month ago I published the first piece of research, and the question that has come up more than any other after eight papers and nearly 2000 downloads:: why these frameworks carry my name. The answer is simple: independent researchers who are outsiders to the industry, and especially women working outside major institutions, are often erased from the narratives of technical progress. In a field dominated by well-funded labs and male researchers with institutional power, attribution is uneven and easily lost. Naming the work after myself was not a gesture of ego but a safeguard—an explicit claim of authorship to prevent the disappearance of contributions developed outside traditional research channels. Clear naming is often the only defense against being overwritten by larger players once ideas gain traction.





