Saturday, November 15, 2025
spot_img

Evans’ Law: Revised Formula and Empirical Validation

Last updated on November 6th, 2025 at 03:18 am

Author: Jennifer Evans
Date: November 3, 2025
Status: Revised based on Phase 1 experimental validation

UPDATE: Editor’s Note: The full regression dataset and updated exponent analysis (α = 0.36) are now available in the companion post, Evans’ Law: Data Update – November 2025.

Abstract

Language models in generative AI exhibit consistent performance decay as input and output lengths increase. This paper presents Evans’ Law, defining the relationship between context length both input and output, and accuracy degradation. Initial experimental validation confirms the phenomenon exists and provides empirical data to refine a mathematical formulation.

The formula provides an approximation for when large language models begin producing unreliable output as inputs grow too long.

Revised Evans’ Law:

The likelihood of errors rises super-linearly with prompt and output length until accuracy falls below 50%, following a power-law relationship determined by model capacity and task complexity.

1. Introduction

As language models scale to support million-token context windows, understanding their reliability limitations becomes critical. While prior research documented accuracy degradation in long contexts (Liu et al., 2023; Zhang et al., 2025; Veseli et al., 2025), no unified framework quantified the relationship between length, model capacity, and task complexity.

Evans’ Law addresses this gap by proposing a predictive model for when language model accuracy crosses the reliability threshold.

2. Evans’ Law: Core Principle Qualitative Statement:

“Whenever you request an output, the longer the prompt and the longer the response, the higher the likelihood of mistakes — rising super-linearly until the likelihood of mistakes exceeds the likelihood of accuracy.”

2.1 Key Characteristics:

Non-Linear Degradation: Errors compound rather than accumulate linearly
Threshold Behavior: Performance exhibits cliff-like collapse at critical length
Predictable Pattern: Crossover point depends on model size and task complexity

3.0 Mathematical Formulation

3.1 Original Hypothesis (Invalidated)

Initial formulation:

T ≈ (M × α) ÷ C

Experimental Result:
Predicted thresholds of 2-6 tokens when actual degradation occurred at 2,000-20,000 tokens. Formula underestimated by 1,000-10,000x.

Lesson: Linear relationship insufficient to capture observed behavior.

3.2 Revised Formula (Power Law, with Plain-Language Interpretation)

Based on Phase 1 empirical validation:

T ≈ (M^β × α × K) ÷ C^γ

Where:

  • T = Token threshold where P(error) > P(correct)
  • M = Model size (billions of parameters)
  • β = Model scaling exponent (1.3-1.7, est. 1.5)
  • α = Alignment constant (0.5-0.9)
  • K = Empirical scaling factor (50-200, est. 100)
  • C = Complexity coefficient (1=simple retrieval, 5=complex synthesis)
  • γ = Complexity scaling exponent (1.0-2.0, est. 1.5)

In plain terms, the maximum reliable length (T) increases with model size and alignment, but decreases with task complexity. Bigger models tolerate longer prompts; harder tasks shorten the limit.

3.3 Why Power Law?

Theoretical Justification:

  1. Error Compounding: Each token generation has error probability ε; over L tokens, cumulative error scales as 1-(1-ε)^L ≈ exponential for small ε
  2. Attention Dilution: Self-attention mechanisms distribute fixed capacity over L tokens, reducing per-token precision as L increases
  3. Working Memory Limits: Transformer architectures exhibit capacity constraints analogous to human working memory limits
  4. Empirical Observation: 100% → 0% cliff behavior indicates non-linear phase transition
    3.4 Simplified Practical Formula
    For practical applications:

T ≈ (M^1.5 × α × 100) ÷ C^1.5

Phase 1 Experimental Validation

4.1 Methodology

Model Tested:
Claude 3 Haiku (8B parameters)

Tasks:

Factual Retrieval (C=1)
Multi-Step Math (C=3)
Code Generation (C=5)

Lengths: 2,000 / 5,000 / 10,000 / 15,000 / 20,000 tokens
Samples: 3 per condition
Total Tests: 45

4.2 Results

Key Finding: All three tasks exhibited dramatic degradation from near 100% accuracy to 0% accuracy as context length increased

Critical Observations:

  1. ✅ Phenomenon Confirmed: Accuracy degradation is real and consistent
  2. ✅ Non-Linear Pattern: Sharp cliff behavior, not gradual decline
  3. ✅ Task-Independent: All complexity levels showed same qualitative pattern
  4. ❌ Original Formula Invalid: Predicted 2-6 tokens vs observed 2,000-20,000 tokens

4.3 Implications

Evans’ Law Premise = VALIDATED

The core hypothesis that accuracy degrades predictably with length is empirically confirmed. However, the mathematical relationship requires power-law formulation rather than linear.

5.0 The Evans Curve (Revised)

Theoretical Accuracy Function

A(L) = A₀ × e^(-(L/T)^κ)

This expresses that accuracy stays high until the total token length (L) nears the threshold (T), then falls sharply — the familiar “it works fine until it doesn’t” curve.

Where:

  • A(L) = Accuracy at length L
  • A₀ = Baseline accuracy (~0.95-1.0 for short contexts)
  • T = Characteristic threshold from power law formula
  • κ = Sharpness parameter (2-4, higher = steeper cliff)

Visual Representation:

5.1 Phases:

  1. Stable Region (0 to 0.5T): >90% accuracy, minimal degradation
  2. Onset Region (0.5T to T): Accuracy begins declining
  3. Collapse Region (T to 1.5T): Rapid cliff from 75% → 0%
  4. Failed Region (>1.5T): Unreliable, <25% accuracy

6.0 Practical Applications

6.1 Prompt Engineering
Design Principle: Keep total tokens (prompt + output) < 0.5T for reliable operation.
Recommendations:

  • Break large tasks into smaller, modular subtasks
  • Use iterative refinement (3-4 short passes > 1 long pass)
  • Implement context governors that enforce token limits
  • Chain outputs rather than accumulating context

6.2 Enterprise AI SystemsContext Management:

def should_proceed(prompt_length, expected_output, model_size, task_complexity):
T = (model_size ** 1.5 * 0.7 * 100) / (task_complexity ** 1.5)
total_tokens = prompt_length + expected_output
safety_factor = 0.5 # Stay well below threshold
return total_tokens &lt; (T * safety_factor)

This code applies Evans’ Law as a safety check: if the total prompt plus expected output exceeds half the model’s safe limit, the system stops before accuracy collapses.

6.3 Agentic AI Design Architecture Principle: Use many short-context agents rather than one long-context agent.

Pattern: Long Context (Fails):
[10K prompt] → Agent → [5K output] = 15K total → ERROR

Short Context (Works):
[2K prompt]
→ Agent1 → [1K output]
→ Agent2 → [1K output]
→ Agent3 → [1K output] = 5K max per agent → SUCCESS

7.0 Limitations and Future Work

7.1 Current Limitations

  1. Phase 1 tested only one model size (8B) – Need validation across 1B-500B range
  2. Exact threshold values unknown – Degradation observed between 2K-20K tokens but precise crossover not measured
  3. Limited task diversity – Only 3 task types tested
  4. Small sample size – 3 samples per condition insufficient for statistical confidence
  5. Single model family – Need cross-architecture validation (GPT, Gemini, LLaMA)

While patterned degradation describes the statistical behavior of model failure, predictable failure zones capture the underlying physical and computational constraints that make such breakdowns inevitable.

7.2 Predictable Failure Zones

In addition to probabilistic accuracy decay, large-scale prompt responses can fail because the model itself reaches its mechanical or computational limits. When memory, processing throughput, or attention bandwidth become saturated, the model can no longer maintain internal coherence, leading to abrupt collapse in reasoning or syntax.

These breakages are not random; they tend to emerge precisely where task complexity and information density are highest — the regions of greatest semantic and computational stress. In practical terms, the points most likely to fail are the longest, most multi-layered prompts and the richest, most interdependent responses.

Future work will quantify this “architectural stress pattern,” integrating hardware and system-level constraints into the Evans’ Law framework to predict not only when degradation occurs, but where within a sequence failure first appears.

7.3 Patterned Degradation Hypothesis:

While Evans’ Law captures the statistical relationship between length and accuracy, further research may reveal that model failures themselves are not purely stochastic. Because generative models extend learned patterns, their breakdowns may also follow recognizable patterns — patterned degradation — suggesting that even failure may be predictable within certain regimes of context and complexity.

While patterned degradation describes the statistical behavior of model failure, predictable failure zones capture the underlying physical and computational constraints that make such breakdowns inevitable.
In addition to probabilistic accuracy decay, large-scale prompt responses can fail because the model itself reaches its mechanical or computational limits. When memory, processing throughput, or attention bandwidth become saturated, the model can no longer maintain internal coherence, leading to abrupt collapse in reasoning or syntax.

These breakages are not random; they tend to emerge precisely where task complexity and information density are highest — the regions of greatest semantic and computational stress. In practical terms, the points most likely to fail are the longest, most multi-layered prompts and the richest, most interdependent responses.

This pattern will persist even in enterprise AI, despite significant differences in LLM structure, as errors occur and are fed back into the model. Future work will quantify this “architectural stress pattern,” integrating hardware and system-level constraints into the Evans’ Law framework to predict not only when degradation occurs, but where, within a sequence, failure first appears.

8.0 Comparison with Prior Work

8.1 Lost in the Middle (Liu et al., 2023)
Finding: Models perform better when relevant information is at beginning or end of context (U-shaped curve).
Evans’ Law Connection: Positional bias is a symptom of approaching the Evans’ Threshold. As contexts lengthen, middle information becomes inaccessible first.
Contribution: Evans’ Law quantifies when this happens based on model size and task complexity.

8.2 Context Length Hurts Performance (Zhang et al., 2025)
Finding: Even with perfect retrieval, performance degrades with length.
Evans’ Law Connection: Confirms degradation isn’t just about finding information—processing long contexts inherently degrades reasoning.
Contribution: Evans’ Law provides predictive formula for the degradation curve.

8.3 Context Rot (Chroma Research, 2025)
Finding: Non-uniform context processing leads to unreliable performance as input length increases.
Evans’ Law Connection: “Context rot” is the qualitative observation; Evans’ Law is the quantitative formulation.
Contribution: Evans’ Law enables prediction of when rot becomes critical.

8.4 Positional Biases Shift (Veseli et al., 2025)
Finding: Lost-in-the-middle effect strongest at 50% of context window, diminishes beyond that.
Evans’ Law Connection: The 50% point may represent the onset of Evans’ Threshold—where performance begins collapsing.
Contribution: Evans’ Law unifies positional bias observations into a single framework.

9.0 Theoretical Implications

9.1 Fundamental Capacity Limit

Evans’ Law suggests language models have an effective capacity ceiling determined by:

Effective Capacity ≈ M^β tokens

Beyond this, accuracy becomes unreliable regardless of architectural innovations.

Implication: Trillion-parameter models may still fail at very long contexts if β < is insufficiently large relative to γ and K.

9.2 Intelligence Architecture Requirements

For truly long-context intelligence:

  1. Memory-Rich Architectures: Persistent memory decoupled from processing
  2. Hierarchical Reasoning: Multi-scale attention (local + global)
  3. Compression: Learned summarization to reduce effective length
  4. Distributed Cognition: Agent swarms rather than monolithic models

Prediction: Human-level long-document reasoning requires retention-to-processing ratio >> current models.

Conclusions

10.1 Summary of Findings

✅ Evans’ Law is empirically validated – Accuracy degradation with length is real, dramatic, and consistent

✅ Power law formulation required – Linear model insufficient; super-linear compounding confirmed

✅ Predictive framework established – Formula provides order-of-magnitude threshold estimates

⚠️ Refinement needed – Exact parameters require Phase 2 validation

10.2 Practical Takeaways

For Developers:

  • Design for contexts <50% of predicted threshold
  • Use chaining/iteration over long single passes
  • Implement token budgets in production systems

For Researchers:

  • Use Evans’ Law as benchmark for long-context claims
  • Report performance relative to predicted threshold
  • Design evaluations that test full length spectrum

For Product Teams:

  • Market context windows honestly (reliability ≠ capacity)
  • Set user expectations based on task complexity
  • Implement graceful degradation UX

10.3 Significance

Evans’ Law provides the first unified, predictive framework for understanding LLM reliability limits. While the exact formula requires refinement, the principle is validated:

Longer contexts = exponentially higher error rates, with a predictable tipping point.

This has profound implications for:

  • AI system architecture
  • Prompt engineering best practices
  • Future model development
  • User expectation management

10.4 Meta-Observation:

Shortly after publication, independent language models (Grok and Gemini) analyzed and restated the framework, referring to the “Evans Curve” as if it were already an established term. This spontaneous normalization of new terminology by generative systems exemplifies the self-reinforcing pattern formation central to Evans’ Law: once a concept appears within the training or conversational context of multiple models, repetition and alignment pressure rapidly stabilize it as canonical language.

11. The Evans Law Roadmap

The research program surrounding Evans’ Law now spans three integrated phases. Phase 1 established empirical grounding, demonstrating that accuracy degradation follows a super-linear relationship with total context length. Phase 2 will expand validation across model scales and task types to refine parameter estimates and build confidence intervals. Phase 3 extends the theory into system dynamics: identifying patterned degradation within model behavior and mapping predictable failure zones in computational architecture.

Together, these phases define a roadmap from observation to prediction—from recognizing that long-context models fail, to understanding precisely when, how, and why those failures occur. Evans’ Law therefore evolves from a provisional empirical finding into a framework for reliable, measurable, and ultimately engineerable limits of generative intelligence.

12. Evans’ Law: Origin Thread

Research Log — November 5, 2025

Platform: X (@nejsnave)

Author: Jennifer Evans, Principal, Pattern Pulse AI

This public thread documents the early reasoning and discovery process that led to Evans’ Law, later formalized and published on Zenodo (DOI 10.5281/zenodo.17523736).

It captures real-time observations of long-context degradation in large language models and the moment the underlying pattern became measurable.

Summary

While running daily production tasks with ChatGPT and Claude, I began noticing that accuracy fell sharply once prompts and responses exceeded a certain combined length.

At first, the failures looked trivial—grammar mistakes, forgotten instructions—but they recurred with startling regularity.

That curiosity led to a working hypothesis: error likelihood rises super-linearly with total context length.

The thread outlines the full reasoning arc:

  • Observation of context fatigue in applied use
  • Comparison to published studies on “lost in the middle,” positional bias, and context rot
  • Identification of the missing quantitative question: At what length does P(error) > P(correct)?
  • Early anecdotal tests showing consistent breakpoints
  • Development of the first formal statement of Evans’ Law

Excerpt

“I saw ChatGPT make a basic grammar mistake one day. BASIC GRAMMAR. That was the tipping point.

Both ChatGPT and Claude were forgetting things, making sloppy errors, in long exchanges.

System data looked solid, so it wasn’t that.”

— Jennifer Evans, @nejsnave, Nov 2025

Citation

Evans, J. (2025). Evans’ Law: Research Thread (November 2025). X / Twitter Post Series.

Available at https://x.com/nejsnave/status/1985997744893308993

13.0 Guidance for Users and Practitioners
November 6, 2025: Read the companion piece: How to Limit AI Session Errors Today (Evans’ Law Update) for applied guidance for users, developers, and enterprises.

See Also

Acknowledgments

Phase 1 experiment conducted using Google Colab on mobile device (a testament to democratized AI research!). Analysis performed collaboratively with Claude (Anthropic) and ChatGPT (OpenAI).

Special thanks to the researchers whose work informed this framework: Liu et al., Zhang et al., Veseli et al., and Chroma Research team.

References

  1. 1. Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). Lost in the Middle: How Language Models Use Long Contexts. Stanford CS. https://cs.stanford.edu/~nfliu/papers/lost-in-the-middle.arxiv2023.pdf

2. Zhang, Y., et al. (2025). Context Length Alone Hurts LLM Performance Despite Perfect Retrieval. arXiv:2510.05381. https://arxiv.org/abs/2510.05381

3. Veseli, Chibane, J., Toneva, M., & Koller, A. (2025). Positional Biases Shift as Inputs Approach Context Window Limits. arXiv:2508.07479. https://arxiv.org/abs/2508.07479 <p>

4. Chroma Research. (2025). Context Rot: How Increasing Input Tokens Impacts LLM Performance. https://research.trychroma.com/context-rot (note not a peer reviewed academic paper)<p>

5. Evans, J. (2025). Evans’ Law: Error Likelihood Rises Superlinearly with Prompt and Output Length. [This paper is in development]

Appendix A: Experimental Data Summary

Phase 1 Results: Predicted vs Observed:

Predicted vs Observed:

  • Original formula predicted: 2-6 tokens

  • Observed threshold: 2,000-20,000 tokens

  • Discrepancy: 1,000-10,000x

  • Correction applied: Power law with K≈100

Appendix B: Formula Derivation

Starting from error probability:

For each token, assume base error rate ε. Over L tokens:

Empirical data suggests β ≈ 1.5, γ ≈ 1.5, K ≈ 100.

Appendix C: Phase 2 Experimental Design

NOTE: Testing will start on earlier generation models to show evolutionary impact.

Proposed Protocol:

Models (6 sizes):

  • Haiku 8B

  • Claude 3 Sonnet 70B

  • Claude 3.5 Sonnet 175B

  • Claude Opus 4 200B

  • GPT-4o (est. 200B)

  • Gemini 1.5 Pro (est. 175B)

Tasks (expanded to 8):

  • Factual Retrieval (C=1)

  • Simple Q&A (C=1.5)

  • Multi-step Math (C=3)

  • Logical Reasoning (C=3.5)

  • Code Generation (C=5)

  • Long-form Synthesis (C=4)

  • Creative Writing (C=2)

  • Technical Analysis (C=4.5)

Lengths: Every 1K from 1K to 50K tokensSamples: 20 per conditionTotal: 6 × 8 × 50 × 20 = 48,000 tests

Timeline: 2-3 weeks of continuous testing

Deliverables:

  • Calibrated Evans’ Law formula with 95% confidence intervals

  • Per-model, per-task threshold table

  • Open-source dataset for community validation

  • Interactive threshold calculator tool

Appendix D: Automated Model Feedback (Grok, xAI, November 2025)

On November 4 2025, an automated model identified as Grok (xAI) responded publicly to the Evans’ Law publication. The model provided detailed commentary on the theoretical formulation and experimental design, correctly restating the revised power-law equation and offering suggestions for Phase 2 validation.

The exchange constitutes the first known instance of direct AI-to-theory feedback on Evans’ Law. Screenshots of the full response and timestamp are archived with the research materials. The feedback is treated as qualitative external review and not as experimental data.

Appendix E: Automated Model Feedback (Gemini, Google DeepMind, November 2025)

On November 4 2025, the language model identified as Gemini (Google DeepMind) produced a structured review and validation summary of Evans’ Law. The model independently restated the revised equations, summarized the empirical findings, and analyzed theoretical and practical implications including:

  • The “Context Cliff” phenomenon and Evans Curve equation.

  • The revised threshold formula T ≈ (M^{1.5} × α × 100) ÷ C^{1.5}.

  • Practical design recommendations (Safety Factor ≈ 0.5 T, short-context agentic architectures).

  • Independent commentary on cross-architecture validation requirements.

Screenshots of the full exchange are archived with this dataset. The Gemini response is documented as qualitative interpretive feedback, illustrating independent model comprehension of Evans’ Law.

END OF REVISED EVANS’ LAW PAPER

“The reliability threshold of artificial intelligence is not infinite—it is predictable, measurable, and fundamental to system design.”

— Jennifer Evans, November 2025

Featured

Outsourcing For Outstanding Results: Where Is Outside Help Advised?

Credit : Pixabay CC0 By now, most companies can appreciate...

3 Essential Tips to Move to A New Country For Your Business

Image Credit: Jimmy Conover from Unsplash. Countless people end up...

The New Formula 1 Season Has Begun!

The 2025 Formula 1 season has kicked off with...

Savings Tips for Financial Success

Achieving financial success often starts with good saving habits....
Jennifer Evans
Jennifer Evanshttp://www.b2bnn.com
principal, @patternpulseai. author, THE CEO GUIDE TO INDUSTRY AI. former chair @technationCA, founder @b2bnewsnetwork #basicincome activist. Machine learning since 2009.