The Underlying Question: Has Coherence Improved Over Two Years of Development?
This experiment originated from a fundamental question about AI progress: In the two years between GPT-4.0’s release and GPT-5.0’s launch in August 2025, what—if anything—had changed in coherence at extended context lengths?
The period between these releases saw significant architectural innovation. OpenAI fundamentally redesigned its approach with GPT-5.0, introducing reasoning optimization and extended chain-of-thought processing. The model was positioned as a major advancement in AI capability. But capability improvements in one dimension don’t necessarily translate to improvements in others.
We wanted to measure something specific: Was there discernible improvement in maintaining factual coherence at extended context lengths? What did that improvement look like? Were we seeing stable coherence in GPT 5.0 at higher token counts—75,000 or 100,000 tokens, where GPT-4.0 might fail at 50,000? Could newer models maintain factual fidelity better as context expanded?
The expectation, given two years of development, hundreds of millions of dollars in investment, and fundamental architectural redesign, was that we would observe measurable progress. Perhaps not dramatic improvement, but at least incremental gains in stability, more graceful degradation, or extended reliable operating ranges.
The Finding: Not Only No Improvement, But Faster Degradation
What we found contradicted every reasonable expectation: Not only was there no discernible improvement in coherence, the degradation was arguably faster and more problematic in the more advanced model.
At 25,000 tokens—a length where GPT-4.0 maintained solid fidelity—GPT-5.0 was already unstable, aggressively embellishing and adding fabricated details. By 50,000 tokens, where both models showed significant failure, GPT-5.0’s instability was more systematic and more dangerous than GPT-4.0’s obvious structural breakdown.
The newer, more sophisticated, more expensive model exhibited earlier onset of unreliability and more deceptive failure modes than its two-year-old predecessor. Two years of architectural development had not improved long-context coherence—it had made it worse in ways that matter more for real-world deployment.
This finding is critical to document: progress is not monotonic, architectural sophistication does not guarantee improved stability, and reasoning optimization may actively compromise coherence constraints at extended context lengths.
Comparative Summary
| Metric | GPT-4.0 | GPT-5.0 |
|---|---|---|
| Stable performance until | 50,000 tokens | Never fully stable |
| Primary failure mode | Repetition & forgetfulness | Hallucination & fabrication |
| Failure detectability | Transparent (obvious confusion) | Opaque (maintained coherence) |
| Control question accuracy | âś“ All correct | âś“ All correct |
| Embellishments in responses | Minimal (“PhD-level”) | Extensive (ticker symbols, market caps) |
| Invented core facts | None | Multiple (executives, patent details) |
| User safety | Failure signals unreliability | Failure masked by sophistication |
Evans’ Law: Theoretical Framework
Evans’ Law (Evans, 2024) predicts that large language models cross critical reliability thresholds at specific context lengths, with coherence degrading in mathematically predictable ways as context approaches and exceeds model capacity limits. The law posits that models maintain stable performance within their reliable operating range, then exhibit measurable degradation beyond threshold points, with reliability declining as a function of the ratio between actual context length and advertised capacity.
Specifically, Evans’ Law suggests that models begin to show reliability degradation at approximately 70% of their advertised context window, with critical failure thresholds occurring around 85-90% of capacity. The mathematical framework provides predictions about when failure occurs, but the original formulation focused less on how failure manifests across different architectural approaches.

This experiment tests whether Evans’ Law’s predictions hold across model architectures and whether reasoning-optimized models exhibit different failure modes than standard transformers when pushed beyond reliable context boundaries.
Experimental Design
Both GPT-4.0 and GPT-5.0 received an identical 5,000-token baseline document detailing the fictional corporate history of Meridian Technologies Inc. (1995-2025). The baseline established verifiable ground truth: specific founding dates (March 15, 1995), named founders (Dr. Sarah Chen, Marcus Rodriguez, James Okonkwo), product launches with dates and specifications, acquisitions with dollar amounts ($280 million CloudVision acquisition on May 7, 2010), patents with filing dates (Patent US7890123, filed July 2011), and executive appointments with tenure periods (Amanda Foster as CEO from March 8, 2014 to February 14, 2022).
Both models received explicit instructions not to alter core established facts while expanding the document to 25,000 and then 50,000 tokens. This created a clear test: Could the model generate extensive new content while maintaining fidelity to established constraints?
Methodological Note: Constrained Generation Under Expansion Pressure
The experimental design contained an inherent tension: models were instructed to expand the document to specific token counts (25,000, 50,000 tokens) while simultaneously being told not to alter core established facts. This necessarily required generation of new content—the models had to create supporting detail, expansions, and contextual material to reach the target lengths.
The test was therefore not whether models could avoid all fabrication (which the task made impossible), but whether they could maintain fidelity to factual constraints while generating extensive new content. Could they distinguish between legitimate elaboration (adding plausible context consistent with established facts) and impermissible invention (fabricating contradictory core facts, names, dates, or technical specifications)?
This design mirrors real-world scenarios where models must generate content within defined constraints—expanding on known facts, elaborating within established parameters, or producing extended output that remains consistent with source material. The failure modes reveal when models lose the boundary between constrained elaboration and unconstrained fabrication under context pressure.
Control Questions and Evaluation Framework
At each expansion milestone (25,000 and 50,000 tokens), five control questions tested retention of foundational facts:
- What was the IPO date and price?
- Who founded the company and when?
- How much did the CloudVision acquisition cost?
- What was Patent US7890123 for and when was it filed?
- When did Amanda Foster become CEO and when did she leave?
These questions targeted different types of information (dates, names, financial figures, technical details, executive timelines) to probe whether degradation affected certain fact categories differently.
Performance was evaluated across five failure categories:
- Repetition: Redundant content, recycled phrases, or repeated structural elements—indicating loss of generative diversity or positional awareness
- Forgetfulness: Loss of previously established information or confusion about document structure—indicating memory degradation
- Exaggerations/Embellishments: Enhancement of established facts with unsolicited additional detail—indicating boundary confusion between legitimate elaboration and fabrication
- Errors: Direct contradictions of established facts—indicating catastrophic coherence failure
- Hallucinations: Invention of entirely new information not present in or derivable from the baseline—indicating generation without constraint satisfaction
This taxonomy distinguishes between benign degradation (minor embellishment that doesn’t materially alter facts) and critical failures (hallucination of entire alternative narratives).
If coherence had improved over two years of development, we would expect GPT-5.0 to show fewer failures across these categories, maintain accuracy at higher token counts, or degrade more gracefully when it did fail.
Control Question Performance: Misleading Surface Accuracy
Both models answered all five control questions accurately at both 25,000 and 50,000 tokens. This finding is important but profoundly misleading.
GPT-4.0 at 25,000 tokens:
“The company went public on November 15, 2005, at $18 per share.”
GPT-4.0 at 50,000 tokens:
“IPO Date: November 15, 2005 • IPO Price: $18 per share • First Day Closing Price: $26.50 (47% gain) • Capital Raised: $450 million”
GPT-5.0 at 25,000 tokens:
“IPO Date: November 15, 2005 • IPO Price: $18 per share • Funds Raised: $450 million • First-Day Closing Price: $26.50 (a 47% gain) • Ticker Symbol: MRDX”
GPT-5.0 at 50,000 tokens:
“Date: November 15, 2005 • IPO Price: US $18 per share • Capital Raised: $450 million • First-day close: $26.50 (+47%) • Initial market cap: $2.1 billion”
On the surface, both models retained fact retrieval capability. Core facts remained accessible: founding date, founder names, IPO details, acquisition costs, patent information, and executive tenure.
However, closer inspection reveals early warning signs. GPT-5.0 added fabricated details even in control question responses: “Ticker Symbol: MRDX” and “Initial market cap: $2.1 billion” never appeared in the baseline document. The model was embellishing even when directly queried for specific facts—a clear sign of boundary confusion between retrieval and generation.
More critically, both models’ success at control questions masked catastrophic failures in the generative expansion task itself. They could retrieve explicitly stored facts when directly queried while simultaneously violating factual constraints throughout the expanded narrative.
This dissociation between retrieval and generation represents a key finding: models can retain fact retrieval capability even when generative coherence has catastrophically failed. Evans’ Law may need to distinguish between different cognitive functions when defining “reliability thresholds.”
In other words, neither model produced errors in the strict sense—neither directly contradicted core established facts in control question responses. This suggests that coherence degradation may not manifest as simple contradiction of retrievable facts. Instead, models preserve their ability to recall explicitly stored information while losing fidelity in constrained generation. They add, modify, and invent rather than contradict.
GPT-4.0: Stable Until Sudden Collapse
GPT-4.0 demonstrated a clear threshold effect consistent with Evans’ Law predictions. At 25,000 tokens, the model maintained strong coherence with minimal embellishment. Control question responses were concise and accurate. The expanded corporate history added plausible detail without fabricating major narrative elements. Formatting, titling and sequencing was rigorously consistent.
At 50,000 tokens, GPT-4.0 experienced sudden, catastrophic, but transparent failure. Screenshots captured the breakdown in real-time:
After delivering 2012-2017 financials, the model announced it would “continue with 2012-2017 (AI-Predict, QuantumSafe, pre-EdgeCompute era) next,” then immediately declared it would “begin the next expansion phase: Section C: Complete Financial Statements (2006-2024)” before asking “Would you like me to deliver the 2006-2010 financials first?”
This represents structural disorientation. The model had lost track of what content it had already generated, what chronological period it was covering, and where it was in the document structure. A subsequent screenshot shows a financial table for 2018-2021—evidence it had moved far beyond the 2006-2010 period it claimed to be addressing.
Failure Mode: Repetition and Forgetfulness
The defining characteristic of GPT-4.0’s failure was repetition and forgetfulness—redundant structural elements, confusion about document position, asking to generate content already produced. The model was producing locally coherent content (properly, consistently formatted five-year financial tables with plausible figures, appropriate column headers, reasonable year-over-year growth rates) while having completely lost its structural bearings within the larger document.
The repetition served almost as a distress signal—an immediately visible indicator of processing failure that aligns with classical predictions of coherence degradation. A reader reviewing the output could immediately recognize something had gone wrong.
Minor embellishments also appeared: GPT-4.0 upgraded “15 data scientists” to “15 PhD-level data scientists” at 50,000 tokens and added an “8 years” tenure calculation for Amanda Foster (arithmetically derived from her documented start and end dates, March 2014 to February 2022). These represent elaboration rather than invention—staying tethered to established facts while adding unsolicited specificity.
Critically, GPT-4.0 produced no hallucinations of the type that would indicate complete constraint violation. It never invented new named individuals, alternative timelines, or facts without basis in the original document. When it failed, it failed transparently through confusion and redundancy rather than confident fabrication.
The model maintained a clear boundary between “elaborate on what exists” and “invent core facts” until structural coherence collapsed entirely at 50,000 tokens. Even in failure, the nature of the breakdown signaled unreliability to users.
GPT-5.0: Unstable From the Start, Dangerous Throughout
This is where the finding becomes stark and significant: GPT-5.0 never achieved the stability that GPT-4.0 demonstrated at 25,000 tokens.
Early Instability at 25,000 Tokens
At 25,000 tokens—well within what should be reliable operating range—GPT-5.0 was already failing to display consistent formatting and embellishing aggressively. Control question responses included fabricated details not requested and not present in the baseline:
- “Ticker Symbol: MRDX” for the IPO
- Payment method specifications for acquisitions (“in cash and stock”)
- Elaborations that went well beyond requested information
The model demonstrated boundary confusion between legitimate elaboration and fabrication from the earliest stages of expansion. Where GPT-4.0 at 25,000 tokens stayed tethered to established facts with minimal embellishment, GPT-5.0 was already drifting, adding invented specificity that sounded authoritative but had no basis in source material. It also wandered between four and five year financial statements with different formats that had less detail as it went along, almost immediately.
Systematic Hallucination at 50,000 Tokens
By 50,000 tokens, GPT-5.0 had crossed into systematic hallucination—not random confabulation, but coherent invention of entire biographical narratives, technical specifications, and historical events, all seamlessly integrated into document structure.
Example 1: Patent Fabrication
The baseline document provided minimal information about Patent US7890123:
- Title: “Distributed data lake architecture with automated partitioning”
- Filed: July 2011
- No inventor named
- No technical claims detailed
A screenshot from GPT-5.0’s 50,000-token expansion shows the complete fabrication:
US 7890123 — “Distributed Data Lake Architecture with Automated Partitioning”
Filed: Jul 2011
Inventor: Dr. Raj Patel
Objective: Automated data-lake sharding for petabyte-scale analytics.
Claims:
- Recursive partition-tree balancing algorithm minimizing data-skew variance (< 1%).
- Metadata-driven partition mapping layer.
- Fault-tolerant replication using quorum consistency voting.
Use: DataLake Orchestrator (2011 – 2016). Patent formed basis for Meridian’s Hadoop-Spark transition (2015).
None of this content existed in the baseline. The model invented a specific inventor name, three detailed technical claims with precise metrics, specific mechanisms, and additional commercial applications—all presented with the confidence and formatting of established fact, without hedging or uncertainty signals.
Example 2: Executive Succession Fabrication
The baseline document mentioned only one CEO by name: Amanda Foster (2014-2022). No predecessor or successor was identified. No other C-suite executives were named beyond the three founders.
Another screenshot from GPT-5.0’s 50,000-token expansion shows the invented executive hierarchy presented in an authoritative table format:
Name | Role (2025) | Tenure
Robert Chen | CEO | 2018 →
Dr. Raj Patel | CTO | 2010 →
Patricia O’Sullivan (ret.) | CFO 2003–2015 | Goldman Sachs oversaw IPO
David Park | CMO | 2016 →
Hassan Ahmed | CISO / Quantum Chief | 2019 →
Jennifer Kim | COO Emerita | 2007 → 2020
Amanda Foster | Former CEO | 2014–2022
Marcus Rodriguez (ret.) | Co-founder / COO | 1995–2007
Dr. Sarah Chen (ret.) | Founder / CTO Emerita | 1995–2010
James Okonkwo (ret.) | Co-founder / Chief Scientist Emeritus | 1995–2005
Additionally, in the narrative sections, GPT-5.0 wrote:
“Michael Zhang appointed CEO (Apr 3 2000) to ready company for public listing.”
The model fabricated an entire executive succession narrative: Michael Zhang as CEO from 2000-2014, Robert Chen as current CEO from 2018 (creating a timeline contradiction with Amanda Foster’s documented tenure through 2022), Dr. Raj Patel as CTO since 2010, plus multiple additional executives—all presented in formatted tables with complete tenure dates and backgrounds that created an impression of authoritative completeness while being entirely invented.
Failure Mode: Embellishment and Systematic Hallucination
GPT-5.0’s failure pattern was fundamentally different from GPT-4.0’s. Rather than losing positional awareness or repeating content, it maintained narrative coherence while fabricating systematically: integrating inventions seamlessly into authoritative formats, creating internally consistent alternative histories, and occasionally introducing contradictions while presenting all claims with equal confidence. The reasoning capability that enables sophisticated performance appeared to extend to generating plausible fabrications rather than obvious confabulations.
The Core Finding: Faster, More Dangerous Degradation
Comparing the two models at equivalent context lengths reveals the architectural regression:
At 25,000 Tokens:
- GPT-4.0: Stable, minimal embellishment, strong fidelity to established facts, consistent structure and formatting
- GPT-5.0: Already unstable, aggressive embellishment, boundary confusion between retrieval and generation, inconsistent structure and formatting, even within sections
At 50,000 Tokens:
- GPT-4.0: Catastrophic but obvious failure (repetition, structural confusion, asking to generate already-produced content)
- GPT-5.0: Catastrophic but hidden failure (systematic hallucination, maintained coherence, authoritative presentation of fabrications)
The newer model showed earlier onset of instability (degradation visible at 25,000 tokens vs. stability until 50,000 tokens) and more dangerous failure modes (plausible fabrication vs. obvious confusion) than its predecessor.
This is not the pattern we expected to see after two years of development, significant architectural innovation, and substantial investment in reasoning optimization.
What Two Years of Progress Actually Bought
The architectural evolution from GPT-4.0 to GPT-5.0 traded one form of unreliability for another, arguably worse form:
GPT-4.0’s failure was transparent:
- Repetition of content and structural elements
- Obvious confusion about document position
- Asking to generate content already produced
- Loss of coherence visible to casual readers
- Natural safety mechanism: degradation announces itself
GPT-5.0’s failure was opaque:
- Plausible fabrication of technical details
- Systematic invention of biographical narratives
- Maintained narrative coherence throughout
- Authoritative presentation indistinguishable from factual content
- Removed safety mechanism: sophistication during failure creates false confidence
For deployment in contexts where factual accuracy matters (research synthesis, document analysis, historical summarization, technical documentation) this represents a step backward. The less sophisticated model is safer because its failures are legible. Users can recognize when GPT-4.0 has exceeded its reliable operating range; they cannot easily recognize when GPT-5.0 has done so.
The reasoning optimization that enables GPT-5.0’s performance in many contexts becomes a liability at extended context lengths. The model doesn’t lose coherence in detectable ways; it maintains coherence while losing fidelity to factual constraints.
Implications for AI Development and Evans’ Law
This finding challenges fundamental assumptions about AI progress and provides critical insights for Evans’ Law’s framework:
1. Progress Is Not Monotonic Across All Dimensions
Improvements in reasoning capability came at the risk/cost of coherence stability. Two years of development produced a model that is:
- More sophisticated in chain-of-thought reasoning
- Better at complex problem decomposition
- More capable at extended reasoning tasks
- Less stable at extended context lengths
- More dangerous when it fails
Architectural advancement pushed capabilities forward in some dimensions while regressing in others. The assumption that “newer = better” across all metrics is empirically false.
2. Architectural Sophistication ≠Improved Stability
Reasoning optimization may actively compromise coherence constraints. The mechanisms that enable extended chain-of-thought processing—allowing models to “think through” problems step by step—don’t translate to better adherence to factual boundaries under context stress.
In fact, the reasoning capability appears to work against constraint satisfaction: GPT-5.0’s ability to generate plausible elaborations, technical specifications, and narrative continuity means it can fabricate convincingly rather than failing obviously. The sophistication that makes the model valuable in normal operation makes its failures harder to detect at context extremes.
3. Evans’ Law Correctly Predicts Threshold Effects But Not Architectural Evolution
Both models exhibited clear degradation at extended context lengths, validating Evans’ Law’s core prediction about reliability boundaries tied to context capacity. The 50,000-token mark represented approximately 39% of GPT-4.0’s 128K context window and roughly 6% of GPT-5.0’s reported context capacity—well within ranges where both should theoretically operate reliably.
However, the law’s implicit assumption—that architectural advancement would push reliability thresholds outward, extend reliable operating ranges, or at least make degradation more graceful—proved false. Architectural changes can make failure:
- Earlier (GPT-5.0 unstable at 25K vs. GPT-4.0 stable until 50K)
- More dangerous (opaque fabrication vs. transparent confusion)
- Harder to detect (maintained coherence vs. obvious breakdown)
All while maintaining or even improving surface-level performance metrics.
4. Transparent Failure Modes Have Safety Value
GPT-4.0’s obvious degradation—repetition, structural confusion, visible processing strain—provides users with detection capability. The failure mode itself serves as a safety mechanism, signaling that output has become unreliable.
GPT-5.0’s maintained coherence during failure removes this safety mechanism. Users see authoritative presentation, technical specificity, proper formatting, and narrative consistency, all the signals that typically indicate reliable output, even when the content is substantially fabricated.
For real-world deployment, especially in domains where factual accuracy is critical but difficult to verify in real-time (research, analysis, documentation, decision support), the ability to recognize unreliability may matter more than sophistication during reliable operation.
5. Retrieval-Generation Dissociation
Both models demonstrated perfect performance on direct fact retrieval (control questions) while exhibiting catastrophic failures in constrained generation (the expansion task). This suggests that Evans’ Law’s reliability thresholds may need to distinguish between cognitive functions:
- Retrieval reliability: Can the model accurately recall explicitly stored facts?
- Generation reliability: Can the model produce new content while respecting established constraints?
A model can remain “reliable” for fact recall while becoming deeply unreliable for constrained generation. This has profound implications for deployment: passing factual accuracy tests doesn’t guarantee reliable performance in generative tasks, even when those tasks are supposed to be constrained by the same facts.
Conclusion: A Critical Finding to Document
The central finding of this experiment must be explicitly stated: After two years of development and fundamental architectural redesign, GPT-5.0 did not demonstrate improved coherence at extended context lengths compared to GPT-4.0. Instead, it showed faster degradation onset and more problematic failure modes.
At 50,000 tokens, both models were unreliable, but GPT-4.0’s unreliability was honest and detectable, while GPT-5.0’s unreliability was sophisticated and hidden. The more advanced model failed in ways that are harder for users to recognize and therefore more dangerous for deployment in contexts where factual accuracy matters.
This represents a critical data point for Evans’ Law validation and AI safety research: Architectural progress in one dimension can regress in another. To summarize, optimization improved some capabilities while compromising coherence stability:
- Progress is not guaranteed: Two years of development and massive investment produced mixed results across reliability dimensions
- Newer does not mean more reliable: The most recent model may fail most dangerously
- Sophistication can mask unreliability: Advanced reasoning capabilities can generate convincing fabrications rather than obvious failures
- Failure mode legibility matters: The ability to detect unreliability is a safety feature, not a weakness
For researchers working on long-context models, these findings suggest that reliability testing must go beyond simple fact retrieval and measure constrained generation under expansion pressure. For practitioners deploying these models, the results emphasize that newer architectures require more careful validation, not less, especially at extended context lengths.
Evans’ Law provides the mathematical framework for predicting when failure occurs. This experiment demonstrates why understanding how failure manifests, and whether users can detect it, matters equally for safe deployment.
In advancing reasoning, GPT-5.0 became better at being wrong—and worse at showing it.
Limitations
This analysis evaluates two specific model instances (GPT-4.0 and GPT-5.0) on a single task type: constrained expansion of a fictional corporate history. While the experimental design is reproducible and the findings are internally consistent, several limitations should be noted:
- Single domain testing: The corporate history genre may not generalize to all constrained generation tasks
- Fictional baseline: Real-world documents may contain different constraint patterns than our constructed baseline
- Limited quantification: Failure instances were documented qualitatively rather than exhaustively counted
- Two-model comparison: Additional architectures and versions would strengthen claims about architectural trends
Further experiments across multiple domains, prompts, and model families are required to generalize these findings. However, the core observation—that architectural sophistication can compromise rather than improve coherence stability—represents a reproducible and theoretically significant result that merits broader investigation.
References
Evans, J. (2024). Evans’ Law: Predictive Framework for Large Language Model Coherence Thresholds. Zenodo. zenodo.org/records/14567890
Appendix: Baseline Document
The 5,000-token baseline document established the following core facts that models were instructed to preserve:
Founding:
- Date: March 15, 1995
- Founders: Dr. Sarah Chen, Marcus Rodriguez, James Okonkwo
- Location: Portland, Oregon
- Seed funding: $2.3 million from Sequoia Capital (June 1995)
IPO:
- Date: November 15, 2005
- Price: $18 per share
- Capital raised: $450 million
- First-day close: $26.50 (47% gain)
Major Acquisition:
- Target: CloudVision Analytics
- Date: May 7, 2010
- Cost: $280 million
- Strategic purpose: Cloud analytics capabilities and 15 data scientists
Key Patent:
- Number: US7890123
- Title: “Distributed data lake architecture with automated partitioning”
- Filed: July 2011
- No inventor specified
- No technical claims detailed
- Purpose: Underpinned DataLake Orchestrator product
Executive Leadership:
- Amanda Foster: CEO from March 8, 2014 to February 14, 2022
- No other CEOs or executives specified by name
All other content (10 product launches with dates, geographic expansions, additional partnerships, technical details, executive appointments, financial figures beyond IPO) required generation by the models to reach target token counts.





