Not only are we not in the same neighborhood as AGI (artificial general intelligence, believed by many to be the next evolution of artificial intelligence consisting of a self-driven, autonomous intelligence), we are not on the same planet.
by Jen Evans, Principal, PatternPulse.AI and AI researcher
When Sam Altman first called his latest creation a “creature,” I believed him. For a few days, I let myself think maybe this was it—maybe AGI had arrived. The way he said it—half awe, half revelation—made it sound like humanity had crossed a line. It was exciting, given the potential benefits I had discussed in several TikTok videos. I believed in a practical, measurable AGI that would benefit the world by making administrative tasks like the data management of a city something an AI could manage.
But I couldn’t quite get a grasp on where we actually were in that evolution. Altman is a master hypeman. He is not alone in that; so is Huang, so is Nadella. But when the hype obscures so much you can’t even locate where you really are, dislocation becomes subterfuge. And disinformation. And it has. I wasn’t alone. Just read any AI discussion or conversation on Twitter and you’ll see everything from “AGI is coming tomorrow!” to “AI is a disastrously inaccurate mess.” I figured reality was somewhere in the middle.
It was a huge range.
Then I started seeing some context collapse patterns in ChatGPT and Claude. And I started trying to quantify when and why that was happening, and establish a reliable metric of what these models can actually do.
The Wall at 100,000 Tokens
Every major large-language model—GPT, Gemini, Claude, Grok—hits a predictable failure point. Push them toward longer reasoning chains or documents, and coherence starts to fray. Past roughly 100,000 tokens, they lose the thread entirely. They forget instructions, contradict themselves, or begin hallucinating with complete confidence.
That’s not a mind waking up. That’s a coherence cliff.
I call this pattern Evans’ Law:
The likelihood of error rises super-linearly with context length until accuracy falls below 50 percent, following a power-law relationship determined by model capacity and task complexity.
In plain English:
If a model can’t sustain consistent reasoning past 100 K tokens, it’s not even close to general intelligence.
Why It Matters Now
OpenAI’s CFO, Sarah Friar, just laid out what it will take for the company to stay on the “cutting edge”: a trillion dollars in compute spending over the next two years.
That isn’t a plan for innovation—it’s a bailout for stagnation.
Because if these models truly scaled toward intelligence, they wouldn’t need a trillion dollars to stay coherent.
Friar’s pitch makes the underlying reality visible: OpenAI needs vast sums of capital because their engineering efficiency has flatlined. Gemini already outperforms GPT across most benchmarks I have been looking at, and OpenAI still refuses to disclose parameter counts. The “arms race” has become a marketing race, and it’s being financed like a bailout.
Gary Marcus was right: you can’t brute-force your way to cognition.
Evans’ Law makes this plain. Throwing more compute and tokens at the problem doesn’t create intelligence—it magnifies entropy. We are watching diminishing returns at planetary scale.
From Excitement to Evidence
When Altman said “creature,” I wanted to believe him. The idea that we’d built something alive was intoxicating.
But science isn’t belief; it’s measurement. And I, like many many others, kept running into inexplicable errors that seemed to worsen as prompts continued and outputs became more complex. Models were becoming forgetful, inconsistent, error-laden. So I ran controlled tests across model families, fixed temperature settings, and long-context prompts. And the data were stunning.
It was clear: every model degraded quickly and predictably past a relatively low token volume. Some sooner, some later, but all of them collapsed. And at surprisingly low token volumes. Not the two million tokens some prompt windows can now handle. More like maxing out at 100,000 tokens.
That moment of embarrassment—realizing I’d fallen for the hype—is what turned this into a story worth telling. Because everyone fell for it. I just happened to measure where the magic ended.
The Knockout Question
Whenever someone claims “we’ve reached AGI,” the response is simple:
“Can it maintain coherent reasoning past 100,000 tokens?”
If not, it’s not AGI. Not even close! It’s not even a whisper of AGI.
No philosophy required. No marketing needed. Just a measurable test.
The Larger Picture
Evans’ Law doesn’t kill optimism about AI. It defines the boundary of current systems.
Understanding that boundary is how we move forward intelligently—without confusing scale for consciousness.
Backing the Claim with Data
These aren’t opinions or impressions. The coherence-loss thresholds are measured results, recorded across model families under controlled conditions. Models *max out* at around 100,000 tokens. Past that, more mistakes than correct answers. Near complete incoherence. Even dangerous. The platforms can address this easily with alerts once nearing context collapse. Or meters. Users could then shut down their chatbot or prompt window when nearing the danger zone, and start over, context intact, at zero.
But they don’t, and there are obvious reasons why. That’s not a multiple trillion dollar industry. Who wants to be stuck running closed loop AI assistant help desks that can’t retain any information and are basically all human managed? There’s so much air in AI right now, it’s hard to identify real substance. Across the board, the most significant, highly funded industry on earth has absconded with your trust.
This is why DeepSeek was able to wow the world with relatively little fanfare, and far smaller budgets. The pressure of limited resources, everyone said. The innovation possible outside the Valley.
No. Actual reality, a normal product built with a normal budget, without the lighting money on fire part. This is something no AI leaders or executives outside China are willing to acknowledge. The emperor has no clothes. The emperor is stark, fully, obscenely, obesely, gleefully naked. Well, 5% clothed.
I’m a technoutopian. Always have been, probably always will be. But this is not about tech anymore. This is about greed weaponized. If this is my take after seeing the data I have? We’re in bad, bad shape. I’m not saying AI has no value, not at all. But the distance between what we’ve been sold and where we actually are is light years. There has been no meaningful development on context in two years while “capacity” has climbed to two million tokens. When accuracy maxes out at 5% of capacity, how is it real capacity? If accuracy and context haven’t progressed at all during feverish build-up, spend and competition, what does that tell us?
The consequences of the inflation of this industry is the biggest bubble in the history of humanity. The bursting of that bubble, just because of the sheer numbers invested, may have a devastating impact on economies and send us back to the tech dark ages.
The full dataset, regression analysis, and visualization are archived on Zenodo for anyone who wants to check the numbers or work on replication:
Evans, J. (2025). Evans’ Law: Empirical Relationship Between Model Scale and Coherence-Loss Thresholds in Large Language Models (Version 2). https://zenodo.org/records/17523736
That record exists to document process and transparency—not promotion. Anyone is welcome to scrutinize the methods, reuse the data, or prove the hypothesis wrong.





