In November 2025, an arXiv paper barely noticeably set the agenda for what is now being presented as the next generation of agentic AI. The paper, Solving a Million-Step LLM Task with Zero Errors, did not claim a breakthrough in intelligence. Instead, it proposed a productive way around intelligence’s limits.
The authors demonstrated that a large language model could successfully complete a task requiring roughly one million sequential steps, provided the task was broken down into thousands of tightly scoped, independent sub-tasks, each executed in isolation, with external coordination, verification, and error correction layered on top. The model itself never reasoned across the full task. It never held a global plan. It never maintained a persistent understanding of what it was doing.
The process worked because it avoided those requirements.
The paper described this as a massively decomposed agentic process. In plainer terms, it was an argument for atomized execution: break a single process into individually manageable, highly discrete sessions; keep each session small enough that error rates remain negligible; and enforce coherence externally through orchestration, voting, tests, and human oversight. The “agent” is not an integrated intelligence. It is a coordinated swarm of stateless executions.
Not that there is anything wrong with that. But it is not agency.
What mattered most in the paper was not the model’s reasoning ability, but the architecture surrounding it. Almost all of the intelligence lived in scaffolding and coordination. Since then, nearly every major agentic system presented as a leap toward autonomy has followed this same pattern.
At Google, agentic coding systems integrated with GitHub repositories do not ask a model to understand an entire codebase or project trajectory. Instead, they decompose work into narrowly defined diffs, testable units, and reviewable changes. The model proposes small edits; external systems run tests, flag failures, and determine whether outputs advance or regress the project state. Planning, prioritization, and rollback live outside the model.
The same structure appears in Shopify’s Google-integrated commerce agents. These systems do not autonomously “run a business.” They operate across sharply bounded workflows: catalog updates, pricing adjustments, campaign generation, fulfillment logic. Each action is locally optimized, validated against predefined constraints, and coordinated by orchestration layers that decide what happens next. The agent executes. The system decides.
Anthropic’s recent compiler project makes the pattern explicit. Sixteen Claude instances were run in parallel across roughly two thousand sessions to produce a large Rust-based C compiler. No single model instance ever held the compiler in its entirety. Global coherence was enforced by humans, tests, version control, and orchestration code. The achievement was real, but it was architectural, not cognitive.
Across these systems, the same design philosophy repeats:
| System | What Is Decomposed | Where Coherence Lives | What the Model Does |
| Million-step arXiv model | Symbolic task steps | External verification & voting | Executes micro-steps |
| Google GitHub agents | Code changes | CI, tests, repo state | Proposes bounded edits |
| Shopify commerce agents | Business workflows | Orchestration & constraints | Executes scoped actions |
| Anthropic compiler project | Compiler components | Humans, tests, VCS | Generates local code |
In every case, reliability is achieved not by intelligence, but by containment. Error rates stay low because tasks are small. Progress is real because coordination compensates for the model’s lack of persistence. What looks like collaboration is, in fact, controlled fragmentation.
This matters because these systems are increasingly described as evidence that AI has crossed a threshold toward autonomy or general intelligence. That framing is misleading.
Even OpenCLAW, formerly known as ClawdBot, emulates this model, although it does so explicitly and transparently. OpenCLAW does not pretend that intelligence has emerged inside the model. It embraces decomposition as a design choice, openly externalizing memory, coordination, retries, and control into tooling and orchestration layers. The system is clever precisely because it makes no claim to integrated cognition: it treats the model as a powerful but stateless executor and builds reliability around that reality. In contrast to many commercial agentic systems, where atomization is presented as autonomy and scaffolding is mistaken for intelligence, OpenCLAW is honest about where the work is actually being done. It is not sleight of hand; it is an engineering response to known model limits.
Today’s announcements from OpenAI’s Frontier and Anthropic’s about multi-agent coordination are both variations on the same architectural move. Tasks are decomposed into bounded units, routed across multiple model instances, and stitched together by external systems that manage state, retries, validation, and sequencing. The models themselves remain stateless and short-horizon; coherence is enforced outside the model.
What’s new in today’s platforms announcements is not intelligence, but industrialization: better tooling for dispatch, monitoring, rollback, and evaluation. Frontier emphasizes enterprise governance, safety rails, and integration; Anthropic emphasizes parallelism, task routing, and reliability under load. In both cases, the “agent” is a managed execution node inside a larger control plane.
What these architectures demonstrate is not that models can reason over long horizons, but that long horizons can be simulated by stitching together short ones. The model never needs to notice when an assumption breaks, when priorities shift, or when the task itself should be reframed. Those judgments are made elsewhere, or not at all. This is not intelligence, not in the human sense. It is pure orchestration.
There is nothing inherently wrong with this. Atomized systems are powerful, economically disruptive, and genuinely useful. They compress labor, increase throughput, and allow fewer humans to supervise vastly more output. But calling this progress toward artificial general intelligence (which some are doing) obscures what is actually happening.
The danger is not that these systems are too intelligent. It is that their limitations are hidden by scaffolding. As long as tasks remain static, goals remain fixed, and correctness is externally verifiable, atomized agents perform well. When context shifts, significance changes, or objectives conflict, the illusion breaks.
What we are witnessing, then, is not the emergence of general intelligence, but a sophisticated form of AGI theatre: systems that look agentic, collaborative, and autonomous precisely because the hard parts of intelligence have been engineered out of the model and absorbed by infrastructure.
The million-step paper did not prove that models can think longer. It proved that thinking can be avoided, and documented exactly how to avoid it. And every major agentic system since has followed its lead, out of sheer necessity, without explicit acknowledgement.

