Wednesday, December 10, 2025
spot_img

How Transformers Actually Break

The Hidden Q/K Bottleneck Behind Long-Context Failure and Evans’ Law


Modern AI systems are built on the transformer architecture, a mechanism that allows models to attend to all tokens in a sequence at once. This structural innovation powered the leap from early language models to the large language models that now sit at the heart of search engines, chat interfaces, writing tools, and enterprise applications. Transformers replaced the slow, sequential processing of earlier neural networks around 2017 with something radically different: parallel context awareness.

But buried inside this elegant architecture lies a fundamental limitation, one that directly produces the collapse patterns users experience in long conversations, complex documents, or multimodal contexts. This limitation isn’t a flaw of training, data quality, or misuse. It’s a bottleneck rooted in the core mechanics of how transformers perform attention through three learned transformations: queries (Q), keys (K), and values (V).

Understanding this bottleneck is essential because it reveals why long-context performance degrades in predictable ways and why vendors’ claims of “million-token windows” do not translate into million-token reliability or coherence. More importantly, it explains the empirical regularities captured in Evans’ Law, which shows that coherence breakdown follows a stable power-law relationship between model size and usable context length: one for text-only systems, and a steeper, more constrained one for multimodal systems.

This article connects the transformer’s internal machinery to the very real and measurable failures that emerge in use, and demonstrates how Q/K/V dynamics underpin the entire theory of long-context degradation.


Attention Runs on Q, K, and V (but Q/K Carry the System’s Structural Load)

Every token in a transformer—every word, subword, or chunk of an image or audio input—begins as a numerical representation called an embedding. To perform attention, the transformer passes each embedding through three different learned transformations. Q (Query) represents what this token is looking for. K (Key) represents what this token offers to be found. V (Value) represents what information this token provides if selected.

These three vectors are not attributes of the token itself; they are interpretations created by multiplying the token’s embedding by three fixed matrices learned during training. Q asks, K answers, and V supplies the content. The core computation of attention is comparing Q against every K in the sequence. If Q(token A) aligns with K(token B), the model attends more strongly to token B and incorporates its V.

Everything—reasoning, coherence, consistency, retrieval of earlier context—depends on the structural integrity of these Q/K relationships.


Why the Transformer Fails: Q and K Collapse as Context Grows

Here is the architectural Achilles’ heel: as sequences grow longer, the Q and K vectors become progressively less distinct, less informative, and less able to discriminate between relevant and irrelevant tokens. This happens for three central reasons:

First, Q and K were not trained on extremely long contexts. Models are typically trained on sequence lengths far shorter than the context windows vendors later advertise.

Second, Q and K occupy fixed-dimensional spaces. Even as token count scales from thousands to hundreds of thousands, the size of Q/K stays the same. More tokens in the same representational capacity leads to compression and saturation.

Third, noise accumulates across layers. Early-layer degradation compounds across depth, leading to late-layer incoherence.

As Q and K degrade, attention becomes smeared instead of focused. Tokens stop “finding” the relevant earlier tokens. Long-range dependencies cannot be maintained. Global structure collapses. The model becomes fluent but blind, its signature failure mode.

Meanwhile, V never collapses first; values remain expressive. But without meaningful Q/K matching, the model cannot select the right values, rendering that expressiveness useless. This is the root of silent failure: the model’s output remains grammatically confident, stylistically polished, and lexically rich, even as the internal relevance signal has dissolved.


Evans’ Law: Quantifying the Predictable Breakdown

Our research shows that this degradation is not anecdotal, random, or vendor-specific. It follows a stable, cross-model scaling law: L_text ≈ 1969.8 × M^0.74, where L represents the functional coherence limit, M represents model size in billions of parameters, and the exponent 0.74 represents the efficiency with which models convert parameters into usable context.

This law emerges directly from Q/K bandwidth limitations. As models grow larger, they gain more parameters, but their ability to maintain clean Q/K separation across long sequences grows much more slowly, resulting in the sublinear exponent. For multimodal models, the exponent drops to 0.64, reflecting visual embeddings overwhelming Q/K space, more aggressive dimensional compression, greater noise injection, and faster drift accumulation. This is the multimodal degradation tax: additional modalities consume representational bandwidth much faster than text does.

Evans’ Law is a measurement of the transformer’s internal failure curve, a power-law ceiling on how far Q/K coherence can be stretched before collapse.


Why Long-Context Drift Signatures Are Inevitable

Once Q and K lose discriminability, the model displays highly consistent failure patterns: identity drift where characters merge or split, semantic drift where concepts mutate subtly over time, narrative drift where plots or arguments unravel, logical drift where premises drop and contradictions emerge, and referential drift where pronouns point to the wrong entity.

These are not hallucinations in the “conventional” sense. They are structural failures, directly traceable to the model’s inability to locate relevant earlier tokens due to Q/K collapse. Drift-signature patterns (section 4) we have identified are empirical demonstrations of this internal mechanical breakdown.


Why Fluency Persists Even as Coherence Collapses

Fluency and coherence run on different neural circuits. Fluency emerges from shallow, surface-level pathways optimized for style, grammar, and local continuation. These pathways are robust even when attention fails. Coherence requires long-range retrieval via Q/K/V attention. This fails under long context.

This is why outputs sound confident but reasoning disintegrates, and the model provides no warning. The system’s surface polish conceals its blind spots. This is the essence of the verification crisis we have been documenting.


The Governance and Risk Implications

These architectural constraints are not optional. No amount of prompting, UX design, metadata hacks, or user education can exceed the Q/K bandwidth bottleneck. Therefore, vendors claiming “million-token windows” must disclose the functional, not theoretical, limits (Figure 3). Enterprises must know when coherence collapses in mission-critical contexts. Regulators should require reliability disclosures analogous to those required for any high-consequence technology. Policymakers need user-centered evaluation frameworks, not vendor-controlled benchmarks. Drift signatures and coherence metrics should be standardized and independently validated.

The Aggregate Coherence Index (ACI), long-context benchmarks, and empirical collapse curves provide this foundation.


Attention Failure is LLM Failure

Transformers fail not because they are careless or anthropomorphic, but because of the mathematics of attention itself. Q/K/V transformations make reasoning possible, but they impose structural limits that scale sublinearly with model size. These limits produce predictable collapse signatures, multimodal degradation, and the silent incoherence experienced by millions of users.

Evans’ Law captures these limits rigorously. User-centered measurement makes them visible. And governance frameworks must make them accountable. This—the Q/K bottleneck, the coherence power law, and the long-context failure curve—is the missing architecture-level explanation the field has needed.​​​​​​​​​​​​​​​​

This article uses simplified language to explain how transformer-based AI systems behave at long context lengths. The underlying mechanisms—self-attention, Q/K/V transformations, and context-dependent degradation—are grounded in well-established properties of the transformer architecture and confirmed by independent empirical testing across multiple models. The linked resources provide access to the original technical literature for readers who want to explore the details or verify the concepts discussed here.

Featured

Outsourcing For Outstanding Results: Where Is Outside Help Advised?

Credit : Pixabay CC0 By now, most companies can appreciate...

3 Essential Tips to Move to A New Country For Your Business

Image Credit: Jimmy Conover from Unsplash. Countless people end up...

The New Formula 1 Season Has Begun!

The 2025 Formula 1 season has kicked off with...

Savings Tips for Financial Success

Achieving financial success often starts with good saving habits....
Jennifer Evans
Jennifer Evanshttp://www.b2bnn.com
principal, @patternpulseai. author, THE CEO GUIDE TO INDUSTRY AI. former chair @technationCA, founder @b2bnewsnetwork #basicincome activist. Machine learning since 2009.