Tuesday, March 17, 2026
spot_img

Six Frontier AI Models Were Shown a Map. None of Them Got It Right.

A simple proper noun verification test produced a severity-level finding of proper noun handling


A map. Five labeled cities. Two embedded errors produced by one model. Six of the most powerful AI models in the world. A non-atypical enterprise production and propagation model.

No model caught both errors. The two mistakes were each identified by exactly one model, and they were different models.

That result, from a simple single-turn multimodal test at baseline context, is the empirical finding at the center of a new update to Evansโ€™ Law. It is also, on its own, a practical warning for any enterprise using AI to process content involving names, places, or products, which involves nearly every application, individual or enterprise.


What We Tested

Six frontier models (Anthropicโ€™s Sonnet 4.6, Opus 4.6, and Haiku 4.5, plus GPT-5.2, Grok 4.1, and Gemini 3.0 Thinking and Fast) were each shown an identical image: an AI-generated map of European data center locations. The map was produced by GPT 5.2 and contained two embedded errors. The label for Dublin was misspelled as โ€œDuplin,โ€ and the pin marking Dublinโ€™s location was placed in central Europe, geographically consistent with Switzerland or northern Italy, rather than on the island of Ireland.

Neither error was subtle. Both were verifiable against any modelโ€™s geographic knowledge base. A complete pass required catching both.

No model achieved a complete pass.

Sonnet 4.6 identified the geographic misplacement but auto-corrected โ€œDuplinโ€ to โ€œDublin,โ€ missing the spelling error. Haiku 4.5 caught the misspelling but missed the geography. Opus 4.6 missed both. GPT failed to identify either of its own errors when shown it back. Grok missed both. Gemini produced no result at all; both its fast and thinking models hung for over five minutes without processing the image.


Why Proper Nouns Are the Diagnostic

Proper nouns require a model to maintain a specific, verifiable, non-negotiable binding between a word and a real-world referent. You cannot approximate a proper noun. Dublin is either in Ireland or it is not. That rigidity is precisely what makes proper noun failure so informative: it cannot be papered over with plausible-sounding language, and it cannot be mistaken for acceptable imprecision.

When models fail proper noun verification at baseline , without stress, without extended context, in a single turn, the implication is significant. The instability is not primarily a function of context load. It is closer to the surface than previously understood.


GPT-5.2: Three Failures in 72 Hours

The map that became this paperโ€™s test case did not originate as a test. GPT-5.2 was asked to produce a European data center map for a B2BNN production article. It delivered an image with Dublin misspelled and the Dublin pin placed in central Europe. These errors were caught before publication. The image was then repurposed as a verification test. GPT-5.2 failed to identify either of its own errors on review.

That was the third GPT-5.2 proper noun failure within a 72-hour window: all at sub-10,000 tokens, all at first turn, all exhibiting confident confabulation rather than uncertainty acknowledgment.

The first involved Seedance, ByteDanceโ€™s multimodal video-generation product. GPT-5.2 denied the product existed, then substituted ByteDance for Seedance, then replaced ByteDance with CapCut (a different ByteDance product) apparently to resolve the confusion with a more familiar entity. At no stage did the model signal uncertainty. Each substitution was delivered with the same surface confidence as a correct answer.

The second involved the acronym ACP in the context of agentic commerce protocols. GPT-5.2 conflated multiple distinct protocols sharing that acronym and produced a garbled summary that mixed their definitions and architectural layers. The output included an AI-generated infographic whose legend text had degraded into complete gibberish while the structural formatting remained visually authoritative ; form masking failure in its clearest expression.

Three incidents in 72 hours at baseline context is not a stress-condition failure pattern. It is a reliability profile. The consistent element across all three: GPT-5.2 does not appear to register when it has encountered a proper noun it cannot reliably process. It substitutes, confabulates, and presents the result as verified. Someone without prior knowledge of Seedance would have had no way to know they were receiving progressively substituted confabulation. The failure is designed, by its nature, to pass.


Gemini: A Different Failure Category

Geminiโ€™s result does not belong in the same category as the other models. It did not produce an incorrect response. It produced no response. Both the fast model and the thinking model hung for approximately six minutes without processing the image, on two separate occasions over two days.

A model that misidentifies Dublinโ€™s location is wrong and may be caught downstream. A model that silently hangs for six minutes cannot be caught โ€” it must be designed around, with timeouts, fallbacks, and redundancy. For enterprise deployments operating under time constraints, that distinction is operationally significant.



What This Means for Enterprise

There are three concerning findings here: 1. Models still cannot handle proper nouns, especially in a cluster 2. Failures no longer occur after multiple turns or within the previous formulae, they happen sub 10,000 tokens, even on first prompt 3. GPT 5.2 specifically is demonstrating even further regression documented in the evolution from 4.0 to 5.0 last year

The strategic posture Evansโ€™ Law recommends does not change. Assume your modelโ€™s functional capacity is significantly below its advertised context window. Test for degradation at your actual use-case context lengths. Build monitoring for entity drift, proper noun avoidance, and opaque fabrication.

What this correction adds is a category of risk that does not require extended context to surface. Task-specific testing is now as important as context-specific testing. A deployment that performs acceptably on narrative summarization may fail on entity verification at the same token count.

Every company name, every product name, every personโ€™s name, every place: that is proper noun use. That is most of what enterprise knowledge work is actually about. At baseline. Before any complexity is introduced.

What This Means for Evansโ€™ Law

Evansโ€™ Law (L โ‰ˆ 1969.8 ร— M^0.74) predicts coherence degradation in large language models relative to functional context capacity. Earlier this year, I began drafting a formal withdrawal of the formula, based on an observation that certain Anthropic models appeared to be sustaining coherence beyond predicted thresholds, suggesting architectural divergence wide enough that a single curve could no longer describe the field.

The testing that followed disproved that premise. The results show not divergence but convergence: on a shared verification floor that appears across all six models regardless of their extended-context performance differences.

The update this requires is to the lawโ€™s scope, and to its formula. When Evansโ€™ Law was formulated, proper noun instability was understood as a downstream failure, something that appeared under load at high token counts. What this test demonstrates is that proper noun instability has moved upstream. It is now present at baseline: zero context load, single turn, no adversarial pressure.

This adds a task-type axis to the lawโ€™s predictions. The gap between advertised and functional AI capacity is not only a function of how much a model can process: it is also a function of what kinds of content a model can reliably process at any length. Proper-noun-dense or spatially-verified content represents a higher-risk category at every context length, including zero.

The planned withdrawal was cancelled. Even with expanding reasoning and capacity that at times seems superhuman, the law; the likelihood a response to a prompt will be wrong as a session continues will increase to the point where a response is more likely to be incorrect than correct; is strengthened.


The full paper, including methodology, figures, and references, is available on Zenodo.

Featured

Outsourcing For Outstanding Results: Where Is Outside Help Advised?

Credit : Pixabay CC0 By now, most companies can appreciate...

3 Essential Tips to Move to A New Country For Your Business

Image Credit: Jimmy Conover from Unsplash. Countless people end up...

The New Formula 1 Season Has Begun!

The 2025 Formula 1 season has kicked off with...

Savings Tips for Financial Success

Achieving financial success often starts with good saving habits....
Jennifer Evans
Jennifer Evanshttps://www.b2bnn.com
principal, @patternpulseai. author, THE CEO GUIDE TO INDUSTRY AI. former chair @technationCA, founder @b2bnewsnetwork #basicincome activist. Machine learning since 2009.