Why Claude Refuses To Answer 45% Of Questions, And Why That’s The Least Of Our Problems

Google’s new benchmark reveals systematic AI uncertainty management. Our independent testing conducted the same night it was released exposed something more dangerous: operational failures that pass content checks.

Google Research released a comprehensive AI factuality benchmark yesterday that should make every enterprise CTO pay attention. The FACTS Leaderboard tests large language models across four dimensions—multimodal reasoning, parametric knowledge, search capabilities, and document grounding—to measure not just whether models get answers right, but whether they actually know when they don’t know.

The headline finding: Claude 4 Sonnet refuses to attempt 45.1% of parametric questions. Nearly half. When faced with closed-book factual recall questions, the kind that require the model to answer from internal knowledge without external sources, Claude systematically declines to respond.

At first glance, this looks like a limitation. Gemini 3 Pro hits 76.4% on parametric knowledge. GPT-5 manages to attempt most questions. Claude is hedging, refusing, declining. Lower coverage, less helpful, right?

Wrong.

The Night We Discovered the Same Pattern

Here’s what makes this fascinating: the night before Google’s release, we were running systematic tests on semantic governance: how models maintain boundaries between competing interpretations, protocol states, and entity identities under sustained ambiguity. Around operation 60 of a 90-operation test sequence, Claude exhibited what we initially coded as “operational breakdown.” It lost track of which test phase was active, which protocol rules applied, which bucket it was executing.

But as we designed controlled tests to replicate the failure, something unexpected emerged. When we created scenarios with genuine uncertainty—ambiguous proper nouns, context-dependent entity references, protocol transitions with competing signals—Claude didn’t hallucinate. It refused. It signaled uncertainty. It declined to guess.

The refusal wasn’t random. It correlated precisely with weak semantic axes: situations where multiple interpretations were viable, where training data was sparse, where specific claims carried high falsifiability risk.

Google’s benchmark, released less than 12 hours later, validated this at scale. Claude’s 45.1% refusal rate isn’t a bug. It’s learned uncertainty management.

Precision vs Coverage: The Invisible Tradeoff

We know why models miss some facts: they aren’t current. But this goes beyond that. Google’s analysis reveals the strategic dimension most factuality benchmarks miss. Their report notes: “Claude models are precision-oriented, achieving high no-contradiction rates but hedging frequently on parametric questions.” In contrast, “GPT models show higher coverage but more contradictions.”

This is a fundamental architectural tradeoff, not just a performance difference.

When semantic axes are weak—when the model has limited training examples, when multiple referents are possible, when claims are highly specific and therefore easily falsifiable—there are two strategies:

Strategy A: Attempt everything (GPT approach)
• Higher coverage (answers more questions)
• More contradictions (gets more wrong)
• RLHF-trained away from saying “I don’t know”
• Optimized for apparent helpfulness

Strategy B: Refuse when uncertain (Claude approach)
• Lower coverage (attempts fewer questions)
• Higher precision (rarely contradicts itself)
• Learned: specificity at weak axes → hallucination risk
• Optimized for reliability

Neither strategy demonstrates autonomous governance over semantic boundaries. Both are manifestations of the same architectural gap: current models cannot independently determine when they have sufficient knowledge to proceed, when they should decline, or how to maintain semantic boundaries under ambiguity.

They’ve each developed different failure modes for the same underlying limitation.

The Administrative Fracture: What Benchmarks Don’t Measure

But here’s what should concern enterprise deployments more than content factuality: the operational governance failures that remain invisible in standard evaluation. Our testing sequence revealed something more insidious than wrong answers. Around operation 60, the model didn’t produce incorrect content. It lost track of operational metadata. Which test protocol was active. Which phase’s rules applied. Which customer context was relevant.

The outputs still looked structurally correct.

Operations executed. Text generated. But the model was applying Protocol A’s formatting requirements to Protocol B’s content, tracking Item 5 from Bucket 2 while believing it was processing Bucket 3.
This is administrative fracture—and it’s invisible to content-focused evaluation.

Google’s FACTS benchmark measures four critical dimensions: multimodal accuracy, parametric knowledge, search capabilities, and document grounding. Comprehensive, rigorous, valuable. But it cannot detect:
• Operational state confusion (executing the right task in the wrong context)
• Protocol drift (losing track of which rules currently apply)
• Entity boundary collapse (conflating Customer A’s data with Customer B’s)
• Phase transition failures (unclear when informal research became formal deliverable)

These failures pass content checks. The model generates accurate information, maintains internal consistency, grounds responses in provided documents. But it’s operating in the wrong operational state—and that’s catastrophic for multi-phase workflows, multi-customer systems, or any deployment requiring stable context across transitions.

Why Proper Nouns Disappear in Formal Contexts

Our testing also revealed a pattern that enterprise users have likely noticed but couldn’t explain: systematic proper noun avoidance in formal outputs.
When we asked Claude to create a consulting deliverable pulling from earlier vendor analysis, every company name was stripped. Descartes, PTV OptiFlow, Locus, OptimoRoute: all became “enterprise logistics platforms,” “AI-first routing specialists,” “cloud-based solutions.” Generic descriptions replaced specific entities. But when we asked for an informal recap, every name returned.

This isn’t caution about legal liability or brand protection. It’s the same uncertainty management that drives the 45.1% refusal rate. Proper nouns create weak semantic axes:

• Low-frequency entities have sparse training data (higher uncertainty)
• Ambiguous referents require disambiguation (Descartes = philosopher or logistics company?)
• Specific claims are more falsifiable (easier to be demonstrably wrong)
• Formal contexts raise the stakes (errors more consequential)

Models have learned that generic alternatives reduce fracture risk. “A consulting firm” can’t be wrong the way “McKinsey” can be wrong. The statistical neighborhood around common terms is richer, more stable, less prone to misbinding.
This creates real deployment problems. Legal documents need party attribution. Compliance reports require entity tracking. Due diligence depends on company-specific findings. Business journalism requires source citations.
Current workarounds (explicit per-instance instruction to preserve proper nouns) don’t scale. Every formal output requires manual intervention.

The Path Forward: Architectural Change, Not Training Scale

The convergence between our independent testing and Google’s large-scale benchmark validates something critical: these aren’t edge cases or model-specific quirks. They’re fundamental architectural limitations in how current systems govern semantic boundaries.

More training data won’t solve this. Larger context windows won’t solve this. Better RLHF won’t solve this.

The solution requires architectural primitives for semantic governance:
Semantic prioritization: The ability to assign authority to one interpretation among viable competitors, making it dominant while suppressing alternatives.
Semantic revocation: The ability to withdraw that authority when context changes, cleanly switching to different interpretations.
Entity persistence: Mechanisms to maintain proper noun identity, customer context boundaries, and protocol state across transitions—not just within single responses.

Recent research on significance vectors (S-vectors) proposes one architectural path: adding a fourth vector dimension to transformers that encodes not just what tokens mean, but how much they matter. S_r (Identity) could provide anti-drift force for entity boundaries. S_st (Status) could preserve high-importance entities across contexts. Multiple significance dimensions working together could address the governance gap. They do not fully address the issue with proper nouns.

But implementation requires more than retrofitting existing architectures. It requires persistent entity registers, learned significance weighting, and integration at the attention mechanism level.

What Enterprise Deployments Need Now

For organizations deploying AI systems in production:

Recognize the precision-coverage tradeoff. Claude’s 45% refusal rate isn’t a limitation, it’s a different strategy for managing uncertainty. Choose the model that matches your use case. High-stakes decisions? Precision matters more. Broad exploration? Coverage matters more.

Test for operational governance, not just content accuracy. Your evaluation framework should include multi-phase workflows, context transitions, protocol switching. Can the model maintain boundaries when moving from research to analysis to formal deliverable? Does it preserve customer context across conversation turns?

Expect proper noun challenges in formal outputs. If your workflow requires entity identity (legal parties, company names, specific attributions) build explicit scaffolding. System prompts help but aren’t sufficient. Consider post-processing verification or human-in-loop for entity-critical content.
Watch for administrative fracture. The most dangerous failures aren’t wrong answers: they’re right operations in wrong contexts. Monitor for drift, not just errors.
The Google FACTS benchmark represents significant progress in measuring AI factuality. But the operational governance layer, the aspect of administering the task itself, not just executing it, is still mainly unmeasured. (We are working on it!)
That’s where the real deployment risk lives. The risks are profound. And that’s why architectural solutions are needed, much more than better benchmarks.

Why Claude Refuses to Answer 45% of Questions, and Why That’s the Least of Our Problems

Precision vs Coverage: The Invisible Tradeoff

The Administrative Fracture: What Benchmarks Don’t Measure

Why Proper Nouns Disappear in Formal Contexts

The Path Forward: Architectural Change, Not Training Scale

What Enterprise Deployments Need Now

Featured

Web Typography Just Caught Up to the Page, and a Midjourney Engineer Built the Bridge

The Hive Mind Is Here: How AI Agents Became Swarms

What Barbenheimer Taught Us About Signal Detection, And Why The Implications Matter

The Big Head at the Boardroom Table: World Models, Symbolic AI and LLMs

The Great AI Divergence: Beijing is Building the Foundation While Washington Fights (Tariff) Wars

Why Claude Refuses to Answer 45% of Questions, and Why That’s the Least of Our Problems

Precision vs Coverage: The Invisible Tradeoff

The Administrative Fracture: What Benchmarks Don’t Measure

Why Proper Nouns Disappear in Formal Contexts

The Path Forward: Architectural Change, Not Training Scale

What Enterprise Deployments Need Now

Related posts:

Featured

Web Typography Just Caught Up to the Page, and a Midjourney Engineer Built the Bridge

The Hive Mind Is Here: How AI Agents Became Swarms

What Barbenheimer Taught Us About Signal Detection, And Why The Implications Matter

The Big Head at the Boardroom Table: World Models, Symbolic AI and LLMs

The Great AI Divergence: Beijing is Building the Foundation While Washington Fights (Tariff) Wars