Visual Coherence Collapse In Diffusion Models: The Image/Video Probabilistic Problem

It’s not inherently wrong to experiment with new tools. Despite this, few things raise the Internet’s ire like AI video; often with demonstrably good reason. Terrible, exploitative things have been produced, both images and video, using AI to construct it. There’s an argument to be made that this is not the fault of the tools, but the people who are using tools to create these monstrosities. Every era has its moral panic.This is a reflection of humanity. There are also stories of people who are using these tools to create legitimately good production value narratives, primarily in ads. Kelly Boesch is a tremendously gifted artist building beautiful music video narrative that is truly unique and vision driven. Darren Aronofsky announced today that he is intending to produce a series on the Revolutionary War using AI as his medium of choice. This announcement was swiftly met with nearly global derision (his Wikipedia entry was even changed or a version of it visually altered to say former filmmaker Darren Aronofsky).

For good or for ill, the medium is here. I would pause it that the primary problem is there is almost no regulation about how the tools are used. The primary question getting lost in this discussion remains largely unanswered: do the tools operate in the way they are being marketed and the way they are intended?

Recent experiments with state-of-the-art AI video generation tools would say no. They reveal a fundamental limitation that closely mirrors coherence collapse in large language models. When asked to sustain character identity and narrative continuity across multiple generations, diffusion-based video models fail in a way that is structurally identical to how LLMs drift over extended conversations: they lack persistent internal representations across iterative outputs.

The Consistency Problem

Initial testing began with static image generation using Midjourney and Adobe. The task was deliberately simple: generate the same ice cream container for a Ben & Jerry’s campaign without typographical errors. Each iteration produced a visually distinct container. Colors shifted, branding changed, proportions warped. The model was not refining a single object but regenerating a new approximation each time. The lesson is: unless you are willing to invest a lot of time into revisions and prompting refinement, you’re probably better off getting a designer to do that in Photoshop then you are trying to get the model to re-create it. If you decide to proceed, realize that while you may get the perfect image with a few minor changes required the first time, you’ll get a demonstrably different version the second time. The lesson here is either resign yourself to something substandard or pay someone with skill to make the fixes.

The same behavior appeared when testing Sora, X Imagine, and CapCut, with human subjects. Despite identical prompts and carefully constrained descriptions, each generation produced a different person. Hair color changed. Facial structure shifted. Clothing details transformed. The model was reconstructing the character from scratch on every attempt, with no persistent visual memory of prior outputs. Despite explaining to X Imagine four times that the office belonged to the techie woman and not the male police officer, it had him sitting at the desk every time.

The Architectural Parallel

This failure is not a tooling issue or a prompt-engineering problem. It is architectural.

In large language models, coherence collapse occurs because meaning is reconstructed token by token from a limited context window. As conversations extend, semantic anchors weaken, errors compound, and the model drifts. The system does not “remember” earlier outputs in any persistent sense; it infers continuity probabilistically from context.

Diffusion models operate the same way in the visual domain. Each image or clip is sampled from a learned distribution conditioned on the prompt. There is no enduring character state, no stable identity embedding carried forward across generations. Visual continuity is inferred, not preserved. Drift is therefore inevitable.

This suggests that Evans’ Law, which predicts coherence collapse in LLMs at approximately L ≈ 1969.8 × M^0.74 tokens, may have a visual analogue. For diffusion models, coherence does not collapse over conversational length but over sequential generations. Beyond a certain number of iterations, maintaining consistent character identity becomes statistically improbable.

The UX Layer: When Architectural Limits Become User-Management Failures

Runway’s user experience exposes how architectural limitations can cascade into failures of user management and expectation-setting. Before upgrading, the system surfaced warnings in the chat interface about an inability to handle refunds; a disclosure that, in retrospect, functioned as an implicit admission of instability rather than a transactional footnote. Despite this, the product flow encouraged continued use and monetization. After upgrading, a 55-second source video was reduced to a five-second output, not because of an explicit user choice or parameter constraint, but due to opaque system limitations revealed only post-purchase.

This is not merely poor UX; it reflects a deeper mismatch between model capability, interface promises, and user behavior. When systems with known architectural constraints fail to clearly bound user expectations, they effectively offload the cost of experimentation onto the user. In doing so, they amplify frustration, erode trust, and obscure the true source of failure, which lies not in user misuse, but in the gap between generative ambition and technical reality.

From Images to Motion: Video Makes the Failure Obvious

Recent experiments with AI video tools make this limitation impossible to ignore.

Testing with Runway Gen-3 and Sora 2 demonstrated that while individual clips can be visually impressive, coherence deteriorates rapidly across shots. Runway struggled almost immediately with emotional calibration and temporal continuity. Even with extensive conditioning, including uploaded reference images, outputs collapsed into brief, inconsistent clips that bore little relationship to the intended sequence.

Sora 2 performed significantly better on single shots. A prompt describing “a woman and a police officer buying coffee, discussing a case, and walking through winter snow in downtown Ottawa” produced 37 seconds of genuinely cinematic footage. Lighting, movement, and interaction were convincing.

But generating a second shot with the same prompt exposed the underlying problem. The police officer, previously clean-shaven, now had a goatee. The woman’s coat color remained consistent, but her facial features, scarf pattern, and accessories changed. These were not continuity errors. They were different people wearing similar clothing.

The goatee isn’t a mistake. It’s proof the model has no persistent memory. Each generation samples probabilistically from learned distributions. The model cannot recreate what it made before because it didn’t retain what it made before.

Why This Matters for Narrative Production

AI video generation currently excels at isolated outputs: atmospheric shots, abstract sequences, environments, visual effects elements. A single generation can reach broadcast quality.

Narrative filmmaking, however, requires stable identity across time. Characters must persist across shots, scenes, and emotional states. Current diffusion models cannot do this without extensive external intervention.

Practitioners working seriously with AI video have adapted accordingly. They generate hundreds of variations and manually select consistent frames. They use AI primarily for environments and composite real actors into generated scenes. They embrace inconsistency as an experimental aesthetic. Or they rely on heavy VFX correction to repair drift frame by frame.

What they are not doing, despite marketing claims, is producing long-form narrative film using AI generation alone. The technology cannot sustain coherent visual identity across the temporal and generative distance storytelling requires.

The Visual Footprint of Generative Models

Another consequence of this architecture is the emergence of a recognizable synthetic aesthetic. Just as LLMs produce linguistically fluent but statistically flattened prose, diffusion models generate imagery that is visually coherent yet identity-unstable. The result is a distinct “AI footprint”: polished, plausible, and immediately legible as synthetic under sustained viewing.

This footprint becomes more pronounced in video, where even minor inconsistencies accumulate rapidly. Motion does not conceal drift; it amplifies it. The failure modes in video are very similar to the failure modes in agentic AI, as are the solutions: a lot of scaffolding and very tightly sequenced probability models.

Beyond Film: A Broader Limitation

This is not merely a production concern. Any application requiring visual consistency across time (surveillance analysis, medical imaging sequences, scientific visualization) faces the same constraint. Diffusion models treat each frame or clip as an independent sampling event, blind to their own prior outputs except through explicit conditioning.

The parallel failure modes across text and image generation point to a deeper limitation in current foundation models. Both reconstruct outputs from learned distributions without maintaining persistent internal state. Both drift. Both fail beyond their effective coherence horizons.

Proposed solutions for LLM coherence collapse, including significance-aware architectures that encode importance alongside content, may have visual equivalents. A diffusion system that maintains persistent character embeddings updated across generations, rather than regenerated each time, could potentially address this problem. Diffusion models generate images by learning statistical patterns from training data. When prompted to create ‘a police officer,’ the model samples from its learned distribution of what police officers look like. It does not create an individual character and remember them. It creates a statistically plausible instance matching the prompt.

When asked to generate a second shot of ‘the same police officer,’ the model has no internal representation of the first officer to reference. It cannot retrieve what it made before. It can only sample again from the same distribution: which will produce a different, equally plausible police officer.

This is identical to how LLMs operate. Each token is generated probabilistically from learned patterns, not retrieved from memory of what was written before. Both architectures are stateless. Both reconstruct rather than remember. Both drift.

Until such architectures exist, AI video generation remains best suited for snippets, experiments, and ideation. It can create beautiful moments. It cannot yet sustain stories.

Understanding why these models fail at narrative coherence is not just a critique of current tools. It is a window into the core limitations of contemporary generative architectures, and a roadmap for what must change next. Like so many other areas of AI regulation is desperately needed. Regulation of new technology is always needed most as they emerge and their capabilities become apparent. This is the period during which they are most easily exploited. Too many people don’t understand what models are capable of; they don’t even understand what AI is and so unleashing them on an unsuspecting public is not only amoral. It’s extremely dangerous and exploitative.

A formal paper documenting these patterns with systematic testing across model families is forthcoming.

Visual Coherence Collapse in Diffusion Models: The Image/Video Probabilistic Problem

The Consistency Problem

The Architectural Parallel

The UX Layer: When Architectural Limits Become User-Management Failures

From Images to Motion: Video Makes the Failure Obvious

Why This Matters for Narrative Production

The Visual Footprint of Generative Models

Beyond Film: A Broader Limitation

Featured

Intelligent agencies are using AI to liberate time and get more creative

Why krypton & xenon are critical for semiconductor manufacturing

AI Is Rapidly Changing the Role of B2B Marketers. Are You and Your Company Keeping Up?

The AI Procurement Map Just Became Trilateral

The Cohere Command A+ Release: Sovereignty, Speed, Infrastructure, and What Canadian Customers Need to Ask