Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction

As image generation reached maturity, the next frontier shifted toward video generation. Video is not a simple extension of images. Adding the new dimension of a time axis creates two fundamental challenges: consistency between frames and an explosion in compute. Rather than asserting the detailed specs of any specific product, this article focuses on the principles of the spatiotemporal diffusion transformer that recent video generation models share.

This field changes very quickly, and the internal structure of commercial models is mostly undisclosed. The content below is based on publicly known concepts and architecture families, and please read it with the understanding that rankings and specific numbers vary by benchmark and version.

The Two Challenges of Video Generation

Temporal Consistency

Video is a sequence of many frames. If each frame is generated independently, a person's face shifts subtly from frame to frame, or background objects suddenly disappear and colors flicker. A good video model has to preserve identity and scene structure across time.

Bad case (per-frame independent generation):

Frame 1: blue shirt Frame 2: navy shirt Frame 3: purple shirt (flicker/instability)

Good case (temporally consistent):

Frame 1: blue shirt Frame 2: blue shirt Frame 3: blue shirt (stable persistence)

Compute Cost

Instead of a single image, you have to produce dozens of frames per second across several seconds of length. The data volume grows in proportion to the number of frames, and once attention along the time direction is added, compute grows sharply. As a result, video models require far more aggressive compression and efficiency than images.

Spatiotemporal Latent Patches

The core idea is to not handle video as raw pixels, but to convert it into a compressed spatiotemporal latent. Where image latent diffusion compressed images spatially, video compresses space and time together.

First, a 3D autoencoder encodes video into a spatiotemporal latent tensor. Then this latent tensor is cut into spatiotemporal patches to form a token sequence. Each patch corresponds to a "short slice of time, small slice of space."

[Source video: time x height x width x channels]

[3D autoencoder for spatiotemporal compression]

[Spatiotemporal latent tensor] --patch splitting--> [Spatiotemporal token sequence]

(each token = time slice x space slice)

This "spatiotemporal patch" concept is known to have been popularized broadly by Sora. It is emphasized that video of various resolutions, lengths, and aspect ratios can be unified into a single token representation, which is advantageous for jointly training on data in different formats.

DiT-Based Spatiotemporal Diffusion

Once tokenization is done, a diffusion transformer (DiT) runs on top of this token sequence. The idea is the same as an image DiT, but the difference is that attention spans not only space but also the time axis.

[Spatiotemporal token sequence] + [text condition]

[Transformer blocks x N]

- Spatial attention (positions within the same timestep)

- Temporal attention (multiple timesteps at the same position)

- or unified spatiotemporal attention

[Noise/velocity prediction] --> Denoising iterations

[Spatiotemporal latent reconstruction] --> [3D decoder] --> [Video]

How to split attention is a design choice. Separating space and time and processing them alternately (factorized) reduces compute, while binding space and time together and processing them as one (full) yields richer interaction but is expensive. Many models compromise between the two, trading off efficiency and quality.

The training formulation, as on the image side, uses noise prediction or a velocity field (flow matching / rectified flow). The broad framework of performing diffusion in a latent space is identical to image latent diffusion.

Conditioning, Length, and Resolution

Text Conditioning

As with image models, the prompt is embedded with a text encoder (CLIP or T5 family) and injected via cross-attention or joint attention. On top of this, a variety of conditioning is used, such as image-to-video that provides a first-frame image as a condition, and upscaling approaches that provide low-resolution video as a condition.

Variable Length and Resolution

The advantage of the spatiotemporal patch representation is flexibility. By adjusting the number of tokens, different lengths, resolutions, and aspect ratios can be handled by the same model. That said, longer video means more tokens and larger compute, so in practice strategies such as cascades (generate low resolution then upscale) or chunk-by-chunk generation are used together.

[Generate short low-resolution video]

[Temporal interpolation / frame extension]

[Spatial upscaling (super-resolution)]

[Final high-resolution video]

Successor Model Families (Concept-Focused)

After Sora imprinted the direction of spatiotemporal latent patches and large-scale diffusion transformers on the public consciousness, several commercial and research models appeared. Models such as Veo, Kling, Runway, and Pika are known to exist, and they appear to use different training data and recipes. However, since most have undisclosed internal structures, here we mention only the common architectural direction.

The commonly observed directions are as follows. (1) Spatiotemporal latent compression, (2) diffusion transformer backbone, (3) text and image conditioning, (4) resolution and length extension via cascades or upscaling. Detailed performance and rankings vary greatly by benchmark, version, and prompt, so we avoid definitive claims.

For reference, regarding Sora, there are reported accounts of a 2026 shutdown (service discontinuation). This is information based on reports, and it is safer to confirm the official announcement for the precise facts and timing. Regardless of the fate of any specific product, the architecture family of spatiotemporal diffusion transformers itself remains the common foundation of this field.

Comparison Table: Contrast with Image Generation

| Axis | Image Generation | Video Generation |

| --- | --- | --- |

| Compression | Spatial (VAE) | Spatiotemporal (3D autoencoder) |

| Tokens | Spatial patches | Spatiotemporal patches |

| Attention | Space-centric | Space + time |

| Core challenge | Composition and detail | Temporal consistency + compute |

| Conditioning | Text | Text + first frame, etc. |

| Output extension | Super-resolution | Temporal interpolation + super-resolution |

The values are general tendencies of the family and may differ from a specific model configuration.

Limits of Physical Consistency

Video models are often likened to "world simulators," but in reality they do not explicitly compute physics. They merely learn statistical patterns from data. As a result, failures such as the following appear.

- **Causal and physical violations**: broken glass reassembling, liquid volume not being conserved, or objects appearing and disappearing without basis.

- **Long-range consistency collapse**: the longer the video, the more object identity and count waver. Scenes differ when the camera returns to them.

- **Contact and rigid-body interaction**: fine interactions such as the unnaturalness of the moment a hand grabs an object are still difficult.

These limits come from the essence that the model does not "understand" physical laws but generates plausible pixel motion. It has improved greatly recently, but full physical consistency remains an open problem.

Evaluation

Evaluating video generation is harder than for images. Perceptual quality, temporal consistency, prompt fidelity, and the naturalness of motion must all be considered together.

- **Automatic metrics**: metrics that jointly look at frame quality and temporal consistency (for example, the FVD family and multi-faceted evaluation suites like VBench) are used, but they do not fully match human perception.

- **Human evaluation**: in practice, human preference comparison is trusted most. However, it is costly and subjective.

- **Caveats**: rankings vary greatly by prompt set, resolution, length, and evaluation method. Rather than asserting "what is best," comparisons that specify the conditions are needed.

Full Pipeline Diagram

[Prompt text] --(option: first-frame image)

[Text encoder]

[Condition embedding] ---------------------+

[Pure noise (spatiotemporal latent)] --> [Spatiotemporal DiT backbone]

[Denoising iterations: sampler + CFG]

[Spatiotemporal latent tensor]

[3D decoder]

[Low-res video] --> [Interpolation/super-resolution]

[Final video]

Strengths

- **Unified representation**: thanks to spatiotemporal patches, various lengths, resolutions, and aspect ratios are handled by one model.

- **Scalability**: the transformer backbone reaps the benefits of large-scale scaling.

- **Conditioning flexibility**: multiple conditions such as text, first frame, and low-res input can be combined.

- **Rapid quality improvement**: resolution, consistency, and motion quality have improved greatly in a short time.

Limits and Open Problems

- **Compute cost**: the longer and higher-resolution the video, the more training and inference costs surge.

- **Long-range consistency**: as it extends from a few seconds to tens of seconds, maintaining identity and scene becomes difficult.

- **Physics and causality**: the physical violations discussed earlier remain.

- **Controllability**: precise control such as camera motion, fine timing, and specific-object control is still developing.

- **Evaluation and copyright**: the absence of reliable standard metrics and the issue of training data provenance are major points of contention, as with images.

Practical Implications

- Starting from short, clear scenes tends to be more stable. Long, complex scenes are prone to breaking consistency.

- If precise control is needed, it is better to use image-to-video or structural conditions together.

- Rather than depending on the fate or rankings of a specific product, it is safer to understand the properties of the architecture family and compare directly for the intended use.

Closing

The common foundation of video generation SOTA can be summarized as "spatiotemporal latent compression + diffusion transformer + text and image conditioning." The spatiotemporal patch concept that Sora popularized has become the de facto standard language of successor models. The fate and rankings of individual products change quickly, but if you understand these architectural principles, you can quickly grasp the structure of a new model even when it appears.

References

- [Scalable Diffusion Models with Transformers, DiT (arXiv 2212.09748)](https://arxiv.org/abs/2212.09748)

- [High-Resolution Image Synthesis with Latent Diffusion Models (arXiv 2112.10752)](https://arxiv.org/abs/2112.10752)

- [Video Diffusion Models (arXiv 2204.03458)](https://arxiv.org/abs/2204.03458)

- [Denoising Diffusion Probabilistic Models (arXiv 2006.11239)](https://arxiv.org/abs/2006.11239)

- [Flow Matching for Generative Modeling (arXiv 2210.02747)](https://arxiv.org/abs/2210.02747)

- [VBench: Comprehensive Benchmark Suite for Video Generative Models (arXiv 2311.17982)](https://arxiv.org/abs/2311.17982)

- [OpenAI Sora Introduction Page](https://openai.com/sora)

- [Runway Research](https://runwayml.com/research)

- [Hugging Face Diffusers Documentation](https://huggingface.co/docs/diffusers)