- Authors
- Name
- 1. Introduction: The Current State of AI Video Generation and the Rise of Open Source
- 2. HunyuanVideo Overview
- 3. HunyuanVideo Architecture Deep Dive
- 4. HunyuanVideo Training Data and Methodology
- 5. HunyuanVideo Model Specifications and Performance
- 6. HunyuanVideo Practical Usage
- 7. LTX-Video Overview
- 8. LTX-2 Architecture Analysis
- 9. LTX-2 Key Features
- 10. LTX-2 Practical Usage
- 11. HunyuanVideo vs LTX-2 Detailed Comparison
- 12. Open-Source Video Generation Model Ecosystem Comparison
- 13. Prompt Engineering Tips
- 14. Future Outlook
- 15. References
1. Introduction: The Current State of AI Video Generation and the Rise of Open Source
2024-2025 marked the era when AI Video Generation technology entered the commercialization stage. Commercial services such as OpenAI Sora, Google Veo, Runway Gen-3, and Kling were released in succession, making the concept of "creating video from text" a reality. However, these commercial models come with limitations such as API costs, usage restrictions, and data privacy concerns.
In this context, open-source video generation models have grown rapidly, beginning to achieve quality on par with commercial models. In particular, Tencent's HunyuanVideo and Lightricks' LTX-2 form the two pillars of open-source video generation, each with different philosophies and strengths.
[AI Video Generation Model Development Timeline]
2024 Q1-Q2 2024 Q3-Q4 2025 Q1-Q2 2025 Q3-Q4 2026 Q1
| | | | |
v v v v v
Sora Preview HunyuanVideo Wan 2.1 HunyuanVideo 1.5 LTX-2 Open
Runway Gen-3 CogVideoX LTX-Video 1.0 Wan 2.2 (MoE) Wan 2.6
Pika 1.0 Kling 1.0 Mochi 1 LTX-2 Preview Veo 3.1
Mochi Preview Sora 2
[Open Source vs Commercial Model Competition]
Commercial: Sora --> Sora 2 --> Veo 3.1 --> Kling 3.5
\ \ \ \
\ \ \ v
Open Source: CogVideoX -> HunyuanVideo -> Wan 2.1 -> LTX-2
\ \ \ \
v v v v
Quality Gap: Gap Narrows: On Par: Surpassing:
Commercial Rapid Benchmark Speed/Access
Advantage Catch-up Parity Advantage
This article provides an in-depth paper-based analysis of the architectures of HunyuanVideo and LTX-2, compares their benchmark performance, covers the entire open-source ecosystem comparison with Wan 2.1, CogVideoX, Mochi, and more, and includes prompt engineering tips and a practical usage guide.
2. HunyuanVideo Overview
2.1 Tencent Research Team
HunyuanVideo is a large-scale video generation model developed by Tencent's Hunyuan AI research team. The Tencent Hunyuan team has experience developing various generative AI models including HunyuanDiT (image generation) and Hunyuan3D (3D generation), and leveraged this technical expertise to enter the video generation domain.
Key Contributions of the Tencent Hunyuan Team:
| Model | Domain | Key Features |
|---|---|---|
| HunyuanDiT | Text-to-Image | Bilingual (Chinese/English), DiT arch |
| Hunyuan3D | 3D Generation | 3D model generation from text/image |
| HunyuanVideo | Text/Image-to-Video | 13B parameters, largest open-source |
| HunyuanVideo 1.5 | Text/Image-to-Video | 8.3B, consumer GPU support |
2.2 Largest Open-Source Video Generation Model
Released in December 2024, HunyuanVideo has 13B (13 billion) parameters, making it the largest open-source video generation model at the time of release. This significantly exceeds competing models such as CogVideoX (5B-10B) and Mochi (10B).
HunyuanVideo Core Specs:
| Item | HunyuanVideo | HunyuanVideo 1.5 |
|---|---|---|
| Parameters | 13B | 8.3B |
| Release Date | December 2024 | November 2025 |
| Architecture | Dual-to-Single Stream DiT | Improved DiT |
| Text Encoder | MLLM (Decoder-Only) | Improved MLLM |
| VAE | 3D Causal VAE | 3D Causal VAE (improved) |
| Training | Flow Matching | Flow Matching |
| Max Resolution | 720p (1280x720) | 720p |
| Max Frames | 129 frames | 129 frames |
| License | Tencent Hunyuan Community | Tencent Hunyuan Community |
2.3 Text-to-Video and Image-to-Video Support
HunyuanVideo supports two core capabilities:
Text-to-Video (T2V): Generates high-quality video from text prompts alone. Describe scenes, actions, and atmosphere in natural language and it creates matching video.
Image-to-Video (I2V): Takes a static image as input and transforms it into video with natural motion added. The HunyuanVideo-I2V model released separately in March 2025 handles this functionality.
[HunyuanVideo Input/Output Pipeline]
Text-to-Video:
"A golden retriever running +----------+ +--------+
through a sunlit meadow" -----> | Hunyuan | --> | Video |
| Video | | Output |
Image-to-Video: | Pipeline | | (MP4) |
[Input Image] + Prompt -----> | | --> | |
+----------+ +--------+
|
MLLM Encoder
3D VAE
DiT Denoiser
3. HunyuanVideo Architecture Deep Dive
HunyuanVideo's architecture consists of three core components: (1) MLLM Text Encoder, (2) 3D Causal VAE, (3) Dual-Stream to Single-Stream DiT.
[HunyuanVideo Full Architecture Diagram]
Text Prompt
|
v
+-------------+
| MLLM Text |
| Encoder |
| (Decoder- |
| Only LLM) |
+------+------+
|
Text Tokens (with bidirectional refiner)
|
v
+--------+ +---------------------+ +--------+
| Gaussian| -> | Dual-Stream to | -> | Denoised|
| Noise | | Single-Stream DiT | | Result |
+--------+ | | +---+----+
| [Dual Phase] | |
| - Video Tokens | v
| - Text Tokens | +----------+
| (independent) | | 3D VAE |
| | | Decoder |
| [Single Phase] | +----+-----+
| - Concat & Fuse | |
+---------------------+ v
Final Video
3.1 Dual-Stream to Single-Stream DiT Design
The most distinctive architectural element of HunyuanVideo is its "Dual-Stream to Single-Stream" Diffusion Transformer (DiT) design. This is the core design philosophy that differentiates it from existing DiT models.
Dual-Stream Phase (Early Layers):
In the Dual-Stream phase, video tokens and text tokens are processed through independent Transformer blocks. Each modality can learn its own appropriate modulation mechanisms without interfering with the other.
# Dual-Stream Phase Pseudocode
class DualStreamBlock(nn.Module):
def __init__(self, dim, num_heads):
self.video_attn = MultiHeadAttention(dim, num_heads)
self.text_attn = MultiHeadAttention(dim, num_heads)
self.video_ffn = FeedForward(dim)
self.text_ffn = FeedForward(dim)
self.video_norm = AdaLayerNorm(dim)
self.text_norm = AdaLayerNorm(dim)
def forward(self, video_tokens, text_tokens, timestep):
# Independent video token processing
video_tokens = self.video_norm(video_tokens, timestep)
video_tokens = video_tokens + self.video_attn(video_tokens)
video_tokens = video_tokens + self.video_ffn(video_tokens)
# Independent text token processing
text_tokens = self.text_norm(text_tokens, timestep)
text_tokens = text_tokens + self.text_attn(text_tokens)
text_tokens = text_tokens + self.text_ffn(text_tokens)
return video_tokens, text_tokens
Single-Stream Phase (Later Layers):
In the Single-Stream phase, video tokens and text tokens are concatenated and processed together in a single Transformer block. This enables effective multimodal information fusion.
# Single-Stream Phase Pseudocode
class SingleStreamBlock(nn.Module):
def __init__(self, dim, num_heads):
self.attn = MultiHeadAttention(dim, num_heads)
self.ffn = FeedForward(dim)
self.norm = AdaLayerNorm(dim)
def forward(self, video_tokens, text_tokens, timestep):
# Concatenate video + text tokens
combined = torch.cat([video_tokens, text_tokens], dim=1)
# Unified processing (Full Attention)
combined = self.norm(combined, timestep)
combined = combined + self.attn(combined)
combined = combined + self.ffn(combined)
# Split and return
video_out = combined[:, :video_tokens.shape[1]]
text_out = combined[:, video_tokens.shape[1]:]
return video_out, text_out
Advantages of the Dual-to-Single Design:
| Characteristic | Dual-Stream Only | Single-Stream Only | Dual-to-Single (HunyuanVideo) |
|---|---|---|---|
| Per-modality learning | Excellent | Limited | Excellent (early phase) |
| Cross-modal fusion | Weak | Strong | Strong (later phase) |
| Computational efficiency | High | Moderate | High |
| Text-video alignment | Low | High | High |
| Model flexibility | High | Low | Very high |
3.2 3D VAE (Causal VAE) - Spatiotemporal Compression
HunyuanVideo uses a 3D Causal VAE to compress pixel-space video into a compact latent space. This VAE is built on CausalConv3D and efficiently compresses both temporal and spatial information.
Compression Ratios:
| Dimension | Ratio | Description |
|---|---|---|
| Temporal | 4x | 129 frames to 33 latent frames |
| Spatial | 8x x 8x | 720x1280 to 90x160 |
| Channel | 3ch to 16ch | RGB 3ch to latent 16ch |
Overall Compression Effect:
Input Video: 720 x 1280 x 129 frames x 3 channels
= ~356M values
Latent: 90 x 160 x 33 x 16 channels
= ~7.6M values
Compression: ~47:1 (by element count)
Causal VAE Characteristics:
The Causal VAE maintains temporal causality in its design, meaning each frame is encoded referencing only information from previous frames. This allows images and videos to be processed by the same VAE. The first frame is treated as an image without temporal compression, while subsequent frames have temporal compression applied considering their relationship to previous frames.
3.3 MLLM Text Encoder
Another innovation of HunyuanVideo is its adoption of a Multimodal Large Language Model (MLLM) as the text encoder. This contrasts with existing video/image generation models that primarily use CLIP or T5 as text encoders.
Comparison with Existing Text Encoders:
| Characteristic | CLIP | T5-XXL | MLLM (HunyuanVideo) |
|---|---|---|---|
| Architecture | Encoder-Only | Encoder-Decoder | Decoder-Only |
| Parameters | ~400M | ~4.7B | Tens of billions |
| Image-text alignment | Excellent | Moderate | Very excellent |
| Detail understanding | Limited | Excellent | Very excellent |
| Complex reasoning | Weak | Moderate | Strong |
| Zero-shot ability | Limited | Moderate | Excellent |
| Attention type | Causal | Bidirectional | Causal + Refiner |
Bidirectional Token Refiner:
MLLMs inherently use causal attention due to their Decoder-Only structure, but bidirectional attention is more effective as text conditioning for diffusion models. To solve this, HunyuanVideo introduces an additional Bidirectional Token Refiner.
[Text Encoding Pipeline]
Text Prompt
|
v
+----------+ +--------------+ +------------------+
| MLLM | --> | Bidirectional| --> | Final Text |
| (Causal | | Token | | Embedding |
| Attn) | | Refiner | | (DiT condition) |
+----------+ +--------------+ +------------------+
Rich Bidirectional Diffusion-optimized
semantics context boost text representation
3.4 Flow Matching Training Method
HunyuanVideo adopts Flow Matching instead of traditional DDPM (Denoising Diffusion Probabilistic Model). Flow Matching learns the optimal transport path between data and noise distributions.
DDPM vs Flow Matching:
| Characteristic | DDPM | Flow Matching |
|---|---|---|
| Noise schedule | Must be predefined | Flexible design |
| Training target | Noise prediction | Vector field pred. |
| Convergence | Slow | Fast |
| Inference path | Curved | Straight (efficient) |
| Sampling steps | Many (20-50) | Fewer (20-30) |
# Flow Matching Training Pseudocode
def flow_matching_loss(model, x_0, text_cond):
"""
x_0: original video latent
text_cond: text condition
"""
# Random timestep sampling
t = torch.rand(x_0.shape[0], device=x_0.device)
# Noise sampling
noise = torch.randn_like(x_0)
# Linear interpolation for intermediate state
x_t = (1 - t) * x_0 + t * noise
# Target vector field: noise direction
target = noise - x_0
# Model's vector field prediction
predicted = model(x_t, t, text_cond)
# Loss computation
loss = F.mse_loss(predicted, target)
return loss
3.5 Unified Image-Video Training Strategy
HunyuanVideo trains on images and videos within a unified framework. Images are treated as single-frame videos and processed by the same model architecture.
3.6 Full Attention Mechanism
HunyuanVideo applies Full Attention across both temporal and spatial dimensions. This contrasts with many video generation models that separate spatial and temporal attention to reduce computation.
| Attention Type | Description | Example Model |
|---|---|---|
| Spatial-Only | Spatial dimension only | Early video models |
| Temporal-Only | Temporal dimension only | AnimateDiff |
| Spatial + Temporal (split) | Applied alternately | CogVideoX |
| Full 3D Attention | Full spatiotemporal attn | HunyuanVideo |
Full Attention allows every token in the video to interact spatiotemporally with all other tokens, achieving more coherent motion and higher visual quality, but at the tradeoff of significantly increased computational cost.
4. HunyuanVideo Training Data and Methodology
4.1 Large-Scale Data Curation Pipeline
HunyuanVideo's training data is prepared through a systematic curation pipeline involving multiple stages of filtering and evaluation from raw data to final training data.
4.2 Multi-Stage Training Strategy
HunyuanVideo employs a Progressive Training strategy, starting from low resolution and gradually increasing resolution.
Training Stage Settings:
| Stage | Resolution | Frames | Batch Size | Primary Goal |
|---|---|---|---|---|
| Stage 1 | 256x256 | 17 | Large | Basic visual concepts |
| Stage 2 | 512x512 | 33 | Medium | Detail learning |
| Stage 3 | 960x544 / 544x960 | 65 | Small | High-resolution adapt. |
| Stage 4 | 1280x720 / 720x1280 | 129 | Very small | Final quality fine-tune |
5. HunyuanVideo Model Specifications and Performance
5.1 Supported Resolutions and Frames
| Resolution | Aspect Ratio | Use Case |
|---|---|---|
| 1280 x 720 | 16:9 | Landscape HD |
| 720 x 1280 | 9:16 | Portrait (mobile) |
| 960 x 544 | ~16:9 | Medium resolution |
| 544 x 960 | ~9:16 | Medium portrait |
| 720 x 720 | 1:1 | Square |
Frame Settings:
| Setting | Value | Notes |
|---|---|---|
| Max frames | 129 frames | 33 latent frames after 4x VAE compress |
| FPS | 24 fps | Standard cinema framerate |
| Video length | ~5.4 sec | 129 / 24 = 5.375 seconds |
5.2 Benchmark Comparison
VBench Evaluation Results:
| Model | Overall | Visual Quality | Text Alignment | Motion Quality | Human Fidelity |
|---|---|---|---|---|---|
| HunyuanVideo | Top tier | 96.4% | 68.5% | 64.5% | Excellent |
| Sora | Top tier | Excellent | Moderate | Excellent | Very excellent |
| CogVideoX-1.5 | Upper | Excellent | Strong | Moderate | Weak |
| Kling 1.6 | Top tier | Excellent | Excellent | Excellent | Excellent |
HunyuanVideo shows particularly strong results in Human Fidelity and Motion Rationality dimensions.
5.3 Competitive Model Comparison
| Comparison | HunyuanVideo | Sora 2 | Runway Gen-3 | Kling 3.5 |
|---|---|---|---|---|
| Access | Open source | Commercial | Commercial | Commercial |
| Parameters | 13B | Undisclosed | Undisclosed | Undisclosed |
| Max Resolution | 720p | 1080p | 1080p | 1080p |
| Max Length | ~5 sec | Up to 20s | Up to 10s | Up to 10s |
| Local Run | Possible | Not possible | Not possible | Not possible |
| Customization | LoRA supported | Not possible | Limited | Not possible |
| Cost | Free (GPU req.) | API billing | Subscription | API billing |
6. HunyuanVideo Practical Usage
6.1 HuggingFace Model Download
# HunyuanVideo original model (13B)
pip install huggingface_hub
huggingface-cli download tencent/HunyuanVideo --local-dir ./HunyuanVideo
# HunyuanVideo 1.5 (8.3B, lighter version)
huggingface-cli download tencent/HunyuanVideo-1.5 --local-dir ./HunyuanVideo-1.5
# Image-to-Video model
huggingface-cli download tencent/HunyuanVideo-I2V --local-dir ./HunyuanVideo-I2V
6.2 Diffusers Library Inference Code
Basic Text-to-Video Inference:
import torch
from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel
from diffusers.utils import export_to_video
# Load model
model_id = "tencent/HunyuanVideo"
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
model_id,
subfolder="transformer",
torch_dtype=torch.bfloat16,
)
pipe = HunyuanVideoPipeline.from_pretrained(
model_id,
transformer=transformer,
torch_dtype=torch.float16,
)
pipe.vae.enable_tiling()
pipe.to("cuda")
# Generate video
output = pipe(
prompt="A cat walks on the grass, realistic style, natural lighting",
height=720,
width=1280,
num_frames=129,
num_inference_steps=30,
guidance_scale=6.0,
).frames[0]
# Save video
export_to_video(output, "hunyuan_output.mp4", fps=24)
4-bit Quantization for VRAM Savings:
import torch
from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel
from diffusers.utils import export_to_video
from transformers import BitsAndBytesConfig
# INT4 quantization config
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4",
)
# Load quantized transformer
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
"tencent/HunyuanVideo",
subfolder="transformer",
quantization_config=quant_config,
)
pipe = HunyuanVideoPipeline.from_pretrained(
"tencent/HunyuanVideo",
transformer=transformer,
torch_dtype=torch.float16,
)
pipe.vae.enable_tiling()
# CPU offload for additional VRAM savings
pipe.enable_model_cpu_offload()
output = pipe(
prompt="A beautiful sunset over the ocean, cinematic",
height=544,
width=960,
num_frames=65,
num_inference_steps=30,
guidance_scale=6.0,
).frames[0]
export_to_video(output, "quantized_output.mp4", fps=24)
6.3 Key Parameter Guide
| Parameter | Default | Range | Description |
|---|---|---|---|
guidance_scale | 6.0 | 1.0-15.0 | Prompt fidelity (higher = more faithful to prompt) |
num_inference_steps | 30 | 20-50 | Denoising steps (higher = better quality, slower) |
height | 720 | 256-720 | Video height (multiple of 8) |
width | 1280 | 256-1280 | Video width (multiple of 8) |
num_frames | 129 | 17-129 | Total frames (4k+1 format recommended) |
seed | Random | Integer | Seed for reproducibility |
6.4 GPU VRAM Requirements
| Configuration | Required VRAM | Resolution | Notes |
|---|---|---|---|
| FP32 (original) | 80GB+ | 720p 129f | A100/H100 required |
| BF16/FP16 | ~40GB | 720p 129f | A100 40GB |
| FP8 quantization | ~24-30GB | 720p 129f | RTX 4090 capable |
| INT4 quant + CPU Offload | ~14-16GB | 544p 65f | RTX 4080 capable |
| HunyuanVideo 1.5 (FP8) | ~14GB | 480p | Consumer GPU |
6.5 LoRA Fine-tuning
HunyuanVideo supports LoRA fine-tuning to learn specific styles, characters, or motion patterns.
Key LoRA Training Tools:
| Tool | Features | Min VRAM |
|---|---|---|
| Musubi Tuner (kohya-ss) | Most popular LoRA training tool | 24GB |
| ai-toolkit (ostris) | Multi-model support | 24GB |
| diffusion-pipe (tdrussell) | Pipeline-based training | 24GB |
| FineTrainers (HuggingFace) | Official Diffusers-based tool | 24GB |
| fal.ai LoRA Training | Cloud-based, no setup needed | Cloud |
7. LTX-Video Overview
7.1 Lightricks Company
Lightricks is an AI-based creative technology company headquartered in Jerusalem, Israel. Founded in 2013, it is widely known for consumer photo/video editing apps such as Facetune, Videoleap, and Photoleap. Leveraging experience in mobile creative tools, it entered the AI video generation space.
7.2 Evolution from LTX-Video 1.0 to LTX-2
| Version | Release | Params | Key Features |
|---|---|---|---|
| LTX-Video 0.9 | Nov 2024 | ~2B | First open-source, real-time |
| LTX-Video 0.9.8 (13B) | Mid 2025 | 13B | Distilled version, quality up |
| LTX-2 | Oct 2025 | 19B | Audio+video simultaneous gen. |
| LTX-2 (open source) | Jan 2026 | 19B | Full weights/code released |
7.3 Near Real-Time Video Generation Speed
The biggest differentiator of the LTX-Video series is its faster-than-real-time video generation speed. LTX-Video was among the first DiT-based video generation models to achieve real-time generation.
[Generation Speed Comparison (5-second video)]
Model Gen Time vs Real-time
LTX-Video 1.0: ~2 sec 2.5x faster
LTX-2: ~3-5 sec ~real-time
HunyuanVideo: ~2-5 min 60x slower
CogVideoX: ~3-8 min 100x slower
Mochi: ~5-10 min 120x slower
(H100 GPU, 768x512 resolution)
7.4 Text-to-Video and Image-to-Video Support
LTX-2 offers simultaneous Audio-Video generation in addition to Text-to-Video and Image-to-Video.
| Feature | LTX-Video 1.0 | LTX-2 |
|---|---|---|
| Text-to-Video | Supported | Supported |
| Image-to-Video | Supported | Supported |
| Audio generation | Not supported | Synchronized audio co-gen |
| 4K resolution | Not supported | Native 4K (3840x2160) |
| 50fps | Not supported | Supported |
| Keyframe Conditioning | Limited | Full support |
8. LTX-2 Architecture Analysis
8.1 Overall Architecture
LTX-2 consists of three core components: (1) Modality-specific VAE, (2) Text embedding pipeline, (3) Asymmetric Dual Stream DiT.
[LTX-2 Full Architecture]
Text Prompt
|
v
+-------------+
| Text Encoder | (Gemma-based)
| + Prompt |
| Enhancer |
+------+------+
|
v
+------------------------------------------+
| Asymmetric Dual Stream DiT |
| |
| +------------------+ +-------------+ |
| | Video Stream | | Audio Stream| |
| | (wide channels, | | (narrow, | |
| | high capacity) | | lightweight)| |
| +--------+---------+ +------+------+ |
| | Cross-Attention | |
| +----------+-----------+ |
+------------------------------------------+
| |
v v
+-------------+ +-------------+
| Video VAE | | Audio VAE |
| Decoder | | Decoder |
| (3D spatio- | | (1D temporal)|
| temporal) | | |
+------+------+ +------+------+
| |
v v
Video Output Audio Output
| |
+--------+-----------+
|
v
Final AV Output (MP4)
8.2 Video VAE (High Compression Ratio - 1:192)
LTX-2's Video VAE achieves a very high compression ratio of 1:192. This is approximately 4x higher than HunyuanVideo's ~47:1 ratio.
VAE Compression Comparison:
| Model | Spatial | Temporal | Latent Ch | Overall Ratio |
|---|---|---|---|---|
| LTX-2 | 32x32 | 8x | 128ch | 1:192 |
| HunyuanVideo | 8x8 | 4x | 16ch | ~1:47 |
| CogVideoX | 8x8 | 4x | 16ch | ~1:47 |
| Wan 2.1 | 8x8 | 4x | 16ch | ~1:47 |
High compression provides:
- Fewer latent tokens: Greatly reduces tokens the DiT must process, improving inference speed
- Memory efficiency: Enables high-resolution video processing with less VRAM
- Faster training: Reduces computation needed during training
8.3 Asymmetric Dual Stream DiT
LTX-2's DiT adopts an asymmetric dual stream structure, reflecting the characteristic differences between video and audio modalities.
Rationale for Asymmetric Design:
| Characteristic | Video Stream | Audio Stream |
|---|---|---|
| Dimension | 3D (spatial + temporal) | 1D (temporal) |
| Complexity | High (spatiotemporal) | Medium (temporal) |
| Channel width | Wide (high capacity) | Narrow (lightweight) |
| Positional Embedding | 3D positional | 1D temporal |
| Data characteristics | Pixel-based visual | Frequency-based audio |
8.4 Text Encoder
LTX-2 uses a Gemma-based text encoder. The enhance_prompt feature can automatically expand simple user prompts for better results.
8.5 Speed Optimization Techniques
| Optimization | Description | Speedup |
|---|---|---|
| High VAE compression | Greatly reduces latent token count | Key factor |
| Distilled inference | 8-step distilled model available | 5-10x |
| FP8 Transformer | Quantized weights | ~2x |
| Two-Stage Pipeline | Stage 1 (gen) + Stage 2 (upscale) | Efficient |
| Gradient Estimation | Reduce steps from 40 to 20-30 | ~1.5x |
9. LTX-2 Key Features
9.1 Real-Time Generation Speed
| Resolution | Frames | Length | Gen Time (H100) | vs Real-time |
|---|---|---|---|---|
| 768x512 | 121 | 5 sec | ~2 sec | 2.5x faster |
| 1216x704 | 121 | 5 sec | ~5 sec | ~real-time |
| 1920x1080 | 121 | 5 sec | ~15 sec | 3x slower |
| 3840x2160 | 121 | 5 sec | ~60 sec | 12x slower |
9.2 High Resolution and Various Output Options
Supported Resolutions:
| Resolution | Aspect | Use Case | VRAM Required |
|---|---|---|---|
| 768 x 512 | 3:2 | Rapid prototyping | ~8-12GB |
| 1216 x 704 | ~16:9 | Standard prod. | ~16GB |
| 1920 x 1080 | 16:9 | Full HD | ~24GB |
| 3840 x 2160 | 16:9 | 4K UHD | 48GB+ |
9.3 Synchronized Audio-Video Generation
One of LTX-2's innovative features is generating audio and video simultaneously. Sound matching the video content is automatically generated without a separate audio generation model.
9.4 Keyframe Conditioning
LTX-2 supports Keyframe Conditioning, allowing you to specify certain frames and naturally fill in between them.
9.5 LoRA Support
LTX-2 officially supports LoRA training and inference, with training code included in the GitHub repository.
10. LTX-2 Practical Usage
10.1 Installation and Environment Setup
# 1. Python environment (3.10+ recommended)
conda create -n ltx2 python=3.10
conda activate ltx2
# 2. Install official LTX-2 package
pip install ltx-pipelines
# 3. Or install from source
git clone https://github.com/Lightricks/LTX-2.git
cd LTX-2
pip install -e "packages/ltx-pipelines[all]"
pip install -e "packages/ltx-core[all]"
# 4. Download model weights
huggingface-cli download Lightricks/LTX-2 --local-dir ./models/ltx2
10.2 Python Inference Code Examples
Text-to-Video Basic Example:
from ltx_pipelines import TI2VidTwoStagesPipeline
# Initialize pipeline
pipe = TI2VidTwoStagesPipeline.from_pretrained(
"Lightricks/LTX-2",
device_map="auto",
enable_fp8=True, # Save VRAM with FP8
)
# Generate video
result = pipe(
prompt="A serene mountain lake at sunrise, mist rising from the water, "
"birds flying overhead, cinematic quality",
negative_prompt="blurry, low quality, distorted",
height=704,
width=1216,
num_frames=121,
frame_rate=24,
num_inference_steps=30,
cfg_guidance_scale=7.5,
seed=42,
enhance_prompt=True, # Auto prompt enhancement
)
# Save
result.save("ltx2_output.mp4")
Image-to-Video Example:
from ltx_pipelines import TI2VidTwoStagesPipeline
from PIL import Image
pipe = TI2VidTwoStagesPipeline.from_pretrained(
"Lightricks/LTX-2",
device_map="auto",
enable_fp8=True,
)
# Load input image
input_image = Image.open("input_photo.jpg")
# I2V generation
result = pipe(
prompt="The scene comes alive with gentle wind blowing through the trees",
images=[input_image],
height=704,
width=1216,
num_frames=121,
frame_rate=24,
num_inference_steps=30,
cfg_guidance_scale=7.5,
seed=42,
)
result.save("ltx2_i2v_output.mp4")
10.3 Key Parameters
| Parameter | Default | Range | Description |
|---|---|---|---|
prompt | Required | String | Video description |
negative_prompt | None | String | Elements to exclude |
height | 704 | Multiple of 32 | Video height |
width | 1216 | Multiple of 32 | Video width |
num_frames | 121 | 8k+1 form | Total frame count |
frame_rate | 24 | 24/30/50 | Frames per second |
num_inference_steps | 30 | 8-50 | Denoising steps |
cfg_guidance_scale | 7.5 | 1.0-15.0 | Prompt fidelity |
seed | Random | Integer | Reproducibility seed |
enhance_prompt | False | True/False | Auto prompt enhancement |
enable_fp8 | False | True/False | Use FP8 quantization |
10.4 GPU Requirements
| GPU | VRAM | Recommended Res. | Notes |
|---|---|---|---|
| RTX 3060/4060 | 8-12GB | 540p, 4 sec | FP8 required, basic |
| RTX 3080/4070 Ti | 12-16GB | 768x512, 5 sec | FP8 recommended |
| RTX 4090 | 24GB | 1080p, 5 sec | Standard use |
| A100 | 40-80GB | 4K, 10 sec | Production |
| H100 | 80GB | 4K, 10 sec | Optimal performance |
11. HunyuanVideo vs LTX-2 Detailed Comparison
11.1 Architecture Comparison
| Item | HunyuanVideo | LTX-2 |
|---|---|---|
| Parameters | 13B (v1) / 8.3B (v1.5) | 19B |
| DiT Structure | Dual-to-Single Stream | Asymmetric Dual Stream |
| VAE Structure | 3D Causal VAE | Video VAE + Audio VAE |
| VAE Ratio | ~1:47 | 1:192 |
| Spatial Comp. | 8x8 | 32x32 |
| Temporal Comp. | 4x | 8x |
| Latent Ch | 16 | 128 |
| Text Encoder | MLLM (Decoder-Only) | Gemma |
| Training | Flow Matching | Diffusion (Flow-based) |
| Attention | Full 3D Attention | Bidirectional Cross-Attn |
11.2 Performance and Quality Comparison
| Comparison | HunyuanVideo | LTX-2 | Winner |
|---|---|---|---|
| Visual quality | Very high | High | HunyuanVideo |
| Motion natural. | Very high | High | HunyuanVideo |
| Text alignment | High | High | Tie |
| Human generation | Excellent | Good | HunyuanVideo |
| Max resolution | 720p | 4K | LTX-2 |
| Audio generation | Not supported | Synced gen | LTX-2 |
| Frame rate | 24fps | Up to 50fps | LTX-2 |
11.3 Speed Comparison
| Condition | HunyuanVideo | LTX-2 | Difference |
|---|---|---|---|
| 768x512, 5s (H100) | ~120 sec | ~3 sec | LTX-2 ~40x faster |
| 1280x720, 5s (H100) | ~300 sec | ~10 sec | LTX-2 ~30x faster |
| 1280x720, 5s (RTX 4090) | ~600 sec | ~30 sec | LTX-2 ~20x faster |
11.4 Use Case Model Selection Guide
[Recommended Model by Scenario]
"I need the highest quality video"
--> HunyuanVideo (v1, 13B)
Reason: Full Attention + 13B for best visual quality
"I want to run locally on a consumer GPU"
--> LTX-2 (FP8) or HunyuanVideo 1.5
Reason: LTX-2 runs on 12GB, HV 1.5 on 14GB
"I need fast iterative work"
--> LTX-2 (Distilled)
Reason: Near real-time generation speed
"I need video with audio"
--> LTX-2
Reason: Only model with simultaneous AV generation
"I need specific character/style training"
--> HunyuanVideo + LoRA
Reason: Rich LoRA ecosystem
"I need 4K high resolution"
--> LTX-2
Reason: Native 4K support
"Human/face generation is important"
--> HunyuanVideo
Reason: Excellent Human Fidelity benchmark
12. Open-Source Video Generation Model Ecosystem Comparison
12.1 Comprehensive Model Comparison
| Item | HunyuanVideo | LTX-2 | Wan 2.1 | CogVideoX | Mochi 1 |
|---|---|---|---|---|---|
| Developer | Tencent | Lightricks | Alibaba | Zhipu/Tsinghua | Genmo |
| Parameters | 13B | 19B | 1.3B / 14B | 5B / 10B | 10B |
| Max Res. | 720p | 4K | 720p | 720p | 480p |
| Max Length | ~5 sec | ~10 sec | ~5 sec | ~6 sec | ~5.4 sec |
| Max FPS | 24 | 50 | 24 | 30 | 30 |
| VAE Ratio | 1:47 | 1:192 | 1:47 | 1:47 | 1:12 |
| Audio | Not supported | Supported | V2A separate | Not supported | Not supported |
| Min VRAM | 14GB (v1.5) | 8-12GB | 8GB (1.3B) | 4.4GB (INT8) | 20GB (ComfyUI) |
| Speed | Slow | Very fast | Moderate | Moderate | Slow |
| I2V Support | Separate model | Integrated | Integrated | Supported | Not supported |
| LoRA | Supported | Supported | Supported | Supported | Limited |
13. Prompt Engineering Tips
13.1 Effective Video Prompt Writing
Prompt Structure (SAEC Framework):
[Subject] + [Action] + [Environment] + [Camera/Cinematography]
S (Subject): Subject - what/who is the main focus
A (Action): Action - what is happening
E (Environment): Environment - where, what atmosphere
C (Camera): Camera - how it is filmed
Good vs Bad Prompts:
| Type | Prompt | Issue/Strength |
|---|---|---|
| Bad | "A nice video of nature" | Too vague |
| Average | "A dog running in a park" | Not specific enough |
| Good | "A golden retriever running through a sunlit meadow, wildflowers swaying, warm golden hour lighting" | Specific + environment |
| Excellent | "Medium tracking shot of a golden retriever running joyfully through a sunlit meadow, wildflowers swaying gently in the breeze, warm golden hour lighting, shallow depth of field, 35mm cinematic lens, natural color grading" | Full SAEC application |
13.2 Cinematography Terminology
Camera Movements:
| Term | Description | Example |
|---|---|---|
| Pan | Horizontal turn | "Slow pan across the landscape" |
| Tilt | Vertical turn | "Tilt up to reveal the building" |
| Dolly | Forward/back | "Dolly in on the subject's face" |
| Tracking Shot | Following shot | "Tracking shot following the car" |
| Crane Shot | Crane | "Crane shot rising above the city" |
| Static | Fixed | "Static shot of the waterfall" |
| Handheld | Handheld | "Handheld camera, documentary style" |
13.3 Negative Prompt Usage
Universal Negative Prompt Template:
# Basic quality control
"blurry, low quality, distorted, deformed, ugly, bad anatomy,
watermark, text overlay, logo, grainy, noisy"
# Additional for human generation
"extra fingers, mutated hands, poorly drawn hands, poorly drawn face,
mutation, deformed, extra limbs, missing limbs"
Negative Prompt Support by Model:
| Model | Negative Prompt | Recommendation |
|---|---|---|
| HunyuanVideo | Not officially | Use guidance_scale instead |
| LTX-2 | Supported | Actively recommended |
| Wan 2.1 | Supported | Actively recommended |
| CogVideoX | Supported | Actively recommended |
14. Future Outlook
14.1 Video Generation Model Development Directions
Key Development Directions:
| Direction | Current State | Expected Development |
|---|---|---|
| Video length | 5-10 sec | Expanding to minutes |
| Resolution | 720p-4K | 8K, HDR support |
| Physics accuracy | Basic | Precise physics simulation |
| Character consist. | Limited | Multi-shot narratives |
| Gen speed | Real-time to min | Real-time streaming |
| Multimodal | AV beginning | AV + subtitles + voice |
| Editing | Basic | AI-based auto editing |
| Interaction | None | Real-time interactive |
14.2 Notable Technology Trends
- MoE Architecture: Introduced in Wan 2.2, greatly improving model efficiency
- Distillation Techniques: Transferring large model knowledge to small models for speed
- Multimodal Integration: Complete integrated generation of video + audio + text
- LoRA Ecosystem Growth: Explosive growth of community-driven specialized models
- Edge Device Deployment: Possibility of video generation on mobile/edge devices
15. References
Papers
| Paper | Authors | Link |
|---|---|---|
| HunyuanVideo: A Systematic Framework For Large Video Generative Models | Tencent Hunyuan Team | arXiv:2412.03603 |
| HunyuanVideo 1.5 Technical Report | Tencent Hunyuan Team | arXiv:2511.18870 |
| LTX-Video: Realtime Video Latent Diffusion | Lightricks Research | arXiv:2501.00103 |
| LTX-2: Efficient Joint Audio-Visual Foundation Model | Lightricks Research | arXiv:2601.03233 |
GitHub Repositories
| Repository | Description | Link |
|---|---|---|
| Tencent-Hunyuan/HunyuanVideo | HunyuanVideo official repo | GitHub |
| Tencent-Hunyuan/HunyuanVideo-1.5 | HunyuanVideo 1.5 official repo | GitHub |
| Lightricks/LTX-2 | LTX-2 official repo | GitHub |
| Lightricks/ComfyUI-LTXVideo | LTX ComfyUI integration | GitHub |
| kohya-ss/musubi-tuner | HunyuanVideo LoRA training tool | GitHub |
| Wan-Video/Wan2.1 | Wan 2.1 official repo | GitHub |
| zai-org/CogVideo | CogVideoX repo | GitHub |
| genmoai/mochi | Mochi 1 repo | GitHub |
HuggingFace Model Pages
| Model | Link |
|---|---|
| tencent/HunyuanVideo | HuggingFace |
| tencent/HunyuanVideo-1.5 | HuggingFace |
| tencent/HunyuanVideo-I2V | HuggingFace |
| Lightricks/LTX-2 | HuggingFace |
| Lightricks/LTX-Video | HuggingFace |
| Wan-AI/Wan2.1-T2V-14B | HuggingFace |
Diffusers Documentation
| Document | Link |
|---|---|
| HunyuanVideo Pipeline | Diffusers Docs |
| HunyuanVideo 1.5 Pipeline | Diffusers Docs |
| LTX-Video Pipeline | Diffusers Docs |