HunyuanVideo and LTX-2 Complete Analysis: Architecture, Performance, and Practical Guide to Open-Source Video Generation Models

1. Introduction: The Current State of AI Video Generation and the Rise of Open Source
2. HunyuanVideo Overview
3. HunyuanVideo Architecture Deep Dive
4. HunyuanVideo Training Data and Methodology
- 4.1 Large-Scale Data Curation Pipeline
- 4.2 Multi-Stage Training Strategy
5. HunyuanVideo Model Specifications and Performance
6. HunyuanVideo Practical Usage
7. LTX-Video Overview
8. LTX-2 Architecture Analysis
9. LTX-2 Key Features
10. LTX-2 Practical Usage
11. HunyuanVideo vs LTX-2 Detailed Comparison
12. Open-Source Video Generation Model Ecosystem Comparison
- 12.1 Comprehensive Model Comparison
13. Prompt Engineering Tips
14. Future Outlook
- 14.1 Video Generation Model Development Directions
- 14.2 Notable Technology Trends
15. References

1. Introduction: The Current State of AI Video Generation and the Rise of Open Source

2024-2025 marked the era when AI Video Generation technology entered the commercialization stage. Commercial services such as OpenAI Sora, Google Veo, Runway Gen-3, and Kling were released in succession, making the concept of "creating video from text" a reality. However, these commercial models come with limitations such as API costs, usage restrictions, and data privacy concerns.

In this context, open-source video generation models have grown rapidly, beginning to achieve quality on par with commercial models. In particular, Tencent's HunyuanVideo and Lightricks' LTX-2 form the two pillars of open-source video generation, each with different philosophies and strengths.

[AI Video Generation Model Development Timeline]

2024 Q1-Q2          2024 Q3-Q4          2025 Q1-Q2          2025 Q3-Q4          2026 Q1
    |                   |                   |                   |                   |
    v                   v                   v                   v                   v
 Sora Preview       HunyuanVideo        Wan 2.1             HunyuanVideo 1.5    LTX-2 Open
 Runway Gen-3       CogVideoX           LTX-Video 1.0       Wan 2.2 (MoE)      Wan 2.6
 Pika 1.0           Kling 1.0           Mochi 1             LTX-2 Preview      Veo 3.1
                    Mochi Preview                            Sora 2

[Open Source vs Commercial Model Competition]

Commercial:   Sora --> Sora 2 --> Veo 3.1 --> Kling 3.5
                 \       \           \           \
                  \       \           \           v
Open Source:  CogVideoX -> HunyuanVideo -> Wan 2.1 -> LTX-2
                 \           \           \           \
                  v           v           v           v
             Quality Gap:  Gap Narrows:  On Par:     Surpassing:
             Commercial    Rapid         Benchmark   Speed/Access
             Advantage     Catch-up      Parity      Advantage

This article provides an in-depth paper-based analysis of the architectures of HunyuanVideo and LTX-2, compares their benchmark performance, covers the entire open-source ecosystem comparison with Wan 2.1, CogVideoX, Mochi, and more, and includes prompt engineering tips and a practical usage guide.

2. HunyuanVideo Overview

2.1 Tencent Research Team

HunyuanVideo is a large-scale video generation model developed by Tencent's Hunyuan AI research team. The Tencent Hunyuan team has experience developing various generative AI models including HunyuanDiT (image generation) and Hunyuan3D (3D generation), and leveraged this technical expertise to enter the video generation domain.

Key Contributions of the Tencent Hunyuan Team:

Model	Domain	Key Features
HunyuanDiT	Text-to-Image	Bilingual (Chinese/English), DiT arch
Hunyuan3D	3D Generation	3D model generation from text/image
HunyuanVideo	Text/Image-to-Video	13B parameters, largest open-source
HunyuanVideo 1.5	Text/Image-to-Video	8.3B, consumer GPU support

2.2 Largest Open-Source Video Generation Model

Released in December 2024, HunyuanVideo has 13B (13 billion) parameters, making it the largest open-source video generation model at the time of release. This significantly exceeds competing models such as CogVideoX (5B-10B) and Mochi (10B).

HunyuanVideo Core Specs:

Item	HunyuanVideo	HunyuanVideo 1.5
Parameters	13B	8.3B
Release Date	December 2024	November 2025
Architecture	Dual-to-Single Stream DiT	Improved DiT
Text Encoder	MLLM (Decoder-Only)	Improved MLLM
VAE	3D Causal VAE	3D Causal VAE (improved)
Training	Flow Matching	Flow Matching
Max Resolution	720p (1280x720)	720p
Max Frames	129 frames	129 frames
License	Tencent Hunyuan Community	Tencent Hunyuan Community

2.3 Text-to-Video and Image-to-Video Support

HunyuanVideo supports two core capabilities:

Text-to-Video (T2V): Generates high-quality video from text prompts alone. Describe scenes, actions, and atmosphere in natural language and it creates matching video.

Image-to-Video (I2V): Takes a static image as input and transforms it into video with natural motion added. The HunyuanVideo-I2V model released separately in March 2025 handles this functionality.

[HunyuanVideo Input/Output Pipeline]

Text-to-Video:
  "A golden retriever running       +----------+     +--------+
   through a sunlit meadow"  -----> | Hunyuan  | --> | Video  |
                                    | Video    |     | Output |
Image-to-Video:                     | Pipeline |     | (MP4)  |
  [Input Image] + Prompt    -----> |          | --> |        |
                                    +----------+     +--------+
                                         |
                                    MLLM Encoder
                                    3D VAE
                                    DiT Denoiser

3. HunyuanVideo Architecture Deep Dive

HunyuanVideo's architecture consists of three core components: (1) MLLM Text Encoder, (2) 3D Causal VAE, (3) Dual-Stream to Single-Stream DiT.

[HunyuanVideo Full Architecture Diagram]

                    Text Prompt
                         |
                         v
                  +-------------+
                  | MLLM Text   |
                  |   Encoder   |
                  | (Decoder-   |
                  |  Only LLM)  |
                  +------+------+
                         |
                  Text Tokens (with bidirectional refiner)
                         |
                         v
+--------+    +---------------------+    +--------+
| Gaussian| -> | Dual-Stream to      | -> | Denoised|
| Noise   |    | Single-Stream DiT   |    | Result  |
+--------+    |                     |    +---+----+
              | [Dual Phase]        |        |
              |  - Video Tokens     |        v
              |  - Text Tokens      |  +----------+
              |  (independent)      |  | 3D VAE   |
              |                     |  | Decoder  |
              | [Single Phase]      |  +----+-----+
              |  - Concat & Fuse    |       |
              +---------------------+       v
                                      Final Video

3.1 Dual-Stream to Single-Stream DiT Design

The most distinctive architectural element of HunyuanVideo is its "Dual-Stream to Single-Stream" Diffusion Transformer (DiT) design. This is the core design philosophy that differentiates it from existing DiT models.

Dual-Stream Phase (Early Layers):

In the Dual-Stream phase, video tokens and text tokens are processed through independent Transformer blocks. Each modality can learn its own appropriate modulation mechanisms without interfering with the other.

# Dual-Stream Phase Pseudocode
class DualStreamBlock(nn.Module):
    def __init__(self, dim, num_heads):
        self.video_attn = MultiHeadAttention(dim, num_heads)
        self.text_attn = MultiHeadAttention(dim, num_heads)
        self.video_ffn = FeedForward(dim)
        self.text_ffn = FeedForward(dim)
        self.video_norm = AdaLayerNorm(dim)
        self.text_norm = AdaLayerNorm(dim)

    def forward(self, video_tokens, text_tokens, timestep):
        # Independent video token processing
        video_tokens = self.video_norm(video_tokens, timestep)
        video_tokens = video_tokens + self.video_attn(video_tokens)
        video_tokens = video_tokens + self.video_ffn(video_tokens)

        # Independent text token processing
        text_tokens = self.text_norm(text_tokens, timestep)
        text_tokens = text_tokens + self.text_attn(text_tokens)
        text_tokens = text_tokens + self.text_ffn(text_tokens)

        return video_tokens, text_tokens

Single-Stream Phase (Later Layers):

In the Single-Stream phase, video tokens and text tokens are concatenated and processed together in a single Transformer block. This enables effective multimodal information fusion.

# Single-Stream Phase Pseudocode
class SingleStreamBlock(nn.Module):
    def __init__(self, dim, num_heads):
        self.attn = MultiHeadAttention(dim, num_heads)
        self.ffn = FeedForward(dim)
        self.norm = AdaLayerNorm(dim)

    def forward(self, video_tokens, text_tokens, timestep):
        # Concatenate video + text tokens
        combined = torch.cat([video_tokens, text_tokens], dim=1)

        # Unified processing (Full Attention)
        combined = self.norm(combined, timestep)
        combined = combined + self.attn(combined)
        combined = combined + self.ffn(combined)

        # Split and return
        video_out = combined[:, :video_tokens.shape[1]]
        text_out = combined[:, video_tokens.shape[1]:]

        return video_out, text_out

Advantages of the Dual-to-Single Design:

Characteristic	Dual-Stream Only	Single-Stream Only	Dual-to-Single (HunyuanVideo)
Per-modality learning	Excellent	Limited	Excellent (early phase)
Cross-modal fusion	Weak	Strong	Strong (later phase)
Computational efficiency	High	Moderate	High
Text-video alignment	Low	High	High
Model flexibility	High	Low	Very high

3.2 3D VAE (Causal VAE) - Spatiotemporal Compression

HunyuanVideo uses a 3D Causal VAE to compress pixel-space video into a compact latent space. This VAE is built on CausalConv3D and efficiently compresses both temporal and spatial information.

Compression Ratios:

Dimension	Ratio	Description
Temporal	4x	129 frames to 33 latent frames
Spatial	8x x 8x	720x1280 to 90x160
Channel	3ch to 16ch	RGB 3ch to latent 16ch

Overall Compression Effect:

Input Video:    720 x 1280 x 129 frames x 3 channels
                = ~356M values

Latent:         90 x 160 x 33 x 16 channels
                = ~7.6M values

Compression:    ~47:1 (by element count)

Causal VAE Characteristics:

The Causal VAE maintains temporal causality in its design, meaning each frame is encoded referencing only information from previous frames. This allows images and videos to be processed by the same VAE. The first frame is treated as an image without temporal compression, while subsequent frames have temporal compression applied considering their relationship to previous frames.

3.3 MLLM Text Encoder

Another innovation of HunyuanVideo is its adoption of a Multimodal Large Language Model (MLLM) as the text encoder. This contrasts with existing video/image generation models that primarily use CLIP or T5 as text encoders.

Comparison with Existing Text Encoders:

Characteristic	CLIP	T5-XXL	MLLM (HunyuanVideo)
Architecture	Encoder-Only	Encoder-Decoder	Decoder-Only
Parameters	~400M	~4.7B	Tens of billions
Image-text alignment	Excellent	Moderate	Very excellent
Detail understanding	Limited	Excellent	Very excellent
Complex reasoning	Weak	Moderate	Strong
Zero-shot ability	Limited	Moderate	Excellent
Attention type	Causal	Bidirectional	Causal + Refiner

Bidirectional Token Refiner:

MLLMs inherently use causal attention due to their Decoder-Only structure, but bidirectional attention is more effective as text conditioning for diffusion models. To solve this, HunyuanVideo introduces an additional Bidirectional Token Refiner.

[Text Encoding Pipeline]

Text Prompt
     |
     v
+----------+     +--------------+     +------------------+
| MLLM     | --> | Bidirectional| --> | Final Text       |
| (Causal  |     | Token        |     | Embedding        |
|  Attn)   |     | Refiner      |     | (DiT condition)  |
+----------+     +--------------+     +------------------+
  Rich            Bidirectional        Diffusion-optimized
  semantics       context boost        text representation

3.4 Flow Matching Training Method

HunyuanVideo adopts Flow Matching instead of traditional DDPM (Denoising Diffusion Probabilistic Model). Flow Matching learns the optimal transport path between data and noise distributions.

DDPM vs Flow Matching:

Characteristic	DDPM	Flow Matching
Noise schedule	Must be predefined	Flexible design
Training target	Noise prediction	Vector field pred.
Convergence	Slow	Fast
Inference path	Curved	Straight (efficient)
Sampling steps	Many (20-50)	Fewer (20-30)

# Flow Matching Training Pseudocode
def flow_matching_loss(model, x_0, text_cond):
    """
    x_0: original video latent
    text_cond: text condition
    """
    # Random timestep sampling
    t = torch.rand(x_0.shape[0], device=x_0.device)

    # Noise sampling
    noise = torch.randn_like(x_0)

    # Linear interpolation for intermediate state
    x_t = (1 - t) * x_0 + t * noise

    # Target vector field: noise direction
    target = noise - x_0

    # Model's vector field prediction
    predicted = model(x_t, t, text_cond)

    # Loss computation
    loss = F.mse_loss(predicted, target)
    return loss

3.5 Unified Image-Video Training Strategy

HunyuanVideo trains on images and videos within a unified framework. Images are treated as single-frame videos and processed by the same model architecture.

3.6 Full Attention Mechanism

HunyuanVideo applies Full Attention across both temporal and spatial dimensions. This contrasts with many video generation models that separate spatial and temporal attention to reduce computation.

Attention Type	Description	Example Model
Spatial-Only	Spatial dimension only	Early video models
Temporal-Only	Temporal dimension only	AnimateDiff
Spatial + Temporal (split)	Applied alternately	CogVideoX
Full 3D Attention	Full spatiotemporal attn	HunyuanVideo

Full Attention allows every token in the video to interact spatiotemporally with all other tokens, achieving more coherent motion and higher visual quality, but at the tradeoff of significantly increased computational cost.

4. HunyuanVideo Training Data and Methodology

4.1 Large-Scale Data Curation Pipeline

HunyuanVideo's training data is prepared through a systematic curation pipeline involving multiple stages of filtering and evaluation from raw data to final training data.

4.2 Multi-Stage Training Strategy

HunyuanVideo employs a Progressive Training strategy, starting from low resolution and gradually increasing resolution.

Training Stage Settings:

Stage	Resolution	Frames	Batch Size	Primary Goal
Stage 1	256x256	17	Large	Basic visual concepts
Stage 2	512x512	33	Medium	Detail learning
Stage 3	960x544 / 544x960	65	Small	High-resolution adapt.
Stage 4	1280x720 / 720x1280	129	Very small	Final quality fine-tune

5. HunyuanVideo Model Specifications and Performance

5.1 Supported Resolutions and Frames

Resolution	Aspect Ratio	Use Case
1280 x 720	16:9	Landscape HD
720 x 1280	9:16	Portrait (mobile)
960 x 544	~16:9	Medium resolution
544 x 960	~9:16	Medium portrait
720 x 720	1:1	Square

Frame Settings:

Setting	Value	Notes
Max frames	129 frames	33 latent frames after 4x VAE compress
FPS	24 fps	Standard cinema framerate
Video length	~5.4 sec	129 / 24 = 5.375 seconds

5.2 Benchmark Comparison

VBench Evaluation Results:

Model	Overall	Visual Quality	Text Alignment	Motion Quality	Human Fidelity
HunyuanVideo	Top tier	96.4%	68.5%	64.5%	Excellent
Sora	Top tier	Excellent	Moderate	Excellent	Very excellent
CogVideoX-1.5	Upper	Excellent	Strong	Moderate	Weak
Kling 1.6	Top tier	Excellent	Excellent	Excellent	Excellent

HunyuanVideo shows particularly strong results in Human Fidelity and Motion Rationality dimensions.

5.3 Competitive Model Comparison

Comparison	HunyuanVideo	Sora 2	Runway Gen-3	Kling 3.5
Access	Open source	Commercial	Commercial	Commercial
Parameters	13B	Undisclosed	Undisclosed	Undisclosed
Max Resolution	720p	1080p	1080p	1080p
Max Length	~5 sec	Up to 20s	Up to 10s	Up to 10s
Local Run	Possible	Not possible	Not possible	Not possible
Customization	LoRA supported	Not possible	Limited	Not possible
Cost	Free (GPU req.)	API billing	Subscription	API billing

6. HunyuanVideo Practical Usage

6.1 HuggingFace Model Download

# HunyuanVideo original model (13B)
pip install huggingface_hub
huggingface-cli download tencent/HunyuanVideo --local-dir ./HunyuanVideo

# HunyuanVideo 1.5 (8.3B, lighter version)
huggingface-cli download tencent/HunyuanVideo-1.5 --local-dir ./HunyuanVideo-1.5

# Image-to-Video model
huggingface-cli download tencent/HunyuanVideo-I2V --local-dir ./HunyuanVideo-I2V

6.2 Diffusers Library Inference Code

Basic Text-to-Video Inference:

import torch
from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel
from diffusers.utils import export_to_video

# Load model
model_id = "tencent/HunyuanVideo"
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
    model_id,
    subfolder="transformer",
    torch_dtype=torch.bfloat16,
)
pipe = HunyuanVideoPipeline.from_pretrained(
    model_id,
    transformer=transformer,
    torch_dtype=torch.float16,
)
pipe.vae.enable_tiling()
pipe.to("cuda")

# Generate video
output = pipe(
    prompt="A cat walks on the grass, realistic style, natural lighting",
    height=720,
    width=1280,
    num_frames=129,
    num_inference_steps=30,
    guidance_scale=6.0,
).frames[0]

# Save video
export_to_video(output, "hunyuan_output.mp4", fps=24)

4-bit Quantization for VRAM Savings:

import torch
from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel
from diffusers.utils import export_to_video
from transformers import BitsAndBytesConfig

# INT4 quantization config
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
)

# Load quantized transformer
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
    "tencent/HunyuanVideo",
    subfolder="transformer",
    quantization_config=quant_config,
)

pipe = HunyuanVideoPipeline.from_pretrained(
    "tencent/HunyuanVideo",
    transformer=transformer,
    torch_dtype=torch.float16,
)
pipe.vae.enable_tiling()

# CPU offload for additional VRAM savings
pipe.enable_model_cpu_offload()

output = pipe(
    prompt="A beautiful sunset over the ocean, cinematic",
    height=544,
    width=960,
    num_frames=65,
    num_inference_steps=30,
    guidance_scale=6.0,
).frames[0]

export_to_video(output, "quantized_output.mp4", fps=24)

6.3 Key Parameter Guide

Parameter	Default	Range	Description
`guidance_scale`	6.0	1.0-15.0	Prompt fidelity (higher = more faithful to prompt)
`num_inference_steps`	30	20-50	Denoising steps (higher = better quality, slower)
`height`	720	256-720	Video height (multiple of 8)
`width`	1280	256-1280	Video width (multiple of 8)
`num_frames`	129	17-129	Total frames (`4k+1` format recommended)
`seed`	Random	Integer	Seed for reproducibility

6.4 GPU VRAM Requirements

Configuration	Required VRAM	Resolution	Notes
FP32 (original)	80GB+	720p 129f	A100/H100 required
BF16/FP16	~40GB	720p 129f	A100 40GB
FP8 quantization	~24-30GB	720p 129f	RTX 4090 capable
INT4 quant + CPU Offload	~14-16GB	544p 65f	RTX 4080 capable
HunyuanVideo 1.5 (FP8)	~14GB	480p	Consumer GPU

6.5 LoRA Fine-tuning

HunyuanVideo supports LoRA fine-tuning to learn specific styles, characters, or motion patterns.

Key LoRA Training Tools:

Tool	Features	Min VRAM
Musubi Tuner (kohya-ss)	Most popular LoRA training tool	24GB
ai-toolkit (ostris)	Multi-model support	24GB
diffusion-pipe (tdrussell)	Pipeline-based training	24GB
FineTrainers (HuggingFace)	Official Diffusers-based tool	24GB
fal.ai LoRA Training	Cloud-based, no setup needed	Cloud

7. LTX-Video Overview

7.1 Lightricks Company

Lightricks is an AI-based creative technology company headquartered in Jerusalem, Israel. Founded in 2013, it is widely known for consumer photo/video editing apps such as Facetune, Videoleap, and Photoleap. Leveraging experience in mobile creative tools, it entered the AI video generation space.

7.2 Evolution from LTX-Video 1.0 to LTX-2

Version	Release	Params	Key Features
LTX-Video 0.9	Nov 2024	~2B	First open-source, real-time
LTX-Video 0.9.8 (13B)	Mid 2025	13B	Distilled version, quality up
LTX-2	Oct 2025	19B	Audio+video simultaneous gen.
LTX-2 (open source)	Jan 2026	19B	Full weights/code released

7.3 Near Real-Time Video Generation Speed

The biggest differentiator of the LTX-Video series is its faster-than-real-time video generation speed. LTX-Video was among the first DiT-based video generation models to achieve real-time generation.

[Generation Speed Comparison (5-second video)]

Model              Gen Time    vs Real-time
LTX-Video 1.0:    ~2 sec       2.5x faster
LTX-2:            ~3-5 sec     ~real-time
HunyuanVideo:     ~2-5 min     60x slower
CogVideoX:        ~3-8 min     100x slower
Mochi:            ~5-10 min    120x slower

(H100 GPU, 768x512 resolution)

7.4 Text-to-Video and Image-to-Video Support

LTX-2 offers simultaneous Audio-Video generation in addition to Text-to-Video and Image-to-Video.

Feature	LTX-Video 1.0	LTX-2
Text-to-Video	Supported	Supported
Image-to-Video	Supported	Supported
Audio generation	Not supported	Synchronized audio co-gen
4K resolution	Not supported	Native 4K (3840x2160)
50fps	Not supported	Supported
Keyframe Conditioning	Limited	Full support

8. LTX-2 Architecture Analysis

8.1 Overall Architecture

LTX-2 consists of three core components: (1) Modality-specific VAE, (2) Text embedding pipeline, (3) Asymmetric Dual Stream DiT.

[LTX-2 Full Architecture]

Text Prompt
     |
     v
+-------------+
| Text Encoder |  (Gemma-based)
| + Prompt     |
|   Enhancer   |
+------+------+
       |
       v
+------------------------------------------+
|        Asymmetric Dual Stream DiT         |
|                                          |
|  +------------------+  +-------------+   |
|  | Video Stream      |  | Audio Stream|   |
|  | (wide channels,   |  | (narrow,   |   |
|  |  high capacity)   |  |  lightweight)|  |
|  +--------+---------+  +------+------+   |
|           |      Cross-Attention  |       |
|           +----------+-----------+       |
+------------------------------------------+
       |                    |
       v                    v
+-------------+      +-------------+
| Video VAE   |      | Audio VAE   |
| Decoder     |      | Decoder     |
| (3D spatio- |      | (1D temporal)|
|  temporal)  |      |             |
+------+------+      +------+------+
       |                    |
       v                    v
   Video Output        Audio Output
       |                    |
       +--------+-----------+
                |
                v
         Final AV Output (MP4)

8.2 Video VAE (High Compression Ratio - 1:192)

LTX-2's Video VAE achieves a very high compression ratio of 1:192. This is approximately 4x higher than HunyuanVideo's ~47:1 ratio.

VAE Compression Comparison:

Model	Spatial	Temporal	Latent Ch	Overall Ratio
LTX-2	32x32	8x	128ch	1:192
HunyuanVideo	8x8	4x	16ch	~1:47
CogVideoX	8x8	4x	16ch	~1:47
Wan 2.1	8x8	4x	16ch	~1:47

High compression provides:

Fewer latent tokens: Greatly reduces tokens the DiT must process, improving inference speed
Memory efficiency: Enables high-resolution video processing with less VRAM
Faster training: Reduces computation needed during training

8.3 Asymmetric Dual Stream DiT

LTX-2's DiT adopts an asymmetric dual stream structure, reflecting the characteristic differences between video and audio modalities.

Rationale for Asymmetric Design:

Characteristic	Video Stream	Audio Stream
Dimension	3D (spatial + temporal)	1D (temporal)
Complexity	High (spatiotemporal)	Medium (temporal)
Channel width	Wide (high capacity)	Narrow (lightweight)
Positional Embedding	3D positional	1D temporal
Data characteristics	Pixel-based visual	Frequency-based audio

8.4 Text Encoder

LTX-2 uses a Gemma-based text encoder. The enhance_prompt feature can automatically expand simple user prompts for better results.

8.5 Speed Optimization Techniques

Optimization	Description	Speedup
High VAE compression	Greatly reduces latent token count	Key factor
Distilled inference	8-step distilled model available	5-10x
FP8 Transformer	Quantized weights	~2x
Two-Stage Pipeline	Stage 1 (gen) + Stage 2 (upscale)	Efficient
Gradient Estimation	Reduce steps from 40 to 20-30	~1.5x

9. LTX-2 Key Features

9.1 Real-Time Generation Speed

Resolution	Frames	Length	Gen Time (H100)	vs Real-time
768x512	121	5 sec	~2 sec	2.5x faster
1216x704	121	5 sec	~5 sec	~real-time
1920x1080	121	5 sec	~15 sec	3x slower
3840x2160	121	5 sec	~60 sec	12x slower

9.2 High Resolution and Various Output Options

Supported Resolutions:

Resolution	Aspect	Use Case	VRAM Required
768 x 512	3:2	Rapid prototyping	~8-12GB
1216 x 704	~16:9	Standard prod.	~16GB
1920 x 1080	16:9	Full HD	~24GB
3840 x 2160	16:9	4K UHD	48GB+

9.3 Synchronized Audio-Video Generation

One of LTX-2's innovative features is generating audio and video simultaneously. Sound matching the video content is automatically generated without a separate audio generation model.

9.4 Keyframe Conditioning

LTX-2 supports Keyframe Conditioning, allowing you to specify certain frames and naturally fill in between them.

9.5 LoRA Support

LTX-2 officially supports LoRA training and inference, with training code included in the GitHub repository.

10. LTX-2 Practical Usage

10.1 Installation and Environment Setup

# 1. Python environment (3.10+ recommended)
conda create -n ltx2 python=3.10
conda activate ltx2

# 2. Install official LTX-2 package
pip install ltx-pipelines

# 3. Or install from source
git clone https://github.com/Lightricks/LTX-2.git
cd LTX-2
pip install -e "packages/ltx-pipelines[all]"
pip install -e "packages/ltx-core[all]"

# 4. Download model weights
huggingface-cli download Lightricks/LTX-2 --local-dir ./models/ltx2

10.2 Python Inference Code Examples

Text-to-Video Basic Example:

from ltx_pipelines import TI2VidTwoStagesPipeline

# Initialize pipeline
pipe = TI2VidTwoStagesPipeline.from_pretrained(
    "Lightricks/LTX-2",
    device_map="auto",
    enable_fp8=True,  # Save VRAM with FP8
)

# Generate video
result = pipe(
    prompt="A serene mountain lake at sunrise, mist rising from the water, "
           "birds flying overhead, cinematic quality",
    negative_prompt="blurry, low quality, distorted",
    height=704,
    width=1216,
    num_frames=121,
    frame_rate=24,
    num_inference_steps=30,
    cfg_guidance_scale=7.5,
    seed=42,
    enhance_prompt=True,  # Auto prompt enhancement
)

# Save
result.save("ltx2_output.mp4")

Image-to-Video Example:

from ltx_pipelines import TI2VidTwoStagesPipeline
from PIL import Image

pipe = TI2VidTwoStagesPipeline.from_pretrained(
    "Lightricks/LTX-2",
    device_map="auto",
    enable_fp8=True,
)

# Load input image
input_image = Image.open("input_photo.jpg")

# I2V generation
result = pipe(
    prompt="The scene comes alive with gentle wind blowing through the trees",
    images=[input_image],
    height=704,
    width=1216,
    num_frames=121,
    frame_rate=24,
    num_inference_steps=30,
    cfg_guidance_scale=7.5,
    seed=42,
)

result.save("ltx2_i2v_output.mp4")

10.3 Key Parameters

Parameter	Default	Range	Description
`prompt`	Required	String	Video description
`negative_prompt`	None	String	Elements to exclude
`height`	704	Multiple of 32	Video height
`width`	1216	Multiple of 32	Video width
`num_frames`	121	`8k+1` form	Total frame count
`frame_rate`	24	24/30/50	Frames per second
`num_inference_steps`	30	8-50	Denoising steps
`cfg_guidance_scale`	7.5	1.0-15.0	Prompt fidelity
`seed`	Random	Integer	Reproducibility seed
`enhance_prompt`	False	True/False	Auto prompt enhancement
`enable_fp8`	False	True/False	Use FP8 quantization

10.4 GPU Requirements

GPU	VRAM	Recommended Res.	Notes
RTX 3060/4060	8-12GB	540p, 4 sec	FP8 required, basic
RTX 3080/4070 Ti	12-16GB	768x512, 5 sec	FP8 recommended
RTX 4090	24GB	1080p, 5 sec	Standard use
A100	40-80GB	4K, 10 sec	Production
H100	80GB	4K, 10 sec	Optimal performance

11. HunyuanVideo vs LTX-2 Detailed Comparison

11.1 Architecture Comparison

Item	HunyuanVideo	LTX-2
Parameters	13B (v1) / 8.3B (v1.5)	19B
DiT Structure	Dual-to-Single Stream	Asymmetric Dual Stream
VAE Structure	3D Causal VAE	Video VAE + Audio VAE
VAE Ratio	~1:47	1:192
Spatial Comp.	8x8	32x32
Temporal Comp.	4x	8x
Latent Ch	16	128
Text Encoder	MLLM (Decoder-Only)	Gemma
Training	Flow Matching	Diffusion (Flow-based)
Attention	Full 3D Attention	Bidirectional Cross-Attn

11.2 Performance and Quality Comparison

Comparison	HunyuanVideo	LTX-2	Winner
Visual quality	Very high	High	HunyuanVideo
Motion natural.	Very high	High	HunyuanVideo
Text alignment	High	High	Tie
Human generation	Excellent	Good	HunyuanVideo
Max resolution	720p	4K	LTX-2
Audio generation	Not supported	Synced gen	LTX-2
Frame rate	24fps	Up to 50fps	LTX-2

11.3 Speed Comparison

Condition	HunyuanVideo	LTX-2	Difference
768x512, 5s (H100)	~120 sec	~3 sec	LTX-2 ~40x faster
1280x720, 5s (H100)	~300 sec	~10 sec	LTX-2 ~30x faster
1280x720, 5s (RTX 4090)	~600 sec	~30 sec	LTX-2 ~20x faster

11.4 Use Case Model Selection Guide

[Recommended Model by Scenario]

"I need the highest quality video"
  --> HunyuanVideo (v1, 13B)
  Reason: Full Attention + 13B for best visual quality

"I want to run locally on a consumer GPU"
  --> LTX-2 (FP8) or HunyuanVideo 1.5
  Reason: LTX-2 runs on 12GB, HV 1.5 on 14GB

"I need fast iterative work"
  --> LTX-2 (Distilled)
  Reason: Near real-time generation speed

"I need video with audio"
  --> LTX-2
  Reason: Only model with simultaneous AV generation

"I need specific character/style training"
  --> HunyuanVideo + LoRA
  Reason: Rich LoRA ecosystem

"I need 4K high resolution"
  --> LTX-2
  Reason: Native 4K support

"Human/face generation is important"
  --> HunyuanVideo
  Reason: Excellent Human Fidelity benchmark

12. Open-Source Video Generation Model Ecosystem Comparison

12.1 Comprehensive Model Comparison

Item	HunyuanVideo	LTX-2	Wan 2.1	CogVideoX	Mochi 1
Developer	Tencent	Lightricks	Alibaba	Zhipu/Tsinghua	Genmo
Parameters	13B	19B	1.3B / 14B	5B / 10B	10B
Max Res.	720p	4K	720p	720p	480p
Max Length	~5 sec	~10 sec	~5 sec	~6 sec	~5.4 sec
Max FPS	24	50	24	30	30
VAE Ratio	1:47	1:192	1:47	1:47	1:12
Audio	Not supported	Supported	V2A separate	Not supported	Not supported
Min VRAM	14GB (v1.5)	8-12GB	8GB (1.3B)	4.4GB (INT8)	20GB (ComfyUI)
Speed	Slow	Very fast	Moderate	Moderate	Slow
I2V Support	Separate model	Integrated	Integrated	Supported	Not supported
LoRA	Supported	Supported	Supported	Supported	Limited

13. Prompt Engineering Tips

13.1 Effective Video Prompt Writing

Prompt Structure (SAEC Framework):

[Subject] + [Action] + [Environment] + [Camera/Cinematography]

S (Subject):      Subject - what/who is the main focus
A (Action):       Action - what is happening
E (Environment):  Environment - where, what atmosphere
C (Camera):       Camera - how it is filmed

Good vs Bad Prompts:

Type	Prompt	Issue/Strength
Bad	"A nice video of nature"	Too vague
Average	"A dog running in a park"	Not specific enough
Good	"A golden retriever running through a sunlit meadow, wildflowers swaying, warm golden hour lighting"	Specific + environment
Excellent	"Medium tracking shot of a golden retriever running joyfully through a sunlit meadow, wildflowers swaying gently in the breeze, warm golden hour lighting, shallow depth of field, 35mm cinematic lens, natural color grading"	Full SAEC application

13.2 Cinematography Terminology

Camera Movements:

Term	Description	Example
Pan	Horizontal turn	"Slow pan across the landscape"
Tilt	Vertical turn	"Tilt up to reveal the building"
Dolly	Forward/back	"Dolly in on the subject's face"
Tracking Shot	Following shot	"Tracking shot following the car"
Crane Shot	Crane	"Crane shot rising above the city"
Static	Fixed	"Static shot of the waterfall"
Handheld	Handheld	"Handheld camera, documentary style"

13.3 Negative Prompt Usage

Universal Negative Prompt Template:

# Basic quality control
"blurry, low quality, distorted, deformed, ugly, bad anatomy,
watermark, text overlay, logo, grainy, noisy"

# Additional for human generation
"extra fingers, mutated hands, poorly drawn hands, poorly drawn face,
mutation, deformed, extra limbs, missing limbs"

Negative Prompt Support by Model:

Model	Negative Prompt	Recommendation
HunyuanVideo	Not officially	Use guidance_scale instead
LTX-2	Supported	Actively recommended
Wan 2.1	Supported	Actively recommended
CogVideoX	Supported	Actively recommended

14. Future Outlook

14.1 Video Generation Model Development Directions

Key Development Directions:

Direction	Current State	Expected Development
Video length	5-10 sec	Expanding to minutes
Resolution	720p-4K	8K, HDR support
Physics accuracy	Basic	Precise physics simulation
Character consist.	Limited	Multi-shot narratives
Gen speed	Real-time to min	Real-time streaming
Multimodal	AV beginning	AV + subtitles + voice
Editing	Basic	AI-based auto editing
Interaction	None	Real-time interactive

14.2 Notable Technology Trends

MoE Architecture: Introduced in Wan 2.2, greatly improving model efficiency
Distillation Techniques: Transferring large model knowledge to small models for speed
Multimodal Integration: Complete integrated generation of video + audio + text
LoRA Ecosystem Growth: Explosive growth of community-driven specialized models
Edge Device Deployment: Possibility of video generation on mobile/edge devices

15. References

Papers

Paper	Authors	Link
HunyuanVideo: A Systematic Framework For Large Video Generative Models	Tencent Hunyuan Team	arXiv:2412.03603
HunyuanVideo 1.5 Technical Report	Tencent Hunyuan Team	arXiv:2511.18870
LTX-Video: Realtime Video Latent Diffusion	Lightricks Research	arXiv:2501.00103
LTX-2: Efficient Joint Audio-Visual Foundation Model	Lightricks Research	arXiv:2601.03233

GitHub Repositories

Repository	Description	Link
Tencent-Hunyuan/HunyuanVideo	HunyuanVideo official repo	GitHub
Tencent-Hunyuan/HunyuanVideo-1.5	HunyuanVideo 1.5 official repo	GitHub
Lightricks/LTX-2	LTX-2 official repo	GitHub
Lightricks/ComfyUI-LTXVideo	LTX ComfyUI integration	GitHub
kohya-ss/musubi-tuner	HunyuanVideo LoRA training tool	GitHub
Wan-Video/Wan2.1	Wan 2.1 official repo	GitHub
zai-org/CogVideo	CogVideoX repo	GitHub
genmoai/mochi	Mochi 1 repo	GitHub

HuggingFace Model Pages

Model	Link
tencent/HunyuanVideo	HuggingFace
tencent/HunyuanVideo-1.5	HuggingFace
tencent/HunyuanVideo-I2V	HuggingFace
Lightricks/LTX-2	HuggingFace
Lightricks/LTX-Video	HuggingFace
Wan-AI/Wan2.1-T2V-14B	HuggingFace

Diffusers Documentation

Document	Link
HunyuanVideo Pipeline	Diffusers Docs
HunyuanVideo 1.5 Pipeline	Diffusers Docs
LTX-Video Pipeline	Diffusers Docs

Additional Resources

Resource	Description	Link
VBench	Video gen benchmark	GitHub
VBench-2.0 Paper	Extended benchmark	arXiv:2503.21755
ComfyUI HunyuanVideo Tutorial	ComfyUI usage guide	Docs
ComfyUI LTX-2 Guide	LTX-2 ComfyUI guide	Docs
LTX-2 System Requirements	Official HW guide	Docs
NVIDIA LTX-2 Guide	RTX GPU guide	NVIDIA