Skip to content
Published on

HunyuanVideo and LTX-2 Complete Analysis: Architecture, Performance, and Practical Guide to Open-Source Video Generation Models

Authors
  • Name
    Twitter

1. Introduction: The Current State of AI Video Generation and the Rise of Open Source

2024-2025 marked the era when AI Video Generation technology entered the commercialization stage. Commercial services such as OpenAI Sora, Google Veo, Runway Gen-3, and Kling were released in succession, making the concept of "creating video from text" a reality. However, these commercial models come with limitations such as API costs, usage restrictions, and data privacy concerns.

In this context, open-source video generation models have grown rapidly, beginning to achieve quality on par with commercial models. In particular, Tencent's HunyuanVideo and Lightricks' LTX-2 form the two pillars of open-source video generation, each with different philosophies and strengths.

[AI Video Generation Model Development Timeline]

2024 Q1-Q2          2024 Q3-Q4          2025 Q1-Q2          2025 Q3-Q4          2026 Q1
    |                   |                   |                   |                   |
    v                   v                   v                   v                   v
 Sora Preview       HunyuanVideo        Wan 2.1             HunyuanVideo 1.5    LTX-2 Open
 Runway Gen-3       CogVideoX           LTX-Video 1.0       Wan 2.2 (MoE)      Wan 2.6
 Pika 1.0           Kling 1.0           Mochi 1             LTX-2 Preview      Veo 3.1
                    Mochi Preview                            Sora 2

[Open Source vs Commercial Model Competition]

Commercial:   Sora --> Sora 2 --> Veo 3.1 --> Kling 3.5
                 \       \           \           \
                  \       \           \           v
Open Source:  CogVideoX -> HunyuanVideo -> Wan 2.1 -> LTX-2
                 \           \           \           \
                  v           v           v           v
             Quality Gap:  Gap Narrows:  On Par:     Surpassing:
             Commercial    Rapid         Benchmark   Speed/Access
             Advantage     Catch-up      Parity      Advantage

This article provides an in-depth paper-based analysis of the architectures of HunyuanVideo and LTX-2, compares their benchmark performance, covers the entire open-source ecosystem comparison with Wan 2.1, CogVideoX, Mochi, and more, and includes prompt engineering tips and a practical usage guide.


2. HunyuanVideo Overview

2.1 Tencent Research Team

HunyuanVideo is a large-scale video generation model developed by Tencent's Hunyuan AI research team. The Tencent Hunyuan team has experience developing various generative AI models including HunyuanDiT (image generation) and Hunyuan3D (3D generation), and leveraged this technical expertise to enter the video generation domain.

Key Contributions of the Tencent Hunyuan Team:

ModelDomainKey Features
HunyuanDiTText-to-ImageBilingual (Chinese/English), DiT arch
Hunyuan3D3D Generation3D model generation from text/image
HunyuanVideoText/Image-to-Video13B parameters, largest open-source
HunyuanVideo 1.5Text/Image-to-Video8.3B, consumer GPU support

2.2 Largest Open-Source Video Generation Model

Released in December 2024, HunyuanVideo has 13B (13 billion) parameters, making it the largest open-source video generation model at the time of release. This significantly exceeds competing models such as CogVideoX (5B-10B) and Mochi (10B).

HunyuanVideo Core Specs:

ItemHunyuanVideoHunyuanVideo 1.5
Parameters13B8.3B
Release DateDecember 2024November 2025
ArchitectureDual-to-Single Stream DiTImproved DiT
Text EncoderMLLM (Decoder-Only)Improved MLLM
VAE3D Causal VAE3D Causal VAE (improved)
TrainingFlow MatchingFlow Matching
Max Resolution720p (1280x720)720p
Max Frames129 frames129 frames
LicenseTencent Hunyuan CommunityTencent Hunyuan Community

2.3 Text-to-Video and Image-to-Video Support

HunyuanVideo supports two core capabilities:

Text-to-Video (T2V): Generates high-quality video from text prompts alone. Describe scenes, actions, and atmosphere in natural language and it creates matching video.

Image-to-Video (I2V): Takes a static image as input and transforms it into video with natural motion added. The HunyuanVideo-I2V model released separately in March 2025 handles this functionality.

[HunyuanVideo Input/Output Pipeline]

Text-to-Video:
  "A golden retriever running       +----------+     +--------+
   through a sunlit meadow"  -----> | Hunyuan  | --> | Video  |
                                    | Video    |     | Output |
Image-to-Video:                     | Pipeline |     | (MP4)  |
  [Input Image] + Prompt    -----> |          | --> |        |
                                    +----------+     +--------+
                                         |
                                    MLLM Encoder
                                    3D VAE
                                    DiT Denoiser

3. HunyuanVideo Architecture Deep Dive

HunyuanVideo's architecture consists of three core components: (1) MLLM Text Encoder, (2) 3D Causal VAE, (3) Dual-Stream to Single-Stream DiT.

[HunyuanVideo Full Architecture Diagram]

                    Text Prompt
                         |
                         v
                  +-------------+
                  | MLLM Text   |
                  |   Encoder   |
                  | (Decoder-   |
                  |  Only LLM)  |
                  +------+------+
                         |
                  Text Tokens (with bidirectional refiner)
                         |
                         v
+--------+    +---------------------+    +--------+
| Gaussian| -> | Dual-Stream to      | -> | Denoised|
| Noise   |    | Single-Stream DiT   |    | Result  |
+--------+    |                     |    +---+----+
              | [Dual Phase]        |        |
              |  - Video Tokens     |        v
              |  - Text Tokens      |  +----------+
              |  (independent)      |  | 3D VAE   |
              |                     |  | Decoder  |
              | [Single Phase]      |  +----+-----+
              |  - Concat & Fuse    |       |
              +---------------------+       v
                                      Final Video

3.1 Dual-Stream to Single-Stream DiT Design

The most distinctive architectural element of HunyuanVideo is its "Dual-Stream to Single-Stream" Diffusion Transformer (DiT) design. This is the core design philosophy that differentiates it from existing DiT models.

Dual-Stream Phase (Early Layers):

In the Dual-Stream phase, video tokens and text tokens are processed through independent Transformer blocks. Each modality can learn its own appropriate modulation mechanisms without interfering with the other.

# Dual-Stream Phase Pseudocode
class DualStreamBlock(nn.Module):
    def __init__(self, dim, num_heads):
        self.video_attn = MultiHeadAttention(dim, num_heads)
        self.text_attn = MultiHeadAttention(dim, num_heads)
        self.video_ffn = FeedForward(dim)
        self.text_ffn = FeedForward(dim)
        self.video_norm = AdaLayerNorm(dim)
        self.text_norm = AdaLayerNorm(dim)

    def forward(self, video_tokens, text_tokens, timestep):
        # Independent video token processing
        video_tokens = self.video_norm(video_tokens, timestep)
        video_tokens = video_tokens + self.video_attn(video_tokens)
        video_tokens = video_tokens + self.video_ffn(video_tokens)

        # Independent text token processing
        text_tokens = self.text_norm(text_tokens, timestep)
        text_tokens = text_tokens + self.text_attn(text_tokens)
        text_tokens = text_tokens + self.text_ffn(text_tokens)

        return video_tokens, text_tokens

Single-Stream Phase (Later Layers):

In the Single-Stream phase, video tokens and text tokens are concatenated and processed together in a single Transformer block. This enables effective multimodal information fusion.

# Single-Stream Phase Pseudocode
class SingleStreamBlock(nn.Module):
    def __init__(self, dim, num_heads):
        self.attn = MultiHeadAttention(dim, num_heads)
        self.ffn = FeedForward(dim)
        self.norm = AdaLayerNorm(dim)

    def forward(self, video_tokens, text_tokens, timestep):
        # Concatenate video + text tokens
        combined = torch.cat([video_tokens, text_tokens], dim=1)

        # Unified processing (Full Attention)
        combined = self.norm(combined, timestep)
        combined = combined + self.attn(combined)
        combined = combined + self.ffn(combined)

        # Split and return
        video_out = combined[:, :video_tokens.shape[1]]
        text_out = combined[:, video_tokens.shape[1]:]

        return video_out, text_out

Advantages of the Dual-to-Single Design:

CharacteristicDual-Stream OnlySingle-Stream OnlyDual-to-Single (HunyuanVideo)
Per-modality learningExcellentLimitedExcellent (early phase)
Cross-modal fusionWeakStrongStrong (later phase)
Computational efficiencyHighModerateHigh
Text-video alignmentLowHighHigh
Model flexibilityHighLowVery high

3.2 3D VAE (Causal VAE) - Spatiotemporal Compression

HunyuanVideo uses a 3D Causal VAE to compress pixel-space video into a compact latent space. This VAE is built on CausalConv3D and efficiently compresses both temporal and spatial information.

Compression Ratios:

DimensionRatioDescription
Temporal4x129 frames to 33 latent frames
Spatial8x x 8x720x1280 to 90x160
Channel3ch to 16chRGB 3ch to latent 16ch

Overall Compression Effect:

Input Video:    720 x 1280 x 129 frames x 3 channels
                = ~356M values

Latent:         90 x 160 x 33 x 16 channels
                = ~7.6M values

Compression:    ~47:1 (by element count)

Causal VAE Characteristics:

The Causal VAE maintains temporal causality in its design, meaning each frame is encoded referencing only information from previous frames. This allows images and videos to be processed by the same VAE. The first frame is treated as an image without temporal compression, while subsequent frames have temporal compression applied considering their relationship to previous frames.

3.3 MLLM Text Encoder

Another innovation of HunyuanVideo is its adoption of a Multimodal Large Language Model (MLLM) as the text encoder. This contrasts with existing video/image generation models that primarily use CLIP or T5 as text encoders.

Comparison with Existing Text Encoders:

CharacteristicCLIPT5-XXLMLLM (HunyuanVideo)
ArchitectureEncoder-OnlyEncoder-DecoderDecoder-Only
Parameters~400M~4.7BTens of billions
Image-text alignmentExcellentModerateVery excellent
Detail understandingLimitedExcellentVery excellent
Complex reasoningWeakModerateStrong
Zero-shot abilityLimitedModerateExcellent
Attention typeCausalBidirectionalCausal + Refiner

Bidirectional Token Refiner:

MLLMs inherently use causal attention due to their Decoder-Only structure, but bidirectional attention is more effective as text conditioning for diffusion models. To solve this, HunyuanVideo introduces an additional Bidirectional Token Refiner.

[Text Encoding Pipeline]

Text Prompt
     |
     v
+----------+     +--------------+     +------------------+
| MLLM     | --> | Bidirectional| --> | Final Text       |
| (Causal  |     | Token        |     | Embedding        |
|  Attn)   |     | Refiner      |     | (DiT condition)  |
+----------+     +--------------+     +------------------+
  Rich            Bidirectional        Diffusion-optimized
  semantics       context boost        text representation

3.4 Flow Matching Training Method

HunyuanVideo adopts Flow Matching instead of traditional DDPM (Denoising Diffusion Probabilistic Model). Flow Matching learns the optimal transport path between data and noise distributions.

DDPM vs Flow Matching:

CharacteristicDDPMFlow Matching
Noise scheduleMust be predefinedFlexible design
Training targetNoise predictionVector field pred.
ConvergenceSlowFast
Inference pathCurvedStraight (efficient)
Sampling stepsMany (20-50)Fewer (20-30)
# Flow Matching Training Pseudocode
def flow_matching_loss(model, x_0, text_cond):
    """
    x_0: original video latent
    text_cond: text condition
    """
    # Random timestep sampling
    t = torch.rand(x_0.shape[0], device=x_0.device)

    # Noise sampling
    noise = torch.randn_like(x_0)

    # Linear interpolation for intermediate state
    x_t = (1 - t) * x_0 + t * noise

    # Target vector field: noise direction
    target = noise - x_0

    # Model's vector field prediction
    predicted = model(x_t, t, text_cond)

    # Loss computation
    loss = F.mse_loss(predicted, target)
    return loss

3.5 Unified Image-Video Training Strategy

HunyuanVideo trains on images and videos within a unified framework. Images are treated as single-frame videos and processed by the same model architecture.

3.6 Full Attention Mechanism

HunyuanVideo applies Full Attention across both temporal and spatial dimensions. This contrasts with many video generation models that separate spatial and temporal attention to reduce computation.

Attention TypeDescriptionExample Model
Spatial-OnlySpatial dimension onlyEarly video models
Temporal-OnlyTemporal dimension onlyAnimateDiff
Spatial + Temporal (split)Applied alternatelyCogVideoX
Full 3D AttentionFull spatiotemporal attnHunyuanVideo

Full Attention allows every token in the video to interact spatiotemporally with all other tokens, achieving more coherent motion and higher visual quality, but at the tradeoff of significantly increased computational cost.


4. HunyuanVideo Training Data and Methodology

4.1 Large-Scale Data Curation Pipeline

HunyuanVideo's training data is prepared through a systematic curation pipeline involving multiple stages of filtering and evaluation from raw data to final training data.

4.2 Multi-Stage Training Strategy

HunyuanVideo employs a Progressive Training strategy, starting from low resolution and gradually increasing resolution.

Training Stage Settings:

StageResolutionFramesBatch SizePrimary Goal
Stage 1256x25617LargeBasic visual concepts
Stage 2512x51233MediumDetail learning
Stage 3960x544 / 544x96065SmallHigh-resolution adapt.
Stage 41280x720 / 720x1280129Very smallFinal quality fine-tune

5. HunyuanVideo Model Specifications and Performance

5.1 Supported Resolutions and Frames

ResolutionAspect RatioUse Case
1280 x 72016:9Landscape HD
720 x 12809:16Portrait (mobile)
960 x 544~16:9Medium resolution
544 x 960~9:16Medium portrait
720 x 7201:1Square

Frame Settings:

SettingValueNotes
Max frames129 frames33 latent frames after 4x VAE compress
FPS24 fpsStandard cinema framerate
Video length~5.4 sec129 / 24 = 5.375 seconds

5.2 Benchmark Comparison

VBench Evaluation Results:

ModelOverallVisual QualityText AlignmentMotion QualityHuman Fidelity
HunyuanVideoTop tier96.4%68.5%64.5%Excellent
SoraTop tierExcellentModerateExcellentVery excellent
CogVideoX-1.5UpperExcellentStrongModerateWeak
Kling 1.6Top tierExcellentExcellentExcellentExcellent

HunyuanVideo shows particularly strong results in Human Fidelity and Motion Rationality dimensions.

5.3 Competitive Model Comparison

ComparisonHunyuanVideoSora 2Runway Gen-3Kling 3.5
AccessOpen sourceCommercialCommercialCommercial
Parameters13BUndisclosedUndisclosedUndisclosed
Max Resolution720p1080p1080p1080p
Max Length~5 secUp to 20sUp to 10sUp to 10s
Local RunPossibleNot possibleNot possibleNot possible
CustomizationLoRA supportedNot possibleLimitedNot possible
CostFree (GPU req.)API billingSubscriptionAPI billing

6. HunyuanVideo Practical Usage

6.1 HuggingFace Model Download

# HunyuanVideo original model (13B)
pip install huggingface_hub
huggingface-cli download tencent/HunyuanVideo --local-dir ./HunyuanVideo

# HunyuanVideo 1.5 (8.3B, lighter version)
huggingface-cli download tencent/HunyuanVideo-1.5 --local-dir ./HunyuanVideo-1.5

# Image-to-Video model
huggingface-cli download tencent/HunyuanVideo-I2V --local-dir ./HunyuanVideo-I2V

6.2 Diffusers Library Inference Code

Basic Text-to-Video Inference:

import torch
from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel
from diffusers.utils import export_to_video

# Load model
model_id = "tencent/HunyuanVideo"
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
    model_id,
    subfolder="transformer",
    torch_dtype=torch.bfloat16,
)
pipe = HunyuanVideoPipeline.from_pretrained(
    model_id,
    transformer=transformer,
    torch_dtype=torch.float16,
)
pipe.vae.enable_tiling()
pipe.to("cuda")

# Generate video
output = pipe(
    prompt="A cat walks on the grass, realistic style, natural lighting",
    height=720,
    width=1280,
    num_frames=129,
    num_inference_steps=30,
    guidance_scale=6.0,
).frames[0]

# Save video
export_to_video(output, "hunyuan_output.mp4", fps=24)

4-bit Quantization for VRAM Savings:

import torch
from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel
from diffusers.utils import export_to_video
from transformers import BitsAndBytesConfig

# INT4 quantization config
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
)

# Load quantized transformer
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
    "tencent/HunyuanVideo",
    subfolder="transformer",
    quantization_config=quant_config,
)

pipe = HunyuanVideoPipeline.from_pretrained(
    "tencent/HunyuanVideo",
    transformer=transformer,
    torch_dtype=torch.float16,
)
pipe.vae.enable_tiling()

# CPU offload for additional VRAM savings
pipe.enable_model_cpu_offload()

output = pipe(
    prompt="A beautiful sunset over the ocean, cinematic",
    height=544,
    width=960,
    num_frames=65,
    num_inference_steps=30,
    guidance_scale=6.0,
).frames[0]

export_to_video(output, "quantized_output.mp4", fps=24)

6.3 Key Parameter Guide

ParameterDefaultRangeDescription
guidance_scale6.01.0-15.0Prompt fidelity (higher = more faithful to prompt)
num_inference_steps3020-50Denoising steps (higher = better quality, slower)
height720256-720Video height (multiple of 8)
width1280256-1280Video width (multiple of 8)
num_frames12917-129Total frames (4k+1 format recommended)
seedRandomIntegerSeed for reproducibility

6.4 GPU VRAM Requirements

ConfigurationRequired VRAMResolutionNotes
FP32 (original)80GB+720p 129fA100/H100 required
BF16/FP16~40GB720p 129fA100 40GB
FP8 quantization~24-30GB720p 129fRTX 4090 capable
INT4 quant + CPU Offload~14-16GB544p 65fRTX 4080 capable
HunyuanVideo 1.5 (FP8)~14GB480pConsumer GPU

6.5 LoRA Fine-tuning

HunyuanVideo supports LoRA fine-tuning to learn specific styles, characters, or motion patterns.

Key LoRA Training Tools:

ToolFeaturesMin VRAM
Musubi Tuner (kohya-ss)Most popular LoRA training tool24GB
ai-toolkit (ostris)Multi-model support24GB
diffusion-pipe (tdrussell)Pipeline-based training24GB
FineTrainers (HuggingFace)Official Diffusers-based tool24GB
fal.ai LoRA TrainingCloud-based, no setup neededCloud

7. LTX-Video Overview

7.1 Lightricks Company

Lightricks is an AI-based creative technology company headquartered in Jerusalem, Israel. Founded in 2013, it is widely known for consumer photo/video editing apps such as Facetune, Videoleap, and Photoleap. Leveraging experience in mobile creative tools, it entered the AI video generation space.

7.2 Evolution from LTX-Video 1.0 to LTX-2

VersionReleaseParamsKey Features
LTX-Video 0.9Nov 2024~2BFirst open-source, real-time
LTX-Video 0.9.8 (13B)Mid 202513BDistilled version, quality up
LTX-2Oct 202519BAudio+video simultaneous gen.
LTX-2 (open source)Jan 202619BFull weights/code released

7.3 Near Real-Time Video Generation Speed

The biggest differentiator of the LTX-Video series is its faster-than-real-time video generation speed. LTX-Video was among the first DiT-based video generation models to achieve real-time generation.

[Generation Speed Comparison (5-second video)]

Model              Gen Time    vs Real-time
LTX-Video 1.0:    ~2 sec       2.5x faster
LTX-2:            ~3-5 sec     ~real-time
HunyuanVideo:     ~2-5 min     60x slower
CogVideoX:        ~3-8 min     100x slower
Mochi:            ~5-10 min    120x slower

(H100 GPU, 768x512 resolution)

7.4 Text-to-Video and Image-to-Video Support

LTX-2 offers simultaneous Audio-Video generation in addition to Text-to-Video and Image-to-Video.

FeatureLTX-Video 1.0LTX-2
Text-to-VideoSupportedSupported
Image-to-VideoSupportedSupported
Audio generationNot supportedSynchronized audio co-gen
4K resolutionNot supportedNative 4K (3840x2160)
50fpsNot supportedSupported
Keyframe ConditioningLimitedFull support

8. LTX-2 Architecture Analysis

8.1 Overall Architecture

LTX-2 consists of three core components: (1) Modality-specific VAE, (2) Text embedding pipeline, (3) Asymmetric Dual Stream DiT.

[LTX-2 Full Architecture]

Text Prompt
     |
     v
+-------------+
| Text Encoder |  (Gemma-based)
| + Prompt     |
|   Enhancer   |
+------+------+
       |
       v
+------------------------------------------+
|        Asymmetric Dual Stream DiT         |
|                                          |
|  +------------------+  +-------------+   |
|  | Video Stream      |  | Audio Stream|   |
|  | (wide channels,   |  | (narrow,   |   |
|  |  high capacity)   |  |  lightweight)|  |
|  +--------+---------+  +------+------+   |
|           |      Cross-Attention  |       |
|           +----------+-----------+       |
+------------------------------------------+
       |                    |
       v                    v
+-------------+      +-------------+
| Video VAE   |      | Audio VAE   |
| Decoder     |      | Decoder     |
| (3D spatio- |      | (1D temporal)|
|  temporal)  |      |             |
+------+------+      +------+------+
       |                    |
       v                    v
   Video Output        Audio Output
       |                    |
       +--------+-----------+
                |
                v
         Final AV Output (MP4)

8.2 Video VAE (High Compression Ratio - 1:192)

LTX-2's Video VAE achieves a very high compression ratio of 1:192. This is approximately 4x higher than HunyuanVideo's ~47:1 ratio.

VAE Compression Comparison:

ModelSpatialTemporalLatent ChOverall Ratio
LTX-232x328x128ch1:192
HunyuanVideo8x84x16ch~1:47
CogVideoX8x84x16ch~1:47
Wan 2.18x84x16ch~1:47

High compression provides:

  1. Fewer latent tokens: Greatly reduces tokens the DiT must process, improving inference speed
  2. Memory efficiency: Enables high-resolution video processing with less VRAM
  3. Faster training: Reduces computation needed during training

8.3 Asymmetric Dual Stream DiT

LTX-2's DiT adopts an asymmetric dual stream structure, reflecting the characteristic differences between video and audio modalities.

Rationale for Asymmetric Design:

CharacteristicVideo StreamAudio Stream
Dimension3D (spatial + temporal)1D (temporal)
ComplexityHigh (spatiotemporal)Medium (temporal)
Channel widthWide (high capacity)Narrow (lightweight)
Positional Embedding3D positional1D temporal
Data characteristicsPixel-based visualFrequency-based audio

8.4 Text Encoder

LTX-2 uses a Gemma-based text encoder. The enhance_prompt feature can automatically expand simple user prompts for better results.

8.5 Speed Optimization Techniques

OptimizationDescriptionSpeedup
High VAE compressionGreatly reduces latent token countKey factor
Distilled inference8-step distilled model available5-10x
FP8 TransformerQuantized weights~2x
Two-Stage PipelineStage 1 (gen) + Stage 2 (upscale)Efficient
Gradient EstimationReduce steps from 40 to 20-30~1.5x

9. LTX-2 Key Features

9.1 Real-Time Generation Speed

ResolutionFramesLengthGen Time (H100)vs Real-time
768x5121215 sec~2 sec2.5x faster
1216x7041215 sec~5 sec~real-time
1920x10801215 sec~15 sec3x slower
3840x21601215 sec~60 sec12x slower

9.2 High Resolution and Various Output Options

Supported Resolutions:

ResolutionAspectUse CaseVRAM Required
768 x 5123:2Rapid prototyping~8-12GB
1216 x 704~16:9Standard prod.~16GB
1920 x 108016:9Full HD~24GB
3840 x 216016:94K UHD48GB+

9.3 Synchronized Audio-Video Generation

One of LTX-2's innovative features is generating audio and video simultaneously. Sound matching the video content is automatically generated without a separate audio generation model.

9.4 Keyframe Conditioning

LTX-2 supports Keyframe Conditioning, allowing you to specify certain frames and naturally fill in between them.

9.5 LoRA Support

LTX-2 officially supports LoRA training and inference, with training code included in the GitHub repository.


10. LTX-2 Practical Usage

10.1 Installation and Environment Setup

# 1. Python environment (3.10+ recommended)
conda create -n ltx2 python=3.10
conda activate ltx2

# 2. Install official LTX-2 package
pip install ltx-pipelines

# 3. Or install from source
git clone https://github.com/Lightricks/LTX-2.git
cd LTX-2
pip install -e "packages/ltx-pipelines[all]"
pip install -e "packages/ltx-core[all]"

# 4. Download model weights
huggingface-cli download Lightricks/LTX-2 --local-dir ./models/ltx2

10.2 Python Inference Code Examples

Text-to-Video Basic Example:

from ltx_pipelines import TI2VidTwoStagesPipeline

# Initialize pipeline
pipe = TI2VidTwoStagesPipeline.from_pretrained(
    "Lightricks/LTX-2",
    device_map="auto",
    enable_fp8=True,  # Save VRAM with FP8
)

# Generate video
result = pipe(
    prompt="A serene mountain lake at sunrise, mist rising from the water, "
           "birds flying overhead, cinematic quality",
    negative_prompt="blurry, low quality, distorted",
    height=704,
    width=1216,
    num_frames=121,
    frame_rate=24,
    num_inference_steps=30,
    cfg_guidance_scale=7.5,
    seed=42,
    enhance_prompt=True,  # Auto prompt enhancement
)

# Save
result.save("ltx2_output.mp4")

Image-to-Video Example:

from ltx_pipelines import TI2VidTwoStagesPipeline
from PIL import Image

pipe = TI2VidTwoStagesPipeline.from_pretrained(
    "Lightricks/LTX-2",
    device_map="auto",
    enable_fp8=True,
)

# Load input image
input_image = Image.open("input_photo.jpg")

# I2V generation
result = pipe(
    prompt="The scene comes alive with gentle wind blowing through the trees",
    images=[input_image],
    height=704,
    width=1216,
    num_frames=121,
    frame_rate=24,
    num_inference_steps=30,
    cfg_guidance_scale=7.5,
    seed=42,
)

result.save("ltx2_i2v_output.mp4")

10.3 Key Parameters

ParameterDefaultRangeDescription
promptRequiredStringVideo description
negative_promptNoneStringElements to exclude
height704Multiple of 32Video height
width1216Multiple of 32Video width
num_frames1218k+1 formTotal frame count
frame_rate2424/30/50Frames per second
num_inference_steps308-50Denoising steps
cfg_guidance_scale7.51.0-15.0Prompt fidelity
seedRandomIntegerReproducibility seed
enhance_promptFalseTrue/FalseAuto prompt enhancement
enable_fp8FalseTrue/FalseUse FP8 quantization

10.4 GPU Requirements

GPUVRAMRecommended Res.Notes
RTX 3060/40608-12GB540p, 4 secFP8 required, basic
RTX 3080/4070 Ti12-16GB768x512, 5 secFP8 recommended
RTX 409024GB1080p, 5 secStandard use
A10040-80GB4K, 10 secProduction
H10080GB4K, 10 secOptimal performance

11. HunyuanVideo vs LTX-2 Detailed Comparison

11.1 Architecture Comparison

ItemHunyuanVideoLTX-2
Parameters13B (v1) / 8.3B (v1.5)19B
DiT StructureDual-to-Single StreamAsymmetric Dual Stream
VAE Structure3D Causal VAEVideo VAE + Audio VAE
VAE Ratio~1:471:192
Spatial Comp.8x832x32
Temporal Comp.4x8x
Latent Ch16128
Text EncoderMLLM (Decoder-Only)Gemma
TrainingFlow MatchingDiffusion (Flow-based)
AttentionFull 3D AttentionBidirectional Cross-Attn

11.2 Performance and Quality Comparison

ComparisonHunyuanVideoLTX-2Winner
Visual qualityVery highHighHunyuanVideo
Motion natural.Very highHighHunyuanVideo
Text alignmentHighHighTie
Human generationExcellentGoodHunyuanVideo
Max resolution720p4KLTX-2
Audio generationNot supportedSynced genLTX-2
Frame rate24fpsUp to 50fpsLTX-2

11.3 Speed Comparison

ConditionHunyuanVideoLTX-2Difference
768x512, 5s (H100)~120 sec~3 secLTX-2 ~40x faster
1280x720, 5s (H100)~300 sec~10 secLTX-2 ~30x faster
1280x720, 5s (RTX 4090)~600 sec~30 secLTX-2 ~20x faster

11.4 Use Case Model Selection Guide

[Recommended Model by Scenario]

"I need the highest quality video"
  --> HunyuanVideo (v1, 13B)
  Reason: Full Attention + 13B for best visual quality

"I want to run locally on a consumer GPU"
  --> LTX-2 (FP8) or HunyuanVideo 1.5
  Reason: LTX-2 runs on 12GB, HV 1.5 on 14GB

"I need fast iterative work"
  --> LTX-2 (Distilled)
  Reason: Near real-time generation speed

"I need video with audio"
  --> LTX-2
  Reason: Only model with simultaneous AV generation

"I need specific character/style training"
  --> HunyuanVideo + LoRA
  Reason: Rich LoRA ecosystem

"I need 4K high resolution"
  --> LTX-2
  Reason: Native 4K support

"Human/face generation is important"
  --> HunyuanVideo
  Reason: Excellent Human Fidelity benchmark

12. Open-Source Video Generation Model Ecosystem Comparison

12.1 Comprehensive Model Comparison

ItemHunyuanVideoLTX-2Wan 2.1CogVideoXMochi 1
DeveloperTencentLightricksAlibabaZhipu/TsinghuaGenmo
Parameters13B19B1.3B / 14B5B / 10B10B
Max Res.720p4K720p720p480p
Max Length~5 sec~10 sec~5 sec~6 sec~5.4 sec
Max FPS2450243030
VAE Ratio1:471:1921:471:471:12
AudioNot supportedSupportedV2A separateNot supportedNot supported
Min VRAM14GB (v1.5)8-12GB8GB (1.3B)4.4GB (INT8)20GB (ComfyUI)
SpeedSlowVery fastModerateModerateSlow
I2V SupportSeparate modelIntegratedIntegratedSupportedNot supported
LoRASupportedSupportedSupportedSupportedLimited

13. Prompt Engineering Tips

13.1 Effective Video Prompt Writing

Prompt Structure (SAEC Framework):

[Subject] + [Action] + [Environment] + [Camera/Cinematography]

S (Subject):      Subject - what/who is the main focus
A (Action):       Action - what is happening
E (Environment):  Environment - where, what atmosphere
C (Camera):       Camera - how it is filmed

Good vs Bad Prompts:

TypePromptIssue/Strength
Bad"A nice video of nature"Too vague
Average"A dog running in a park"Not specific enough
Good"A golden retriever running through a sunlit meadow, wildflowers swaying, warm golden hour lighting"Specific + environment
Excellent"Medium tracking shot of a golden retriever running joyfully through a sunlit meadow, wildflowers swaying gently in the breeze, warm golden hour lighting, shallow depth of field, 35mm cinematic lens, natural color grading"Full SAEC application

13.2 Cinematography Terminology

Camera Movements:

TermDescriptionExample
PanHorizontal turn"Slow pan across the landscape"
TiltVertical turn"Tilt up to reveal the building"
DollyForward/back"Dolly in on the subject's face"
Tracking ShotFollowing shot"Tracking shot following the car"
Crane ShotCrane"Crane shot rising above the city"
StaticFixed"Static shot of the waterfall"
HandheldHandheld"Handheld camera, documentary style"

13.3 Negative Prompt Usage

Universal Negative Prompt Template:

# Basic quality control
"blurry, low quality, distorted, deformed, ugly, bad anatomy,
watermark, text overlay, logo, grainy, noisy"

# Additional for human generation
"extra fingers, mutated hands, poorly drawn hands, poorly drawn face,
mutation, deformed, extra limbs, missing limbs"

Negative Prompt Support by Model:

ModelNegative PromptRecommendation
HunyuanVideoNot officiallyUse guidance_scale instead
LTX-2SupportedActively recommended
Wan 2.1SupportedActively recommended
CogVideoXSupportedActively recommended

14. Future Outlook

14.1 Video Generation Model Development Directions

Key Development Directions:

DirectionCurrent StateExpected Development
Video length5-10 secExpanding to minutes
Resolution720p-4K8K, HDR support
Physics accuracyBasicPrecise physics simulation
Character consist.LimitedMulti-shot narratives
Gen speedReal-time to minReal-time streaming
MultimodalAV beginningAV + subtitles + voice
EditingBasicAI-based auto editing
InteractionNoneReal-time interactive
  1. MoE Architecture: Introduced in Wan 2.2, greatly improving model efficiency
  2. Distillation Techniques: Transferring large model knowledge to small models for speed
  3. Multimodal Integration: Complete integrated generation of video + audio + text
  4. LoRA Ecosystem Growth: Explosive growth of community-driven specialized models
  5. Edge Device Deployment: Possibility of video generation on mobile/edge devices

15. References

Papers

PaperAuthorsLink
HunyuanVideo: A Systematic Framework For Large Video Generative ModelsTencent Hunyuan TeamarXiv:2412.03603
HunyuanVideo 1.5 Technical ReportTencent Hunyuan TeamarXiv:2511.18870
LTX-Video: Realtime Video Latent DiffusionLightricks ResearcharXiv:2501.00103
LTX-2: Efficient Joint Audio-Visual Foundation ModelLightricks ResearcharXiv:2601.03233

GitHub Repositories

RepositoryDescriptionLink
Tencent-Hunyuan/HunyuanVideoHunyuanVideo official repoGitHub
Tencent-Hunyuan/HunyuanVideo-1.5HunyuanVideo 1.5 official repoGitHub
Lightricks/LTX-2LTX-2 official repoGitHub
Lightricks/ComfyUI-LTXVideoLTX ComfyUI integrationGitHub
kohya-ss/musubi-tunerHunyuanVideo LoRA training toolGitHub
Wan-Video/Wan2.1Wan 2.1 official repoGitHub
zai-org/CogVideoCogVideoX repoGitHub
genmoai/mochiMochi 1 repoGitHub

HuggingFace Model Pages

ModelLink
tencent/HunyuanVideoHuggingFace
tencent/HunyuanVideo-1.5HuggingFace
tencent/HunyuanVideo-I2VHuggingFace
Lightricks/LTX-2HuggingFace
Lightricks/LTX-VideoHuggingFace
Wan-AI/Wan2.1-T2V-14BHuggingFace

Diffusers Documentation

DocumentLink
HunyuanVideo PipelineDiffusers Docs
HunyuanVideo 1.5 PipelineDiffusers Docs
LTX-Video PipelineDiffusers Docs

Additional Resources

ResourceDescriptionLink
VBenchVideo gen benchmarkGitHub
VBench-2.0 PaperExtended benchmarkarXiv:2503.21755
ComfyUI HunyuanVideo TutorialComfyUI usage guideDocs
ComfyUI LTX-2 GuideLTX-2 ComfyUI guideDocs
LTX-2 System RequirementsOfficial HW guideDocs
NVIDIA LTX-2 GuideRTX GPU guideNVIDIA