Skip to content
Published on

AI Video Generation 2026 Complete Guide — Sora 2 · Veo 3 · Runway Gen-4 · Pika · Luma Dream Machine · Kling · Hailuo · Hunyuan Video Deep Dive

Authors

Prologue — From Sora preview to Sora app, two years compressed

On 2024-02-15, when OpenAI unveiled the Sora preview, the video industry froze. 60 seconds of 1080p from a prompt alone was a shock. But that Sora stayed locked behind a research preview for a year.

Then on 2025-09-30, OpenAI announced Sora 2 and simultaneously launched an iOS-only Sora app. 4K, 25-second clips, native audio, Cameos (registering your own face/voice for the model), a social feed. ChatGPT Pro subscribers got a dedicated allowance. At the same time, Google rolled Veo 2 (2024-12) and Veo 3 (2025-05 I/O) into the Gemini app and Vertex AI, and Runway moved from Gen-3 Alpha to Gen-4 (2025-03), pushing deeper into film workflows.

In parallel, the Chinese side — Kuaishou Kling, MiniMax Hailuo, Tencent Hunyuan Video, Alibaba Wan 2.1 — closed the gap rapidly. And the open-source side — Genmo Mochi 1 (2024-10, Apache 2.0), Lightricks LTX-Video (2024-11, real-time 2B), CogVideoX (Tsinghua) — landed on top of ComfyUI workflows, making cinematic clips possible on a single RTX 4090.

This guide compresses those two years — closed and open, pricing and licensing, plus Korea and Japan — into one arc.


Chapter 1 · Text, image, and video — three input branches

The first question when picking an AI video model is "what does it take as input?" There are three branches.

  • Text-to-video (T2V) — A prompt alone creates a new clip. Sora 2, Veo 3, Runway Gen-4, Kling, Hailuo, Hunyuan, Wan, Mochi, CogVideoX all support it. Most universal, hardest to control.
  • Image-to-video (I2V) — Takes a still image (storyboard, character sheet, product photo) as the first frame and animates it. Runway Gen-4, Luma Dream Machine, Pika, Kling, Hailuo, LTX-Video are strong here. Character consistency and brand asset preservation are the key.
  • Video-to-video (V2V) — Takes existing footage and changes style, motion, or viewpoint. Runway's Gen-3 Video-to-Video, Pika's Pikaffects, and ComfyUI's AnimateDiff workflows belong here.

Most pro workflows mix all three. T2V for the first pass, I2V to lock a character, V2V to unify style, lip-sync tools for mouth alignment. Then Premiere/DaVinci to cut and assemble.


Chapter 2 · OpenAI Sora 2 — 4K, 25 seconds, Cameos, iOS app

On 2025-09-30, OpenAI released two things at once: the model and the app.

  • Model side — 4K output, up to 25-second clips, native synchronized audio (dialogue, effects, ambience generated with the video), and major improvements in physics, gravity, and contact consistency over the Sora 1 preview.
  • App side — An iOS-only Sora app launched alongside. TikTok-style vertical feed, a "Cameos" feature for registering your own face (via live selfie for security), and collaborative inserts of friends' cameos into your videos.
  • Pricing — ChatGPT Pro ($200/month) includes an allowance. Additional usage runs on credits. ChatGPT Plus ($20/month) also gets limited Sora 2 access.
  • Watermarking — All outputs carry a visible Sora logo watermark plus C2PA metadata. Only Pro can remove the visible watermark, but the metadata always stays.
  • API — A Sora 2 API entered beta in 2025-11, gated to partners.

Sora 2's real differentiators are two-fold. First, while other models output silent video and have you bolt on sound via ElevenLabs/Suno separately, Sora 2 generates synchronized audio natively. Second, the Cameos feature effectively standardized a "consent model for deepfakes" — only an explicitly registered face is usable, and only when the owner grants sharing rights to a friend.


Chapter 3 · Google Veo 2 · Veo 3 — Two channels via Gemini and Vertex AI

Google's video models consolidated into the Veo line.

  • Veo 2 — Launched 2024-12 inside Vertex AI Studio and VideoFX (public beta). 4K, up to 2 minutes, with natural-language cinematic camera commands (dolly, crane, zoom).
  • Veo 3 — Announced at Google I/O 2025-05. Solved Veo 2's silent output: native dialogue, sound effects, and ambience are generated together. It landed essentially when Sora 2 did, moving in the same direction.
  • Channels — Gemini App (for Advanced/Ultra subscribers), Vertex AI (enterprise), and Flow (a film-making tool from Google).
  • Flow — Also announced at I/O 2025-05. Wrapped scene-level consistency, character continuity, and camera control into a film-maker UI.

Veo 3's strengths are Google infrastructure — native sound powered by DeepMind audio models — and the enterprise channel via Vertex AI. If Sora 2 is social-feed-first, Veo 3 sits closer to the production pipeline.


Chapter 4 · Runway Gen-4 — Penetrating film production workflows

Runway's path was clear from day one: "a film-editing company making an AI video tool."

  • Gen-1 (2023-02) — Video-to-Video only, style transfer.
  • Gen-2 (2023-06) — Extended to Text-to-Video and Image-to-Video.
  • Gen-3 Alpha (2024-06) — Quality stepped up to genuinely cinematic.
  • Gen-3 Alpha Turbo (2024-07) — 7x faster inference, half the price.
  • Gen-4 (2025-03) — References (reference images) and multi-shot consistency are the core. Keep one character across many shots, carry the same look and lighting through a series.

Gen-4's References is the feature film-makers most wanted. Feed in character sheets, costume references, and environment moodboards, and you can produce multiple shots that hold that consistency.

  • Pricing — Credit-based. Standard ($15/month, 625 credits), Pro ($35/month), Unlimited ($95/month). Gen-4 is typically more expensive per clip.
  • Act-One (2024-10) — Maps captured facial performance onto a character. Move actor performance into a digital figure.

Chapter 5 · Pika 2.2 · 2.5 — Pikadditions, Pikaffects, Pikaframes

Pika's strategy is to make feature names memorable.

  • Pika 1.0 (2023-12) — First GA, mostly short clips.
  • Pika 1.5 (2024-10) — Introduced Pikaffects (unreal effects like exploding, melting, crushing) and Pika Scenes (multi-character composition).
  • Pika 2.0 (2024-12) — Reliable character and object composition.
  • Pika 2.2 (2025-02) — Pikaframes (transition mode that fills between a first and last frame) plus 10-second clips.
  • Pika 2.5 (late 2025) — Pikadditions (insert new objects into existing footage), plus quality improvements.

Pika's appeal sits less in cinematic continuity and more in "effects you can explain in a line." Pikaffects is extremely powerful for ad and social creators.

  • Pricing — Basic (free, watermarked), Standard ($8/month), Pro ($28/month), Fancy ($58/month).

Chapter 6 · Luma Dream Machine · Ray 2 — Fast and loopable

Luma AI's Dream Machine took the "fast and everyday" position.

  • Dream Machine 1.0 (2024-06) — Text-to-Video, Image-to-Video, around 5-second clips.
  • Ray 2 (2025-01) — Bigger model, longer clips, more accurate motion.
  • Ray 2 Flash (mid-2025) — A smaller, faster variant.

Luma's strengths are two-fold. First, Image-to-Video quality is very good — it starts from a still and produces natural motion. Second, the Loop feature (seamlessly repeating clips) is potent for social GIFs and background loops.

  • The API was among the earliest broadly available; developer integration is easy.
  • Pricing — Free (limited), Standard ($9.99/month), Plus ($29.99/month), Unlimited ($94.99/month).

Chapter 7 · Kling 1.6 · 2.0 — Kuaishou's global push

Kuaishou (the Chinese rival to TikTok) launched Kling in 2024-06, and it rapidly built a global user base.

  • Kling 1.0 (2024-06) — First release, 1080p, up to 10 seconds.
  • Kling 1.5 (2024-09) — Motion Brush (assigns motion only inside a region), Camera Control.
  • Kling 1.6 (2024-12) — Quality bump, stronger English prompts.
  • Kling 2.0 (2025) — Longer clips, more accurate physics.

Kling's differentiator is Motion Brush — you can paint a region of the frame and direct motion only there. Example: make only this character's hair flow in the wind.

  • Pricing — Buy credits on the global klingai.com. Roughly $10/100 credits. About 100 credits per 5-second clip.

Chapter 8 · MiniMax Hailuo — Fast text-to-video

MiniMax's Hailuo launched 2024-09. Initially free, later monetized.

  • Hailuo Video 01 (2024-09) — Text-to-Video, started at 6-second 720p.
  • Hailuo I2V-01 (2024-11) — Separate Image-to-Video model.
  • Hailuo MiniMax-01 (2025) — Larger multimodal model including video.

Hailuo is very strong on English prompts, with fast inference (short clips in roughly 30s–1 minute) as a key advantage. The catch: clips are shorter compared to Sora 2's 25 seconds.

  • An API is also available.

Chapter 9 · Tencent Hunyuan Video — The 13B open-source watershed

On 2024-12-03, Tencent released Hunyuan Video — 13B parameters, an effectively open license (commercial use allowed with some caveats). It reshaped the open-source video landscape.

  • Model size — 13B. Text-to-video, 5-second clips, 720p baseline.
  • Architecture — Diffusion Transformer (DiT). Text encoder is MLLM-based.
  • License — Tencent Hunyuan Community License. Free commercial use under 100M MAU, separate terms above that.
  • Hardware requirements — Full inference at 720p 5s needs roughly 60GB VRAM. H100 80GB or H200 141GB recommended. On RTX 4090 (24GB), quantization plus offloading (GGUF Q4/Q8 variants appeared fast) make it possible.
  • ComfyUI integration — A wrapper node landed within a week. Drop-in usable.

Hunyuan Video pulled the open-source video camp into "practical" territory. Until then open models were demo-grade.


Chapter 10 · Alibaba Wan 2.1 — 14B with open licensing

In 2025-01, Alibaba released Wan 2.1.

  • Wan 2.1 T2V-14B — 14B parameters, text-to-video, 720p · 5s.
  • Wan 2.1 I2V-14B — Same-size image-to-video variant.
  • Wan 2.1 T2V-1.3B — Small variant that runs on a single RTX 4090.
  • License — Apache 2.0 (Wan 2.1 1.3B) and Tongyi Qianwen License (14B).

Wan 2.1's real charm is the 1.3B variant. Pure Apache 2.0, runs on a single consumer GPU. Quality is below the 14B and Hunyuan, of course.


Chapter 11 · Genmo Mochi 1 — Apache 2.0 at 10B

In 2024-10, Genmo released Mochi 1 under Apache 2.0.

  • Model size — 10B parameters (AsymmDiT architecture).
  • Output — 480p, around 5.4 seconds.
  • License — Apache 2.0. Fully permissive.
  • Hardware requirements — Full inference recommended on 4x H100. With quantization/offloading, a single H100 80GB or RTX 4090 can work.

Mochi 1 was the first to fill the "fully free open video model" slot. License-wise, cleaner than Hunyuan.


Chapter 12 · Lightricks LTX-Video — A real-time 2B model

In 2024-11, Lightricks (the company behind Facetune and Videoleap) released LTX-Video.

  • Model size — 2B parameters. Very small.
  • Speed — Generates a 4-second 720p clip in roughly 4 seconds on H100. Effectively real-time.
  • License — RAIL-S (free for research and personal use, commercial use is restricted but possible).
  • Workflow — ComfyUI nodes appeared quickly. Roughly 10x faster than Wan/Hunyuan.

LTX-Video moved the quality-vs-speed balance toward speed. Strong for rapid prototyping and iteration.


Chapter 13 · CogVideoX 5B — Tsinghua's open base

In 2024-09, Tsinghua KEG Lab and ZhipuAI released CogVideoX.

  • CogVideoX-2B / CogVideoX-5B — Two sizes.
  • License — CogVideoX License (Apache-flavored with some constraints).
  • Quality — Slightly behind Mochi 1 as of late 2024, but a low barrier to entry made it popular for research and education.

CogVideoX is on ModelScope and Hugging Face and was quickly wired into ComfyUI workflows.


Chapter 14 · Stable Video Diffusion · the prehistory

The "prehistory" of video models in one paragraph.

  • Stable Video Diffusion (2023-11, Stability AI) — The first serious open video model. About 2–4 seconds, 576x1024. Demo-grade by today's standards, but ComfyUI/AUTOMATIC1111 workflows first took root here.
  • AnimateDiff (2023-07) — A method to attach motion modules to a Stable Diffusion image model and produce short animations. Still the default for V2V workflows in ComfyUI.
  • VideoCrafter / ModelScope T2V — Contemporaries from the same era.

Without these, neither the ComfyUI ecosystem nor today's open-source video models would have taken root.


Chapter 15 · ComfyUI workflows — Wan, Hunyuan, Mochi in one place

ComfyUI is a node-based workflow editor and has become the standard interface for open video models.

Representative node packages:

  • ComfyUI-HunyuanVideoWrapper — Hunyuan Video integration.
  • ComfyUI-WanVideoWrapper — Wan 2.1 integration.
  • ComfyUI-MochiWrapper — Mochi 1 integration.
  • ComfyUI-LTXVideo — LTX-Video integration.
  • ComfyUI-CogVideoXWrapper — CogVideoX integration.

A typical workflow flows like this:

[Text Prompt]
   |
   v
[CLIP/T5 Text Encoder] --+
                          |
[Empty Latent Video] -----+--> [Diffusion Model (Hunyuan/Wan/Mochi)] --> [Latent Video]
                          |                                                  |
[Negative Prompt] --------+                                                  v
                                                                    [VAE Decode]
                                                                              |
                                                                              v
                                                                       [Video Output]

I2V workflows add an Image Encoder node and a Conditioning node. V2V re-encodes the input video into latent space as the starting point.

ComfyUI's real strength is the ability to drop LoRA, ControlNet, IPAdapter, and upscalers in at the node level. Fine-grained control that closed models simply don't expose.


Chapter 16 · Lip sync — HeyGen, Synthesia, D-ID, Hedra

Video generation and lip sync are different problems. Lip-sync tools form a separate category.

  • HeyGen — De facto standard for avatar video plus lip sync. Register your own face/voice or pick a library avatar. From $24/month.
  • Synthesia — Enterprise training and marketing videos. 140+ languages. Starter from $22/month.
  • D-ID — Animates a still image into a talking face. Strong API. Studio plan from $5.9/month.
  • Hedra Character-1 (2024-06) — Generates both expression and lip motion for an AI character. From $10/month.
  • Sync.so (Wav2Lip successor) — An open-source side lip-sync model.

Sora 2 and Veo 3 generate video and audio together, but swapping a new voice into existing footage still belongs to the tools above.


Chapter 17 · Storyboarding and longform — LTX Studio, Showrunner, Wonder

Tools that string 5–25 second clips into longer pieces are separate.

  • LTX Studio (Lightricks) — Storyboarding, character consistency, and scene management as one integrated tool. Sells the workflow rather than a single model.
  • Showrunner (Fable Simulation) — Generates TV-series episodes. Famous for South Park style simulations.
  • Wonder Dynamics (acquired by Autodesk) — Automatically composites CG characters into live-action footage. Slots into VFX pipelines.
  • Krea AI — A creative tool bundling image, video, and 3D.

These tools make "sequences, not single clips," and many pull Sora 2, Veo 3, and Runway Gen-4 via API to do so.


Chapter 18 · Watermarks and C2PA — a new standard for provenance

The standard that took root fastest in 2024–2025 is C2PA (Coalition for Content Provenance and Authenticity).

  • C2PA embeds cryptographically signed metadata that records origin and edit history.
  • Adobe, Microsoft, OpenAI, Google, BBC, and Meta participate.
  • It is embedded in images (JPEG XMP) and video (MP4 metadata).

State of play in 2026:

  • OpenAI Sora 2 — Visible watermark plus C2PA metadata. Only Pro removes the visible mark; C2PA stays.
  • Google Veo 3SynthID (DeepMind invisible watermark) plus C2PA.
  • Meta — Facebook and Instagram auto-label AI-generated content.
  • EU AI Act — Labeling of generative AI content becomes a legal requirement from 2026.

Watermarking is the last line of defense for content trust. But open-source-generated video does not carry C2PA, so the standard operates only inside the closed ecosystem.


Chapter 19 · Korea — VARCO, HyperCLOVA X video

Korea's situation in video is one beat behind text and image, but catching up fast.

  • NCsoft VARCO Vision — A multimodal branch of the VARCO family. Image/video understanding (VLM) first; full generation is still pending.
  • Naver HyperCLOVA X — Text is the main line; a video line is being prepared separately.
  • Kakao Karlo — An image-generation model exists, but no public video model.
  • Local workflows — Many Korean creators run Hunyuan, Wan, and LTX through ComfyUI with Korean prompts (via translator). Ad production houses are adopting fast.

The Korean market's particularity is K-content IP. Workflows that preserve character continuity for K-drama, K-pop, and webtoon characters (LoRA training + Runway References + lip sync) are being rapidly experimented with.


Chapter 20 · Japan — NTT Tsuzumi, Pikalmer, Sakana

Japan also has few direct video models, but adjacent fields are active.

  • NTT Tsuzumi — NTT's LLM line. Strong in Japanese. No separate video line yet.
  • Sony Pikalmer (placeholder name, internal projects) — Sony's media AI attempts.
  • Sakana AI — Known for evolutionary model merging. Not a direct video developer, but the merging technique is applicable on the LoRA layer.
  • Stability AI Japan — Active around Japanese variants of Stable Diffusion.
  • AI animation — Japanese animation studios are piloting Runway Gen-4 and Pika 2.5 inside some production pipelines. Union dynamics keep full adoption cautious.

Japan leans toward controllable open-source workflows over closed models, weighted by IP continuity and union concerns.


Chapter 21 · Cost — the real price of one clip

Comparable pricing in one summary.

  • Sora 2 — ChatGPT Pro $200/month includes an allowance. Beyond that, credits.
  • Veo 3 — On Vertex AI, roughly $0.35-0.75/sec (beta, subject to change). Some Gemini Advanced/Ultra subscriptions include an allowance.
  • Runway Gen-4 — Standard $15/month (625 credits, around 41 seconds of footage), Pro $35/month.
  • Pika 2.5 — Standard $8/month, Pro $28/month.
  • Luma Dream Machine / Ray 2 — Standard $9.99/month, Unlimited $94.99/month.
  • Kling$10/100 credits. About 100 credits per 5-second clip.
  • Hailuo — Credit-based, from $10.
  • HeyGen — Creator from $24/month.
  • Local GPU (Hunyuan/Wan/Mochi) — At cloud H100 around $2-3/hour, one 5-second clip costs $0.5-1. Buying an RTX 4090 (around $1,800) gives unlimited generation (electricity aside).

Two paths to the cheapest price. One: open-source models on your own GPU. Two: low-tier Pika/Luma subscription with controlled volumes.


Chapter 22 · Limits — motion coherence, physics, text

Video models in 2026 are strong, but the weaknesses are also clear.

  • Scene consistency — Holding one character identical across several 5-second clips is still hard. Runway Gen-4's References and ComfyUI's LoRA mitigate this.
  • Physics simulation — Accurate motion of liquids, cloth, and joints is still weak. Sora 2 is best but not perfect.
  • Text rendering — Letters inside the video (signs, book covers) often break. Veo 3 and Sora 2 are the most accurate.
  • Coherence beyond 5 seconds — Even Sora 2 at 25 seconds shows awkwardness later in clips.
  • Copyright and face usage — Faces outside a consent model like Cameos are refused. Open-source models, with weaker guardrails, push the ethical and legal burden onto the user.

These limits decline at different rates per generation. Text rendering improved fast; physics is improving slowly.


Chapter 23 · Use cases — ads, social, storyboards, R&D

Four of the most active use cases in 2026:

  • Ads/marketing — 30-second social ads. Pipelines that combine Pika's Pikaffects, Runway Gen-4's References, and HeyGen avatars. Cost roughly 1/10 of traditional production.
  • Social content — TikTok, Reels, Shorts. Sora App, Luma, and Kling are strong here. Frighteningly powerful for short attention-grabbing clips.
  • Film pre-visualization and storyboarding — Runway Gen-4 and LTX Studio penetrating production-house workflows. See scene flow before the real shoot.
  • R&D and simulation — NVIDIA and autonomous-driving companies are starting to use video models to generate synthetic training data. Endless road scenarios.

Full feature-film and drama production isn't there yet, but short films, music videos, ads, and trailers already use these tools.


Chapter 24 · Decision tree — which model to use

Last, a one-page situational guide.

  • Short social clips, fast iteration → Pika 2.5, Luma Ray 2, Kling.
  • Cinematic tone, character consistency → Runway Gen-4 plus References. Expensive but the most controllable.
  • Native synced audio, dialogue → Sora 2 or Veo 3.
  • Enterprise integration (Vertex AI, GCP data governance) → Veo 3.
  • Training videos, multilingual lip sync → HeyGen, Synthesia.
  • Low-cost iteration on open-source workflows → Hunyuan Video, Wan 2.1, Mochi 1, LTX-Video in ComfyUI.
  • Starting with a single personal GPU → Wan 2.1 1.3B or LTX-Video.
  • 100 percent clean commercial license → Mochi 1 (Apache 2.0).

This tree is likely to be revised within six months. AI video is still one of the fastest-moving fields.


Epilogue — Questions for the next year

In two years we went from 60-second 1080p to 25-second 4K with synced audio. What does the 2027 model need to solve?

  • Long-form coherence — Sequences longer than a minute without breaks.
  • Interactive video — Branching with user intervention mid-clip.
  • Real-time generation — Game-engine-style instantaneous response.
  • 3D coherence — The world not collapsing while the camera roams freely.
  • Copyright and consent frameworks — How to standardize explicit consent for face, voice, and style.

No one has the answers yet. But at the pace of 2024–2026, those answers are likely within another two years.


References