AI Video Generation 2026 — Sora 2 / Veo 3 / Kling 2 / Hailuo / Runway Gen-4 / Luma Ray 2 / HunyuanVideo Deep Dive

AI Video Generation 2026 — Two years after Sora 1 turned heads in February 2024, the AI video generation market has settled into three camps: closed SOTA (OpenAI / Google / Kuaishou / MiniMax), video-industry standards (Runway / Luma / Pika), and a real open-weight tier (Hunyuan / LTX / Wan / Open-Sora). This guide maps that landscape.

Prologue — What changed in two years
1. The 2026 map — closed / industry tools / open weights
2. Sora 2 (OpenAI) — generation 1 to 2
3. Veo 3 (Google) — synthesized audio + dialogue sync
4. Kling 2 (Kuaishou) — the strongest model from China
5. Hailuo (MiniMax) — China's other heavyweight
6. Runway Gen-4 — the video industry standard
7. Luma Ray 2 — Dream Machine's successor
8. Pika 2 — pivot to image-to-video
9. HunyuanVideo (Tencent, open) — the first real open challenger
10. LTX-Video / Wan 2.1 / Open-Sora — the open camp
11. Diffusion Transformer (DiT) — technical background
12. Audio + music integration — Lyria 2 / Suno / Udio / ElevenLabs SFX
13. Korea and Japan — Sakana AI, KAIST, Naver
14. Who should pick what — recommendations by workload
15. Wrap-up — the big picture for 2026
References

Prologue — What changed in two years

On February 16, 2024, OpenAI released the Sora 1 demo — a woman walking through Tokyo, Earth from space, a paper plane gliding over a jungle. All 1080p, up to 60 seconds, all from a single text prompt. The November 2024 public API release still left rough edges around fidelity, length, and physical consistency.

Two years later, the picture is completely different.

Sora 2 (OpenAI, Oct 2025) — 4K, 120s, consistent characters, precise camera control, directly available inside ChatGPT Plus.
Veo 3 (Google DeepMind, Jun 2025) — synthesized audio, dialogue and music synced to the video. The first major model that generates "lip-synced talking characters" in one pass.
Kling 2 (Kuaishou, Apr 2025) — China's strongest video model. Kling 1.0 in June 2024 matched the Sora 1 demo in quality within four months of seeing it.
Hailuo (MiniMax, 2024–) — China's other heavyweight. The generous free daily quota gave them a huge consumer base.
HunyuanVideo (Tencent, Dec 2024) — the first true open-weight competitor. 13B parameters, Apache-2.0-compatible license.
Runway Gen-4 (2025) — the de facto standard for film and advertising. Deepest integration with Adobe Creative tools and After Effects.
Luma Ray 2 (2025) — Dream Machine's successor. Emphasizes camera motion and physical consistency.
Pika 2 — pivoted to image-to-video. The "make my photo come alive" market.
LTX-Video (Lightricks, Nov 2024) — sub-realtime latency, open weights, runs on consumer GPUs.
Wan-2.1 (Alibaba, Feb 2025) — another strong open-weight contender.
Open-Sora (HPC-AI Tech) — academic open source, Sora-style architecture re-implemented in the open.

Audio joined the stack too. Google Lyria 2, Suno v4, and Udio handle music; ElevenLabs SFX covers sound effects; HeyGen and Synthesia handle lip-sync. The fragmented 2024 pipeline (video here, music there, lip-sync elsewhere) has collapsed into a single workflow in 2026.

This guide walks the map in 14 chapters and ends with a who-should-pick-what.

1. The 2026 map — closed / industry tools / open weights

1.1 Three camps

As of May 2026, the AI video market splits into three groups.

Camp	Examples	Strengths	Weaknesses
Closed SOTA	Sora 2, Veo 3, Kling 2, Hailuo	Best quality, length, consistency	Pricing, restrictions, watermarks
Industry-standard tools	Runway Gen-4, Luma Ray 2, Pika 2	Workflow integration, fine control	Slightly behind SOTA on raw quality
Open weights	HunyuanVideo, LTX, Wan 2.1, Open-Sora	Self-hosting, fine-tuning	Quality and length gap remains

This mirrors the LLM split (GPT-4 / Claude / Gemini vs Anthropic-API-compatible OSS vs Llama / Qwen). Video is just one to two years behind LLMs in following the same pattern.

1.2 The four evaluation axes

Quality — resolution, detail, texture consistency
Temporal coherence — characters and props stay consistent across frames
Physics — gravity, collisions, liquids, cloth all behave plausibly
Control — beyond prompts: camera, character, style, scene-level steering

No model wins all four. Which axis you prioritize depends on whether the workload is an ad insert, a short-film previs, or social content.

1.3 One-line specs

Model	Max res	Max length	Audio sync	License
Sora 2 (OpenAI)	4K	120s	Separate	Closed, API
Veo 3 (Google)	4K	60s	Co-generated audio	Closed, Vertex AI
Kling 2 (Kuaishou)	1080p	30s	None	Closed, web
Hailuo (MiniMax)	1080p	10s	None	Closed, API
Runway Gen-4	1080p	16s	None	Closed, SaaS
Luma Ray 2	1080p	10s	None	Closed, API
Pika 2	720p–1080p	10s	None	Closed, API
HunyuanVideo (Tencent)	720p	5s	None	Open, 13B
LTX-Video (Lightricks)	720p	5s	None	Open, 2B
Wan 2.1 (Alibaba)	720p	5s	None	Open, 14B
Open-Sora	720p	16s	None	Open, MIT

The short lengths refer to a single generation. Stitching multiple generations into longer pieces is a separate workflow.

2. Sora 2 (OpenAI) — generation 1 to 2

2.1 What changed from Sora 1

Sora 1 went public in November 2024, after the February 2024 demo. The original strengths were length (up to 60s) and resolution (1080p). The weaknesses were the now-familiar diffusion-video pitfalls: twisted fingers, inconsistent costumes between shots, awkward gait.

Sora 2 (October 2025) shipped:

Max length 120s
4K option
Character consistency — the same character keeps a stable appearance across a single prompt. Marketed as "character memory."
Camera control — explicit camera-motion tokens (zoom in, dolly out, orbit left).
Physics — better liquid, collision, and gravity handling.

OpenAI integrated Sora 2 directly into ChatGPT Plus / Team / Enterprise. The API is a separate signup tier.

2.2 Pricing and speed

As of May 2026:

ChatGPT Plus ($20/mo): up to 12s at standard resolution included; beyond that uses credits.
API: roughly $0.30–$0.50 per second of output (varies by resolution and length).
Wall-clock: 1–3 minutes for a 12-second clip.

Video generation costs 100× or more compared to text generation, and the prices reflect it directly.

2.3 Prompt example

A close-up of a Korean street food vendor flipping hotteok on a hot grill,
steam rising, the camera slowly dollies in from the left.
Time of day: golden hour. Style: cinematic, shallow depth of field.
Duration: 8 seconds. Aspect ratio: 16:9.

Sora 2 picks up camera motion, time of day, style, and duration / aspect ratio as explicit metadata.

2.4 Character memory

One of the bigger Sora 2 evolutions. A character that appears in one shot keeps the same appearance in the next, given the same prompt scaffolding. Very useful for ads or short narratives:

[Shot 1] A woman in a red coat walks into a Tokyo subway station at night.
[Shot 2] (Same woman, same coat) She buys a ticket from the machine.
[Shot 3] (Same woman) The train arrives, she steps in.

Industry feedback: "we can storyboard with this." Pre-vis costs for ad sequences have reportedly dropped by an order of magnitude.

2.5 Weaknesses

Korean text rendered into the scene still falls apart; even English text wobbles occasionally.
Fast action (sports, combat) still produces stretched limbs.
Watermark always present (toggleable on the API tier).
C2PA content credentials embedded on every output.

3. Veo 3 (Google) — synthesized audio + dialogue sync

3.1 Veo 1 to 2 to 3

Google DeepMind announced Veo 1 at Google I/O 2024. Veo 2 followed in December 2024, and Veo 3 launched in June 2025. The big leap in Veo 3 is co-generated audio.

Veo 3 produces four things in one pass, all time-aligned:

Video
Ambient audio — street noise, rain, wind
Dialogue — voice synced to the character's lip movement
Music — BGM from a Lyria 2 integration

This matters because, up to 2024, "AI video" meant a silent clip. Users had to layer BGM, SFX, narration, and lip-sync themselves. Veo 3 generates all of it together from a single prompt.

3.2 Prompt example

A barista in a Seoul cafe pours coffee while explaining the beans to a customer.
She says in Korean: "This is an Ethiopian Yirgacheffe, very floral."
The customer nods. Background: light jazz, gentle espresso machine sounds.

Veo 3 renders the barista pouring, speaks the Korean line with credible lip-sync, and lays down the jazz plus espresso-machine ambience — all aligned. Non-English (Korean, Japanese, Chinese, Spanish, …) works well.

3.3 Pricing and access

Available through Google Vertex AI.
Free quota inside Google AI Studio (aistudio.google.com).
Roughly $0.50–$1.00 for an 8-second clip with audio.
Direct integration into Google Workspace Business / Enterprise.

3.4 Strengths and weaknesses

Strengths

Audio + video in one step. Workflow collapses from many tools to one.
Multilingual dialogue (English, Korean, Japanese, Chinese, Spanish) sounds natural.
Plugs into Workspace — drops directly into Slides / Docs.

Weaknesses

60-second cap (half of Sora 2).
Camera control less granular than Sora 2.
Region availability has lagged outside the US in earlier rollout phases.

4. Kling 2 (Kuaishou) — the strongest model from China

4.1 Kling 1 to 2 — the four-month shock

In June 2024, four months after the Sora 1 demo, the Chinese short-video company Kuaishou (快手) opened Kling 1.0 to the public. Two things were shocking:

Quality was roughly on par with the Sora demo, while everyone else was still figuring out how to even approach that bar.
It was free to use, while Sora was still locked behind a closed waitlist. This let Kuaishou build a massive user base.

Kling iterated fast: 1.5, 1.6, 2.0 (April 2025). As of May 2026, Kling 2 offers:

1080p, 30s
Camera-motion control — explicit tokens like Sora 2.
Image-to-Video — both first and last frames can be supplied.
Multi-shot — automatic cut composition inside a single prompt.

4.2 Why Kling iterates so fast

Kuaishou is a TikTok competitor in China, so they own an enormous internal video dataset (tens of billions of hours). That's a real training-data advantage.

The other factor: Chinese AI labs have shown an extremely fast iteration cycle in video, the same way they have in LLMs. Between June 2024 and April 2025, Kling went 1.0 → 1.5 → 1.6 → 2.0; in the same window OpenAI shipped 1.0 → 2.0.

4.3 Pricing and access

klingai.com (global) / kling.kuaishou.com (China).
Free daily credits, paid plans $10–$60 per month.
Global signup works with any major credit card.

4.4 Weaknesses

Content moderation for politically sensitive material (a China-specific constraint).
Korean and Japanese text rendered into scenes breaks down.
No C2PA credentials — provenance tracking is limited.
Pricing and free quotas shift fairly often.

5. Hailuo (MiniMax) — China's other heavyweight

5.1 Who is MiniMax

MiniMax is a Shanghai-headquartered Chinese AI company. They've been working on LLMs and on video / voice models in parallel since 2023. Hailuo (海螺) is their video brand.

When Hailuo launched in August 2024, it positioned itself as "the alternative for everyone who can't get into Sora." Quality wasn't quite at Kling levels, but the free quota was extremely generous.

As of May 2026, Hailuo offers:

1080p, 10s
First-frame and last-frame image-to-video controls.
Generous daily free credits.
Director Mode — camera-motion tokens.

5.2 Strengths

Most generous free quota — great for students and hobbyists.
Easy global signup.
Fast generation — a 6-second clip in roughly 30 seconds.
Image-to-video quality — particularly strong for "turn this portrait into a video."

5.3 Weaknesses

Max length only 10 seconds.
Character consistency is weaker than Sora 2 or Kling 2.
Same content-policy / TOS posture as other Chinese providers.

5.4 Kling vs Hailuo

Axis	Kling 2	Hailuo
Max length	30s	10s
Resolution	1080p	1080p
Camera control	Strong	Medium
Free tier	Decent	Generous
Global access	Easy	Easy
Pricing	`$10`–`$60`/mo	`$5`–`$30`/mo

Within the Chinese tier, Kling is the SOTA pick; Hailuo is the value pick. Both are evolving rapidly.

6. Runway Gen-4 — the video industry standard

6.1 Where Runway sits

Runway, founded in 2018, is a video + ML tooling company. They were co-credited on the original Stable Diffusion paper, and in 2023 they essentially opened up AI video commercially with Gen-1 and Gen-2.

Gen-3 Alpha shipped in June 2024, then Gen-4 in 2025. Runway's real strength is workflow, not raw model quality.

Frames — reference-image-based control for character, style, and location consistency.
Director Mode — fine camera-motion control.
Video-to-Video — restyle existing footage.
Motion Brush — mask which areas should move.
After Effects plugin — drops directly into compositing pipelines.

6.2 Gen-4 character consistency

The biggest Gen-4 jump is reference-image-driven character consistency. Workflows like this become trivial:

[Reference image] character.png (face)
[Prompt] Same character walking through Times Square at night, neon lights,
camera tracks behind.

That single change is decisive for ads, music videos, and short films — the cost of keeping a character consistent across cuts collapses.

6.3 Pricing

Standard $15/mo — 625 credits
Pro $35/mo — 2,250 credits
Unlimited $95/mo
Enterprise — custom

A 10-second clip burns roughly 50 credits (varies). For ad and media houses, the economics work.

6.4 Who uses it

Ad agencies (Ogilvy, Wieden+Kennedy have published case studies)
Music video directors
Short films and documentaries
After Effects-heavy users — the plugin integration is deep

Sora 2 is "a brilliant clip from a single image"; Runway is "fits into a professional video workflow."

7. Luma Ray 2 — Dream Machine's successor

7.1 Luma's roots

Luma AI grew out of NeRF (Neural Radiance Fields) research. From 2022 through 2023 they were known for NeRF-based 3D capture apps.

In June 2024 they entered the AI video market with Dream Machine, almost the same week Kling went public. Ray 1 followed, and Ray 2 shipped in late 2025.

7.2 What Ray 2 emphasizes

Physical consistency — Luma deliberately leans into camera motion and physics quality; the NeRF lineage shows.
Keyframes — pin first, middle, and final frames.
Cinematic camera tokens — orbit, dolly, zoom and other film-style moves.
Solid API — developers can integrate it cleanly into their own apps.

7.3 Pricing

Free — 30 credits per day
Standard $9.99/mo
Pro $29.99/mo
Premier $94.99/mo
API — roughly $0.50 for a 5-second clip

7.4 Runway vs Luma

Axis	Runway Gen-4	Luma Ray 2
Quality	Comparable	Comparable
Camera control	Strong (Director Mode)	Strong (cinematic tokens)
Character consistency	Strong (Frames, references)	Average
Workflow integration	After Effects, native editor	API-first
Pricing	Slightly higher	Generally cheaper

Runway is the industry-standard pick; Luma is the API-integration / physics-coherence pick. Both are reasonable choices.

8. Pika 2 — pivot to image-to-video

8.1 Pika's evolution

Pika Labs started in 2023 as a Discord bot. Together with Runway, they essentially opened the consumer AI video market.

When Pika 2 shipped in late 2024, the strategy shifted. Rather than fight Sora / Veo / Kling head-on in t2v, Pika positioned itself for images, characters, and short social content.

Pika 2 highlights:

Pikaffects — special-effects animations applied to a still image ("melt," "explode," "crush").
Pikascenes — drop a portrait into a fictional scene naturally.
Lip-sync — make a still photo speak.
Fast image-to-video — 8-second clip in roughly 30 seconds.

8.2 Who uses it

Social-media creators (TikTok, Instagram Reels)
Meme makers
Casual users — "make my photo move"

Not the "70-second short film" market that Sora 2 chases; Pika owns the 8-second social market.

8.3 Pricing

Free — daily quota
Standard $10/mo
Pro $35/mo
Fancy $95/mo

Excellent value for influencer / social-marketing workflows.

9. HunyuanVideo (Tencent, open) — the first real open challenger

9.1 Why it mattered

On December 3, 2024, Tencent released HunyuanVideo. The event was big because:

13B parameters — by far the largest open video model at the time.
Quality comparable to Runway Gen-3 / Luma Dream Machine — the first open-weight model to approach closed SOTA.
Apache-2.0-compatible license (with a few constraints) — commercial use allowed.

This was the video-generation equivalent of Llama 2's moment for LLMs.

9.2 Architecture

HunyuanVideo combines DiT (Diffusion Transformer) with latent diffusion.

3D VAE — compresses video into a latent space.
DiT encoder — applies diffusion in the latent space.
MLLM text encoder — a multimodal LLM serves as the text encoder, with richer representations than CLIP.
Flow matching — more efficient noise-to-video mapping during training.

The technical report is public and is now widely cited in the literature.

9.3 How to run it

git clone https://github.com/Tencent/HunyuanVideo
cd HunyuanVideo

# Recommended: H100 or A100 80GB GPU
python sample_video.py \
  --prompt "A cat playing piano in a jazz bar, warm light" \
  --video-length 65 \
  --infer-steps 50 \
  --save-path ./outputs

A 7B variant exists too — it will fit on an RTX 4090. Quality is noticeably better on 13B.

9.4 ComfyUI

ComfyUI (the node-based AI workflow tool) officially supports HunyuanVideo nodes.

[Load HunyuanVideo Model] - [CLIP Text Encode] - [HunyuanVideo Sampler] - [Video Combine]

The adoption from video creators integrating it into their pipelines was immediate, because compared to closed-model pricing, you only pay GPU costs.

9.5 Weaknesses

5-second cap on a single generation.
Korean / Japanese text rendered into the scene still breaks.
Roughly 60GB+ VRAM for the full model. Quantization and LoRA help.

10. LTX-Video / Wan 2.1 / Open-Sora — the open camp

10.1 LTX-Video (Lightricks, Nov 2024)

Lightricks is an Israeli mobile video editing company (Facetune, Videoleap). In November 2024 they released LTX-Video.

2B parameters — small.
Fast — under 4 seconds for a 5-second clip on an RTX 4090, faster than realtime.
Open weights — self-host freely.
Commercial-use license.

LTX's significance is "an AI video model that runs on consumer GPUs." While HunyuanVideo demands H100-class hardware, LTX fits on a single 4090.

from diffusers import LTXPipeline
import torch

pipe = LTXPipeline.from_pretrained(
    "Lightricks/LTX-Video", torch_dtype=torch.bfloat16
).to("cuda")

video = pipe(
    prompt="A woman walking in the rain at night, neon city",
    num_frames=121,
    guidance_scale=3.0,
).frames[0]

10.2 Wan 2.1 (Alibaba, Feb 2025)

Alibaba runs Qwen for LLMs and Wan for video. Wan 2.1 shipped in February 2025.

14B parameters
Text-to-video and image-to-video both supported.
Flow matching based.
Multilingual prompts — Chinese and English both work well.

Quality is broadly comparable to HunyuanVideo; the two models are often benchmarked head-to-head.

10.3 Open-Sora (HPC-AI Tech)

Open-Sora is an academic open-source project led by NUS / HPC-AI Tech in Singapore. It started right after the Sora 1 demo, aimed at re-implementing the Sora architecture in the open.

MIT license.
Full training pipeline + data tooling published.
Quality is a bit behind HunyuanVideo / Wan.
Excellent for research and teaching.

If you want to understand how a video model is actually trained, this is the codebase to read.

10.4 Open camp comparison

Model	Params	Max length	Min GPU	License	Notes
HunyuanVideo (13B)	13B	5s	60GB	Apache-2.0-compat	Best quality
HunyuanVideo (7B)	7B	5s	24GB	Apache-2.0-compat	Compromise
LTX-Video	2B	5s	12GB	Commercial OK	Fast, small
Wan 2.1	14B	5s	60GB	Commercial OK	Hunyuan rival
Open-Sora v2	11B	16s	40GB	MIT	Academic, 16s

The open camp had its breakout in 2025; in 2026 it is narrowing the gap to closed SOTA, though character consistency and multi-shot control still lag.

11. Diffusion Transformer (DiT) — technical background

11.1 Why DiT

Since GANs arrived in 2014, video generation has bounced between GAN, VAE, and Diffusion. For images, Stable Diffusion's 2022 latent-diffusion approach settled the question. Video took longer.

The pivotal move was the DiT (Diffusion Transformer, 2023) paper by William Peebles and Saining Xie, which replaced the UNet backbone of diffusion with a Vision Transformer.

11.2 UNet vs Transformer

Axis	UNet diffusion	DiT
Backbone	CNN-based UNet	Vision Transformer
Scaling	Hard (structural limits)	Easy (LLM-style scaling laws)
Video support	Time axis is awkward	Native sequence handling
Training stability	Mature	New but stable

Video is fundamentally a 3D tensor over (height, width, time). Bolting time onto a UNet is awkward; for a Transformer, time is just another sequence dimension.

After DiT, essentially every major video model moved to a DiT (or DiT-derived) backbone. Sora, Veo, Kling, HunyuanVideo, and Open-Sora are all DiT-family models.

11.3 Why latent diffusion still matters

A 1024×1024 frame is 1M pixels. One second at 24fps is 24M pixels. Running diffusion in raw pixel space is impractical.

Latent diffusion compresses the video with a VAE into latent space (e.g., 128×128×8 ≈ 130K dimensions) and runs diffusion there. The compute budget drops by 100× or more.

Every modern video model starts with a 3D VAE (often causal). HunyuanVideo, Wan, and Open-Sora all train custom 3D VAEs.

11.4 Flow matching — the new training recipe

Between 2022 and 2023, flow matching rose as a diffusion alternative.

Diffusion: learn the noise-to-video path via an SDE.
Flow matching: learn the noise-to-video path via an ODE. Training is more stable and inference is faster.

HunyuanVideo, Wan 2.1, and Stable Diffusion 3 all adopted flow matching. It's the de facto standard by 2026.

11.5 From CLIP to LLM text encoders

Text encoding for video models traditionally used CLIP. Between 2024 and 2025 that changed.

Stable Diffusion 3 — added T5-XXL as a text encoder.
HunyuanVideo — uses an MLLM (multimodal LLM) as the text encoder.
Veo 3 — uses Gemini's text encoder.

For long prompts, complex scene descriptions, and multilingual inputs, LLM-based encoders dominate. Even just breaking past CLIP's 77-token limit is a major gain.

12. Audio + music integration — Lyria 2 / Suno / Udio / ElevenLabs SFX

12.1 Video is not silent anymore

Through 2024, AI video output was almost universally silent. Sora 1, Kling 1, Runway Gen-3 all emitted video tracks only — users had to layer BGM, SFX, narration, and lip-sync separately.

Starting in 2025, that broke.

12.2 Lyria 2 (Google DeepMind, 2024)

Lyria is Google DeepMind's music generation model, with version 2.0 released in 2024.

Text-to-music generation.
Integrated into YouTube Shorts Dream Track and similar products.
Co-generated with Veo 3 — when Veo 3 generates video, Lyria provides synced BGM.

12.3 Suno v4 / Udio

Suno (Cambridge, MA) and Udio (founded by ex-Google-DeepMind team members) are the two strongest music generation companies.

Suno v4 — lyrics + melody together. Full-length tracks up to 4 minutes.
Udio — comparable quality, more granular control.

For video creators who need BGM, it is almost always one of these two. Free quotas are generous.

12.4 ElevenLabs Sound Effects

ElevenLabs' core business is TTS, but they added an SFX (sound effects) model in 2024.

Text-to-sound — "footsteps in snow," "thunder rumble," "espresso machine."
0–22 seconds in length.
Sufficient free quota.

Useful for niche effects that aren't in standard SFX libraries.

12.5 HeyGen / Synthesia — lip-sync specialists

HeyGen and Synthesia dominate the "AI avatar + lip-sync" market.

Upload a video of yourself → generate an AI avatar.
Type text → the avatar speaks it naturally (multilingual).
Used heavily for internal training, customer support, sales demos.

In the enterprise segment, HeyGen and Synthesia are effectively the default.

12.6 A unified workflow

A typical 2026 video creation pipeline:

[Sora 2 or Kling 2] main 8-second clip
  |
[Suno v4] 30s BGM (slightly longer than the video)
  |
[ElevenLabs SFX] effects (footsteps, ambience)
  |
[ElevenLabs TTS] narration
  |
[CapCut / DaVinci / Premiere] composite

Or skip the assembly entirely with Veo 3 (video + audio in one pass).

13. Korea and Japan — Sakana AI, KAIST, Naver

13.1 Korea — KAIST, Naver, and generative-video startups

Korean academia and industry contribute along these lines:

KAIST — theoretical work on diffusion and flow matching (e.g., research from Jong Chul Ye's group).
Naver AI Lab — extending HyperCLOVA X to multimodal, both video understanding (VLM) and generation.
Kakao Brain — Karlo (image gen), Sketch2Video.
Startups — Lablup (model infrastructure), Snowmind, Twelve Labs (video search).

Twelve Labs in particular earned international recognition for "AI that understands video" — they lean toward understanding rather than generation. They have many published collaborations with NVIDIA.

13.2 Japan — Sakana AI

Sakana AI is a Tokyo-based company founded by ex-Google-Brain / DeepMind researchers David Ha and Llion Jones (a Transformer paper co-author).

Evolutionary Model Merging — automatically combining models to produce new ones.
DiffusionPipe / Sakana AI Scientist — automated diffusion-model design.
Collaborates with Japanese government and corporates on Japanese-language multimodal models.

They don't ship a consumer video product, but they build core techniques other companies use.

13.3 Japan's animation + AI niche

Japan's video AI scene is closely tied to the animation industry.

Stability AI Japan — Japanese Stable Diffusion, anime-style specialized models.
AniPortrait / EMO — portrait + audio to lip-synced animation.
VOICEVOX-style integrations — voice synthesis combined with video pipelines.

Japan also has the most developed domain expertise for "character consistency" — a long-running concern in anime production.

13.4 Training data and copyright policy

Country	Training data policy	Output copyright
US	Ongoing fair-use debate	Only the human-authored portion
EU	AI Act, explicit opt-out signals	Similar
Japan	Training explicitly allowed (Copyright Act, Article 30-4)	Some recognition in special cases
Korea	Legal framework still being formalized	Similar
China	Heavy content moderation, output liability stated	Some recognition in special cases

Japan's training-data policy is the most permissive, which is why Japan is considered training-friendly for AI image and video models.

14. Who should pick what — recommendations by workload

14.1 Advertising and brand inserts

Recommendation: Sora 2 or Veo 3

Sora 2: character memory, 4K, 120s — short ad sequences end-to-end.
Veo 3: co-generated audio — significant post-production savings.
Budget: a single ad spot generation runs $50–$500.

Ad agencies often pair these with Runway Gen-4 — Sora / Veo for pre-vis, Runway plus After Effects for final compositing.

14.2 Film and series pre-visualization

Recommendation: Sora 2 + Runway Gen-4

Sora 2 character memory drives the storyboard / pre-vis pass.
Runway Gen-4 reference-image control keeps the character on-model.
Integrates directly with director / VFX-supervisor workflows.

Film industry cases: previs costs for a short film dropping from around $30,000 to $3,000 are widely reported.

Recommendation: Pika 2 + Hailuo + Suno

Pika 2 for effects and lip-sync.
Hailuo for its generous free quota.
Suno for BGM.
Budget: a full workflow under $20–$50 per month.

14.4 Learning and education

Recommendation: HeyGen + ElevenLabs

HeyGen avatar + ElevenLabs TTS.
Internal training, online courses, tutorials.
Auto-multilingual subtitles and dubbing.

14.5 Games and interactive

Recommendation: LTX-Video + self-hosting

Speed is decisive when content is generated dynamically in-game.
License-free, open-weight.
Runs on a single RTX 4090.

14.6 Research and academia

Recommendation: HunyuanVideo + Open-Sora

Full training pipeline available.
Custom datasets can be fine-tuned.
Reproducibility for paper-grade work.

14.7 Monthly budget table

Use case	Recommended stack	Monthly (USD)
Hobby / experimentation	Kling / Hailuo free + Pika	`$0`
Single creator	Pika Pro + Suno	`$30`–`$50`
Social marketing	Kling + Hailuo + Suno + ElevenLabs	`$50`–`$150`
Ad agency	Sora 2 API + Runway Pro + Veo 3	`$500`–`$5,000`
Film pre-vis	Sora 2 + Runway Unlimited + Luma	`$1,000`–`$10,000`
Self-hosted (open)	HunyuanVideo / LTX + rented GPUs	GPU cost only

14.8 Decision tree

              [Need synced audio?]
              /              \
           Yes               No
            |                  \
        [Veo 3]         [Character consistency critical?]
                          /              \
                        Yes               No
                         |                  \
                     [Length 30s+?]    [Short social clip?]
                      /        \         /         \
                    Yes        No      Yes          No
                     |          \       |            \
                 [Sora 2]   [Runway Gen-4]  [Pika 2]  [Kling/Hailuo]

15. Wrap-up — the big picture for 2026

Three themes.

First, video, audio, and lip-sync have merged into one workflow. Veo 3 was the inflection point; Sora 3 (or a successor to Sora 2) is expected to follow. The 2024 era of "stitch together three separate tools" is essentially over.

Second, open weights are closing the gap to closed SOTA, lagging by roughly a year. HunyuanVideo, Wan 2.1, and LTX made self-hosting and fine-tuning real options, mirroring how Llama 3 caught up with GPT-4 in the text world. The remaining gap is in higher-level control (character consistency, multi-shot composition) where closed SOTA is still about a year ahead.

Third, video generation moved from "cool demo" to "production workflow." Advertising, film pre-vis, social content, internal training — there are case studies in every category. The 2024 "is this real?" reaction has been replaced in 2026 with "we're shipping deadlines with this."

The interesting questions for the next 12–24 months: (1) will Sora 3 finally solve character consistency end-to-end, (2) will another HunyuanVideo-grade open model arrive within a year, (3) will video + audio + lip-sync truly consolidate into one model, and (4) will C2PA and watermark standards converge.

"AI made this" is no longer the headline. "How well it was made" is the actual game now.

References

OpenAI Sora — https://openai.com/sora
Sora 1 system card (Feb 2024) — https://openai.com/research/video-generation-models-as-world-simulators
Google DeepMind Veo — https://deepmind.google/technologies/veo/
Google Vertex AI Veo — https://cloud.google.com/vertex-ai/generative-ai/docs/video/overview
Kling AI — https://klingai.com
Kuaishou Kling announcement — https://kling.kuaishou.com
MiniMax Hailuo — https://hailuoai.video
Runway Gen-4 — https://runwayml.com/research/introducing-runway-gen-4
Luma AI Dream Machine / Ray — https://lumalabs.ai/dream-machine
Pika Labs — https://pika.art
Tencent HunyuanVideo (GitHub) — https://github.com/Tencent/HunyuanVideo
HunyuanVideo technical report — https://arxiv.org/abs/2412.03603
Lightricks LTX-Video — https://github.com/Lightricks/LTX-Video
Alibaba Wan-2.1 — https://github.com/Wan-Video/Wan2.1
Open-Sora (HPC-AI Tech) — https://github.com/hpcaitech/Open-Sora
DiT paper (Peebles and Xie, 2023) — https://arxiv.org/abs/2212.09748
Latent Diffusion (Rombach et al.) — https://arxiv.org/abs/2112.10752
Flow Matching paper — https://arxiv.org/abs/2210.02747
Google Lyria — https://deepmind.google/discover/blog/transforming-music-creation-with-ai-and-human-creativity/
Suno AI — https://suno.com
Udio — https://udio.com
ElevenLabs Sound Effects — https://elevenlabs.io/sound-effects
HeyGen — https://heygen.com
Synthesia — https://synthesia.io
ComfyUI — https://github.com/comfyanonymous/ComfyUI
Sakana AI — https://sakana.ai
Twelve Labs — https://twelvelabs.io
Naver AI Lab — https://clova.ai
C2PA Content Credentials — https://c2pa.org
AniPortrait — https://github.com/Zejun-Yang/AniPortrait
EMO (Alibaba) — https://humanaigc.github.io/emote-portrait-alive/
KAIST AI — https://gsai.kaist.ac.kr