- Published on
AI Video Generation 2026 — Sora 2 / Veo 3 / Kling 2 / Hailuo / Runway Gen-4 / Luma Ray 2 / HunyuanVideo Deep Dive
- Authors

- Name
- Youngju Kim
- @fjvbn20031
AI Video Generation 2026 — Two years after Sora 1 turned heads in February 2024, the AI video generation market has settled into three camps: closed SOTA (OpenAI / Google / Kuaishou / MiniMax), video-industry standards (Runway / Luma / Pika), and a real open-weight tier (Hunyuan / LTX / Wan / Open-Sora). This guide maps that landscape.
- Prologue — What changed in two years
- 1. The 2026 map — closed / industry tools / open weights
- 2. Sora 2 (OpenAI) — generation 1 to 2
- 3. Veo 3 (Google) — synthesized audio + dialogue sync
- 4. Kling 2 (Kuaishou) — the strongest model from China
- 5. Hailuo (MiniMax) — China's other heavyweight
- 6. Runway Gen-4 — the video industry standard
- 7. Luma Ray 2 — Dream Machine's successor
- 8. Pika 2 — pivot to image-to-video
- 9. HunyuanVideo (Tencent, open) — the first real open challenger
- 10. LTX-Video / Wan 2.1 / Open-Sora — the open camp
- 11. Diffusion Transformer (DiT) — technical background
- 12. Audio + music integration — Lyria 2 / Suno / Udio / ElevenLabs SFX
- 13. Korea and Japan — Sakana AI, KAIST, Naver
- 14. Who should pick what — recommendations by workload
- 15. Wrap-up — the big picture for 2026
- References
Prologue — What changed in two years
On February 16, 2024, OpenAI released the Sora 1 demo — a woman walking through Tokyo, Earth from space, a paper plane gliding over a jungle. All 1080p, up to 60 seconds, all from a single text prompt. The November 2024 public API release still left rough edges around fidelity, length, and physical consistency.
Two years later, the picture is completely different.
- Sora 2 (OpenAI, Oct 2025) — 4K, 120s, consistent characters, precise camera control, directly available inside ChatGPT Plus.
- Veo 3 (Google DeepMind, Jun 2025) — synthesized audio, dialogue and music synced to the video. The first major model that generates "lip-synced talking characters" in one pass.
- Kling 2 (Kuaishou, Apr 2025) — China's strongest video model. Kling 1.0 in June 2024 matched the Sora 1 demo in quality within four months of seeing it.
- Hailuo (MiniMax, 2024–) — China's other heavyweight. The generous free daily quota gave them a huge consumer base.
- HunyuanVideo (Tencent, Dec 2024) — the first true open-weight competitor. 13B parameters, Apache-2.0-compatible license.
- Runway Gen-4 (2025) — the de facto standard for film and advertising. Deepest integration with Adobe Creative tools and After Effects.
- Luma Ray 2 (2025) — Dream Machine's successor. Emphasizes camera motion and physical consistency.
- Pika 2 — pivoted to image-to-video. The "make my photo come alive" market.
- LTX-Video (Lightricks, Nov 2024) — sub-realtime latency, open weights, runs on consumer GPUs.
- Wan-2.1 (Alibaba, Feb 2025) — another strong open-weight contender.
- Open-Sora (HPC-AI Tech) — academic open source, Sora-style architecture re-implemented in the open.
Audio joined the stack too. Google Lyria 2, Suno v4, and Udio handle music; ElevenLabs SFX covers sound effects; HeyGen and Synthesia handle lip-sync. The fragmented 2024 pipeline (video here, music there, lip-sync elsewhere) has collapsed into a single workflow in 2026.
This guide walks the map in 14 chapters and ends with a who-should-pick-what.
1. The 2026 map — closed / industry tools / open weights
1.1 Three camps
As of May 2026, the AI video market splits into three groups.
| Camp | Examples | Strengths | Weaknesses |
|---|---|---|---|
| Closed SOTA | Sora 2, Veo 3, Kling 2, Hailuo | Best quality, length, consistency | Pricing, restrictions, watermarks |
| Industry-standard tools | Runway Gen-4, Luma Ray 2, Pika 2 | Workflow integration, fine control | Slightly behind SOTA on raw quality |
| Open weights | HunyuanVideo, LTX, Wan 2.1, Open-Sora | Self-hosting, fine-tuning | Quality and length gap remains |
This mirrors the LLM split (GPT-4 / Claude / Gemini vs Anthropic-API-compatible OSS vs Llama / Qwen). Video is just one to two years behind LLMs in following the same pattern.
1.2 The four evaluation axes
- Quality — resolution, detail, texture consistency
- Temporal coherence — characters and props stay consistent across frames
- Physics — gravity, collisions, liquids, cloth all behave plausibly
- Control — beyond prompts: camera, character, style, scene-level steering
No model wins all four. Which axis you prioritize depends on whether the workload is an ad insert, a short-film previs, or social content.
1.3 One-line specs
| Model | Max res | Max length | Audio sync | License |
|---|---|---|---|---|
| Sora 2 (OpenAI) | 4K | 120s | Separate | Closed, API |
| Veo 3 (Google) | 4K | 60s | Co-generated audio | Closed, Vertex AI |
| Kling 2 (Kuaishou) | 1080p | 30s | None | Closed, web |
| Hailuo (MiniMax) | 1080p | 10s | None | Closed, API |
| Runway Gen-4 | 1080p | 16s | None | Closed, SaaS |
| Luma Ray 2 | 1080p | 10s | None | Closed, API |
| Pika 2 | 720p–1080p | 10s | None | Closed, API |
| HunyuanVideo (Tencent) | 720p | 5s | None | Open, 13B |
| LTX-Video (Lightricks) | 720p | 5s | None | Open, 2B |
| Wan 2.1 (Alibaba) | 720p | 5s | None | Open, 14B |
| Open-Sora | 720p | 16s | None | Open, MIT |
The short lengths refer to a single generation. Stitching multiple generations into longer pieces is a separate workflow.
2. Sora 2 (OpenAI) — generation 1 to 2
2.1 What changed from Sora 1
Sora 1 went public in November 2024, after the February 2024 demo. The original strengths were length (up to 60s) and resolution (1080p). The weaknesses were the now-familiar diffusion-video pitfalls: twisted fingers, inconsistent costumes between shots, awkward gait.
Sora 2 (October 2025) shipped:
- Max length 120s
- 4K option
- Character consistency — the same character keeps a stable appearance across a single prompt. Marketed as "character memory."
- Camera control — explicit camera-motion tokens (zoom in, dolly out, orbit left).
- Physics — better liquid, collision, and gravity handling.
OpenAI integrated Sora 2 directly into ChatGPT Plus / Team / Enterprise. The API is a separate signup tier.
2.2 Pricing and speed
As of May 2026:
- ChatGPT Plus (
$20/mo): up to 12s at standard resolution included; beyond that uses credits. - API: roughly
$0.30–$0.50per second of output (varies by resolution and length). - Wall-clock: 1–3 minutes for a 12-second clip.
Video generation costs 100× or more compared to text generation, and the prices reflect it directly.
2.3 Prompt example
A close-up of a Korean street food vendor flipping hotteok on a hot grill,
steam rising, the camera slowly dollies in from the left.
Time of day: golden hour. Style: cinematic, shallow depth of field.
Duration: 8 seconds. Aspect ratio: 16:9.
Sora 2 picks up camera motion, time of day, style, and duration / aspect ratio as explicit metadata.
2.4 Character memory
One of the bigger Sora 2 evolutions. A character that appears in one shot keeps the same appearance in the next, given the same prompt scaffolding. Very useful for ads or short narratives:
[Shot 1] A woman in a red coat walks into a Tokyo subway station at night.
[Shot 2] (Same woman, same coat) She buys a ticket from the machine.
[Shot 3] (Same woman) The train arrives, she steps in.
Industry feedback: "we can storyboard with this." Pre-vis costs for ad sequences have reportedly dropped by an order of magnitude.
2.5 Weaknesses
- Korean text rendered into the scene still falls apart; even English text wobbles occasionally.
- Fast action (sports, combat) still produces stretched limbs.
- Watermark always present (toggleable on the API tier).
- C2PA content credentials embedded on every output.
3. Veo 3 (Google) — synthesized audio + dialogue sync
3.1 Veo 1 to 2 to 3
Google DeepMind announced Veo 1 at Google I/O 2024. Veo 2 followed in December 2024, and Veo 3 launched in June 2025. The big leap in Veo 3 is co-generated audio.
Veo 3 produces four things in one pass, all time-aligned:
- Video
- Ambient audio — street noise, rain, wind
- Dialogue — voice synced to the character's lip movement
- Music — BGM from a Lyria 2 integration
This matters because, up to 2024, "AI video" meant a silent clip. Users had to layer BGM, SFX, narration, and lip-sync themselves. Veo 3 generates all of it together from a single prompt.
3.2 Prompt example
A barista in a Seoul cafe pours coffee while explaining the beans to a customer.
She says in Korean: "This is an Ethiopian Yirgacheffe, very floral."
The customer nods. Background: light jazz, gentle espresso machine sounds.
Veo 3 renders the barista pouring, speaks the Korean line with credible lip-sync, and lays down the jazz plus espresso-machine ambience — all aligned. Non-English (Korean, Japanese, Chinese, Spanish, …) works well.
3.3 Pricing and access
- Available through Google Vertex AI.
- Free quota inside Google AI Studio (
aistudio.google.com). - Roughly
$0.50–$1.00for an 8-second clip with audio. - Direct integration into Google Workspace Business / Enterprise.
3.4 Strengths and weaknesses
Strengths
- Audio + video in one step. Workflow collapses from many tools to one.
- Multilingual dialogue (English, Korean, Japanese, Chinese, Spanish) sounds natural.
- Plugs into Workspace — drops directly into Slides / Docs.
Weaknesses
- 60-second cap (half of Sora 2).
- Camera control less granular than Sora 2.
- Region availability has lagged outside the US in earlier rollout phases.
4. Kling 2 (Kuaishou) — the strongest model from China
4.1 Kling 1 to 2 — the four-month shock
In June 2024, four months after the Sora 1 demo, the Chinese short-video company Kuaishou (快手) opened Kling 1.0 to the public. Two things were shocking:
- Quality was roughly on par with the Sora demo, while everyone else was still figuring out how to even approach that bar.
- It was free to use, while Sora was still locked behind a closed waitlist. This let Kuaishou build a massive user base.
Kling iterated fast: 1.5, 1.6, 2.0 (April 2025). As of May 2026, Kling 2 offers:
- 1080p, 30s
- Camera-motion control — explicit tokens like Sora 2.
- Image-to-Video — both first and last frames can be supplied.
- Multi-shot — automatic cut composition inside a single prompt.
4.2 Why Kling iterates so fast
Kuaishou is a TikTok competitor in China, so they own an enormous internal video dataset (tens of billions of hours). That's a real training-data advantage.
The other factor: Chinese AI labs have shown an extremely fast iteration cycle in video, the same way they have in LLMs. Between June 2024 and April 2025, Kling went 1.0 → 1.5 → 1.6 → 2.0; in the same window OpenAI shipped 1.0 → 2.0.
4.3 Pricing and access
klingai.com(global) /kling.kuaishou.com(China).- Free daily credits, paid plans
$10–$60per month. - Global signup works with any major credit card.
4.4 Weaknesses
- Content moderation for politically sensitive material (a China-specific constraint).
- Korean and Japanese text rendered into scenes breaks down.
- No C2PA credentials — provenance tracking is limited.
- Pricing and free quotas shift fairly often.
5. Hailuo (MiniMax) — China's other heavyweight
5.1 Who is MiniMax
MiniMax is a Shanghai-headquartered Chinese AI company. They've been working on LLMs and on video / voice models in parallel since 2023. Hailuo (海螺) is their video brand.
When Hailuo launched in August 2024, it positioned itself as "the alternative for everyone who can't get into Sora." Quality wasn't quite at Kling levels, but the free quota was extremely generous.
As of May 2026, Hailuo offers:
- 1080p, 10s
- First-frame and last-frame image-to-video controls.
- Generous daily free credits.
- Director Mode — camera-motion tokens.
5.2 Strengths
- Most generous free quota — great for students and hobbyists.
- Easy global signup.
- Fast generation — a 6-second clip in roughly 30 seconds.
- Image-to-video quality — particularly strong for "turn this portrait into a video."
5.3 Weaknesses
- Max length only 10 seconds.
- Character consistency is weaker than Sora 2 or Kling 2.
- Same content-policy / TOS posture as other Chinese providers.
5.4 Kling vs Hailuo
| Axis | Kling 2 | Hailuo |
|---|---|---|
| Max length | 30s | 10s |
| Resolution | 1080p | 1080p |
| Camera control | Strong | Medium |
| Free tier | Decent | Generous |
| Global access | Easy | Easy |
| Pricing | $10–$60/mo | $5–$30/mo |
Within the Chinese tier, Kling is the SOTA pick; Hailuo is the value pick. Both are evolving rapidly.
6. Runway Gen-4 — the video industry standard
6.1 Where Runway sits
Runway, founded in 2018, is a video + ML tooling company. They were co-credited on the original Stable Diffusion paper, and in 2023 they essentially opened up AI video commercially with Gen-1 and Gen-2.
Gen-3 Alpha shipped in June 2024, then Gen-4 in 2025. Runway's real strength is workflow, not raw model quality.
- Frames — reference-image-based control for character, style, and location consistency.
- Director Mode — fine camera-motion control.
- Video-to-Video — restyle existing footage.
- Motion Brush — mask which areas should move.
- After Effects plugin — drops directly into compositing pipelines.
6.2 Gen-4 character consistency
The biggest Gen-4 jump is reference-image-driven character consistency. Workflows like this become trivial:
[Reference image] character.png (face)
[Prompt] Same character walking through Times Square at night, neon lights,
camera tracks behind.
That single change is decisive for ads, music videos, and short films — the cost of keeping a character consistent across cuts collapses.
6.3 Pricing
- Standard
$15/mo— 625 credits - Pro
$35/mo— 2,250 credits - Unlimited
$95/mo - Enterprise — custom
A 10-second clip burns roughly 50 credits (varies). For ad and media houses, the economics work.
6.4 Who uses it
- Ad agencies (Ogilvy, Wieden+Kennedy have published case studies)
- Music video directors
- Short films and documentaries
- After Effects-heavy users — the plugin integration is deep
Sora 2 is "a brilliant clip from a single image"; Runway is "fits into a professional video workflow."
7. Luma Ray 2 — Dream Machine's successor
7.1 Luma's roots
Luma AI grew out of NeRF (Neural Radiance Fields) research. From 2022 through 2023 they were known for NeRF-based 3D capture apps.
In June 2024 they entered the AI video market with Dream Machine, almost the same week Kling went public. Ray 1 followed, and Ray 2 shipped in late 2025.
7.2 What Ray 2 emphasizes
- Physical consistency — Luma deliberately leans into camera motion and physics quality; the NeRF lineage shows.
- Keyframes — pin first, middle, and final frames.
- Cinematic camera tokens — orbit, dolly, zoom and other film-style moves.
- Solid API — developers can integrate it cleanly into their own apps.
7.3 Pricing
- Free — 30 credits per day
- Standard
$9.99/mo - Pro
$29.99/mo - Premier
$94.99/mo - API — roughly
$0.50for a 5-second clip
7.4 Runway vs Luma
| Axis | Runway Gen-4 | Luma Ray 2 |
|---|---|---|
| Quality | Comparable | Comparable |
| Camera control | Strong (Director Mode) | Strong (cinematic tokens) |
| Character consistency | Strong (Frames, references) | Average |
| Workflow integration | After Effects, native editor | API-first |
| Pricing | Slightly higher | Generally cheaper |
Runway is the industry-standard pick; Luma is the API-integration / physics-coherence pick. Both are reasonable choices.
8. Pika 2 — pivot to image-to-video
8.1 Pika's evolution
Pika Labs started in 2023 as a Discord bot. Together with Runway, they essentially opened the consumer AI video market.
When Pika 2 shipped in late 2024, the strategy shifted. Rather than fight Sora / Veo / Kling head-on in t2v, Pika positioned itself for images, characters, and short social content.
Pika 2 highlights:
- Pikaffects — special-effects animations applied to a still image ("melt," "explode," "crush").
- Pikascenes — drop a portrait into a fictional scene naturally.
- Lip-sync — make a still photo speak.
- Fast image-to-video — 8-second clip in roughly 30 seconds.
8.2 Who uses it
- Social-media creators (TikTok, Instagram Reels)
- Meme makers
- Casual users — "make my photo move"
Not the "70-second short film" market that Sora 2 chases; Pika owns the 8-second social market.
8.3 Pricing
- Free — daily quota
- Standard
$10/mo - Pro
$35/mo - Fancy
$95/mo
Excellent value for influencer / social-marketing workflows.
9. HunyuanVideo (Tencent, open) — the first real open challenger
9.1 Why it mattered
On December 3, 2024, Tencent released HunyuanVideo. The event was big because:
- 13B parameters — by far the largest open video model at the time.
- Quality comparable to Runway Gen-3 / Luma Dream Machine — the first open-weight model to approach closed SOTA.
- Apache-2.0-compatible license (with a few constraints) — commercial use allowed.
This was the video-generation equivalent of Llama 2's moment for LLMs.
9.2 Architecture
HunyuanVideo combines DiT (Diffusion Transformer) with latent diffusion.
- 3D VAE — compresses video into a latent space.
- DiT encoder — applies diffusion in the latent space.
- MLLM text encoder — a multimodal LLM serves as the text encoder, with richer representations than CLIP.
- Flow matching — more efficient noise-to-video mapping during training.
The technical report is public and is now widely cited in the literature.
9.3 How to run it
git clone https://github.com/Tencent/HunyuanVideo
cd HunyuanVideo
# Recommended: H100 or A100 80GB GPU
python sample_video.py \
--prompt "A cat playing piano in a jazz bar, warm light" \
--video-length 65 \
--infer-steps 50 \
--save-path ./outputs
A 7B variant exists too — it will fit on an RTX 4090. Quality is noticeably better on 13B.
9.4 ComfyUI
ComfyUI (the node-based AI workflow tool) officially supports HunyuanVideo nodes.
[Load HunyuanVideo Model] - [CLIP Text Encode] - [HunyuanVideo Sampler] - [Video Combine]
The adoption from video creators integrating it into their pipelines was immediate, because compared to closed-model pricing, you only pay GPU costs.
9.5 Weaknesses
- 5-second cap on a single generation.
- Korean / Japanese text rendered into the scene still breaks.
- Roughly 60GB+ VRAM for the full model. Quantization and LoRA help.
10. LTX-Video / Wan 2.1 / Open-Sora — the open camp
10.1 LTX-Video (Lightricks, Nov 2024)
Lightricks is an Israeli mobile video editing company (Facetune, Videoleap). In November 2024 they released LTX-Video.
- 2B parameters — small.
- Fast — under 4 seconds for a 5-second clip on an RTX 4090, faster than realtime.
- Open weights — self-host freely.
- Commercial-use license.
LTX's significance is "an AI video model that runs on consumer GPUs." While HunyuanVideo demands H100-class hardware, LTX fits on a single 4090.
from diffusers import LTXPipeline
import torch
pipe = LTXPipeline.from_pretrained(
"Lightricks/LTX-Video", torch_dtype=torch.bfloat16
).to("cuda")
video = pipe(
prompt="A woman walking in the rain at night, neon city",
num_frames=121,
guidance_scale=3.0,
).frames[0]
10.2 Wan 2.1 (Alibaba, Feb 2025)
Alibaba runs Qwen for LLMs and Wan for video. Wan 2.1 shipped in February 2025.
- 14B parameters
- Text-to-video and image-to-video both supported.
- Flow matching based.
- Multilingual prompts — Chinese and English both work well.
Quality is broadly comparable to HunyuanVideo; the two models are often benchmarked head-to-head.
10.3 Open-Sora (HPC-AI Tech)
Open-Sora is an academic open-source project led by NUS / HPC-AI Tech in Singapore. It started right after the Sora 1 demo, aimed at re-implementing the Sora architecture in the open.
- MIT license.
- Full training pipeline + data tooling published.
- Quality is a bit behind HunyuanVideo / Wan.
- Excellent for research and teaching.
If you want to understand how a video model is actually trained, this is the codebase to read.
10.4 Open camp comparison
| Model | Params | Max length | Min GPU | License | Notes |
|---|---|---|---|---|---|
| HunyuanVideo (13B) | 13B | 5s | 60GB | Apache-2.0-compat | Best quality |
| HunyuanVideo (7B) | 7B | 5s | 24GB | Apache-2.0-compat | Compromise |
| LTX-Video | 2B | 5s | 12GB | Commercial OK | Fast, small |
| Wan 2.1 | 14B | 5s | 60GB | Commercial OK | Hunyuan rival |
| Open-Sora v2 | 11B | 16s | 40GB | MIT | Academic, 16s |
The open camp had its breakout in 2025; in 2026 it is narrowing the gap to closed SOTA, though character consistency and multi-shot control still lag.
11. Diffusion Transformer (DiT) — technical background
11.1 Why DiT
Since GANs arrived in 2014, video generation has bounced between GAN, VAE, and Diffusion. For images, Stable Diffusion's 2022 latent-diffusion approach settled the question. Video took longer.
The pivotal move was the DiT (Diffusion Transformer, 2023) paper by William Peebles and Saining Xie, which replaced the UNet backbone of diffusion with a Vision Transformer.
11.2 UNet vs Transformer
| Axis | UNet diffusion | DiT |
|---|---|---|
| Backbone | CNN-based UNet | Vision Transformer |
| Scaling | Hard (structural limits) | Easy (LLM-style scaling laws) |
| Video support | Time axis is awkward | Native sequence handling |
| Training stability | Mature | New but stable |
Video is fundamentally a 3D tensor over (height, width, time). Bolting time onto a UNet is awkward; for a Transformer, time is just another sequence dimension.
After DiT, essentially every major video model moved to a DiT (or DiT-derived) backbone. Sora, Veo, Kling, HunyuanVideo, and Open-Sora are all DiT-family models.
11.3 Why latent diffusion still matters
A 1024×1024 frame is 1M pixels. One second at 24fps is 24M pixels. Running diffusion in raw pixel space is impractical.
Latent diffusion compresses the video with a VAE into latent space (e.g., 128×128×8 ≈ 130K dimensions) and runs diffusion there. The compute budget drops by 100× or more.
Every modern video model starts with a 3D VAE (often causal). HunyuanVideo, Wan, and Open-Sora all train custom 3D VAEs.
11.4 Flow matching — the new training recipe
Between 2022 and 2023, flow matching rose as a diffusion alternative.
- Diffusion: learn the noise-to-video path via an SDE.
- Flow matching: learn the noise-to-video path via an ODE. Training is more stable and inference is faster.
HunyuanVideo, Wan 2.1, and Stable Diffusion 3 all adopted flow matching. It's the de facto standard by 2026.
11.5 From CLIP to LLM text encoders
Text encoding for video models traditionally used CLIP. Between 2024 and 2025 that changed.
- Stable Diffusion 3 — added T5-XXL as a text encoder.
- HunyuanVideo — uses an MLLM (multimodal LLM) as the text encoder.
- Veo 3 — uses Gemini's text encoder.
For long prompts, complex scene descriptions, and multilingual inputs, LLM-based encoders dominate. Even just breaking past CLIP's 77-token limit is a major gain.
12. Audio + music integration — Lyria 2 / Suno / Udio / ElevenLabs SFX
12.1 Video is not silent anymore
Through 2024, AI video output was almost universally silent. Sora 1, Kling 1, Runway Gen-3 all emitted video tracks only — users had to layer BGM, SFX, narration, and lip-sync separately.
Starting in 2025, that broke.
12.2 Lyria 2 (Google DeepMind, 2024)
Lyria is Google DeepMind's music generation model, with version 2.0 released in 2024.
- Text-to-music generation.
- Integrated into YouTube Shorts Dream Track and similar products.
- Co-generated with Veo 3 — when Veo 3 generates video, Lyria provides synced BGM.
12.3 Suno v4 / Udio
Suno (Cambridge, MA) and Udio (founded by ex-Google-DeepMind team members) are the two strongest music generation companies.
- Suno v4 — lyrics + melody together. Full-length tracks up to 4 minutes.
- Udio — comparable quality, more granular control.
For video creators who need BGM, it is almost always one of these two. Free quotas are generous.
12.4 ElevenLabs Sound Effects
ElevenLabs' core business is TTS, but they added an SFX (sound effects) model in 2024.
- Text-to-sound — "footsteps in snow," "thunder rumble," "espresso machine."
- 0–22 seconds in length.
- Sufficient free quota.
Useful for niche effects that aren't in standard SFX libraries.
12.5 HeyGen / Synthesia — lip-sync specialists
HeyGen and Synthesia dominate the "AI avatar + lip-sync" market.
- Upload a video of yourself → generate an AI avatar.
- Type text → the avatar speaks it naturally (multilingual).
- Used heavily for internal training, customer support, sales demos.
In the enterprise segment, HeyGen and Synthesia are effectively the default.
12.6 A unified workflow
A typical 2026 video creation pipeline:
[Sora 2 or Kling 2] main 8-second clip
|
[Suno v4] 30s BGM (slightly longer than the video)
|
[ElevenLabs SFX] effects (footsteps, ambience)
|
[ElevenLabs TTS] narration
|
[CapCut / DaVinci / Premiere] composite
Or skip the assembly entirely with Veo 3 (video + audio in one pass).
13. Korea and Japan — Sakana AI, KAIST, Naver
13.1 Korea — KAIST, Naver, and generative-video startups
Korean academia and industry contribute along these lines:
- KAIST — theoretical work on diffusion and flow matching (e.g., research from Jong Chul Ye's group).
- Naver AI Lab — extending HyperCLOVA X to multimodal, both video understanding (VLM) and generation.
- Kakao Brain — Karlo (image gen), Sketch2Video.
- Startups — Lablup (model infrastructure), Snowmind, Twelve Labs (video search).
Twelve Labs in particular earned international recognition for "AI that understands video" — they lean toward understanding rather than generation. They have many published collaborations with NVIDIA.
13.2 Japan — Sakana AI
Sakana AI is a Tokyo-based company founded by ex-Google-Brain / DeepMind researchers David Ha and Llion Jones (a Transformer paper co-author).
- Evolutionary Model Merging — automatically combining models to produce new ones.
- DiffusionPipe / Sakana AI Scientist — automated diffusion-model design.
- Collaborates with Japanese government and corporates on Japanese-language multimodal models.
They don't ship a consumer video product, but they build core techniques other companies use.
13.3 Japan's animation + AI niche
Japan's video AI scene is closely tied to the animation industry.
- Stability AI Japan — Japanese Stable Diffusion, anime-style specialized models.
- AniPortrait / EMO — portrait + audio to lip-synced animation.
- VOICEVOX-style integrations — voice synthesis combined with video pipelines.
Japan also has the most developed domain expertise for "character consistency" — a long-running concern in anime production.
13.4 Training data and copyright policy
| Country | Training data policy | Output copyright |
|---|---|---|
| US | Ongoing fair-use debate | Only the human-authored portion |
| EU | AI Act, explicit opt-out signals | Similar |
| Japan | Training explicitly allowed (Copyright Act, Article 30-4) | Some recognition in special cases |
| Korea | Legal framework still being formalized | Similar |
| China | Heavy content moderation, output liability stated | Some recognition in special cases |
Japan's training-data policy is the most permissive, which is why Japan is considered training-friendly for AI image and video models.
14. Who should pick what — recommendations by workload
14.1 Advertising and brand inserts
Recommendation: Sora 2 or Veo 3
- Sora 2: character memory, 4K, 120s — short ad sequences end-to-end.
- Veo 3: co-generated audio — significant post-production savings.
- Budget: a single ad spot generation runs
$50–$500.
Ad agencies often pair these with Runway Gen-4 — Sora / Veo for pre-vis, Runway plus After Effects for final compositing.
14.2 Film and series pre-visualization
Recommendation: Sora 2 + Runway Gen-4
- Sora 2 character memory drives the storyboard / pre-vis pass.
- Runway Gen-4 reference-image control keeps the character on-model.
- Integrates directly with director / VFX-supervisor workflows.
Film industry cases: previs costs for a short film dropping from around $30,000 to $3,000 are widely reported.
14.3 Social content (TikTok / Reels / Shorts)
Recommendation: Pika 2 + Hailuo + Suno
- Pika 2 for effects and lip-sync.
- Hailuo for its generous free quota.
- Suno for BGM.
- Budget: a full workflow under
$20–$50per month.
14.4 Learning and education
Recommendation: HeyGen + ElevenLabs
- HeyGen avatar + ElevenLabs TTS.
- Internal training, online courses, tutorials.
- Auto-multilingual subtitles and dubbing.
14.5 Games and interactive
Recommendation: LTX-Video + self-hosting
- Speed is decisive when content is generated dynamically in-game.
- License-free, open-weight.
- Runs on a single RTX 4090.
14.6 Research and academia
Recommendation: HunyuanVideo + Open-Sora
- Full training pipeline available.
- Custom datasets can be fine-tuned.
- Reproducibility for paper-grade work.
14.7 Monthly budget table
| Use case | Recommended stack | Monthly (USD) |
|---|---|---|
| Hobby / experimentation | Kling / Hailuo free + Pika | $0 |
| Single creator | Pika Pro + Suno | $30–$50 |
| Social marketing | Kling + Hailuo + Suno + ElevenLabs | $50–$150 |
| Ad agency | Sora 2 API + Runway Pro + Veo 3 | $500–$5,000 |
| Film pre-vis | Sora 2 + Runway Unlimited + Luma | $1,000–$10,000 |
| Self-hosted (open) | HunyuanVideo / LTX + rented GPUs | GPU cost only |
14.8 Decision tree
[Need synced audio?]
/ \
Yes No
| \
[Veo 3] [Character consistency critical?]
/ \
Yes No
| \
[Length 30s+?] [Short social clip?]
/ \ / \
Yes No Yes No
| \ | \
[Sora 2] [Runway Gen-4] [Pika 2] [Kling/Hailuo]
15. Wrap-up — the big picture for 2026
Three themes.
First, video, audio, and lip-sync have merged into one workflow. Veo 3 was the inflection point; Sora 3 (or a successor to Sora 2) is expected to follow. The 2024 era of "stitch together three separate tools" is essentially over.
Second, open weights are closing the gap to closed SOTA, lagging by roughly a year. HunyuanVideo, Wan 2.1, and LTX made self-hosting and fine-tuning real options, mirroring how Llama 3 caught up with GPT-4 in the text world. The remaining gap is in higher-level control (character consistency, multi-shot composition) where closed SOTA is still about a year ahead.
Third, video generation moved from "cool demo" to "production workflow." Advertising, film pre-vis, social content, internal training — there are case studies in every category. The 2024 "is this real?" reaction has been replaced in 2026 with "we're shipping deadlines with this."
The interesting questions for the next 12–24 months: (1) will Sora 3 finally solve character consistency end-to-end, (2) will another HunyuanVideo-grade open model arrive within a year, (3) will video + audio + lip-sync truly consolidate into one model, and (4) will C2PA and watermark standards converge.
"AI made this" is no longer the headline. "How well it was made" is the actual game now.
References
- OpenAI Sora — https://openai.com/sora
- Sora 1 system card (Feb 2024) — https://openai.com/research/video-generation-models-as-world-simulators
- Google DeepMind Veo — https://deepmind.google/technologies/veo/
- Google Vertex AI Veo — https://cloud.google.com/vertex-ai/generative-ai/docs/video/overview
- Kling AI — https://klingai.com
- Kuaishou Kling announcement — https://kling.kuaishou.com
- MiniMax Hailuo — https://hailuoai.video
- Runway Gen-4 — https://runwayml.com/research/introducing-runway-gen-4
- Luma AI Dream Machine / Ray — https://lumalabs.ai/dream-machine
- Pika Labs — https://pika.art
- Tencent HunyuanVideo (GitHub) — https://github.com/Tencent/HunyuanVideo
- HunyuanVideo technical report — https://arxiv.org/abs/2412.03603
- Lightricks LTX-Video — https://github.com/Lightricks/LTX-Video
- Alibaba Wan-2.1 — https://github.com/Wan-Video/Wan2.1
- Open-Sora (HPC-AI Tech) — https://github.com/hpcaitech/Open-Sora
- DiT paper (Peebles and Xie, 2023) — https://arxiv.org/abs/2212.09748
- Latent Diffusion (Rombach et al.) — https://arxiv.org/abs/2112.10752
- Flow Matching paper — https://arxiv.org/abs/2210.02747
- Google Lyria — https://deepmind.google/discover/blog/transforming-music-creation-with-ai-and-human-creativity/
- Suno AI — https://suno.com
- Udio — https://udio.com
- ElevenLabs Sound Effects — https://elevenlabs.io/sound-effects
- HeyGen — https://heygen.com
- Synthesia — https://synthesia.io
- ComfyUI — https://github.com/comfyanonymous/ComfyUI
- Sakana AI — https://sakana.ai
- Twelve Labs — https://twelvelabs.io
- Naver AI Lab — https://clova.ai
- C2PA Content Credentials — https://c2pa.org
- AniPortrait — https://github.com/Zejun-Yang/AniPortrait
- EMO (Alibaba) — https://humanaigc.github.io/emote-portrait-alive/
- KAIST AI — https://gsai.kaist.ac.kr