Voice AI in 2026 — ElevenLabs / Cartesia / Sesame / Whisper Turbo / Deepgram / Parakeet Deep Dive

Voice AI 2026 series — Since Whisper Turbo arrived in October 2024, voice AI has been moving as fast as text LLMs. This post is the May 2026 map of TTS, STT, and realtime voice agents.

Prologue — Why Voice, Now Again
Chapter 1 · The 2026 Voice AI Map — TTS / STT / Voice Agents
Chapter 2 · Whisper Large v3 Turbo (Oct 2024) — 8x Faster Multilingual STT
Chapter 3 · Deepgram Nova-3 / AssemblyAI Universal-2 — Commercial STT Battle
Chapter 4 · NVIDIA Parakeet 1.1 — The Open-Source SOTA
Chapter 5 · ElevenLabs — The TTS Standard
Chapter 6 · Cartesia (the Mamba Authors) — Sonic 2 + Ultra-Low Latency
Chapter 7 · Sesame (Iribe, March 2025) — "Voice Presence"
Chapter 8 · ChatTTS / F5-TTS / XTTS-v2 — Open-Source TTS
Chapter 9 · Realtime API — OpenAI / Google / ElevenLabs Conversational
Chapter 10 · Voice Agents — Vapi / Retell / Bland / Synthflow
Chapter 11 · Korea — Naver CLOVA, Kakao KOTTS, SKT NUGU
Chapter 12 · Japan — VOICEVOX (Open Source), Coeiroink, GPT-SoVITS, Bert-VITS2
Chapter 13 · Who Should Choose What — Call Center / Game NPC / Audiobook / Interpretation
Chapter 14 · Wrap-Up — The Big Picture of Voice AI in 2026
References

Prologue — Why Voice, Now Again

The 2022–2023 LLM boom was text-first. ChatGPT web chat, GitHub Copilot, RAG bots — all keyboard input. Voice was "yeah, someday maybe."

By 2026 three events changed the picture.

Whisper Large v3 Turbo (Oct 2024) — OpenAI open-sourced a turbo variant that's 8x faster than v3 large. Realtime STT on a single A100 became practical.
Cartesia Sonic 2 (2024) — A startup founded by the Mamba state-space model authors (Albert Gu, Tri Dao). Sub-90ms TTS that's fast enough that pairing it with a GPT-4-class LLM makes the user not realize they're talking to a bot.
Sesame (March 2025) — Oculus co-founder Brendan Iribe's "voice presence" demo. A 30-second clip on social media convinced everyone "this is different."

Layer in ElevenLabs V3, Deepgram Nova-3, AssemblyAI Universal-2, NVIDIA Parakeet 1.1, OpenAI Realtime API, plus voice-agent platforms like Vapi and Retell — and by May 2026, "AI call center" is no longer a PoC. It ships.

This post maps that landscape across 14 chapters.

Chapter 1 · The 2026 Voice AI Map — TTS / STT / Voice Agents

1.1 The Three Axes

A voice AI system almost always splits into three components.

Stage	Role	Representative models/services
STT (Speech-to-Text)	human speech to text	Whisper Turbo, Deepgram Nova-3, AssemblyAI Universal-2, Parakeet
LLM	text input to text response	GPT-4o, Claude 3.5, Gemini 2
TTS (Text-to-Speech)	text to speech	ElevenLabs, Cartesia Sonic 2, Sesame, OpenAI TTS, VOICEVOX

Then there are unified systems that handle all of it in one model: OpenAI Realtime API, Google Live API, ElevenLabs Conversational v2. Unified is more natural but pricing, constraints, and debuggability differ.

1.2 Evaluation Axes

In 2026, voice AI is evaluated on four axes.

Latency — from when the user stops speaking to when the AI starts. Under 200ms is the natural threshold.
Quality — naturalness, emotion, multilingual accuracy.
Cost — per minute or per 1M characters.
Control — voice cloning, emotion tags, SSML, speed control.

No single model wins all four. So the answer depends on the workload. Call center vs game NPC vs audiobook all have different priorities.

1.3 Open Source vs Commercial

Axis	Open source	Commercial
TTS quality	F5-TTS, XTTS-v2, ChatTTS — got better, not at the top	ElevenLabs, Cartesia, Sesame — dominant
STT accuracy	Whisper, Parakeet — basically caught up	Deepgram, AssemblyAI — slight edge, domain tuning
Latency	self-host can reach 100ms	200–500ms (network)
Cost	GPU only	$0.01–$ 0.30 per minute

Open source has basically caught commercial STT. The TTS gap is still real. That's the big picture in 2026.

Chapter 2 · Whisper Large v3 Turbo (Oct 2024) — 8x Faster Multilingual STT

2.1 v3 to v3 turbo

When OpenAI open-sourced Whisper in September 2022, that was one of the biggest single events in voice AI. 99 languages, multilingual, free, commercial-grade STT accuracy.

v3 turbo, released in October 2024, cuts v3 large's decoder layers from 32 to 4 and adds compression. Result:

Speed: about 8x faster than v3
Model size: 1.5B to 809M parameters
Accuracy: 1–2% loss on major languages (English/Korean/Japanese) versus v3 — practically equivalent
Language coverage: 99 to a slightly smaller set (a few rare languages dropped)

import whisper

model = whisper.load_model("turbo")  # large-v3-turbo
result = model.transcribe("interview.mp3", language="ko")
print(result["text"])

2.2 Why 8x Matters

v3 large took about 3 minutes on an A100 to transcribe one hour of audio. Realtime was out of reach (streaming was a separate concern).

Turbo finishes the same audio in 22 seconds. Consequences:

Realtime captions: 200–400ms chunks can keep up
Batch cost down 8x: cloud GPU hours fall
Edge devices: realtime on an M2 MacBook Air

2.3 Limitations

Diarization: Whisper doesn't know who spoke. WhisperX or similar covers this.
True streaming: Whisper is 30-second chunk based. faster-whisper or whisper-streaming wraps it.
Domain adaptation: medical/legal/finance jargon needs fine-tune. Deepgram and AssemblyAI offer domain models.

2.4 Variants — faster-whisper / WhisperX / Distil-Whisper

Tool	Core	Use case
OpenAI Whisper (official)	Reference PyTorch	Research/eval
faster-whisper	CTranslate2 backend, 4x more speedup	Production batch
WhisperX	+ diarization + word timestamps	Media captioning
Distil-Whisper	Smaller distilled variant	Mobile/edge

In production most teams run faster-whisper or WhisperX. The official OpenAI implementation is for research/eval.

Chapter 3 · Deepgram Nova-3 / AssemblyAI Universal-2 — Commercial STT Battle

3.1 Deepgram Nova-3 — The Latency King

Deepgram's edge is latency. Nova-3 has:

First-word latency under 100ms — partial transcripts start almost immediately
End-to-end in-house training — not bolted on top of an external ASR
Domain custom — medical, call center, media-specific models
Pricing — around $0.0043/min (batch) to$ 0.0145/min (streaming)

from deepgram import DeepgramClient, PrerecordedOptions

deepgram = DeepgramClient(api_key="...")
options = PrerecordedOptions(model="nova-3", smart_format=True, diarize=True)
response = deepgram.listen.prerecorded.v("1").transcribe_file(
    {"buffer": audio_buffer}, options
)

For call-center bots and live captions where 100ms decides the UX, Deepgram is basically the default.

3.2 AssemblyAI Universal-2 — The Full-Stack Player

AssemblyAI competes on "transcript plus post-processing." Universal-2:

Word accuracy — English WER under 5% (on par with or slightly above Whisper v3 large)
Auto-chapters, summarization, PII redaction, sentiment — all in one API
Language detection — auto-detects across 99 languages
Pricing — about $0.0065/min for the Best model, plus per-feature add-ons

Especially strong when you need not just transcript but chapters, summaries, sentiment — media and podcast use cases.

3.3 Speechmatics — Accent Champion

UK-based, strong on a wide range of English accents (Indian, Australian, Caribbean, Scottish). Wins on global call centers where accent diversity is high.

3.4 NVIDIA Riva — Self-Host Champion

NVIDIA Riva is a self-hosted speech SDK. Used by government, finance, and healthcare where data cannot leave the cluster. Common pattern: serve Parakeet on Riva.

3.5 AWS Transcribe / Azure Speech / Google STT

The three hyperscalers all have STT. Accuracy is slightly behind Deepgram/AssemblyAI, but the advantage is integration with the rest of the same cloud.

3.6 Comparison

Service	English WER	Korean WER	Latency	Per-min (USD)	Strength
Whisper v3 turbo (self)	~5%	~8%	~1–3s	GPU only	Free, multilingual
Deepgram Nova-3	~4%	~9%	`<100ms`	0.004–0.015	Low latency
AssemblyAI Universal-2	~4%	~10%	~300ms	0.0065+	Post-processing
Parakeet 1.1 (self)	~5%	N/A	~200ms	GPU only	Open source SOTA
Speechmatics	~5%	~9%	~200ms	0.007+	Accents
AWS Transcribe	~7%	~12%	~500ms	0.024	AWS integration

Numbers are approximate from public benchmarks. Real numbers vary heavily by domain and audio quality.

Chapter 4 · NVIDIA Parakeet 1.1 — The Open-Source SOTA

4.1 What Parakeet Is

A family of open-source STT models trained by NVIDIA with the NeMo framework. When Parakeet 1.1 dropped in late 2024, the verdict was "open source caught commercial."

Sizes: 110M to 1.1B parameter variants
Architecture: FastConformer encoder + CTC/Transducer hybrid
Speed: 2x+ faster than Whisper turbo on the same GPU
Accuracy: top of HuggingFace OpenASR English leaderboard

4.2 Why It's Fast

Whisper uses a Transformer encoder + decoder. It autoregressively generates tokens over a 30-second audio chunk. Parakeet uses a FastConformer encoder + CTC (or RNN-T) decoder. CTC is not autoregressive — it's a sequence alignment — and it's much faster.

The tradeoff: multilingual coverage is weaker than Whisper. Parakeet 1.1 English is English-specialized. A separate multilingual variant (Canary) exists.

4.3 Self-Hosting with NeMo

import nemo.collections.asr as nemo_asr

asr_model = nemo_asr.models.ASRModel.from_pretrained(
    "nvidia/parakeet-tdt-1.1b"
)
transcripts = asr_model.transcribe(["audio.wav"])
print(transcripts[0])

A single GPU can handle hundreds of audio-hours per minute. License is CC-BY-4.0, commercial-friendly.

4.4 Multilingual Variant — Canary

NVIDIA released a separate multilingual ASR called Canary. Supports English/Spanish/German/French and a few more. Korean and Japanese coverage is limited — Whisper still wins those.

Chapter 5 · ElevenLabs — The TTS Standard

5.1 Why ElevenLabs Won

Since 2023 ElevenLabs has been the de facto TTS standard. Why:

Naturalness — the first model where you stop thinking "AI voice" and start hearing "that person"
Multilingual — same voice in 30 languages with accent preserved
Cloning — voice clone from 1 minute of audio, "Professional Voice Clone" with 30+ minutes
Both API and UX are good — developers integrate in 5 minutes, non-developers use the web app directly

5.2 Model Lineup

Multilingual v2 (2023) — classic, high quality, stable. ~400ms latency.
Flash v2.5 (2024) — low-latency variant, under 75ms. Slightly lower quality than v2.
V3 alpha (2025) — emotion tags, dialogue, audio tags ([whispers], [laughs]).
Conversational v2 (2025) — TTS + STT + LLM bundled as a voice agent.

5.3 V3 Emotion Tags

V3 lets you sprinkle inline tags into the text to mark emotion.

[excited] Welcome back!
[whispers] I have a secret.
[laughs] That's hilarious.
[sighs] Okay, let's start over.

This is a bigger change than it looks. Previously you had to write SSML to tune prosody. V3 just takes natural-language tags.

5.4 Pricing

Starter: $5/month for 30K chars
Creator: $22/month for 100K chars + voice cloning
Pro/Scale/Business: usage-based
API rate: roughly $0.18/1K chars (Flash),$ 0.30/1K chars (V2)

More expensive than most alternatives but the quality difference shows up in your workflow, so for games, video, and audiobooks it's the default.

5.5 Limitations

Korean naturalness lags English (still better than most global TTS)
Japanese has occasional awkward prosody
2–5x the price of alternatives

Chapter 6 · Cartesia (the Mamba Authors) — Sonic 2 + Ultra-Low Latency

6.1 Who Built It

Cartesia was founded in 2023 by Albert Gu, Tri Dao, and others — the Mamba state-space model paper authors. Mamba was promoted as a Transformer alternative whose memory/compute scale linearly with sequence length. That fits audio well.

6.2 Sonic / Sonic 2 — 90ms TTS

Cartesia's first model Sonic shipped sub-90ms TTS and got noticed. Sonic 2 (late 2024) added:

First-byte latency under 75ms — half of ElevenLabs Flash
Quality — comparable to ElevenLabs Multilingual v2
Multilingual — English/Spanish/French/German/Japanese/Chinese/Korean and more
Voice cloning — instant clone from 3-second samples

from cartesia import Cartesia

client = Cartesia(api_key="...")
audio = client.tts.sse(
    model_id="sonic-2",
    transcript="Hello, glad to meet you.",
    voice={"mode": "id", "id": "your_voice_id"},
    output_format={"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
)

6.3 Why It's Fast

Mamba-style state-space models handle token-to-token dependency in O(n) time, unlike Transformer attention. For long sequences like audio, that's a big win.

Cartesia also built streaming as a first-class concern. The first byte goes out as soon as the first input chunk arrives.

6.4 Where It's Used

Realtime voice agents (Vapi/Retell offer Cartesia as a default TTS option)
Game NPC dynamic dialogue
Live interpretation

If ElevenLabs is "best quality," Cartesia is "best balance of latency and quality."

Chapter 7 · Sesame (Iribe, March 2025) — "Voice Presence"

7.1 Brendan Iribe and Sesame

Brendan Iribe co-founded Oculus VR. After the Facebook acquisition and stops at Anduril, he founded Sesame in 2024 and demoed publicly in March 2025.

The Sesame concept is "voice presence" — not just natural-sounding, but a voice that makes you feel someone is there. Breath, hesitation, "uhh", backchannels ("yeah", "mhm"), interruptions handled naturally.

7.2 Why the Demo Went Viral

The 30-second demo in March 2025 spread fast on social. Why:

A ~0.3s "thinking" breath before answers begin
Backchannels like "oh really?" inserted while the user speaks
Sentence endings fade naturally — the trademark AI "hard stop" disappears
Emotion in the voice tracks the meaning of the text

If ElevenLabs/Cartesia made "natural voice," Sesame made "someone is here."

7.3 What's Different Under the Hood

Sesame partially published a paper. The key ideas:

Single backbone modeling text, audio, and prosody jointly — not a separate TTS, more a voice LLM
Interruption handling — when the user cuts in, the model pauses smoothly and replies
Non-verbal sounds — sighs, laughs, throat-clears are in training data

7.4 Caveats

As of May 2026 it's not GA, only limited beta. Pricing/SLA unpublished.
English-only for now, no Korean or Japanese.
Whether Sesame can actually produce "voice presence" at production cost is unproven.

Even so, the direction (treating voice as presence rather than as TTS output) is where ElevenLabs and Cartesia will soon follow.

Chapter 8 · ChatTTS / F5-TTS / XTTS-v2 — Open-Source TTS

8.1 ChatTTS — A Chinese Team's Natural English TTS

ChatTTS is an open-source TTS released in 2024 by a Chinese team. Notable for:

English naturalness near ElevenLabs Multilingual v2 (top of open source)
Conversational style — same text reads "like a chat"
Free weights on HuggingFace
Korean and Japanese are weak

F5-TTS dropped in late 2024 and hit #1 trending on HuggingFace. Also a hot topic in the Korean developer community.

Flow matching based (a diffusion variant) — training is more stable
Voice cloning — zero-shot clone from 15-second samples
Multilingual — English/Chinese focus, others need fine-tune
License — non-commercial (commercial restrictions apply, double-check)

8.3 XTTS-v2 (Coqui) — The Cloning Classic

Coqui was an active open-source TTS company in 2023–2024. The company itself shut down but XTTS-v2 weights remain on HuggingFace.

17 languages
Voice clone from 6-second samples
Naturalness below ElevenLabs but free
Korean and Japanese supported

from TTS.api import TTS

tts = TTS(model_name="tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")
tts.tts_to_file(
    text="Hello there.",
    speaker_wav="reference.wav",
    language="en",
    file_path="output.wav",
)

8.4 Tortoise TTS — Slow but Quality

Tortoise is a relatively old open-source TTS from 2022. Inference is very slow (minutes) but quality was good enough that it was the de facto open-source TTS for a while. Today ChatTTS and F5-TTS have taken that spot.

8.5 Open-Source TTS Cheat Sheet

Model	English quality	Multilingual	Inference speed	License
ChatTTS	Excellent	Weak	Fast	Non-commercial concerns
F5-TTS	Good	English/Chinese	Medium	Non-commercial
XTTS-v2	Good	17 languages	Medium	CPL (conditional commercial)
Tortoise	Excellent	English	Very slow	Apache 2.0

License check is mandatory for commercial use. F5-TTS has an explicit non-commercial clause and is off-limits for commercial products.

Chapter 9 · Realtime API — OpenAI / Google / ElevenLabs Conversational

9.1 What "Realtime" Means

Traditional voice pipelines are STT then LLM then TTS, in series. Each stage adds latency, distortion, and wait. Realtime APIs collapse that into one model: voice in, voice out.

Upsides:

Shorter latency (200–500ms vs 1–2s)
Natural interruption
Non-verbal info (laughter, sighs, tone) preserved

Downsides:

More expensive ( $0.06/min input +$ 0.24/min output range)
Harder to debug (no intermediate text — logs are audio)
Tool calling integration is more involved

9.2 OpenAI Realtime API (gpt-4o-realtime)

Launched late 2024. WebSocket-based.

const ws = new WebSocket("wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview")
ws.on("open", () => {
  ws.send(JSON.stringify({
    type: "session.update",
    session: { voice: "alloy", instructions: "You are a helpful Korean assistant." }
  }))
})
ws.on("message", (data) => {
  const event = JSON.parse(data)
  if (event.type === "response.audio.delta") {
    // base64 PCM chunk
  }
})

GPT-4o's voice model takes audio in and produces audio out directly. Korean and Japanese supported.

9.3 Google Live API (Gemini 2)

Gemini 2's Live API. Similar WebSocket interface. Strength is integration with the Google ecosystem (Search, Maps, Calendar).

9.4 ElevenLabs Conversational v2

ElevenLabs expanded from pure TTS into a voice-agent platform. STT is in-house plus Deepgram option, LLM is your pick (OpenAI/Anthropic/Google), TTS is ElevenLabs voices. Plug-and-play.

9.5 When to Use Realtime vs Split

Realtime is good for:

Simple chat bots, FAQ responders
Scenarios where human-like naturalness is the KPI (fitting, coaching)
Interruption/backchannel heavy interactions

Split is good for:

Complex workflows (multi-step tool calls, context branching)
Brand-voice TTS retention
Strict logging/audit domains (finance, healthcare)

Chapter 10 · Voice Agents — Vapi / Retell / Bland / Synthflow

10.1 What a Voice-Agent Platform Is

To ship a call-center bot you need STT/LLM/TTS wired together, phone-network integration (SIP/Twilio), and turn-taking/interruption/call-routing logic. Platforms that do all of that exploded in 2024–2025.

10.2 Vapi

SF startup, YC-backed. Notable for:

Pluggable TTS/STT/LLM (ElevenLabs, Cartesia, Deepgram, AssemblyAI, etc.)
PSTN via Twilio/Vonage
Webhook-based external API calls (bookings, CRM updates)
Pricing — about $0.05–$ 0.15 per min + model cost

10.3 Retell AI

Vapi's most direct competitor. Slicker UI, easier-to-read live call transcripts. Pricing is similar.

10.4 Bland AI

Sales-call specialist. Strong for high-volume outbound (e.g. real-estate cold calls). Per-call pricing is cheap.

10.5 Synthflow

EU no-code voice-agent builder. GUI-driven flow editor. Non-developers in ops can use it.

10.6 Comparison

Platform	Strength	Weakness	Per-min (USD)
Vapi	Flexible, good API	UI is plain	0.05–0.15 + model
Retell AI	Clean UI, good transcripts	Similar pricing	0.07–0.15 + model
Bland AI	High-volume outbound	Weaker for inbound/complex	~0.09 per call
Synthflow	No-code, EU data	API flexibility weaker	0.13+ + model

10.7 Build vs Buy

Below 10K calls per minute, a platform is almost always cheaper. Above that — or if data can't leave the cluster — build with LiveKit + Deepgram + Cartesia.

Chapter 11 · Korea — Naver CLOVA, Kakao KOTTS, SKT NUGU

11.1 Naver CLOVA Voice / CLOVA Studio

Naver has CLOVA Voice (TTS), CLOVA Speech (STT), and HyperCLOVA X (LLM) — a full voice stack. Korean naturalness is ahead of ElevenLabs. Billed per minute or per character.

11.2 Kakao KOTTS

Kakao Enterprise's Korean TTS. Aimed at B2B (call center, announcements). Integrates with the Kakao Talk bot builder.

11.3 SKT NUGU

SK Telecom's voice assistant platform. NUGU speakers, TMAP voice navigation, NUGU Candy — strong in the consumer market.

11.4 Coway Sonatts and Others

A handful of Korean enterprises have built proprietary Korean TTS. Limited public exposure.

11.5 Korean STT — CLOVA vs Deepgram vs Whisper

Model	Korean WER	Strength	Weakness
Naver CLOVA Speech	~5–7%	Korean domain tuning, Korean proper nouns	Weak global integration
Deepgram (Korean)	~9%	Low latency, global	Weak domain tuning
Whisper v3 turbo	~8%	Free, multilingual	Diarization separate
Parakeet	N/A (English-centric)	-	-

For a Korean company serving Korean users, CLOVA is the top pick. For global plus Korean, Whisper turbo or Deepgram.

11.6 Korean Voice-Agent Cases

Banking/card IVR — KB, Shinhan, KakaoBank partial rollouts
Food-delivery voice ordering — limited pilots
Game NPCs — NCsoft examples

Korean voice agents lag global by 2–3 years but are catching up fast.

Chapter 12 · Japan — VOICEVOX (Open Source), Coeiroink, GPT-SoVITS, Bert-VITS2

12.1 VOICEVOX — Japan's De Facto Open-Source TTS

VOICEVOX has overwhelming mindshare in Japan. Notable for:

Free, commercial-permitted under conditions — per-character terms must be checked
Many character voices — Shiki, Metoan, Zundamon are internet memes
Runs locally without a GPU — realtime on CPU
Half the YouTube/Niconico Japanese videos use VOICEVOX

12.2 Coeiroink

Similar to VOICEVOX, slightly more permissive on character licensing. Preferred by some creators.

12.3 GPT-SoVITS

Zero-shot voice cloning TTS popular in the Japanese and Chinese communities. Cloning from under one minute. De facto standard for Japanese voice content creators.

12.4 Bert-VITS2

Another open-source favorite. BERT-based text encoder + VITS decoder. Strong in Japanese and Chinese.

12.5 Japanese Commercial TTS

ElevenLabs Multilingual v2 — supports Japanese, above-average naturalness
Azure Neural TTS — rich Japanese voice catalog
Google WaveNet — stable Japanese
AWS Polly — many Japanese voices

Commercial is dominated by the global three, but in the Japanese content market (VTubers, video, games) VOICEVOX and GPT-SoVITS dominate.

12.6 Japanese STT

Model	Japanese WER	Notes
Whisper v3 turbo	~8%	Most popular
AssemblyAI	~9%	Post-processing strength
Google STT	~7%	Japanese domain tuning
Azure Speech	~7%	Many Japanese voices
Deepgram	~11%	Weak in Japanese

For Japanese, Deepgram is surprisingly weak; Google and Azure often win.

Chapter 13 · Who Should Choose What — Call Center / Game NPC / Audiobook / Interpretation

13.1 Inbound Call-Center Bot

Goal: fast response + natural Korean/English + interruption handling + tool calls

Recommended:

STT: Deepgram Nova-3 (English) or Naver CLOVA (Korean)
LLM: GPT-4o or Claude 3.5
TTS: Cartesia Sonic 2 (English) or CLOVA Voice (Korean)
Platform: Vapi or Retell AI

Alternative: OpenAI Realtime API alone (enough for a simple bot, more expensive)

13.2 Game NPC Voiceover

Goal: character-voice consistency + emotion + multilingual

Recommended:

TTS: ElevenLabs Professional Voice Clone + V3 emotion tags
Or: Cartesia voice cloning (when low latency is needed for dynamic lines)
Open-source option: GPT-SoVITS (clone the character voice)

13.3 Audiobook / Podcast

Goal: natural long-form pacing, emotion, accurate pronunciation

Recommended:

ElevenLabs Multilingual v2 + Voice Lab
For short Korean: Naver CLOVA
Multi-speaker: ElevenLabs Projects mode

13.4 Live Interpretation

Goal: ultra-low-latency STT + instant translation + natural TTS

Recommended:

STT: Deepgram Nova-3 or AssemblyAI
Translation: GPT-4o or Claude
TTS: Cartesia Sonic 2 (low latency is critical)
Or: OpenAI Realtime API (simplest, smoothest)

13.5 Video Captioning / Content Post-Processing

Goal: accuracy + diarization + chapters/summary

Recommended:

AssemblyAI Universal-2 (most complete)
Or: WhisperX (full open source)

13.6 Cost-Sensitive + Private Data

Goal: data cannot leave the cluster, GPU-only operation

Recommended:

STT: Parakeet 1.1 or Whisper v3 turbo (NeMo or faster-whisper)
TTS: XTTS-v2 or F5-TTS (mind the license)
LLM: Llama 3 70B or Qwen 2.5
Infra: NVIDIA Riva or self-built vLLM/Triton

13.7 One-Line Matrix

Scenario	STT	TTS	Notes
Korean call center	CLOVA Speech	CLOVA Voice	Domain tuning
English call center	Deepgram	Cartesia Sonic 2	Low latency
Game NPC	(n/a)	ElevenLabs V3	Emotion tags
Audiobook	(n/a)	ElevenLabs v2	Long form
Live interpretation	Deepgram	Cartesia	Or OpenAI Realtime
Media captioning	AssemblyAI	(n/a)	Chapters/summary
Private on-prem	Parakeet	XTTS-v2	NVIDIA Riva
Japanese content	Whisper	VOICEVOX	Character voices

Chapter 14 · Wrap-Up — The Big Picture of Voice AI in 2026

Three big currents.

First, STT is basically solved. Whisper turbo, Deepgram Nova-3, and Parakeet 1.1 have made sub-5% English WER routine. What remains is domain adaptation (medical/legal terms), multilingual accuracy (especially low-resource languages), and side info like diarization and emotion metadata.

Second, TTS is moving from "natural voice" to "voice presence." ElevenLabs and Cartesia almost finished naturalness; Sesame redefined the bar with "someone is here." In late 2026 to 2027 ElevenLabs and Cartesia will probably catch up to that territory.

Third, unified (Realtime API) is eating the split pipeline. Simple bots can ship with OpenAI Realtime API alone. The split pipeline survives where (a) brand voice matters, (b) complex tool chains are needed, (c) audio data must be separately audited.

Voice AI is no longer a fun demo. In 2026 it ships in call centers, automotive infotainment, games, education, and healthcare. Things to watch in the next 1–2 years: (1) whether Sesame can actually ship at scale, (2) whether open-source TTS narrows the ElevenLabs gap, (3) whether Whisper turbo gets another jump.

References

OpenAI Whisper v3 turbo release — https://github.com/openai/whisper/discussions/2363
OpenAI Whisper paper — https://arxiv.org/abs/2212.04356
Deepgram Nova-3 — https://deepgram.com/learn/introducing-nova-3
AssemblyAI Universal-2 — https://www.assemblyai.com/blog/universal-2
NVIDIA Parakeet — https://huggingface.co/nvidia/parakeet-tdt-1.1b
NVIDIA NeMo — https://github.com/NVIDIA/NeMo
HuggingFace OpenASR Leaderboard — https://huggingface.co/spaces/hf-audio/open_asr_leaderboard
ElevenLabs API docs — https://elevenlabs.io/docs
Cartesia — https://cartesia.ai
Mamba paper (Albert Gu, Tri Dao) — https://arxiv.org/abs/2312.00752
Sesame (Brendan Iribe) — https://www.sesame.com
ChatTTS GitHub — https://github.com/2noise/ChatTTS
F5-TTS — https://github.com/SWivid/F5-TTS
Coqui XTTS-v2 — https://huggingface.co/coqui/XTTS-v2
Tortoise TTS — https://github.com/neonbjb/tortoise-tts
OpenAI Realtime API — https://platform.openai.com/docs/guides/realtime
Google Gemini Live API — https://ai.google.dev/gemini-api/docs/live
ElevenLabs Conversational AI — https://elevenlabs.io/conversational-ai
Vapi — https://vapi.ai
Retell AI — https://retellai.com
Bland AI — https://bland.ai
Synthflow — https://synthflow.ai
Naver CLOVA Voice — https://www.ncloud.com/product/aiService/clovaVoice
Kakao KOTTS — https://www.kakaocorp.com
VOICEVOX — https://voicevox.hiroshiba.jp
Coeiroink — https://coeiroink.com
GPT-SoVITS — https://github.com/RVC-Boss/GPT-SoVITS
Bert-VITS2 — https://github.com/fishaudio/Bert-VITS2
LiveKit Agents — https://docs.livekit.io/agents
faster-whisper — https://github.com/SYSTRAN/faster-whisper
WhisperX — https://github.com/m-bain/whisperX

Prologue — Why Voice, Now Again

Chapter 1 · The 2026 Voice AI Map — TTS / STT / Voice Agents

1.1 The Three Axes

1.2 Evaluation Axes

1.3 Open Source vs Commercial

Chapter 2 · Whisper Large v3 Turbo (Oct 2024) — 8x Faster Multilingual STT

2.1 v3 to v3 turbo

2.2 Why 8x Matters

2.3 Limitations

2.4 Variants — faster-whisper / WhisperX / Distil-Whisper

Chapter 3 · Deepgram Nova-3 / AssemblyAI Universal-2 — Commercial STT Battle

3.1 Deepgram Nova-3 — The Latency King

3.2 AssemblyAI Universal-2 — The Full-Stack Player

3.3 Speechmatics — Accent Champion

3.4 NVIDIA Riva — Self-Host Champion

3.5 AWS Transcribe / Azure Speech / Google STT

3.6 Comparison

Chapter 4 · NVIDIA Parakeet 1.1 — The Open-Source SOTA

4.1 What Parakeet Is

4.2 Why It's Fast

4.3 Self-Hosting with NeMo

4.4 Multilingual Variant — Canary

Chapter 5 · ElevenLabs — The TTS Standard

5.1 Why ElevenLabs Won

5.2 Model Lineup

5.3 V3 Emotion Tags

5.4 Pricing

5.5 Limitations

Chapter 6 · Cartesia (the Mamba Authors) — Sonic 2 + Ultra-Low Latency

6.1 Who Built It

6.2 Sonic / Sonic 2 — 90ms TTS

6.3 Why It's Fast

6.4 Where It's Used

Chapter 7 · Sesame (Iribe, March 2025) — "Voice Presence"

7.1 Brendan Iribe and Sesame

7.2 Why the Demo Went Viral

7.3 What's Different Under the Hood

7.4 Caveats

Chapter 8 · ChatTTS / F5-TTS / XTTS-v2 — Open-Source TTS

8.1 ChatTTS — A Chinese Team's Natural English TTS

8.2 F5-TTS — Trending #1 on HuggingFace

8.3 XTTS-v2 (Coqui) — The Cloning Classic

8.4 Tortoise TTS — Slow but Quality

8.5 Open-Source TTS Cheat Sheet

Chapter 9 · Realtime API — OpenAI / Google / ElevenLabs Conversational

9.1 What "Realtime" Means

9.2 OpenAI Realtime API (gpt-4o-realtime)

9.3 Google Live API (Gemini 2)

9.4 ElevenLabs Conversational v2

9.5 When to Use Realtime vs Split

Chapter 10 · Voice Agents — Vapi / Retell / Bland / Synthflow

10.1 What a Voice-Agent Platform Is

10.2 Vapi

10.3 Retell AI

10.4 Bland AI

10.5 Synthflow

10.6 Comparison

10.7 Build vs Buy

Chapter 11 · Korea — Naver CLOVA, Kakao KOTTS, SKT NUGU

11.1 Naver CLOVA Voice / CLOVA Studio

11.2 Kakao KOTTS

11.3 SKT NUGU

11.4 Coway Sonatts and Others

11.5 Korean STT — CLOVA vs Deepgram vs Whisper

11.6 Korean Voice-Agent Cases

Chapter 12 · Japan — VOICEVOX (Open Source), Coeiroink, GPT-SoVITS, Bert-VITS2

12.1 VOICEVOX — Japan's De Facto Open-Source TTS

12.2 Coeiroink

12.3 GPT-SoVITS

12.4 Bert-VITS2

12.5 Japanese Commercial TTS

12.6 Japanese STT

Chapter 13 · Who Should Choose What — Call Center / Game NPC / Audiobook / Interpretation

13.1 Inbound Call-Center Bot

13.2 Game NPC Voiceover

13.3 Audiobook / Podcast

13.4 Live Interpretation

13.5 Video Captioning / Content Post-Processing

13.6 Cost-Sensitive + Private Data

13.7 One-Line Matrix