Skip to content
Published on

Voice AI in 2026 — ElevenLabs / Cartesia / Sesame / Whisper Turbo / Deepgram / Parakeet Deep Dive

Authors

Voice AI 2026 series — Since Whisper Turbo arrived in October 2024, voice AI has been moving as fast as text LLMs. This post is the May 2026 map of TTS, STT, and realtime voice agents.

Prologue — Why Voice, Now Again

The 2022–2023 LLM boom was text-first. ChatGPT web chat, GitHub Copilot, RAG bots — all keyboard input. Voice was "yeah, someday maybe."

By 2026 three events changed the picture.

  1. Whisper Large v3 Turbo (Oct 2024) — OpenAI open-sourced a turbo variant that's 8x faster than v3 large. Realtime STT on a single A100 became practical.
  2. Cartesia Sonic 2 (2024) — A startup founded by the Mamba state-space model authors (Albert Gu, Tri Dao). Sub-90ms TTS that's fast enough that pairing it with a GPT-4-class LLM makes the user not realize they're talking to a bot.
  3. Sesame (March 2025) — Oculus co-founder Brendan Iribe's "voice presence" demo. A 30-second clip on social media convinced everyone "this is different."

Layer in ElevenLabs V3, Deepgram Nova-3, AssemblyAI Universal-2, NVIDIA Parakeet 1.1, OpenAI Realtime API, plus voice-agent platforms like Vapi and Retell — and by May 2026, "AI call center" is no longer a PoC. It ships.

This post maps that landscape across 14 chapters.


Chapter 1 · The 2026 Voice AI Map — TTS / STT / Voice Agents

1.1 The Three Axes

A voice AI system almost always splits into three components.

StageRoleRepresentative models/services
STT (Speech-to-Text)human speech to textWhisper Turbo, Deepgram Nova-3, AssemblyAI Universal-2, Parakeet
LLMtext input to text responseGPT-4o, Claude 3.5, Gemini 2
TTS (Text-to-Speech)text to speechElevenLabs, Cartesia Sonic 2, Sesame, OpenAI TTS, VOICEVOX

Then there are unified systems that handle all of it in one model: OpenAI Realtime API, Google Live API, ElevenLabs Conversational v2. Unified is more natural but pricing, constraints, and debuggability differ.

1.2 Evaluation Axes

In 2026, voice AI is evaluated on four axes.

  • Latency — from when the user stops speaking to when the AI starts. Under 200ms is the natural threshold.
  • Quality — naturalness, emotion, multilingual accuracy.
  • Cost — per minute or per 1M characters.
  • Control — voice cloning, emotion tags, SSML, speed control.

No single model wins all four. So the answer depends on the workload. Call center vs game NPC vs audiobook all have different priorities.

1.3 Open Source vs Commercial

AxisOpen sourceCommercial
TTS qualityF5-TTS, XTTS-v2, ChatTTS — got better, not at the topElevenLabs, Cartesia, Sesame — dominant
STT accuracyWhisper, Parakeet — basically caught upDeepgram, AssemblyAI — slight edge, domain tuning
Latencyself-host can reach 100ms200–500ms (network)
CostGPU only0.010.01–0.30 per minute

Open source has basically caught commercial STT. The TTS gap is still real. That's the big picture in 2026.


Chapter 2 · Whisper Large v3 Turbo (Oct 2024) — 8x Faster Multilingual STT

2.1 v3 to v3 turbo

When OpenAI open-sourced Whisper in September 2022, that was one of the biggest single events in voice AI. 99 languages, multilingual, free, commercial-grade STT accuracy.

v3 turbo, released in October 2024, cuts v3 large's decoder layers from 32 to 4 and adds compression. Result:

  • Speed: about 8x faster than v3
  • Model size: 1.5B to 809M parameters
  • Accuracy: 1–2% loss on major languages (English/Korean/Japanese) versus v3 — practically equivalent
  • Language coverage: 99 to a slightly smaller set (a few rare languages dropped)
import whisper

model = whisper.load_model("turbo")  # large-v3-turbo
result = model.transcribe("interview.mp3", language="ko")
print(result["text"])

2.2 Why 8x Matters

v3 large took about 3 minutes on an A100 to transcribe one hour of audio. Realtime was out of reach (streaming was a separate concern).

Turbo finishes the same audio in 22 seconds. Consequences:

  • Realtime captions: 200–400ms chunks can keep up
  • Batch cost down 8x: cloud GPU hours fall
  • Edge devices: realtime on an M2 MacBook Air

2.3 Limitations

  • Diarization: Whisper doesn't know who spoke. WhisperX or similar covers this.
  • True streaming: Whisper is 30-second chunk based. faster-whisper or whisper-streaming wraps it.
  • Domain adaptation: medical/legal/finance jargon needs fine-tune. Deepgram and AssemblyAI offer domain models.

2.4 Variants — faster-whisper / WhisperX / Distil-Whisper

ToolCoreUse case
OpenAI Whisper (official)Reference PyTorchResearch/eval
faster-whisperCTranslate2 backend, 4x more speedupProduction batch
WhisperX+ diarization + word timestampsMedia captioning
Distil-WhisperSmaller distilled variantMobile/edge

In production most teams run faster-whisper or WhisperX. The official OpenAI implementation is for research/eval.


Chapter 3 · Deepgram Nova-3 / AssemblyAI Universal-2 — Commercial STT Battle

3.1 Deepgram Nova-3 — The Latency King

Deepgram's edge is latency. Nova-3 has:

  • First-word latency under 100ms — partial transcripts start almost immediately
  • End-to-end in-house training — not bolted on top of an external ASR
  • Domain custom — medical, call center, media-specific models
  • Pricing — around 0.0043/min(batch)to0.0043/min (batch) to 0.0145/min (streaming)
from deepgram import DeepgramClient, PrerecordedOptions

deepgram = DeepgramClient(api_key="...")
options = PrerecordedOptions(model="nova-3", smart_format=True, diarize=True)
response = deepgram.listen.prerecorded.v("1").transcribe_file(
    {"buffer": audio_buffer}, options
)

For call-center bots and live captions where 100ms decides the UX, Deepgram is basically the default.

3.2 AssemblyAI Universal-2 — The Full-Stack Player

AssemblyAI competes on "transcript plus post-processing." Universal-2:

  • Word accuracy — English WER under 5% (on par with or slightly above Whisper v3 large)
  • Auto-chapters, summarization, PII redaction, sentiment — all in one API
  • Language detection — auto-detects across 99 languages
  • Pricing — about $0.0065/min for the Best model, plus per-feature add-ons

Especially strong when you need not just transcript but chapters, summaries, sentiment — media and podcast use cases.

3.3 Speechmatics — Accent Champion

UK-based, strong on a wide range of English accents (Indian, Australian, Caribbean, Scottish). Wins on global call centers where accent diversity is high.

3.4 NVIDIA Riva — Self-Host Champion

NVIDIA Riva is a self-hosted speech SDK. Used by government, finance, and healthcare where data cannot leave the cluster. Common pattern: serve Parakeet on Riva.

3.5 AWS Transcribe / Azure Speech / Google STT

The three hyperscalers all have STT. Accuracy is slightly behind Deepgram/AssemblyAI, but the advantage is integration with the rest of the same cloud.

3.6 Comparison

ServiceEnglish WERKorean WERLatencyPer-min (USD)Strength
Whisper v3 turbo (self)~5%~8%~1–3sGPU onlyFree, multilingual
Deepgram Nova-3~4%~9%<100ms0.004–0.015Low latency
AssemblyAI Universal-2~4%~10%~300ms0.0065+Post-processing
Parakeet 1.1 (self)~5%N/A~200msGPU onlyOpen source SOTA
Speechmatics~5%~9%~200ms0.007+Accents
AWS Transcribe~7%~12%~500ms0.024AWS integration

Numbers are approximate from public benchmarks. Real numbers vary heavily by domain and audio quality.


Chapter 4 · NVIDIA Parakeet 1.1 — The Open-Source SOTA

4.1 What Parakeet Is

A family of open-source STT models trained by NVIDIA with the NeMo framework. When Parakeet 1.1 dropped in late 2024, the verdict was "open source caught commercial."

  • Sizes: 110M to 1.1B parameter variants
  • Architecture: FastConformer encoder + CTC/Transducer hybrid
  • Speed: 2x+ faster than Whisper turbo on the same GPU
  • Accuracy: top of HuggingFace OpenASR English leaderboard

4.2 Why It's Fast

Whisper uses a Transformer encoder + decoder. It autoregressively generates tokens over a 30-second audio chunk. Parakeet uses a FastConformer encoder + CTC (or RNN-T) decoder. CTC is not autoregressive — it's a sequence alignment — and it's much faster.

The tradeoff: multilingual coverage is weaker than Whisper. Parakeet 1.1 English is English-specialized. A separate multilingual variant (Canary) exists.

4.3 Self-Hosting with NeMo

import nemo.collections.asr as nemo_asr

asr_model = nemo_asr.models.ASRModel.from_pretrained(
    "nvidia/parakeet-tdt-1.1b"
)
transcripts = asr_model.transcribe(["audio.wav"])
print(transcripts[0])

A single GPU can handle hundreds of audio-hours per minute. License is CC-BY-4.0, commercial-friendly.

4.4 Multilingual Variant — Canary

NVIDIA released a separate multilingual ASR called Canary. Supports English/Spanish/German/French and a few more. Korean and Japanese coverage is limited — Whisper still wins those.


Chapter 5 · ElevenLabs — The TTS Standard

5.1 Why ElevenLabs Won

Since 2023 ElevenLabs has been the de facto TTS standard. Why:

  1. Naturalness — the first model where you stop thinking "AI voice" and start hearing "that person"
  2. Multilingual — same voice in 30 languages with accent preserved
  3. Cloning — voice clone from 1 minute of audio, "Professional Voice Clone" with 30+ minutes
  4. Both API and UX are good — developers integrate in 5 minutes, non-developers use the web app directly

5.2 Model Lineup

  • Multilingual v2 (2023) — classic, high quality, stable. ~400ms latency.
  • Flash v2.5 (2024) — low-latency variant, under 75ms. Slightly lower quality than v2.
  • V3 alpha (2025) — emotion tags, dialogue, audio tags ([whispers], [laughs]).
  • Conversational v2 (2025) — TTS + STT + LLM bundled as a voice agent.

5.3 V3 Emotion Tags

V3 lets you sprinkle inline tags into the text to mark emotion.

[excited] Welcome back!
[whispers] I have a secret.
[laughs] That's hilarious.
[sighs] Okay, let's start over.

This is a bigger change than it looks. Previously you had to write SSML to tune prosody. V3 just takes natural-language tags.

5.4 Pricing

  • Starter: $5/month for 30K chars
  • Creator: $22/month for 100K chars + voice cloning
  • Pro/Scale/Business: usage-based
  • API rate: roughly 0.18/1Kchars(Flash),0.18/1K chars (Flash), 0.30/1K chars (V2)

More expensive than most alternatives but the quality difference shows up in your workflow, so for games, video, and audiobooks it's the default.

5.5 Limitations

  • Korean naturalness lags English (still better than most global TTS)
  • Japanese has occasional awkward prosody
  • 2–5x the price of alternatives

Chapter 6 · Cartesia (the Mamba Authors) — Sonic 2 + Ultra-Low Latency

6.1 Who Built It

Cartesia was founded in 2023 by Albert Gu, Tri Dao, and others — the Mamba state-space model paper authors. Mamba was promoted as a Transformer alternative whose memory/compute scale linearly with sequence length. That fits audio well.

6.2 Sonic / Sonic 2 — 90ms TTS

Cartesia's first model Sonic shipped sub-90ms TTS and got noticed. Sonic 2 (late 2024) added:

  • First-byte latency under 75ms — half of ElevenLabs Flash
  • Quality — comparable to ElevenLabs Multilingual v2
  • Multilingual — English/Spanish/French/German/Japanese/Chinese/Korean and more
  • Voice cloning — instant clone from 3-second samples
from cartesia import Cartesia

client = Cartesia(api_key="...")
audio = client.tts.sse(
    model_id="sonic-2",
    transcript="Hello, glad to meet you.",
    voice={"mode": "id", "id": "your_voice_id"},
    output_format={"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
)

6.3 Why It's Fast

Mamba-style state-space models handle token-to-token dependency in O(n) time, unlike Transformer attention. For long sequences like audio, that's a big win.

Cartesia also built streaming as a first-class concern. The first byte goes out as soon as the first input chunk arrives.

6.4 Where It's Used

  • Realtime voice agents (Vapi/Retell offer Cartesia as a default TTS option)
  • Game NPC dynamic dialogue
  • Live interpretation

If ElevenLabs is "best quality," Cartesia is "best balance of latency and quality."


Chapter 7 · Sesame (Iribe, March 2025) — "Voice Presence"

7.1 Brendan Iribe and Sesame

Brendan Iribe co-founded Oculus VR. After the Facebook acquisition and stops at Anduril, he founded Sesame in 2024 and demoed publicly in March 2025.

The Sesame concept is "voice presence" — not just natural-sounding, but a voice that makes you feel someone is there. Breath, hesitation, "uhh", backchannels ("yeah", "mhm"), interruptions handled naturally.

7.2 Why the Demo Went Viral

The 30-second demo in March 2025 spread fast on social. Why:

  • A ~0.3s "thinking" breath before answers begin
  • Backchannels like "oh really?" inserted while the user speaks
  • Sentence endings fade naturally — the trademark AI "hard stop" disappears
  • Emotion in the voice tracks the meaning of the text

If ElevenLabs/Cartesia made "natural voice," Sesame made "someone is here."

7.3 What's Different Under the Hood

Sesame partially published a paper. The key ideas:

  • Single backbone modeling text, audio, and prosody jointly — not a separate TTS, more a voice LLM
  • Interruption handling — when the user cuts in, the model pauses smoothly and replies
  • Non-verbal sounds — sighs, laughs, throat-clears are in training data

7.4 Caveats

  • As of May 2026 it's not GA, only limited beta. Pricing/SLA unpublished.
  • English-only for now, no Korean or Japanese.
  • Whether Sesame can actually produce "voice presence" at production cost is unproven.

Even so, the direction (treating voice as presence rather than as TTS output) is where ElevenLabs and Cartesia will soon follow.


Chapter 8 · ChatTTS / F5-TTS / XTTS-v2 — Open-Source TTS

8.1 ChatTTS — A Chinese Team's Natural English TTS

ChatTTS is an open-source TTS released in 2024 by a Chinese team. Notable for:

  • English naturalness near ElevenLabs Multilingual v2 (top of open source)
  • Conversational style — same text reads "like a chat"
  • Free weights on HuggingFace
  • Korean and Japanese are weak

F5-TTS dropped in late 2024 and hit #1 trending on HuggingFace. Also a hot topic in the Korean developer community.

  • Flow matching based (a diffusion variant) — training is more stable
  • Voice cloning — zero-shot clone from 15-second samples
  • Multilingual — English/Chinese focus, others need fine-tune
  • License — non-commercial (commercial restrictions apply, double-check)

8.3 XTTS-v2 (Coqui) — The Cloning Classic

Coqui was an active open-source TTS company in 2023–2024. The company itself shut down but XTTS-v2 weights remain on HuggingFace.

  • 17 languages
  • Voice clone from 6-second samples
  • Naturalness below ElevenLabs but free
  • Korean and Japanese supported
from TTS.api import TTS

tts = TTS(model_name="tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")
tts.tts_to_file(
    text="Hello there.",
    speaker_wav="reference.wav",
    language="en",
    file_path="output.wav",
)

8.4 Tortoise TTS — Slow but Quality

Tortoise is a relatively old open-source TTS from 2022. Inference is very slow (minutes) but quality was good enough that it was the de facto open-source TTS for a while. Today ChatTTS and F5-TTS have taken that spot.

8.5 Open-Source TTS Cheat Sheet

ModelEnglish qualityMultilingualInference speedLicense
ChatTTSExcellentWeakFastNon-commercial concerns
F5-TTSGoodEnglish/ChineseMediumNon-commercial
XTTS-v2Good17 languagesMediumCPL (conditional commercial)
TortoiseExcellentEnglishVery slowApache 2.0

License check is mandatory for commercial use. F5-TTS has an explicit non-commercial clause and is off-limits for commercial products.


Chapter 9 · Realtime API — OpenAI / Google / ElevenLabs Conversational

9.1 What "Realtime" Means

Traditional voice pipelines are STT then LLM then TTS, in series. Each stage adds latency, distortion, and wait. Realtime APIs collapse that into one model: voice in, voice out.

Upsides:

  • Shorter latency (200–500ms vs 1–2s)
  • Natural interruption
  • Non-verbal info (laughter, sighs, tone) preserved

Downsides:

  • More expensive (0.06/mininput+0.06/min input + 0.24/min output range)
  • Harder to debug (no intermediate text — logs are audio)
  • Tool calling integration is more involved

9.2 OpenAI Realtime API (gpt-4o-realtime)

Launched late 2024. WebSocket-based.

const ws = new WebSocket("wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview")
ws.on("open", () => {
  ws.send(JSON.stringify({
    type: "session.update",
    session: { voice: "alloy", instructions: "You are a helpful Korean assistant." }
  }))
})
ws.on("message", (data) => {
  const event = JSON.parse(data)
  if (event.type === "response.audio.delta") {
    // base64 PCM chunk
  }
})

GPT-4o's voice model takes audio in and produces audio out directly. Korean and Japanese supported.

9.3 Google Live API (Gemini 2)

Gemini 2's Live API. Similar WebSocket interface. Strength is integration with the Google ecosystem (Search, Maps, Calendar).

9.4 ElevenLabs Conversational v2

ElevenLabs expanded from pure TTS into a voice-agent platform. STT is in-house plus Deepgram option, LLM is your pick (OpenAI/Anthropic/Google), TTS is ElevenLabs voices. Plug-and-play.

9.5 When to Use Realtime vs Split

Realtime is good for:

  • Simple chat bots, FAQ responders
  • Scenarios where human-like naturalness is the KPI (fitting, coaching)
  • Interruption/backchannel heavy interactions

Split is good for:

  • Complex workflows (multi-step tool calls, context branching)
  • Brand-voice TTS retention
  • Strict logging/audit domains (finance, healthcare)

Chapter 10 · Voice Agents — Vapi / Retell / Bland / Synthflow

10.1 What a Voice-Agent Platform Is

To ship a call-center bot you need STT/LLM/TTS wired together, phone-network integration (SIP/Twilio), and turn-taking/interruption/call-routing logic. Platforms that do all of that exploded in 2024–2025.

10.2 Vapi

SF startup, YC-backed. Notable for:

  • Pluggable TTS/STT/LLM (ElevenLabs, Cartesia, Deepgram, AssemblyAI, etc.)
  • PSTN via Twilio/Vonage
  • Webhook-based external API calls (bookings, CRM updates)
  • Pricing — about 0.050.05–0.15 per min + model cost

10.3 Retell AI

Vapi's most direct competitor. Slicker UI, easier-to-read live call transcripts. Pricing is similar.

10.4 Bland AI

Sales-call specialist. Strong for high-volume outbound (e.g. real-estate cold calls). Per-call pricing is cheap.

10.5 Synthflow

EU no-code voice-agent builder. GUI-driven flow editor. Non-developers in ops can use it.

10.6 Comparison

PlatformStrengthWeaknessPer-min (USD)
VapiFlexible, good APIUI is plain0.05–0.15 + model
Retell AIClean UI, good transcriptsSimilar pricing0.07–0.15 + model
Bland AIHigh-volume outboundWeaker for inbound/complex~0.09 per call
SynthflowNo-code, EU dataAPI flexibility weaker0.13+ + model

10.7 Build vs Buy

Below 10K calls per minute, a platform is almost always cheaper. Above that — or if data can't leave the cluster — build with LiveKit + Deepgram + Cartesia.


Chapter 11 · Korea — Naver CLOVA, Kakao KOTTS, SKT NUGU

11.1 Naver CLOVA Voice / CLOVA Studio

Naver has CLOVA Voice (TTS), CLOVA Speech (STT), and HyperCLOVA X (LLM) — a full voice stack. Korean naturalness is ahead of ElevenLabs. Billed per minute or per character.

11.2 Kakao KOTTS

Kakao Enterprise's Korean TTS. Aimed at B2B (call center, announcements). Integrates with the Kakao Talk bot builder.

11.3 SKT NUGU

SK Telecom's voice assistant platform. NUGU speakers, TMAP voice navigation, NUGU Candy — strong in the consumer market.

11.4 Coway Sonatts and Others

A handful of Korean enterprises have built proprietary Korean TTS. Limited public exposure.

11.5 Korean STT — CLOVA vs Deepgram vs Whisper

ModelKorean WERStrengthWeakness
Naver CLOVA Speech~5–7%Korean domain tuning, Korean proper nounsWeak global integration
Deepgram (Korean)~9%Low latency, globalWeak domain tuning
Whisper v3 turbo~8%Free, multilingualDiarization separate
ParakeetN/A (English-centric)--

For a Korean company serving Korean users, CLOVA is the top pick. For global plus Korean, Whisper turbo or Deepgram.

11.6 Korean Voice-Agent Cases

  • Banking/card IVR — KB, Shinhan, KakaoBank partial rollouts
  • Food-delivery voice ordering — limited pilots
  • Game NPCs — NCsoft examples

Korean voice agents lag global by 2–3 years but are catching up fast.


Chapter 12 · Japan — VOICEVOX (Open Source), Coeiroink, GPT-SoVITS, Bert-VITS2

12.1 VOICEVOX — Japan's De Facto Open-Source TTS

VOICEVOX has overwhelming mindshare in Japan. Notable for:

  • Free, commercial-permitted under conditions — per-character terms must be checked
  • Many character voices — Shiki, Metoan, Zundamon are internet memes
  • Runs locally without a GPU — realtime on CPU
  • Half the YouTube/Niconico Japanese videos use VOICEVOX

12.2 Coeiroink

Similar to VOICEVOX, slightly more permissive on character licensing. Preferred by some creators.

12.3 GPT-SoVITS

Zero-shot voice cloning TTS popular in the Japanese and Chinese communities. Cloning from under one minute. De facto standard for Japanese voice content creators.

12.4 Bert-VITS2

Another open-source favorite. BERT-based text encoder + VITS decoder. Strong in Japanese and Chinese.

12.5 Japanese Commercial TTS

  • ElevenLabs Multilingual v2 — supports Japanese, above-average naturalness
  • Azure Neural TTS — rich Japanese voice catalog
  • Google WaveNet — stable Japanese
  • AWS Polly — many Japanese voices

Commercial is dominated by the global three, but in the Japanese content market (VTubers, video, games) VOICEVOX and GPT-SoVITS dominate.

12.6 Japanese STT

ModelJapanese WERNotes
Whisper v3 turbo~8%Most popular
AssemblyAI~9%Post-processing strength
Google STT~7%Japanese domain tuning
Azure Speech~7%Many Japanese voices
Deepgram~11%Weak in Japanese

For Japanese, Deepgram is surprisingly weak; Google and Azure often win.


Chapter 13 · Who Should Choose What — Call Center / Game NPC / Audiobook / Interpretation

13.1 Inbound Call-Center Bot

Goal: fast response + natural Korean/English + interruption handling + tool calls

Recommended:

  • STT: Deepgram Nova-3 (English) or Naver CLOVA (Korean)
  • LLM: GPT-4o or Claude 3.5
  • TTS: Cartesia Sonic 2 (English) or CLOVA Voice (Korean)
  • Platform: Vapi or Retell AI

Alternative: OpenAI Realtime API alone (enough for a simple bot, more expensive)

13.2 Game NPC Voiceover

Goal: character-voice consistency + emotion + multilingual

Recommended:

  • TTS: ElevenLabs Professional Voice Clone + V3 emotion tags
  • Or: Cartesia voice cloning (when low latency is needed for dynamic lines)
  • Open-source option: GPT-SoVITS (clone the character voice)

13.3 Audiobook / Podcast

Goal: natural long-form pacing, emotion, accurate pronunciation

Recommended:

  • ElevenLabs Multilingual v2 + Voice Lab
  • For short Korean: Naver CLOVA
  • Multi-speaker: ElevenLabs Projects mode

13.4 Live Interpretation

Goal: ultra-low-latency STT + instant translation + natural TTS

Recommended:

  • STT: Deepgram Nova-3 or AssemblyAI
  • Translation: GPT-4o or Claude
  • TTS: Cartesia Sonic 2 (low latency is critical)
  • Or: OpenAI Realtime API (simplest, smoothest)

13.5 Video Captioning / Content Post-Processing

Goal: accuracy + diarization + chapters/summary

Recommended:

  • AssemblyAI Universal-2 (most complete)
  • Or: WhisperX (full open source)

13.6 Cost-Sensitive + Private Data

Goal: data cannot leave the cluster, GPU-only operation

Recommended:

  • STT: Parakeet 1.1 or Whisper v3 turbo (NeMo or faster-whisper)
  • TTS: XTTS-v2 or F5-TTS (mind the license)
  • LLM: Llama 3 70B or Qwen 2.5
  • Infra: NVIDIA Riva or self-built vLLM/Triton

13.7 One-Line Matrix

ScenarioSTTTTSNotes
Korean call centerCLOVA SpeechCLOVA VoiceDomain tuning
English call centerDeepgramCartesia Sonic 2Low latency
Game NPC(n/a)ElevenLabs V3Emotion tags
Audiobook(n/a)ElevenLabs v2Long form
Live interpretationDeepgramCartesiaOr OpenAI Realtime
Media captioningAssemblyAI(n/a)Chapters/summary
Private on-premParakeetXTTS-v2NVIDIA Riva
Japanese contentWhisperVOICEVOXCharacter voices

Chapter 14 · Wrap-Up — The Big Picture of Voice AI in 2026

Three big currents.

First, STT is basically solved. Whisper turbo, Deepgram Nova-3, and Parakeet 1.1 have made sub-5% English WER routine. What remains is domain adaptation (medical/legal terms), multilingual accuracy (especially low-resource languages), and side info like diarization and emotion metadata.

Second, TTS is moving from "natural voice" to "voice presence." ElevenLabs and Cartesia almost finished naturalness; Sesame redefined the bar with "someone is here." In late 2026 to 2027 ElevenLabs and Cartesia will probably catch up to that territory.

Third, unified (Realtime API) is eating the split pipeline. Simple bots can ship with OpenAI Realtime API alone. The split pipeline survives where (a) brand voice matters, (b) complex tool chains are needed, (c) audio data must be separately audited.

Voice AI is no longer a fun demo. In 2026 it ships in call centers, automotive infotainment, games, education, and healthcare. Things to watch in the next 1–2 years: (1) whether Sesame can actually ship at scale, (2) whether open-source TTS narrows the ElevenLabs gap, (3) whether Whisper turbo gets another jump.


References