Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

Prologue — The Final Piece of the Generative-Media Quartet

Over the past several weeks we've gone through generative media one category at a time. Music (Suno, Udio, Lyria, ElevenMusic). Images (FLUX, Imagen, Midjourney, Ideogram, Recraft, Firefly). Video (Sora, Veo, Runway, Pika, Kling, Luma, Hailuo). The pattern was the same every time — the stunning 2024 demos, the rough 2025 betas, the mature 2026 tools, and the hard problems that still won't go away.

Today is the last piece — voice. And voice differs from the other three on two decisive points.

First, voice is bidirectional. Music is fire-and-forget, an image is fire-and-forget, a video is fire-and-forget. But voice means listening to a person speak (STT), figuring out what to say back (LLM), and returning it as natural speech (TTS). Those three stages get bundled together as the unit of conversation. So the voice category is not a TTS-model bake-off — it's the whole voice-agent stack.

Second, latency is absolute in voice. With music you wait 30 seconds; with an image, 10 seconds; with a video, a minute or more. But in human-to-human conversation, silence longer than 800 milliseconds starts to feel awkward, and beyond 1.5 seconds the other person assumes you've stopped. So a voice agent has to return its first audio byte within roughly 300 milliseconds of the user finishing their turn. That's a dimension nobody had to worry about in music, images, or video.

Those two differences make the 2026 voice category interesting. Model quality alone isn't enough. You also have to design the transport layer (typically WebRTC), turn detection, interruption handling, endpointing, cache warming, and warm pools — the entire system layer in lockstep with the models.

The 2026 lineup, as of May.

ElevenLabs has cemented its position as the category leader in consumer TTS and B2B voice cloning, and is now climbing up the stack with Conversational AI as a voice-agent product.
OpenAI Realtime API delivers genuine voice-in voice-out over WebRTC on top of GPT-Realtime, and reshaped the category by doing so.
Cartesia's Sonic-2 holds the title of the fastest TTS at 75ms time-to-first-byte (vendor figure, May 2026).
Vapi owns the orchestration layer for voice agents that lets you mix and match STT/LLM/TTS, and raised a $64M Series B last June.
Sesame's CSM (Conversational Speech Model) opened a new axis — "human-like personality."
On the STT side, Deepgram Nova-3 and AssemblyAI Universal-2 are the two-way leaders, with Whisper Large V3 Turbo and WhisperX as the open-source baselines.
Hume EVI 4 focuses on emotional recognition and generation, Bland specializes in phone-call automation, and Retell is another B2B voice-agent platform in the same neighborhood as Vapi.

This piece sorts that landscape. Who fits which job, how a voice-agent stack actually composes, how you hit the sub-300ms first-byte target, where the build/buy line sits, and what voice-cloning consent looks like in practice — without the breathless "AI is replacing call centers" or "AI voice is dangerous" framing on either side.

The one-liner: 2026 AI voice isn't a story of "TTS got better." It's the story of "the whole stack can now run end-to-end under 300ms." Understand that, and tool selection gets easy.

1. How the Category Was Born — What Happened in 2023~2024

1.1 Three Lineages of Speech Synthesis

AI voice synthesis is actually a 30-year-old field. Early on it was concatenative TTS (gluing recorded fragments), then parametric TTS (predicting acoustic parameters with statistical models), and from 2017 onward, neural TTS (WaveNet, Tacotron). The direct ancestors of what we use today are two threads from 2020 onward.

Thread 1: multi-speaker neural TTS. Take text and a speaker embedding, synthesize in any target voice. ElevenLabs started in this lineage when it was founded in November 2022.

Thread 2: autoregressive codec models. Apply text-LLM ideas to audio directly. Neural audio codecs (EnCodec, SoundStream) compress audio into tokens; a transformer then learns the sequence. Microsoft's VALL-E (January 2023), Meta's Voicebox (June 2023), and OpenAI's Whisper (STT, September 2022) all live in this lineage.

By late 2023 and early 2024 the two threads started fusing. ElevenLabs went hybrid autoregressive plus diffusion. Microsoft shipped VALL-E 2. OpenAI dropped audio tokens directly inside a multimodal LLM (GPT-4o).

1.2 The Inflection Point — The May GPT-4o Demo

In May 2024, OpenAI unveiled GPT-4o with a voice-in voice-out demo. The user spoke, the model heard, the same model answered with speech. Interruptions worked naturally, emotion came through, the thing even sang. The entire category got redrawn that day.

But shipping took longer than the demo suggested — July, then the October Realtime API (beta), then the August 2025 GA of the GPT-Realtime model. That interval gave Anthropic, Google, and Cartesia time to ship their own answers.

1.3 The Voice-Cloning Bombshell — Sky and Scarlett Johansson

On May 14, 2024, OpenAI launched "Sky" as one of the GPT-4o voices. Actress Scarlett Johansson had previously declined OpenAI's offer to use her voice and went public when Sky landed sounding uncannily like her. OpenAI pulled Sky immediately.

The signal to the whole industry was loud and clear. Voice-cloning consent isn't a checkbox in a ToS — it's the legal and ethical foundation of the whole product. Every major voice model since has required some kind of verification that you actually have the right to clone the voice you're cloning.

1.4 Why Did Things Suddenly Get Good

The same three variables as in every other generative-media category.

Data. Licensed multi-speaker datasets (LibriTTS, GigaSpeech, Common Voice) got richer, and the major labs license tens of thousands of hours of speech data on top.
Compute. H100/H200 clusters made it feasible to train multi-billion-parameter audio models in reasonable wallclock time.
Architecture. Neural audio codec plus transformer plus multi-speaker embedding plus diffusion decoder is now the standard recipe.

What really mattered in 2024~2025 was that low-latency streaming became table stakes. Previously you sent the full text and got back 30 seconds of audio in one batch. Now you stream text tokens in and audio chunks out. That single change is what made voice agents real.

2. TTS Leaders — ElevenLabs, Cartesia, OpenAI, Sesame

2.1 ElevenLabs — The Category Leader

As of May 2026, the text-to-speech product with the most users is ElevenLabs. Founded November 2022, Series B in January 2024 ( $80M, led by Andreessen Horowitz), Series C in January 2025 ($ 180M at $3.3B), and through 2026 has been expanding into a multimodal voice company.

The product lines.

TTS API. Multilingual v2 is the baseline, Turbo v2.5 is the low-latency tier, Flash v2.5 is the faster baseline. The v3 family rolled out in beta in May 2026.
Voice Design v2. Design a new voice from a text prompt ("warm, mid-30s female narrator, slight British accent"). v2 update landed January 2026.
Voice Cloning. Instant (30-second sample, fast clone) and Professional (30+ minute sample, high-quality clone).
Conversational AI. Beta in November 2024, GA in January 2025. STT/LLM/TTS bundled into a voice-agent builder. The product line that took ElevenLabs up the stack.
ElevenMusic. Music side (covered in the previous post).
ElevenStudio. Dubbing and translation, smoothly relocating a video's voice into another language.

Quality? Thirty-two languages including English, Japanese, Korean, Spanish, French, German. Korean voice quality got visibly better through 2025 — but fine emotional control in Korean (sarcastic tones, restrained sadness) is still weaker than in English.

Pricing (May 2026).

Free: 10,000 credits/month
Starter: $5/month, 30,000 credits
Creator: $22/month, 100,000 credits, commercial use
Pro: $99/month, 500,000 credits
Scale: $330/month and up
Enterprise: custom

2.2 Cartesia — The Low-Latency Champion

Cartesia was founded in February 2024. Co-founders Karan Goel and Albert Gu did state-space-model research at Stanford and co-authored Mamba. $27M seed in May 2024 (Lightspeed lead),$ 64M Series A in March 2025 (at $300M), and a follow-on Series B in January 2026.

The flagship is the Sonic family — Sonic-1 (2024) and Sonic-2 (September 2025). Sonic-2's time-to-first-byte is 75ms (vendor figure, May 2026), the lowest on the market. This is the model that made the sub-300ms first-byte target for voice agents realistic for the first time.

Quality is competitive with ElevenLabs in subtle ways. On plain English sentences they're roughly equal. On expressive voices (dramatic narration), ElevenLabs edges ahead. On low-latency voice-agent scenarios, Cartesia is decisively ahead.

Pricing (May 2026).

Free: 50,000 chars/month
Creator: $5/month, 100,000 chars/month
Pro: $49/month, 1,000,000 chars/month
Scale: $299/month
Enterprise: custom

2.3 OpenAI Realtime — The Move That Reshaped the Category

OpenAI's Realtime API launched in beta in October 2024 and stabilized in August 2025 alongside the GPT-Realtime model's GA. By adopting WebRTC as a standard transport, it changed what "voice agent" meant.

Key properties.

Voice-in voice-out. Not a three-stage STT/LLM/TTS pipeline — a single multimodal model handling all three. Lower theoretical latency.
WebRTC. One line of browser code to connect. UDP-based, so it's much more tolerant of packet loss than WebSocket.
Function calling (tool use). The model can invoke external functions mid-conversation. A baseline requirement for voice agents.
VAD (Voice Activity Detection). The model itself decides whether the user has finished speaking. Server-side semantic VAD is the default.
Interruption. If the user starts speaking while the model is speaking, the model stops immediately.

Pricing (May 2026, GPT-Realtime).

Audio input: $40 per 1M tokens
Audio output: $80 per 1M tokens
Cached input: $2.5 per 1M tokens

The catch with OpenAI Realtime is that you have almost no model choice — you're locked to GPT-Realtime. If you want to run Claude or Gemini, you fall back to the traditional STT plus text-LLM plus TTS pipeline.

2.4 Sesame — A Conversational Model With Personality

Sesame AI is a newer faction that surfaced publicly in early 2025. Founder Brendan Iribe was a co-founder and CEO of Oculus VR. That background gives them a "voice and device fused together" vision that feels very specific.

The product is CSM (Conversational Speech Model). When the demo went public in February 2025, the internet genuinely shook — most natural, most personality-laden, most human-feeling voice anyone had tried. The model lands a joke, hesitates briefly, switches tone abruptly — the small human details are there.

The technology under CSM.

Speech generated by an end-to-end multimodal LLM. Unlike conventional TTS, the LLM emits audio tokens directly.
Personality-based training. Started with two characters ("Maya" and "Miles"), each with its own speech style baked into training data.
Beta as of May 2026. Open API access is still limited; mostly demos and selective partner integrations.

The implication is big — voice is now competing on "personality and expressiveness," not just "technical fidelity."

2.5 The Rest

Azure Speech. Microsoft's enterprise TTS. Widest voice catalog (140+ languages, 600+ voices) and battle-tested reliability. Naturalness is half a step behind ElevenLabs/Cartesia.
Google Cloud TTS. Vertex AI integration. Chirp 3 HD voices closed the quality gap meaningfully.
AWS Polly. Amazon's classic TTS, now with Generative voice options. Pricing and SLA are attractive.
Play.ht. Consumer side, strong with podcasters and YouTubers.
Resemble AI. Voice-cloning specialist, B2B.
Coqui XTTS. Open-source TTS. The company itself shut down in 2024, but the weights live on GitHub.

2.6 TTS Comparison

Tool	Time-to-First-Byte	Naturalness	Voice Variety	Korean	Price Tier	Primary Use
ElevenLabs v3	about 200~400ms	very high	very wide	good	mid-high	content, B2B agents
Cartesia Sonic-2	about 75ms	high	wide	fair	mid	low-latency agents
OpenAI Realtime	about 300~500ms E2E	high	limited	good	high	multimodal agents
Sesame CSM	not disclosed	very high (personality)	character-bound	unrated	beta	next-gen conversation
Azure Speech	about 200~300ms	fair to high	very wide	good	mid	enterprise
Google TTS Chirp 3	about 200~400ms	high	wide	good	mid	GCP-integrated
AWS Polly Generative	about 300~500ms	fair to high	wide	fair	low to mid	AWS-integrated

3. STT Leaders — Deepgram, AssemblyAI, Whisper

3.1 Deepgram Nova-3

Deepgram, founded 2015, is one of the oldest pure-play STT shops. Series C in June 2024 ( $72M), Series D in August 2025 ($ 100M), additional round in January 2026.

The current flagship is Nova-3 (GA June 2025). Versus Nova-2 it gained ground on accuracy, latency, and price simultaneously.

WER (Word Error Rate). English 7.7% (Nova-2: 8.4%), multilingual average 12.3% (Nova-2: 15.1%). Measured on 2026 standard benchmarks (CommonVoice, Earnings-22).
Latency. Streaming first-word about 250ms; batch processes a one-hour file in roughly 30 seconds.
Multilingual. 30+ languages including Korean, with code-switching handling (two languages in one utterance).
Diarization. Speaker separation noticeably better than Nova-2.
Smart Format. Auto-formats numbers, currency, emails, phone numbers.

Pricing (May 2026).

Pre-recorded: $0.0043/min (about$ 0.26/hour)
Streaming: $0.0058/min
Enhanced (higher-tier models): additional cost

Deepgram's strength is the low-latency streaming + price + B2B reliability triangle. Vapi, Retell, Bland and similar platforms default to Deepgram for STT.

3.2 AssemblyAI Universal-2

AssemblyAI was founded in 2017, a Y Combinator alum. Deepgram's most direct competitor.

The flagship is Universal-2 (GA in late 2025). Visibly more accurate than Universal-1, and notably strong on "formatting and readability."

WER. English 6.6%, multilingual average 11.8%. Slightly more accurate than Deepgram Nova-3 on some benchmarks.
Timestamps. Word-level timestamps and speaker diarization are extremely precise.
Language detection plus code-switching. Automatic.
Speaker diarization. One of the most accurate in the market.
Extras. Sentiment analysis, entity detection, topic detection, summarization, PII redaction all in the same API.

Pricing (May 2026).

Best model: $0.37/hour (batch)
Universal-2: $0.27/hour
Streaming: $0.47/hour

AssemblyAI's edge is post-processing integration (summaries, sentiment, entities). Call-center analytics and meeting notes are the sweet spots.

3.3 Whisper and WhisperX — The Open-Source Baselines

OpenAI Whisper landed as open source in September 2022 — multilingual STT, MIT licensed. It's still the standard for "self-host to save money" or "don't send data out."

Whisper Large V3 Turbo (October 2024) — roughly 8x faster than V3 at similar quality. The strong open-source baseline.

WhisperX (2023~2025) — adds forced alignment, voice activity detection, and speaker diarization on top of Whisper. The de facto standard when you need precise word-level timestamps.

Faster-Whisper — CTranslate2-backed optimization. About 4x faster than vanilla Whisper on GPU.

Performance (English LibriSpeech test-clean).

Whisper Large V3 Turbo: WER about 3.1%
Faster-Whisper Large V3: WER about 3.4%
WhisperX (timestamp accuracy): very high

Open-source Whisper's limits are (a) no true real-time streaming (chunked workarounds only), (b) speaker diarization requires a separate model, and (c) you carry the operational burden yourself.

3.4 STT Comparison

Model	WER (English)	WER (multilingual)	Latency (streaming)	Price ($/hour)	License	Korean
Deepgram Nova-3	7.7%	12.3%	about 250ms	0.26	commercial SaaS	good
AssemblyAI Universal-2	6.6%	11.8%	about 400ms	0.27	commercial SaaS	good
Whisper Large V3 Turbo	3.1%	7~12% (varies)	not supported (chunked workaround)	$0 self-host	MIT	good
WhisperX	3.1% (Whisper base)	same	not supported	$0	BSD-4	good
Faster-Whisper	3.4%	same	not supported	$0	MIT	good
Azure Speech STT	about 8%	about 13%	about 300ms	1.0	enterprise	good
Google STT Chirp 3	about 7%	about 12%	about 300ms	about 0.4	enterprise	good

Caveat: WER numbers are extremely sensitive to benchmark and domain. On noisy call-center audio, Whisper might WER far worse than the SaaS leaders — or far better. Measure on your own domain data.

4. Voice-Agent Platforms — Vapi, Retell, Bland, Hume

4.1 Vapi — The Platform Layer

Vapi was founded in 2023. $20M Series A in November 2024 (Bessemer lead),$ 64M Series B in June 2025 (at $600M). One of the fastest-growing companies in the voice-agent category.

Vapi's positioning is "the orchestration layer for STT/LLM/TTS." They don't build the models — they let you compose voice agents from the best of each (Deepgram, OpenAI, ElevenLabs, Cartesia, etc.).

Key features.

Modular stack. STT (Deepgram/AssemblyAI), LLM (OpenAI/Anthropic/Google), TTS (ElevenLabs/Cartesia/PlayHT), all swappable.
Turn detection. Semantic-VAD-based decision about whether the user has finished speaking.
Interruption handling. When the user starts talking mid-response, the model stops immediately.
Function calling. Outbound API calls during a conversation (booking systems, CRM lookups).
Phone integration. Twilio/Vonage/Telnyx for actual PSTN numbers.
Recording plus analytics. All calls recorded; dashboard with search, filtering, analysis.

Pricing (May 2026).

Free tier: 10 minutes/month
Pay-as-you-go: $0.05~$ 0.20/min depending on stack choice
Enterprise: custom

Vapi's selling point is "fast to build plus no model lock-in." You can spin up an MVP in a weekend and change models with a config flag.

4.2 Retell — Vapi's Closest Rival

Also founded 2023, also a B2B voice-agent platform. Very similar positioning to Vapi, but more emphasis on enterprise call reliability.

High-quality call infrastructure. Deeper Twilio integration, stronger call-stability SLAs.
Agent Studio. A more polished no-code/low-code builder.
Analytics. Auto-classification of call outcomes, per-call analysis.

Pricing is in the same neighborhood as Vapi ( $0.07~$ 0.18/min).

4.3 Bland — Phone-Call Automation Specialist

Bland AI focuses on a specific use case — "an AI that talks to people on the phone." More specialized for inbound and outbound call-center automation than general voice agents.

High concurrency. Thousands of simultaneous calls.
Workflow builder. Branching logic, variable extraction, CRM integration.
Voice cloning. Clone a voice that matches the sales tone of the company.
Compliance. TCPA (U.S. telemarketing regulation) tooling.

Target markets: sales callbacks, appointment setting, customer surveys, collections.

4.4 Hume EVI — Emotional Voice

Hume AI sits in a different camp. They start from "voice carries emotion" as a thesis. EVI (Empathic Voice Interface) is specifically designed to recognize the emotional tone in a user's voice and to put emotion into the response.

EVI 4 (early 2026). Improved tone-classification accuracy and response-emotion precision.
Use cases. Mental health bots, coaching, care calls.
Limits. Whether the model's emotion classification matches lived user experience is still being validated in the wild.

4.5 Voice-Agent Platform Comparison

Platform	Positioning	Primary Use	Model Choice	Price ($/min)	Differentiator
Vapi	orchestration layer	any voice agent	very wide (every major)	0.05~0.20	fast build, no lock-in
Retell	enterprise calls	call center, B2B sales	wide	0.07~0.18	call stability, Studio
Bland	phone automation	sales, scheduling, surveys	own plus some	0.10~0.15	high concurrency
ElevenLabs Conversational AI	integrated stack	content/B2B agents	ElevenLabs-first	session-based	bundled voices
OpenAI Realtime	direct API	bring-your-own build	GPT-Realtime locked	token-based	shortest E2E latency
Hume EVI	emotion-aware	healthcare, care	EVI models	custom	tone analysis

5. The Voice-Agent Stack — How One Call Actually Flows

5.1 The Traditional Three-Stage Pipeline

Most voice agents chain three models.

user speech audio
   │
   ▼
[STT]  Speech-to-Text
       (e.g., Deepgram Nova-3 streaming)
   │
   ▼ text tokens
[LLM]  Large Language Model
       (e.g., GPT-5, Claude Opus 4.7, Gemini 2.5)
   │
   ▼ response text
[TTS]  Text-to-Speech
       (e.g., Cartesia Sonic-2 streaming)
   │
   ▼
model response audio

The big win of independent stages is interchangeability — swap any model without touching the others. Whisper for STT, Claude for LLM, ElevenLabs for TTS, in any combination. Vapi/Retell exist to manage that combinatorial space.

The big downside is cumulative latency. Even 100ms per stage adds up to 300ms before network RTT, and 400~500ms total is easy to hit.

5.2 End-to-End Multimodal Models

OpenAI Realtime and some next-gen models (Sesame CSM, GPT-4o's voice mode) work differently. A single model takes speech in and emits speech directly.

user speech audio
   │
   ▼
[E2E Multimodal LLM]
   - speech tokens in
   - text/speech tokens out
   - streamed over WebRTC
   │
   ▼
model response audio

Pros — potentially shorter latency (no intermediate conversions), more natural emotion and intonation (STT doesn't throw away tone). Cons — no model choice, higher pricing, harder to fine-tune.

5.3 The Supporting Components

A working voice agent isn't STT/LLM/TTS in isolation. These extras are mandatory.

VAD (Voice Activity Detection). Is the user speaking or silent right now? Silero VAD and WebRTC VAD are the open-source standards. A more sophisticated form is semantic VAD — "has the user finished speaking?" decided semantically (did the question end, is the user still thinking out loud).

Turn detection. Is it the model's turn to speak now? Starts at simple VAD (silence for 300ms) and evolves into more nuanced models. OpenAI Realtime offers server-side semantic VAD as an option.

Endpointing. Find the precise end of an utterance. Pauses in the middle of "uh... so..." must not be mistaken for the end of the turn.

Interruption handling. When the user starts speaking mid-response, (a) stop the current TTS immediately, (b) reprocess the new user utterance, and (c) reflect "the user interrupted" in conversation state.

Conversation state management. Past turns, user-made promises, model-made promises, variables (customer name, order number) — all tracked. The LLM's context window plus external memory.

Tool use / function calling. Outbound API calls during the conversation. "Move my appointment to 12:30" should trigger updateAppointment(id, newTime).

Monitoring and analytics. Call recording, transcription, sentiment analysis, outcome categorization, dashboards. The operational backbone.

5.4 The Real System Diagram

                                  ┌─────────────────────────┐
[phone ─── PSTN ─── Twilio]───────▶│  Voice Agent Platform   │
                                  │  (Vapi / Retell / etc)   │
                                  └────────────┬─────────────┘
                                               │
            ┌──────────────────────────────────┼──────────────────────────────────┐
            │                                  │                                  │
            ▼                                  ▼                                  ▼
   ┌──────────────────┐              ┌──────────────────┐              ┌──────────────────┐
   │      STT         │              │      LLM         │              │      TTS         │
   │ Deepgram Nova-3  │──text tokens─▶│ Claude / GPT     │─response text▶│ Cartesia Sonic-2 │
   │ (streaming WSS)  │              │ (streaming SSE)  │              │ (streaming WSS)  │
   └────────▲─────────┘              └────────▲─────────┘              └────────┬─────────┘
            │                                 │                                 │
            │ audio chunks                    │ context                         │ audio chunks
            │                                 │                                 │
   ┌────────┴─────────────────────────────────┴─────────────────────────────────┴────────┐
   │                       Conversation Orchestrator                                       │
   │  - VAD (Silero / server-side semantic VAD)                                            │
   │  - Turn detection                                                                     │
   │  - Endpointing                                                                        │
   │  - Interruption handling                                                              │
   │  - State management (past turns plus variables)                                       │
   │  - Tool-use router (booking system / CRM / DB)                                        │
   └────────────────────────────┬──────────────────────────────────────────────────────────┘
                                │
                ┌───────────────┼───────────────┐
                │               │               │
                ▼               ▼               ▼
        ┌─────────────┐ ┌─────────────┐ ┌──────────────┐
        │  Recording  │ │  Analytics  │ │  Compliance  │
        │  Storage    │ │  Dashboard  │ │  PII Redact  │
        └─────────────┘ └─────────────┘ └──────────────┘

What this shows — three models, but many more system components. That's why Vapi and Retell create value. Building all of this from scratch is a six-month project.

6. Latency as the Absolute Metric — The Sub-300ms Target

6.1 Why 300ms

In natural human conversation, the gap between turns averages 200~~300 milliseconds. Beyond that it starts to feel like awkward silence; beyond 700~~800ms the listener wonders if you heard them.

For a voice agent to feel natural, time-to-first-byte (TTFB) — user finishes speaking to model's first audio byte — has to be under 300ms. 350~500ms is "a little awkward but acceptable," and beyond 500ms people start describing the experience as weird.

6.2 The Latency Budget

To hit TTFB 300ms, you have to budget each stage like this.

Stage	Budget	Notes
Network RTT (round trip)	50~100ms	depends on user location
Endpointing (end-of-utterance detection)	30~80ms	semantic VAD is fastest
STT final transcript	50~150ms	streaming partials arrive earlier
LLM time-to-first-token	100~300ms	strongly dependent on model and prompt size
TTS first audio chunk	50~200ms	Cartesia's 75ms is the market floor
Total	about 300~800ms	floor adds to 300ms, average is 500ms+

The takeaway — even at the floor of every stage, 300ms is tight. So you have to (a) collapse stages with an E2E model, (b) crush each stage to its floor, or (c) start responding speculatively before the user finishes.

6.3 Optimization Tricks

1. Speculative response. The LLM starts drafting before the user finishes. When the user does finish, you either emit what's drafted or quickly correct it. Risk: if the user adds more, the draft becomes awkward.

2. Stream everything. STT emits partial transcripts; LLM streams tokens; TTS makes audio chunks as text chunks arrive. Batch in any one stage means batch end-to-end.

3. Short prompts. LLM TTFT scales almost linearly with prompt length. Keep system prompts tight and rely on prompt caching for context.

4. Caches and warm pools. Pre-spin voice-agent instances and keep them warm. Avoid the cold start on the first call.

5. Geographic proximity. Inference servers must be close to the user. Multi-region deployment is non-negotiable.

6. End-to-end models. OpenAI Realtime collapses stages and eliminates intermediate transformation delays.

6.4 Measurement and SLAs

Latency is a distribution, not an average. p50 of 250ms with p99 of 2 seconds means 1% of turns feel awkward. Calls have dozens to hundreds of turns, so p99 awkwardness shows up multiple times per call.

Common operational SLAs.

p50 TTFB < 300ms
p95 TTFB < 600ms
p99 TTFB < 1000ms
Interruption responsiveness < 200ms

These metrics need to be measured per turn, not per call, to be meaningful.

7. Use Cases — Where AI Voice Actually Works

7.1 First-Line Call-Center Triage

The use case that landed fastest. The reason is simple — high-volume repetitive calls, defined workflows, and the first 30 seconds of almost every call asks the same questions.

A typical workflow.

Inbound. Customer calls → AI agent answers → "What can I help you with?" → intent classification (order status / shipping / refund / other) → context load → resolution or handoff to a human.
Outbound. AI agent calls → "Hi, this is XYZ Apparel calling about your shipment" → simple update or appointment setting.

Field results.

Self-resolution rate 30~60% (varies by industry and question type)
30~50% reduction in average call duration
70~90% cost reduction versus human agents
CSAT: usually flat or slightly down (fine emotional handling still favors humans)

Stack: Vapi/Retell + Deepgram + Claude/GPT + ElevenLabs/Cartesia.

7.2 Appointment Scheduling

Dental offices, salons, small clinics. The typical workflow is "what day/time works for you?" → check the scheduling system → present options → confirm → SMS confirmation.

This is the best use case for tool use — the model calls getAvailableSlots(date), then bookSlot(slotId, customerInfo).

7.3 Podcasts and Audiobook Narration

Long-form content generation. ElevenLabs is strongest here.

The workflow.

Write the script
Pick a voice or clone your own
Synthesize the whole script via the ElevenLabs API
Post-process (music, SFX, mastering)

Cost: a one-hour audiobook fits comfortably in one month's ElevenLabs Pro $99 credits. Versus a human narrator ($ 200~$500/hour), the savings are dramatic.

Quality: humans still win on fine emotional moments (a grieving scene), but by late 2025 most listeners couldn't tell the difference in mainstream content.

7.4 Accessibility

Screen readers for blind users, real-time captions for deaf users. AI voice has been here for a long time, but quality has improved usability dramatically.

VoiceOver (macOS/iOS) and TalkBack (Android) are gradually adopting ElevenLabs/Cartesia-grade voices.
Live Caption (Pixel phones), Otter.ai, and similar live-captioning products lean heavily on Whisper/Deepgram.

7.5 Voice Cloning — Authentication and Memory

Preserving your own voice, or recreating a family member's voice (a deceased relative, for instance). Technically a 30-second sample is enough — but this is also the area with the thickest ethical and legal gray zone.

The person is alive and consenting → clearly OK
The person is deceased, with family consent → legally ambiguous (depends on jurisdiction's rights of the deceased)
The person is alive but didn't consent → obviously unlawful (the deepfake zone)

ElevenLabs requires "Voice Verification" — the person whose voice is being cloned must record a verification phrase directly with ElevenLabs.

7.6 Where It Doesn't Work

Honestly.

Complex call-center complaint handling. Calming down an angry customer still favors humans.
Legal or medical advice. Accuracy and liability rule out unsupervised AI voice.
Creative collaboration (like a voice director with an actor). Fine direction is still very human.
Low-resource languages. English, Spanish, Chinese are great. Languages with thin training data (Vietnamese, Swahili) lag noticeably.
Real-time interpretation. Useful but still behind on both latency and accuracy.

8. Build vs. Buy — An Honest Decision Frame

8.1 Three Paths

When you set out to build a voice agent, you have three paths.

Path A: Pure SaaS. Use ElevenLabs Conversational AI, Air AI, or just the no-code builders inside Vapi/Retell. Build time: days. Cost: $0.05~$ 0.30/min. Control: low.

Path B: Platform plus custom. Vapi or Retell as a base; you write function calls and workflow logic. Build time: 1~4 weeks. Cost: $0.05~$ 0.20/min plus engineering time. Control: medium-high.

Path C: Full build. Compose STT/LLM/TTS yourself and write VAD, endpointing, and state management from scratch. Build time: 3~~6 months. Cost: API bills plus 2~~3 full-time engineers. Control: very high.

8.2 Decision Tree

start
 │
 ├─ Call volume below 1,000 min/month?
 │   └─ yes → Path A or Path B. Path C is never justified here.
 │
 ├─ Industry-specific compliance needed? (HIPAA, PCI, SOC2)
 │   ├─ yes → Path B (Vapi enterprise tier plus compliance add-ons) or
 │   │        Path C (full self-host)
 │   └─ no  ↓
 │
 ├─ Call volume above 100,000 min/month?
 │   └─ yes → Run the cost math. SaaS unit cost times volume vs. self-host.
 │            Usually a Path B enterprise contract is optimal.
 │
 ├─ Does model choice matter? (e.g., a specific LLM is required)
 │   ├─ yes → Path B (Vapi's modular models)
 │   └─ no  → Path A (fastest start)
 │
 ├─ Is fine UX control absolutely necessary? (response tone, interruption policy)
 │   ├─ yes → Path C is worth considering
 │   └─ no  → Path B

8.3 Cost Comparison

Rough monthly cost by volume (average stack pricing).

Monthly minutes	Path A ($0.20/min)	Path B ($0.10/min)	Path C (self-host)
1,000	$200	$100	thousands in salary alone
10,000	$2,000	$1,000	salary plus about $300 infra
100,000	$20,000	$10,000	salary plus about $2,000 infra
1,000,000	$200,000	$100,000	salary plus about $20,000 infra

The implication — Path C only starts to make pricing sense above about 1M minutes/month (12M minutes/year). Below that, the operational burden of SaaS savings almost always wins.

8.4 Industry Patterns

A voice feature in a B2B SaaS. Path A or Path B. Speed-to-launch dominates.
Call-center replacement. Path B enterprise contracts. Call reliability and compliance dominate.
Companies where the voice IP is itself an asset (advertising, media). Path C. Self-host the cloning model, keep data internal.
Voice features in consumer apps. Path A or Path B. OpenAI Realtime or Vapi.
Healthcare or finance compliance contexts. Path B compliance tier or Path C.

Epilogue — Checklist, Anti-Patterns, What's Next

AI voice went from the "wow, that's natural" GPT-4o demo shock of May 2024 to the "sub-300ms TTFB voice agents actually run" maturity of May 2026. Same pattern as music, images, and video — but the additional constraints of bidirectionality and absolute latency made the category richer.

The May 2026 takeaway is simple. For TTS quality alone, any major model is good enough. The real differentiators are (a) first-byte latency, (b) overall stack stability, (c) compliance and consent, and (d) the price-versus-volume balance. You need to see the stack, not just the model.

Tool-Selection Checklist

TTS only, or a voice agent? — TTS only → ElevenLabs/Cartesia. Agent → Vapi/Retell or OpenAI Realtime.
Is first-byte latency absolute? — Cartesia Sonic-2, or OpenAI Realtime with caching and a warm pool.
Do you need model choice? — Vapi is the most flexible. ElevenLabs Conversational AI favors its own voices.
Language other than English/Japanese/Korean? — Validate per-tool language quality on your domain.
What's the monthly call volume? — Under 1M minutes, SaaS almost always wins.
Compliance required? — HIPAA/PCI/SOC2 means enterprise contracts or self-hosting.
Voice cloning needed? — ElevenLabs Voice Cloning or Resemble AI, with mandatory consent verification.
STT accuracy critical? — Compare Deepgram vs. AssemblyAI vs. Whisper on your domain data.
Tool use required? — Vapi, OpenAI Realtime, and ElevenLabs Conversational AI all support it.
Analytics/recording/dashboards required? — Vapi/Retell give you these for free. DIY is heavy.

Anti-Patterns

Anti-pattern	Why it's bad	Instead
Choosing the tool from model quality alone	The whole-stack latency decides	Evaluate first-byte latency and reliability too
First-tool lock-in on the model	Models get 6 months better routinely	A platform with modular models (Vapi)
Building on batch APIs first	No streaming means no voice agent	Streaming from day one
Naive silence-only VAD	Confuses mid-utterance pauses for the end of turn	Semantic VAD or proper endpointing
No interruption handling	Awkward when user talks over the model	Immediate TTS stop plus state update
Full context every turn	LLM TTFT balloons, latency collapses	Short system prompts, prompt caching
Skipping consent verification	Legal and reputational risk	Mandatory consent flow
Sending everything to one place	PII exposure risk	Self-host option or PII redaction
Average-only latency SLA	p99 awkwardness shows up multiple times per call	Measure p50/p95/p99
Going Path C too early	Operational burden usually exceeds build cost	Stay on SaaS below 1M minutes/month

What's Next

The generative-media quartet closes here — music, images, video, voice. The next post pulls them together into a unified generative-media workflow. One prompt that produces music plus images plus video plus voice in one pipeline. The choice between Runway Gen-4, Veo 3, and Sora 3 producing voice themselves versus assembling a separate pipeline. The new standard for AI content creation, and how to turn each stage's model choice into a single matrix — this will be the synthesis post for the quartet.

Prologue — The Final Piece of the Generative-Media Quartet

1. How the Category Was Born — What Happened in 2023~2024

1.1 Three Lineages of Speech Synthesis

1.2 The Inflection Point — The May GPT-4o Demo

1.3 The Voice-Cloning Bombshell — Sky and Scarlett Johansson

1.4 Why Did Things Suddenly Get Good

2. TTS Leaders — ElevenLabs, Cartesia, OpenAI, Sesame

2.1 ElevenLabs — The Category Leader

2.2 Cartesia — The Low-Latency Champion

2.3 OpenAI Realtime — The Move That Reshaped the Category

2.4 Sesame — A Conversational Model With Personality

2.5 The Rest

2.6 TTS Comparison

3. STT Leaders — Deepgram, AssemblyAI, Whisper

3.1 Deepgram Nova-3

3.2 AssemblyAI Universal-2

3.3 Whisper and WhisperX — The Open-Source Baselines

3.4 STT Comparison

4. Voice-Agent Platforms — Vapi, Retell, Bland, Hume

4.1 Vapi — The Platform Layer

4.2 Retell — Vapi's Closest Rival

4.3 Bland — Phone-Call Automation Specialist

4.4 Hume EVI — Emotional Voice

4.5 Voice-Agent Platform Comparison

5. The Voice-Agent Stack — How One Call Actually Flows

5.1 The Traditional Three-Stage Pipeline

5.2 End-to-End Multimodal Models

5.3 The Supporting Components

5.4 The Real System Diagram

6. Latency as the Absolute Metric — The Sub-300ms Target

6.1 Why 300ms

6.2 The Latency Budget

6.3 Optimization Tricks

6.4 Measurement and SLAs

7. Use Cases — Where AI Voice Actually Works

7.1 First-Line Call-Center Triage

7.2 Appointment Scheduling

7.3 Podcasts and Audiobook Narration

7.4 Accessibility

7.5 Voice Cloning — Authentication and Memory

7.6 Where It Doesn't Work

8. Build vs. Buy — An Honest Decision Frame

8.1 Three Paths

8.2 Decision Tree

8.3 Cost Comparison

8.4 Industry Patterns

Epilogue — Checklist, Anti-Patterns, What's Next

Tool-Selection Checklist

Anti-Patterns

What's Next

References