✍️ 필사 모드: AI Voice 2026 — ElevenLabs, OpenAI Realtime, Cartesia, Vapi, Sesame, Deepgram, and the State of the Voice Agent Stack
EnglishPrologue — The Final Piece of the Generative-Media Quartet
Over the past several weeks we've gone through generative media one category at a time. Music (Suno, Udio, Lyria, ElevenMusic). Images (FLUX, Imagen, Midjourney, Ideogram, Recraft, Firefly). Video (Sora, Veo, Runway, Pika, Kling, Luma, Hailuo). The pattern was the same every time — the stunning 2024 demos, the rough 2025 betas, the mature 2026 tools, and the hard problems that still won't go away.
Today is the last piece — voice. And voice differs from the other three on two decisive points.
First, voice is bidirectional. Music is fire-and-forget, an image is fire-and-forget, a video is fire-and-forget. But voice means listening to a person speak (STT), figuring out what to say back (LLM), and returning it as natural speech (TTS). Those three stages get bundled together as the unit of conversation. So the voice category is not a TTS-model bake-off — it's the whole voice-agent stack.
Second, latency is absolute in voice. With music you wait 30 seconds; with an image, 10 seconds; with a video, a minute or more. But in human-to-human conversation, silence longer than 800 milliseconds starts to feel awkward, and beyond 1.5 seconds the other person assumes you've stopped. So a voice agent has to return its first audio byte within roughly 300 milliseconds of the user finishing their turn. That's a dimension nobody had to worry about in music, images, or video.
Those two differences make the 2026 voice category interesting. Model quality alone isn't enough. You also have to design the transport layer (typically WebRTC), turn detection, interruption handling, endpointing, cache warming, and warm pools — the entire system layer in lockstep with the models.
The 2026 lineup, as of May.
- ElevenLabs has cemented its position as the category leader in consumer TTS and B2B voice cloning, and is now climbing up the stack with Conversational AI as a voice-agent product.
- OpenAI Realtime API delivers genuine voice-in voice-out over WebRTC on top of GPT-Realtime, and reshaped the category by doing so.
- Cartesia's Sonic-2 holds the title of the fastest TTS at 75ms time-to-first-byte (vendor figure, May 2026).
- Vapi owns the orchestration layer for voice agents that lets you mix and match STT/LLM/TTS, and raised a $64M Series B last June.
- Sesame's CSM (Conversational Speech Model) opened a new axis — "human-like personality."
- On the STT side, Deepgram Nova-3 and AssemblyAI Universal-2 are the two-way leaders, with Whisper Large V3 Turbo and WhisperX as the open-source baselines.
- Hume EVI 4 focuses on emotional recognition and generation, Bland specializes in phone-call automation, and Retell is another B2B voice-agent platform in the same neighborhood as Vapi.
This piece sorts that landscape. Who fits which job, how a voice-agent stack actually composes, how you hit the sub-300ms first-byte target, where the build/buy line sits, and what voice-cloning consent looks like in practice — without the breathless "AI is replacing call centers" or "AI voice is dangerous" framing on either side.
The one-liner: 2026 AI voice isn't a story of "TTS got better." It's the story of "the whole stack can now run end-to-end under 300ms." Understand that, and tool selection gets easy.
1. How the Category Was Born — What Happened in 2023~2024
1.1 Three Lineages of Speech Synthesis
AI voice synthesis is actually a 30-year-old field. Early on it was concatenative TTS (gluing recorded fragments), then parametric TTS (predicting acoustic parameters with statistical models), and from 2017 onward, neural TTS (WaveNet, Tacotron). The direct ancestors of what we use today are two threads from 2020 onward.
Thread 1: multi-speaker neural TTS. Take text and a speaker embedding, synthesize in any target voice. ElevenLabs started in this lineage when it was founded in November 2022.
Thread 2: autoregressive codec models. Apply text-LLM ideas to audio directly. Neural audio codecs (EnCodec, SoundStream) compress audio into tokens; a transformer then learns the sequence. Microsoft's VALL-E (January 2023), Meta's Voicebox (June 2023), and OpenAI's Whisper (STT, September 2022) all live in this lineage.
By late 2023 and early 2024 the two threads started fusing. ElevenLabs went hybrid autoregressive plus diffusion. Microsoft shipped VALL-E 2. OpenAI dropped audio tokens directly inside a multimodal LLM (GPT-4o).
1.2 The Inflection Point — The May GPT-4o Demo
In May 2024, OpenAI unveiled GPT-4o with a voice-in voice-out demo. The user spoke, the model heard, the same model answered with speech. Interruptions worked naturally, emotion came through, the thing even sang. The entire category got redrawn that day.
But shipping took longer than the demo suggested — July, then the October Realtime API (beta), then the August 2025 GA of the GPT-Realtime model. That interval gave Anthropic, Google, and Cartesia time to ship their own answers.
1.3 The Voice-Cloning Bombshell — Sky and Scarlett Johansson
On May 14, 2024, OpenAI launched "Sky" as one of the GPT-4o voices. Actress Scarlett Johansson had previously declined OpenAI's offer to use her voice and went public when Sky landed sounding uncannily like her. OpenAI pulled Sky immediately.
The signal to the whole industry was loud and clear. Voice-cloning consent isn't a checkbox in a ToS — it's the legal and ethical foundation of the whole product. Every major voice model since has required some kind of verification that you actually have the right to clone the voice you're cloning.
1.4 Why Did Things Suddenly Get Good
The same three variables as in every other generative-media category.
- Data. Licensed multi-speaker datasets (LibriTTS, GigaSpeech, Common Voice) got richer, and the major labs license tens of thousands of hours of speech data on top.
- Compute. H100/H200 clusters made it feasible to train multi-billion-parameter audio models in reasonable wallclock time.
- Architecture. Neural audio codec plus transformer plus multi-speaker embedding plus diffusion decoder is now the standard recipe.
What really mattered in 2024~2025 was that low-latency streaming became table stakes. Previously you sent the full text and got back 30 seconds of audio in one batch. Now you stream text tokens in and audio chunks out. That single change is what made voice agents real.
2. TTS Leaders — ElevenLabs, Cartesia, OpenAI, Sesame
2.1 ElevenLabs — The Category Leader
As of May 2026, the text-to-speech product with the most users is ElevenLabs. Founded November 2022, Series B in January 2024 (180M at $3.3B), and through 2026 has been expanding into a multimodal voice company.
The product lines.
- TTS API. Multilingual v2 is the baseline, Turbo v2.5 is the low-latency tier, Flash v2.5 is the faster baseline. The v3 family rolled out in beta in May 2026.
- Voice Design v2. Design a new voice from a text prompt ("warm, mid-30s female narrator, slight British accent"). v2 update landed January 2026.
- Voice Cloning. Instant (30-second sample, fast clone) and Professional (30+ minute sample, high-quality clone).
- Conversational AI. Beta in November 2024, GA in January 2025. STT/LLM/TTS bundled into a voice-agent builder. The product line that took ElevenLabs up the stack.
- ElevenMusic. Music side (covered in the previous post).
- ElevenStudio. Dubbing and translation, smoothly relocating a video's voice into another language.
Quality? Thirty-two languages including English, Japanese, Korean, Spanish, French, German. Korean voice quality got visibly better through 2025 — but fine emotional control in Korean (sarcastic tones, restrained sadness) is still weaker than in English.
Pricing (May 2026).
- Free: 10,000 credits/month
- Starter: $5/month, 30,000 credits
- Creator: $22/month, 100,000 credits, commercial use
- Pro: $99/month, 500,000 credits
- Scale: $330/month and up
- Enterprise: custom
2.2 Cartesia — The Low-Latency Champion
Cartesia was founded in February 2024. Co-founders Karan Goel and Albert Gu did state-space-model research at Stanford and co-authored Mamba. 64M Series A in March 2025 (at $300M), and a follow-on Series B in January 2026.
The flagship is the Sonic family — Sonic-1 (2024) and Sonic-2 (September 2025). Sonic-2's time-to-first-byte is 75ms (vendor figure, May 2026), the lowest on the market. This is the model that made the sub-300ms first-byte target for voice agents realistic for the first time.
Quality is competitive with ElevenLabs in subtle ways. On plain English sentences they're roughly equal. On expressive voices (dramatic narration), ElevenLabs edges ahead. On low-latency voice-agent scenarios, Cartesia is decisively ahead.
Pricing (May 2026).
- Free: 50,000 chars/month
- Creator: $5/month, 100,000 chars/month
- Pro: $49/month, 1,000,000 chars/month
- Scale: $299/month
- Enterprise: custom
2.3 OpenAI Realtime — The Move That Reshaped the Category
OpenAI's Realtime API launched in beta in October 2024 and stabilized in August 2025 alongside the GPT-Realtime model's GA. By adopting WebRTC as a standard transport, it changed what "voice agent" meant.
Key properties.
- Voice-in voice-out. Not a three-stage STT/LLM/TTS pipeline — a single multimodal model handling all three. Lower theoretical latency.
- WebRTC. One line of browser code to connect. UDP-based, so it's much more tolerant of packet loss than WebSocket.
- Function calling (tool use). The model can invoke external functions mid-conversation. A baseline requirement for voice agents.
- VAD (Voice Activity Detection). The model itself decides whether the user has finished speaking. Server-side semantic VAD is the default.
- Interruption. If the user starts speaking while the model is speaking, the model stops immediately.
Pricing (May 2026, GPT-Realtime).
- Audio input: $40 per 1M tokens
- Audio output: $80 per 1M tokens
- Cached input: $2.5 per 1M tokens
The catch with OpenAI Realtime is that you have almost no model choice — you're locked to GPT-Realtime. If you want to run Claude or Gemini, you fall back to the traditional STT plus text-LLM plus TTS pipeline.
2.4 Sesame — A Conversational Model With Personality
Sesame AI is a newer faction that surfaced publicly in early 2025. Founder Brendan Iribe was a co-founder and CEO of Oculus VR. That background gives them a "voice and device fused together" vision that feels very specific.
The product is CSM (Conversational Speech Model). When the demo went public in February 2025, the internet genuinely shook — most natural, most personality-laden, most human-feeling voice anyone had tried. The model lands a joke, hesitates briefly, switches tone abruptly — the small human details are there.
The technology under CSM.
- Speech generated by an end-to-end multimodal LLM. Unlike conventional TTS, the LLM emits audio tokens directly.
- Personality-based training. Started with two characters ("Maya" and "Miles"), each with its own speech style baked into training data.
- Beta as of May 2026. Open API access is still limited; mostly demos and selective partner integrations.
The implication is big — voice is now competing on "personality and expressiveness," not just "technical fidelity."
2.5 The Rest
- Azure Speech. Microsoft's enterprise TTS. Widest voice catalog (140+ languages, 600+ voices) and battle-tested reliability. Naturalness is half a step behind ElevenLabs/Cartesia.
- Google Cloud TTS. Vertex AI integration. Chirp 3 HD voices closed the quality gap meaningfully.
- AWS Polly. Amazon's classic TTS, now with Generative voice options. Pricing and SLA are attractive.
- Play.ht. Consumer side, strong with podcasters and YouTubers.
- Resemble AI. Voice-cloning specialist, B2B.
- Coqui XTTS. Open-source TTS. The company itself shut down in 2024, but the weights live on GitHub.
2.6 TTS Comparison
| Tool | Time-to-First-Byte | Naturalness | Voice Variety | Korean | Price Tier | Primary Use |
|---|---|---|---|---|---|---|
| ElevenLabs v3 | about 200~400ms | very high | very wide | good | mid-high | content, B2B agents |
| Cartesia Sonic-2 | about 75ms | high | wide | fair | mid | low-latency agents |
| OpenAI Realtime | about 300~500ms E2E | high | limited | good | high | multimodal agents |
| Sesame CSM | not disclosed | very high (personality) | character-bound | unrated | beta | next-gen conversation |
| Azure Speech | about 200~300ms | fair to high | very wide | good | mid | enterprise |
| Google TTS Chirp 3 | about 200~400ms | high | wide | good | mid | GCP-integrated |
| AWS Polly Generative | about 300~500ms | fair to high | wide | fair | low to mid | AWS-integrated |
3. STT Leaders — Deepgram, AssemblyAI, Whisper
3.1 Deepgram Nova-3
Deepgram, founded 2015, is one of the oldest pure-play STT shops. Series C in June 2024 (100M), additional round in January 2026.
The current flagship is Nova-3 (GA June 2025). Versus Nova-2 it gained ground on accuracy, latency, and price simultaneously.
- WER (Word Error Rate). English 7.7% (Nova-2: 8.4%), multilingual average 12.3% (Nova-2: 15.1%). Measured on 2026 standard benchmarks (CommonVoice, Earnings-22).
- Latency. Streaming first-word about 250ms; batch processes a one-hour file in roughly 30 seconds.
- Multilingual. 30+ languages including Korean, with code-switching handling (two languages in one utterance).
- Diarization. Speaker separation noticeably better than Nova-2.
- Smart Format. Auto-formats numbers, currency, emails, phone numbers.
Pricing (May 2026).
- Pre-recorded: 0.26/hour)
- Streaming: $0.0058/min
- Enhanced (higher-tier models): additional cost
Deepgram's strength is the low-latency streaming + price + B2B reliability triangle. Vapi, Retell, Bland and similar platforms default to Deepgram for STT.
3.2 AssemblyAI Universal-2
AssemblyAI was founded in 2017, a Y Combinator alum. Deepgram's most direct competitor.
The flagship is Universal-2 (GA in late 2025). Visibly more accurate than Universal-1, and notably strong on "formatting and readability."
- WER. English 6.6%, multilingual average 11.8%. Slightly more accurate than Deepgram Nova-3 on some benchmarks.
- Timestamps. Word-level timestamps and speaker diarization are extremely precise.
- Language detection plus code-switching. Automatic.
- Speaker diarization. One of the most accurate in the market.
- Extras. Sentiment analysis, entity detection, topic detection, summarization, PII redaction all in the same API.
Pricing (May 2026).
- Best model: $0.37/hour (batch)
- Universal-2: $0.27/hour
- Streaming: $0.47/hour
AssemblyAI's edge is post-processing integration (summaries, sentiment, entities). Call-center analytics and meeting notes are the sweet spots.
3.3 Whisper and WhisperX — The Open-Source Baselines
OpenAI Whisper landed as open source in September 2022 — multilingual STT, MIT licensed. It's still the standard for "self-host to save money" or "don't send data out."
Whisper Large V3 Turbo (October 2024) — roughly 8x faster than V3 at similar quality. The strong open-source baseline.
WhisperX (2023~2025) — adds forced alignment, voice activity detection, and speaker diarization on top of Whisper. The de facto standard when you need precise word-level timestamps.
Faster-Whisper — CTranslate2-backed optimization. About 4x faster than vanilla Whisper on GPU.
Performance (English LibriSpeech test-clean).
- Whisper Large V3 Turbo: WER about 3.1%
- Faster-Whisper Large V3: WER about 3.4%
- WhisperX (timestamp accuracy): very high
Open-source Whisper's limits are (a) no true real-time streaming (chunked workarounds only), (b) speaker diarization requires a separate model, and (c) you carry the operational burden yourself.
3.4 STT Comparison
| Model | WER (English) | WER (multilingual) | Latency (streaming) | Price ($/hour) | License | Korean |
|---|---|---|---|---|---|---|
| Deepgram Nova-3 | 7.7% | 12.3% | about 250ms | 0.26 | commercial SaaS | good |
| AssemblyAI Universal-2 | 6.6% | 11.8% | about 400ms | 0.27 | commercial SaaS | good |
| Whisper Large V3 Turbo | 3.1% | 7~12% (varies) | not supported (chunked workaround) | $0 self-host | MIT | good |
| WhisperX | 3.1% (Whisper base) | same | not supported | $0 | BSD-4 | good |
| Faster-Whisper | 3.4% | same | not supported | $0 | MIT | good |
| Azure Speech STT | about 8% | about 13% | about 300ms | 1.0 | enterprise | good |
| Google STT Chirp 3 | about 7% | about 12% | about 300ms | about 0.4 | enterprise | good |
Caveat: WER numbers are extremely sensitive to benchmark and domain. On noisy call-center audio, Whisper might WER far worse than the SaaS leaders — or far better. Measure on your own domain data.
4. Voice-Agent Platforms — Vapi, Retell, Bland, Hume
4.1 Vapi — The Platform Layer
Vapi was founded in 2023. 64M Series B in June 2025 (at $600M). One of the fastest-growing companies in the voice-agent category.
Vapi's positioning is "the orchestration layer for STT/LLM/TTS." They don't build the models — they let you compose voice agents from the best of each (Deepgram, OpenAI, ElevenLabs, Cartesia, etc.).
Key features.
- Modular stack. STT (Deepgram/AssemblyAI), LLM (OpenAI/Anthropic/Google), TTS (ElevenLabs/Cartesia/PlayHT), all swappable.
- Turn detection. Semantic-VAD-based decision about whether the user has finished speaking.
- Interruption handling. When the user starts talking mid-response, the model stops immediately.
- Function calling. Outbound API calls during a conversation (booking systems, CRM lookups).
- Phone integration. Twilio/Vonage/Telnyx for actual PSTN numbers.
- Recording plus analytics. All calls recorded; dashboard with search, filtering, analysis.
Pricing (May 2026).
- Free tier: 10 minutes/month
- Pay-as-you-go: 0.20/min depending on stack choice
- Enterprise: custom
Vapi's selling point is "fast to build plus no model lock-in." You can spin up an MVP in a weekend and change models with a config flag.
4.2 Retell — Vapi's Closest Rival
Also founded 2023, also a B2B voice-agent platform. Very similar positioning to Vapi, but more emphasis on enterprise call reliability.
- High-quality call infrastructure. Deeper Twilio integration, stronger call-stability SLAs.
- Agent Studio. A more polished no-code/low-code builder.
- Analytics. Auto-classification of call outcomes, per-call analysis.
Pricing is in the same neighborhood as Vapi (0.18/min).
4.3 Bland — Phone-Call Automation Specialist
Bland AI focuses on a specific use case — "an AI that talks to people on the phone." More specialized for inbound and outbound call-center automation than general voice agents.
- High concurrency. Thousands of simultaneous calls.
- Workflow builder. Branching logic, variable extraction, CRM integration.
- Voice cloning. Clone a voice that matches the sales tone of the company.
- Compliance. TCPA (U.S. telemarketing regulation) tooling.
Target markets: sales callbacks, appointment setting, customer surveys, collections.
4.4 Hume EVI — Emotional Voice
Hume AI sits in a different camp. They start from "voice carries emotion" as a thesis. EVI (Empathic Voice Interface) is specifically designed to recognize the emotional tone in a user's voice and to put emotion into the response.
- EVI 4 (early 2026). Improved tone-classification accuracy and response-emotion precision.
- Use cases. Mental health bots, coaching, care calls.
- Limits. Whether the model's emotion classification matches lived user experience is still being validated in the wild.
4.5 Voice-Agent Platform Comparison
| Platform | Positioning | Primary Use | Model Choice | Price ($/min) | Differentiator |
|---|---|---|---|---|---|
| Vapi | orchestration layer | any voice agent | very wide (every major) | 0.05~0.20 | fast build, no lock-in |
| Retell | enterprise calls | call center, B2B sales | wide | 0.07~0.18 | call stability, Studio |
| Bland | phone automation | sales, scheduling, surveys | own plus some | 0.10~0.15 | high concurrency |
| ElevenLabs Conversational AI | integrated stack | content/B2B agents | ElevenLabs-first | session-based | bundled voices |
| OpenAI Realtime | direct API | bring-your-own build | GPT-Realtime locked | token-based | shortest E2E latency |
| Hume EVI | emotion-aware | healthcare, care | EVI models | custom | tone analysis |
5. The Voice-Agent Stack — How One Call Actually Flows
5.1 The Traditional Three-Stage Pipeline
Most voice agents chain three models.
user speech audio
│
▼
[STT] Speech-to-Text
(e.g., Deepgram Nova-3 streaming)
│
▼ text tokens
[LLM] Large Language Model
(e.g., GPT-5, Claude Opus 4.7, Gemini 2.5)
│
▼ response text
[TTS] Text-to-Speech
(e.g., Cartesia Sonic-2 streaming)
│
▼
model response audio
The big win of independent stages is interchangeability — swap any model without touching the others. Whisper for STT, Claude for LLM, ElevenLabs for TTS, in any combination. Vapi/Retell exist to manage that combinatorial space.
The big downside is cumulative latency. Even 100ms per stage adds up to 300ms before network RTT, and 400~500ms total is easy to hit.
5.2 End-to-End Multimodal Models
OpenAI Realtime and some next-gen models (Sesame CSM, GPT-4o's voice mode) work differently. A single model takes speech in and emits speech directly.
user speech audio
│
▼
[E2E Multimodal LLM]
- speech tokens in
- text/speech tokens out
- streamed over WebRTC
│
▼
model response audio
Pros — potentially shorter latency (no intermediate conversions), more natural emotion and intonation (STT doesn't throw away tone). Cons — no model choice, higher pricing, harder to fine-tune.
5.3 The Supporting Components
A working voice agent isn't STT/LLM/TTS in isolation. These extras are mandatory.
VAD (Voice Activity Detection). Is the user speaking or silent right now? Silero VAD and WebRTC VAD are the open-source standards. A more sophisticated form is semantic VAD — "has the user finished speaking?" decided semantically (did the question end, is the user still thinking out loud).
Turn detection. Is it the model's turn to speak now? Starts at simple VAD (silence for 300ms) and evolves into more nuanced models. OpenAI Realtime offers server-side semantic VAD as an option.
Endpointing. Find the precise end of an utterance. Pauses in the middle of "uh... so..." must not be mistaken for the end of the turn.
Interruption handling. When the user starts speaking mid-response, (a) stop the current TTS immediately, (b) reprocess the new user utterance, and (c) reflect "the user interrupted" in conversation state.
Conversation state management. Past turns, user-made promises, model-made promises, variables (customer name, order number) — all tracked. The LLM's context window plus external memory.
Tool use / function calling. Outbound API calls during the conversation. "Move my appointment to 12:30" should trigger updateAppointment(id, newTime).
Monitoring and analytics. Call recording, transcription, sentiment analysis, outcome categorization, dashboards. The operational backbone.
5.4 The Real System Diagram
┌─────────────────────────┐
[phone ─── PSTN ─── Twilio]───────▶│ Voice Agent Platform │
│ (Vapi / Retell / etc) │
└────────────┬─────────────┘
│
┌──────────────────────────────────┼──────────────────────────────────┐
│ │ │
▼ ▼ ▼
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ STT │ │ LLM │ │ TTS │
│ Deepgram Nova-3 │──text tokens─▶│ Claude / GPT │─response text▶│ Cartesia Sonic-2 │
│ (streaming WSS) │ │ (streaming SSE) │ │ (streaming WSS) │
└────────▲─────────┘ └────────▲─────────┘ └────────┬─────────┘
│ │ │
│ audio chunks │ context │ audio chunks
│ │ │
┌────────┴─────────────────────────────────┴─────────────────────────────────┴────────┐
│ Conversation Orchestrator │
│ - VAD (Silero / server-side semantic VAD) │
│ - Turn detection │
│ - Endpointing │
│ - Interruption handling │
│ - State management (past turns plus variables) │
│ - Tool-use router (booking system / CRM / DB) │
└────────────────────────────┬──────────────────────────────────────────────────────────┘
│
┌───────────────┼───────────────┐
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌──────────────┐
│ Recording │ │ Analytics │ │ Compliance │
│ Storage │ │ Dashboard │ │ PII Redact │
└─────────────┘ └─────────────┘ └──────────────┘
What this shows — three models, but many more system components. That's why Vapi and Retell create value. Building all of this from scratch is a six-month project.
6. Latency as the Absolute Metric — The Sub-300ms Target
6.1 Why 300ms
In natural human conversation, the gap between turns averages 200300 milliseconds. Beyond that it starts to feel like awkward silence; beyond 700800ms the listener wonders if you heard them.
For a voice agent to feel natural, time-to-first-byte (TTFB) — user finishes speaking to model's first audio byte — has to be under 300ms. 350~500ms is "a little awkward but acceptable," and beyond 500ms people start describing the experience as weird.
6.2 The Latency Budget
To hit TTFB 300ms, you have to budget each stage like this.
| Stage | Budget | Notes |
|---|---|---|
| Network RTT (round trip) | 50~100ms | depends on user location |
| Endpointing (end-of-utterance detection) | 30~80ms | semantic VAD is fastest |
| STT final transcript | 50~150ms | streaming partials arrive earlier |
| LLM time-to-first-token | 100~300ms | strongly dependent on model and prompt size |
| TTS first audio chunk | 50~200ms | Cartesia's 75ms is the market floor |
| Total | about 300~800ms | floor adds to 300ms, average is 500ms+ |
The takeaway — even at the floor of every stage, 300ms is tight. So you have to (a) collapse stages with an E2E model, (b) crush each stage to its floor, or (c) start responding speculatively before the user finishes.
6.3 Optimization Tricks
1. Speculative response. The LLM starts drafting before the user finishes. When the user does finish, you either emit what's drafted or quickly correct it. Risk: if the user adds more, the draft becomes awkward.
2. Stream everything. STT emits partial transcripts; LLM streams tokens; TTS makes audio chunks as text chunks arrive. Batch in any one stage means batch end-to-end.
3. Short prompts. LLM TTFT scales almost linearly with prompt length. Keep system prompts tight and rely on prompt caching for context.
4. Caches and warm pools. Pre-spin voice-agent instances and keep them warm. Avoid the cold start on the first call.
5. Geographic proximity. Inference servers must be close to the user. Multi-region deployment is non-negotiable.
6. End-to-end models. OpenAI Realtime collapses stages and eliminates intermediate transformation delays.
6.4 Measurement and SLAs
Latency is a distribution, not an average. p50 of 250ms with p99 of 2 seconds means 1% of turns feel awkward. Calls have dozens to hundreds of turns, so p99 awkwardness shows up multiple times per call.
Common operational SLAs.
- p50 TTFB < 300ms
- p95 TTFB < 600ms
- p99 TTFB < 1000ms
- Interruption responsiveness < 200ms
These metrics need to be measured per turn, not per call, to be meaningful.
7. Use Cases — Where AI Voice Actually Works
7.1 First-Line Call-Center Triage
The use case that landed fastest. The reason is simple — high-volume repetitive calls, defined workflows, and the first 30 seconds of almost every call asks the same questions.
A typical workflow.
- Inbound. Customer calls → AI agent answers → "What can I help you with?" → intent classification (order status / shipping / refund / other) → context load → resolution or handoff to a human.
- Outbound. AI agent calls → "Hi, this is XYZ Apparel calling about your shipment" → simple update or appointment setting.
Field results.
- Self-resolution rate 30~60% (varies by industry and question type)
- 30~50% reduction in average call duration
- 70~90% cost reduction versus human agents
- CSAT: usually flat or slightly down (fine emotional handling still favors humans)
Stack: Vapi/Retell + Deepgram + Claude/GPT + ElevenLabs/Cartesia.
7.2 Appointment Scheduling
Dental offices, salons, small clinics. The typical workflow is "what day/time works for you?" → check the scheduling system → present options → confirm → SMS confirmation.
This is the best use case for tool use — the model calls getAvailableSlots(date), then bookSlot(slotId, customerInfo).
7.3 Podcasts and Audiobook Narration
Long-form content generation. ElevenLabs is strongest here.
The workflow.
- Write the script
- Pick a voice or clone your own
- Synthesize the whole script via the ElevenLabs API
- Post-process (music, SFX, mastering)
Cost: a one-hour audiobook fits comfortably in one month's ElevenLabs Pro 200~$500/hour), the savings are dramatic.
Quality: humans still win on fine emotional moments (a grieving scene), but by late 2025 most listeners couldn't tell the difference in mainstream content.
7.4 Accessibility
Screen readers for blind users, real-time captions for deaf users. AI voice has been here for a long time, but quality has improved usability dramatically.
- VoiceOver (macOS/iOS) and TalkBack (Android) are gradually adopting ElevenLabs/Cartesia-grade voices.
- Live Caption (Pixel phones), Otter.ai, and similar live-captioning products lean heavily on Whisper/Deepgram.
7.5 Voice Cloning — Authentication and Memory
Preserving your own voice, or recreating a family member's voice (a deceased relative, for instance). Technically a 30-second sample is enough — but this is also the area with the thickest ethical and legal gray zone.
- The person is alive and consenting → clearly OK
- The person is deceased, with family consent → legally ambiguous (depends on jurisdiction's rights of the deceased)
- The person is alive but didn't consent → obviously unlawful (the deepfake zone)
ElevenLabs requires "Voice Verification" — the person whose voice is being cloned must record a verification phrase directly with ElevenLabs.
7.6 Where It Doesn't Work
Honestly.
- Complex call-center complaint handling. Calming down an angry customer still favors humans.
- Legal or medical advice. Accuracy and liability rule out unsupervised AI voice.
- Creative collaboration (like a voice director with an actor). Fine direction is still very human.
- Low-resource languages. English, Spanish, Chinese are great. Languages with thin training data (Vietnamese, Swahili) lag noticeably.
- Real-time interpretation. Useful but still behind on both latency and accuracy.
8. Build vs. Buy — An Honest Decision Frame
8.1 Three Paths
When you set out to build a voice agent, you have three paths.
Path A: Pure SaaS. Use ElevenLabs Conversational AI, Air AI, or just the no-code builders inside Vapi/Retell. Build time: days. Cost: 0.30/min. Control: low.
Path B: Platform plus custom. Vapi or Retell as a base; you write function calls and workflow logic. Build time: 1~4 weeks. Cost: 0.20/min plus engineering time. Control: medium-high.
Path C: Full build. Compose STT/LLM/TTS yourself and write VAD, endpointing, and state management from scratch. Build time: 36 months. Cost: API bills plus 23 full-time engineers. Control: very high.
8.2 Decision Tree
start
│
├─ Call volume below 1,000 min/month?
│ └─ yes → Path A or Path B. Path C is never justified here.
│
├─ Industry-specific compliance needed? (HIPAA, PCI, SOC2)
│ ├─ yes → Path B (Vapi enterprise tier plus compliance add-ons) or
│ │ Path C (full self-host)
│ └─ no ↓
│
├─ Call volume above 100,000 min/month?
│ └─ yes → Run the cost math. SaaS unit cost times volume vs. self-host.
│ Usually a Path B enterprise contract is optimal.
│
├─ Does model choice matter? (e.g., a specific LLM is required)
│ ├─ yes → Path B (Vapi's modular models)
│ └─ no → Path A (fastest start)
│
├─ Is fine UX control absolutely necessary? (response tone, interruption policy)
│ ├─ yes → Path C is worth considering
│ └─ no → Path B
8.3 Cost Comparison
Rough monthly cost by volume (average stack pricing).
| Monthly minutes | Path A ($0.20/min) | Path B ($0.10/min) | Path C (self-host) |
|---|---|---|---|
| 1,000 | $200 | $100 | thousands in salary alone |
| 10,000 | $2,000 | $1,000 | salary plus about $300 infra |
| 100,000 | $20,000 | $10,000 | salary plus about $2,000 infra |
| 1,000,000 | $200,000 | $100,000 | salary plus about $20,000 infra |
The implication — Path C only starts to make pricing sense above about 1M minutes/month (12M minutes/year). Below that, the operational burden of SaaS savings almost always wins.
8.4 Industry Patterns
- A voice feature in a B2B SaaS. Path A or Path B. Speed-to-launch dominates.
- Call-center replacement. Path B enterprise contracts. Call reliability and compliance dominate.
- Companies where the voice IP is itself an asset (advertising, media). Path C. Self-host the cloning model, keep data internal.
- Voice features in consumer apps. Path A or Path B. OpenAI Realtime or Vapi.
- Healthcare or finance compliance contexts. Path B compliance tier or Path C.
Epilogue — Checklist, Anti-Patterns, What's Next
AI voice went from the "wow, that's natural" GPT-4o demo shock of May 2024 to the "sub-300ms TTFB voice agents actually run" maturity of May 2026. Same pattern as music, images, and video — but the additional constraints of bidirectionality and absolute latency made the category richer.
The May 2026 takeaway is simple. For TTS quality alone, any major model is good enough. The real differentiators are (a) first-byte latency, (b) overall stack stability, (c) compliance and consent, and (d) the price-versus-volume balance. You need to see the stack, not just the model.
Tool-Selection Checklist
- TTS only, or a voice agent? — TTS only → ElevenLabs/Cartesia. Agent → Vapi/Retell or OpenAI Realtime.
- Is first-byte latency absolute? — Cartesia Sonic-2, or OpenAI Realtime with caching and a warm pool.
- Do you need model choice? — Vapi is the most flexible. ElevenLabs Conversational AI favors its own voices.
- Language other than English/Japanese/Korean? — Validate per-tool language quality on your domain.
- What's the monthly call volume? — Under 1M minutes, SaaS almost always wins.
- Compliance required? — HIPAA/PCI/SOC2 means enterprise contracts or self-hosting.
- Voice cloning needed? — ElevenLabs Voice Cloning or Resemble AI, with mandatory consent verification.
- STT accuracy critical? — Compare Deepgram vs. AssemblyAI vs. Whisper on your domain data.
- Tool use required? — Vapi, OpenAI Realtime, and ElevenLabs Conversational AI all support it.
- Analytics/recording/dashboards required? — Vapi/Retell give you these for free. DIY is heavy.
Anti-Patterns
| Anti-pattern | Why it's bad | Instead |
|---|---|---|
| Choosing the tool from model quality alone | The whole-stack latency decides | Evaluate first-byte latency and reliability too |
| First-tool lock-in on the model | Models get 6 months better routinely | A platform with modular models (Vapi) |
| Building on batch APIs first | No streaming means no voice agent | Streaming from day one |
| Naive silence-only VAD | Confuses mid-utterance pauses for the end of turn | Semantic VAD or proper endpointing |
| No interruption handling | Awkward when user talks over the model | Immediate TTS stop plus state update |
| Full context every turn | LLM TTFT balloons, latency collapses | Short system prompts, prompt caching |
| Skipping consent verification | Legal and reputational risk | Mandatory consent flow |
| Sending everything to one place | PII exposure risk | Self-host option or PII redaction |
| Average-only latency SLA | p99 awkwardness shows up multiple times per call | Measure p50/p95/p99 |
| Going Path C too early | Operational burden usually exceeds build cost | Stay on SaaS below 1M minutes/month |
What's Next
The generative-media quartet closes here — music, images, video, voice. The next post pulls them together into a unified generative-media workflow. One prompt that produces music plus images plus video plus voice in one pipeline. The choice between Runway Gen-4, Veo 3, and Sora 3 producing voice themselves versus assembling a separate pipeline. The new standard for AI content creation, and how to turn each stage's model choice into a single matrix — this will be the synthesis post for the quartet.
References
- ElevenLabs
- ElevenLabs Conversational AI
- ElevenLabs Voice Design v2
- ElevenLabs Voice Cloning
- ElevenLabs Series C — TechCrunch
- Cartesia
- Cartesia Sonic-2 announcement
- Cartesia Series A announcement
- OpenAI Realtime API docs
- OpenAI Realtime API launch — TechCrunch
- GPT-Realtime GA — OpenAI
- Scarlett Johansson Sky controversy — NPR
- Sesame AI
- Sesame CSM launch — VentureBeat
- Deepgram
- Deepgram Nova-3 launch
- Deepgram Series D — TechCrunch
- AssemblyAI
- AssemblyAI Universal-2 launch
- OpenAI Whisper GitHub
- Whisper Large V3 Turbo discussion
- WhisperX GitHub
- Faster-Whisper GitHub
- Vapi
- Vapi Series B — TechCrunch
- Retell AI
- Bland AI
- Hume AI
- Hume EVI 4
- Microsoft VALL-E
- Meta Voicebox
- Mamba paper
- Silero VAD GitHub
- WebRTC for Voice AI — Cartesia guide
- Voice Agent Latency Best Practices — Vapi docs
- Twilio Voice AI integration
- Azure Speech Service
- Google Cloud TTS Chirp 3
- AWS Polly Generative Voices
현재 단락 (1/399)
Over the past several weeks we've gone through generative media one category at a time. **Music** (S...