Skip to content
Published on

Voice AI & TTS 2026 Deep Dive - ElevenLabs · Cartesia Sonic · OpenAI Voice · Play.HT · Hume · Sesame · Fish Audio · Deepgram Aura

Authors

Prologue — The year voice became the mouth and ears of the LLM

As of May 2026, the phrase "voice AI" carries entirely different weight than it did five years ago.

  • ElevenLabs v3 supports 32 languages, emotion tags, and 5-second cloning. It is the de-facto English-speaking TTS standard.
  • Cartesia Sonic is the fastest commercial TTS with 75ms TTFW (Time To First Word) and is the default TTS in LiveKit Agents.
  • OpenAI Realtime API has popularized the full-duplex model where STT, LLM, and TTS run over a single WebSocket.
  • Google Gemini Live and Anthropic Claude voice mode have established LLM-native voice.
  • Hume EVI 2 and Sesame's Maya/Miles demos (March 2025) redefined the limits of emotional and natural-sounding speech.
  • Fish Audio, CosyVoice 2, and F5-TTS rapidly took share in the open-source and Chinese-speaking world.
  • Deepgram Nova-3 pushed STT latency below 50ms; AssemblyAI Universal-2 and OpenAI GPT-4o transcribe compete on accuracy.
  • Orchestration tools like LiveKit Agents, Pipecat, Vapi, Retell AI, and Bland AI made the voice-agent stack standard.
  • Tennessee's ELVIS Act and the EU AI Act drew the first legal lines around voice cloning ethics.
  • Korea is dominated by Typecast (Neosapience) and Naver Clova Dubbing; Japan by CoeFont and VOICEVOX.

This article maps the whole field — which tool owns which slot, which metrics actually matter, and what to choose for a new 2026 project.


1. The 2026 voice stack — a four-tier pipeline

Today's voice AI breaks into four layers.

[ Tier 1 ] Input          - Microphone / WebRTC / SIP / Telephony
[ Tier 2 ] STT (ASR)      - Deepgram Nova-3 / AssemblyAI / GPT-4o transcribe / Whisper v3 turbo
[ Tier 3 ] LLM            - GPT-5 / Claude 4.5 / Gemini 2.5 Pro / Llama 4
[ Tier 4 ] TTS            - ElevenLabs / Cartesia / OpenAI / Play.HT / Hume / Sesame
[ Sidecar ] Orchestration - LiveKit Agents / Pipecat / Vapi / Retell / Bland
[ Sidecar ] Interruption  - VAD / barge-in / turn detection / endpointing

The classic STT → LLM → TTS pipeline is still the most common, but full-duplex LLM-native voice — proven by OpenAI Realtime and Gemini Live since 2025 — is taking ground fast.

StageKey metric
STTWER (Word Error Rate), first-partial latency, multilingual
LLMTTFT (time to first token), TPS (tokens per second)
TTSTTFW (time to first word), audio MOS, voice diversity
Full-duplexEnd-to-end latency, interruption naturalness

The conversational latency target is consistent: under 300ms to first audio.


2. The metrics — latency, latency, latency

The most often ignored yet most important number in voice AI is the human perception threshold.

  • Under 200ms: feels like natural human conversation.
  • 200-500ms: slightly off but tolerable.
  • 500ms-1s: visibly slow.
  • Over 1s: sounds like an answering machine.

Traditional pipelines accumulate latency like this:

Mic -> VAD -> STT partial -> Endpoint -> LLM TTFT -> TTS TTFW -> Speaker
10ms  30ms     80ms          200ms       400ms       150ms      30ms
                              Total: ~900ms

Cutting this to under 300ms requires three tricks:

  1. Streaming STT — Send partial results to the LLM without waiting for endpoint.
  2. Streaming LLM — Stream the first token into the TTS immediately.
  3. Streaming TTS — Emit audio at the word level.

OpenAI Realtime and Gemini Live fuse these steps inside the model itself and reach 200-400ms.


3. ElevenLabs v3 — The English-language TTS throne

ElevenLabs, founded in 2022, captured the TTS market faster than any prior player. v3 ships with:

  • 32 languages, 60-second cloning, 5-second Instant Voice Clone (IVC)
  • Emotion tags: anger, sadness, excitement, whisper, etc.
  • ElevenLabs Conversational AI — single SDK with STT + LLM + TTS
  • ElevenLabs Studio — long-form dubbing and audiobooks
  • Voice Library — 50,000+ public voices
  • ElevenLabs Reader — app for visually impaired and heavy readers

Python SDK example:

from elevenlabs.client import ElevenLabs
from elevenlabs import play

client = ElevenLabs(api_key="...")

audio = client.text_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    model_id="eleven_multilingual_v3",
    text="Hello. This is the 2026 voice AI guide.",
)

play(audio)

Streaming:

stream = client.text_to_speech.convert_as_stream(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    model_id="eleven_flash_v2_5",  # low-latency model
    text="Low-latency streaming example.",
)

for chunk in stream:
    speaker.write(chunk)

Pricing (May 2026):

PlanChars/moPrice
Free10kFree
Starter30k5 USD
Creator100k22 USD
Pro500k99 USD
Scale2M330 USD
EnterpriseCustomContact

Metered API runs about 180 USD per 1M characters for eleven_multilingual_v3, and roughly half that for eleven_flash_v2_5.

Strengths: audio quality, language coverage, Voice Library scale, integrated Conversational AI. Weaknesses: price, some languages (Korean / Japanese) still less polished than English.


4. Cartesia Sonic — The fastest TTS

Cartesia was founded in 2023 by the Mamba authors (Albert Gu, Karan Goel). Their SSM (State Space Model) based Sonic TTS is famous for:

  • 75ms TTFW — by far the fastest commercial TTS
  • Sonic-2 (2025) / Sonic-3 (2026) — multilingual, emotion, singing
  • Default TTS for LiveKit Agents
  • Voice cloning from 3-second samples

Python SDK call:

from cartesia import Cartesia

client = Cartesia(api_key="...")

# streaming synthesis
ws = client.tts.websocket()

for output in ws.send(
    model_id="sonic-3",
    transcript="Low-latency voice demo.",
    voice_id="694f9389-aac1-45b6-b726-9d9369183238",
    output_format={
        "container": "raw",
        "encoding": "pcm_s16le",
        "sample_rate": 24000,
    },
):
    speaker.write(output.audio)

Pricing runs about 65 USD per 1M characters, well under half of ElevenLabs. The trade-off is that Korean and Japanese quality lags ElevenLabs by a notch.

Pick Cartesia when latency is non-negotiable, ElevenLabs when multilingual quality wins.


5. Play.HT 3 — Multilingual + Realtime

Play.HT is an LA-based company founded in 2016 that supports 30+ languages. The 3.0 highlights:

  • PlayDialog — synthesizes conversations between two or more speakers
  • Realtime API — 200ms TTFW
  • 142 voices plus cloning
  • LangChain and LlamaIndex integration

Python call:

from pyht import Client, TTSOptions, Format

client = Client(user_id="...", api_key="...")

options = TTSOptions(
    voice="s3://voice-cloning-zero-shot/...",
    sample_rate=24000,
    format=Format.FORMAT_WAV,
)

for chunk in client.tts("Play.HT 3 demo.", options=options):
    speaker.write(chunk)

Pricing starts at 39 USD for 100k characters — a midpoint between ElevenLabs and Cartesia.

Highlight: PlayDialog is the strongest for two-person dialog naturalness. Popular for automated podcast generation.


6. OpenAI Voice — tts-1, gpt-4o-mini-tts, Realtime API

OpenAI started with tts-1 in 2024 and filled out the stack through 2025-2026.

ModelUse caseNotes
tts-1Standard TTSFast, decent quality, 6 voices
tts-1-hdHigh-quality TTSPricier but better audio
gpt-4o-mini-ttsNext-gen TTSInstructable, emotion control
Realtime API (gpt-4o-realtime-preview)Full-duplex voiceUnified STT+LLM+TTS

Realtime API example:

import WebSocket from 'ws'

const ws = new WebSocket(
  'wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2026',
  {
    headers: {
      Authorization: 'Bearer YOUR_KEY',
      'OpenAI-Beta': 'realtime=v1',
    },
  }
)

ws.on('open', () => {
  ws.send(
    JSON.stringify({
      type: 'session.update',
      session: {
        modalities: ['text', 'audio'],
        voice: 'alloy',
        instructions: 'Be friendly and concise.',
        turn_detection: { type: 'server_vad' },
      },
    })
  )
})

ws.on('message', (data) => {
  const evt = JSON.parse(data)
  if (evt.type === 'response.audio.delta') {
    speaker.write(Buffer.from(evt.delta, 'base64'))
  }
})

Pricing (Realtime API): 100 USD per 1M input audio tokens, 200 USD per 1M output audio tokens — about 0.06 USD per minute of conversation. tts-1 costs 15 USD per 1M chars, gpt-4o-mini-tts only 12 USD — cheapest of the bunch.

Strengths: price, integration, direct line to GPT models. Weaknesses: smaller voice catalog than ElevenLabs or Cartesia.


7. Hume AI EVI 2 — Emotional voice interface

Hume AI treats emotion as a first-class ML target. EVI 2 (Empathic Voice Interface 2) does:

  • Measures the speaker's emotion on 28 emotion dimensions
  • Automatically modulates the response voice to match
  • Full-duplex voice with ~700ms TTFW
  • Tunes response tone to match the user's tone

Demos are striking but daily-conversation naturalness still trails OpenAI Realtime. Hume shines in emotion-sensitive verticals — medical consults, mental health, companionship chatbots.

Pricing is about 0.072 USD per minute.


8. Sesame — The Maya / Miles shockwave

Sesame is the company built by Oculus co-founder Brendan Iribe after acquiring Maven AI in 2024. The Maya and Miles voice demos released in March 2025 turned Twitter upside down.

  • Natural breathing, hesitation, laughter
  • Tone modulation that follows the user's emotion
  • Persona consistency across long conversations
  • Conversational Speech Model (CSM) 1B open-sourced for research

Demo: https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice

As of May 2026, the commercial API is still in limited preview, but the naturalness shown is something ElevenLabs, Hume, and OpenAI cannot yet match. Trade-off: English-centric for now.


9. Fish Audio Speech 1.5 — The Chinese-language leader

Fish Audio, based in China, has grown fast since 2024. Speech 1.5 strengths:

  • #1 in Chinese naturalness — including regional dialects
  • 30-second voice cloning
  • 9 languages supported
  • ~12 USD per 1M characters — very cheap
  • Open-source Fish Speech v1.4

Choice criteria:

  • Mandarin speakers, Chinese market → Fish Audio
  • Korean / Japanese priority → Typecast / Clova / CoeFont

The follow-up OpenAudio S1 model is also public.


10. Deepgram Aura — TTS from an STT company

Deepgram has been an STT specialist since 2015. They launched their first TTS, Aura, in 2024.

  • ~200ms TTFW
  • ~15 USD per 1M characters — comparable to OpenAI tts-1
  • 12 voices (English-centric)
  • Bundles its STT plus Aura TTS as a full voice-agent SDK

Highlight: pairing STT and TTS with the same vendor simplifies invoicing, SLA, and security posture. The TTS quality alone is a notch below ElevenLabs or Cartesia.


11. Other TTS — Resemble, WellSaid, Coqui, F5-TTS

ToolNotes
Resemble AICloning + security focus. Government / defense market
WellSaid LabsUS enterprise audience
Coqui TTSOpen source. Company shut down in 2024, community keeps it alive
F5-TTS (UCB, 2024)5-second cloning, explosively popular open source
MaskGCTMicrosoft + Sealand, 2024 open source
CosyVoice 2Alibaba 2025 — strong Chinese + English
GPT-SoVITSIndie project, popular in Japan / China communities
OpenVoice v2MyShell.ai — cloning + multilingual
Bark, Vall-E-X, XTTS v22023-2024 legacy open models

For open-source-first stacks in 2026, F5-TTS or CosyVoice 2 is the top pick. F5-TTS surprises with 5-second clones; CosyVoice 2 has Alibaba's stable backing.


12. The cloud big three — Polly, Google TTS, Azure Speech

VendorNotesPricing
Amazon PollyNeural + Generative voices, 90+ voices4 USD per 1M chars (Standard)
Google Cloud TTSStudio, Neural2, Wavenet16 USD per 1M chars (Studio)
Azure SpeechCustom Neural Voice, strong multilingual16-30 USD per 1M chars

These remain the default for enterprise, government, and regulated industries. Voice freshness and naturalness lag ElevenLabs and Cartesia by a generation, but AWS / GCP / Azure integration and SLA win the decision.

Microsoft Research's NaturalSpeech 3 is academically top-tier but not GA. Google DeepMind's Lyria 2 is music-focused yet overlaps with TTS for vocal synthesis.


13. STT — Deepgram Nova-3, AssemblyAI Universal-2, OpenAI

ToolTTFWWER (English)Languages
Deepgram Nova-3<50ms6.8%36
AssemblyAI Universal-2200ms5.7%70+
OpenAI Whisper v3 turboBatch7.5%99
OpenAI gpt-4o-transcribeStreaming5.2%99+
Gladia300ms6.5%100+
Speechmatics250ms6.0%50+
Rev AI300ms7.0%36
Soniox80ms5.9%60+

When latency is non-negotiable, Nova-3 or Soniox. When WER matters most, GPT-4o transcribe or AssemblyAI.

Open source: Whisper, WhisperX, Distil-Whisper, Vosk, Moonshine (Useful Sensors), Owl ASR. Moonshine is rising as the mobile / edge-friendly pick.

# Deepgram Nova-3 streaming STT example
from deepgram import DeepgramClient, LiveTranscriptionEvents, LiveOptions

dg = DeepgramClient(api_key="...")
connection = dg.listen.live.v("1")

def on_message(_, result, **kwargs):
    print(result.channel.alternatives[0].transcript)

connection.on(LiveTranscriptionEvents.Transcript, on_message)
connection.start(LiveOptions(model="nova-3", language="en", interim_results=True))

for chunk in mic_stream():
    connection.send(chunk)

14. Full-duplex voice agents — LiveKit, Pipecat, Vapi

Voice agents go beyond plain TTS / STT. They handle turn management, interruption, VAD, and tool calling together.

LiveKit Agents

LiveKit Agents is a Python full-stack voice-agent framework on top of the WebRTC backbone. Cartesia is the default TTS.

from livekit.agents import Agent, AgentSession, JobContext
from livekit.plugins import openai, cartesia, deepgram, silero

class Assistant(Agent):
    async def on_enter(self):
        await self.session.say("Hello, how can I help you?")

async def entrypoint(ctx: JobContext):
    session = AgentSession(
        stt=deepgram.STT(model="nova-3"),
        llm=openai.LLM(model="gpt-4o"),
        tts=cartesia.TTS(voice="..."),
        vad=silero.VAD.load(),
    )
    await session.start(agent=Assistant(), room=ctx.room)

Pipecat

Pipecat is a Python voice-agent framework sponsored by Daily.co. More modular than LiveKit and strong in vision + audio multimodal.

Vapi · Retell AI · Bland AI

Three companies offering voice-agent SaaS.

  • Vapi — fastest-growing, both no-code and API
  • Retell AI — Y Combinator alum, strong telephony integration
  • Bland AI — US contact-center focus, 0.09 USD per minute

SaaS gets you running fast with SIP / Twilio integration done, but the monthly bill grows faster than a direct LiveKit + Cartesia stack.


15. Full-duplex LLM — Realtime API, Gemini Live, Claude Voice

The second path to replacing the classic pipeline is LLM-native voice.

ModelReleasedNotes
OpenAI Realtime API (gpt-4o-realtime)2024-10WebSocket, 8 voices
Google Gemini 2.5 Live2025Multimodal video included
Anthropic Claude voice mode2025Mobile app, Sonnet-based
Mistral Voxtral2025Open 7B/24B voice model

The upside of LLM-native voice is natural emotion, interruption, and backchannel ("uh-huh", "right") handling. The downside is locked-in TTS and a smaller voice catalog.


16. Interruption / VAD / barge-in — The invisible essentials

Ninety percent of why voice agents feel awkward comes from interruption handling. Humans cut off unfinished sentences, drop in backchannel words, and start the next turn instantly. The relevant stack:

  • VAD (Voice Activity Detection) — Silero VAD is the de-facto standard. 30-50ms detection of speech start / end.
  • Turn Detection — More than silence detection; decides whether a turn is "done". LiveKit Turn Detector (2026), built into OpenAI Realtime.
  • Barge-in — Cuts AI TTS the moment the user speaks and flips to listening mode.
  • Endpointing — Triggers the LLM on partial STT output.

Silero VAD usage:

import torch

vad, utils = torch.hub.load(
    "snakers4/silero-vad", "silero_vad", trust_repo=True
)

(get_speech_timestamps, _, read_audio, *_) = utils

audio = read_audio("test.wav", sampling_rate=16000)
ts = get_speech_timestamps(audio, vad, sampling_rate=16000)

17. Cloning ethics — ELVIS Act, EU AI Act, SynthID

Voice cloning entered the public conversation after fake Biden audio surfaced in the 2024 New Hampshire primary. Legislation followed.

  • Tennessee ELVIS Act (effective July 2024) — First US law criminalizing unauthorized AI cloning of voice and likeness.
  • EU AI Act (effective May 2024) — Voice cloning categorized as high-risk with transparency obligations.
  • California AB 2839 (2024) — Bans election-period deepfakes.
  • US FCC (2024) — Outlaws robocalls using AI-generated voices.

Counter-tech:

  • SynthID Audio (Google DeepMind) — Sub-audible watermark.
  • Resemble Detect — Resemble AI's fake-voice detector.
  • AntiFake (Washington University) — Speech perturbation that resists TTS training.

Commercial TTS vendors now require recorded-consent flows. ElevenLabs asks the speaker to record "I have the right to clone this voice".


18. Korea — Typecast, Clova, Kakao, HyperCLOVA X Voice

The Korean market is firmly held by domestic players.

  • Typecast (Neosapience) — #1 in Korea. Content creators, advertising, audiobooks. Very strong on video-to-voice consistency.
  • Naver Clova Voice / Clova Dubbing — 50+ Korean voices; Clova Dubbing auto-dubs video subtitles.
  • HyperCLOVA X Voice — Voice-agent SDK pairing Naver's LLM with TTS.
  • Kakao TTS / Kakao i Voice — Integrated with KakaoTalk chatbots and Kakao i.
  • AI Tester (NFly) — Focused on advertising voices.

Note: foreign TTS still sounds awkward on Korean prosody and loanword pronunciation. Typecast and Clova are overwhelmingly the most natural.

Typecast API call:

import requests

resp = requests.post(
    "https://typecast.ai/api/speak",
    headers={"Authorization": "Bearer ..."},
    json={
        "actor_id": "5c3b3...",
        "text": "Typecast TTS synthesis demo.",
        "lang": "ko",
        "tempo": 1.0,
    },
)

Pricing: Typecast about 1.5 KRW per 100 characters, Clova about 4 KRW per 200 characters.


19. Japan — CoeFont, VOICEVOX, Synthesizer V

Japan plays a different game. The model is character voices + marketplace.

  • CoeFont — Marketplace of 10,000+ voices. Voice actors register and sell their own voice.
  • Rinna Japanese TTS — Japanese open TTS from the ex-Microsoft Rinna team.
  • VOICEROID / VOICEVOX — VOICEVOX is free; characters like Zundamon and Shikoku Metan are the standard on YouTube and Niconico.
  • Synthesizer V — Singing synthesis across Japanese, Chinese, and Korean vocals.
  • AI Voice Project (AIVoice) — Licensed reproduction of professional Japanese voice actors.

Note: Japan requires checking commercial-use terms per character. Even within VOICEVOX, terms differ by character.

Choice criteria:

  • Business / contact center → CoeFont, Rinna
  • YouTube / games / doujin content → VOICEVOX
  • Singing → Synthesizer V

20. Pricing — Per 1M characters / per minute

Price gaps span an order of magnitude. May 2026 snapshot.

ToolPer 1M charsFull-duplex per minute
ElevenLabs Multilingual v3180 USD0.30 USD
ElevenLabs Flash v2.590 USD0.15 USD
Cartesia Sonic 365 USD0.11 USD
Play.HT 3120 USD0.20 USD
OpenAI tts-115 USD0.06 USD
OpenAI gpt-4o-mini-tts12 USD0.05 USD
OpenAI Realtime API-0.06 USD
Hume EVI 2-0.072 USD
Fish Audio 1.512 USD0.04 USD
Deepgram Aura15 USD0.05 USD
Amazon Polly Generative30 USD0.08 USD
Google Cloud TTS Studio160 USD0.27 USD
Azure Custom Neural24 USD0.07 USD
Typecast~15 USD-
Naver Clova Voice~20 USD-
CoeFont~30 USD-
Vapi (full agent)-0.08 USD
Retell AI-0.075 USD
Bland AI-0.09 USD

For startups, OpenAI tts-1 / Fish Audio / Cartesia offer the best cost performance. For enterprise quality, go ElevenLabs / Typecast / Clova.


21. Who should pick what

A picking matrix.

GoalRecommendation
English console full-duplexOpenAI Realtime API
Multilingual voice agentLiveKit Agents + Cartesia
English audiobooks / dubbingElevenLabs Studio
Emotional companion chatbotHume EVI 2
Showcase naturalness demoSesame Maya / Miles
Chinese-language contentFish Audio
Korean-language contentTypecast, Naver Clova
Japanese character voiceVOICEVOX
Japanese marketplaceCoeFont
Open source / self-hostedF5-TTS, CosyVoice 2
Contact-center SaaSVapi, Retell AI, Bland AI
Mobile / edge STTMoonshine, Distil-Whisper
Fast STTDeepgram Nova-3
Accurate STTOpenAI gpt-4o-transcribe
Enterprise defaultPolly, Google TTS, Azure Speech

Three decision axes:

  1. Latency vs quality — Cartesia / Realtime are fast; ElevenLabs / Sesame are rich.
  2. API vs self-hosted — APIs are fast to ship; open models keep data sovereignty.
  3. Global vs native — Korean and Japanese have a naturalness gap that foreign TTS cannot yet close.

22. Use cases — What is actually paying right now

Where voice AI is generating revenue in 2026.

  • Contact-center automation — Retell and Bland deployed at US real-estate and healthcare firms. Saving 5-15 USD per call.
  • Audiobook / podcast dubbing — ElevenLabs Studio contracted with publishers. Cost down ten-fold per hour.
  • Game NPC voicing — Sony, EA, Ubisoft all partnered with ElevenLabs or Resemble.
  • Language learning — Duolingo Max and Speak use OpenAI Realtime.
  • Accessibility — Apple and Microsoft integrate TTS at the OS level.
  • Ad / marketing dubbing — Video dubbing is the largest single market.
  • Personal companion chatbots — Character.AI and Replika use ElevenLabs / Cartesia.

The clearest revenue line is contact-center automation, followed by content dubbing.


23. Closing — The year voice became an interface

Five years ago voice AI was answering-machine grade. The 2026 version is different.

  • 75ms-TTFW Cartesia, 32-language ElevenLabs, and unified-model OpenAI Realtime coexist.
  • LiveKit Agents, Pipecat, Vapi, Retell, and Bland built the orchestration layer.
  • Sesame and Hume reset the bar for emotion and naturalness.
  • Deepgram Nova-3 pushed STT below 50ms latency.
  • Typecast / Clova hold Korea, CoeFont / VOICEVOX hold Japan.
  • ELVIS Act and EU AI Act drew the legal lines on cloning.

What remains is choosing what voice interface to build. May this article be that starting line.


References