Skip to content

필사 모드: Voice AI & TTS 2026 Deep Dive - ElevenLabs · Cartesia Sonic · OpenAI Voice · Play.HT · Hume · Sesame · Fish Audio · Deepgram Aura

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.
원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Prologue — The year voice became the mouth and ears of the LLM

As of May 2026, the phrase "voice AI" carries entirely different weight than it did five years ago.

- ElevenLabs v3 supports 32 languages, emotion tags, and 5-second cloning. It is the de-facto English-speaking TTS standard.

- Cartesia Sonic is the fastest commercial TTS with 75ms TTFW (Time To First Word) and is the default TTS in LiveKit Agents.

- OpenAI Realtime API has popularized the full-duplex model where STT, LLM, and TTS run over a single WebSocket.

- Google Gemini Live and Anthropic Claude voice mode have established LLM-native voice.

- Hume EVI 2 and Sesame's Maya/Miles demos (March 2025) redefined the limits of emotional and natural-sounding speech.

- Fish Audio, CosyVoice 2, and F5-TTS rapidly took share in the open-source and Chinese-speaking world.

- Deepgram Nova-3 pushed STT latency below 50ms; AssemblyAI Universal-2 and OpenAI GPT-4o transcribe compete on accuracy.

- Orchestration tools like LiveKit Agents, Pipecat, Vapi, Retell AI, and Bland AI made the voice-agent stack standard.

- Tennessee's ELVIS Act and the EU AI Act drew the first legal lines around voice cloning ethics.

- Korea is dominated by Typecast (Neosapience) and Naver Clova Dubbing; Japan by CoeFont and VOICEVOX.

This article maps the whole field — which tool owns which slot, which metrics actually matter, and what to choose for a new 2026 project.

1. The 2026 voice stack — a four-tier pipeline

Today's voice AI breaks into four layers.

[ Tier 1 ] Input - Microphone / WebRTC / SIP / Telephony

[ Tier 2 ] STT (ASR) - Deepgram Nova-3 / AssemblyAI / GPT-4o transcribe / Whisper v3 turbo

[ Tier 3 ] LLM - GPT-5 / Claude 4.5 / Gemini 2.5 Pro / Llama 4

[ Tier 4 ] TTS - ElevenLabs / Cartesia / OpenAI / Play.HT / Hume / Sesame

[ Sidecar ] Orchestration - LiveKit Agents / Pipecat / Vapi / Retell / Bland

[ Sidecar ] Interruption - VAD / barge-in / turn detection / endpointing

The classic STT → LLM → TTS pipeline is still the most common, but full-duplex LLM-native voice — proven by OpenAI Realtime and Gemini Live since 2025 — is taking ground fast.

| Stage | Key metric |

| --- | --- |

| STT | WER (Word Error Rate), first-partial latency, multilingual |

| LLM | TTFT (time to first token), TPS (tokens per second) |

| TTS | TTFW (time to first word), audio MOS, voice diversity |

| Full-duplex | End-to-end latency, interruption naturalness |

The conversational latency target is consistent: **under 300ms to first audio**.

2. The metrics — latency, latency, latency

The most often ignored yet most important number in voice AI is **the human perception threshold**.

- Under 200ms: feels like natural human conversation.

- 200-500ms: slightly off but tolerable.

- 500ms-1s: visibly slow.

- Over 1s: sounds like an answering machine.

Traditional pipelines accumulate latency like this:

Mic -> VAD -> STT partial -> Endpoint -> LLM TTFT -> TTS TTFW -> Speaker

10ms 30ms 80ms 200ms 400ms 150ms 30ms

Total: ~900ms

Cutting this to under 300ms requires three tricks:

1. **Streaming STT** — Send partial results to the LLM without waiting for endpoint.

2. **Streaming LLM** — Stream the first token into the TTS immediately.

3. **Streaming TTS** — Emit audio at the word level.

OpenAI Realtime and Gemini Live fuse these steps inside the model itself and reach 200-400ms.

3. ElevenLabs v3 — The English-language TTS throne

ElevenLabs, founded in 2022, captured the TTS market faster than any prior player. v3 ships with:

- 32 languages, 60-second cloning, 5-second Instant Voice Clone (IVC)

- Emotion tags: anger, sadness, excitement, whisper, etc.

- ElevenLabs Conversational AI — single SDK with STT + LLM + TTS

- ElevenLabs Studio — long-form dubbing and audiobooks

- Voice Library — 50,000+ public voices

- ElevenLabs Reader — app for visually impaired and heavy readers

Python SDK example:

from elevenlabs.client import ElevenLabs

from elevenlabs import play

client = ElevenLabs(api_key="...")

audio = client.text_to_speech.convert(

voice_id="JBFqnCBsd6RMkjVDRZzb",

model_id="eleven_multilingual_v3",

text="Hello. This is the 2026 voice AI guide.",

)

play(audio)

Streaming:

stream = client.text_to_speech.convert_as_stream(

voice_id="JBFqnCBsd6RMkjVDRZzb",

model_id="eleven_flash_v2_5", # low-latency model

text="Low-latency streaming example.",

)

for chunk in stream:

speaker.write(chunk)

Pricing (May 2026):

| Plan | Chars/mo | Price |

| --- | --- | --- |

| Free | 10k | Free |

| Starter | 30k | 5 USD |

| Creator | 100k | 22 USD |

| Pro | 500k | 99 USD |

| Scale | 2M | 330 USD |

| Enterprise | Custom | Contact |

Metered API runs about 180 USD per 1M characters for eleven_multilingual_v3, and roughly half that for eleven_flash_v2_5.

Strengths: audio quality, language coverage, Voice Library scale, integrated Conversational AI.

Weaknesses: price, some languages (Korean / Japanese) still less polished than English.

4. Cartesia Sonic — The fastest TTS

Cartesia was founded in 2023 by the Mamba authors (Albert Gu, Karan Goel). Their SSM (State Space Model) based Sonic TTS is famous for:

- **75ms TTFW** — by far the fastest commercial TTS

- Sonic-2 (2025) / Sonic-3 (2026) — multilingual, emotion, singing

- Default TTS for LiveKit Agents

- Voice cloning from 3-second samples

Python SDK call:

from cartesia import Cartesia

client = Cartesia(api_key="...")

streaming synthesis

ws = client.tts.websocket()

for output in ws.send(

model_id="sonic-3",

transcript="Low-latency voice demo.",

voice_id="694f9389-aac1-45b6-b726-9d9369183238",

output_format={

"container": "raw",

"encoding": "pcm_s16le",

"sample_rate": 24000,

},

):

speaker.write(output.audio)

Pricing runs about 65 USD per 1M characters, well under half of ElevenLabs. The trade-off is that Korean and Japanese quality lags ElevenLabs by a notch.

Pick Cartesia when latency is non-negotiable, ElevenLabs when multilingual quality wins.

5. Play.HT 3 — Multilingual + Realtime

Play.HT is an LA-based company founded in 2016 that supports 30+ languages. The 3.0 highlights:

- PlayDialog — synthesizes conversations between two or more speakers

- Realtime API — 200ms TTFW

- 142 voices plus cloning

- LangChain and LlamaIndex integration

Python call:

from pyht import Client, TTSOptions, Format

client = Client(user_id="...", api_key="...")

options = TTSOptions(

voice="s3://voice-cloning-zero-shot/...",

sample_rate=24000,

format=Format.FORMAT_WAV,

)

for chunk in client.tts("Play.HT 3 demo.", options=options):

speaker.write(chunk)

Pricing starts at 39 USD for 100k characters — a midpoint between ElevenLabs and Cartesia.

Highlight: PlayDialog is the strongest for two-person dialog naturalness. Popular for automated podcast generation.

6. OpenAI Voice — tts-1, gpt-4o-mini-tts, Realtime API

OpenAI started with tts-1 in 2024 and filled out the stack through 2025-2026.

| Model | Use case | Notes |

| --- | --- | --- |

| tts-1 | Standard TTS | Fast, decent quality, 6 voices |

| tts-1-hd | High-quality TTS | Pricier but better audio |

| gpt-4o-mini-tts | Next-gen TTS | Instructable, emotion control |

| Realtime API (gpt-4o-realtime-preview) | Full-duplex voice | Unified STT+LLM+TTS |

Realtime API example:

const ws = new WebSocket(

'wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2026',

{

headers: {

Authorization: 'Bearer YOUR_KEY',

'OpenAI-Beta': 'realtime=v1',

},

}

)

ws.on('open', () => {

ws.send(

JSON.stringify({

type: 'session.update',

session: {

modalities: ['text', 'audio'],

voice: 'alloy',

instructions: 'Be friendly and concise.',

turn_detection: { type: 'server_vad' },

},

})

)

})

ws.on('message', (data) => {

const evt = JSON.parse(data)

if (evt.type === 'response.audio.delta') {

speaker.write(Buffer.from(evt.delta, 'base64'))

}

})

Pricing (Realtime API): 100 USD per 1M input audio tokens, 200 USD per 1M output audio tokens — about 0.06 USD per minute of conversation. tts-1 costs 15 USD per 1M chars, gpt-4o-mini-tts only 12 USD — cheapest of the bunch.

Strengths: price, integration, direct line to GPT models.

Weaknesses: smaller voice catalog than ElevenLabs or Cartesia.

7. Hume AI EVI 2 — Emotional voice interface

Hume AI treats emotion as a first-class ML target. EVI 2 (Empathic Voice Interface 2) does:

- Measures the speaker's emotion on 28 emotion dimensions

- Automatically modulates the response voice to match

- Full-duplex voice with ~700ms TTFW

- Tunes response tone to match the user's tone

Demos are striking but daily-conversation naturalness still trails OpenAI Realtime. Hume shines in emotion-sensitive verticals — medical consults, mental health, companionship chatbots.

Pricing is about 0.072 USD per minute.

8. Sesame — The Maya / Miles shockwave

Sesame is the company built by Oculus co-founder Brendan Iribe after acquiring Maven AI in 2024. The Maya and Miles voice demos released in March 2025 turned Twitter upside down.

- Natural breathing, hesitation, laughter

- Tone modulation that follows the user's emotion

- Persona consistency across long conversations

- Conversational Speech Model (CSM) 1B open-sourced for research

Demo: [https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice](https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice)

As of May 2026, the commercial API is still in limited preview, but the naturalness shown is something ElevenLabs, Hume, and OpenAI cannot yet match. Trade-off: English-centric for now.

9. Fish Audio Speech 1.5 — The Chinese-language leader

Fish Audio, based in China, has grown fast since 2024. Speech 1.5 strengths:

- #1 in Chinese naturalness — including regional dialects

- 30-second voice cloning

- 9 languages supported

- ~12 USD per 1M characters — very cheap

- Open-source Fish Speech v1.4

Choice criteria:

- Mandarin speakers, Chinese market → Fish Audio

- Korean / Japanese priority → Typecast / Clova / CoeFont

The follow-up OpenAudio S1 model is also public.

10. Deepgram Aura — TTS from an STT company

Deepgram has been an STT specialist since 2015. They launched their first TTS, Aura, in 2024.

- ~200ms TTFW

- ~15 USD per 1M characters — comparable to OpenAI tts-1

- 12 voices (English-centric)

- Bundles its STT plus Aura TTS as a full voice-agent SDK

Highlight: pairing STT and TTS with the same vendor simplifies invoicing, SLA, and security posture. The TTS quality alone is a notch below ElevenLabs or Cartesia.

11. Other TTS — Resemble, WellSaid, Coqui, F5-TTS

| Tool | Notes |

| --- | --- |

| Resemble AI | Cloning + security focus. Government / defense market |

| WellSaid Labs | US enterprise audience |

| Coqui TTS | Open source. Company shut down in 2024, community keeps it alive |

| F5-TTS (UCB, 2024) | 5-second cloning, explosively popular open source |

| MaskGCT | Microsoft + Sealand, 2024 open source |

| CosyVoice 2 | Alibaba 2025 — strong Chinese + English |

| GPT-SoVITS | Indie project, popular in Japan / China communities |

| OpenVoice v2 | MyShell.ai — cloning + multilingual |

| Bark, Vall-E-X, XTTS v2 | 2023-2024 legacy open models |

For open-source-first stacks in 2026, **F5-TTS** or **CosyVoice 2** is the top pick. F5-TTS surprises with 5-second clones; CosyVoice 2 has Alibaba's stable backing.

12. The cloud big three — Polly, Google TTS, Azure Speech

| Vendor | Notes | Pricing |

| --- | --- | --- |

| Amazon Polly | Neural + Generative voices, 90+ voices | 4 USD per 1M chars (Standard) |

| Google Cloud TTS | Studio, Neural2, Wavenet | 16 USD per 1M chars (Studio) |

| Azure Speech | Custom Neural Voice, strong multilingual | 16-30 USD per 1M chars |

These remain the default for enterprise, government, and regulated industries. Voice freshness and naturalness lag ElevenLabs and Cartesia by a generation, but AWS / GCP / Azure integration and SLA win the decision.

Microsoft Research's **NaturalSpeech 3** is academically top-tier but not GA. Google DeepMind's **Lyria 2** is music-focused yet overlaps with TTS for vocal synthesis.

13. STT — Deepgram Nova-3, AssemblyAI Universal-2, OpenAI

| Tool | TTFW | WER (English) | Languages |

| --- | --- | --- | --- |

| Deepgram Nova-3 | `<50ms` | 6.8% | 36 |

| AssemblyAI Universal-2 | 200ms | 5.7% | 70+ |

| OpenAI Whisper v3 turbo | Batch | 7.5% | 99 |

| OpenAI gpt-4o-transcribe | Streaming | 5.2% | 99+ |

| Gladia | 300ms | 6.5% | 100+ |

| Speechmatics | 250ms | 6.0% | 50+ |

| Rev AI | 300ms | 7.0% | 36 |

| Soniox | 80ms | 5.9% | 60+ |

When latency is non-negotiable, Nova-3 or Soniox. When WER matters most, GPT-4o transcribe or AssemblyAI.

Open source: **Whisper, WhisperX, Distil-Whisper, Vosk, Moonshine (Useful Sensors), Owl ASR**. Moonshine is rising as the mobile / edge-friendly pick.

Deepgram Nova-3 streaming STT example

from deepgram import DeepgramClient, LiveTranscriptionEvents, LiveOptions

dg = DeepgramClient(api_key="...")

connection = dg.listen.live.v("1")

def on_message(_, result, **kwargs):

print(result.channel.alternatives[0].transcript)

connection.on(LiveTranscriptionEvents.Transcript, on_message)

connection.start(LiveOptions(model="nova-3", language="en", interim_results=True))

for chunk in mic_stream():

connection.send(chunk)

14. Full-duplex voice agents — LiveKit, Pipecat, Vapi

Voice agents go beyond plain TTS / STT. They handle **turn management, interruption, VAD, and tool calling** together.

LiveKit Agents

LiveKit Agents is a Python full-stack voice-agent framework on top of the WebRTC backbone. Cartesia is the default TTS.

from livekit.agents import Agent, AgentSession, JobContext

from livekit.plugins import openai, cartesia, deepgram, silero

class Assistant(Agent):

async def on_enter(self):

await self.session.say("Hello, how can I help you?")

async def entrypoint(ctx: JobContext):

session = AgentSession(

stt=deepgram.STT(model="nova-3"),

llm=openai.LLM(model="gpt-4o"),

tts=cartesia.TTS(voice="..."),

vad=silero.VAD.load(),

)

await session.start(agent=Assistant(), room=ctx.room)

Pipecat

Pipecat is a Python voice-agent framework sponsored by Daily.co. More modular than LiveKit and strong in vision + audio multimodal.

Vapi · Retell AI · Bland AI

Three companies offering voice-agent SaaS.

- **Vapi** — fastest-growing, both no-code and API

- **Retell AI** — Y Combinator alum, strong telephony integration

- **Bland AI** — US contact-center focus, 0.09 USD per minute

SaaS gets you running fast with SIP / Twilio integration done, but the monthly bill grows faster than a direct LiveKit + Cartesia stack.

15. Full-duplex LLM — Realtime API, Gemini Live, Claude Voice

The second path to replacing the classic pipeline is LLM-native voice.

| Model | Released | Notes |

| --- | --- | --- |

| OpenAI Realtime API (gpt-4o-realtime) | 2024-10 | WebSocket, 8 voices |

| Google Gemini 2.5 Live | 2025 | Multimodal video included |

| Anthropic Claude voice mode | 2025 | Mobile app, Sonnet-based |

| Mistral Voxtral | 2025 | Open 7B/24B voice model |

The upside of LLM-native voice is natural **emotion, interruption, and backchannel ("uh-huh", "right")** handling. The downside is locked-in TTS and a smaller voice catalog.

16. Interruption / VAD / barge-in — The invisible essentials

Ninety percent of why voice agents feel awkward comes from **interruption handling**. Humans cut off unfinished sentences, drop in backchannel words, and start the next turn instantly. The relevant stack:

- **VAD (Voice Activity Detection)** — Silero VAD is the de-facto standard. 30-50ms detection of speech start / end.

- **Turn Detection** — More than silence detection; decides whether a turn is "done". LiveKit Turn Detector (2026), built into OpenAI Realtime.

- **Barge-in** — Cuts AI TTS the moment the user speaks and flips to listening mode.

- **Endpointing** — Triggers the LLM on partial STT output.

Silero VAD usage:

vad, utils = torch.hub.load(

"snakers4/silero-vad", "silero_vad", trust_repo=True

)

(get_speech_timestamps, _, read_audio, *_) = utils

audio = read_audio("test.wav", sampling_rate=16000)

ts = get_speech_timestamps(audio, vad, sampling_rate=16000)

17. Cloning ethics — ELVIS Act, EU AI Act, SynthID

Voice cloning entered the public conversation after fake Biden audio surfaced in the 2024 New Hampshire primary. Legislation followed.

- **Tennessee ELVIS Act** (effective July 2024) — First US law criminalizing unauthorized AI cloning of voice and likeness.

- **EU AI Act** (effective May 2024) — Voice cloning categorized as high-risk with transparency obligations.

- **California AB 2839** (2024) — Bans election-period deepfakes.

- **US FCC** (2024) — Outlaws robocalls using AI-generated voices.

Counter-tech:

- **SynthID Audio** (Google DeepMind) — Sub-audible watermark.

- **Resemble Detect** — Resemble AI's fake-voice detector.

- **AntiFake** (Washington University) — Speech perturbation that resists TTS training.

Commercial TTS vendors now require recorded-consent flows. ElevenLabs asks the speaker to record "I have the right to clone this voice".

18. Korea — Typecast, Clova, Kakao, HyperCLOVA X Voice

The Korean market is firmly held by domestic players.

- **Typecast (Neosapience)** — #1 in Korea. Content creators, advertising, audiobooks. Very strong on video-to-voice consistency.

- **Naver Clova Voice / Clova Dubbing** — 50+ Korean voices; Clova Dubbing auto-dubs video subtitles.

- **HyperCLOVA X Voice** — Voice-agent SDK pairing Naver's LLM with TTS.

- **Kakao TTS / Kakao i Voice** — Integrated with KakaoTalk chatbots and Kakao i.

- **AI Tester (NFly)** — Focused on advertising voices.

Note: foreign TTS still sounds awkward on Korean prosody and loanword pronunciation. Typecast and Clova are overwhelmingly the most natural.

Typecast API call:

resp = requests.post(

"https://typecast.ai/api/speak",

headers={"Authorization": "Bearer ..."},

json={

"actor_id": "5c3b3...",

"text": "Typecast TTS synthesis demo.",

"lang": "ko",

"tempo": 1.0,

},

)

Pricing: Typecast about 1.5 KRW per 100 characters, Clova about 4 KRW per 200 characters.

19. Japan — CoeFont, VOICEVOX, Synthesizer V

Japan plays a different game. The model is **character voices + marketplace**.

- **CoeFont** — Marketplace of 10,000+ voices. Voice actors register and sell their own voice.

- **Rinna Japanese TTS** — Japanese open TTS from the ex-Microsoft Rinna team.

- **VOICEROID / VOICEVOX** — VOICEVOX is free; characters like Zundamon and Shikoku Metan are the standard on YouTube and Niconico.

- **Synthesizer V** — Singing synthesis across Japanese, Chinese, and Korean vocals.

- **AI Voice Project (AIVoice)** — Licensed reproduction of professional Japanese voice actors.

Note: Japan requires checking **commercial-use terms per character**. Even within VOICEVOX, terms differ by character.

Choice criteria:

- Business / contact center → CoeFont, Rinna

- YouTube / games / doujin content → VOICEVOX

- Singing → Synthesizer V

20. Pricing — Per 1M characters / per minute

Price gaps span an order of magnitude. May 2026 snapshot.

| Tool | Per 1M chars | Full-duplex per minute |

| --- | --- | --- |

| ElevenLabs Multilingual v3 | 180 USD | 0.30 USD |

| ElevenLabs Flash v2.5 | 90 USD | 0.15 USD |

| Cartesia Sonic 3 | 65 USD | 0.11 USD |

| Play.HT 3 | 120 USD | 0.20 USD |

| OpenAI tts-1 | 15 USD | 0.06 USD |

| OpenAI gpt-4o-mini-tts | 12 USD | 0.05 USD |

| OpenAI Realtime API | - | 0.06 USD |

| Hume EVI 2 | - | 0.072 USD |

| Fish Audio 1.5 | 12 USD | 0.04 USD |

| Deepgram Aura | 15 USD | 0.05 USD |

| Amazon Polly Generative | 30 USD | 0.08 USD |

| Google Cloud TTS Studio | 160 USD | 0.27 USD |

| Azure Custom Neural | 24 USD | 0.07 USD |

| Typecast | ~15 USD | - |

| Naver Clova Voice | ~20 USD | - |

| CoeFont | ~30 USD | - |

| Vapi (full agent) | - | 0.08 USD |

| Retell AI | - | 0.075 USD |

| Bland AI | - | 0.09 USD |

For startups, OpenAI tts-1 / Fish Audio / Cartesia offer the best cost performance. For enterprise quality, go ElevenLabs / Typecast / Clova.

21. Who should pick what

A picking matrix.

| Goal | Recommendation |

| --- | --- |

| English console full-duplex | OpenAI Realtime API |

| Multilingual voice agent | LiveKit Agents + Cartesia |

| English audiobooks / dubbing | ElevenLabs Studio |

| Emotional companion chatbot | Hume EVI 2 |

| Showcase naturalness demo | Sesame Maya / Miles |

| Chinese-language content | Fish Audio |

| Korean-language content | Typecast, Naver Clova |

| Japanese character voice | VOICEVOX |

| Japanese marketplace | CoeFont |

| Open source / self-hosted | F5-TTS, CosyVoice 2 |

| Contact-center SaaS | Vapi, Retell AI, Bland AI |

| Mobile / edge STT | Moonshine, Distil-Whisper |

| Fast STT | Deepgram Nova-3 |

| Accurate STT | OpenAI gpt-4o-transcribe |

| Enterprise default | Polly, Google TTS, Azure Speech |

Three decision axes:

1. **Latency vs quality** — Cartesia / Realtime are fast; ElevenLabs / Sesame are rich.

2. **API vs self-hosted** — APIs are fast to ship; open models keep data sovereignty.

3. **Global vs native** — Korean and Japanese have a naturalness gap that foreign TTS cannot yet close.

22. Use cases — What is actually paying right now

Where voice AI is generating revenue in 2026.

- **Contact-center automation** — Retell and Bland deployed at US real-estate and healthcare firms. Saving 5-15 USD per call.

- **Audiobook / podcast dubbing** — ElevenLabs Studio contracted with publishers. Cost down ten-fold per hour.

- **Game NPC voicing** — Sony, EA, Ubisoft all partnered with ElevenLabs or Resemble.

- **Language learning** — Duolingo Max and Speak use OpenAI Realtime.

- **Accessibility** — Apple and Microsoft integrate TTS at the OS level.

- **Ad / marketing dubbing** — Video dubbing is the largest single market.

- **Personal companion chatbots** — Character.AI and Replika use ElevenLabs / Cartesia.

The clearest revenue line is contact-center automation, followed by content dubbing.

23. Closing — The year voice became an interface

Five years ago voice AI was answering-machine grade. The 2026 version is different.

- 75ms-TTFW Cartesia, 32-language ElevenLabs, and unified-model OpenAI Realtime coexist.

- LiveKit Agents, Pipecat, Vapi, Retell, and Bland built the orchestration layer.

- Sesame and Hume reset the bar for emotion and naturalness.

- Deepgram Nova-3 pushed STT below 50ms latency.

- Typecast / Clova hold Korea, CoeFont / VOICEVOX hold Japan.

- ELVIS Act and EU AI Act drew the legal lines on cloning.

What remains is choosing what voice interface to build. May this article be that starting line.

References

- ElevenLabs — [https://elevenlabs.io/](https://elevenlabs.io/)

- ElevenLabs Conversational AI — [https://elevenlabs.io/conversational-ai](https://elevenlabs.io/conversational-ai)

- Cartesia — [https://cartesia.ai/](https://cartesia.ai/)

- Cartesia Sonic — [https://cartesia.ai/sonic](https://cartesia.ai/sonic)

- Play.HT — [https://play.ht/](https://play.ht/)

- OpenAI Realtime API — [https://platform.openai.com/docs/guides/realtime](https://platform.openai.com/docs/guides/realtime)

- OpenAI TTS — [https://platform.openai.com/docs/guides/text-to-speech](https://platform.openai.com/docs/guides/text-to-speech)

- Hume AI EVI — [https://hume.ai/products/empathic-voice-interface](https://hume.ai/products/empathic-voice-interface)

- Sesame Research — [https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice](https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice)

- Fish Audio — [https://fish.audio/](https://fish.audio/)

- Deepgram Aura — [https://deepgram.com/product/text-to-speech](https://deepgram.com/product/text-to-speech)

- Deepgram Nova-3 — [https://deepgram.com/learn/introducing-nova-3](https://deepgram.com/learn/introducing-nova-3)

- AssemblyAI Universal-2 — [https://www.assemblyai.com/blog/universal-2/](https://www.assemblyai.com/blog/universal-2/)

- OpenAI Whisper — [https://openai.com/research/whisper](https://openai.com/research/whisper)

- LiveKit Agents — [https://docs.livekit.io/agents/](https://docs.livekit.io/agents/)

- Pipecat — [https://www.pipecat.ai/](https://www.pipecat.ai/)

- Vapi — [https://vapi.ai/](https://vapi.ai/)

- Retell AI — [https://www.retellai.com/](https://www.retellai.com/)

- Bland AI — [https://www.bland.ai/](https://www.bland.ai/)

- Silero VAD — [https://github.com/snakers4/silero-vad](https://github.com/snakers4/silero-vad)

- Resemble AI — [https://www.resemble.ai/](https://www.resemble.ai/)

- WellSaid Labs — [https://wellsaidlabs.com/](https://wellsaidlabs.com/)

- Coqui TTS — [https://github.com/coqui-ai/TTS](https://github.com/coqui-ai/TTS)

- F5-TTS — [https://github.com/SWivid/F5-TTS](https://github.com/SWivid/F5-TTS)

- CosyVoice — [https://github.com/FunAudioLLM/CosyVoice](https://github.com/FunAudioLLM/CosyVoice)

- MaskGCT — [https://github.com/open-mmlab/Amphion](https://github.com/open-mmlab/Amphion)

- OpenVoice — [https://github.com/myshell-ai/OpenVoice](https://github.com/myshell-ai/OpenVoice)

- Moonshine — [https://github.com/usefulsensors/moonshine](https://github.com/usefulsensors/moonshine)

- Distil-Whisper — [https://github.com/huggingface/distil-whisper](https://github.com/huggingface/distil-whisper)

- Tennessee ELVIS Act — [https://www.capitol.tn.gov/Bills/113/Bill/HB2091.pdf](https://www.capitol.tn.gov/Bills/113/Bill/HB2091.pdf)

- EU AI Act — [https://artificialintelligenceact.eu/](https://artificialintelligenceact.eu/)

- SynthID — [https://deepmind.google/technologies/synthid/](https://deepmind.google/technologies/synthid/)

- Typecast — [https://typecast.ai/](https://typecast.ai/)

- Naver Clova Voice — [https://www.ncloud.com/product/aiService/css](https://www.ncloud.com/product/aiService/css)

- CoeFont — [https://coefont.cloud/](https://coefont.cloud/)

- VOICEVOX — [https://voicevox.hiroshiba.jp/](https://voicevox.hiroshiba.jp/)

현재 단락 (1/405)

As of May 2026, the phrase "voice AI" carries entirely different weight than it did five years ago.

작성 글자: 0원문 글자: 21,994작성 단락: 0/405