필사 모드: Voice AI & TTS 2026 Deep Dive - ElevenLabs · Cartesia Sonic · OpenAI Voice · Play.HT · Hume · Sesame · Fish Audio · Deepgram Aura
EnglishPrologue — The year voice became the mouth and ears of the LLM
As of May 2026, the phrase "voice AI" carries entirely different weight than it did five years ago.
- ElevenLabs v3 supports 32 languages, emotion tags, and 5-second cloning. It is the de-facto English-speaking TTS standard.
- Cartesia Sonic is the fastest commercial TTS with 75ms TTFW (Time To First Word) and is the default TTS in LiveKit Agents.
- OpenAI Realtime API has popularized the full-duplex model where STT, LLM, and TTS run over a single WebSocket.
- Google Gemini Live and Anthropic Claude voice mode have established LLM-native voice.
- Hume EVI 2 and Sesame's Maya/Miles demos (March 2025) redefined the limits of emotional and natural-sounding speech.
- Fish Audio, CosyVoice 2, and F5-TTS rapidly took share in the open-source and Chinese-speaking world.
- Deepgram Nova-3 pushed STT latency below 50ms; AssemblyAI Universal-2 and OpenAI GPT-4o transcribe compete on accuracy.
- Orchestration tools like LiveKit Agents, Pipecat, Vapi, Retell AI, and Bland AI made the voice-agent stack standard.
- Tennessee's ELVIS Act and the EU AI Act drew the first legal lines around voice cloning ethics.
- Korea is dominated by Typecast (Neosapience) and Naver Clova Dubbing; Japan by CoeFont and VOICEVOX.
This article maps the whole field — which tool owns which slot, which metrics actually matter, and what to choose for a new 2026 project.
1. The 2026 voice stack — a four-tier pipeline
Today's voice AI breaks into four layers.
[ Tier 1 ] Input - Microphone / WebRTC / SIP / Telephony
[ Tier 2 ] STT (ASR) - Deepgram Nova-3 / AssemblyAI / GPT-4o transcribe / Whisper v3 turbo
[ Tier 3 ] LLM - GPT-5 / Claude 4.5 / Gemini 2.5 Pro / Llama 4
[ Tier 4 ] TTS - ElevenLabs / Cartesia / OpenAI / Play.HT / Hume / Sesame
[ Sidecar ] Orchestration - LiveKit Agents / Pipecat / Vapi / Retell / Bland
[ Sidecar ] Interruption - VAD / barge-in / turn detection / endpointing
The classic STT → LLM → TTS pipeline is still the most common, but full-duplex LLM-native voice — proven by OpenAI Realtime and Gemini Live since 2025 — is taking ground fast.
| Stage | Key metric |
| --- | --- |
| STT | WER (Word Error Rate), first-partial latency, multilingual |
| LLM | TTFT (time to first token), TPS (tokens per second) |
| TTS | TTFW (time to first word), audio MOS, voice diversity |
| Full-duplex | End-to-end latency, interruption naturalness |
The conversational latency target is consistent: **under 300ms to first audio**.
2. The metrics — latency, latency, latency
The most often ignored yet most important number in voice AI is **the human perception threshold**.
- Under 200ms: feels like natural human conversation.
- 200-500ms: slightly off but tolerable.
- 500ms-1s: visibly slow.
- Over 1s: sounds like an answering machine.
Traditional pipelines accumulate latency like this:
Mic -> VAD -> STT partial -> Endpoint -> LLM TTFT -> TTS TTFW -> Speaker
10ms 30ms 80ms 200ms 400ms 150ms 30ms
Total: ~900ms
Cutting this to under 300ms requires three tricks:
1. **Streaming STT** — Send partial results to the LLM without waiting for endpoint.
2. **Streaming LLM** — Stream the first token into the TTS immediately.
3. **Streaming TTS** — Emit audio at the word level.
OpenAI Realtime and Gemini Live fuse these steps inside the model itself and reach 200-400ms.
3. ElevenLabs v3 — The English-language TTS throne
ElevenLabs, founded in 2022, captured the TTS market faster than any prior player. v3 ships with:
- 32 languages, 60-second cloning, 5-second Instant Voice Clone (IVC)
- Emotion tags: anger, sadness, excitement, whisper, etc.
- ElevenLabs Conversational AI — single SDK with STT + LLM + TTS
- ElevenLabs Studio — long-form dubbing and audiobooks
- Voice Library — 50,000+ public voices
- ElevenLabs Reader — app for visually impaired and heavy readers
Python SDK example:
from elevenlabs.client import ElevenLabs
from elevenlabs import play
client = ElevenLabs(api_key="...")
audio = client.text_to_speech.convert(
voice_id="JBFqnCBsd6RMkjVDRZzb",
model_id="eleven_multilingual_v3",
text="Hello. This is the 2026 voice AI guide.",
)
play(audio)
Streaming:
stream = client.text_to_speech.convert_as_stream(
voice_id="JBFqnCBsd6RMkjVDRZzb",
model_id="eleven_flash_v2_5", # low-latency model
text="Low-latency streaming example.",
)
for chunk in stream:
speaker.write(chunk)
Pricing (May 2026):
| Plan | Chars/mo | Price |
| --- | --- | --- |
| Free | 10k | Free |
| Starter | 30k | 5 USD |
| Creator | 100k | 22 USD |
| Pro | 500k | 99 USD |
| Scale | 2M | 330 USD |
| Enterprise | Custom | Contact |
Metered API runs about 180 USD per 1M characters for eleven_multilingual_v3, and roughly half that for eleven_flash_v2_5.
Strengths: audio quality, language coverage, Voice Library scale, integrated Conversational AI.
Weaknesses: price, some languages (Korean / Japanese) still less polished than English.
4. Cartesia Sonic — The fastest TTS
Cartesia was founded in 2023 by the Mamba authors (Albert Gu, Karan Goel). Their SSM (State Space Model) based Sonic TTS is famous for:
- **75ms TTFW** — by far the fastest commercial TTS
- Sonic-2 (2025) / Sonic-3 (2026) — multilingual, emotion, singing
- Default TTS for LiveKit Agents
- Voice cloning from 3-second samples
Python SDK call:
from cartesia import Cartesia
client = Cartesia(api_key="...")
streaming synthesis
ws = client.tts.websocket()
for output in ws.send(
model_id="sonic-3",
transcript="Low-latency voice demo.",
voice_id="694f9389-aac1-45b6-b726-9d9369183238",
output_format={
"container": "raw",
"encoding": "pcm_s16le",
"sample_rate": 24000,
},
):
speaker.write(output.audio)
Pricing runs about 65 USD per 1M characters, well under half of ElevenLabs. The trade-off is that Korean and Japanese quality lags ElevenLabs by a notch.
Pick Cartesia when latency is non-negotiable, ElevenLabs when multilingual quality wins.
5. Play.HT 3 — Multilingual + Realtime
Play.HT is an LA-based company founded in 2016 that supports 30+ languages. The 3.0 highlights:
- PlayDialog — synthesizes conversations between two or more speakers
- Realtime API — 200ms TTFW
- 142 voices plus cloning
- LangChain and LlamaIndex integration
Python call:
from pyht import Client, TTSOptions, Format
client = Client(user_id="...", api_key="...")
options = TTSOptions(
voice="s3://voice-cloning-zero-shot/...",
sample_rate=24000,
format=Format.FORMAT_WAV,
)
for chunk in client.tts("Play.HT 3 demo.", options=options):
speaker.write(chunk)
Pricing starts at 39 USD for 100k characters — a midpoint between ElevenLabs and Cartesia.
Highlight: PlayDialog is the strongest for two-person dialog naturalness. Popular for automated podcast generation.
6. OpenAI Voice — tts-1, gpt-4o-mini-tts, Realtime API
OpenAI started with tts-1 in 2024 and filled out the stack through 2025-2026.
| Model | Use case | Notes |
| --- | --- | --- |
| tts-1 | Standard TTS | Fast, decent quality, 6 voices |
| tts-1-hd | High-quality TTS | Pricier but better audio |
| gpt-4o-mini-tts | Next-gen TTS | Instructable, emotion control |
| Realtime API (gpt-4o-realtime-preview) | Full-duplex voice | Unified STT+LLM+TTS |
Realtime API example:
const ws = new WebSocket(
'wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2026',
{
headers: {
Authorization: 'Bearer YOUR_KEY',
'OpenAI-Beta': 'realtime=v1',
},
}
)
ws.on('open', () => {
ws.send(
JSON.stringify({
type: 'session.update',
session: {
modalities: ['text', 'audio'],
voice: 'alloy',
instructions: 'Be friendly and concise.',
turn_detection: { type: 'server_vad' },
},
})
)
})
ws.on('message', (data) => {
const evt = JSON.parse(data)
if (evt.type === 'response.audio.delta') {
speaker.write(Buffer.from(evt.delta, 'base64'))
}
})
Pricing (Realtime API): 100 USD per 1M input audio tokens, 200 USD per 1M output audio tokens — about 0.06 USD per minute of conversation. tts-1 costs 15 USD per 1M chars, gpt-4o-mini-tts only 12 USD — cheapest of the bunch.
Strengths: price, integration, direct line to GPT models.
Weaknesses: smaller voice catalog than ElevenLabs or Cartesia.
7. Hume AI EVI 2 — Emotional voice interface
Hume AI treats emotion as a first-class ML target. EVI 2 (Empathic Voice Interface 2) does:
- Measures the speaker's emotion on 28 emotion dimensions
- Automatically modulates the response voice to match
- Full-duplex voice with ~700ms TTFW
- Tunes response tone to match the user's tone
Demos are striking but daily-conversation naturalness still trails OpenAI Realtime. Hume shines in emotion-sensitive verticals — medical consults, mental health, companionship chatbots.
Pricing is about 0.072 USD per minute.
8. Sesame — The Maya / Miles shockwave
Sesame is the company built by Oculus co-founder Brendan Iribe after acquiring Maven AI in 2024. The Maya and Miles voice demos released in March 2025 turned Twitter upside down.
- Natural breathing, hesitation, laughter
- Tone modulation that follows the user's emotion
- Persona consistency across long conversations
- Conversational Speech Model (CSM) 1B open-sourced for research
Demo: [https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice](https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice)
As of May 2026, the commercial API is still in limited preview, but the naturalness shown is something ElevenLabs, Hume, and OpenAI cannot yet match. Trade-off: English-centric for now.
9. Fish Audio Speech 1.5 — The Chinese-language leader
Fish Audio, based in China, has grown fast since 2024. Speech 1.5 strengths:
- #1 in Chinese naturalness — including regional dialects
- 30-second voice cloning
- 9 languages supported
- ~12 USD per 1M characters — very cheap
- Open-source Fish Speech v1.4
Choice criteria:
- Mandarin speakers, Chinese market → Fish Audio
- Korean / Japanese priority → Typecast / Clova / CoeFont
The follow-up OpenAudio S1 model is also public.
10. Deepgram Aura — TTS from an STT company
Deepgram has been an STT specialist since 2015. They launched their first TTS, Aura, in 2024.
- ~200ms TTFW
- ~15 USD per 1M characters — comparable to OpenAI tts-1
- 12 voices (English-centric)
- Bundles its STT plus Aura TTS as a full voice-agent SDK
Highlight: pairing STT and TTS with the same vendor simplifies invoicing, SLA, and security posture. The TTS quality alone is a notch below ElevenLabs or Cartesia.
11. Other TTS — Resemble, WellSaid, Coqui, F5-TTS
| Tool | Notes |
| --- | --- |
| Resemble AI | Cloning + security focus. Government / defense market |
| WellSaid Labs | US enterprise audience |
| Coqui TTS | Open source. Company shut down in 2024, community keeps it alive |
| F5-TTS (UCB, 2024) | 5-second cloning, explosively popular open source |
| MaskGCT | Microsoft + Sealand, 2024 open source |
| CosyVoice 2 | Alibaba 2025 — strong Chinese + English |
| GPT-SoVITS | Indie project, popular in Japan / China communities |
| OpenVoice v2 | MyShell.ai — cloning + multilingual |
| Bark, Vall-E-X, XTTS v2 | 2023-2024 legacy open models |
For open-source-first stacks in 2026, **F5-TTS** or **CosyVoice 2** is the top pick. F5-TTS surprises with 5-second clones; CosyVoice 2 has Alibaba's stable backing.
12. The cloud big three — Polly, Google TTS, Azure Speech
| Vendor | Notes | Pricing |
| --- | --- | --- |
| Amazon Polly | Neural + Generative voices, 90+ voices | 4 USD per 1M chars (Standard) |
| Google Cloud TTS | Studio, Neural2, Wavenet | 16 USD per 1M chars (Studio) |
| Azure Speech | Custom Neural Voice, strong multilingual | 16-30 USD per 1M chars |
These remain the default for enterprise, government, and regulated industries. Voice freshness and naturalness lag ElevenLabs and Cartesia by a generation, but AWS / GCP / Azure integration and SLA win the decision.
Microsoft Research's **NaturalSpeech 3** is academically top-tier but not GA. Google DeepMind's **Lyria 2** is music-focused yet overlaps with TTS for vocal synthesis.
13. STT — Deepgram Nova-3, AssemblyAI Universal-2, OpenAI
| Tool | TTFW | WER (English) | Languages |
| --- | --- | --- | --- |
| Deepgram Nova-3 | `<50ms` | 6.8% | 36 |
| AssemblyAI Universal-2 | 200ms | 5.7% | 70+ |
| OpenAI Whisper v3 turbo | Batch | 7.5% | 99 |
| OpenAI gpt-4o-transcribe | Streaming | 5.2% | 99+ |
| Gladia | 300ms | 6.5% | 100+ |
| Speechmatics | 250ms | 6.0% | 50+ |
| Rev AI | 300ms | 7.0% | 36 |
| Soniox | 80ms | 5.9% | 60+ |
When latency is non-negotiable, Nova-3 or Soniox. When WER matters most, GPT-4o transcribe or AssemblyAI.
Open source: **Whisper, WhisperX, Distil-Whisper, Vosk, Moonshine (Useful Sensors), Owl ASR**. Moonshine is rising as the mobile / edge-friendly pick.
Deepgram Nova-3 streaming STT example
from deepgram import DeepgramClient, LiveTranscriptionEvents, LiveOptions
dg = DeepgramClient(api_key="...")
connection = dg.listen.live.v("1")
def on_message(_, result, **kwargs):
print(result.channel.alternatives[0].transcript)
connection.on(LiveTranscriptionEvents.Transcript, on_message)
connection.start(LiveOptions(model="nova-3", language="en", interim_results=True))
for chunk in mic_stream():
connection.send(chunk)
14. Full-duplex voice agents — LiveKit, Pipecat, Vapi
Voice agents go beyond plain TTS / STT. They handle **turn management, interruption, VAD, and tool calling** together.
LiveKit Agents
LiveKit Agents is a Python full-stack voice-agent framework on top of the WebRTC backbone. Cartesia is the default TTS.
from livekit.agents import Agent, AgentSession, JobContext
from livekit.plugins import openai, cartesia, deepgram, silero
class Assistant(Agent):
async def on_enter(self):
await self.session.say("Hello, how can I help you?")
async def entrypoint(ctx: JobContext):
session = AgentSession(
stt=deepgram.STT(model="nova-3"),
llm=openai.LLM(model="gpt-4o"),
tts=cartesia.TTS(voice="..."),
vad=silero.VAD.load(),
)
await session.start(agent=Assistant(), room=ctx.room)
Pipecat
Pipecat is a Python voice-agent framework sponsored by Daily.co. More modular than LiveKit and strong in vision + audio multimodal.
Vapi · Retell AI · Bland AI
Three companies offering voice-agent SaaS.
- **Vapi** — fastest-growing, both no-code and API
- **Retell AI** — Y Combinator alum, strong telephony integration
- **Bland AI** — US contact-center focus, 0.09 USD per minute
SaaS gets you running fast with SIP / Twilio integration done, but the monthly bill grows faster than a direct LiveKit + Cartesia stack.
15. Full-duplex LLM — Realtime API, Gemini Live, Claude Voice
The second path to replacing the classic pipeline is LLM-native voice.
| Model | Released | Notes |
| --- | --- | --- |
| OpenAI Realtime API (gpt-4o-realtime) | 2024-10 | WebSocket, 8 voices |
| Google Gemini 2.5 Live | 2025 | Multimodal video included |
| Anthropic Claude voice mode | 2025 | Mobile app, Sonnet-based |
| Mistral Voxtral | 2025 | Open 7B/24B voice model |
The upside of LLM-native voice is natural **emotion, interruption, and backchannel ("uh-huh", "right")** handling. The downside is locked-in TTS and a smaller voice catalog.
16. Interruption / VAD / barge-in — The invisible essentials
Ninety percent of why voice agents feel awkward comes from **interruption handling**. Humans cut off unfinished sentences, drop in backchannel words, and start the next turn instantly. The relevant stack:
- **VAD (Voice Activity Detection)** — Silero VAD is the de-facto standard. 30-50ms detection of speech start / end.
- **Turn Detection** — More than silence detection; decides whether a turn is "done". LiveKit Turn Detector (2026), built into OpenAI Realtime.
- **Barge-in** — Cuts AI TTS the moment the user speaks and flips to listening mode.
- **Endpointing** — Triggers the LLM on partial STT output.
Silero VAD usage:
vad, utils = torch.hub.load(
"snakers4/silero-vad", "silero_vad", trust_repo=True
)
(get_speech_timestamps, _, read_audio, *_) = utils
audio = read_audio("test.wav", sampling_rate=16000)
ts = get_speech_timestamps(audio, vad, sampling_rate=16000)
17. Cloning ethics — ELVIS Act, EU AI Act, SynthID
Voice cloning entered the public conversation after fake Biden audio surfaced in the 2024 New Hampshire primary. Legislation followed.
- **Tennessee ELVIS Act** (effective July 2024) — First US law criminalizing unauthorized AI cloning of voice and likeness.
- **EU AI Act** (effective May 2024) — Voice cloning categorized as high-risk with transparency obligations.
- **California AB 2839** (2024) — Bans election-period deepfakes.
- **US FCC** (2024) — Outlaws robocalls using AI-generated voices.
Counter-tech:
- **SynthID Audio** (Google DeepMind) — Sub-audible watermark.
- **Resemble Detect** — Resemble AI's fake-voice detector.
- **AntiFake** (Washington University) — Speech perturbation that resists TTS training.
Commercial TTS vendors now require recorded-consent flows. ElevenLabs asks the speaker to record "I have the right to clone this voice".
18. Korea — Typecast, Clova, Kakao, HyperCLOVA X Voice
The Korean market is firmly held by domestic players.
- **Typecast (Neosapience)** — #1 in Korea. Content creators, advertising, audiobooks. Very strong on video-to-voice consistency.
- **Naver Clova Voice / Clova Dubbing** — 50+ Korean voices; Clova Dubbing auto-dubs video subtitles.
- **HyperCLOVA X Voice** — Voice-agent SDK pairing Naver's LLM with TTS.
- **Kakao TTS / Kakao i Voice** — Integrated with KakaoTalk chatbots and Kakao i.
- **AI Tester (NFly)** — Focused on advertising voices.
Note: foreign TTS still sounds awkward on Korean prosody and loanword pronunciation. Typecast and Clova are overwhelmingly the most natural.
Typecast API call:
resp = requests.post(
"https://typecast.ai/api/speak",
headers={"Authorization": "Bearer ..."},
json={
"actor_id": "5c3b3...",
"text": "Typecast TTS synthesis demo.",
"lang": "ko",
"tempo": 1.0,
},
)
Pricing: Typecast about 1.5 KRW per 100 characters, Clova about 4 KRW per 200 characters.
19. Japan — CoeFont, VOICEVOX, Synthesizer V
Japan plays a different game. The model is **character voices + marketplace**.
- **CoeFont** — Marketplace of 10,000+ voices. Voice actors register and sell their own voice.
- **Rinna Japanese TTS** — Japanese open TTS from the ex-Microsoft Rinna team.
- **VOICEROID / VOICEVOX** — VOICEVOX is free; characters like Zundamon and Shikoku Metan are the standard on YouTube and Niconico.
- **Synthesizer V** — Singing synthesis across Japanese, Chinese, and Korean vocals.
- **AI Voice Project (AIVoice)** — Licensed reproduction of professional Japanese voice actors.
Note: Japan requires checking **commercial-use terms per character**. Even within VOICEVOX, terms differ by character.
Choice criteria:
- Business / contact center → CoeFont, Rinna
- YouTube / games / doujin content → VOICEVOX
- Singing → Synthesizer V
20. Pricing — Per 1M characters / per minute
Price gaps span an order of magnitude. May 2026 snapshot.
| Tool | Per 1M chars | Full-duplex per minute |
| --- | --- | --- |
| ElevenLabs Multilingual v3 | 180 USD | 0.30 USD |
| ElevenLabs Flash v2.5 | 90 USD | 0.15 USD |
| Cartesia Sonic 3 | 65 USD | 0.11 USD |
| Play.HT 3 | 120 USD | 0.20 USD |
| OpenAI tts-1 | 15 USD | 0.06 USD |
| OpenAI gpt-4o-mini-tts | 12 USD | 0.05 USD |
| OpenAI Realtime API | - | 0.06 USD |
| Hume EVI 2 | - | 0.072 USD |
| Fish Audio 1.5 | 12 USD | 0.04 USD |
| Deepgram Aura | 15 USD | 0.05 USD |
| Amazon Polly Generative | 30 USD | 0.08 USD |
| Google Cloud TTS Studio | 160 USD | 0.27 USD |
| Azure Custom Neural | 24 USD | 0.07 USD |
| Typecast | ~15 USD | - |
| Naver Clova Voice | ~20 USD | - |
| CoeFont | ~30 USD | - |
| Vapi (full agent) | - | 0.08 USD |
| Retell AI | - | 0.075 USD |
| Bland AI | - | 0.09 USD |
For startups, OpenAI tts-1 / Fish Audio / Cartesia offer the best cost performance. For enterprise quality, go ElevenLabs / Typecast / Clova.
21. Who should pick what
A picking matrix.
| Goal | Recommendation |
| --- | --- |
| English console full-duplex | OpenAI Realtime API |
| Multilingual voice agent | LiveKit Agents + Cartesia |
| English audiobooks / dubbing | ElevenLabs Studio |
| Emotional companion chatbot | Hume EVI 2 |
| Showcase naturalness demo | Sesame Maya / Miles |
| Chinese-language content | Fish Audio |
| Korean-language content | Typecast, Naver Clova |
| Japanese character voice | VOICEVOX |
| Japanese marketplace | CoeFont |
| Open source / self-hosted | F5-TTS, CosyVoice 2 |
| Contact-center SaaS | Vapi, Retell AI, Bland AI |
| Mobile / edge STT | Moonshine, Distil-Whisper |
| Fast STT | Deepgram Nova-3 |
| Accurate STT | OpenAI gpt-4o-transcribe |
| Enterprise default | Polly, Google TTS, Azure Speech |
Three decision axes:
1. **Latency vs quality** — Cartesia / Realtime are fast; ElevenLabs / Sesame are rich.
2. **API vs self-hosted** — APIs are fast to ship; open models keep data sovereignty.
3. **Global vs native** — Korean and Japanese have a naturalness gap that foreign TTS cannot yet close.
22. Use cases — What is actually paying right now
Where voice AI is generating revenue in 2026.
- **Contact-center automation** — Retell and Bland deployed at US real-estate and healthcare firms. Saving 5-15 USD per call.
- **Audiobook / podcast dubbing** — ElevenLabs Studio contracted with publishers. Cost down ten-fold per hour.
- **Game NPC voicing** — Sony, EA, Ubisoft all partnered with ElevenLabs or Resemble.
- **Language learning** — Duolingo Max and Speak use OpenAI Realtime.
- **Accessibility** — Apple and Microsoft integrate TTS at the OS level.
- **Ad / marketing dubbing** — Video dubbing is the largest single market.
- **Personal companion chatbots** — Character.AI and Replika use ElevenLabs / Cartesia.
The clearest revenue line is contact-center automation, followed by content dubbing.
23. Closing — The year voice became an interface
Five years ago voice AI was answering-machine grade. The 2026 version is different.
- 75ms-TTFW Cartesia, 32-language ElevenLabs, and unified-model OpenAI Realtime coexist.
- LiveKit Agents, Pipecat, Vapi, Retell, and Bland built the orchestration layer.
- Sesame and Hume reset the bar for emotion and naturalness.
- Deepgram Nova-3 pushed STT below 50ms latency.
- Typecast / Clova hold Korea, CoeFont / VOICEVOX hold Japan.
- ELVIS Act and EU AI Act drew the legal lines on cloning.
What remains is choosing what voice interface to build. May this article be that starting line.
References
- ElevenLabs — [https://elevenlabs.io/](https://elevenlabs.io/)
- ElevenLabs Conversational AI — [https://elevenlabs.io/conversational-ai](https://elevenlabs.io/conversational-ai)
- Cartesia — [https://cartesia.ai/](https://cartesia.ai/)
- Cartesia Sonic — [https://cartesia.ai/sonic](https://cartesia.ai/sonic)
- Play.HT — [https://play.ht/](https://play.ht/)
- OpenAI Realtime API — [https://platform.openai.com/docs/guides/realtime](https://platform.openai.com/docs/guides/realtime)
- OpenAI TTS — [https://platform.openai.com/docs/guides/text-to-speech](https://platform.openai.com/docs/guides/text-to-speech)
- Hume AI EVI — [https://hume.ai/products/empathic-voice-interface](https://hume.ai/products/empathic-voice-interface)
- Sesame Research — [https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice](https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice)
- Fish Audio — [https://fish.audio/](https://fish.audio/)
- Deepgram Aura — [https://deepgram.com/product/text-to-speech](https://deepgram.com/product/text-to-speech)
- Deepgram Nova-3 — [https://deepgram.com/learn/introducing-nova-3](https://deepgram.com/learn/introducing-nova-3)
- AssemblyAI Universal-2 — [https://www.assemblyai.com/blog/universal-2/](https://www.assemblyai.com/blog/universal-2/)
- OpenAI Whisper — [https://openai.com/research/whisper](https://openai.com/research/whisper)
- LiveKit Agents — [https://docs.livekit.io/agents/](https://docs.livekit.io/agents/)
- Pipecat — [https://www.pipecat.ai/](https://www.pipecat.ai/)
- Vapi — [https://vapi.ai/](https://vapi.ai/)
- Retell AI — [https://www.retellai.com/](https://www.retellai.com/)
- Bland AI — [https://www.bland.ai/](https://www.bland.ai/)
- Silero VAD — [https://github.com/snakers4/silero-vad](https://github.com/snakers4/silero-vad)
- Resemble AI — [https://www.resemble.ai/](https://www.resemble.ai/)
- WellSaid Labs — [https://wellsaidlabs.com/](https://wellsaidlabs.com/)
- Coqui TTS — [https://github.com/coqui-ai/TTS](https://github.com/coqui-ai/TTS)
- F5-TTS — [https://github.com/SWivid/F5-TTS](https://github.com/SWivid/F5-TTS)
- CosyVoice — [https://github.com/FunAudioLLM/CosyVoice](https://github.com/FunAudioLLM/CosyVoice)
- MaskGCT — [https://github.com/open-mmlab/Amphion](https://github.com/open-mmlab/Amphion)
- OpenVoice — [https://github.com/myshell-ai/OpenVoice](https://github.com/myshell-ai/OpenVoice)
- Moonshine — [https://github.com/usefulsensors/moonshine](https://github.com/usefulsensors/moonshine)
- Distil-Whisper — [https://github.com/huggingface/distil-whisper](https://github.com/huggingface/distil-whisper)
- Tennessee ELVIS Act — [https://www.capitol.tn.gov/Bills/113/Bill/HB2091.pdf](https://www.capitol.tn.gov/Bills/113/Bill/HB2091.pdf)
- EU AI Act — [https://artificialintelligenceact.eu/](https://artificialintelligenceact.eu/)
- SynthID — [https://deepmind.google/technologies/synthid/](https://deepmind.google/technologies/synthid/)
- Typecast — [https://typecast.ai/](https://typecast.ai/)
- Naver Clova Voice — [https://www.ncloud.com/product/aiService/css](https://www.ncloud.com/product/aiService/css)
- CoeFont — [https://coefont.cloud/](https://coefont.cloud/)
- VOICEVOX — [https://voicevox.hiroshiba.jp/](https://voicevox.hiroshiba.jp/)
현재 단락 (1/405)
As of May 2026, the phrase "voice AI" carries entirely different weight than it did five years ago.