Voice AI & TTS 2026 Deep Dive - ElevenLabs · Cartesia Sonic · OpenAI Voice · Play.HT · Hume · Sesame · Fish Audio · Deepgram Aura

Prologue — The year voice became the mouth and ears of the LLM

As of May 2026, the phrase "voice AI" carries entirely different weight than it did five years ago.

ElevenLabs v3 supports 32 languages, emotion tags, and 5-second cloning. It is the de-facto English-speaking TTS standard.
Cartesia Sonic is the fastest commercial TTS with 75ms TTFW (Time To First Word) and is the default TTS in LiveKit Agents.
OpenAI Realtime API has popularized the full-duplex model where STT, LLM, and TTS run over a single WebSocket.
Google Gemini Live and Anthropic Claude voice mode have established LLM-native voice.
Hume EVI 2 and Sesame's Maya/Miles demos (March 2025) redefined the limits of emotional and natural-sounding speech.
Fish Audio, CosyVoice 2, and F5-TTS rapidly took share in the open-source and Chinese-speaking world.
Deepgram Nova-3 pushed STT latency below 50ms; AssemblyAI Universal-2 and OpenAI GPT-4o transcribe compete on accuracy.
Orchestration tools like LiveKit Agents, Pipecat, Vapi, Retell AI, and Bland AI made the voice-agent stack standard.
Tennessee's ELVIS Act and the EU AI Act drew the first legal lines around voice cloning ethics.
Korea is dominated by Typecast (Neosapience) and Naver Clova Dubbing; Japan by CoeFont and VOICEVOX.

This article maps the whole field — which tool owns which slot, which metrics actually matter, and what to choose for a new 2026 project.

1. The 2026 voice stack — a four-tier pipeline

Today's voice AI breaks into four layers.

[ Tier 1 ] Input          - Microphone / WebRTC / SIP / Telephony
[ Tier 2 ] STT (ASR)      - Deepgram Nova-3 / AssemblyAI / GPT-4o transcribe / Whisper v3 turbo
[ Tier 3 ] LLM            - GPT-5 / Claude 4.5 / Gemini 2.5 Pro / Llama 4
[ Tier 4 ] TTS            - ElevenLabs / Cartesia / OpenAI / Play.HT / Hume / Sesame
[ Sidecar ] Orchestration - LiveKit Agents / Pipecat / Vapi / Retell / Bland
[ Sidecar ] Interruption  - VAD / barge-in / turn detection / endpointing

The classic STT → LLM → TTS pipeline is still the most common, but full-duplex LLM-native voice — proven by OpenAI Realtime and Gemini Live since 2025 — is taking ground fast.

Stage	Key metric
STT	WER (Word Error Rate), first-partial latency, multilingual
LLM	TTFT (time to first token), TPS (tokens per second)
TTS	TTFW (time to first word), audio MOS, voice diversity
Full-duplex	End-to-end latency, interruption naturalness

The conversational latency target is consistent: under 300ms to first audio.

2. The metrics — latency, latency, latency

The most often ignored yet most important number in voice AI is the human perception threshold.

Under 200ms: feels like natural human conversation.
200-500ms: slightly off but tolerable.
500ms-1s: visibly slow.
Over 1s: sounds like an answering machine.

Traditional pipelines accumulate latency like this:

Mic -> VAD -> STT partial -> Endpoint -> LLM TTFT -> TTS TTFW -> Speaker
10ms  30ms     80ms          200ms       400ms       150ms      30ms
                              Total: ~900ms

Cutting this to under 300ms requires three tricks:

Streaming STT — Send partial results to the LLM without waiting for endpoint.
Streaming LLM — Stream the first token into the TTS immediately.
Streaming TTS — Emit audio at the word level.

OpenAI Realtime and Gemini Live fuse these steps inside the model itself and reach 200-400ms.

3. ElevenLabs v3 — The English-language TTS throne

ElevenLabs, founded in 2022, captured the TTS market faster than any prior player. v3 ships with:

32 languages, 60-second cloning, 5-second Instant Voice Clone (IVC)
Emotion tags: anger, sadness, excitement, whisper, etc.
ElevenLabs Conversational AI — single SDK with STT + LLM + TTS
ElevenLabs Studio — long-form dubbing and audiobooks
Voice Library — 50,000+ public voices
ElevenLabs Reader — app for visually impaired and heavy readers

Python SDK example:

from elevenlabs.client import ElevenLabs
from elevenlabs import play

client = ElevenLabs(api_key="...")

audio = client.text_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    model_id="eleven_multilingual_v3",
    text="Hello. This is the 2026 voice AI guide.",
)

play(audio)

Streaming:

stream = client.text_to_speech.convert_as_stream(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    model_id="eleven_flash_v2_5",  # low-latency model
    text="Low-latency streaming example.",
)

for chunk in stream:
    speaker.write(chunk)

Pricing (May 2026):

Plan	Chars/mo	Price
Free	10k	Free
Starter	30k	5 USD
Creator	100k	22 USD
Pro	500k	99 USD
Scale	2M	330 USD
Enterprise	Custom	Contact

Metered API runs about 180 USD per 1M characters for eleven_multilingual_v3, and roughly half that for eleven_flash_v2_5.

Strengths: audio quality, language coverage, Voice Library scale, integrated Conversational AI. Weaknesses: price, some languages (Korean / Japanese) still less polished than English.

4. Cartesia Sonic — The fastest TTS

Cartesia was founded in 2023 by the Mamba authors (Albert Gu, Karan Goel). Their SSM (State Space Model) based Sonic TTS is famous for:

75ms TTFW — by far the fastest commercial TTS
Sonic-2 (2025) / Sonic-3 (2026) — multilingual, emotion, singing
Default TTS for LiveKit Agents
Voice cloning from 3-second samples

Python SDK call:

from cartesia import Cartesia

client = Cartesia(api_key="...")

# streaming synthesis
ws = client.tts.websocket()

for output in ws.send(
    model_id="sonic-3",
    transcript="Low-latency voice demo.",
    voice_id="694f9389-aac1-45b6-b726-9d9369183238",
    output_format={
        "container": "raw",
        "encoding": "pcm_s16le",
        "sample_rate": 24000,
    },
):
    speaker.write(output.audio)

Pricing runs about 65 USD per 1M characters, well under half of ElevenLabs. The trade-off is that Korean and Japanese quality lags ElevenLabs by a notch.

Pick Cartesia when latency is non-negotiable, ElevenLabs when multilingual quality wins.

5. Play.HT 3 — Multilingual + Realtime

Play.HT is an LA-based company founded in 2016 that supports 30+ languages. The 3.0 highlights:

PlayDialog — synthesizes conversations between two or more speakers
Realtime API — 200ms TTFW
142 voices plus cloning
LangChain and LlamaIndex integration

Python call:

from pyht import Client, TTSOptions, Format

client = Client(user_id="...", api_key="...")

options = TTSOptions(
    voice="s3://voice-cloning-zero-shot/...",
    sample_rate=24000,
    format=Format.FORMAT_WAV,
)

for chunk in client.tts("Play.HT 3 demo.", options=options):
    speaker.write(chunk)

Pricing starts at 39 USD for 100k characters — a midpoint between ElevenLabs and Cartesia.

Highlight: PlayDialog is the strongest for two-person dialog naturalness. Popular for automated podcast generation.

6. OpenAI Voice — tts-1, gpt-4o-mini-tts, Realtime API

OpenAI started with tts-1 in 2024 and filled out the stack through 2025-2026.

Model	Use case	Notes
tts-1	Standard TTS	Fast, decent quality, 6 voices
tts-1-hd	High-quality TTS	Pricier but better audio
gpt-4o-mini-tts	Next-gen TTS	Instructable, emotion control
Realtime API (gpt-4o-realtime-preview)	Full-duplex voice	Unified STT+LLM+TTS

Realtime API example:

import WebSocket from 'ws'

const ws = new WebSocket(
  'wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2026',
  {
    headers: {
      Authorization: 'Bearer YOUR_KEY',
      'OpenAI-Beta': 'realtime=v1',
    },
  }
)

ws.on('open', () => {
  ws.send(
    JSON.stringify({
      type: 'session.update',
      session: {
        modalities: ['text', 'audio'],
        voice: 'alloy',
        instructions: 'Be friendly and concise.',
        turn_detection: { type: 'server_vad' },
      },
    })
  )
})

ws.on('message', (data) => {
  const evt = JSON.parse(data)
  if (evt.type === 'response.audio.delta') {
    speaker.write(Buffer.from(evt.delta, 'base64'))
  }
})

Pricing (Realtime API): 100 USD per 1M input audio tokens, 200 USD per 1M output audio tokens — about 0.06 USD per minute of conversation. tts-1 costs 15 USD per 1M chars, gpt-4o-mini-tts only 12 USD — cheapest of the bunch.

Strengths: price, integration, direct line to GPT models. Weaknesses: smaller voice catalog than ElevenLabs or Cartesia.

7. Hume AI EVI 2 — Emotional voice interface

Hume AI treats emotion as a first-class ML target. EVI 2 (Empathic Voice Interface 2) does:

Measures the speaker's emotion on 28 emotion dimensions
Automatically modulates the response voice to match
Full-duplex voice with ~700ms TTFW
Tunes response tone to match the user's tone

Demos are striking but daily-conversation naturalness still trails OpenAI Realtime. Hume shines in emotion-sensitive verticals — medical consults, mental health, companionship chatbots.

Pricing is about 0.072 USD per minute.

8. Sesame — The Maya / Miles shockwave

Sesame is the company built by Oculus co-founder Brendan Iribe after acquiring Maven AI in 2024. The Maya and Miles voice demos released in March 2025 turned Twitter upside down.

Natural breathing, hesitation, laughter
Tone modulation that follows the user's emotion
Persona consistency across long conversations
Conversational Speech Model (CSM) 1B open-sourced for research

Demo: https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice

As of May 2026, the commercial API is still in limited preview, but the naturalness shown is something ElevenLabs, Hume, and OpenAI cannot yet match. Trade-off: English-centric for now.

9. Fish Audio Speech 1.5 — The Chinese-language leader

Fish Audio, based in China, has grown fast since 2024. Speech 1.5 strengths:

#1 in Chinese naturalness — including regional dialects
30-second voice cloning
9 languages supported
~12 USD per 1M characters — very cheap
Open-source Fish Speech v1.4

Choice criteria:

Mandarin speakers, Chinese market → Fish Audio
Korean / Japanese priority → Typecast / Clova / CoeFont

The follow-up OpenAudio S1 model is also public.

10. Deepgram Aura — TTS from an STT company

Deepgram has been an STT specialist since 2015. They launched their first TTS, Aura, in 2024.

~200ms TTFW
~15 USD per 1M characters — comparable to OpenAI tts-1
12 voices (English-centric)
Bundles its STT plus Aura TTS as a full voice-agent SDK

Highlight: pairing STT and TTS with the same vendor simplifies invoicing, SLA, and security posture. The TTS quality alone is a notch below ElevenLabs or Cartesia.

11. Other TTS — Resemble, WellSaid, Coqui, F5-TTS

Tool	Notes
Resemble AI	Cloning + security focus. Government / defense market
WellSaid Labs	US enterprise audience
Coqui TTS	Open source. Company shut down in 2024, community keeps it alive
F5-TTS (UCB, 2024)	5-second cloning, explosively popular open source
MaskGCT	Microsoft + Sealand, 2024 open source
CosyVoice 2	Alibaba 2025 — strong Chinese + English
GPT-SoVITS	Indie project, popular in Japan / China communities
OpenVoice v2	MyShell.ai — cloning + multilingual
Bark, Vall-E-X, XTTS v2	2023-2024 legacy open models

For open-source-first stacks in 2026, F5-TTS or CosyVoice 2 is the top pick. F5-TTS surprises with 5-second clones; CosyVoice 2 has Alibaba's stable backing.

12. The cloud big three — Polly, Google TTS, Azure Speech

Vendor	Notes	Pricing
Amazon Polly	Neural + Generative voices, 90+ voices	4 USD per 1M chars (Standard)
Google Cloud TTS	Studio, Neural2, Wavenet	16 USD per 1M chars (Studio)
Azure Speech	Custom Neural Voice, strong multilingual	16-30 USD per 1M chars

These remain the default for enterprise, government, and regulated industries. Voice freshness and naturalness lag ElevenLabs and Cartesia by a generation, but AWS / GCP / Azure integration and SLA win the decision.

Microsoft Research's NaturalSpeech 3 is academically top-tier but not GA. Google DeepMind's Lyria 2 is music-focused yet overlaps with TTS for vocal synthesis.

13. STT — Deepgram Nova-3, AssemblyAI Universal-2, OpenAI

Tool	TTFW	WER (English)	Languages
Deepgram Nova-3	`<50ms`	6.8%	36
AssemblyAI Universal-2	200ms	5.7%	70+
OpenAI Whisper v3 turbo	Batch	7.5%	99
OpenAI gpt-4o-transcribe	Streaming	5.2%	99+
Gladia	300ms	6.5%	100+
Speechmatics	250ms	6.0%	50+
Rev AI	300ms	7.0%	36
Soniox	80ms	5.9%	60+

When latency is non-negotiable, Nova-3 or Soniox. When WER matters most, GPT-4o transcribe or AssemblyAI.

Open source: Whisper, WhisperX, Distil-Whisper, Vosk, Moonshine (Useful Sensors), Owl ASR. Moonshine is rising as the mobile / edge-friendly pick.

# Deepgram Nova-3 streaming STT example
from deepgram import DeepgramClient, LiveTranscriptionEvents, LiveOptions

dg = DeepgramClient(api_key="...")
connection = dg.listen.live.v("1")

def on_message(_, result, **kwargs):
    print(result.channel.alternatives[0].transcript)

connection.on(LiveTranscriptionEvents.Transcript, on_message)
connection.start(LiveOptions(model="nova-3", language="en", interim_results=True))

for chunk in mic_stream():
    connection.send(chunk)

14. Full-duplex voice agents — LiveKit, Pipecat, Vapi

Voice agents go beyond plain TTS / STT. They handle turn management, interruption, VAD, and tool calling together.

LiveKit Agents

LiveKit Agents is a Python full-stack voice-agent framework on top of the WebRTC backbone. Cartesia is the default TTS.

from livekit.agents import Agent, AgentSession, JobContext
from livekit.plugins import openai, cartesia, deepgram, silero

class Assistant(Agent):
    async def on_enter(self):
        await self.session.say("Hello, how can I help you?")

async def entrypoint(ctx: JobContext):
    session = AgentSession(
        stt=deepgram.STT(model="nova-3"),
        llm=openai.LLM(model="gpt-4o"),
        tts=cartesia.TTS(voice="..."),
        vad=silero.VAD.load(),
    )
    await session.start(agent=Assistant(), room=ctx.room)

Pipecat

Pipecat is a Python voice-agent framework sponsored by Daily.co. More modular than LiveKit and strong in vision + audio multimodal.

Vapi · Retell AI · Bland AI

Three companies offering voice-agent SaaS.

Vapi — fastest-growing, both no-code and API
Retell AI — Y Combinator alum, strong telephony integration
Bland AI — US contact-center focus, 0.09 USD per minute

SaaS gets you running fast with SIP / Twilio integration done, but the monthly bill grows faster than a direct LiveKit + Cartesia stack.

15. Full-duplex LLM — Realtime API, Gemini Live, Claude Voice

The second path to replacing the classic pipeline is LLM-native voice.

Model	Released	Notes
OpenAI Realtime API (gpt-4o-realtime)	2024-10	WebSocket, 8 voices
Google Gemini 2.5 Live	2025	Multimodal video included
Anthropic Claude voice mode	2025	Mobile app, Sonnet-based
Mistral Voxtral	2025	Open 7B/24B voice model

The upside of LLM-native voice is natural emotion, interruption, and backchannel ("uh-huh", "right") handling. The downside is locked-in TTS and a smaller voice catalog.

16. Interruption / VAD / barge-in — The invisible essentials

Ninety percent of why voice agents feel awkward comes from interruption handling. Humans cut off unfinished sentences, drop in backchannel words, and start the next turn instantly. The relevant stack:

VAD (Voice Activity Detection) — Silero VAD is the de-facto standard. 30-50ms detection of speech start / end.
Turn Detection — More than silence detection; decides whether a turn is "done". LiveKit Turn Detector (2026), built into OpenAI Realtime.
Barge-in — Cuts AI TTS the moment the user speaks and flips to listening mode.
Endpointing — Triggers the LLM on partial STT output.

Silero VAD usage:

import torch

vad, utils = torch.hub.load(
    "snakers4/silero-vad", "silero_vad", trust_repo=True
)

(get_speech_timestamps, _, read_audio, *_) = utils

audio = read_audio("test.wav", sampling_rate=16000)
ts = get_speech_timestamps(audio, vad, sampling_rate=16000)

17. Cloning ethics — ELVIS Act, EU AI Act, SynthID

Voice cloning entered the public conversation after fake Biden audio surfaced in the 2024 New Hampshire primary. Legislation followed.

Tennessee ELVIS Act (effective July 2024) — First US law criminalizing unauthorized AI cloning of voice and likeness.
EU AI Act (effective May 2024) — Voice cloning categorized as high-risk with transparency obligations.
California AB 2839 (2024) — Bans election-period deepfakes.
US FCC (2024) — Outlaws robocalls using AI-generated voices.

Counter-tech:

SynthID Audio (Google DeepMind) — Sub-audible watermark.
Resemble Detect — Resemble AI's fake-voice detector.
AntiFake (Washington University) — Speech perturbation that resists TTS training.

Commercial TTS vendors now require recorded-consent flows. ElevenLabs asks the speaker to record "I have the right to clone this voice".

18. Korea — Typecast, Clova, Kakao, HyperCLOVA X Voice

The Korean market is firmly held by domestic players.

Typecast (Neosapience) — #1 in Korea. Content creators, advertising, audiobooks. Very strong on video-to-voice consistency.
Naver Clova Voice / Clova Dubbing — 50+ Korean voices; Clova Dubbing auto-dubs video subtitles.
HyperCLOVA X Voice — Voice-agent SDK pairing Naver's LLM with TTS.
Kakao TTS / Kakao i Voice — Integrated with KakaoTalk chatbots and Kakao i.
AI Tester (NFly) — Focused on advertising voices.

Note: foreign TTS still sounds awkward on Korean prosody and loanword pronunciation. Typecast and Clova are overwhelmingly the most natural.

Typecast API call:

import requests

resp = requests.post(
    "https://typecast.ai/api/speak",
    headers={"Authorization": "Bearer ..."},
    json={
        "actor_id": "5c3b3...",
        "text": "Typecast TTS synthesis demo.",
        "lang": "ko",
        "tempo": 1.0,
    },
)

Pricing: Typecast about 1.5 KRW per 100 characters, Clova about 4 KRW per 200 characters.

19. Japan — CoeFont, VOICEVOX, Synthesizer V

Japan plays a different game. The model is character voices + marketplace.

CoeFont — Marketplace of 10,000+ voices. Voice actors register and sell their own voice.
Rinna Japanese TTS — Japanese open TTS from the ex-Microsoft Rinna team.
VOICEROID / VOICEVOX — VOICEVOX is free; characters like Zundamon and Shikoku Metan are the standard on YouTube and Niconico.
Synthesizer V — Singing synthesis across Japanese, Chinese, and Korean vocals.
AI Voice Project (AIVoice) — Licensed reproduction of professional Japanese voice actors.

Note: Japan requires checking commercial-use terms per character. Even within VOICEVOX, terms differ by character.

Choice criteria:

Business / contact center → CoeFont, Rinna
YouTube / games / doujin content → VOICEVOX
Singing → Synthesizer V

20. Pricing — Per 1M characters / per minute

Price gaps span an order of magnitude. May 2026 snapshot.

Tool	Per 1M chars	Full-duplex per minute
ElevenLabs Multilingual v3	180 USD	0.30 USD
ElevenLabs Flash v2.5	90 USD	0.15 USD
Cartesia Sonic 3	65 USD	0.11 USD
Play.HT 3	120 USD	0.20 USD
OpenAI tts-1	15 USD	0.06 USD
OpenAI gpt-4o-mini-tts	12 USD	0.05 USD
OpenAI Realtime API	-	0.06 USD
Hume EVI 2	-	0.072 USD
Fish Audio 1.5	12 USD	0.04 USD
Deepgram Aura	15 USD	0.05 USD
Amazon Polly Generative	30 USD	0.08 USD
Google Cloud TTS Studio	160 USD	0.27 USD
Azure Custom Neural	24 USD	0.07 USD
Typecast	~15 USD	-
Naver Clova Voice	~20 USD	-
CoeFont	~30 USD	-
Vapi (full agent)	-	0.08 USD
Retell AI	-	0.075 USD
Bland AI	-	0.09 USD

For startups, OpenAI tts-1 / Fish Audio / Cartesia offer the best cost performance. For enterprise quality, go ElevenLabs / Typecast / Clova.

21. Who should pick what

A picking matrix.

Goal	Recommendation
English console full-duplex	OpenAI Realtime API
Multilingual voice agent	LiveKit Agents + Cartesia
English audiobooks / dubbing	ElevenLabs Studio
Emotional companion chatbot	Hume EVI 2
Showcase naturalness demo	Sesame Maya / Miles
Chinese-language content	Fish Audio
Korean-language content	Typecast, Naver Clova
Japanese character voice	VOICEVOX
Japanese marketplace	CoeFont
Open source / self-hosted	F5-TTS, CosyVoice 2
Contact-center SaaS	Vapi, Retell AI, Bland AI
Mobile / edge STT	Moonshine, Distil-Whisper
Fast STT	Deepgram Nova-3
Accurate STT	OpenAI gpt-4o-transcribe
Enterprise default	Polly, Google TTS, Azure Speech

Three decision axes:

Latency vs quality — Cartesia / Realtime are fast; ElevenLabs / Sesame are rich.
API vs self-hosted — APIs are fast to ship; open models keep data sovereignty.
Global vs native — Korean and Japanese have a naturalness gap that foreign TTS cannot yet close.

22. Use cases — What is actually paying right now

Where voice AI is generating revenue in 2026.

Contact-center automation — Retell and Bland deployed at US real-estate and healthcare firms. Saving 5-15 USD per call.
Audiobook / podcast dubbing — ElevenLabs Studio contracted with publishers. Cost down ten-fold per hour.
Game NPC voicing — Sony, EA, Ubisoft all partnered with ElevenLabs or Resemble.
Language learning — Duolingo Max and Speak use OpenAI Realtime.
Accessibility — Apple and Microsoft integrate TTS at the OS level.
Ad / marketing dubbing — Video dubbing is the largest single market.
Personal companion chatbots — Character.AI and Replika use ElevenLabs / Cartesia.

The clearest revenue line is contact-center automation, followed by content dubbing.

23. Closing — The year voice became an interface

Five years ago voice AI was answering-machine grade. The 2026 version is different.

75ms-TTFW Cartesia, 32-language ElevenLabs, and unified-model OpenAI Realtime coexist.
LiveKit Agents, Pipecat, Vapi, Retell, and Bland built the orchestration layer.
Sesame and Hume reset the bar for emotion and naturalness.
Deepgram Nova-3 pushed STT below 50ms latency.
Typecast / Clova hold Korea, CoeFont / VOICEVOX hold Japan.
ELVIS Act and EU AI Act drew the legal lines on cloning.

What remains is choosing what voice interface to build. May this article be that starting line.

References

ElevenLabs — https://elevenlabs.io/
ElevenLabs Conversational AI — https://elevenlabs.io/conversational-ai
Cartesia — https://cartesia.ai/
Cartesia Sonic — https://cartesia.ai/sonic
Play.HT — https://play.ht/
OpenAI Realtime API — https://platform.openai.com/docs/guides/realtime
OpenAI TTS — https://platform.openai.com/docs/guides/text-to-speech
Hume AI EVI — https://hume.ai/products/empathic-voice-interface
Sesame Research — https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice
Fish Audio — https://fish.audio/
Deepgram Aura — https://deepgram.com/product/text-to-speech
Deepgram Nova-3 — https://deepgram.com/learn/introducing-nova-3
AssemblyAI Universal-2 — https://www.assemblyai.com/blog/universal-2/
OpenAI Whisper — https://openai.com/research/whisper
LiveKit Agents — https://docs.livekit.io/agents/
Pipecat — https://www.pipecat.ai/
Vapi — https://vapi.ai/
Retell AI — https://www.retellai.com/
Bland AI — https://www.bland.ai/
Silero VAD — https://github.com/snakers4/silero-vad
Resemble AI — https://www.resemble.ai/
WellSaid Labs — https://wellsaidlabs.com/
Coqui TTS — https://github.com/coqui-ai/TTS
F5-TTS — https://github.com/SWivid/F5-TTS
CosyVoice — https://github.com/FunAudioLLM/CosyVoice
MaskGCT — https://github.com/open-mmlab/Amphion
OpenVoice — https://github.com/myshell-ai/OpenVoice
Moonshine — https://github.com/usefulsensors/moonshine
Distil-Whisper — https://github.com/huggingface/distil-whisper
Tennessee ELVIS Act — https://www.capitol.tn.gov/Bills/113/Bill/HB2091.pdf
EU AI Act — https://artificialintelligenceact.eu/
SynthID — https://deepmind.google/technologies/synthid/
Typecast — https://typecast.ai/
Naver Clova Voice — https://www.ncloud.com/product/aiService/css
CoeFont — https://coefont.cloud/
VOICEVOX — https://voicevox.hiroshiba.jp/