- Published on
SOTA Speech Recognition and Synthesis — From Whisper to Codec Language Models
- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Introduction
- The Big Picture: The Arc of Progress
- The Lineage of ASR
- Whisper: Large-Scale Weak Supervision
- Streaming vs Offline
- Training Techniques for Robustness
- The Evolution of TTS
- Neural Audio Codecs
- Codec Language Models: The VALL-E Family Concept
- Zero-Shot Voice Cloning and Ethics
- Multilingual Support
- Sound as Features: Mel Spectrograms
- Evaluation Metrics
- An Example Practical Pipeline
- Comparison: Approaches at a Glance
- Speaker Diarization and Auxiliary Tasks
- Speech-to-Speech and Unified Models
- End-to-End Flow Diagram
- On-Device and Model Compression
- Limitations and Caveats
- Closing
- References
Introduction
Speech technology is woven into daily life. Phone dictation, automatic meeting notes, navigation prompts, audiobook narration — all are products of speech recognition (ASR, Automatic Speech Recognition) and speech synthesis (TTS, Text-to-Speech).
In recent years the field leaped forward through deep learning, especially large-scale pre-training and transformers. On recognition, large weakly-supervised models like Whisper emerged; on synthesis, we now have natural voices hard to tell from humans, and codec language models that clone a voice from a short sample.
This article traces the lineage of speech recognition and synthesis while examining core architectures and principles. Since AI SOTA changes fast, we focus on concepts and structures rather than rankings or numbers, and assert only models and papers we know for sure.
The Big Picture: The Arc of Progress
Before the details, here is the arc of speech technology at a glance.
[ASR arc]
HMM/GMM -> DNN acoustic model -> CTC/attention end-to-end -> Whisper large weak-sup
[TTS arc]
concat/parametric -> Tacotron+neural vocoder -> neural codec -> codec language model
The overall trend has two threads. First, a shift from hand-assembling many components to learning end-to-end from data. Second, as data scaled and representations became tokenized, both recognition and synthesis came to share tools similar to large language models. Below we examine each stage in turn.
The Lineage of ASR
From Sound to Letters
The goal of ASR is to turn a waveform into text. The essence is aligning a continuous sound signal to a discrete sequence of letters. Sound flows over time and speaking speed varies, so matching which sound spans map to which letters is the central challenge.
[waveform] ~~~~~~/\/\~~~~/\~~~~~
| feature extraction (e.g., mel spectrogram)
v
[acoustic features] per-frame vectors
| acoustic model + alignment
v
[text] "hello"
The HMM Era
Early ASR was led by Hidden Markov Models (HMMs) combined with Gaussian Mixture Models (GMMs). The acoustic model mapped sound features to phone states, a pronunciation lexicon mapped phones to words, and a language model scored word sequences — a combination of many parts.
This was the standard for a long time but was complex with many separately built components. Later, deep neural networks (DNNs) replaced GMMs in the acoustic model, sharply improving accuracy, and the field gradually moved toward end-to-end neural networks.
CTC: A Breakthrough for Alignment
A key turning point for end-to-end ASR is CTC (Connectionist Temporal Classification). CTC lets you train without aligning sound frames to letters one by one.
The core idea is a blank token, plus summing the probabilities of all valid alignment paths to compute the final text probability.
frames: f1 f2 f3 f4 f5 f6
path 1: h h _ e _ llo
path 2: _ h _ e e llo
(_ is blank; collapse repeats/blanks to get "hello")
Sum over all valid paths -> probability of "hello"
Thanks to CTC, training became possible from just sound-text pairs without frame-level alignment labels. However, CTC assumes each output is independent, so reflecting linguistic context often required combining it with a separate language model.
Attention-Based Encoder-Decoder
Another line is the attention-based encoder-decoder. The encoder turns sound into representations, and the decoder uses attention to pick relevant sound spans and generate letters one at a time. Transformers made this structure much stronger.
CTC and attention are complementary, so hybrid structures using both were widely adopted: CTC stably handles monotonic alignment while attention flexibly reflects context.
Whisper: Large-Scale Weak Supervision
Whisper's Core
Whisper is a speech recognition model from OpenAI, introduced in "Robust Speech Recognition via Large-Scale Weak Supervision" (arXiv 2212.04356). As the name says, large-scale weak supervision is the core.
Whisper's hallmark is training on a vast amount (reportedly on the order of hundreds of thousands of hours) of web speech-text pairs. It showed that even imperfectly labeled data, in sufficient volume, can yield a robust model.
[large-scale web speech-text]
many languages, accents, noise, domains
|
v
[transformer encoder-decoder]
log-mel spectrogram input -> encoder
text token output <- decoder (attention)
|
v
[one model, many tasks]
multilingual recognition, translation, LID, timestamps
Multitask Design
Another hallmark is that one model handles many tasks. Special tokens issue instructions like "recognize this language," "translate to English," or "add timestamps," and the same model performs them. This multitask design handles diverse functions in one place without separate components.
Robustness is a strength. Whisper reportedly works reasonably well in noisy environments and across accents. That said, its base structure does not directly fit real-time streaming, and hallucination or repetition can occur on long audio.
Streaming vs Offline
ASR has two broad usage settings.
[offline (batch) recognition]
recognize after receiving the full audio
- pros: uses full context, better accuracy
- uses: meeting notes, subtitling, podcast transcription
[streaming (real-time) recognition]
recognize as sound arrives
- pros: low latency
- limits: cannot see future context, may revise partials
- uses: live captions, voice assistants, call recognition
Offline encoder-decoder models like Whisper favor accuracy but have high latency. Streaming suits structures that peek little or not at all into the future (e.g., transducer families). In practice, the two are used for different purposes.
Training Techniques for Robustness
Real-world speech is messy with noise, reverberation, and varied microphone characteristics. A model trained only on clean data drops sharply in such conditions. So data augmentation, deliberately diversifying the training data, is widely used.
[common speech data augmentation]
- speed/pitch shift: slightly change speaking rate or pitch
- noise addition: mix in background noise, reverb
- SpecAugment: mask parts of the spectrogram in time/frequency
- volume/gain shift: vary loudness
SpecAugment, a simple technique that masks parts of the time and frequency axes on the mel spectrogram, greatly improved ASR and was widely adopted. Training on large, diverse data like Whisper is itself a way to gain robustness. The more the data captures real-world diversity, the better the model holds up after deployment.
The Evolution of TTS
Early Days: Concatenative and Parametric
Early TTS included concatenative synthesis, stitching prerecorded speech fragments, and parametric synthesis, generating speech features via statistical models. Concatenative was natural in a specific speaker's voice but inflexible; parametric was flexible but sounded mechanical.
Tacotron: End-to-End Neural TTS
The turning point for deep learning TTS is the Tacotron family. Tacotron takes text and generates a mel spectrogram (a time-frequency representation of sound) via attention-based sequence-to-sequence. Then a vocoder converts the mel spectrogram into an actual waveform.
[text]
| text encoder + attention decoder (Tacotron family)
v
[mel spectrogram] time-frequency representation
| vocoder (waveform generation)
v
[waveform]
This "text to mel spectrogram to waveform" two-stage structure long became the standard for neural TTS.
Neural Vocoders
The vocoder's quality, converting mel to waveform, drives final audio quality. Early WaveNet sounded very natural but generated samples one by one and was very slow. Later parallel-capable vocoders (e.g., flow-based and GAN-based families) preserved quality while enabling much faster synthesis. Widely used GAN-based vocoder families gained popularity for balancing speed and good quality.
Prosody and Style Control
Good synthesis goes beyond clear pronunciation to carry natural prosody. Prosody is intonation, stress, rhythm, and pauses — the elements that give a sentence its "feel." The same sentence can sound like a question, a statement, or surprise depending on prosody.
[examples of prosody control]
"Really"
- flat intonation -> a calm reaction
- rising ending -> a surprised question
- stressed intonation -> a strong exclamation
same words, different prosody -> different meaning/emotion
Early TTS struggled to control prosody finely, but recent models can adjust prosody and speaker style by conditioning on a reference voice, style tokens, or emotion labels. Codec language models' zero-shot cloning also mimics the reference's prosody, helping reproduce a speaker's manner from a short sample.
Neural Audio Codecs
Audio as Tokens
A key ingredient of modern speech synthesis is the neural audio codec. A neural codec compresses audio with a neural network, turning continuous sound into a discrete token sequence.
[continuous waveform]
| encoder
v
[vector quantization (VQ)] -> discrete token sequence (audio "words")
| decoder
v
[reconstructed waveform]
The key is representing audio as multiple layers of discrete codes via techniques like Residual Vector Quantization (RVQ). Now audio, like text, becomes a sequence of tokens. Representative neural codecs include SoundStream and EnCodec families.
Once audio is tokenized, a language model can predict audio tokens just as it predicts text tokens. That is the starting point for the codec language models next.
Codec Language Models: The VALL-E Family Concept
Synthesis as Language Modeling
Tokenizing speech with a neural codec lets us redefine TTS as "predict the next audio token." Given text and a short reference speech as conditions, a language model generates continuation audio tokens in that speaker's voice. A representative concept of this direction is the VALL-E family.
[text] + [short reference speech (3s sample)]
| tokenize with codec
v
[condition: text tokens + reference speech tokens]
| language model (predict next audio token)
v
[generated audio token sequence]
| codec decoder
v
[speech synthesized in the reference speaker's voice]
Zero-Shot Voice Cloning
An impressive ability of codec language models is zero-shot voice cloning. Without retraining, from just a few seconds of reference speech, they mimic the person's voice characteristics (timbre, prosody) to synthesize new sentences. This resembles in-context learning, taking the short reference as context and continuing generation.
This can build personalized voices from little data, which is powerful but raises the ethical risks discussed later.
Coexistence with Diffusion Approaches
Besides codec language models (autoregressive), diffusion-based and flow-matching-based speech generation are also active. Autoregressive generates tokens sequentially, while diffusion families build speech gradually from noise. Both have trade-offs and evolve side by side.
Zero-Shot Voice Cloning and Ethics
Zero-shot voice cloning has great value for accessibility (restoring a voice for those with vocal disorders) and content creation. But it also carries serious risks.
- Impersonation and fraud: raises risks of voice phishing and financial scams using someone else's voice.
- Cloning without consent: cloning a voice without consent can violate personality and likeness rights.
- Deepfake speech: fake statements can be made to sound real, misleading the public.
Because of these risks, much research and many services emphasize consent verification, watermarking (embedding an identifying signal in generated audio), and usage-restriction policies. When working with such technology, weigh these safeguards and responsibilities as much as the capabilities.
Multilingual Support
A key challenge is multilingual and low-resource language support. The world has thousands of languages, but only a few have abundant training data.
Models trained on large multilingual data, like Whisper, benefit from handling many languages in one. Cross-lingual knowledge is shared, so low-data languages can benefit from related ones. Still, extremely low-resource languages underperform, and code-switching (mixing languages in one sentence) and dialect handling remain hard.
Sound as Features: Mel Spectrograms
Let us look a bit more at the mel spectrogram, which appeared several times above. A speech signal is a long waveform of tens of thousands of samples per second. Rather than feeding this raw to a network, preprocessing into features matched to human hearing is widely used.
[waveform] amplitude over time
| split into short windows, apply Fourier transform (STFT)
v
[spectrogram] energy map of time x frequency
| apply mel scale matched to human hearing
v
[mel spectrogram] time x mel frequency
The mel scale reflects that humans are more sensitive to differences at low frequencies and less at high frequencies. So the mel spectrogram is a more perceptually meaningful representation than the raw spectrogram. Both recognition and synthesis often use it as an intermediate bridge. That said, directly handling waveforms or using neural codec tokens (seen earlier) are also advancing.
Evaluation Metrics
Recognition: WER
ASR performance is mainly measured by Word Error Rate (WER). Comparing the reference sentence with the recognized output, we count substituted, deleted, and inserted words as a ratio.
WER = (substitutions + deletions + insertions) / reference word count
- lower is better (closer to 0 is more accurate)
- for languages with different spacing conventions (Korean, Japanese),
Character Error Rate (CER) is often reported too
WER is useful but has limits. It counts trivial errors that still convey meaning the same as fatal ones that change meaning, so it does not always match real usability. Depending on use, human evaluation or downstream performance (e.g., whether a command is executed well from the recognition) is also considered.
Synthesis: MOS
Synthesis naturalness is mainly measured by Mean Opinion Score (MOS). Human raters listen to synthesized speech and rate naturalness from 1 to 5, then average. Automatic metrics that mimic human evaluation are increasingly used, but the final judgment of synthesis quality still relies heavily on human listening tests.
An Example Practical Pipeline
A typical flow when putting speech technology into a real service.
[speech recognition service flow]
1. audio input (mic/file)
2. preprocessing (resampling, normalization, feature extraction)
3. ASR model inference (Whisper, etc.)
4. postprocessing (punctuation restoration, number normalization, filtering)
5. use results (subtitles, command handling, search, etc.)
[speech synthesis service flow]
1. text input
2. text normalization (numbers/abbreviations/symbols into spoken form)
3. TTS model inference (Tacotron/codec LM, etc.)
4. postprocessing (loudness normalization, silence trimming)
5. audio output/streaming
Pre- and post-processing greatly affect real-world quality. In recognition, punctuation and number normalization drive readability; in synthesis, bad text normalization makes numbers or abbreviations read wrong. Refining this surrounding processing matters as much as the model.
Comparison: Approaches at a Glance
| Category | Key Concept | Strengths | Notes |
|---|---|---|---|
| ASR HMM/GMM | acoustic+lexicon+LM combo | long-proven, interpretable | many parts, complex |
| ASR CTC | blank alignment, end-to-end | no alignment labels needed | independence assumption |
| ASR attention | encoder-decoder | flexible context | needs tuning for streaming |
| ASR Whisper | large weak-sup multitask | robust, multilingual | streaming/long-audio limits |
| TTS Tacotron+vocoder | text to mel to waveform | natural synthesis | two-stage pipeline |
| TTS codec LM | audio token prediction | zero-shot cloning | high ethical risk |
The table is a conceptual comparison; each approach keeps evolving, so detailed superiority varies by situation.
Speaker Diarization and Auxiliary Tasks
Real audio often has multiple people speaking. For meeting notes or call transcription, we need speaker diarization to tell "who spoke when."
[speaker diarization flow]
audio
| voice activity detection (VAD, speech vs silence)
v
speech segments
| extract speaker embeddings + cluster
v
"speaker A: 0-5s, speaker B: 5-9s ..." labels
The key tool here is speaker embeddings. Voice characteristics become vectors, trained so the same person's utterances yield similar vectors. This is the same contrastive learning idea from the embedding article, applied to speech. Beyond diarization, auxiliary tasks like emotion recognition, language identification, and voice activity detection (VAD) also enter practical pipelines.
Speech-to-Speech and Unified Models
Traditionally, voice assistants were three stages: ASR turns sound to text, an LLM produces an answer, and TTS makes sound again. But this has many stages, so latency is high, and information like emotion or intonation is lost in between.
[staged voice dialogue]
sound -> ASR -> text -> LLM -> text -> TTS -> sound
(emotion/intonation may vanish in between)
[unified speech-to-speech]
sound -> [one model] -> sound
(can keep intonation, laughter, emotion more naturally)
So unified speech-to-speech models that connect sound directly to sound are being researched. The neural codec tokens seen earlier are a key ingredient here too, since handling audio as tokens lets us imagine a unified model processing text and audio the same way. Still, such unified models are evolving, with many challenges in latency, quality, and controllability.
End-to-End Flow Diagram
Summarizing recognition and synthesis in one picture:
[speech input]
| ASR (Whisper, etc.)
v
[text] <-> [LLM processing (optional)]
| TTS (Tacotron / codec LM, etc.)
v
[speech output]
Voice assistant: chain ASR -> LLM -> TTS to enable dialogue
Connecting ASR and TTS with an LLM yields a voice conversational interface. Recently, speech-to-speech approaches that unify these three stages in one model are also being researched.
On-Device and Model Compression
There is strong demand to process speech directly on the device rather than in the cloud. Latency is low, it works without internet, and speech data never leaves the device, favoring privacy.
[cloud vs on-device]
cloud: large model, high accuracy / latency, cost, privacy burden
on-device: small model, low latency / performance limits, needs optimization
On-device requires model compression. Techniques include quantization (lowering weight precision), distillation (a small model mimicking a large one), and pruning (removing unnecessary connections). Whisper-family small variants are also available, so you can balance accuracy and cost to fit the situation.
Limitations and Caveats
- Accuracy limits: ASR errors grow with jargon, proper nouns, strong accents, and noise. Critical uses need human review.
- Hallucination and repetition: large generative models can invent nonexistent words or repeat phrases.
- Real-time: highly accurate offline models have high latency and may be unfit for real-time.
- Ethics and safety: voice cloning carries impersonation/fraud risks, needing safeguards like consent and watermarking.
- Recency: this field's SOTA changes very fast. These notes are for understanding; verify specifics and rankings in official docs.
- Low-resource languages: performance can drop sharply for low-data languages.
Closing
Speech technology advanced remarkably from the HMM era's many components, through the end-to-end training of CTC and attention, to Whisper's large-scale weak supervision and codec language models' zero-shot cloning.
Three takeaways: first, ASR's fundamental task is aligning continuous sound to discrete letters, which CTC and attention solved end-to-end. Second, neural codecs turning audio into tokens let us redefine synthesis as language modeling. Third, as capabilities grew, so did the ethical responsibility of voice cloning. SOTA moves fast, but these principles and this sense of responsibility endure.
References
- Whisper: Robust Speech Recognition via Large-Scale Weak Supervision (arXiv 2212.04356): arxiv.org/abs/2212.04356
- Connectionist Temporal Classification, CTC (ICML 2006 page): dl.acm.org/doi/10.1145/1143844.1143891
- WaveNet: A Generative Model for Raw Audio (arXiv 1609.03499): arxiv.org/abs/1609.03499
- Tacotron 2 (arXiv 1712.05884): arxiv.org/abs/1712.05884
- SoundStream: An End-to-End Neural Audio Codec (arXiv 2107.03312): arxiv.org/abs/2107.03312
- High Fidelity Neural Audio Compression, EnCodec (arXiv 2210.13438): arxiv.org/abs/2210.13438
- Whisper repository (GitHub): github.com/openai/whisper
- Hugging Face Audio docs: huggingface.co/docs/transformers/tasks/asr