Skip to content
Published on

SOTA Speech Recognition and Synthesis — From Whisper to Codec Language Models

Authors

Introduction

Speech technology is woven into daily life. Phone dictation, automatic meeting notes, navigation prompts, audiobook narration — all are products of speech recognition (ASR, Automatic Speech Recognition) and speech synthesis (TTS, Text-to-Speech).

In recent years the field leaped forward through deep learning, especially large-scale pre-training and transformers. On recognition, large weakly-supervised models like Whisper emerged; on synthesis, we now have natural voices hard to tell from humans, and codec language models that clone a voice from a short sample.

This article traces the lineage of speech recognition and synthesis while examining core architectures and principles. Since AI SOTA changes fast, we focus on concepts and structures rather than rankings or numbers, and assert only models and papers we know for sure.

The Big Picture: The Arc of Progress

Before the details, here is the arc of speech technology at a glance.

[ASR arc]
 HMM/GMM  ->  DNN acoustic model  ->  CTC/attention end-to-end  ->  Whisper large weak-sup

[TTS arc]
 concat/parametric  ->  Tacotron+neural vocoder  ->  neural codec  ->  codec language model

The overall trend has two threads. First, a shift from hand-assembling many components to learning end-to-end from data. Second, as data scaled and representations became tokenized, both recognition and synthesis came to share tools similar to large language models. Below we examine each stage in turn.

The Lineage of ASR

From Sound to Letters

The goal of ASR is to turn a waveform into text. The essence is aligning a continuous sound signal to a discrete sequence of letters. Sound flows over time and speaking speed varies, so matching which sound spans map to which letters is the central challenge.

[waveform]  ~~~~~~/\/\~~~~/\~~~~~
     |  feature extraction (e.g., mel spectrogram)
     v
[acoustic features]  per-frame vectors
     |  acoustic model + alignment
     v
[text]     "hello"

The HMM Era

Early ASR was led by Hidden Markov Models (HMMs) combined with Gaussian Mixture Models (GMMs). The acoustic model mapped sound features to phone states, a pronunciation lexicon mapped phones to words, and a language model scored word sequences — a combination of many parts.

This was the standard for a long time but was complex with many separately built components. Later, deep neural networks (DNNs) replaced GMMs in the acoustic model, sharply improving accuracy, and the field gradually moved toward end-to-end neural networks.

CTC: A Breakthrough for Alignment

A key turning point for end-to-end ASR is CTC (Connectionist Temporal Classification). CTC lets you train without aligning sound frames to letters one by one.

The core idea is a blank token, plus summing the probabilities of all valid alignment paths to compute the final text probability.

frames:  f1  f2  f3  f4  f5  f6
path 1:  h   h   _   e   _   llo
path 2:  _   h   _   e   e   llo
         (_ is blank; collapse repeats/blanks to get "hello")

Sum over all valid paths -> probability of "hello"

Thanks to CTC, training became possible from just sound-text pairs without frame-level alignment labels. However, CTC assumes each output is independent, so reflecting linguistic context often required combining it with a separate language model.

Attention-Based Encoder-Decoder

Another line is the attention-based encoder-decoder. The encoder turns sound into representations, and the decoder uses attention to pick relevant sound spans and generate letters one at a time. Transformers made this structure much stronger.

CTC and attention are complementary, so hybrid structures using both were widely adopted: CTC stably handles monotonic alignment while attention flexibly reflects context.

Whisper: Large-Scale Weak Supervision

Whisper's Core

Whisper is a speech recognition model from OpenAI, introduced in "Robust Speech Recognition via Large-Scale Weak Supervision" (arXiv 2212.04356). As the name says, large-scale weak supervision is the core.

Whisper's hallmark is training on a vast amount (reportedly on the order of hundreds of thousands of hours) of web speech-text pairs. It showed that even imperfectly labeled data, in sufficient volume, can yield a robust model.

[large-scale web speech-text]
 many languages, accents, noise, domains
              |
              v
[transformer encoder-decoder]
 log-mel spectrogram input -> encoder
 text token output <- decoder (attention)
              |
              v
[one model, many tasks]
 multilingual recognition, translation, LID, timestamps

Multitask Design

Another hallmark is that one model handles many tasks. Special tokens issue instructions like "recognize this language," "translate to English," or "add timestamps," and the same model performs them. This multitask design handles diverse functions in one place without separate components.

Robustness is a strength. Whisper reportedly works reasonably well in noisy environments and across accents. That said, its base structure does not directly fit real-time streaming, and hallucination or repetition can occur on long audio.

Streaming vs Offline

ASR has two broad usage settings.

[offline (batch) recognition]
 recognize after receiving the full audio
 - pros: uses full context, better accuracy
 - uses: meeting notes, subtitling, podcast transcription

[streaming (real-time) recognition]
 recognize as sound arrives
 - pros: low latency
 - limits: cannot see future context, may revise partials
 - uses: live captions, voice assistants, call recognition

Offline encoder-decoder models like Whisper favor accuracy but have high latency. Streaming suits structures that peek little or not at all into the future (e.g., transducer families). In practice, the two are used for different purposes.

Training Techniques for Robustness

Real-world speech is messy with noise, reverberation, and varied microphone characteristics. A model trained only on clean data drops sharply in such conditions. So data augmentation, deliberately diversifying the training data, is widely used.

[common speech data augmentation]
 - speed/pitch shift: slightly change speaking rate or pitch
 - noise addition: mix in background noise, reverb
 - SpecAugment: mask parts of the spectrogram in time/frequency
 - volume/gain shift: vary loudness

SpecAugment, a simple technique that masks parts of the time and frequency axes on the mel spectrogram, greatly improved ASR and was widely adopted. Training on large, diverse data like Whisper is itself a way to gain robustness. The more the data captures real-world diversity, the better the model holds up after deployment.

The Evolution of TTS

Early Days: Concatenative and Parametric

Early TTS included concatenative synthesis, stitching prerecorded speech fragments, and parametric synthesis, generating speech features via statistical models. Concatenative was natural in a specific speaker's voice but inflexible; parametric was flexible but sounded mechanical.

Tacotron: End-to-End Neural TTS

The turning point for deep learning TTS is the Tacotron family. Tacotron takes text and generates a mel spectrogram (a time-frequency representation of sound) via attention-based sequence-to-sequence. Then a vocoder converts the mel spectrogram into an actual waveform.

[text]
   |  text encoder + attention decoder (Tacotron family)
   v
[mel spectrogram]  time-frequency representation
   |  vocoder (waveform generation)
   v
[waveform]

This "text to mel spectrogram to waveform" two-stage structure long became the standard for neural TTS.

Neural Vocoders

The vocoder's quality, converting mel to waveform, drives final audio quality. Early WaveNet sounded very natural but generated samples one by one and was very slow. Later parallel-capable vocoders (e.g., flow-based and GAN-based families) preserved quality while enabling much faster synthesis. Widely used GAN-based vocoder families gained popularity for balancing speed and good quality.

Prosody and Style Control

Good synthesis goes beyond clear pronunciation to carry natural prosody. Prosody is intonation, stress, rhythm, and pauses — the elements that give a sentence its "feel." The same sentence can sound like a question, a statement, or surprise depending on prosody.

[examples of prosody control]
 "Really"
   - flat intonation -> a calm reaction
   - rising ending -> a surprised question
   - stressed intonation -> a strong exclamation

 same words, different prosody -> different meaning/emotion

Early TTS struggled to control prosody finely, but recent models can adjust prosody and speaker style by conditioning on a reference voice, style tokens, or emotion labels. Codec language models' zero-shot cloning also mimics the reference's prosody, helping reproduce a speaker's manner from a short sample.

Neural Audio Codecs

Audio as Tokens

A key ingredient of modern speech synthesis is the neural audio codec. A neural codec compresses audio with a neural network, turning continuous sound into a discrete token sequence.

[continuous waveform]
   |  encoder
   v
[vector quantization (VQ)]  -> discrete token sequence (audio "words")
   |  decoder
   v
[reconstructed waveform]

The key is representing audio as multiple layers of discrete codes via techniques like Residual Vector Quantization (RVQ). Now audio, like text, becomes a sequence of tokens. Representative neural codecs include SoundStream and EnCodec families.

Once audio is tokenized, a language model can predict audio tokens just as it predicts text tokens. That is the starting point for the codec language models next.

Codec Language Models: The VALL-E Family Concept

Synthesis as Language Modeling

Tokenizing speech with a neural codec lets us redefine TTS as "predict the next audio token." Given text and a short reference speech as conditions, a language model generates continuation audio tokens in that speaker's voice. A representative concept of this direction is the VALL-E family.

[text]  +  [short reference speech (3s sample)]
              |  tokenize with codec
              v
[condition: text tokens + reference speech tokens]
              |  language model (predict next audio token)
              v
[generated audio token sequence]
              |  codec decoder
              v
[speech synthesized in the reference speaker's voice]

Zero-Shot Voice Cloning

An impressive ability of codec language models is zero-shot voice cloning. Without retraining, from just a few seconds of reference speech, they mimic the person's voice characteristics (timbre, prosody) to synthesize new sentences. This resembles in-context learning, taking the short reference as context and continuing generation.

This can build personalized voices from little data, which is powerful but raises the ethical risks discussed later.

Coexistence with Diffusion Approaches

Besides codec language models (autoregressive), diffusion-based and flow-matching-based speech generation are also active. Autoregressive generates tokens sequentially, while diffusion families build speech gradually from noise. Both have trade-offs and evolve side by side.

Zero-Shot Voice Cloning and Ethics

Zero-shot voice cloning has great value for accessibility (restoring a voice for those with vocal disorders) and content creation. But it also carries serious risks.

  • Impersonation and fraud: raises risks of voice phishing and financial scams using someone else's voice.
  • Cloning without consent: cloning a voice without consent can violate personality and likeness rights.
  • Deepfake speech: fake statements can be made to sound real, misleading the public.

Because of these risks, much research and many services emphasize consent verification, watermarking (embedding an identifying signal in generated audio), and usage-restriction policies. When working with such technology, weigh these safeguards and responsibilities as much as the capabilities.

Multilingual Support

A key challenge is multilingual and low-resource language support. The world has thousands of languages, but only a few have abundant training data.

Models trained on large multilingual data, like Whisper, benefit from handling many languages in one. Cross-lingual knowledge is shared, so low-data languages can benefit from related ones. Still, extremely low-resource languages underperform, and code-switching (mixing languages in one sentence) and dialect handling remain hard.

Sound as Features: Mel Spectrograms

Let us look a bit more at the mel spectrogram, which appeared several times above. A speech signal is a long waveform of tens of thousands of samples per second. Rather than feeding this raw to a network, preprocessing into features matched to human hearing is widely used.

[waveform]  amplitude over time
   |  split into short windows, apply Fourier transform (STFT)
   v
[spectrogram]  energy map of time x frequency
   |  apply mel scale matched to human hearing
   v
[mel spectrogram]  time x mel frequency

The mel scale reflects that humans are more sensitive to differences at low frequencies and less at high frequencies. So the mel spectrogram is a more perceptually meaningful representation than the raw spectrogram. Both recognition and synthesis often use it as an intermediate bridge. That said, directly handling waveforms or using neural codec tokens (seen earlier) are also advancing.

Evaluation Metrics

Recognition: WER

ASR performance is mainly measured by Word Error Rate (WER). Comparing the reference sentence with the recognized output, we count substituted, deleted, and inserted words as a ratio.

WER = (substitutions + deletions + insertions) / reference word count

- lower is better (closer to 0 is more accurate)
- for languages with different spacing conventions (Korean, Japanese),
  Character Error Rate (CER) is often reported too

WER is useful but has limits. It counts trivial errors that still convey meaning the same as fatal ones that change meaning, so it does not always match real usability. Depending on use, human evaluation or downstream performance (e.g., whether a command is executed well from the recognition) is also considered.

Synthesis: MOS

Synthesis naturalness is mainly measured by Mean Opinion Score (MOS). Human raters listen to synthesized speech and rate naturalness from 1 to 5, then average. Automatic metrics that mimic human evaluation are increasingly used, but the final judgment of synthesis quality still relies heavily on human listening tests.

An Example Practical Pipeline

A typical flow when putting speech technology into a real service.

[speech recognition service flow]
 1. audio input (mic/file)
 2. preprocessing (resampling, normalization, feature extraction)
 3. ASR model inference (Whisper, etc.)
 4. postprocessing (punctuation restoration, number normalization, filtering)
 5. use results (subtitles, command handling, search, etc.)

[speech synthesis service flow]
 1. text input
 2. text normalization (numbers/abbreviations/symbols into spoken form)
 3. TTS model inference (Tacotron/codec LM, etc.)
 4. postprocessing (loudness normalization, silence trimming)
 5. audio output/streaming

Pre- and post-processing greatly affect real-world quality. In recognition, punctuation and number normalization drive readability; in synthesis, bad text normalization makes numbers or abbreviations read wrong. Refining this surrounding processing matters as much as the model.

Comparison: Approaches at a Glance

CategoryKey ConceptStrengthsNotes
ASR HMM/GMMacoustic+lexicon+LM combolong-proven, interpretablemany parts, complex
ASR CTCblank alignment, end-to-endno alignment labels neededindependence assumption
ASR attentionencoder-decoderflexible contextneeds tuning for streaming
ASR Whisperlarge weak-sup multitaskrobust, multilingualstreaming/long-audio limits
TTS Tacotron+vocodertext to mel to waveformnatural synthesistwo-stage pipeline
TTS codec LMaudio token predictionzero-shot cloninghigh ethical risk

The table is a conceptual comparison; each approach keeps evolving, so detailed superiority varies by situation.

Speaker Diarization and Auxiliary Tasks

Real audio often has multiple people speaking. For meeting notes or call transcription, we need speaker diarization to tell "who spoke when."

[speaker diarization flow]
 audio
   |  voice activity detection (VAD, speech vs silence)
   v
 speech segments
   |  extract speaker embeddings + cluster
   v
 "speaker A: 0-5s, speaker B: 5-9s ..." labels

The key tool here is speaker embeddings. Voice characteristics become vectors, trained so the same person's utterances yield similar vectors. This is the same contrastive learning idea from the embedding article, applied to speech. Beyond diarization, auxiliary tasks like emotion recognition, language identification, and voice activity detection (VAD) also enter practical pipelines.

Speech-to-Speech and Unified Models

Traditionally, voice assistants were three stages: ASR turns sound to text, an LLM produces an answer, and TTS makes sound again. But this has many stages, so latency is high, and information like emotion or intonation is lost in between.

[staged voice dialogue]
 sound -> ASR -> text -> LLM -> text -> TTS -> sound
        (emotion/intonation may vanish in between)

[unified speech-to-speech]
 sound -> [one model] -> sound
        (can keep intonation, laughter, emotion more naturally)

So unified speech-to-speech models that connect sound directly to sound are being researched. The neural codec tokens seen earlier are a key ingredient here too, since handling audio as tokens lets us imagine a unified model processing text and audio the same way. Still, such unified models are evolving, with many challenges in latency, quality, and controllability.

End-to-End Flow Diagram

Summarizing recognition and synthesis in one picture:

[speech input]
   |  ASR (Whisper, etc.)
   v
[text]  <->  [LLM processing (optional)]
   |  TTS (Tacotron / codec LM, etc.)
   v
[speech output]

Voice assistant: chain ASR -> LLM -> TTS to enable dialogue

Connecting ASR and TTS with an LLM yields a voice conversational interface. Recently, speech-to-speech approaches that unify these three stages in one model are also being researched.

On-Device and Model Compression

There is strong demand to process speech directly on the device rather than in the cloud. Latency is low, it works without internet, and speech data never leaves the device, favoring privacy.

[cloud vs on-device]
 cloud: large model, high accuracy / latency, cost, privacy burden
 on-device: small model, low latency / performance limits, needs optimization

On-device requires model compression. Techniques include quantization (lowering weight precision), distillation (a small model mimicking a large one), and pruning (removing unnecessary connections). Whisper-family small variants are also available, so you can balance accuracy and cost to fit the situation.

Limitations and Caveats

  • Accuracy limits: ASR errors grow with jargon, proper nouns, strong accents, and noise. Critical uses need human review.
  • Hallucination and repetition: large generative models can invent nonexistent words or repeat phrases.
  • Real-time: highly accurate offline models have high latency and may be unfit for real-time.
  • Ethics and safety: voice cloning carries impersonation/fraud risks, needing safeguards like consent and watermarking.
  • Recency: this field's SOTA changes very fast. These notes are for understanding; verify specifics and rankings in official docs.
  • Low-resource languages: performance can drop sharply for low-data languages.

Closing

Speech technology advanced remarkably from the HMM era's many components, through the end-to-end training of CTC and attention, to Whisper's large-scale weak supervision and codec language models' zero-shot cloning.

Three takeaways: first, ASR's fundamental task is aligning continuous sound to discrete letters, which CTC and attention solved end-to-end. Second, neural codecs turning audio into tokens let us redefine synthesis as language modeling. Third, as capabilities grew, so did the ethical responsibility of voice cloning. SOTA moves fast, but these principles and this sense of responsibility endure.

References