SOTA Music and Audio Generation — Neural Codecs and Generative Models

Introduction
Audio Representation: What Should We Model
Autoregressive Audio Language Models
- The Idea
- The MusicGen Family
Diffusion-Based Audio
Text-to-Music Conditioning
Commercial and Research (Concept-Focused)
Comparison Table: By Approach
Full Pipeline Diagram
Evaluation
Copyright and Ethics Issues
Strengths
Limitations and Open Problems
Practical Implications
Conclusion
References

Introduction

While text, image, and video generation matured, audio and music generation advanced rapidly as well. Audio poses unique challenges. It is a long sequence made of tens of thousands of samples per second, and the human ear is sensitive to even minute distortions. This article organizes the common principles of SOTA music and audio generation around their lineage, from audio representations to neural codecs, autoregressive audio language models, and diffusion-based audio.

This field also changes quickly. The content below is based on widely known concepts, papers, and architecture families, and does not make definitive claims about the detailed specs or rankings of particular commercial models.

Audio Representation: What Should We Model

The first question in audio generation is "what should we predict." The model structure changes greatly depending on the representation.

Waveform

The most primitive representation is a sequence of amplitude values over time, that is, the waveform. For 44.1kHz audio, there are 44100 samples in one second. Predicting the waveform directly offers a high quality ceiling, but the sequence is extremely long, making modeling difficult.

Waveform: time -->  ...-0.2, 0.1, 0.4, 0.3, -0.1, -0.5...  (tens of thousands of samples per second)

Spectrogram

This is a representation that divides the waveform into short segments and converts them into frequency components. Because it can be treated like a two-dimensional time-frequency image, it is well suited to borrowing image generation techniques. However, it requires a step (a vocoder) that converts the spectrogram back into a waveform.

Spectrogram: vertical axis = frequency, horizontal axis = time, value = intensity (treated like an image)
        --> [vocoder] --> waveform

Neural Codec

The core of recent SOTA is the neural codec. It is a neural network that compresses audio into a sequence of a small number of discrete tokens. Representative examples include the SoundStream and EnCodec families.

The key is residual vector quantization (RVQ). Audio is hierarchically quantized through multiple stages of codebooks, packing high sound quality into a short token sequence.

[waveform] --encoder--> [continuous representation] --RVQ quantization--> [discrete token sequence]
                                                     |
[waveform] <--decoder-- [continuous representation] <--dequantization-- [discrete token sequence]

RVQ hierarchy:
 stage 1 codebook --> residual --> stage 2 codebook --> residual --> ... (precision accumulates)

The discrete tokens of a neural codec are ideal for a language model to handle. This is because audio tokens can be predicted just like text tokens. This bridge drove the rise of audio language models.

Autoregressive Audio Language Models

The Idea

Once audio is turned into discrete tokens, you can generate audio through "next-token prediction," exactly like a language model. The AudioLM family proposed this approach. It stitches audio tokens together autoregressively, like a language model, to produce natural sound.

The AudioLM family often uses two kinds of tokens together. Semantic tokens carry long-term structure and content, while acoustic tokens carry fine-grained timbre and sound quality. It is a hierarchical generation that captures the large structure first and then fills in the acoustic detail.

[semantic token prediction] --> the large flow/structure of the piece
        |
[acoustic token prediction] --> fine timbre/texture (neural codec tokens)
        |
   [codec decoder] --> waveform

The MusicGen Family

MusicGen (arXiv 2306.05284) is a representative case that handled text-conditioned music generation with a single Transformer language model. It performs autoregressive generation over EnCodec codec tokens, using an efficient arrangement of the multiple RVQ layer tokens (codebook interleaving). It takes a text description or melody as a condition to generate music.

[text prompt] --text encoder--> [conditioning embedding]
                                        |
[codec tokens] --autoregressive transformer--> [next codec token prediction]
                                        |
                                  [EnCodec decoder] --> music waveform

The advantage of the autoregressive approach is that it can reuse language model infrastructure as is. The drawback is that it generates tokens one at a time sequentially, which can be slow for long audio.

Diffusion-Based Audio

Another major strand is the diffusion model. Like image diffusion, it generates by mixing noise into audio (mainly spectrograms or latent representations) and then reversing it.

Spectrogram diffusion: Performs diffusion over the time-frequency representation and recovers the waveform with a vocoder.
Latent audio diffusion: Compresses audio into a latent space and then performs diffusion over it. It is the same idea as image latent diffusion.

[pure noise] --> [diffusion backbone: U-Net or DiT] --repeated denoising--> [audio latent/spectrogram]
                                                                        |
                                                          [decoder/vocoder] --> waveform

The advantage of the diffusion approach is that it refines the whole thing in parallel, so it has less of the sequential bottleneck of autoregression. Recently, a trend of using flow matching / rectified flow families has also appeared in audio. Autoregression and diffusion are not mutually exclusive, and are mixed or chosen depending on the situation.

Text-to-Music Conditioning

To control music with text, you embed the text description and inject it into the generation process. The principle is the same as for images and video.

Text encoder: Embeds the prompt ("calm lo-fi hip hop, rainy night mood") with something like the T5 family.
Injection method: In autoregressive models, it is prepended as conditioning tokens or via cross-attention; in diffusion models, it is injected via cross-attention.
Additional conditions: Melody, chord progression, rhythm, reference audio, and more can be given as conditions. This greatly increases musical controllability.

[text/melody condition] --> [conditioning embedding]
                             |
[generation backbone (AR or diffusion)] <-- condition injection
                             |
                    [codec/vocoder] --> music

Commercial and Research (Concept-Focused)

On the research side, AudioLM, MusicGen, EnCodec, SoundStream, and others provided the foundation of publicly available ideas. On the commercial side, services such as Suno and Udio are known to exist, and are reputed to show impressive quality in song (including vocals) generation. However, since the internal structure of commercial models is mostly undisclosed, here we cover only the principles of the publicly available architecture families.

The commonly observed directions are as follows. (1) Discretely tokenize audio with a neural codec, (2) generate tokens/latents with autoregression or diffusion, (3) condition on text/melody, (4) recover the waveform with a codec decoder or vocoder. Detailed performance and rankings vary greatly depending on the prompt, genre, and evaluation method, so we avoid definitive claims.

Comparison Table: By Approach

Axis	Autoregressive Audio LM	Diffusion-Based Audio
Representation	Codec discrete tokens	Spectrogram/latent
Generation method	Next-token prediction (sequential)	Iterative denoising (parallel)
Representative family	AudioLM, MusicGen	Spectrogram/latent diffusion
Strength	Reuses language model infrastructure	Eases the sequential bottleneck
Weakness	Can be slow for long audio	Depends on vocoder/decoder quality
Conditioning	Conditioning tokens/cross-attention	Cross-attention

The values are general tendencies of the families and may differ from a specific model configuration.

Full Pipeline Diagram

[text prompt] (+ melody/reference audio)
        |
 [text encoder]
        |
 [conditioning embedding] ---------------------+
                                    |
 [generation backbone]                        |
   - autoregressive: sequential codec token prediction <--+
   - or diffusion: latent/spectrogram denoising
        |
 [neural codec decoder / vocoder]
        |
   [final audio waveform]

Evaluation

Audio generation evaluation is highly subjective.

Automatic metrics: Audio quality (e.g., the FAD family) and text-audio alignment (e.g., CLAP-based similarity) are used, but they do not fully capture musical appeal or emotion.
Human evaluation: In practice, listening preference comparison is the most trusted. However, it is costly and taste plays a role.
Caveat: Rankings vary by genre, prompt, length, and evaluation method. Rather than definitive claims about "what is best," comparisons with explicit conditions are needed.

Copyright and Ethics Issues

Music and audio generation is especially acute in copyright and ethics issues.

Training data provenance: Whether copyrighted music was used for training and whether style or voice was imitated are key issues.
Voice/artist imitation: The problem of cloning a specific singer's voice is entangled with likeness and publicity rights.
Plagiarism/similarity: The risk that the generated output is too similar to an existing song must be managed.
Transparency: Discussion is under way toward indicating that audio is generated or watermarking it.

Separate from technical performance, these issues are a key constraint on commercialization and a subject of social debate.

Strengths

Accessibility: You can quickly create music, sound effects, and audio with text alone.
Modularity: The codec, generation backbone, and vocoder are separated, making it easy to replace and improve components.
Improved controllability: Melody, chord, and reference audio conditions have made musical control possible.
Efficiency: Thanks to the discrete tokenization of neural codecs, even long audio has become easier to handle.

Limitations and Open Problems

Long-term structure: A coherent composition of the whole piece (intro-development-chorus, etc.) is still difficult.
Fine quality: The human ear is sensitive to minute distortion, so artifacts are easily revealed.
Absence of evaluation standards: There is a lack of reliable metrics to quantify musical appeal.
Copyright/ethics: The data, imitation, and transparency issues discussed earlier remain large.
Control precision: Control that precisely specifies a particular instrument, beat, or emotion is still developing.

Practical Implications

It is powerful for fast prototyping, but a copyright and license review is essential for commercial use.
If precise control is needed, it is better to also provide structural conditions such as melody and chords.
Autoregression and diffusion have situation-dependent trade-offs, so it is safer to compare them directly for the target use case.

Conclusion

The common foundation of SOTA music and audio generation can be summarized as "neural codec tokenization + autoregressive or diffusion generation + text/melody conditioning." EnCodec/SoundStream laid the bridge of representation, AudioLM/MusicGen opened language-model-style generation, and the diffusion family offered a parallel alternative. The rankings and details of commercial services change quickly, but understanding these principles lets you rapidly grasp the structure of new models.