- Authors
- Name
- 1. Introduction: A Turning Point in AI Music Generation
- 2. ACE-Step v1: In-Depth Architecture Analysis
- 3. REPA: Semantic Representation Alignment Training
- 4. ACE-Step v1 Training Details
- 5. ACE-Step v1.5: Hybrid LM + DiT Evolution
- 6. Performance Evaluation and Benchmarks
- 7. Comparative Analysis of AI Music Generation Models
- 8. Core Foundational Technologies for Music Generation
- 9. Practical Usage Guide
- 10. Ethical Considerations and Legal Issues
- 11. Key Paper References
- 12. Future Outlook
- 13. Conclusion
- References
1. Introduction: A Turning Point in AI Music Generation
The field of AI Music Generation has undergone explosive progress from 2024 to 2025. While Meta's MusicGen, Google's MusicLM, and commercial services like Suno and Udio demonstrated the possibilities of AI composition to the public, few open-source models had achieved quality rivaling commercial models.
In May 2025, the release of ACE-Step, jointly developed by ACE Studio and StepFun, changed the landscape. ACE-Step is a Foundation Model that generates up to 4 minutes of high-quality music from text prompts and lyrics in approximately 20 seconds, achieving over 15x faster inference speed than LLM-based models and superior musical coherence at a scale of 3.5B parameters. In January 2026, the follow-up version ACE-Step 1.5 was released, delivering commercial-model-level quality in local environments with remarkable speeds of under 2 seconds on A100 and under 10 seconds on RTX 3090.
[AI Music Generation Model Development Timeline]
2023 2024 2025 2026
| | | |
v v v v
+----------+ +--------------+ +---------------+ +------------------+
| MusicGen | | Stable Audio | | ACE-Step v1 | | ACE-Step v1.5 |
| MusicLM | | Suno v3 | | (3.5B, DCAE | | (Hybrid LM+DiT, |
| AudioLDM | | Udio v1 | | + Linear DiT)| | DMD2, under 4GB) |
| Riffusion| | JEN-1 | | DiffRhythm | | Suno v5 |
+----------+ +--------------+ +---------------+ +------------------+
Key Transition: Commercialization: Open-source Leap: Local Deployment Era:
- Autoregressive - Text-to-Song - Diffusion + DCAE - 4-8 step generation
- Spectrogram - Vocal + BGM - Flow Matching - LoRA personalization
based generation - Multilingual - REPA training - 50+ language support
lyrics
This article provides an in-depth analysis of ACE-Step's architecture based on the papers, covering the evolution from v1 to v1.5, competitive model comparisons, core foundational technologies, and practical usage guides.
2. ACE-Step v1: In-Depth Architecture Analysis
ACE-Step v1 (arXiv:2506.00045) was designed to overcome the fundamental limitations of existing music generation models. LLM-based models excel at lyric alignment but suffer from slow inference and structural artifacts, while Diffusion models enable fast synthesis but lack long-range structural coherence. ACE-Step adopts a Diffusion + DCAE + Linear Transformer architecture that integrates the strengths of both approaches.
2.1 Overall Architecture Overview
The core components of ACE-Step v1 are as follows:
[ACE-Step v1 Architecture]
+---------------------------------------------+
| Conditioning Encoders |
| |
| +----------+ +----------+ +--------------+ |
| | Text | | Lyric | | Speaker | |
| | Encoder | | Encoder | | Encoder | |
| |(mT5-base)| |(SongGen) | |(PLR-OSNet) | |
| | frozen | |trainable | | pre-trained | |
| | dim=768 | | | | dim=512 | |
| +----+-----+ +----+-----+ +------+-------+ |
+-------+------------+---------------+--------+
| | |
+------+-----+ |
| cross-attention |
v v
+-----------+ +----------------------------------------------+
| | | Linear Diffusion Transformer (DiT) |
| DCAE | | |
| Encoder |--->| +-------------------------------------+ |
| (f8c8) | | | 24 Transformer Blocks | |
| | | | - AdaLN-single (shared params) | |
| mel-spec | | | - Linear Attention | |
| to latent | | | - 1D Conv FeedForward | |
| ~10.77Hz | | | - Cross-Attention (text+lyric) | |
| | | | - REPA at layer 8 | |
+-----------+ | +-------------------------------------+ |
| |
+------------------+---------------------------+
|
v
+----------------------------------------------+
| DCAE Decoder |
| latent to mel-spectrogram to waveform |
| (Fish Audio Vocoder, 32kHz mono) |
+----------------------------------------------+
2.2 Deep Compression AutoEncoder (DCAE)
The first key innovation of ACE-Step is the application of Deep Compression AutoEncoder (DCAE), proposed by Sana (NVIDIA/MIT-HAN Lab), to the music domain. DCAE was originally designed for high-resolution image generation, achieving extremely high spatial compression ratios of 32x to 128x.
In ACE-Step, it takes mel-spectrograms as input and applies 8x compression (f8c8, channel=8):
[DCAE Compression Process]
Input: mel-spectrogram (44.1kHz/32kHz audio to mel conversion)
|
v
+---------------------------------------------+
| DCAE Encoder |
| - Residual Autoencoding |
| - Space-to-Channel Transform |
| - 8x temporal compression |
| |
| Output: latent space (~10.77Hz) |
| 4-minute music to ~2,584 latent tokens |
+---------------------------------------------+
|
v (Generated/transformed by DiT)
|
v
+---------------------------------------------+
| DCAE Decoder + Vocoder |
| - latent to mel-spectrogram reconstruction |
| - Fish Audio Universal Music Vocoder |
| - Output: 32kHz mono waveform |
+---------------------------------------------+
DCAE Training Details:
| Item | Details |
|---|---|
| Compression Config | f8c8 (8x compression, channel=8) |
| Temporal Resolution | ~10.77Hz in latent space |
| Training Hardware | 120 NVIDIA A100 GPUs |
| Training Steps | 140,000 steps |
| Global Batch Size | 480 (4 per GPU) |
| Training Duration | ~5 days |
| Discriminator | Patch-based, StyleGAN Disc2DRes, SwinDisc2D |
| Training Strategy | Phase 1: MSE only / Phase 2: frozen encoder + MSE + adversarial |
| Vocoder | Fish Audio universal music vocoder (32kHz mono) |
| Reconstruction FAD | 0.0224 |
The paper also experimented with 32x compression (f32), but it resulted in unacceptable quality degradation, leading to the adoption of 8x compression. This is because music audio is far more sensitive to temporal detail than images.
2.3 Conditioning Encoders: Multi-Condition Encoding
ACE-Step injects diverse conditioning information into the model through three specialized encoders:
2.3.1 Text Encoder (Style/Genre Prompt)
# Text Encoder: Google mT5-base (frozen)
# - Output dimension: 768
# - Max sequence length: 256 tokens
# - Multilingual support (100+ languages)
# - Kept frozen during training
# Prompt example:
prompt = "upbeat K-pop dance track with synth bass, 128 BPM, female vocal, major key"
The choice of mT5-base was driven by the necessity of multilingual support. Style prompts can be entered in various languages including English, Korean, Japanese, and Chinese.
2.3.2 Lyric Encoder (Lyrics Encoding)
[Lyric Encoder Processing Pipeline]
Raw lyrics input (Korean, English, Japanese, etc.)
|
v
Non-Roman scripts to Grapheme-to-Phoneme conversion to phoneme representation
|
v
XTTS VoiceBPE Tokenizer (multilingual support)
|
v
SongGen architecture-based Lyric Encoder (trainable)
|
v
Up to 4,096 tokens of lyric embeddings
The Lyric Encoder is based on the SongGen architecture and, unlike the Text Encoder, its parameters are updated during training. This is because lyric-music alignment is one of the most challenging tasks in music generation. Non-Roman scripts (Hangul, Chinese characters, Hiragana, etc.) are converted to phoneme representations through Grapheme-to-Phoneme (G2P) tools.
2.3.3 Speaker Encoder (Voice/Timbre Encoding)
# Speaker Encoder Configuration
# - Input: 10-second vocal segment with accompaniment removed (separated by demucs)
# - Architecture: PLR-OSNet (originally for face recognition, applied to vocal recognition)
# - Output dimension: 512
# - Training dropout: 50% (to prevent over-reliance on timbre)
# - Full song: average of embeddings from multiple segments
# Voice cloning scenario:
# 1. Input 10-second reference vocal segment
# 2. Separate accompaniment with demucs
# 3. Extract 512-dim embedding with Speaker Encoder
# 4. Inject embedding as condition to DiT during generation
The 50% dropout for the Speaker Encoder is an intentional design decision. By removing speaker information with 50% probability during training, the model is guided to focus sufficiently on musical structure and melody rather than excessively relying on timbre.
2.4 Linear Diffusion Transformer (DiT) Backbone
The core generation model of ACE-Step, the Linear Diffusion Transformer, consists of 24 blocks and uses linear attention instead of standard attention for efficient operation on long sequences.
[DiT Block Structure (x24)]
Input: noisy latent z_t + time embedding t
|
v
+---------------------------------+
| AdaLN-single |
| (Simplified Adaptive LayerNorm)|
| - Parameters shared across |
| all blocks |
| - Conditioned on time step t |
+------------+--------------------+
|
v
+---------------------------------+
| Linear Self-Attention |
| - O(n) complexity (vs O(n^2)) |
| - RoPE Position Encoding |
| - Up to 2,584 mel latent tokens|
+------------+--------------------+
|
v
+---------------------------------+
| Cross-Attention |
| - Text Encoder output (768-dim)|
| - Lyric Encoder output |
| - Speaker Encoder output(512-d)|
| - Concatenate and Attend |
+------------+--------------------+
|
v
+---------------------------------+
| 1D Convolutional FeedForward |
| - Adapted from 2D Conv to 1D |
| - Optimized for temporal audio |
| sequences |
+------------+--------------------+
|
v
Output: denoised prediction
(REPA semantic alignment extracted at Layer 8)
Key Architectural Decisions:
AdaLN-single: Shares Adaptive Layer Normalization parameters across all 24 blocks to maximize parameter efficiency. This technique, introduced by Sana, offers excellent performance efficiency relative to model size.
Linear Attention: Since music requires handling long sequences of up to 4 minutes, O(n) complexity linear attention was adopted instead of O(n^2) standard attention. This enables efficient processing of sequences up to 2,584 tokens.
RoPE (Rotary Position Embedding): Provides robust position information across various music lengths through relative position encoding.
1D Convolutional FeedForward: The original image-targeted 2D Conv was adapted to 1D for temporal audio sequences. This better captures the temporal continuity of audio.
2.5 Flow Matching Generation Process
ACE-Step adopts Flow Matching instead of score-based diffusion. Flow Matching learns a straight path (linear probability path) from Gaussian noise to data distribution, enabling faster convergence and stable training.
[Flow Matching Training Process]
Time t ~ U[0, 1]
|
v
Noise z ~ N(0, I) Data x_0 (DCAE latent)
| |
+-------- Linear Interpolation --------+
z_t = (1-t)*z + t*x_0
|
v
+------------------+
| DiT(z_t, t, c) | <- conditioning c (text, lyric, speaker)
| |
| Prediction |
| target: |
| v = x_0 - z |
| (negative |
| constant |
| velocity field)|
+--------+---------+
|
v
L_FM = MSE(v_predicted, v_target)
Inference:
z_0 ~ N(0, I) -> Solve ODE -> z_1 approx x_0 -> DCAE Decoder -> waveform
Loss Function:
L_Total = L_FM + lambda_SSL * L_SSL
Where:
- L_FM: Flow Matching loss (MSE)
- L_SSL: REPA Semantic Alignment loss
- lambda_SSL = 1.0 (during most of training)
-> mHuBERT component reduced to 0.01 (last 100K steps)
3. REPA: Semantic Representation Alignment Training
The second key innovation of ACE-Step is the REPA (Representation Alignment) technique. It directly leverages semantic representations from pre-trained Self-Supervised Learning (SSL) models in DiT training, achieving fast convergence and high semantic fidelity.
3.1 Roles of MERT and mHuBERT
[REPA Training Structure]
+-----------------------+
| DiT Layer 8 Output |
| (intermediate repr.)|
+-----------+-----------+
|
+-----------------+------------------+
| | |
v | v
+------------------+ | +------------------+
| MERT (frozen) | | | mHuBERT (frozen) |
| | | | |
| - Music repr. | | | - Multilingual |
| learning | | | speech repr. |
| - 1024xT_M dim | | | - 768xT_H dim |
| - 75Hz frame | | | - 50Hz frame |
| - Improves style/| | | - Improves lyric/|
| melody accuracy| | | pronunciation |
+--------+---------+ | | alignment |
| | +--------+---------+
v v v
+----------------------------------------------+
| L_SSL = avg(1 - cosine_sim(DiT_repr, SSL)) |
| |
| = 0.5 * L_MERT + 0.5 * L_mHuBERT |
+----------------------------------------------+
| SSL Model | Role | Dimension | Frame Rate | Contribution |
|---|---|---|---|---|
| MERT | Music understanding | 1024 x T_M | 75Hz | Style accuracy, melody coherence |
| mHuBERT-147 | Multilingual speech understanding | 768 x T_H | 50Hz | Lyric alignment, pronunciation naturalness |
MERT (Music Representation Transformer) is a music understanding model pre-trained with large-scale self-supervised learning, capturing high-level semantics such as musical style, melody, and harmony. mHuBERT-147 is a multilingual speech representation model supporting 147 languages, responsible for semantic alignment of lyrics and pronunciation.
By aligning representations from these two models with the DiT's 8th layer output, ACE-Step simultaneously learns musical semantics (MERT) and linguistic semantics (mHuBERT). This is particularly important for music generation with lyrics, as the synchronization (alignment) of melody and lyrics determines the naturalness of the music.
3.2 Conditional Dropout Strategy
Dropout is applied to conditioning information during training to enhance model robustness:
| Condition | Dropout Rate | Purpose |
|---|---|---|
| Text prompt | 15% | Support for Classifier-Free Guidance (CFG) |
| Lyric (lyrics) | 15% | Support for instrumental generation without lyrics |
| Speaker (voice) | 50% | Prevent timbre over-reliance, focus on musical structure |
4. ACE-Step v1 Training Details
4.1 Training Data
ACE-Step v1 was trained on a large-scale music dataset:
| Item | Details |
|---|---|
| Total Data | 1.8M unique tracks (~100,000 hours) |
| Languages | 19 languages (English majority) |
| Quality Filter | Audiobox aesthetics toolkit |
| Excluded | Low-quality recordings, live performances |
Automatic Annotation Pipeline:
[Data Annotation Pipeline]
Raw audio files
|
+-> Qwen-omini model -> Style/genre caption generation
|
+-> Whisper 3.0 -> Lyric transcription
| +-> LSH-based IPA-to-database mapping for lyric refinement
|
+-> "All-in-one" music understanding model -> Song structure (intro, verse, chorus, etc.)
|
+-> BeatThis -> BPM extraction
|
+-> Essentia -> Key/Scale, style tag extraction
|
+-> Demucs -> Vocal/accompaniment separation (for Speaker Encoder training)
4.2 Training Configuration
Training was conducted in two stages: Pre-training + Fine-tuning:
| Stage | Data | Steps | Notes |
|---|---|---|---|
| Pre-training | Full 100K hours | 460,000 | Foundation training on full dataset |
| Fine-tuning | High-quality 20K hours | 240,000 | Curated high-quality subset |
Hyperparameters:
# Training Environment
Hardware: 15 nodes x 8 NVIDIA A100 (120 GPUs total)
Global Batch Size: 120 (1 per GPU)
Training Duration: ~264 hours (approximately 11 days)
# Optimizer
Optimizer: AdamW
Weight Decay: 1e-2
Betas: (0.8, 0.9)
Learning Rate: 1e-4
LR Schedule: Linear warm-up (4,000 steps)
Gradient Clipping: max norm 0.5
# REPA Weights
lambda_SSL: 1.0 (full training)
mHuBERT lambda: 0.01 (reduced in last 100K steps)
5. ACE-Step v1.5: Hybrid LM + DiT Evolution
ACE-Step v1.5 (arXiv:2602.00744), released in January 2026, fundamentally redesigned the v1 architecture. It introduces a Language Model as a structural planner and dramatically reduces inference steps through Distribution Matching Distillation, among other innovations.
5.1 Hybrid LM + DiT Architecture
[ACE-Step v1.5 Architecture]
User Input (Text prompt + Lyrics)
|
v
+----------------------------------------------------------+
| Composer Agent (Language Model, Qwen-based ~1.7B) |
| |
| Chain-of-Thought reasoning: |
| 1. Metadata generation (BPM, Key, Duration, Structure) |
| 2. Lyrics refinement and structuring |
| 3. Caption/style directive generation |
| 4. YAML-format Song Blueprint output |
| |
| +----------------------------------------+ |
| | bpm: 128 | |
| | key: "C major" | |
| | duration: 210 | |
| | structure: | |
| | - intro: 0-15s | |
| | - verse1: 15-45s | |
| | - chorus1: 45-75s | |
| | - verse2: 75-105s ... | |
| | style: "energetic K-pop with synth" | |
| +----------------------------------------+ |
+---------------------+------------------------------------+
| Song Blueprint
v
+----------------------------------------------------------+
| 1D VAE (Self-Learning Tokenizer) |
| - 48kHz stereo audio processing |
| - 64-dimensional latent space @ 25Hz |
| - 1920x compression ratio |
| - FSQ: 25Hz to 5Hz discrete codes (~64K codebook) |
| - "Source Latent" generation (LM-DiT bridging) |
+---------------------+------------------------------------+
|
v
+----------------------------------------------------------+
| Diffusion Transformer (DiT, ~2B parameters) |
| - Acoustic rendering with Source Latent + Blueprint |
| conditions |
| - DMD2 distillation: 50 steps to 4-8 steps |
| - 200x speedup (240-second track in ~1 second, A100) |
+----------------------------------------------------------+
The most significant change in v1.5 is the separation of structural planning and acoustic rendering. The Language Model first designs the overall blueprint of the music, and the DiT only performs the role of generating actual audio according to this blueprint. This enables maintaining consistent structure for even songs longer than 10 minutes.
5.2 Self-Learning Tokenizer
v1.5 uses a 1D VAE instead of v1's mel-spectrogram-based DCAE to directly process 48kHz stereo audio:
[v1 vs v1.5 Audio Processing Comparison]
ACE-Step v1:
Audio -> mel-spectrogram -> DCAE Encoder -> latent (10.77Hz)
latent -> DCAE Decoder -> mel -> Fish Audio Vocoder -> 32kHz mono
ACE-Step v1.5:
Audio (48kHz stereo) -> 1D VAE Encoder -> latent (25Hz, 64-dim)
latent -> FSQ -> 5Hz discrete codes ("Source Latent")
DiT -> latent -> 1D VAE Decoder -> 48kHz stereo
Improvements:
- 32kHz mono -> 48kHz stereo (improved audio quality)
- Elimination of mel-spectrogram intermediate stage (reduced information loss)
- Near-lossless quality maintained with 1920x compression ratio
The 1D VAE's Finite Scalar Quantization (FSQ) quantizes continuous 25Hz latent into 5Hz discrete codes. These discrete codes serve as Source Latent, bridging the Language Model and DiT. The codebook size is approximately 64K, and this tokenizer is trained simultaneously with the DiT through a self-learning approach.
5.3 Distribution Matching Distillation (DMD2)
The key to v1.5's dramatic speed improvement is DMD2 (Distribution Matching Distillation):
[DMD2 Distillation Process]
Teacher Model (50-step DiT)
|
v Knowledge Distillation
Student Model (4-8 step DiT)
|
+-- Dynamic-shift Strategy: {1, 2, 3} step sampling
| -> Exposure to diverse denoising states to prevent overfitting
|
+-- Distribution Matching Loss
| -> Alignment of Teacher distribution with Student distribution
|
+-- Result: 200x speedup
- 50 steps to 4-8 steps
- 240-second music generated in ~1 second on A100
- Dramatic RTF (Real-Time Factor) improvement
5.4 Intrinsic Reinforcement Learning
v1.5 introduces reinforcement learning-based alignment to further improve generation quality:
[RL-Based Alignment Structure]
DiT Alignment:
+-- DiffusionNTF framework
+-- Attention Alignment Score (AAS)
| -> Measurement of cross-attention map consensus
+-- Improvement of acoustic quality and text condition adherence
LM Alignment:
+-- Pointwise Mutual Information (PMI)
| -> Measurement of semantic adherence
+-- Improvement of Song Blueprint accuracy
Final Reward Weights:
- Atmosphere: 50%
- Lyrics: 30%
- Metadata: 20%
5.5 Data and Training Infrastructure
v1.5 uses significantly larger-scale data and more sophisticated training strategies than v1:
RL-Driven Annotation Pipeline:
[v1.5 Data Annotation]
1. "Golden Set" construction (5M samples)
+-- Initial annotation with Gemini 2.5 Pro
2. Fine-tuning
+-- Fine-tune Qwen2.5-Omni with Golden Set
+-- GRPO optimization -> ACE-Captioner, ACE-Transcriber generation
3. Reward Models training
+-- Trained on 4M contrastive pairs
4. Progressive Curriculum (3 stages)
+-- Phase 1: Foundation Pre-training (20M samples)
+-- Phase 2: Omni-task Fine-tuning (17M, including stem-separated tracks)
+-- Phase 3: High-quality SFT (2M curated samples)
The 3-stage progressive curriculum spanning 27M samples in total is designed so the model starts with basic music generation capabilities and gradually learns specialized tasks.
5.6 Omni-Task Framework
Another key innovation of v1.5 is the Omni-Task framework that handles diverse music tasks with a single model:
| Task | Description | Use Scenario |
|---|---|---|
| Text-to-Music | Generate full song from text prompt | Composition, BGM |
| Cover Generation | Style/timbre conversion of existing songs | Cover song production |
| Repainting | Regenerate/modify specific sections | Partial remixing |
| Track Extraction | Separate vocal/accompaniment tracks | Mixing, remastering |
| Layering | Multi-track synthesis | Arrangement, producing |
| Completion | Continue unfinished compositions | Collaborative composition |
| Vocal-to-BGM | Generate accompaniment from vocals | Karaoke production |
All these tasks are implemented through combinations of Source Latent and Mask configurations, handled by a single model without separate model training.
6. Performance Evaluation and Benchmarks
6.1 Inference Speed Comparison
The most dramatic advantage of ACE-Step is its inference speed:
| Model | RTF (RTX 4090) | 4-min Song Gen Time | Notes |
|---|---|---|---|
| ACE-Step v1 | 15.63x | ~20s (A100) | 15.63x real-time |
| ACE-Step v1.5 | - | Under 2s (A100) | DMD2 distillation |
| DiffRhythm | 10.03x | ~30s | |
| Yue (LLM-based) | 0.083x | ~48 min | Slower than real-time |
ACE-Step v1 is approximately 188x faster than the LLM-based model Yue, and v1.5 is over 10x faster than v1 through distillation.
v1.5 Performance by Hardware:
| Hardware | Full Song Gen Time | VRAM Required |
|---|---|---|
| NVIDIA A100 | Under 2 seconds | - |
| RTX 3090 | Under 10 seconds | Under 4GB |
| RTX 4090 | Under 5 seconds (est.) | Under 4GB |
| AMD Radeon | Supported (official AMD partnership) | Under 4GB |
| Apple Silicon (Mac) | Supported | Under 4GB |
6.2 Music Quality Evaluation
ACE-Step achieved competitive results across various automatic evaluation metrics and human evaluations:
Automatic Evaluation (v1):
| Metric | ACE-Step v1 | Best Comparison Model | Description |
|---|---|---|---|
| DCAE FAD | 0.0224 | DiffRhythm VAE: 0.0059 | Waveform reconstruction quality |
| Style Alignment | Top tier | Udio v1 (best) | CLAP + Mulan based |
| Lyric Alignment | Strong | Hailuo (best) | Whisper Forced Alignment |
| SongEval Coherence | Competitive | Suno v3 (best) | Musical coherence |
| SongEval Memorability | Strong | - | Memorable melody |
Automatic Evaluation (v1.5):
| Metric | ACE-Step v1.5 | Suno v5 | MinMax 2.0 |
|---|---|---|---|
| AudioBox CU | 8.09 (best) | - | - |
| AudioBox PQ | 8.35 (best) | - | - |
| SongEval Coherence | 4.72 (tied best) | - | - |
| Style Alignment | 39.1 | 46.8 | 43.1 |
| Lyric Alignment | 26.3 | 34.2 | 29.5 |
v1.5 achieved the highest scores in AudioBox CU (8.09) and PQ (8.35), and tied for the best in SongEval Coherence (4.72). While it falls short of Suno v5 in Style/Lyric Alignment, it is overwhelmingly superior among open-source models, and ranks between Suno v4.5 and v5 in Music Arena human evaluations.
Human Evaluation (v1, 32 participants):
| Evaluation Item | Score (/100) |
|---|---|
| Emotional Expression | ~85 |
| Innovativeness | ~82 |
| Sound Quality | ~80 |
| Musicality | ~78 |
7. Comparative Analysis of AI Music Generation Models
7.1 Major Model Overview
A systematic comparison of major models in the current AI music generation field:
[AI Music Generation Model Classification]
+-------------------------------------------------------------+
| Open-Source Models |
+--------------+--------------+--------------+-----------------+
| ACE-Step | MusicGen | Stable Audio| Riffusion |
| (v1, v1.5) | (Meta) | Open | |
| | | (Stability) | |
| Diffusion | Autoregress | Latent | Image Diffusion|
| + DCAE/VAE | + EnCodec | Diffusion | -> Spectrogram |
| 3.5B params | 1.5B/3.3B | 1.1B | ~1B |
+--------------+--------------+--------------+-----------------+
| Commercial Models |
+--------------+--------------+--------------+-----------------+
| Suno | Udio | ElevenLabs | Google MusicLM |
| (v3->v5) | (v1->v2) | Eleven Music| |
| | | | |
| Full song | Segment-by | Licensed | Experimental/ |
| generation | -segment | commercial | Instrumental |
| pipeline | composition | use OK | focus |
+--------------+--------------+--------------+-----------------+
7.2 Detailed Comparison Table
| Model | Developer | Parameters | Generation Method | Audio Representation | Max Length | Lyric Support | Open Source |
|---|---|---|---|---|---|---|---|
| ACE-Step v1 | ACE Studio + StepFun | 3.5B | Flow Matching + DiT | Mel DCAE latent | 4 min | Yes (multilingual) | Yes |
| ACE-Step v1.5 | ACE Studio + StepFun | ~3.7B (LM+DiT) | Hybrid LM + DiT + DMD2 | 1D VAE latent | 10+ min | Yes (50+ languages) | Yes |
| MusicGen | Meta | 1.5B/3.3B | Autoregressive | EnCodec tokens | ~30s | No | Yes |
| Stable Audio Open | Stability AI | 1.1B | Latent Diffusion | VAE latent | 47s | No | Yes |
| Riffusion | Riffusion | ~1B | Image Diffusion | Spectrogram | A few seconds | No | Yes |
| JEN-1 | Jen Music | - | AR + Non-AR hybrid | Raw waveform | ~30s | No | Partial |
| Suno | Suno Inc. | Undisclosed | Undisclosed | Undisclosed | 4+ min | Yes | No |
| Udio | Udio | Undisclosed | Undisclosed | Undisclosed | Segment-based | Yes | No |
| MusicLM | Undisclosed | AR + SoundStream | SoundStream tokens | ~30s | No | No |
7.3 MusicGen (Meta)
Meta's MusicGen is a pioneer in open-source music generation models. It is an autoregressive transformer model based on the EnCodec tokenizer.
[MusicGen Architecture]
Text prompt -> T5 Encoder -> Conditioning
|
v
+--------------------------+
| Autoregressive Decoder |
| (Transformer LM) |
| |
| EnCodec 4 codebooks |
| 32kHz, 50Hz sampling |
| |
| Simultaneous multi- |
| codebook generation |
| via delay pattern |
+----------+---------------+
|
v
+--------------------------+
| EnCodec Decoder |
| tokens -> waveform |
+--------------------------+
Strengths: Stable instrumental generation, melody conditioning support Limitations: No lyric support, ~30-second limit, relatively slow autoregressive generation
7.4 Suno vs ACE-Step
Suno is currently the most commercially successful AI music generation platform:
| Comparison Item | ACE-Step v1.5 | Suno v5 |
|---|---|---|
| Accessibility | Local install (open source) | Cloud service |
| VRAM Required | Under 4GB | N/A (server) |
| Song Structure | LM-based Blueprint | End-to-end |
| Customization | LoRA training possible | Prompts only |
| Style Alignment | 39.1 | 46.8 |
| Lyric Alignment | 26.3 | 34.2 |
| Price | Free (local) | Subscription |
| Commercial Use | License check needed | Paid plans |
While Suno v5 still leads in absolute quality, ACE-Step v1.5 is a powerful alternative in terms of local deployment, customization, and cost efficiency.
7.5 Stable Audio Open
Stability AI's Stable Audio Open is a latent diffusion-based open-source model:
| Comparison Item | ACE-Step v1.5 | Stable Audio Open |
|---|---|---|
| Max Length | 10+ min | 47 seconds |
| Lyric Support | Yes (50+ languages) | No |
| Vocal Generation | Yes (including Voice Cloning) | No (instrumental only) |
| Parameters | ~3.7B | 1.1B |
| Audio Quality | 48kHz stereo | 44.1kHz stereo |
ACE-Step shows superiority in nearly all aspects including length, lyrics, and vocals.
8. Core Foundational Technologies for Music Generation
An in-depth analysis of the essential foundational technologies needed to understand AI music generation.
8.1 Audio Tokenization: Converting Audio to Discrete Tokens
The first challenge for music generation models is transforming continuous audio signals into a form the model can process. There are broadly three approaches:
[Audio Representation Method Comparison]
1. Spectrogram-Based
+--------------------------------------------+
| waveform -> STFT -> mel-spectrogram -> image |
| |
| Pros: Easy visualization, can leverage |
| image models |
| Cons: Phase information loss, vocoder needed|
| Used by: Riffusion, ACE-Step v1 (DCAE input)|
+--------------------------------------------+
2. Neural Audio Codec (Discrete Tokens)
+--------------------------------------------+
| waveform -> Encoder -> RVQ -> discrete tokens|
| tokens -> Decoder -> waveform |
| |
| Pros: End-to-end, high compression ratio |
| Cons: Weak long-range dependency |
| (acoustic tokens) |
| Used by: MusicGen (EnCodec), MusicLM |
| (SoundStream) |
+--------------------------------------------+
3. Continuous Latent (VAE)
+--------------------------------------------+
| waveform -> VAE Encoder -> continuous latent |
| latent -> VAE Decoder -> waveform |
| |
| Pros: Natural integration with Diffusion |
| Cons: Compression ratio vs quality tradeoff |
| Used by: ACE-Step v1.5 (1D VAE), |
| Stable Audio |
+--------------------------------------------+
8.2 EnCodec and SoundStream
EnCodec (Meta) and SoundStream (Google) are representative Neural Audio Codec models:
[EnCodec / SoundStream Architecture]
Input: raw waveform (24kHz/48kHz)
|
v
+---------------------------------+
| Encoder (1D Conv + LSTM) |
| -> continuous embeddings |
+------------+--------------------+
|
v
+---------------------------------+
| Residual Vector Quantization |
| (RVQ) |
| |
| Codebook 1 -> Most important |
| information |
| Codebook 2 -> Residual |
| Codebook 3 -> Finer residual |
| ... |
| Codebook N -> Final residual |
| |
| Each codebook: 1024 entries |
| sampling rate: 50Hz/75Hz |
+------------+--------------------+
|
v
+---------------------------------+
| Decoder (1D TransposeConv) |
| -> reconstructed waveform |
+---------------------------------+
Training: Reconstruction Loss + Adversarial Loss
(Multi-scale discriminator)
EnCodec vs SoundStream:
| Item | EnCodec | SoundStream |
|---|---|---|
| Developer | Meta | |
| Key Innovation | Multi-scale discriminator, loss balancing | RVQ introduction |
| Sample Rate | 24kHz/48kHz | 24kHz |
| Bitrate | 1.5~24 kbps | 3~18 kbps |
| Used In | MusicGen, AudioGen | AudioLM, MusicLM |
| Open Source | Yes | No |
8.3 Diffusion for Audio
Audio application of Diffusion models is built on the success in the image domain:
[Audio Diffusion Training]
Forward Process (Adding Noise):
x_0 (original audio latent)
-> x_1 -> x_2 -> ... -> x_T (pure Gaussian noise)
x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1-alpha_bar_t) * epsilon, epsilon ~ N(0,I)
Reverse Process (Denoising, training target):
x_T (noise) -> x_{T-1} -> ... -> x_0 (generated audio latent)
p_theta(x_{t-1}|x_t) = N(x_{t-1}; mu_theta(x_t, t), sigma^2 I)
Loss: L = E_{t,x_0,epsilon} [||epsilon - epsilon_theta(x_t, t, c)||^2]
(c = conditioning: text, melody, etc.)
ACE-Step v1 uses Flow Matching instead of standard Diffusion, which uses straight paths for convergence with fewer steps and stable training. v1.5 adds DMD2 distillation on top to achieve high-quality generation with only 4-8 steps.
8.4 Classifier-Free Guidance (CFG)
CFG, a core technique in all conditional generation models, is also used in ACE-Step:
[CFG Application]
epsilon_guided = epsilon_uncond + w * (epsilon_cond - epsilon_uncond)
Where:
- epsilon_cond: prediction with conditions (text, lyric, speaker)
- epsilon_uncond: prediction without conditions (trained via dropout)
- w: guidance scale (higher = more condition adherence, less diversity)
ACE-Step's 15% text/lyric dropout, 50% speaker dropout
enables unconditional training for this CFG.
9. Practical Usage Guide
9.1 ACE-Step v1.5 Local Installation
ACE-Step v1.5 offers a remarkably simple installation process:
# 1. Install uv package manager
curl -LsSf https://astral.sh/uv/install.sh | sh
# 2. Clone repository and install dependencies
git clone https://github.com/ACE-Step/ACE-Step-1.5.git
cd ACE-Step-1.5
uv sync
# 3. Launch Gradio UI (web interface)
uv run acestep
# -> Access at http://localhost:7860
# 4. Or launch REST API server
uv run acestep-api
# -> Use API at http://localhost:8001
# 5. Environment configuration (optional)
cp .env.example .env
# Customize model paths, ports, GPU settings, etc. in .env file
Supported Hardware:
- NVIDIA GPU (CUDA): RTX 20xx or higher recommended
- AMD GPU (ROCm): Optimized through official AMD partnership
- Intel GPU: Supported
- Apple Silicon (Mac): MPS backend supported
Models are automatically downloaded on first run and operate with under 4GB of VRAM.
9.2 Basic Text-to-Music Usage
# Music generation example via API (conceptual code)
import requests
# Basic text-to-music generation
response = requests.post("http://localhost:8001/generate", json={
"prompt": "Bright and cheerful K-pop dance track, synth bass and electronic beats, "
"128 BPM, female vocal, C major",
"lyrics": """
[Verse 1]
Shining like stars tonight
Let's dance together
In this moment where music flows
We won't stop
[Chorus]
La la la shining night
La la la time together
May this moment last forever
""",
"duration": 180, # 3 minutes
"num_inference_steps": 8, # DMD2 distilled
"guidance_scale": 7.0,
"seed": 42
})
# Save output audio
with open("output.wav", "wb") as f:
f.write(response.content)
9.3 Prompt Writing Guide
Effective prompt writing directly impacts generation quality:
[Effective Prompt Structure]
1. Genre/Style : "indie folk ballad", "aggressive metal", "lo-fi hip-hop"
2. Instrumentation : "acoustic guitar, soft piano, light percussion"
3. Mood/Emotion : "melancholic", "uplifting", "dreamy"
4. Tempo (BPM) : "slow tempo 70 BPM", "fast 140 BPM"
5. Key : "minor key", "E flat major"
6. Vocal Character : "female vocal, breathy", "male baritone, powerful"
7. Production Style : "lo-fi with vinyl crackle", "clean studio production"
[Good Prompt Example]
"Dreamy shoegaze rock with layers of reverbed electric guitars,
ethereal female vocal, 90 BPM, D minor, lo-fi production
with tape saturation and subtle noise"
[Lyrics Format]
- Use [Verse], [Chorus], [Bridge], [Intro], [Outro] tags
- Clearly separate each section
- One phrase per line
9.4 LoRA Personalization Training
One of ACE-Step v1.5's powerful features is LoRA support that allows training your own style with a small number of songs:
[LoRA Training Process]
1. Data Preparation
+-- Minimum 3-5 reference music tracks
+-- Text prompt (caption) for each track
+-- (Optional) Lyrics files
2. Access LoRA Training tab in Gradio UI
+-- Upload audio files
+-- Enter captions
+-- Configure training parameters
| +-- Learning Rate: ~1e-4
| +-- Epochs: 50-200
| +-- LoRA Rank: 8-64
+-- Start training
3. Apply trained LoRA
+-- Load LoRA weights during generation
+-- Adjust LoRA Scale (0.0~1.0)
+-- Combine with existing prompts to apply style
This allows you to reflect a specific artist's production style, nuances of a specific genre, or your own composition style in the model.
9.5 ComfyUI Integration
ACE-Step 1.5 also supports integration with ComfyUI, enabling visual configuration of music generation in a node-based workflow:
[ComfyUI ACE-Step Workflow Example]
+----------+ +--------------+ +--------------+
| Text |---->| ACE-Step |---->| Audio |
| Prompt | | Generator | | Preview |
+----------+ | | +--------------+
| |
+----------+ | | +--------------+
| Lyrics |---->| |---->| Save WAV |
| Input | | | | Node |
+----------+ +--------------+ +--------------+
10. Ethical Considerations and Legal Issues
10.1 Copyright Status (2025-2026)
Copyright issues in AI music generation are currently one of the hottest legal topics:
Key Rulings and Trends:
| Date | Event | Impact |
|---|---|---|
| Jan 2025 | US Copyright Office: No copyright for 100% AI-generated content | Public domain ruling |
| Mar 2025 | US Appeals Court: Confirms denial of copyright for AI works | Legal precedent established |
| Sep 2025 | Warner Music + Suno settlement | Suno agrees to license-based model transition |
| Nov 2025 | UMG + Udio settlement | Similar license transition agreement |
| Aug 2025 | ElevenLabs Eleven Music launch | First legally licensed commercial AI music |
| Jan 2026 | UMG vs Anthropic ($3B) | Copyright lawsuit over 20,000+ songs in training data |
10.2 "Meaningful Human Authorship" Principle
The US Copyright Office released guidelines stating that copyright may be recognized for AI-assisted works when "meaningful human authorship" is present:
[Copyright Recognition Spectrum for AI Music]
Fully AI-Generated Fully Human-Created
<---------------------------------------->
| | |
No copyright Judgment needed Copyright recognized
|
Human actively:
- Modifying melody
- Writing lyrics
- Arranging structure
- Selecting/editing AI output
-> "Meaningful Human Authorship"
-> Copyright may be recognized
10.3 Ethical Considerations for Open-Source Models
Open-source models like ACE-Step require additional ethical considerations:
Training Data Sources: The copyright status of ACE-Step's training data of 1.8M songs (v1) / 27M samples (v1.5) is not clearly disclosed in the papers. Users should be aware of legal risks when commercially using generated music.
Voice Cloning Misuse: The voice cloning capability through the Speaker Encoder could be misused to replicate specific artists' voices without authorization. Cloning without consent from the reference vocal rights holder is both ethically and legally problematic.
Deepfake Music: Deepfake music where AI generates "new songs" by specific artists has already emerged as a social issue. ACE-Step's Cover Generation feature also requires responsible use in this context.
Impact on the Music Industry: The democratization of AI music generation technology can directly affect the livelihoods of professional musicians, composers, and producers. A balance between technological advancement and creator protection is needed.
10.4 Guidelines for Responsible Use
[Responsible Use Principles for AI Music Generation]
1. Transparency: Clearly state when music is AI-generated/assisted
2. Consent: Obtain original artist consent for Voice Cloning
3. Attribution: Clearly distinguish AI tool contributions from human contributions
4. Commercial Use: Comply with relevant regulations and license conditions
5. Education: Use AI tools as supplementary tools for music education/learning
6. Fair Use: Distinguish between style imitation and copying of existing music
11. Key Paper References
A compilation of key papers on ACE-Step and the AI music generation field:
11.1 ACE-Step Related
| Paper | Authors | Year | Key Contribution |
|---|---|---|---|
| ACE-Step: A Step Towards Music Generation Foundation Model | Gong et al. | 2025 | DCAE + Linear DiT + REPA |
| ACE-Step 1.5: Pushing the Boundaries of Open-Source Music Generation | ACE-Step Team | 2026 | Hybrid LM+DiT, DMD2, RL alignment |
11.2 Foundational Technologies
| Paper | Key Contribution | Usage |
|---|---|---|
| Deep Compression Autoencoder (Chen et al., 2024) | High compression ratio AutoEncoder | ACE-Step DCAE |
| MERT (Li et al., 2024) | Self-supervised music representation learning | ACE-Step REPA |
| mHuBERT-147 (Lee et al., 2024) | Multilingual speech representation | ACE-Step REPA |
| Flow Matching (Lipman et al., 2023) | ODE-based generative model | ACE-Step generation process |
| DMD2 (Yin et al., 2024) | Distribution Matching Distillation | ACE-Step v1.5 speedup |
11.3 Competing Model Papers
| Paper | Author/Org | Year | Key Contribution |
|---|---|---|---|
| MusicGen: Simple and Controllable Music Generation | Copet et al. (Meta) | 2023 | EnCodec + AR Transformer |
| MusicLM: Generating Music from Text | Agostinelli et al. (Google) | 2023 | SoundStream + AR |
| Stable Audio Open | Evans et al. (Stability AI) | 2024 | Latent Diffusion for Audio |
| Riffusion | Forsgren & Martiros | 2022 | Spectrogram Image Diffusion |
| JEN-1: Text-Guided Universal Music Generation | Li et al. | 2023 | AR + Non-AR hybrid |
| DiffRhythm | - | 2025 | 1D VAE + Flow DiT |
| SongGen | - | 2025 | Lyric encoding architecture |
11.4 Audio Tokenization
| Paper | Author/Org | Year | Key Contribution |
|---|---|---|---|
| EnCodec: High Fidelity Neural Audio Compression | Defossez et al. (Meta) | 2022 | RVQ + Multi-scale Disc |
| SoundStream: An End-to-End Neural Audio Codec | Zeghidour et al. (Google) | 2021 | RVQ introduction |
| WavTokenizer | Peng et al. | 2025 | 40/75 tokens/sec SOTA |
| AudioLM: A Language Modeling Approach to Audio | Borsos et al. (Google) | 2023 | Semantic + Acoustic tokens |
12. Future Outlook
12.1 Technology Development Direction
AI music generation technology is expected to evolve in the following directions:
[AI Music Generation Technology Development Roadmap]
2026 Current 2027 Expected 2028+ Long-term
| | |
v v v
+--------------+ +--------------+ +------------------+
| Current State | | Short-term | | Long-term Vision |
| | | Development | | |
| - 4-min song | -> | - Album-level| -> | - Real-time |
| generation | | consistent | | interactive |
| - Text cond. | | generation | | music gen |
| - LoRA | | - Multi-track| | - Emotion-aware |
| personalize| | simultaneous| | adaptive music |
| - Voice Clone| | generation | | - Video-music |
| - 50+ langs | | - Real-time | | synchronization|
| | | streaming | | - Fully automated|
| | | generation | | production |
+--------------+ +--------------+ +------------------+
12.2 ACE-Step's Foundation Model Vision
The ultimate vision of the ACE-Step project is to become the "Stable Diffusion of Music AI." This means not just a simple text-to-music pipeline, but a general-purpose Foundation Model upon which various downstream tasks can be built:
[ACE-Step Foundation Model Ecosystem Vision]
+-------------------------+
| ACE-Step Foundation |
| Model (Base) |
+----------+--------------+
|
+--------------------+--------------------+
| | |
v v v
+--------------+ +--------------+ +------------------+
| Text-to- | | Audio | | Music |
| Music | | Editing | | Understanding |
| Generation | | & Remixing | | & Analysis |
+--------------+ +--------------+ +------------------+
| | |
v v v
+--------------+ +--------------+ +------------------+
| LoRA | | Voice | | Stem |
| Style | | Cloning | | Separation |
| Transfer | | & TTS | | & Transcription |
+--------------+ +--------------+ +------------------+
When this vision is realized, diverse users including music producers, video creators, game developers, and educators will be able to generate and edit commercial-quality music in local environments.
12.3 Industry Impact Outlook
Democratization of Music Production: The ability to generate commercial-quality music with 4GB VRAM means the barrier to entry for music production has been dramatically lowered.
Hybrid Workflows: AI-Human collaborative workflows where AI generates drafts and humans refine them will become standard. ACE-Step's Repainting, Completion, and Track Extraction features are optimized for such workflows.
Personalized Music Experiences: Personalization training through LoRA enables music generation tailored to each user's preferences. This will lead to dynamically generated custom music in games, meditation apps, fitness apps, and more.
Legal Framework Establishment: Through the lawsuits and settlements of 2025-2026, a clear legal framework for AI music generation will gradually be formed. ElevenLabs' license-based approach could serve as one model.
13. Conclusion
ACE-Step is a landmark model that has dramatically narrowed the gap between open-source and commercial models in AI music generation. The v1 DCAE + Linear DiT + REPA architecture achieved 188x faster inference than LLM-based models at 3.5B parameters, and the v1.5 Hybrid LM + DiT + DMD2 architecture realized remarkable efficiency of under 2 seconds on A100 and under 4GB VRAM.
Summarizing the key technical contributions:
- DCAE Application to Music Domain: Achieved high-quality reconstruction while maintaining 10.77Hz temporal resolution with 8x compression
- REPA Training: Fast convergence and high fidelity through musical/linguistic semantic alignment via MERT + mHuBERT
- Hybrid LM + DiT: Support for songs longer than 10 minutes through separation of structural planning and acoustic rendering
- DMD2 Distillation: Compressed 50 steps to 4-8 steps, 200x speed improvement
- Omni-Task Framework: Single model performs diverse tasks including text-to-music, cover, repainting, and track separation
Of course, gaps still exist with top-tier commercial models like Suno v5 in Style/Lyric Alignment. However, the values ACE-Step offers -- open-source, local deployment, and customizability -- are unique advantages that commercial models cannot provide. ACE-Step's journey toward music AI's "Stable Diffusion moment" has only just begun.
References
- Gong, J., Zhao, S., Wang, S., Xu, S., & Guo, J. (2025). ACE-Step: A Step Towards Music Generation Foundation Model. arXiv:2506.00045. https://arxiv.org/abs/2506.00045
- ACE-Step Team. (2026). ACE-Step 1.5: Pushing the Boundaries of Open-Source Music Generation. arXiv:2602.00744. https://arxiv.org/abs/2602.00744
- Chen, J. et al. (2024). Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models. arXiv:2410.10733. https://arxiv.org/abs/2410.10733
- Copet, J. et al. (2023). Simple and Controllable Music Generation. NeurIPS 2023. https://arxiv.org/abs/2306.05284
- Agostinelli, A. et al. (2023). MusicLM: Generating Music From Text. arXiv:2301.11325. https://arxiv.org/abs/2301.11325
- Defossez, A. et al. (2022). High Fidelity Neural Audio Compression. arXiv:2210.13438. https://arxiv.org/abs/2210.13438
- Zeghidour, N. et al. (2021). SoundStream: An End-to-End Neural Audio Codec. arXiv:2107.03312. https://arxiv.org/abs/2107.03312
- Li, Y. et al. (2024). MERT: Acoustic Music Understanding Model with Large-Scale Self-Supervised Training. ICLR 2024.
- Lee, R. et al. (2024). mHuBERT-147: A Compact Multilingual HuBERT Model. Interspeech 2024.
- Lipman, Y. et al. (2023). Flow Matching for Generative Modeling. ICLR 2023.
- Yin, T. et al. (2024). One-step Diffusion with Distribution Matching Distillation. CVPR 2024.
- Evans, Z. et al. (2024). Stable Audio Open. arXiv:2407.14358.
- Li, P. et al. (2023). JEN-1: Text-Guided Universal Music Generation with Omnidirectional Diffusion Models. arXiv:2308.04729.
- ACE-Step GitHub (v1): https://github.com/ace-step/ACE-Step
- ACE-Step GitHub (v1.5): https://github.com/ace-step/ACE-Step-1.5
- ACE-Step Hugging Face: https://huggingface.co/ACE-Step/ACE-Step-v1-3.5B