Skip to content
Published on

ACE-Step: A New Paradigm in AI Music Generation — Complete Analysis of Architecture, Training Methods, and Practical Applications

Authors
  • Name
    Twitter

1. Introduction: A Turning Point in AI Music Generation

The field of AI Music Generation has undergone explosive progress from 2024 to 2025. While Meta's MusicGen, Google's MusicLM, and commercial services like Suno and Udio demonstrated the possibilities of AI composition to the public, few open-source models had achieved quality rivaling commercial models.

In May 2025, the release of ACE-Step, jointly developed by ACE Studio and StepFun, changed the landscape. ACE-Step is a Foundation Model that generates up to 4 minutes of high-quality music from text prompts and lyrics in approximately 20 seconds, achieving over 15x faster inference speed than LLM-based models and superior musical coherence at a scale of 3.5B parameters. In January 2026, the follow-up version ACE-Step 1.5 was released, delivering commercial-model-level quality in local environments with remarkable speeds of under 2 seconds on A100 and under 10 seconds on RTX 3090.

[AI Music Generation Model Development Timeline]

2023               2024                2025                  2026
  |                  |                   |                     |
  v                  v                   v                     v
+----------+   +--------------+   +---------------+   +------------------+
| MusicGen |   | Stable Audio |   | ACE-Step v1   |   | ACE-Step v1.5    |
| MusicLM  |   | Suno v3      |   | (3.5B, DCAE   |   | (Hybrid LM+DiT,  |
| AudioLDM |   | Udio v1      |   |  + Linear DiT)|   |  DMD2, under 4GB) |
| Riffusion|   | JEN-1        |   | DiffRhythm    |   | Suno v5          |
+----------+   +--------------+   +---------------+   +------------------+

Key Transition:     Commercialization:  Open-source Leap:     Local Deployment Era:
- Autoregressive   - Text-to-Song     - Diffusion + DCAE      - 4-8 step generation
- Spectrogram      - Vocal + BGM      - Flow Matching         - LoRA personalization
  based generation - Multilingual      - REPA training         - 50+ language support
                     lyrics

This article provides an in-depth analysis of ACE-Step's architecture based on the papers, covering the evolution from v1 to v1.5, competitive model comparisons, core foundational technologies, and practical usage guides.


2. ACE-Step v1: In-Depth Architecture Analysis

ACE-Step v1 (arXiv:2506.00045) was designed to overcome the fundamental limitations of existing music generation models. LLM-based models excel at lyric alignment but suffer from slow inference and structural artifacts, while Diffusion models enable fast synthesis but lack long-range structural coherence. ACE-Step adopts a Diffusion + DCAE + Linear Transformer architecture that integrates the strengths of both approaches.

2.1 Overall Architecture Overview

The core components of ACE-Step v1 are as follows:

[ACE-Step v1 Architecture]

                    +---------------------------------------------+
                    |           Conditioning Encoders              |
                    |                                              |
                    |  +----------+ +----------+ +--------------+ |
                    |  |  Text    | |  Lyric   | |   Speaker    | |
                    |  | Encoder  | | Encoder  | |   Encoder    | |
                    |  |(mT5-base)| |(SongGen) | |(PLR-OSNet)   | |
                    |  | frozen   | |trainable | | pre-trained  | |
                    |  | dim=768  | |          | | dim=512      | |
                    |  +----+-----+ +----+-----+ +------+-------+ |
                    +-------+------------+---------------+--------+
                            |            |               |
                            +------+-----+               |
                                   | cross-attention      |
                                   v                      v
+-----------+    +----------------------------------------------+
|           |    |     Linear Diffusion Transformer (DiT)       |
|   DCAE    |    |                                              |
|  Encoder  |--->|  +-------------------------------------+    |
|  (f8c8)   |    |  |  24 Transformer Blocks               |    |
|           |    |  |  - AdaLN-single (shared params)      |    |
| mel-spec  |    |  |  - Linear Attention                  |    |
| to latent |    |  |  - 1D Conv FeedForward               |    |
| ~10.77Hz  |    |  |  - Cross-Attention (text+lyric)      |    |
|           |    |  |  - REPA at layer 8                   |    |
+-----------+    |  +-------------------------------------+    |
                 |                                              |
                 +------------------+---------------------------+
                                    |
                                    v
                 +----------------------------------------------+
                 |              DCAE Decoder                     |
                 |   latent to mel-spectrogram to waveform       |
                 |   (Fish Audio Vocoder, 32kHz mono)            |
                 +----------------------------------------------+

2.2 Deep Compression AutoEncoder (DCAE)

The first key innovation of ACE-Step is the application of Deep Compression AutoEncoder (DCAE), proposed by Sana (NVIDIA/MIT-HAN Lab), to the music domain. DCAE was originally designed for high-resolution image generation, achieving extremely high spatial compression ratios of 32x to 128x.

In ACE-Step, it takes mel-spectrograms as input and applies 8x compression (f8c8, channel=8):

[DCAE Compression Process]

Input: mel-spectrogram (44.1kHz/32kHz audio to mel conversion)
  |
  v
+---------------------------------------------+
|  DCAE Encoder                               |
|  - Residual Autoencoding                    |
|  - Space-to-Channel Transform              |
|  - 8x temporal compression                  |
|                                             |
|  Output: latent space (~10.77Hz)            |
|  4-minute music to ~2,584 latent tokens     |
+---------------------------------------------+
  |
  v (Generated/transformed by DiT)
  |
  v
+---------------------------------------------+
|  DCAE Decoder + Vocoder                     |
|  - latent to mel-spectrogram reconstruction |
|  - Fish Audio Universal Music Vocoder       |
|  - Output: 32kHz mono waveform             |
+---------------------------------------------+

DCAE Training Details:

ItemDetails
Compression Configf8c8 (8x compression, channel=8)
Temporal Resolution~10.77Hz in latent space
Training Hardware120 NVIDIA A100 GPUs
Training Steps140,000 steps
Global Batch Size480 (4 per GPU)
Training Duration~5 days
DiscriminatorPatch-based, StyleGAN Disc2DRes, SwinDisc2D
Training StrategyPhase 1: MSE only / Phase 2: frozen encoder + MSE + adversarial
VocoderFish Audio universal music vocoder (32kHz mono)
Reconstruction FAD0.0224

The paper also experimented with 32x compression (f32), but it resulted in unacceptable quality degradation, leading to the adoption of 8x compression. This is because music audio is far more sensitive to temporal detail than images.

2.3 Conditioning Encoders: Multi-Condition Encoding

ACE-Step injects diverse conditioning information into the model through three specialized encoders:

2.3.1 Text Encoder (Style/Genre Prompt)

# Text Encoder: Google mT5-base (frozen)
# - Output dimension: 768
# - Max sequence length: 256 tokens
# - Multilingual support (100+ languages)
# - Kept frozen during training

# Prompt example:
prompt = "upbeat K-pop dance track with synth bass, 128 BPM, female vocal, major key"

The choice of mT5-base was driven by the necessity of multilingual support. Style prompts can be entered in various languages including English, Korean, Japanese, and Chinese.

2.3.2 Lyric Encoder (Lyrics Encoding)

[Lyric Encoder Processing Pipeline]

Raw lyrics input (Korean, English, Japanese, etc.)
  |
  v
Non-Roman scripts to Grapheme-to-Phoneme conversion to phoneme representation
  |
  v
XTTS VoiceBPE Tokenizer (multilingual support)
  |
  v
SongGen architecture-based Lyric Encoder (trainable)
  |
  v
Up to 4,096 tokens of lyric embeddings

The Lyric Encoder is based on the SongGen architecture and, unlike the Text Encoder, its parameters are updated during training. This is because lyric-music alignment is one of the most challenging tasks in music generation. Non-Roman scripts (Hangul, Chinese characters, Hiragana, etc.) are converted to phoneme representations through Grapheme-to-Phoneme (G2P) tools.

2.3.3 Speaker Encoder (Voice/Timbre Encoding)

# Speaker Encoder Configuration
# - Input: 10-second vocal segment with accompaniment removed (separated by demucs)
# - Architecture: PLR-OSNet (originally for face recognition, applied to vocal recognition)
# - Output dimension: 512
# - Training dropout: 50% (to prevent over-reliance on timbre)
# - Full song: average of embeddings from multiple segments

# Voice cloning scenario:
# 1. Input 10-second reference vocal segment
# 2. Separate accompaniment with demucs
# 3. Extract 512-dim embedding with Speaker Encoder
# 4. Inject embedding as condition to DiT during generation

The 50% dropout for the Speaker Encoder is an intentional design decision. By removing speaker information with 50% probability during training, the model is guided to focus sufficiently on musical structure and melody rather than excessively relying on timbre.

2.4 Linear Diffusion Transformer (DiT) Backbone

The core generation model of ACE-Step, the Linear Diffusion Transformer, consists of 24 blocks and uses linear attention instead of standard attention for efficient operation on long sequences.

[DiT Block Structure (x24)]

Input: noisy latent z_t + time embedding t
  |
  v
+---------------------------------+
|  AdaLN-single                   |
|  (Simplified Adaptive LayerNorm)|
|  - Parameters shared across     |
|    all blocks                   |
|  - Conditioned on time step t   |
+------------+--------------------+
             |
             v
+---------------------------------+
|  Linear Self-Attention          |
|  - O(n) complexity (vs O(n^2)) |
|  - RoPE Position Encoding       |
|  - Up to 2,584 mel latent tokens|
+------------+--------------------+
             |
             v
+---------------------------------+
|  Cross-Attention                |
|  - Text Encoder output (768-dim)|
|  - Lyric Encoder output         |
|  - Speaker Encoder output(512-d)|
|  - Concatenate and Attend       |
+------------+--------------------+
             |
             v
+---------------------------------+
|  1D Convolutional FeedForward   |
|  - Adapted from 2D Conv to 1D  |
|  - Optimized for temporal audio |
|    sequences                    |
+------------+--------------------+
             |
             v
Output: denoised prediction
(REPA semantic alignment extracted at Layer 8)

Key Architectural Decisions:

  1. AdaLN-single: Shares Adaptive Layer Normalization parameters across all 24 blocks to maximize parameter efficiency. This technique, introduced by Sana, offers excellent performance efficiency relative to model size.

  2. Linear Attention: Since music requires handling long sequences of up to 4 minutes, O(n) complexity linear attention was adopted instead of O(n^2) standard attention. This enables efficient processing of sequences up to 2,584 tokens.

  3. RoPE (Rotary Position Embedding): Provides robust position information across various music lengths through relative position encoding.

  4. 1D Convolutional FeedForward: The original image-targeted 2D Conv was adapted to 1D for temporal audio sequences. This better captures the temporal continuity of audio.

2.5 Flow Matching Generation Process

ACE-Step adopts Flow Matching instead of score-based diffusion. Flow Matching learns a straight path (linear probability path) from Gaussian noise to data distribution, enabling faster convergence and stable training.

[Flow Matching Training Process]

Time t ~ U[0, 1]
  |
  v
Noise z ~ N(0, I)         Data x_0 (DCAE latent)
  |                            |
  +-------- Linear Interpolation --------+
            z_t = (1-t)*z + t*x_0
                  |
                  v
        +------------------+
        |   DiT(z_t, t, c) |  <- conditioning c (text, lyric, speaker)
        |                  |
        |  Prediction      |
        |  target:         |
        |  v = x_0 - z     |
        |  (negative       |
        |   constant       |
        |   velocity field)|
        +--------+---------+
                 |
                 v
        L_FM = MSE(v_predicted, v_target)

Inference:
  z_0 ~ N(0, I) -> Solve ODE -> z_1 approx x_0 -> DCAE Decoder -> waveform

Loss Function:

L_Total = L_FM + lambda_SSL * L_SSL

Where:
- L_FM: Flow Matching loss (MSE)
- L_SSL: REPA Semantic Alignment loss
- lambda_SSL = 1.0 (during most of training)
         -> mHuBERT component reduced to 0.01 (last 100K steps)

3. REPA: Semantic Representation Alignment Training

The second key innovation of ACE-Step is the REPA (Representation Alignment) technique. It directly leverages semantic representations from pre-trained Self-Supervised Learning (SSL) models in DiT training, achieving fast convergence and high semantic fidelity.

3.1 Roles of MERT and mHuBERT

[REPA Training Structure]

                    +-----------------------+
                    |   DiT Layer 8 Output  |
                    |   (intermediate repr.)|
                    +-----------+-----------+
                                |
              +-----------------+------------------+
              |                 |                   |
              v                 |                   v
+------------------+            |     +------------------+
|   MERT (frozen)  |            |     | mHuBERT (frozen) |
|                  |            |     |                  |
| - Music repr.    |            |     | - Multilingual   |
|   learning       |            |     |   speech repr.   |
| - 1024xT_M dim   |            |     | - 768xT_H dim    |
| - 75Hz frame     |            |     | - 50Hz frame     |
| - Improves style/|            |     | - Improves lyric/|
|   melody accuracy|            |     |   pronunciation  |
+--------+---------+            |     |   alignment      |
         |                      |     +--------+---------+
         v                      v              v
    +----------------------------------------------+
    |  L_SSL = avg(1 - cosine_sim(DiT_repr, SSL))  |
    |                                              |
    |  = 0.5 * L_MERT + 0.5 * L_mHuBERT           |
    +----------------------------------------------+
SSL ModelRoleDimensionFrame RateContribution
MERTMusic understanding1024 x T_M75HzStyle accuracy, melody coherence
mHuBERT-147Multilingual speech understanding768 x T_H50HzLyric alignment, pronunciation naturalness

MERT (Music Representation Transformer) is a music understanding model pre-trained with large-scale self-supervised learning, capturing high-level semantics such as musical style, melody, and harmony. mHuBERT-147 is a multilingual speech representation model supporting 147 languages, responsible for semantic alignment of lyrics and pronunciation.

By aligning representations from these two models with the DiT's 8th layer output, ACE-Step simultaneously learns musical semantics (MERT) and linguistic semantics (mHuBERT). This is particularly important for music generation with lyrics, as the synchronization (alignment) of melody and lyrics determines the naturalness of the music.

3.2 Conditional Dropout Strategy

Dropout is applied to conditioning information during training to enhance model robustness:

ConditionDropout RatePurpose
Text prompt15%Support for Classifier-Free Guidance (CFG)
Lyric (lyrics)15%Support for instrumental generation without lyrics
Speaker (voice)50%Prevent timbre over-reliance, focus on musical structure

4. ACE-Step v1 Training Details

4.1 Training Data

ACE-Step v1 was trained on a large-scale music dataset:

ItemDetails
Total Data1.8M unique tracks (~100,000 hours)
Languages19 languages (English majority)
Quality FilterAudiobox aesthetics toolkit
ExcludedLow-quality recordings, live performances

Automatic Annotation Pipeline:

[Data Annotation Pipeline]

Raw audio files
  |
  +-> Qwen-omini model -> Style/genre caption generation
  |
  +-> Whisper 3.0 -> Lyric transcription
  |      +-> LSH-based IPA-to-database mapping for lyric refinement
  |
  +-> "All-in-one" music understanding model -> Song structure (intro, verse, chorus, etc.)
  |
  +-> BeatThis -> BPM extraction
  |
  +-> Essentia -> Key/Scale, style tag extraction
  |
  +-> Demucs -> Vocal/accompaniment separation (for Speaker Encoder training)

4.2 Training Configuration

Training was conducted in two stages: Pre-training + Fine-tuning:

StageDataStepsNotes
Pre-trainingFull 100K hours460,000Foundation training on full dataset
Fine-tuningHigh-quality 20K hours240,000Curated high-quality subset

Hyperparameters:

# Training Environment
Hardware:          15 nodes x 8 NVIDIA A100 (120 GPUs total)
Global Batch Size: 120 (1 per GPU)
Training Duration: ~264 hours (approximately 11 days)

# Optimizer
Optimizer:         AdamW
Weight Decay:      1e-2
Betas:             (0.8, 0.9)
Learning Rate:     1e-4
LR Schedule:       Linear warm-up (4,000 steps)
Gradient Clipping: max norm 0.5

# REPA Weights
lambda_SSL:        1.0 (full training)
mHuBERT lambda:    0.01 (reduced in last 100K steps)

5. ACE-Step v1.5: Hybrid LM + DiT Evolution

ACE-Step v1.5 (arXiv:2602.00744), released in January 2026, fundamentally redesigned the v1 architecture. It introduces a Language Model as a structural planner and dramatically reduces inference steps through Distribution Matching Distillation, among other innovations.

5.1 Hybrid LM + DiT Architecture

[ACE-Step v1.5 Architecture]

User Input (Text prompt + Lyrics)
  |
  v
+----------------------------------------------------------+
|  Composer Agent (Language Model, Qwen-based ~1.7B)        |
|                                                          |
|  Chain-of-Thought reasoning:                              |
|  1. Metadata generation (BPM, Key, Duration, Structure)  |
|  2. Lyrics refinement and structuring                     |
|  3. Caption/style directive generation                    |
|  4. YAML-format Song Blueprint output                     |
|                                                          |
|  +----------------------------------------+               |
|  | bpm: 128                              |               |
|  | key: "C major"                        |               |
|  | duration: 210                         |               |
|  | structure:                            |               |
|  |   - intro: 0-15s                      |               |
|  |   - verse1: 15-45s                    |               |
|  |   - chorus1: 45-75s                   |               |
|  |   - verse2: 75-105s ...               |               |
|  | style: "energetic K-pop with synth"   |               |
|  +----------------------------------------+               |
+---------------------+------------------------------------+
                      | Song Blueprint
                      v
+----------------------------------------------------------+
|  1D VAE (Self-Learning Tokenizer)                        |
|  - 48kHz stereo audio processing                         |
|  - 64-dimensional latent space @ 25Hz                    |
|  - 1920x compression ratio                               |
|  - FSQ: 25Hz to 5Hz discrete codes (~64K codebook)      |
|  - "Source Latent" generation (LM-DiT bridging)          |
+---------------------+------------------------------------+
                      |
                      v
+----------------------------------------------------------+
|  Diffusion Transformer (DiT, ~2B parameters)             |
|  - Acoustic rendering with Source Latent + Blueprint     |
|    conditions                                            |
|  - DMD2 distillation: 50 steps to 4-8 steps             |
|  - 200x speedup (240-second track in ~1 second, A100)   |
+----------------------------------------------------------+

The most significant change in v1.5 is the separation of structural planning and acoustic rendering. The Language Model first designs the overall blueprint of the music, and the DiT only performs the role of generating actual audio according to this blueprint. This enables maintaining consistent structure for even songs longer than 10 minutes.

5.2 Self-Learning Tokenizer

v1.5 uses a 1D VAE instead of v1's mel-spectrogram-based DCAE to directly process 48kHz stereo audio:

[v1 vs v1.5 Audio Processing Comparison]

ACE-Step v1:
  Audio -> mel-spectrogram -> DCAE Encoder -> latent (10.77Hz)
  latent -> DCAE Decoder -> mel -> Fish Audio Vocoder -> 32kHz mono

ACE-Step v1.5:
  Audio (48kHz stereo) -> 1D VAE Encoder -> latent (25Hz, 64-dim)
  latent -> FSQ -> 5Hz discrete codes ("Source Latent")
  DiT -> latent -> 1D VAE Decoder -> 48kHz stereo

Improvements:
- 32kHz mono -> 48kHz stereo (improved audio quality)
- Elimination of mel-spectrogram intermediate stage (reduced information loss)
- Near-lossless quality maintained with 1920x compression ratio

The 1D VAE's Finite Scalar Quantization (FSQ) quantizes continuous 25Hz latent into 5Hz discrete codes. These discrete codes serve as Source Latent, bridging the Language Model and DiT. The codebook size is approximately 64K, and this tokenizer is trained simultaneously with the DiT through a self-learning approach.

5.3 Distribution Matching Distillation (DMD2)

The key to v1.5's dramatic speed improvement is DMD2 (Distribution Matching Distillation):

[DMD2 Distillation Process]

Teacher Model (50-step DiT)
  |
  v Knowledge Distillation
Student Model (4-8 step DiT)
  |
  +-- Dynamic-shift Strategy: {1, 2, 3} step sampling
  |   -> Exposure to diverse denoising states to prevent overfitting
  |
  +-- Distribution Matching Loss
  |   -> Alignment of Teacher distribution with Student distribution
  |
  +-- Result: 200x speedup
      - 50 steps to 4-8 steps
      - 240-second music generated in ~1 second on A100
      - Dramatic RTF (Real-Time Factor) improvement

5.4 Intrinsic Reinforcement Learning

v1.5 introduces reinforcement learning-based alignment to further improve generation quality:

[RL-Based Alignment Structure]

DiT Alignment:
  +-- DiffusionNTF framework
  +-- Attention Alignment Score (AAS)
  |   -> Measurement of cross-attention map consensus
  +-- Improvement of acoustic quality and text condition adherence

LM Alignment:
  +-- Pointwise Mutual Information (PMI)
  |   -> Measurement of semantic adherence
  +-- Improvement of Song Blueprint accuracy

Final Reward Weights:
  - Atmosphere: 50%
  - Lyrics: 30%
  - Metadata: 20%

5.5 Data and Training Infrastructure

v1.5 uses significantly larger-scale data and more sophisticated training strategies than v1:

RL-Driven Annotation Pipeline:

[v1.5 Data Annotation]

1. "Golden Set" construction (5M samples)
   +-- Initial annotation with Gemini 2.5 Pro

2. Fine-tuning
   +-- Fine-tune Qwen2.5-Omni with Golden Set
   +-- GRPO optimization -> ACE-Captioner, ACE-Transcriber generation

3. Reward Models training
   +-- Trained on 4M contrastive pairs

4. Progressive Curriculum (3 stages)
   +-- Phase 1: Foundation Pre-training (20M samples)
   +-- Phase 2: Omni-task Fine-tuning (17M, including stem-separated tracks)
   +-- Phase 3: High-quality SFT (2M curated samples)

The 3-stage progressive curriculum spanning 27M samples in total is designed so the model starts with basic music generation capabilities and gradually learns specialized tasks.

5.6 Omni-Task Framework

Another key innovation of v1.5 is the Omni-Task framework that handles diverse music tasks with a single model:

TaskDescriptionUse Scenario
Text-to-MusicGenerate full song from text promptComposition, BGM
Cover GenerationStyle/timbre conversion of existing songsCover song production
RepaintingRegenerate/modify specific sectionsPartial remixing
Track ExtractionSeparate vocal/accompaniment tracksMixing, remastering
LayeringMulti-track synthesisArrangement, producing
CompletionContinue unfinished compositionsCollaborative composition
Vocal-to-BGMGenerate accompaniment from vocalsKaraoke production

All these tasks are implemented through combinations of Source Latent and Mask configurations, handled by a single model without separate model training.


6. Performance Evaluation and Benchmarks

6.1 Inference Speed Comparison

The most dramatic advantage of ACE-Step is its inference speed:

ModelRTF (RTX 4090)4-min Song Gen TimeNotes
ACE-Step v115.63x~20s (A100)15.63x real-time
ACE-Step v1.5-Under 2s (A100)DMD2 distillation
DiffRhythm10.03x~30s
Yue (LLM-based)0.083x~48 minSlower than real-time

ACE-Step v1 is approximately 188x faster than the LLM-based model Yue, and v1.5 is over 10x faster than v1 through distillation.

v1.5 Performance by Hardware:

HardwareFull Song Gen TimeVRAM Required
NVIDIA A100Under 2 seconds-
RTX 3090Under 10 secondsUnder 4GB
RTX 4090Under 5 seconds (est.)Under 4GB
AMD RadeonSupported (official AMD partnership)Under 4GB
Apple Silicon (Mac)SupportedUnder 4GB

6.2 Music Quality Evaluation

ACE-Step achieved competitive results across various automatic evaluation metrics and human evaluations:

Automatic Evaluation (v1):

MetricACE-Step v1Best Comparison ModelDescription
DCAE FAD0.0224DiffRhythm VAE: 0.0059Waveform reconstruction quality
Style AlignmentTop tierUdio v1 (best)CLAP + Mulan based
Lyric AlignmentStrongHailuo (best)Whisper Forced Alignment
SongEval CoherenceCompetitiveSuno v3 (best)Musical coherence
SongEval MemorabilityStrong-Memorable melody

Automatic Evaluation (v1.5):

MetricACE-Step v1.5Suno v5MinMax 2.0
AudioBox CU8.09 (best)--
AudioBox PQ8.35 (best)--
SongEval Coherence4.72 (tied best)--
Style Alignment39.146.843.1
Lyric Alignment26.334.229.5

v1.5 achieved the highest scores in AudioBox CU (8.09) and PQ (8.35), and tied for the best in SongEval Coherence (4.72). While it falls short of Suno v5 in Style/Lyric Alignment, it is overwhelmingly superior among open-source models, and ranks between Suno v4.5 and v5 in Music Arena human evaluations.

Human Evaluation (v1, 32 participants):

Evaluation ItemScore (/100)
Emotional Expression~85
Innovativeness~82
Sound Quality~80
Musicality~78

7. Comparative Analysis of AI Music Generation Models

7.1 Major Model Overview

A systematic comparison of major models in the current AI music generation field:

[AI Music Generation Model Classification]

+-------------------------------------------------------------+
|                    Open-Source Models                         |
+--------------+--------------+--------------+-----------------+
|  ACE-Step    |  MusicGen    |  Stable Audio|  Riffusion      |
|  (v1, v1.5)  |  (Meta)      |  Open        |                 |
|              |              |  (Stability) |                 |
|  Diffusion   |  Autoregress |  Latent      |  Image Diffusion|
|  + DCAE/VAE  |  + EnCodec   |  Diffusion   |  -> Spectrogram |
|  3.5B params |  1.5B/3.3B   |  1.1B        |  ~1B            |
+--------------+--------------+--------------+-----------------+
|                    Commercial Models                         |
+--------------+--------------+--------------+-----------------+
|  Suno        |  Udio        |  ElevenLabs  |  Google MusicLM |
|  (v3->v5)    |  (v1->v2)    |  Eleven Music|                 |
|              |              |              |                 |
|  Full song   |  Segment-by  |  Licensed    |  Experimental/  |
|  generation  |  -segment    |  commercial  |  Instrumental   |
|  pipeline    |  composition |  use OK      |  focus          |
+--------------+--------------+--------------+-----------------+

7.2 Detailed Comparison Table

ModelDeveloperParametersGeneration MethodAudio RepresentationMax LengthLyric SupportOpen Source
ACE-Step v1ACE Studio + StepFun3.5BFlow Matching + DiTMel DCAE latent4 minYes (multilingual)Yes
ACE-Step v1.5ACE Studio + StepFun~3.7B (LM+DiT)Hybrid LM + DiT + DMD21D VAE latent10+ minYes (50+ languages)Yes
MusicGenMeta1.5B/3.3BAutoregressiveEnCodec tokens~30sNoYes
Stable Audio OpenStability AI1.1BLatent DiffusionVAE latent47sNoYes
RiffusionRiffusion~1BImage DiffusionSpectrogramA few secondsNoYes
JEN-1Jen Music-AR + Non-AR hybridRaw waveform~30sNoPartial
SunoSuno Inc.UndisclosedUndisclosedUndisclosed4+ minYesNo
UdioUdioUndisclosedUndisclosedUndisclosedSegment-basedYesNo
MusicLMGoogleUndisclosedAR + SoundStreamSoundStream tokens~30sNoNo

7.3 MusicGen (Meta)

Meta's MusicGen is a pioneer in open-source music generation models. It is an autoregressive transformer model based on the EnCodec tokenizer.

[MusicGen Architecture]

Text prompt -> T5 Encoder -> Conditioning
                                    |
                                    v
                    +--------------------------+
                    |  Autoregressive Decoder   |
                    |  (Transformer LM)         |
                    |                          |
                    |  EnCodec 4 codebooks      |
                    |  32kHz, 50Hz sampling     |
                    |                          |
                    |  Simultaneous multi-      |
                    |  codebook generation      |
                    |  via delay pattern        |
                    +----------+---------------+
                               |
                               v
                    +--------------------------+
                    |  EnCodec Decoder          |
                    |  tokens -> waveform       |
                    +--------------------------+

Strengths: Stable instrumental generation, melody conditioning support Limitations: No lyric support, ~30-second limit, relatively slow autoregressive generation

7.4 Suno vs ACE-Step

Suno is currently the most commercially successful AI music generation platform:

Comparison ItemACE-Step v1.5Suno v5
AccessibilityLocal install (open source)Cloud service
VRAM RequiredUnder 4GBN/A (server)
Song StructureLM-based BlueprintEnd-to-end
CustomizationLoRA training possiblePrompts only
Style Alignment39.146.8
Lyric Alignment26.334.2
PriceFree (local)Subscription
Commercial UseLicense check neededPaid plans

While Suno v5 still leads in absolute quality, ACE-Step v1.5 is a powerful alternative in terms of local deployment, customization, and cost efficiency.

7.5 Stable Audio Open

Stability AI's Stable Audio Open is a latent diffusion-based open-source model:

Comparison ItemACE-Step v1.5Stable Audio Open
Max Length10+ min47 seconds
Lyric SupportYes (50+ languages)No
Vocal GenerationYes (including Voice Cloning)No (instrumental only)
Parameters~3.7B1.1B
Audio Quality48kHz stereo44.1kHz stereo

ACE-Step shows superiority in nearly all aspects including length, lyrics, and vocals.


8. Core Foundational Technologies for Music Generation

An in-depth analysis of the essential foundational technologies needed to understand AI music generation.

8.1 Audio Tokenization: Converting Audio to Discrete Tokens

The first challenge for music generation models is transforming continuous audio signals into a form the model can process. There are broadly three approaches:

[Audio Representation Method Comparison]

1. Spectrogram-Based
   +--------------------------------------------+
   | waveform -> STFT -> mel-spectrogram -> image |
   |                                            |
   | Pros: Easy visualization, can leverage      |
   |       image models                          |
   | Cons: Phase information loss, vocoder needed|
   | Used by: Riffusion, ACE-Step v1 (DCAE input)|
   +--------------------------------------------+

2. Neural Audio Codec (Discrete Tokens)
   +--------------------------------------------+
   | waveform -> Encoder -> RVQ -> discrete tokens|
   | tokens -> Decoder -> waveform               |
   |                                            |
   | Pros: End-to-end, high compression ratio    |
   | Cons: Weak long-range dependency            |
   |       (acoustic tokens)                     |
   | Used by: MusicGen (EnCodec), MusicLM        |
   |          (SoundStream)                      |
   +--------------------------------------------+

3. Continuous Latent (VAE)
   +--------------------------------------------+
   | waveform -> VAE Encoder -> continuous latent |
   | latent -> VAE Decoder -> waveform           |
   |                                            |
   | Pros: Natural integration with Diffusion    |
   | Cons: Compression ratio vs quality tradeoff |
   | Used by: ACE-Step v1.5 (1D VAE),           |
   |          Stable Audio                       |
   +--------------------------------------------+

8.2 EnCodec and SoundStream

EnCodec (Meta) and SoundStream (Google) are representative Neural Audio Codec models:

[EnCodec / SoundStream Architecture]

Input: raw waveform (24kHz/48kHz)
  |
  v
+---------------------------------+
|  Encoder (1D Conv + LSTM)       |
|  -> continuous embeddings        |
+------------+--------------------+
             |
             v
+---------------------------------+
|  Residual Vector Quantization   |
|  (RVQ)                          |
|                                 |
|  Codebook 1 -> Most important   |
|                 information     |
|  Codebook 2 -> Residual         |
|  Codebook 3 -> Finer residual   |
|  ...                            |
|  Codebook N -> Final residual   |
|                                 |
|  Each codebook: 1024 entries    |
|  sampling rate: 50Hz/75Hz       |
+------------+--------------------+
             |
             v
+---------------------------------+
|  Decoder (1D TransposeConv)     |
|  -> reconstructed waveform      |
+---------------------------------+

Training: Reconstruction Loss + Adversarial Loss
          (Multi-scale discriminator)

EnCodec vs SoundStream:

ItemEnCodecSoundStream
DeveloperMetaGoogle
Key InnovationMulti-scale discriminator, loss balancingRVQ introduction
Sample Rate24kHz/48kHz24kHz
Bitrate1.5~24 kbps3~18 kbps
Used InMusicGen, AudioGenAudioLM, MusicLM
Open SourceYesNo

8.3 Diffusion for Audio

Audio application of Diffusion models is built on the success in the image domain:

[Audio Diffusion Training]

Forward Process (Adding Noise):
  x_0 (original audio latent)
  -> x_1 -> x_2 -> ... -> x_T (pure Gaussian noise)

  x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1-alpha_bar_t) * epsilon,  epsilon ~ N(0,I)

Reverse Process (Denoising, training target):
  x_T (noise) -> x_{T-1} -> ... -> x_0 (generated audio latent)

  p_theta(x_{t-1}|x_t) = N(x_{t-1}; mu_theta(x_t, t), sigma^2 I)

Loss: L = E_{t,x_0,epsilon} [||epsilon - epsilon_theta(x_t, t, c)||^2]
       (c = conditioning: text, melody, etc.)

ACE-Step v1 uses Flow Matching instead of standard Diffusion, which uses straight paths for convergence with fewer steps and stable training. v1.5 adds DMD2 distillation on top to achieve high-quality generation with only 4-8 steps.

8.4 Classifier-Free Guidance (CFG)

CFG, a core technique in all conditional generation models, is also used in ACE-Step:

[CFG Application]

epsilon_guided = epsilon_uncond + w * (epsilon_cond - epsilon_uncond)

Where:
- epsilon_cond: prediction with conditions (text, lyric, speaker)
- epsilon_uncond: prediction without conditions (trained via dropout)
- w: guidance scale (higher = more condition adherence, less diversity)

ACE-Step's 15% text/lyric dropout, 50% speaker dropout
enables unconditional training for this CFG.

9. Practical Usage Guide

9.1 ACE-Step v1.5 Local Installation

ACE-Step v1.5 offers a remarkably simple installation process:

# 1. Install uv package manager
curl -LsSf https://astral.sh/uv/install.sh | sh

# 2. Clone repository and install dependencies
git clone https://github.com/ACE-Step/ACE-Step-1.5.git
cd ACE-Step-1.5
uv sync

# 3. Launch Gradio UI (web interface)
uv run acestep
# -> Access at http://localhost:7860

# 4. Or launch REST API server
uv run acestep-api
# -> Use API at http://localhost:8001

# 5. Environment configuration (optional)
cp .env.example .env
# Customize model paths, ports, GPU settings, etc. in .env file

Supported Hardware:

  • NVIDIA GPU (CUDA): RTX 20xx or higher recommended
  • AMD GPU (ROCm): Optimized through official AMD partnership
  • Intel GPU: Supported
  • Apple Silicon (Mac): MPS backend supported

Models are automatically downloaded on first run and operate with under 4GB of VRAM.

9.2 Basic Text-to-Music Usage

# Music generation example via API (conceptual code)
import requests

# Basic text-to-music generation
response = requests.post("http://localhost:8001/generate", json={
    "prompt": "Bright and cheerful K-pop dance track, synth bass and electronic beats, "
              "128 BPM, female vocal, C major",
    "lyrics": """
[Verse 1]
Shining like stars tonight
Let's dance together
In this moment where music flows
We won't stop

[Chorus]
La la la shining night
La la la time together
May this moment last forever
""",
    "duration": 180,          # 3 minutes
    "num_inference_steps": 8,  # DMD2 distilled
    "guidance_scale": 7.0,
    "seed": 42
})

# Save output audio
with open("output.wav", "wb") as f:
    f.write(response.content)

9.3 Prompt Writing Guide

Effective prompt writing directly impacts generation quality:

[Effective Prompt Structure]

1. Genre/Style      : "indie folk ballad", "aggressive metal", "lo-fi hip-hop"
2. Instrumentation  : "acoustic guitar, soft piano, light percussion"
3. Mood/Emotion     : "melancholic", "uplifting", "dreamy"
4. Tempo (BPM)      : "slow tempo 70 BPM", "fast 140 BPM"
5. Key              : "minor key", "E flat major"
6. Vocal Character   : "female vocal, breathy", "male baritone, powerful"
7. Production Style  : "lo-fi with vinyl crackle", "clean studio production"

[Good Prompt Example]
"Dreamy shoegaze rock with layers of reverbed electric guitars,
 ethereal female vocal, 90 BPM, D minor, lo-fi production
 with tape saturation and subtle noise"

[Lyrics Format]
- Use [Verse], [Chorus], [Bridge], [Intro], [Outro] tags
- Clearly separate each section
- One phrase per line

9.4 LoRA Personalization Training

One of ACE-Step v1.5's powerful features is LoRA support that allows training your own style with a small number of songs:

[LoRA Training Process]

1. Data Preparation
   +-- Minimum 3-5 reference music tracks
   +-- Text prompt (caption) for each track
   +-- (Optional) Lyrics files

2. Access LoRA Training tab in Gradio UI
   +-- Upload audio files
   +-- Enter captions
   +-- Configure training parameters
   |   +-- Learning Rate: ~1e-4
   |   +-- Epochs: 50-200
   |   +-- LoRA Rank: 8-64
   +-- Start training

3. Apply trained LoRA
   +-- Load LoRA weights during generation
   +-- Adjust LoRA Scale (0.0~1.0)
   +-- Combine with existing prompts to apply style

This allows you to reflect a specific artist's production style, nuances of a specific genre, or your own composition style in the model.

9.5 ComfyUI Integration

ACE-Step 1.5 also supports integration with ComfyUI, enabling visual configuration of music generation in a node-based workflow:

[ComfyUI ACE-Step Workflow Example]

+----------+     +--------------+     +--------------+
|  Text    |---->|  ACE-Step    |---->|  Audio       |
|  Prompt  |     |  Generator   |     |  Preview     |
+----------+     |              |     +--------------+
                 |              |
+----------+     |              |     +--------------+
|  Lyrics  |---->|              |---->|  Save WAV    |
|  Input   |     |              |     |  Node        |
+----------+     +--------------+     +--------------+

Copyright issues in AI music generation are currently one of the hottest legal topics:

Key Rulings and Trends:

DateEventImpact
Jan 2025US Copyright Office: No copyright for 100% AI-generated contentPublic domain ruling
Mar 2025US Appeals Court: Confirms denial of copyright for AI worksLegal precedent established
Sep 2025Warner Music + Suno settlementSuno agrees to license-based model transition
Nov 2025UMG + Udio settlementSimilar license transition agreement
Aug 2025ElevenLabs Eleven Music launchFirst legally licensed commercial AI music
Jan 2026UMG vs Anthropic ($3B)Copyright lawsuit over 20,000+ songs in training data

10.2 "Meaningful Human Authorship" Principle

The US Copyright Office released guidelines stating that copyright may be recognized for AI-assisted works when "meaningful human authorship" is present:

[Copyright Recognition Spectrum for AI Music]

Fully AI-Generated                              Fully Human-Created
     <---------------------------------------->

     |                  |                  |
  No copyright       Judgment needed     Copyright recognized
                        |
                   Human actively:
                   - Modifying melody
                   - Writing lyrics
                   - Arranging structure
                   - Selecting/editing AI output
                   -> "Meaningful Human Authorship"
                   -> Copyright may be recognized

10.3 Ethical Considerations for Open-Source Models

Open-source models like ACE-Step require additional ethical considerations:

  1. Training Data Sources: The copyright status of ACE-Step's training data of 1.8M songs (v1) / 27M samples (v1.5) is not clearly disclosed in the papers. Users should be aware of legal risks when commercially using generated music.

  2. Voice Cloning Misuse: The voice cloning capability through the Speaker Encoder could be misused to replicate specific artists' voices without authorization. Cloning without consent from the reference vocal rights holder is both ethically and legally problematic.

  3. Deepfake Music: Deepfake music where AI generates "new songs" by specific artists has already emerged as a social issue. ACE-Step's Cover Generation feature also requires responsible use in this context.

  4. Impact on the Music Industry: The democratization of AI music generation technology can directly affect the livelihoods of professional musicians, composers, and producers. A balance between technological advancement and creator protection is needed.

10.4 Guidelines for Responsible Use

[Responsible Use Principles for AI Music Generation]

1. Transparency: Clearly state when music is AI-generated/assisted
2. Consent: Obtain original artist consent for Voice Cloning
3. Attribution: Clearly distinguish AI tool contributions from human contributions
4. Commercial Use: Comply with relevant regulations and license conditions
5. Education: Use AI tools as supplementary tools for music education/learning
6. Fair Use: Distinguish between style imitation and copying of existing music

11. Key Paper References

A compilation of key papers on ACE-Step and the AI music generation field:

PaperAuthorsYearKey Contribution
ACE-Step: A Step Towards Music Generation Foundation ModelGong et al.2025DCAE + Linear DiT + REPA
ACE-Step 1.5: Pushing the Boundaries of Open-Source Music GenerationACE-Step Team2026Hybrid LM+DiT, DMD2, RL alignment

11.2 Foundational Technologies

PaperKey ContributionUsage
Deep Compression Autoencoder (Chen et al., 2024)High compression ratio AutoEncoderACE-Step DCAE
MERT (Li et al., 2024)Self-supervised music representation learningACE-Step REPA
mHuBERT-147 (Lee et al., 2024)Multilingual speech representationACE-Step REPA
Flow Matching (Lipman et al., 2023)ODE-based generative modelACE-Step generation process
DMD2 (Yin et al., 2024)Distribution Matching DistillationACE-Step v1.5 speedup

11.3 Competing Model Papers

PaperAuthor/OrgYearKey Contribution
MusicGen: Simple and Controllable Music GenerationCopet et al. (Meta)2023EnCodec + AR Transformer
MusicLM: Generating Music from TextAgostinelli et al. (Google)2023SoundStream + AR
Stable Audio OpenEvans et al. (Stability AI)2024Latent Diffusion for Audio
RiffusionForsgren & Martiros2022Spectrogram Image Diffusion
JEN-1: Text-Guided Universal Music GenerationLi et al.2023AR + Non-AR hybrid
DiffRhythm-20251D VAE + Flow DiT
SongGen-2025Lyric encoding architecture

11.4 Audio Tokenization

PaperAuthor/OrgYearKey Contribution
EnCodec: High Fidelity Neural Audio CompressionDefossez et al. (Meta)2022RVQ + Multi-scale Disc
SoundStream: An End-to-End Neural Audio CodecZeghidour et al. (Google)2021RVQ introduction
WavTokenizerPeng et al.202540/75 tokens/sec SOTA
AudioLM: A Language Modeling Approach to AudioBorsos et al. (Google)2023Semantic + Acoustic tokens

12. Future Outlook

12.1 Technology Development Direction

AI music generation technology is expected to evolve in the following directions:

[AI Music Generation Technology Development Roadmap]

2026 Current                 2027 Expected               2028+ Long-term
    |                          |                          |
    v                          v                          v
+--------------+        +--------------+        +------------------+
| Current State |        | Short-term   |        | Long-term Vision |
|              |        | Development  |        |                  |
| - 4-min song |   ->   | - Album-level|   ->   | - Real-time      |
|   generation |        |   consistent |        |   interactive    |
| - Text cond. |        |   generation |        |   music gen      |
| - LoRA       |        | - Multi-track|        | - Emotion-aware  |
|   personalize|        |   simultaneous|       |   adaptive music |
| - Voice Clone|        |   generation |        | - Video-music    |
| - 50+ langs  |        | - Real-time  |        |   synchronization|
|              |        |   streaming  |        | - Fully automated|
|              |        |   generation |        |   production     |
+--------------+        +--------------+        +------------------+

12.2 ACE-Step's Foundation Model Vision

The ultimate vision of the ACE-Step project is to become the "Stable Diffusion of Music AI." This means not just a simple text-to-music pipeline, but a general-purpose Foundation Model upon which various downstream tasks can be built:

[ACE-Step Foundation Model Ecosystem Vision]

                    +-------------------------+
                    |  ACE-Step Foundation     |
                    |  Model (Base)            |
                    +----------+--------------+
                               |
          +--------------------+--------------------+
          |                    |                     |
          v                    v                     v
  +--------------+   +--------------+   +------------------+
  |  Text-to-    |   |  Audio       |   |  Music           |
  |  Music       |   |  Editing     |   |  Understanding   |
  |  Generation  |   |  & Remixing  |   |  & Analysis      |
  +--------------+   +--------------+   +------------------+
          |                    |                     |
          v                    v                     v
  +--------------+   +--------------+   +------------------+
  |  LoRA        |   |  Voice       |   |  Stem            |
  |  Style       |   |  Cloning     |   |  Separation      |
  |  Transfer    |   |  & TTS       |   |  & Transcription |
  +--------------+   +--------------+   +------------------+

When this vision is realized, diverse users including music producers, video creators, game developers, and educators will be able to generate and edit commercial-quality music in local environments.

12.3 Industry Impact Outlook

  1. Democratization of Music Production: The ability to generate commercial-quality music with 4GB VRAM means the barrier to entry for music production has been dramatically lowered.

  2. Hybrid Workflows: AI-Human collaborative workflows where AI generates drafts and humans refine them will become standard. ACE-Step's Repainting, Completion, and Track Extraction features are optimized for such workflows.

  3. Personalized Music Experiences: Personalization training through LoRA enables music generation tailored to each user's preferences. This will lead to dynamically generated custom music in games, meditation apps, fitness apps, and more.

  4. Legal Framework Establishment: Through the lawsuits and settlements of 2025-2026, a clear legal framework for AI music generation will gradually be formed. ElevenLabs' license-based approach could serve as one model.


13. Conclusion

ACE-Step is a landmark model that has dramatically narrowed the gap between open-source and commercial models in AI music generation. The v1 DCAE + Linear DiT + REPA architecture achieved 188x faster inference than LLM-based models at 3.5B parameters, and the v1.5 Hybrid LM + DiT + DMD2 architecture realized remarkable efficiency of under 2 seconds on A100 and under 4GB VRAM.

Summarizing the key technical contributions:

  1. DCAE Application to Music Domain: Achieved high-quality reconstruction while maintaining 10.77Hz temporal resolution with 8x compression
  2. REPA Training: Fast convergence and high fidelity through musical/linguistic semantic alignment via MERT + mHuBERT
  3. Hybrid LM + DiT: Support for songs longer than 10 minutes through separation of structural planning and acoustic rendering
  4. DMD2 Distillation: Compressed 50 steps to 4-8 steps, 200x speed improvement
  5. Omni-Task Framework: Single model performs diverse tasks including text-to-music, cover, repainting, and track separation

Of course, gaps still exist with top-tier commercial models like Suno v5 in Style/Lyric Alignment. However, the values ACE-Step offers -- open-source, local deployment, and customizability -- are unique advantages that commercial models cannot provide. ACE-Step's journey toward music AI's "Stable Diffusion moment" has only just begun.


References