ACE-Step: A New Paradigm in AI Music Generation — Complete Analysis of Architecture, Training Methods, and Practical Applications

1. Introduction: A Turning Point in AI Music Generation
2. ACE-Step v1: In-Depth Architecture Analysis
3. REPA: Semantic Representation Alignment Training
- 3.1 Roles of MERT and mHuBERT
- 3.2 Conditional Dropout Strategy
4. ACE-Step v1 Training Details
- 4.1 Training Data
- 4.2 Training Configuration
5. ACE-Step v1.5: Hybrid LM + DiT Evolution
6. Performance Evaluation and Benchmarks
- 6.1 Inference Speed Comparison
- 6.2 Music Quality Evaluation
7. Comparative Analysis of AI Music Generation Models
8. Core Foundational Technologies for Music Generation
9. Practical Usage Guide
10. Ethical Considerations and Legal Issues
11. Key Paper References
12. Future Outlook
13. Conclusion
References

1. Introduction: A Turning Point in AI Music Generation

The field of AI Music Generation has undergone explosive progress from 2024 to 2025. While Meta's MusicGen, Google's MusicLM, and commercial services like Suno and Udio demonstrated the possibilities of AI composition to the public, few open-source models had achieved quality rivaling commercial models.

In May 2025, the release of ACE-Step, jointly developed by ACE Studio and StepFun, changed the landscape. ACE-Step is a Foundation Model that generates up to 4 minutes of high-quality music from text prompts and lyrics in approximately 20 seconds, achieving over 15x faster inference speed than LLM-based models and superior musical coherence at a scale of 3.5B parameters. In January 2026, the follow-up version ACE-Step 1.5 was released, delivering commercial-model-level quality in local environments with remarkable speeds of under 2 seconds on A100 and under 10 seconds on RTX 3090.

[AI Music Generation Model Development Timeline]

2023               2024                2025                  2026
  |                  |                   |                     |
  v                  v                   v                     v
+----------+   +--------------+   +---------------+   +------------------+
| MusicGen |   | Stable Audio |   | ACE-Step v1   |   | ACE-Step v1.5    |
| MusicLM  |   | Suno v3      |   | (3.5B, DCAE   |   | (Hybrid LM+DiT,  |
| AudioLDM |   | Udio v1      |   |  + Linear DiT)|   |  DMD2, under 4GB) |
| Riffusion|   | JEN-1        |   | DiffRhythm    |   | Suno v5          |
+----------+   +--------------+   +---------------+   +------------------+

Key Transition:     Commercialization:  Open-source Leap:     Local Deployment Era:
- Autoregressive   - Text-to-Song     - Diffusion + DCAE      - 4-8 step generation
- Spectrogram      - Vocal + BGM      - Flow Matching         - LoRA personalization
  based generation - Multilingual      - REPA training         - 50+ language support
                     lyrics

This article provides an in-depth analysis of ACE-Step's architecture based on the papers, covering the evolution from v1 to v1.5, competitive model comparisons, core foundational technologies, and practical usage guides.

2. ACE-Step v1: In-Depth Architecture Analysis

ACE-Step v1 (arXiv:2506.00045) was designed to overcome the fundamental limitations of existing music generation models. LLM-based models excel at lyric alignment but suffer from slow inference and structural artifacts, while Diffusion models enable fast synthesis but lack long-range structural coherence. ACE-Step adopts a Diffusion + DCAE + Linear Transformer architecture that integrates the strengths of both approaches.

2.1 Overall Architecture Overview

The core components of ACE-Step v1 are as follows:

[ACE-Step v1 Architecture]

                    +---------------------------------------------+
                    |           Conditioning Encoders              |
                    |                                              |
                    |  +----------+ +----------+ +--------------+ |
                    |  |  Text    | |  Lyric   | |   Speaker    | |
                    |  | Encoder  | | Encoder  | |   Encoder    | |
                    |  |(mT5-base)| |(SongGen) | |(PLR-OSNet)   | |
                    |  | frozen   | |trainable | | pre-trained  | |
                    |  | dim=768  | |          | | dim=512      | |
                    |  +----+-----+ +----+-----+ +------+-------+ |
                    +-------+------------+---------------+--------+
                            |            |               |
                            +------+-----+               |
                                   | cross-attention      |
                                   v                      v
+-----------+    +----------------------------------------------+
|           |    |     Linear Diffusion Transformer (DiT)       |
|   DCAE    |    |                                              |
|  Encoder  |--->|  +-------------------------------------+    |
|  (f8c8)   |    |  |  24 Transformer Blocks               |    |
|           |    |  |  - AdaLN-single (shared params)      |    |
| mel-spec  |    |  |  - Linear Attention                  |    |
| to latent |    |  |  - 1D Conv FeedForward               |    |
| ~10.77Hz  |    |  |  - Cross-Attention (text+lyric)      |    |
|           |    |  |  - REPA at layer 8                   |    |
+-----------+    |  +-------------------------------------+    |
                 |                                              |
                 +------------------+---------------------------+
                                    |
                                    v
                 +----------------------------------------------+
                 |              DCAE Decoder                     |
                 |   latent to mel-spectrogram to waveform       |
                 |   (Fish Audio Vocoder, 32kHz mono)            |
                 +----------------------------------------------+

2.2 Deep Compression AutoEncoder (DCAE)

The first key innovation of ACE-Step is the application of Deep Compression AutoEncoder (DCAE), proposed by Sana (NVIDIA/MIT-HAN Lab), to the music domain. DCAE was originally designed for high-resolution image generation, achieving extremely high spatial compression ratios of 32x to 128x.

In ACE-Step, it takes mel-spectrograms as input and applies 8x compression (f8c8, channel=8):

[DCAE Compression Process]

Input: mel-spectrogram (44.1kHz/32kHz audio to mel conversion)
  |
  v
+---------------------------------------------+
|  DCAE Encoder                               |
|  - Residual Autoencoding                    |
|  - Space-to-Channel Transform              |
|  - 8x temporal compression                  |
|                                             |
|  Output: latent space (~10.77Hz)            |
|  4-minute music to ~2,584 latent tokens     |
+---------------------------------------------+
  |
  v (Generated/transformed by DiT)
  |
  v
+---------------------------------------------+
|  DCAE Decoder + Vocoder                     |
|  - latent to mel-spectrogram reconstruction |
|  - Fish Audio Universal Music Vocoder       |
|  - Output: 32kHz mono waveform             |
+---------------------------------------------+

DCAE Training Details:

Item	Details
Compression Config	f8c8 (8x compression, channel=8)
Temporal Resolution	~10.77Hz in latent space
Training Hardware	120 NVIDIA A100 GPUs
Training Steps	140,000 steps
Global Batch Size	480 (4 per GPU)
Training Duration	~5 days
Discriminator	Patch-based, StyleGAN Disc2DRes, SwinDisc2D
Training Strategy	Phase 1: MSE only / Phase 2: frozen encoder + MSE + adversarial
Vocoder	Fish Audio universal music vocoder (32kHz mono)
Reconstruction FAD	0.0224

The paper also experimented with 32x compression (f32), but it resulted in unacceptable quality degradation, leading to the adoption of 8x compression. This is because music audio is far more sensitive to temporal detail than images.

2.3 Conditioning Encoders: Multi-Condition Encoding

ACE-Step injects diverse conditioning information into the model through three specialized encoders:

2.3.1 Text Encoder (Style/Genre Prompt)

# Text Encoder: Google mT5-base (frozen)
# - Output dimension: 768
# - Max sequence length: 256 tokens
# - Multilingual support (100+ languages)
# - Kept frozen during training

# Prompt example:
prompt = "upbeat K-pop dance track with synth bass, 128 BPM, female vocal, major key"

The choice of mT5-base was driven by the necessity of multilingual support. Style prompts can be entered in various languages including English, Korean, Japanese, and Chinese.

2.3.2 Lyric Encoder (Lyrics Encoding)

[Lyric Encoder Processing Pipeline]

Raw lyrics input (Korean, English, Japanese, etc.)
  |
  v
Non-Roman scripts to Grapheme-to-Phoneme conversion to phoneme representation
  |
  v
XTTS VoiceBPE Tokenizer (multilingual support)
  |
  v
SongGen architecture-based Lyric Encoder (trainable)
  |
  v
Up to 4,096 tokens of lyric embeddings

The Lyric Encoder is based on the SongGen architecture and, unlike the Text Encoder, its parameters are updated during training. This is because lyric-music alignment is one of the most challenging tasks in music generation. Non-Roman scripts (Hangul, Chinese characters, Hiragana, etc.) are converted to phoneme representations through Grapheme-to-Phoneme (G2P) tools.

2.3.3 Speaker Encoder (Voice/Timbre Encoding)

# Speaker Encoder Configuration
# - Input: 10-second vocal segment with accompaniment removed (separated by demucs)
# - Architecture: PLR-OSNet (originally for face recognition, applied to vocal recognition)
# - Output dimension: 512
# - Training dropout: 50% (to prevent over-reliance on timbre)
# - Full song: average of embeddings from multiple segments

# Voice cloning scenario:
# 1. Input 10-second reference vocal segment
# 2. Separate accompaniment with demucs
# 3. Extract 512-dim embedding with Speaker Encoder
# 4. Inject embedding as condition to DiT during generation

The 50% dropout for the Speaker Encoder is an intentional design decision. By removing speaker information with 50% probability during training, the model is guided to focus sufficiently on musical structure and melody rather than excessively relying on timbre.

2.4 Linear Diffusion Transformer (DiT) Backbone

The core generation model of ACE-Step, the Linear Diffusion Transformer, consists of 24 blocks and uses linear attention instead of standard attention for efficient operation on long sequences.

[DiT Block Structure (x24)]

Input: noisy latent z_t + time embedding t
  |
  v
+---------------------------------+
|  AdaLN-single                   |
|  (Simplified Adaptive LayerNorm)|
|  - Parameters shared across     |
|    all blocks                   |
|  - Conditioned on time step t   |
+------------+--------------------+
             |
             v
+---------------------------------+
|  Linear Self-Attention          |
|  - O(n) complexity (vs O(n^2)) |
|  - RoPE Position Encoding       |
|  - Up to 2,584 mel latent tokens|
+------------+--------------------+
             |
             v
+---------------------------------+
|  Cross-Attention                |
|  - Text Encoder output (768-dim)|
|  - Lyric Encoder output         |
|  - Speaker Encoder output(512-d)|
|  - Concatenate and Attend       |
+------------+--------------------+
             |
             v
+---------------------------------+
|  1D Convolutional FeedForward   |
|  - Adapted from 2D Conv to 1D  |
|  - Optimized for temporal audio |
|    sequences                    |
+------------+--------------------+
             |
             v
Output: denoised prediction
(REPA semantic alignment extracted at Layer 8)

Key Architectural Decisions:

AdaLN-single: Shares Adaptive Layer Normalization parameters across all 24 blocks to maximize parameter efficiency. This technique, introduced by Sana, offers excellent performance efficiency relative to model size.
Linear Attention: Since music requires handling long sequences of up to 4 minutes, O(n) complexity linear attention was adopted instead of O(n^2) standard attention. This enables efficient processing of sequences up to 2,584 tokens.
RoPE (Rotary Position Embedding): Provides robust position information across various music lengths through relative position encoding.
1D Convolutional FeedForward: The original image-targeted 2D Conv was adapted to 1D for temporal audio sequences. This better captures the temporal continuity of audio.

2.5 Flow Matching Generation Process

ACE-Step adopts Flow Matching instead of score-based diffusion. Flow Matching learns a straight path (linear probability path) from Gaussian noise to data distribution, enabling faster convergence and stable training.

[Flow Matching Training Process]

Time t ~ U[0, 1]
  |
  v
Noise z ~ N(0, I)         Data x_0 (DCAE latent)
  |                            |
  +-------- Linear Interpolation --------+
            z_t = (1-t)*z + t*x_0
                  |
                  v
        +------------------+
        |   DiT(z_t, t, c) |  <- conditioning c (text, lyric, speaker)
        |                  |
        |  Prediction      |
        |  target:         |
        |  v = x_0 - z     |
        |  (negative       |
        |   constant       |
        |   velocity field)|
        +--------+---------+
                 |
                 v
        L_FM = MSE(v_predicted, v_target)

Inference:
  z_0 ~ N(0, I) -> Solve ODE -> z_1 approx x_0 -> DCAE Decoder -> waveform

Loss Function:

L_Total = L_FM + lambda_SSL * L_SSL

Where:
- L_FM: Flow Matching loss (MSE)
- L_SSL: REPA Semantic Alignment loss
- lambda_SSL = 1.0 (during most of training)
         -> mHuBERT component reduced to 0.01 (last 100K steps)

3. REPA: Semantic Representation Alignment Training

The second key innovation of ACE-Step is the REPA (Representation Alignment) technique. It directly leverages semantic representations from pre-trained Self-Supervised Learning (SSL) models in DiT training, achieving fast convergence and high semantic fidelity.

3.1 Roles of MERT and mHuBERT

[REPA Training Structure]

                    +-----------------------+
                    |   DiT Layer 8 Output  |
                    |   (intermediate repr.)|
                    +-----------+-----------+
                                |
              +-----------------+------------------+
              |                 |                   |
              v                 |                   v
+------------------+            |     +------------------+
|   MERT (frozen)  |            |     | mHuBERT (frozen) |
|                  |            |     |                  |
| - Music repr.    |            |     | - Multilingual   |
|   learning       |            |     |   speech repr.   |
| - 1024xT_M dim   |            |     | - 768xT_H dim    |
| - 75Hz frame     |            |     | - 50Hz frame     |
| - Improves style/|            |     | - Improves lyric/|
|   melody accuracy|            |     |   pronunciation  |
+--------+---------+            |     |   alignment      |
         |                      |     +--------+---------+
         v                      v              v
    +----------------------------------------------+
    |  L_SSL = avg(1 - cosine_sim(DiT_repr, SSL))  |
    |                                              |
    |  = 0.5 * L_MERT + 0.5 * L_mHuBERT           |
    +----------------------------------------------+

SSL Model	Role	Dimension	Frame Rate	Contribution
MERT	Music understanding	1024 x T_M	75Hz	Style accuracy, melody coherence
mHuBERT-147	Multilingual speech understanding	768 x T_H	50Hz	Lyric alignment, pronunciation naturalness

MERT (Music Representation Transformer) is a music understanding model pre-trained with large-scale self-supervised learning, capturing high-level semantics such as musical style, melody, and harmony. mHuBERT-147 is a multilingual speech representation model supporting 147 languages, responsible for semantic alignment of lyrics and pronunciation.

By aligning representations from these two models with the DiT's 8th layer output, ACE-Step simultaneously learns musical semantics (MERT) and linguistic semantics (mHuBERT). This is particularly important for music generation with lyrics, as the synchronization (alignment) of melody and lyrics determines the naturalness of the music.

3.2 Conditional Dropout Strategy

Dropout is applied to conditioning information during training to enhance model robustness:

Condition	Dropout Rate	Purpose
Text prompt	15%	Support for Classifier-Free Guidance (CFG)
Lyric (lyrics)	15%	Support for instrumental generation without lyrics
Speaker (voice)	50%	Prevent timbre over-reliance, focus on musical structure

4. ACE-Step v1 Training Details

4.1 Training Data

ACE-Step v1 was trained on a large-scale music dataset:

Item	Details
Total Data	1.8M unique tracks (~100,000 hours)
Languages	19 languages (English majority)
Quality Filter	Audiobox aesthetics toolkit
Excluded	Low-quality recordings, live performances

Automatic Annotation Pipeline:

[Data Annotation Pipeline]

Raw audio files
  |
  +-> Qwen-omini model -> Style/genre caption generation
  |
  +-> Whisper 3.0 -> Lyric transcription
  |      +-> LSH-based IPA-to-database mapping for lyric refinement
  |
  +-> "All-in-one" music understanding model -> Song structure (intro, verse, chorus, etc.)
  |
  +-> BeatThis -> BPM extraction
  |
  +-> Essentia -> Key/Scale, style tag extraction
  |
  +-> Demucs -> Vocal/accompaniment separation (for Speaker Encoder training)

4.2 Training Configuration

Training was conducted in two stages: Pre-training + Fine-tuning:

Stage	Data	Steps	Notes
Pre-training	Full 100K hours	460,000	Foundation training on full dataset
Fine-tuning	High-quality 20K hours	240,000	Curated high-quality subset

Hyperparameters:

# Training Environment
Hardware:          15 nodes x 8 NVIDIA A100 (120 GPUs total)
Global Batch Size: 120 (1 per GPU)
Training Duration: ~264 hours (approximately 11 days)

# Optimizer
Optimizer:         AdamW
Weight Decay:      1e-2
Betas:             (0.8, 0.9)
Learning Rate:     1e-4
LR Schedule:       Linear warm-up (4,000 steps)
Gradient Clipping: max norm 0.5

# REPA Weights
lambda_SSL:        1.0 (full training)
mHuBERT lambda:    0.01 (reduced in last 100K steps)

5. ACE-Step v1.5: Hybrid LM + DiT Evolution

ACE-Step v1.5 (arXiv:2602.00744), released in January 2026, fundamentally redesigned the v1 architecture. It introduces a Language Model as a structural planner and dramatically reduces inference steps through Distribution Matching Distillation, among other innovations.

5.1 Hybrid LM + DiT Architecture

[ACE-Step v1.5 Architecture]

User Input (Text prompt + Lyrics)
  |
  v
+----------------------------------------------------------+
|  Composer Agent (Language Model, Qwen-based ~1.7B)        |
|                                                          |
|  Chain-of-Thought reasoning:                              |
|  1. Metadata generation (BPM, Key, Duration, Structure)  |
|  2. Lyrics refinement and structuring                     |
|  3. Caption/style directive generation                    |
|  4. YAML-format Song Blueprint output                     |
|                                                          |
|  +----------------------------------------+               |
|  | bpm: 128                              |               |
|  | key: "C major"                        |               |
|  | duration: 210                         |               |
|  | structure:                            |               |
|  |   - intro: 0-15s                      |               |
|  |   - verse1: 15-45s                    |               |
|  |   - chorus1: 45-75s                   |               |
|  |   - verse2: 75-105s ...               |               |
|  | style: "energetic K-pop with synth"   |               |
|  +----------------------------------------+               |
+---------------------+------------------------------------+
                      | Song Blueprint
                      v
+----------------------------------------------------------+
|  1D VAE (Self-Learning Tokenizer)                        |
|  - 48kHz stereo audio processing                         |
|  - 64-dimensional latent space @ 25Hz                    |
|  - 1920x compression ratio                               |
|  - FSQ: 25Hz to 5Hz discrete codes (~64K codebook)      |
|  - "Source Latent" generation (LM-DiT bridging)          |
+---------------------+------------------------------------+
                      |
                      v
+----------------------------------------------------------+
|  Diffusion Transformer (DiT, ~2B parameters)             |
|  - Acoustic rendering with Source Latent + Blueprint     |
|    conditions                                            |
|  - DMD2 distillation: 50 steps to 4-8 steps             |
|  - 200x speedup (240-second track in ~1 second, A100)   |
+----------------------------------------------------------+

The most significant change in v1.5 is the separation of structural planning and acoustic rendering. The Language Model first designs the overall blueprint of the music, and the DiT only performs the role of generating actual audio according to this blueprint. This enables maintaining consistent structure for even songs longer than 10 minutes.

5.2 Self-Learning Tokenizer

v1.5 uses a 1D VAE instead of v1's mel-spectrogram-based DCAE to directly process 48kHz stereo audio:

[v1 vs v1.5 Audio Processing Comparison]

ACE-Step v1:
  Audio -> mel-spectrogram -> DCAE Encoder -> latent (10.77Hz)
  latent -> DCAE Decoder -> mel -> Fish Audio Vocoder -> 32kHz mono

ACE-Step v1.5:
  Audio (48kHz stereo) -> 1D VAE Encoder -> latent (25Hz, 64-dim)
  latent -> FSQ -> 5Hz discrete codes ("Source Latent")
  DiT -> latent -> 1D VAE Decoder -> 48kHz stereo

Improvements:
- 32kHz mono -> 48kHz stereo (improved audio quality)
- Elimination of mel-spectrogram intermediate stage (reduced information loss)
- Near-lossless quality maintained with 1920x compression ratio

The 1D VAE's Finite Scalar Quantization (FSQ) quantizes continuous 25Hz latent into 5Hz discrete codes. These discrete codes serve as Source Latent, bridging the Language Model and DiT. The codebook size is approximately 64K, and this tokenizer is trained simultaneously with the DiT through a self-learning approach.

5.3 Distribution Matching Distillation (DMD2)

The key to v1.5's dramatic speed improvement is DMD2 (Distribution Matching Distillation):

[DMD2 Distillation Process]

Teacher Model (50-step DiT)
  |
  v Knowledge Distillation
Student Model (4-8 step DiT)
  |
  +-- Dynamic-shift Strategy: {1, 2, 3} step sampling
  |   -> Exposure to diverse denoising states to prevent overfitting
  |
  +-- Distribution Matching Loss
  |   -> Alignment of Teacher distribution with Student distribution
  |
  +-- Result: 200x speedup
      - 50 steps to 4-8 steps
      - 240-second music generated in ~1 second on A100
      - Dramatic RTF (Real-Time Factor) improvement

5.4 Intrinsic Reinforcement Learning

v1.5 introduces reinforcement learning-based alignment to further improve generation quality:

[RL-Based Alignment Structure]

DiT Alignment:
  +-- DiffusionNTF framework
  +-- Attention Alignment Score (AAS)
  |   -> Measurement of cross-attention map consensus
  +-- Improvement of acoustic quality and text condition adherence

LM Alignment:
  +-- Pointwise Mutual Information (PMI)
  |   -> Measurement of semantic adherence
  +-- Improvement of Song Blueprint accuracy

Final Reward Weights:
  - Atmosphere: 50%
  - Lyrics: 30%
  - Metadata: 20%

5.5 Data and Training Infrastructure

v1.5 uses significantly larger-scale data and more sophisticated training strategies than v1:

RL-Driven Annotation Pipeline:

[v1.5 Data Annotation]

1. "Golden Set" construction (5M samples)
   +-- Initial annotation with Gemini 2.5 Pro

2. Fine-tuning
   +-- Fine-tune Qwen2.5-Omni with Golden Set
   +-- GRPO optimization -> ACE-Captioner, ACE-Transcriber generation

3. Reward Models training
   +-- Trained on 4M contrastive pairs

4. Progressive Curriculum (3 stages)
   +-- Phase 1: Foundation Pre-training (20M samples)
   +-- Phase 2: Omni-task Fine-tuning (17M, including stem-separated tracks)
   +-- Phase 3: High-quality SFT (2M curated samples)

The 3-stage progressive curriculum spanning 27M samples in total is designed so the model starts with basic music generation capabilities and gradually learns specialized tasks.

5.6 Omni-Task Framework

Another key innovation of v1.5 is the Omni-Task framework that handles diverse music tasks with a single model:

Task	Description	Use Scenario
Text-to-Music	Generate full song from text prompt	Composition, BGM
Cover Generation	Style/timbre conversion of existing songs	Cover song production
Repainting	Regenerate/modify specific sections	Partial remixing
Track Extraction	Separate vocal/accompaniment tracks	Mixing, remastering
Layering	Multi-track synthesis	Arrangement, producing
Completion	Continue unfinished compositions	Collaborative composition
Vocal-to-BGM	Generate accompaniment from vocals	Karaoke production

All these tasks are implemented through combinations of Source Latent and Mask configurations, handled by a single model without separate model training.

6. Performance Evaluation and Benchmarks

6.1 Inference Speed Comparison

The most dramatic advantage of ACE-Step is its inference speed:

Model	RTF (RTX 4090)	4-min Song Gen Time	Notes
ACE-Step v1	15.63x	~20s (A100)	15.63x real-time
ACE-Step v1.5	-	Under 2s (A100)	DMD2 distillation
DiffRhythm	10.03x	~30s
Yue (LLM-based)	0.083x	~48 min	Slower than real-time

ACE-Step v1 is approximately 188x faster than the LLM-based model Yue, and v1.5 is over 10x faster than v1 through distillation.

v1.5 Performance by Hardware:

Hardware	Full Song Gen Time	VRAM Required
NVIDIA A100	Under 2 seconds	-
RTX 3090	Under 10 seconds	Under 4GB
RTX 4090	Under 5 seconds (est.)	Under 4GB
AMD Radeon	Supported (official AMD partnership)	Under 4GB
Apple Silicon (Mac)	Supported	Under 4GB

6.2 Music Quality Evaluation

ACE-Step achieved competitive results across various automatic evaluation metrics and human evaluations:

Automatic Evaluation (v1):

Metric	ACE-Step v1	Best Comparison Model	Description
DCAE FAD	0.0224	DiffRhythm VAE: 0.0059	Waveform reconstruction quality
Style Alignment	Top tier	Udio v1 (best)	CLAP + Mulan based
Lyric Alignment	Strong	Hailuo (best)	Whisper Forced Alignment
SongEval Coherence	Competitive	Suno v3 (best)	Musical coherence
SongEval Memorability	Strong	-	Memorable melody

Automatic Evaluation (v1.5):

Metric	ACE-Step v1.5	Suno v5	MinMax 2.0
AudioBox CU	8.09 (best)	-	-
AudioBox PQ	8.35 (best)	-	-
SongEval Coherence	4.72 (tied best)	-	-
Style Alignment	39.1	46.8	43.1
Lyric Alignment	26.3	34.2	29.5

v1.5 achieved the highest scores in AudioBox CU (8.09) and PQ (8.35), and tied for the best in SongEval Coherence (4.72). While it falls short of Suno v5 in Style/Lyric Alignment, it is overwhelmingly superior among open-source models, and ranks between Suno v4.5 and v5 in Music Arena human evaluations.

Human Evaluation (v1, 32 participants):

Evaluation Item	Score (/100)
Emotional Expression	~85
Innovativeness	~82
Sound Quality	~80
Musicality	~78

7. Comparative Analysis of AI Music Generation Models

7.1 Major Model Overview

A systematic comparison of major models in the current AI music generation field:

[AI Music Generation Model Classification]

+-------------------------------------------------------------+
|                    Open-Source Models                         |
+--------------+--------------+--------------+-----------------+
|  ACE-Step    |  MusicGen    |  Stable Audio|  Riffusion      |
|  (v1, v1.5)  |  (Meta)      |  Open        |                 |
|              |              |  (Stability) |                 |
|  Diffusion   |  Autoregress |  Latent      |  Image Diffusion|
|  + DCAE/VAE  |  + EnCodec   |  Diffusion   |  -> Spectrogram |
|  3.5B params |  1.5B/3.3B   |  1.1B        |  ~1B            |
+--------------+--------------+--------------+-----------------+
|                    Commercial Models                         |
+--------------+--------------+--------------+-----------------+
|  Suno        |  Udio        |  ElevenLabs  |  Google MusicLM |
|  (v3->v5)    |  (v1->v2)    |  Eleven Music|                 |
|              |              |              |                 |
|  Full song   |  Segment-by  |  Licensed    |  Experimental/  |
|  generation  |  -segment    |  commercial  |  Instrumental   |
|  pipeline    |  composition |  use OK      |  focus          |
+--------------+--------------+--------------+-----------------+

7.2 Detailed Comparison Table

Model	Developer	Parameters	Generation Method	Audio Representation	Max Length	Lyric Support	Open Source
ACE-Step v1	ACE Studio + StepFun	3.5B	Flow Matching + DiT	Mel DCAE latent	4 min	Yes (multilingual)	Yes
ACE-Step v1.5	ACE Studio + StepFun	~3.7B (LM+DiT)	Hybrid LM + DiT + DMD2	1D VAE latent	10+ min	Yes (50+ languages)	Yes
MusicGen	Meta	1.5B/3.3B	Autoregressive	EnCodec tokens	~30s	No	Yes
Stable Audio Open	Stability AI	1.1B	Latent Diffusion	VAE latent	47s	No	Yes
Riffusion	Riffusion	~1B	Image Diffusion	Spectrogram	A few seconds	No	Yes
JEN-1	Jen Music	-	AR + Non-AR hybrid	Raw waveform	~30s	No	Partial
Suno	Suno Inc.	Undisclosed	Undisclosed	Undisclosed	4+ min	Yes	No
Udio	Udio	Undisclosed	Undisclosed	Undisclosed	Segment-based	Yes	No
MusicLM	Google	Undisclosed	AR + SoundStream	SoundStream tokens	~30s	No	No

7.3 MusicGen (Meta)

Meta's MusicGen is a pioneer in open-source music generation models. It is an autoregressive transformer model based on the EnCodec tokenizer.

[MusicGen Architecture]

Text prompt -> T5 Encoder -> Conditioning
                                    |
                                    v
                    +--------------------------+
                    |  Autoregressive Decoder   |
                    |  (Transformer LM)         |
                    |                          |
                    |  EnCodec 4 codebooks      |
                    |  32kHz, 50Hz sampling     |
                    |                          |
                    |  Simultaneous multi-      |
                    |  codebook generation      |
                    |  via delay pattern        |
                    +----------+---------------+
                               |
                               v
                    +--------------------------+
                    |  EnCodec Decoder          |
                    |  tokens -> waveform       |
                    +--------------------------+

Strengths: Stable instrumental generation, melody conditioning support Limitations: No lyric support, ~30-second limit, relatively slow autoregressive generation

7.4 Suno vs ACE-Step

Suno is currently the most commercially successful AI music generation platform:

Comparison Item	ACE-Step v1.5	Suno v5
Accessibility	Local install (open source)	Cloud service
VRAM Required	Under 4GB	N/A (server)
Song Structure	LM-based Blueprint	End-to-end
Customization	LoRA training possible	Prompts only
Style Alignment	39.1	46.8
Lyric Alignment	26.3	34.2
Price	Free (local)	Subscription
Commercial Use	License check needed	Paid plans

While Suno v5 still leads in absolute quality, ACE-Step v1.5 is a powerful alternative in terms of local deployment, customization, and cost efficiency.

7.5 Stable Audio Open

Stability AI's Stable Audio Open is a latent diffusion-based open-source model:

Comparison Item	ACE-Step v1.5	Stable Audio Open
Max Length	10+ min	47 seconds
Lyric Support	Yes (50+ languages)	No
Vocal Generation	Yes (including Voice Cloning)	No (instrumental only)
Parameters	~3.7B	1.1B
Audio Quality	48kHz stereo	44.1kHz stereo

ACE-Step shows superiority in nearly all aspects including length, lyrics, and vocals.

8. Core Foundational Technologies for Music Generation

An in-depth analysis of the essential foundational technologies needed to understand AI music generation.

8.1 Audio Tokenization: Converting Audio to Discrete Tokens

The first challenge for music generation models is transforming continuous audio signals into a form the model can process. There are broadly three approaches:

[Audio Representation Method Comparison]

1. Spectrogram-Based
   +--------------------------------------------+
   | waveform -> STFT -> mel-spectrogram -> image |
   |                                            |
   | Pros: Easy visualization, can leverage      |
   |       image models                          |
   | Cons: Phase information loss, vocoder needed|
   | Used by: Riffusion, ACE-Step v1 (DCAE input)|
   +--------------------------------------------+

2. Neural Audio Codec (Discrete Tokens)
   +--------------------------------------------+
   | waveform -> Encoder -> RVQ -> discrete tokens|
   | tokens -> Decoder -> waveform               |
   |                                            |
   | Pros: End-to-end, high compression ratio    |
   | Cons: Weak long-range dependency            |
   |       (acoustic tokens)                     |
   | Used by: MusicGen (EnCodec), MusicLM        |
   |          (SoundStream)                      |
   +--------------------------------------------+

3. Continuous Latent (VAE)
   +--------------------------------------------+
   | waveform -> VAE Encoder -> continuous latent |
   | latent -> VAE Decoder -> waveform           |
   |                                            |
   | Pros: Natural integration with Diffusion    |
   | Cons: Compression ratio vs quality tradeoff |
   | Used by: ACE-Step v1.5 (1D VAE),           |
   |          Stable Audio                       |
   +--------------------------------------------+

8.2 EnCodec and SoundStream

EnCodec (Meta) and SoundStream (Google) are representative Neural Audio Codec models:

[EnCodec / SoundStream Architecture]

Input: raw waveform (24kHz/48kHz)
  |
  v
+---------------------------------+
|  Encoder (1D Conv + LSTM)       |
|  -> continuous embeddings        |
+------------+--------------------+
             |
             v
+---------------------------------+
|  Residual Vector Quantization   |
|  (RVQ)                          |
|                                 |
|  Codebook 1 -> Most important   |
|                 information     |
|  Codebook 2 -> Residual         |
|  Codebook 3 -> Finer residual   |
|  ...                            |
|  Codebook N -> Final residual   |
|                                 |
|  Each codebook: 1024 entries    |
|  sampling rate: 50Hz/75Hz       |
+------------+--------------------+
             |
             v
+---------------------------------+
|  Decoder (1D TransposeConv)     |
|  -> reconstructed waveform      |
+---------------------------------+

Training: Reconstruction Loss + Adversarial Loss
          (Multi-scale discriminator)

EnCodec vs SoundStream:

Item	EnCodec	SoundStream
Developer	Meta	Google
Key Innovation	Multi-scale discriminator, loss balancing	RVQ introduction
Sample Rate	24kHz/48kHz	24kHz
Bitrate	1.5~24 kbps	3~18 kbps
Used In	MusicGen, AudioGen	AudioLM, MusicLM
Open Source	Yes	No

8.3 Diffusion for Audio

Audio application of Diffusion models is built on the success in the image domain:

[Audio Diffusion Training]

Forward Process (Adding Noise):
  x_0 (original audio latent)
  -> x_1 -> x_2 -> ... -> x_T (pure Gaussian noise)

  x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1-alpha_bar_t) * epsilon,  epsilon ~ N(0,I)

Reverse Process (Denoising, training target):
  x_T (noise) -> x_{T-1} -> ... -> x_0 (generated audio latent)

  p_theta(x_{t-1}|x_t) = N(x_{t-1}; mu_theta(x_t, t), sigma^2 I)

Loss: L = E_{t,x_0,epsilon} [||epsilon - epsilon_theta(x_t, t, c)||^2]
       (c = conditioning: text, melody, etc.)

ACE-Step v1 uses Flow Matching instead of standard Diffusion, which uses straight paths for convergence with fewer steps and stable training. v1.5 adds DMD2 distillation on top to achieve high-quality generation with only 4-8 steps.

8.4 Classifier-Free Guidance (CFG)

CFG, a core technique in all conditional generation models, is also used in ACE-Step:

[CFG Application]

epsilon_guided = epsilon_uncond + w * (epsilon_cond - epsilon_uncond)

Where:
- epsilon_cond: prediction with conditions (text, lyric, speaker)
- epsilon_uncond: prediction without conditions (trained via dropout)
- w: guidance scale (higher = more condition adherence, less diversity)

ACE-Step's 15% text/lyric dropout, 50% speaker dropout
enables unconditional training for this CFG.

9. Practical Usage Guide

9.1 ACE-Step v1.5 Local Installation

ACE-Step v1.5 offers a remarkably simple installation process:

# 1. Install uv package manager
curl -LsSf https://astral.sh/uv/install.sh | sh

# 2. Clone repository and install dependencies
git clone https://github.com/ACE-Step/ACE-Step-1.5.git
cd ACE-Step-1.5
uv sync

# 3. Launch Gradio UI (web interface)
uv run acestep
# -> Access at http://localhost:7860

# 4. Or launch REST API server
uv run acestep-api
# -> Use API at http://localhost:8001

# 5. Environment configuration (optional)
cp .env.example .env
# Customize model paths, ports, GPU settings, etc. in .env file

Supported Hardware:

NVIDIA GPU (CUDA): RTX 20xx or higher recommended
AMD GPU (ROCm): Optimized through official AMD partnership
Intel GPU: Supported
Apple Silicon (Mac): MPS backend supported

Models are automatically downloaded on first run and operate with under 4GB of VRAM.

9.2 Basic Text-to-Music Usage

# Music generation example via API (conceptual code)
import requests

# Basic text-to-music generation
response = requests.post("http://localhost:8001/generate", json={
    "prompt": "Bright and cheerful K-pop dance track, synth bass and electronic beats, "
              "128 BPM, female vocal, C major",
    "lyrics": """
[Verse 1]
Shining like stars tonight
Let's dance together
In this moment where music flows
We won't stop

[Chorus]
La la la shining night
La la la time together
May this moment last forever
""",
    "duration": 180,          # 3 minutes
    "num_inference_steps": 8,  # DMD2 distilled
    "guidance_scale": 7.0,
    "seed": 42
})

# Save output audio
with open("output.wav", "wb") as f:
    f.write(response.content)

9.3 Prompt Writing Guide

Effective prompt writing directly impacts generation quality:

[Effective Prompt Structure]

1. Genre/Style      : "indie folk ballad", "aggressive metal", "lo-fi hip-hop"
2. Instrumentation  : "acoustic guitar, soft piano, light percussion"
3. Mood/Emotion     : "melancholic", "uplifting", "dreamy"
4. Tempo (BPM)      : "slow tempo 70 BPM", "fast 140 BPM"
5. Key              : "minor key", "E flat major"
6. Vocal Character   : "female vocal, breathy", "male baritone, powerful"
7. Production Style  : "lo-fi with vinyl crackle", "clean studio production"

[Good Prompt Example]
"Dreamy shoegaze rock with layers of reverbed electric guitars,
 ethereal female vocal, 90 BPM, D minor, lo-fi production
 with tape saturation and subtle noise"

[Lyrics Format]
- Use [Verse], [Chorus], [Bridge], [Intro], [Outro] tags
- Clearly separate each section
- One phrase per line

9.4 LoRA Personalization Training

One of ACE-Step v1.5's powerful features is LoRA support that allows training your own style with a small number of songs:

[LoRA Training Process]

1. Data Preparation
   +-- Minimum 3-5 reference music tracks
   +-- Text prompt (caption) for each track
   +-- (Optional) Lyrics files

2. Access LoRA Training tab in Gradio UI
   +-- Upload audio files
   +-- Enter captions
   +-- Configure training parameters
   |   +-- Learning Rate: ~1e-4
   |   +-- Epochs: 50-200
   |   +-- LoRA Rank: 8-64
   +-- Start training

3. Apply trained LoRA
   +-- Load LoRA weights during generation
   +-- Adjust LoRA Scale (0.0~1.0)
   +-- Combine with existing prompts to apply style

This allows you to reflect a specific artist's production style, nuances of a specific genre, or your own composition style in the model.

9.5 ComfyUI Integration

ACE-Step 1.5 also supports integration with ComfyUI, enabling visual configuration of music generation in a node-based workflow:

[ComfyUI ACE-Step Workflow Example]

+----------+     +--------------+     +--------------+
|  Text    |---->|  ACE-Step    |---->|  Audio       |
|  Prompt  |     |  Generator   |     |  Preview     |
+----------+     |              |     +--------------+
                 |              |
+----------+     |              |     +--------------+
|  Lyrics  |---->|              |---->|  Save WAV    |
|  Input   |     |              |     |  Node        |
+----------+     +--------------+     +--------------+

10. Ethical Considerations and Legal Issues

10.1 Copyright Status (2025-2026)

Key Rulings and Trends:

Date	Event	Impact
Jan 2025	US Copyright Office: No copyright for 100% AI-generated content	Public domain ruling
Mar 2025	US Appeals Court: Confirms denial of copyright for AI works	Legal precedent established
Sep 2025	Warner Music + Suno settlement	Suno agrees to license-based model transition
Nov 2025	UMG + Udio settlement	Similar license transition agreement
Aug 2025	ElevenLabs Eleven Music launch	First legally licensed commercial AI music
Jan 2026	UMG vs Anthropic ($3B)	Copyright lawsuit over 20,000+ songs in training data

10.2 "Meaningful Human Authorship" Principle

The US Copyright Office released guidelines stating that copyright may be recognized for AI-assisted works when "meaningful human authorship" is present:

[Copyright Recognition Spectrum for AI Music]

Fully AI-Generated                              Fully Human-Created
     <---------------------------------------->

     |                  |                  |
  No copyright       Judgment needed     Copyright recognized
                        |
                   Human actively:
                   - Modifying melody
                   - Writing lyrics
                   - Arranging structure
                   - Selecting/editing AI output
                   -> "Meaningful Human Authorship"
                   -> Copyright may be recognized

10.3 Ethical Considerations for Open-Source Models

Open-source models like ACE-Step require additional ethical considerations:

Training Data Sources: The copyright status of ACE-Step's training data of 1.8M songs (v1) / 27M samples (v1.5) is not clearly disclosed in the papers. Users should be aware of legal risks when commercially using generated music.
Voice Cloning Misuse: The voice cloning capability through the Speaker Encoder could be misused to replicate specific artists' voices without authorization. Cloning without consent from the reference vocal rights holder is both ethically and legally problematic.
Deepfake Music: Deepfake music where AI generates "new songs" by specific artists has already emerged as a social issue. ACE-Step's Cover Generation feature also requires responsible use in this context.
Impact on the Music Industry: The democratization of AI music generation technology can directly affect the livelihoods of professional musicians, composers, and producers. A balance between technological advancement and creator protection is needed.

10.4 Guidelines for Responsible Use

[Responsible Use Principles for AI Music Generation]

1. Transparency: Clearly state when music is AI-generated/assisted
2. Consent: Obtain original artist consent for Voice Cloning
3. Attribution: Clearly distinguish AI tool contributions from human contributions
4. Commercial Use: Comply with relevant regulations and license conditions
5. Education: Use AI tools as supplementary tools for music education/learning
6. Fair Use: Distinguish between style imitation and copying of existing music

11. Key Paper References

A compilation of key papers on ACE-Step and the AI music generation field:

Paper	Authors	Year	Key Contribution
ACE-Step: A Step Towards Music Generation Foundation Model	Gong et al.	2025	DCAE + Linear DiT + REPA
ACE-Step 1.5: Pushing the Boundaries of Open-Source Music Generation	ACE-Step Team	2026	Hybrid LM+DiT, DMD2, RL alignment

11.2 Foundational Technologies

Paper	Key Contribution	Usage
Deep Compression Autoencoder (Chen et al., 2024)	High compression ratio AutoEncoder	ACE-Step DCAE
MERT (Li et al., 2024)	Self-supervised music representation learning	ACE-Step REPA
mHuBERT-147 (Lee et al., 2024)	Multilingual speech representation	ACE-Step REPA
Flow Matching (Lipman et al., 2023)	ODE-based generative model	ACE-Step generation process
DMD2 (Yin et al., 2024)	Distribution Matching Distillation	ACE-Step v1.5 speedup

11.3 Competing Model Papers

Paper	Author/Org	Year	Key Contribution
MusicGen: Simple and Controllable Music Generation	Copet et al. (Meta)	2023	EnCodec + AR Transformer
MusicLM: Generating Music from Text	Agostinelli et al. (Google)	2023	SoundStream + AR
Stable Audio Open	Evans et al. (Stability AI)	2024	Latent Diffusion for Audio
Riffusion	Forsgren & Martiros	2022	Spectrogram Image Diffusion
JEN-1: Text-Guided Universal Music Generation	Li et al.	2023	AR + Non-AR hybrid
DiffRhythm	-	2025	1D VAE + Flow DiT
SongGen	-	2025	Lyric encoding architecture

11.4 Audio Tokenization

Paper	Author/Org	Year	Key Contribution
EnCodec: High Fidelity Neural Audio Compression	Defossez et al. (Meta)	2022	RVQ + Multi-scale Disc
SoundStream: An End-to-End Neural Audio Codec	Zeghidour et al. (Google)	2021	RVQ introduction
WavTokenizer	Peng et al.	2025	40/75 tokens/sec SOTA
AudioLM: A Language Modeling Approach to Audio	Borsos et al. (Google)	2023	Semantic + Acoustic tokens

12. Future Outlook

12.1 Technology Development Direction

AI music generation technology is expected to evolve in the following directions:

[AI Music Generation Technology Development Roadmap]

2026 Current                 2027 Expected               2028+ Long-term
    |                          |                          |
    v                          v                          v
+--------------+        +--------------+        +------------------+
| Current State |        | Short-term   |        | Long-term Vision |
|              |        | Development  |        |                  |
| - 4-min song |   ->   | - Album-level|   ->   | - Real-time      |
|   generation |        |   consistent |        |   interactive    |
| - Text cond. |        |   generation |        |   music gen      |
| - LoRA       |        | - Multi-track|        | - Emotion-aware  |
|   personalize|        |   simultaneous|       |   adaptive music |
| - Voice Clone|        |   generation |        | - Video-music    |
| - 50+ langs  |        | - Real-time  |        |   synchronization|
|              |        |   streaming  |        | - Fully automated|
|              |        |   generation |        |   production     |
+--------------+        +--------------+        +------------------+

12.2 ACE-Step's Foundation Model Vision

The ultimate vision of the ACE-Step project is to become the "Stable Diffusion of Music AI." This means not just a simple text-to-music pipeline, but a general-purpose Foundation Model upon which various downstream tasks can be built:

[ACE-Step Foundation Model Ecosystem Vision]

                    +-------------------------+
                    |  ACE-Step Foundation     |
                    |  Model (Base)            |
                    +----------+--------------+
                               |
          +--------------------+--------------------+
          |                    |                     |
          v                    v                     v
  +--------------+   +--------------+   +------------------+
  |  Text-to-    |   |  Audio       |   |  Music           |
  |  Music       |   |  Editing     |   |  Understanding   |
  |  Generation  |   |  & Remixing  |   |  & Analysis      |
  +--------------+   +--------------+   +------------------+
          |                    |                     |
          v                    v                     v
  +--------------+   +--------------+   +------------------+
  |  LoRA        |   |  Voice       |   |  Stem            |
  |  Style       |   |  Cloning     |   |  Separation      |
  |  Transfer    |   |  & TTS       |   |  & Transcription |
  +--------------+   +--------------+   +------------------+

When this vision is realized, diverse users including music producers, video creators, game developers, and educators will be able to generate and edit commercial-quality music in local environments.

12.3 Industry Impact Outlook

Democratization of Music Production: The ability to generate commercial-quality music with 4GB VRAM means the barrier to entry for music production has been dramatically lowered.
Hybrid Workflows: AI-Human collaborative workflows where AI generates drafts and humans refine them will become standard. ACE-Step's Repainting, Completion, and Track Extraction features are optimized for such workflows.
Personalized Music Experiences: Personalization training through LoRA enables music generation tailored to each user's preferences. This will lead to dynamically generated custom music in games, meditation apps, fitness apps, and more.
Legal Framework Establishment: Through the lawsuits and settlements of 2025-2026, a clear legal framework for AI music generation will gradually be formed. ElevenLabs' license-based approach could serve as one model.

13. Conclusion

ACE-Step is a landmark model that has dramatically narrowed the gap between open-source and commercial models in AI music generation. The v1 DCAE + Linear DiT + REPA architecture achieved 188x faster inference than LLM-based models at 3.5B parameters, and the v1.5 Hybrid LM + DiT + DMD2 architecture realized remarkable efficiency of under 2 seconds on A100 and under 4GB VRAM.

Summarizing the key technical contributions:

DCAE Application to Music Domain: Achieved high-quality reconstruction while maintaining 10.77Hz temporal resolution with 8x compression
REPA Training: Fast convergence and high fidelity through musical/linguistic semantic alignment via MERT + mHuBERT
Hybrid LM + DiT: Support for songs longer than 10 minutes through separation of structural planning and acoustic rendering
DMD2 Distillation: Compressed 50 steps to 4-8 steps, 200x speed improvement
Omni-Task Framework: Single model performs diverse tasks including text-to-music, cover, repainting, and track separation

Of course, gaps still exist with top-tier commercial models like Suno v5 in Style/Lyric Alignment. However, the values ACE-Step offers -- open-source, local deployment, and customizability -- are unique advantages that commercial models cannot provide. ACE-Step's journey toward music AI's "Stable Diffusion moment" has only just begun.

References

Gong, J., Zhao, S., Wang, S., Xu, S., & Guo, J. (2025). ACE-Step: A Step Towards Music Generation Foundation Model. arXiv:2506.00045. https://arxiv.org/abs/2506.00045
ACE-Step Team. (2026). ACE-Step 1.5: Pushing the Boundaries of Open-Source Music Generation. arXiv:2602.00744. https://arxiv.org/abs/2602.00744
Chen, J. et al. (2024). Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models. arXiv:2410.10733. https://arxiv.org/abs/2410.10733
Copet, J. et al. (2023). Simple and Controllable Music Generation. NeurIPS 2023. https://arxiv.org/abs/2306.05284
Agostinelli, A. et al. (2023). MusicLM: Generating Music From Text. arXiv:2301.11325. https://arxiv.org/abs/2301.11325
Defossez, A. et al. (2022). High Fidelity Neural Audio Compression. arXiv:2210.13438. https://arxiv.org/abs/2210.13438
Zeghidour, N. et al. (2021). SoundStream: An End-to-End Neural Audio Codec. arXiv:2107.03312. https://arxiv.org/abs/2107.03312
Li, Y. et al. (2024). MERT: Acoustic Music Understanding Model with Large-Scale Self-Supervised Training. ICLR 2024.
Lee, R. et al. (2024). mHuBERT-147: A Compact Multilingual HuBERT Model. Interspeech 2024.
Lipman, Y. et al. (2023). Flow Matching for Generative Modeling. ICLR 2023.
Yin, T. et al. (2024). One-step Diffusion with Distribution Matching Distillation. CVPR 2024.
Evans, Z. et al. (2024). Stable Audio Open. arXiv:2407.14358.
Li, P. et al. (2023). JEN-1: Text-Guided Universal Music Generation with Omnidirectional Diffusion Models. arXiv:2308.04729.
ACE-Step GitHub (v1): https://github.com/ace-step/ACE-Step
ACE-Step GitHub (v1.5): https://github.com/ace-step/ACE-Step-1.5
ACE-Step Hugging Face: https://huggingface.co/ACE-Step/ACE-Step-v1-3.5B