AI Music Generation 2026 — Suno, Udio, Stable Audio, MusicGen, Mubert, ElevenLabs, Lyria — Where Are We Really?

Prologue — What Has Changed in Two Years

Summer 2023: AI-generated music was a toy. One-bar melodies, awkward rhythms, vocals either absent or unintelligible. When Meta open-sourced MusicGen, the reaction was "neat" rather than "I'll write a song with this."

Spring 2024: Suno shipped v3, Udio opened its beta, and the mood shifted. A single text prompt produced a two-minute song with actual vocals. Rough in places, but for the first time people said "wait, this is real." Three months later, in June 2024, the RIAA sued both Suno and Udio for massive copyright infringement. Industry attention had arrived in earnest.

May 2026: the landscape has shifted again. Suno v5.5 clones a user's voice and supports personal fine-tunes. Udio has signed licensing settlements with Universal, Warner, Kobalt, and Merlin in sequence. Google acquired Riffusion's successor ProducerAI and folded it into Lyria 3. ElevenLabs expanded from voice into music. On the open-source side, YuE, ACE-Step, and DiffRhythm offer full-song models with vocals that run on a single RTX 4090.

And yet — vocals are still the hardest part. Korean lyrics still sound less natural than English. Anything past four minutes loses coherence. Models with airtight commercial licensing are still rare. The Suno summary judgment hearing is set for July 2026.

This post tries to map that landscape. Which tool fits which job, why vocals are difficult, where open source stands, how the lawsuits are unfolding, and what real workflows look like for indie game soundtracks, podcast intros, YouTube BGM, and songwriting ideation. This is not "AI is killing music" nor "AI is saving music." It is the middle ground that the actual practitioners live in.

One-line take: 2026 AI music is not about "replacing humans" but about "people who couldn't make music starting to make music." Knowing that boundary makes the tool choice easy.

1 · The Birth of the Category — What Happened in 2023–2024

1.1 Two Technical Lineages

AI music generation is the merger of two technical lineages.

Lineage 1: Autoregressive token models. Like text LLMs, tokenize audio and predict the next token. Meta's MusicGen (2023), Google's MusicLM (2023), and Suno's early versions belong here. Training works by compressing audio through a neural audio codec like EnCodec into tokens, then training a transformer on those token sequences.

Lineage 2: Diffusion-based audio. Apply image-diffusion architectures (Stable Diffusion) to audio. Stability AI's Stable Audio is the canonical example. Riffusion used a clever trick — convert audio to a spectrogram (a frequency image), run image diffusion on it, then convert the result back to audio.

By 2024 the two lineages cross-pollinated and vocal synthesis was bolted on. The real leap for Suno and Udio was producing a "full song with vocals and lyrics from text" — until then, almost everything was instrumental backing only.

1.2 Why Quality Jumped Suddenly

Three variables moved at once.

Data. Access to large licensed music catalogs (or — as the lawsuits allege — scraped catalogs) became viable for training. MusicGen alone was trained on roughly 20,000 hours of licensed music.
Compute. H100/H200 clusters made training multi-billion-parameter audio models feasible in reasonable time.
Architecture. Neural audio codecs like EnCodec and SoundStream opened the door to handling audio as LLM-style tokens.

With those three in place, the trick that worked for text LLMs — "predict the next plausible token" — started working for music.

1.3 The RIAA Bomb — June 2024

On June 24, 2024, the Recording Industry Association of America, representing Universal, Warner, and Sony, filed two copyright infringement suits — against Suno in the District of Massachusetts and Udio in the Southern District of New York. The core claim: "trained on copyrighted recordings without permission." The defense from both companies: "transformative fair use."

This is not an isolated dispute. It will decide the commercial fate of the entire AI music category. If the training data is ruled infringing, model retraining is required and the licensing structure for outputs changes fundamentally. That is why the wave of settlements started arriving in late 2025.

2 · Consumer Tools — Suno, Udio, Lyria, ElevenMusic

2.1 Suno — The Category Leader

As of May 2026, the most-used text-to-song tool is Suno. The progression: v3 (early 2024), v4 (2025), v5 (late 2025), v5.5 (March 26, 2026).

Three pillars in v5.5:

Voices. Users record about thirty seconds of their own singing voice, register it, and the AI sings in that timbre. Pro and Premier subscribers only. Private by default.
Custom Models. Upload your own catalog (e.g., songs you have made) to fine-tune v5.5 toward that style. Up to three per account.
Studio. Receive stems separated by track — vocals, bass, drums, harmony, instrumentation. Drop them into a DAW for post-production.

Quality? For English lyrics in mainstream genres like pop, rock, electronic, or folk, a first-time listener will believe a human made it. Korean and other less-trained languages still struggle with pronunciation and prosody (steadily improving since 2025, still weaker than English). Structurally complex genres like jazz improvisation or full classical orchestration remain weak spots.

Commercial licensing is explicitly granted on Pro and above. Marketing "100% safe" is hard while the RIAA case is pending.

2.2 Udio — A Different Aesthetic

Udio was founded in December 2023 by former Google DeepMind researchers, led by CEO David Ding. The April 2024 seed round of $10M was led by Andreessen Horowitz, with notable participation from Instagram co-founder Mike Krieger, will.i.am, Common, and other music-industry figures.

Udio's output has a subtly different character from Suno's. Where Suno tends toward "polished pop," Udio leans toward "track produced by a producer." It scores especially well in hip-hop, R&B, Latin, and electronic.

On October 29, 2025, Universal Music Group settled with Udio — a payment plus a licensing deal for a joint AI music platform launching in 2026. On November 25, Warner settled too (a multi-million-dollar settlement plus a licensing partnership, with Suno acquiring Songkick from Warner as part of the package). Kobalt and Merlin followed. As of May 2026, Sony is the only major still actively litigating against Udio.

2.3 Lyria 3 (Google DeepMind)

Google moved on two fronts.

Lyria the model. From Lyria 2 (May 2025) to Lyria 3 (February 18, 2026). 48kHz stereo, up to three minutes, working directly on audio tokens rather than spectrograms. SynthID watermarking is mandatory. Access via Vertex AI and the Gemini API.

Riffusion acquisition. On February 24, 2026, Google acquired ProducerAI (formerly Riffusion). ProducerAI was a conversational music-generation agent with a million users. After acquisition it was folded into Lyria 3. The spectrogram-diffusion lineage that Riffusion pioneered now lives inside Lyria 3.

2.4 Lyria RealTime — A Different Usage Model

Lyria RealTime is a separate beast. Not "generate a song" but "control streaming audio in real time." You adjust style, tempo, and mood live while infinite music plays. Primary use cases: live streaming, game BGM, interactive installations. Accessed via the Gemini API.

2.5 ElevenMusic (ElevenLabs)

ElevenLabs, known for voice synthesis, launched Eleven Music on August 5, 2025. On April 1, 2026, it relaunched as ElevenMusic with a standalone iOS app and a full consumer platform.

The differentiator is licensing. ElevenLabs signed training-data deals with Merlin Network, Kobalt Music Group, and SourceAudio in advance. Marketing positions ElevenMusic as "cleared for commercial use." The key signal: it deliberately did not train on the major labels' RIAA-side catalogs.

Functionally, you control length, lyric presence, and remix existing tracks (genre and tempo shifts). Free tier covers seven songs per day. Combined with ElevenLabs' voice synthesis, finer vocal-character control is possible.

2.6 Comparison — Consumer Tools

Tool	Vocal Quality	Instrumental	Korean Lyrics	Length	Commercial License	Primary Use
Suno v5.5	Very high	High	OK	Up to 8 min	Pro and above, explicit	Songwriting, content
Udio	High	Very high	OK	Up to 4+ min	Standard and above	Producing, hip-hop/R&B
Lyria 3	Medium (lyric-light)	Very high	Weak	Up to 3 min	Vertex AI terms	Enterprise integration
ElevenMusic	High	High	Not benchmarked	Up to 5 min	Explicitly cleared	Content creators
Lyria RealTime	None	High	N/A	Infinite stream	API terms	Games, live

3 · Open Source and Local Options — MusicGen, Stable Audio, YuE, ACE-Step

3.1 Why Open Source

Three reasons.

Cost. No subscription, unlimited generation. Runs on a single local RTX 4090.
Privacy. Lyrics and concepts never leave your machine. Crucial for unreleased projects.
Control. Fine-tuning, fixed seeds, batch generation, and automation pipelines become possible.

The cost — quality lags consumer tools by a half-step, and licensing terms need careful reading.

3.2 MusicGen (Meta, 2023)

The starting point of open-source AI music. Released August 2023 as part of the AudioCraft framework. Text-to-instrumental.

Parameters. Three sizes — 300M, 1.5B, 3.3B. The 3.3B variant wants 16GB+ VRAM.
Data. About 20,000 hours of music Meta owns or licensed.
License. Model weights are CC BY-NC 4.0 — non-commercial use only. This is widely misread. Self-hosting does not grant commercial rights.
2026 status. No meaningful update since 2024. Quality is visibly behind Suno and Udio. Cannot do vocals.

Still useful for "learning," "offline experiments," "non-commercial projects," and "as a baseline for comparing other models."

3.3 Stable Audio 2.5 / Stable Audio Open

The two Stability AI lines are easy to confuse.

Stable Audio 2.5. Commercial SaaS. Up to three minutes, complex structure (intro, development, outro). Better response to mood prompts like "uplifting" or "lush synthesizers." Strong for sound effects, ad music, and video tracks.

Stable Audio Open. Open source. The base model maxes at 47 seconds. Stable Audio Open Small (341M parameters, built with Arm) generates 11 seconds of audio in under 8 seconds on a smartphone CPU. Licensed under the Stability AI Community License, free for commercial and non-commercial use.

Stable Audio Open is stronger for sound design — short SFX, loops, textures, foley — than for full songs.

3.4 YuE — Open-Source Full-Song Model

YuE arrived in 2025 as an open-source full-song model with vocals. Apache 2.0 license (commercial use allowed). It does what MusicGen does not — "text plus lyrics into a full song with vocals."

Hardware. Recommended 24GB VRAM. Quantized versions run in 8–16GB. On a 4090, 30 seconds takes roughly 360 seconds.
Optimized forks. DeepBeepMeep's GPU-poor branch generates a 1-minute song in about 4 minutes on a 4090.
License. Apache 2.0 — commercial use allowed. The cleanest license among open-source music models.

Quality does not match Suno v5, but YuE is the first open-source model to combine "open + commercial + vocals."

3.5 ACE-Step 1.5 — Another Local Contender

ACE-Step 1.5 stands out for supporting Mac, AMD, Intel, and CUDA. The fact that it runs on M-series Macs matters a lot. Reasonable music generation plus vocals plus decent quality makes it the often-recommended "2026 local starting point."

3.6 Comparison — Open Source / Local

Model	Vocals	License	Min VRAM	Length	Strength
MusicGen 3.3B	No	CC BY-NC 4.0 (non-commercial)	16GB	30 sec	Learning, baseline
Stable Audio Open	No	Stability Community	8GB	47 sec	Sound design
YuE	Yes	Apache 2.0	24GB rec.	1–5 min	Full songs, commercial
ACE-Step 1.5	Yes	Open source	12–24GB	Full song	Multi-platform
DiffRhythm	Yes	Open source	16GB	Full song	Fast inference

4 · Where It Actually Works

4.1 Indie Game Soundtracks

One of the strongest fits. The reason is simple — an indie game typically needs 10 to 30 tracks. Commissioning all of them from a composer costs roughly ten to fifty thousand dollars. Filling the gap from royalty-free libraries means the same music turns up in other games.

AI music slots neatly into that gap.

Volume. Dozens of tracks per hour, keep what you like.
Uniqueness. Unlike libraries, your track will not appear in another game.
Variation control. Adjust the seed and prompt to generate similar tracks for the same mood.
Loop-friendly. Game BGM loops anyway. You do not need a full four-minute song.

A workflow used by actual indie studios.

1. Write a mood sheet for the game: "neon-lit cyberpunk alley, tense but melancholy, 100 BPM"
2. Generate 10 to 20 tracks in Suno or Udio, shortlist favorites
3. Separate stems on the 1 to 2 chosen tracks
4. Adjust BPM and key in a DAW, build loop points
5. Import into Unity or Unreal as .ogg or .wav
6. Configure interactive layers in an adaptive music system like FMOD or Wwise

A caution: verify the licensing of AI output against your distribution channel (Steam, consoles). Suno Pro and above, or a clean model like ElevenMusic, is the safe choice.

4.2 Podcast Intros and Outros

A 15 to 30-second signature sound. AI music's main weakness — long-term coherence — barely matters here.

Workflow.

Prompt mood and genre: "upbeat tech podcast intro, synth-driven, 20 seconds, fade-out"
Generate 10 to 20, pick one
Polish around the voiceover
Use the same track on every episode — it becomes "brand sound"

Cost: Suno Pro at $10 a month covers it. Compared to commissioning a composer ($ 300 to $1,000), it is negligible.

4.3 YouTube and Short-Form BGM

This is where Mubert shines. Mubert is not text-to-song — it is mood-based infinite track generation. It can produce 25-minute background tracks and 25 variations quickly. The royalty-free license is unambiguous. Musicians upload their sample packs and receive 80 percent of track sales, so the training-data origin is comparatively clean.

For a YouTuber, the appeal is "no Content ID claims." Vocal-bearing Suno tracks rarely trigger claims either, but Mubert is the most clearly safe option.

4.4 Songwriting Ideation

Professional songwriters and composers are surprisingly aggressive users. Two patterns.

Motif generation. Quickly try "what would this chord progression with this vocal melody sound like." They do not use the output directly — they steal the idea and weave it into their own track.

Guide track. Write lyrics first, then make an AI demo. Listen to the demo to judge "this part works, this part needs to change." Then build the real song. The AI music acts as an MVP.

The core mindset: use AI output as a design tool, not a finished product. Masterpieces will not pop out — the right position for AI music is "idea generator."

4.5 Where It Does Not Work

The same honesty applies to limits.

Advanced classical composition. Four-voice fugues, sonata-form structures — still weak.
Replacing live performance. Cannot manufacture stage energy.
Jazz improvisation. No coherent motivic development.
Big commercial IP. Major film soundtracks and lead ad tracks remain out of reach — not for quality reasons but for legal safety.
Distinctive vocal character. Suno Voices cloning a user's own voice is roughly the ceiling.

5 · Quality Reality — Vocals Are the Hardest Part

5.1 Why Vocals Are Hard

The two hardest problems in audio generation are (a) long-term coherence and (b) vocals. Vocals are especially hard, for layered reasons.

Phonemes and pronunciation. Human voice changes phonemes every 50ms or so. The model has to map lyric text to a sequence of pronounced audio tokens. English has rich training data and works well. Korean, Japanese, Arabic and similar languages have far less audio data per phoneme.

Prosody (intonation). Singing "I love you" sadly versus joyfully sounds different. The model must combine lyric meaning with song mood to shape the intonation curve.

Pitch stability. Human singers hold pitch within roughly ±10 cents. AI sometimes wavers ±50 cents. The ear hears it as "off."

Intelligibility. Listeners need to hear the lyrics. Vocals are not finished when melody is in place — the words must be audible. Hard consonant clusters (like "strengths") often blur in AI output.

5.2 The Extra Penalty for Non-English Lyrics

Korean has roughly one-tenth to one-twentieth the training data of English. Consequences:

Final consonants (especially ㄹ and ㅇ) sound awkward.
An English-style vocal phrasing is forced onto Korean (consonants run together instead of being articulated).
Natural prosody of the lyric is missed.

Mitigations: (a) Suno v5.5 is visibly better than v4 on Korean. (b) Explicit style tags like "korean ballad," "k-pop," or "trot" help. (c) When awkwardness remains, generate with English lyrics and re-record the vocal in Korean during post.

5.3 Instrumentals Are Surprisingly Solid

Conversely, instrumentals are near-human-quality from late 2025 onward. Electronic, synth pop, lo-fi, cinematic scores, ambient — telling them apart from human work is nearly impossible. That is why games, podcasts, and YouTube BGM exploded first.

5.4 Length and Coherence

Past three minutes, the model starts losing track of "where this song is going." Specifically:

Motif forgetting. A hook introduced at one minute disappears by three.
Structural drift. Verse-chorus-bridge structure erodes as length grows.
Quality drift. After four minutes, vocals sometimes turn grainy or the mix shifts.

Workarounds: (a) generate short pieces and stitch in a DAW, (b) use Suno's Extend feature in segments, (c) for anything past five minutes, go instrumental.

6 · Lawsuits and the Copyright Debate — Honestly

6.1 What Is at Issue

The RIAA suits have two core issues.

Training data use. "Trained on copyrighted recordings without permission." Both defendants invoke "transformative fair use."
Output similarity. Plaintiffs claim Suno and Udio can reproduce specific training songs nearly verbatim.

The legal question reduces to whether AI training passes the four-factor fair-use test (purpose, nature, amount, market effect).

6.2 Status as of May 2026

Suno. Contesting all claims on fair-use grounds against Universal, Warner, and Sony in the District of Massachusetts. Suno filed for summary judgment in March 2026, with the key hearing scheduled for July 2026. Cited precedent: the Second Circuit's 2024 Bartz v. SoundAI ruling, which treated AI training as transformative use.

Udio. Successive licensing settlements with Universal (October 2025), Warner (November 2025), Kobalt, and Merlin. Sony remains the only major actively litigating. The Universal deal includes a joint AI music platform launching in 2026.

Independent artists. In October 2025, separately from the majors, a class of independent musicians sued both Suno and Udio.

6.3 Three Possible Outcomes

Scenario A — Suno wins (fair use upheld). AI training becomes legitimized. Every AI model uses a similar defense. The music industry shifts to a separate licensing market (e.g., the Universal-Udio joint platform). Users get the most freedom.

Scenario B — Suno loses (licensing required). Suno is forced into licensing settlements or model retraining. Costs rise sharply and subscription prices follow. New entrants cannot start without licensing. "Pre-licensed" models like ElevenMusic gain a structural advantage.

Scenario C — Settlement. The most likely scenario. The Universal-Udio template — majors + licensing + revenue sharing — becomes the industry standard. The entire industry aligns to that shape.

6.4 What Users Should Do

Safe to do, no caveats: subscribe to Suno or Udio Pro and above, plans that explicitly grant commercial usage rights, and avoid explicitly imitating named major artists.

Safer still: models like ElevenMusic with provable pre-licensed training data, or Apache 2.0 open-source models like YuE or ACE-Step run locally.

Avoid: prompts attempting to clone a specific named artist's voice ("in the style of [famous singer]"), then commercially distributing the output. That is the clearest risk.

7 · Decision Framework — What to Pick

7.1 "Situation → Recommended Tool"

Situation	First choice	Second choice	Note
Songwriting demos	Suno v5.5	Udio	Vocal quality first
Indie game BGM	Suno Pro	Mubert	Stem separation matters
Podcast intro	Suno	ElevenMusic	30 seconds works anywhere
YouTube background	Mubert	Stable Audio 2.5	Mood-based infinite tracks
Ad track (commercial)	ElevenMusic	Stable Audio 2.5	License cleanliness first
Live game BGM	Lyria RealTime	(few alternatives)	Real-time control
Local / private experiment	YuE	ACE-Step	Data does not leave the box
Sound design (short SFX)	Stable Audio Open	(DAW plugins)	11 to 47 seconds
Learning / research	MusicGen	YuE	Non-commercial OK
Korean-lyric songs	Suno v5.5	Udio	Plan for vocal post-processing

7.2 Decision Tree

Start
 │
 ├─ Need vocals?
 │   ├─ No  → Mubert / Stable Audio / MusicGen / Lyria RealTime
 │   └─ Yes ↓
 │
 ├─ Commercial use?
 │   ├─ No  (research / learning) → Anything goes, MusicGen included
 │   └─ Yes ↓
 │
 ├─ License cleanliness top priority?
 │   ├─ Yes → ElevenMusic or YuE / ACE-Step self-hosted
 │   └─ No  ↓
 │
 ├─ Non-English lyrics?
 │   ├─ Yes → Suno v5.5 first, expect post-processing
 │   └─ No  ↓
 │
 ├─ What aesthetic?
 │   ├─ Pop / electronic polish      → Suno
 │   ├─ Hip-hop / R&B / producer tone → Udio
 │   └─ Enterprise / Vertex AI       → Lyria 3

7.3 By Budget

Budget	Recommendation
$0 / month	MusicGen + 4090 or cloud GPU. Suno free tier (5 songs / day).
$10 / month	Suno Pro alone. Enough for most content creators.
$30 / month	Suno Pro + Udio Standard + Mubert. Rich aesthetic choices.
$100+ / month	Suno Premier + ElevenMusic + Stable Audio 2.5. Commercial production.
$1,000+	Own 4090 box + YuE self-hosted + subscriptions. Studios, game teams.

Epilogue — Checklist, Anti-Patterns, What's Next

AI music has gone from 2023's "neat" to 2026's "I'll release this." The pivot is that vocals now sound like vocals, lengths reach actual song duration, and aesthetic differences have settled into genre. At the same time — Korean vocals, coherence past four minutes, and airtight commercial licensing remain unsolved. The Suno summary judgment hearing in July 2026 will likely decide the category's next year.

Tool Selection Checklist

Do you need vocals? — If not, Mubert or Stable Audio is a much safer pick.
Are you using it commercially? — Pro tier or higher, explicit license, permanent-rights confirmation.
Is the language English? — If not, budget for post-processing and vocal re-recording.
How long is the piece? — Past three minutes, use Extend or stitching, or stay instrumental.
What genre aesthetic? — Suno (pop), Udio (hip-hop / R&B), Lyria (enterprise).
Need stem separation? — Suno Studio is one of the few that really delivers.
Online dependency a burden? — Consider YuE or ACE-Step locally.
Workflow repetitive? — Use the Mubert API, Suno API, or Lyria RealTime API.
Copyright safety top priority? — ElevenMusic, or models that document training data.
Are you ready to treat AI output as a draft, not a final? — The most important question.

Anti-Patterns

Anti-pattern	Why it's bad	Instead
Shipping the first generation	Average quality is low	Generate 10 to 20, curate
Naming famous artists in prompts	License gray zone, Content ID risk	Abstract descriptions like "late-80s synth-pop"
Judging Korean songs by English assumptions	Awkward pronunciation slips through	At least one native-speaker review
Releasing commercially on a free tier	License violation	Subscribe at Pro or above
Generating a 4-minute song in one shot	Late-track coherence falls apart	Generate short, stitch, or use Extend
Using MusicGen output in a commercial ad	CC BY-NC 4.0 violation	YuE / ACE-Step or consumer tools
Skipping vocal intelligibility checks	Releasing songs no one can parse	Three external listeners read the lyrics back
Treating Lyria 3 like a free tool	Vertex AI pricing not understood	Cost-calculate per minute
Crediting AI output as "I composed this"	Disclosure and copyright risk	Mark as "AI-assisted composition"
Relying on one model only	Model limits become work limits	Pair 2 to 3 models by aesthetic

What's Next

The next post is "AI Video Generation 2026 — Sora, Veo, Runway, Pika, Kling — and How They Actually Differ." Same pattern as this one: the category's explosion (2024 Sora demo) and maturation (commercial tools in 2026), the hardest part analogous to vocals (long-term coherence, character identity, fingers), open-source options (Open-Sora, Mochi, Wan), real use cases (ads, short video, concept visuals), and the copyright debate (NYT-OpenAI, Disney's licensing model) at the same depth.