Split View: AI 음악 생성 2026 — Suno · Udio · Stable Audio · MusicGen · Mubert · ElevenLabs · Lyria, 어디까지 왔나

AI 음악 생성 2026 — Suno · Udio · Stable Audio · MusicGen · Mubert · ElevenLabs · Lyria, 어디까지 왔나

프롤로그 — 2년 만에 무엇이 바뀌었나

2023년 여름, AI로 만든 음악은 장난감이었다. 한 줄짜리 멜로디, 어색한 박자, 보컬은 없거나 아예 알아들을 수 없었다. Meta의 MusicGen이 오픈소스로 공개됐을 때 사람들은 "재밌네"라고 말했지 "이걸로 곡을 쓰겠다"고 말하진 않았다.

2024년 봄, Suno가 v3를 내고 Udio가 베타를 열었을 때 분위기가 달라졌다. 텍스트 한 줄로 보컬이 있는 2분짜리 곡이 나왔다. 어색한 곳이 있었지만 "어, 이거 진짜네" 소리가 처음 나왔다. 같은 해 6월, RIAA(미국음반산업협회)는 Suno와 Udio를 상대로 대규모 저작권 침해 소송을 제기했다. 산업이 진지하게 본다는 증거였다.

2026년 5월 지금, 풍경이 다시 바뀌었다. Suno는 v5.5에서 사용자의 목소리를 클로닝하고 개인화 모델을 제공한다. Udio는 Universal · Warner · Kobalt · Merlin과 차례로 라이선싱 합의를 했다. Google은 Riffusion의 후신 ProducerAI를 인수해 Lyria 3로 통합했다. ElevenLabs는 음성 대신 음악으로 카테고리를 확장했다. 오픈소스 쪽에서는 YuE, ACE-Step, DiffRhythm처럼 보컬까지 다루는 풀송 모델이 4090 한 장으로 돌아간다.

그런데도 — 보컬은 여전히 가장 어렵다. 한국어 가사는 영어보다 어색하다. 4분 이상은 일관성이 무너진다. 상업적으로 안전한 출력을 보장하는 모델은 아직 손에 꼽는다. 그리고 RIAA의 Suno 소송은 2026년 7월에 약식 판결 심리가 잡혀 있다.

이 글은 그 풍경을 정리한다. 어떤 도구가 어떤 일에 맞는지, 보컬이 왜 어려운지, 오픈소스가 어디까지 왔는지, 소송이 어떻게 흘러가는지, 인디 게임 사운드트랙·팟캐스트 인트로·유튜브 BGM·작곡 아이디에이션에서 실제로 어떻게 쓰는지를 본다. AI가 음악을 망친다는 식의 글도, AI가 음악을 구원한다는 식의 글도 아니다.

핵심 한 줄: 2026년의 AI 음악은 "사람을 대체"가 아니라 "이전엔 못 만들던 사람이 만들기 시작"의 도구다. 그 경계를 알면 선택이 쉬워진다.

1장 · 카테고리의 탄생 — 2023~2024년 사이에 무슨 일이 있었나

1.1 두 갈래의 기원

AI 음악 생성은 두 갈래의 기술 계보가 합쳐진 결과다.

갈래 1: 자기회귀 토큰 모델. 텍스트 LLM처럼 오디오를 토큰화해 다음 토큰을 예측한다. Meta의 MusicGen(2023년), Google의 MusicLM(2023년), Suno의 초기 버전이 이 계열이다. 학습은 EnCodec 같은 뉴럴 오디오 코덱으로 오디오를 압축해 토큰으로 만든 뒤, 트랜스포머가 그 시퀀스를 학습한다.

갈래 2: 디퓨전 기반 오디오. 이미지 디퓨전(Stable Diffusion)의 아키텍처를 오디오에 적용한다. Stable Audio(Stability AI)가 대표적이다. Riffusion은 더 영리한 트릭을 썼다 — 오디오를 스펙트로그램(주파수 이미지)으로 바꾼 뒤 이미지 디퓨전을 돌렸다. 결과 이미지를 다시 오디오로 되돌리면 음악이 나온다.

2024년 들어 두 갈래가 섞이고 보컬 합성이 결합된다. Suno와 Udio의 진짜 도약은 "텍스트에서 보컬과 가사가 있는 풀송"을 만들었다는 점이다. 그 이전까지는 거의 모두 인스트루멘털(반주)이었다.

1.2 왜 갑자기 좋아졌나

세 가지 변수가 동시에 움직였다.

데이터. 라이선싱된 대형 음악 카탈로그(혹은 — 소송이 주장하는 대로 — 스크래핑된 카탈로그)를 학습에 쓸 수 있게 됐다. MusicGen은 약 20,000시간의 라이선싱된 음악으로 학습됐다.
컴퓨트. H100/H200 클러스터로 멀티빌리언 파라미터 오디오 모델을 합리적 시간 안에 학습할 수 있게 됐다.
아키텍처. 뉴럴 오디오 코덱(EnCodec, SoundStream)이 오디오를 LLM처럼 다룰 수 있는 토큰으로 압축하는 길을 열었다.

이 셋이 갖춰지자 텍스트 LLM이 한 일 — "그럴듯한 다음 토큰 예측" — 이 음악에도 가능해졌다.

1.3 RIAA의 폭탄 — 2024년 6월

2024년 6월 24일, 미국음반산업협회(RIAA)는 Universal · Warner · Sony를 대리해 Suno(매사추세츠 연방 지방법원)와 Udio(뉴욕 남부 지방법원)를 상대로 두 건의 저작권 침해 소송을 제기했다. 핵심 주장은 "허가 없이 저작권 보호된 음반을 학습 데이터로 썼다"는 것이다. 양사의 변호는 "변혁적 공정 이용(transformative fair use)"이다.

이 소송은 단순한 분쟁이 아니다. AI 음악 카테고리 전체의 상업적 운명을 결정한다. 학습 데이터가 위법이라는 판결이 나오면 모델 재학습이 필요해지고, 출력의 라이선싱 구조 자체가 바뀐다. 그래서 2025년 말부터 합의의 물결이 시작됐다.

2장 · 컨슈머 도구 — Suno · Udio · Lyria · ElevenMusic

2.1 Suno — 카테고리 리더

2026년 5월 시점에서 가장 많이 쓰이는 텍스트-투-송 도구는 Suno다. v3(2024 초), v4(2025), v5(2025 말), v5.5(2026년 3월 26일)로 빠르게 진화했다.

v5.5의 핵심은 세 가지다.

Voices. 사용자가 자기 목소리를 30초 정도 녹음해 등록하면 AI가 그 음색으로 노래한다. Pro · Premier 구독자 한정. 기본은 비공개.
Custom Models. 자기 카탈로그(예: 본인이 만든 곡들)를 업로드해 v5.5를 그 스타일로 파인튠한다. 최대 3개까지.
Studio. 보컬 · 베이스 · 드럼 · 하모니 · 인스트루멘트의 스템(stem)을 분리 트랙으로 받는다. DAW로 들고 가서 후처리할 수 있다.

품질은? 영어 가사, 팝/록/일렉트로닉/포크 같은 메인스트림 장르라면 처음 듣는 사람은 사람이 만들었다고 믿을 수준이다. 한국어 가사는 발음과 운율이 어색해진다(2025년부터 꾸준히 좋아지지만 영어보단 약하다). 재즈 솔로 즉흥이나 클래식 오케스트레이션처럼 구조가 복잡한 장르는 아직 약하다.

상업적 라이선싱은 Pro 구독 이상에서 명시적으로 허용된다. 다만 RIAA 소송이 진행 중인 이상 "100% 안전"을 광고하긴 어렵다.

2.2 Udio — 사운드의 다른 미학

Udio는 Google DeepMind 출신 연구자들이 2023년 12월에 창업한 회사다. CEO는 David Ding. 시드 라운드($10M, 2024년 4월)에 Andreessen Horowitz가 리드했고, Instagram 공동창업자 Mike Krieger, will.i.am, Common 같은 음악계 인사가 참여했다.

Udio의 결과물은 Suno와 미묘하게 다르다. 일반적으로 Suno가 더 "팝적"이고 매끈하다면, Udio는 좀 더 "프로듀서가 다듬은 트랙" 같은 느낌을 준다. 힙합, R&B, 라틴, 일렉트로닉 분야에서 더 좋은 평가를 받았다.

2025년 10월 29일, Universal Music Group이 Udio와 합의했다. 합의금 + 2026년 출시 예정인 공동 AI 음악 플랫폼 라이선싱 딜이 포함됐다. 11월 25일에는 Warner도 합의했다(수백만 달러 + Songkick의 Suno로의 매각이 포함된 패키지). 이후 Kobalt, Merlin도 차례로 라이선싱 합의를 했다. 2026년 5월 시점에 Udio를 상대로 적극적으로 소송 중인 메이저는 Sony뿐이다.

2.3 Lyria 3 (Google DeepMind)

Google은 두 갈래로 움직였다.

Lyria 자체 모델. Lyria 2(2025년 5월)에서 Lyria 3(2026년 2월 18일)으로 갔다. 48kHz 스테레오, 최대 3분, 스펙트로그램이 아니라 오디오 토큰을 직접 다룬다. SynthID 워터마킹 의무 적용. Vertex AI · Gemini API로 접근한다.

Riffusion 인수. 2026년 2월 24일, Google은 ProducerAI(이전 Riffusion)를 인수했다. ProducerAI는 1백만 사용자를 보유한 대화형 음악 생성 에이전트였다. 인수 후 Lyria 3와 통합됐다. 즉 Riffusion이라는 스펙트로그램 디퓨전 계보는 Lyria 3 안으로 흡수됐다.

2.4 Lyria RealTime — 다른 사용 모델

Lyria RealTime은 별도로 봐야 한다. "한 곡을 생성"이 아니라 "스트리밍 오디오를 라이브로 컨트롤"하는 모델이다. 스타일, 박자, 분위기를 실시간으로 조정하면서 무한 음악을 만든다. 라이브 스트리밍, 게임 BGM, 인터랙티브 인스털레이션이 주 용도다. Gemini API로 접근.

2.5 ElevenMusic (ElevenLabs)

음성 합성으로 알려진 ElevenLabs는 2025년 8월 5일 Eleven Music을 출시했고, 2026년 4월 1일 iOS 앱과 함께 ElevenMusic으로 정식 플랫폼화했다.

차별점은 라이선싱이다. Merlin Network, Kobalt Music Group, SourceAudio와 학습 데이터 라이선스를 사전에 체결했다. 즉 "상업적 사용에 깨끗하다"고 마케팅한다. RIAA 진영의 메이저 카탈로그를 학습에 쓰지 않았다는 점이 핵심이다.

기능적으로는 길이 조절, 가사 유무 선택, 기존 곡 리믹스(장르 · 템포 변경)가 된다. 무료 티어는 하루 7곡까지. ElevenLabs의 보이스 합성과 결합하면 보컬 음색을 더 세밀히 제어할 수 있다.

2.6 비교 — 컨슈머 도구

도구	보컬 품질	인스트루멘털	한국어 가사	길이	상업 라이선스	주 사용처
Suno v5.5	매우 높음	높음	보통	최대 8분	Pro 이상 명시 허용	송라이팅, 콘텐츠
Udio	높음	매우 높음	보통	최대 4분+	Standard 이상 허용	프로듀싱, 힙합/R&B
Lyria 3	중간(라이언 위주)	매우 높음	약함	최대 3분	Vertex AI 약관	엔터프라이즈 통합
ElevenMusic	높음	높음	미평가	최대 5분	명시적 클리어	콘텐츠 크리에이터
Lyria RealTime	미지원	높음	해당 없음	무한 스트리밍	API 약관	게임/라이브

3장 · 오픈소스 / 로컬 옵션 — MusicGen · Stable Audio · YuE · ACE-Step

3.1 왜 오픈소스인가

세 가지 이유다.

비용. 구독료 없이 무제한 생성. 로컬 4090 한 장으로 돌아간다.
프라이버시. 가사나 콘셉트가 외부 서버에 안 올라간다. 미공개 프로젝트에 중요.
통제. 파인튜닝, 시드 고정, 배치 생성, 자동화 파이프라인이 가능하다.

대신 — 품질은 컨슈머 도구보다 한 박자 뒤다. 그리고 라이선스를 잘 봐야 한다.

3.2 MusicGen (Meta, 2023)

오픈소스 AI 음악의 시작점. 2023년 8월에 AudioCraft 프레임워크의 일부로 공개됐다. 텍스트 → 인스트루멘털 음악.

파라미터. 300M, 1.5B, 3.3B의 세 가지 크기. 3.3B는 16GB 이상 VRAM 권장.
데이터. Meta가 소유하거나 라이선싱한 약 20,000시간의 음악.
라이선스. 모델 가중치는 CC BY-NC 4.0 — 비상업적 사용만. 이 점이 자주 오해된다. 자가 호스팅한다고 상업적으로 쓸 수 있는 게 아니다.
2026년 상태. 2024년 이후 의미 있는 업데이트가 없다. 품질이 Suno/Udio에 명확히 뒤진다. 그리고 보컬은 못 만든다.

여전히 가치는 있다. "공부용", "오프라인 실험", "비상업 프로젝트", "다른 모델의 비교 베이스라인"으로 좋다.

3.3 Stable Audio 2.5 / Stable Audio Open

Stability AI의 두 라인을 구별해야 한다.

Stable Audio 2.5. 상업 SaaS. 최대 3분, 복잡한 구조(인트로 · 전개 · 아웃트로) 지원. 무드 프롬프트("uplifting", "lush synthesizers")에 더 잘 반응한다. 사운드 효과, 광고 음악, 영상 트랙에 강점이 있다.

Stable Audio Open. 오픈소스. 일반 버전은 최대 47초. Stable Audio Open Small(341M, Arm과 협업)은 스마트폰 CPU에서 8초 이하로 11초 오디오를 생성한다. 라이선스는 Stability AI Community License — 상업/비상업 모두 허용된다.

Stable Audio Open은 풀송보다는 사운드 디자인(짧은 효과음, 루프, 텍스처, 폴리)에 강하다.

3.4 YuE — 오픈소스 풀송 모델

YuE는 2025년에 등장한 오픈소스 풀송 보컬 모델이다. 라이선스는 Apache 2.0(상업 가능). MusicGen에는 없는 "텍스트와 가사 → 보컬이 있는 풀송"이 된다.

하드웨어. 24GB VRAM 권장. 양자화 버전은 8~16GB도 가능. 4090에서 30초 생성에 약 360초.
최적화 분기. DeepBeepMeep 등의 GPU-poor 분기가 있어 1분 곡을 4090에서 4분에 만든다.
라이선스. Apache 2.0 — 상업적 사용 가능. 오픈소스 음악 모델 중에서 라이선스가 가장 깨끗한 편.

품질은 Suno v5와 어깨를 나란히 하진 않지만, "오픈소스 + 상업 가능 + 보컬"이라는 삼박자가 처음으로 갖춰진 모델이다.

3.5 ACE-Step 1.5 — 또 다른 로컬 강자

ACE-Step 1.5는 Mac, AMD, Intel, CUDA 디바이스를 모두 지원한다는 점이 차별점이다. M-시리즈 Mac에서도 돌아간다는 게 크다. 음악 생성 + 보컬 + 적당한 품질의 균형이 좋아 "2026년 로컬 음악 출발점"으로 자주 추천된다.

3.6 비교 — 오픈소스 / 로컬

모델	보컬	라이선스	최소 VRAM	길이	강점
MusicGen 3.3B	미지원	CC BY-NC 4.0(비상업)	16GB	30초	학습용, 베이스라인
Stable Audio Open	미지원	Stability Community	8GB	47초	사운드 디자인
YuE	지원	Apache 2.0	24GB(권장)	1~5분	풀송, 상업 가능
ACE-Step 1.5	지원	오픈소스	12~24GB	풀송	멀티 플랫폼
DiffRhythm	지원	오픈소스	16GB	풀송	빠른 추론

4장 · 사용처 — AI 음악이 실제로 통하는 곳

4.1 인디 게임 사운드트랙

가장 잘 작동하는 분야 중 하나다. 이유는 단순하다 — 인디 게임은 보통 10~30곡이 필요한데, 작곡가에게 다 의뢰하면 $10,000~$ 50,000이고, 라이선싱 라이브러리로 채우면 다른 게임과 음악이 겹친다.

AI 음악의 강점이 정확히 여기에 들어맞는다.

양. 한 시간에 수십 곡 생성, 마음에 드는 것만 골라 쓴다.
고유성. 라이브러리 음악과 달리 다른 게임에 같은 트랙이 안 깔린다.
반복 가능성. 같은 무드의 변주가 필요할 때 시드 · 프롬프트를 살짝 바꿔 비슷한 곡을 더 만든다.
루프 친화. 게임 BGM은 어차피 루프된다. 4분 풀송이 필요 없다.

워크플로우(실제 인디 스튜디오 사례).

1. 게임의 무드 시트 작성: "neon-lit cyberpunk alley, tense but melancholy, 100 BPM"
2. Suno/Udio에서 10~20곡 생성, 후보 추리기
3. 마음에 드는 트랙 1~2개의 스템(stem) 분리
4. DAW에서 BPM/키 맞춤, 루프 포인트 만들기
5. 게임 엔진(Unity/Unreal)에 .ogg/.wav로 임포트
6. 적응형 음악 시스템(FMOD/Wwise)에 인터랙티브 레이어 구성

주의점: AI 출력의 라이선스가 게임 배포(스팀, 콘솔)에 맞는지 반드시 확인한다. Suno Pro 이상이거나 ElevenMusic 같은 클리어 모델을 쓰는 게 안전하다.

4.2 팟캐스트 인트로 · 아웃트로

15초~30초 분량의 시그니처 사운드. AI 음악의 단점(긴 일관성)이 거의 안 드러나는 영역이다.

워크플로우.

프롬프트로 무드와 장르 지정("upbeat tech podcast intro, synth-driven, 20 seconds, fade-out")
10~20개 생성, 1개 선택
보이스오버에 맞춰 다듬기
모든 에피소드에 동일 트랙 사용 — "브랜드 사운드"가 된다

비용: Suno Pro $10/월이면 충분히 커버된다. 작곡가 외주($ 300~$1,000)와 비교하면 미세한 비용.

4.3 YouTube · 숏폼 BGM

여기서는 Mubert가 특히 강하다. Mubert는 텍스트-투-송이 아니라 무드 기반 무한 트랙 생성이다. 25분짜리 백그라운드 음악, 25개 변주 등을 빠르게 만든다. 로열티-프리 라이선스가 명확하다. 음악가가 자기 샘플 팩을 업로드하면 80%를 분배받는 구조라 학습 데이터의 소스도 비교적 깨끗하다.

YouTuber 입장에서 Mubert의 이점은 "Content ID 클레임 안 걸린다"는 점이다. 보컬이 들어간 Suno 트랙도 클레임은 잘 안 걸리지만, Mubert는 그 부분이 가장 분명하다.

4.4 작곡 아이디에이션

프로 작곡가/송라이터가 의외로 가장 적극적이다. 사용법은 두 가지다.

모티프 생성. "이런 코드 진행에 이런 보컬 멜로디가 어떨까"를 빠르게 시도한다. 결과물을 그대로 쓰지 않고, 아이디어만 가져다 자기 곡에 녹인다.

가이드 트랙. 가사를 먼저 쓰고 AI로 데모를 만든다. 그 데모를 들으면서 "이 부분은 좋고 이 부분은 다르게" 같은 판단을 한다. 그러고 나서 진짜 곡으로 다시 만든다. 즉 AI 음악이 MVP처럼 작동한다.

핵심 마인드셋: AI 출력을 최종 결과물이 아니라 디자인 도구로 쓴다. 거장 곡이 안 나오는 게 당연하고, "아이디어 발생기"라는 위치가 정확하다.

4.5 작동하지 않는 영역

같은 정도로 솔직하게.

고급 클래식 작곡. 4성부 푸가, 소나타 형식 같은 구조적 음악은 아직 약하다.
실시간 라이브 공연 대체. 라이브의 에너지를 못 만든다.
재즈 임프로비제이션. 일관된 모티프 발전이 안 된다.
상업적으로 큰 IP. 메이저 영화 사운드트랙, 상업광고 메인 트랙에는 아직 무리(품질이 아니라 법적 안전성 때문).
개성 있는 보컬 캐릭터. 사용자의 목소리를 클로닝하는 Suno Voices 정도가 한계.

5장 · 품질의 현실 — 보컬이 가장 어렵다

5.1 왜 보컬이 어려운가

오디오 생성에서 가장 어려운 두 가지는 (a) 길이 일관성, (b) 보컬이다. 보컬은 특히 어렵다 — 이유는 여러 층에 걸쳐 있다.

음운 · 발음. 사람 목소리는 50ms 단위로 음운(phoneme)이 바뀐다. 모델이 가사 텍스트를 받아 그걸 오디오 토큰의 발음 시퀀스로 매핑해야 한다. 영어는 학습 데이터가 풍부하니 잘 되지만, 한국어, 일본어, 아랍어 같은 언어는 오디오 데이터가 비교적 적다.

프로소디(억양). "사랑해"라는 단어를 슬프게 vs 신나게 부르면 다르다. 모델이 가사 의미와 곡 분위기를 결합해 억양 곡선을 만들어야 한다.

음정 안정성. 사람 가수는 음정을 ±10센트 정도로 안정시킨다. AI는 가끔 ±50센트까지 흔들린다. 듣기에 "어색"하다.

발음 인텔리지빌리티. 가사를 알아들을 수 있어야 한다. 보컬은 멜로디만 만들면 끝이 아니라, 글자가 들려야 한다. 어려운 자음 클러스터(예: "strengths")에서 모델이 자주 흐릿해진다.

5.2 한국어 가사의 추가 문제

한국어는 영어 학습 데이터의 1/10~1/20 수준이다. 결과:

받침 발음이 어색하다(특히 ㄹ, ㅇ).
영어식 보컬 스타일이 한국어에 강제된다(자음을 끊지 않고 흘려보냄).
가사의 자연스러운 운율을 못 살린다.

대응법: (a) Suno의 v5.5에서 한국어 출력이 v4보다 명확히 좋아졌다. (b) "korean ballad", "k-pop", "trot" 같은 명시적 스타일 태그가 도움이 된다. (c) 정 어색하면 영어 가사로 만든 뒤 후처리에서 보컬을 한국어로 다시 녹음한다.

5.3 인스트루멘털은 의외로 잘 된다

반대로 인스트루멘털은 2025년 후반부터 거의 사람 수준이다. 일렉트로닉, 신스 팝, 로파이, 시네마틱 스코어, 앰비언트 — 이 분야는 듣고 구별이 거의 불가능하다. 그래서 게임/팟캐스트/유튜브 BGM에서 가장 먼저 폭발했다.

5.4 길이 일관성

3분이 넘어가면 모델이 "이 곡이 어디로 가는지"를 잃기 시작한다. 정확히는:

모티프 망각. 1분에 등장한 멜로디 후크가 3분에 사라진다.
구조 흐려짐. verse-chorus-bridge 구조가 길이가 늘수록 무너진다.
퀄리티 드리프트. 4분 이후 갑자기 보컬이 거칠어지거나 믹스가 변한다.

대응: (a) 짧게 만들어 DAW에서 이어 붙이기, (b) Suno의 "Extend" 기능으로 부분씩 연장, (c) 5분 이상은 그냥 인스트루멘털로 가기.

6장 · 소송과 저작권 논쟁 — 정직하게

6.1 무엇이 쟁점인가

RIAA 소송의 핵심은 두 가지다.

학습 데이터 사용. "허가 없이 저작권 음반을 학습에 썼다." 양사는 "변혁적 공정 이용"으로 변호.
출력의 유사성. Suno와 Udio가 학습 데이터의 특정 곡을 거의 그대로 재현할 수 있는 사례가 있다는 주장.

법적 쟁점은 결국 "AI 학습이 저작권법의 공정 이용 4요소(목적, 성격, 양, 시장 영향)를 통과하는가"다.

6.2 2026년 5월 현재 상태

Suno. Universal · Warner · Sony 모두와 매사추세츠 연방 지방법원에서 공정 이용을 다투고 있다. Suno는 2026년 3월 약식 판결 신청을 냈고, 핵심 심리가 2026년 7월로 예정돼 있다. 인용한 선례는 2024년 제2순회법원의 Bartz v. SoundAI 결정(AI 학습을 변혁적 사용으로 인정한 판례)이다.

Udio. Universal(2025년 10월), Warner(2025년 11월), Kobalt, Merlin과 차례로 라이선싱 합의했다. Sony만 적극 소송 중. Universal과는 2026년에 출시할 공동 AI 음악 플랫폼 계약이 포함됐다.

독립 아티스트. 2025년 10월, 메이저와는 별도로 독립 음악가 집단이 Suno와 Udio를 상대로 클래스 액션을 제기했다.

6.3 결과가 어떻게 나오든

세 가지 시나리오를 본다.

시나리오 A — Suno 승소(공정 이용 인정). AI 학습이 합법화된다. 모든 AI 모델이 비슷한 방어를 쓴다. 음악 산업은 별도 라이선싱 시장(예: Universal-Udio 합작 플랫폼)으로 옮겨간다. 사용자 입장에서는 가장 자유롭다.

시나리오 B — Suno 패소(라이선스 필요 판결). Suno는 라이선스 합의를 강제당하거나 모델 재학습이 필요해진다. 비용이 폭증하고 구독료가 오른다. 신규 진입자는 라이선스 없이는 시작 자체가 어려워진다. ElevenMusic 같은 "사전 라이선싱" 모델이 우위에 선다.

시나리오 C — 합의로 끝남. 가장 가능성 높은 시나리오. Universal-Udio 모델처럼 메이저와 합의 + 라이선싱 + 수익 공유 구조가 표준이 된다. 산업 전체가 그 방향으로 정렬된다.

6.4 사용자가 할 일

무엇을 해도 안전한 사용: Suno/Udio Pro 이상 구독, 출력에 대한 상업 사용권 명시된 플랜, 가능하면 메이저 아티스트 스타일을 명시적으로 흉내내지 않기.

더 안전한 사용: ElevenMusic처럼 "사전 라이선싱된 데이터로 학습됐다"는 입증이 있는 모델, 또는 YuE/ACE-Step 같은 Apache 2.0 오픈소스 모델 + 로컬 실행.

피할 것: 특정 아티스트의 목소리를 흉내내려는 프롬프트("in the style of [유명 가수]"), 그 출력을 상업적으로 배포. 이건 명백한 위험.

7장 · 의사결정 프레임 — 무엇을 골라야 하나

7.1 "내 상황 → 추천 도구" 표

상황	1순위	2순위	메모
송라이팅 데모 만들기	Suno v5.5	Udio	보컬 품질 우선
인디 게임 BGM	Suno Pro	Mubert	스템 분리 가능한 곳
팟캐스트 인트로	Suno	ElevenMusic	30초면 어디든 됨
YouTube 백그라운드	Mubert	Stable Audio 2.5	무드 기반 무한 트랙
광고 트랙(상업)	ElevenMusic	Stable Audio 2.5	라이선스 클린 우선
게임 라이브 BGM	Lyria RealTime	(대안 거의 없음)	실시간 컨트롤
로컬/프라이빗 실험	YuE	ACE-Step	데이터 외부 유출 X
사운드 디자인(짧은 효과음)	Stable Audio Open	(DAW 플러그인)	11초~47초
학생 학습/연구	MusicGen	YuE	비상업 OK
한국어 가사 곡	Suno v5.5	Udio	어쩔 수 없이 보컬 후처리

7.2 결정 트리

시작
 │
 ├─ 보컬이 필요한가?
 │   ├─ 아니오 → Mubert / Stable Audio / MusicGen / Lyria RealTime
 │   └─ 예 ↓
 │
 ├─ 상업적으로 쓸 건가?
 │   ├─ 아니오(연구/학습) → 무엇이든 OK, MusicGen 포함
 │   └─ 예 ↓
 │
 ├─ 라이선스 클린함이 최우선인가?
 │   ├─ 예 → ElevenMusic 또는 YuE/ACE-Step 자가 호스팅
 │   └─ 아니오 ↓
 │
 ├─ 한국어/영어 외 가사인가?
 │   ├─ 예 → Suno v5.5 우선, 결과 후처리 예상
 │   └─ 아니오 ↓
 │
 ├─ 어떤 미학을 원하나?
 │   ├─ 팝/일렉트로닉 매끈함 → Suno
 │   ├─ 힙합/R&B/프로듀서 톤 → Udio
 │   └─ 엔터프라이즈/Vertex AI 통합 → Lyria 3

7.3 예산별 가이드

예산	추천
$0/월	MusicGen + 4090 또는 클라우드 GPU. Suno 무료 티어 일 5곡.
$10/월	Suno Pro 단독. 대부분의 콘텐츠 크리에이터에게 충분.
$30/월	Suno Pro + Udio Standard + Mubert. 풍부한 미학 선택.
$100+/월	Suno Premier + ElevenMusic + Stable Audio 2.5. 상업 프로덕션.
$1,000+	자체 4090 박스 + YuE 자가 호스팅 + 구독 조합. 스튜디오/게임팀.

에필로그 — 체크리스트, 안티패턴, 다음 글 예고

AI 음악은 2023년의 "재밌네"에서 2026년의 "이걸로 작품을 만든다"로 갔다. 그 변화의 핵심은 보컬이 보컬처럼 들리기 시작했고, 길이가 노래처럼 길어졌고, 미학이 장르마다 다르게 익숙해졌다는 점이다. 동시에 — 한국어 보컬, 4분 이상 일관성, 상업적 라이선스의 안전성은 아직 풀리지 않은 문제로 남아 있다. 2026년 7월의 Suno 약식 판결 심리가 카테고리 전체의 다음 1년을 결정할 가능성이 크다.

도구 선택 체크리스트

보컬이 필요한가? — 필요 없으면 Mubert/Stable Audio가 훨씬 안전한 선택
상업적으로 쓰는가? — Pro 이상 구독, 명시적 라이선스, 영구권 확인
언어가 영어인가? — 다른 언어면 후처리 단계 + 보컬 재녹음 예산을 잡아둠
길이는 몇 분인가? — 3분 이상은 Extend/조합으로 풀거나 인스트루멘털만
장르 미학이 무엇인가? — Suno(팝), Udio(힙합/R&B), Lyria(엔터프라이즈)
출력에 스템 분리가 필요한가? — Suno Studio가 거의 유일하게 강함
온라인 의존이 부담인가? — YuE/ACE-Step 로컬 실행 검토
워크플로우가 반복적인가? — Mubert API, Suno API, Lyria RealTime API 활용
저작권 안전성 우선인가? — ElevenMusic 또는 자가 학습 데이터 명시 모델
AI 출력을 최종이 아닌 초안으로 쓸 준비가 됐는가? — 가장 본질적인 질문

안티패턴

안티패턴	왜 나쁜가	대신
첫 번째 생성을 그대로 쓰기	평균 품질이 낮음	10~20개 생성 후 큐레이션
유명 아티스트 이름을 프롬프트에 직접	라이선스 회색지대, Content ID 위험	"in the style of late-80s synth-pop" 같은 추상 묘사
한국어 곡을 영어 학습 가정대로 평가	발음 어색함을 모르고 출시	모국어 화자 1명 이상 검수
무료 티어로 상업 출시	라이선스 위반	최소 Pro 구독
4분 풀송을 한 번에 받기	후반 일관성 무너짐	짧게 받아 이어 붙이거나 Extend
MusicGen 출력을 상업 광고에 사용	CC BY-NC 4.0 위반	YuE/ACE-Step 또는 컨슈머 도구
보컬 인텔리지빌리티 안 점검	가사 안 들리는 곡 출시	외부 청자 3명에게 가사 듣게 함
Lyria 3을 무료 도구로 기대	Vertex AI 비용 구조 모름	단가 계산기로 분당 비용 확인
AI 출력에 "내가 작곡"이라고 표기	표시 의무/저작권 논쟁 위험	"AI 보조 작곡" 명시
단일 모델만 의존	한 모델 출력의 한계가 곧 작품의 한계	2~3 모델을 미학별로 분리 사용

다음 글 예고

다음 글은 **"AI 비디오 생성 2026 — Sora · Veo · Runway · Pika · Kling, 그리고 그것들이 실제 어떻게 다른가"**다. 음악과 같은 패턴으로, 카테고리의 폭발(2024 Sora 데모)과 성숙(2026의 상용 도구들), 보컬에 해당하는 가장 어려운 영역(긴 일관성, 캐릭터 동일성, 손가락), 오픈소스 옵션(Open-Sora, Mochi, Wan 등), 사용처(광고, 짧은 영상, 콘셉트 비주얼), 그리고 저작권 논쟁(NYT-OpenAI, Disney 라이선싱 모델)을 같은 깊이로 다룰 예정이다.

참고 / References

AI Music Generation 2026 — Suno, Udio, Stable Audio, MusicGen, Mubert, ElevenLabs, Lyria — Where Are We Really?

Prologue — What Has Changed in Two Years

Summer 2023: AI-generated music was a toy. One-bar melodies, awkward rhythms, vocals either absent or unintelligible. When Meta open-sourced MusicGen, the reaction was "neat" rather than "I'll write a song with this."

Spring 2024: Suno shipped v3, Udio opened its beta, and the mood shifted. A single text prompt produced a two-minute song with actual vocals. Rough in places, but for the first time people said "wait, this is real." Three months later, in June 2024, the RIAA sued both Suno and Udio for massive copyright infringement. Industry attention had arrived in earnest.

May 2026: the landscape has shifted again. Suno v5.5 clones a user's voice and supports personal fine-tunes. Udio has signed licensing settlements with Universal, Warner, Kobalt, and Merlin in sequence. Google acquired Riffusion's successor ProducerAI and folded it into Lyria 3. ElevenLabs expanded from voice into music. On the open-source side, YuE, ACE-Step, and DiffRhythm offer full-song models with vocals that run on a single RTX 4090.

And yet — vocals are still the hardest part. Korean lyrics still sound less natural than English. Anything past four minutes loses coherence. Models with airtight commercial licensing are still rare. The Suno summary judgment hearing is set for July 2026.

This post tries to map that landscape. Which tool fits which job, why vocals are difficult, where open source stands, how the lawsuits are unfolding, and what real workflows look like for indie game soundtracks, podcast intros, YouTube BGM, and songwriting ideation. This is not "AI is killing music" nor "AI is saving music." It is the middle ground that the actual practitioners live in.

One-line take: 2026 AI music is not about "replacing humans" but about "people who couldn't make music starting to make music." Knowing that boundary makes the tool choice easy.

1 · The Birth of the Category — What Happened in 2023–2024

1.1 Two Technical Lineages

AI music generation is the merger of two technical lineages.

Lineage 1: Autoregressive token models. Like text LLMs, tokenize audio and predict the next token. Meta's MusicGen (2023), Google's MusicLM (2023), and Suno's early versions belong here. Training works by compressing audio through a neural audio codec like EnCodec into tokens, then training a transformer on those token sequences.

Lineage 2: Diffusion-based audio. Apply image-diffusion architectures (Stable Diffusion) to audio. Stability AI's Stable Audio is the canonical example. Riffusion used a clever trick — convert audio to a spectrogram (a frequency image), run image diffusion on it, then convert the result back to audio.

By 2024 the two lineages cross-pollinated and vocal synthesis was bolted on. The real leap for Suno and Udio was producing a "full song with vocals and lyrics from text" — until then, almost everything was instrumental backing only.

1.2 Why Quality Jumped Suddenly

Three variables moved at once.

Data. Access to large licensed music catalogs (or — as the lawsuits allege — scraped catalogs) became viable for training. MusicGen alone was trained on roughly 20,000 hours of licensed music.
Compute. H100/H200 clusters made training multi-billion-parameter audio models feasible in reasonable time.
Architecture. Neural audio codecs like EnCodec and SoundStream opened the door to handling audio as LLM-style tokens.

With those three in place, the trick that worked for text LLMs — "predict the next plausible token" — started working for music.

1.3 The RIAA Bomb — June 2024

On June 24, 2024, the Recording Industry Association of America, representing Universal, Warner, and Sony, filed two copyright infringement suits — against Suno in the District of Massachusetts and Udio in the Southern District of New York. The core claim: "trained on copyrighted recordings without permission." The defense from both companies: "transformative fair use."

This is not an isolated dispute. It will decide the commercial fate of the entire AI music category. If the training data is ruled infringing, model retraining is required and the licensing structure for outputs changes fundamentally. That is why the wave of settlements started arriving in late 2025.

2 · Consumer Tools — Suno, Udio, Lyria, ElevenMusic

2.1 Suno — The Category Leader

As of May 2026, the most-used text-to-song tool is Suno. The progression: v3 (early 2024), v4 (2025), v5 (late 2025), v5.5 (March 26, 2026).

Three pillars in v5.5:

Voices. Users record about thirty seconds of their own singing voice, register it, and the AI sings in that timbre. Pro and Premier subscribers only. Private by default.
Custom Models. Upload your own catalog (e.g., songs you have made) to fine-tune v5.5 toward that style. Up to three per account.
Studio. Receive stems separated by track — vocals, bass, drums, harmony, instrumentation. Drop them into a DAW for post-production.

Quality? For English lyrics in mainstream genres like pop, rock, electronic, or folk, a first-time listener will believe a human made it. Korean and other less-trained languages still struggle with pronunciation and prosody (steadily improving since 2025, still weaker than English). Structurally complex genres like jazz improvisation or full classical orchestration remain weak spots.

Commercial licensing is explicitly granted on Pro and above. Marketing "100% safe" is hard while the RIAA case is pending.

2.2 Udio — A Different Aesthetic

Udio was founded in December 2023 by former Google DeepMind researchers, led by CEO David Ding. The April 2024 seed round of $10M was led by Andreessen Horowitz, with notable participation from Instagram co-founder Mike Krieger, will.i.am, Common, and other music-industry figures.

Udio's output has a subtly different character from Suno's. Where Suno tends toward "polished pop," Udio leans toward "track produced by a producer." It scores especially well in hip-hop, R&B, Latin, and electronic.

On October 29, 2025, Universal Music Group settled with Udio — a payment plus a licensing deal for a joint AI music platform launching in 2026. On November 25, Warner settled too (a multi-million-dollar settlement plus a licensing partnership, with Suno acquiring Songkick from Warner as part of the package). Kobalt and Merlin followed. As of May 2026, Sony is the only major still actively litigating against Udio.

2.3 Lyria 3 (Google DeepMind)

Google moved on two fronts.

Lyria the model. From Lyria 2 (May 2025) to Lyria 3 (February 18, 2026). 48kHz stereo, up to three minutes, working directly on audio tokens rather than spectrograms. SynthID watermarking is mandatory. Access via Vertex AI and the Gemini API.

Riffusion acquisition. On February 24, 2026, Google acquired ProducerAI (formerly Riffusion). ProducerAI was a conversational music-generation agent with a million users. After acquisition it was folded into Lyria 3. The spectrogram-diffusion lineage that Riffusion pioneered now lives inside Lyria 3.

2.4 Lyria RealTime — A Different Usage Model

Lyria RealTime is a separate beast. Not "generate a song" but "control streaming audio in real time." You adjust style, tempo, and mood live while infinite music plays. Primary use cases: live streaming, game BGM, interactive installations. Accessed via the Gemini API.

2.5 ElevenMusic (ElevenLabs)

ElevenLabs, known for voice synthesis, launched Eleven Music on August 5, 2025. On April 1, 2026, it relaunched as ElevenMusic with a standalone iOS app and a full consumer platform.

The differentiator is licensing. ElevenLabs signed training-data deals with Merlin Network, Kobalt Music Group, and SourceAudio in advance. Marketing positions ElevenMusic as "cleared for commercial use." The key signal: it deliberately did not train on the major labels' RIAA-side catalogs.

Functionally, you control length, lyric presence, and remix existing tracks (genre and tempo shifts). Free tier covers seven songs per day. Combined with ElevenLabs' voice synthesis, finer vocal-character control is possible.

2.6 Comparison — Consumer Tools

Tool	Vocal Quality	Instrumental	Korean Lyrics	Length	Commercial License	Primary Use
Suno v5.5	Very high	High	OK	Up to 8 min	Pro and above, explicit	Songwriting, content
Udio	High	Very high	OK	Up to 4+ min	Standard and above	Producing, hip-hop/R&B
Lyria 3	Medium (lyric-light)	Very high	Weak	Up to 3 min	Vertex AI terms	Enterprise integration
ElevenMusic	High	High	Not benchmarked	Up to 5 min	Explicitly cleared	Content creators
Lyria RealTime	None	High	N/A	Infinite stream	API terms	Games, live

3 · Open Source and Local Options — MusicGen, Stable Audio, YuE, ACE-Step

3.1 Why Open Source

Three reasons.

Cost. No subscription, unlimited generation. Runs on a single local RTX 4090.
Privacy. Lyrics and concepts never leave your machine. Crucial for unreleased projects.
Control. Fine-tuning, fixed seeds, batch generation, and automation pipelines become possible.

The cost — quality lags consumer tools by a half-step, and licensing terms need careful reading.

3.2 MusicGen (Meta, 2023)

The starting point of open-source AI music. Released August 2023 as part of the AudioCraft framework. Text-to-instrumental.

Parameters. Three sizes — 300M, 1.5B, 3.3B. The 3.3B variant wants 16GB+ VRAM.
Data. About 20,000 hours of music Meta owns or licensed.
License. Model weights are CC BY-NC 4.0 — non-commercial use only. This is widely misread. Self-hosting does not grant commercial rights.
2026 status. No meaningful update since 2024. Quality is visibly behind Suno and Udio. Cannot do vocals.

Still useful for "learning," "offline experiments," "non-commercial projects," and "as a baseline for comparing other models."

3.3 Stable Audio 2.5 / Stable Audio Open

The two Stability AI lines are easy to confuse.

Stable Audio 2.5. Commercial SaaS. Up to three minutes, complex structure (intro, development, outro). Better response to mood prompts like "uplifting" or "lush synthesizers." Strong for sound effects, ad music, and video tracks.

Stable Audio Open. Open source. The base model maxes at 47 seconds. Stable Audio Open Small (341M parameters, built with Arm) generates 11 seconds of audio in under 8 seconds on a smartphone CPU. Licensed under the Stability AI Community License, free for commercial and non-commercial use.

Stable Audio Open is stronger for sound design — short SFX, loops, textures, foley — than for full songs.

3.4 YuE — Open-Source Full-Song Model

YuE arrived in 2025 as an open-source full-song model with vocals. Apache 2.0 license (commercial use allowed). It does what MusicGen does not — "text plus lyrics into a full song with vocals."

Hardware. Recommended 24GB VRAM. Quantized versions run in 8–16GB. On a 4090, 30 seconds takes roughly 360 seconds.
Optimized forks. DeepBeepMeep's GPU-poor branch generates a 1-minute song in about 4 minutes on a 4090.
License. Apache 2.0 — commercial use allowed. The cleanest license among open-source music models.

Quality does not match Suno v5, but YuE is the first open-source model to combine "open + commercial + vocals."

3.5 ACE-Step 1.5 — Another Local Contender

ACE-Step 1.5 stands out for supporting Mac, AMD, Intel, and CUDA. The fact that it runs on M-series Macs matters a lot. Reasonable music generation plus vocals plus decent quality makes it the often-recommended "2026 local starting point."

3.6 Comparison — Open Source / Local

Model	Vocals	License	Min VRAM	Length	Strength
MusicGen 3.3B	No	CC BY-NC 4.0 (non-commercial)	16GB	30 sec	Learning, baseline
Stable Audio Open	No	Stability Community	8GB	47 sec	Sound design
YuE	Yes	Apache 2.0	24GB rec.	1–5 min	Full songs, commercial
ACE-Step 1.5	Yes	Open source	12–24GB	Full song	Multi-platform
DiffRhythm	Yes	Open source	16GB	Full song	Fast inference

4 · Where It Actually Works

4.1 Indie Game Soundtracks

One of the strongest fits. The reason is simple — an indie game typically needs 10 to 30 tracks. Commissioning all of them from a composer costs roughly ten to fifty thousand dollars. Filling the gap from royalty-free libraries means the same music turns up in other games.

AI music slots neatly into that gap.

Volume. Dozens of tracks per hour, keep what you like.
Uniqueness. Unlike libraries, your track will not appear in another game.
Variation control. Adjust the seed and prompt to generate similar tracks for the same mood.
Loop-friendly. Game BGM loops anyway. You do not need a full four-minute song.

A workflow used by actual indie studios.

1. Write a mood sheet for the game: "neon-lit cyberpunk alley, tense but melancholy, 100 BPM"
2. Generate 10 to 20 tracks in Suno or Udio, shortlist favorites
3. Separate stems on the 1 to 2 chosen tracks
4. Adjust BPM and key in a DAW, build loop points
5. Import into Unity or Unreal as .ogg or .wav
6. Configure interactive layers in an adaptive music system like FMOD or Wwise

A caution: verify the licensing of AI output against your distribution channel (Steam, consoles). Suno Pro and above, or a clean model like ElevenMusic, is the safe choice.

4.2 Podcast Intros and Outros

A 15 to 30-second signature sound. AI music's main weakness — long-term coherence — barely matters here.

Workflow.

Prompt mood and genre: "upbeat tech podcast intro, synth-driven, 20 seconds, fade-out"
Generate 10 to 20, pick one
Polish around the voiceover
Use the same track on every episode — it becomes "brand sound"

Cost: Suno Pro at $10 a month covers it. Compared to commissioning a composer ($ 300 to $1,000), it is negligible.

4.3 YouTube and Short-Form BGM

This is where Mubert shines. Mubert is not text-to-song — it is mood-based infinite track generation. It can produce 25-minute background tracks and 25 variations quickly. The royalty-free license is unambiguous. Musicians upload their sample packs and receive 80 percent of track sales, so the training-data origin is comparatively clean.

For a YouTuber, the appeal is "no Content ID claims." Vocal-bearing Suno tracks rarely trigger claims either, but Mubert is the most clearly safe option.

4.4 Songwriting Ideation

Professional songwriters and composers are surprisingly aggressive users. Two patterns.

Motif generation. Quickly try "what would this chord progression with this vocal melody sound like." They do not use the output directly — they steal the idea and weave it into their own track.

Guide track. Write lyrics first, then make an AI demo. Listen to the demo to judge "this part works, this part needs to change." Then build the real song. The AI music acts as an MVP.

The core mindset: use AI output as a design tool, not a finished product. Masterpieces will not pop out — the right position for AI music is "idea generator."

4.5 Where It Does Not Work

The same honesty applies to limits.

Advanced classical composition. Four-voice fugues, sonata-form structures — still weak.
Replacing live performance. Cannot manufacture stage energy.
Jazz improvisation. No coherent motivic development.
Big commercial IP. Major film soundtracks and lead ad tracks remain out of reach — not for quality reasons but for legal safety.
Distinctive vocal character. Suno Voices cloning a user's own voice is roughly the ceiling.

5 · Quality Reality — Vocals Are the Hardest Part

5.1 Why Vocals Are Hard

The two hardest problems in audio generation are (a) long-term coherence and (b) vocals. Vocals are especially hard, for layered reasons.

Phonemes and pronunciation. Human voice changes phonemes every 50ms or so. The model has to map lyric text to a sequence of pronounced audio tokens. English has rich training data and works well. Korean, Japanese, Arabic and similar languages have far less audio data per phoneme.

Prosody (intonation). Singing "I love you" sadly versus joyfully sounds different. The model must combine lyric meaning with song mood to shape the intonation curve.

Pitch stability. Human singers hold pitch within roughly ±10 cents. AI sometimes wavers ±50 cents. The ear hears it as "off."

Intelligibility. Listeners need to hear the lyrics. Vocals are not finished when melody is in place — the words must be audible. Hard consonant clusters (like "strengths") often blur in AI output.

5.2 The Extra Penalty for Non-English Lyrics

Korean has roughly one-tenth to one-twentieth the training data of English. Consequences:

Final consonants (especially ㄹ and ㅇ) sound awkward.
An English-style vocal phrasing is forced onto Korean (consonants run together instead of being articulated).
Natural prosody of the lyric is missed.

Mitigations: (a) Suno v5.5 is visibly better than v4 on Korean. (b) Explicit style tags like "korean ballad," "k-pop," or "trot" help. (c) When awkwardness remains, generate with English lyrics and re-record the vocal in Korean during post.

5.3 Instrumentals Are Surprisingly Solid

Conversely, instrumentals are near-human-quality from late 2025 onward. Electronic, synth pop, lo-fi, cinematic scores, ambient — telling them apart from human work is nearly impossible. That is why games, podcasts, and YouTube BGM exploded first.

5.4 Length and Coherence

Past three minutes, the model starts losing track of "where this song is going." Specifically:

Motif forgetting. A hook introduced at one minute disappears by three.
Structural drift. Verse-chorus-bridge structure erodes as length grows.
Quality drift. After four minutes, vocals sometimes turn grainy or the mix shifts.

Workarounds: (a) generate short pieces and stitch in a DAW, (b) use Suno's Extend feature in segments, (c) for anything past five minutes, go instrumental.

6 · Lawsuits and the Copyright Debate — Honestly

6.1 What Is at Issue

The RIAA suits have two core issues.

Training data use. "Trained on copyrighted recordings without permission." Both defendants invoke "transformative fair use."
Output similarity. Plaintiffs claim Suno and Udio can reproduce specific training songs nearly verbatim.

The legal question reduces to whether AI training passes the four-factor fair-use test (purpose, nature, amount, market effect).

6.2 Status as of May 2026

Suno. Contesting all claims on fair-use grounds against Universal, Warner, and Sony in the District of Massachusetts. Suno filed for summary judgment in March 2026, with the key hearing scheduled for July 2026. Cited precedent: the Second Circuit's 2024 Bartz v. SoundAI ruling, which treated AI training as transformative use.

Udio. Successive licensing settlements with Universal (October 2025), Warner (November 2025), Kobalt, and Merlin. Sony remains the only major actively litigating. The Universal deal includes a joint AI music platform launching in 2026.

Independent artists. In October 2025, separately from the majors, a class of independent musicians sued both Suno and Udio.

6.3 Three Possible Outcomes

Scenario A — Suno wins (fair use upheld). AI training becomes legitimized. Every AI model uses a similar defense. The music industry shifts to a separate licensing market (e.g., the Universal-Udio joint platform). Users get the most freedom.

Scenario B — Suno loses (licensing required). Suno is forced into licensing settlements or model retraining. Costs rise sharply and subscription prices follow. New entrants cannot start without licensing. "Pre-licensed" models like ElevenMusic gain a structural advantage.

Scenario C — Settlement. The most likely scenario. The Universal-Udio template — majors + licensing + revenue sharing — becomes the industry standard. The entire industry aligns to that shape.

6.4 What Users Should Do

Safe to do, no caveats: subscribe to Suno or Udio Pro and above, plans that explicitly grant commercial usage rights, and avoid explicitly imitating named major artists.

Safer still: models like ElevenMusic with provable pre-licensed training data, or Apache 2.0 open-source models like YuE or ACE-Step run locally.

Avoid: prompts attempting to clone a specific named artist's voice ("in the style of [famous singer]"), then commercially distributing the output. That is the clearest risk.

7 · Decision Framework — What to Pick

7.1 "Situation → Recommended Tool"

Situation	First choice	Second choice	Note
Songwriting demos	Suno v5.5	Udio	Vocal quality first
Indie game BGM	Suno Pro	Mubert	Stem separation matters
Podcast intro	Suno	ElevenMusic	30 seconds works anywhere
YouTube background	Mubert	Stable Audio 2.5	Mood-based infinite tracks
Ad track (commercial)	ElevenMusic	Stable Audio 2.5	License cleanliness first
Live game BGM	Lyria RealTime	(few alternatives)	Real-time control
Local / private experiment	YuE	ACE-Step	Data does not leave the box
Sound design (short SFX)	Stable Audio Open	(DAW plugins)	11 to 47 seconds
Learning / research	MusicGen	YuE	Non-commercial OK
Korean-lyric songs	Suno v5.5	Udio	Plan for vocal post-processing

7.2 Decision Tree

Start
 │
 ├─ Need vocals?
 │   ├─ No  → Mubert / Stable Audio / MusicGen / Lyria RealTime
 │   └─ Yes ↓
 │
 ├─ Commercial use?
 │   ├─ No  (research / learning) → Anything goes, MusicGen included
 │   └─ Yes ↓
 │
 ├─ License cleanliness top priority?
 │   ├─ Yes → ElevenMusic or YuE / ACE-Step self-hosted
 │   └─ No  ↓
 │
 ├─ Non-English lyrics?
 │   ├─ Yes → Suno v5.5 first, expect post-processing
 │   └─ No  ↓
 │
 ├─ What aesthetic?
 │   ├─ Pop / electronic polish      → Suno
 │   ├─ Hip-hop / R&B / producer tone → Udio
 │   └─ Enterprise / Vertex AI       → Lyria 3

7.3 By Budget

Budget	Recommendation
$0 / month	MusicGen + 4090 or cloud GPU. Suno free tier (5 songs / day).
$10 / month	Suno Pro alone. Enough for most content creators.
$30 / month	Suno Pro + Udio Standard + Mubert. Rich aesthetic choices.
$100+ / month	Suno Premier + ElevenMusic + Stable Audio 2.5. Commercial production.
$1,000+	Own 4090 box + YuE self-hosted + subscriptions. Studios, game teams.

Epilogue — Checklist, Anti-Patterns, What's Next

AI music has gone from 2023's "neat" to 2026's "I'll release this." The pivot is that vocals now sound like vocals, lengths reach actual song duration, and aesthetic differences have settled into genre. At the same time — Korean vocals, coherence past four minutes, and airtight commercial licensing remain unsolved. The Suno summary judgment hearing in July 2026 will likely decide the category's next year.

Tool Selection Checklist

Do you need vocals? — If not, Mubert or Stable Audio is a much safer pick.
Are you using it commercially? — Pro tier or higher, explicit license, permanent-rights confirmation.
Is the language English? — If not, budget for post-processing and vocal re-recording.
How long is the piece? — Past three minutes, use Extend or stitching, or stay instrumental.
What genre aesthetic? — Suno (pop), Udio (hip-hop / R&B), Lyria (enterprise).
Need stem separation? — Suno Studio is one of the few that really delivers.
Online dependency a burden? — Consider YuE or ACE-Step locally.
Workflow repetitive? — Use the Mubert API, Suno API, or Lyria RealTime API.
Copyright safety top priority? — ElevenMusic, or models that document training data.
Are you ready to treat AI output as a draft, not a final? — The most important question.

Anti-Patterns

Anti-pattern	Why it's bad	Instead
Shipping the first generation	Average quality is low	Generate 10 to 20, curate
Naming famous artists in prompts	License gray zone, Content ID risk	Abstract descriptions like "late-80s synth-pop"
Judging Korean songs by English assumptions	Awkward pronunciation slips through	At least one native-speaker review
Releasing commercially on a free tier	License violation	Subscribe at Pro or above
Generating a 4-minute song in one shot	Late-track coherence falls apart	Generate short, stitch, or use Extend
Using MusicGen output in a commercial ad	CC BY-NC 4.0 violation	YuE / ACE-Step or consumer tools
Skipping vocal intelligibility checks	Releasing songs no one can parse	Three external listeners read the lyrics back
Treating Lyria 3 like a free tool	Vertex AI pricing not understood	Cost-calculate per minute
Crediting AI output as "I composed this"	Disclosure and copyright risk	Mark as "AI-assisted composition"
Relying on one model only	Model limits become work limits	Pair 2 to 3 models by aesthetic

What's Next

The next post is "AI Video Generation 2026 — Sora, Veo, Runway, Pika, Kling — and How They Actually Differ." Same pattern as this one: the category's explosion (2024 Sora demo) and maturation (commercial tools in 2026), the hardest part analogous to vocals (long-term coherence, character identity, fingers), open-source options (Open-Sora, Mochi, Wan), real use cases (ads, short video, concept visuals), and the copyright debate (NYT-OpenAI, Disney's licensing model) at the same depth.