音声・オーディオAI完全ガイド：ASR、TTS、Whisper、Wav2Vecから音声合成まで

音声・オーディオAIは、人間と機械の間で最も自然なインターフェースを作り出します。スマートフォンの音声アシスタントからリアルタイム翻訳システム、バーチャルインフルエンサーの合成音声まで、オーディオAI技術は私たちの日常生活に溶け込んでいます。

このガイドでは、音の物理特性とデジタル信号処理から始まり、自動音声認識（ASR）、テキスト音声変換（TTS）、話者分離、音楽生成まで、実践的なPythonコードとともにオーディオAIのエコシステム全体を解説します。

1. 音声信号処理の基礎

音の物理特性

音は空気中を伝搬する圧力波です。デジタル音声処理には以下の概念が不可欠です。

周波数: 1秒あたりの振動回数をHz（ヘルツ）で表します。人間の可聴域はおよそ20Hzから20,000Hzです。低周波は低音、高周波は高音に対応します。

振幅: 波の大きさ——音圧の強さです。デシベル（dB）で表され、0dBは基準閾値を表し、負の値はより静かな音を示します。

位相: 時間軸に沿った波形の位置です。同じ周波数でも位相が異なる2つの波を合成すると、建設的または破壊的干渉が生じます。

倍音: 基本周波数の整数倍の周波数成分です。楽器や声の音色（音の質感）を決定します。

サンプリングレートとビット深度

サンプリングレート: 1秒あたりに収集する音声サンプル数（Hz）。ナイキストの定理では、信号を完全に復元するには、存在する最高周波数の2倍以上でサンプリングする必要があります。

CD品質: 44,100 Hz（44.1 kHz）
ハイレゾオーディオ: 48,000 Hz（動画）、96,000 Hz、192,000 Hz
電話品質: 8,000 Hz；広帯域電話: 16,000 Hz
Whisperのデフォルト: 16,000 Hz

ビット深度: サンプルあたりのビット数——ダイナミックレンジを決定します。

16ビット: 65,536レベル、96 dBダイナミックレンジ（CD標準）
24ビット: 16,777,216レベル、144 dBダイナミックレンジ
32ビット浮動小数点: 深層学習パイプラインの標準

librosaの概要

librosaは音声分析のコアPythonライブラリです。

pip install librosa soundfile matplotlib numpy scipy

import librosa
import librosa.display
import numpy as np
import matplotlib.pyplot as plt
import soundfile as sf

# 音声を読み込む（sr=Noneで元のサンプリングレートを保持）
y, sr = librosa.load('audio.wav', sr=None)
print(f"Duration: {len(y)/sr:.2f} s")
print(f"Sample rate: {sr} Hz")
print(f"Samples: {len(y)}")
print(f"dtype: {y.dtype}")

# 16 kHzにリサンプリング
y_16k = librosa.resample(y, orig_sr=sr, target_sr=16000)

# ステレオ → モノラル
y_mono = librosa.to_mono(y)  # (2, N) → (N,)

# 保存
sf.write('output.wav', y_16k, 16000)

# 波形の可視化
plt.figure(figsize=(14, 4))
librosa.display.waveshow(y, sr=sr, alpha=0.8, color='steelblue')
plt.title('Waveform')
plt.xlabel('Time (seconds)')
plt.ylabel('Amplitude')
plt.tight_layout()
plt.savefig('waveform.png')
plt.show()

2. 音声特徴量の抽出

フーリエ変換（FFT）

フーリエ変換は時間領域の信号を周波数領域に変換し、どの周波数成分がどの程度含まれているかを明らかにします。

import numpy as np
import matplotlib.pyplot as plt
from scipy.fft import fft, fftfreq
import librosa

y, sr = librosa.load('audio.wav', sr=22050)

N  = len(y)
yf = fft(y)
xf = fftfreq(N, 1/sr)

# 正の周波数のみ保持（エルミート対称性）
xf_pos = xf[:N//2]
yf_pos = np.abs(yf[:N//2])

plt.figure(figsize=(12, 4))
plt.plot(xf_pos, yf_pos)
plt.title('Frequency Spectrum (FFT)')
plt.xlabel('Frequency (Hz)')
plt.ylabel('Magnitude')
plt.xlim(0, sr//2)
plt.yscale('log')
plt.grid(True)
plt.tight_layout()
plt.show()

短時間フーリエ変換（STFT）

単純なFFTは信号全体のグローバル平均を与えます。STFTは短い重複ウィンドウにFFTを適用し、スペクトル内容が時間とともにどのように変化するかを捉える時間周波数表現を生成します。

import librosa
import librosa.display
import numpy as np
import matplotlib.pyplot as plt

y, sr = librosa.load('speech.wav', sr=22050)

n_fft      = 2048   # FFTサイズ（周波数分解能）
hop_length = 512    # ホップサイズ
win_length = 2048   # ウィンドウサイズ

D = librosa.stft(y, n_fft=n_fft, hop_length=hop_length, win_length=win_length)

magnitude = np.abs(D)
phase     = np.angle(D)
D_db      = librosa.amplitude_to_db(magnitude, ref=np.max)

plt.figure(figsize=(14, 6))
librosa.display.specshow(
    D_db, sr=sr, hop_length=hop_length,
    x_axis='time', y_axis='hz', cmap='magma'
)
plt.colorbar(format='%+2.0f dB')
plt.title('STFT Spectrogram')
plt.tight_layout()
plt.savefig('stft_spectrogram.png')
plt.show()

print(f"STFT shape: {D.shape}")
print(f"Frequency resolution: {sr/n_fft:.2f} Hz")
print(f"Time resolution: {hop_length/sr*1000:.2f} ms")

メルスペクトログラム

人間の聴覚システムは音程を対数スケールで知覚します。メルスケールはこの知覚的な非線形性をモデル化します。メルスペクトログラムは深層学習音声モデルで最も広く使用される入力表現です。

import librosa
import librosa.display
import numpy as np
import matplotlib.pyplot as plt

y, sr = librosa.load('speech.wav', sr=22050)

n_mels     = 128
n_fft      = 2048
hop_length = 512

mel_spec = librosa.feature.melspectrogram(
    y=y, sr=sr, n_mels=n_mels,
    n_fft=n_fft, hop_length=hop_length,
    fmin=0, fmax=sr//2, power=2.0
)
mel_db = librosa.power_to_db(mel_spec, ref=np.max)

plt.figure(figsize=(14, 6))
librosa.display.specshow(
    mel_db, sr=sr, hop_length=hop_length,
    x_axis='time', y_axis='mel',
    fmax=sr//2, cmap='viridis'
)
plt.colorbar(format='%+2.0f dB')
plt.title('Mel Spectrogram')
plt.tight_layout()
plt.savefig('mel_spectrogram.png')
plt.show()

print(f"Mel Spectrogram shape: {mel_spec.shape}")
# (n_mels, time_frames) → ~(128, 86 * duration_seconds)

MFCC（メル周波数ケプストラム係数）

MFCCはログメルフィルタバンク出力に離散コサイン変換（DCT）を適用して得られます。スペクトル包絡（音色）をコンパクトに表現し、数十年にわたって音声認識の標準的な特徴量でした。

import librosa
import librosa.display
import numpy as np
import matplotlib.pyplot as plt

y, sr = librosa.load('speech.wav', sr=22050)

n_mfcc     = 40
hop_length = 512

mfccs       = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=n_mfcc, hop_length=hop_length)
mfcc_delta  = librosa.feature.delta(mfccs)
mfcc_delta2 = librosa.feature.delta(mfccs, order=2)

# 連結特徴量: 120次元（40 + 40 + 40）
mfcc_combined = np.vstack([mfccs, mfcc_delta, mfcc_delta2])

fig, axes = plt.subplots(3, 1, figsize=(14, 8))
for ax, data, title in zip(
    axes,
    [mfccs, mfcc_delta, mfcc_delta2],
    ['MFCC', 'MFCC Delta (1st derivative)', 'MFCC Delta-Delta (2nd derivative)']
):
    librosa.display.specshow(data, x_axis='time', hop_length=hop_length, sr=sr, ax=ax)
    ax.set_title(title)
    plt.colorbar(ax.collections[0], ax=ax)

plt.tight_layout()
plt.savefig('mfcc.png')
plt.show()

# 分類用の固定長特徴ベクトル
mfcc_feature = np.concatenate([np.mean(mfccs, axis=1), np.std(mfccs, axis=1)])
print(f"MFCC feature vector dim: {mfcc_feature.shape}")

3. 自動音声認識（ASR）

従来のASR：HMM + GMM

古典的な音声認識は、音素シーケンスのモデリングに隠れマルコフモデル（HMM）、音響特徴量のモデリングにガウス混合モデル（GMM）を組み合わせました。音声からMFCC特徴量を抽出し、HMM-GMMシステムが音素シーケンスを予測し、言語モデルが音素シーケンスを単語シーケンスにマッピングします。

CTC（Connectionist Temporal Classification）

CTCは、強制アライメントなしに入力と出力シーケンスの長さが異なる場合のエンドツーエンド学習を可能にします。ブランクトークンが繰り返し文字と無音を処理し、モデルが音声テキストペアから直接学習できるようにします。

Wav2Vec 2.0

Facebook AI ResearchのWav2Vec 2.0は自己教師あり学習を使用して、大量のラベルなし音声から強力な音響表現を学習します。少量のラベル付きデータでファインチューニングでき、最先端の結果を達成します。

import torch
import torchaudio
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

model_name = "facebook/wav2vec2-base-960h"
processor  = Wav2Vec2Processor.from_pretrained(model_name)
model      = Wav2Vec2ForCTC.from_pretrained(model_name)
model.eval()

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

def transcribe_wav2vec2(audio_path):
    speech, sample_rate = torchaudio.load(audio_path)

    if sample_rate != 16000:
        resampler = torchaudio.transforms.Resample(sample_rate, 16000)
        speech    = resampler(speech)

    speech = speech.squeeze().numpy()
    inputs = processor(speech, sampling_rate=16000,
                       return_tensors="pt", padding=True).to(device)

    with torch.no_grad():
        logits = model(**inputs).logits

    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)
    return transcription[0].lower()

text = transcribe_wav2vec2('speech.wav')
print(f"Transcription: {text}")

Whisper（OpenAI）

WhisperはOpenAIが2022年にリリースした大規模多言語ASRモデルです。680,000時間の多言語音声で学習され、99言語をサポートし、ほとんどのユースケースでファインチューニング不要で優れた精度を発揮します。

アーキテクチャ: エンコーダー-デコーダーTransformer

エンコーダー: 音声をメルスペクトログラムに変換し、Transformerエンコーダーで処理
デコーダー: 言語検出とタイムスタンプを含むテキストトークンを自己回帰的に生成

モデルサイズ:

tiny: 3900万パラメータ——最速
base: 7400万
small: 2億4400万
medium: 7億6900万
large-v3: 15億5000万——最高精度

import whisper
import numpy as np

# モデルを読み込む（初回実行時は自動でダウンロード）
model = whisper.load_model("base")  # または "small", "medium", "large-v3"

# 基本的な文字起こし
result = model.transcribe("speech.wav")
print("Transcript:", result["text"])
print("Detected language:", result["language"])

# 特定の言語を強制
result_forced = model.transcribe("speech.wav", language="en", task="transcribe")

# 単語レベルのタイムスタンプ付き
result_ts = model.transcribe("speech.wav", verbose=False, word_timestamps=True)

for segment in result_ts["segments"]:
    print(f"[{segment['start']:.2f}s - {segment['end']:.2f}s] {segment['text']}")

# 英語以外の音声を英語に翻訳
result_translated = model.transcribe("german_speech.wav", task="translate")
print("English translation:", result_translated["text"])

# マイク入力（5秒クリップ）
def transcribe_from_microphone(duration=5, sample_rate=16000):
    import sounddevice as sd
    print(f"Recording for {duration} seconds...")
    audio = sd.rec(int(duration * sample_rate),
                   samplerate=sample_rate, channels=1, dtype=np.float32)
    sd.wait()
    result = model.transcribe(audio.flatten(), language="en")
    print(f"You said: {result['text']}")

transcribe_from_microphone()

Faster-Whisper

faster-whisperはCTranslate2を使用してWhisperを再実装し、メモリ使用量を削減しながら最大4倍高速な推論を実現します。

from faster_whisper import WhisperModel

model = WhisperModel(
    "large-v3",
    device="cuda",
    compute_type="float16"  # "float16", "int8", "int8_float16"
)

segments, info = model.transcribe(
    "speech.wav",
    language="en",
    beam_size=5,
    word_timestamps=True,
    vad_filter=True,
    vad_parameters=dict(min_silence_duration_ms=500)
)

print(f"Language: {info.language}, confidence: {info.language_probability:.2f}")

full_text = ""
for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
    full_text += segment.text
    if segment.words:
        for word in segment.words:
            print(f"  '{word.word}': {word.start:.2f}s - {word.end:.2f}s")

# バッチ処理
def batch_transcribe(audio_files, output_dir):
    import os
    os.makedirs(output_dir, exist_ok=True)
    for audio_path in audio_files:
        name        = os.path.splitext(os.path.basename(audio_path))[0]
        output_path = os.path.join(output_dir, f"{name}.txt")
        segs, _     = model.transcribe(audio_path, language="en")
        with open(output_path, 'w', encoding='utf-8') as f:
            for seg in segs:
                f.write(f"[{seg.start:.2f} - {seg.end:.2f}] {seg.text}\n")
        print(f"Done: {audio_path} -> {output_path}")

batch_transcribe(['interview1.wav', 'lecture.mp3'], './transcripts')

4. テキスト音声変換（TTS）

深層学習TTSアーキテクチャ

Tacotron 2: テキストからメルスペクトログラムを生成するシーケンス-トゥ-シーケンスモデルで、WaveNetボコーダーと組み合わせます。アテンションメカニズムがテキストエンコーダー出力と音声デコーダーをアライメントします。

FastSpeech 2: Tacotron 2より3〜38倍高速な非自己回帰モデルです。デュレーション予測器がアライメント問題を解決し、ピッチとエネルギーが入力から直接予測されます。

VITS: 変分推論と敵対的学習を組み合わせたエンドツーエンドモデルです。音響モデルとボコーダーを1つのネットワークに統合し、1回のパスで自然な合成を実現します。

Edge TTS（Microsoft）

Microsoftの高品質TTSサービスで、edge-tts Pythonパッケージから無料で使用できます。

import asyncio
import edge_tts

async def synthesize_with_edge_tts():
    # 利用可能な音声の一覧
    voices     = await edge_tts.list_voices()
    en_voices  = [v for v in voices if v['Locale'].startswith('en-')]
    print("English voices:")
    for v in en_voices[:5]:
        print(f"  {v['ShortName']} - {v['FriendlyName']} ({v['Gender']})")

    # 基本的な合成
    text = "Hello! This is a demonstration of Microsoft Edge TTS."
    communicate = edge_tts.Communicate(
        text,
        voice="en-US-AriaNeural",
        rate="+0%",
        volume="+0%",
        pitch="+0Hz"
    )
    await communicate.save("output_edge.mp3")
    print("Saved: output_edge.mp3")

asyncio.run(synthesize_with_edge_tts())

Coqui TTS（オープンソース）

from TTS.api import TTS
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

# XTTS v2: 多言語ゼロショットTTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)

# 基本的な合成
tts.tts_to_file(
    text="This is a demonstration of Coqui XTTS v2.",
    file_path="output_xtts.wav",
    language="en",
    speaker_wav="reference_voice.wav"
)

# 音声クローン
def clone_voice(reference_audio, text, output_path, language="en"):
    tts.tts_to_file(
        text=text,
        file_path=output_path,
        speaker_wav=reference_audio,
        language=language,
        split_sentences=True
    )
    print(f"Voice clone saved: {output_path}")

clone_voice(
    "my_voice_sample.wav",
    "This sentence is spoken in a cloned voice.",
    "cloned_output.wav"
)

5. 話者分離

話者分離は「誰がいつ話したか」という質問に答えます。会議の文字起こし、インタビュー分析、複数話者の字幕生成に不可欠です。

from pyannote.audio import Pipeline
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token="YOUR_HF_TOKEN"
)
pipeline.to(device)

def diarize_audio(audio_path, num_speakers=None):
    kwargs = {"num_speakers": num_speakers} if num_speakers else {}
    diarization = pipeline(audio_path, **kwargs)

    timeline = []
    for turn, _, speaker in diarization.itertracks(yield_label=True):
        entry = {
            'start':    turn.start,
            'end':      turn.end,
            'speaker':  speaker,
            'duration': turn.end - turn.start
        }
        timeline.append(entry)
        print(f"  [{turn.start:.2f}s - {turn.end:.2f}s] {speaker}")

    return timeline, diarization

6. 音声感情認識

import torch
import torchaudio
from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2FeatureExtractor
import numpy as np

model_name       = "ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition"
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_name)
model            = Wav2Vec2ForSequenceClassification.from_pretrained(model_name)
model.eval()

EMOTIONS = ['angry', 'calm', 'disgust', 'fearful',
            'happy', 'neutral', 'sad', 'surprised']

def predict_emotion(audio_path):
    speech, sr = torchaudio.load(audio_path)
    if sr != 16000:
        speech = torchaudio.transforms.Resample(sr, 16000)(speech)

    inputs = feature_extractor(
        speech.squeeze().numpy(),
        sampling_rate=16000,
        return_tensors="pt",
        padding=True
    )

    with torch.no_grad():
        probs = torch.softmax(model(**inputs).logits, dim=-1)

    top_probs, top_idx = torch.topk(probs, 3)
    print("Top-3 emotions:")
    for idx, prob in zip(top_idx[0], top_probs[0]):
        print(f"  {EMOTIONS[idx.item()]}: {prob.item():.4f}")

    predicted = EMOTIONS[probs.argmax().item()]
    print(f"Predicted: {predicted}")
    return predicted, dict(zip(EMOTIONS, probs[0].tolist()))

emotion, probs = predict_emotion('speech.wav')

7. 音楽AI

MusicGen（Meta）

Meta AIのMusicGenはテキストプロンプトから音楽を生成し、ジャンル、楽器、ムード、テンポの説明を条件として使用します。

from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write
import torch

model = MusicGen.get_pretrained('facebook/musicgen-large')
model.set_generation_params(
    duration=30,
    temperature=1.0,
    top_k=250,
    top_p=0.0,
    cfg_coef=3.0,
)

descriptions = [
    "upbeat electronic dance music with synthesizers and strong bass",
    "peaceful classical piano music with violin, gentle and romantic",
    "intense rock music with electric guitar and drums"
]

print("Generating music...")
wav = model.generate(descriptions)

for idx, (desc, one_wav) in enumerate(zip(descriptions, wav)):
    filename = f'generated_music_{idx}'
    audio_write(
        filename, one_wav.cpu(), model.sample_rate,
        strategy="loudness", loudness_compressor=True
    )
    print(f"Saved: {filename}.wav - '{desc}'")

8. 実践的なオーディオAIプロジェクト

プロジェクト1: リアルタイム字幕システム

import queue
import threading
import numpy as np
import sounddevice as sd
from faster_whisper import WhisperModel
import time

class RealtimeSubtitleSystem:
    def __init__(self, model_size="base", language="en"):
        print(f"Loading Whisper {model_size}...")
        self.model        = WhisperModel(model_size, device="cuda", compute_type="float16")
        self.language     = language
        self.audio_queue  = queue.Queue()
        self.is_running   = False
        self.sample_rate  = 16000
        self.chunk_secs   = 3

    def audio_callback(self, indata, frames, time_info, status):
        if status:
            print(f"Audio status: {status}")
        self.audio_queue.put(indata.copy())

    def transcription_worker(self):
        buffer     = np.array([], dtype=np.float32)
        chunk_size = self.sample_rate * self.chunk_secs

        while self.is_running or not self.audio_queue.empty():
            try:
                chunk  = self.audio_queue.get(timeout=0.1)
                buffer = np.append(buffer, chunk.flatten())

                if len(buffer) >= chunk_size:
                    audio_data = buffer[:chunk_size]
                    buffer     = buffer[chunk_size // 2:]

                    segments, _ = self.model.transcribe(
                        audio_data, language=self.language,
                        vad_filter=True,
                        vad_parameters=dict(min_silence_duration_ms=300)
                    )
                    for seg in segments:
                        text = seg.text.strip()
                        if text:
                            ts = time.strftime("%H:%M:%S")
                            print(f"[{ts}] {text}")

            except queue.Empty:
                continue
            except Exception as e:
                print(f"Transcription error: {e}")

    def start(self):
        self.is_running = True
        t = threading.Thread(target=self.transcription_worker, daemon=True)
        t.start()

        print("Real-time subtitle system running (Ctrl+C to stop)")
        print("-" * 50)

        with sd.InputStream(
            samplerate=self.sample_rate, channels=1,
            dtype=np.float32, callback=self.audio_callback,
            blocksize=int(self.sample_rate * 0.1)
        ):
            try:
                while True:
                    time.sleep(0.1)
            except KeyboardInterrupt:
                print("\nStopping...")
                self.is_running = False

        t.join()
        print("System stopped.")

system = RealtimeSubtitleSystem(model_size="small", language="en")
system.start()

音声・オーディオAI学習ロードマップ

初級

librosaで音声特徴量抽出を練習
Whisper APIで文字起こしプロトタイプを構築
Edge TTSでTTSアプリを作成

中級

カスタムASR用にWav2Vec 2.0をファインチューニング
pyannote.audioで話者分離パイプラインを構築
リアルタイム音声チャットボットを実装

上級

VITSまたはXTTSでカスタムTTSモデルを学習
MusicGenで音楽生成アプリを構築
感情認識と会話AIを組み合わせる

クイズ

Q1. メルスペクトログラムが深層学習音声モデルで広く使われる理由は何ですか？

答え: メルスケールは人間の聴覚の非線形な知覚特性をモデル化しているため、人間の音声認知により近い表現を提供します。また、時間と周波数の両方の情報を保持し、CNNや他のニューラルネットワークが処理しやすい2D画像のような形式を持ちます。

解説: 生の波形やFFTよりもメルスペクトログラムが好まれる理由は、（1）知覚的に意味のある周波数スケール、（2）時間-周波数の局在化、（3）正規化された表現によりモデルの学習が安定することにあります。

Q2. WhisperとWav2Vec 2.0の主な違いは何ですか？

答え: Whisperはエンコーダー-デコーダーTransformerを使用した教師あり学習モデルで、680,000時間の多言語音声で学習されています。翻訳、言語検出、タイムスタンプ生成が可能です。Wav2Vec 2.0は自己教師あり事前学習を使用し、ラベルなしデータから音響表現を学習した後、少量のラベル付きデータでファインチューニングします。

解説: Whisperはそのまま高性能なASRを提供しますが、Wav2Vec 2.0はラベル付きデータが少ない低リソース言語に特に有効です。

Q3. 話者分離（Speaker Diarization）の応用例を3つ挙げてください。

答え:

会議の文字起こし: 会議録音を話者ごとに分離し、誰がいつ何を発言したかを記録
インタビュー分析: インタビュー音声から質問者と回答者の発言を分離
複数話者の字幕生成: ポッドキャストや動画に話者識別付きの字幕を自動生成

解説: 話者分離はWhisperなどのASRと組み合わせると特に強力で、各話者の発言を別々に文字起こしして意味のある会話ログを作成できます。

参考文献

librosaドキュメント: https://librosa.org/doc/latest/
OpenAI Whisper: https://openai.com/research/whisper
Wav2Vec 2.0 (arXiv 2006.11477): https://arxiv.org/abs/2006.11477
HuggingFace オーディオコース: https://huggingface.co/learn/audio-course/
Coqui TTS: https://github.com/coqui-ai/TTS
pyannote.audio: https://github.com/pyannote/pyannote-audio
faster-whisper: https://github.com/SYSTRAN/faster-whisper
AudioCraft (MusicGen): https://github.com/facebookresearch/audiocraft