Split View: 음성 & 오디오 AI 완전 정복: Whisper, TTS, 화자 인식, 음악 생성까지

음성 & 오디오 AI 완전 정복: Whisper, TTS, 화자 인식, 음악 생성까지

1. 음성 기초: 소리를 숫자로 이해하기
2. 음성 인식(ASR): 말을 텍스트로
3. 음성 합성(TTS): 텍스트를 목소리로
4. 화자 인식: 누가 말하는가
5. 음악 AI: AudioCraft와 창작적 음향
6. 실시간 처리: 스트리밍 ASR 파이프라인
7. 실전 응용: 회의 자동 요약 시스템
퀴즈
마무리

1. 음성 기초: 소리를 숫자로 이해하기

음파와 디지털 오디오

소리는 공기 압력의 시간적 변화입니다. 마이크로폰이 이 압력 변화를 전기 신호로 변환하고, ADC(아날로그-디지털 변환기)가 일정 간격으로 샘플링하여 숫자 배열로 저장합니다.

핵심 개념:

샘플링 레이트(Sample Rate): 초당 샘플 수. 인간 가청 범위(20Hz~20kHz)를 커버하려면 나이퀴스트 정리에 따라 최소 40kHz 필요. 음성 AI에서는 16kHz가 표준.
비트 깊이(Bit Depth): 각 샘플의 정밀도. 16-bit = 65,536 단계, 24-bit = 16,777,216 단계.
채널: 모노(1채널) vs 스테레오(2채널). 음성 인식은 대부분 모노 16kHz.

FFT와 스펙트로그램

시간 도메인 파형을 주파수 도메인으로 변환하는 FFT(Fast Fourier Transform)가 오디오 분석의 핵심입니다.

import librosa
import librosa.display
import numpy as np
import matplotlib.pyplot as plt

# 오디오 파일 로드 (16kHz 모노)
y, sr = librosa.load("speech.wav", sr=16000, mono=True)
print(f"샘플 수: {len(y)}, 샘플 레이트: {sr}Hz, 길이: {len(y)/sr:.2f}초")

# STFT (Short-Time Fourier Transform)
n_fft = 512        # FFT 윈도우 크기
hop_length = 128   # 윈도우 이동 간격

D = librosa.stft(y, n_fft=n_fft, hop_length=hop_length, window="hann")
magnitude = np.abs(D)

# 파워 스펙트로그램 (dB 스케일)
S_db = librosa.amplitude_to_db(magnitude, ref=np.max)

fig, axes = plt.subplots(3, 1, figsize=(12, 10))

# 파형
axes[0].plot(np.linspace(0, len(y)/sr, len(y)), y)
axes[0].set_title("파형 (Waveform)")
axes[0].set_xlabel("시간 (초)")

# 선형 스펙트로그램
librosa.display.specshow(S_db, sr=sr, hop_length=hop_length,
                         x_axis="time", y_axis="linear", ax=axes[1])
axes[1].set_title("선형 스펙트로그램 (Linear Spectrogram)")

# Mel 스펙트로그램
mel_spec = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=80,
                                           n_fft=n_fft, hop_length=hop_length)
mel_db = librosa.power_to_db(mel_spec, ref=np.max)
librosa.display.specshow(mel_db, sr=sr, hop_length=hop_length,
                         x_axis="time", y_axis="mel", ax=axes[2])
axes[2].set_title("Mel 스펙트로그램 (80 Mel bins)")

plt.tight_layout()
plt.savefig("spectrogram_comparison.png", dpi=150)

MFCC: 음성의 압축된 지문

MFCC(Mel-Frequency Cepstral Coefficients)는 인간의 청각 시스템을 모방한 특징입니다.

처리 단계:

프리엠퍼시스(Pre-emphasis) 필터 적용 — 고주파 성분 강조
프레임 분할 + 윈도잉
FFT → 파워 스펙트럼
Mel 필터뱅크 적용 (주파수 축을 Mel 스케일로 변환)
로그 변환
DCT(Discrete Cosine Transform) → MFCC 계수

import librosa
import numpy as np

def extract_mfcc_features(audio_path, sr=16000, n_mfcc=13, n_mels=40):
    """
    MFCC 특징 추출 함수
    Returns: (39, T) 형태의 mfcc + delta + delta2 특징
    """
    y, sr = librosa.load(audio_path, sr=sr)

    # 프리엠퍼시스
    y_emphasized = np.append(y[0], y[1:] - 0.97 * y[:-1])

    # MFCC 추출
    mfcc = librosa.feature.mfcc(
        y=y_emphasized, sr=sr,
        n_mfcc=n_mfcc,
        n_mels=n_mels,
        n_fft=512,
        hop_length=160,   # 10ms (16kHz 기준)
        win_length=400,   # 25ms
        window="hann"
    )

    # Delta (1차 미분) — 동적 특징
    delta = librosa.feature.delta(mfcc)
    # Delta-Delta (2차 미분)
    delta2 = librosa.feature.delta(mfcc, order=2)

    # 39차원 특징 벡터 (13 + 13 + 13)
    features = np.vstack([mfcc, delta, delta2])

    # CMVN 정규화
    features = (features - features.mean(axis=1, keepdims=True)) / \
               (features.std(axis=1, keepdims=True) + 1e-8)

    return features  # shape: (39, T)

features = extract_mfcc_features("speech.wav")
print(f"MFCC 특징 shape: {features.shape}")

2. 음성 인식(ASR): 말을 텍스트로

CTC (Connectionist Temporal Classification)

전통적인 ASR은 음향 모델 + 언어 모델 + 발음 사전의 3단계 파이프라인이었습니다. CTC는 이 복잡성을 혁신적으로 줄였습니다.

CTC의 핵심 아이디어:

입력 시퀀스(음성 프레임)와 출력 시퀀스(텍스트)의 길이가 서로 다름
특별한 blank 토큰 도입으로 정렬 문제 해결
모든 가능한 정렬(alignment)의 확률 합산으로 학습 (Forward-Backward 알고리즘)
연속된 동일 레이블과 blank를 제거하는 collapse 디코딩

CTC Decode 예시: a-a-blank-b → ab

Seq2Seq with Attention

RNN 기반 Seq2Seq 모델은 CTC와 달리 언어 모델을 내재화합니다.

인코더: BiLSTM/Transformer로 음성 프레임을 컨텍스트 벡터로 변환
어텐션: 각 디코딩 스텝에서 인코더 출력의 어느 부분에 집중할지 결정
디코더: 이전 출력 토큰 + 어텐션 컨텍스트로 다음 토큰 생성

Whisper 아키텍처 완전 해부

OpenAI Whisper는 680,000시간 분량의 다국어 음성으로 학습된 Encoder-Decoder Transformer입니다.

아키텍처 상세:

오디오 인코더: 30초 청크 → 80채널 log-Mel spectrogram → Conv1D 2개 → Transformer 인코더
텍스트 디코더: Cross-attention으로 오디오 컨텍스트 참조 → 자기회귀 생성
특수 토큰: 언어 토큰, 태스크 토큰(transcribe/translate), 타임스탬프 토큰

import whisper
import json

def transcribe_with_timestamps(audio_path, model_size="large-v3", language="ko"):
    """
    Whisper로 타임스탬프 포함 전사
    모델 크기: tiny, base, small, medium, large, large-v3
    """
    model = whisper.load_model(model_size)

    result = model.transcribe(
        audio_path,
        language=language,
        task="transcribe",           # "translate"로 바꾸면 영어 번역
        word_timestamps=True,        # 단어별 타임스탬프
        condition_on_previous_text=True,
        temperature=0,
        beam_size=5,
        verbose=False
    )

    print(f"전체 전사: {result['text']}")
    print(f"감지된 언어: {result['language']}")

    for seg in result["segments"]:
        start = f"{seg['start']:.2f}s"
        end   = f"{seg['end']:.2f}s"
        text  = seg["text"].strip()
        print(f"[{start} -> {end}] {text}")

        if "words" in seg:
            for word in seg["words"]:
                ws = f"{word['start']:.2f}s"
                we = f"{word['end']:.2f}s"
                print(f"  {word['word']}: {ws}~{we}")

    return result

result = transcribe_with_timestamps("meeting.wav", model_size="large-v3", language="ko")

with open("transcript.json", "w", encoding="utf-8") as f:
    json.dump(result, f, ensure_ascii=False, indent=2)

한국어/일본어 ASR 특수성

한국어 ASR 도전:

교착어: 어미 변화가 매우 다양 (먹다/먹어/먹었다/먹겠다)
연음 현상: "국어"는 [구거]로 발음됨
경음화/격음화: 복잡한 음운 규칙 처리 필요
띄어쓰기 불규칙으로 후처리 필요

일본어 ASR 도전:

히라가나, 카타카나, 한자 3가지 문자 체계 혼용
장모음/단모음 구분 (오코리, 오코리이)
무성화 모음 현상 (す, き 등의 비발음화)

3. 음성 합성(TTS): 텍스트를 목소리로

Tacotron 2 → FastSpeech 2 → VITS 진화

모델	아키텍처	특징	추론 속도
Tacotron 2	Seq2Seq + Attention	자연스러운 운율, 느린 추론	느림
FastSpeech 2	Non-autoregressive Transformer	명시적 duration/pitch/energy	빠름
VITS	VAE + Normalizing Flow + GAN	단일 모델, 최고 음질	중간

FastSpeech 2 합성 실습

from TTS.api import TTS
import torch

def synthesize_speech_ko(text, output_path, speed=1.0):
    """
    Coqui TTS를 사용한 한국어 음성 합성
    모델: tts_models/ko/css10/vits
    """
    device = "cuda" if torch.cuda.is_available() else "cpu"
    tts = TTS("tts_models/ko/css10/vits").to(device)

    tts.tts_to_file(
        text=text,
        file_path=output_path,
        speed=speed,
    )
    print(f"합성 완료: {output_path}")

# 한국어 TTS 예시
text_ko = "안녕하세요. 음성 AI 기술이 놀랍도록 발전했습니다."
synthesize_speech_ko(text_ko, "output_ko.wav")

# 다국어 TTS (XTTS-v2)
def synthesize_multilang(text, language, speaker_wav, output_path):
    """
    XTTS-v2: 화자 음성 클로닝 + 다국어 합성
    language: "ko", "ja", "en", "zh-cn" 등
    speaker_wav: 3~6초 참조 음성
    """
    tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2")
    tts.tts_to_file(
        text=text,
        language=language,
        speaker_wav=speaker_wav,
        file_path=output_path,
    )

synthesize_multilang(
    "목소리를 복제하여 다국어로 말할 수 있습니다.",
    language="ko",
    speaker_wav="reference_voice.wav",
    output_path="cloned_ko.wav"
)

VITS: End-to-End 음성 합성의 혁신

VITS(Variational Inference with adversarial learning for end-to-end Text-to-Speech)는 텍스트에서 직접 파형을 생성합니다.

핵심 구성 요소:

Posterior Encoder: 타겟 오디오 → 잠재 변수 z
Prior Encoder: 텍스트 → 사전 분포 (normalizing flow 포함)
Decoder (HiFi-GAN): 잠재 변수 z → 파형
Stochastic Duration Predictor: 각 음소의 지속 시간 예측

Normalizing flow가 복잡한 음성 분포를 단순한 가우시안으로 변환하여 학습 안정성을 높입니다. Tacotron 2 대비 병렬 추론이 가능하고 외부 보코더 불필요합니다.

4. 화자 인식: 누가 말하는가

x-vector와 ECAPA-TDNN

i-vector (전통적 방법):

GMM-UBM 기반의 총 가변성 공간 모델링
고정 길이 화자 임베딩 생성
얕은 특징, 판별 능력 한계

x-vector (딥러닝 방법):

TDNN(Time Delay Neural Network) 기반
프레임 레벨 특징 → 통계적 풀링 → 세그먼트 레벨 임베딩
PLDA(Probabilistic Linear Discriminant Analysis)로 스코어링

ECAPA-TDNN (Emphasized Channel Attention, Propagation and Aggregation):

다중 스케일 특징 집계
채널 어텐션 메커니즘으로 화자 구별력 향상
2020년 VoxCeleb 챌린지 1위

pyannote.audio로 화자 분리(Diarization)

from pyannote.audio import Pipeline
import torch

def speaker_diarization(audio_path, hf_token, num_speakers=None):
    """
    화자 분리: 누가 언제 말했는지 파악
    """
    pipeline = Pipeline.from_pretrained(
        "pyannote/speaker-diarization-3.1",
        use_auth_token=hf_token
    )

    if torch.cuda.is_available():
        pipeline = pipeline.to(torch.device("cuda"))

    params = {}
    if num_speakers:
        params["num_speakers"] = num_speakers
    else:
        params["min_speakers"] = 1
        params["max_speakers"] = 10

    diarization = pipeline(audio_path, **params)

    segments = []
    for turn, _, speaker in diarization.itertracks(yield_label=True):
        seg = {
            "start": round(turn.start, 3),
            "end": round(turn.end, 3),
            "speaker": speaker
        }
        segments.append(seg)
        print(f"[{seg['start']:.3f}s ~ {seg['end']:.3f}s] {speaker}")

    with open("diarization.rttm", "w") as rttm:
        diarization.write_rttm(rttm)

    speakers_found = len(set(s["speaker"] for s in segments))
    print(f"\n총 {speakers_found}명의 화자 감지")
    return segments

segments = speaker_diarization("meeting.wav", hf_token="YOUR_HF_TOKEN")

Whisper + pyannote 결합: 화자별 전사

import whisper
from pyannote.audio import Pipeline
import torch

def diarize_and_transcribe(audio_path, hf_token, language="ko"):
    """화자 분리 + 음성 인식 결합"""
    # 1. 화자 분리
    diarization_pipeline = Pipeline.from_pretrained(
        "pyannote/speaker-diarization-3.1",
        use_auth_token=hf_token
    )
    diarization = diarization_pipeline(audio_path)

    # 2. Whisper 전사
    asr_model = whisper.load_model("large-v3")
    asr_result = asr_model.transcribe(audio_path, language=language,
                                       word_timestamps=True)

    # 3. 단어 타임스탬프와 화자 정보 매핑
    words_with_speakers = []
    for seg in asr_result["segments"]:
        for word in seg.get("words", []):
            word_mid = (word["start"] + word["end"]) / 2
            speaker = "UNKNOWN"
            for turn, _, spk in diarization.itertracks(yield_label=True):
                if turn.start <= word_mid <= turn.end:
                    speaker = spk
                    break
            words_with_speakers.append({
                "word": word["word"],
                "start": word["start"],
                "end": word["end"],
                "speaker": speaker
            })

    # 4. 화자별 발화 그룹핑
    grouped = []
    current_speaker = None
    current_text = []

    for item in words_with_speakers:
        if item["speaker"] != current_speaker:
            if current_text:
                grouped.append({
                    "speaker": current_speaker,
                    "text": "".join(current_text).strip()
                })
            current_speaker = item["speaker"]
            current_text = [item["word"]]
        else:
            current_text.append(item["word"])

    if current_text:
        grouped.append({"speaker": current_speaker,
                        "text": "".join(current_text).strip()})

    for entry in grouped:
        print(f"[{entry['speaker']}]: {entry['text']}")

    return grouped

5. 음악 AI: AudioCraft와 창작적 음향

MusicGen으로 음악 생성

Meta의 AudioCraft는 텍스트 설명으로 음악을 생성하는 MusicGen과 일반 오디오를 생성하는 AudioGen을 포함합니다.

from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write
import torch

def generate_music(descriptions, duration=10, model_size="medium"):
    """
    텍스트 설명으로 음악 생성
    모델 크기: small (300M), medium (1.5B), large (3.3B), melody
    """
    model = MusicGen.get_pretrained(f"facebook/musicgen-{model_size}")
    model.set_generation_params(
        duration=duration,     # 초 단위 (최대 30초)
        temperature=1.0,       # 창의성 제어
        top_k=250,
        cfg_coef=3.0,          # Classifier-Free Guidance 강도
    )

    wav = model.generate(descriptions)
    # shape: (batch, channels, samples)

    for i, (desc, audio) in enumerate(zip(descriptions, wav)):
        audio_write(
            f"music_{i}",
            audio.cpu(),
            model.sample_rate,
            strategy="loudness",
            loudness_compressor=True
        )
        print(f"생성 완료: music_{i}.wav")
        print(f"  설명: {desc}")

# 다양한 스타일 생성
descriptions = [
    "Upbeat K-pop with synth leads, 120 BPM, energetic and bright",
    "Calm Korean traditional music with gayageum and daegeum, peaceful",
    "Epic orchestral trailer music with powerful drums and brass",
    "Lo-fi hip hop beats with jazz piano, for studying",
]

generate_music(descriptions, duration=15, model_size="medium")

AudioGen으로 음향 효과 생성

from audiocraft.models import AudioGen
from audiocraft.data.audio import audio_write

def generate_sound_effects(descriptions, duration=5):
    """환경음, 음향 효과 생성"""
    model = AudioGen.get_pretrained("facebook/audiogen-medium")
    model.set_generation_params(duration=duration, temperature=1.0)

    wav = model.generate(descriptions)

    for i, (desc, audio) in enumerate(zip(descriptions, wav)):
        audio_write(f"sfx_{i}", audio.cpu(), model.sample_rate)
        print(f"생성: sfx_{i}.wav — {desc}")

sound_descriptions = [
    "Rain falling on a tin roof with distant thunder",
    "Busy city street with cars and people talking",
    "Birds chirping in a forest at dawn",
    "Keyboard typing in a quiet office",
]

generate_sound_effects(sound_descriptions, duration=8)

Demucs로 음악 분리

import subprocess
import os

def separate_audio_tracks(audio_path, output_dir="separated", model="htdemucs"):
    """
    음악을 보컬/드럼/베이스/기타 등으로 분리
    모델: htdemucs (4-stem), htdemucs_6s (6-stem)
    """
    os.makedirs(output_dir, exist_ok=True)

    cmd = [
        "python", "-m", "demucs",
        "--name", model,
        "--out", output_dir,
        "--mp3",
        "--mp3-bitrate", "320",
        audio_path
    ]

    result = subprocess.run(cmd, capture_output=True, text=True)

    if result.returncode == 0:
        base_name = os.path.splitext(os.path.basename(audio_path))[0]
        stems_dir = os.path.join(output_dir, model, base_name)
        stems = os.listdir(stems_dir)
        print(f"분리 성공! 트랙: {stems}")
        # htdemucs: vocals.mp3, drums.mp3, bass.mp3, other.mp3
        # htdemucs_6s: + guitar.mp3, piano.mp3
    else:
        print(f"오류: {result.stderr}")

    return output_dir

separate_audio_tracks("song.mp3", model="htdemucs_6s")

6. 실시간 처리: 스트리밍 ASR 파이프라인

look-ahead window는 스트리밍 ASR에서 중요한 트레이드오프를 만듭니다. 미래 음성을 더 많이 볼수록 정확도가 올라가지만, 지연 시간도 함께 증가합니다.

import sounddevice as sd
import numpy as np
import whisper
import queue
import threading
from collections import deque

class RealTimeASR:
    """
    실시간 스트리밍 음성 인식
    look-ahead window로 정확도/지연시간 균형 조절
    """
    def __init__(self, model_size="small", language="ko",
                 chunk_duration=2.0, look_ahead=0.5, sample_rate=16000):
        self.model = whisper.load_model(model_size)
        self.language = language
        self.chunk_duration = chunk_duration
        self.look_ahead = look_ahead
        self.sample_rate = sample_rate
        self.audio_queue = queue.Queue()
        max_buf = int(sample_rate * (chunk_duration + look_ahead) * 3)
        self.buffer = deque(maxlen=max_buf)
        self.is_running = False

    def audio_callback(self, indata, frames, time, status):
        if status:
            print(f"오디오 상태: {status}")
        self.audio_queue.put(indata.copy())

    def process_audio(self):
        chunk_samples = int(self.sample_rate * self.chunk_duration)

        while self.is_running:
            audio_chunk = []
            while len(audio_chunk) < chunk_samples:
                try:
                    data = self.audio_queue.get(timeout=0.1)
                    audio_chunk.extend(data.flatten())
                except queue.Empty:
                    break

            if len(audio_chunk) < chunk_samples // 2:
                continue

            self.buffer.extend(audio_chunk)
            audio_array = np.array(list(self.buffer), dtype=np.float32)
            audio_array = audio_array / (np.max(np.abs(audio_array)) + 1e-8)

            result = self.model.transcribe(
                audio_array,
                language=self.language,
                temperature=0,
                no_speech_threshold=0.6,
            )

            if result["text"].strip():
                print(f"\r인식: {result['text'].strip()}", end="", flush=True)

    def start(self):
        self.is_running = True
        t = threading.Thread(target=self.process_audio, daemon=True)
        t.start()

        print("마이크 활성화 — 말씀하세요 (Ctrl+C로 종료)")
        with sd.InputStream(
            samplerate=self.sample_rate,
            channels=1,
            dtype="float32",
            blocksize=int(self.sample_rate * 0.1),
            callback=self.audio_callback
        ):
            try:
                while True:
                    sd.sleep(100)
            except KeyboardInterrupt:
                self.is_running = False
                print("\n인식 종료")

# 실행
asr = RealTimeASR(model_size="small", language="ko",
                  chunk_duration=2.0, look_ahead=0.5)
asr.start()

7. 실전 응용: 회의 자동 요약 시스템

import whisper
from pyannote.audio import Pipeline
from openai import OpenAI

def auto_meeting_summary(audio_path, hf_token, openai_key, language="ko"):
    """
    회의 자동 요약: 화자 분리 + 전사 + LLM 요약
    """
    client = OpenAI(api_key=openai_key)

    # 1. 화자 분리
    print("화자 분리 중...")
    diarization_pipeline = Pipeline.from_pretrained(
        "pyannote/speaker-diarization-3.1",
        use_auth_token=hf_token
    )
    diarization = diarization_pipeline(audio_path)

    # 2. 전사
    print("음성 인식 중...")
    asr_model = whisper.load_model("large-v3")
    result = asr_model.transcribe(audio_path, language=language,
                                   word_timestamps=True)

    # 3. 화자-텍스트 매핑 및 대화록 구성
    lines = []
    current_speaker = None
    current_words = []

    for seg in result["segments"]:
        for word in seg.get("words", []):
            mid = (word["start"] + word["end"]) / 2
            speaker = "UNKNOWN"
            for turn, _, spk in diarization.itertracks(yield_label=True):
                if turn.start <= mid <= turn.end:
                    speaker = spk
                    break

            if speaker != current_speaker:
                if current_words:
                    lines.append(f"[{current_speaker}]: {''.join(current_words).strip()}")
                current_speaker = speaker
                current_words = [word["word"]]
            else:
                current_words.append(word["word"])

    if current_words:
        lines.append(f"[{current_speaker}]: {''.join(current_words).strip()}")

    transcript = "\n".join(lines)

    # 4. LLM 요약
    print("요약 생성 중...")
    prompt = f"""다음 회의 전사록을 아래 항목으로 요약해주세요:

1. 주요 논의 사항
2. 결정된 사항
3. Action Items (담당자 포함)
4. 다음 회의 일정

전사록:
{transcript[:4000]}"""

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.3
    )

    summary = response.choices[0].message.content
    print("\n=== 회의 요약 ===")
    print(summary)

    return {"transcript": transcript, "summary": summary}

퀴즈

Q1. CTC(Connectionist Temporal Classification) loss가 정렬 없는 시퀀스 학습을 가능하게 하는 원리는?

정답: CTC는 입력 프레임 시퀀스와 출력 레이블 시퀀스 사이의 모든 가능한 정렬(monotonic alignment)의 확률을 합산하여 학습합니다.

설명: 입력 음성의 T개 프레임에서 N개의 텍스트 레이블을 생성할 때, 각 프레임마다 레이블 또는 blank 토큰을 출력합니다. 동일한 레이블의 연속 및 blank를 제거하는 collapse 디코딩 규칙으로 최종 텍스트를 복원합니다. Forward-Backward 알고리즘(동적 프로그래밍)으로 모든 정렬의 확률 합을 효율적으로 계산합니다. 이로써 음소 경계나 정렬 어노테이션 없이 텍스트 레이블만으로 학습이 가능합니다.

Q2. Mel spectrogram이 linear spectrogram보다 음성 인식에 적합한 청각 지각적 이유는?

정답: 인간의 청각 시스템은 주파수에 대해 로그 스케일로 인식하며, Mel 스케일이 이를 반영합니다.

설명: 인간의 달팽이관(cochlea)은 저주파 영역에서 더 세밀하게 주파수를 구분하고 고주파 영역에서는 상대적으로 둔감합니다. Mel 필터뱅크는 저주파 영역에 필터를 촘촘하게, 고주파 영역에 성글게 배치하여 이 특성을 모방합니다. 결과적으로 80개의 Mel bin이 512개의 linear bin보다 음성 관련 정보를 더 압축적이고 효과적으로 표현하며, 모델이 음성의 음향적 특성을 더 잘 학습할 수 있습니다.

Q3. VITS가 Tacotron 2보다 end-to-end 학습이 자연스러운 이유 (normalizing flow 관련)?

정답: VITS는 normalizing flow를 통해 텍스트-오디오 간 복잡한 확률 분포를 직접 모델링하여, 중간 표현(mel spectrogram) 없이 단일 모델로 학습합니다.

설명: Tacotron 2는 텍스트 → mel spectrogram → 파형의 2단계 파이프라인으로 각 단계의 오류가 누적됩니다. VITS는 VAE의 잠재 공간을 normalizing flow로 변환합니다. Normalizing flow는 가역적 함수 체인으로 단순한 가우시안 분포를 복잡한 음성 분포로 정확하게 변환하고, 역방향으로도 정확한 우도 계산이 가능합니다. 이를 통해 텍스트 조건부 사전 분포와 오디오 사후 분포가 flow를 통해 직접 연결되어 완전한 end-to-end 학습이 가능해집니다.

Q4. x-vector 화자 임베딩이 i-vector보다 딥러닝으로 더 잘 학습되는 이유는?

정답: x-vector는 판별적(discriminative) 학습으로 화자 간 결정 경계를 직접 최적화하는 반면, i-vector는 생성적(generative) 모델링에 기반합니다.

설명: i-vector는 GMM-UBM 통계와 총 가변성 행렬로 화자 공간을 모델링하며, 화자 식별을 위한 판별 경계를 명시적으로 최적화하지 않습니다. x-vector(TDNN 기반)는 softmax cross-entropy로 화자 분류를 직접 학습하므로 화자 간 결정 경계가 명확합니다. TDNN의 통계적 풀링 레이어가 가변 길이 발화를 고정 크기 임베딩으로 변환하며, 딥러닝의 비선형성이 복잡한 화자 음향 패턴을 포착합니다. 데이터가 충분할 때 x-vector가 i-vector 대비 EER(Equal Error Rate)에서 크게 우수합니다.

Q5. 스트리밍 ASR에서 look-ahead window가 인식 정확도와 지연시간 사이의 트레이드오프?

정답: look-ahead window가 클수록 미래 컨텍스트를 활용해 정확도가 향상되지만, 그만큼 결과 출력 지연시간(latency)이 증가합니다.

설명: 음성은 시간적 컨텍스트가 중요합니다. "배"가 "배가 고프다"인지 "배를 타다"인지는 뒤따르는 단어에 따라 결정됩니다. look-ahead가 0이면 현재 청크만 보므로 지연은 최소이지만 경계 부분 인식 오류가 증가합니다. look-ahead가 길면 정확도가 향상되지만 그만큼 결과 출력이 지연됩니다. 실시간 자막 시스템에서 200~500ms look-ahead가 실용적 균형점입니다. CIF(Continuous Integrate-and-Fire)나 Emformer 같은 아키텍처는 이 트레이드오프를 구조적으로 개선합니다.

마무리

음성 AI는 ASR, TTS, 화자 인식, 음악 생성이 서로 연결된 넓고 깊은 분야입니다. Whisper로 다국어 전사, pyannote로 화자 분리, VITS로 자연스러운 음성 합성, MusicGen으로 음악 창작까지 — 오늘 소개한 도구들은 모두 오픈소스로 접근 가능합니다. 각 컴포넌트를 조합하면 콜센터 AI, 회의 요약, 실시간 번역, 접근성 도구 등 다양한 실전 시스템을 구축할 수 있습니다.

Speech & Audio AI Complete Guide: ASR, TTS, Whisper, Wav2Vec to Voice Synthesis

Speech and audio AI creates the most natural interface between humans and machines. From smartphone voice assistants to real-time translation systems and synthetic voices for virtual influencers, audio AI technology has woven itself into our everyday lives.

This guide takes you through the full audio AI ecosystem — from the physics of sound and digital signal processing to automatic speech recognition (ASR), text-to-speech (TTS), speaker diarization, and music generation — with practical Python code throughout.

1. Audio Signal Processing Fundamentals

Physical Properties of Sound

Sound is a pressure wave propagating through air. Understanding the following concepts is essential for digital audio processing.

Frequency: The number of vibrations per second, measured in Hz (hertz). The human hearing range spans roughly 20 Hz to 20,000 Hz. Low frequencies correspond to bass tones; high frequencies to treble.

Amplitude: The magnitude of the wave — the strength of the sound pressure. Expressed in decibels (dB), where 0 dB represents a reference threshold and negative values indicate quieter sounds.

Phase: The position of a waveform along the time axis. Two waves of the same frequency but different phases produce constructive or destructive interference when combined.

Harmonics: Frequency components at integer multiples of the fundamental frequency. They determine the timbre (tone color) of an instrument or voice.

Sampling Rate and Bit Depth

Sampling Rate: The number of audio samples captured per second (Hz). The Nyquist theorem states that to fully reconstruct a signal, you must sample at more than twice the highest frequency present.

CD quality: 44,100 Hz (44.1 kHz)
High-resolution audio: 48,000 Hz (video), 96,000 Hz, 192,000 Hz
Telephone quality: 8,000 Hz; wideband telephony: 16,000 Hz
Whisper default: 16,000 Hz

Bit Depth: The number of bits per sample — determines the dynamic range.

16-bit: 65,536 levels, 96 dB dynamic range (CD standard)
24-bit: 16,777,216 levels, 144 dB dynamic range
32-bit float: standard for deep learning pipelines

librosa Overview

librosa is the core Python library for audio analysis.

pip install librosa soundfile matplotlib numpy scipy

import librosa
import librosa.display
import numpy as np
import matplotlib.pyplot as plt
import soundfile as sf

# Load audio (sr=None preserves original sample rate)
y, sr = librosa.load('audio.wav', sr=None)
print(f"Duration: {len(y)/sr:.2f} s")
print(f"Sample rate: {sr} Hz")
print(f"Samples: {len(y)}")
print(f"dtype: {y.dtype}")

# Resample to 16 kHz
y_16k = librosa.resample(y, orig_sr=sr, target_sr=16000)

# Stereo → mono
y_mono = librosa.to_mono(y)  # (2, N) → (N,)

# Save
sf.write('output.wav', y_16k, 16000)

# Waveform visualization
plt.figure(figsize=(14, 4))
librosa.display.waveshow(y, sr=sr, alpha=0.8, color='steelblue')
plt.title('Waveform')
plt.xlabel('Time (seconds)')
plt.ylabel('Amplitude')
plt.tight_layout()
plt.savefig('waveform.png')
plt.show()

2. Audio Feature Extraction

Fourier Transform (FFT)

The Fourier transform converts a time-domain signal into the frequency domain, revealing which frequency components are present and how strongly.

import numpy as np
import matplotlib.pyplot as plt
from scipy.fft import fft, fftfreq
import librosa

y, sr = librosa.load('audio.wav', sr=22050)

N  = len(y)
yf = fft(y)
xf = fftfreq(N, 1/sr)

# Keep only positive frequencies (Hermitian symmetry)
xf_pos = xf[:N//2]
yf_pos = np.abs(yf[:N//2])

plt.figure(figsize=(12, 4))
plt.plot(xf_pos, yf_pos)
plt.title('Frequency Spectrum (FFT)')
plt.xlabel('Frequency (Hz)')
plt.ylabel('Magnitude')
plt.xlim(0, sr//2)
plt.yscale('log')
plt.grid(True)
plt.tight_layout()
plt.show()

Short-Time Fourier Transform (STFT)

The plain FFT gives a global average over the entire signal. The STFT applies FFT to short overlapping windows, producing a time-frequency representation that captures how spectral content evolves over time.

import librosa
import librosa.display
import numpy as np
import matplotlib.pyplot as plt

y, sr = librosa.load('speech.wav', sr=22050)

n_fft      = 2048   # FFT size (frequency resolution)
hop_length = 512    # hop size
win_length = 2048   # window size

D = librosa.stft(y, n_fft=n_fft, hop_length=hop_length, win_length=win_length)

magnitude = np.abs(D)
phase     = np.angle(D)
D_db      = librosa.amplitude_to_db(magnitude, ref=np.max)

plt.figure(figsize=(14, 6))
librosa.display.specshow(
    D_db, sr=sr, hop_length=hop_length,
    x_axis='time', y_axis='hz', cmap='magma'
)
plt.colorbar(format='%+2.0f dB')
plt.title('STFT Spectrogram')
plt.tight_layout()
plt.savefig('stft_spectrogram.png')
plt.show()

print(f"STFT shape: {D.shape}")
print(f"Frequency resolution: {sr/n_fft:.2f} Hz")
print(f"Time resolution: {hop_length/sr*1000:.2f} ms")

Mel Spectrogram

The human auditory system perceives pitch on a logarithmic scale — more sensitive at low frequencies and less so at high frequencies. The Mel scale models this perceptual non-linearity. Mel spectrograms are the most widely used input representation for deep learning audio models.

import librosa
import librosa.display
import numpy as np
import matplotlib.pyplot as plt

y, sr = librosa.load('speech.wav', sr=22050)

n_mels     = 128
n_fft      = 2048
hop_length = 512

mel_spec = librosa.feature.melspectrogram(
    y=y, sr=sr, n_mels=n_mels,
    n_fft=n_fft, hop_length=hop_length,
    fmin=0, fmax=sr//2, power=2.0
)
mel_db = librosa.power_to_db(mel_spec, ref=np.max)

plt.figure(figsize=(14, 6))
librosa.display.specshow(
    mel_db, sr=sr, hop_length=hop_length,
    x_axis='time', y_axis='mel',
    fmax=sr//2, cmap='viridis'
)
plt.colorbar(format='%+2.0f dB')
plt.title('Mel Spectrogram')
plt.tight_layout()
plt.savefig('mel_spectrogram.png')
plt.show()

print(f"Mel Spectrogram shape: {mel_spec.shape}")
# (n_mels, time_frames) → ~(128, 86 * duration_seconds)

MFCC (Mel-Frequency Cepstral Coefficients)

MFCCs are obtained by applying the Discrete Cosine Transform (DCT) to the log Mel filterbank outputs. They compactly represent the spectral envelope (timbre) and have been the standard features for speech recognition for decades.

import librosa
import librosa.display
import numpy as np
import matplotlib.pyplot as plt

y, sr = librosa.load('speech.wav', sr=22050)

n_mfcc     = 40
hop_length = 512

mfccs       = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=n_mfcc, hop_length=hop_length)
mfcc_delta  = librosa.feature.delta(mfccs)
mfcc_delta2 = librosa.feature.delta(mfccs, order=2)

# Concatenated feature: 120-dimensional (40 + 40 + 40)
mfcc_combined = np.vstack([mfccs, mfcc_delta, mfcc_delta2])

fig, axes = plt.subplots(3, 1, figsize=(14, 8))
for ax, data, title in zip(
    axes,
    [mfccs, mfcc_delta, mfcc_delta2],
    ['MFCC', 'MFCC Delta (1st derivative)', 'MFCC Delta-Delta (2nd derivative)']
):
    librosa.display.specshow(data, x_axis='time', hop_length=hop_length, sr=sr, ax=ax)
    ax.set_title(title)
    plt.colorbar(ax.collections[0], ax=ax)

plt.tight_layout()
plt.savefig('mfcc.png')
plt.show()

# Fixed-length feature vector for classification
mfcc_feature = np.concatenate([np.mean(mfccs, axis=1), np.std(mfccs, axis=1)])
print(f"MFCC feature vector dim: {mfcc_feature.shape}")

Chromagram

A chromagram represents the energy distribution across the 12 pitch classes (C, C#, D, D#, E, F, F#, G, G#, A, A#, B). It is widely used in music analysis for chord recognition and key detection.

import librosa
import librosa.display
import numpy as np
import matplotlib.pyplot as plt

y, sr = librosa.load('music.wav', sr=22050)
hop_length = 512

chroma_stft = librosa.feature.chroma_stft(y=y, sr=sr, hop_length=hop_length)
chroma_cqt  = librosa.feature.chroma_cqt(y=y, sr=sr, hop_length=hop_length)
chroma_cens = librosa.feature.chroma_cens(y=y, sr=sr, hop_length=hop_length)

fig, axes = plt.subplots(3, 1, figsize=(14, 10))
for ax, chroma, title in zip(
    axes,
    [chroma_stft, chroma_cqt, chroma_cens],
    ['Chroma STFT', 'Chroma CQT', 'Chroma CENS']
):
    librosa.display.specshow(
        chroma, y_axis='chroma', x_axis='time',
        hop_length=hop_length, sr=sr, cmap='coolwarm', ax=ax
    )
    ax.set_title(title)
    plt.colorbar(ax.collections[0], ax=ax)

plt.tight_layout()
plt.savefig('chromagram.png')
plt.show()

3. Automatic Speech Recognition (ASR)

Traditional ASR: HMM + GMM

Classic speech recognition combined Hidden Markov Models (HMMs) for phoneme sequence modeling with Gaussian Mixture Models (GMMs) for acoustic feature modeling. MFCC features are extracted from the audio, phoneme sequences are predicted by the HMM-GMM system, and a language model maps phoneme sequences to word sequences.

CTC (Connectionist Temporal Classification)

CTC enables end-to-end training when input and output sequences have different lengths, without requiring forced alignment. A blank token handles repeated characters and silences, allowing the model to learn directly from audio-text pairs.

Wav2Vec 2.0

Facebook AI Research's Wav2Vec 2.0 uses self-supervised learning to learn powerful acoustic representations from large amounts of unlabeled audio. It can be fine-tuned with a small labeled dataset and achieves state-of-the-art results.

import torch
import torchaudio
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

model_name = "facebook/wav2vec2-base-960h"
processor  = Wav2Vec2Processor.from_pretrained(model_name)
model      = Wav2Vec2ForCTC.from_pretrained(model_name)
model.eval()

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

def transcribe_wav2vec2(audio_path):
    speech, sample_rate = torchaudio.load(audio_path)

    if sample_rate != 16000:
        resampler = torchaudio.transforms.Resample(sample_rate, 16000)
        speech    = resampler(speech)

    speech = speech.squeeze().numpy()
    inputs = processor(speech, sampling_rate=16000,
                       return_tensors="pt", padding=True).to(device)

    with torch.no_grad():
        logits = model(**inputs).logits

    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)
    return transcription[0].lower()

text = transcribe_wav2vec2('speech.wav')
print(f"Transcription: {text}")

Whisper (OpenAI)

Whisper is OpenAI's large-scale multilingual ASR model released in 2022. Trained on 680,000 hours of multilingual audio, it supports 99 languages and delivers excellent accuracy out of the box — no fine-tuning required for most use cases.

Architecture: Encoder-Decoder Transformer

Encoder: converts audio to a Mel spectrogram and processes it through a transformer encoder
Decoder: autoregressively generates text tokens, including language detection and timestamps

Model sizes:

tiny: 39M parameters — fastest
base: 74M
small: 244M
medium: 769M
large-v3: 1,550M — best accuracy

import whisper
import numpy as np

# Load model (downloads automatically on first run)
model = whisper.load_model("base")  # or "small", "medium", "large-v3"

# Basic transcription
result = model.transcribe("speech.wav")
print("Transcript:", result["text"])
print("Detected language:", result["language"])

# Force a specific language
result_forced = model.transcribe("speech.wav", language="en", task="transcribe")

# With word-level timestamps
result_ts = model.transcribe("speech.wav", verbose=False, word_timestamps=True)

for segment in result_ts["segments"]:
    print(f"[{segment['start']:.2f}s - {segment['end']:.2f}s] {segment['text']}")

# Translate non-English audio to English
result_translated = model.transcribe("german_speech.wav", task="translate")
print("English translation:", result_translated["text"])

# Microphone input (5-second clip)
def transcribe_from_microphone(duration=5, sample_rate=16000):
    import sounddevice as sd
    print(f"Recording for {duration} seconds...")
    audio = sd.rec(int(duration * sample_rate),
                   samplerate=sample_rate, channels=1, dtype=np.float32)
    sd.wait()
    result = model.transcribe(audio.flatten(), language="en")
    print(f"You said: {result['text']}")

transcribe_from_microphone()

Faster-Whisper

faster-whisper reimplements Whisper using CTranslate2, achieving up to 4x faster inference with reduced memory usage.

from faster_whisper import WhisperModel

model = WhisperModel(
    "large-v3",
    device="cuda",
    compute_type="float16"  # "float16", "int8", "int8_float16"
)

segments, info = model.transcribe(
    "speech.wav",
    language="en",
    beam_size=5,
    word_timestamps=True,
    vad_filter=True,
    vad_parameters=dict(min_silence_duration_ms=500)
)

print(f"Language: {info.language}, confidence: {info.language_probability:.2f}")

full_text = ""
for segment in segments:
    print(f"[{segment.start:.2f}s → {segment.end:.2f}s] {segment.text}")
    full_text += segment.text
    if segment.words:
        for word in segment.words:
            print(f"  '{word.word}': {word.start:.2f}s - {word.end:.2f}s")

# Batch processing
def batch_transcribe(audio_files, output_dir):
    import os
    os.makedirs(output_dir, exist_ok=True)
    for audio_path in audio_files:
        name        = os.path.splitext(os.path.basename(audio_path))[0]
        output_path = os.path.join(output_dir, f"{name}.txt")
        segs, _     = model.transcribe(audio_path, language="en")
        with open(output_path, 'w', encoding='utf-8') as f:
            for seg in segs:
                f.write(f"[{seg.start:.2f} - {seg.end:.2f}] {seg.text}\n")
        print(f"Done: {audio_path} → {output_path}")

batch_transcribe(['interview1.wav', 'lecture.mp3'], './transcripts')

4. Text-to-Speech (TTS)

Deep Learning TTS Architectures

Tacotron 2: A sequence-to-sequence model that generates Mel spectrograms from text, coupled with a WaveNet vocoder. An attention mechanism aligns the text encoder output with the audio decoder.

FastSpeech 2: A non-autoregressive model that is 3–38x faster than Tacotron 2. A duration predictor solves the alignment problem, and pitch and energy are predicted directly from the input.

VITS: An end-to-end model combining variational inference with adversarial training. It merges the acoustic model and vocoder into a single network, yielding natural-sounding synthesis in one pass.

Edge TTS (Microsoft)

Microsoft's high-quality TTS service, free to use via the edge-tts Python package.

import asyncio
import edge_tts

async def synthesize_with_edge_tts():
    # List available voices
    voices     = await edge_tts.list_voices()
    en_voices  = [v for v in voices if v['Locale'].startswith('en-')]
    print("English voices:")
    for v in en_voices[:5]:
        print(f"  {v['ShortName']} - {v['FriendlyName']} ({v['Gender']})")

    # Basic synthesis
    text = "Hello! This is a demonstration of Microsoft Edge TTS."
    communicate = edge_tts.Communicate(
        text,
        voice="en-US-AriaNeural",
        rate="+0%",     # speed: -50% to +100%
        volume="+0%",
        pitch="+0Hz"
    )
    await communicate.save("output_edge.mp3")
    print("Saved: output_edge.mp3")

    # With word-boundary subtitles
    async def synthesize_with_subs(text, voice, audio_out, srt_out):
        communicate = edge_tts.Communicate(text, voice)
        subs = edge_tts.SubMaker()
        with open(audio_out, "wb") as af:
            async for chunk in communicate.stream():
                if chunk["type"] == "audio":
                    af.write(chunk["data"])
                elif chunk["type"] == "WordBoundary":
                    subs.feed(chunk)
        with open(srt_out, "w", encoding="utf-8") as sf:
            sf.write(subs.get_srt())

    await synthesize_with_subs(
        "The quick brown fox jumps over the lazy dog.",
        "en-US-AriaNeural",
        "output_subs.mp3",
        "output_subs.srt"
    )

asyncio.run(synthesize_with_edge_tts())

Coqui TTS (Open Source)

from TTS.api import TTS
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

# XTTS v2: multilingual zero-shot TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)

# Basic synthesis
tts.tts_to_file(
    text="This is a demonstration of Coqui XTTS v2.",
    file_path="output_xtts.wav",
    language="en",
    speaker_wav="reference_voice.wav"  # 3+ second reference for voice cloning
)

# Voice cloning
def clone_voice(reference_audio, text, output_path, language="en"):
    tts.tts_to_file(
        text=text,
        file_path=output_path,
        speaker_wav=reference_audio,
        language=language,
        split_sentences=True
    )
    print(f"Voice clone saved: {output_path}")

clone_voice(
    "my_voice_sample.wav",
    "This sentence is spoken in a cloned voice.",
    "cloned_output.wav"
)

# English TTS with Tacotron2
tts_en = TTS("tts_models/en/ljspeech/tacotron2-DDC").to(device)
tts_en.tts_to_file(
    text="Hello, this is a text-to-speech demonstration using Tacotron 2.",
    file_path="output_tacotron.wav"
)

OpenVoice (Voice Cloning)

# pip install openvoice melo-tts

from openvoice import se_extractor
from openvoice.api import ToneColorConverter
from melo.api import TTS
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

tone_color_converter = ToneColorConverter(
    'checkpoints_v2/converter/config.json', device=device
)
tone_color_converter.load_ckpt('checkpoints_v2/converter/checkpoint.pth')

# Generate base speech
tts_model  = TTS(language='EN', device=device)
speaker_id = tts_model.hps.data.spk2id['EN-US']
src_path   = 'tmp/output_base.wav'

tts_model.tts_to_file(
    text="Today we will discuss advances in voice cloning technology.",
    speaker_id=speaker_id,
    output_path=src_path,
    speed=1.0
)

# Extract tone color from reference speaker
target_se, _ = se_extractor.get_se(
    'reference.wav', tone_color_converter,
    target_dir='processed', vad=False
)
source_se = torch.load('checkpoints_v2/base_speakers/ses/en-us.pth', map_location=device)

# Apply tone color conversion
tone_color_converter.convert(
    audio_src_path=src_path,
    src_se=source_se,
    tgt_se=target_se,
    output_path='output_cloned.wav'
)
print("Voice clone complete: output_cloned.wav")

5. Speaker Diarization

Speaker diarization answers the question "who spoke when". It is essential for meeting transcription, interview analysis, and multi-speaker subtitle generation.

from pyannote.audio import Pipeline
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token="YOUR_HF_TOKEN"
)
pipeline.to(device)

def diarize_audio(audio_path, num_speakers=None):
    kwargs = {"num_speakers": num_speakers} if num_speakers else {}
    diarization = pipeline(audio_path, **kwargs)

    timeline = []
    for turn, _, speaker in diarization.itertracks(yield_label=True):
        entry = {
            'start':    turn.start,
            'end':      turn.end,
            'speaker':  speaker,
            'duration': turn.end - turn.start
        }
        timeline.append(entry)
        print(f"  [{turn.start:.2f}s - {turn.end:.2f}s] {speaker}")

    return timeline, diarization

# Combined diarization + Whisper transcription
def diarize_and_transcribe(audio_path, hf_token):
    from faster_whisper import WhisperModel

    whisper_model = WhisperModel("medium", device="cuda", compute_type="float16")
    segments, _   = whisper_model.transcribe(audio_path, word_timestamps=True)
    segments      = list(segments)

    diarize_pipeline = Pipeline.from_pretrained(
        "pyannote/speaker-diarization-3.1", use_auth_token=hf_token
    ).to(torch.device("cuda"))
    diarization = diarize_pipeline(audio_path)

    def get_speaker(start, end):
        best_speaker, best_overlap = "Unknown", 0.0
        for turn, _, speaker in diarization.itertracks(yield_label=True):
            overlap = min(end, turn.end) - max(start, turn.start)
            if overlap > best_overlap:
                best_overlap, best_speaker = overlap, speaker
        return best_speaker

    result = []
    for segment in segments:
        speaker = get_speaker(segment.start, segment.end)
        entry   = {
            'start':   segment.start,
            'end':     segment.end,
            'speaker': speaker,
            'text':    segment.text.strip()
        }
        result.append(entry)
        print(f"[{segment.start:.2f}s - {segment.end:.2f}s] {speaker}: {segment.text.strip()}")

    return result

timeline = diarize_and_transcribe("meeting.wav", "YOUR_HF_TOKEN")

6. Speech Emotion Recognition

import torch
import torchaudio
from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2FeatureExtractor
import numpy as np

model_name       = "ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition"
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_name)
model            = Wav2Vec2ForSequenceClassification.from_pretrained(model_name)
model.eval()

EMOTIONS = ['angry', 'calm', 'disgust', 'fearful',
            'happy', 'neutral', 'sad', 'surprised']

def predict_emotion(audio_path):
    speech, sr = torchaudio.load(audio_path)
    if sr != 16000:
        speech = torchaudio.transforms.Resample(sr, 16000)(speech)

    inputs = feature_extractor(
        speech.squeeze().numpy(),
        sampling_rate=16000,
        return_tensors="pt",
        padding=True
    )

    with torch.no_grad():
        probs = torch.softmax(model(**inputs).logits, dim=-1)

    top_probs, top_idx = torch.topk(probs, 3)
    print("Top-3 emotions:")
    for idx, prob in zip(top_idx[0], top_probs[0]):
        print(f"  {EMOTIONS[idx.item()]}: {prob.item():.4f}")

    predicted = EMOTIONS[probs.argmax().item()]
    print(f"Predicted: {predicted}")
    return predicted, dict(zip(EMOTIONS, probs[0].tolist()))

emotion, probs = predict_emotion('speech.wav')

# Feature-based emotion analysis
def analyze_emotion_features(audio_path):
    import librosa
    y, sr = librosa.load(audio_path, sr=16000)

    mfcc               = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=40)
    spectral_centroid  = librosa.feature.spectral_centroid(y=y, sr=sr)
    spectral_rolloff   = librosa.feature.spectral_rolloff(y=y, sr=sr)
    zero_crossing      = librosa.feature.zero_crossing_rate(y=y)
    rms                = librosa.feature.rms(y=y)
    pitch, mag         = librosa.piptrack(y=y, sr=sr)

    features = np.concatenate([
        np.mean(mfcc, axis=1),
        np.std(mfcc, axis=1),
        [np.mean(spectral_centroid)],
        [np.std(spectral_centroid)],
        [np.mean(spectral_rolloff)],
        [np.mean(zero_crossing)],
        [np.mean(rms)],
        [np.std(rms)]
    ])

    print(f"Feature vector dim: {features.shape}")
    return features

7. Music AI

MusicGen (Meta)

Meta AI's MusicGen generates music from text prompts, conditioning on descriptions of genre, instruments, mood, and tempo.

from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write
import torch

model = MusicGen.get_pretrained('facebook/musicgen-large')
model.set_generation_params(
    duration=30,
    temperature=1.0,
    top_k=250,
    top_p=0.0,
    cfg_coef=3.0,
)

descriptions = [
    "upbeat electronic dance music with synthesizers and strong bass",
    "peaceful classical piano music with violin, gentle and romantic",
    "intense rock music with electric guitar and drums"
]

print("Generating music...")
wav = model.generate(descriptions)

for idx, (desc, one_wav) in enumerate(zip(descriptions, wav)):
    filename = f'generated_music_{idx}'
    audio_write(
        filename, one_wav.cpu(), model.sample_rate,
        strategy="loudness", loudness_compressor=True
    )
    print(f"Saved: {filename}.wav — '{desc}'")

# Melody-conditioned generation
model_melody = MusicGen.get_pretrained('facebook/musicgen-melody')
model_melody.set_generation_params(duration=15)

import torchaudio
melody, melody_sr = torchaudio.load('humming.wav')

wav_melody = model_melody.generate_with_chroma(
    descriptions=["full orchestral arrangement, epic and cinematic"],
    melody_wavs=melody.unsqueeze(0),
    melody_sample_rate=melody_sr
)
audio_write('melody_based', wav_melody[0].cpu(), model_melody.sample_rate,
            strategy="loudness")
print("Melody-conditioned generation complete.")

Music Genre Classification

import librosa
import numpy as np
import torch
import torch.nn as nn

GENRES = ['blues', 'classical', 'country', 'disco', 'hiphop',
          'jazz', 'metal', 'pop', 'reggae', 'rock']

def extract_music_features(audio_path, sr=22050, duration=30):
    y, _ = librosa.load(audio_path, sr=sr, duration=duration)

    mfcc               = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=20)
    spectral_centroid  = librosa.feature.spectral_centroid(y=y, sr=sr)
    spectral_bandwidth = librosa.feature.spectral_bandwidth(y=y, sr=sr)
    spectral_rolloff   = librosa.feature.spectral_rolloff(y=y, sr=sr)
    spectral_contrast  = librosa.feature.spectral_contrast(y=y, sr=sr)
    zcr                = librosa.feature.zero_crossing_rate(y)
    chroma             = librosa.feature.chroma_stft(y=y, sr=sr)
    tempo, _           = librosa.beat.beat_track(y=y, sr=sr)

    feature_vec = np.concatenate([
        np.mean(mfcc, axis=1),
        np.std(mfcc, axis=1),
        [np.mean(spectral_centroid)],
        [np.mean(spectral_bandwidth)],
        [np.mean(spectral_rolloff)],
        [np.mean(spectral_contrast)],
        [np.mean(zcr)],
        [np.std(zcr)],
        np.mean(chroma, axis=1),
        [float(tempo)]
    ])
    return feature_vec

class GenreClassifier(nn.Module):
    def __init__(self, input_dim=56, num_classes=10):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, 512), nn.BatchNorm1d(512), nn.ReLU(), nn.Dropout(0.3),
            nn.Linear(512, 256),       nn.BatchNorm1d(256), nn.ReLU(), nn.Dropout(0.3),
            nn.Linear(256, 128),       nn.ReLU(),
            nn.Linear(128, num_classes)
        )

    def forward(self, x):
        return self.net(x)

def classify_genre(audio_path, model):
    feat = extract_music_features(audio_path)
    x    = torch.FloatTensor(feat).unsqueeze(0)
    with torch.no_grad():
        probs = torch.softmax(model(x), dim=1)[0]
    top_prob, top_idx = torch.topk(probs, 3)
    print("Top-3 genres:")
    for prob, idx in zip(top_prob, top_idx):
        print(f"  {GENRES[idx.item()]}: {prob.item():.4f}")
    return GENRES[probs.argmax().item()]

8. Practical Audio AI Projects

Project 1: Real-time Subtitle System

import queue
import threading
import numpy as np
import sounddevice as sd
from faster_whisper import WhisperModel
import time

class RealtimeSubtitleSystem:
    def __init__(self, model_size="base", language="en"):
        print(f"Loading Whisper {model_size}...")
        self.model        = WhisperModel(model_size, device="cuda", compute_type="float16")
        self.language     = language
        self.audio_queue  = queue.Queue()
        self.is_running   = False
        self.sample_rate  = 16000
        self.chunk_secs   = 3

    def audio_callback(self, indata, frames, time_info, status):
        if status:
            print(f"Audio status: {status}")
        self.audio_queue.put(indata.copy())

    def transcription_worker(self):
        buffer     = np.array([], dtype=np.float32)
        chunk_size = self.sample_rate * self.chunk_secs

        while self.is_running or not self.audio_queue.empty():
            try:
                chunk  = self.audio_queue.get(timeout=0.1)
                buffer = np.append(buffer, chunk.flatten())

                if len(buffer) >= chunk_size:
                    audio_data = buffer[:chunk_size]
                    buffer     = buffer[chunk_size // 2:]   # 50% overlap

                    segments, _ = self.model.transcribe(
                        audio_data, language=self.language,
                        vad_filter=True,
                        vad_parameters=dict(min_silence_duration_ms=300)
                    )
                    for seg in segments:
                        text = seg.text.strip()
                        if text:
                            ts = time.strftime("%H:%M:%S")
                            print(f"[{ts}] {text}")

            except queue.Empty:
                continue
            except Exception as e:
                print(f"Transcription error: {e}")

    def start(self):
        self.is_running = True
        t = threading.Thread(target=self.transcription_worker, daemon=True)
        t.start()

        print("Real-time subtitle system running (Ctrl+C to stop)")
        print("-" * 50)

        with sd.InputStream(
            samplerate=self.sample_rate, channels=1,
            dtype=np.float32, callback=self.audio_callback,
            blocksize=int(self.sample_rate * 0.1)
        ):
            try:
                while True:
                    time.sleep(0.1)
            except KeyboardInterrupt:
                print("\nStopping...")
                self.is_running = False

        t.join()
        print("System stopped.")

system = RealtimeSubtitleSystem(model_size="small", language="en")
system.start()

Project 2: Voice Chatbot

import openai
import sounddevice as sd
import numpy as np
import tempfile
import os
import time
from faster_whisper import WhisperModel
import edge_tts
import asyncio
import soundfile as sf

class VoiceChatbot:
    def __init__(self):
        self.client       = openai.OpenAI()
        self.whisper      = WhisperModel("base", device="cpu", compute_type="int8")
        self.sample_rate  = 16000
        self.history      = []
        self.system_prompt = (
            "You are a helpful and knowledgeable AI assistant. "
            "Respond concisely and clearly in English."
        )

    def record_audio(self, duration=5):
        print(f"Recording... ({duration} s)")
        audio = sd.rec(int(duration * self.sample_rate),
                       samplerate=self.sample_rate, channels=1, dtype=np.float32)
        sd.wait()
        print("Done recording.")
        return audio.flatten()

    def speech_to_text(self, audio):
        with tempfile.NamedTemporaryFile(suffix='.wav', delete=False) as f:
            sf.write(f.name, audio, self.sample_rate)
            tmp = f.name
        try:
            segs, _ = self.whisper.transcribe(tmp, language="en", vad_filter=True)
            text = " ".join(s.text.strip() for s in segs)
        finally:
            os.unlink(tmp)
        return text.strip()

    def chat(self, user_text):
        self.history.append({"role": "user", "content": user_text})
        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "system", "content": self.system_prompt}] + self.history,
            max_tokens=300,
            temperature=0.7
        )
        assistant_text = response.choices[0].message.content
        self.history.append({"role": "assistant", "content": assistant_text})
        if len(self.history) > 20:
            self.history = self.history[-20:]
        return assistant_text

    async def text_to_speech(self, text, output_path='response.mp3'):
        communicate = edge_tts.Communicate(text, voice="en-US-AriaNeural", rate="+10%")
        await communicate.save(output_path)
        return output_path

    def play_audio(self, path):
        import subprocess
        if os.name == 'nt':
            os.startfile(path)
        elif hasattr(os, 'uname') and os.uname().sysname == 'Darwin':
            subprocess.run(['afplay', path])
        else:
            subprocess.run(['mpg123', path])

    def run(self):
        print("Voice chatbot started!")
        print("Say 'quit' or 'exit' to stop.")
        print("=" * 50)

        while True:
            audio     = self.record_audio(duration=5)
            user_text = self.speech_to_text(audio)

            if not user_text:
                print("Could not understand audio. Please try again.")
                continue

            print(f"You: {user_text}")

            if any(w in user_text.lower() for w in ['quit', 'exit', 'stop', 'bye']):
                print("Goodbye!")
                break

            response = self.chat(user_text)
            print(f"AI: {response}")

            audio_path = asyncio.run(self.text_to_speech(response))
            self.play_audio(audio_path)
            time.sleep(0.5)

chatbot = VoiceChatbot()
chatbot.run()

Speech & Audio AI Learning Roadmap

Beginner

Audio feature extraction practice with librosa
Build a transcription prototype with the Whisper API
Create TTS applications with Edge TTS

Intermediate

Fine-tune Wav2Vec 2.0 for custom ASR
Build a speaker diarization pipeline with pyannote.audio
Implement a real-time voice chatbot

Advanced

Train a custom TTS model with VITS or XTTS
Build music generation applications with MusicGen
Combine emotion recognition with conversational AI

References

librosa documentation: https://librosa.org/doc/latest/
OpenAI Whisper: https://openai.com/research/whisper
Wav2Vec 2.0 (arXiv 2006.11477): https://arxiv.org/abs/2006.11477
Hugging Face Audio Course: https://huggingface.co/learn/audio-course/
Coqui TTS: https://github.com/coqui-ai/TTS
pyannote.audio: https://github.com/pyannote/pyannote-audio
faster-whisper: https://github.com/SYSTRAN/faster-whisper
AudioCraft (MusicGen): https://github.com/facebookresearch/audiocraft