💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Speech & Audio AI Complete Guide: ASR, TTS, Whisper, Wav2Vec to Voice Synthesis

Speech and audio AI creates the most natural interface between humans and machines. From smartphone voice assistants to real-time translation systems and synthetic voices for virtual influencers, audio AI technology has woven itself into our everyday lives.

This guide takes you through the full audio AI ecosystem — from the physics of sound and digital signal processing to automatic speech recognition (ASR), text-to-speech (TTS), speaker diarization, and music generation — with practical Python code throughout.

1. Audio Signal Processing Fundamentals

Physical Properties of Sound

Sound is a pressure wave propagating through air. Understanding the following concepts is essential for digital audio processing.

**Frequency**: The number of vibrations per second, measured in Hz (hertz). The human hearing range spans roughly 20 Hz to 20,000 Hz. Low frequencies correspond to bass tones; high frequencies to treble.

**Amplitude**: The magnitude of the wave — the strength of the sound pressure. Expressed in decibels (dB), where 0 dB represents a reference threshold and negative values indicate quieter sounds.

**Phase**: The position of a waveform along the time axis. Two waves of the same frequency but different phases produce constructive or destructive interference when combined.

**Harmonics**: Frequency components at integer multiples of the fundamental frequency. They determine the timbre (tone color) of an instrument or voice.

Sampling Rate and Bit Depth

**Sampling Rate**: The number of audio samples captured per second (Hz). The Nyquist theorem states that to fully reconstruct a signal, you must sample at more than twice the highest frequency present.

- CD quality: 44,100 Hz (44.1 kHz)

- High-resolution audio: 48,000 Hz (video), 96,000 Hz, 192,000 Hz

- Telephone quality: 8,000 Hz; wideband telephony: 16,000 Hz

- Whisper default: 16,000 Hz

**Bit Depth**: The number of bits per sample — determines the dynamic range.

- 16-bit: 65,536 levels, 96 dB dynamic range (CD standard)

- 24-bit: 16,777,216 levels, 144 dB dynamic range

- 32-bit float: standard for deep learning pipelines

librosa Overview

librosa is the core Python library for audio analysis.

pip install librosa soundfile matplotlib numpy scipy

Load audio (sr=None preserves original sample rate)

y, sr = librosa.load('audio.wav', sr=None)

print(f"Duration: {len(y)/sr:.2f} s")

print(f"Sample rate: {sr} Hz")

print(f"Samples: {len(y)}")

print(f"dtype: {y.dtype}")

Resample to 16 kHz

y_16k = librosa.resample(y, orig_sr=sr, target_sr=16000)

Stereo → mono

y_mono = librosa.to_mono(y) # (2, N) → (N,)

Save

sf.write('output.wav', y_16k, 16000)

Waveform visualization

plt.figure(figsize=(14, 4))

librosa.display.waveshow(y, sr=sr, alpha=0.8, color='steelblue')

plt.title('Waveform')

plt.xlabel('Time (seconds)')

plt.ylabel('Amplitude')

plt.tight_layout()

plt.savefig('waveform.png')

plt.show()

2. Audio Feature Extraction

Fourier Transform (FFT)

The Fourier transform converts a time-domain signal into the frequency domain, revealing which frequency components are present and how strongly.

from scipy.fft import fft, fftfreq

y, sr = librosa.load('audio.wav', sr=22050)

N = len(y)

yf = fft(y)

xf = fftfreq(N, 1/sr)

Keep only positive frequencies (Hermitian symmetry)

xf_pos = xf[:N//2]

yf_pos = np.abs(yf[:N//2])

plt.figure(figsize=(12, 4))

plt.plot(xf_pos, yf_pos)

plt.title('Frequency Spectrum (FFT)')

plt.xlabel('Frequency (Hz)')

plt.ylabel('Magnitude')

plt.xlim(0, sr//2)

plt.yscale('log')

plt.grid(True)

plt.tight_layout()

plt.show()

Short-Time Fourier Transform (STFT)

The plain FFT gives a global average over the entire signal. The STFT applies FFT to short overlapping windows, producing a time-frequency representation that captures how spectral content evolves over time.

y, sr = librosa.load('speech.wav', sr=22050)

n_fft = 2048 # FFT size (frequency resolution)

hop_length = 512 # hop size

win_length = 2048 # window size

D = librosa.stft(y, n_fft=n_fft, hop_length=hop_length, win_length=win_length)

magnitude = np.abs(D)

phase = np.angle(D)

D_db = librosa.amplitude_to_db(magnitude, ref=np.max)

plt.figure(figsize=(14, 6))

librosa.display.specshow(

D_db, sr=sr, hop_length=hop_length,

x_axis='time', y_axis='hz', cmap='magma'

)

plt.colorbar(format='%+2.0f dB')

plt.title('STFT Spectrogram')

plt.tight_layout()

plt.savefig('stft_spectrogram.png')

plt.show()

print(f"STFT shape: {D.shape}")

print(f"Frequency resolution: {sr/n_fft:.2f} Hz")

print(f"Time resolution: {hop_length/sr*1000:.2f} ms")

Mel Spectrogram

The human auditory system perceives pitch on a logarithmic scale — more sensitive at low frequencies and less so at high frequencies. The Mel scale models this perceptual non-linearity. Mel spectrograms are the most widely used input representation for deep learning audio models.

y, sr = librosa.load('speech.wav', sr=22050)

n_mels = 128

n_fft = 2048

hop_length = 512

mel_spec = librosa.feature.melspectrogram(

y=y, sr=sr, n_mels=n_mels,

n_fft=n_fft, hop_length=hop_length,

fmin=0, fmax=sr//2, power=2.0

)

mel_db = librosa.power_to_db(mel_spec, ref=np.max)

plt.figure(figsize=(14, 6))

librosa.display.specshow(

mel_db, sr=sr, hop_length=hop_length,

x_axis='time', y_axis='mel',

fmax=sr//2, cmap='viridis'

)

plt.colorbar(format='%+2.0f dB')

plt.title('Mel Spectrogram')

plt.tight_layout()

plt.savefig('mel_spectrogram.png')

plt.show()

print(f"Mel Spectrogram shape: {mel_spec.shape}")

(n_mels, time_frames) → ~(128, 86 * duration_seconds)

MFCC (Mel-Frequency Cepstral Coefficients)

MFCCs are obtained by applying the Discrete Cosine Transform (DCT) to the log Mel filterbank outputs. They compactly represent the spectral envelope (timbre) and have been the standard features for speech recognition for decades.

y, sr = librosa.load('speech.wav', sr=22050)

n_mfcc = 40

hop_length = 512

mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=n_mfcc, hop_length=hop_length)

mfcc_delta = librosa.feature.delta(mfccs)

mfcc_delta2 = librosa.feature.delta(mfccs, order=2)

Concatenated feature: 120-dimensional (40 + 40 + 40)

mfcc_combined = np.vstack([mfccs, mfcc_delta, mfcc_delta2])

fig, axes = plt.subplots(3, 1, figsize=(14, 8))

for ax, data, title in zip(

axes,

[mfccs, mfcc_delta, mfcc_delta2],

['MFCC', 'MFCC Delta (1st derivative)', 'MFCC Delta-Delta (2nd derivative)']

librosa.display.specshow(data, x_axis='time', hop_length=hop_length, sr=sr, ax=ax)

ax.set_title(title)

plt.colorbar(ax.collections[0], ax=ax)

plt.tight_layout()

plt.savefig('mfcc.png')

plt.show()

Fixed-length feature vector for classification

mfcc_feature = np.concatenate([np.mean(mfccs, axis=1), np.std(mfccs, axis=1)])

print(f"MFCC feature vector dim: {mfcc_feature.shape}")

Chromagram

A chromagram represents the energy distribution across the 12 pitch classes (C, C#, D, D#, E, F, F#, G, G#, A, A#, B). It is widely used in music analysis for chord recognition and key detection.

y, sr = librosa.load('music.wav', sr=22050)

hop_length = 512

chroma_stft = librosa.feature.chroma_stft(y=y, sr=sr, hop_length=hop_length)

chroma_cqt = librosa.feature.chroma_cqt(y=y, sr=sr, hop_length=hop_length)

chroma_cens = librosa.feature.chroma_cens(y=y, sr=sr, hop_length=hop_length)

fig, axes = plt.subplots(3, 1, figsize=(14, 10))

for ax, chroma, title in zip(

axes,

[chroma_stft, chroma_cqt, chroma_cens],

['Chroma STFT', 'Chroma CQT', 'Chroma CENS']

librosa.display.specshow(

chroma, y_axis='chroma', x_axis='time',

hop_length=hop_length, sr=sr, cmap='coolwarm', ax=ax

)

ax.set_title(title)

plt.colorbar(ax.collections[0], ax=ax)

plt.tight_layout()

plt.savefig('chromagram.png')

plt.show()

3. Automatic Speech Recognition (ASR)

Traditional ASR: HMM + GMM

Classic speech recognition combined Hidden Markov Models (HMMs) for phoneme sequence modeling with Gaussian Mixture Models (GMMs) for acoustic feature modeling. MFCC features are extracted from the audio, phoneme sequences are predicted by the HMM-GMM system, and a language model maps phoneme sequences to word sequences.

CTC (Connectionist Temporal Classification)

CTC enables end-to-end training when input and output sequences have different lengths, without requiring forced alignment. A blank token handles repeated characters and silences, allowing the model to learn directly from audio-text pairs.

Wav2Vec 2.0

Facebook AI Research's Wav2Vec 2.0 uses self-supervised learning to learn powerful acoustic representations from large amounts of unlabeled audio. It can be fine-tuned with a small labeled dataset and achieves state-of-the-art results.

from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

model_name = "facebook/wav2vec2-base-960h"

processor = Wav2Vec2Processor.from_pretrained(model_name)

model = Wav2Vec2ForCTC.from_pretrained(model_name)

model.eval()

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model.to(device)

def transcribe_wav2vec2(audio_path):

speech, sample_rate = torchaudio.load(audio_path)

if sample_rate != 16000:

resampler = torchaudio.transforms.Resample(sample_rate, 16000)

speech = resampler(speech)

speech = speech.squeeze().numpy()

inputs = processor(speech, sampling_rate=16000,

return_tensors="pt", padding=True).to(device)

with torch.no_grad():

logits = model(**inputs).logits

predicted_ids = torch.argmax(logits, dim=-1)

transcription = processor.batch_decode(predicted_ids)

return transcription[0].lower()

text = transcribe_wav2vec2('speech.wav')

print(f"Transcription: {text}")

Whisper (OpenAI)

Whisper is OpenAI's large-scale multilingual ASR model released in 2022. Trained on 680,000 hours of multilingual audio, it supports 99 languages and delivers excellent accuracy out of the box — no fine-tuning required for most use cases.

**Architecture**: Encoder-Decoder Transformer

- Encoder: converts audio to a Mel spectrogram and processes it through a transformer encoder

- Decoder: autoregressively generates text tokens, including language detection and timestamps

**Model sizes**:

- tiny: 39M parameters — fastest

- base: 74M

- small: 244M

- medium: 769M

- large-v3: 1,550M — best accuracy

Load model (downloads automatically on first run)

model = whisper.load_model("base") # or "small", "medium", "large-v3"

Basic transcription

result = model.transcribe("speech.wav")

print("Transcript:", result["text"])

print("Detected language:", result["language"])

Force a specific language

result_forced = model.transcribe("speech.wav", language="en", task="transcribe")

With word-level timestamps

result_ts = model.transcribe("speech.wav", verbose=False, word_timestamps=True)

for segment in result_ts["segments"]:

print(f"[{segment['start']:.2f}s - {segment['end']:.2f}s] {segment['text']}")

Translate non-English audio to English

result_translated = model.transcribe("german_speech.wav", task="translate")

print("English translation:", result_translated["text"])

Microphone input (5-second clip)

def transcribe_from_microphone(duration=5, sample_rate=16000):

print(f"Recording for {duration} seconds...")

audio = sd.rec(int(duration * sample_rate),

samplerate=sample_rate, channels=1, dtype=np.float32)

sd.wait()

result = model.transcribe(audio.flatten(), language="en")

print(f"You said: {result['text']}")

transcribe_from_microphone()

Faster-Whisper

faster-whisper reimplements Whisper using CTranslate2, achieving up to 4x faster inference with reduced memory usage.

from faster_whisper import WhisperModel

model = WhisperModel(

"large-v3",

device="cuda",

compute_type="float16" # "float16", "int8", "int8_float16"

)

segments, info = model.transcribe(

"speech.wav",

language="en",

beam_size=5,

word_timestamps=True,

vad_filter=True,

vad_parameters=dict(min_silence_duration_ms=500)

)

print(f"Language: {info.language}, confidence: {info.language_probability:.2f}")

full_text = ""

for segment in segments:

print(f"[{segment.start:.2f}s → {segment.end:.2f}s] {segment.text}")

full_text += segment.text

if segment.words:

for word in segment.words:

print(f" '{word.word}': {word.start:.2f}s - {word.end:.2f}s")

Batch processing

def batch_transcribe(audio_files, output_dir):

os.makedirs(output_dir, exist_ok=True)

for audio_path in audio_files:

name = os.path.splitext(os.path.basename(audio_path))[0]

output_path = os.path.join(output_dir, f"{name}.txt")

segs, _ = model.transcribe(audio_path, language="en")

with open(output_path, 'w', encoding='utf-8') as f:

for seg in segs:

f.write(f"[{seg.start:.2f} - {seg.end:.2f}] {seg.text}\n")

print(f"Done: {audio_path} → {output_path}")

batch_transcribe(['interview1.wav', 'lecture.mp3'], './transcripts')

4. Text-to-Speech (TTS)

Deep Learning TTS Architectures

**Tacotron 2**: A sequence-to-sequence model that generates Mel spectrograms from text, coupled with a WaveNet vocoder. An attention mechanism aligns the text encoder output with the audio decoder.

**FastSpeech 2**: A non-autoregressive model that is 3–38x faster than Tacotron 2. A duration predictor solves the alignment problem, and pitch and energy are predicted directly from the input.

**VITS**: An end-to-end model combining variational inference with adversarial training. It merges the acoustic model and vocoder into a single network, yielding natural-sounding synthesis in one pass.

Edge TTS (Microsoft)

Microsoft's high-quality TTS service, free to use via the `edge-tts` Python package.

async def synthesize_with_edge_tts():

List available voices

voices = await edge_tts.list_voices()

en_voices = [v for v in voices if v['Locale'].startswith('en-')]

print("English voices:")

for v in en_voices[:5]:

print(f" {v['ShortName']} - {v['FriendlyName']} ({v['Gender']})")

Basic synthesis

text = "Hello! This is a demonstration of Microsoft Edge TTS."

communicate = edge_tts.Communicate(

text,

voice="en-US-AriaNeural",

rate="+0%", # speed: -50% to +100%

volume="+0%",

pitch="+0Hz"

)

await communicate.save("output_edge.mp3")

print("Saved: output_edge.mp3")

With word-boundary subtitles

async def synthesize_with_subs(text, voice, audio_out, srt_out):

communicate = edge_tts.Communicate(text, voice)

subs = edge_tts.SubMaker()

with open(audio_out, "wb") as af:

async for chunk in communicate.stream():

if chunk["type"] == "audio":

af.write(chunk["data"])

elif chunk["type"] == "WordBoundary":

subs.feed(chunk)

with open(srt_out, "w", encoding="utf-8") as sf:

sf.write(subs.get_srt())

await synthesize_with_subs(

"The quick brown fox jumps over the lazy dog.",

"en-US-AriaNeural",

"output_subs.mp3",

"output_subs.srt"

)

asyncio.run(synthesize_with_edge_tts())

Coqui TTS (Open Source)

from TTS.api import TTS

device = "cuda" if torch.cuda.is_available() else "cpu"

XTTS v2: multilingual zero-shot TTS

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)

Basic synthesis

tts.tts_to_file(

text="This is a demonstration of Coqui XTTS v2.",

file_path="output_xtts.wav",

language="en",

speaker_wav="reference_voice.wav" # 3+ second reference for voice cloning

)

Voice cloning

def clone_voice(reference_audio, text, output_path, language="en"):

tts.tts_to_file(

text=text,

file_path=output_path,

speaker_wav=reference_audio,

language=language,

split_sentences=True

)

print(f"Voice clone saved: {output_path}")

clone_voice(

"my_voice_sample.wav",

"This sentence is spoken in a cloned voice.",

"cloned_output.wav"

)

English TTS with Tacotron2

tts_en = TTS("tts_models/en/ljspeech/tacotron2-DDC").to(device)

tts_en.tts_to_file(

text="Hello, this is a text-to-speech demonstration using Tacotron 2.",

file_path="output_tacotron.wav"

)

OpenVoice (Voice Cloning)

pip install openvoice melo-tts

from openvoice import se_extractor

from openvoice.api import ToneColorConverter

from melo.api import TTS

device = "cuda" if torch.cuda.is_available() else "cpu"

tone_color_converter = ToneColorConverter(

'checkpoints_v2/converter/config.json', device=device

)

tone_color_converter.load_ckpt('checkpoints_v2/converter/checkpoint.pth')

Generate base speech

tts_model = TTS(language='EN', device=device)

speaker_id = tts_model.hps.data.spk2id['EN-US']

src_path = 'tmp/output_base.wav'

tts_model.tts_to_file(

text="Today we will discuss advances in voice cloning technology.",

speaker_id=speaker_id,

output_path=src_path,

speed=1.0

)

Extract tone color from reference speaker

target_se, _ = se_extractor.get_se(

'reference.wav', tone_color_converter,

target_dir='processed', vad=False

)

source_se = torch.load('checkpoints_v2/base_speakers/ses/en-us.pth', map_location=device)

Apply tone color conversion

tone_color_converter.convert(

audio_src_path=src_path,

src_se=source_se,

tgt_se=target_se,

output_path='output_cloned.wav'

)

print("Voice clone complete: output_cloned.wav")

5. Speaker Diarization

Speaker diarization answers the question "who spoke when". It is essential for meeting transcription, interview analysis, and multi-speaker subtitle generation.

from pyannote.audio import Pipeline

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

pipeline = Pipeline.from_pretrained(

"pyannote/speaker-diarization-3.1",

use_auth_token="YOUR_HF_TOKEN"

)

pipeline.to(device)

def diarize_audio(audio_path, num_speakers=None):

kwargs = {"num_speakers": num_speakers} if num_speakers else {}

diarization = pipeline(audio_path, **kwargs)

timeline = []

for turn, _, speaker in diarization.itertracks(yield_label=True):

entry = {

'start': turn.start,

'end': turn.end,

'speaker': speaker,

'duration': turn.end - turn.start

}

timeline.append(entry)

print(f" [{turn.start:.2f}s - {turn.end:.2f}s] {speaker}")

return timeline, diarization

Combined diarization + Whisper transcription

def diarize_and_transcribe(audio_path, hf_token):

from faster_whisper import WhisperModel

whisper_model = WhisperModel("medium", device="cuda", compute_type="float16")

segments, _ = whisper_model.transcribe(audio_path, word_timestamps=True)

segments = list(segments)

diarize_pipeline = Pipeline.from_pretrained(

"pyannote/speaker-diarization-3.1", use_auth_token=hf_token

).to(torch.device("cuda"))

diarization = diarize_pipeline(audio_path)

def get_speaker(start, end):

best_speaker, best_overlap = "Unknown", 0.0

for turn, _, speaker in diarization.itertracks(yield_label=True):

overlap = min(end, turn.end) - max(start, turn.start)

if overlap > best_overlap:

best_overlap, best_speaker = overlap, speaker

return best_speaker

result = []

for segment in segments:

speaker = get_speaker(segment.start, segment.end)

entry = {

'start': segment.start,

'end': segment.end,

'speaker': speaker,

'text': segment.text.strip()

}

result.append(entry)

print(f"[{segment.start:.2f}s - {segment.end:.2f}s] {speaker}: {segment.text.strip()}")

return result

timeline = diarize_and_transcribe("meeting.wav", "YOUR_HF_TOKEN")

6. Speech Emotion Recognition

from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2FeatureExtractor

model_name = "ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition"

feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_name)

model = Wav2Vec2ForSequenceClassification.from_pretrained(model_name)

model.eval()

EMOTIONS = ['angry', 'calm', 'disgust', 'fearful',

'happy', 'neutral', 'sad', 'surprised']

def predict_emotion(audio_path):

speech, sr = torchaudio.load(audio_path)

if sr != 16000:

speech = torchaudio.transforms.Resample(sr, 16000)(speech)

inputs = feature_extractor(

speech.squeeze().numpy(),

sampling_rate=16000,

return_tensors="pt",

padding=True

)

with torch.no_grad():

probs = torch.softmax(model(**inputs).logits, dim=-1)

top_probs, top_idx = torch.topk(probs, 3)

print("Top-3 emotions:")

for idx, prob in zip(top_idx[0], top_probs[0]):

print(f" {EMOTIONS[idx.item()]}: {prob.item():.4f}")

predicted = EMOTIONS[probs.argmax().item()]

print(f"Predicted: {predicted}")

return predicted, dict(zip(EMOTIONS, probs[0].tolist()))

emotion, probs = predict_emotion('speech.wav')

Feature-based emotion analysis

def analyze_emotion_features(audio_path):

y, sr = librosa.load(audio_path, sr=16000)

mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=40)

spectral_centroid = librosa.feature.spectral_centroid(y=y, sr=sr)

spectral_rolloff = librosa.feature.spectral_rolloff(y=y, sr=sr)

zero_crossing = librosa.feature.zero_crossing_rate(y=y)

rms = librosa.feature.rms(y=y)

pitch, mag = librosa.piptrack(y=y, sr=sr)

features = np.concatenate([

np.mean(mfcc, axis=1),

np.std(mfcc, axis=1),

[np.mean(spectral_centroid)],

[np.std(spectral_centroid)],

[np.mean(spectral_rolloff)],

[np.mean(zero_crossing)],

[np.mean(rms)],

[np.std(rms)]

])

print(f"Feature vector dim: {features.shape}")

return features

7. Music AI

MusicGen (Meta)

Meta AI's MusicGen generates music from text prompts, conditioning on descriptions of genre, instruments, mood, and tempo.

from audiocraft.models import MusicGen

from audiocraft.data.audio import audio_write

model = MusicGen.get_pretrained('facebook/musicgen-large')

model.set_generation_params(

duration=30,

temperature=1.0,

top_k=250,

top_p=0.0,

cfg_coef=3.0,

)

descriptions = [

"upbeat electronic dance music with synthesizers and strong bass",

"peaceful classical piano music with violin, gentle and romantic",

"intense rock music with electric guitar and drums"

]

print("Generating music...")

wav = model.generate(descriptions)

for idx, (desc, one_wav) in enumerate(zip(descriptions, wav)):

filename = f'generated_music_{idx}'

audio_write(

filename, one_wav.cpu(), model.sample_rate,

strategy="loudness", loudness_compressor=True

)

print(f"Saved: {filename}.wav — '{desc}'")

Melody-conditioned generation

model_melody = MusicGen.get_pretrained('facebook/musicgen-melody')

model_melody.set_generation_params(duration=15)

melody, melody_sr = torchaudio.load('humming.wav')

wav_melody = model_melody.generate_with_chroma(

descriptions=["full orchestral arrangement, epic and cinematic"],

melody_wavs=melody.unsqueeze(0),

melody_sample_rate=melody_sr

)

audio_write('melody_based', wav_melody[0].cpu(), model_melody.sample_rate,

strategy="loudness")

print("Melody-conditioned generation complete.")

Music Genre Classification

GENRES = ['blues', 'classical', 'country', 'disco', 'hiphop',

'jazz', 'metal', 'pop', 'reggae', 'rock']

def extract_music_features(audio_path, sr=22050, duration=30):

y, _ = librosa.load(audio_path, sr=sr, duration=duration)

mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=20)

spectral_centroid = librosa.feature.spectral_centroid(y=y, sr=sr)

spectral_bandwidth = librosa.feature.spectral_bandwidth(y=y, sr=sr)

spectral_rolloff = librosa.feature.spectral_rolloff(y=y, sr=sr)

spectral_contrast = librosa.feature.spectral_contrast(y=y, sr=sr)

zcr = librosa.feature.zero_crossing_rate(y)

chroma = librosa.feature.chroma_stft(y=y, sr=sr)

tempo, _ = librosa.beat.beat_track(y=y, sr=sr)

feature_vec = np.concatenate([

np.mean(mfcc, axis=1),

np.std(mfcc, axis=1),

[np.mean(spectral_centroid)],

[np.mean(spectral_bandwidth)],

[np.mean(spectral_rolloff)],

[np.mean(spectral_contrast)],

[np.mean(zcr)],

[np.std(zcr)],

np.mean(chroma, axis=1),

[float(tempo)]

])

return feature_vec

class GenreClassifier(nn.Module):

def __init__(self, input_dim=56, num_classes=10):

super().__init__()

self.net = nn.Sequential(

nn.Linear(input_dim, 512), nn.BatchNorm1d(512), nn.ReLU(), nn.Dropout(0.3),

nn.Linear(512, 256), nn.BatchNorm1d(256), nn.ReLU(), nn.Dropout(0.3),

nn.Linear(256, 128), nn.ReLU(),

nn.Linear(128, num_classes)

)

def forward(self, x):

return self.net(x)

def classify_genre(audio_path, model):

feat = extract_music_features(audio_path)

x = torch.FloatTensor(feat).unsqueeze(0)

with torch.no_grad():

probs = torch.softmax(model(x), dim=1)[0]

top_prob, top_idx = torch.topk(probs, 3)

print("Top-3 genres:")

for prob, idx in zip(top_prob, top_idx):

print(f" {GENRES[idx.item()]}: {prob.item():.4f}")

return GENRES[probs.argmax().item()]

8. Practical Audio AI Projects

Project 1: Real-time Subtitle System

from faster_whisper import WhisperModel

class RealtimeSubtitleSystem:

def __init__(self, model_size="base", language="en"):

print(f"Loading Whisper {model_size}...")

self.model = WhisperModel(model_size, device="cuda", compute_type="float16")

self.language = language

self.audio_queue = queue.Queue()

self.is_running = False

self.sample_rate = 16000

self.chunk_secs = 3

def audio_callback(self, indata, frames, time_info, status):

if status:

print(f"Audio status: {status}")

self.audio_queue.put(indata.copy())

def transcription_worker(self):

buffer = np.array([], dtype=np.float32)

chunk_size = self.sample_rate * self.chunk_secs

while self.is_running or not self.audio_queue.empty():

try:

chunk = self.audio_queue.get(timeout=0.1)

buffer = np.append(buffer, chunk.flatten())

if len(buffer) >= chunk_size:

audio_data = buffer[:chunk_size]

buffer = buffer[chunk_size // 2:] # 50% overlap

segments, _ = self.model.transcribe(

audio_data, language=self.language,

vad_filter=True,

vad_parameters=dict(min_silence_duration_ms=300)

)

for seg in segments:

text = seg.text.strip()

if text:

ts = time.strftime("%H:%M:%S")

print(f"[{ts}] {text}")

except queue.Empty:

continue

except Exception as e:

print(f"Transcription error: {e}")

def start(self):

self.is_running = True

t = threading.Thread(target=self.transcription_worker, daemon=True)

t.start()

print("Real-time subtitle system running (Ctrl+C to stop)")

print("-" * 50)

with sd.InputStream(

samplerate=self.sample_rate, channels=1,

dtype=np.float32, callback=self.audio_callback,

blocksize=int(self.sample_rate * 0.1)

try:

while True:

time.sleep(0.1)

except KeyboardInterrupt:

print("\nStopping...")

self.is_running = False

t.join()

print("System stopped.")

system = RealtimeSubtitleSystem(model_size="small", language="en")

system.start()

Project 2: Voice Chatbot

from faster_whisper import WhisperModel

class VoiceChatbot:

def __init__(self):

self.client = openai.OpenAI()

self.whisper = WhisperModel("base", device="cpu", compute_type="int8")

self.sample_rate = 16000

self.history = []

self.system_prompt = (

"You are a helpful and knowledgeable AI assistant. "

"Respond concisely and clearly in English."

)

def record_audio(self, duration=5):

print(f"Recording... ({duration} s)")

audio = sd.rec(int(duration * self.sample_rate),

samplerate=self.sample_rate, channels=1, dtype=np.float32)

sd.wait()

print("Done recording.")

return audio.flatten()

def speech_to_text(self, audio):

with tempfile.NamedTemporaryFile(suffix='.wav', delete=False) as f:

sf.write(f.name, audio, self.sample_rate)

tmp = f.name

try:

segs, _ = self.whisper.transcribe(tmp, language="en", vad_filter=True)

text = " ".join(s.text.strip() for s in segs)

finally:

os.unlink(tmp)

return text.strip()

def chat(self, user_text):

self.history.append({"role": "user", "content": user_text})

response = self.client.chat.completions.create(

model="gpt-4o",

messages=[{"role": "system", "content": self.system_prompt}] + self.history,

max_tokens=300,

temperature=0.7

)

assistant_text = response.choices[0].message.content

self.history.append({"role": "assistant", "content": assistant_text})

if len(self.history) > 20:

self.history = self.history[-20:]

return assistant_text

async def text_to_speech(self, text, output_path='response.mp3'):

communicate = edge_tts.Communicate(text, voice="en-US-AriaNeural", rate="+10%")

await communicate.save(output_path)

return output_path

def play_audio(self, path):

if os.name == 'nt':

os.startfile(path)

elif hasattr(os, 'uname') and os.uname().sysname == 'Darwin':

subprocess.run(['afplay', path])

else:

subprocess.run(['mpg123', path])

def run(self):

print("Voice chatbot started!")

print("Say 'quit' or 'exit' to stop.")

print("=" * 50)

while True:

audio = self.record_audio(duration=5)

user_text = self.speech_to_text(audio)

if not user_text:

print("Could not understand audio. Please try again.")

continue

print(f"You: {user_text}")

if any(w in user_text.lower() for w in ['quit', 'exit', 'stop', 'bye']):

print("Goodbye!")

break

response = self.chat(user_text)

print(f"AI: {response}")

audio_path = asyncio.run(self.text_to_speech(response))

self.play_audio(audio_path)

time.sleep(0.5)

chatbot = VoiceChatbot()

chatbot.run()

Speech & Audio AI Learning Roadmap

**Beginner**

1. Audio feature extraction practice with librosa

2. Build a transcription prototype with the Whisper API

3. Create TTS applications with Edge TTS

**Intermediate**

1. Fine-tune Wav2Vec 2.0 for custom ASR

2. Build a speaker diarization pipeline with pyannote.audio

3. Implement a real-time voice chatbot

**Advanced**

1. Train a custom TTS model with VITS or XTTS

2. Build music generation applications with MusicGen

3. Combine emotion recognition with conversational AI

References

- librosa documentation: https://librosa.org/doc/latest/

- OpenAI Whisper: https://openai.com/research/whisper

- Wav2Vec 2.0 (arXiv 2006.11477): https://arxiv.org/abs/2006.11477

- Hugging Face Audio Course: https://huggingface.co/learn/audio-course/

- Coqui TTS: https://github.com/coqui-ai/TTS

- pyannote.audio: https://github.com/pyannote/pyannote-audio

- faster-whisper: https://github.com/SYSTRAN/faster-whisper

- AudioCraft (MusicGen): https://github.com/facebookresearch/audiocraft