- Published on
Speech & Audio AI Complete Guide: ASR, TTS, Whisper, Wav2Vec to Voice Synthesis
- Authors

- Name
- Youngju Kim
- @fjvbn20031
Speech & Audio AI Complete Guide: ASR, TTS, Whisper, Wav2Vec to Voice Synthesis
Speech and audio AI creates the most natural interface between humans and machines. From smartphone voice assistants to real-time translation systems and synthetic voices for virtual influencers, audio AI technology has woven itself into our everyday lives.
This guide takes you through the full audio AI ecosystem — from the physics of sound and digital signal processing to automatic speech recognition (ASR), text-to-speech (TTS), speaker diarization, and music generation — with practical Python code throughout.
1. Audio Signal Processing Fundamentals
Physical Properties of Sound
Sound is a pressure wave propagating through air. Understanding the following concepts is essential for digital audio processing.
Frequency: The number of vibrations per second, measured in Hz (hertz). The human hearing range spans roughly 20 Hz to 20,000 Hz. Low frequencies correspond to bass tones; high frequencies to treble.
Amplitude: The magnitude of the wave — the strength of the sound pressure. Expressed in decibels (dB), where 0 dB represents a reference threshold and negative values indicate quieter sounds.
Phase: The position of a waveform along the time axis. Two waves of the same frequency but different phases produce constructive or destructive interference when combined.
Harmonics: Frequency components at integer multiples of the fundamental frequency. They determine the timbre (tone color) of an instrument or voice.
Sampling Rate and Bit Depth
Sampling Rate: The number of audio samples captured per second (Hz). The Nyquist theorem states that to fully reconstruct a signal, you must sample at more than twice the highest frequency present.
- CD quality: 44,100 Hz (44.1 kHz)
- High-resolution audio: 48,000 Hz (video), 96,000 Hz, 192,000 Hz
- Telephone quality: 8,000 Hz; wideband telephony: 16,000 Hz
- Whisper default: 16,000 Hz
Bit Depth: The number of bits per sample — determines the dynamic range.
- 16-bit: 65,536 levels, 96 dB dynamic range (CD standard)
- 24-bit: 16,777,216 levels, 144 dB dynamic range
- 32-bit float: standard for deep learning pipelines
librosa Overview
librosa is the core Python library for audio analysis.
pip install librosa soundfile matplotlib numpy scipy
import librosa
import librosa.display
import numpy as np
import matplotlib.pyplot as plt
import soundfile as sf
# Load audio (sr=None preserves original sample rate)
y, sr = librosa.load('audio.wav', sr=None)
print(f"Duration: {len(y)/sr:.2f} s")
print(f"Sample rate: {sr} Hz")
print(f"Samples: {len(y)}")
print(f"dtype: {y.dtype}")
# Resample to 16 kHz
y_16k = librosa.resample(y, orig_sr=sr, target_sr=16000)
# Stereo → mono
y_mono = librosa.to_mono(y) # (2, N) → (N,)
# Save
sf.write('output.wav', y_16k, 16000)
# Waveform visualization
plt.figure(figsize=(14, 4))
librosa.display.waveshow(y, sr=sr, alpha=0.8, color='steelblue')
plt.title('Waveform')
plt.xlabel('Time (seconds)')
plt.ylabel('Amplitude')
plt.tight_layout()
plt.savefig('waveform.png')
plt.show()
2. Audio Feature Extraction
Fourier Transform (FFT)
The Fourier transform converts a time-domain signal into the frequency domain, revealing which frequency components are present and how strongly.
import numpy as np
import matplotlib.pyplot as plt
from scipy.fft import fft, fftfreq
import librosa
y, sr = librosa.load('audio.wav', sr=22050)
N = len(y)
yf = fft(y)
xf = fftfreq(N, 1/sr)
# Keep only positive frequencies (Hermitian symmetry)
xf_pos = xf[:N//2]
yf_pos = np.abs(yf[:N//2])
plt.figure(figsize=(12, 4))
plt.plot(xf_pos, yf_pos)
plt.title('Frequency Spectrum (FFT)')
plt.xlabel('Frequency (Hz)')
plt.ylabel('Magnitude')
plt.xlim(0, sr//2)
plt.yscale('log')
plt.grid(True)
plt.tight_layout()
plt.show()
Short-Time Fourier Transform (STFT)
The plain FFT gives a global average over the entire signal. The STFT applies FFT to short overlapping windows, producing a time-frequency representation that captures how spectral content evolves over time.
import librosa
import librosa.display
import numpy as np
import matplotlib.pyplot as plt
y, sr = librosa.load('speech.wav', sr=22050)
n_fft = 2048 # FFT size (frequency resolution)
hop_length = 512 # hop size
win_length = 2048 # window size
D = librosa.stft(y, n_fft=n_fft, hop_length=hop_length, win_length=win_length)
magnitude = np.abs(D)
phase = np.angle(D)
D_db = librosa.amplitude_to_db(magnitude, ref=np.max)
plt.figure(figsize=(14, 6))
librosa.display.specshow(
D_db, sr=sr, hop_length=hop_length,
x_axis='time', y_axis='hz', cmap='magma'
)
plt.colorbar(format='%+2.0f dB')
plt.title('STFT Spectrogram')
plt.tight_layout()
plt.savefig('stft_spectrogram.png')
plt.show()
print(f"STFT shape: {D.shape}")
print(f"Frequency resolution: {sr/n_fft:.2f} Hz")
print(f"Time resolution: {hop_length/sr*1000:.2f} ms")
Mel Spectrogram
The human auditory system perceives pitch on a logarithmic scale — more sensitive at low frequencies and less so at high frequencies. The Mel scale models this perceptual non-linearity. Mel spectrograms are the most widely used input representation for deep learning audio models.
import librosa
import librosa.display
import numpy as np
import matplotlib.pyplot as plt
y, sr = librosa.load('speech.wav', sr=22050)
n_mels = 128
n_fft = 2048
hop_length = 512
mel_spec = librosa.feature.melspectrogram(
y=y, sr=sr, n_mels=n_mels,
n_fft=n_fft, hop_length=hop_length,
fmin=0, fmax=sr//2, power=2.0
)
mel_db = librosa.power_to_db(mel_spec, ref=np.max)
plt.figure(figsize=(14, 6))
librosa.display.specshow(
mel_db, sr=sr, hop_length=hop_length,
x_axis='time', y_axis='mel',
fmax=sr//2, cmap='viridis'
)
plt.colorbar(format='%+2.0f dB')
plt.title('Mel Spectrogram')
plt.tight_layout()
plt.savefig('mel_spectrogram.png')
plt.show()
print(f"Mel Spectrogram shape: {mel_spec.shape}")
# (n_mels, time_frames) → ~(128, 86 * duration_seconds)
MFCC (Mel-Frequency Cepstral Coefficients)
MFCCs are obtained by applying the Discrete Cosine Transform (DCT) to the log Mel filterbank outputs. They compactly represent the spectral envelope (timbre) and have been the standard features for speech recognition for decades.
import librosa
import librosa.display
import numpy as np
import matplotlib.pyplot as plt
y, sr = librosa.load('speech.wav', sr=22050)
n_mfcc = 40
hop_length = 512
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=n_mfcc, hop_length=hop_length)
mfcc_delta = librosa.feature.delta(mfccs)
mfcc_delta2 = librosa.feature.delta(mfccs, order=2)
# Concatenated feature: 120-dimensional (40 + 40 + 40)
mfcc_combined = np.vstack([mfccs, mfcc_delta, mfcc_delta2])
fig, axes = plt.subplots(3, 1, figsize=(14, 8))
for ax, data, title in zip(
axes,
[mfccs, mfcc_delta, mfcc_delta2],
['MFCC', 'MFCC Delta (1st derivative)', 'MFCC Delta-Delta (2nd derivative)']
):
librosa.display.specshow(data, x_axis='time', hop_length=hop_length, sr=sr, ax=ax)
ax.set_title(title)
plt.colorbar(ax.collections[0], ax=ax)
plt.tight_layout()
plt.savefig('mfcc.png')
plt.show()
# Fixed-length feature vector for classification
mfcc_feature = np.concatenate([np.mean(mfccs, axis=1), np.std(mfccs, axis=1)])
print(f"MFCC feature vector dim: {mfcc_feature.shape}")
Chromagram
A chromagram represents the energy distribution across the 12 pitch classes (C, C#, D, D#, E, F, F#, G, G#, A, A#, B). It is widely used in music analysis for chord recognition and key detection.
import librosa
import librosa.display
import numpy as np
import matplotlib.pyplot as plt
y, sr = librosa.load('music.wav', sr=22050)
hop_length = 512
chroma_stft = librosa.feature.chroma_stft(y=y, sr=sr, hop_length=hop_length)
chroma_cqt = librosa.feature.chroma_cqt(y=y, sr=sr, hop_length=hop_length)
chroma_cens = librosa.feature.chroma_cens(y=y, sr=sr, hop_length=hop_length)
fig, axes = plt.subplots(3, 1, figsize=(14, 10))
for ax, chroma, title in zip(
axes,
[chroma_stft, chroma_cqt, chroma_cens],
['Chroma STFT', 'Chroma CQT', 'Chroma CENS']
):
librosa.display.specshow(
chroma, y_axis='chroma', x_axis='time',
hop_length=hop_length, sr=sr, cmap='coolwarm', ax=ax
)
ax.set_title(title)
plt.colorbar(ax.collections[0], ax=ax)
plt.tight_layout()
plt.savefig('chromagram.png')
plt.show()
3. Automatic Speech Recognition (ASR)
Traditional ASR: HMM + GMM
Classic speech recognition combined Hidden Markov Models (HMMs) for phoneme sequence modeling with Gaussian Mixture Models (GMMs) for acoustic feature modeling. MFCC features are extracted from the audio, phoneme sequences are predicted by the HMM-GMM system, and a language model maps phoneme sequences to word sequences.
CTC (Connectionist Temporal Classification)
CTC enables end-to-end training when input and output sequences have different lengths, without requiring forced alignment. A blank token handles repeated characters and silences, allowing the model to learn directly from audio-text pairs.
Wav2Vec 2.0
Facebook AI Research's Wav2Vec 2.0 uses self-supervised learning to learn powerful acoustic representations from large amounts of unlabeled audio. It can be fine-tuned with a small labeled dataset and achieves state-of-the-art results.
import torch
import torchaudio
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
model_name = "facebook/wav2vec2-base-960h"
processor = Wav2Vec2Processor.from_pretrained(model_name)
model = Wav2Vec2ForCTC.from_pretrained(model_name)
model.eval()
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
def transcribe_wav2vec2(audio_path):
speech, sample_rate = torchaudio.load(audio_path)
if sample_rate != 16000:
resampler = torchaudio.transforms.Resample(sample_rate, 16000)
speech = resampler(speech)
speech = speech.squeeze().numpy()
inputs = processor(speech, sampling_rate=16000,
return_tensors="pt", padding=True).to(device)
with torch.no_grad():
logits = model(**inputs).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
return transcription[0].lower()
text = transcribe_wav2vec2('speech.wav')
print(f"Transcription: {text}")
Whisper (OpenAI)
Whisper is OpenAI's large-scale multilingual ASR model released in 2022. Trained on 680,000 hours of multilingual audio, it supports 99 languages and delivers excellent accuracy out of the box — no fine-tuning required for most use cases.
Architecture: Encoder-Decoder Transformer
- Encoder: converts audio to a Mel spectrogram and processes it through a transformer encoder
- Decoder: autoregressively generates text tokens, including language detection and timestamps
Model sizes:
- tiny: 39M parameters — fastest
- base: 74M
- small: 244M
- medium: 769M
- large-v3: 1,550M — best accuracy
import whisper
import numpy as np
# Load model (downloads automatically on first run)
model = whisper.load_model("base") # or "small", "medium", "large-v3"
# Basic transcription
result = model.transcribe("speech.wav")
print("Transcript:", result["text"])
print("Detected language:", result["language"])
# Force a specific language
result_forced = model.transcribe("speech.wav", language="en", task="transcribe")
# With word-level timestamps
result_ts = model.transcribe("speech.wav", verbose=False, word_timestamps=True)
for segment in result_ts["segments"]:
print(f"[{segment['start']:.2f}s - {segment['end']:.2f}s] {segment['text']}")
# Translate non-English audio to English
result_translated = model.transcribe("german_speech.wav", task="translate")
print("English translation:", result_translated["text"])
# Microphone input (5-second clip)
def transcribe_from_microphone(duration=5, sample_rate=16000):
import sounddevice as sd
print(f"Recording for {duration} seconds...")
audio = sd.rec(int(duration * sample_rate),
samplerate=sample_rate, channels=1, dtype=np.float32)
sd.wait()
result = model.transcribe(audio.flatten(), language="en")
print(f"You said: {result['text']}")
transcribe_from_microphone()
Faster-Whisper
faster-whisper reimplements Whisper using CTranslate2, achieving up to 4x faster inference with reduced memory usage.
from faster_whisper import WhisperModel
model = WhisperModel(
"large-v3",
device="cuda",
compute_type="float16" # "float16", "int8", "int8_float16"
)
segments, info = model.transcribe(
"speech.wav",
language="en",
beam_size=5,
word_timestamps=True,
vad_filter=True,
vad_parameters=dict(min_silence_duration_ms=500)
)
print(f"Language: {info.language}, confidence: {info.language_probability:.2f}")
full_text = ""
for segment in segments:
print(f"[{segment.start:.2f}s → {segment.end:.2f}s] {segment.text}")
full_text += segment.text
if segment.words:
for word in segment.words:
print(f" '{word.word}': {word.start:.2f}s - {word.end:.2f}s")
# Batch processing
def batch_transcribe(audio_files, output_dir):
import os
os.makedirs(output_dir, exist_ok=True)
for audio_path in audio_files:
name = os.path.splitext(os.path.basename(audio_path))[0]
output_path = os.path.join(output_dir, f"{name}.txt")
segs, _ = model.transcribe(audio_path, language="en")
with open(output_path, 'w', encoding='utf-8') as f:
for seg in segs:
f.write(f"[{seg.start:.2f} - {seg.end:.2f}] {seg.text}\n")
print(f"Done: {audio_path} → {output_path}")
batch_transcribe(['interview1.wav', 'lecture.mp3'], './transcripts')
4. Text-to-Speech (TTS)
Deep Learning TTS Architectures
Tacotron 2: A sequence-to-sequence model that generates Mel spectrograms from text, coupled with a WaveNet vocoder. An attention mechanism aligns the text encoder output with the audio decoder.
FastSpeech 2: A non-autoregressive model that is 3–38x faster than Tacotron 2. A duration predictor solves the alignment problem, and pitch and energy are predicted directly from the input.
VITS: An end-to-end model combining variational inference with adversarial training. It merges the acoustic model and vocoder into a single network, yielding natural-sounding synthesis in one pass.
Edge TTS (Microsoft)
Microsoft's high-quality TTS service, free to use via the edge-tts Python package.
import asyncio
import edge_tts
async def synthesize_with_edge_tts():
# List available voices
voices = await edge_tts.list_voices()
en_voices = [v for v in voices if v['Locale'].startswith('en-')]
print("English voices:")
for v in en_voices[:5]:
print(f" {v['ShortName']} - {v['FriendlyName']} ({v['Gender']})")
# Basic synthesis
text = "Hello! This is a demonstration of Microsoft Edge TTS."
communicate = edge_tts.Communicate(
text,
voice="en-US-AriaNeural",
rate="+0%", # speed: -50% to +100%
volume="+0%",
pitch="+0Hz"
)
await communicate.save("output_edge.mp3")
print("Saved: output_edge.mp3")
# With word-boundary subtitles
async def synthesize_with_subs(text, voice, audio_out, srt_out):
communicate = edge_tts.Communicate(text, voice)
subs = edge_tts.SubMaker()
with open(audio_out, "wb") as af:
async for chunk in communicate.stream():
if chunk["type"] == "audio":
af.write(chunk["data"])
elif chunk["type"] == "WordBoundary":
subs.feed(chunk)
with open(srt_out, "w", encoding="utf-8") as sf:
sf.write(subs.get_srt())
await synthesize_with_subs(
"The quick brown fox jumps over the lazy dog.",
"en-US-AriaNeural",
"output_subs.mp3",
"output_subs.srt"
)
asyncio.run(synthesize_with_edge_tts())
Coqui TTS (Open Source)
from TTS.api import TTS
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
# XTTS v2: multilingual zero-shot TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)
# Basic synthesis
tts.tts_to_file(
text="This is a demonstration of Coqui XTTS v2.",
file_path="output_xtts.wav",
language="en",
speaker_wav="reference_voice.wav" # 3+ second reference for voice cloning
)
# Voice cloning
def clone_voice(reference_audio, text, output_path, language="en"):
tts.tts_to_file(
text=text,
file_path=output_path,
speaker_wav=reference_audio,
language=language,
split_sentences=True
)
print(f"Voice clone saved: {output_path}")
clone_voice(
"my_voice_sample.wav",
"This sentence is spoken in a cloned voice.",
"cloned_output.wav"
)
# English TTS with Tacotron2
tts_en = TTS("tts_models/en/ljspeech/tacotron2-DDC").to(device)
tts_en.tts_to_file(
text="Hello, this is a text-to-speech demonstration using Tacotron 2.",
file_path="output_tacotron.wav"
)
OpenVoice (Voice Cloning)
# pip install openvoice melo-tts
from openvoice import se_extractor
from openvoice.api import ToneColorConverter
from melo.api import TTS
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
tone_color_converter = ToneColorConverter(
'checkpoints_v2/converter/config.json', device=device
)
tone_color_converter.load_ckpt('checkpoints_v2/converter/checkpoint.pth')
# Generate base speech
tts_model = TTS(language='EN', device=device)
speaker_id = tts_model.hps.data.spk2id['EN-US']
src_path = 'tmp/output_base.wav'
tts_model.tts_to_file(
text="Today we will discuss advances in voice cloning technology.",
speaker_id=speaker_id,
output_path=src_path,
speed=1.0
)
# Extract tone color from reference speaker
target_se, _ = se_extractor.get_se(
'reference.wav', tone_color_converter,
target_dir='processed', vad=False
)
source_se = torch.load('checkpoints_v2/base_speakers/ses/en-us.pth', map_location=device)
# Apply tone color conversion
tone_color_converter.convert(
audio_src_path=src_path,
src_se=source_se,
tgt_se=target_se,
output_path='output_cloned.wav'
)
print("Voice clone complete: output_cloned.wav")
5. Speaker Diarization
Speaker diarization answers the question "who spoke when". It is essential for meeting transcription, interview analysis, and multi-speaker subtitle generation.
from pyannote.audio import Pipeline
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1",
use_auth_token="YOUR_HF_TOKEN"
)
pipeline.to(device)
def diarize_audio(audio_path, num_speakers=None):
kwargs = {"num_speakers": num_speakers} if num_speakers else {}
diarization = pipeline(audio_path, **kwargs)
timeline = []
for turn, _, speaker in diarization.itertracks(yield_label=True):
entry = {
'start': turn.start,
'end': turn.end,
'speaker': speaker,
'duration': turn.end - turn.start
}
timeline.append(entry)
print(f" [{turn.start:.2f}s - {turn.end:.2f}s] {speaker}")
return timeline, diarization
# Combined diarization + Whisper transcription
def diarize_and_transcribe(audio_path, hf_token):
from faster_whisper import WhisperModel
whisper_model = WhisperModel("medium", device="cuda", compute_type="float16")
segments, _ = whisper_model.transcribe(audio_path, word_timestamps=True)
segments = list(segments)
diarize_pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1", use_auth_token=hf_token
).to(torch.device("cuda"))
diarization = diarize_pipeline(audio_path)
def get_speaker(start, end):
best_speaker, best_overlap = "Unknown", 0.0
for turn, _, speaker in diarization.itertracks(yield_label=True):
overlap = min(end, turn.end) - max(start, turn.start)
if overlap > best_overlap:
best_overlap, best_speaker = overlap, speaker
return best_speaker
result = []
for segment in segments:
speaker = get_speaker(segment.start, segment.end)
entry = {
'start': segment.start,
'end': segment.end,
'speaker': speaker,
'text': segment.text.strip()
}
result.append(entry)
print(f"[{segment.start:.2f}s - {segment.end:.2f}s] {speaker}: {segment.text.strip()}")
return result
timeline = diarize_and_transcribe("meeting.wav", "YOUR_HF_TOKEN")
6. Speech Emotion Recognition
import torch
import torchaudio
from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2FeatureExtractor
import numpy as np
model_name = "ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition"
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_name)
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_name)
model.eval()
EMOTIONS = ['angry', 'calm', 'disgust', 'fearful',
'happy', 'neutral', 'sad', 'surprised']
def predict_emotion(audio_path):
speech, sr = torchaudio.load(audio_path)
if sr != 16000:
speech = torchaudio.transforms.Resample(sr, 16000)(speech)
inputs = feature_extractor(
speech.squeeze().numpy(),
sampling_rate=16000,
return_tensors="pt",
padding=True
)
with torch.no_grad():
probs = torch.softmax(model(**inputs).logits, dim=-1)
top_probs, top_idx = torch.topk(probs, 3)
print("Top-3 emotions:")
for idx, prob in zip(top_idx[0], top_probs[0]):
print(f" {EMOTIONS[idx.item()]}: {prob.item():.4f}")
predicted = EMOTIONS[probs.argmax().item()]
print(f"Predicted: {predicted}")
return predicted, dict(zip(EMOTIONS, probs[0].tolist()))
emotion, probs = predict_emotion('speech.wav')
# Feature-based emotion analysis
def analyze_emotion_features(audio_path):
import librosa
y, sr = librosa.load(audio_path, sr=16000)
mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=40)
spectral_centroid = librosa.feature.spectral_centroid(y=y, sr=sr)
spectral_rolloff = librosa.feature.spectral_rolloff(y=y, sr=sr)
zero_crossing = librosa.feature.zero_crossing_rate(y=y)
rms = librosa.feature.rms(y=y)
pitch, mag = librosa.piptrack(y=y, sr=sr)
features = np.concatenate([
np.mean(mfcc, axis=1),
np.std(mfcc, axis=1),
[np.mean(spectral_centroid)],
[np.std(spectral_centroid)],
[np.mean(spectral_rolloff)],
[np.mean(zero_crossing)],
[np.mean(rms)],
[np.std(rms)]
])
print(f"Feature vector dim: {features.shape}")
return features
7. Music AI
MusicGen (Meta)
Meta AI's MusicGen generates music from text prompts, conditioning on descriptions of genre, instruments, mood, and tempo.
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write
import torch
model = MusicGen.get_pretrained('facebook/musicgen-large')
model.set_generation_params(
duration=30,
temperature=1.0,
top_k=250,
top_p=0.0,
cfg_coef=3.0,
)
descriptions = [
"upbeat electronic dance music with synthesizers and strong bass",
"peaceful classical piano music with violin, gentle and romantic",
"intense rock music with electric guitar and drums"
]
print("Generating music...")
wav = model.generate(descriptions)
for idx, (desc, one_wav) in enumerate(zip(descriptions, wav)):
filename = f'generated_music_{idx}'
audio_write(
filename, one_wav.cpu(), model.sample_rate,
strategy="loudness", loudness_compressor=True
)
print(f"Saved: {filename}.wav — '{desc}'")
# Melody-conditioned generation
model_melody = MusicGen.get_pretrained('facebook/musicgen-melody')
model_melody.set_generation_params(duration=15)
import torchaudio
melody, melody_sr = torchaudio.load('humming.wav')
wav_melody = model_melody.generate_with_chroma(
descriptions=["full orchestral arrangement, epic and cinematic"],
melody_wavs=melody.unsqueeze(0),
melody_sample_rate=melody_sr
)
audio_write('melody_based', wav_melody[0].cpu(), model_melody.sample_rate,
strategy="loudness")
print("Melody-conditioned generation complete.")
Music Genre Classification
import librosa
import numpy as np
import torch
import torch.nn as nn
GENRES = ['blues', 'classical', 'country', 'disco', 'hiphop',
'jazz', 'metal', 'pop', 'reggae', 'rock']
def extract_music_features(audio_path, sr=22050, duration=30):
y, _ = librosa.load(audio_path, sr=sr, duration=duration)
mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=20)
spectral_centroid = librosa.feature.spectral_centroid(y=y, sr=sr)
spectral_bandwidth = librosa.feature.spectral_bandwidth(y=y, sr=sr)
spectral_rolloff = librosa.feature.spectral_rolloff(y=y, sr=sr)
spectral_contrast = librosa.feature.spectral_contrast(y=y, sr=sr)
zcr = librosa.feature.zero_crossing_rate(y)
chroma = librosa.feature.chroma_stft(y=y, sr=sr)
tempo, _ = librosa.beat.beat_track(y=y, sr=sr)
feature_vec = np.concatenate([
np.mean(mfcc, axis=1),
np.std(mfcc, axis=1),
[np.mean(spectral_centroid)],
[np.mean(spectral_bandwidth)],
[np.mean(spectral_rolloff)],
[np.mean(spectral_contrast)],
[np.mean(zcr)],
[np.std(zcr)],
np.mean(chroma, axis=1),
[float(tempo)]
])
return feature_vec
class GenreClassifier(nn.Module):
def __init__(self, input_dim=56, num_classes=10):
super().__init__()
self.net = nn.Sequential(
nn.Linear(input_dim, 512), nn.BatchNorm1d(512), nn.ReLU(), nn.Dropout(0.3),
nn.Linear(512, 256), nn.BatchNorm1d(256), nn.ReLU(), nn.Dropout(0.3),
nn.Linear(256, 128), nn.ReLU(),
nn.Linear(128, num_classes)
)
def forward(self, x):
return self.net(x)
def classify_genre(audio_path, model):
feat = extract_music_features(audio_path)
x = torch.FloatTensor(feat).unsqueeze(0)
with torch.no_grad():
probs = torch.softmax(model(x), dim=1)[0]
top_prob, top_idx = torch.topk(probs, 3)
print("Top-3 genres:")
for prob, idx in zip(top_prob, top_idx):
print(f" {GENRES[idx.item()]}: {prob.item():.4f}")
return GENRES[probs.argmax().item()]
8. Practical Audio AI Projects
Project 1: Real-time Subtitle System
import queue
import threading
import numpy as np
import sounddevice as sd
from faster_whisper import WhisperModel
import time
class RealtimeSubtitleSystem:
def __init__(self, model_size="base", language="en"):
print(f"Loading Whisper {model_size}...")
self.model = WhisperModel(model_size, device="cuda", compute_type="float16")
self.language = language
self.audio_queue = queue.Queue()
self.is_running = False
self.sample_rate = 16000
self.chunk_secs = 3
def audio_callback(self, indata, frames, time_info, status):
if status:
print(f"Audio status: {status}")
self.audio_queue.put(indata.copy())
def transcription_worker(self):
buffer = np.array([], dtype=np.float32)
chunk_size = self.sample_rate * self.chunk_secs
while self.is_running or not self.audio_queue.empty():
try:
chunk = self.audio_queue.get(timeout=0.1)
buffer = np.append(buffer, chunk.flatten())
if len(buffer) >= chunk_size:
audio_data = buffer[:chunk_size]
buffer = buffer[chunk_size // 2:] # 50% overlap
segments, _ = self.model.transcribe(
audio_data, language=self.language,
vad_filter=True,
vad_parameters=dict(min_silence_duration_ms=300)
)
for seg in segments:
text = seg.text.strip()
if text:
ts = time.strftime("%H:%M:%S")
print(f"[{ts}] {text}")
except queue.Empty:
continue
except Exception as e:
print(f"Transcription error: {e}")
def start(self):
self.is_running = True
t = threading.Thread(target=self.transcription_worker, daemon=True)
t.start()
print("Real-time subtitle system running (Ctrl+C to stop)")
print("-" * 50)
with sd.InputStream(
samplerate=self.sample_rate, channels=1,
dtype=np.float32, callback=self.audio_callback,
blocksize=int(self.sample_rate * 0.1)
):
try:
while True:
time.sleep(0.1)
except KeyboardInterrupt:
print("\nStopping...")
self.is_running = False
t.join()
print("System stopped.")
system = RealtimeSubtitleSystem(model_size="small", language="en")
system.start()
Project 2: Voice Chatbot
import openai
import sounddevice as sd
import numpy as np
import tempfile
import os
import time
from faster_whisper import WhisperModel
import edge_tts
import asyncio
import soundfile as sf
class VoiceChatbot:
def __init__(self):
self.client = openai.OpenAI()
self.whisper = WhisperModel("base", device="cpu", compute_type="int8")
self.sample_rate = 16000
self.history = []
self.system_prompt = (
"You are a helpful and knowledgeable AI assistant. "
"Respond concisely and clearly in English."
)
def record_audio(self, duration=5):
print(f"Recording... ({duration} s)")
audio = sd.rec(int(duration * self.sample_rate),
samplerate=self.sample_rate, channels=1, dtype=np.float32)
sd.wait()
print("Done recording.")
return audio.flatten()
def speech_to_text(self, audio):
with tempfile.NamedTemporaryFile(suffix='.wav', delete=False) as f:
sf.write(f.name, audio, self.sample_rate)
tmp = f.name
try:
segs, _ = self.whisper.transcribe(tmp, language="en", vad_filter=True)
text = " ".join(s.text.strip() for s in segs)
finally:
os.unlink(tmp)
return text.strip()
def chat(self, user_text):
self.history.append({"role": "user", "content": user_text})
response = self.client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "system", "content": self.system_prompt}] + self.history,
max_tokens=300,
temperature=0.7
)
assistant_text = response.choices[0].message.content
self.history.append({"role": "assistant", "content": assistant_text})
if len(self.history) > 20:
self.history = self.history[-20:]
return assistant_text
async def text_to_speech(self, text, output_path='response.mp3'):
communicate = edge_tts.Communicate(text, voice="en-US-AriaNeural", rate="+10%")
await communicate.save(output_path)
return output_path
def play_audio(self, path):
import subprocess
if os.name == 'nt':
os.startfile(path)
elif hasattr(os, 'uname') and os.uname().sysname == 'Darwin':
subprocess.run(['afplay', path])
else:
subprocess.run(['mpg123', path])
def run(self):
print("Voice chatbot started!")
print("Say 'quit' or 'exit' to stop.")
print("=" * 50)
while True:
audio = self.record_audio(duration=5)
user_text = self.speech_to_text(audio)
if not user_text:
print("Could not understand audio. Please try again.")
continue
print(f"You: {user_text}")
if any(w in user_text.lower() for w in ['quit', 'exit', 'stop', 'bye']):
print("Goodbye!")
break
response = self.chat(user_text)
print(f"AI: {response}")
audio_path = asyncio.run(self.text_to_speech(response))
self.play_audio(audio_path)
time.sleep(0.5)
chatbot = VoiceChatbot()
chatbot.run()
Speech & Audio AI Learning Roadmap
Beginner
- Audio feature extraction practice with librosa
- Build a transcription prototype with the Whisper API
- Create TTS applications with Edge TTS
Intermediate
- Fine-tune Wav2Vec 2.0 for custom ASR
- Build a speaker diarization pipeline with pyannote.audio
- Implement a real-time voice chatbot
Advanced
- Train a custom TTS model with VITS or XTTS
- Build music generation applications with MusicGen
- Combine emotion recognition with conversational AI
References
- librosa documentation: https://librosa.org/doc/latest/
- OpenAI Whisper: https://openai.com/research/whisper
- Wav2Vec 2.0 (arXiv 2006.11477): https://arxiv.org/abs/2006.11477
- Hugging Face Audio Course: https://huggingface.co/learn/audio-course/
- Coqui TTS: https://github.com/coqui-ai/TTS
- pyannote.audio: https://github.com/pyannote/pyannote-audio
- faster-whisper: https://github.com/SYSTRAN/faster-whisper
- AudioCraft (MusicGen): https://github.com/facebookresearch/audiocraft