Skip to content
Published on

Speech & Audio AI Complete Guide: ASR, TTS, Whisper, Wav2Vec to Voice Synthesis

Authors

Speech & Audio AI Complete Guide: ASR, TTS, Whisper, Wav2Vec to Voice Synthesis

Speech and audio AI creates the most natural interface between humans and machines. From smartphone voice assistants to real-time translation systems and synthetic voices for virtual influencers, audio AI technology has woven itself into our everyday lives.

This guide takes you through the full audio AI ecosystem — from the physics of sound and digital signal processing to automatic speech recognition (ASR), text-to-speech (TTS), speaker diarization, and music generation — with practical Python code throughout.


1. Audio Signal Processing Fundamentals

Physical Properties of Sound

Sound is a pressure wave propagating through air. Understanding the following concepts is essential for digital audio processing.

Frequency: The number of vibrations per second, measured in Hz (hertz). The human hearing range spans roughly 20 Hz to 20,000 Hz. Low frequencies correspond to bass tones; high frequencies to treble.

Amplitude: The magnitude of the wave — the strength of the sound pressure. Expressed in decibels (dB), where 0 dB represents a reference threshold and negative values indicate quieter sounds.

Phase: The position of a waveform along the time axis. Two waves of the same frequency but different phases produce constructive or destructive interference when combined.

Harmonics: Frequency components at integer multiples of the fundamental frequency. They determine the timbre (tone color) of an instrument or voice.

Sampling Rate and Bit Depth

Sampling Rate: The number of audio samples captured per second (Hz). The Nyquist theorem states that to fully reconstruct a signal, you must sample at more than twice the highest frequency present.

  • CD quality: 44,100 Hz (44.1 kHz)
  • High-resolution audio: 48,000 Hz (video), 96,000 Hz, 192,000 Hz
  • Telephone quality: 8,000 Hz; wideband telephony: 16,000 Hz
  • Whisper default: 16,000 Hz

Bit Depth: The number of bits per sample — determines the dynamic range.

  • 16-bit: 65,536 levels, 96 dB dynamic range (CD standard)
  • 24-bit: 16,777,216 levels, 144 dB dynamic range
  • 32-bit float: standard for deep learning pipelines

librosa Overview

librosa is the core Python library for audio analysis.

pip install librosa soundfile matplotlib numpy scipy
import librosa
import librosa.display
import numpy as np
import matplotlib.pyplot as plt
import soundfile as sf

# Load audio (sr=None preserves original sample rate)
y, sr = librosa.load('audio.wav', sr=None)
print(f"Duration: {len(y)/sr:.2f} s")
print(f"Sample rate: {sr} Hz")
print(f"Samples: {len(y)}")
print(f"dtype: {y.dtype}")

# Resample to 16 kHz
y_16k = librosa.resample(y, orig_sr=sr, target_sr=16000)

# Stereo → mono
y_mono = librosa.to_mono(y)  # (2, N) → (N,)

# Save
sf.write('output.wav', y_16k, 16000)

# Waveform visualization
plt.figure(figsize=(14, 4))
librosa.display.waveshow(y, sr=sr, alpha=0.8, color='steelblue')
plt.title('Waveform')
plt.xlabel('Time (seconds)')
plt.ylabel('Amplitude')
plt.tight_layout()
plt.savefig('waveform.png')
plt.show()

2. Audio Feature Extraction

Fourier Transform (FFT)

The Fourier transform converts a time-domain signal into the frequency domain, revealing which frequency components are present and how strongly.

import numpy as np
import matplotlib.pyplot as plt
from scipy.fft import fft, fftfreq
import librosa

y, sr = librosa.load('audio.wav', sr=22050)

N  = len(y)
yf = fft(y)
xf = fftfreq(N, 1/sr)

# Keep only positive frequencies (Hermitian symmetry)
xf_pos = xf[:N//2]
yf_pos = np.abs(yf[:N//2])

plt.figure(figsize=(12, 4))
plt.plot(xf_pos, yf_pos)
plt.title('Frequency Spectrum (FFT)')
plt.xlabel('Frequency (Hz)')
plt.ylabel('Magnitude')
plt.xlim(0, sr//2)
plt.yscale('log')
plt.grid(True)
plt.tight_layout()
plt.show()

Short-Time Fourier Transform (STFT)

The plain FFT gives a global average over the entire signal. The STFT applies FFT to short overlapping windows, producing a time-frequency representation that captures how spectral content evolves over time.

import librosa
import librosa.display
import numpy as np
import matplotlib.pyplot as plt

y, sr = librosa.load('speech.wav', sr=22050)

n_fft      = 2048   # FFT size (frequency resolution)
hop_length = 512    # hop size
win_length = 2048   # window size

D = librosa.stft(y, n_fft=n_fft, hop_length=hop_length, win_length=win_length)

magnitude = np.abs(D)
phase     = np.angle(D)
D_db      = librosa.amplitude_to_db(magnitude, ref=np.max)

plt.figure(figsize=(14, 6))
librosa.display.specshow(
    D_db, sr=sr, hop_length=hop_length,
    x_axis='time', y_axis='hz', cmap='magma'
)
plt.colorbar(format='%+2.0f dB')
plt.title('STFT Spectrogram')
plt.tight_layout()
plt.savefig('stft_spectrogram.png')
plt.show()

print(f"STFT shape: {D.shape}")
print(f"Frequency resolution: {sr/n_fft:.2f} Hz")
print(f"Time resolution: {hop_length/sr*1000:.2f} ms")

Mel Spectrogram

The human auditory system perceives pitch on a logarithmic scale — more sensitive at low frequencies and less so at high frequencies. The Mel scale models this perceptual non-linearity. Mel spectrograms are the most widely used input representation for deep learning audio models.

import librosa
import librosa.display
import numpy as np
import matplotlib.pyplot as plt

y, sr = librosa.load('speech.wav', sr=22050)

n_mels     = 128
n_fft      = 2048
hop_length = 512

mel_spec = librosa.feature.melspectrogram(
    y=y, sr=sr, n_mels=n_mels,
    n_fft=n_fft, hop_length=hop_length,
    fmin=0, fmax=sr//2, power=2.0
)
mel_db = librosa.power_to_db(mel_spec, ref=np.max)

plt.figure(figsize=(14, 6))
librosa.display.specshow(
    mel_db, sr=sr, hop_length=hop_length,
    x_axis='time', y_axis='mel',
    fmax=sr//2, cmap='viridis'
)
plt.colorbar(format='%+2.0f dB')
plt.title('Mel Spectrogram')
plt.tight_layout()
plt.savefig('mel_spectrogram.png')
plt.show()

print(f"Mel Spectrogram shape: {mel_spec.shape}")
# (n_mels, time_frames) → ~(128, 86 * duration_seconds)

MFCC (Mel-Frequency Cepstral Coefficients)

MFCCs are obtained by applying the Discrete Cosine Transform (DCT) to the log Mel filterbank outputs. They compactly represent the spectral envelope (timbre) and have been the standard features for speech recognition for decades.

import librosa
import librosa.display
import numpy as np
import matplotlib.pyplot as plt

y, sr = librosa.load('speech.wav', sr=22050)

n_mfcc     = 40
hop_length = 512

mfccs       = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=n_mfcc, hop_length=hop_length)
mfcc_delta  = librosa.feature.delta(mfccs)
mfcc_delta2 = librosa.feature.delta(mfccs, order=2)

# Concatenated feature: 120-dimensional (40 + 40 + 40)
mfcc_combined = np.vstack([mfccs, mfcc_delta, mfcc_delta2])

fig, axes = plt.subplots(3, 1, figsize=(14, 8))
for ax, data, title in zip(
    axes,
    [mfccs, mfcc_delta, mfcc_delta2],
    ['MFCC', 'MFCC Delta (1st derivative)', 'MFCC Delta-Delta (2nd derivative)']
):
    librosa.display.specshow(data, x_axis='time', hop_length=hop_length, sr=sr, ax=ax)
    ax.set_title(title)
    plt.colorbar(ax.collections[0], ax=ax)

plt.tight_layout()
plt.savefig('mfcc.png')
plt.show()

# Fixed-length feature vector for classification
mfcc_feature = np.concatenate([np.mean(mfccs, axis=1), np.std(mfccs, axis=1)])
print(f"MFCC feature vector dim: {mfcc_feature.shape}")

Chromagram

A chromagram represents the energy distribution across the 12 pitch classes (C, C#, D, D#, E, F, F#, G, G#, A, A#, B). It is widely used in music analysis for chord recognition and key detection.

import librosa
import librosa.display
import numpy as np
import matplotlib.pyplot as plt

y, sr = librosa.load('music.wav', sr=22050)
hop_length = 512

chroma_stft = librosa.feature.chroma_stft(y=y, sr=sr, hop_length=hop_length)
chroma_cqt  = librosa.feature.chroma_cqt(y=y, sr=sr, hop_length=hop_length)
chroma_cens = librosa.feature.chroma_cens(y=y, sr=sr, hop_length=hop_length)

fig, axes = plt.subplots(3, 1, figsize=(14, 10))
for ax, chroma, title in zip(
    axes,
    [chroma_stft, chroma_cqt, chroma_cens],
    ['Chroma STFT', 'Chroma CQT', 'Chroma CENS']
):
    librosa.display.specshow(
        chroma, y_axis='chroma', x_axis='time',
        hop_length=hop_length, sr=sr, cmap='coolwarm', ax=ax
    )
    ax.set_title(title)
    plt.colorbar(ax.collections[0], ax=ax)

plt.tight_layout()
plt.savefig('chromagram.png')
plt.show()

3. Automatic Speech Recognition (ASR)

Traditional ASR: HMM + GMM

Classic speech recognition combined Hidden Markov Models (HMMs) for phoneme sequence modeling with Gaussian Mixture Models (GMMs) for acoustic feature modeling. MFCC features are extracted from the audio, phoneme sequences are predicted by the HMM-GMM system, and a language model maps phoneme sequences to word sequences.

CTC (Connectionist Temporal Classification)

CTC enables end-to-end training when input and output sequences have different lengths, without requiring forced alignment. A blank token handles repeated characters and silences, allowing the model to learn directly from audio-text pairs.

Wav2Vec 2.0

Facebook AI Research's Wav2Vec 2.0 uses self-supervised learning to learn powerful acoustic representations from large amounts of unlabeled audio. It can be fine-tuned with a small labeled dataset and achieves state-of-the-art results.

import torch
import torchaudio
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

model_name = "facebook/wav2vec2-base-960h"
processor  = Wav2Vec2Processor.from_pretrained(model_name)
model      = Wav2Vec2ForCTC.from_pretrained(model_name)
model.eval()

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

def transcribe_wav2vec2(audio_path):
    speech, sample_rate = torchaudio.load(audio_path)

    if sample_rate != 16000:
        resampler = torchaudio.transforms.Resample(sample_rate, 16000)
        speech    = resampler(speech)

    speech = speech.squeeze().numpy()
    inputs = processor(speech, sampling_rate=16000,
                       return_tensors="pt", padding=True).to(device)

    with torch.no_grad():
        logits = model(**inputs).logits

    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)
    return transcription[0].lower()

text = transcribe_wav2vec2('speech.wav')
print(f"Transcription: {text}")

Whisper (OpenAI)

Whisper is OpenAI's large-scale multilingual ASR model released in 2022. Trained on 680,000 hours of multilingual audio, it supports 99 languages and delivers excellent accuracy out of the box — no fine-tuning required for most use cases.

Architecture: Encoder-Decoder Transformer

  • Encoder: converts audio to a Mel spectrogram and processes it through a transformer encoder
  • Decoder: autoregressively generates text tokens, including language detection and timestamps

Model sizes:

  • tiny: 39M parameters — fastest
  • base: 74M
  • small: 244M
  • medium: 769M
  • large-v3: 1,550M — best accuracy
import whisper
import numpy as np

# Load model (downloads automatically on first run)
model = whisper.load_model("base")  # or "small", "medium", "large-v3"

# Basic transcription
result = model.transcribe("speech.wav")
print("Transcript:", result["text"])
print("Detected language:", result["language"])

# Force a specific language
result_forced = model.transcribe("speech.wav", language="en", task="transcribe")

# With word-level timestamps
result_ts = model.transcribe("speech.wav", verbose=False, word_timestamps=True)

for segment in result_ts["segments"]:
    print(f"[{segment['start']:.2f}s - {segment['end']:.2f}s] {segment['text']}")

# Translate non-English audio to English
result_translated = model.transcribe("german_speech.wav", task="translate")
print("English translation:", result_translated["text"])

# Microphone input (5-second clip)
def transcribe_from_microphone(duration=5, sample_rate=16000):
    import sounddevice as sd
    print(f"Recording for {duration} seconds...")
    audio = sd.rec(int(duration * sample_rate),
                   samplerate=sample_rate, channels=1, dtype=np.float32)
    sd.wait()
    result = model.transcribe(audio.flatten(), language="en")
    print(f"You said: {result['text']}")

transcribe_from_microphone()

Faster-Whisper

faster-whisper reimplements Whisper using CTranslate2, achieving up to 4x faster inference with reduced memory usage.

from faster_whisper import WhisperModel

model = WhisperModel(
    "large-v3",
    device="cuda",
    compute_type="float16"  # "float16", "int8", "int8_float16"
)

segments, info = model.transcribe(
    "speech.wav",
    language="en",
    beam_size=5,
    word_timestamps=True,
    vad_filter=True,
    vad_parameters=dict(min_silence_duration_ms=500)
)

print(f"Language: {info.language}, confidence: {info.language_probability:.2f}")

full_text = ""
for segment in segments:
    print(f"[{segment.start:.2f}s → {segment.end:.2f}s] {segment.text}")
    full_text += segment.text
    if segment.words:
        for word in segment.words:
            print(f"  '{word.word}': {word.start:.2f}s - {word.end:.2f}s")

# Batch processing
def batch_transcribe(audio_files, output_dir):
    import os
    os.makedirs(output_dir, exist_ok=True)
    for audio_path in audio_files:
        name        = os.path.splitext(os.path.basename(audio_path))[0]
        output_path = os.path.join(output_dir, f"{name}.txt")
        segs, _     = model.transcribe(audio_path, language="en")
        with open(output_path, 'w', encoding='utf-8') as f:
            for seg in segs:
                f.write(f"[{seg.start:.2f} - {seg.end:.2f}] {seg.text}\n")
        print(f"Done: {audio_path}{output_path}")

batch_transcribe(['interview1.wav', 'lecture.mp3'], './transcripts')

4. Text-to-Speech (TTS)

Deep Learning TTS Architectures

Tacotron 2: A sequence-to-sequence model that generates Mel spectrograms from text, coupled with a WaveNet vocoder. An attention mechanism aligns the text encoder output with the audio decoder.

FastSpeech 2: A non-autoregressive model that is 3–38x faster than Tacotron 2. A duration predictor solves the alignment problem, and pitch and energy are predicted directly from the input.

VITS: An end-to-end model combining variational inference with adversarial training. It merges the acoustic model and vocoder into a single network, yielding natural-sounding synthesis in one pass.

Edge TTS (Microsoft)

Microsoft's high-quality TTS service, free to use via the edge-tts Python package.

import asyncio
import edge_tts

async def synthesize_with_edge_tts():
    # List available voices
    voices     = await edge_tts.list_voices()
    en_voices  = [v for v in voices if v['Locale'].startswith('en-')]
    print("English voices:")
    for v in en_voices[:5]:
        print(f"  {v['ShortName']} - {v['FriendlyName']} ({v['Gender']})")

    # Basic synthesis
    text = "Hello! This is a demonstration of Microsoft Edge TTS."
    communicate = edge_tts.Communicate(
        text,
        voice="en-US-AriaNeural",
        rate="+0%",     # speed: -50% to +100%
        volume="+0%",
        pitch="+0Hz"
    )
    await communicate.save("output_edge.mp3")
    print("Saved: output_edge.mp3")

    # With word-boundary subtitles
    async def synthesize_with_subs(text, voice, audio_out, srt_out):
        communicate = edge_tts.Communicate(text, voice)
        subs = edge_tts.SubMaker()
        with open(audio_out, "wb") as af:
            async for chunk in communicate.stream():
                if chunk["type"] == "audio":
                    af.write(chunk["data"])
                elif chunk["type"] == "WordBoundary":
                    subs.feed(chunk)
        with open(srt_out, "w", encoding="utf-8") as sf:
            sf.write(subs.get_srt())

    await synthesize_with_subs(
        "The quick brown fox jumps over the lazy dog.",
        "en-US-AriaNeural",
        "output_subs.mp3",
        "output_subs.srt"
    )

asyncio.run(synthesize_with_edge_tts())

Coqui TTS (Open Source)

from TTS.api import TTS
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

# XTTS v2: multilingual zero-shot TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)

# Basic synthesis
tts.tts_to_file(
    text="This is a demonstration of Coqui XTTS v2.",
    file_path="output_xtts.wav",
    language="en",
    speaker_wav="reference_voice.wav"  # 3+ second reference for voice cloning
)

# Voice cloning
def clone_voice(reference_audio, text, output_path, language="en"):
    tts.tts_to_file(
        text=text,
        file_path=output_path,
        speaker_wav=reference_audio,
        language=language,
        split_sentences=True
    )
    print(f"Voice clone saved: {output_path}")

clone_voice(
    "my_voice_sample.wav",
    "This sentence is spoken in a cloned voice.",
    "cloned_output.wav"
)

# English TTS with Tacotron2
tts_en = TTS("tts_models/en/ljspeech/tacotron2-DDC").to(device)
tts_en.tts_to_file(
    text="Hello, this is a text-to-speech demonstration using Tacotron 2.",
    file_path="output_tacotron.wav"
)

OpenVoice (Voice Cloning)

# pip install openvoice melo-tts

from openvoice import se_extractor
from openvoice.api import ToneColorConverter
from melo.api import TTS
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

tone_color_converter = ToneColorConverter(
    'checkpoints_v2/converter/config.json', device=device
)
tone_color_converter.load_ckpt('checkpoints_v2/converter/checkpoint.pth')

# Generate base speech
tts_model  = TTS(language='EN', device=device)
speaker_id = tts_model.hps.data.spk2id['EN-US']
src_path   = 'tmp/output_base.wav'

tts_model.tts_to_file(
    text="Today we will discuss advances in voice cloning technology.",
    speaker_id=speaker_id,
    output_path=src_path,
    speed=1.0
)

# Extract tone color from reference speaker
target_se, _ = se_extractor.get_se(
    'reference.wav', tone_color_converter,
    target_dir='processed', vad=False
)
source_se = torch.load('checkpoints_v2/base_speakers/ses/en-us.pth', map_location=device)

# Apply tone color conversion
tone_color_converter.convert(
    audio_src_path=src_path,
    src_se=source_se,
    tgt_se=target_se,
    output_path='output_cloned.wav'
)
print("Voice clone complete: output_cloned.wav")

5. Speaker Diarization

Speaker diarization answers the question "who spoke when". It is essential for meeting transcription, interview analysis, and multi-speaker subtitle generation.

from pyannote.audio import Pipeline
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token="YOUR_HF_TOKEN"
)
pipeline.to(device)

def diarize_audio(audio_path, num_speakers=None):
    kwargs = {"num_speakers": num_speakers} if num_speakers else {}
    diarization = pipeline(audio_path, **kwargs)

    timeline = []
    for turn, _, speaker in diarization.itertracks(yield_label=True):
        entry = {
            'start':    turn.start,
            'end':      turn.end,
            'speaker':  speaker,
            'duration': turn.end - turn.start
        }
        timeline.append(entry)
        print(f"  [{turn.start:.2f}s - {turn.end:.2f}s] {speaker}")

    return timeline, diarization

# Combined diarization + Whisper transcription
def diarize_and_transcribe(audio_path, hf_token):
    from faster_whisper import WhisperModel

    whisper_model = WhisperModel("medium", device="cuda", compute_type="float16")
    segments, _   = whisper_model.transcribe(audio_path, word_timestamps=True)
    segments      = list(segments)

    diarize_pipeline = Pipeline.from_pretrained(
        "pyannote/speaker-diarization-3.1", use_auth_token=hf_token
    ).to(torch.device("cuda"))
    diarization = diarize_pipeline(audio_path)

    def get_speaker(start, end):
        best_speaker, best_overlap = "Unknown", 0.0
        for turn, _, speaker in diarization.itertracks(yield_label=True):
            overlap = min(end, turn.end) - max(start, turn.start)
            if overlap > best_overlap:
                best_overlap, best_speaker = overlap, speaker
        return best_speaker

    result = []
    for segment in segments:
        speaker = get_speaker(segment.start, segment.end)
        entry   = {
            'start':   segment.start,
            'end':     segment.end,
            'speaker': speaker,
            'text':    segment.text.strip()
        }
        result.append(entry)
        print(f"[{segment.start:.2f}s - {segment.end:.2f}s] {speaker}: {segment.text.strip()}")

    return result

timeline = diarize_and_transcribe("meeting.wav", "YOUR_HF_TOKEN")

6. Speech Emotion Recognition

import torch
import torchaudio
from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2FeatureExtractor
import numpy as np

model_name       = "ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition"
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_name)
model            = Wav2Vec2ForSequenceClassification.from_pretrained(model_name)
model.eval()

EMOTIONS = ['angry', 'calm', 'disgust', 'fearful',
            'happy', 'neutral', 'sad', 'surprised']

def predict_emotion(audio_path):
    speech, sr = torchaudio.load(audio_path)
    if sr != 16000:
        speech = torchaudio.transforms.Resample(sr, 16000)(speech)

    inputs = feature_extractor(
        speech.squeeze().numpy(),
        sampling_rate=16000,
        return_tensors="pt",
        padding=True
    )

    with torch.no_grad():
        probs = torch.softmax(model(**inputs).logits, dim=-1)

    top_probs, top_idx = torch.topk(probs, 3)
    print("Top-3 emotions:")
    for idx, prob in zip(top_idx[0], top_probs[0]):
        print(f"  {EMOTIONS[idx.item()]}: {prob.item():.4f}")

    predicted = EMOTIONS[probs.argmax().item()]
    print(f"Predicted: {predicted}")
    return predicted, dict(zip(EMOTIONS, probs[0].tolist()))

emotion, probs = predict_emotion('speech.wav')

# Feature-based emotion analysis
def analyze_emotion_features(audio_path):
    import librosa
    y, sr = librosa.load(audio_path, sr=16000)

    mfcc               = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=40)
    spectral_centroid  = librosa.feature.spectral_centroid(y=y, sr=sr)
    spectral_rolloff   = librosa.feature.spectral_rolloff(y=y, sr=sr)
    zero_crossing      = librosa.feature.zero_crossing_rate(y=y)
    rms                = librosa.feature.rms(y=y)
    pitch, mag         = librosa.piptrack(y=y, sr=sr)

    features = np.concatenate([
        np.mean(mfcc, axis=1),
        np.std(mfcc, axis=1),
        [np.mean(spectral_centroid)],
        [np.std(spectral_centroid)],
        [np.mean(spectral_rolloff)],
        [np.mean(zero_crossing)],
        [np.mean(rms)],
        [np.std(rms)]
    ])

    print(f"Feature vector dim: {features.shape}")
    return features

7. Music AI

MusicGen (Meta)

Meta AI's MusicGen generates music from text prompts, conditioning on descriptions of genre, instruments, mood, and tempo.

from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write
import torch

model = MusicGen.get_pretrained('facebook/musicgen-large')
model.set_generation_params(
    duration=30,
    temperature=1.0,
    top_k=250,
    top_p=0.0,
    cfg_coef=3.0,
)

descriptions = [
    "upbeat electronic dance music with synthesizers and strong bass",
    "peaceful classical piano music with violin, gentle and romantic",
    "intense rock music with electric guitar and drums"
]

print("Generating music...")
wav = model.generate(descriptions)

for idx, (desc, one_wav) in enumerate(zip(descriptions, wav)):
    filename = f'generated_music_{idx}'
    audio_write(
        filename, one_wav.cpu(), model.sample_rate,
        strategy="loudness", loudness_compressor=True
    )
    print(f"Saved: {filename}.wav — '{desc}'")

# Melody-conditioned generation
model_melody = MusicGen.get_pretrained('facebook/musicgen-melody')
model_melody.set_generation_params(duration=15)

import torchaudio
melody, melody_sr = torchaudio.load('humming.wav')

wav_melody = model_melody.generate_with_chroma(
    descriptions=["full orchestral arrangement, epic and cinematic"],
    melody_wavs=melody.unsqueeze(0),
    melody_sample_rate=melody_sr
)
audio_write('melody_based', wav_melody[0].cpu(), model_melody.sample_rate,
            strategy="loudness")
print("Melody-conditioned generation complete.")

Music Genre Classification

import librosa
import numpy as np
import torch
import torch.nn as nn

GENRES = ['blues', 'classical', 'country', 'disco', 'hiphop',
          'jazz', 'metal', 'pop', 'reggae', 'rock']

def extract_music_features(audio_path, sr=22050, duration=30):
    y, _ = librosa.load(audio_path, sr=sr, duration=duration)

    mfcc               = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=20)
    spectral_centroid  = librosa.feature.spectral_centroid(y=y, sr=sr)
    spectral_bandwidth = librosa.feature.spectral_bandwidth(y=y, sr=sr)
    spectral_rolloff   = librosa.feature.spectral_rolloff(y=y, sr=sr)
    spectral_contrast  = librosa.feature.spectral_contrast(y=y, sr=sr)
    zcr                = librosa.feature.zero_crossing_rate(y)
    chroma             = librosa.feature.chroma_stft(y=y, sr=sr)
    tempo, _           = librosa.beat.beat_track(y=y, sr=sr)

    feature_vec = np.concatenate([
        np.mean(mfcc, axis=1),
        np.std(mfcc, axis=1),
        [np.mean(spectral_centroid)],
        [np.mean(spectral_bandwidth)],
        [np.mean(spectral_rolloff)],
        [np.mean(spectral_contrast)],
        [np.mean(zcr)],
        [np.std(zcr)],
        np.mean(chroma, axis=1),
        [float(tempo)]
    ])
    return feature_vec

class GenreClassifier(nn.Module):
    def __init__(self, input_dim=56, num_classes=10):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, 512), nn.BatchNorm1d(512), nn.ReLU(), nn.Dropout(0.3),
            nn.Linear(512, 256),       nn.BatchNorm1d(256), nn.ReLU(), nn.Dropout(0.3),
            nn.Linear(256, 128),       nn.ReLU(),
            nn.Linear(128, num_classes)
        )

    def forward(self, x):
        return self.net(x)

def classify_genre(audio_path, model):
    feat = extract_music_features(audio_path)
    x    = torch.FloatTensor(feat).unsqueeze(0)
    with torch.no_grad():
        probs = torch.softmax(model(x), dim=1)[0]
    top_prob, top_idx = torch.topk(probs, 3)
    print("Top-3 genres:")
    for prob, idx in zip(top_prob, top_idx):
        print(f"  {GENRES[idx.item()]}: {prob.item():.4f}")
    return GENRES[probs.argmax().item()]

8. Practical Audio AI Projects

Project 1: Real-time Subtitle System

import queue
import threading
import numpy as np
import sounddevice as sd
from faster_whisper import WhisperModel
import time

class RealtimeSubtitleSystem:
    def __init__(self, model_size="base", language="en"):
        print(f"Loading Whisper {model_size}...")
        self.model        = WhisperModel(model_size, device="cuda", compute_type="float16")
        self.language     = language
        self.audio_queue  = queue.Queue()
        self.is_running   = False
        self.sample_rate  = 16000
        self.chunk_secs   = 3

    def audio_callback(self, indata, frames, time_info, status):
        if status:
            print(f"Audio status: {status}")
        self.audio_queue.put(indata.copy())

    def transcription_worker(self):
        buffer     = np.array([], dtype=np.float32)
        chunk_size = self.sample_rate * self.chunk_secs

        while self.is_running or not self.audio_queue.empty():
            try:
                chunk  = self.audio_queue.get(timeout=0.1)
                buffer = np.append(buffer, chunk.flatten())

                if len(buffer) >= chunk_size:
                    audio_data = buffer[:chunk_size]
                    buffer     = buffer[chunk_size // 2:]   # 50% overlap

                    segments, _ = self.model.transcribe(
                        audio_data, language=self.language,
                        vad_filter=True,
                        vad_parameters=dict(min_silence_duration_ms=300)
                    )
                    for seg in segments:
                        text = seg.text.strip()
                        if text:
                            ts = time.strftime("%H:%M:%S")
                            print(f"[{ts}] {text}")

            except queue.Empty:
                continue
            except Exception as e:
                print(f"Transcription error: {e}")

    def start(self):
        self.is_running = True
        t = threading.Thread(target=self.transcription_worker, daemon=True)
        t.start()

        print("Real-time subtitle system running (Ctrl+C to stop)")
        print("-" * 50)

        with sd.InputStream(
            samplerate=self.sample_rate, channels=1,
            dtype=np.float32, callback=self.audio_callback,
            blocksize=int(self.sample_rate * 0.1)
        ):
            try:
                while True:
                    time.sleep(0.1)
            except KeyboardInterrupt:
                print("\nStopping...")
                self.is_running = False

        t.join()
        print("System stopped.")

system = RealtimeSubtitleSystem(model_size="small", language="en")
system.start()

Project 2: Voice Chatbot

import openai
import sounddevice as sd
import numpy as np
import tempfile
import os
import time
from faster_whisper import WhisperModel
import edge_tts
import asyncio
import soundfile as sf

class VoiceChatbot:
    def __init__(self):
        self.client       = openai.OpenAI()
        self.whisper      = WhisperModel("base", device="cpu", compute_type="int8")
        self.sample_rate  = 16000
        self.history      = []
        self.system_prompt = (
            "You are a helpful and knowledgeable AI assistant. "
            "Respond concisely and clearly in English."
        )

    def record_audio(self, duration=5):
        print(f"Recording... ({duration} s)")
        audio = sd.rec(int(duration * self.sample_rate),
                       samplerate=self.sample_rate, channels=1, dtype=np.float32)
        sd.wait()
        print("Done recording.")
        return audio.flatten()

    def speech_to_text(self, audio):
        with tempfile.NamedTemporaryFile(suffix='.wav', delete=False) as f:
            sf.write(f.name, audio, self.sample_rate)
            tmp = f.name
        try:
            segs, _ = self.whisper.transcribe(tmp, language="en", vad_filter=True)
            text = " ".join(s.text.strip() for s in segs)
        finally:
            os.unlink(tmp)
        return text.strip()

    def chat(self, user_text):
        self.history.append({"role": "user", "content": user_text})
        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "system", "content": self.system_prompt}] + self.history,
            max_tokens=300,
            temperature=0.7
        )
        assistant_text = response.choices[0].message.content
        self.history.append({"role": "assistant", "content": assistant_text})
        if len(self.history) > 20:
            self.history = self.history[-20:]
        return assistant_text

    async def text_to_speech(self, text, output_path='response.mp3'):
        communicate = edge_tts.Communicate(text, voice="en-US-AriaNeural", rate="+10%")
        await communicate.save(output_path)
        return output_path

    def play_audio(self, path):
        import subprocess
        if os.name == 'nt':
            os.startfile(path)
        elif hasattr(os, 'uname') and os.uname().sysname == 'Darwin':
            subprocess.run(['afplay', path])
        else:
            subprocess.run(['mpg123', path])

    def run(self):
        print("Voice chatbot started!")
        print("Say 'quit' or 'exit' to stop.")
        print("=" * 50)

        while True:
            audio     = self.record_audio(duration=5)
            user_text = self.speech_to_text(audio)

            if not user_text:
                print("Could not understand audio. Please try again.")
                continue

            print(f"You: {user_text}")

            if any(w in user_text.lower() for w in ['quit', 'exit', 'stop', 'bye']):
                print("Goodbye!")
                break

            response = self.chat(user_text)
            print(f"AI: {response}")

            audio_path = asyncio.run(self.text_to_speech(response))
            self.play_audio(audio_path)
            time.sleep(0.5)

chatbot = VoiceChatbot()
chatbot.run()

Speech & Audio AI Learning Roadmap

Beginner

  1. Audio feature extraction practice with librosa
  2. Build a transcription prototype with the Whisper API
  3. Create TTS applications with Edge TTS

Intermediate

  1. Fine-tune Wav2Vec 2.0 for custom ASR
  2. Build a speaker diarization pipeline with pyannote.audio
  3. Implement a real-time voice chatbot

Advanced

  1. Train a custom TTS model with VITS or XTTS
  2. Build music generation applications with MusicGen
  3. Combine emotion recognition with conversational AI

References