Speech & Audio AI Complete Guide: ASR, TTS, Whisper, Wav2Vec to Voice Synthesis
Speech and audio AI creates the most natural interface between humans and machines. From smartphone voice assistants to real-time translation systems and synthetic voices for virtual influencers, audio AI technology has woven itself into our everyday lives.
This guide takes you through the full audio AI ecosystem — from the physics of sound and digital signal processing to automatic speech recognition (ASR), text-to-speech (TTS), speaker diarization, and music generation — with practical Python code throughout.
1. Audio Signal Processing Fundamentals
Physical Properties of Sound
Sound is a pressure wave propagating through air. Understanding the following concepts is essential for digital audio processing.
**Frequency**: The number of vibrations per second, measured in Hz (hertz). The human hearing range spans roughly 20 Hz to 20,000 Hz. Low frequencies correspond to bass tones; high frequencies to treble.
**Amplitude**: The magnitude of the wave — the strength of the sound pressure. Expressed in decibels (dB), where 0 dB represents a reference threshold and negative values indicate quieter sounds.
**Phase**: The position of a waveform along the time axis. Two waves of the same frequency but different phases produce constructive or destructive interference when combined.
**Harmonics**: Frequency components at integer multiples of the fundamental frequency. They determine the timbre (tone color) of an instrument or voice.
Sampling Rate and Bit Depth
**Sampling Rate**: The number of audio samples captured per second (Hz). The Nyquist theorem states that to fully reconstruct a signal, you must sample at more than twice the highest frequency present.
- CD quality: 44,100 Hz (44.1 kHz)
- High-resolution audio: 48,000 Hz (video), 96,000 Hz, 192,000 Hz
- Telephone quality: 8,000 Hz; wideband telephony: 16,000 Hz
- Whisper default: 16,000 Hz
**Bit Depth**: The number of bits per sample — determines the dynamic range.
- 16-bit: 65,536 levels, 96 dB dynamic range (CD standard)
- 24-bit: 16,777,216 levels, 144 dB dynamic range
- 32-bit float: standard for deep learning pipelines
librosa Overview
librosa is the core Python library for audio analysis.
pip install librosa soundfile matplotlib numpy scipy
Load audio (sr=None preserves original sample rate)
y, sr = librosa.load('audio.wav', sr=None)
print(f"Duration: {len(y)/sr:.2f} s")
print(f"Sample rate: {sr} Hz")
print(f"Samples: {len(y)}")
print(f"dtype: {y.dtype}")
Resample to 16 kHz
y_16k = librosa.resample(y, orig_sr=sr, target_sr=16000)
Stereo → mono
y_mono = librosa.to_mono(y) # (2, N) → (N,)
Save
sf.write('output.wav', y_16k, 16000)
Waveform visualization
plt.figure(figsize=(14, 4))
librosa.display.waveshow(y, sr=sr, alpha=0.8, color='steelblue')
plt.title('Waveform')
plt.xlabel('Time (seconds)')
plt.ylabel('Amplitude')
plt.tight_layout()
plt.savefig('waveform.png')
plt.show()
2. Audio Feature Extraction
Fourier Transform (FFT)
The Fourier transform converts a time-domain signal into the frequency domain, revealing which frequency components are present and how strongly.
from scipy.fft import fft, fftfreq
y, sr = librosa.load('audio.wav', sr=22050)
N = len(y)
yf = fft(y)
xf = fftfreq(N, 1/sr)
Keep only positive frequencies (Hermitian symmetry)
xf_pos = xf[:N//2]
yf_pos = np.abs(yf[:N//2])
plt.figure(figsize=(12, 4))
plt.plot(xf_pos, yf_pos)
plt.title('Frequency Spectrum (FFT)')
plt.xlabel('Frequency (Hz)')
plt.ylabel('Magnitude')
plt.xlim(0, sr//2)
plt.yscale('log')
plt.grid(True)
plt.tight_layout()
plt.show()
Short-Time Fourier Transform (STFT)
The plain FFT gives a global average over the entire signal. The STFT applies FFT to short overlapping windows, producing a time-frequency representation that captures how spectral content evolves over time.
y, sr = librosa.load('speech.wav', sr=22050)
n_fft = 2048 # FFT size (frequency resolution)
hop_length = 512 # hop size
win_length = 2048 # window size
D = librosa.stft(y, n_fft=n_fft, hop_length=hop_length, win_length=win_length)
magnitude = np.abs(D)
phase = np.angle(D)
D_db = librosa.amplitude_to_db(magnitude, ref=np.max)
plt.figure(figsize=(14, 6))
librosa.display.specshow(
D_db, sr=sr, hop_length=hop_length,
x_axis='time', y_axis='hz', cmap='magma'
)
plt.colorbar(format='%+2.0f dB')
plt.title('STFT Spectrogram')
plt.tight_layout()
plt.savefig('stft_spectrogram.png')
plt.show()
print(f"STFT shape: {D.shape}")
print(f"Frequency resolution: {sr/n_fft:.2f} Hz")
print(f"Time resolution: {hop_length/sr*1000:.2f} ms")
Mel Spectrogram
The human auditory system perceives pitch on a logarithmic scale — more sensitive at low frequencies and less so at high frequencies. The Mel scale models this perceptual non-linearity. Mel spectrograms are the most widely used input representation for deep learning audio models.
y, sr = librosa.load('speech.wav', sr=22050)
n_mels = 128
n_fft = 2048
hop_length = 512
mel_spec = librosa.feature.melspectrogram(
y=y, sr=sr, n_mels=n_mels,
n_fft=n_fft, hop_length=hop_length,
fmin=0, fmax=sr//2, power=2.0
)
mel_db = librosa.power_to_db(mel_spec, ref=np.max)
plt.figure(figsize=(14, 6))
librosa.display.specshow(
mel_db, sr=sr, hop_length=hop_length,
x_axis='time', y_axis='mel',
fmax=sr//2, cmap='viridis'
)
plt.colorbar(format='%+2.0f dB')
plt.title('Mel Spectrogram')
plt.tight_layout()
plt.savefig('mel_spectrogram.png')
plt.show()
print(f"Mel Spectrogram shape: {mel_spec.shape}")
(n_mels, time_frames) → ~(128, 86 * duration_seconds)
MFCC (Mel-Frequency Cepstral Coefficients)
MFCCs are obtained by applying the Discrete Cosine Transform (DCT) to the log Mel filterbank outputs. They compactly represent the spectral envelope (timbre) and have been the standard features for speech recognition for decades.
y, sr = librosa.load('speech.wav', sr=22050)
n_mfcc = 40
hop_length = 512
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=n_mfcc, hop_length=hop_length)
mfcc_delta = librosa.feature.delta(mfccs)
mfcc_delta2 = librosa.feature.delta(mfccs, order=2)
Concatenated feature: 120-dimensional (40 + 40 + 40)
mfcc_combined = np.vstack([mfccs, mfcc_delta, mfcc_delta2])
fig, axes = plt.subplots(3, 1, figsize=(14, 8))
for ax, data, title in zip(
axes,
[mfccs, mfcc_delta, mfcc_delta2],
['MFCC', 'MFCC Delta (1st derivative)', 'MFCC Delta-Delta (2nd derivative)']
):
librosa.display.specshow(data, x_axis='time', hop_length=hop_length, sr=sr, ax=ax)
ax.set_title(title)
plt.colorbar(ax.collections[0], ax=ax)
plt.tight_layout()
plt.savefig('mfcc.png')
plt.show()
Fixed-length feature vector for classification
mfcc_feature = np.concatenate([np.mean(mfccs, axis=1), np.std(mfccs, axis=1)])
print(f"MFCC feature vector dim: {mfcc_feature.shape}")
Chromagram
A chromagram represents the energy distribution across the 12 pitch classes (C, C#, D, D#, E, F, F#, G, G#, A, A#, B). It is widely used in music analysis for chord recognition and key detection.
y, sr = librosa.load('music.wav', sr=22050)
hop_length = 512
chroma_stft = librosa.feature.chroma_stft(y=y, sr=sr, hop_length=hop_length)
chroma_cqt = librosa.feature.chroma_cqt(y=y, sr=sr, hop_length=hop_length)
chroma_cens = librosa.feature.chroma_cens(y=y, sr=sr, hop_length=hop_length)
fig, axes = plt.subplots(3, 1, figsize=(14, 10))
for ax, chroma, title in zip(
axes,
[chroma_stft, chroma_cqt, chroma_cens],
['Chroma STFT', 'Chroma CQT', 'Chroma CENS']
):
librosa.display.specshow(
chroma, y_axis='chroma', x_axis='time',
hop_length=hop_length, sr=sr, cmap='coolwarm', ax=ax
)
ax.set_title(title)
plt.colorbar(ax.collections[0], ax=ax)
plt.tight_layout()
plt.savefig('chromagram.png')
plt.show()
3. Automatic Speech Recognition (ASR)
Traditional ASR: HMM + GMM
Classic speech recognition combined Hidden Markov Models (HMMs) for phoneme sequence modeling with Gaussian Mixture Models (GMMs) for acoustic feature modeling. MFCC features are extracted from the audio, phoneme sequences are predicted by the HMM-GMM system, and a language model maps phoneme sequences to word sequences.
CTC (Connectionist Temporal Classification)
CTC enables end-to-end training when input and output sequences have different lengths, without requiring forced alignment. A blank token handles repeated characters and silences, allowing the model to learn directly from audio-text pairs.
Wav2Vec 2.0
Facebook AI Research's Wav2Vec 2.0 uses self-supervised learning to learn powerful acoustic representations from large amounts of unlabeled audio. It can be fine-tuned with a small labeled dataset and achieves state-of-the-art results.
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
model_name = "facebook/wav2vec2-base-960h"
processor = Wav2Vec2Processor.from_pretrained(model_name)
model = Wav2Vec2ForCTC.from_pretrained(model_name)
model.eval()
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
def transcribe_wav2vec2(audio_path):
speech, sample_rate = torchaudio.load(audio_path)
if sample_rate != 16000:
resampler = torchaudio.transforms.Resample(sample_rate, 16000)
speech = resampler(speech)
speech = speech.squeeze().numpy()
inputs = processor(speech, sampling_rate=16000,
return_tensors="pt", padding=True).to(device)
with torch.no_grad():
logits = model(**inputs).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
return transcription[0].lower()
text = transcribe_wav2vec2('speech.wav')
print(f"Transcription: {text}")
Whisper (OpenAI)
Whisper is OpenAI's large-scale multilingual ASR model released in 2022. Trained on 680,000 hours of multilingual audio, it supports 99 languages and delivers excellent accuracy out of the box — no fine-tuning required for most use cases.
**Architecture**: Encoder-Decoder Transformer
- Encoder: converts audio to a Mel spectrogram and processes it through a transformer encoder
- Decoder: autoregressively generates text tokens, including language detection and timestamps
**Model sizes**:
- tiny: 39M parameters — fastest
- base: 74M
- small: 244M
- medium: 769M
- large-v3: 1,550M — best accuracy
Load model (downloads automatically on first run)
model = whisper.load_model("base") # or "small", "medium", "large-v3"
Basic transcription
result = model.transcribe("speech.wav")
print("Transcript:", result["text"])
print("Detected language:", result["language"])
Force a specific language
result_forced = model.transcribe("speech.wav", language="en", task="transcribe")
With word-level timestamps
result_ts = model.transcribe("speech.wav", verbose=False, word_timestamps=True)
for segment in result_ts["segments"]:
print(f"[{segment['start']:.2f}s - {segment['end']:.2f}s] {segment['text']}")
Translate non-English audio to English
result_translated = model.transcribe("german_speech.wav", task="translate")
print("English translation:", result_translated["text"])
Microphone input (5-second clip)
def transcribe_from_microphone(duration=5, sample_rate=16000):
print(f"Recording for {duration} seconds...")
audio = sd.rec(int(duration * sample_rate),
samplerate=sample_rate, channels=1, dtype=np.float32)
sd.wait()
result = model.transcribe(audio.flatten(), language="en")
print(f"You said: {result['text']}")
transcribe_from_microphone()
Faster-Whisper
faster-whisper reimplements Whisper using CTranslate2, achieving up to 4x faster inference with reduced memory usage.
from faster_whisper import WhisperModel
model = WhisperModel(
"large-v3",
device="cuda",
compute_type="float16" # "float16", "int8", "int8_float16"
)
segments, info = model.transcribe(
"speech.wav",
language="en",
beam_size=5,
word_timestamps=True,
vad_filter=True,
vad_parameters=dict(min_silence_duration_ms=500)
)
print(f"Language: {info.language}, confidence: {info.language_probability:.2f}")
full_text = ""
for segment in segments:
print(f"[{segment.start:.2f}s → {segment.end:.2f}s] {segment.text}")
full_text += segment.text
if segment.words:
for word in segment.words:
print(f" '{word.word}': {word.start:.2f}s - {word.end:.2f}s")
Batch processing
def batch_transcribe(audio_files, output_dir):
os.makedirs(output_dir, exist_ok=True)
for audio_path in audio_files:
name = os.path.splitext(os.path.basename(audio_path))[0]
output_path = os.path.join(output_dir, f"{name}.txt")
segs, _ = model.transcribe(audio_path, language="en")
with open(output_path, 'w', encoding='utf-8') as f:
for seg in segs:
f.write(f"[{seg.start:.2f} - {seg.end:.2f}] {seg.text}\n")
print(f"Done: {audio_path} → {output_path}")
batch_transcribe(['interview1.wav', 'lecture.mp3'], './transcripts')
4. Text-to-Speech (TTS)
Deep Learning TTS Architectures
**Tacotron 2**: A sequence-to-sequence model that generates Mel spectrograms from text, coupled with a WaveNet vocoder. An attention mechanism aligns the text encoder output with the audio decoder.
**FastSpeech 2**: A non-autoregressive model that is 3–38x faster than Tacotron 2. A duration predictor solves the alignment problem, and pitch and energy are predicted directly from the input.
**VITS**: An end-to-end model combining variational inference with adversarial training. It merges the acoustic model and vocoder into a single network, yielding natural-sounding synthesis in one pass.
Edge TTS (Microsoft)
Microsoft's high-quality TTS service, free to use via the `edge-tts` Python package.
async def synthesize_with_edge_tts():
List available voices
voices = await edge_tts.list_voices()
en_voices = [v for v in voices if v['Locale'].startswith('en-')]
print("English voices:")
for v in en_voices[:5]:
print(f" {v['ShortName']} - {v['FriendlyName']} ({v['Gender']})")
Basic synthesis
text = "Hello! This is a demonstration of Microsoft Edge TTS."
communicate = edge_tts.Communicate(
text,
voice="en-US-AriaNeural",
rate="+0%", # speed: -50% to +100%
volume="+0%",
pitch="+0Hz"
)
await communicate.save("output_edge.mp3")
print("Saved: output_edge.mp3")
With word-boundary subtitles
async def synthesize_with_subs(text, voice, audio_out, srt_out):
communicate = edge_tts.Communicate(text, voice)
subs = edge_tts.SubMaker()
with open(audio_out, "wb") as af:
async for chunk in communicate.stream():
if chunk["type"] == "audio":
af.write(chunk["data"])
elif chunk["type"] == "WordBoundary":
subs.feed(chunk)
with open(srt_out, "w", encoding="utf-8") as sf:
sf.write(subs.get_srt())
await synthesize_with_subs(
"The quick brown fox jumps over the lazy dog.",
"en-US-AriaNeural",
"output_subs.mp3",
"output_subs.srt"
)
asyncio.run(synthesize_with_edge_tts())
Coqui TTS (Open Source)
from TTS.api import TTS
device = "cuda" if torch.cuda.is_available() else "cpu"
XTTS v2: multilingual zero-shot TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)
Basic synthesis
tts.tts_to_file(
text="This is a demonstration of Coqui XTTS v2.",
file_path="output_xtts.wav",
language="en",
speaker_wav="reference_voice.wav" # 3+ second reference for voice cloning
)
Voice cloning
def clone_voice(reference_audio, text, output_path, language="en"):
tts.tts_to_file(
text=text,
file_path=output_path,
speaker_wav=reference_audio,
language=language,
split_sentences=True
)
print(f"Voice clone saved: {output_path}")
clone_voice(
"my_voice_sample.wav",
"This sentence is spoken in a cloned voice.",
"cloned_output.wav"
)
English TTS with Tacotron2
tts_en = TTS("tts_models/en/ljspeech/tacotron2-DDC").to(device)
tts_en.tts_to_file(
text="Hello, this is a text-to-speech demonstration using Tacotron 2.",
file_path="output_tacotron.wav"
)
OpenVoice (Voice Cloning)
pip install openvoice melo-tts
from openvoice import se_extractor
from openvoice.api import ToneColorConverter
from melo.api import TTS
device = "cuda" if torch.cuda.is_available() else "cpu"
tone_color_converter = ToneColorConverter(
'checkpoints_v2/converter/config.json', device=device
)
tone_color_converter.load_ckpt('checkpoints_v2/converter/checkpoint.pth')
Generate base speech
tts_model = TTS(language='EN', device=device)
speaker_id = tts_model.hps.data.spk2id['EN-US']
src_path = 'tmp/output_base.wav'
tts_model.tts_to_file(
text="Today we will discuss advances in voice cloning technology.",
speaker_id=speaker_id,
output_path=src_path,
speed=1.0
)
Extract tone color from reference speaker
target_se, _ = se_extractor.get_se(
'reference.wav', tone_color_converter,
target_dir='processed', vad=False
)
source_se = torch.load('checkpoints_v2/base_speakers/ses/en-us.pth', map_location=device)
Apply tone color conversion
tone_color_converter.convert(
audio_src_path=src_path,
src_se=source_se,
tgt_se=target_se,
output_path='output_cloned.wav'
)
print("Voice clone complete: output_cloned.wav")
5. Speaker Diarization
Speaker diarization answers the question "who spoke when". It is essential for meeting transcription, interview analysis, and multi-speaker subtitle generation.
from pyannote.audio import Pipeline
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1",
use_auth_token="YOUR_HF_TOKEN"
)
pipeline.to(device)
def diarize_audio(audio_path, num_speakers=None):
kwargs = {"num_speakers": num_speakers} if num_speakers else {}
diarization = pipeline(audio_path, **kwargs)
timeline = []
for turn, _, speaker in diarization.itertracks(yield_label=True):
entry = {
'start': turn.start,
'end': turn.end,
'speaker': speaker,
'duration': turn.end - turn.start
}
timeline.append(entry)
print(f" [{turn.start:.2f}s - {turn.end:.2f}s] {speaker}")
return timeline, diarization
Combined diarization + Whisper transcription
def diarize_and_transcribe(audio_path, hf_token):
from faster_whisper import WhisperModel
whisper_model = WhisperModel("medium", device="cuda", compute_type="float16")
segments, _ = whisper_model.transcribe(audio_path, word_timestamps=True)
segments = list(segments)
diarize_pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1", use_auth_token=hf_token
).to(torch.device("cuda"))
diarization = diarize_pipeline(audio_path)
def get_speaker(start, end):
best_speaker, best_overlap = "Unknown", 0.0
for turn, _, speaker in diarization.itertracks(yield_label=True):
overlap = min(end, turn.end) - max(start, turn.start)
if overlap > best_overlap:
best_overlap, best_speaker = overlap, speaker
return best_speaker
result = []
for segment in segments:
speaker = get_speaker(segment.start, segment.end)
entry = {
'start': segment.start,
'end': segment.end,
'speaker': speaker,
'text': segment.text.strip()
}
result.append(entry)
print(f"[{segment.start:.2f}s - {segment.end:.2f}s] {speaker}: {segment.text.strip()}")
return result
timeline = diarize_and_transcribe("meeting.wav", "YOUR_HF_TOKEN")
6. Speech Emotion Recognition
from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2FeatureExtractor
model_name = "ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition"
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_name)
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_name)
model.eval()
EMOTIONS = ['angry', 'calm', 'disgust', 'fearful',
'happy', 'neutral', 'sad', 'surprised']
def predict_emotion(audio_path):
speech, sr = torchaudio.load(audio_path)
if sr != 16000:
speech = torchaudio.transforms.Resample(sr, 16000)(speech)
inputs = feature_extractor(
speech.squeeze().numpy(),
sampling_rate=16000,
return_tensors="pt",
padding=True
)
with torch.no_grad():
probs = torch.softmax(model(**inputs).logits, dim=-1)
top_probs, top_idx = torch.topk(probs, 3)
print("Top-3 emotions:")
for idx, prob in zip(top_idx[0], top_probs[0]):
print(f" {EMOTIONS[idx.item()]}: {prob.item():.4f}")
predicted = EMOTIONS[probs.argmax().item()]
print(f"Predicted: {predicted}")
return predicted, dict(zip(EMOTIONS, probs[0].tolist()))
emotion, probs = predict_emotion('speech.wav')
Feature-based emotion analysis
def analyze_emotion_features(audio_path):
y, sr = librosa.load(audio_path, sr=16000)
mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=40)
spectral_centroid = librosa.feature.spectral_centroid(y=y, sr=sr)
spectral_rolloff = librosa.feature.spectral_rolloff(y=y, sr=sr)
zero_crossing = librosa.feature.zero_crossing_rate(y=y)
rms = librosa.feature.rms(y=y)
pitch, mag = librosa.piptrack(y=y, sr=sr)
features = np.concatenate([
np.mean(mfcc, axis=1),
np.std(mfcc, axis=1),
[np.mean(spectral_centroid)],
[np.std(spectral_centroid)],
[np.mean(spectral_rolloff)],
[np.mean(zero_crossing)],
[np.mean(rms)],
[np.std(rms)]
])
print(f"Feature vector dim: {features.shape}")
return features
7. Music AI
MusicGen (Meta)
Meta AI's MusicGen generates music from text prompts, conditioning on descriptions of genre, instruments, mood, and tempo.
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write
model = MusicGen.get_pretrained('facebook/musicgen-large')
model.set_generation_params(
duration=30,
temperature=1.0,
top_k=250,
top_p=0.0,
cfg_coef=3.0,
)
descriptions = [
"upbeat electronic dance music with synthesizers and strong bass",
"peaceful classical piano music with violin, gentle and romantic",
"intense rock music with electric guitar and drums"
]
print("Generating music...")
wav = model.generate(descriptions)
for idx, (desc, one_wav) in enumerate(zip(descriptions, wav)):
filename = f'generated_music_{idx}'
audio_write(
filename, one_wav.cpu(), model.sample_rate,
strategy="loudness", loudness_compressor=True
)
print(f"Saved: {filename}.wav — '{desc}'")
Melody-conditioned generation
model_melody = MusicGen.get_pretrained('facebook/musicgen-melody')
model_melody.set_generation_params(duration=15)
melody, melody_sr = torchaudio.load('humming.wav')
wav_melody = model_melody.generate_with_chroma(
descriptions=["full orchestral arrangement, epic and cinematic"],
melody_wavs=melody.unsqueeze(0),
melody_sample_rate=melody_sr
)
audio_write('melody_based', wav_melody[0].cpu(), model_melody.sample_rate,
strategy="loudness")
print("Melody-conditioned generation complete.")
Music Genre Classification
GENRES = ['blues', 'classical', 'country', 'disco', 'hiphop',
'jazz', 'metal', 'pop', 'reggae', 'rock']
def extract_music_features(audio_path, sr=22050, duration=30):
y, _ = librosa.load(audio_path, sr=sr, duration=duration)
mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=20)
spectral_centroid = librosa.feature.spectral_centroid(y=y, sr=sr)
spectral_bandwidth = librosa.feature.spectral_bandwidth(y=y, sr=sr)
spectral_rolloff = librosa.feature.spectral_rolloff(y=y, sr=sr)
spectral_contrast = librosa.feature.spectral_contrast(y=y, sr=sr)
zcr = librosa.feature.zero_crossing_rate(y)
chroma = librosa.feature.chroma_stft(y=y, sr=sr)
tempo, _ = librosa.beat.beat_track(y=y, sr=sr)
feature_vec = np.concatenate([
np.mean(mfcc, axis=1),
np.std(mfcc, axis=1),
[np.mean(spectral_centroid)],
[np.mean(spectral_bandwidth)],
[np.mean(spectral_rolloff)],
[np.mean(spectral_contrast)],
[np.mean(zcr)],
[np.std(zcr)],
np.mean(chroma, axis=1),
[float(tempo)]
])
return feature_vec
class GenreClassifier(nn.Module):
def __init__(self, input_dim=56, num_classes=10):
super().__init__()
self.net = nn.Sequential(
nn.Linear(input_dim, 512), nn.BatchNorm1d(512), nn.ReLU(), nn.Dropout(0.3),
nn.Linear(512, 256), nn.BatchNorm1d(256), nn.ReLU(), nn.Dropout(0.3),
nn.Linear(256, 128), nn.ReLU(),
nn.Linear(128, num_classes)
)
def forward(self, x):
return self.net(x)
def classify_genre(audio_path, model):
feat = extract_music_features(audio_path)
x = torch.FloatTensor(feat).unsqueeze(0)
with torch.no_grad():
probs = torch.softmax(model(x), dim=1)[0]
top_prob, top_idx = torch.topk(probs, 3)
print("Top-3 genres:")
for prob, idx in zip(top_prob, top_idx):
print(f" {GENRES[idx.item()]}: {prob.item():.4f}")
return GENRES[probs.argmax().item()]
8. Practical Audio AI Projects
Project 1: Real-time Subtitle System
from faster_whisper import WhisperModel
class RealtimeSubtitleSystem:
def __init__(self, model_size="base", language="en"):
print(f"Loading Whisper {model_size}...")
self.model = WhisperModel(model_size, device="cuda", compute_type="float16")
self.language = language
self.audio_queue = queue.Queue()
self.is_running = False
self.sample_rate = 16000
self.chunk_secs = 3
def audio_callback(self, indata, frames, time_info, status):
if status:
print(f"Audio status: {status}")
self.audio_queue.put(indata.copy())
def transcription_worker(self):
buffer = np.array([], dtype=np.float32)
chunk_size = self.sample_rate * self.chunk_secs
while self.is_running or not self.audio_queue.empty():
try:
chunk = self.audio_queue.get(timeout=0.1)
buffer = np.append(buffer, chunk.flatten())
if len(buffer) >= chunk_size:
audio_data = buffer[:chunk_size]
buffer = buffer[chunk_size // 2:] # 50% overlap
segments, _ = self.model.transcribe(
audio_data, language=self.language,
vad_filter=True,
vad_parameters=dict(min_silence_duration_ms=300)
)
for seg in segments:
text = seg.text.strip()
if text:
ts = time.strftime("%H:%M:%S")
print(f"[{ts}] {text}")
except queue.Empty:
continue
except Exception as e:
print(f"Transcription error: {e}")
def start(self):
self.is_running = True
t = threading.Thread(target=self.transcription_worker, daemon=True)
t.start()
print("Real-time subtitle system running (Ctrl+C to stop)")
print("-" * 50)
with sd.InputStream(
samplerate=self.sample_rate, channels=1,
dtype=np.float32, callback=self.audio_callback,
blocksize=int(self.sample_rate * 0.1)
):
try:
while True:
time.sleep(0.1)
except KeyboardInterrupt:
print("\nStopping...")
self.is_running = False
t.join()
print("System stopped.")
system = RealtimeSubtitleSystem(model_size="small", language="en")
system.start()
Project 2: Voice Chatbot
from faster_whisper import WhisperModel
class VoiceChatbot:
def __init__(self):
self.client = openai.OpenAI()
self.whisper = WhisperModel("base", device="cpu", compute_type="int8")
self.sample_rate = 16000
self.history = []
self.system_prompt = (
"You are a helpful and knowledgeable AI assistant. "
"Respond concisely and clearly in English."
)
def record_audio(self, duration=5):
print(f"Recording... ({duration} s)")
audio = sd.rec(int(duration * self.sample_rate),
samplerate=self.sample_rate, channels=1, dtype=np.float32)
sd.wait()
print("Done recording.")
return audio.flatten()
def speech_to_text(self, audio):
with tempfile.NamedTemporaryFile(suffix='.wav', delete=False) as f:
sf.write(f.name, audio, self.sample_rate)
tmp = f.name
try:
segs, _ = self.whisper.transcribe(tmp, language="en", vad_filter=True)
text = " ".join(s.text.strip() for s in segs)
finally:
os.unlink(tmp)
return text.strip()
def chat(self, user_text):
self.history.append({"role": "user", "content": user_text})
response = self.client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "system", "content": self.system_prompt}] + self.history,
max_tokens=300,
temperature=0.7
)
assistant_text = response.choices[0].message.content
self.history.append({"role": "assistant", "content": assistant_text})
if len(self.history) > 20:
self.history = self.history[-20:]
return assistant_text
async def text_to_speech(self, text, output_path='response.mp3'):
communicate = edge_tts.Communicate(text, voice="en-US-AriaNeural", rate="+10%")
await communicate.save(output_path)
return output_path
def play_audio(self, path):
if os.name == 'nt':
os.startfile(path)
elif hasattr(os, 'uname') and os.uname().sysname == 'Darwin':
subprocess.run(['afplay', path])
else:
subprocess.run(['mpg123', path])
def run(self):
print("Voice chatbot started!")
print("Say 'quit' or 'exit' to stop.")
print("=" * 50)
while True:
audio = self.record_audio(duration=5)
user_text = self.speech_to_text(audio)
if not user_text:
print("Could not understand audio. Please try again.")
continue
print(f"You: {user_text}")
if any(w in user_text.lower() for w in ['quit', 'exit', 'stop', 'bye']):
print("Goodbye!")
break
response = self.chat(user_text)
print(f"AI: {response}")
audio_path = asyncio.run(self.text_to_speech(response))
self.play_audio(audio_path)
time.sleep(0.5)
chatbot = VoiceChatbot()
chatbot.run()
Speech & Audio AI Learning Roadmap
**Beginner**
1. Audio feature extraction practice with librosa
2. Build a transcription prototype with the Whisper API
3. Create TTS applications with Edge TTS
**Intermediate**
1. Fine-tune Wav2Vec 2.0 for custom ASR
2. Build a speaker diarization pipeline with pyannote.audio
3. Implement a real-time voice chatbot
**Advanced**
1. Train a custom TTS model with VITS or XTTS
2. Build music generation applications with MusicGen
3. Combine emotion recognition with conversational AI
References
- librosa documentation: https://librosa.org/doc/latest/
- OpenAI Whisper: https://openai.com/research/whisper
- Wav2Vec 2.0 (arXiv 2006.11477): https://arxiv.org/abs/2006.11477
- Hugging Face Audio Course: https://huggingface.co/learn/audio-course/
- Coqui TTS: https://github.com/coqui-ai/TTS
- pyannote.audio: https://github.com/pyannote/pyannote-audio
- faster-whisper: https://github.com/SYSTRAN/faster-whisper
- AudioCraft (MusicGen): https://github.com/facebookresearch/audiocraft
현재 단락 (1/654)
Speech and audio AI creates the most natural interface between humans and machines. From smartphone ...