Skip to content
Published on

Web Audio API & Browser Audio 2026 — AudioWorklet / Tone.js / Howler.js / Wavesurfer / Peaks.js / Meyda / Faust / Csound / Web Speech Deep Dive

Authors

Prologue — The Year the Browser Became an Audio Workstation

As of May 2026, the phrase "audio in the browser" carries a completely different weight than it did five years ago.

  • The Web Audio API graduated from W3C Working Draft to Candidate Recommendation, and the Level 2 specification stabilized.
  • AudioWorklet is GA in every major browser (Chrome, Edge, Firefox, Safari), and ScriptProcessor has effectively been declared dead.
  • Tone.js reached v15, and the project Yotam Mann started has become the de facto standard for music programming.
  • Howler.js is used by the vast majority of HTML5 games.
  • Analysis and visualization libraries like WaveSurfer.js v8, Peaks.js (BBC), and Meyda make DAW-class UIs possible.
  • Faust2WebAudio and Csound on WebAssembly brought academic DSP into the browser.
  • WebRTC plus Opus plus WebTransport delivers realtime voice collaboration under 50 ms.
  • Web Speech API SpeechRecognition and SpeechSynthesis run reliably on both desktop and mobile.
  • Korean shipped through Naver Clova TTS/STT and Kakao i Voice; Japan settled on VOICEVOX and Coeiroink for both local and browser use.

This article draws the whole map in one pass. Which tool sits where, what disappeared and what survived, and what you should pick if you start a new project in 2026.


1. The 2026 Web Audio Map — Four Layers

The ecosystem splits cleanly into four layers.

[ Layer 1 ] Specs        — Web Audio API (W3C), Web Speech API
[ Layer 2 ] Low-level    — AudioContext, AudioWorklet, OfflineAudioContext
[ Layer 3 ] Libraries    — Tone.js, Howler.js, Pizzicato.js, WaveSurfer.js, Peaks.js, Meyda
[ Layer 4 ] DSL/Engines  — Faust2WebAudio, Csound on Web, Web MIDI synths
[ Side axis ] Realtime/AI — WebRTC + Opus, WebTransport, Coqui XTTS, Suno API, ElevenLabs API

Each layer is split by level of abstraction. Layer 1 is the spec and immovable, Layer 2 ships in every browser, Layer 3 is the library you pick, Layer 4 is the DSP expert's domain.

DomainRecommended tool
Game audioHowler.js
Music, instruments, sequencersTone.js
DAW UI waveformWaveSurfer.js
Broadcast segment editorPeaks.js
Feature extraction (BPM, MFCC)Meyda
Academic DSP / synthsFaust2WebAudio, Csound on Web
Video conferencing / voice chatWebRTC + Opus
Browser TTSWeb Speech API plus external API
Korean TTSNaver Clova, Kakao i Voice
Japanese TTSVOICEVOX, Coeiroink

2. Web Audio API (W3C) — GA in Every Browser

The core of the Web Audio API is graph-based audio routing. You create nodes in JavaScript and connect them, and audio processing runs on the browser's separate render thread.

const ctx = new AudioContext()

// Create nodes
const osc = ctx.createOscillator()
const gain = ctx.createGain()

// Wire the graph: osc -> gain -> output
osc.connect(gain)
gain.connect(ctx.destination)

// Set parameters
osc.frequency.value = 440 // A4
gain.gain.value = 0.2

// Play
osc.start()
osc.stop(ctx.currentTime + 1)

Two key 2026 changes:

  1. AudioParam automation curvessetValueCurveAtTime interpolates identically across every browser. No more polyfills.
  2. AudioRenderCapacity — measure the render thread load directly and dynamically adjust your node count (Chrome 121+, Safari 18+).

The spec's surface area is small. Roughly 30 node types in total, and almost any audio task is just a combination of those nodes.

CategoryRepresentative nodes
SourcesOscillatorNode, AudioBufferSourceNode, MediaElementSourceNode, MediaStreamSourceNode
EffectsGainNode, BiquadFilterNode, DelayNode, ConvolverNode, DynamicsCompressorNode, WaveShaperNode
AnalysisAnalyserNode
SpatialPannerNode, StereoPannerNode, AudioListener
CustomAudioWorkletNode
OutputAudioDestinationNode, MediaStreamDestination

3. AudioContext + OfflineAudioContext + MediaStream / MediaElement

Three contexts, each for a different use case.

AudioContext — Realtime

The default. Output goes to speakers or headphones. A user gesture (click, key press) is required before the first sound plays. Mobile Safari is especially strict about this.

OfflineAudioContext — Non-realtime Rendering

Renders audio faster than realtime. Used for WAV export, offline analysis, mastering previews.

// 44.1 kHz, stereo, 10-second buffer
const offline = new OfflineAudioContext(2, 44100 * 10, 44100)

const osc = offline.createOscillator()
osc.connect(offline.destination)
osc.start()
osc.stop(10)

const rendered = await offline.startRendering()
// rendered is an AudioBuffer ready to encode as WAV

MediaElementAudioSourceNode — Wire an audio or video tag into the graph

Turns an HTML5 audio or video element into a node. Does not work with iframe-embedded videos (CORS, autoplay policy).

const audioEl = document.querySelector('audio')
const ctx = new AudioContext()
const source = ctx.createMediaElementSource(audioEl)
source.connect(ctx.destination)
// Playing audioEl now routes through the ctx graph

MediaStreamAudioSourceNode — getUserMedia Microphone

Puts the mic into the graph. The same node is used for WebRTC input.

const stream = await navigator.mediaDevices.getUserMedia({ audio: true })
const ctx = new AudioContext()
const mic = ctx.createMediaStreamSource(stream)
const analyser = ctx.createAnalyser()
mic.connect(analyser)
// Pull realtime spectrum via analyser.getByteFrequencyData()

4. AudioWorklet — The Grave of ScriptProcessor

This is the single biggest shift in 2026.

Before: ScriptProcessorNode ran on the main thread. Buffer sizes from 256 to 16384 samples meant high latency, and any GC pause produced clicks (dropouts).

Now: AudioWorklet is a separate global scope that runs on the audio render thread itself. It is decoupled from the main thread and computes in 128-sample blocks (about 2.6 ms).

// worklet.js — runs on the render thread
class WhiteNoiseProcessor extends AudioWorkletProcessor {
  process(inputs, outputs, parameters) {
    const output = outputs[0]
    for (let channel = 0; channel < output.length; channel++) {
      const ch = output[channel]
      for (let i = 0; i < ch.length; i++) {
        ch[i] = Math.random() * 2 - 1
      }
    }
    return true // keep the node alive
  }
}

registerProcessor('white-noise', WhiteNoiseProcessor)
// main.js — register then use from the main thread
const ctx = new AudioContext()
await ctx.audioWorklet.addModule('worklet.js')
const noise = new AudioWorkletNode(ctx, 'white-noise')
noise.connect(ctx.destination)

The real power of AudioWorklet is combination with WebAssembly. DSP code written in Rust or C++ can be built to WASM and run inside a worklet for near-native performance. Both Faust2WebAudio and Csound on Web use this pattern internally.

ScriptProcessorNode is stamped DEPRECATED in the spec. Never use it for new code.


5. Tone.js (Yotam Mann) — The De Facto Music Programming Standard

Tone.js started in 2014, created by Yotam Mann, and provides abstractions for the music domain — notes, beats, synths, effect chains.

import * as Tone from 'tone'

// Polyphonic synth plus reverb
const reverb = new Tone.Reverb({ decay: 4 }).toDestination()
const synth = new Tone.PolySynth(Tone.AMSynth).connect(reverb)

// Sequencer — a 16th-note loop
const seq = new Tone.Sequence(
  (time, note) => {
    synth.triggerAttackRelease(note, '8n', time)
  },
  ['C4', 'E4', 'G4', 'B4', 'C5', 'B4', 'G4', 'E4'],
  '8n'
)

await Tone.start() // After a user gesture
Tone.Transport.bpm.value = 120
Tone.Transport.start()
seq.start(0)

Key components:

CategoryModules
SourcesOscillator, Synth, AMSynth, FMSynth, MonoSynth, PolySynth, Sampler, Player
EffectsReverb, Delay, FeedbackDelay, Chorus, Distortion, Phaser, AutoFilter, Compressor
TimeTransport (global BPM clock), Sequence, Loop, Pattern, Part
SignalSignal, Param, LFO, Envelope, ScaledEnvelope

Tone.js v15 rewrote the Note class from scratch. It accepts notes as strings like 'C4' and also as objects like an object with note, octave, and accidental fields. It is the most intuitive API for handling music theory in code.


6. Howler.js — The Absolute Standard for Game Audio

Howler.js is the library for people who want to say "I do not care about graphs — I just want to play BGM and sound effects."

import { Howl, Howler } from 'howler'

const sfx = new Howl({
  src: ['shoot.webm', 'shoot.mp3'],
  volume: 0.5,
  rate: 1.2, // 1.2x speed while preserving pitch
  sprite: {
    shoot1: [0, 200],
    shoot2: [300, 200],
  },
})

document.getElementById('fire').addEventListener('click', () => {
  sfx.play('shoot1')
})

Howler.volume(0.8) // Global master volume

Howler.js wins:

  • One object auto-manages multiple instances (firing the same effect rapidly is pooled for you)
  • Fade in/out, seek, sprites, 3D panning, and global mute share a consistent API
  • Codec auto-fallback — falls back from WebM Opus to MP3
  • Mobile autoplay-policy handling baked in

Most HTML5 game engines (Phaser, Pixi.js plus a separate sound layer, Construct 3, etc.) recommend Howler.js as the default or an option.


7. Pizzicato.js — A Younger, More Declarative Option

Pizzicato makes the audio node graph feel object-oriented. Thinner than Tone.js, more graph-friendly than Howler.js.

import Pizzicato from 'pizzicato'

const sound = new Pizzicato.Sound({
  source: 'wave',
  options: { type: 'square', frequency: 220 },
})

const delay = new Pizzicato.Effects.Delay({
  feedback: 0.5,
  time: 0.3,
  mix: 0.5,
})

sound.addEffect(delay)
sound.play()

Pros: simple API. Cons: smaller community than Tone.js, so resources thin out as you go deep. Great for experiments and teaching.


8. WaveSurfer.js — The Canonical Waveform Visualizer

WaveSurfer.js v8 is the de facto standard for DAW-style waveform displays. Hand it an audio file, it paints a waveform onto a canvas, and plugins provide playhead, region selection, zoom, and minimap.

import WaveSurfer from 'wavesurfer.js'
import RegionsPlugin from 'wavesurfer.js/dist/plugins/regions.esm.js'

const ws = WaveSurfer.create({
  container: '#wave',
  waveColor: '#888',
  progressColor: '#4af',
  height: 120,
  url: '/audio/example.mp3',
})

const regions = ws.registerPlugin(RegionsPlugin.create())

ws.on('ready', () => {
  regions.addRegion({
    start: 4,
    end: 8,
    color: 'rgba(0,255,0,0.2)',
    drag: true,
    resize: true,
  })
})

What changed in v8: rendering moved to OffscreenCanvas plus Web Worker, so hour-long audio renders without freezing the main thread. Smooth even on mobile.


9. Peaks.js (BBC R&D) — The Broadcast Segment Editor

Peaks.js comes from BBC R&D. It handles radio and podcast editing tasks — speaker separation, ad break marking, chapter markers.

What makes it different:

  • Two-tier zoomable view — minimap on top, zoomed detail view below
  • Pre-computed wave data format (.dat) — compute waveforms on the server in advance, the client renders them instantly. No need to decode hour-long audio on the client.
  • Point and segment APIs — drop markers at a time position, attach labels, drag to move
import Peaks from 'peaks.js'

const options = {
  zoomview: { container: document.getElementById('zoomview') },
  overview: { container: document.getElementById('overview') },
  mediaElement: document.querySelector('audio'),
  dataUri: { json: '/audio/example.json' }, // pre-computed waveform data
}

Peaks.init(options, (err, peaks) => {
  peaks.segments.add({ startTime: 10, endTime: 20, labelText: 'Ad break' })
  peaks.points.add({ time: 15.5, labelText: 'Chapter 2' })
})

Optimized for broadcaster workflows. Services like BBC Sounds use it internally.


10. Meyda — Audio Feature Extraction

Meyda extracts features from audio in realtime. MFCC, energy, RMS, zero-crossing rate, spectral centroid, chroma, loudness — all there.

import Meyda from 'meyda'

const ctx = new AudioContext()
const stream = await navigator.mediaDevices.getUserMedia({ audio: true })
const source = ctx.createMediaStreamSource(stream)

const analyzer = Meyda.createMeydaAnalyzer({
  audioContext: ctx,
  source,
  bufferSize: 1024,
  featureExtractors: ['rms', 'spectralCentroid', 'mfcc', 'chroma'],
  callback: (features) => {
    console.log(features.rms, features.spectralCentroid)
  },
})

analyzer.start()

Use cases:

  • BPM estimation — realtime RMS peak detection
  • Voice activity detection (VAD) — RMS threshold plus spectral centroid
  • Instrument classification — MFCC plus a machine-learning model
  • Visualization — visualize chroma and you can see harmonic progression

The most common tool when combined with an ML model for browser-side demos like genre classification or singing-voice recognition.


11. Faust2WebAudio — Faust DSL to Web Audio

Faust is a functional audio DSP language from INRIA / GRAME. You can define a synth, an effect, or a filter in a single line, and the compiler emits C++/Rust/JS/WASM/AudioWorklet output.

Faust code example (low-pass filter):

import("stdfaust.lib");
process = fi.lowpass(3, hslider("Cutoff", 1000, 50, 20000, 1));

Compile that with the faust2webaudiowasm tool and you get a JavaScript node built on AudioWorklet.

import { FaustAudioWorkletNode } from './lowpass.js'

const ctx = new AudioContext()
const node = await FaustAudioWorkletNode.create(ctx, 'lowpass')

document.querySelector('audio').addEventListener('play', (e) => {
  const src = ctx.createMediaElementSource(e.target)
  src.connect(node).connect(ctx.destination)
})

// Parameter automation
node.setParamValue('/Cutoff', 800)

Pros: expresses DSP algorithms mathematically and cleanly. Cons: steep learning curve. Without a signal-processing background the on-ramp is hard.


12. Csound on Web — WASM Csound

Csound is a computer music language that has been around since 1986. Ported to WebAssembly, it runs in the browser almost as-is.

import { Csound } from '@csound/browser'

const csound = await Csound()
await csound.compileOrc(`
  instr 1
    iFreq = p4
    aSig oscili 0.3, iFreq
    outs aSig, aSig
  endin
`)
await csound.readScore('i1 0 2 440\ni1 2 2 880')
await csound.start()

The significance of Csound on Web: a massive 40 years of academic, research, and experimental music assets runs in the browser. Great for putting historic works in a digital museum or for teaching electronic music.

Internally it is AudioWorklet plus WASM. Decoupled from the main thread, so the UI stays smooth under load.


13. WebRTC + Opus + WebTransport — Realtime Audio

The core stack for video meetings, game voice chat, and live collaborative music.

WebRTC + Opus

WebRTC is the standard for P2P media streaming. The Opus codec efficiently encodes both speech and music across 16–512 kbps. End-to-end latency below 50 ms.

const pc = new RTCPeerConnection()
const stream = await navigator.mediaDevices.getUserMedia({ audio: true })
stream.getTracks().forEach((t) => pc.addTrack(t, stream))

pc.ontrack = (e) => {
  const audio = new Audio()
  audio.srcObject = e.streams[0]
  audio.play()
}

// SDP offer/answer is exchanged through a separate signaling channel, e.g. WebSocket

WebTransport — The New Option

WebTransport (bidirectional QUIC streams over HTTP/3) is a low-latency channel that replaces WebSocket. Simpler than the WebRTC data channel and lets you explicitly choose reliable or unreliable delivery. Good when you want lower latency at the cost of accepting packet loss, like game voice.

In 2026 it is GA in Chrome, Edge, and Firefox. Safari supports it from version 26.


14. AI Audio — Coqui XTTS, Suno API, ElevenLabs

In the browser there are two paths to AI voice generation.

Path 1: Call a Server API

ElevenLabs, OpenAI TTS, Suno (music generation) — all REST/streaming APIs.

const res = await fetch('https://api.elevenlabs.io/v1/text-to-speech/VOICE_ID/stream', {
  method: 'POST',
  headers: {
    'xi-api-key': API_KEY,
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    text: 'Hello world',
    model_id: 'eleven_multilingual_v2',
  }),
})

const audio = new Audio()
audio.src = URL.createObjectURL(await res.blob())
audio.play()

With streaming (MediaSource Extensions plus Opus chunks) time-to-first-syllable is 200–300 ms.

Path 2: Client-side Local Inference

Coqui XTTS-v2 runs in the browser through Hugging Face Transformers.js. It is WebGPU-backed, so the initial load (~500 MB) is heavy, but once loaded synthesis happens with no external call.

import { pipeline } from '@huggingface/transformers'

const synth = await pipeline('text-to-speech', 'coqui-ai/xtts-v2-onnx', {
  device: 'webgpu',
})

const audio = await synth('Hello from the browser')
const buffer = ctx.createBuffer(1, audio.audio.length, audio.sampling_rate)
buffer.copyToChannel(audio.audio, 0)
// Play it...

Pick by: quality and instant response means API; privacy and offline means local.


15. Web Speech API — Built-in Recognition Plus Synthesis

The Web Speech API has two halves.

SpeechRecognition — Speech to Text

const SR = window.SpeechRecognition || window.webkitSpeechRecognition
const rec = new SR()
rec.lang = 'en-US'
rec.continuous = true
rec.interimResults = true

rec.onresult = (e) => {
  for (const res of e.results) {
    console.log(res[0].transcript, res.isFinal)
  }
}

rec.start()

Caveat: Chrome sends audio to Google's cloud service. If privacy matters, use Whisper.cpp (WASM) or Vosk locally.

SpeechSynthesis — Text to Speech

const utter = new SpeechSynthesisUtterance('Hello there')
utter.lang = 'en-US'
utter.rate = 1.0
utter.pitch = 1.0

const voices = speechSynthesis.getVoices()
utter.voice = voices.find((v) => v.lang === 'en-US' && v.name.includes('Samantha'))

speechSynthesis.speak(utter)

Uses the OS voice engine. macOS/iOS gets Apple's Neural Voice, Windows gets SAPI Voice such as Edge TTS, Android gets Google TTS. Quality varies a lot.


16. Korea — Clova TTS / Kakao i Voice

The Korean TTS market has two giants.

Offered through Naver Cloud Platform (NCP). About 50 voice characters (varied age, gender, accent), SSML support, emotional tone control. Pricing roughly 3 won per 1000 characters as of 2026 — confirm exact pricing on the official page.

const res = await fetch('https://naveropenapi.apigw.ntruss.com/tts-premium/v1/tts', {
  method: 'POST',
  headers: {
    'X-NCP-APIGW-API-KEY-ID': CLIENT_ID,
    'X-NCP-APIGW-API-KEY': CLIENT_SECRET,
    'Content-Type': 'application/x-www-form-urlencoded',
  },
  body: new URLSearchParams({
    speaker: 'nara',
    text: 'Hello, this is Clova.',
    volume: '0',
    speed: '0',
    pitch: '0',
    format: 'mp3',
  }),
})
const blob = await res.blob()
new Audio(URL.createObjectURL(blob)).play()

Kakao i Voice

The voice synthesis side of Kakao i Open Builder. Fewer characters than Clova, but stronger integration with the Kakao ecosystem (KakaoWork, Kakao chatbots).

How to pick:

  • Variety of voice characters matters → Clova
  • Need Kakao chatbot / Kakao i integration → Kakao i Voice
  • Cost — get a usage-based quote either way

Calling directly from the browser usually fails on CORS, so you put a backend proxy in front.


17. Japan — VOICEVOX, Coeiroink

Japan's TTS landscape is very different from Korea. Free, open source, and character voices dominate.

VOICEVOX

VOICEVOX is a free TTS engine operated by Hihosaba Inc. Character voices like Zundamon and Shikoku Metan are used explosively on YouTube, Niconico, and games.

The main product is a desktop app, but a VOICEVOX Core (C++) plus WASM build exists for the browser.

import { VoicevoxCore } from 'voicevox-wasm'

const core = await VoicevoxCore.create()
await core.loadModel(1) // Zundamon

const audioQuery = await core.audioQuery('Hello', 1)
const wav = await core.synthesis(audioQuery, 1)

const ctx = new AudioContext()
const buf = await ctx.decodeAudioData(wav)
const src = ctx.createBufferSource()
src.buffer = buf
src.connect(ctx.destination)
src.start()

Coeiroink

Coeiroink is another free TTS. Through a system called MYCOEIROINK users can train their own voice. Many individual creators and YouTubers build their own voice with it.

How to pick:

  • Popular character voices (Zundamon, etc.) → VOICEVOX
  • Training and personalizing your own voice → Coeiroink

Each character has its own terms of service (commercial use, credit requirements). Read them.


18. Who Should Pick What

Cheat sheet.

GoalRecommended stack
HTML5 game BGM/SFXHowler.js
Music sequencer / synthTone.js
DAW-style waveform UIWaveSurfer.js plus Tone.js
Podcast / radio editingPeaks.js
BPM and instrument-classification MLMeyda plus TensorFlow.js
Academic DSP / synth researchFaust2WebAudio
Electronic music educationCsound on Web
Voice chatWebRTC plus Opus
Low-latency live (game voice)WebRTC or WebTransport
Built-in browser TTSWeb Speech API
High-quality English TTSElevenLabs API
Music generationSuno API
Client-side TTS (offline)Coqui XTTS via Transformers.js
Korean TTS (service)Clova or Kakao i Voice
Japanese TTS (characters)VOICEVOX
Japanese TTS (your own voice)Coeiroink

Three big decision axes:

  1. Realtime vs non-realtime — realtime is AudioContext plus AudioWorklet, non-realtime is OfflineAudioContext.
  2. Music vs game vs analysis — music is Tone.js, games are Howler.js, analysis is Meyda.
  3. Cloud API vs local inference — quality and simplicity means API, privacy and offline means a Transformers.js local model.

19. Wrap-up — The Browser Is an Audio Workstation Now

Five years ago doing audio in the browser was a toy-grade activity. The 2026 browser is different.

  • AudioWorklet enables native-level low-latency DSP.
  • WASM lets you run C/C++/Rust DSP algorithms as-is.
  • Tone.js, Howler.js, WaveSurfer.js, Peaks.js, and Meyda became de facto standards in their respective domains.
  • Faust2WebAudio and Csound on Web brought academic DSP into the browser.
  • WebRTC plus Opus plus WebTransport enables sub-50ms realtime collaboration.
  • Web Speech API plus Clova / Kakao / VOICEVOX / Coeiroink delivers natural multilingual voice.
  • Coqui XTTS, ElevenLabs, and Suno make AI audio immediately usable.

What's left is to pick your tools and build the work. May this article be a starting line for that choice.


References