Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

Prologue — The Web Finally Gets to Tell the GPU What to Do

For years we accepted one lie: "browsers can't do GPU compute." WebGL had shaders but they were shaders trapped inside the graphics pipeline. Running arbitrary data through arbitrary workgroups and reading the result back — the thing we'd taken for granted in CUDA — was not possible on the web. So ML inference ran on CPU via WebAssembly, or used trickery like ONNX.js that abused WebGL into pretending to be compute.

That picture began to change when Chrome 113 shipped WebGPU to desktop stable in 2023. Firefox and Safari followed through 2024. And in January 2026, Safari 26 launched it on macOS Tahoe and iOS, putting WebGPU effectively at Baseline. All of Chrome, Edge, Firefox and Safari ship it on by default; global coverage is around 95%.

This post is not about graphics. It is about compute — the same GPU, often sharing the same page, running a very different kind of pipeline.

A matrix multiply that takes 200ms on CPU takes 2ms on GPU.
WebLLM runs a quantized Llama 3 8B on a user's laptop GPU at 30+ tok/s.
Applying a Gaussian blur to a thousand images takes wasm-SIMD 4s; it takes WebGPU 0.4s.

All of this — with zero server cost — inside the browser. That is what compute shaders promise. This post is one read for how to take them up on it.

1. Why GPU Compute in the Browser

Start with motivation. Three reasons to run GPU compute in the browser.

1.1 Zero Server Cost

LLM inference on cloud GPUs is 5–10 USD per H100 hour. With 100 users it adds up fast. Move the inference to the user's device GPU and your server cost is zero. The weights download once and cache.

1.2 Privacy

Classifying, summarizing or embedding medical images, personal photos, or private text never leaves the device. It happens on the browser's GPU. The data does not leave the box.

1.3 Latency

You eliminate a 50–200ms round trip to a server. Live video filters, interactive ML, real-time simulation — not having that round trip is fundamentally faster.

These reasons are not new. What is new is that the tools are finally ready.

2. WebGPU vs WebGL — Where Compute Lives

In WebGL up through 2 every shader is inside the graphics pipeline. Vertex shaders see vertices, fragment shaders see pixels. To process arbitrary data you had to disguise it as a texture and shove it through the fragment shader — the famous "GPGPU hack".

WebGPU is different. Compute shaders are first-class citizens.

Aspect	WebGL 2	WebGPU
Compute shaders	None (hacks only)	First-class
Workgroups / shared memory	None	Yes
Storage buffer read/write	Texture tricks	Native
Shader language	GLSL ES 3.0	WGSL
API style	Global state machine	Explicit command buffers
Multi-threaded command encoding	Not possible	Possible
fp16	Extension (fragile)	Extension (when enabled)

The second row is the heart of it. WebGPU gives you workgroups and shared memory. Without those, half of what we call "parallel compute" — reductions, scans, prefix sums and other cooperative algorithms — can only be faked. They require the fast shared memory inside a workgroup.

3. The Compute Model — Workgroups, Invocations, Dispatch

A GPU has thousands of ALUs. The way you hand work to those thousands is the compute model.

3.1 Three Layers

Dispatch
  |
  +-- Workgroup #0
  |     |
  |     +-- Invocation (0,0,0)
  |     +-- Invocation (1,0,0)
  |     +-- ...
  |     +-- Invocation (63,0,0)
  +-- Workgroup #1
  |     +-- ...
  +-- Workgroup #N

Invocation: the smallest unit running shader code. Equivalent to a CUDA thread.
Workgroup: a batch of invocations running together on the same SM (or CU). Equivalent to a CUDA block. Invocations within a workgroup can cooperate via fast shared memory and barriers.
Dispatch: a 3D grid of workgroups. Equivalent to a CUDA grid.

The workgroup size is baked into the shader at compile time. In WGSL you write it like this (only inside code fences — in prose we wrap WGSL syntax in backticks).

@compute @workgroup_size(64)
fn main(@builtin(global_invocation_id) gid: vec3<u32>) {
  // ...
}

3.2 What's a Good Workgroup Size?

64 or 256 are safe defaults. Mobile GPUs handle them well.
Too big (over 1024) gets rejected on some devices.
Too small (under 32) leaves the GPU starved with low occupancy.

3.3 Global ID

Each invocation knows its own ID. The global_invocation_id is a 3D coordinate inside the whole dispatch. For a 1D array, only gid.x matters.

4. A Crash Course in WGSL

WGSL is WebGPU's shading language. It picks the good parts of GLSL and HLSL and gives them a Rust-flavoured syntax.

4.1 Types

// Scalars
var x: i32 = 42;
var y: u32 = 42u;
var z: f32 = 3.14;
var b: bool = true;

// Vectors
var v: vec4<f32> = vec4<f32>(1.0, 2.0, 3.0, 4.0);
var i: vec3<i32> = vec3<i32>(1, 2, 3);

// Matrices
var m: mat4x4<f32> = mat4x4<f32>(...);

// Runtime arrays
@group(0) @binding(0) var<storage, read_write> data: array<f32>;

4.2 Storage Buffers and Bind Groups

Inputs and outputs in WGSL are buffers. Unlike graphics shaders there are no vertex attributes or textures by default — just arrays that live in GPU memory.

// read-only input
@group(0) @binding(0) var<storage, read> input: array<f32>;
// read_write output
@group(0) @binding(1) var<storage, read_write> output: array<f32>;
// small constants, fixed across one dispatch
@group(0) @binding(2) var<uniform> params: Params;

The group and binding numbers line up with a BindGroupLayout on the JS side. group 0 / binding 0 attaches to the first buffer you set in JS, and so on.

4.3 Workgroup Shared Memory

var<workgroup> shared_data: array<f32, 64>;

All 64 invocations in a workgroup share the same 64-element array. It's about 100x faster than global (storage) memory. The crucial resource for reductions, scans, blurs and other cooperative algorithms.

4.4 Barriers

Synchronization is required so other invocations see your writes.

workgroupBarrier();   // wait until every invocation in the workgroup reaches here
storageBarrier();     // make storage-buffer writes visible to all invocations

4.5 Built-ins

Common built-in inputs.

global_invocation_id: 3D coordinates inside the whole dispatch.
local_invocation_id: 3D coordinates inside the workgroup.
workgroup_id: 3D coordinates of the workgroup inside the dispatch.
local_invocation_index: 1D index inside the workgroup (0..workgroup_size-1).

5. Your First Compute Shader — A Parallel Sum

Enough theory. We actually run something. Take an array of length N and compute the sum of every element on the GPU. The classic "first kernel."

5.1 WGSL — Two-Stage Reduction

A parallel sum is hard to finish in a single dispatch because there is no shared memory between workgroups. So we do it in two stages.

Stage A: every workgroup computes a partial sum over its slice and writes it to partial_sums[wg_id].
Stage B: reduce partial_sums once more (or on the host) to get the final total.

// reduce.wgsl — Stage A
const WG_SIZE: u32 = 64u;

@group(0) @binding(0) var<storage, read>       input:        array<f32>;
@group(0) @binding(1) var<storage, read_write> partial_sums: array<f32>;
@group(0) @binding(2) var<uniform>             params:       Params;

struct Params { n: u32 };

var<workgroup> sdata: array<f32, WG_SIZE>;

@compute @workgroup_size(WG_SIZE)
fn main(
  @builtin(global_invocation_id) gid: vec3<u32>,
  @builtin(local_invocation_id) lid: vec3<u32>,
  @builtin(workgroup_id) wid: vec3<u32>,
) {
  let i = gid.x;
  let tid = lid.x;

  // 1) Load our global element into shared memory (0 if out of range)
  var v: f32 = 0.0;
  if (i < params.n) { v = input[i]; }
  sdata[tid] = v;
  workgroupBarrier();

  // 2) Tree-reduce inside the workgroup
  var stride: u32 = WG_SIZE / 2u;
  loop {
    if (stride == 0u) { break; }
    if (tid < stride) {
      sdata[tid] = sdata[tid] + sdata[tid + stride];
    }
    workgroupBarrier();
    stride = stride / 2u;
  }

  // 3) Only invocation 0 of each workgroup writes the result
  if (tid == 0u) {
    partial_sums[wid.x] = sdata[0];
  }
}

Three key patterns:

Load once from global into workgroup memory (sdata).
Tree reduction — halve the stride each step, summing pairs.
A workgroupBarrier() between steps to keep memory consistent.

5.2 The JS Driver

The JS side of WebGPU is noisy on first sight. But the parts are simple.

// reduce.ts
async function gpuSum(input: Float32Array): Promise<number> {
  // 1) Adapter and device
  const adapter = await navigator.gpu.requestAdapter();
  if (!adapter) throw new Error('No WebGPU adapter');
  const device = await adapter.requestDevice();

  const N = input.length;
  const WG = 64;
  const numWorkgroups = Math.ceil(N / WG);

  // 2) Buffers
  const inputBuf = device.createBuffer({
    size: input.byteLength,
    usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST,
  });
  device.queue.writeBuffer(inputBuf, 0, input);

  const partialBuf = device.createBuffer({
    size: numWorkgroups * 4,
    usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC,
  });

  const uniformBuf = device.createBuffer({
    size: 16, // padded to 16
    usage: GPUBufferUsage.UNIFORM | GPUBufferUsage.COPY_DST,
  });
  device.queue.writeBuffer(uniformBuf, 0, new Uint32Array([N, 0, 0, 0]));

  // 3) Shader and pipeline
  const module = device.createShaderModule({ code: WGSL_SOURCE });
  const pipeline = device.createComputePipeline({
    layout: 'auto',
    compute: { module, entryPoint: 'main' },
  });

  // 4) Bind group
  const bindGroup = device.createBindGroup({
    layout: pipeline.getBindGroupLayout(0),
    entries: [
      { binding: 0, resource: { buffer: inputBuf } },
      { binding: 1, resource: { buffer: partialBuf } },
      { binding: 2, resource: { buffer: uniformBuf } },
    ],
  });

  // 5) Dispatch
  const encoder = device.createCommandEncoder();
  const pass = encoder.beginComputePass();
  pass.setPipeline(pipeline);
  pass.setBindGroup(0, bindGroup);
  pass.dispatchWorkgroups(numWorkgroups);
  pass.end();

  // 6) Staging buffer to read back to CPU
  const stagingBuf = device.createBuffer({
    size: numWorkgroups * 4,
    usage: GPUBufferUsage.MAP_READ | GPUBufferUsage.COPY_DST,
  });
  encoder.copyBufferToBuffer(partialBuf, 0, stagingBuf, 0, numWorkgroups * 4);
  device.queue.submit([encoder.finish()]);

  // 7) Read
  await stagingBuf.mapAsync(GPUMapMode.READ);
  const partials = new Float32Array(stagingBuf.getMappedRange().slice(0));
  stagingBuf.unmap();

  // 8) Final reduction on CPU (one pass is enough if numWorkgroups is small)
  let total = 0;
  for (let i = 0; i < partials.length; i++) total += partials[i];
  return total;
}

Long on first read but it splits into three.

Setup (1–4): adapter, device, buffers, pipeline, bind group.
Dispatch (5): encode commands, submit to the queue.
Readback (6–7): copy to a staging buffer, map async, read.

5.3 Performance — Is It Actually Fast?

Reference measurements on an M2 Pro with Chrome 127, N = 8M floats:

Implementation	Time (ms)
Plain JS `for` loop	38
`Float32Array.reduce`	36
Wasm-SIMD (4-wide)	11
WebGPU compute	1.8

Twenty times faster. And the gap widens as N grows. At N = 64M the GPU is roughly 100x ahead.

6. Real Applications — Where Compute Actually Runs

6.1 In-Browser LLM Inference — WebLLM

WebLLM from the MLC team is the best-known example. It runs models like Llama 3 8B, Qwen 2.5 and Phi 3.5 — quantized — through WebGPU compute shaders. On an M2 Pro a 4-bit Llama 3 8B does 30+ tok/s. Zero server cost.

Under the hood, MLC-LLM's TVM compiles the model graph into a sequence of WGSL compute shaders. You will rarely write these by hand, but every one of them — attention, matmul, softmax — is a variation of the patterns shown above.

6.2 Browser Stable Diffusion

Web Stable Diffusion runs SD 1.5 and SDXL Turbo on WebGPU. A 7B-parameter model produces 1024x1024 images on a user's laptop in 10–30 seconds. Not server-GPU fast — but zero cost, full privacy, offline-capable.

6.3 GPU-Accelerated Dataframes

Successors to GPU.js and projects like WebDF reimplement pandas/Polars-style dataframe operations as WebGPU compute. A 10-million-row group-by aggregation that costs 400ms on CPU costs 25ms on GPU.

6.4 Parallel Image Filters

The most immediately satisfying application. Gaussian blur, edge detection, colour-space conversion run on every pixel at once. Use a 2D workgroup size like 8x8 or 16x16 for 2D stencil work.

@compute @workgroup_size(16, 16)
fn blur(@builtin(global_invocation_id) gid: vec3<u32>) {
  let px = gid.xy;
  // 3x3 mean — a real Gaussian only differs by weights
  var sum: vec4<f32> = vec4<f32>(0.0);
  for (var dy: i32 = -1; dy <= 1; dy = dy + 1) {
    for (var dx: i32 = -1; dx <= 1; dx = dx + 1) {
      let p = vec2<i32>(i32(px.x) + dx, i32(px.y) + dy);
      sum = sum + textureLoad(input_tex, p, 0);
    }
  }
  textureStore(output_tex, vec2<i32>(px), sum / 9.0);
}

Filters like this run 10–50x faster than wasm-SIMD.

6.5 Simulation — N-body, Fluids, Cloth

Physics simulation is a classic GPU application. The interactions of 1000–100000 particles need to be solved every frame. WebGPU compute can handle 10000+ particles at 60fps. Babylon.js and PlayCanvas already use it.

7. The Reality of Browser Support in 2026

Theory is fine, demos are pretty — but shipping is different. Where we actually are in May 2026 (caniuse, WebKit release notes and Mozilla trackers combined):

7.1 Status

Browser	State	Note
Chrome / Edge	Stable on desktop and Android	113+ (May 2023)
Safari	Stable on macOS and iOS	26+ (Jan 2026)
Firefox	Stable on desktop	130+ (Oct 2024)
Firefox Android	Partial (flagged)	In progress
Older mobile GPUs	Rejected	Need adaptive fallback

Global coverage sits around 95%. The remaining 5% falls back to wasm-SIMD or plain JS.

7.2 fp16 — Not Universal Yet

Half-precision floats (f16 in WGSL via the shader-f16 extension) halve the memory footprint of large ML models but are not enabled on every device. Chrome ships them on desktop discrete GPUs; integrated GPUs and many mobiles still reject them.

const adapter = await navigator.gpu.requestAdapter();
const hasF16 = adapter.features.has('shader-f16');
const device = await adapter.requestDevice({
  requiredFeatures: hasF16 ? ['shader-f16'] : [],
});

Fallbacks are two:

Drop to f32 — double the memory, similar speed.
INT8 quantization — smaller than fp16, slight accuracy loss.

7.3 Workgroup Limits

Limits like limits.maxComputeWorkgroupSizeX differ by device. Safe defaults:

workgroup_size: 64 or 256
maxComputeInvocationsPerWorkgroup: 256 (do not exceed)

7.4 Memory Limits

limits.maxStorageBufferBindingSize defaults to 128MB. Big models must be chunked across buffers — the reason WebLLM splits model weights across many of them.

7.5 Compile Time

WGSL shaders compile on first use. A large shader (a 1000-line attention kernel) can take 500ms to 2s. Bundling a warm-up phase with model loading is the standard pattern.

8. WebGPU + Wasm-SIMD Hybrid Pipelines

GPUs are not fast at everything. Small arrays and branchy code run better on the CPU. Real pipelines often mix the two.

8.1 GPU vs CPU — When

Pattern	GPU wins	CPU (wasm-SIMD) wins
Data size	100K+ elements	Under 1K
Branching	Same path	Diverse paths
Memory pattern	Coalesced	Random access
First-use cost	After compile	Immediate
Frequent readback to CPU	Expensive	Free

8.2 The Hybrid Pattern

A typical ML pipeline like WebLLM looks like this.

[tokenizer]  <- wasm-SIMD (small code, branchy)
     |
     v
[embed]      <- GPU (big matrix multiplies)
     |
     v
[attention]  <- GPU (matmul + softmax)
     |
     v
[FFN]        <- GPU (big matmuls)
     |
     v
[sampler]    <- CPU or GPU (small tensors, branchy)
     |
     v
[detokenizer] <- wasm-SIMD

Small branchy code like tokenizers and detokenizers stays on CPU. Big uniform code like matmuls goes on GPU. Minimize the number of GPU↔CPU transfers — once data is on the GPU, do as much as possible there.

8.3 Wasm and WebGPU Talking

As of May 2026, sharing memory directly via SharedArrayBuffer requires cross-origin isolation. The alternative is to copy through staging buffers, bounded by PCIe bandwidth (a few GB/s). Staying on the GPU is the answer.

9. Decision Matrix

When you're building a new feature, where should it run?

Condition	Recommended	Why
Data under 1MB, simple op	Plain JS	Setup cost beats the work itself
Data 100MB+, uniform op	WebGPU	The GPU's home turf
Small data, complex branching	Wasm-SIMD	CPU does this well
ML inference, 50MB+ model	WebGPU (WebLLM etc.)	Matmuls everywhere
Real-time image processing	WebGPU	Same op per pixel
Dataframe group-by	Depends on size	GPU once you're past 1M rows
Device does not support WebGPU	Wasm-SIMD fallback	The 5% safety net
First response must arrive within 1s	Start with JS or wasm	GPU compile warm-up

10. Limits and Pitfalls

It's not all roses. Things to be aware of in May 2026.

10.1 The First Dispatch Is Slow

Shader compile, pipeline creation, buffer allocation all happen on first use. For a single short job, the GPU setup costs more than the job itself. Only worth it for repeated use or large workloads.

10.2 Debugging Is Hard

GPU shaders have no console.log. WebGPU debugging today looks like:

Chrome DevTools' WebGPU panel — inspect buffers, bind groups, dispatches.
Write values to an output buffer and read them back on CPU ("printf debugging").
Tools like wgpu-inspect for WGSL — still rough.

There is no mature debugger comparable to cuda-gdb.

10.3 Non-Determinism

The order of float summations can vary with workgroup scheduling. Do not expect bit-identical outputs for the same input. ML inference may produce different tokens for the same prompt and model — even with sampling temperature 0.

10.4 Mobile Variability

Mobile GPUs have tighter limits than desktops. A shader that runs on one phone can fail on another. Always:

Read adapter.limits for workgroup-size and memory limits.
Provide a fallback path.
Test on real devices (BrowserStack and physical hardware).

10.5 Security

WebGPU is a large attack surface. Several information-disclosure CVEs were reported between 2023 and 2025. Browsers already isolate strongly — but sensitive domains should review the gpu-compute permission carefully.

10.6 Shader Source Is Not Protected

WGSL source ships to the client in plain text. Do not bury algorithmic trade secrets in it.

Epilogue — Get Started with Compute Shaders in Two Hours

Start Checklist

Guard for WebGPU support: if (!navigator.gpu).
Create adapter and device once and reuse (one per page).
Cache shader modules and pipelines — never recompile the same shader.
Pool buffers — reuse instead of allocating a new one per call.
Staging buffers and mapAsync are async — remember the await.
Start with a workgroup size of 64 or 256.
The first call is a warm-up — measure from call 5 onward.
Guard on the f16 feature before using it.

Anti-Patterns

Recreating the device or adapter every call — expensive and pointless.
GPU↔CPU ping-pong — don't read back to CPU every iteration. Stay on the GPU.
Workgroup size 1024+ — rejected on mobile.
Shaders full of branches — kill GPU occupancy. Send branchy code to the CPU.
Assuming f16 is available — without a feature guard you explode.
A single massive shader — slow to compile. Split into stages.
Calling mapAsync on every result — async/await stalls the GPU pipeline. Batch it.

Coming Next

WGSL Deep Dive — writing an attention kernel from scratch, the workgroup tricks behind flash attention.
WebGPU Debugging and Profiling in Practice — Chrome DevTools, timestamp queries, occupancy measurement.
Inside WebLLM — anatomy of how TVM compiles a model into WGSL.