✍️ 필사 모드: Real GPU Compute in the Browser — A Hands-On Guide to WebGPU Compute Shaders and WGSL in 2026
EnglishPrologue — The Web Finally Gets to Tell the GPU What to Do
For years we accepted one lie: "browsers can't do GPU compute." WebGL had shaders but they were shaders trapped inside the graphics pipeline. Running arbitrary data through arbitrary workgroups and reading the result back — the thing we'd taken for granted in CUDA — was not possible on the web. So ML inference ran on CPU via WebAssembly, or used trickery like ONNX.js that abused WebGL into pretending to be compute.
That picture began to change when Chrome 113 shipped WebGPU to desktop stable in 2023. Firefox and Safari followed through 2024. And in January 2026, Safari 26 launched it on macOS Tahoe and iOS, putting WebGPU effectively at Baseline. All of Chrome, Edge, Firefox and Safari ship it on by default; global coverage is around 95%.
This post is not about graphics. It is about compute — the same GPU, often sharing the same page, running a very different kind of pipeline.
- A matrix multiply that takes 200ms on CPU takes 2ms on GPU.
- WebLLM runs a quantized Llama 3 8B on a user's laptop GPU at 30+ tok/s.
- Applying a Gaussian blur to a thousand images takes wasm-SIMD 4s; it takes WebGPU 0.4s.
All of this — with zero server cost — inside the browser. That is what compute shaders promise. This post is one read for how to take them up on it.
1. Why GPU Compute in the Browser
Start with motivation. Three reasons to run GPU compute in the browser.
1.1 Zero Server Cost
LLM inference on cloud GPUs is 5–10 USD per H100 hour. With 100 users it adds up fast. Move the inference to the user's device GPU and your server cost is zero. The weights download once and cache.
1.2 Privacy
Classifying, summarizing or embedding medical images, personal photos, or private text never leaves the device. It happens on the browser's GPU. The data does not leave the box.
1.3 Latency
You eliminate a 50–200ms round trip to a server. Live video filters, interactive ML, real-time simulation — not having that round trip is fundamentally faster.
These reasons are not new. What is new is that the tools are finally ready.
2. WebGPU vs WebGL — Where Compute Lives
In WebGL up through 2 every shader is inside the graphics pipeline. Vertex shaders see vertices, fragment shaders see pixels. To process arbitrary data you had to disguise it as a texture and shove it through the fragment shader — the famous "GPGPU hack".
WebGPU is different. Compute shaders are first-class citizens.
| Aspect | WebGL 2 | WebGPU |
|---|---|---|
| Compute shaders | None (hacks only) | First-class |
| Workgroups / shared memory | None | Yes |
| Storage buffer read/write | Texture tricks | Native |
| Shader language | GLSL ES 3.0 | WGSL |
| API style | Global state machine | Explicit command buffers |
| Multi-threaded command encoding | Not possible | Possible |
| fp16 | Extension (fragile) | Extension (when enabled) |
The second row is the heart of it. WebGPU gives you workgroups and shared memory. Without those, half of what we call "parallel compute" — reductions, scans, prefix sums and other cooperative algorithms — can only be faked. They require the fast shared memory inside a workgroup.
3. The Compute Model — Workgroups, Invocations, Dispatch
A GPU has thousands of ALUs. The way you hand work to those thousands is the compute model.
3.1 Three Layers
Dispatch
|
+-- Workgroup #0
| |
| +-- Invocation (0,0,0)
| +-- Invocation (1,0,0)
| +-- ...
| +-- Invocation (63,0,0)
+-- Workgroup #1
| +-- ...
+-- Workgroup #N
- Invocation: the smallest unit running shader code. Equivalent to a CUDA thread.
- Workgroup: a batch of invocations running together on the same SM (or CU). Equivalent to a CUDA block. Invocations within a workgroup can cooperate via fast shared memory and barriers.
- Dispatch: a 3D grid of workgroups. Equivalent to a CUDA grid.
The workgroup size is baked into the shader at compile time. In WGSL you write it like this (only inside code fences — in prose we wrap WGSL syntax in backticks).
@compute @workgroup_size(64)
fn main(@builtin(global_invocation_id) gid: vec3<u32>) {
// ...
}
3.2 What's a Good Workgroup Size?
- 64 or 256 are safe defaults. Mobile GPUs handle them well.
- Too big (over 1024) gets rejected on some devices.
- Too small (under 32) leaves the GPU starved with low occupancy.
3.3 Global ID
Each invocation knows its own ID. The global_invocation_id is a 3D coordinate inside the whole dispatch. For a 1D array, only gid.x matters.
4. A Crash Course in WGSL
WGSL is WebGPU's shading language. It picks the good parts of GLSL and HLSL and gives them a Rust-flavoured syntax.
4.1 Types
// Scalars
var x: i32 = 42;
var y: u32 = 42u;
var z: f32 = 3.14;
var b: bool = true;
// Vectors
var v: vec4<f32> = vec4<f32>(1.0, 2.0, 3.0, 4.0);
var i: vec3<i32> = vec3<i32>(1, 2, 3);
// Matrices
var m: mat4x4<f32> = mat4x4<f32>(...);
// Runtime arrays
@group(0) @binding(0) var<storage, read_write> data: array<f32>;
4.2 Storage Buffers and Bind Groups
Inputs and outputs in WGSL are buffers. Unlike graphics shaders there are no vertex attributes or textures by default — just arrays that live in GPU memory.
// read-only input
@group(0) @binding(0) var<storage, read> input: array<f32>;
// read_write output
@group(0) @binding(1) var<storage, read_write> output: array<f32>;
// small constants, fixed across one dispatch
@group(0) @binding(2) var<uniform> params: Params;
The group and binding numbers line up with a BindGroupLayout on the JS side. group 0 / binding 0 attaches to the first buffer you set in JS, and so on.
4.3 Workgroup Shared Memory
var<workgroup> shared_data: array<f32, 64>;
All 64 invocations in a workgroup share the same 64-element array. It's about 100x faster than global (storage) memory. The crucial resource for reductions, scans, blurs and other cooperative algorithms.
4.4 Barriers
Synchronization is required so other invocations see your writes.
workgroupBarrier(); // wait until every invocation in the workgroup reaches here
storageBarrier(); // make storage-buffer writes visible to all invocations
4.5 Built-ins
Common built-in inputs.
global_invocation_id: 3D coordinates inside the whole dispatch.local_invocation_id: 3D coordinates inside the workgroup.workgroup_id: 3D coordinates of the workgroup inside the dispatch.local_invocation_index: 1D index inside the workgroup (0..workgroup_size-1).
5. Your First Compute Shader — A Parallel Sum
Enough theory. We actually run something. Take an array of length N and compute the sum of every element on the GPU. The classic "first kernel."
5.1 WGSL — Two-Stage Reduction
A parallel sum is hard to finish in a single dispatch because there is no shared memory between workgroups. So we do it in two stages.
- Stage A: every workgroup computes a partial sum over its slice and writes it to
partial_sums[wg_id]. - Stage B: reduce
partial_sumsonce more (or on the host) to get the final total.
// reduce.wgsl — Stage A
const WG_SIZE: u32 = 64u;
@group(0) @binding(0) var<storage, read> input: array<f32>;
@group(0) @binding(1) var<storage, read_write> partial_sums: array<f32>;
@group(0) @binding(2) var<uniform> params: Params;
struct Params { n: u32 };
var<workgroup> sdata: array<f32, WG_SIZE>;
@compute @workgroup_size(WG_SIZE)
fn main(
@builtin(global_invocation_id) gid: vec3<u32>,
@builtin(local_invocation_id) lid: vec3<u32>,
@builtin(workgroup_id) wid: vec3<u32>,
) {
let i = gid.x;
let tid = lid.x;
// 1) Load our global element into shared memory (0 if out of range)
var v: f32 = 0.0;
if (i < params.n) { v = input[i]; }
sdata[tid] = v;
workgroupBarrier();
// 2) Tree-reduce inside the workgroup
var stride: u32 = WG_SIZE / 2u;
loop {
if (stride == 0u) { break; }
if (tid < stride) {
sdata[tid] = sdata[tid] + sdata[tid + stride];
}
workgroupBarrier();
stride = stride / 2u;
}
// 3) Only invocation 0 of each workgroup writes the result
if (tid == 0u) {
partial_sums[wid.x] = sdata[0];
}
}
Three key patterns:
- Load once from global into workgroup memory (
sdata). - Tree reduction — halve the stride each step, summing pairs.
- A
workgroupBarrier()between steps to keep memory consistent.
5.2 The JS Driver
The JS side of WebGPU is noisy on first sight. But the parts are simple.
// reduce.ts
async function gpuSum(input: Float32Array): Promise<number> {
// 1) Adapter and device
const adapter = await navigator.gpu.requestAdapter();
if (!adapter) throw new Error('No WebGPU adapter');
const device = await adapter.requestDevice();
const N = input.length;
const WG = 64;
const numWorkgroups = Math.ceil(N / WG);
// 2) Buffers
const inputBuf = device.createBuffer({
size: input.byteLength,
usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST,
});
device.queue.writeBuffer(inputBuf, 0, input);
const partialBuf = device.createBuffer({
size: numWorkgroups * 4,
usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC,
});
const uniformBuf = device.createBuffer({
size: 16, // padded to 16
usage: GPUBufferUsage.UNIFORM | GPUBufferUsage.COPY_DST,
});
device.queue.writeBuffer(uniformBuf, 0, new Uint32Array([N, 0, 0, 0]));
// 3) Shader and pipeline
const module = device.createShaderModule({ code: WGSL_SOURCE });
const pipeline = device.createComputePipeline({
layout: 'auto',
compute: { module, entryPoint: 'main' },
});
// 4) Bind group
const bindGroup = device.createBindGroup({
layout: pipeline.getBindGroupLayout(0),
entries: [
{ binding: 0, resource: { buffer: inputBuf } },
{ binding: 1, resource: { buffer: partialBuf } },
{ binding: 2, resource: { buffer: uniformBuf } },
],
});
// 5) Dispatch
const encoder = device.createCommandEncoder();
const pass = encoder.beginComputePass();
pass.setPipeline(pipeline);
pass.setBindGroup(0, bindGroup);
pass.dispatchWorkgroups(numWorkgroups);
pass.end();
// 6) Staging buffer to read back to CPU
const stagingBuf = device.createBuffer({
size: numWorkgroups * 4,
usage: GPUBufferUsage.MAP_READ | GPUBufferUsage.COPY_DST,
});
encoder.copyBufferToBuffer(partialBuf, 0, stagingBuf, 0, numWorkgroups * 4);
device.queue.submit([encoder.finish()]);
// 7) Read
await stagingBuf.mapAsync(GPUMapMode.READ);
const partials = new Float32Array(stagingBuf.getMappedRange().slice(0));
stagingBuf.unmap();
// 8) Final reduction on CPU (one pass is enough if numWorkgroups is small)
let total = 0;
for (let i = 0; i < partials.length; i++) total += partials[i];
return total;
}
Long on first read but it splits into three.
- Setup (1–4): adapter, device, buffers, pipeline, bind group.
- Dispatch (5): encode commands, submit to the queue.
- Readback (6–7): copy to a staging buffer, map async, read.
5.3 Performance — Is It Actually Fast?
Reference measurements on an M2 Pro with Chrome 127, N = 8M floats:
| Implementation | Time (ms) |
|---|---|
Plain JS for loop | 38 |
Float32Array.reduce | 36 |
| Wasm-SIMD (4-wide) | 11 |
| WebGPU compute | 1.8 |
Twenty times faster. And the gap widens as N grows. At N = 64M the GPU is roughly 100x ahead.
6. Real Applications — Where Compute Actually Runs
6.1 In-Browser LLM Inference — WebLLM
WebLLM from the MLC team is the best-known example. It runs models like Llama 3 8B, Qwen 2.5 and Phi 3.5 — quantized — through WebGPU compute shaders. On an M2 Pro a 4-bit Llama 3 8B does 30+ tok/s. Zero server cost.
Under the hood, MLC-LLM's TVM compiles the model graph into a sequence of WGSL compute shaders. You will rarely write these by hand, but every one of them — attention, matmul, softmax — is a variation of the patterns shown above.
6.2 Browser Stable Diffusion
Web Stable Diffusion runs SD 1.5 and SDXL Turbo on WebGPU. A 7B-parameter model produces 1024x1024 images on a user's laptop in 10–30 seconds. Not server-GPU fast — but zero cost, full privacy, offline-capable.
6.3 GPU-Accelerated Dataframes
Successors to GPU.js and projects like WebDF reimplement pandas/Polars-style dataframe operations as WebGPU compute. A 10-million-row group-by aggregation that costs 400ms on CPU costs 25ms on GPU.
6.4 Parallel Image Filters
The most immediately satisfying application. Gaussian blur, edge detection, colour-space conversion run on every pixel at once. Use a 2D workgroup size like 8x8 or 16x16 for 2D stencil work.
@compute @workgroup_size(16, 16)
fn blur(@builtin(global_invocation_id) gid: vec3<u32>) {
let px = gid.xy;
// 3x3 mean — a real Gaussian only differs by weights
var sum: vec4<f32> = vec4<f32>(0.0);
for (var dy: i32 = -1; dy <= 1; dy = dy + 1) {
for (var dx: i32 = -1; dx <= 1; dx = dx + 1) {
let p = vec2<i32>(i32(px.x) + dx, i32(px.y) + dy);
sum = sum + textureLoad(input_tex, p, 0);
}
}
textureStore(output_tex, vec2<i32>(px), sum / 9.0);
}
Filters like this run 10–50x faster than wasm-SIMD.
6.5 Simulation — N-body, Fluids, Cloth
Physics simulation is a classic GPU application. The interactions of 1000–100000 particles need to be solved every frame. WebGPU compute can handle 10000+ particles at 60fps. Babylon.js and PlayCanvas already use it.
7. The Reality of Browser Support in 2026
Theory is fine, demos are pretty — but shipping is different. Where we actually are in May 2026 (caniuse, WebKit release notes and Mozilla trackers combined):
7.1 Status
| Browser | State | Note |
|---|---|---|
| Chrome / Edge | Stable on desktop and Android | 113+ (May 2023) |
| Safari | Stable on macOS and iOS | 26+ (Jan 2026) |
| Firefox | Stable on desktop | 130+ (Oct 2024) |
| Firefox Android | Partial (flagged) | In progress |
| Older mobile GPUs | Rejected | Need adaptive fallback |
Global coverage sits around 95%. The remaining 5% falls back to wasm-SIMD or plain JS.
7.2 fp16 — Not Universal Yet
Half-precision floats (f16 in WGSL via the shader-f16 extension) halve the memory footprint of large ML models but are not enabled on every device. Chrome ships them on desktop discrete GPUs; integrated GPUs and many mobiles still reject them.
const adapter = await navigator.gpu.requestAdapter();
const hasF16 = adapter.features.has('shader-f16');
const device = await adapter.requestDevice({
requiredFeatures: hasF16 ? ['shader-f16'] : [],
});
Fallbacks are two:
- Drop to f32 — double the memory, similar speed.
- INT8 quantization — smaller than fp16, slight accuracy loss.
7.3 Workgroup Limits
Limits like limits.maxComputeWorkgroupSizeX differ by device. Safe defaults:
- workgroup_size: 64 or 256
- maxComputeInvocationsPerWorkgroup: 256 (do not exceed)
7.4 Memory Limits
limits.maxStorageBufferBindingSize defaults to 128MB. Big models must be chunked across buffers — the reason WebLLM splits model weights across many of them.
7.5 Compile Time
WGSL shaders compile on first use. A large shader (a 1000-line attention kernel) can take 500ms to 2s. Bundling a warm-up phase with model loading is the standard pattern.
8. WebGPU + Wasm-SIMD Hybrid Pipelines
GPUs are not fast at everything. Small arrays and branchy code run better on the CPU. Real pipelines often mix the two.
8.1 GPU vs CPU — When
| Pattern | GPU wins | CPU (wasm-SIMD) wins |
|---|---|---|
| Data size | 100K+ elements | Under 1K |
| Branching | Same path | Diverse paths |
| Memory pattern | Coalesced | Random access |
| First-use cost | After compile | Immediate |
| Frequent readback to CPU | Expensive | Free |
8.2 The Hybrid Pattern
A typical ML pipeline like WebLLM looks like this.
[tokenizer] <- wasm-SIMD (small code, branchy)
|
v
[embed] <- GPU (big matrix multiplies)
|
v
[attention] <- GPU (matmul + softmax)
|
v
[FFN] <- GPU (big matmuls)
|
v
[sampler] <- CPU or GPU (small tensors, branchy)
|
v
[detokenizer] <- wasm-SIMD
Small branchy code like tokenizers and detokenizers stays on CPU. Big uniform code like matmuls goes on GPU. Minimize the number of GPU↔CPU transfers — once data is on the GPU, do as much as possible there.
8.3 Wasm and WebGPU Talking
As of May 2026, sharing memory directly via SharedArrayBuffer requires cross-origin isolation. The alternative is to copy through staging buffers, bounded by PCIe bandwidth (a few GB/s). Staying on the GPU is the answer.
9. Decision Matrix
When you're building a new feature, where should it run?
| Condition | Recommended | Why |
|---|---|---|
| Data under 1MB, simple op | Plain JS | Setup cost beats the work itself |
| Data 100MB+, uniform op | WebGPU | The GPU's home turf |
| Small data, complex branching | Wasm-SIMD | CPU does this well |
| ML inference, 50MB+ model | WebGPU (WebLLM etc.) | Matmuls everywhere |
| Real-time image processing | WebGPU | Same op per pixel |
| Dataframe group-by | Depends on size | GPU once you're past 1M rows |
| Device does not support WebGPU | Wasm-SIMD fallback | The 5% safety net |
| First response must arrive within 1s | Start with JS or wasm | GPU compile warm-up |
10. Limits and Pitfalls
It's not all roses. Things to be aware of in May 2026.
10.1 The First Dispatch Is Slow
Shader compile, pipeline creation, buffer allocation all happen on first use. For a single short job, the GPU setup costs more than the job itself. Only worth it for repeated use or large workloads.
10.2 Debugging Is Hard
GPU shaders have no console.log. WebGPU debugging today looks like:
- Chrome DevTools' WebGPU panel — inspect buffers, bind groups, dispatches.
- Write values to an output buffer and read them back on CPU ("printf debugging").
- Tools like wgpu-inspect for WGSL — still rough.
There is no mature debugger comparable to cuda-gdb.
10.3 Non-Determinism
The order of float summations can vary with workgroup scheduling. Do not expect bit-identical outputs for the same input. ML inference may produce different tokens for the same prompt and model — even with sampling temperature 0.
10.4 Mobile Variability
Mobile GPUs have tighter limits than desktops. A shader that runs on one phone can fail on another. Always:
- Read
adapter.limitsfor workgroup-size and memory limits. - Provide a fallback path.
- Test on real devices (BrowserStack and physical hardware).
10.5 Security
WebGPU is a large attack surface. Several information-disclosure CVEs were reported between 2023 and 2025. Browsers already isolate strongly — but sensitive domains should review the gpu-compute permission carefully.
10.6 Shader Source Is Not Protected
WGSL source ships to the client in plain text. Do not bury algorithmic trade secrets in it.
Epilogue — Get Started with Compute Shaders in Two Hours
Start Checklist
- Guard for WebGPU support:
if (!navigator.gpu). - Create adapter and device once and reuse (one per page).
- Cache shader modules and pipelines — never recompile the same shader.
- Pool buffers — reuse instead of allocating a new one per call.
- Staging buffers and mapAsync are async — remember the
await. - Start with a workgroup size of 64 or 256.
- The first call is a warm-up — measure from call 5 onward.
- Guard on the f16 feature before using it.
Anti-Patterns
- Recreating the device or adapter every call — expensive and pointless.
- GPU↔CPU ping-pong — don't read back to CPU every iteration. Stay on the GPU.
- Workgroup size 1024+ — rejected on mobile.
- Shaders full of branches — kill GPU occupancy. Send branchy code to the CPU.
- Assuming f16 is available — without a feature guard you explode.
- A single massive shader — slow to compile. Split into stages.
- Calling mapAsync on every result — async/await stalls the GPU pipeline. Batch it.
Coming Next
- WGSL Deep Dive — writing an attention kernel from scratch, the workgroup tricks behind flash attention.
- WebGPU Debugging and Profiling in Practice — Chrome DevTools, timestamp queries, occupancy measurement.
- Inside WebLLM — anatomy of how TVM compiles a model into WGSL.
References
- WebGPU Standard — W3C
- WGSL Standard — W3C
- MDN — WebGPU API
- WebGPU Fundamentals
- WebGPU Samples — webgpu.github.io
- WebLLM — MLC
- Web Stable Diffusion
- Chrome WebGPU release notes
- Safari 26 release notes — WebKit
- Firefox WebGPU tracker — Bugzilla
- WebGPU Best Practices — Brandon Jones
- Surma — WebGPU compute intro
- Babylon.js Compute Shaders docs
현재 단락 (1/323)
For years we accepted one lie: "browsers can't do GPU compute." WebGL had shaders but they were shad...