Split View: 브라우저에서 진짜 GPU 컴퓨트 — WebGPU 컴퓨트 셰이더와 WGSL 실전 가이드 2026

브라우저에서 진짜 GPU 컴퓨트 — WebGPU 컴퓨트 셰이더와 WGSL 실전 가이드 2026

프롤로그 — 웹이 드디어 GPU를 시킨다

수년 동안 우리는 거짓말 하나를 받아들이고 살았다. "브라우저는 GPU 컴퓨트를 못 한다." WebGL은 셰이더가 있었지만 그건 그래픽 파이프라인 안에 갇힌 셰이더였다. 임의의 데이터를 임의의 워크그룹으로 굴리고 결과를 읽어오는 — 우리가 CUDA에서 당연하게 하던 일은 — 웹에선 불가능했다. 그래서 ML 추론은 WebAssembly로 CPU에서 돌리거나, ONNX.js처럼 WebGL을 컴퓨트처럼 학대하는 트릭으로 짜낸 성능을 받아 썼다.

2023년 Chrome 113이 WebGPU를 데스크톱 안정 채널에 올렸을 때 풍경이 바뀌기 시작했다. 2024년 Firefox와 Safari가 따라왔다. 그리고 2026년 1월 Safari 26이 macOS Tahoe와 iOS에서 정식 출시되면서 WebGPU는 Baseline에 거의 도달했다. Chrome·Edge·Firefox·Safari 모두 기본 활성, 전역 커버리지 약 95%.

이 글은 그래픽이 아니라 컴퓨트 이야기다. 같은 GPU지만 같은 페이지에 둘이 같이 사는 — 그러나 다른 일을 하는 — 컴퓨트 파이프라인.

행렬 곱 한 번이 CPU에서 200ms 걸리면 GPU에서 2ms다.
WebLLM은 Llama 3 8B 양자화 모델을 사용자의 노트북 GPU에서 30+ tok/s로 돌린다.
이미지 1000장에 가우시안 블러 같은 필터를 거는 데 wasm-SIMD가 4초 걸리면 WebGPU가 0.4초다.

이 모든 것을 — 서버 비용 0원으로 — 브라우저 안에서 한다. 이게 컴퓨트 셰이더가 약속하는 것이다.

이 글은 그 약속을 받아내는 방법을 한 호흡으로 정리한다.

1장 · 왜 브라우저에서 GPU 컴퓨트인가

먼저 동기부터. 우리가 GPU 컴퓨트를 브라우저에서 굴려야 할 이유는 세 가지다.

1.1 서버 비용 0원

LLM 추론을 클라우드에서 돌리면 H100 한 시간이 5–10달러다. 사용자가 100명이면 비용이 빠르게 누적된다. 같은 추론을 사용자 디바이스 GPU에서 돌리면 서버 비용은 0이다. 모델 가중치는 한 번 다운로드되고 캐시된다.

1.2 프라이버시

의료 이미지, 개인 사진, 사적인 텍스트를 분류·요약·임베딩하는 작업을 서버로 보내지 않는다. 브라우저 GPU에서 끝낸다. 데이터는 디바이스를 떠나지 않는다.

1.3 지연

서버 왕복 50–200ms를 없앤다. 실시간 비디오 필터, 라이브 ML, 인터랙티브 시뮬레이션 — 왕복이 없는 게 본질적으로 더 빠르다.

이 세 가지 이유는 새롭지 않다. 새로운 건 드디어 도구가 준비됐다는 것이다.

2장 · WebGPU vs WebGL — 컴퓨트의 자리

WebGL 2까지의 셰이더는 모두 그래픽스 파이프라인 안에 있다. 정점 셰이더는 정점을 받고, 프래그먼트 셰이더는 픽셀을 받는다. 임의 데이터를 임의로 처리하려면 데이터를 텍스처로 위장해 프래그먼트 셰이더에 욱여넣는 — 일명 "GPGPU 해킹"을 — 해야 했다.

WebGPU는 다르다. 컴퓨트 셰이더가 1급 시민이다.

측면	WebGL 2	WebGPU
컴퓨트 셰이더	없음 (해킹으로)	1급 지원
워크그룹 / 공유 메모리	없음	있음
스토리지 버퍼 읽기/쓰기	텍스처 트릭	네이티브
셰이더 언어	GLSL ES 3.0	WGSL
API 스타일	글로벌 상태 머신	명시적 명령 버퍼
멀티스레드 명령 인코딩	불가능	가능
fp16	확장 (애매)	확장 (활성화 시)

핵심은 두 번째 줄이다. WebGPU는 워크그룹과 공유 메모리를 준다. 이게 없으면 우리가 "병렬 컴퓨트"라고 부르는 것 중 절반은 흉내만 낼 뿐 진짜로 못 한다 — 리덕션, 스캔, 프리픽스 합 같은 협력적 알고리즘은 워크그룹 내부의 빠른 공유 메모리를 전제로 한다.

3장 · 컴퓨트 모델 — 워크그룹, 인보케이션, 디스패치

GPU는 수천 개의 ALU를 가지고 있다. 그 수천 개에게 일을 나눠주는 방식이 컴퓨트 모델이다.

3.1 세 개의 계층

디스패치 (Dispatch)
  │
  ├─ 워크그룹 #0
  │     ├─ 인보케이션 (0,0,0)
  │     ├─ 인보케이션 (1,0,0)
  │     ├─ ...
  │     └─ 인보케이션 (63,0,0)
  ├─ 워크그룹 #1
  │     └─ ...
  └─ 워크그룹 #N

인보케이션 (invocation): 셰이더 코드를 실행하는 가장 작은 단위. CUDA의 thread에 해당.
워크그룹 (workgroup): 같은 SM(또는 CU)에서 함께 실행되는 인보케이션의 묶음. CUDA의 block. 워크그룹 내부에서는 빠른 공유 메모리와 배리어로 협력할 수 있다.
디스패치 (dispatch): 워크그룹들의 3D 그리드. CUDA의 grid.

워크그룹 크기는 셰이더에 컴파일-타임으로 박힌다. WGSL에서는 다음과 같이 쓴다 (코드 블록 안에서만 — 산문에서는 백틱으로 감싼다).

@compute @workgroup_size(64)
fn main(@builtin(global_invocation_id) gid: vec3<u32>) {
  // ...
}

3.2 좋은 워크그룹 크기는?

64 또는 256이 안전한 기본값. 모바일 GPU도 잘 다룬다.
너무 크면 (1024 이상) 일부 디바이스에서 거부될 수 있다.
너무 작으면 (32 미만) 점유율이 떨어져 GPU를 굶긴다.

3.3 글로벌 ID

각 인보케이션은 자기 자신의 ID를 안다. global_invocation_id는 전체 디스패치 안에서의 3D 좌표다. 1D 배열을 처리한다면 gid.x만 쓰면 된다.

4장 · WGSL 크래시 코스

WGSL은 WebGPU의 셰이더 언어다. GLSL과 HLSL의 좋은 부분을 섞은 — Rust스러운 — 문법을 가진다.

4.1 타입

// 스칼라
var x: i32 = 42;
var y: u32 = 42u;
var z: f32 = 3.14;
var b: bool = true;

// 벡터
var v: vec4<f32> = vec4<f32>(1.0, 2.0, 3.0, 4.0);
var i: vec3<i32> = vec3<i32>(1, 2, 3);

// 행렬
var m: mat4x4<f32> = mat4x4<f32>(...);

// 배열 (런타임)
@group(0) @binding(0) var<storage, read_write> data: array<f32>;

4.2 스토리지 버퍼와 바인드 그룹

WGSL의 입력/출력은 버퍼다. 그래픽 셰이더와 달리 정점 어트리뷰트도 텍스처도 없고, 그냥 GPU 메모리에 사는 배열이다.

// read-only 입력
@group(0) @binding(0) var<storage, read> input: array<f32>;
// read_write 출력
@group(0) @binding(1) var<storage, read_write> output: array<f32>;
// 작은 상수 (1 디스패치 동안 고정)
@group(0) @binding(2) var<uniform> params: Params;

group과 binding은 JS 쪽 BindGroupLayout과 매칭된다. group 0/binding 0이 JS에서 만든 첫 번째 버퍼와 연결되는 식이다.

4.3 워크그룹 공유 메모리

var<workgroup> shared_data: array<f32, 64>;

워크그룹 안의 64개 인보케이션이 같은 64-원소 배열을 공유한다. 글로벌 메모리(스토리지)보다 100배 빠르다. 리덕션·스캔·블러처럼 협력적 알고리즘의 핵심 자원.

4.4 배리어

워크그룹 메모리에 쓴 값을 다른 인보케이션이 보려면 동기화가 필요하다.

workgroupBarrier();   // 워크그룹 안의 모든 인보케이션이 여기 도착할 때까지 대기
storageBarrier();     // 스토리지 버퍼 쓰기를 모든 인보케이션에 가시화

4.5 빌트인

자주 쓰이는 빌트인 입력들.

global_invocation_id: 전체 디스패치 안의 3D 좌표.
local_invocation_id: 워크그룹 안의 3D 좌표.
workgroup_id: 디스패치 안의 워크그룹 3D 좌표.
local_invocation_index: 워크그룹 안의 1D 인덱스 (0..workgroup_size-1).

5장 · 첫 컴퓨트 셰이더 — 병렬 합

이론은 충분하다. 실제로 굴려본다. 길이 N의 배열을 받아 모든 원소의 합을 GPU에서 계산하는 — 클래식한 "first kernel" — 예제다.

5.1 WGSL — 두 단계 리덕션

병렬 합은 한 디스패치로 끝내기 어렵다. 워크그룹 사이의 공유 메모리가 없기 때문이다. 그래서 두 단계로 한다.

단계 A: 각 워크그룹이 자기 영역의 부분합을 계산해서 partial_sums[wg_id]에 쓴다.
단계 B: partial_sums 배열을 다시 한 번 (또는 호스트에서) 줄여 최종합을 얻는다.

// reduce.wgsl — 단계 A
const WG_SIZE: u32 = 64u;

@group(0) @binding(0) var<storage, read>       input:        array<f32>;
@group(0) @binding(1) var<storage, read_write> partial_sums: array<f32>;
@group(0) @binding(2) var<uniform>             params:       Params;

struct Params { n: u32 };

var<workgroup> sdata: array<f32, WG_SIZE>;

@compute @workgroup_size(WG_SIZE)
fn main(
  @builtin(global_invocation_id) gid: vec3<u32>,
  @builtin(local_invocation_id) lid: vec3<u32>,
  @builtin(workgroup_id) wid: vec3<u32>,
) {
  let i = gid.x;
  let tid = lid.x;

  // 1) 자기 글로벌 원소를 공유 메모리로 로드 (범위 밖이면 0)
  var v: f32 = 0.0;
  if (i < params.n) { v = input[i]; }
  sdata[tid] = v;
  workgroupBarrier();

  // 2) 워크그룹 안에서 트리 형태로 줄인다
  var stride: u32 = WG_SIZE / 2u;
  loop {
    if (stride == 0u) { break; }
    if (tid < stride) {
      sdata[tid] = sdata[tid] + sdata[tid + stride];
    }
    workgroupBarrier();
    stride = stride / 2u;
  }

  // 3) 워크그룹의 0번 인보케이션만 결과를 쓴다
  if (tid == 0u) {
    partial_sums[wid.x] = sdata[0];
  }
}

핵심 패턴 셋:

워크그룹 메모리(sdata)에 글로벌에서 한 번만 로드.
트리 리덕션 — stride를 절반씩 줄이며 페어를 합산.
매 단계 사이에 workgroupBarrier() — 메모리 일관성 보장.

5.2 JS 드라이버

WebGPU의 JS 쪽은 처음 보면 시끄럽다. 하지만 부품은 단순하다.

// reduce.ts
async function gpuSum(input: Float32Array): Promise<number> {
  // 1) 어댑터·디바이스
  const adapter = await navigator.gpu.requestAdapter();
  if (!adapter) throw new Error('No WebGPU adapter');
  const device = await adapter.requestDevice();

  const N = input.length;
  const WG = 64;
  const numWorkgroups = Math.ceil(N / WG);

  // 2) 버퍼들
  const inputBuf = device.createBuffer({
    size: input.byteLength,
    usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST,
  });
  device.queue.writeBuffer(inputBuf, 0, input);

  const partialBuf = device.createBuffer({
    size: numWorkgroups * 4,
    usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC,
  });

  const uniformBuf = device.createBuffer({
    size: 16, // pad to 16
    usage: GPUBufferUsage.UNIFORM | GPUBufferUsage.COPY_DST,
  });
  device.queue.writeBuffer(uniformBuf, 0, new Uint32Array([N, 0, 0, 0]));

  // 3) 셰이더 + 파이프라인
  const module = device.createShaderModule({ code: WGSL_SOURCE });
  const pipeline = device.createComputePipeline({
    layout: 'auto',
    compute: { module, entryPoint: 'main' },
  });

  // 4) 바인드 그룹
  const bindGroup = device.createBindGroup({
    layout: pipeline.getBindGroupLayout(0),
    entries: [
      { binding: 0, resource: { buffer: inputBuf } },
      { binding: 1, resource: { buffer: partialBuf } },
      { binding: 2, resource: { buffer: uniformBuf } },
    ],
  });

  // 5) 디스패치
  const encoder = device.createCommandEncoder();
  const pass = encoder.beginComputePass();
  pass.setPipeline(pipeline);
  pass.setBindGroup(0, bindGroup);
  pass.dispatchWorkgroups(numWorkgroups);
  pass.end();

  // 6) 결과를 CPU로 읽기 위한 staging 버퍼
  const stagingBuf = device.createBuffer({
    size: numWorkgroups * 4,
    usage: GPUBufferUsage.MAP_READ | GPUBufferUsage.COPY_DST,
  });
  encoder.copyBufferToBuffer(partialBuf, 0, stagingBuf, 0, numWorkgroups * 4);
  device.queue.submit([encoder.finish()]);

  // 7) 읽기
  await stagingBuf.mapAsync(GPUMapMode.READ);
  const partials = new Float32Array(stagingBuf.getMappedRange().slice(0));
  stagingBuf.unmap();

  // 8) 마지막 합은 CPU에서 (numWorkgroups가 작으면 한 번이면 충분)
  let total = 0;
  for (let i = 0; i < partials.length; i++) total += partials[i];
  return total;
}

처음 보면 길지만 셋으로 나뉜다.

셋업 (1–4): 어댑터·디바이스·버퍼·파이프라인·바인드 그룹.
디스패치 (5): 명령 인코딩 → 큐 제출.
읽기 (6–7): staging 버퍼로 복사 → mapAsync → 읽기.

5.3 성능 — 진짜 빠른가

레퍼런스 측정 (M2 Pro, Chrome 127, N=8M 부동소수점):

구현	시간 (ms)
단순 JS `for` 루프	38
`Float32Array.reduce`	36
Wasm-SIMD (4-wide)	11
WebGPU 컴퓨트	1.8

20배 빠르다. 그리고 N이 커질수록 격차가 더 벌어진다. N=64M에서는 GPU가 100배 가까이 앞선다.

6장 · 실제 응용 — 컴퓨트가 실제로 굴러가는 곳

6.1 브라우저 LLM 추론 — WebLLM

WebLLM (MLC 팀)이 가장 잘 알려진 사례다. Llama 3 8B, Qwen 2.5, Phi 3.5 같은 모델을 양자화해 WebGPU 컴퓨트 셰이더로 굴린다. M2 Pro에서 4-bit Llama 3 8B가 30+ tok/s. 서버 0원.

내부는 MLC-LLM의 TVM이 모델 그래프를 WGSL 컴퓨트 셰이더 시퀀스로 컴파일한다. 우리가 직접 짤 일은 적지만 — 어텐션·matmul·소프트맥스 등 — 그 모든 게 위에서 본 패턴의 변종이다.

6.2 브라우저 Stable Diffusion

Web Stable Diffusion이 SD 1.5와 SDXL Turbo를 WebGPU로 굴린다. 7B 파라미터 모델이 사용자 노트북에서 1024x1024 이미지를 10–30초에 생성한다. 서버 GPU와는 못 비교하지만 — 서버 비용 0원, 프라이버시 100%, 오프라인 가능.

6.3 GPU 가속 데이터프레임

GPU.js의 후속, WebDF 같은 프로젝트가 pandas/Polars 풍 데이터프레임 연산을 WebGPU 컴퓨트로 구현한다. 천만 행 group-by aggregation이 CPU로 400ms 걸리면 GPU로 25ms.

6.4 병렬 이미지 필터

가장 즉시 와닿는 응용. 가우시안 블러, 엣지 디텍션, 색공간 변환을 모든 픽셀에 동시에 적용한다. 워크그룹 크기를 8x8 또는 16x16으로 잡아 2D 스텐실 처리.

@compute @workgroup_size(16, 16)
fn blur(@builtin(global_invocation_id) gid: vec3<u32>) {
  let px = gid.xy;
  // 3x3 평균 — 실제 가우시안은 가중치만 다르다
  var sum: vec4<f32> = vec4<f32>(0.0);
  for (var dy: i32 = -1; dy <= 1; dy = dy + 1) {
    for (var dx: i32 = -1; dx <= 1; dx = dx + 1) {
      let p = vec2<i32>(i32(px.x) + dx, i32(px.y) + dy);
      sum = sum + textureLoad(input_tex, p, 0);
    }
  }
  textureStore(output_tex, vec2<i32>(px), sum / 9.0);
}

이런 필터는 wasm-SIMD보다 10–50배 빠르다.

6.5 시뮬레이션 — N-body, fluids, cloth

물리 시뮬레이션은 GPU의 고전 응용이다. 1000–100000 입자의 상호작용을 매 프레임 풀어야 한다. WebGPU 컴퓨트로 60fps에서 10000+ 입자 처리 가능. Babylon.js나 PlayCanvas가 이미 활용한다.

7장 · 2026년 브라우저 지원의 현실

이론 좋고 데모 멋지지만 실제로 배포할 때 마주치는 현실은 다르다.

7.1 지원 상태

2026년 5월 기준 (caniuse + WebKit 릴리스 노트 + Mozilla 트래커 종합):

브라우저	상태	비고
Chrome / Edge	데스크톱·안드로이드 안정	113+(2023.05)
Safari	macOS·iOS 안정	26+(2026.01)
Firefox	데스크톱 안정	130+(2024.10)
Firefox 안드로이드	부분 (플래그)	진행 중
구형 모바일 GPU	거부됨	적응형 폴백 필요

전역 커버리지는 약 95%다. 남은 5%는 wasm-SIMD 또는 단순 JS 폴백.

7.2 fp16 — 아직 보편적이지 않다

대형 ML 모델의 메모리 풋프린트를 절반으로 줄이는 fp16(half-precision)은 — WGSL에서는 f16 확장 — 모든 디바이스에서 켜져 있지 않다. Chrome은 데스크톱 디스크리트 GPU에서 잘 동작하지만, 통합 GPU나 모바일에서 거부될 수 있다.

const adapter = await navigator.gpu.requestAdapter();
const hasF16 = adapter.features.has('shader-f16');
const device = await adapter.requestDevice({
  requiredFeatures: hasF16 ? ['shader-f16'] : [],
});

폴백은 둘 중 하나:

f32로 떨어뜨린다 — 메모리 2배, 속도 비슷.
INT8 양자화 — fp16보다 더 작지만 정확도 손실 약간.

7.3 워크그룹 한계

limits.maxComputeWorkgroupSizeX 등의 한계는 디바이스마다 다르다. 안전한 기본값:

workgroup_size: 64 또는 256
maxComputeInvocationsPerWorkgroup: 256 (절대 그 이상 박지 말 것)

7.4 메모리 한계

limits.maxStorageBufferBindingSize가 기본 128MB. 대형 모델은 청크로 나눠야 한다. WebLLM이 모델을 여러 버퍼로 쪼개는 이유.

7.5 컴파일 시간

WGSL 셰이더는 첫 사용 시 컴파일된다. 큰 셰이더(어텐션 1000줄짜리)는 500ms–2s 걸릴 수 있다. 모델 로딩과 함께 워밍업 단계를 두는 게 표준.

8장 · WebGPU + Wasm-SIMD 하이브리드

GPU가 모든 일에 빠른 건 아니다. 작은 배열이나 분기가 많은 코드는 CPU가 더 빠르다. 실제 파이프라인은 종종 둘을 섞는다.

8.1 언제 GPU, 언제 CPU?

패턴	GPU 좋음	CPU(wasm-SIMD) 좋음
데이터 크기	100K+ 원소	1K 미만
분기	같은 경로	가지각색
메모리 패턴	코얼레스드	임의 액세스
첫 사용	컴파일 후	즉시
결과를 자주 CPU로?	비쌈	0 비용

8.2 하이브리드 패턴

WebLLM 같은 ML 파이프라인의 전형적인 흐름:

[tokenizer]  ← wasm-SIMD (작은 코드, 분기 많음)
     ↓
[embed]      ← GPU (큰 행렬 곱)
     ↓
[attention]  ← GPU (matmul + softmax)
     ↓
[FFN]        ← GPU (큰 matmul)
     ↓
[sampler]    ← CPU 또는 GPU (작은 텐서, 분기 있음)
     ↓
[detokenizer] ← wasm-SIMD

토크나이저·디토크나이저처럼 작고 분기 많은 코드는 CPU. 행렬 곱처럼 크고 균일한 코드는 GPU. 데이터를 GPU↔CPU로 옮기는 횟수를 최소화하는 게 핵심 — 한 번 GPU로 보내면 가능한 모든 일을 거기서 끝낸다.

8.3 Wasm과 WebGPU의 통신

현재(2026년 5월) SharedArrayBuffer를 통한 직접 공유는 cross-origin isolation을 요구한다. 대안은 staging 버퍼를 통한 복사인데 PCIe 대역폭에 묶인다 (몇 GB/s). 가능한 한 GPU에 머무르는 게 답.

9장 · 의사결정 매트릭스

새 기능을 짤 때 어디서 굴릴지 어떻게 결정하나.

조건	추천	이유
데이터 1MB 미만, 단순 연산	순수 JS	셋업 비용이 본 작업보다 큼
데이터 100MB+, 균일 연산	WebGPU	GPU의 무기
작은 데이터, 복잡한 분기	Wasm-SIMD	CPU가 잘함
ML 추론, 모델 50MB+	WebGPU (WebLLM 등)	matmul 사방에
실시간 이미지 처리	WebGPU	픽셀당 같은 연산
데이터프레임 group-by	데이터 크기에 따라	1M+ 행이면 GPU
사용자 디바이스 미지원	Wasm-SIMD 폴백	5% 안전망
첫 응답이 1초 안에 필요	JS 또는 wasm 시작	GPU 컴파일 워밍업

10장 · 한계와 함정

장밋빛만은 아니다. 2026년 5월 현재 알아야 할 것들.

10.1 첫 디스패치는 느리다

셰이더 컴파일·파이프라인 생성·버퍼 할당 모두 첫 사용에 일어난다. 짧은 작업 하나라면 GPU 셋업이 작업보다 비싸다. 반복 사용 또는 큰 일감에만 의미가 있다.

10.2 디버깅이 어렵다

GPU 셰이더에는 console.log가 없다. WebGPU 디버깅은:

Chrome DevTools의 WebGPU 패널 — 버퍼·바인드 그룹·디스패치 검사.
출력 버퍼에 값을 써서 CPU로 읽어 확인 ("printf 디버깅").
WGSL의 wgpu-inspect 같은 도구 — 아직 미숙.

CUDA의 cuda-gdb 같은 성숙한 디버거는 없다.

10.3 비결정론

부동소수점 합산 순서가 워크그룹 스케줄링에 따라 다를 수 있다. 같은 입력에 비트 단위로 같은 출력을 기대하지 말 것. ML 추론에서 같은 모델·같은 프롬프트가 다른 토큰을 뽑을 수 있다 (대개 sampling temperature가 0이어도).

10.4 모바일 변동성

모바일 GPU는 데스크톱보다 한계가 빡빡하다. 같은 셰이더가 어떤 폰에서는 안 돌 수 있다. 항상:

adapter.limits를 읽어 워크그룹 크기와 메모리 한계를 확인.
폴백 경로 마련.
실제 디바이스에서 테스트 (BrowserStack, 실기 둘 다).

10.5 보안

WebGPU는 큰 공격 표면이다. 2023–2025년 사이에 몇 개의 정보 누설 CVE가 보고됐다. 브라우저는 이미 강력한 격리를 두지만 — 민감한 도메인에서는 gpu-compute 권한을 신중히 검토.

10.6 셰이더 코드 보호 안 됨

WGSL 소스는 클라이언트에 평문으로 간다. 알고리즘 영업비밀을 거기 박지 말 것.

에필로그 — 컴퓨트 셰이더, 두 시간으로 시작하기

시작 체크리스트

WebGPU 지원 확인: if (!navigator.gpu) 가드.
어댑터·디바이스 한 번 만들어 재사용 (페이지당 1개).
셰이더 모듈·파이프라인 캐시 — 같은 셰이더를 재컴파일하지 말 것.
버퍼 풀링 — 매 호출 새 버퍼 만들지 말고 재활용.
staging 버퍼와 mapAsync는 비동기 — await 잊지 말 것.
워크그룹 크기는 64 또는 256으로 시작.
첫 호출은 워밍업 — 진짜 측정은 5번째 호출부터.
fp16은 feature 가드 후 사용.

안티 패턴

매 호출 디바이스/어댑터 재생성 — 비싸고 불필요.
GPU↔CPU 핑퐁 — 매 반복마다 결과를 CPU로 읽지 말 것. GPU에 머무를 것.
워크그룹 크기 1024+ — 모바일에서 거부됨.
분기로 가득한 셰이더 — GPU 점유율을 죽인다. CPU로 보낼 것.
fp16을 무조건 가정 — feature 가드 없이는 폭발.
거대 단일 셰이더 — 컴파일이 길다. 단계별로 쪼개기.
결과를 매번 mapAsync — async/await로 인한 GPU 파이프라인 stall. 배치할 것.

다음 글 예고

WGSL 깊은 다이브 — 어텐션 커널을 처음부터 짜기, flash attention의 워크그룹 트릭.
WebGPU 디버깅과 프로파일링 실전 — Chrome DevTools, 타임스탬프 쿼리, 점유율 측정.
WebLLM 내부 구조 — TVM이 모델을 WGSL로 컴파일하는 파이프라인 해부.

참고 / References

Real GPU Compute in the Browser — A Hands-On Guide to WebGPU Compute Shaders and WGSL in 2026

Prologue — The Web Finally Gets to Tell the GPU What to Do

For years we accepted one lie: "browsers can't do GPU compute." WebGL had shaders but they were shaders trapped inside the graphics pipeline. Running arbitrary data through arbitrary workgroups and reading the result back — the thing we'd taken for granted in CUDA — was not possible on the web. So ML inference ran on CPU via WebAssembly, or used trickery like ONNX.js that abused WebGL into pretending to be compute.

That picture began to change when Chrome 113 shipped WebGPU to desktop stable in 2023. Firefox and Safari followed through 2024. And in January 2026, Safari 26 launched it on macOS Tahoe and iOS, putting WebGPU effectively at Baseline. All of Chrome, Edge, Firefox and Safari ship it on by default; global coverage is around 95%.

This post is not about graphics. It is about compute — the same GPU, often sharing the same page, running a very different kind of pipeline.

A matrix multiply that takes 200ms on CPU takes 2ms on GPU.
WebLLM runs a quantized Llama 3 8B on a user's laptop GPU at 30+ tok/s.
Applying a Gaussian blur to a thousand images takes wasm-SIMD 4s; it takes WebGPU 0.4s.

All of this — with zero server cost — inside the browser. That is what compute shaders promise. This post is one read for how to take them up on it.

1. Why GPU Compute in the Browser

Start with motivation. Three reasons to run GPU compute in the browser.

1.1 Zero Server Cost

LLM inference on cloud GPUs is 5–10 USD per H100 hour. With 100 users it adds up fast. Move the inference to the user's device GPU and your server cost is zero. The weights download once and cache.

1.2 Privacy

Classifying, summarizing or embedding medical images, personal photos, or private text never leaves the device. It happens on the browser's GPU. The data does not leave the box.

1.3 Latency

You eliminate a 50–200ms round trip to a server. Live video filters, interactive ML, real-time simulation — not having that round trip is fundamentally faster.

These reasons are not new. What is new is that the tools are finally ready.

2. WebGPU vs WebGL — Where Compute Lives

In WebGL up through 2 every shader is inside the graphics pipeline. Vertex shaders see vertices, fragment shaders see pixels. To process arbitrary data you had to disguise it as a texture and shove it through the fragment shader — the famous "GPGPU hack".

WebGPU is different. Compute shaders are first-class citizens.

Aspect	WebGL 2	WebGPU
Compute shaders	None (hacks only)	First-class
Workgroups / shared memory	None	Yes
Storage buffer read/write	Texture tricks	Native
Shader language	GLSL ES 3.0	WGSL
API style	Global state machine	Explicit command buffers
Multi-threaded command encoding	Not possible	Possible
fp16	Extension (fragile)	Extension (when enabled)

The second row is the heart of it. WebGPU gives you workgroups and shared memory. Without those, half of what we call "parallel compute" — reductions, scans, prefix sums and other cooperative algorithms — can only be faked. They require the fast shared memory inside a workgroup.

3. The Compute Model — Workgroups, Invocations, Dispatch

A GPU has thousands of ALUs. The way you hand work to those thousands is the compute model.

3.1 Three Layers

Dispatch
  |
  +-- Workgroup #0
  |     |
  |     +-- Invocation (0,0,0)
  |     +-- Invocation (1,0,0)
  |     +-- ...
  |     +-- Invocation (63,0,0)
  +-- Workgroup #1
  |     +-- ...
  +-- Workgroup #N

Invocation: the smallest unit running shader code. Equivalent to a CUDA thread.
Workgroup: a batch of invocations running together on the same SM (or CU). Equivalent to a CUDA block. Invocations within a workgroup can cooperate via fast shared memory and barriers.
Dispatch: a 3D grid of workgroups. Equivalent to a CUDA grid.

The workgroup size is baked into the shader at compile time. In WGSL you write it like this (only inside code fences — in prose we wrap WGSL syntax in backticks).

@compute @workgroup_size(64)
fn main(@builtin(global_invocation_id) gid: vec3<u32>) {
  // ...
}

3.2 What's a Good Workgroup Size?

64 or 256 are safe defaults. Mobile GPUs handle them well.
Too big (over 1024) gets rejected on some devices.
Too small (under 32) leaves the GPU starved with low occupancy.

3.3 Global ID

Each invocation knows its own ID. The global_invocation_id is a 3D coordinate inside the whole dispatch. For a 1D array, only gid.x matters.

4. A Crash Course in WGSL

WGSL is WebGPU's shading language. It picks the good parts of GLSL and HLSL and gives them a Rust-flavoured syntax.

4.1 Types

// Scalars
var x: i32 = 42;
var y: u32 = 42u;
var z: f32 = 3.14;
var b: bool = true;

// Vectors
var v: vec4<f32> = vec4<f32>(1.0, 2.0, 3.0, 4.0);
var i: vec3<i32> = vec3<i32>(1, 2, 3);

// Matrices
var m: mat4x4<f32> = mat4x4<f32>(...);

// Runtime arrays
@group(0) @binding(0) var<storage, read_write> data: array<f32>;

4.2 Storage Buffers and Bind Groups

Inputs and outputs in WGSL are buffers. Unlike graphics shaders there are no vertex attributes or textures by default — just arrays that live in GPU memory.

// read-only input
@group(0) @binding(0) var<storage, read> input: array<f32>;
// read_write output
@group(0) @binding(1) var<storage, read_write> output: array<f32>;
// small constants, fixed across one dispatch
@group(0) @binding(2) var<uniform> params: Params;

The group and binding numbers line up with a BindGroupLayout on the JS side. group 0 / binding 0 attaches to the first buffer you set in JS, and so on.

4.3 Workgroup Shared Memory

var<workgroup> shared_data: array<f32, 64>;

All 64 invocations in a workgroup share the same 64-element array. It's about 100x faster than global (storage) memory. The crucial resource for reductions, scans, blurs and other cooperative algorithms.

4.4 Barriers

Synchronization is required so other invocations see your writes.

workgroupBarrier();   // wait until every invocation in the workgroup reaches here
storageBarrier();     // make storage-buffer writes visible to all invocations

4.5 Built-ins

Common built-in inputs.

global_invocation_id: 3D coordinates inside the whole dispatch.
local_invocation_id: 3D coordinates inside the workgroup.
workgroup_id: 3D coordinates of the workgroup inside the dispatch.
local_invocation_index: 1D index inside the workgroup (0..workgroup_size-1).

5. Your First Compute Shader — A Parallel Sum

Enough theory. We actually run something. Take an array of length N and compute the sum of every element on the GPU. The classic "first kernel."

5.1 WGSL — Two-Stage Reduction

A parallel sum is hard to finish in a single dispatch because there is no shared memory between workgroups. So we do it in two stages.

Stage A: every workgroup computes a partial sum over its slice and writes it to partial_sums[wg_id].
Stage B: reduce partial_sums once more (or on the host) to get the final total.

// reduce.wgsl — Stage A
const WG_SIZE: u32 = 64u;

@group(0) @binding(0) var<storage, read>       input:        array<f32>;
@group(0) @binding(1) var<storage, read_write> partial_sums: array<f32>;
@group(0) @binding(2) var<uniform>             params:       Params;

struct Params { n: u32 };

var<workgroup> sdata: array<f32, WG_SIZE>;

@compute @workgroup_size(WG_SIZE)
fn main(
  @builtin(global_invocation_id) gid: vec3<u32>,
  @builtin(local_invocation_id) lid: vec3<u32>,
  @builtin(workgroup_id) wid: vec3<u32>,
) {
  let i = gid.x;
  let tid = lid.x;

  // 1) Load our global element into shared memory (0 if out of range)
  var v: f32 = 0.0;
  if (i < params.n) { v = input[i]; }
  sdata[tid] = v;
  workgroupBarrier();

  // 2) Tree-reduce inside the workgroup
  var stride: u32 = WG_SIZE / 2u;
  loop {
    if (stride == 0u) { break; }
    if (tid < stride) {
      sdata[tid] = sdata[tid] + sdata[tid + stride];
    }
    workgroupBarrier();
    stride = stride / 2u;
  }

  // 3) Only invocation 0 of each workgroup writes the result
  if (tid == 0u) {
    partial_sums[wid.x] = sdata[0];
  }
}

Three key patterns:

Load once from global into workgroup memory (sdata).
Tree reduction — halve the stride each step, summing pairs.
A workgroupBarrier() between steps to keep memory consistent.

5.2 The JS Driver

The JS side of WebGPU is noisy on first sight. But the parts are simple.

// reduce.ts
async function gpuSum(input: Float32Array): Promise<number> {
  // 1) Adapter and device
  const adapter = await navigator.gpu.requestAdapter();
  if (!adapter) throw new Error('No WebGPU adapter');
  const device = await adapter.requestDevice();

  const N = input.length;
  const WG = 64;
  const numWorkgroups = Math.ceil(N / WG);

  // 2) Buffers
  const inputBuf = device.createBuffer({
    size: input.byteLength,
    usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST,
  });
  device.queue.writeBuffer(inputBuf, 0, input);

  const partialBuf = device.createBuffer({
    size: numWorkgroups * 4,
    usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC,
  });

  const uniformBuf = device.createBuffer({
    size: 16, // padded to 16
    usage: GPUBufferUsage.UNIFORM | GPUBufferUsage.COPY_DST,
  });
  device.queue.writeBuffer(uniformBuf, 0, new Uint32Array([N, 0, 0, 0]));

  // 3) Shader and pipeline
  const module = device.createShaderModule({ code: WGSL_SOURCE });
  const pipeline = device.createComputePipeline({
    layout: 'auto',
    compute: { module, entryPoint: 'main' },
  });

  // 4) Bind group
  const bindGroup = device.createBindGroup({
    layout: pipeline.getBindGroupLayout(0),
    entries: [
      { binding: 0, resource: { buffer: inputBuf } },
      { binding: 1, resource: { buffer: partialBuf } },
      { binding: 2, resource: { buffer: uniformBuf } },
    ],
  });

  // 5) Dispatch
  const encoder = device.createCommandEncoder();
  const pass = encoder.beginComputePass();
  pass.setPipeline(pipeline);
  pass.setBindGroup(0, bindGroup);
  pass.dispatchWorkgroups(numWorkgroups);
  pass.end();

  // 6) Staging buffer to read back to CPU
  const stagingBuf = device.createBuffer({
    size: numWorkgroups * 4,
    usage: GPUBufferUsage.MAP_READ | GPUBufferUsage.COPY_DST,
  });
  encoder.copyBufferToBuffer(partialBuf, 0, stagingBuf, 0, numWorkgroups * 4);
  device.queue.submit([encoder.finish()]);

  // 7) Read
  await stagingBuf.mapAsync(GPUMapMode.READ);
  const partials = new Float32Array(stagingBuf.getMappedRange().slice(0));
  stagingBuf.unmap();

  // 8) Final reduction on CPU (one pass is enough if numWorkgroups is small)
  let total = 0;
  for (let i = 0; i < partials.length; i++) total += partials[i];
  return total;
}

Long on first read but it splits into three.

Setup (1–4): adapter, device, buffers, pipeline, bind group.
Dispatch (5): encode commands, submit to the queue.
Readback (6–7): copy to a staging buffer, map async, read.

5.3 Performance — Is It Actually Fast?

Reference measurements on an M2 Pro with Chrome 127, N = 8M floats:

Implementation	Time (ms)
Plain JS `for` loop	38
`Float32Array.reduce`	36
Wasm-SIMD (4-wide)	11
WebGPU compute	1.8

Twenty times faster. And the gap widens as N grows. At N = 64M the GPU is roughly 100x ahead.

6. Real Applications — Where Compute Actually Runs

6.1 In-Browser LLM Inference — WebLLM

WebLLM from the MLC team is the best-known example. It runs models like Llama 3 8B, Qwen 2.5 and Phi 3.5 — quantized — through WebGPU compute shaders. On an M2 Pro a 4-bit Llama 3 8B does 30+ tok/s. Zero server cost.

Under the hood, MLC-LLM's TVM compiles the model graph into a sequence of WGSL compute shaders. You will rarely write these by hand, but every one of them — attention, matmul, softmax — is a variation of the patterns shown above.

6.2 Browser Stable Diffusion

Web Stable Diffusion runs SD 1.5 and SDXL Turbo on WebGPU. A 7B-parameter model produces 1024x1024 images on a user's laptop in 10–30 seconds. Not server-GPU fast — but zero cost, full privacy, offline-capable.

6.3 GPU-Accelerated Dataframes

Successors to GPU.js and projects like WebDF reimplement pandas/Polars-style dataframe operations as WebGPU compute. A 10-million-row group-by aggregation that costs 400ms on CPU costs 25ms on GPU.

6.4 Parallel Image Filters

The most immediately satisfying application. Gaussian blur, edge detection, colour-space conversion run on every pixel at once. Use a 2D workgroup size like 8x8 or 16x16 for 2D stencil work.

@compute @workgroup_size(16, 16)
fn blur(@builtin(global_invocation_id) gid: vec3<u32>) {
  let px = gid.xy;
  // 3x3 mean — a real Gaussian only differs by weights
  var sum: vec4<f32> = vec4<f32>(0.0);
  for (var dy: i32 = -1; dy <= 1; dy = dy + 1) {
    for (var dx: i32 = -1; dx <= 1; dx = dx + 1) {
      let p = vec2<i32>(i32(px.x) + dx, i32(px.y) + dy);
      sum = sum + textureLoad(input_tex, p, 0);
    }
  }
  textureStore(output_tex, vec2<i32>(px), sum / 9.0);
}

Filters like this run 10–50x faster than wasm-SIMD.

6.5 Simulation — N-body, Fluids, Cloth

Physics simulation is a classic GPU application. The interactions of 1000–100000 particles need to be solved every frame. WebGPU compute can handle 10000+ particles at 60fps. Babylon.js and PlayCanvas already use it.

7. The Reality of Browser Support in 2026

Theory is fine, demos are pretty — but shipping is different. Where we actually are in May 2026 (caniuse, WebKit release notes and Mozilla trackers combined):

7.1 Status

Browser	State	Note
Chrome / Edge	Stable on desktop and Android	113+ (May 2023)
Safari	Stable on macOS and iOS	26+ (Jan 2026)
Firefox	Stable on desktop	130+ (Oct 2024)
Firefox Android	Partial (flagged)	In progress
Older mobile GPUs	Rejected	Need adaptive fallback

Global coverage sits around 95%. The remaining 5% falls back to wasm-SIMD or plain JS.

7.2 fp16 — Not Universal Yet

Half-precision floats (f16 in WGSL via the shader-f16 extension) halve the memory footprint of large ML models but are not enabled on every device. Chrome ships them on desktop discrete GPUs; integrated GPUs and many mobiles still reject them.

const adapter = await navigator.gpu.requestAdapter();
const hasF16 = adapter.features.has('shader-f16');
const device = await adapter.requestDevice({
  requiredFeatures: hasF16 ? ['shader-f16'] : [],
});

Fallbacks are two:

Drop to f32 — double the memory, similar speed.
INT8 quantization — smaller than fp16, slight accuracy loss.

7.3 Workgroup Limits

Limits like limits.maxComputeWorkgroupSizeX differ by device. Safe defaults:

workgroup_size: 64 or 256
maxComputeInvocationsPerWorkgroup: 256 (do not exceed)

7.4 Memory Limits

limits.maxStorageBufferBindingSize defaults to 128MB. Big models must be chunked across buffers — the reason WebLLM splits model weights across many of them.

7.5 Compile Time

WGSL shaders compile on first use. A large shader (a 1000-line attention kernel) can take 500ms to 2s. Bundling a warm-up phase with model loading is the standard pattern.

8. WebGPU + Wasm-SIMD Hybrid Pipelines

GPUs are not fast at everything. Small arrays and branchy code run better on the CPU. Real pipelines often mix the two.

8.1 GPU vs CPU — When

Pattern	GPU wins	CPU (wasm-SIMD) wins
Data size	100K+ elements	Under 1K
Branching	Same path	Diverse paths
Memory pattern	Coalesced	Random access
First-use cost	After compile	Immediate
Frequent readback to CPU	Expensive	Free

8.2 The Hybrid Pattern

A typical ML pipeline like WebLLM looks like this.

[tokenizer]  <- wasm-SIMD (small code, branchy)
     |
     v
[embed]      <- GPU (big matrix multiplies)
     |
     v
[attention]  <- GPU (matmul + softmax)
     |
     v
[FFN]        <- GPU (big matmuls)
     |
     v
[sampler]    <- CPU or GPU (small tensors, branchy)
     |
     v
[detokenizer] <- wasm-SIMD

Small branchy code like tokenizers and detokenizers stays on CPU. Big uniform code like matmuls goes on GPU. Minimize the number of GPU↔CPU transfers — once data is on the GPU, do as much as possible there.

8.3 Wasm and WebGPU Talking

As of May 2026, sharing memory directly via SharedArrayBuffer requires cross-origin isolation. The alternative is to copy through staging buffers, bounded by PCIe bandwidth (a few GB/s). Staying on the GPU is the answer.

9. Decision Matrix

When you're building a new feature, where should it run?

Condition	Recommended	Why
Data under 1MB, simple op	Plain JS	Setup cost beats the work itself
Data 100MB+, uniform op	WebGPU	The GPU's home turf
Small data, complex branching	Wasm-SIMD	CPU does this well
ML inference, 50MB+ model	WebGPU (WebLLM etc.)	Matmuls everywhere
Real-time image processing	WebGPU	Same op per pixel
Dataframe group-by	Depends on size	GPU once you're past 1M rows
Device does not support WebGPU	Wasm-SIMD fallback	The 5% safety net
First response must arrive within 1s	Start with JS or wasm	GPU compile warm-up

10. Limits and Pitfalls

It's not all roses. Things to be aware of in May 2026.

10.1 The First Dispatch Is Slow

Shader compile, pipeline creation, buffer allocation all happen on first use. For a single short job, the GPU setup costs more than the job itself. Only worth it for repeated use or large workloads.

10.2 Debugging Is Hard

GPU shaders have no console.log. WebGPU debugging today looks like:

Chrome DevTools' WebGPU panel — inspect buffers, bind groups, dispatches.
Write values to an output buffer and read them back on CPU ("printf debugging").
Tools like wgpu-inspect for WGSL — still rough.

There is no mature debugger comparable to cuda-gdb.

10.3 Non-Determinism

The order of float summations can vary with workgroup scheduling. Do not expect bit-identical outputs for the same input. ML inference may produce different tokens for the same prompt and model — even with sampling temperature 0.

10.4 Mobile Variability

Mobile GPUs have tighter limits than desktops. A shader that runs on one phone can fail on another. Always:

Read adapter.limits for workgroup-size and memory limits.
Provide a fallback path.
Test on real devices (BrowserStack and physical hardware).

10.5 Security

WebGPU is a large attack surface. Several information-disclosure CVEs were reported between 2023 and 2025. Browsers already isolate strongly — but sensitive domains should review the gpu-compute permission carefully.

10.6 Shader Source Is Not Protected

WGSL source ships to the client in plain text. Do not bury algorithmic trade secrets in it.

Epilogue — Get Started with Compute Shaders in Two Hours

Start Checklist

Guard for WebGPU support: if (!navigator.gpu).
Create adapter and device once and reuse (one per page).
Cache shader modules and pipelines — never recompile the same shader.
Pool buffers — reuse instead of allocating a new one per call.
Staging buffers and mapAsync are async — remember the await.
Start with a workgroup size of 64 or 256.
The first call is a warm-up — measure from call 5 onward.
Guard on the f16 feature before using it.

Anti-Patterns

Recreating the device or adapter every call — expensive and pointless.
GPU↔CPU ping-pong — don't read back to CPU every iteration. Stay on the GPU.
Workgroup size 1024+ — rejected on mobile.
Shaders full of branches — kill GPU occupancy. Send branchy code to the CPU.
Assuming f16 is available — without a feature guard you explode.
A single massive shader — slow to compile. Split into stages.
Calling mapAsync on every result — async/await stalls the GPU pipeline. Batch it.

Coming Next

WGSL Deep Dive — writing an attention kernel from scratch, the workgroup tricks behind flash attention.
WebGPU Debugging and Profiling in Practice — Chrome DevTools, timestamp queries, occupancy measurement.
Inside WebLLM — anatomy of how TVM compiles a model into WGSL.