Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

Prologue — "We can do 10k RPS" is usually a lie

Some company, the night before launch in 2026.

PM: "We can do 10k RPS, right?" Backend: "I ran it in JMeter. It passed." SRE: "What scenario?" Backend: "Constant 10k RPS for 5 minutes..." SRE: "What about the production traffic pattern? Did you do warm-up? What's p99?" Backend: "..."

This scene is still common in 2026. The tools got better; the measurement culture did not. "It passed" is the phrase, but what passed, under what distribution, how close to production — nobody answers. Then real traffic arrives in bursts, caches start cold, p99 breaks the SLO, and none of it correlates with the test that "passed."

The tools themselves have improved. The JMeter-centric world of the 2010s has shifted to one where k6 (Grafana Labs) is the de facto default, with Locust, Vegeta, Gatling, Artillery, and Bombardier each holding their own ground. wrk, wrk2, and autocannon are alive and well in the micro-benchmark niche, and non-HTTP protocols like gRPC and WebSocket have their own homes in ghz and fortio.

This post maps the 2026 landscape of load/performance testing tools. Where each tool sits, how the same scenario looks across them, what "a good load test" actually means, and how to pick honestly for your team.

1. Four purposes of load testing — name the target first

Before picking a tool, name the purpose. The same tool doesn't fit every purpose.

Micro-benchmark — peak throughput and p99 latency of a single endpoint. "How fast is this hot path?" Good for catching regressions on small changes. wrk, autocannon, Bombardier, Vegeta.
Load test — does the system stay inside SLO at the expected traffic level. Usually a steady RPS for some time. k6, Locust, Gatling, Artillery, JMeter.
Stress test — find the limit. Where does it break, how does it break, can it recover. Same tools plus scenario design.
Spike & soak — sudden surges (spike) and long runs (soak — catches memory or connection leaks). k6 and Locust express these scenarios well.

A fifth axis is chaos testing — injecting failure while traffic flows — usually a combination of a load tool and a chaos tool.

The core insight: "performance testing" is one phrase covering four different things. Pick your tool only after answering "what kind am I doing most often?" Dragging out JMeter for a micro-benchmark is overkill; using wrk for a complex scenario is undershooting.

2. Tool map 2026 — one table

Tool	Language/script	Strengths	Weaknesses	Typical use
k6	JS (ES2015+), Go runtime	Modern default, rich output, cloud option, gRPC/WS/browser	Distributed runs need setup in OSS	The general default in 2026
Locust	Python	Easy distributed mode, full Python code	Single-worker throughput limited	Python teams, complex user models
Vegeta	Go (CLI + lib)	One-liner runs, strong result analysis	Simple scenarios only	HTTP micro-bench, quick checks
Gatling	Scala/Java DSL	Scenario expressiveness, enterprise reports	Scala learning curve	Large, JVM-friendly orgs
Artillery	Node.js, YAML	Fast start, declarative YAML	Single-node limits at high load	Node teams, CI scenarios
wrk / wrk2	C, Lua scripting	Very light, very fast HTTP bench	HTTP only, simple scenarios	Hot-path micro-bench
autocannon	Node.js	npm install and go	Mostly fits Node teams	Quick Node API bench
Bombardier	Go	Dead-simple, fast CLI	Almost no scenario	One-liner load checks
JMeter	Java, GUI/XML	Old library and plugin ecosystem	XML scenarios, dated UX	Enterprise, legacy assets
ghz	Go, CLI	gRPC-only, simple	gRPC only	gRPC service benchmarks
fortio	Go (Istio)	gRPC + HTTP, distribution analysis	Plain UI	Service mesh validation

One-liner: in 2026, the "tool you pick up first" is usually k6. Python-friendly teams pick Locust. Vegeta (or wrk) for one-liner micro-benches. JMeter/Gatling for enterprise/JVM orgs and legacy assets. Artillery for Node teams in CI. gRPC goes to ghz or k6's gRPC module.

3. k6 — the 2026 default

Status (May 2026): under Grafana Labs. The k6 OSS binary is free; Grafana Cloud k6 is the paid option for distributed runs and dashboards. v0.5x stable releases ship steadily; the browser module (Playwright backend), gRPC, WebSocket, and the xk6 extension ecosystem continue to grow.

Why it became the default:

JS-scriptable — most engineers can read and write it.
Single Go binary — easy to install, CPU-efficient (much higher single-worker throughput than Locust).
Rich output — p50/p90/p95/p99 in the console by default. Exports to Prometheus, InfluxDB, Datadog.
scenario/executor model — constant-vus, ramping-arrival-rate, per-vu-iterations and friends are expressive.
Extensible — xk6 adds modules for SQL, Kafka, Redis, etc.

Base script (we'll compare against it in section 4):

// k6 script: login + ramp 50 → 200 RPS for 5min
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  scenarios: {
    login_ramp: {
      executor: 'ramping-arrival-rate',
      startRate: 50,
      timeUnit: '1s',
      preAllocatedVUs: 50,
      maxVUs: 500,
      stages: [
        { target: 50, duration: '1m' },
        { target: 200, duration: '5m' },
        { target: 200, duration: '2m' },
      ],
    },
  },
  thresholds: {
    http_req_duration: ['p(99)<500'],
    http_req_failed: ['rate<0.01'],
  },
};

export default function () {
  const res = http.post('https://api.example.com/login', JSON.stringify({
    user: 'demo',
    pass: 'pw'
  }), { headers: { 'Content-Type': 'application/json' } });
  check(res, { 'status 200': (r) => r.status === 200 });
  sleep(1);
}

Grafana Cloud k6 pricing sense (2026): free tier offers ~50 VUh/month; Pro plans start around $299/month with more VUh and concurrency. Distributed and regional runs are the convenience play; the alternative is self-hosted distributed (multiple nodes running the same script, results aggregated).

Limits:

Distributed runs need setup in OSS — the grafana/k6-operator for Kubernetes exists, but it's operational overhead.
The JS runtime is Goja (an ES5+ embedded in Go), not V8. Some modern JS features need babel transforms.
Heavy data manipulation is safer outside the scenario.

Side-by-side: k6 / Locust / Vegeta for the same shape.

k6

The script in section 3. The key bits: the ramping-arrival-rate executor and thresholds for p99 and error rate as code.

Locust (Python)

# locustfile.py
from locust import HttpUser, task, LoadTestShape, constant_throughput

class LoginUser(HttpUser):
    wait_time = constant_throughput(1)  # 1 req/sec per user

    @task
    def login(self):
        self.client.post(
            "/login",
            json={"user": "demo", "pass": "pw"},
            headers={"Content-Type": "application/json"},
        )

class RampShape(LoadTestShape):
    stages = [
        {"duration": 60, "users": 50, "spawn_rate": 50},
        {"duration": 360, "users": 200, "spawn_rate": 5},
        {"duration": 480, "users": 200, "spawn_rate": 0},
    ]

    def tick(self):
        run_time = self.get_run_time()
        for stage in self.stages:
            if run_time < stage["duration"]:
                return (stage["users"], stage["spawn_rate"])
        return None

Run: locust -f locustfile.py --headless -H https://api.example.com. Distributed via --master and --worker, with well-known Helm charts for Kubernetes.

Vegeta (CLI)

# step 1: 50 RPS for 1m
echo "POST https://api.example.com/login" | \
  vegeta attack -rate=50 -duration=60s -body=body.json \
                -header="Content-Type: application/json" \
  | vegeta report -type=hist[0,100ms,200ms,500ms,1s]

# step 2: 200 RPS for 5m
echo "POST https://api.example.com/login" | \
  vegeta attack -rate=200 -duration=5m -body=body.json \
                -header="Content-Type: application/json" \
  | tee result.bin \
  | vegeta report
vegeta plot result.bin > plot.html

Vegeta keeps scenarios deliberately simple — a fixed rate for a fixed duration. Ramps usually come from chaining commands or driving from a shell script. That simplicity is its strength — one line measures, vegeta report -type=json and vegeta plot give you the distribution and timeseries.

Side-by-side observations:

Expressiveness: Locust (most freedom) ≈ k6 > Vegeta. Vegeta is simple on purpose.
CPU efficiency: k6 > Vegeta > Locust (single-worker; Locust compensates via distributed).
CI friendliness: k6 (threshold exit codes) ≈ Vegeta (grep reports). Locust runs headless in CI too.
Result visualization: k6 (Cloud k6 or own Prometheus) > Locust (web UI) > Vegeta (plot HTML).

5. Locust — the comfortable friend for Python teams

Status (2026): actively maintained, v2.x stable. Locust's strength is behavior modeling in code — represent what users do as classes/methods with @task weights.

When it fits:

Python-friendly team that wants to call data and abstraction libraries directly.
Complex user behavior (multi-page flows, stateful sessions).
Distributed runs needed but no SaaS — --master/--worker is genuinely simple.

Limits:

Single-worker throughput is lower than k6. Python's gevent concurrency is fine, but Go is lighter.
The built-in web UI is nice but doesn't match a k6 + Grafana stack.

Tip: Locust's killer feature is distributed is easy. If you need 100k+ RPS, spinning up dozens of workers feels natural. Deploy via Helm to Kubernetes, scrape metrics into Prometheus — the pattern is familiar and operationally light.

6. Vegeta — the elegance of one line

Status (2026): maintained by a small set of contributors, stable and simple. New features are rare, and that's the point.

Why Vegeta survives:

One-line measurement — echo "GET https://..." | vegeta attack -rate=100 -duration=30s | vegeta report. What takes a file and a runner and a container mount with other tools is a single line here.
Accurate distribution analysis — vegeta report shows p50/p90/p95/p99/min/mean/max in one go. Histogram buckets are CLI flags.
Result serialization — the .bin format saves results for later reprocessing. CI can archive the bin and analyze it later.

Limits:

Scenario expressiveness is minimal — a fixed rate or a chained set of steps. Not for complex user simulations.
Distributed is "run the same command on multiple machines and merge bins" — manual.

Typical use: precisely measuring one hot path's latency distribution, micro-bench regression in CI, "is the server alive right now" quick check.

7. Gatling — the JVM heavyweight

Status (2026): Gatling 3.x stable. Both Gatling Enterprise (paid, formerly FrontLine) and OSS are active. Scala DSL is the default; Java and Kotlin DSLs are first-class citizens now.

Why it's still picked:

DSL expressiveness — scenarios, chaining, assertions feel natural as code.
Reporting — clean HTML reports out of the box; Enterprise adds distributed and team workflows.
JVM-friendly orgs — fits Maven/Gradle builds naturally.
Scala barrier dropped — Java DSL is a first-class citizen.

When it fits:

JVM-based backends, want integration with the existing build system.
Complex scenarios you want to keep in code, code-reviewed.
Need enterprise support.

Limits:

Not for lightweight one-liners — JVM startup and SBT/Maven cost.
Scala barrier is lower but not zero.

8. Artillery — YAML scenarios, fast start

Status (2026): OSS + Cloud (paid). The appeal is YAML scenarios. v2 strengthened distributed runs via Cloud.

Why it's picked:

Declarative YAML — describe URL/payload/flow without writing code.
Node.js-based — familiar to Node teams.
Quick start — npm install -g artillery and a yml file is enough to run.

Limits:

Single-node throughput is below k6 (Node event-loop limits).
Very high load needs Cloud or multiple instances.

Example (slightly different shape):

config:
  target: 'https://api.example.com'
  phases:
    - duration: 60
      arrivalRate: 50
    - duration: 300
      arrivalRate: 50
      rampTo: 200
    - duration: 120
      arrivalRate: 200
scenarios:
  - flow:
      - post:
          url: '/login'
          json:
            user: 'demo'
            pass: 'pw'
          expect:
            - statusCode: 200

9. JMeter — the old guard

Status (2026): Apache JMeter 5.x maintained. Largest body of learning material, widest plugin ecosystem. GUI-centric but headless runs are officially supported.

Why it still appears:

Legacy assets — many orgs own piles of .jmx files. Migration cost vs. maintenance cost.
Plugins — Prometheus listener, custom protocols, BlazeMeter integration.
Finance, telecom, regulated — internal standards stuck on JMeter.
GUI-friendly — QA teams that don't write code can drive it (both blessing and curse).

Why new teams rarely choose it fresh:

XML (.jmx) scenarios — not git-diff-friendly.
Modern UX/CLI friendliness lags.
CI is possible but heavier than k6/Locust.

Guidance: there's almost no reason a new project would pick JMeter fresh. But if you have the assets, the cost of throwing them away vs. keeping them is a separate decision. The common pattern is "write new tests in k6/Gatling, keep the old in JMeter, migrate as opportunity allows."

10. Micro-benchmarks — wrk / wrk2 / autocannon / Bombardier

These four are specialized for "peak throughput and latency distribution of one endpoint, fast."

wrk

Written in C, very fast. Lua scripting for light customization.
HTTP only, keep-alive friendly.
Caveat: no rate-limit — always at max load. Use for measuring the ceiling.

wrk2

A fork of wrk that supports fixed-rate runs — "exactly 1000 RPS for 30 seconds."
Best fit for micro-bench + fixed-rate measurement.
Stronger on coordinated-omission correction.

autocannon

Node.js-based. npm install -g autocannon, then autocannon -c 50 -d 30 https://....
Suits Node teams wanting fast CI benches.

Bombardier

Single Go binary. Dead simple and fast. bombardier -c 125 -n 1000000 https://....
Almost no scenario expression — that's the point.
Maintenance is alive but mostly in stability mode.

When which:

Need precise latency distribution and fixed rate → wrk2.
Just want to throw load quickly → wrk or Bombardier.
Node team in CI → autocannon.
Need even a small scenario → not these four; reach for Vegeta or k6.

11. Non-HTTP protocols — gRPC, WebSocket, real browser

In 2026 load testing isn't just HTTP. gRPC, WebSocket, and headless real browsers are all valid targets.

gRPC

ghz — a CLI for gRPC only. ghz --insecure --proto ./svc.proto --call svc.Hello -c 50 -n 10000 .... Simple and precise.
k6 gRPC module — import grpc from 'k6/net/grpc' mixes gRPC calls into a scenario.
fortio — born inside Istio. Both HTTP and gRPC, strong on p50–p999 distribution analysis.

WebSocket

k6 WS module — import ws from 'k6/ws'. Connection count, message RPS, session length scenarios all expressible.
Artillery — first-class WebSocket support in YAML.
Gatling — strong WebSocket scenario expressiveness.

Real browser load

k6 browser module — Playwright backend. Runs JS, renders, interacts in a real browser. Use cases: frontend regression, page-load SLOs.
Cost warning: browser instances are heavy; think in concurrent sessions, not RPS.

12. What "a good load test" means — 2026 checklist

Even with a good tool, a bad scenario gives bad results. The core of a good load test.

1) Production-like data

User IDs, payload sizes, token distribution should look like production.
"One user hitting the same page 10k times" all fits in cache — production doesn't.
A common pattern: sample payloads from production, anonymize, replay.

2) Distribution, not a flat rate

"Constant 1000 RPS" is fine for measurement but far from reality.
Production has bursts, dips, diurnal cycles.
Model with k6's ramping-arrival-rate, Locust's LoadTestShape.

3) Warm-up and ramp-up

Caches, DB connection pools, JIT all start cold — measuring during cold start gives pessimistic numbers.
Set aside the first 1–2 minutes as warm-up, exclude from measurement.

4) p99 (not the average)

"Average latency" is not your SLO's friend — it hides the long tail.
p95, p99, p999 (especially p99) connect to SLOs. Check percentiles in every tool's output.
A tool can under-report p99 due to coordinated omission — wrk2 was the first to popularize the fix.

5) Separate error rates

Computing latency only over successful responses hides dead requests.
Count 4xx, 5xx, timeouts, connection refused separately, with thresholds.

6) Multiple measurement points

Client-side latency from the load tool plus server-side RED (Rate, Error, Duration) plus downstream dependencies.
Load tool says slow, server says fast → network or measurement issue. Vice versa exists too.

7) Regular, not one-shot

Not a single pre-launch run; a weekly/monthly regression suite.
Track how code changes shift latency distributions over time.

13. Self-host vs cloud — cost and operations

Once distributed is needed, two paths.

Self-host distributed

k6 OSS + grafana/k6-operator (k8s) — pods run the test, results land in Prometheus/InfluxDB.
Locust master/worker — simplest. Helm chart with N workers.
Same command on N nodes + merge results — vegeta/wrk/Bombardier style.

Pros: cost control, data stays in-house. Cons: operational overhead, regional distribution is hard.

Cloud

Grafana Cloud k6 — distributed runs, regions, dashboards. From around $299/month.
BlazeMeter — JMeter/Gatling/k6 compatible enterprise SaaS.
Artillery Cloud — Artillery-based SaaS.
Loader.io, k6 Cloud free tier — small free tiers for occasional measurement.

Pros: regions, instant runs, instant dashboards. Cons: cost adds up (a weekly large test can quickly hit hundreds to thousands per month).

Rule of thumb: one-off big runs (pre-launch) — cloud is cheap and fast. Regular regression — self-host is cheaper long-term. Many orgs mix — daily CI self-hosted, quarterly big run on cloud.

14. Decision frame — picking honestly

Situation	Pick
Starting fresh, generic backend	k6
Python team, complex behavior	Locust
One-line micro-bench	Vegeta or wrk2
Node team, fast CI	Artillery or autocannon
Enterprise JVM org	Gatling
Legacy JMeter assets	Keep JMeter + add k6 for new
gRPC	ghz or k6 gRPC
Browser load	k6 browser
Pure throughput ceiling	wrk or Bombardier
Chaos + load	k6/Locust + Toxiproxy/Litmus

Mixing is common. Within one org: Vegeta for micro-bench, k6 for general load, JMeter for the big legacy scenario. That's the real picture. Don't force a single tool — accepting the right tool per area keeps operations simpler.

15. Anti-patterns and traps

Running directly against production — use staging or an isolated production-like environment. Production canaries are a different technique.
Bypassing DNS/CDN to hit origin directly — measurement passes; SLOs include CDN. The measurement path must match the user path.
Measuring on localhost and concluding — network latency is 0, RPS-only views give false confidence.
Constant rate, single pass, declare success — production is a distribution.
Looking only at average latency — look at p99.
Looking only at the load tool — pair with server metrics.
One-off pre-launch run — put it in regression.
Measuring scenarios with wrk — wrong tool.
Trying 100k RPS from a single node — confusing tool limits with system limits.

Epilogue — measurement is a decision tool

The tools are abundant. JMeter's monolith era is gone; k6 has become the default. Locust is the friendly Python option, Vegeta the elegant one-liner, Gatling the JVM heavyweight, Artillery the quick starter, and wrk/wrk2/autocannon/Bombardier are the micro-bench specialists.

But tools don't make the result. Does the scenario resemble production? Did you look at percentiles? Did you warm up? Is it regression? Those make the result. The work after picking a tool is longer and more important.

One-line summary: "It runs fine" is not measurement. p99, distribution, scenario — those are measurement. Tools are how you get there; the destination is something you have to define.

12-item checklist

Is the purpose clear (micro/load/stress/spike)?
Is the data production-like?
Is the rate a distribution/ramp, not a flat constant?
Did you exclude warm-up from measurement?
Is p99 (or p999) tied to your SLO?
Are error rates separated from latency?
Are client and server metrics looked at together?
Is regression wired into CI?
Did you make a self-host vs cloud decision for distributed?
If non-HTTP, are you using the right tool (gRPC/WS/browser)?
Does the measurement path match the user path (DNS, CDN)?
Do you understand coordinated-omission correction?

Ten anti-patterns

Only looking at the average — look at p99.
One flat-rate run as "passed" — model the distribution.
Measuring on localhost and concluding — no network.
Loading production directly — use staging.
Looking only at the tool — look at server metrics too.
One-shot pre-launch — make it regression.
Doing everything in JMeter — pick per area.
Forcing 100k RPS from one node — tool limit ≠ system limit.
Ignoring warm-up — early data poisons the conclusion.
Dropping error counts — dead requests hide.

Next post teasers

Candidates: production signal vs noise — designing RED/USE/SLO dashboards, chaos engineering 2026 — Litmus, Chaos Mesh, Gremlin, AWS FIS, performance regression CI — from a one-liner to a distribution.

"What can be measured can be improved. And if you don't look at the distribution, you aren't measuring."

— Load/performance testing tools 2026, end.

✍️ 필사 모드: Load/Performance Testing Tools 2026 — Deep Dive on k6, Locust, Vegeta, Gatling, Artillery, JMeter (Beyond JMeter)

Prologue — "We can do 10k RPS" is usually a lie

1. Four purposes of load testing — name the target first

2. Tool map 2026 — one table

3. k6 — the 2026 default

k6

Locust (Python)

Vegeta (CLI)

5. Locust — the comfortable friend for Python teams

6. Vegeta — the elegance of one line

7. Gatling — the JVM heavyweight

8. Artillery — YAML scenarios, fast start

9. JMeter — the old guard

10. Micro-benchmarks — wrk / wrk2 / autocannon / Bombardier

wrk

wrk2

autocannon

Bombardier

11. Non-HTTP protocols — gRPC, WebSocket, real browser

gRPC

WebSocket

Real browser load

12. What "a good load test" means — 2026 checklist

1) Production-like data

2) Distribution, not a flat rate

3) Warm-up and ramp-up

4) p99 (not the average)

5) Separate error rates

6) Multiple measurement points

7) Regular, not one-shot

13. Self-host vs cloud — cost and operations

Self-host distributed

Cloud

14. Decision frame — picking honestly

15. Anti-patterns and traps

Epilogue — measurement is a decision tool

12-item checklist

Ten anti-patterns

Next post teasers

References

Prologue — "We can do 10k RPS" is usually a lie

1. Four purposes of load testing — name the target first

2. Tool map 2026 — one table

3. k6 — the 2026 default

4. Same scenario, different tools — POST /login + ramp 50 → 200 RPS

k6

Locust (Python)

Vegeta (CLI)

5. Locust — the comfortable friend for Python teams

6. Vegeta — the elegance of one line

7. Gatling — the JVM heavyweight

8. Artillery — YAML scenarios, fast start

9. JMeter — the old guard

10. Micro-benchmarks — wrk / wrk2 / autocannon / Bombardier

wrk

wrk2

autocannon

Bombardier

11. Non-HTTP protocols — gRPC, WebSocket, real browser

gRPC

WebSocket

Real browser load

12. What "a good load test" means — 2026 checklist

1) Production-like data

2) Distribution, not a flat rate

3) Warm-up and ramp-up

4) p99 (not the average)

5) Separate error rates

6) Multiple measurement points

7) Regular, not one-shot

13. Self-host vs cloud — cost and operations

Self-host distributed

Cloud

14. Decision frame — picking honestly

15. Anti-patterns and traps

Epilogue — measurement is a decision tool

12-item checklist

Ten anti-patterns

Next post teasers

References