Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

Introduction — Why Limit the Rate at All
Fixed Window Counter — The Simplest Approach
Sliding Window Log — Accurate but Heavy
Sliding Window Counter — A Practical Compromise
Token Bucket — Bursts, Elegantly
Leaky Bucket — Smoothing the Flow
Algorithm Comparison
Rate Limiting in a Distributed Setting
429 and Retry-After — Rejecting Gracefully
Client Backoff Etiquette
Wrapping Up
References

Introduction — Why Limit the Rate at All

Rate limiting means "capping the number of requests allowed within a given window of time." At first glance it looks like a device for annoying users, but it is really a safety mechanism that protects the whole system.

There are several reasons you need it. It protects server resources from runaway traffic or misbehaving clients, blocks abuse like brute-force login attempts and scraping, shares resources fairly among many users, and enforces per-plan quotas on a paid API. In short, it is a mechanism for stability and fairness.

The catch is that the simple idea of "count the requests" can be implemented in several different ways. Each carries its own trade-offs in accuracy, memory usage, and how it handles bursts. This post contrasts the main algorithms one at a time and then covers distributed implementation and client etiquette. If you want to experiment with the concepts visually, this site's Message Queue Playground lets you play with throughput and flow control.

Fixed Window Counter — The Simplest Approach

The most intuitive method divides time into fixed windows and counts requests within each. For "100 per minute," you reset a counter to zero at the start of every minute and increment it by one on each request. Once the counter passes 100, you reject until that minute ends.

  100 per minute limit, fixed window
  [00:00 ~ 00:59]  count 0 -> 100 (excess rejected)
  [01:00 ~ 01:59]  count reset, 0 -> 100 again

The upside is clear. It is trivial to implement, and the only state to store is a single counter, so it uses almost no memory. But it has a fatal weakness: a burst can double at the window boundary.

Suppose a user fires 100 requests at 00:00:59 and immediately fires 100 more at 01:00:01. Those two moments are just two seconds apart, but they belong to different windows, so both are allowed. The result is roughly 200 requests passing in about two seconds. Nominally a 100-per-minute limit, but in practice a momentary leak of double that. This boundary problem is exactly why the next approaches exist.

Sliding Window Log — Accurate but Heavy

To eliminate the boundary problem at the root, use "the past one minute from this very instant" as your reference instead of a fixed window. This is the sliding window log. You record the timestamp of every request and, on each new request, count "how many requests came in within one minute of now."

  On each request:
  1) Drop timestamps older than (now - 60s) from the log
  2) Count the remaining log entries
  3) If count < 100, allow and add the current timestamp; otherwise reject

This is perfectly accurate. No matter what instant you pick as the reference, it counts exactly the requests from the past minute. There is no doubling at the boundary.

But it is expensive. You have to store every request's timestamp, so as traffic grows, memory usage grows in proportion to the request count. Keeping such a log per user on an API taking tens of thousands of requests per second is a burden. So while its accuracy is the best, high-scale systems tend to prefer the compromise that follows.

Sliding Window Counter — A Practical Compromise

The sliding window counter is a practical middle ground between the fixed window's lightness and the sliding log's accuracy. The idea is to keep only the current and previous window counters and approximate by weighting the previous window's count by how much it still overlaps.

The calculation goes like this. If you know how much of the current window has elapsed, the previous window overlaps by that much less. So the estimated request count is:

  estimate = current window count
           + previous window count * (fraction of the previous window still overlapping)

  Example: 100 per minute, 25% of the way into the current window
      previous window count = 80, current window count = 30
      estimate = 30 + 80 * (1 - 0.25) = 30 + 60 = 90
      90 < 100, so allow

This approximation leans on the assumption that traffic is roughly evenly distributed within a window. It is not perfect, but in practice the error is small, and it stores only two counters, so it is very light. That is why production rate limiters commonly adopt it. It resolves most of the fixed window's boundary-burst problem while avoiding the sliding log's memory burden.

Token Bucket — Bursts, Elegantly

Where the approaches so far focused on counting "how many came," the token bucket flips the perspective. Instead of counting requests, it hands out tokens — the right to process a request.

Here is how it works. Tokens are added to a bucket at a fixed rate (say, 10 per second). The bucket has a maximum capacity (say, 100). When a request arrives, it takes one token. If a token is available, the request passes; if not, it is rejected.

  Token bucket (capacity 100, refill rate 10/sec)
   tokens drip in at 10/sec, accumulating in the bucket (max 100)
       │
       ▼
   [ bucket: ●●●●●●... up to 100 ]
       │  each request consumes 1 token
       ▼
   token available -> pass, none -> reject

The beauty of the token bucket is that it allows bursts naturally. If there are no requests for a while, tokens accumulate to the bucket's brim (up to 100). Then when requests suddenly surge, you can pass as many at once as there are accumulated tokens. That is, it tolerates a normally quiet client firing off a burst, while capping the long-run average rate at the refill rate (10/sec). The bucket capacity sets the maximum burst, and the refill rate sets the sustainable average.

This flexibility makes the token bucket one of the most widely used algorithms in practice. Many API gateways and network devices use it.

Leaky Bucket — Smoothing the Flow

The leaky bucket is, as the name suggests, a bucket with a hole. Requests fill the bucket (a queue) and leak out the bottom at a constant rate to be processed. When the bucket fills up, overflowing requests are dropped.

  Leaky bucket
   requests pile into the queue from the top
       │
       ▼
   [ queue: requests waiting ]
       │  processed at a constant rate from the bottom (e.g., 10/sec)
       ▼
   when the queue is full, overflow is dropped

Token bucket and leaky bucket are often compared. The key difference is the nature of the output.

Token bucket allows bursts. If tokens have accumulated, many pass at once. The output can be bursty.
Leaky bucket makes the output smooth. No matter how the input surges, it only leaves at the bottom hole's rate, so the processing rate is constant. The catch is that absorbing a momentary burst means waiting in the queue, which can add latency.

So the choice depends on your goal. If you want to "generously allow the occasional burst but hold the average," the token bucket fits. If you want to "keep a constant processing rate to protect downstream," the leaky bucket fits.

Algorithm Comparison

Here is everything so far in one table.

Algorithm	Memory	Accuracy	Burst handling	Notes
Fixed window	very small	low (boundary issue)	doubles at boundary	simplest to build
Sliding log	large (per request)	perfect	none	accurate but heavy
Sliding counter	small (2 counters)	high (approx.)	almost none	practical compromise
Token bucket	small	high	allows up to capacity	burst-tolerant, caps average
Leaky bucket	small (queue size)	high	absorbs and flattens	constant output rate

Rate Limiting in a Distributed Setting

So far we assumed a single server. But real services have many server instances processing requests together behind a load balancer, and that creates a problem. If each instance holds its own counter, the total limit is multiplied by the number of instances. With five instances each allowing 100 per minute, a user effectively gets 500 per minute.

The fix is to keep the counter in shared storage outside the instances. The most widely used choice for this role is Redis — a fast in-memory store that supports atomic operations and key expiration (TTL).

Here is a simple example of a fixed window implemented with Redis.

  key:  ratelimit:{user_id}:{current_minute}
  1) INCR key          -> atomically increment the counter and get the new value
  2) if the value is 1, EXPIRE key 60   -> set a TTL so it auto-deletes when the window passes
  3) if value > 100, reject; otherwise allow

What matters here is atomicity. If "read, compare, and increment" is split into separate steps, concurrent requests can race and exceed the limit. Redis's INCR is atomic on its own, and more complex logic (a sliding window or token bucket) is implemented with a Lua script that bundles several commands into a single atomic unit. Redis runs Lua scripts atomically on a single thread, which avoids the race condition.

Distributed rate limiting has one more consideration. Hitting the shared store on every request adds latency and load, so some designs sacrifice a bit of accuracy by approximating locally and syncing periodically. The trade-off between accuracy and performance repeats here too.

429 and Retry-After — Rejecting Gracefully

When the rate limiter decides to reject a request, it should tell the client clearly. The tool for this is the HTTP status code 429 Too Many Requests. This code conveys, in a standard way, "you have made too many requests."

Sending a 429 is not the end of it. A good rate limiter also tells the client when it can try again. The header for that is Retry-After. Its value can be a wait in seconds or a specific time.

HTTP/1.1 429 Too Many Requests
Retry-After: 30
Content-Type: application/json

{ "error": "rate limit exceeded", "retry_after": 30 }

On top of this, many APIs provide headers that report the remaining quota — for example RateLimit-Limit (the total limit), RateLimit-Remaining (calls left), and RateLimit-Reset (time until reset). With this information, a client can throttle itself before hitting the wall. If you are curious about the precise meaning of 429 and other status codes, the HTTP Status Code Reference lets you look each one up.

Client Backoff Etiquette

Rate limiting is not the server's job alone. How a client reacts to a 429 has a big effect on the health of the whole system. A client that retries the instant it receives a 429 only torments the server more.

The right reaction is backoff — steadily increasing the interval between retries.

Exponential backoff: double the wait on each retry — 1s, 2s, 4s, 8s. This gives the server time to recover.
Add jitter: if many clients retry at identical intervals, those retries pile up at the same moment and hammer the server again — the "thundering herd" problem. So add a bit of randomness to the wait to spread out the retry moments.
Respect Retry-After: if the server told you the wait via Retry-After, it is polite to honor that value over your own calculation.
Set a ceiling: cap the maximum retry count or maximum wait so you do not retry forever.

  On receiving 429:
   wait = min(base * 2^attempt, ceiling) + random_jitter
   if a Retry-After header is present, prefer its value

To sum up, well-designed rate limiting comes together when the server's rejection and the client's restraint mesh. The server signals clearly with 429 and Retry-After, and the client backs off politely with exponential backoff and jitter.

Wrapping Up

Rate limiting holds the simple goal of "counting requests," but its implementation offers a range of trade-offs across three axes: accuracy, memory, and burst handling. The fixed window is simple but leaks at the boundary; the sliding log is accurate but heavy; the sliding counter is the practical balance between them. The token bucket allows bursts elegantly, and the leaky bucket smooths the output.

On top of that, a distributed setting needs shared storage like Redis with atomic operations, and the client side needs polite backoff around 429 and Retry-After. The point is not "which algorithm is right" but "what this system is trying to protect." Settle that answer and the right tool follows naturally.

References

Cloudflare Learning: What is rate limiting? — https://www.cloudflare.com/learning/bots/what-is-rate-limiting/
Stripe: Rate limiters — https://stripe.com/blog/rate-limiters
MDN: 429 Too Many Requests — https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/429
MDN: Retry-After header — https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Retry-After
IETF: RateLimit header fields for HTTP — https://datatracker.ietf.org/doc/draft-ietf-httpapi-ratelimit-headers/
AWS Architecture Blog: Exponential backoff and jitter — https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/