필사 모드: Rate Limiting Algorithms, Explained: Fixed Window, Sliding Window, Token Bucket, and Leaky Bucket
English- Introduction — Why Limit the Rate at All
- Fixed Window Counter — The Simplest Approach
- Sliding Window Log — Accurate but Heavy
- Sliding Window Counter — A Practical Compromise
- Token Bucket — Bursts, Elegantly
- Leaky Bucket — Smoothing the Flow
- Algorithm Comparison
- Rate Limiting in a Distributed Setting
- 429 and Retry-After — Rejecting Gracefully
- Client Backoff Etiquette
- Wrapping Up
- References
Introduction — Why Limit the Rate at All
Rate limiting means "capping the number of requests allowed within a given window of time." At first glance it looks like a device for annoying users, but it is really a safety mechanism that protects the whole system.
There are several reasons you need it. It protects server resources from runaway traffic or misbehaving clients, blocks abuse like brute-force login attempts and scraping, shares resources fairly among many users, and enforces per-plan quotas on a paid API. In short, it is a mechanism for stability and fairness.
The catch is that the simple idea of "count the requests" can be implemented in several different ways. Each carries its own trade-offs in accuracy, memory usage, and how it handles bursts. This post contrasts the main algorithms one at a time and then covers distributed implementation and client etiquette. If you want to experiment with the concepts visually, this site's Message Queue Playground lets you play with throughput and flow control.
Fixed Window Counter — The Simplest Approach
The most intuitive method divides time into fixed windows and counts requests within each. For "100 per minute," you reset a counter to zero at the start of every minute and increment it by one on each request. Once the counter passes 100, you reject until that minute ends.
100 per minute limit, fixed window
[00:00 ~ 00:59] count 0 -> 100 (excess rejected)
[01:00 ~ 01:59] count reset, 0 -> 100 again
The upside is clear. It is trivial to implement, and the only state to store is a single counter, so it uses almost no memory. But it has a fatal weakness: a burst can double at the window boundary.
Suppose a user fires 100 requests at 00:00:59 and immediately fires 100 more at 01:00:01. Those two moments are just two seconds apart, but they belong to different windows, so both are allowed. The result is roughly 200 requests passing in about two seconds. Nominally a 100-per-minute limit, but in practice a momentary leak of double that. This boundary problem is exactly why the next approaches exist.
Sliding Window Log — Accurate but Heavy
To eliminate the boundary problem at the root, use "the past one minute from this very instant" as your reference instead of a fixed window. This is the sliding window log. You record the timestamp of every request and, on each new request, count "how many requests came in within one minute of now."
On each request:
1) Drop timestamps older than (now - 60s) from the log
2) Count the remaining log entries
3) If count < 100, allow and add the current timestamp; otherwise reject
This is perfectly accurate. No matter what instant you pick as the reference, it counts exactly the requests from the past minute. There is no doubling at the boundary.
But it is expensive. You have to store every request's timestamp, so as traffic grows, memory usage grows in proportion to the request count. Keeping such a log per user on an API taking tens of thousands of requests per second is a burden. So while its accuracy is the best, high-scale systems tend to prefer the compromise that follows.
Sliding Window Counter — A Practical Compromise
The sliding window counter is a practical middle ground between the fixed window's lightness and the sliding log's accuracy. The idea is to keep only the current and previous window counters and approximate by weighting the previous window's count by how much it still overlaps.
The calculation goes like this. If you know how much of the current window has elapsed, the previous window overlaps by that much less. So the estimated request count is:
estimate = current window count
+ previous window count * (fraction of the previous window still overlapping)
Example: 100 per minute, 25% of the way into the current window
previous window count = 80, current window count = 30
estimate = 30 + 80 * (1 - 0.25) = 30 + 60 = 90
90 < 100, so allow
This approximation leans on the assumption that traffic is roughly evenly distributed within a window. It is not perfect, but in practice the error is small, and it stores only two counters, so it is very light. That is why production rate limiters commonly adopt it. It resolves most of the fixed window's boundary-burst problem while avoiding the sliding log's memory burden.
Token Bucket — Bursts, Elegantly
Where the approaches so far focused on counting "how many came," the token bucket flips the perspective. Instead of counting requests, it hands out tokens — the right to process a request.
Here is how it works. Tokens are added to a bucket at a fixed rate (say, 10 per second). The bucket has a maximum capacity (say, 100). When a request arrives, it takes one token. If a token is available, the request passes; if not, it is rejected.
Token bucket (capacity 100, refill rate 10/sec)
tokens drip in at 10/sec, accumulating in the bucket (max 100)
│
▼
[ bucket: ●●●●●●... up to 100 ]
│ each request consumes 1 token
▼
token available -> pass, none -> reject
The beauty of the token bucket is that it allows bursts naturally. If there are no requests for a while, tokens accumulate to the bucket's brim (up to 100). Then when requests suddenly surge, you can pass as many at once as there are accumulated tokens. That is, it tolerates a normally quiet client firing off a burst, while capping the long-run average rate at the refill rate (10/sec). The bucket capacity sets the maximum burst, and the refill rate sets the sustainable average.
This flexibility makes the token bucket one of the most widely used algorithms in practice. Many API gateways and network devices use it.
Leaky Bucket — Smoothing the Flow
The leaky bucket is, as the name suggests, a bucket with a hole. Requests fill the bucket (a queue) and leak out the bottom at a constant rate to be processed. When the bucket fills up, overflowing requests are dropped.
Leaky bucket
requests pile into the queue from the top
│
▼
[ queue: requests waiting ]
│ processed at a constant rate from the bottom (e.g., 10/sec)
▼
when the queue is full, overflow is dropped
Token bucket and leaky bucket are often compared. The key difference is the nature of the output.
- Token bucket allows bursts. If tokens have accumulated, many pass at once. The output can be bursty.
- Leaky bucket makes the output smooth. No matter how the input surges, it only leaves at the bottom hole's rate, so the processing rate is constant. The catch is that absorbing a momentary burst means waiting in the queue, which can add latency.
So the choice depends on your goal. If you want to "generously allow the occasional burst but hold the average," the token bucket fits. If you want to "keep a constant processing rate to protect downstream," the leaky bucket fits.
Algorithm Comparison
Here is everything so far in one table.
| Algorithm | Memory | Accuracy | Burst handling | Notes |
|---|---|---|---|---|
| Fixed window | very small | low (boundary issue) | doubles at boundary | simplest to build |
| Sliding log | large (per request) | perfect | none | accurate but heavy |
| Sliding counter | small (2 counters) | high (approx.) | almost none | practical compromise |
| Token bucket | small | high | allows up to capacity | burst-tolerant, caps average |
| Leaky bucket | small (queue size) | high | absorbs and flattens | constant output rate |
Rate Limiting in a Distributed Setting
So far we assumed a single server. But real services have many server instances processing requests together behind a load balancer, and that creates a problem. If each instance holds its own counter, the total limit is multiplied by the number of instances. With five instances each allowing 100 per minute, a user effectively gets 500 per minute.
The fix is to keep the counter in shared storage outside the instances. The most widely used choice for this role is Redis — a fast in-memory store that supports atomic operations and key expiration (TTL).
Here is a simple example of a fixed window implemented with Redis.
key: ratelimit:{user_id}:{current_minute}
1) INCR key -> atomically increment the counter and get the new value
2) if the value is 1, EXPIRE key 60 -> set a TTL so it auto-deletes when the window passes
3) if value > 100, reject; otherwise allow
What matters here is atomicity. If "read, compare, and increment" is split into separate steps, concurrent requests can race and exceed the limit. Redis's INCR is atomic on its own, and more complex logic (a sliding window or token bucket) is implemented with a Lua script that bundles several commands into a single atomic unit. Redis runs Lua scripts atomically on a single thread, which avoids the race condition.
Distributed rate limiting has one more consideration. Hitting the shared store on every request adds latency and load, so some designs sacrifice a bit of accuracy by approximating locally and syncing periodically. The trade-off between accuracy and performance repeats here too.
429 and Retry-After — Rejecting Gracefully
When the rate limiter decides to reject a request, it should tell the client clearly. The tool for this is the HTTP status code 429 Too Many Requests. This code conveys, in a standard way, "you have made too many requests."
Sending a 429 is not the end of it. A good rate limiter also tells the client when it can try again. The header for that is Retry-After. Its value can be a wait in seconds or a specific time.
HTTP/1.1 429 Too Many Requests
Retry-After: 30
Content-Type: application/json
{ "error": "rate limit exceeded", "retry_after": 30 }
On top of this, many APIs provide headers that report the remaining quota — for example RateLimit-Limit (the total limit), RateLimit-Remaining (calls left), and RateLimit-Reset (time until reset). With this information, a client can throttle itself before hitting the wall. If you are curious about the precise meaning of 429 and other status codes, the HTTP Status Code Reference lets you look each one up.
Client Backoff Etiquette
Rate limiting is not the server's job alone. How a client reacts to a 429 has a big effect on the health of the whole system. A client that retries the instant it receives a 429 only torments the server more.
The right reaction is backoff — steadily increasing the interval between retries.
- Exponential backoff: double the wait on each retry — 1s, 2s, 4s, 8s. This gives the server time to recover.
- Add jitter: if many clients retry at identical intervals, those retries pile up at the same moment and hammer the server again — the "thundering herd" problem. So add a bit of randomness to the wait to spread out the retry moments.
- Respect Retry-After: if the server told you the wait via
Retry-After, it is polite to honor that value over your own calculation. - Set a ceiling: cap the maximum retry count or maximum wait so you do not retry forever.
On receiving 429:
wait = min(base * 2^attempt, ceiling) + random_jitter
if a Retry-After header is present, prefer its value
To sum up, well-designed rate limiting comes together when the server's rejection and the client's restraint mesh. The server signals clearly with 429 and Retry-After, and the client backs off politely with exponential backoff and jitter.
Wrapping Up
Rate limiting holds the simple goal of "counting requests," but its implementation offers a range of trade-offs across three axes: accuracy, memory, and burst handling. The fixed window is simple but leaks at the boundary; the sliding log is accurate but heavy; the sliding counter is the practical balance between them. The token bucket allows bursts elegantly, and the leaky bucket smooths the output.
On top of that, a distributed setting needs shared storage like Redis with atomic operations, and the client side needs polite backoff around 429 and Retry-After. The point is not "which algorithm is right" but "what this system is trying to protect." Settle that answer and the right tool follows naturally.
References
- Cloudflare Learning: What is rate limiting? — https://www.cloudflare.com/learning/bots/what-is-rate-limiting/
- Stripe: Rate limiters — https://stripe.com/blog/rate-limiters
- MDN: 429 Too Many Requests — https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/429
- MDN: Retry-After header — https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Retry-After
- IETF: RateLimit header fields for HTTP — https://datatracker.ietf.org/doc/draft-ietf-httpapi-ratelimit-headers/
- AWS Architecture Blog: Exponential backoff and jitter — https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/
현재 단락 (1/92)
Rate limiting means "capping the number of requests allowed within a given window of time." At first...