- Authors

- Name
- Youngju Kim
- @fjvbn20031
Overview
When operating APIs, you inevitably face various situations such as unexpected traffic spikes, DDoS attacks, and excessive calls from specific users. Rate Limiting is the most fundamental yet effective defense mechanism against these issues. This post covers everything from core rate limiting algorithms to practical implementations with Redis, Nginx, and Kong, LLM API cost control patterns, and distributed rate limiting strategies.
1. What is Rate Limiting
1.1 Why Is It Needed
Rate Limiting is a technique that restricts the number of requests allowed within a given time period.
Primary Purposes:
- DDoS Defense: Protect services from malicious mass requests
- Cost Control: Manage external API call costs (especially LLM APIs)
- Fair Usage: Prevent specific users from monopolizing resources
- Service Stability: Prevent service outages due to overload
- SLA Guarantee: Provide consistent service quality to all users
1.2 Where to Apply Rate Limiting
Client --> [CDN/WAF] --> [Load Balancer] --> [API Gateway] --> [Application]
| | | |
L3/L4 limits Connection limits Request Rate Limit Business logic limits
Rate Limiting can be applied at multiple layers of infrastructure, with the API Gateway layer being the most effective location.
2. Rate Limiting Algorithms
2.1 Token Bucket
The most widely used algorithm, adopted by AWS API Gateway and Nginx.
How It Works:
- Tokens are added to the bucket at a constant rate
- Each request consumes one token
- Requests are rejected when no tokens remain
- Burst traffic is allowed up to bucket capacity
Bucket Capacity: 10 tokens
Refill Rate: 2 tokens/second
Time 0s: [##########] 10 tokens -> Request OK (9 left)
Time 0s: [#########.] 9 tokens -> Request OK (8 left)
...
Time 0s: [..........] 0 tokens -> Request REJECTED
Time 1s: [##........] 2 tokens -> Refilled
import time
import threading
class TokenBucket:
def __init__(self, capacity: int, refill_rate: float):
self.capacity = capacity
self.refill_rate = refill_rate # tokens per second
self.tokens = capacity
self.last_refill_time = time.monotonic()
self.lock = threading.Lock()
def _refill(self):
now = time.monotonic()
elapsed = now - self.last_refill_time
new_tokens = elapsed * self.refill_rate
self.tokens = min(self.capacity, self.tokens + new_tokens)
self.last_refill_time = now
def consume(self, tokens: int = 1) -> bool:
with self.lock:
self._refill()
if self.tokens >= tokens:
self.tokens -= tokens
return True
return False
def wait_time(self) -> float:
"""Time to wait for the next token"""
with self.lock:
self._refill()
if self.tokens >= 1:
return 0.0
return (1 - self.tokens) / self.refill_rate
# Usage example
bucket = TokenBucket(capacity=10, refill_rate=2.0)
for i in range(15):
if bucket.consume():
print(f"Request {i+1}: Allowed")
else:
wait = bucket.wait_time()
print(f"Request {i+1}: Rejected (wait {wait:.2f}s)")
Pros: Allows bursts, simple implementation, memory efficient Cons: Difficult to track exact request counts
2.2 Leaky Bucket
An algorithm that processes requests at a constant rate.
Incoming Requests --> [Bucket/Queue] --> Processing (fixed rate)
|
Overflow --> Rejected
Processing Rate: 2 requests/second (fixed)
Queue Size: 5
Fast incoming requests queue up; rejected when queue is full
import time
import threading
from collections import deque
class LeakyBucket:
def __init__(self, capacity: int, leak_rate: float):
self.capacity = capacity
self.leak_rate = leak_rate # requests per second
self.queue = deque()
self.lock = threading.Lock()
self.last_leak_time = time.monotonic()
def _leak(self):
now = time.monotonic()
elapsed = now - self.last_leak_time
leaked = int(elapsed * self.leak_rate)
if leaked > 0:
for _ in range(min(leaked, len(self.queue))):
self.queue.popleft()
self.last_leak_time = now
def allow(self) -> bool:
with self.lock:
self._leak()
if len(self.queue) < self.capacity:
self.queue.append(time.monotonic())
return True
return False
Pros: Guarantees constant processing rate Cons: Does not allow bursts, lacks flexibility
2.3 Fixed Window Counter
The simplest algorithm that counts requests within a fixed time window.
Window: 1 minute, Limit: 100 requests
|------- Window 1 -------|------- Window 2 -------|
12:00:00 12:00:59 12:01:00 12:01:59
Count: 0...50...100 Count: 0...50...100
(allowed) (allowed)
Problem: Boundary Burst
100 requests at 12:00:30~12:00:59 + 100 requests at 12:01:00~12:01:30
= 200 requests in 60 seconds (2x the limit!)
import time
import threading
class FixedWindowCounter:
def __init__(self, limit: int, window_seconds: int):
self.limit = limit
self.window_seconds = window_seconds
self.counters = {} # key -> (window_start, count)
self.lock = threading.Lock()
def allow(self, key: str) -> bool:
with self.lock:
now = time.time()
window_start = int(now / self.window_seconds) * self.window_seconds
if key not in self.counters or self.counters[key][0] != window_start:
self.counters[key] = (window_start, 0)
if self.counters[key][1] < self.limit:
self.counters[key] = (window_start, self.counters[key][1] + 1)
return True
return False
Pros: Very simple implementation, minimal memory usage Cons: Request concentration at window boundaries
2.4 Sliding Window Log
Accurately tracks each request by recording timestamps in a log.
Window: 60 seconds, Limit: 5 requests
Log: [12:00:10, 12:00:25, 12:00:40, 12:00:55, 12:01:05]
New request at 12:01:10:
1. Remove entries older than 60s (before 12:00:10)
-> [12:00:25, 12:00:40, 12:00:55, 12:01:05]
2. Count: 4 < 5 -> Allowed
3. Add to log: [12:00:25, 12:00:40, 12:00:55, 12:01:05, 12:01:10]
import time
import threading
from collections import defaultdict
class SlidingWindowLog:
def __init__(self, limit: int, window_seconds: int):
self.limit = limit
self.window_seconds = window_seconds
self.logs = defaultdict(list)
self.lock = threading.Lock()
def allow(self, key: str) -> bool:
with self.lock:
now = time.time()
cutoff = now - self.window_seconds
# Remove old entries
self.logs[key] = [
ts for ts in self.logs[key] if ts > cutoff
]
if len(self.logs[key]) < self.limit:
self.logs[key].append(now)
return True
return False
Pros: Most accurate rate limiting Cons: High memory usage (stores all request timestamps)
2.5 Sliding Window Counter
Combines the advantages of Fixed Window and Sliding Window Log. This is the most commonly used algorithm in production.
Window: 60 seconds, Limit: 100 requests
Previous Window Count: 80
Current Window Count: 30
Current Window Progress: 25% (15 seconds elapsed)
Weighted Count = 80 * (1 - 0.25) + 30 = 80 * 0.75 + 30 = 90
90 < 100 -> Allowed
import time
import threading
class SlidingWindowCounter:
def __init__(self, limit: int, window_seconds: int):
self.limit = limit
self.window_seconds = window_seconds
self.windows = {} # key -> (prev_count, curr_count, curr_window_start)
self.lock = threading.Lock()
def allow(self, key: str) -> bool:
with self.lock:
now = time.time()
curr_window = int(now / self.window_seconds) * self.window_seconds
window_progress = (now - curr_window) / self.window_seconds
if key not in self.windows:
self.windows[key] = (0, 0, curr_window)
prev_count, curr_count, stored_window = self.windows[key]
# Update if window has changed
if stored_window != curr_window:
if curr_window - stored_window >= self.window_seconds * 2:
prev_count = 0
else:
prev_count = curr_count
curr_count = 0
# Calculate weighted count
weighted = prev_count * (1 - window_progress) + curr_count
if weighted < self.limit:
curr_count += 1
self.windows[key] = (prev_count, curr_count, curr_window)
return True
return False
2.6 Algorithm Comparison Table
| Algorithm | Accuracy | Memory | Burst Allowed | Complexity | Best For |
|---|---|---|---|---|---|
| Token Bucket | Medium | Low | Yes | Low | General APIs, burst needed |
| Leaky Bucket | Medium | Low | No | Low | Constant rate needed |
| Fixed Window | Low | Very Low | Boundary issue | Very Low | Simple limits |
| Sliding Log | High | High | No | Medium | Precise limiting needed |
| Sliding Counter | High | Low | No | Medium | Production general-purpose |
3. Implementation Methods
3.1 Redis + Lua Script
Combining Redis atomic operations with Lua scripts enables safe rate limiting even in distributed environments.
Sliding Window Counter (Redis + Lua)
import redis
import time
r = redis.Redis(host='localhost', port=6379, db=0)
# Lua script: Sliding Window Counter
SLIDING_WINDOW_SCRIPT = """
local key = KEYS[1]
local now = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local limit = tonumber(ARGV[3])
local clear_before = now - window
redis.call('ZREMRANGEBYSCORE', key, '-inf', clear_before)
local current_count = redis.call('ZCARD', key)
if current_count < limit then
redis.call('ZADD', key, now, now .. '-' .. math.random(1000000))
redis.call('EXPIRE', key, window + 1)
return 1
else
return 0
end
"""
sliding_window_sha = r.register_script(SLIDING_WINDOW_SCRIPT)
def is_rate_limited(key: str, limit: int = 100, window: int = 60) -> bool:
"""
Returns True if the request is rate limited, False if allowed.
"""
now = time.time()
result = sliding_window_sha(
keys=[f"ratelimit:{key}"],
args=[now, window, limit]
)
return result == 0 # 0 = rate limited, 1 = allowed
# Usage example
user_id = "user-123"
for i in range(110):
if is_rate_limited(user_id, limit=100, window=60):
print(f"Request {i+1}: Rate Limited!")
else:
print(f"Request {i+1}: Allowed")
Token Bucket (Redis + Lua)
TOKEN_BUCKET_SCRIPT = """
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local requested = tonumber(ARGV[4])
local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1])
local last_refill = tonumber(bucket[2])
if tokens == nil then
tokens = capacity
last_refill = now
end
local elapsed = now - last_refill
local new_tokens = elapsed * refill_rate
tokens = math.min(capacity, tokens + new_tokens)
local allowed = 0
if tokens >= requested then
tokens = tokens - requested
allowed = 1
end
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) + 1)
return allowed
"""
token_bucket_sha = r.register_script(TOKEN_BUCKET_SCRIPT)
def token_bucket_check(
key: str,
capacity: int = 100,
refill_rate: float = 10.0,
tokens: int = 1
) -> bool:
now = time.time()
result = token_bucket_sha(
keys=[f"tokenbucket:{key}"],
args=[capacity, refill_rate, now, tokens]
)
return result == 1 # 1 = allowed
3.2 Nginx Rate Limiting
# nginx.conf
http {
# Zone definition: per client IP, 10MB shared memory
limit_req_zone $binary_remote_addr zone=api_limit:10m rate=10r/s;
# Per-user rate limiting (API Key based)
map $http_x_api_key $api_key_zone {
default $http_x_api_key;
}
limit_req_zone $api_key_zone zone=user_limit:10m rate=100r/m;
server {
listen 80;
# Default API endpoint
location /api/ {
# burst=20: allow up to 20 excess requests (queued)
# nodelay: process immediately (within burst)
limit_req zone=api_limit burst=20 nodelay;
# Response code when rate limit exceeded
limit_req_status 429;
proxy_pass http://backend;
}
# LLM API endpoint (stricter limits)
location /api/v1/completions {
limit_req zone=user_limit burst=5 nodelay;
limit_req_status 429;
proxy_pass http://llm_backend;
}
# Custom error page
error_page 429 = @rate_limited;
location @rate_limited {
default_type application/json;
return 429 '{"error": "Too Many Requests", "retry_after": 60}';
}
}
}
3.3 Kong Rate Limiting Plugin
# kong.yml - Declarative Configuration
_format_version: '3.0'
services:
- name: llm-api
url: http://llm-backend:8000
routes:
- name: llm-route
paths:
- /api/v1/completions
plugins:
# Global Rate Limiting
- name: rate-limiting
config:
minute: 100
hour: 1000
policy: redis
redis_host: redis
redis_port: 6379
redis_database: 0
fault_tolerant: true
hide_client_headers: false
error_code: 429
error_message: 'Rate limit exceeded'
# Per-Consumer Rate Limiting
- name: rate-limiting
consumer: premium-user
config:
minute: 500
hour: 10000
policy: redis
redis_host: redis
3.4 Envoy Rate Limiting
# envoy.yaml
static_resources:
listeners:
- address:
socket_address:
address: 0.0.0.0
port_value: 8080
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
'@type': type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
http_filters:
- name: envoy.filters.http.local_ratelimit
typed_config:
'@type': type.googleapis.com/envoy.extensions.filters.http.local_ratelimit.v3.LocalRateLimit
stat_prefix: http_local_rate_limiter
token_bucket:
max_tokens: 100
tokens_per_fill: 10
fill_interval: 1s
filter_enabled:
runtime_key: local_rate_limit_enabled
default_value:
numerator: 100
denominator: HUNDRED
- name: envoy.filters.http.router
typed_config:
'@type': type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
3.5 Express.js Rate Limiting
const rateLimit = require('express-rate-limit')
const RedisStore = require('rate-limit-redis')
const Redis = require('ioredis')
const redis = new Redis({
host: 'localhost',
port: 6379,
})
// Default Rate Limiter
const apiLimiter = rateLimit({
windowMs: 60 * 1000, // 1 minute
max: 100,
standardHeaders: true, // RateLimit-* headers
legacyHeaders: false, // X-RateLimit-* headers
store: new RedisStore({
sendCommand: (...args) => redis.call(...args),
}),
message: {
error: 'Too Many Requests',
retryAfter: 60,
},
keyGenerator: (req) => {
return req.headers['x-api-key'] || req.ip
},
})
// Strict Limiter for LLM API
const llmLimiter = rateLimit({
windowMs: 60 * 1000,
max: 20,
store: new RedisStore({
sendCommand: (...args) => redis.call(...args),
prefix: 'rl:llm:',
}),
keyGenerator: (req) => req.headers['x-api-key'],
})
const app = require('express')()
app.use('/api/', apiLimiter)
app.use('/api/v1/completions', llmLimiter)
3.6 Spring Boot Rate Limiting
// Bucket4j + Redis Based Rate Limiting
@Configuration
public class RateLimitConfig {
@Bean
public ProxyManager<String> proxyManager(RedissonClient redissonClient) {
return Bucket4jRedisson.casBasedBuilder(redissonClient)
.build();
}
}
@Component
public class RateLimitInterceptor implements HandlerInterceptor {
private final ProxyManager<String> proxyManager;
@Override
public boolean preHandle(
HttpServletRequest request,
HttpServletResponse response,
Object handler
) throws Exception {
String apiKey = request.getHeader("X-API-Key");
if (apiKey == null) {
response.setStatus(401);
return false;
}
BucketConfiguration config = BucketConfiguration.builder()
.addLimit(
Bandwidth.builder()
.capacity(100)
.refillGreedy(100, Duration.ofMinutes(1))
.build()
)
.build();
Bucket bucket = proxyManager.builder()
.build(apiKey, () -> config);
ConsumptionProbe probe = bucket.tryConsumeAndReturnRemaining(1);
if (probe.isConsumed()) {
response.setHeader("X-RateLimit-Remaining",
String.valueOf(probe.getRemainingTokens()));
return true;
} else {
long waitSeconds = probe.getNanosToWaitForRefill() / 1_000_000_000;
response.setStatus(429);
response.setHeader("Retry-After", String.valueOf(waitSeconds));
response.getWriter().write(
"{\"error\":\"Rate limit exceeded\"}"
);
return false;
}
}
}
4. Distributed Rate Limiting
4.1 Centralized (Redis)
+--------+ +--------+ +--------+
| Node 1 | | Node 2 | | Node 3 |
+---+----+ +---+----+ +---+----+
| | |
+---------+----+---------+----+
| |
+----+----+ +---+---+
| Redis | | Redis |
| Primary | | Replica|
+---------+ +-------+
Pros: Consistent counts across all nodes Cons: Additional network latency to Redis, Redis can become SPOF
4.2 Local Counter + Synchronization
+--------+ +--------+ +--------+
| Node 1 | | Node 2 | | Node 3 |
| Local: | | Local: | | Local: |
| 30/100 | | 25/100 | | 20/100 |
+---+----+ +---+----+ +---+----+
| | |
+----Periodic Sync (every 5s)-----+
|
+----+----+
| Redis |
| Global: |
| 75/100 |
+---------+
import time
import threading
import redis
class DistributedRateLimiter:
def __init__(
self,
redis_client: redis.Redis,
key_prefix: str,
global_limit: int,
window_seconds: int,
sync_interval: float = 5.0,
local_threshold: float = 0.1,
):
self.redis = redis_client
self.key_prefix = key_prefix
self.global_limit = global_limit
self.window_seconds = window_seconds
self.sync_interval = sync_interval
# Local allowance = 10% of global limit
self.local_limit = int(global_limit * local_threshold)
self.local_count = 0
self.lock = threading.Lock()
# Start periodic sync thread
self._start_sync_thread()
def _start_sync_thread(self):
def sync_loop():
while True:
time.sleep(self.sync_interval)
self._sync_to_redis()
thread = threading.Thread(target=sync_loop, daemon=True)
thread.start()
def _sync_to_redis(self):
with self.lock:
if self.local_count > 0:
key = f"{self.key_prefix}:global"
self.redis.incrby(key, self.local_count)
self.redis.expire(key, self.window_seconds + 10)
self.local_count = 0
def allow(self, key: str) -> bool:
with self.lock:
# Check local counter
if self.local_count >= self.local_limit:
self._sync_to_redis()
# Check global counter
global_key = f"{self.key_prefix}:global"
global_count = int(self.redis.get(global_key) or 0)
if global_count >= self.global_limit:
return False
self.local_count += 1
return True
4.3 Consistent Hashing for Sharded Rate Limiters
Hash the rate limit key to assign to a specific Redis node
User A (hash=0x3A) --> Redis Node 1 (0x00-0x55)
User B (hash=0x8F) --> Redis Node 2 (0x56-0xAA)
User C (hash=0xC2) --> Redis Node 3 (0xAB-0xFF)
Each user's counter is always managed on the same node
-> Minimized network hops, guaranteed consistency
5. LLM API Cost Control
5.1 Understanding OpenAI/Anthropic API Rate Limits
OpenAI (GPT-4o reference):
- RPM (Requests Per Minute): Varies by tier
- TPM (Tokens Per Minute): Combined input + output tokens
- RPD (Requests Per Day): Daily limit
Anthropic (Claude reference):
- RPM: Varies by tier
- Input TPM / Output TPM: Managed separately
5.2 Per-User, Per-Model Rate Limiting
import redis
import json
from dataclasses import dataclass
@dataclass
class RateLimitConfig:
rpm: int # Requests per minute
tpm: int # Tokens per minute
rpd: int # Requests per day
daily_budget: float # USD
# Per-tier configuration
TIER_LIMITS = {
"free": RateLimitConfig(rpm=10, tpm=10000, rpd=100, daily_budget=1.0),
"basic": RateLimitConfig(rpm=50, tpm=50000, rpd=1000, daily_budget=10.0),
"premium": RateLimitConfig(rpm=200, tpm=200000, rpd=10000, daily_budget=100.0),
"enterprise": RateLimitConfig(rpm=1000, tpm=1000000, rpd=100000, daily_budget=1000.0),
}
class LLMRateLimiter:
def __init__(self, redis_client: redis.Redis):
self.redis = redis_client
def check_rate_limit(
self,
user_id: str,
tier: str,
model: str,
estimated_tokens: int,
) -> dict:
limits = TIER_LIMITS.get(tier, TIER_LIMITS["free"])
now_minute = int(time.time() / 60)
now_day = int(time.time() / 86400)
pipe = self.redis.pipeline()
# RPM check
rpm_key = f"rl:rpm:{user_id}:{now_minute}"
pipe.incr(rpm_key)
pipe.expire(rpm_key, 120)
# TPM check
tpm_key = f"rl:tpm:{user_id}:{now_minute}"
pipe.incrby(tpm_key, estimated_tokens)
pipe.expire(tpm_key, 120)
# RPD check
rpd_key = f"rl:rpd:{user_id}:{now_day}"
pipe.incr(rpd_key)
pipe.expire(rpd_key, 172800)
results = pipe.execute()
current_rpm = results[0]
current_tpm = results[2]
current_rpd = results[4]
# Check limit violations
if current_rpm > limits.rpm:
return {
"allowed": False,
"reason": "RPM limit exceeded",
"limit": limits.rpm,
"current": current_rpm,
"retry_after": 60,
}
if current_tpm > limits.tpm:
return {
"allowed": False,
"reason": "TPM limit exceeded",
"limit": limits.tpm,
"current": current_tpm,
"retry_after": 60,
}
if current_rpd > limits.rpd:
return {
"allowed": False,
"reason": "Daily request limit exceeded",
"limit": limits.rpd,
"current": current_rpd,
"retry_after": 3600,
}
return {"allowed": True}
5.3 Token Counting and Cost Estimation
import tiktoken
# Per-model pricing (USD per 1K tokens, reference)
MODEL_PRICING = {
"gpt-4o": {"input": 0.0025, "output": 0.01},
"gpt-4o-mini": {"input": 0.00015, "output": 0.0006},
"claude-sonnet-4-20250514": {"input": 0.003, "output": 0.015},
"claude-haiku-35": {"input": 0.0008, "output": 0.004},
}
def estimate_cost(
model: str,
input_text: str,
estimated_output_tokens: int = 500,
) -> dict:
# Calculate input token count
try:
enc = tiktoken.encoding_for_model(model)
input_tokens = len(enc.encode(input_text))
except KeyError:
# Approximate for models not supported by tiktoken
input_tokens = len(input_text) // 4
pricing = MODEL_PRICING.get(model, {"input": 0.01, "output": 0.03})
input_cost = (input_tokens / 1000) * pricing["input"]
output_cost = (estimated_output_tokens / 1000) * pricing["output"]
return {
"input_tokens": input_tokens,
"estimated_output_tokens": estimated_output_tokens,
"estimated_cost_usd": round(input_cost + output_cost, 6),
}
5.4 Budget Enforcement
class BudgetEnforcer:
def __init__(self, redis_client: redis.Redis):
self.redis = redis_client
def check_budget(
self,
user_id: str,
estimated_cost: float,
daily_limit: float,
monthly_limit: float,
) -> dict:
today = time.strftime("%Y-%m-%d")
month = time.strftime("%Y-%m")
daily_key = f"budget:daily:{user_id}:{today}"
monthly_key = f"budget:monthly:{user_id}:{month}"
daily_spent = float(self.redis.get(daily_key) or 0)
monthly_spent = float(self.redis.get(monthly_key) or 0)
if daily_spent + estimated_cost > daily_limit:
return {
"allowed": False,
"reason": "Daily budget exceeded",
"spent": daily_spent,
"limit": daily_limit,
}
if monthly_spent + estimated_cost > monthly_limit:
return {
"allowed": False,
"reason": "Monthly budget exceeded",
"spent": monthly_spent,
"limit": monthly_limit,
}
return {"allowed": True}
def record_spend(self, user_id: str, cost: float):
today = time.strftime("%Y-%m-%d")
month = time.strftime("%Y-%m")
pipe = self.redis.pipeline()
pipe.incrbyfloat(f"budget:daily:{user_id}:{today}", cost)
pipe.expire(f"budget:daily:{user_id}:{today}", 172800)
pipe.incrbyfloat(f"budget:monthly:{user_id}:{month}", cost)
pipe.expire(f"budget:monthly:{user_id}:{month}", 2764800)
pipe.execute()
5.5 Queue + Rate Limiter Combination Pattern
+--------+ +------------+ +---------------+ +--------+
| Client +--->| API Gateway|+-->| Rate Limiter +--->| Queue |
+--------+ | (check | | (Token Bucket | | (Redis/|
| budget) | | + Budget) | | RabbitMQ)
+------------+ +-------+-------+ +---+----+
|
+----------------+ |
| Worker Pool |<------+
| (rate-aware |
| LLM caller) |
+-------+--------+
|
+-------v--------+
| LLM Provider |
| (OpenAI, etc.) |
+----------------+
6. HTTP Response Headers
6.1 Standard Response
HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 30
RateLimit-Limit: 100
RateLimit-Remaining: 0
RateLimit-Reset: 1679900400
{
"error": {
"code": "rate_limit_exceeded",
"message": "You have exceeded the rate limit. Please retry after 30 seconds.",
"type": "rate_limit_error"
}
}
6.2 Response Header Implementation
from fastapi import FastAPI, Request, Response
from fastapi.responses import JSONResponse
app = FastAPI()
@app.middleware("http")
async def rate_limit_middleware(request: Request, call_next):
api_key = request.headers.get("X-API-Key", request.client.host)
result = rate_limiter.check(api_key)
if not result["allowed"]:
return JSONResponse(
status_code=429,
content={
"error": {
"code": "rate_limit_exceeded",
"message": result["reason"],
}
},
headers={
"Retry-After": str(result.get("retry_after", 60)),
"RateLimit-Limit": str(result["limit"]),
"RateLimit-Remaining": "0",
"RateLimit-Reset": str(result.get("reset_at", "")),
}
)
response = await call_next(request)
# Include rate limit info in success responses too
response.headers["RateLimit-Limit"] = str(result.get("limit", ""))
response.headers["RateLimit-Remaining"] = str(result.get("remaining", ""))
response.headers["RateLimit-Reset"] = str(result.get("reset_at", ""))
return response
7. Monitoring
7.1 Prometheus Metrics
from prometheus_client import Counter, Histogram, Gauge
# Rate Limit metrics
rate_limit_total = Counter(
'rate_limit_requests_total',
'Total rate limited requests',
['user_tier', 'endpoint', 'result'] # result: allowed/rejected
)
rate_limit_remaining = Gauge(
'rate_limit_remaining',
'Remaining rate limit tokens',
['user_id', 'limit_type'] # limit_type: rpm/tpm/rpd
)
request_cost = Histogram(
'llm_request_cost_usd',
'Cost of LLM requests in USD',
['model', 'user_tier'],
buckets=[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
)
budget_utilization = Gauge(
'budget_utilization_ratio',
'Budget utilization ratio (0-1)',
['user_id', 'period'] # period: daily/monthly
)
7.2 Grafana Dashboard Query Examples
# Rate Limit rejection rate
rate(rate_limit_requests_total{result="rejected"}[5m])
/
rate(rate_limit_requests_total[5m])
# Daily cost by user tier
sum by (user_tier) (
increase(llm_request_cost_usd_sum[24h])
)
# Top 10 budget consumption users
topk(10, budget_utilization_ratio{period="daily"})
8. Best Practices and Anti-Patterns
8.1 Best Practices
- Layered Rate Limiting: Apply multiple layers - global, per-service, per-user
- Graceful Degradation: Provide cached or simplified responses when rate limits are exceeded
- Header Exposure: Provide RateLimit headers so clients know their remaining quota
- Adaptive Limits: Dynamically adjust rate limits based on server load
- Monitoring and Alerts: Alert when rate limit rejection rates exceed thresholds
- Documentation: Clearly document rate limit policies in API documentation
8.2 Anti-Patterns
1. Rate Limiting by Client IP Only
Problem: Multiple users behind NAT/proxy share one IP
Solution: API Key + IP combination, or auth-based rate limiting
2. Single-Layer Rate Limiting
Problem: Global-only limits affect normal users too
Solution: Per-user, per-endpoint, per-tier multi-layer approach
3. Hardcoded Rate Limit Values
Problem: Requires deployment to change
Solution: Dynamic management via config server or environment variables
4. No Rate Limiting on Internal APIs
Problem: Failure propagation between internal services
Solution: Apply rate limiting to internal APIs (alongside circuit breakers)
5. Clients Without Retry Logic
Problem: Immediate failure on 429 response
Solution: Implement retry logic that respects the Retry-After header
9. Conclusion
Rate Limiting is not just a simple defense technique but a core infrastructure component that ensures service stability, fairness, and cost efficiency.
Key Takeaways:
- Algorithm Selection: Sliding Window Counter is the most versatile for production use
- Implementation Tools: Redis + Lua is most reliable in distributed environments
- LLM Cost Control: RPM/TPM Rate Limiting + Budget Enforcement combination is essential
- Distributed Environments: Choose between centralized (Redis) or local + sync approaches
- Monitoring: Monitor rate limit rejection rates, cost tracking, and budget consumption
Proper rate limiting not only protects services but also provides users with predictable service experiences.
References
- IETF RateLimit Headers Draft: https://datatracker.ietf.org/doc/draft-ietf-httpapi-ratelimit-headers/
- Redis Rate Limiting Patterns: https://redis.io/tutorials/howtos/ratelimiting/
- Nginx Rate Limiting: https://www.nginx.com/blog/rate-limiting-nginx/
- Kong Rate Limiting Plugin: https://docs.konghq.com/hub/kong-inc/rate-limiting/
- Bucket4j: https://github.com/bucket4j/bucket4j