💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Overview

When operating APIs, you inevitably face various situations such as unexpected traffic spikes, DDoS attacks, and excessive calls from specific users.

Rate Limiting is the most fundamental yet effective defense mechanism against these issues.

This post covers everything from core rate limiting algorithms to practical implementations with Redis, Nginx, and Kong,

LLM API cost control patterns, and distributed rate limiting strategies.

1. What is Rate Limiting

1.1 Why Is It Needed

Rate Limiting is a technique that restricts the number of requests allowed within a given time period.

**Primary Purposes:**

- **DDoS Defense**: Protect services from malicious mass requests

- **Cost Control**: Manage external API call costs (especially LLM APIs)

- **Fair Usage**: Prevent specific users from monopolizing resources

- **Service Stability**: Prevent service outages due to overload

- **SLA Guarantee**: Provide consistent service quality to all users

1.2 Where to Apply Rate Limiting

Client --> [CDN/WAF] --> [Load Balancer] --> [API Gateway] --> [Application]

| | | |

L3/L4 limits Connection limits Request Rate Limit Business logic limits

Rate Limiting can be applied at multiple layers of infrastructure, with the **API Gateway** layer being the most effective location.

2. Rate Limiting Algorithms

2.1 Token Bucket

The most widely used algorithm, adopted by AWS API Gateway and Nginx.

**How It Works:**

- Tokens are added to the bucket at a constant rate

- Each request consumes one token

- Requests are rejected when no tokens remain

- Burst traffic is allowed up to bucket capacity

Bucket Capacity: 10 tokens

Refill Rate: 2 tokens/second

Time 0s: [##########] 10 tokens -> Request OK (9 left)

Time 0s: [#########.] 9 tokens -> Request OK (8 left)

...

Time 0s: [..........] 0 tokens -> Request REJECTED

Time 1s: [##........] 2 tokens -> Refilled

class TokenBucket:

def __init__(self, capacity: int, refill_rate: float):

self.capacity = capacity

self.refill_rate = refill_rate # tokens per second

self.tokens = capacity

self.last_refill_time = time.monotonic()

self.lock = threading.Lock()

def _refill(self):

now = time.monotonic()

elapsed = now - self.last_refill_time

new_tokens = elapsed * self.refill_rate

self.tokens = min(self.capacity, self.tokens + new_tokens)

self.last_refill_time = now

def consume(self, tokens: int = 1) -> bool:

with self.lock:

self._refill()

if self.tokens >= tokens:

self.tokens -= tokens

return True

return False

def wait_time(self) -> float:

"""Time to wait for the next token"""

with self.lock:

self._refill()

if self.tokens >= 1:

return 0.0

return (1 - self.tokens) / self.refill_rate

Usage example

bucket = TokenBucket(capacity=10, refill_rate=2.0)

for i in range(15):

if bucket.consume():

print(f"Request {i+1}: Allowed")

else:

wait = bucket.wait_time()

print(f"Request {i+1}: Rejected (wait {wait:.2f}s)")

**Pros:** Allows bursts, simple implementation, memory efficient

**Cons:** Difficult to track exact request counts

2.2 Leaky Bucket

An algorithm that processes requests at a constant rate.

Incoming Requests --> [Bucket/Queue] --> Processing (fixed rate)

Overflow --> Rejected

Processing Rate: 2 requests/second (fixed)

Queue Size: 5

Fast incoming requests queue up; rejected when queue is full

from collections import deque

class LeakyBucket:

def __init__(self, capacity: int, leak_rate: float):

self.capacity = capacity

self.leak_rate = leak_rate # requests per second

self.queue = deque()

self.lock = threading.Lock()

self.last_leak_time = time.monotonic()

def _leak(self):

now = time.monotonic()

elapsed = now - self.last_leak_time

leaked = int(elapsed * self.leak_rate)

if leaked > 0:

for _ in range(min(leaked, len(self.queue))):

self.queue.popleft()

self.last_leak_time = now

def allow(self) -> bool:

with self.lock:

self._leak()

if len(self.queue) < self.capacity:

self.queue.append(time.monotonic())

return True

return False

**Pros:** Guarantees constant processing rate

**Cons:** Does not allow bursts, lacks flexibility

2.3 Fixed Window Counter

The simplest algorithm that counts requests within a fixed time window.

Window: 1 minute, Limit: 100 requests

|------- Window 1 -------|------- Window 2 -------|

12:00:00 12:00:59 12:01:00 12:01:59

Count: 0...50...100 Count: 0...50...100

(allowed) (allowed)

Problem: Boundary Burst

100 requests at 12:00:30~12:00:59 + 100 requests at 12:01:00~12:01:30

= 200 requests in 60 seconds (2x the limit!)

class FixedWindowCounter:

def __init__(self, limit: int, window_seconds: int):

self.limit = limit

self.window_seconds = window_seconds

self.counters = {} # key -> (window_start, count)

self.lock = threading.Lock()

def allow(self, key: str) -> bool:

with self.lock:

now = time.time()

window_start = int(now / self.window_seconds) * self.window_seconds

if key not in self.counters or self.counters[key][0] != window_start:

self.counters[key] = (window_start, 0)

if self.counters[key][1] < self.limit:

self.counters[key] = (window_start, self.counters[key][1] + 1)

return True

return False

**Pros:** Very simple implementation, minimal memory usage

**Cons:** Request concentration at window boundaries

2.4 Sliding Window Log

Accurately tracks each request by recording timestamps in a log.

Window: 60 seconds, Limit: 5 requests

Log: [12:00:10, 12:00:25, 12:00:40, 12:00:55, 12:01:05]

New request at 12:01:10:

1. Remove entries older than 60s (before 12:00:10)

-> [12:00:25, 12:00:40, 12:00:55, 12:01:05]

2. Count: 4 < 5 -> Allowed

3. Add to log: [12:00:25, 12:00:40, 12:00:55, 12:01:05, 12:01:10]

from collections import defaultdict

class SlidingWindowLog:

def __init__(self, limit: int, window_seconds: int):

self.limit = limit

self.window_seconds = window_seconds

self.logs = defaultdict(list)

self.lock = threading.Lock()

def allow(self, key: str) -> bool:

with self.lock:

now = time.time()

cutoff = now - self.window_seconds

Remove old entries

self.logs[key] = [

ts for ts in self.logs[key] if ts > cutoff

]

if len(self.logs[key]) < self.limit:

self.logs[key].append(now)

return True

return False

**Pros:** Most accurate rate limiting

**Cons:** High memory usage (stores all request timestamps)

2.5 Sliding Window Counter

Combines the advantages of Fixed Window and Sliding Window Log.

This is the most commonly used algorithm in production.

Window: 60 seconds, Limit: 100 requests

Previous Window Count: 80

Current Window Count: 30

Current Window Progress: 25% (15 seconds elapsed)

Weighted Count = 80 * (1 - 0.25) + 30 = 80 * 0.75 + 30 = 90

90 < 100 -> Allowed

class SlidingWindowCounter:

def __init__(self, limit: int, window_seconds: int):

self.limit = limit

self.window_seconds = window_seconds

self.windows = {} # key -> (prev_count, curr_count, curr_window_start)

self.lock = threading.Lock()

def allow(self, key: str) -> bool:

with self.lock:

now = time.time()

curr_window = int(now / self.window_seconds) * self.window_seconds

window_progress = (now - curr_window) / self.window_seconds

if key not in self.windows:

self.windows[key] = (0, 0, curr_window)

prev_count, curr_count, stored_window = self.windows[key]

Update if window has changed

if stored_window != curr_window:

if curr_window - stored_window >= self.window_seconds * 2:

prev_count = 0

else:

prev_count = curr_count

curr_count = 0

Calculate weighted count

weighted = prev_count * (1 - window_progress) + curr_count

if weighted < self.limit:

curr_count += 1

self.windows[key] = (prev_count, curr_count, curr_window)

return True

return False

2.6 Algorithm Comparison Table

| --------------- | -------- | -------- | -------------- | ---------- | -------------------------- |

| Token Bucket | Medium | Low | Yes | Low | General APIs, burst needed |

| Leaky Bucket | Medium | Low | No | Low | Constant rate needed |

3. Implementation Methods

3.1 Redis + Lua Script

Combining Redis atomic operations with Lua scripts enables safe rate limiting even in distributed environments.

Sliding Window Counter (Redis + Lua)

r = redis.Redis(host='localhost', port=6379, db=0)

Lua script: Sliding Window Counter

SLIDING_WINDOW_SCRIPT = """

local key = KEYS[1]

local now = tonumber(ARGV[1])

local window = tonumber(ARGV[2])

local limit = tonumber(ARGV[3])

local clear_before = now - window

redis.call('ZREMRANGEBYSCORE', key, '-inf', clear_before)

local current_count = redis.call('ZCARD', key)

if current_count < limit then

redis.call('ZADD', key, now, now .. '-' .. math.random(1000000))

redis.call('EXPIRE', key, window + 1)

return 1

else

return 0

end

"""

sliding_window_sha = r.register_script(SLIDING_WINDOW_SCRIPT)

def is_rate_limited(key: str, limit: int = 100, window: int = 60) -> bool:

"""

Returns True if the request is rate limited, False if allowed.

"""

now = time.time()

result = sliding_window_sha(

keys=[f"ratelimit:{key}"],

args=[now, window, limit]

)

return result == 0 # 0 = rate limited, 1 = allowed

Usage example

user_id = "user-123"

for i in range(110):

if is_rate_limited(user_id, limit=100, window=60):

print(f"Request {i+1}: Rate Limited!")

else:

print(f"Request {i+1}: Allowed")

Token Bucket (Redis + Lua)

TOKEN_BUCKET_SCRIPT = """

local key = KEYS[1]

local capacity = tonumber(ARGV[1])

local refill_rate = tonumber(ARGV[2])

local now = tonumber(ARGV[3])

local requested = tonumber(ARGV[4])

local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')

local tokens = tonumber(bucket[1])

local last_refill = tonumber(bucket[2])

if tokens == nil then

tokens = capacity

last_refill = now

end

local elapsed = now - last_refill

local new_tokens = elapsed * refill_rate

tokens = math.min(capacity, tokens + new_tokens)

local allowed = 0

if tokens >= requested then

tokens = tokens - requested

allowed = 1

end

redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)

redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) + 1)

return allowed

"""

token_bucket_sha = r.register_script(TOKEN_BUCKET_SCRIPT)

def token_bucket_check(

key: str,

capacity: int = 100,

refill_rate: float = 10.0,

tokens: int = 1

) -> bool:

now = time.time()

result = token_bucket_sha(

keys=[f"tokenbucket:{key}"],

args=[capacity, refill_rate, now, tokens]

)

return result == 1 # 1 = allowed

3.2 Nginx Rate Limiting

nginx.conf

http {

Zone definition: per client IP, 10MB shared memory

limit_req_zone $binary_remote_addr zone=api_limit:10m rate=10r/s;

Per-user rate limiting (API Key based)

map $http_x_api_key $api_key_zone {

default $http_x_api_key;

}

limit_req_zone $api_key_zone zone=user_limit:10m rate=100r/m;

server {

listen 80;

Default API endpoint

location /api/ {

burst=20: allow up to 20 excess requests (queued)

nodelay: process immediately (within burst)

limit_req zone=api_limit burst=20 nodelay;

Response code when rate limit exceeded

limit_req_status 429;

proxy_pass http://backend;

}

LLM API endpoint (stricter limits)

location /api/v1/completions {

limit_req zone=user_limit burst=5 nodelay;

limit_req_status 429;

proxy_pass http://llm_backend;

}

Custom error page

error_page 429 = @rate_limited;

location @rate_limited {

default_type application/json;

return 429 '{"error": "Too Many Requests", "retry_after": 60}';

}

3.3 Kong Rate Limiting Plugin

kong.yml - Declarative Configuration

_format_version: '3.0'

services:

- name: llm-api

url: http://llm-backend:8000

routes:

- name: llm-route

paths:

- /api/v1/completions

plugins:

Global Rate Limiting

- name: rate-limiting

config:

minute: 100

hour: 1000

policy: redis

redis_host: redis

redis_port: 6379

redis_database: 0

fault_tolerant: true

hide_client_headers: false

error_code: 429

error_message: 'Rate limit exceeded'

Per-Consumer Rate Limiting

- name: rate-limiting

consumer: premium-user

config:

minute: 500

hour: 10000

policy: redis

redis_host: redis

3.4 Envoy Rate Limiting

envoy.yaml

static_resources:

listeners:

- address:

socket_address:

address: 0.0.0.0

port_value: 8080

filter_chains:

- filters:

- name: envoy.filters.network.http_connection_manager

typed_config:

'@type': type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager

http_filters:

- name: envoy.filters.http.local_ratelimit

typed_config:

'@type': type.googleapis.com/envoy.extensions.filters.http.local_ratelimit.v3.LocalRateLimit

stat_prefix: http_local_rate_limiter

token_bucket:

max_tokens: 100

tokens_per_fill: 10

fill_interval: 1s

filter_enabled:

runtime_key: local_rate_limit_enabled

default_value:

numerator: 100

denominator: HUNDRED

- name: envoy.filters.http.router

typed_config:

'@type': type.googleapis.com/envoy.extensions.filters.http.router.v3.Router

3.5 Express.js Rate Limiting

const rateLimit = require('express-rate-limit')

const RedisStore = require('rate-limit-redis')

const Redis = require('ioredis')

const redis = new Redis({

host: 'localhost',

port: 6379,

})

// Default Rate Limiter

const apiLimiter = rateLimit({

windowMs: 60 * 1000, // 1 minute

max: 100,

standardHeaders: true, // RateLimit-* headers

legacyHeaders: false, // X-RateLimit-* headers

store: new RedisStore({

sendCommand: (...args) => redis.call(...args),

}),

message: {

error: 'Too Many Requests',

retryAfter: 60,

keyGenerator: (req) => {

return req.headers['x-api-key'] || req.ip

})

// Strict Limiter for LLM API

const llmLimiter = rateLimit({

windowMs: 60 * 1000,

max: 20,

store: new RedisStore({

sendCommand: (...args) => redis.call(...args),

prefix: 'rl:llm:',

}),

keyGenerator: (req) => req.headers['x-api-key'],

})

const app = require('express')()

app.use('/api/', apiLimiter)

app.use('/api/v1/completions', llmLimiter)

3.6 Spring Boot Rate Limiting

// Bucket4j + Redis Based Rate Limiting

@Configuration

public class RateLimitConfig {

@Bean

public ProxyManager<String> proxyManager(RedissonClient redissonClient) {

return Bucket4jRedisson.casBasedBuilder(redissonClient)

.build();

}

@Component

public class RateLimitInterceptor implements HandlerInterceptor {

private final ProxyManager<String> proxyManager;

@Override

public boolean preHandle(

HttpServletRequest request,

HttpServletResponse response,

Object handler

) throws Exception {

String apiKey = request.getHeader("X-API-Key");

if (apiKey == null) {

response.setStatus(401);

return false;

}

BucketConfiguration config = BucketConfiguration.builder()

.addLimit(

Bandwidth.builder()

.capacity(100)

.refillGreedy(100, Duration.ofMinutes(1))

.build()

)

.build();

Bucket bucket = proxyManager.builder()

.build(apiKey, () -> config);

ConsumptionProbe probe = bucket.tryConsumeAndReturnRemaining(1);

if (probe.isConsumed()) {

response.setHeader("X-RateLimit-Remaining",

String.valueOf(probe.getRemainingTokens()));

return true;

} else {

long waitSeconds = probe.getNanosToWaitForRefill() / 1_000_000_000;

response.setStatus(429);

response.setHeader("Retry-After", String.valueOf(waitSeconds));

response.getWriter().write(

"{\"error\":\"Rate limit exceeded\"}"

);

return false;

}

4. Distributed Rate Limiting

4.1 Centralized (Redis)

+--------+ +--------+ +--------+

+---+----+ +---+----+ +---+----+

| | |

+---------+----+---------+----+

| |

+----+----+ +---+---+

| Redis | | Redis |

| Primary | | Replica|

+---------+ +-------+

**Pros:** Consistent counts across all nodes

**Cons:** Additional network latency to Redis, Redis can become SPOF

4.2 Local Counter + Synchronization

+--------+ +--------+ +--------+

| 30/100 | | 25/100 | | 20/100 |

+---+----+ +---+----+ +---+----+

| | |

+----Periodic Sync (every 5s)-----+

+----+----+

| Redis |

| Global: |

| 75/100 |

+---------+

class DistributedRateLimiter:

def __init__(

self,

redis_client: redis.Redis,

key_prefix: str,

global_limit: int,

window_seconds: int,

sync_interval: float = 5.0,

local_threshold: float = 0.1,

self.redis = redis_client

self.key_prefix = key_prefix

self.global_limit = global_limit

self.window_seconds = window_seconds

self.sync_interval = sync_interval

Local allowance = 10% of global limit

self.local_limit = int(global_limit * local_threshold)

self.local_count = 0

self.lock = threading.Lock()

Start periodic sync thread

self._start_sync_thread()

def _start_sync_thread(self):

def sync_loop():

while True:

time.sleep(self.sync_interval)

self._sync_to_redis()

thread = threading.Thread(target=sync_loop, daemon=True)

thread.start()

def _sync_to_redis(self):

with self.lock:

if self.local_count > 0:

key = f"{self.key_prefix}:global"

self.redis.incrby(key, self.local_count)

self.redis.expire(key, self.window_seconds + 10)

self.local_count = 0

def allow(self, key: str) -> bool:

with self.lock:

Check local counter

if self.local_count >= self.local_limit:

self._sync_to_redis()

Check global counter

global_key = f"{self.key_prefix}:global"

global_count = int(self.redis.get(global_key) or 0)

if global_count >= self.global_limit:

return False

self.local_count += 1

return True

4.3 Consistent Hashing for Sharded Rate Limiters

Hash the rate limit key to assign to a specific Redis node

User A (hash=0x3A) --> Redis Node 1 (0x00-0x55)

User B (hash=0x8F) --> Redis Node 2 (0x56-0xAA)

User C (hash=0xC2) --> Redis Node 3 (0xAB-0xFF)

Each user's counter is always managed on the same node

-> Minimized network hops, guaranteed consistency

5. LLM API Cost Control

5.1 Understanding OpenAI/Anthropic API Rate Limits

OpenAI (GPT-4o reference):

- RPM (Requests Per Minute): Varies by tier

- TPM (Tokens Per Minute): Combined input + output tokens

- RPD (Requests Per Day): Daily limit

Anthropic (Claude reference):

- RPM: Varies by tier

- Input TPM / Output TPM: Managed separately

5.2 Per-User, Per-Model Rate Limiting

from dataclasses import dataclass

@dataclass

class RateLimitConfig:

rpm: int # Requests per minute

tpm: int # Tokens per minute

rpd: int # Requests per day

daily_budget: float # USD

Per-tier configuration

TIER_LIMITS = {

"free": RateLimitConfig(rpm=10, tpm=10000, rpd=100, daily_budget=1.0),

"basic": RateLimitConfig(rpm=50, tpm=50000, rpd=1000, daily_budget=10.0),

"premium": RateLimitConfig(rpm=200, tpm=200000, rpd=10000, daily_budget=100.0),

"enterprise": RateLimitConfig(rpm=1000, tpm=1000000, rpd=100000, daily_budget=1000.0),

}

class LLMRateLimiter:

def __init__(self, redis_client: redis.Redis):

self.redis = redis_client

def check_rate_limit(

self,

user_id: str,

tier: str,

model: str,

estimated_tokens: int,

) -> dict:

limits = TIER_LIMITS.get(tier, TIER_LIMITS["free"])

now_minute = int(time.time() / 60)

now_day = int(time.time() / 86400)

pipe = self.redis.pipeline()

RPM check

rpm_key = f"rl:rpm:{user_id}:{now_minute}"

pipe.incr(rpm_key)

pipe.expire(rpm_key, 120)

TPM check

tpm_key = f"rl:tpm:{user_id}:{now_minute}"

pipe.incrby(tpm_key, estimated_tokens)

pipe.expire(tpm_key, 120)

RPD check

rpd_key = f"rl:rpd:{user_id}:{now_day}"

pipe.incr(rpd_key)

pipe.expire(rpd_key, 172800)

results = pipe.execute()

current_rpm = results[0]

current_tpm = results[2]

current_rpd = results[4]

Check limit violations

if current_rpm > limits.rpm:

return {

"allowed": False,

"reason": "RPM limit exceeded",

"limit": limits.rpm,

"current": current_rpm,

"retry_after": 60,

}

if current_tpm > limits.tpm:

return {

"allowed": False,

"reason": "TPM limit exceeded",

"limit": limits.tpm,

"current": current_tpm,

"retry_after": 60,

}

if current_rpd > limits.rpd:

return {

"allowed": False,

"reason": "Daily request limit exceeded",

"limit": limits.rpd,

"current": current_rpd,

"retry_after": 3600,

}

return {"allowed": True}

5.3 Token Counting and Cost Estimation

Per-model pricing (USD per 1K tokens, reference)

MODEL_PRICING = {

"gpt-4o": {"input": 0.0025, "output": 0.01},

"gpt-4o-mini": {"input": 0.00015, "output": 0.0006},

"claude-sonnet-4-20250514": {"input": 0.003, "output": 0.015},

"claude-haiku-35": {"input": 0.0008, "output": 0.004},

}

def estimate_cost(

model: str,

input_text: str,

estimated_output_tokens: int = 500,

) -> dict:

Calculate input token count

try:

enc = tiktoken.encoding_for_model(model)

input_tokens = len(enc.encode(input_text))

except KeyError:

Approximate for models not supported by tiktoken

input_tokens = len(input_text) // 4

pricing = MODEL_PRICING.get(model, {"input": 0.01, "output": 0.03})

input_cost = (input_tokens / 1000) * pricing["input"]

output_cost = (estimated_output_tokens / 1000) * pricing["output"]

return {

"input_tokens": input_tokens,

"estimated_output_tokens": estimated_output_tokens,

"estimated_cost_usd": round(input_cost + output_cost, 6),

}

5.4 Budget Enforcement

class BudgetEnforcer:

def __init__(self, redis_client: redis.Redis):

self.redis = redis_client

def check_budget(

self,

user_id: str,

estimated_cost: float,

daily_limit: float,

monthly_limit: float,

) -> dict:

today = time.strftime("%Y-%m-%d")

month = time.strftime("%Y-%m")

daily_key = f"budget:daily:{user_id}:{today}"

monthly_key = f"budget:monthly:{user_id}:{month}"

daily_spent = float(self.redis.get(daily_key) or 0)

monthly_spent = float(self.redis.get(monthly_key) or 0)

if daily_spent + estimated_cost > daily_limit:

return {

"allowed": False,

"reason": "Daily budget exceeded",

"spent": daily_spent,

"limit": daily_limit,

}

if monthly_spent + estimated_cost > monthly_limit:

return {

"allowed": False,

"reason": "Monthly budget exceeded",

"spent": monthly_spent,

"limit": monthly_limit,

}

return {"allowed": True}

def record_spend(self, user_id: str, cost: float):

today = time.strftime("%Y-%m-%d")

month = time.strftime("%Y-%m")

pipe = self.redis.pipeline()

pipe.incrbyfloat(f"budget:daily:{user_id}:{today}", cost)

pipe.expire(f"budget:daily:{user_id}:{today}", 172800)

pipe.incrbyfloat(f"budget:monthly:{user_id}:{month}", cost)

pipe.expire(f"budget:monthly:{user_id}:{month}", 2764800)

pipe.execute()

5.5 Queue + Rate Limiter Combination Pattern

+--------+ +------------+ +---------------+ +--------+

+------------+ +-------+-------+ +---+----+

+----------------+ |

| Worker Pool |<------+

| (rate-aware |

| LLM caller) |

+-------+--------+

+-------v--------+

| LLM Provider |

| (OpenAI, etc.) |

+----------------+

6. HTTP Response Headers

6.1 Standard Response

HTTP/1.1 429 Too Many Requests

Content-Type: application/json

Retry-After: 30

RateLimit-Limit: 100

RateLimit-Remaining: 0

RateLimit-Reset: 1679900400

{

"error": {

"code": "rate_limit_exceeded",

"message": "You have exceeded the rate limit. Please retry after 30 seconds.",

"type": "rate_limit_error"

}

6.2 Response Header Implementation

from fastapi import FastAPI, Request, Response

from fastapi.responses import JSONResponse

app = FastAPI()

@app.middleware("http")

async def rate_limit_middleware(request: Request, call_next):

api_key = request.headers.get("X-API-Key", request.client.host)

result = rate_limiter.check(api_key)

if not result["allowed"]:

return JSONResponse(

status_code=429,

content={

"error": {

"code": "rate_limit_exceeded",

"message": result["reason"],

}

headers={

"Retry-After": str(result.get("retry_after", 60)),

"RateLimit-Limit": str(result["limit"]),

"RateLimit-Remaining": "0",

"RateLimit-Reset": str(result.get("reset_at", "")),

}

)

response = await call_next(request)

Include rate limit info in success responses too

response.headers["RateLimit-Limit"] = str(result.get("limit", ""))

response.headers["RateLimit-Remaining"] = str(result.get("remaining", ""))

response.headers["RateLimit-Reset"] = str(result.get("reset_at", ""))

return response

7. Monitoring

7.1 Prometheus Metrics

from prometheus_client import Counter, Histogram, Gauge

Rate Limit metrics

rate_limit_total = Counter(

'rate_limit_requests_total',

'Total rate limited requests',

['user_tier', 'endpoint', 'result'] # result: allowed/rejected

)

rate_limit_remaining = Gauge(

'rate_limit_remaining',

'Remaining rate limit tokens',

['user_id', 'limit_type'] # limit_type: rpm/tpm/rpd

)

request_cost = Histogram(

'llm_request_cost_usd',

'Cost of LLM requests in USD',

['model', 'user_tier'],

buckets=[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0]

)

budget_utilization = Gauge(

'budget_utilization_ratio',

'Budget utilization ratio (0-1)',

['user_id', 'period'] # period: daily/monthly

)

7.2 Grafana Dashboard Query Examples

Rate Limit rejection rate

rate(rate_limit_requests_total{result="rejected"}[5m])

rate(rate_limit_requests_total[5m])

Daily cost by user tier

sum by (user_tier) (

increase(llm_request_cost_usd_sum[24h])

)

Top 10 budget consumption users

topk(10, budget_utilization_ratio{period="daily"})

8. Best Practices and Anti-Patterns

8.1 Best Practices

1. **Layered Rate Limiting**: Apply multiple layers - global, per-service, per-user

2. **Graceful Degradation**: Provide cached or simplified responses when rate limits are exceeded

3. **Header Exposure**: Provide RateLimit headers so clients know their remaining quota

4. **Adaptive Limits**: Dynamically adjust rate limits based on server load

5. **Monitoring and Alerts**: Alert when rate limit rejection rates exceed thresholds

6. **Documentation**: Clearly document rate limit policies in API documentation

8.2 Anti-Patterns

1. Rate Limiting by Client IP Only

Problem: Multiple users behind NAT/proxy share one IP

Solution: API Key + IP combination, or auth-based rate limiting

2. Single-Layer Rate Limiting

Problem: Global-only limits affect normal users too

Solution: Per-user, per-endpoint, per-tier multi-layer approach

3. Hardcoded Rate Limit Values

Problem: Requires deployment to change

Solution: Dynamic management via config server or environment variables

4. No Rate Limiting on Internal APIs

Problem: Failure propagation between internal services

Solution: Apply rate limiting to internal APIs (alongside circuit breakers)

5. Clients Without Retry Logic

Problem: Immediate failure on 429 response

Solution: Implement retry logic that respects the Retry-After header

9. Conclusion

Rate Limiting is not just a simple defense technique but a core infrastructure component that ensures service stability, fairness, and cost efficiency.

**Key Takeaways:**

1. **Algorithm Selection**: Sliding Window Counter is the most versatile for production use

2. **Implementation Tools**: Redis + Lua is most reliable in distributed environments

3. **LLM Cost Control**: RPM/TPM Rate Limiting + Budget Enforcement combination is essential

4. **Distributed Environments**: Choose between centralized (Redis) or local + sync approaches

5. **Monitoring**: Monitor rate limit rejection rates, cost tracking, and budget consumption

Proper rate limiting not only protects services but also provides users with predictable service experiences.

References

- IETF RateLimit Headers Draft: https://datatracker.ietf.org/doc/draft-ietf-httpapi-ratelimit-headers/

- Redis Rate Limiting Patterns: https://redis.io/tutorials/howtos/ratelimiting/

- Nginx Rate Limiting: https://www.nginx.com/blog/rate-limiting-nginx/

- Kong Rate Limiting Plugin: https://docs.konghq.com/hub/kong-inc/rate-limiting/

- Bucket4j: https://github.com/bucket4j/bucket4j