Skip to content

Split View: [Architecture] Rate Limiting과 API 호출량 제한 구현 완전 가이드

|

[Architecture] Rate Limiting과 API 호출량 제한 구현 완전 가이드

개요

API를 운영하다 보면 예상치 못한 트래픽 급증, DDoS 공격, 특정 사용자의 과도한 호출 등 다양한 상황에 직면합니다. Rate Limiting은 이러한 문제에 대한 가장 기본적이면서도 효과적인 방어 수단입니다. 이 글에서는 Rate Limiting의 핵심 알고리즘부터 Redis, Nginx, Kong 등을 활용한 실전 구현, LLM API 비용 제어 패턴, 분산 환경에서의 Rate Limiting까지 총정리합니다.


1. Rate Limiting이란

1.1 왜 필요한가

Rate Limiting은 단위 시간 내 허용되는 요청 수를 제한하는 기법입니다.

주요 목적:

  • DDoS 방어: 악의적인 대량 요청으로부터 서비스 보호
  • 비용 제어: 외부 API 호출 비용 관리 (특히 LLM API)
  • 공정 사용: 특정 사용자가 리소스를 독점하는 것을 방지
  • 서비스 안정성: 과부하로 인한 서비스 장애 예방
  • SLA 보장: 모든 사용자에게 일정 수준의 서비스 품질 제공

1.2 Rate Limiting의 위치

Client --> [CDN/WAF] --> [Load Balancer] --> [API Gateway] --> [Application]
              |                |                  |                 |
          L3/L4 제한      Connection 제한    요청 Rate Limit    비즈니스 로직 제한

Rate Limiting은 인프라의 여러 계층에서 적용할 수 있으며, 가장 효과적인 위치는 API Gateway 계층입니다.


2. Rate Limiting 알고리즘

2.1 Token Bucket

가장 널리 사용되는 알고리즘으로, AWS API Gateway와 Nginx에서 채택하고 있습니다.

원리:

  • 버킷에 일정 속도로 토큰이 충전됨
  • 요청마다 토큰 1개를 소비
  • 토큰이 없으면 요청 거부
  • 버킷 용량만큼 burst 트래픽 허용
Bucket Capacity: 10 tokens
Refill Rate: 2 tokens/second

Time 0s: [##########] 10 tokens -> Request OK (9 left)
Time 0s: [#########.] 9 tokens  -> Request OK (8 left)
...
Time 0s: [..........] 0 tokens  -> Request REJECTED
Time 1s: [##........] 2 tokens  -> Refilled
import time
import threading

class TokenBucket:
    def __init__(self, capacity: int, refill_rate: float):
        self.capacity = capacity
        self.refill_rate = refill_rate  # tokens per second
        self.tokens = capacity
        self.last_refill_time = time.monotonic()
        self.lock = threading.Lock()

    def _refill(self):
        now = time.monotonic()
        elapsed = now - self.last_refill_time
        new_tokens = elapsed * self.refill_rate
        self.tokens = min(self.capacity, self.tokens + new_tokens)
        self.last_refill_time = now

    def consume(self, tokens: int = 1) -> bool:
        with self.lock:
            self._refill()
            if self.tokens >= tokens:
                self.tokens -= tokens
                return True
            return False

    def wait_time(self) -> float:
        """다음 토큰까지 대기 시간"""
        with self.lock:
            self._refill()
            if self.tokens >= 1:
                return 0.0
            return (1 - self.tokens) / self.refill_rate


# 사용 예시
bucket = TokenBucket(capacity=10, refill_rate=2.0)

for i in range(15):
    if bucket.consume():
        print(f"Request {i+1}: Allowed")
    else:
        wait = bucket.wait_time()
        print(f"Request {i+1}: Rejected (wait {wait:.2f}s)")

장점: burst 허용, 구현 간편, 메모리 효율적 단점: 정확한 요청 수 추적이 어려움

2.2 Leaky Bucket

일정한 속도로 요청을 처리하는 알고리즘입니다.

Incoming Requests --> [Bucket/Queue] --> Processing (fixed rate)
                         |
                      Overflow --> Rejected

처리 속도: 2 requests/second (고정)
큐 크기: 5

빠르게 들어온 요청은 큐에 대기, 큐가 가득 차면 거부
import time
import threading
from collections import deque

class LeakyBucket:
    def __init__(self, capacity: int, leak_rate: float):
        self.capacity = capacity
        self.leak_rate = leak_rate  # requests per second
        self.queue = deque()
        self.lock = threading.Lock()
        self.last_leak_time = time.monotonic()

    def _leak(self):
        now = time.monotonic()
        elapsed = now - self.last_leak_time
        leaked = int(elapsed * self.leak_rate)
        if leaked > 0:
            for _ in range(min(leaked, len(self.queue))):
                self.queue.popleft()
            self.last_leak_time = now

    def allow(self) -> bool:
        with self.lock:
            self._leak()
            if len(self.queue) < self.capacity:
                self.queue.append(time.monotonic())
                return True
            return False

장점: 일정한 처리 속도 보장 단점: burst를 허용하지 않아 유연성이 부족

2.3 Fixed Window Counter

가장 단순한 알고리즘으로, 고정된 시간 윈도우 내의 요청 수를 카운트합니다.

Window: 1 minute, Limit: 100 requests

|------- Window 1 -------|------- Window 2 -------|
12:00:00     12:00:59    12:01:00     12:01:59

  Count: 0...50...100       Count: 0...50...100
         (allowed)                 (allowed)

문제: Boundary Burst
12:00:30~12:00:59에 100건 + 12:01:00~12:01:30에 100건
= 60초 동안 200건 처리 (제한의 2배!)
import time
import threading

class FixedWindowCounter:
    def __init__(self, limit: int, window_seconds: int):
        self.limit = limit
        self.window_seconds = window_seconds
        self.counters = {}  # key -> (window_start, count)
        self.lock = threading.Lock()

    def allow(self, key: str) -> bool:
        with self.lock:
            now = time.time()
            window_start = int(now / self.window_seconds) * self.window_seconds

            if key not in self.counters or self.counters[key][0] != window_start:
                self.counters[key] = (window_start, 0)

            if self.counters[key][1] < self.limit:
                self.counters[key] = (window_start, self.counters[key][1] + 1)
                return True
            return False

장점: 구현이 매우 간단, 메모리 사용량 최소 단점: 윈도우 경계(boundary)에서 요청 집중 문제

2.4 Sliding Window Log

각 요청의 타임스탬프를 로그로 기록하여 정확하게 추적합니다.

Window: 60 seconds, Limit: 5 requests

Log: [12:00:10, 12:00:25, 12:00:40, 12:00:55, 12:01:05]

New request at 12:01:10:
  1. Remove entries older than 60s (before 12:00:10)
     -> [12:00:25, 12:00:40, 12:00:55, 12:01:05]
  2. Count: 4 < 5 -> Allowed
  3. Add to log: [12:00:25, 12:00:40, 12:00:55, 12:01:05, 12:01:10]
import time
import threading
from collections import defaultdict

class SlidingWindowLog:
    def __init__(self, limit: int, window_seconds: int):
        self.limit = limit
        self.window_seconds = window_seconds
        self.logs = defaultdict(list)
        self.lock = threading.Lock()

    def allow(self, key: str) -> bool:
        with self.lock:
            now = time.time()
            cutoff = now - self.window_seconds

            # 오래된 항목 제거
            self.logs[key] = [
                ts for ts in self.logs[key] if ts > cutoff
            ]

            if len(self.logs[key]) < self.limit:
                self.logs[key].append(now)
                return True
            return False

장점: 가장 정확한 Rate Limiting 단점: 메모리 사용량이 높음 (모든 요청 타임스탬프 저장)

2.5 Sliding Window Counter

Fixed Window와 Sliding Window Log의 장점을 결합한 알고리즘입니다. 실무에서 가장 많이 사용됩니다.

Window: 60 seconds, Limit: 100 requests

Previous Window Count: 80
Current Window Count: 30
Current Window Progress: 25% (15 seconds elapsed)

Weighted Count = 80 * (1 - 0.25) + 30 = 80 * 0.75 + 30 = 90

90 < 100 -> Allowed
import time
import threading

class SlidingWindowCounter:
    def __init__(self, limit: int, window_seconds: int):
        self.limit = limit
        self.window_seconds = window_seconds
        self.windows = {}  # key -> (prev_count, curr_count, curr_window_start)
        self.lock = threading.Lock()

    def allow(self, key: str) -> bool:
        with self.lock:
            now = time.time()
            curr_window = int(now / self.window_seconds) * self.window_seconds
            window_progress = (now - curr_window) / self.window_seconds

            if key not in self.windows:
                self.windows[key] = (0, 0, curr_window)

            prev_count, curr_count, stored_window = self.windows[key]

            # 윈도우가 바뀌었으면 갱신
            if stored_window != curr_window:
                if curr_window - stored_window >= self.window_seconds * 2:
                    prev_count = 0
                else:
                    prev_count = curr_count
                curr_count = 0

            # 가중 카운트 계산
            weighted = prev_count * (1 - window_progress) + curr_count

            if weighted < self.limit:
                curr_count += 1
                self.windows[key] = (prev_count, curr_count, curr_window)
                return True
            return False

2.6 알고리즘 비교표

알고리즘정확도메모리Burst 허용구현 복잡도적합한 경우
Token Bucket낮음O낮음일반 API, burst 필요
Leaky Bucket낮음X낮음일정 처리율 필요
Fixed Window낮음매우 낮음경계 문제매우 낮음단순한 제한
Sliding Log높음높음X중간정확한 제한 필요
Sliding Counter높음낮음X중간실무 범용

3. 구현 방법

3.1 Redis + Lua Script

Redis의 원자적 연산과 Lua 스크립트를 결합하여 분산 환경에서도 안전한 Rate Limiting을 구현합니다.

Sliding Window Counter (Redis + Lua)

import redis
import time

r = redis.Redis(host='localhost', port=6379, db=0)

# Lua 스크립트: Sliding Window Counter
SLIDING_WINDOW_SCRIPT = """
local key = KEYS[1]
local now = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local limit = tonumber(ARGV[3])

local clear_before = now - window
redis.call('ZREMRANGEBYSCORE', key, '-inf', clear_before)

local current_count = redis.call('ZCARD', key)
if current_count < limit then
    redis.call('ZADD', key, now, now .. '-' .. math.random(1000000))
    redis.call('EXPIRE', key, window + 1)
    return 1
else
    return 0
end
"""

sliding_window_sha = r.register_script(SLIDING_WINDOW_SCRIPT)

def is_rate_limited(key: str, limit: int = 100, window: int = 60) -> bool:
    """
    Returns True if the request is allowed, False if rate limited.
    """
    now = time.time()
    result = sliding_window_sha(
        keys=[f"ratelimit:{key}"],
        args=[now, window, limit]
    )
    return result == 0  # 0 = rate limited, 1 = allowed


# 사용 예시
user_id = "user-123"
for i in range(110):
    if is_rate_limited(user_id, limit=100, window=60):
        print(f"Request {i+1}: Rate Limited!")
    else:
        print(f"Request {i+1}: Allowed")

Token Bucket (Redis + Lua)

TOKEN_BUCKET_SCRIPT = """
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local requested = tonumber(ARGV[4])

local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1])
local last_refill = tonumber(bucket[2])

if tokens == nil then
    tokens = capacity
    last_refill = now
end

local elapsed = now - last_refill
local new_tokens = elapsed * refill_rate
tokens = math.min(capacity, tokens + new_tokens)

local allowed = 0
if tokens >= requested then
    tokens = tokens - requested
    allowed = 1
end

redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) + 1)

return allowed
"""

token_bucket_sha = r.register_script(TOKEN_BUCKET_SCRIPT)

def token_bucket_check(
    key: str,
    capacity: int = 100,
    refill_rate: float = 10.0,
    tokens: int = 1
) -> bool:
    now = time.time()
    result = token_bucket_sha(
        keys=[f"tokenbucket:{key}"],
        args=[capacity, refill_rate, now, tokens]
    )
    return result == 1  # 1 = allowed

3.2 Nginx Rate Limiting

# nginx.conf

http {
    # Zone 정의: 클라이언트 IP 기준, 10MB 공유 메모리
    limit_req_zone $binary_remote_addr zone=api_limit:10m rate=10r/s;

    # 사용자별 Rate Limiting (API Key 기반)
    map $http_x_api_key $api_key_zone {
        default $http_x_api_key;
    }
    limit_req_zone $api_key_zone zone=user_limit:10m rate=100r/m;

    server {
        listen 80;

        # 기본 API 엔드포인트
        location /api/ {
            # burst=20: 20개까지 초과 요청 허용 (큐에 대기)
            # nodelay: 대기 없이 즉시 처리 (burst 내)
            limit_req zone=api_limit burst=20 nodelay;

            # Rate Limit 초과 시 응답 코드
            limit_req_status 429;

            proxy_pass http://backend;
        }

        # LLM API 엔드포인트 (더 엄격한 제한)
        location /api/v1/completions {
            limit_req zone=user_limit burst=5 nodelay;
            limit_req_status 429;

            proxy_pass http://llm_backend;
        }

        # 에러 페이지 커스터마이징
        error_page 429 = @rate_limited;
        location @rate_limited {
            default_type application/json;
            return 429 '{"error": "Too Many Requests", "retry_after": 60}';
        }
    }
}

3.3 Kong Rate Limiting Plugin

# kong.yml - Declarative Configuration

_format_version: '3.0'

services:
  - name: llm-api
    url: http://llm-backend:8000
    routes:
      - name: llm-route
        paths:
          - /api/v1/completions

plugins:
  # 글로벌 Rate Limiting
  - name: rate-limiting
    config:
      minute: 100
      hour: 1000
      policy: redis
      redis_host: redis
      redis_port: 6379
      redis_database: 0
      fault_tolerant: true
      hide_client_headers: false
      error_code: 429
      error_message: 'Rate limit exceeded'

  # Consumer별 Rate Limiting
  - name: rate-limiting
    consumer: premium-user
    config:
      minute: 500
      hour: 10000
      policy: redis
      redis_host: redis

3.4 Envoy Rate Limiting

# envoy.yaml
static_resources:
  listeners:
    - address:
        socket_address:
          address: 0.0.0.0
          port_value: 8080
      filter_chains:
        - filters:
            - name: envoy.filters.network.http_connection_manager
              typed_config:
                '@type': type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
                http_filters:
                  - name: envoy.filters.http.local_ratelimit
                    typed_config:
                      '@type': type.googleapis.com/envoy.extensions.filters.http.local_ratelimit.v3.LocalRateLimit
                      stat_prefix: http_local_rate_limiter
                      token_bucket:
                        max_tokens: 100
                        tokens_per_fill: 10
                        fill_interval: 1s
                      filter_enabled:
                        runtime_key: local_rate_limit_enabled
                        default_value:
                          numerator: 100
                          denominator: HUNDRED
                  - name: envoy.filters.http.router
                    typed_config:
                      '@type': type.googleapis.com/envoy.extensions.filters.http.router.v3.Router

3.5 Express.js Rate Limiting

const rateLimit = require('express-rate-limit')
const RedisStore = require('rate-limit-redis')
const Redis = require('ioredis')

const redis = new Redis({
  host: 'localhost',
  port: 6379,
})

// 기본 Rate Limiter
const apiLimiter = rateLimit({
  windowMs: 60 * 1000, // 1 minute
  max: 100,
  standardHeaders: true, // RateLimit-* headers
  legacyHeaders: false, // X-RateLimit-* headers
  store: new RedisStore({
    sendCommand: (...args) => redis.call(...args),
  }),
  message: {
    error: 'Too Many Requests',
    retryAfter: 60,
  },
  keyGenerator: (req) => {
    return req.headers['x-api-key'] || req.ip
  },
})

// LLM API용 엄격한 Limiter
const llmLimiter = rateLimit({
  windowMs: 60 * 1000,
  max: 20,
  store: new RedisStore({
    sendCommand: (...args) => redis.call(...args),
    prefix: 'rl:llm:',
  }),
  keyGenerator: (req) => req.headers['x-api-key'],
})

const app = require('express')()
app.use('/api/', apiLimiter)
app.use('/api/v1/completions', llmLimiter)

3.6 Spring Boot Rate Limiting

// Bucket4j + Redis 기반 Rate Limiting

@Configuration
public class RateLimitConfig {

    @Bean
    public ProxyManager<String> proxyManager(RedissonClient redissonClient) {
        return Bucket4jRedisson.casBasedBuilder(redissonClient)
            .build();
    }
}

@Component
public class RateLimitInterceptor implements HandlerInterceptor {

    private final ProxyManager<String> proxyManager;

    @Override
    public boolean preHandle(
        HttpServletRequest request,
        HttpServletResponse response,
        Object handler
    ) throws Exception {
        String apiKey = request.getHeader("X-API-Key");
        if (apiKey == null) {
            response.setStatus(401);
            return false;
        }

        BucketConfiguration config = BucketConfiguration.builder()
            .addLimit(
                Bandwidth.builder()
                    .capacity(100)
                    .refillGreedy(100, Duration.ofMinutes(1))
                    .build()
            )
            .build();

        Bucket bucket = proxyManager.builder()
            .build(apiKey, () -> config);

        ConsumptionProbe probe = bucket.tryConsumeAndReturnRemaining(1);

        if (probe.isConsumed()) {
            response.setHeader("X-RateLimit-Remaining",
                String.valueOf(probe.getRemainingTokens()));
            return true;
        } else {
            long waitSeconds = probe.getNanosToWaitForRefill() / 1_000_000_000;
            response.setStatus(429);
            response.setHeader("Retry-After", String.valueOf(waitSeconds));
            response.getWriter().write(
                "{\"error\":\"Rate limit exceeded\"}"
            );
            return false;
        }
    }
}

4. 분산 환경 Rate Limiting

4.1 중앙 집중형 (Redis)

+--------+     +--------+     +--------+
| Node 1 |     | Node 2 |     | Node 3 |
+---+----+     +---+----+     +---+----+
    |              |              |
    +---------+----+---------+----+
              |              |
         +----+----+    +---+---+
         |  Redis  |    | Redis |
         | Primary |    | Replica|
         +---------+    +-------+

장점: 모든 노드에서 일관된 카운트 단점: Redis에 네트워크 지연 추가, Redis가 SPOF가 될 수 있음

4.2 로컬 카운터 + 동기화

+--------+     +--------+     +--------+
| Node 1 |     | Node 2 |     | Node 3 |
| Local:  |     | Local:  |     | Local:  |
|  30/100 |     |  25/100 |     |  20/100 |
+---+----+     +---+----+     +---+----+
    |              |              |
    +----Periodic Sync (every 5s)-----+
              |
         +----+----+
         |  Redis  |
         | Global: |
         |  75/100 |
         +---------+
import time
import threading
import redis

class DistributedRateLimiter:
    def __init__(
        self,
        redis_client: redis.Redis,
        key_prefix: str,
        global_limit: int,
        window_seconds: int,
        sync_interval: float = 5.0,
        local_threshold: float = 0.1,
    ):
        self.redis = redis_client
        self.key_prefix = key_prefix
        self.global_limit = global_limit
        self.window_seconds = window_seconds
        self.sync_interval = sync_interval
        # 로컬 허용량 = 전체 제한의 10%
        self.local_limit = int(global_limit * local_threshold)
        self.local_count = 0
        self.lock = threading.Lock()

        # 주기적 동기화 스레드
        self._start_sync_thread()

    def _start_sync_thread(self):
        def sync_loop():
            while True:
                time.sleep(self.sync_interval)
                self._sync_to_redis()

        thread = threading.Thread(target=sync_loop, daemon=True)
        thread.start()

    def _sync_to_redis(self):
        with self.lock:
            if self.local_count > 0:
                key = f"{self.key_prefix}:global"
                self.redis.incrby(key, self.local_count)
                self.redis.expire(key, self.window_seconds + 10)
                self.local_count = 0

    def allow(self, key: str) -> bool:
        with self.lock:
            # 로컬 카운터 체크
            if self.local_count >= self.local_limit:
                self._sync_to_redis()
                # 글로벌 카운터 체크
                global_key = f"{self.key_prefix}:global"
                global_count = int(self.redis.get(global_key) or 0)
                if global_count >= self.global_limit:
                    return False

            self.local_count += 1
            return True

4.3 Consistent Hashing for Sharded Rate Limiters

요청의 Rate Limit 키를 해싱하여 특정 Redis 노드에 할당

User A (hash=0x3A) --> Redis Node 1 (0x00-0x55)
User B (hash=0x8F) --> Redis Node 2 (0x56-0xAA)
User C (hash=0xC2) --> Redis Node 3 (0xAB-0xFF)

각 사용자의 카운터는 항상 같은 노드에서 관리
-> 네트워크 홉 최소화, 일관성 보장

5. LLM API 비용 제어

5.1 OpenAI/Anthropic API Rate Limits 이해

OpenAI (GPT-4o 기준):
  - RPM (Requests Per Minute): Tier에 따라 다름
  - TPM (Tokens Per Minute): 입력+출력 토큰 합산
  - RPD (Requests Per Day): 일일 제한

Anthropic (Claude 기준):
  - RPM: Tier에 따라 다름
  - Input TPM / Output TPM: 별도 관리

5.2 Per-User, Per-Model Rate Limiting

import redis
import json
from dataclasses import dataclass

@dataclass
class RateLimitConfig:
    rpm: int          # Requests per minute
    tpm: int          # Tokens per minute
    rpd: int          # Requests per day
    daily_budget: float  # USD

# 사용자 티어별 설정
TIER_LIMITS = {
    "free": RateLimitConfig(rpm=10, tpm=10000, rpd=100, daily_budget=1.0),
    "basic": RateLimitConfig(rpm=50, tpm=50000, rpd=1000, daily_budget=10.0),
    "premium": RateLimitConfig(rpm=200, tpm=200000, rpd=10000, daily_budget=100.0),
    "enterprise": RateLimitConfig(rpm=1000, tpm=1000000, rpd=100000, daily_budget=1000.0),
}

class LLMRateLimiter:
    def __init__(self, redis_client: redis.Redis):
        self.redis = redis_client

    def check_rate_limit(
        self,
        user_id: str,
        tier: str,
        model: str,
        estimated_tokens: int,
    ) -> dict:
        limits = TIER_LIMITS.get(tier, TIER_LIMITS["free"])
        now_minute = int(time.time() / 60)
        now_day = int(time.time() / 86400)

        pipe = self.redis.pipeline()

        # RPM 체크
        rpm_key = f"rl:rpm:{user_id}:{now_minute}"
        pipe.incr(rpm_key)
        pipe.expire(rpm_key, 120)

        # TPM 체크
        tpm_key = f"rl:tpm:{user_id}:{now_minute}"
        pipe.incrby(tpm_key, estimated_tokens)
        pipe.expire(tpm_key, 120)

        # RPD 체크
        rpd_key = f"rl:rpd:{user_id}:{now_day}"
        pipe.incr(rpd_key)
        pipe.expire(rpd_key, 172800)

        results = pipe.execute()
        current_rpm = results[0]
        current_tpm = results[2]
        current_rpd = results[4]

        # 제한 초과 확인
        if current_rpm > limits.rpm:
            return {
                "allowed": False,
                "reason": "RPM limit exceeded",
                "limit": limits.rpm,
                "current": current_rpm,
                "retry_after": 60,
            }

        if current_tpm > limits.tpm:
            return {
                "allowed": False,
                "reason": "TPM limit exceeded",
                "limit": limits.tpm,
                "current": current_tpm,
                "retry_after": 60,
            }

        if current_rpd > limits.rpd:
            return {
                "allowed": False,
                "reason": "Daily request limit exceeded",
                "limit": limits.rpd,
                "current": current_rpd,
                "retry_after": 3600,
            }

        return {"allowed": True}

5.3 Token Counting과 비용 예측

import tiktoken

# 모델별 가격 (USD per 1K tokens, 참고용)
MODEL_PRICING = {
    "gpt-4o": {"input": 0.0025, "output": 0.01},
    "gpt-4o-mini": {"input": 0.00015, "output": 0.0006},
    "claude-sonnet-4-20250514": {"input": 0.003, "output": 0.015},
    "claude-haiku-35": {"input": 0.0008, "output": 0.004},
}

def estimate_cost(
    model: str,
    input_text: str,
    estimated_output_tokens: int = 500,
) -> dict:
    # 입력 토큰 수 계산
    try:
        enc = tiktoken.encoding_for_model(model)
        input_tokens = len(enc.encode(input_text))
    except KeyError:
        # tiktoken이 지원하지 않는 모델은 근사치 사용
        input_tokens = len(input_text) // 4

    pricing = MODEL_PRICING.get(model, {"input": 0.01, "output": 0.03})

    input_cost = (input_tokens / 1000) * pricing["input"]
    output_cost = (estimated_output_tokens / 1000) * pricing["output"]

    return {
        "input_tokens": input_tokens,
        "estimated_output_tokens": estimated_output_tokens,
        "estimated_cost_usd": round(input_cost + output_cost, 6),
    }

5.4 Budget Enforcement

class BudgetEnforcer:
    def __init__(self, redis_client: redis.Redis):
        self.redis = redis_client

    def check_budget(
        self,
        user_id: str,
        estimated_cost: float,
        daily_limit: float,
        monthly_limit: float,
    ) -> dict:
        today = time.strftime("%Y-%m-%d")
        month = time.strftime("%Y-%m")

        daily_key = f"budget:daily:{user_id}:{today}"
        monthly_key = f"budget:monthly:{user_id}:{month}"

        daily_spent = float(self.redis.get(daily_key) or 0)
        monthly_spent = float(self.redis.get(monthly_key) or 0)

        if daily_spent + estimated_cost > daily_limit:
            return {
                "allowed": False,
                "reason": "Daily budget exceeded",
                "spent": daily_spent,
                "limit": daily_limit,
            }

        if monthly_spent + estimated_cost > monthly_limit:
            return {
                "allowed": False,
                "reason": "Monthly budget exceeded",
                "spent": monthly_spent,
                "limit": monthly_limit,
            }

        return {"allowed": True}

    def record_spend(self, user_id: str, cost: float):
        today = time.strftime("%Y-%m-%d")
        month = time.strftime("%Y-%m")

        pipe = self.redis.pipeline()
        pipe.incrbyfloat(f"budget:daily:{user_id}:{today}", cost)
        pipe.expire(f"budget:daily:{user_id}:{today}", 172800)
        pipe.incrbyfloat(f"budget:monthly:{user_id}:{month}", cost)
        pipe.expire(f"budget:monthly:{user_id}:{month}", 2764800)
        pipe.execute()

5.5 Queue + Rate Limiter 조합 패턴

+--------+    +------------+    +---------------+    +--------+
| Client +--->| API Gateway|+-->| Rate Limiter  +--->| Queue  |
+--------+    | (check     |    | (Token Bucket |    | (Redis/|
              |  budget)   |    |  + Budget)    |    | RabbitMQ)
              +------------+    +-------+-------+    +---+----+
                                                         |
                                +----------------+       |
                                | Worker Pool    |<------+
                                | (rate-aware    |
                                |  LLM caller)   |
                                +-------+--------+
                                        |
                                +-------v--------+
                                | LLM Provider   |
                                | (OpenAI, etc.) |
                                +----------------+

6. HTTP 응답 헤더

6.1 표준 응답

HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 30
RateLimit-Limit: 100
RateLimit-Remaining: 0
RateLimit-Reset: 1679900400

{
  "error": {
    "code": "rate_limit_exceeded",
    "message": "You have exceeded the rate limit. Please retry after 30 seconds.",
    "type": "rate_limit_error"
  }
}

6.2 응답 헤더 구현

from fastapi import FastAPI, Request, Response
from fastapi.responses import JSONResponse

app = FastAPI()

@app.middleware("http")
async def rate_limit_middleware(request: Request, call_next):
    api_key = request.headers.get("X-API-Key", request.client.host)

    result = rate_limiter.check(api_key)

    if not result["allowed"]:
        return JSONResponse(
            status_code=429,
            content={
                "error": {
                    "code": "rate_limit_exceeded",
                    "message": result["reason"],
                }
            },
            headers={
                "Retry-After": str(result.get("retry_after", 60)),
                "RateLimit-Limit": str(result["limit"]),
                "RateLimit-Remaining": "0",
                "RateLimit-Reset": str(result.get("reset_at", "")),
            }
        )

    response = await call_next(request)

    # 성공 응답에도 Rate Limit 정보 포함
    response.headers["RateLimit-Limit"] = str(result.get("limit", ""))
    response.headers["RateLimit-Remaining"] = str(result.get("remaining", ""))
    response.headers["RateLimit-Reset"] = str(result.get("reset_at", ""))

    return response

7. 모니터링

7.1 Prometheus Metrics

from prometheus_client import Counter, Histogram, Gauge

# Rate Limit 메트릭
rate_limit_total = Counter(
    'rate_limit_requests_total',
    'Total rate limited requests',
    ['user_tier', 'endpoint', 'result']  # result: allowed/rejected
)

rate_limit_remaining = Gauge(
    'rate_limit_remaining',
    'Remaining rate limit tokens',
    ['user_id', 'limit_type']  # limit_type: rpm/tpm/rpd
)

request_cost = Histogram(
    'llm_request_cost_usd',
    'Cost of LLM requests in USD',
    ['model', 'user_tier'],
    buckets=[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
)

budget_utilization = Gauge(
    'budget_utilization_ratio',
    'Budget utilization ratio (0-1)',
    ['user_id', 'period']  # period: daily/monthly
)

7.2 Grafana Dashboard 쿼리 예시

# Rate Limit 거부율
rate(rate_limit_requests_total{result="rejected"}[5m])
/
rate(rate_limit_requests_total[5m])

# 사용자별 일일 비용
sum by (user_tier) (
  increase(llm_request_cost_usd_sum[24h])
)

# Budget 소진율 상위 10명
topk(10, budget_utilization_ratio{period="daily"})

8. Best Practices와 Anti-Patterns

8.1 Best Practices

  1. 계층적 Rate Limiting: 글로벌, 서비스별, 사용자별 다중 계층 적용
  2. Graceful Degradation: Rate Limit 초과 시 캐시된 응답이나 간소화된 응답 제공
  3. 헤더 노출: 클라이언트가 남은 할당량을 알 수 있도록 RateLimit 헤더 제공
  4. 적응적 제한: 서버 부하에 따라 Rate Limit을 동적으로 조절
  5. 모니터링과 알림: Rate Limit 거부율이 임계값을 넘으면 알림
  6. 문서화: API 문서에 Rate Limit 정책을 명확하게 기술

8.2 Anti-Patterns

1. Client IP만으로 Rate Limiting
   문제: NAT/프록시 뒤의 다수 사용자가 하나의 IP를 공유
   해결: API Key + IP 조합, 또는 인증 기반 Rate Limiting

2. 단일 계층 Rate Limiting
   문제: 글로벌 제한만 있으면 정상 사용자도 영향 받음
   해결: 사용자별, 엔드포인트별, 티어별 다중 계층

3. 하드코딩된 Rate Limit 값
   문제: 변경 시 배포 필요
   해결: 설정 서버나 환경 변수로 동적 관리

4. Rate Limit 없는 내부 API
   문제: 내부 서비스 간 장애 전파
   해결: 내부 API에도 Rate Limiting 적용 (서킷브레이커와 병행)

5. Retry 로직 없는 클라이언트
   문제: 429 응답 시 즉시 실패
   해결: Retry-After 헤더를 존중하는 재시도 로직 구현

9. 마무리

Rate Limiting은 단순한 방어 기법이 아니라 서비스의 안정성, 공정성, 비용 효율성을 보장하는 핵심 인프라 컴포넌트입니다.

핵심 정리:

  1. 알고리즘 선택: Sliding Window Counter가 실무에서 가장 범용적
  2. 구현 도구: Redis + Lua가 분산 환경에서 가장 신뢰성 높음
  3. LLM 비용 제어: RPM/TPM Rate Limiting + Budget Enforcement 조합 필수
  4. 분산 환경: 중앙 집중형(Redis) 또는 로컬+동기화 방식 선택
  5. 모니터링: Rate Limit 거부율, 비용 추적, Budget 소진율 모니터링 필수

적절한 Rate Limiting은 서비스를 보호할 뿐만 아니라, 사용자에게 예측 가능한 서비스 경험을 제공합니다.


참고 자료

[Architecture] Complete Guide to Rate Limiting and API Throttling

Overview

When operating APIs, you inevitably face various situations such as unexpected traffic spikes, DDoS attacks, and excessive calls from specific users. Rate Limiting is the most fundamental yet effective defense mechanism against these issues. This post covers everything from core rate limiting algorithms to practical implementations with Redis, Nginx, and Kong, LLM API cost control patterns, and distributed rate limiting strategies.


1. What is Rate Limiting

1.1 Why Is It Needed

Rate Limiting is a technique that restricts the number of requests allowed within a given time period.

Primary Purposes:

  • DDoS Defense: Protect services from malicious mass requests
  • Cost Control: Manage external API call costs (especially LLM APIs)
  • Fair Usage: Prevent specific users from monopolizing resources
  • Service Stability: Prevent service outages due to overload
  • SLA Guarantee: Provide consistent service quality to all users

1.2 Where to Apply Rate Limiting

Client --> [CDN/WAF] --> [Load Balancer] --> [API Gateway] --> [Application]
              |                |                  |                 |
          L3/L4 limits   Connection limits  Request Rate Limit  Business logic limits

Rate Limiting can be applied at multiple layers of infrastructure, with the API Gateway layer being the most effective location.


2. Rate Limiting Algorithms

2.1 Token Bucket

The most widely used algorithm, adopted by AWS API Gateway and Nginx.

How It Works:

  • Tokens are added to the bucket at a constant rate
  • Each request consumes one token
  • Requests are rejected when no tokens remain
  • Burst traffic is allowed up to bucket capacity
Bucket Capacity: 10 tokens
Refill Rate: 2 tokens/second

Time 0s: [##########] 10 tokens -> Request OK (9 left)
Time 0s: [#########.] 9 tokens  -> Request OK (8 left)
...
Time 0s: [..........] 0 tokens  -> Request REJECTED
Time 1s: [##........] 2 tokens  -> Refilled
import time
import threading

class TokenBucket:
    def __init__(self, capacity: int, refill_rate: float):
        self.capacity = capacity
        self.refill_rate = refill_rate  # tokens per second
        self.tokens = capacity
        self.last_refill_time = time.monotonic()
        self.lock = threading.Lock()

    def _refill(self):
        now = time.monotonic()
        elapsed = now - self.last_refill_time
        new_tokens = elapsed * self.refill_rate
        self.tokens = min(self.capacity, self.tokens + new_tokens)
        self.last_refill_time = now

    def consume(self, tokens: int = 1) -> bool:
        with self.lock:
            self._refill()
            if self.tokens >= tokens:
                self.tokens -= tokens
                return True
            return False

    def wait_time(self) -> float:
        """Time to wait for the next token"""
        with self.lock:
            self._refill()
            if self.tokens >= 1:
                return 0.0
            return (1 - self.tokens) / self.refill_rate


# Usage example
bucket = TokenBucket(capacity=10, refill_rate=2.0)

for i in range(15):
    if bucket.consume():
        print(f"Request {i+1}: Allowed")
    else:
        wait = bucket.wait_time()
        print(f"Request {i+1}: Rejected (wait {wait:.2f}s)")

Pros: Allows bursts, simple implementation, memory efficient Cons: Difficult to track exact request counts

2.2 Leaky Bucket

An algorithm that processes requests at a constant rate.

Incoming Requests --> [Bucket/Queue] --> Processing (fixed rate)
                         |
                      Overflow --> Rejected

Processing Rate: 2 requests/second (fixed)
Queue Size: 5

Fast incoming requests queue up; rejected when queue is full
import time
import threading
from collections import deque

class LeakyBucket:
    def __init__(self, capacity: int, leak_rate: float):
        self.capacity = capacity
        self.leak_rate = leak_rate  # requests per second
        self.queue = deque()
        self.lock = threading.Lock()
        self.last_leak_time = time.monotonic()

    def _leak(self):
        now = time.monotonic()
        elapsed = now - self.last_leak_time
        leaked = int(elapsed * self.leak_rate)
        if leaked > 0:
            for _ in range(min(leaked, len(self.queue))):
                self.queue.popleft()
            self.last_leak_time = now

    def allow(self) -> bool:
        with self.lock:
            self._leak()
            if len(self.queue) < self.capacity:
                self.queue.append(time.monotonic())
                return True
            return False

Pros: Guarantees constant processing rate Cons: Does not allow bursts, lacks flexibility

2.3 Fixed Window Counter

The simplest algorithm that counts requests within a fixed time window.

Window: 1 minute, Limit: 100 requests

|------- Window 1 -------|------- Window 2 -------|
12:00:00     12:00:59    12:01:00     12:01:59

  Count: 0...50...100       Count: 0...50...100
         (allowed)                 (allowed)

Problem: Boundary Burst
100 requests at 12:00:30~12:00:59 + 100 requests at 12:01:00~12:01:30
= 200 requests in 60 seconds (2x the limit!)
import time
import threading

class FixedWindowCounter:
    def __init__(self, limit: int, window_seconds: int):
        self.limit = limit
        self.window_seconds = window_seconds
        self.counters = {}  # key -> (window_start, count)
        self.lock = threading.Lock()

    def allow(self, key: str) -> bool:
        with self.lock:
            now = time.time()
            window_start = int(now / self.window_seconds) * self.window_seconds

            if key not in self.counters or self.counters[key][0] != window_start:
                self.counters[key] = (window_start, 0)

            if self.counters[key][1] < self.limit:
                self.counters[key] = (window_start, self.counters[key][1] + 1)
                return True
            return False

Pros: Very simple implementation, minimal memory usage Cons: Request concentration at window boundaries

2.4 Sliding Window Log

Accurately tracks each request by recording timestamps in a log.

Window: 60 seconds, Limit: 5 requests

Log: [12:00:10, 12:00:25, 12:00:40, 12:00:55, 12:01:05]

New request at 12:01:10:
  1. Remove entries older than 60s (before 12:00:10)
     -> [12:00:25, 12:00:40, 12:00:55, 12:01:05]
  2. Count: 4 < 5 -> Allowed
  3. Add to log: [12:00:25, 12:00:40, 12:00:55, 12:01:05, 12:01:10]
import time
import threading
from collections import defaultdict

class SlidingWindowLog:
    def __init__(self, limit: int, window_seconds: int):
        self.limit = limit
        self.window_seconds = window_seconds
        self.logs = defaultdict(list)
        self.lock = threading.Lock()

    def allow(self, key: str) -> bool:
        with self.lock:
            now = time.time()
            cutoff = now - self.window_seconds

            # Remove old entries
            self.logs[key] = [
                ts for ts in self.logs[key] if ts > cutoff
            ]

            if len(self.logs[key]) < self.limit:
                self.logs[key].append(now)
                return True
            return False

Pros: Most accurate rate limiting Cons: High memory usage (stores all request timestamps)

2.5 Sliding Window Counter

Combines the advantages of Fixed Window and Sliding Window Log. This is the most commonly used algorithm in production.

Window: 60 seconds, Limit: 100 requests

Previous Window Count: 80
Current Window Count: 30
Current Window Progress: 25% (15 seconds elapsed)

Weighted Count = 80 * (1 - 0.25) + 30 = 80 * 0.75 + 30 = 90

90 < 100 -> Allowed
import time
import threading

class SlidingWindowCounter:
    def __init__(self, limit: int, window_seconds: int):
        self.limit = limit
        self.window_seconds = window_seconds
        self.windows = {}  # key -> (prev_count, curr_count, curr_window_start)
        self.lock = threading.Lock()

    def allow(self, key: str) -> bool:
        with self.lock:
            now = time.time()
            curr_window = int(now / self.window_seconds) * self.window_seconds
            window_progress = (now - curr_window) / self.window_seconds

            if key not in self.windows:
                self.windows[key] = (0, 0, curr_window)

            prev_count, curr_count, stored_window = self.windows[key]

            # Update if window has changed
            if stored_window != curr_window:
                if curr_window - stored_window >= self.window_seconds * 2:
                    prev_count = 0
                else:
                    prev_count = curr_count
                curr_count = 0

            # Calculate weighted count
            weighted = prev_count * (1 - window_progress) + curr_count

            if weighted < self.limit:
                curr_count += 1
                self.windows[key] = (prev_count, curr_count, curr_window)
                return True
            return False

2.6 Algorithm Comparison Table

AlgorithmAccuracyMemoryBurst AllowedComplexityBest For
Token BucketMediumLowYesLowGeneral APIs, burst needed
Leaky BucketMediumLowNoLowConstant rate needed
Fixed WindowLowVery LowBoundary issueVery LowSimple limits
Sliding LogHighHighNoMediumPrecise limiting needed
Sliding CounterHighLowNoMediumProduction general-purpose

3. Implementation Methods

3.1 Redis + Lua Script

Combining Redis atomic operations with Lua scripts enables safe rate limiting even in distributed environments.

Sliding Window Counter (Redis + Lua)

import redis
import time

r = redis.Redis(host='localhost', port=6379, db=0)

# Lua script: Sliding Window Counter
SLIDING_WINDOW_SCRIPT = """
local key = KEYS[1]
local now = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local limit = tonumber(ARGV[3])

local clear_before = now - window
redis.call('ZREMRANGEBYSCORE', key, '-inf', clear_before)

local current_count = redis.call('ZCARD', key)
if current_count < limit then
    redis.call('ZADD', key, now, now .. '-' .. math.random(1000000))
    redis.call('EXPIRE', key, window + 1)
    return 1
else
    return 0
end
"""

sliding_window_sha = r.register_script(SLIDING_WINDOW_SCRIPT)

def is_rate_limited(key: str, limit: int = 100, window: int = 60) -> bool:
    """
    Returns True if the request is rate limited, False if allowed.
    """
    now = time.time()
    result = sliding_window_sha(
        keys=[f"ratelimit:{key}"],
        args=[now, window, limit]
    )
    return result == 0  # 0 = rate limited, 1 = allowed


# Usage example
user_id = "user-123"
for i in range(110):
    if is_rate_limited(user_id, limit=100, window=60):
        print(f"Request {i+1}: Rate Limited!")
    else:
        print(f"Request {i+1}: Allowed")

Token Bucket (Redis + Lua)

TOKEN_BUCKET_SCRIPT = """
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local requested = tonumber(ARGV[4])

local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1])
local last_refill = tonumber(bucket[2])

if tokens == nil then
    tokens = capacity
    last_refill = now
end

local elapsed = now - last_refill
local new_tokens = elapsed * refill_rate
tokens = math.min(capacity, tokens + new_tokens)

local allowed = 0
if tokens >= requested then
    tokens = tokens - requested
    allowed = 1
end

redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) + 1)

return allowed
"""

token_bucket_sha = r.register_script(TOKEN_BUCKET_SCRIPT)

def token_bucket_check(
    key: str,
    capacity: int = 100,
    refill_rate: float = 10.0,
    tokens: int = 1
) -> bool:
    now = time.time()
    result = token_bucket_sha(
        keys=[f"tokenbucket:{key}"],
        args=[capacity, refill_rate, now, tokens]
    )
    return result == 1  # 1 = allowed

3.2 Nginx Rate Limiting

# nginx.conf

http {
    # Zone definition: per client IP, 10MB shared memory
    limit_req_zone $binary_remote_addr zone=api_limit:10m rate=10r/s;

    # Per-user rate limiting (API Key based)
    map $http_x_api_key $api_key_zone {
        default $http_x_api_key;
    }
    limit_req_zone $api_key_zone zone=user_limit:10m rate=100r/m;

    server {
        listen 80;

        # Default API endpoint
        location /api/ {
            # burst=20: allow up to 20 excess requests (queued)
            # nodelay: process immediately (within burst)
            limit_req zone=api_limit burst=20 nodelay;

            # Response code when rate limit exceeded
            limit_req_status 429;

            proxy_pass http://backend;
        }

        # LLM API endpoint (stricter limits)
        location /api/v1/completions {
            limit_req zone=user_limit burst=5 nodelay;
            limit_req_status 429;

            proxy_pass http://llm_backend;
        }

        # Custom error page
        error_page 429 = @rate_limited;
        location @rate_limited {
            default_type application/json;
            return 429 '{"error": "Too Many Requests", "retry_after": 60}';
        }
    }
}

3.3 Kong Rate Limiting Plugin

# kong.yml - Declarative Configuration

_format_version: '3.0'

services:
  - name: llm-api
    url: http://llm-backend:8000
    routes:
      - name: llm-route
        paths:
          - /api/v1/completions

plugins:
  # Global Rate Limiting
  - name: rate-limiting
    config:
      minute: 100
      hour: 1000
      policy: redis
      redis_host: redis
      redis_port: 6379
      redis_database: 0
      fault_tolerant: true
      hide_client_headers: false
      error_code: 429
      error_message: 'Rate limit exceeded'

  # Per-Consumer Rate Limiting
  - name: rate-limiting
    consumer: premium-user
    config:
      minute: 500
      hour: 10000
      policy: redis
      redis_host: redis

3.4 Envoy Rate Limiting

# envoy.yaml
static_resources:
  listeners:
    - address:
        socket_address:
          address: 0.0.0.0
          port_value: 8080
      filter_chains:
        - filters:
            - name: envoy.filters.network.http_connection_manager
              typed_config:
                '@type': type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
                http_filters:
                  - name: envoy.filters.http.local_ratelimit
                    typed_config:
                      '@type': type.googleapis.com/envoy.extensions.filters.http.local_ratelimit.v3.LocalRateLimit
                      stat_prefix: http_local_rate_limiter
                      token_bucket:
                        max_tokens: 100
                        tokens_per_fill: 10
                        fill_interval: 1s
                      filter_enabled:
                        runtime_key: local_rate_limit_enabled
                        default_value:
                          numerator: 100
                          denominator: HUNDRED
                  - name: envoy.filters.http.router
                    typed_config:
                      '@type': type.googleapis.com/envoy.extensions.filters.http.router.v3.Router

3.5 Express.js Rate Limiting

const rateLimit = require('express-rate-limit')
const RedisStore = require('rate-limit-redis')
const Redis = require('ioredis')

const redis = new Redis({
  host: 'localhost',
  port: 6379,
})

// Default Rate Limiter
const apiLimiter = rateLimit({
  windowMs: 60 * 1000, // 1 minute
  max: 100,
  standardHeaders: true, // RateLimit-* headers
  legacyHeaders: false, // X-RateLimit-* headers
  store: new RedisStore({
    sendCommand: (...args) => redis.call(...args),
  }),
  message: {
    error: 'Too Many Requests',
    retryAfter: 60,
  },
  keyGenerator: (req) => {
    return req.headers['x-api-key'] || req.ip
  },
})

// Strict Limiter for LLM API
const llmLimiter = rateLimit({
  windowMs: 60 * 1000,
  max: 20,
  store: new RedisStore({
    sendCommand: (...args) => redis.call(...args),
    prefix: 'rl:llm:',
  }),
  keyGenerator: (req) => req.headers['x-api-key'],
})

const app = require('express')()
app.use('/api/', apiLimiter)
app.use('/api/v1/completions', llmLimiter)

3.6 Spring Boot Rate Limiting

// Bucket4j + Redis Based Rate Limiting

@Configuration
public class RateLimitConfig {

    @Bean
    public ProxyManager<String> proxyManager(RedissonClient redissonClient) {
        return Bucket4jRedisson.casBasedBuilder(redissonClient)
            .build();
    }
}

@Component
public class RateLimitInterceptor implements HandlerInterceptor {

    private final ProxyManager<String> proxyManager;

    @Override
    public boolean preHandle(
        HttpServletRequest request,
        HttpServletResponse response,
        Object handler
    ) throws Exception {
        String apiKey = request.getHeader("X-API-Key");
        if (apiKey == null) {
            response.setStatus(401);
            return false;
        }

        BucketConfiguration config = BucketConfiguration.builder()
            .addLimit(
                Bandwidth.builder()
                    .capacity(100)
                    .refillGreedy(100, Duration.ofMinutes(1))
                    .build()
            )
            .build();

        Bucket bucket = proxyManager.builder()
            .build(apiKey, () -> config);

        ConsumptionProbe probe = bucket.tryConsumeAndReturnRemaining(1);

        if (probe.isConsumed()) {
            response.setHeader("X-RateLimit-Remaining",
                String.valueOf(probe.getRemainingTokens()));
            return true;
        } else {
            long waitSeconds = probe.getNanosToWaitForRefill() / 1_000_000_000;
            response.setStatus(429);
            response.setHeader("Retry-After", String.valueOf(waitSeconds));
            response.getWriter().write(
                "{\"error\":\"Rate limit exceeded\"}"
            );
            return false;
        }
    }
}

4. Distributed Rate Limiting

4.1 Centralized (Redis)

+--------+     +--------+     +--------+
| Node 1 |     | Node 2 |     | Node 3 |
+---+----+     +---+----+     +---+----+
    |              |              |
    +---------+----+---------+----+
              |              |
         +----+----+    +---+---+
         |  Redis  |    | Redis |
         | Primary |    | Replica|
         +---------+    +-------+

Pros: Consistent counts across all nodes Cons: Additional network latency to Redis, Redis can become SPOF

4.2 Local Counter + Synchronization

+--------+     +--------+     +--------+
| Node 1 |     | Node 2 |     | Node 3 |
| Local:  |     | Local:  |     | Local:  |
|  30/100 |     |  25/100 |     |  20/100 |
+---+----+     +---+----+     +---+----+
    |              |              |
    +----Periodic Sync (every 5s)-----+
              |
         +----+----+
         |  Redis  |
         | Global: |
         |  75/100 |
         +---------+
import time
import threading
import redis

class DistributedRateLimiter:
    def __init__(
        self,
        redis_client: redis.Redis,
        key_prefix: str,
        global_limit: int,
        window_seconds: int,
        sync_interval: float = 5.0,
        local_threshold: float = 0.1,
    ):
        self.redis = redis_client
        self.key_prefix = key_prefix
        self.global_limit = global_limit
        self.window_seconds = window_seconds
        self.sync_interval = sync_interval
        # Local allowance = 10% of global limit
        self.local_limit = int(global_limit * local_threshold)
        self.local_count = 0
        self.lock = threading.Lock()

        # Start periodic sync thread
        self._start_sync_thread()

    def _start_sync_thread(self):
        def sync_loop():
            while True:
                time.sleep(self.sync_interval)
                self._sync_to_redis()

        thread = threading.Thread(target=sync_loop, daemon=True)
        thread.start()

    def _sync_to_redis(self):
        with self.lock:
            if self.local_count > 0:
                key = f"{self.key_prefix}:global"
                self.redis.incrby(key, self.local_count)
                self.redis.expire(key, self.window_seconds + 10)
                self.local_count = 0

    def allow(self, key: str) -> bool:
        with self.lock:
            # Check local counter
            if self.local_count >= self.local_limit:
                self._sync_to_redis()
                # Check global counter
                global_key = f"{self.key_prefix}:global"
                global_count = int(self.redis.get(global_key) or 0)
                if global_count >= self.global_limit:
                    return False

            self.local_count += 1
            return True

4.3 Consistent Hashing for Sharded Rate Limiters

Hash the rate limit key to assign to a specific Redis node

User A (hash=0x3A) --> Redis Node 1 (0x00-0x55)
User B (hash=0x8F) --> Redis Node 2 (0x56-0xAA)
User C (hash=0xC2) --> Redis Node 3 (0xAB-0xFF)

Each user's counter is always managed on the same node
-> Minimized network hops, guaranteed consistency

5. LLM API Cost Control

5.1 Understanding OpenAI/Anthropic API Rate Limits

OpenAI (GPT-4o reference):
  - RPM (Requests Per Minute): Varies by tier
  - TPM (Tokens Per Minute): Combined input + output tokens
  - RPD (Requests Per Day): Daily limit

Anthropic (Claude reference):
  - RPM: Varies by tier
  - Input TPM / Output TPM: Managed separately

5.2 Per-User, Per-Model Rate Limiting

import redis
import json
from dataclasses import dataclass

@dataclass
class RateLimitConfig:
    rpm: int          # Requests per minute
    tpm: int          # Tokens per minute
    rpd: int          # Requests per day
    daily_budget: float  # USD

# Per-tier configuration
TIER_LIMITS = {
    "free": RateLimitConfig(rpm=10, tpm=10000, rpd=100, daily_budget=1.0),
    "basic": RateLimitConfig(rpm=50, tpm=50000, rpd=1000, daily_budget=10.0),
    "premium": RateLimitConfig(rpm=200, tpm=200000, rpd=10000, daily_budget=100.0),
    "enterprise": RateLimitConfig(rpm=1000, tpm=1000000, rpd=100000, daily_budget=1000.0),
}

class LLMRateLimiter:
    def __init__(self, redis_client: redis.Redis):
        self.redis = redis_client

    def check_rate_limit(
        self,
        user_id: str,
        tier: str,
        model: str,
        estimated_tokens: int,
    ) -> dict:
        limits = TIER_LIMITS.get(tier, TIER_LIMITS["free"])
        now_minute = int(time.time() / 60)
        now_day = int(time.time() / 86400)

        pipe = self.redis.pipeline()

        # RPM check
        rpm_key = f"rl:rpm:{user_id}:{now_minute}"
        pipe.incr(rpm_key)
        pipe.expire(rpm_key, 120)

        # TPM check
        tpm_key = f"rl:tpm:{user_id}:{now_minute}"
        pipe.incrby(tpm_key, estimated_tokens)
        pipe.expire(tpm_key, 120)

        # RPD check
        rpd_key = f"rl:rpd:{user_id}:{now_day}"
        pipe.incr(rpd_key)
        pipe.expire(rpd_key, 172800)

        results = pipe.execute()
        current_rpm = results[0]
        current_tpm = results[2]
        current_rpd = results[4]

        # Check limit violations
        if current_rpm > limits.rpm:
            return {
                "allowed": False,
                "reason": "RPM limit exceeded",
                "limit": limits.rpm,
                "current": current_rpm,
                "retry_after": 60,
            }

        if current_tpm > limits.tpm:
            return {
                "allowed": False,
                "reason": "TPM limit exceeded",
                "limit": limits.tpm,
                "current": current_tpm,
                "retry_after": 60,
            }

        if current_rpd > limits.rpd:
            return {
                "allowed": False,
                "reason": "Daily request limit exceeded",
                "limit": limits.rpd,
                "current": current_rpd,
                "retry_after": 3600,
            }

        return {"allowed": True}

5.3 Token Counting and Cost Estimation

import tiktoken

# Per-model pricing (USD per 1K tokens, reference)
MODEL_PRICING = {
    "gpt-4o": {"input": 0.0025, "output": 0.01},
    "gpt-4o-mini": {"input": 0.00015, "output": 0.0006},
    "claude-sonnet-4-20250514": {"input": 0.003, "output": 0.015},
    "claude-haiku-35": {"input": 0.0008, "output": 0.004},
}

def estimate_cost(
    model: str,
    input_text: str,
    estimated_output_tokens: int = 500,
) -> dict:
    # Calculate input token count
    try:
        enc = tiktoken.encoding_for_model(model)
        input_tokens = len(enc.encode(input_text))
    except KeyError:
        # Approximate for models not supported by tiktoken
        input_tokens = len(input_text) // 4

    pricing = MODEL_PRICING.get(model, {"input": 0.01, "output": 0.03})

    input_cost = (input_tokens / 1000) * pricing["input"]
    output_cost = (estimated_output_tokens / 1000) * pricing["output"]

    return {
        "input_tokens": input_tokens,
        "estimated_output_tokens": estimated_output_tokens,
        "estimated_cost_usd": round(input_cost + output_cost, 6),
    }

5.4 Budget Enforcement

class BudgetEnforcer:
    def __init__(self, redis_client: redis.Redis):
        self.redis = redis_client

    def check_budget(
        self,
        user_id: str,
        estimated_cost: float,
        daily_limit: float,
        monthly_limit: float,
    ) -> dict:
        today = time.strftime("%Y-%m-%d")
        month = time.strftime("%Y-%m")

        daily_key = f"budget:daily:{user_id}:{today}"
        monthly_key = f"budget:monthly:{user_id}:{month}"

        daily_spent = float(self.redis.get(daily_key) or 0)
        monthly_spent = float(self.redis.get(monthly_key) or 0)

        if daily_spent + estimated_cost > daily_limit:
            return {
                "allowed": False,
                "reason": "Daily budget exceeded",
                "spent": daily_spent,
                "limit": daily_limit,
            }

        if monthly_spent + estimated_cost > monthly_limit:
            return {
                "allowed": False,
                "reason": "Monthly budget exceeded",
                "spent": monthly_spent,
                "limit": monthly_limit,
            }

        return {"allowed": True}

    def record_spend(self, user_id: str, cost: float):
        today = time.strftime("%Y-%m-%d")
        month = time.strftime("%Y-%m")

        pipe = self.redis.pipeline()
        pipe.incrbyfloat(f"budget:daily:{user_id}:{today}", cost)
        pipe.expire(f"budget:daily:{user_id}:{today}", 172800)
        pipe.incrbyfloat(f"budget:monthly:{user_id}:{month}", cost)
        pipe.expire(f"budget:monthly:{user_id}:{month}", 2764800)
        pipe.execute()

5.5 Queue + Rate Limiter Combination Pattern

+--------+    +------------+    +---------------+    +--------+
| Client +--->| API Gateway|+-->| Rate Limiter  +--->| Queue  |
+--------+    | (check     |    | (Token Bucket |    | (Redis/|
              |  budget)   |    |  + Budget)    |    | RabbitMQ)
              +------------+    +-------+-------+    +---+----+
                                                         |
                                +----------------+       |
                                | Worker Pool    |<------+
                                | (rate-aware    |
                                |  LLM caller)   |
                                +-------+--------+
                                        |
                                +-------v--------+
                                | LLM Provider   |
                                | (OpenAI, etc.) |
                                +----------------+

6. HTTP Response Headers

6.1 Standard Response

HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 30
RateLimit-Limit: 100
RateLimit-Remaining: 0
RateLimit-Reset: 1679900400

{
  "error": {
    "code": "rate_limit_exceeded",
    "message": "You have exceeded the rate limit. Please retry after 30 seconds.",
    "type": "rate_limit_error"
  }
}

6.2 Response Header Implementation

from fastapi import FastAPI, Request, Response
from fastapi.responses import JSONResponse

app = FastAPI()

@app.middleware("http")
async def rate_limit_middleware(request: Request, call_next):
    api_key = request.headers.get("X-API-Key", request.client.host)

    result = rate_limiter.check(api_key)

    if not result["allowed"]:
        return JSONResponse(
            status_code=429,
            content={
                "error": {
                    "code": "rate_limit_exceeded",
                    "message": result["reason"],
                }
            },
            headers={
                "Retry-After": str(result.get("retry_after", 60)),
                "RateLimit-Limit": str(result["limit"]),
                "RateLimit-Remaining": "0",
                "RateLimit-Reset": str(result.get("reset_at", "")),
            }
        )

    response = await call_next(request)

    # Include rate limit info in success responses too
    response.headers["RateLimit-Limit"] = str(result.get("limit", ""))
    response.headers["RateLimit-Remaining"] = str(result.get("remaining", ""))
    response.headers["RateLimit-Reset"] = str(result.get("reset_at", ""))

    return response

7. Monitoring

7.1 Prometheus Metrics

from prometheus_client import Counter, Histogram, Gauge

# Rate Limit metrics
rate_limit_total = Counter(
    'rate_limit_requests_total',
    'Total rate limited requests',
    ['user_tier', 'endpoint', 'result']  # result: allowed/rejected
)

rate_limit_remaining = Gauge(
    'rate_limit_remaining',
    'Remaining rate limit tokens',
    ['user_id', 'limit_type']  # limit_type: rpm/tpm/rpd
)

request_cost = Histogram(
    'llm_request_cost_usd',
    'Cost of LLM requests in USD',
    ['model', 'user_tier'],
    buckets=[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
)

budget_utilization = Gauge(
    'budget_utilization_ratio',
    'Budget utilization ratio (0-1)',
    ['user_id', 'period']  # period: daily/monthly
)

7.2 Grafana Dashboard Query Examples

# Rate Limit rejection rate
rate(rate_limit_requests_total{result="rejected"}[5m])
/
rate(rate_limit_requests_total[5m])

# Daily cost by user tier
sum by (user_tier) (
  increase(llm_request_cost_usd_sum[24h])
)

# Top 10 budget consumption users
topk(10, budget_utilization_ratio{period="daily"})

8. Best Practices and Anti-Patterns

8.1 Best Practices

  1. Layered Rate Limiting: Apply multiple layers - global, per-service, per-user
  2. Graceful Degradation: Provide cached or simplified responses when rate limits are exceeded
  3. Header Exposure: Provide RateLimit headers so clients know their remaining quota
  4. Adaptive Limits: Dynamically adjust rate limits based on server load
  5. Monitoring and Alerts: Alert when rate limit rejection rates exceed thresholds
  6. Documentation: Clearly document rate limit policies in API documentation

8.2 Anti-Patterns

1. Rate Limiting by Client IP Only
   Problem: Multiple users behind NAT/proxy share one IP
   Solution: API Key + IP combination, or auth-based rate limiting

2. Single-Layer Rate Limiting
   Problem: Global-only limits affect normal users too
   Solution: Per-user, per-endpoint, per-tier multi-layer approach

3. Hardcoded Rate Limit Values
   Problem: Requires deployment to change
   Solution: Dynamic management via config server or environment variables

4. No Rate Limiting on Internal APIs
   Problem: Failure propagation between internal services
   Solution: Apply rate limiting to internal APIs (alongside circuit breakers)

5. Clients Without Retry Logic
   Problem: Immediate failure on 429 response
   Solution: Implement retry logic that respects the Retry-After header

9. Conclusion

Rate Limiting is not just a simple defense technique but a core infrastructure component that ensures service stability, fairness, and cost efficiency.

Key Takeaways:

  1. Algorithm Selection: Sliding Window Counter is the most versatile for production use
  2. Implementation Tools: Redis + Lua is most reliable in distributed environments
  3. LLM Cost Control: RPM/TPM Rate Limiting + Budget Enforcement combination is essential
  4. Distributed Environments: Choose between centralized (Redis) or local + sync approaches
  5. Monitoring: Monitor rate limit rejection rates, cost tracking, and budget consumption

Proper rate limiting not only protects services but also provides users with predictable service experiences.


References