[Architecture] Rate Limiting과 API 호출량 제한 구현 완전 가이드

개요

API를 운영하다 보면 예상치 못한 트래픽 급증, DDoS 공격, 특정 사용자의 과도한 호출 등 다양한 상황에 직면합니다. Rate Limiting은 이러한 문제에 대한 가장 기본적이면서도 효과적인 방어 수단입니다. 이 글에서는 Rate Limiting의 핵심 알고리즘부터 Redis, Nginx, Kong 등을 활용한 실전 구현, LLM API 비용 제어 패턴, 분산 환경에서의 Rate Limiting까지 총정리합니다.

1. Rate Limiting이란

1.1 왜 필요한가

Rate Limiting은 단위 시간 내 허용되는 요청 수를 제한하는 기법입니다.

주요 목적:

DDoS 방어: 악의적인 대량 요청으로부터 서비스 보호
비용 제어: 외부 API 호출 비용 관리 (특히 LLM API)
공정 사용: 특정 사용자가 리소스를 독점하는 것을 방지
서비스 안정성: 과부하로 인한 서비스 장애 예방
SLA 보장: 모든 사용자에게 일정 수준의 서비스 품질 제공

1.2 Rate Limiting의 위치

Client --> [CDN/WAF] --> [Load Balancer] --> [API Gateway] --> [Application]
              |                |                  |                 |
          L3/L4 제한      Connection 제한    요청 Rate Limit    비즈니스 로직 제한

Rate Limiting은 인프라의 여러 계층에서 적용할 수 있으며, 가장 효과적인 위치는 API Gateway 계층입니다.

2. Rate Limiting 알고리즘

2.1 Token Bucket

가장 널리 사용되는 알고리즘으로, AWS API Gateway와 Nginx에서 채택하고 있습니다.

원리:

버킷에 일정 속도로 토큰이 충전됨
요청마다 토큰 1개를 소비
토큰이 없으면 요청 거부
버킷 용량만큼 burst 트래픽 허용

Bucket Capacity: 10 tokens
Refill Rate: 2 tokens/second

Time 0s: [##########] 10 tokens -> Request OK (9 left)
Time 0s: [#########.] 9 tokens  -> Request OK (8 left)
...
Time 0s: [..........] 0 tokens  -> Request REJECTED
Time 1s: [##........] 2 tokens  -> Refilled

import time
import threading

class TokenBucket:
    def __init__(self, capacity: int, refill_rate: float):
        self.capacity = capacity
        self.refill_rate = refill_rate  # tokens per second
        self.tokens = capacity
        self.last_refill_time = time.monotonic()
        self.lock = threading.Lock()

    def _refill(self):
        now = time.monotonic()
        elapsed = now - self.last_refill_time
        new_tokens = elapsed * self.refill_rate
        self.tokens = min(self.capacity, self.tokens + new_tokens)
        self.last_refill_time = now

    def consume(self, tokens: int = 1) -> bool:
        with self.lock:
            self._refill()
            if self.tokens >= tokens:
                self.tokens -= tokens
                return True
            return False

    def wait_time(self) -> float:
        """다음 토큰까지 대기 시간"""
        with self.lock:
            self._refill()
            if self.tokens >= 1:
                return 0.0
            return (1 - self.tokens) / self.refill_rate


# 사용 예시
bucket = TokenBucket(capacity=10, refill_rate=2.0)

for i in range(15):
    if bucket.consume():
        print(f"Request {i+1}: Allowed")
    else:
        wait = bucket.wait_time()
        print(f"Request {i+1}: Rejected (wait {wait:.2f}s)")

장점: burst 허용, 구현 간편, 메모리 효율적 단점: 정확한 요청 수 추적이 어려움

2.2 Leaky Bucket

일정한 속도로 요청을 처리하는 알고리즘입니다.

Incoming Requests --> [Bucket/Queue] --> Processing (fixed rate)
                         |
                      Overflow --> Rejected

처리 속도: 2 requests/second (고정)
큐 크기: 5

빠르게 들어온 요청은 큐에 대기, 큐가 가득 차면 거부

import time
import threading
from collections import deque

class LeakyBucket:
    def __init__(self, capacity: int, leak_rate: float):
        self.capacity = capacity
        self.leak_rate = leak_rate  # requests per second
        self.queue = deque()
        self.lock = threading.Lock()
        self.last_leak_time = time.monotonic()

    def _leak(self):
        now = time.monotonic()
        elapsed = now - self.last_leak_time
        leaked = int(elapsed * self.leak_rate)
        if leaked > 0:
            for _ in range(min(leaked, len(self.queue))):
                self.queue.popleft()
            self.last_leak_time = now

    def allow(self) -> bool:
        with self.lock:
            self._leak()
            if len(self.queue) < self.capacity:
                self.queue.append(time.monotonic())
                return True
            return False

장점: 일정한 처리 속도 보장 단점: burst를 허용하지 않아 유연성이 부족

2.3 Fixed Window Counter

가장 단순한 알고리즘으로, 고정된 시간 윈도우 내의 요청 수를 카운트합니다.

Window: 1 minute, Limit: 100 requests

|------- Window 1 -------|------- Window 2 -------|
12:00:00     12:00:59    12:01:00     12:01:59

  Count: 0...50...100       Count: 0...50...100
         (allowed)                 (allowed)

문제: Boundary Burst
12:00:30~12:00:59에 100건 + 12:01:00~12:01:30에 100건
= 60초 동안 200건 처리 (제한의 2배!)

import time
import threading

class FixedWindowCounter:
    def __init__(self, limit: int, window_seconds: int):
        self.limit = limit
        self.window_seconds = window_seconds
        self.counters = {}  # key -> (window_start, count)
        self.lock = threading.Lock()

    def allow(self, key: str) -> bool:
        with self.lock:
            now = time.time()
            window_start = int(now / self.window_seconds) * self.window_seconds

            if key not in self.counters or self.counters[key][0] != window_start:
                self.counters[key] = (window_start, 0)

            if self.counters[key][1] < self.limit:
                self.counters[key] = (window_start, self.counters[key][1] + 1)
                return True
            return False

장점: 구현이 매우 간단, 메모리 사용량 최소 단점: 윈도우 경계(boundary)에서 요청 집중 문제

2.4 Sliding Window Log

각 요청의 타임스탬프를 로그로 기록하여 정확하게 추적합니다.

Window: 60 seconds, Limit: 5 requests

Log: [12:00:10, 12:00:25, 12:00:40, 12:00:55, 12:01:05]

New request at 12:01:10:
  1. Remove entries older than 60s (before 12:00:10)
     -> [12:00:25, 12:00:40, 12:00:55, 12:01:05]
  2. Count: 4 < 5 -> Allowed
  3. Add to log: [12:00:25, 12:00:40, 12:00:55, 12:01:05, 12:01:10]

import time
import threading
from collections import defaultdict

class SlidingWindowLog:
    def __init__(self, limit: int, window_seconds: int):
        self.limit = limit
        self.window_seconds = window_seconds
        self.logs = defaultdict(list)
        self.lock = threading.Lock()

    def allow(self, key: str) -> bool:
        with self.lock:
            now = time.time()
            cutoff = now - self.window_seconds

            # 오래된 항목 제거
            self.logs[key] = [
                ts for ts in self.logs[key] if ts > cutoff
            ]

            if len(self.logs[key]) < self.limit:
                self.logs[key].append(now)
                return True
            return False

장점: 가장 정확한 Rate Limiting 단점: 메모리 사용량이 높음 (모든 요청 타임스탬프 저장)

2.5 Sliding Window Counter

Fixed Window와 Sliding Window Log의 장점을 결합한 알고리즘입니다. 실무에서 가장 많이 사용됩니다.

Window: 60 seconds, Limit: 100 requests

Previous Window Count: 80
Current Window Count: 30
Current Window Progress: 25% (15 seconds elapsed)

Weighted Count = 80 * (1 - 0.25) + 30 = 80 * 0.75 + 30 = 90

90 < 100 -> Allowed

import time
import threading

class SlidingWindowCounter:
    def __init__(self, limit: int, window_seconds: int):
        self.limit = limit
        self.window_seconds = window_seconds
        self.windows = {}  # key -> (prev_count, curr_count, curr_window_start)
        self.lock = threading.Lock()

    def allow(self, key: str) -> bool:
        with self.lock:
            now = time.time()
            curr_window = int(now / self.window_seconds) * self.window_seconds
            window_progress = (now - curr_window) / self.window_seconds

            if key not in self.windows:
                self.windows[key] = (0, 0, curr_window)

            prev_count, curr_count, stored_window = self.windows[key]

            # 윈도우가 바뀌었으면 갱신
            if stored_window != curr_window:
                if curr_window - stored_window >= self.window_seconds * 2:
                    prev_count = 0
                else:
                    prev_count = curr_count
                curr_count = 0

            # 가중 카운트 계산
            weighted = prev_count * (1 - window_progress) + curr_count

            if weighted < self.limit:
                curr_count += 1
                self.windows[key] = (prev_count, curr_count, curr_window)
                return True
            return False

2.6 알고리즘 비교표

알고리즘	정확도	메모리	Burst 허용	구현 복잡도	적합한 경우
Token Bucket	중	낮음	O	낮음	일반 API, burst 필요
Leaky Bucket	중	낮음	X	낮음	일정 처리율 필요
Fixed Window	낮음	매우 낮음	경계 문제	매우 낮음	단순한 제한
Sliding Log	높음	높음	X	중간	정확한 제한 필요
Sliding Counter	높음	낮음	X	중간	실무 범용

3. 구현 방법

3.1 Redis + Lua Script

Redis의 원자적 연산과 Lua 스크립트를 결합하여 분산 환경에서도 안전한 Rate Limiting을 구현합니다.

Sliding Window Counter (Redis + Lua)

import redis
import time

r = redis.Redis(host='localhost', port=6379, db=0)

# Lua 스크립트: Sliding Window Counter
SLIDING_WINDOW_SCRIPT = """
local key = KEYS[1]
local now = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local limit = tonumber(ARGV[3])

local clear_before = now - window
redis.call('ZREMRANGEBYSCORE', key, '-inf', clear_before)

local current_count = redis.call('ZCARD', key)
if current_count < limit then
    redis.call('ZADD', key, now, now .. '-' .. math.random(1000000))
    redis.call('EXPIRE', key, window + 1)
    return 1
else
    return 0
end
"""

sliding_window_sha = r.register_script(SLIDING_WINDOW_SCRIPT)

def is_rate_limited(key: str, limit: int = 100, window: int = 60) -> bool:
    """
    Returns True if the request is allowed, False if rate limited.
    """
    now = time.time()
    result = sliding_window_sha(
        keys=[f"ratelimit:{key}"],
        args=[now, window, limit]
    )
    return result == 0  # 0 = rate limited, 1 = allowed


# 사용 예시
user_id = "user-123"
for i in range(110):
    if is_rate_limited(user_id, limit=100, window=60):
        print(f"Request {i+1}: Rate Limited!")
    else:
        print(f"Request {i+1}: Allowed")

Token Bucket (Redis + Lua)

TOKEN_BUCKET_SCRIPT = """
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local requested = tonumber(ARGV[4])

local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1])
local last_refill = tonumber(bucket[2])

if tokens == nil then
    tokens = capacity
    last_refill = now
end

local elapsed = now - last_refill
local new_tokens = elapsed * refill_rate
tokens = math.min(capacity, tokens + new_tokens)

local allowed = 0
if tokens >= requested then
    tokens = tokens - requested
    allowed = 1
end

redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) + 1)

return allowed
"""

token_bucket_sha = r.register_script(TOKEN_BUCKET_SCRIPT)

def token_bucket_check(
    key: str,
    capacity: int = 100,
    refill_rate: float = 10.0,
    tokens: int = 1
) -> bool:
    now = time.time()
    result = token_bucket_sha(
        keys=[f"tokenbucket:{key}"],
        args=[capacity, refill_rate, now, tokens]
    )
    return result == 1  # 1 = allowed

3.2 Nginx Rate Limiting

# nginx.conf

http {
    # Zone 정의: 클라이언트 IP 기준, 10MB 공유 메모리
    limit_req_zone $binary_remote_addr zone=api_limit:10m rate=10r/s;

    # 사용자별 Rate Limiting (API Key 기반)
    map $http_x_api_key $api_key_zone {
        default $http_x_api_key;
    }
    limit_req_zone $api_key_zone zone=user_limit:10m rate=100r/m;

    server {
        listen 80;

        # 기본 API 엔드포인트
        location /api/ {
            # burst=20: 20개까지 초과 요청 허용 (큐에 대기)
            # nodelay: 대기 없이 즉시 처리 (burst 내)
            limit_req zone=api_limit burst=20 nodelay;

            # Rate Limit 초과 시 응답 코드
            limit_req_status 429;

            proxy_pass http://backend;
        }

        # LLM API 엔드포인트 (더 엄격한 제한)
        location /api/v1/completions {
            limit_req zone=user_limit burst=5 nodelay;
            limit_req_status 429;

            proxy_pass http://llm_backend;
        }

        # 에러 페이지 커스터마이징
        error_page 429 = @rate_limited;
        location @rate_limited {
            default_type application/json;
            return 429 '{"error": "Too Many Requests", "retry_after": 60}';
        }
    }
}

3.3 Kong Rate Limiting Plugin

# kong.yml - Declarative Configuration

_format_version: '3.0'

services:
  - name: llm-api
    url: http://llm-backend:8000
    routes:
      - name: llm-route
        paths:
          - /api/v1/completions

plugins:
  # 글로벌 Rate Limiting
  - name: rate-limiting
    config:
      minute: 100
      hour: 1000
      policy: redis
      redis_host: redis
      redis_port: 6379
      redis_database: 0
      fault_tolerant: true
      hide_client_headers: false
      error_code: 429
      error_message: 'Rate limit exceeded'

  # Consumer별 Rate Limiting
  - name: rate-limiting
    consumer: premium-user
    config:
      minute: 500
      hour: 10000
      policy: redis
      redis_host: redis

3.4 Envoy Rate Limiting

# envoy.yaml
static_resources:
  listeners:
    - address:
        socket_address:
          address: 0.0.0.0
          port_value: 8080
      filter_chains:
        - filters:
            - name: envoy.filters.network.http_connection_manager
              typed_config:
                '@type': type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
                http_filters:
                  - name: envoy.filters.http.local_ratelimit
                    typed_config:
                      '@type': type.googleapis.com/envoy.extensions.filters.http.local_ratelimit.v3.LocalRateLimit
                      stat_prefix: http_local_rate_limiter
                      token_bucket:
                        max_tokens: 100
                        tokens_per_fill: 10
                        fill_interval: 1s
                      filter_enabled:
                        runtime_key: local_rate_limit_enabled
                        default_value:
                          numerator: 100
                          denominator: HUNDRED
                  - name: envoy.filters.http.router
                    typed_config:
                      '@type': type.googleapis.com/envoy.extensions.filters.http.router.v3.Router

3.5 Express.js Rate Limiting

const rateLimit = require('express-rate-limit')
const RedisStore = require('rate-limit-redis')
const Redis = require('ioredis')

const redis = new Redis({
  host: 'localhost',
  port: 6379,
})

// 기본 Rate Limiter
const apiLimiter = rateLimit({
  windowMs: 60 * 1000, // 1 minute
  max: 100,
  standardHeaders: true, // RateLimit-* headers
  legacyHeaders: false, // X-RateLimit-* headers
  store: new RedisStore({
    sendCommand: (...args) => redis.call(...args),
  }),
  message: {
    error: 'Too Many Requests',
    retryAfter: 60,
  },
  keyGenerator: (req) => {
    return req.headers['x-api-key'] || req.ip
  },
})

// LLM API용 엄격한 Limiter
const llmLimiter = rateLimit({
  windowMs: 60 * 1000,
  max: 20,
  store: new RedisStore({
    sendCommand: (...args) => redis.call(...args),
    prefix: 'rl:llm:',
  }),
  keyGenerator: (req) => req.headers['x-api-key'],
})

const app = require('express')()
app.use('/api/', apiLimiter)
app.use('/api/v1/completions', llmLimiter)

3.6 Spring Boot Rate Limiting

// Bucket4j + Redis 기반 Rate Limiting

@Configuration
public class RateLimitConfig {

    @Bean
    public ProxyManager<String> proxyManager(RedissonClient redissonClient) {
        return Bucket4jRedisson.casBasedBuilder(redissonClient)
            .build();
    }
}

@Component
public class RateLimitInterceptor implements HandlerInterceptor {

    private final ProxyManager<String> proxyManager;

    @Override
    public boolean preHandle(
        HttpServletRequest request,
        HttpServletResponse response,
        Object handler
    ) throws Exception {
        String apiKey = request.getHeader("X-API-Key");
        if (apiKey == null) {
            response.setStatus(401);
            return false;
        }

        BucketConfiguration config = BucketConfiguration.builder()
            .addLimit(
                Bandwidth.builder()
                    .capacity(100)
                    .refillGreedy(100, Duration.ofMinutes(1))
                    .build()
            )
            .build();

        Bucket bucket = proxyManager.builder()
            .build(apiKey, () -> config);

        ConsumptionProbe probe = bucket.tryConsumeAndReturnRemaining(1);

        if (probe.isConsumed()) {
            response.setHeader("X-RateLimit-Remaining",
                String.valueOf(probe.getRemainingTokens()));
            return true;
        } else {
            long waitSeconds = probe.getNanosToWaitForRefill() / 1_000_000_000;
            response.setStatus(429);
            response.setHeader("Retry-After", String.valueOf(waitSeconds));
            response.getWriter().write(
                "{\"error\":\"Rate limit exceeded\"}"
            );
            return false;
        }
    }
}

4. 분산 환경 Rate Limiting

4.1 중앙 집중형 (Redis)

+--------+     +--------+     +--------+
| Node 1 |     | Node 2 |     | Node 3 |
+---+----+     +---+----+     +---+----+
    |              |              |
    +---------+----+---------+----+
              |              |
         +----+----+    +---+---+
         |  Redis  |    | Redis |
         | Primary |    | Replica|
         +---------+    +-------+

장점: 모든 노드에서 일관된 카운트 단점: Redis에 네트워크 지연 추가, Redis가 SPOF가 될 수 있음

4.2 로컬 카운터 + 동기화

+--------+     +--------+     +--------+
| Node 1 |     | Node 2 |     | Node 3 |
| Local:  |     | Local:  |     | Local:  |
|  30/100 |     |  25/100 |     |  20/100 |
+---+----+     +---+----+     +---+----+
    |              |              |
    +----Periodic Sync (every 5s)-----+
              |
         +----+----+
         |  Redis  |
         | Global: |
         |  75/100 |
         +---------+

import time
import threading
import redis

class DistributedRateLimiter:
    def __init__(
        self,
        redis_client: redis.Redis,
        key_prefix: str,
        global_limit: int,
        window_seconds: int,
        sync_interval: float = 5.0,
        local_threshold: float = 0.1,
    ):
        self.redis = redis_client
        self.key_prefix = key_prefix
        self.global_limit = global_limit
        self.window_seconds = window_seconds
        self.sync_interval = sync_interval
        # 로컬 허용량 = 전체 제한의 10%
        self.local_limit = int(global_limit * local_threshold)
        self.local_count = 0
        self.lock = threading.Lock()

        # 주기적 동기화 스레드
        self._start_sync_thread()

    def _start_sync_thread(self):
        def sync_loop():
            while True:
                time.sleep(self.sync_interval)
                self._sync_to_redis()

        thread = threading.Thread(target=sync_loop, daemon=True)
        thread.start()

    def _sync_to_redis(self):
        with self.lock:
            if self.local_count > 0:
                key = f"{self.key_prefix}:global"
                self.redis.incrby(key, self.local_count)
                self.redis.expire(key, self.window_seconds + 10)
                self.local_count = 0

    def allow(self, key: str) -> bool:
        with self.lock:
            # 로컬 카운터 체크
            if self.local_count >= self.local_limit:
                self._sync_to_redis()
                # 글로벌 카운터 체크
                global_key = f"{self.key_prefix}:global"
                global_count = int(self.redis.get(global_key) or 0)
                if global_count >= self.global_limit:
                    return False

            self.local_count += 1
            return True

4.3 Consistent Hashing for Sharded Rate Limiters

요청의 Rate Limit 키를 해싱하여 특정 Redis 노드에 할당

User A (hash=0x3A) --> Redis Node 1 (0x00-0x55)
User B (hash=0x8F) --> Redis Node 2 (0x56-0xAA)
User C (hash=0xC2) --> Redis Node 3 (0xAB-0xFF)

각 사용자의 카운터는 항상 같은 노드에서 관리
-> 네트워크 홉 최소화, 일관성 보장

5. LLM API 비용 제어

5.1 OpenAI/Anthropic API Rate Limits 이해

OpenAI (GPT-4o 기준):
  - RPM (Requests Per Minute): Tier에 따라 다름
  - TPM (Tokens Per Minute): 입력+출력 토큰 합산
  - RPD (Requests Per Day): 일일 제한

Anthropic (Claude 기준):
  - RPM: Tier에 따라 다름
  - Input TPM / Output TPM: 별도 관리

5.2 Per-User, Per-Model Rate Limiting

import redis
import json
from dataclasses import dataclass

@dataclass
class RateLimitConfig:
    rpm: int          # Requests per minute
    tpm: int          # Tokens per minute
    rpd: int          # Requests per day
    daily_budget: float  # USD

# 사용자 티어별 설정
TIER_LIMITS = {
    "free": RateLimitConfig(rpm=10, tpm=10000, rpd=100, daily_budget=1.0),
    "basic": RateLimitConfig(rpm=50, tpm=50000, rpd=1000, daily_budget=10.0),
    "premium": RateLimitConfig(rpm=200, tpm=200000, rpd=10000, daily_budget=100.0),
    "enterprise": RateLimitConfig(rpm=1000, tpm=1000000, rpd=100000, daily_budget=1000.0),
}

class LLMRateLimiter:
    def __init__(self, redis_client: redis.Redis):
        self.redis = redis_client

    def check_rate_limit(
        self,
        user_id: str,
        tier: str,
        model: str,
        estimated_tokens: int,
    ) -> dict:
        limits = TIER_LIMITS.get(tier, TIER_LIMITS["free"])
        now_minute = int(time.time() / 60)
        now_day = int(time.time() / 86400)

        pipe = self.redis.pipeline()

        # RPM 체크
        rpm_key = f"rl:rpm:{user_id}:{now_minute}"
        pipe.incr(rpm_key)
        pipe.expire(rpm_key, 120)

        # TPM 체크
        tpm_key = f"rl:tpm:{user_id}:{now_minute}"
        pipe.incrby(tpm_key, estimated_tokens)
        pipe.expire(tpm_key, 120)

        # RPD 체크
        rpd_key = f"rl:rpd:{user_id}:{now_day}"
        pipe.incr(rpd_key)
        pipe.expire(rpd_key, 172800)

        results = pipe.execute()
        current_rpm = results[0]
        current_tpm = results[2]
        current_rpd = results[4]

        # 제한 초과 확인
        if current_rpm > limits.rpm:
            return {
                "allowed": False,
                "reason": "RPM limit exceeded",
                "limit": limits.rpm,
                "current": current_rpm,
                "retry_after": 60,
            }

        if current_tpm > limits.tpm:
            return {
                "allowed": False,
                "reason": "TPM limit exceeded",
                "limit": limits.tpm,
                "current": current_tpm,
                "retry_after": 60,
            }

        if current_rpd > limits.rpd:
            return {
                "allowed": False,
                "reason": "Daily request limit exceeded",
                "limit": limits.rpd,
                "current": current_rpd,
                "retry_after": 3600,
            }

        return {"allowed": True}

5.3 Token Counting과 비용 예측

import tiktoken

# 모델별 가격 (USD per 1K tokens, 참고용)
MODEL_PRICING = {
    "gpt-4o": {"input": 0.0025, "output": 0.01},
    "gpt-4o-mini": {"input": 0.00015, "output": 0.0006},
    "claude-sonnet-4-20250514": {"input": 0.003, "output": 0.015},
    "claude-haiku-35": {"input": 0.0008, "output": 0.004},
}

def estimate_cost(
    model: str,
    input_text: str,
    estimated_output_tokens: int = 500,
) -> dict:
    # 입력 토큰 수 계산
    try:
        enc = tiktoken.encoding_for_model(model)
        input_tokens = len(enc.encode(input_text))
    except KeyError:
        # tiktoken이 지원하지 않는 모델은 근사치 사용
        input_tokens = len(input_text) // 4

    pricing = MODEL_PRICING.get(model, {"input": 0.01, "output": 0.03})

    input_cost = (input_tokens / 1000) * pricing["input"]
    output_cost = (estimated_output_tokens / 1000) * pricing["output"]

    return {
        "input_tokens": input_tokens,
        "estimated_output_tokens": estimated_output_tokens,
        "estimated_cost_usd": round(input_cost + output_cost, 6),
    }

5.4 Budget Enforcement

class BudgetEnforcer:
    def __init__(self, redis_client: redis.Redis):
        self.redis = redis_client

    def check_budget(
        self,
        user_id: str,
        estimated_cost: float,
        daily_limit: float,
        monthly_limit: float,
    ) -> dict:
        today = time.strftime("%Y-%m-%d")
        month = time.strftime("%Y-%m")

        daily_key = f"budget:daily:{user_id}:{today}"
        monthly_key = f"budget:monthly:{user_id}:{month}"

        daily_spent = float(self.redis.get(daily_key) or 0)
        monthly_spent = float(self.redis.get(monthly_key) or 0)

        if daily_spent + estimated_cost > daily_limit:
            return {
                "allowed": False,
                "reason": "Daily budget exceeded",
                "spent": daily_spent,
                "limit": daily_limit,
            }

        if monthly_spent + estimated_cost > monthly_limit:
            return {
                "allowed": False,
                "reason": "Monthly budget exceeded",
                "spent": monthly_spent,
                "limit": monthly_limit,
            }

        return {"allowed": True}

    def record_spend(self, user_id: str, cost: float):
        today = time.strftime("%Y-%m-%d")
        month = time.strftime("%Y-%m")

        pipe = self.redis.pipeline()
        pipe.incrbyfloat(f"budget:daily:{user_id}:{today}", cost)
        pipe.expire(f"budget:daily:{user_id}:{today}", 172800)
        pipe.incrbyfloat(f"budget:monthly:{user_id}:{month}", cost)
        pipe.expire(f"budget:monthly:{user_id}:{month}", 2764800)
        pipe.execute()

5.5 Queue + Rate Limiter 조합 패턴

+--------+    +------------+    +---------------+    +--------+
| Client +--->| API Gateway|+-->| Rate Limiter  +--->| Queue  |
+--------+    | (check     |    | (Token Bucket |    | (Redis/|
              |  budget)   |    |  + Budget)    |    | RabbitMQ)
              +------------+    +-------+-------+    +---+----+
                                                         |
                                +----------------+       |
                                | Worker Pool    |<------+
                                | (rate-aware    |
                                |  LLM caller)   |
                                +-------+--------+
                                        |
                                +-------v--------+
                                | LLM Provider   |
                                | (OpenAI, etc.) |
                                +----------------+

6. HTTP 응답 헤더

6.1 표준 응답

HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 30
RateLimit-Limit: 100
RateLimit-Remaining: 0
RateLimit-Reset: 1679900400

{
  "error": {
    "code": "rate_limit_exceeded",
    "message": "You have exceeded the rate limit. Please retry after 30 seconds.",
    "type": "rate_limit_error"
  }
}

6.2 응답 헤더 구현

from fastapi import FastAPI, Request, Response
from fastapi.responses import JSONResponse

app = FastAPI()

@app.middleware("http")
async def rate_limit_middleware(request: Request, call_next):
    api_key = request.headers.get("X-API-Key", request.client.host)

    result = rate_limiter.check(api_key)

    if not result["allowed"]:
        return JSONResponse(
            status_code=429,
            content={
                "error": {
                    "code": "rate_limit_exceeded",
                    "message": result["reason"],
                }
            },
            headers={
                "Retry-After": str(result.get("retry_after", 60)),
                "RateLimit-Limit": str(result["limit"]),
                "RateLimit-Remaining": "0",
                "RateLimit-Reset": str(result.get("reset_at", "")),
            }
        )

    response = await call_next(request)

    # 성공 응답에도 Rate Limit 정보 포함
    response.headers["RateLimit-Limit"] = str(result.get("limit", ""))
    response.headers["RateLimit-Remaining"] = str(result.get("remaining", ""))
    response.headers["RateLimit-Reset"] = str(result.get("reset_at", ""))

    return response

7. 모니터링

7.1 Prometheus Metrics

from prometheus_client import Counter, Histogram, Gauge

# Rate Limit 메트릭
rate_limit_total = Counter(
    'rate_limit_requests_total',
    'Total rate limited requests',
    ['user_tier', 'endpoint', 'result']  # result: allowed/rejected
)

rate_limit_remaining = Gauge(
    'rate_limit_remaining',
    'Remaining rate limit tokens',
    ['user_id', 'limit_type']  # limit_type: rpm/tpm/rpd
)

request_cost = Histogram(
    'llm_request_cost_usd',
    'Cost of LLM requests in USD',
    ['model', 'user_tier'],
    buckets=[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
)

budget_utilization = Gauge(
    'budget_utilization_ratio',
    'Budget utilization ratio (0-1)',
    ['user_id', 'period']  # period: daily/monthly
)

7.2 Grafana Dashboard 쿼리 예시

# Rate Limit 거부율
rate(rate_limit_requests_total{result="rejected"}[5m])
/
rate(rate_limit_requests_total[5m])

# 사용자별 일일 비용
sum by (user_tier) (
  increase(llm_request_cost_usd_sum[24h])
)

# Budget 소진율 상위 10명
topk(10, budget_utilization_ratio{period="daily"})

8. Best Practices와 Anti-Patterns

8.1 Best Practices

계층적 Rate Limiting: 글로벌, 서비스별, 사용자별 다중 계층 적용
Graceful Degradation: Rate Limit 초과 시 캐시된 응답이나 간소화된 응답 제공
헤더 노출: 클라이언트가 남은 할당량을 알 수 있도록 RateLimit 헤더 제공
적응적 제한: 서버 부하에 따라 Rate Limit을 동적으로 조절
모니터링과 알림: Rate Limit 거부율이 임계값을 넘으면 알림
문서화: API 문서에 Rate Limit 정책을 명확하게 기술

8.2 Anti-Patterns

1. Client IP만으로 Rate Limiting
   문제: NAT/프록시 뒤의 다수 사용자가 하나의 IP를 공유
   해결: API Key + IP 조합, 또는 인증 기반 Rate Limiting

2. 단일 계층 Rate Limiting
   문제: 글로벌 제한만 있으면 정상 사용자도 영향 받음
   해결: 사용자별, 엔드포인트별, 티어별 다중 계층

3. 하드코딩된 Rate Limit 값
   문제: 변경 시 배포 필요
   해결: 설정 서버나 환경 변수로 동적 관리

4. Rate Limit 없는 내부 API
   문제: 내부 서비스 간 장애 전파
   해결: 내부 API에도 Rate Limiting 적용 (서킷브레이커와 병행)

5. Retry 로직 없는 클라이언트
   문제: 429 응답 시 즉시 실패
   해결: Retry-After 헤더를 존중하는 재시도 로직 구현

9. 마무리

Rate Limiting은 단순한 방어 기법이 아니라 서비스의 안정성, 공정성, 비용 효율성을 보장하는 핵심 인프라 컴포넌트입니다.

핵심 정리:

알고리즘 선택: Sliding Window Counter가 실무에서 가장 범용적
구현 도구: Redis + Lua가 분산 환경에서 가장 신뢰성 높음
LLM 비용 제어: RPM/TPM Rate Limiting + Budget Enforcement 조합 필수
분산 환경: 중앙 집중형(Redis) 또는 로컬+동기화 방식 선택
모니터링: Rate Limit 거부율, 비용 추적, Budget 소진율 모니터링 필수

적절한 Rate Limiting은 서비스를 보호할 뿐만 아니라, 사용자에게 예측 가능한 서비스 경험을 제공합니다.

참고 자료

IETF RateLimit Headers Draft: https://datatracker.ietf.org/doc/draft-ietf-httpapi-ratelimit-headers/
Redis Rate Limiting Patterns: https://redis.io/tutorials/howtos/ratelimiting/
Nginx Rate Limiting: https://www.nginx.com/blog/rate-limiting-nginx/
Kong Rate Limiting Plugin: https://docs.konghq.com/hub/kong-inc/rate-limiting/
Bucket4j: https://github.com/bucket4j/bucket4j