API 게이트웨이 패턴과 BFF 설계: Kong·Envoy·GraphQL Federation 실전 구현

들어가며
API 게이트웨이 패턴 이해
BFF(Backend for Frontend) 패턴
게이트웨이 도구 비교
- 선택 가이드
Kong Gateway 실전 구성
- Kong 설치 및 기본 구성 (Kubernetes)
- Kong 서비스 및 라우팅 구성
Envoy Proxy 구성
- Envoy 기본 구성
GraphQL Federation
인증과 인가 통합
- 게이트웨이 계층 JWT 검증 미들웨어
레이트 리미팅과 서킷 브레이커
- 다단계 레이트 리미팅 전략
- 서킷 브레이커 패턴
관찰가능성 통합
- OpenTelemetry 통합 설정
- 핵심 메트릭 대시보드 항목
트러블슈팅
프로덕션 체크리스트
안티패턴과 실패 사례
참고자료

들어가며

마이크로서비스 아키텍처가 산업 전반에 보편화되면서, 서비스 간 통신의 복잡성을 어떻게 관리할 것인가는 모든 엔지니어링 조직의 핵심 과제가 되었다. 글로벌 API 관리 시장은 2026년 기준 약 51억 달러 규모로 성장했으며, CAGR 32.3%의 폭발적 확장세를 보이고 있다. 2025년 기준으로 조직의 31%가 복수의 API 게이트웨이를 운영하고 있으며, 이 중 11%는 세 개 이상의 게이트웨이를 병행 운영한다.

API 게이트웨이 패턴은 클라이언트와 백엔드 서비스 사이에 단일 진입점(Single Entry Point)을 두어 라우팅, 인증, 레이트 리미팅, 프로토콜 변환 등의 횡단 관심사(Cross-Cutting Concerns)를 중앙 집중화하는 아키텍처 패턴이다. 여기에 BFF(Backend for Frontend) 패턴을 결합하면 웹, 모바일, IoT 등 각 프론트엔드 유형에 최적화된 전용 백엔드를 제공할 수 있다.

이 글에서는 API 게이트웨이 패턴과 BFF 패턴의 핵심 원리를 살펴보고, Kong Gateway, Envoy Proxy, GraphQL Federation이라는 세 가지 주요 도구의 실전 구성 방법을 다룬다. 인증/인가 통합, 레이트 리미팅, 서킷 브레이커, 관찰가능성(Observability) 설정까지 프로덕션 운영에 필요한 모든 내용을 실전 코드와 함께 제공한다.

API 게이트웨이 패턴 이해

게이트웨이의 역할

API 게이트웨이는 마이크로서비스 아키텍처에서 모든 클라이언트 요청의 단일 진입점 역할을 한다. 내부 서비스 토폴로지의 복잡성을 추상화하여 클라이언트에게 깨끗하고 일관된 인터페이스를 제공한다. 핵심 책임은 다음과 같다.

요청 라우팅: URL 경로, 헤더, 메서드 기반으로 적절한 백엔드 서비스로 트래픽 전달
프로토콜 변환: REST-to-gRPC, HTTP-to-WebSocket 등 프로토콜 간 변환
응답 집계: 여러 서비스의 응답을 하나로 합쳐 클라이언트에 반환(API Composition)
횡단 관심사 처리: 인증, 인가, 레이트 리미팅, 로깅, 캐싱, CORS 등을 중앙 관리
서비스 디스커버리: 동적으로 서비스 인스턴스를 탐색하고 로드밸런싱 수행

게이트웨이 토폴로지 아키텍처

                          API 게이트웨이 토폴로지

    ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐
    │  Web App │  │Mobile App│  │ Partner  │  │IoT Device│
    │ (React)  │  │ (Swift)  │  │  (B2B)   │  │ (MQTT)   │
    └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘
         │             │             │              │
         └──────┬──────┴──────┬──────┘              │
                │             │                     │
         ┌──────▼──────┐ ┌───▼──────────┐   ┌──────▼──────┐
         │   Edge LB   │ │  Edge LB     │   │  Edge LB    │
         │  (L4/L7)    │ │  (L4/L7)     │   │  (L4/L7)    │
         └──────┬──────┘ └──────┬───────┘   └──────┬──────┘
                │               │                  │
         ┌──────▼───────────────▼──────────────────▼──────┐
         │              API Gateway Cluster                │
         │  ┌──────────────────────────────────────────┐   │
         │  │  Auth │ Rate Limit │ Logging │ Transform │   │
         │  └──────────────────────────────────────────┘   │
         │  ┌─────────────────────────────────────────┐    │
         │  │        Route Matching & Dispatch         │    │
         │  └─────────────────────────────────────────┘    │
         └───────┬──────────┬──────────┬──────────┬───────┘
                 │          │          │          │
          ┌──────▼───┐┌────▼────┐┌────▼────┐┌───▼──────┐
          │ User Svc ││Order Svc││Pay Svc  ││Notify Svc│
          │ (gRPC)   ││ (REST)  ││ (REST)  ││(WebSocket│
          └──────────┘└─────────┘└─────────┘└──────────┘

Edge 게이트웨이 vs Internal 게이트웨이

실무에서는 게이트웨이를 계층적으로 배치한다. Edge 게이트웨이는 외부 트래픽의 진입점으로 DDoS 방어, TLS 종료, 공격적인 레이트 리미팅을 수행한다. Internal 게이트웨이는 서비스 간 통신을 관리하며, 서비스 디스커버리, mTLS, 세밀한 접근 제어를 담당한다.

    외부 트래픽
        │
    ┌───▼────────────────────────┐
    │   Edge Gateway (Kong)      │  ← TLS 종료, WAF, 레이트 리미팅
    │   - DDoS 방어              │
    │   - JWT 검증               │
    │   - 글로벌 레이트 리미팅    │
    └───────────┬────────────────┘
                │
    ┌───────────▼────────────────┐
    │  Internal Gateway (Envoy)  │  ← mTLS, 서비스 라우팅
    │   - mTLS 상호 인증         │
    │   - 서비스 디스커버리       │
    │   - 세밀한 RBAC            │
    └──┬────────┬────────┬───────┘
       │        │        │
    ┌──▼──┐  ┌──▼──┐  ┌──▼──┐
    │Svc A│  │Svc B│  │Svc C│
    └─────┘  └─────┘  └─────┘

BFF(Backend for Frontend) 패턴

BFF 패턴이 필요한 이유

단일 API 게이트웨이로 모든 클라이언트를 서비스하다 보면 필연적으로 "하나의 크기가 모두에게 맞지 않는" 문제에 직면한다. 웹 애플리케이션은 풍부한 데이터를 한 번에 가져오기 원하지만, 모바일 앱은 대역폭 절약을 위해 최소한의 필드만 필요하다. IoT 디바이스는 바이너리 프로토콜을, 서드파티 파트너는 안정적인 REST API를 기대한다.

BFF 패턴은 Sam Newman이 제안한 아키텍처 패턴으로, 각 프론트엔드 유형에 맞는 전용 백엔드를 두어 이 문제를 해결한다. BFF는 특정 사용자 경험에 밀접하게 결합되며, 해당 UI를 담당하는 팀이 BFF도 함께 관리한다.

BFF 아키텍처

    ┌──────────┐     ┌──────────┐     ┌──────────┐
    │  Web App │     │Mobile App│     │ Admin App│
    │ (React)  │     │ (Flutter)│     │ (Vue.js) │
    └────┬─────┘     └────┬─────┘     └────┬─────┘
         │                │                │
    ┌────▼─────┐     ┌────▼─────┐     ┌────▼─────┐
    │ Web BFF  │     │Mobile BFF│     │Admin BFF │
    │(Node.js) │     │  (Go)    │     │(Node.js) │
    │          │     │          │     │          │
    │- 풍부한  │     │- 경량    │     │- 대시보드│
    │  데이터  │     │  페이로드│     │  집계    │
    │- SSR 지원│     │- 오프라인│     │- 벌크    │
    │- SEO     │     │  캐싱    │     │  연산    │
    └──┬───┬───┘     └──┬───┬───┘     └──┬───┬───┘
       │   │            │   │            │   │
       │   └────────┬───┘   └────────┬───┘   │
       │            │                │       │
    ┌──▼──┐     ┌───▼──┐         ┌───▼──┐   │
    │User │     │Order │         │Admin │◄──┘
    │ Svc │     │ Svc  │         │ Svc  │
    └─────┘     └──────┘         └──────┘

BFF 구현 예제: Node.js Web BFF

// web-bff/src/server.ts
import express from 'express'
import axios from 'axios'

const app = express()

// Web BFF: 대시보드를 위한 데이터 집계 엔드포인트
app.get('/api/dashboard', async (req, res) => {
  const userId = req.headers['x-user-id'] as string

  try {
    // 여러 마이크로서비스에서 병렬로 데이터 수집
    const [userProfile, recentOrders, notifications, analytics] = await Promise.all([
      axios.get(`${process.env.USER_SERVICE_URL}/users/${userId}`),
      axios.get(`${process.env.ORDER_SERVICE_URL}/orders?userId=${userId}&limit=10`),
      axios.get(`${process.env.NOTIFICATION_SERVICE_URL}/notifications/${userId}?unread=true`),
      axios.get(`${process.env.ANALYTICS_SERVICE_URL}/users/${userId}/summary`),
    ])

    // Web 프론트엔드에 최적화된 응답 구조
    res.json({
      user: {
        name: userProfile.data.name,
        email: userProfile.data.email,
        avatar: userProfile.data.avatarUrl,
        memberSince: userProfile.data.createdAt,
      },
      orders: {
        recent: recentOrders.data.items.map((order: any) => ({
          id: order.id,
          status: order.status,
          total: order.totalAmount,
          date: order.createdAt,
          itemCount: order.items.length,
          // 웹에서만 필요한 상세 정보 포함
          trackingUrl: order.trackingUrl,
          invoice: order.invoiceUrl,
        })),
        totalCount: recentOrders.data.totalCount,
      },
      notifications: {
        unreadCount: notifications.data.count,
        items: notifications.data.items.slice(0, 5),
      },
      analytics: {
        totalSpent: analytics.data.totalSpent,
        orderFrequency: analytics.data.orderFrequency,
        favoriteCategories: analytics.data.topCategories,
      },
    })
  } catch (error) {
    // 부분 실패 시 가용한 데이터만 반환 (Graceful Degradation)
    res.status(207).json({
      partial: true,
      error: 'Some services are unavailable',
      available: {},
    })
  }
})

// Mobile BFF와의 차이점: 웹은 SSR을 위한 전체 데이터를 반환
// Mobile BFF는 동일 엔드포인트에서 경량 필드만 반환
app.listen(3001, () => {
  console.log('Web BFF running on port 3001')
})

BFF 핵심 설계 원칙

하나의 BFF는 하나의 프론트엔드에 대응: 웹 BFF가 모바일 요구사항을 처리하기 시작하면 "다중 목적 BFF"가 되어 원래의 문제로 회귀한다.
BFF는 비즈니스 로직을 갖지 않는다: 데이터 집계, 변환, 포맷팅만 수행하며, 비즈니스 규칙은 반드시 하위 서비스에 둔다.
BFF 팀과 프론트엔드 팀은 동일 팀: 프론트엔드 개발자가 자신의 BFF를 직접 관리할 때 가장 효율적이다.
보안 경계로서의 BFF: 특히 SPA 환경에서 OIDC/OAuth 2.0 흐름의 토큰 협상을 BFF가 기밀 클라이언트(Confidential Client)로서 처리하여 공개 클라이언트의 보안 위험을 제거한다.

게이트웨이 도구 비교

주요 API 게이트웨이 솔루션을 프로덕션 운영 관점에서 비교한다.

항목	Kong Gateway	Envoy Proxy	AWS API Gateway	GraphQL Federation
기반 기술	NGINX + Lua	C++ (자체 개발)	AWS 관리형	Apollo Router (Rust)
배포 모델	셀프호스팅 / Konnect	셀프호스팅 / 사이드카	완전 관리형	셀프호스팅 / GraphOS
프로토콜	HTTP, gRPC, WebSocket	HTTP/1.1, HTTP/2, gRPC, TCP	REST, WebSocket, HTTP	GraphQL
성능	~50,000 TPS/노드	~100,000+ TPS/노드	AWS 내부 최적화	서브그래프 수에 비례
확장성	플러그인 (Lua, Go)	필터 (C++, Wasm, Lua)	Lambda 연동	서브그래프 서비스
K8s 통합	KIC (Gateway API 준수)	Envoy Gateway / Istio	EKS 통합	Helm Chart
관찰가능성	Prometheus, Datadog	OpenTelemetry 네이티브	CloudWatch	Apollo Studio
인증	플러그인 (JWT, OAuth)	필터 (JWT, ext_authz)	Cognito, Lambda Auth	서브그래프 위임
레이트 리미팅	플러그인 (로컬/글로벌)	필터 (로컬/글로벌)	스로틀링 내장	커스텀 구현 필요
라이선스	Apache 2.0 / Enterprise	Apache 2.0	종량제	Elastic License
러닝 커브	중간	높음	낮음	높음
적합 시나리오	범용 API 관리	서비스 메시 / L7 제어	AWS 네이티브	다중 팀 그래프 통합

선택 가이드

Kong: 플러그인 생태계가 풍부하고 GUI 관리 도구가 필요한 경우. 60개 이상의 공식 플러그인을 제공한다. 단, OIDC, 고급 분석 등 핵심 기능이 Enterprise 라이선스에 포함되는 점을 주의해야 한다.
Envoy: 고성능 L7 프록시가 필요하거나 서비스 메시(Istio, Consul)와 통합해야 하는 경우. xDS API를 통한 동적 설정이 강점이지만, 설정 복잡도가 높다.
AWS API Gateway: AWS 네이티브 환경에서 관리 부담을 최소화하고 싶은 경우. Lambda 통합이 강력하지만, AWS 락인과 콜드 스타트 지연이 단점이다.
GraphQL Federation: 다수의 팀이 독립적으로 그래프를 기여하며, 프론트엔드가 유연한 쿼리를 필요로 하는 경우. 스키마 설계의 초기 투자가 크지만, 장기적으로 프론트엔드 생산성이 극대화된다.

Kong Gateway 실전 구성

Kong 설치 및 기본 구성 (Kubernetes)

Kong Ingress Controller(KIC)는 Kubernetes Gateway API를 일등 시민으로 지원하며, Gateway API 핵심 적합성 테스트를 100% 통과한 최초의 게이트웨이다.

# kong-values.yaml - Helm Chart 설정
# helm install kong kong/ingress -f kong-values.yaml -n kong-system

gateway:
  image:
    repository: kong/kong-gateway
    tag: '3.9'
  env:
    database: 'off' # DB-less 모드 (선언적 설정)
    proxy_access_log: /dev/stdout
    admin_access_log: /dev/stdout
    proxy_error_log: /dev/stderr
    admin_error_log: /dev/stderr
    plugins: bundled,oidc,prometheus
  proxy:
    type: LoadBalancer
    annotations:
      service.beta.kubernetes.io/aws-load-balancer-type: nlb
      service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
    tls:
      enabled: true
  admin:
    enabled: true
    type: ClusterIP # 클러스터 내부에서만 접근
  resources:
    requests:
      cpu: 500m
      memory: 512Mi
    limits:
      cpu: 2000m
      memory: 2Gi

controller:
  image:
    repository: kong/kubernetes-ingress-controller
    tag: '3.4'
  ingressClass: kong
  resources:
    requests:
      cpu: 250m
      memory: 256Mi

Kong 서비스 및 라우팅 구성

# kong-gateway-api.yaml
# Kubernetes Gateway API를 사용한 선언적 라우팅 구성
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: kong-gateway
  namespace: kong-system
  annotations:
    konghq.com/gateway-operator: 'true'
spec:
  gatewayClassName: kong
  listeners:
    - name: http
      protocol: HTTP
      port: 80
    - name: https
      protocol: HTTPS
      port: 443
      tls:
        mode: Terminate
        certificateRefs:
          - name: api-tls-cert
            kind: Secret
---
# HTTPRoute로 서비스별 라우팅 정의
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: user-service-route
  namespace: default
  annotations:
    konghq.com/plugins: rate-limiting-global,jwt-auth,cors
spec:
  parentRefs:
    - name: kong-gateway
      namespace: kong-system
  hostnames:
    - 'api.example.com'
  rules:
    - matches:
        - path:
            type: PathPrefix
            value: /api/v1/users
      backendRefs:
        - name: user-service
          port: 8080
          weight: 100
    - matches:
        - path:
            type: PathPrefix
            value: /api/v1/orders
      backendRefs:
        - name: order-service
          port: 8080
          weight: 90
        - name: order-service-canary
          port: 8080
          weight: 10 # 카나리 배포: 10% 트래픽
---
# Kong 플러그인: JWT 인증
apiVersion: configuration.konghq.com/v1
kind: KongPlugin
metadata:
  name: jwt-auth
  namespace: default
config:
  key_claim_name: iss
  claims_to_verify:
    - exp
  header_names:
    - Authorization
plugin: jwt
---
# Kong 플러그인: 레이트 리미팅
apiVersion: configuration.konghq.com/v1
kind: KongPlugin
metadata:
  name: rate-limiting-global
  namespace: default
config:
  minute: 100
  hour: 5000
  policy: redis
  redis:
    host: redis-master.redis-system.svc.cluster.local
    port: 6379
    database: 0
    timeout: 2000
  fault_tolerant: true # Redis 장애 시에도 요청 통과
  hide_client_headers: false
plugin: rate-limiting
---
# Kong 플러그인: CORS
apiVersion: configuration.konghq.com/v1
kind: KongPlugin
metadata:
  name: cors
  namespace: default
config:
  origins:
    - 'https://app.example.com'
    - 'https://admin.example.com'
  methods:
    - GET
    - POST
    - PUT
    - DELETE
    - OPTIONS
  headers:
    - Authorization
    - Content-Type
    - X-Request-ID
  exposed_headers:
    - X-RateLimit-Remaining
  max_age: 3600
  credentials: true
plugin: cors

Envoy Proxy 구성

Envoy 기본 구성

Envoy는 고성능 C++ 기반 프록시로, L7 계층에서 HTTP/2, gRPC를 네이티브로 지원한다. xDS API를 통한 동적 설정 업데이트가 핵심 강점이다.

# envoy-config.yaml
# Envoy 프록시 정적 설정 (프로덕션에서는 xDS 동적 설정 사용 권장)
admin:
  address:
    socket_address:
      address: 0.0.0.0
      port_value: 9901

static_resources:
  listeners:
    - name: api_listener
      address:
        socket_address:
          address: 0.0.0.0
          port_value: 8443
      filter_chains:
        - filters:
            - name: envoy.filters.network.http_connection_manager
              typed_config:
                '@type': type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
                stat_prefix: ingress_http
                codec_type: AUTO
                # 액세스 로그 설정
                access_log:
                  - name: envoy.access_loggers.file
                    typed_config:
                      '@type': type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog
                      path: /dev/stdout
                      log_format:
                        json_format:
                          timestamp: '%START_TIME%'
                          method: '%REQ(:METHOD)%'
                          path: '%REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%'
                          protocol: '%PROTOCOL%'
                          response_code: '%RESPONSE_CODE%'
                          duration_ms: '%DURATION%'
                          upstream_host: '%UPSTREAM_HOST%'
                          request_id: '%REQ(X-REQUEST-ID)%'
                # HTTP 필터 체인
                http_filters:
                  # JWT 인증 필터
                  - name: envoy.filters.http.jwt_authn
                    typed_config:
                      '@type': type.googleapis.com/envoy.extensions.filters.http.jwt_authn.v3.JwtAuthentication
                      providers:
                        auth0_provider:
                          issuer: 'https://your-tenant.auth0.com/'
                          audiences:
                            - 'https://api.example.com'
                          remote_jwks:
                            http_uri:
                              uri: 'https://your-tenant.auth0.com/.well-known/jwks.json'
                              cluster: auth0_jwks
                              timeout: 5s
                            cache_duration: 600s
                          forward: true
                          payload_in_metadata: jwt_payload
                      rules:
                        - match:
                            prefix: /api/
                          requires:
                            provider_name: auth0_provider
                        - match:
                            prefix: /health
                  # 레이트 리미팅 필터
                  - name: envoy.filters.http.local_ratelimit
                    typed_config:
                      '@type': type.googleapis.com/envoy.extensions.filters.http.local_ratelimit.v3.LocalRateLimit
                      stat_prefix: http_local_rate_limiter
                      token_bucket:
                        max_tokens: 1000
                        tokens_per_fill: 100
                        fill_interval: 1s
                      filter_enabled:
                        runtime_key: local_rate_limit_enabled
                        default_value:
                          numerator: 100
                          denominator: HUNDRED
                      filter_enforced:
                        runtime_key: local_rate_limit_enforced
                        default_value:
                          numerator: 100
                          denominator: HUNDRED
                      response_headers_to_add:
                        - append_action: OVERWRITE_IF_EXISTS_OR_ADD
                          header:
                            key: x-local-rate-limit
                            value: 'true'
                  # 라우터 필터 (반드시 마지막)
                  - name: envoy.filters.http.router
                    typed_config:
                      '@type': type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
                # 라우트 설정
                route_config:
                  name: local_route
                  virtual_hosts:
                    - name: api_service
                      domains: ['api.example.com']
                      routes:
                        - match:
                            prefix: '/api/v1/users'
                          route:
                            cluster: user_service
                            timeout: 10s
                            retry_policy:
                              retry_on: '5xx,connect-failure,reset'
                              num_retries: 3
                              per_try_timeout: 3s
                              retry_back_off:
                                base_interval: 0.1s
                                max_interval: 1s
                        - match:
                            prefix: '/api/v1/orders'
                          route:
                            cluster: order_service
                            timeout: 15s
                            retry_policy:
                              retry_on: '5xx,connect-failure'
                              num_retries: 2
          transport_socket:
            name: envoy.transport_sockets.tls
            typed_config:
              '@type': type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.DownstreamTlsContext
              common_tls_context:
                tls_certificates:
                  - certificate_chain:
                      filename: /etc/envoy/certs/server.crt
                    private_key:
                      filename: /etc/envoy/certs/server.key

  clusters:
    - name: user_service
      type: STRICT_DNS
      lb_policy: ROUND_ROBIN
      load_assignment:
        cluster_name: user_service
        endpoints:
          - lb_endpoints:
              - endpoint:
                  address:
                    socket_address:
                      address: user-service.default.svc.cluster.local
                      port_value: 8080
      # 서킷 브레이커 설정
      circuit_breakers:
        thresholds:
          - priority: DEFAULT
            max_connections: 1024
            max_pending_requests: 1024
            max_requests: 1024
            max_retries: 3
      # 헬스 체크
      health_checks:
        - timeout: 5s
          interval: 10s
          unhealthy_threshold: 3
          healthy_threshold: 2
          http_health_check:
            path: /health
      # Outlier Detection (이상 탐지)
      outlier_detection:
        consecutive_5xx: 5
        interval: 10s
        base_ejection_time: 30s
        max_ejection_percent: 50

    - name: order_service
      type: STRICT_DNS
      lb_policy: LEAST_REQUEST
      load_assignment:
        cluster_name: order_service
        endpoints:
          - lb_endpoints:
              - endpoint:
                  address:
                    socket_address:
                      address: order-service.default.svc.cluster.local
                      port_value: 8080
      circuit_breakers:
        thresholds:
          - priority: DEFAULT
            max_connections: 512
            max_pending_requests: 512

    - name: auth0_jwks
      type: LOGICAL_DNS
      lb_policy: ROUND_ROBIN
      load_assignment:
        cluster_name: auth0_jwks
        endpoints:
          - lb_endpoints:
              - endpoint:
                  address:
                    socket_address:
                      address: your-tenant.auth0.com
                      port_value: 443
      transport_socket:
        name: envoy.transport_sockets.tls
        typed_config:
          '@type': type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext
          sni: your-tenant.auth0.com

GraphQL Federation

Apollo Federation v2 개요

GraphQL Federation은 여러 팀이 독립적으로 관리하는 서브그래프(Subgraph)를 하나의 통합된 슈퍼그래프(Supergraph)로 합성하는 아키텍처 패턴이다. Apollo Federation v2는 개선된 공유 소유권 모델, 향상된 타입 병합, 더 깔끔한 구문을 제공한다.

    ┌──────────────────────────────────────────────────┐
    │                  클라이언트                        │
    │              (Web / Mobile App)                   │
    └────────────────────┬─────────────────────────────┘
                         │ GraphQL 쿼리
    ┌────────────────────▼─────────────────────────────┐
    │              Apollo Router                        │
    │         (슈퍼그래프 실행 엔진)                     │
    │                                                  │
    │  ┌────────────────────────────────────────────┐   │
    │  │       Query Plan (쿼리 실행 계획)           │   │
    │  │  1. users 서브그래프에서 User 조회          │   │
    │  │  2. orders 서브그래프에서 Order 조회        │   │
    │  │  3. reviews 서브그래프에서 Review 조회      │   │
    │  │  4. 결과 병합 후 클라이언트에 반환          │   │
    │  └────────────────────────────────────────────┘   │
    └───────┬───────────────┬───────────────┬──────────┘
            │               │               │
    ┌───────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
    │ Users        │ │ Orders      │ │ Reviews     │
    │ Subgraph     │ │ Subgraph    │ │ Subgraph    │
    │ (Team A)     │ │ (Team B)    │ │ (Team C)    │
    │              │ │             │ │             │
    │ - User       │ │ - Order     │ │ - Review    │
    │ - Profile    │ │ - LineItem  │ │ - Rating    │
    │              │ │ - extends   │ │ - extends   │
    │              │ │   User      │ │   User      │
    └──────────────┘ └─────────────┘ └─────────────┘

Federation 서브그래프 구현

// subgraphs/users/src/index.ts
import { ApolloServer } from '@apollo/server'
import { startStandaloneServer } from '@apollo/server/standalone'
import { buildSubgraphSchema } from '@apollo/subgraph'
import { gql } from 'graphql-tag'

const typeDefs = gql`
  extend schema
    @link(
      url: "https://specs.apollo.dev/federation/v2.11"
      import: ["@key", "@shareable", "@external", "@provides"]
    )

  type Query {
    user(id: ID!): User
    users(limit: Int = 10, offset: Int = 0): UserConnection!
    me: User
  }

  type User @key(fields: "id") {
    id: ID!
    email: String!
    name: String!
    avatarUrl: String
    role: UserRole!
    createdAt: String!
    updatedAt: String!
  }

  type UserConnection {
    nodes: [User!]!
    totalCount: Int!
    pageInfo: PageInfo!
  }

  type PageInfo @shareable {
    hasNextPage: Boolean!
    hasPreviousPage: Boolean!
  }

  enum UserRole {
    ADMIN
    USER
    MODERATOR
  }
`

const resolvers = {
  Query: {
    user: async (_: any, { id }: { id: string }, context: any) => {
      return context.dataSources.usersAPI.getUser(id)
    },
    users: async (_: any, args: any, context: any) => {
      return context.dataSources.usersAPI.getUsers(args)
    },
    me: async (_: any, __: any, context: any) => {
      if (!context.userId) throw new Error('Not authenticated')
      return context.dataSources.usersAPI.getUser(context.userId)
    },
  },
  User: {
    __resolveReference: async (ref: { id: string }, context: any) => {
      // Federation이 다른 서브그래프에서 User를 참조할 때 호출
      return context.dataSources.usersAPI.getUser(ref.id)
    },
  },
}

const server = new ApolloServer({
  schema: buildSubgraphSchema({ typeDefs, resolvers }),
})

const { url } = await startStandaloneServer(server, {
  listen: { port: 4001 },
  context: async ({ req }) => ({
    userId: req.headers['x-user-id'],
    dataSources: {
      usersAPI: new UsersDataSource(),
    },
  }),
})

console.log(`Users subgraph ready at ${url}`)

// subgraphs/orders/src/index.ts
import { ApolloServer } from '@apollo/server'
import { buildSubgraphSchema } from '@apollo/subgraph'
import { gql } from 'graphql-tag'

const typeDefs = gql`
  extend schema
    @link(
      url: "https://specs.apollo.dev/federation/v2.11"
      import: ["@key", "@external", "@requires"]
    )

  type Query {
    order(id: ID!): Order
    ordersByUser(userId: ID!, status: OrderStatus): [Order!]!
  }

  # User 타입을 확장하여 orders 필드 추가
  type User @key(fields: "id") {
    id: ID! @external
    orders(status: OrderStatus, limit: Int = 10): [Order!]!
    totalSpent: Float! @requires(fields: "id")
  }

  type Order @key(fields: "id") {
    id: ID!
    userId: ID!
    user: User!
    items: [OrderItem!]!
    status: OrderStatus!
    totalAmount: Float!
    currency: String!
    createdAt: String!
    shippingAddress: Address
  }

  type OrderItem {
    productId: ID!
    productName: String!
    quantity: Int!
    unitPrice: Float!
  }

  type Address {
    street: String!
    city: String!
    country: String!
    zipCode: String!
  }

  enum OrderStatus {
    PENDING
    CONFIRMED
    SHIPPED
    DELIVERED
    CANCELLED
  }
`

const resolvers = {
  Query: {
    order: async (_: any, { id }: { id: string }, ctx: any) => {
      return ctx.dataSources.ordersAPI.getOrder(id)
    },
    ordersByUser: async (_: any, args: any, ctx: any) => {
      return ctx.dataSources.ordersAPI.getOrdersByUser(args.userId, args.status)
    },
  },
  User: {
    orders: async (user: { id: string }, args: any, ctx: any) => {
      return ctx.dataSources.ordersAPI.getOrdersByUser(user.id, args.status)
    },
    totalSpent: async (user: { id: string }, _: any, ctx: any) => {
      return ctx.dataSources.ordersAPI.getTotalSpent(user.id)
    },
  },
  Order: {
    user: (order: { userId: string }) => ({ __typename: 'User', id: order.userId }),
  },
}

슈퍼그래프 구성

# supergraph-config.yaml
# rover supergraph compose --config supergraph-config.yaml > supergraph.graphql
federation_version: =2.11.2
subgraphs:
  users:
    routing_url: http://users-subgraph:4001/graphql
    schema:
      subgraph_url: http://users-subgraph:4001/graphql
  orders:
    routing_url: http://orders-subgraph:4002/graphql
    schema:
      subgraph_url: http://orders-subgraph:4002/graphql
  reviews:
    routing_url: http://reviews-subgraph:4003/graphql
    schema:
      subgraph_url: http://reviews-subgraph:4003/graphql

# router-config.yaml (Apollo Router 설정)
supergraph:
  listen: 0.0.0.0:4000
  introspection: false # 프로덕션에서는 비활성화

# 헤더 전파
headers:
  all:
    request:
      - propagate:
          named: 'Authorization'
      - propagate:
          named: 'X-Request-ID'
      - propagate:
          named: 'X-User-ID'

# 서브그래프별 설정
traffic_shaping:
  all:
    timeout: 10s
  subgraphs:
    orders:
      timeout: 15s # 주문 서비스는 더 긴 타임아웃

# 쿼리 깊이 제한 (DoS 방지)
limits:
  max_depth: 15
  max_height: 200
  max_aliases: 30

# 텔레메트리
telemetry:
  exporters:
    tracing:
      otlp:
        enabled: true
        endpoint: http://otel-collector:4317
        protocol: grpc
    metrics:
      prometheus:
        enabled: true
        listen: 0.0.0.0:9090
        path: /metrics

# 응답 캐싱
preview_entity_cache:
  enabled: true
  subgraph:
    all:
      enabled: true
      ttl: 60s

인증과 인가 통합

게이트웨이 계층 JWT 검증 미들웨어

게이트웨이에서 JWT를 검증하고, 검증된 클레임을 헤더로 변환하여 하위 서비스에 전달하는 것이 표준 패턴이다. RS256 또는 ES256 알고리즘으로 서명된 토큰을 사용하며, IdP의 공개 키로 서명을 검증한다.

// gateway/src/middleware/auth.ts
import jwt from 'jsonwebtoken'
import jwksClient from 'jwks-rsa'
import { Request, Response, NextFunction } from 'express'

// JWKS 클라이언트 (키 캐싱 포함)
const client = jwksClient({
  jwksUri: `${process.env.AUTH0_DOMAIN}/.well-known/jwks.json`,
  cache: true,
  cacheMaxEntries: 5,
  cacheMaxAge: 600000, // 10분
  rateLimit: true,
  jwksRequestsPerMinute: 10,
})

function getSigningKey(header: jwt.JwtHeader): Promise<string> {
  return new Promise((resolve, reject) => {
    client.getSigningKey(header.kid, (err, key) => {
      if (err) return reject(err)
      const signingKey = key?.getPublicKey()
      if (!signingKey) return reject(new Error('No signing key found'))
      resolve(signingKey)
    })
  })
}

// 경로별 인증 정책
const AUTH_POLICIES: Record<string, { required: boolean; scopes?: string[] }> = {
  '/api/v1/users/me': { required: true, scopes: ['read:profile'] },
  '/api/v1/orders': { required: true, scopes: ['read:orders'] },
  '/api/v1/admin': { required: true, scopes: ['admin:all'] },
  '/health': { required: false },
  '/api/v1/products': { required: false }, // 공개 API
}

export async function authMiddleware(
  req: Request,
  res: Response,
  next: NextFunction
): Promise<void> {
  // 경로별 정책 조회
  const policy = Object.entries(AUTH_POLICIES).find(([path]) => req.path.startsWith(path))

  if (policy && !policy[1].required) {
    return next()
  }

  const authHeader = req.headers.authorization
  if (!authHeader?.startsWith('Bearer ')) {
    res.status(401).json({
      error: 'unauthorized',
      message: 'Missing or invalid Authorization header',
    })
    return
  }

  const token = authHeader.substring(7)

  try {
    // JWT 디코딩 (검증 전 헤더 확인)
    const decoded = jwt.decode(token, { complete: true })
    if (!decoded || typeof decoded === 'string') {
      throw new Error('Invalid token format')
    }

    // 알고리즘 화이트리스트 검증
    if (!['RS256', 'ES256'].includes(decoded.header.alg)) {
      throw new Error(`Unsupported algorithm: ${decoded.header.alg}`)
    }

    // JWKS에서 공개 키를 가져와 서명 검증
    const signingKey = await getSigningKey(decoded.header)
    const payload = jwt.verify(token, signingKey, {
      algorithms: ['RS256', 'ES256'],
      audience: process.env.API_AUDIENCE,
      issuer: `${process.env.AUTH0_DOMAIN}/`,
      clockTolerance: 30, // 30초 시계 오차 허용
    }) as jwt.JwtPayload

    // 스코프 검증
    if (policy?.[1]?.scopes) {
      const tokenScopes = (payload.scope || '').split(' ')
      const requiredScopes = policy[1].scopes
      const hasAllScopes = requiredScopes.every((s) => tokenScopes.includes(s))

      if (!hasAllScopes) {
        res.status(403).json({
          error: 'insufficient_scope',
          message: `Required scopes: ${requiredScopes.join(', ')}`,
        })
        return
      }
    }

    // 검증된 클레임을 헤더로 전파 (하위 서비스에서 사용)
    req.headers['x-user-id'] = payload.sub || ''
    req.headers['x-user-email'] = payload.email || ''
    req.headers['x-user-roles'] = JSON.stringify(payload['https://api.example.com/roles'] || [])
    req.headers['x-auth-time'] = String(payload.iat || 0)

    // 원본 Authorization 헤더는 제거 (하위 서비스에 토큰 노출 방지)
    delete req.headers.authorization

    next()
  } catch (error: any) {
    const errorMap: Record<string, { status: number; message: string }> = {
      TokenExpiredError: { status: 401, message: 'Token has expired' },
      JsonWebTokenError: { status: 401, message: 'Invalid token' },
      NotBeforeError: { status: 401, message: 'Token not yet valid' },
    }

    const mapped = errorMap[error.name] || { status: 401, message: 'Authentication failed' }
    res.status(mapped.status).json({ error: error.name, message: mapped.message })
  }
}

레이트 리미팅과 서킷 브레이커

다단계 레이트 리미팅 전략

효과적인 레이트 리미팅은 단일 계층이 아니라 다단계로 구성된다. Edge에서는 IP 기반의 공격적 제한으로 DDoS를 방어하고, 게이트웨이에서는 사용자/API 키 기반의 세밀한 제한을 적용한다.

// gateway/src/middleware/rateLimiter.ts
import Redis from 'ioredis'

const redis = new Redis({
  host: process.env.REDIS_HOST || 'localhost',
  port: parseInt(process.env.REDIS_PORT || '6379'),
  maxRetriesPerRequest: 3,
  retryStrategy: (times) => Math.min(times * 100, 3000),
  enableReadyCheck: true,
})

interface RateLimitConfig {
  windowMs: number // 시간 창 (밀리초)
  maxRequests: number // 최대 요청 수
  keyPrefix: string // Redis 키 프리픽스
}

// 슬라이딩 윈도우 레이트 리미터 (Redis Sorted Set 활용)
async function slidingWindowRateLimit(
  identifier: string,
  config: RateLimitConfig
): Promise<{ allowed: boolean; remaining: number; retryAfter?: number }> {
  const key = `${config.keyPrefix}:${identifier}`
  const now = Date.now()
  const windowStart = now - config.windowMs

  // Redis 트랜잭션으로 원자적 처리
  const pipeline = redis.pipeline()
  pipeline.zremrangebyscore(key, 0, windowStart) // 만료된 항목 제거
  pipeline.zadd(key, now.toString(), `${now}:${Math.random()}`) // 현재 요청 추가
  pipeline.zcard(key) // 현재 윈도우 내 요청 수
  pipeline.expire(key, Math.ceil(config.windowMs / 1000)) // TTL 설정

  const results = await pipeline.exec()
  if (!results) throw new Error('Redis pipeline failed')

  const currentCount = results[2]?.[1] as number
  const allowed = currentCount <= config.maxRequests
  const remaining = Math.max(0, config.maxRequests - currentCount)

  if (!allowed) {
    // 가장 오래된 요청이 만료되는 시점 계산
    const oldestInWindow = await redis.zrange(key, 0, 0, 'WITHSCORES')
    const retryAfter =
      oldestInWindow.length >= 2
        ? Math.ceil((parseInt(oldestInWindow[1]) + config.windowMs - now) / 1000)
        : Math.ceil(config.windowMs / 1000)

    return { allowed: false, remaining: 0, retryAfter }
  }

  return { allowed: true, remaining }
}

// 티어별 레이트 리미팅 정책
const RATE_LIMIT_TIERS: Record<string, RateLimitConfig> = {
  free: { windowMs: 60_000, maxRequests: 60, keyPrefix: 'rl:free' },
  pro: { windowMs: 60_000, maxRequests: 600, keyPrefix: 'rl:pro' },
  enterprise: { windowMs: 60_000, maxRequests: 6000, keyPrefix: 'rl:enterprise' },
  // IP 기반 글로벌 제한 (DDoS 방어)
  global_ip: { windowMs: 1_000, maxRequests: 50, keyPrefix: 'rl:ip' },
}

export async function rateLimitMiddleware(req: any, res: any, next: any) {
  const clientIp = req.headers['x-forwarded-for']?.split(',')[0]?.trim() || req.ip
  const userId = req.headers['x-user-id']
  const tier = req.headers['x-user-tier'] || 'free'

  try {
    // 1단계: IP 기반 글로벌 레이트 리미팅
    const ipResult = await slidingWindowRateLimit(clientIp, RATE_LIMIT_TIERS.global_ip)
    if (!ipResult.allowed) {
      res.set('Retry-After', String(ipResult.retryAfter))
      return res.status(429).json({ error: 'Too many requests (IP limit)' })
    }

    // 2단계: 사용자/티어 기반 레이트 리미팅
    if (userId) {
      const config = RATE_LIMIT_TIERS[tier] || RATE_LIMIT_TIERS.free
      const userResult = await slidingWindowRateLimit(userId, config)

      res.set('X-RateLimit-Limit', String(config.maxRequests))
      res.set('X-RateLimit-Remaining', String(userResult.remaining))

      if (!userResult.allowed) {
        res.set('Retry-After', String(userResult.retryAfter))
        return res.status(429).json({ error: 'Rate limit exceeded', tier })
      }
    }

    next()
  } catch (error) {
    // Redis 장애 시 요청을 통과시킴 (fault-tolerant)
    console.error('Rate limiter error:', error)
    next()
  }
}

서킷 브레이커 패턴

서킷 브레이커는 실패율이 임계치를 초과하면 회로를 열어 빠른 실패 응답을 반환하고, 쿨다운 후 반개방(Half-Open) 상태에서 점진적으로 복구를 시도한다.

// gateway/src/middleware/circuitBreaker.ts
enum CircuitState {
  CLOSED = 'CLOSED', // 정상 - 요청 통과
  OPEN = 'OPEN', // 차단 - 즉시 실패 반환
  HALF_OPEN = 'HALF_OPEN', // 테스트 - 제한적 요청 허용
}

interface CircuitBreakerConfig {
  failureThreshold: number // 실패 임계치 (예: 5)
  successThreshold: number // 복구 판정 성공 횟수 (예: 3)
  timeout: number // Open 상태 유지 시간 (ms)
  monitorWindow: number // 모니터링 윈도우 (ms)
  halfOpenMaxRequests: number // Half-Open에서 허용할 최대 요청
}

class CircuitBreaker {
  private state: CircuitState = CircuitState.CLOSED
  private failureCount = 0
  private successCount = 0
  private lastFailureTime = 0
  private halfOpenRequests = 0

  constructor(
    private name: string,
    private config: CircuitBreakerConfig
  ) {}

  async execute<T>(fn: () => Promise<T>, fallback?: () => T): Promise<T> {
    if (this.state === CircuitState.OPEN) {
      if (Date.now() - this.lastFailureTime >= this.config.timeout) {
        this.transitionTo(CircuitState.HALF_OPEN)
      } else {
        console.warn(`[CircuitBreaker:${this.name}] OPEN - rejecting request`)
        if (fallback) return fallback()
        throw new Error(`Service ${this.name} is unavailable (circuit open)`)
      }
    }

    if (this.state === CircuitState.HALF_OPEN) {
      if (this.halfOpenRequests >= this.config.halfOpenMaxRequests) {
        if (fallback) return fallback()
        throw new Error(`Service ${this.name} is testing recovery`)
      }
      this.halfOpenRequests++
    }

    try {
      const result = await fn()
      this.onSuccess()
      return result
    } catch (error) {
      this.onFailure()
      if (fallback && this.state === CircuitState.OPEN) return fallback()
      throw error
    }
  }

  private onSuccess() {
    if (this.state === CircuitState.HALF_OPEN) {
      this.successCount++
      if (this.successCount >= this.config.successThreshold) {
        this.transitionTo(CircuitState.CLOSED)
      }
    }
    this.failureCount = 0
  }

  private onFailure() {
    this.failureCount++
    this.lastFailureTime = Date.now()
    if (this.failureCount >= this.config.failureThreshold) {
      this.transitionTo(CircuitState.OPEN)
    }
  }

  private transitionTo(newState: CircuitState) {
    console.log(`[CircuitBreaker:${this.name}] ${this.state} -> ${newState}`)
    this.state = newState
    if (newState === CircuitState.CLOSED) {
      this.failureCount = 0
      this.successCount = 0
    }
    if (newState === CircuitState.HALF_OPEN) {
      this.halfOpenRequests = 0
      this.successCount = 0
    }
  }

  getStatus() {
    return {
      name: this.name,
      state: this.state,
      failureCount: this.failureCount,
      lastFailureTime: this.lastFailureTime ? new Date(this.lastFailureTime).toISOString() : null,
    }
  }
}

// 서비스별 서킷 브레이커 인스턴스
export const circuitBreakers = {
  userService: new CircuitBreaker('user-service', {
    failureThreshold: 5,
    successThreshold: 3,
    timeout: 30_000, // 30초 후 Half-Open 전환
    monitorWindow: 60_000, // 1분 모니터링 윈도우
    halfOpenMaxRequests: 3,
  }),
  orderService: new CircuitBreaker('order-service', {
    failureThreshold: 3,
    successThreshold: 2,
    timeout: 60_000, // 결제 관련은 보수적으로 60초
    monitorWindow: 60_000,
    halfOpenMaxRequests: 1,
  }),
}

관찰가능성 통합

게이트웨이는 모든 트래픽이 통과하는 지점이므로, 관찰가능성의 핵심 허브가 된다. 로그, 메트릭, 트레이스의 세 기둥(Three Pillars)을 모두 게이트웨이 계층에서 수집해야 한다.

OpenTelemetry 통합 설정

# otel-collector-config.yaml
# OpenTelemetry Collector 설정 - 게이트웨이 트래픽 수집
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024
  # 게이트웨이 메타데이터 추가
  attributes:
    actions:
      - key: deployment.environment
        value: production
        action: upsert
      - key: service.layer
        value: gateway
        action: upsert
  # 샘플링 (전체의 10%만 상세 트레이스)
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: error-policy
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: latency-policy
        type: latency
        latency:
          threshold_ms: 1000
      - name: probabilistic-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 10

exporters:
  otlphttp/jaeger:
    endpoint: http://jaeger:4318
  prometheus:
    endpoint: 0.0.0.0:8889
  loki:
    endpoint: http://loki:3100/loki/api/v1/push
    labels:
      resource:
        service.name: 'service_name'
        service.layer: 'layer'

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, attributes, tail_sampling]
      exporters: [otlphttp/jaeger]
    metrics:
      receivers: [otlp]
      processors: [batch, attributes]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [batch, attributes]
      exporters: [loki]

핵심 메트릭 대시보드 항목

게이트웨이에서 반드시 모니터링해야 하는 메트릭은 다음과 같다.

요청률 (Request Rate): 초당 요청 수. 비정상적 급증은 DDoS 또는 클라이언트 버그를 의미한다.
에러율 (Error Rate): 4xx, 5xx 응답 비율. 서비스별로 분리하여 추적한다.
지연시간 (Latency): p50, p95, p99 백분위수. p99가 SLA를 초과하면 알림 발생.
서킷 브레이커 상태: 각 서비스의 서킷 상태를 실시간으로 표시한다.
레이트 리미팅 히트율: 제한에 걸리는 요청 비율. 정상 사용자가 자주 걸린다면 임계치 조정이 필요하다.
업스트림 연결 풀: 활성 연결, 대기 요청, 타임아웃 수를 추적한다.

트러블슈팅

장애 시나리오 1: 게이트웨이 메모리 누수

증상: 게이트웨이 Pod의 메모리 사용량이 지속적으로 증가하여 OOM Kill 발생.

원인: 대용량 요청/응답 본문을 버퍼링하면서 메모리를 해제하지 못하는 경우. 특히 파일 업로드 API나 대용량 JSON 응답을 프록시할 때 발생.

해결:

# Kong: 요청/응답 본문 버퍼링 제한
apiVersion: configuration.konghq.com/v1
kind: KongPlugin
metadata:
  name: request-size-limiting
config:
  allowed_payload_size: 10 # MB
  size_unit: megabytes
  require_content_length: true
plugin: request-size-limiting

# Envoy: 버퍼 제한 설정
http_filters:
  - name: envoy.filters.http.buffer
    typed_config:
      '@type': type.googleapis.com/envoy.extensions.filters.http.buffer.v3.Buffer
      max_request_bytes: 10485760 # 10MB

장애 시나리오 2: 캐스케이딩 실패

증상: 하나의 백엔드 서비스 장애가 게이트웨이를 통해 전체 시스템으로 전파.

원인: 서킷 브레이커 미설정 또는 타임아웃이 너무 긴 경우. 장애 서비스에 대한 요청이 게이트웨이의 연결 풀을 소진.

복구 절차:

즉시 장애 서비스에 대한 서킷 브레이커 수동 개방
해당 서비스 경로에 정적 폴백 응답 설정
연결 풀 상태 확인 및 필요시 게이트웨이 Pod 롤링 재시작
장애 서비스 복구 후 트래픽을 점진적으로 복원 (카나리)

장애 시나리오 3: TLS 인증서 만료

증상: 갑작스러운 전체 API 503 에러. 클라이언트에서 SSL/TLS 핸드셰이크 실패 로그 확인.

예방: cert-manager를 활용한 자동 인증서 갱신과, 만료 30일 전 알림 설정.

# 인증서 만료일 확인 스크립트
#!/bin/bash
DOMAIN="api.example.com"
EXPIRY=$(echo | openssl s_client -servername $DOMAIN -connect $DOMAIN:443 2>/dev/null \
  | openssl x509 -noout -dates 2>/dev/null \
  | grep notAfter | cut -d= -f2)

EXPIRY_EPOCH=$(date -d "$EXPIRY" +%s 2>/dev/null || date -j -f "%b %d %T %Y %Z" "$EXPIRY" +%s)
NOW_EPOCH=$(date +%s)
DAYS_LEFT=$(( (EXPIRY_EPOCH - NOW_EPOCH) / 86400 ))

echo "Certificate expires: $EXPIRY ($DAYS_LEFT days remaining)"

if [ "$DAYS_LEFT" -lt 30 ]; then
  echo "WARNING: Certificate expiring soon!"
  # 슬랙/PagerDuty 알림 전송
fi

장애 시나리오 4: GraphQL Federation 서브그래프 스키마 충돌

증상: rover supergraph compose 실패. 두 서브그래프가 동일 타입의 동일 필드를 다른 반환 타입으로 정의.

해결: Federation v2의 @shareable 지시어를 사용하여 공유 타입 필드를 명시적으로 선언하고, @override 지시어로 필드 소유권을 이전한다. CI/CD 파이프라인에서 rover subgraph check를 실행하여 호환성을 사전 검증해야 한다.

프로덕션 체크리스트

배포 전 검증

게이트웨이 클러스터 최소 3개 노드 이상으로 HA 구성
TLS 1.3 강제 적용 및 약한 암호화 스위트 비활성화
헬스 체크 엔드포인트(/health, /ready) 구성
요청/응답 크기 제한 설정 (기본 10MB 이하)
타임아웃 설정: 연결 5초, 읽기 30초, 쓰기 30초
DNS TTL 적절히 설정 (너무 길면 장애 복구 지연)

보안

모든 외부 통신 TLS 적용, 내부 통신 mTLS 적용
JWT 검증은 게이트웨이에서 수행, 토큰은 하위 서비스에 전달하지 않음
CORS 허용 도메인 화이트리스트 적용
레이트 리미팅: IP 기반(DDoS), 사용자 기반(공정 사용) 이중 적용
민감한 헤더(Authorization, Cookie) 로깅에서 마스킹
Admin API는 내부 네트워크에서만 접근 가능하게 제한

관찰가능성

분산 트레이싱: 모든 요청에 X-Request-ID 부여 및 전파
메트릭: RED(Rate, Errors, Duration) 메트릭 수집 및 대시보드 구축
알림: 에러율 5% 초과, p99 지연 SLA 초과, 서킷 브레이커 Open 시 즉시 알림
로깅: 구조화된 JSON 로그, 요청/응답 본문은 디버그 모드에서만 로깅

운영

카나리 배포: 게이트웨이 설정 변경 시 5-10% 트래픽으로 검증 후 전체 적용
설정 버전 관리: 모든 게이트웨이 설정을 Git으로 관리 (GitOps)
롤백 절차: 30초 이내에 이전 설정으로 롤백 가능한 절차 확보
부하 테스트: 예상 피크 트래픽의 2배 이상으로 정기 부하 테스트 수행
혼돈 공학: 게이트웨이 노드 1개 장애 시 자동 복구 검증

안티패턴과 실패 사례

안티패턴 1: God Gateway (신 게이트웨이)

모든 비즈니스 로직을 게이트웨이에 구현하는 패턴. 요청 검증, 데이터 변환, 비즈니스 규칙 적용, 응답 집계를 모두 게이트웨이가 처리하면 게이트웨이가 모놀리스화된다. 게이트웨이는 라우팅, 인증, 레이트 리미팅 등 횡단 관심사만 처리해야 한다.

안티패턴 2: 공유 BFF

하나의 BFF가 웹과 모바일을 모두 서비스하기 시작하면, 양쪽의 요구사항이 충돌하면서 복잡도가 기하급수적으로 증가한다. "이 필드는 모바일에서만 필요하다"는 조건문이 BFF 전체에 퍼지면서, BFF가 없었을 때보다 더 나쁜 상황이 된다. 각 프론트엔드 타입별로 전용 BFF를 유지하라.

안티패턴 3: 게이트웨이에서의 데이터 변환 남용

게이트웨이에서 과도한 응답 변환(필드 이름 변경, 데이터 구조 재배치, 비즈니스 로직에 따른 필드 필터링)을 수행하면, 게이트웨이의 CPU 사용량이 급증하고 디버깅이 어려워진다. 데이터 변환은 BFF 또는 서비스 자체에서 수행해야 한다.

안티패턴 4: 레이트 리미팅 없는 배포

"우리 서비스는 트래픽이 적으니 레이트 리미팅이 필요 없다"는 가장 흔한 실수다. 예상치 못한 크롤러, 잘못 구현된 클라이언트의 무한 재시도, API 키 유출에 의한 남용은 트래픽 규모와 무관하게 발생한다. 프로덕션 배포 시 레이트 리미팅은 선택이 아니라 필수다.

안티패턴 5: 서킷 브레이커 미적용 상태의 동기 호출 체인

게이트웨이가 서비스 A를 호출하고, 서비스 A가 서비스 B를 호출하는 동기 체인에서 서킷 브레이커가 없으면, 서비스 B의 장애가 서비스 A와 게이트웨이를 연쇄적으로 다운시킨다(캐스케이딩 장애). 모든 외부 호출 지점에 서킷 브레이커를 적용하고, 가능하다면 비동기 통신으로 전환하라.

실패 사례: GraphQL Federation N+1 쿼리 문제

Federation 환경에서 Router가 서브그래프 A에서 사용자 목록을 가져온 후, 각 사용자의 주문을 서브그래프 B에서 개별로 조회하면 N+1 문제가 발생한다. Apollo Router의 Query Plan은 가능한 한 배치 요청으로 최적화하지만, 서브그래프가 배치 조회를 지원하지 않으면 성능이 급격히 저하된다. 서브그래프의 __resolveReference가 배치를 지원하도록 DataLoader 패턴을 반드시 적용해야 한다.

참고자료

Microservices Pattern: API Gateway / Backends for Frontends - Chris Richardson의 마이크로서비스 패턴 레퍼런스
Sam Newman - Backends For Frontends - BFF 패턴 원저자의 설명
Kong Gateway Kubernetes Documentation - Kong 공식 Kubernetes 설치 가이드
Envoy Proxy Documentation - Envoy 프록시 공식 문서
Introduction to Apollo Federation - Apollo GraphQL Federation 공식 문서
The Backend for Frontend Pattern (BFF) - Auth0 - Auth0의 BFF 패턴과 보안 가이드
Backends for Frontends Pattern - Azure Architecture Center - Microsoft Azure의 BFF 패턴 레퍼런스
Kubernetes Gateway API Implementations - Kubernetes Gateway API 구현체 목록
API Gateway Patterns for Microservices - Kong vs NGINX vs Envoy - 게이트웨이 비교 분석