Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

이 글은 디버깅 실전 시리즈 5편 중 4편이다.

언어별 디버깅 가이드

프레임워크별 디버깅 실전

IDE별 디버깅 완전정리

언어×프레임워크 장애 사례집 ← 현재 글

원격 디버깅 실전 가이드

왜 "언어×프레임워크 조합"이 따로 필요한가

언어별, 프레임워크별 디버깅 가이드는 많다. 하지만 현업에서 터지는 장애는 언어 런타임의 특성과 프레임워크의 실행 모델이 겹치는 지점에서 발생한다.

Python의 GIL + FastAPI의 async event loop = 동기 코드 한 줄이 전체 요청을 멈춘다
Java의 스레드 풀 + Spring Boot의 auto-configuration = 설정 하나 빠뜨리면 커넥션 풀이 고갈된다
Go의 goroutine + Gin의 context 전파 = cancel 안 하면 goroutine이 무한 증식한다
Next.js의 SSR hydration + React의 상태 모델 = 서버/클라이언트 불일치가 화면을 깨뜨린다

이 글은 실전에서 반복적으로 발생하는 8가지 장애 사례를 조합별로 분류하고, 각각에 대해 증상부터 재발방지까지 한 번에 정리한다. 이론이 아니라 "내일 당장 온콜에서 써먹을 수 있는" 수준으로 작성했다.

8개 사례 요약 비교표

#	조합	대표 증상	핵심 도구
1	Python + FastAPI	응답 지연 급증, event loop blocked 경고	`py-spy`, `asyncio.get_event_loop().slow_callback_duration`
2	Python + Django	페이지 로딩 극심 지연, 쿼리 수백 개 발생	`django-debug-toolbar`, `nplusone`, `EXPLAIN ANALYZE`
3	JS/TS + Next.js	하이드레이션 에러, 화면 깜빡임/깨짐	Chrome DevTools, `next build --debug`, `useEffect` 분석
4	JS/TS + NestJS/Express	메모리 지속 증가, 프로세스 OOM Kill	`--inspect`, Chrome Memory Profiler, `clinic.js`
5	Go + Gin/Fiber	goroutine 수 무한 증가, 메모리 폭증	`pprof`, `runtime.NumGoroutine()`, Grafana
6	Java + Spring Boot	요청 큐잉/타임아웃, HikariCP 경고	JFR, `async-profiler`, Micrometer/Prometheus
7	Java + Spring Data JPA	단일 API 쿼리 수십 개, flush 시점 예외	Hibernate SQL 로그, `p6spy`, `EXPLAIN`
8	React + API Backend	화면에 이전 데이터 표시, 깜빡임 후 덮어쓰기	React DevTools, Network 탭, `AbortController`

사례 1: Python + FastAPI — Event Loop Blocking & Pydantic Validation 병목

사례 1-A: Event Loop Blocking

조합: Python 3.11+ / FastAPI 0.110+ / Uvicorn

증상: 평소 p99 응답시간 50ms이던 API가 특정 엔드포인트 호출 이후 전체 서버 응답시간이 2~5초로 치솟는다. Uvicorn 로그에 Slow callback. Running time: 3.2s 경고가 찍힌다.

원인: FastAPI는 async def 핸들러를 asyncio event loop 위에서 실행한다. 그런데 핸들러 안에서 동기 I/O(파일 읽기, 동기 DB 드라이버, CPU-heavy 연산)를 호출하면 event loop 자체가 블로킹되어 다른 모든 요청이 대기한다.

재현:

# bad_endpoint.py — 이 코드가 전체 서버를 멈춘다
import time
from fastapi import FastAPI

app = FastAPI()

@app.get("/slow")
async def slow_endpoint():
    # 동기 sleep이 event loop을 3초간 점유
    time.sleep(3)
    return {"status": "done"}

@app.get("/health")
async def health():
    return {"status": "ok"}

# 터미널 1: 서버 기동
uvicorn bad_endpoint:app --host 0.0.0.0 --port 8000

# 터미널 2: /slow 호출 중 /health도 3초 걸리는지 확인
curl http://localhost:8000/slow &
sleep 0.5
time curl http://localhost:8000/health
# health가 2.5초 걸리면 event loop blocking 확정

브레이크포인트:

uvicorn.protocols.http.httptools_impl:RequestResponseCycle.receive — 요청 수신 시점
의심되는 핸들러 함수 진입 직후
asyncio 디버그 모드 활성화: PYTHONASYNCIODEBUG=1 uvicorn ...

프로파일링:

# 실시간 CPU 스택 확인
py-spy top --pid $(pgrep -f uvicorn)

# 플레임 그래프 생성
py-spy record -o flamegraph.svg --pid $(pgrep -f uvicorn) -d 30

# asyncio slow callback 임계값 조정 (0.1초 이상이면 경고)

import asyncio
loop = asyncio.get_event_loop()
loop.slow_callback_duration = 0.1  # 100ms

해결:

import asyncio
from fastapi import FastAPI
from functools import partial

app = FastAPI()

def cpu_heavy_task(data: dict) -> dict:
    """동기 함수 — CPU 바운드 작업"""
    import time
    time.sleep(3)  # 실제로는 복잡한 계산
    return {"result": "processed"}

@app.get("/slow-fixed")
async def slow_endpoint_fixed():
    # 방법 1: run_in_executor로 스레드풀에 위임
    loop = asyncio.get_event_loop()
    result = await loop.run_in_executor(None, partial(cpu_heavy_task, {}))
    return result

# 방법 2: def로 선언하면 FastAPI가 자동으로 threadpool에서 실행
@app.get("/slow-fixed-v2")
def slow_endpoint_sync():
    import time
    time.sleep(3)
    return {"status": "done"}

사례 1-B: Pydantic Validation 병목

증상: 대량 데이터(수천 개 아이템 리스트)를 받는 POST API의 응답시간이 비정상적으로 느리다. CPU 사용률이 급등한다.

원인: Pydantic v1의 validator가 각 필드마다 파이썬 레벨에서 실행되어, 대용량 리스트 입력 시 O(n * fields) 검증 비용이 발생한다.

재현:

from pydantic import BaseModel
from fastapi import FastAPI
from typing import List

class Item(BaseModel):
    name: str
    price: float
    category: str
    tags: List[str]

class BulkRequest(BaseModel):
    items: List[Item]  # 5000개 아이템 → validation만 2초+

app = FastAPI()

@app.post("/bulk")
async def bulk_create(req: BulkRequest):
    return {"count": len(req.items)}

# 5000개 아이템 벤치마크
python -c "
import json, requests, time
items = [{'name': f'item-{i}', 'price': 9.99, 'category': 'test', 'tags': ['a','b']} for i in range(5000)]
start = time.time()
r = requests.post('http://localhost:8000/bulk', json={'items': items})
print(f'Time: {time.time()-start:.2f}s, Status: {r.status_code}')
"

프로파일링:

py-spy record -o pydantic_profile.svg --pid $(pgrep -f uvicorn) -d 10
# pydantic.validators 관련 프레임이 대부분을 차지하면 확정

해결:

# 1) Pydantic v2 사용 (Rust 기반 core, 5~50x 빠름)
# pyproject.toml: pydantic>=2.0

# 2) 대량 입력은 스트리밍 처리
from fastapi import Request
import orjson

@app.post("/bulk-stream")
async def bulk_stream(request: Request):
    body = await request.body()
    data = orjson.loads(body)  # Pydantic 검증 건너뛰고 직접 파싱
    # 필수 필드만 수동 검증
    validated = []
    for item in data.get("items", []):
        if "name" in item and "price" in item:
            validated.append(item)
    return {"count": len(validated)}

재발방지 체크리스트:

async def 핸들러 안에 동기 I/O 호출이 없는지 코드 리뷰에서 확인
PYTHONASYNCIODEBUG=1을 staging 환경에 상시 적용
Pydantic v2 마이그레이션 완료 여부 확인
1000건 이상 벌크 API는 별도 벤치마크 테스트 추가
py-spy 프로파일링을 CI의 성능 테스트 단계에 포함

사례 2: Python + Django — N+1 Query & Transaction Boundary 문제

사례 2-A: N+1 Query

조합: Python 3.10+ / Django 4.2+ / PostgreSQL

증상: 관리자 페이지에서 주문 목록 조회 시 페이지 로딩이 10초 이상 걸린다. DB CPU가 급등한다.

원인: Django ORM에서 ForeignKey/ManyToMany 관계를 순회할 때 select_related/prefetch_related를 누락하면, 각 객체 접근 시마다 별도 쿼리가 발생한다.

재현:

# models.py
from django.db import models

class Customer(models.Model):
    name = models.CharField(max_length=100)

class Order(models.Model):
    customer = models.ForeignKey(Customer, on_delete=models.CASCADE)
    total = models.DecimalField(max_digits=10, decimal_places=2)
    created_at = models.DateTimeField(auto_now_add=True)

class OrderItem(models.Model):
    order = models.ForeignKey(Order, on_delete=models.CASCADE, related_name='items')
    product_name = models.CharField(max_length=200)
    quantity = models.IntegerField()

# views.py — N+1 발생 코드
def order_list(request):
    orders = Order.objects.all()[:100]
    data = []
    for order in orders:
        data.append({
            'customer': order.customer.name,        # 쿼리 1번씩 (N)
            'items': [i.product_name for i in order.items.all()],  # 쿼리 1번씩 (N)
            'total': order.total,
        })
    # 총 쿼리: 1 + 100 + 100 = 201개
    return JsonResponse(data, safe=False)

브레이크포인트:

django.db.models.sql.compiler:SQLCompiler.execute_sql — 모든 쿼리 실행 지점
django.db.backends.utils:CursorDebugWrapper.execute — 쿼리 로그

프로파일링:

# django-debug-toolbar 설치 (개발 환경)
pip install django-debug-toolbar

# settings.py
INSTALLED_APPS += ['debug_toolbar']
MIDDLEWARE.insert(0, 'debug_toolbar.middleware.DebugToolbarMiddleware')

# nplusone 라이브러리로 자동 탐지
pip install nplusone
# settings.py
INSTALLED_APPS += ['nplusone.ext.django']
MIDDLEWARE.insert(0, 'nplusone.ext.django.NPlusOneMiddleware')
NPLUSONE_RAISE = True  # N+1 발생 시 예외 발생

-- PostgreSQL에서 슬로우쿼리 확인
SELECT query, calls, mean_exec_time, total_exec_time
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;

해결:

# views.py — 수정 후
def order_list(request):
    orders = (
        Order.objects
        .select_related('customer')          # JOIN으로 1회
        .prefetch_related('items')           # IN 쿼리로 1회
        .all()[:100]
    )
    data = []
    for order in orders:
        data.append({
            'customer': order.customer.name,
            'items': [i.product_name for i in order.items.all()],
            'total': order.total,
        })
    # 총 쿼리: 1(orders) + 1(customers JOIN) + 1(items IN) = 2~3개
    return JsonResponse(data, safe=False)

사례 2-B: Transaction Boundary 문제

증상: 결제 처리 API에서 간헐적으로 주문은 생성되었는데 결제 기록이 없다. 데이터 정합성 깨짐.

원인: Django의 ATOMIC_REQUESTS = False(기본값)일 때, 뷰 함수 내 여러 DB 작업이 각각 auto-commit되어 중간에 예외가 발생하면 부분만 커밋된다.

재현 / 해결:

# BAD — 트랜잭션 경계 없음
def create_payment(request):
    order = Order.objects.create(customer_id=1, total=100)
    # 여기서 예외 발생하면 order는 이미 커밋됨
    payment = Payment.objects.create(order=order, amount=100)
    send_notification(order)  # 외부 API 호출
    return JsonResponse({"order_id": order.id})

# GOOD — 명시적 트랜잭션
from django.db import transaction

def create_payment(request):
    with transaction.atomic():
        order = Order.objects.create(customer_id=1, total=100)
        payment = Payment.objects.create(order=order, amount=100)
    # 트랜잭션 밖에서 부수효과 실행
    send_notification(order)
    return JsonResponse({"order_id": order.id})

재발방지 체크리스트:

모든 list/detail 뷰에 select_related / prefetch_related 적용 여부 확인
nplusone 또는 django-auto-prefetch 라이브러리 적용
staging 환경에서 django-debug-toolbar SQL 패널로 쿼리 수 모니터링
2개 이상 모델 변경이 있는 뷰에 transaction.atomic() 적용
ATOMIC_REQUESTS = True 전역 설정 검토 (부수효과 위치 주의)
CI에 쿼리 수 상한 assertion 테스트 추가

사례 3: JS/TS + Next.js — Hydration Mismatch & Server/Client Boundary 혼동

조합: TypeScript / Next.js 14+ (App Router) / React 18+

증상: 페이지 로드 시 콘솔에 Hydration failed because the initial UI does not match what was rendered on the server 경고. 화면이 깜빡이거나 레이아웃이 순간적으로 깨진다. 심한 경우 전체 클라이언트 사이드 렌더링으로 폴백되어 SEO가 무효화된다.

원인: 서버에서 렌더링한 HTML과 클라이언트에서 hydration 시 생성하는 DOM이 불일치한다. 대표적인 원인:

Date.now(), Math.random() 등 비결정적 값을 렌더링에 사용
typeof window !== 'undefined' 분기로 서버/클라이언트 다른 UI 반환
브라우저 전용 API(localStorage, navigator)를 초기 렌더링에 사용

재현:

// app/dashboard/page.tsx — Hydration Mismatch 발생
export default function Dashboard() {
  // 서버: "2026-03-07T09:00:00Z", 클라이언트: "2026-03-07T09:00:03Z"
  const now = new Date().toISOString()

  // 서버에는 localStorage가 없으므로 null, 클라이언트는 "dark"
  const theme = typeof window !== 'undefined' ? localStorage.getItem('theme') : null

  return (
    <div className={theme === 'dark' ? 'dark' : 'light'}>
      <p>현재 시각: {now}</p>
      <p>테마: {theme ?? 'default'}</p>
    </div>
  )
}

브레이크포인트:

Chrome DevTools > Sources > next/dist/client/app-index.js 내 hydrateRoot 호출 지점
React DevTools > Profiler > Hydration 탭에서 mismatch 컴포넌트 식별
Next.js 에러 오버레이에서 제공하는 서버/클라이언트 HTML diff 확인

프로파일링:

# Next.js 빌드 분석
ANALYZE=true next build

# 서버 사이드 렌더링 결과 직접 확인
curl -s http://localhost:3000/dashboard | prettier --parser html > server.html
# 브라우저에서 document.documentElement.outerHTML 복사 후 비교
diff server.html client.html

해결:

'use client'

import { useState, useEffect } from 'react'

export default function Dashboard() {
  const [now, setNow] = useState<string>('')
  const [theme, setTheme] = useState<string>('default')

  // 클라이언트 전용 값은 useEffect에서 설정
  useEffect(() => {
    setNow(new Date().toISOString())
    setTheme(localStorage.getItem('theme') ?? 'default')
  }, [])

  return (
    <div className={theme === 'dark' ? 'dark' : 'light'}>
      {/* 초기에는 빈 값 → hydration 일치 → useEffect 후 갱신 */}
      <p>현재 시각: {now || '로딩 중...'}</p>
      <p>테마: {theme}</p>
    </div>
  )
}

// 대안: next/dynamic으로 클라이언트 전용 컴포넌트 분리
import dynamic from 'next/dynamic'

const ClientClock = dynamic(() => import('./ClientClock'), {
  ssr: false,
  loading: () => <p>로딩 중...</p>,
})

export default function Dashboard() {
  return (
    <div>
      <h1>대시보드</h1>
      <ClientClock />
    </div>
  )
}

재발방지 체크리스트:

Date.now(), Math.random() 등 비결정적 값은 useEffect 안에서만 사용
typeof window 분기를 렌더링 반환값에 사용하지 않기
'use client' 디렉티브가 필요한 컴포넌트 식별 및 명시
CI에서 next build 경고를 에러로 처리하는 설정 적용
E2E 테스트에서 hydration 에러 콘솔 로그를 실패 조건으로 설정

사례 4: JS/TS + NestJS/Express — Memory Leak & Async 에러 누락

사례 4-A: Memory Leak

조합: TypeScript / NestJS 10+ / Node.js 20+

증상: 서비스 배포 후 시간이 지나면 메모리 사용량이 단조 증가한다. 며칠 뒤 OOM Kill 발생. 재시작하면 일시적으로 정상화되지만 다시 증가.

원인: 모듈 스코프 또는 싱글턴 서비스에 데이터를 계속 축적하는 패턴. 대표적으로 캐시 Map에 TTL/크기 제한 없이 추가하거나, EventEmitter 리스너를 반복 등록하는 경우.

재현:

// cache.service.ts — 메모리 누수 패턴
import { Injectable } from '@nestjs/common'

@Injectable()
export class CacheService {
  // 싱글턴이므로 앱 생애주기 동안 계속 커짐
  private cache = new Map<string, any>()

  set(key: string, value: any) {
    this.cache.set(key, value) // 삭제/만료 로직 없음
  }

  get(key: string) {
    return this.cache.get(key)
  }
}

브레이크포인트:

CacheService.set() 호출 빈도와 cache.size 증가 추이
Node.js --inspect 플래그로 Chrome DevTools 연결

프로파일링:

# Node.js 인스펙터 모드로 시작
node --inspect dist/main.js

# 또는 NestJS CLI
nest start --debug

# clinic.js로 자동 분석
npx clinic doctor -- node dist/main.js
npx clinic heap -- node dist/main.js

Chrome DevTools에서:

Memory 탭 > Heap Snapshot 촬영 (시점 A)
부하를 건다 (ab -n 10000 -c 50 http://localhost:3000/api/data)
Heap Snapshot 촬영 (시점 B)
Comparison 뷰로 누적 객체 확인

해결:

// cache.service.ts — 수정
import { Injectable } from '@nestjs/common'

interface CacheEntry {
  value: any
  expireAt: number
}

@Injectable()
export class CacheService {
  private cache = new Map<string, CacheEntry>()
  private readonly MAX_SIZE = 10000
  private readonly DEFAULT_TTL = 5 * 60 * 1000 // 5분

  set(key: string, value: any, ttl = this.DEFAULT_TTL) {
    // 크기 제한 초과 시 가장 오래된 항목 제거
    if (this.cache.size >= this.MAX_SIZE) {
      const firstKey = this.cache.keys().next().value
      if (firstKey) this.cache.delete(firstKey)
    }
    this.cache.set(key, { value, expireAt: Date.now() + ttl })
  }

  get(key: string) {
    const entry = this.cache.get(key)
    if (!entry) return null
    if (Date.now() > entry.expireAt) {
      this.cache.delete(key)
      return null
    }
    return entry.value
  }
}

사례 4-B: Async 에러 누락 (Error Swallowing)

증상: 특정 API 호출 시 클라이언트는 응답을 영영 받지 못한다(hang). 서버 에러 로그에는 아무것도 찍히지 않는다.

원인: async 함수의 반환된 Promise에 .catch()가 없거나, try/catch로 감싸지 않은 비동기 호출이 예외를 삼킨다.

재현 / 해결:

// BAD — Promise rejection이 삼켜진다
@Post('/process')
async processData(@Body() data: any) {
  // fire-and-forget: 에러가 발생해도 아무 로그 없음
  this.heavyService.processInBackground(data);
  return { status: 'accepted' };
}

// GOOD — 에러 처리 보장
@Post('/process')
async processData(@Body() data: any) {
  // 방법 1: await로 에러 전파
  try {
    await this.heavyService.processInBackground(data);
  } catch (error) {
    this.logger.error('Background processing failed', error);
    throw new InternalServerErrorException('Processing failed');
  }
  return { status: 'accepted' };

  // 방법 2: fire-and-forget이 필요하면 .catch() 필수
  this.heavyService.processInBackground(data)
    .catch(err => this.logger.error('Background task failed', err));
  return { status: 'accepted' };
}

// main.ts — 전역 unhandled rejection 감시
process.on('unhandledRejection', (reason, promise) => {
  console.error('Unhandled Rejection at:', promise, 'reason:', reason)
  // Sentry 등에 보고
})

재발방지 체크리스트:

싱글턴 서비스의 Map/Set/배열에 크기 제한과 TTL 적용
clinic doctor/clinic heap을 정기적으로 실행하는 CI 단계 추가
모든 async 함수 호출에 await 또는 .catch() 존재 확인
ESLint @typescript-eslint/no-floating-promises 룰 활성화
process.on('unhandledRejection') 핸들러 설정
Kubernetes 메모리 limit과 Node.js --max-old-space-size 일치시키기

사례 5: Go + Gin/Fiber — Goroutine Leak & Context Cancellation 누락

조합: Go 1.22+ / Gin 1.9+ (또는 Fiber)

증상: 서비스 운영 중 runtime.NumGoroutine()이 지속적으로 증가한다. 메모리 사용량도 함께 올라가다가 결국 OOM. 로그에는 특별한 에러가 없다.

원인: HTTP 핸들러에서 goroutine을 생성해 외부 API를 호출하면서, 요청의 context.Context를 전달하지 않거나 타임아웃 없이 무한 대기하는 goroutine이 쌓인다.

재현:

// BAD — goroutine leak
func (h *Handler) FetchData(c *gin.Context) {
    results := make(chan string, 3)

    for _, url := range []string{urlA, urlB, urlC} {
        go func(u string) {
            // context 없이 호출 → 클라이언트가 끊어도 goroutine은 살아있음
            resp, err := http.Get(u)
            if err != nil {
                return // 채널에 안 보내고 리턴 → 수신자는 영원히 대기
            }
            defer resp.Body.Close()
            body, _ := io.ReadAll(resp.Body)
            results <- string(body)
        }(url)
    }

    // 3개 모두 올 때까지 대기 — 하나라도 실패하면 영원히 블로킹
    var data []string
    for i := 0; i < 3; i++ {
        data = append(data, <-results)
    }

    c.JSON(200, gin.H{"data": data})
}

브레이크포인트:

runtime.NumGoroutine() 값을 주기적으로 로깅
Delve 디버거에서 goroutines 명령으로 전체 goroutine 스택 덤프

# Delve로 실행 중인 프로세스에 attach
dlv attach $(pgrep myservice)
(dlv) goroutines
(dlv) goroutine <id> bt  # 특정 goroutine 스택 확인

프로파일링:

// main.go에 pprof 엔드포인트 추가
import _ "net/http/pprof"

func main() {
    go func() {
        log.Println(http.ListenAndServe(":6060", nil))
    }()
    // ...
}

# goroutine 프로파일
go tool pprof http://localhost:6060/debug/pprof/goroutine

# 10초간 CPU 프로파일
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=10

# 히프 프로파일
go tool pprof http://localhost:6060/debug/pprof/heap

해결:

// GOOD — context 전파 + 타임아웃 + errgroup
import (
    "context"
    "golang.org/x/sync/errgroup"
    "net/http"
    "time"
)

func (h *Handler) FetchData(c *gin.Context) {
    // 요청 context에 타임아웃 추가
    ctx, cancel := context.WithTimeout(c.Request.Context(), 5*time.Second)
    defer cancel()

    g, ctx := errgroup.WithContext(ctx)
    urls := []string{urlA, urlB, urlC}
    results := make([]string, len(urls))

    for i, url := range urls {
        i, url := i, url
        g.Go(func() error {
            req, err := http.NewRequestWithContext(ctx, "GET", url, nil)
            if err != nil {
                return err
            }
            resp, err := http.DefaultClient.Do(req)
            if err != nil {
                return err
            }
            defer resp.Body.Close()
            body, err := io.ReadAll(resp.Body)
            if err != nil {
                return err
            }
            results[i] = string(body)
            return nil
        })
    }

    if err := g.Wait(); err != nil {
        c.JSON(500, gin.H{"error": err.Error()})
        return
    }

    c.JSON(200, gin.H{"data": results})
}

재발방지 체크리스트:

모든 goroutine 생성 시 context.Context 전달 여부 확인
외부 호출에 context.WithTimeout 또는 context.WithDeadline 적용
errgroup 또는 동등한 패턴으로 goroutine 라이프사이클 관리
/debug/pprof/goroutine 엔드포인트를 모니터링 대시보드에 연결
goroutine 수 임계값(e.g., 10,000개) 초과 시 알림 설정
goleak 라이브러리를 단위 테스트에 적용

// goroutine leak 테스트
import "go.uber.org/goleak"

func TestMain(m *testing.M) {
    goleak.VerifyTestMain(m)
}

사례 6: Java + Spring Boot — Thread Pool Starvation & DB Connection Pool 고갈

조합: Java 21+ / Spring Boot 3.2+ / HikariCP / Tomcat

증상: 트래픽 증가 시 응답시간이 급격히 증가하다가 Connection is not available, request timed out after 30000ms 에러가 대량 발생한다. Tomcat 스레드가 모두 WAITING 상태로 전환된다.

원인: 기본 HikariCP maximumPoolSize=10인 상태에서 Tomcat 스레드(기본 200개)가 동시에 DB 커넥션을 요청한다. 커넥션 풀보다 스레드 풀이 훨씬 크므로, 트래픽 증가 시 대부분의 스레드가 커넥션 대기 상태에 빠진다.

재현:

# application.yml — 문제를 유발하는 기본 설정
spring:
  datasource:
    hikari:
      maximum-pool-size: 10 # 기본값
      connection-timeout: 30000 # 30초 대기 후 예외
server:
  tomcat:
    threads:
      max: 200 # 기본값

# 부하 테스트로 재현
wrk -t12 -c200 -d30s http://localhost:8080/api/orders

# 또는 간단히
ab -n 5000 -c 200 http://localhost:8080/api/orders

브레이크포인트:

com.zaxxer.hikari.pool.HikariPool:getConnection — 커넥션 획득 시점
org.apache.tomcat.util.threads.ThreadPoolExecutor:execute — 스레드 풀 상태

프로파일링:

# 스레드 덤프
jstack $(pgrep -f 'spring-boot') > thread_dump.txt
# "WAITING on com.zaxxer.hikari" 패턴 검색

# JFR(Java Flight Recorder) 기록
jcmd $(pgrep -f 'spring-boot') JFR.start duration=60s filename=recording.jfr

# async-profiler
./asprof -d 30 -f profile.html $(pgrep -f 'spring-boot')

# HikariCP 메트릭 로깅 활성화
logging.level.com.zaxxer.hikari=DEBUG
logging.level.com.zaxxer.hikari.HikariConfig=DEBUG

Micrometer 메트릭 확인:

curl http://localhost:8080/actuator/metrics/hikaricp.connections.active
curl http://localhost:8080/actuator/metrics/hikaricp.connections.pending

해결:

# application.yml — 수정
spring:
  datasource:
    hikari:
      maximum-pool-size: 30 # 서버 코어 수 * 2 + 스핀들 수
      minimum-idle: 10
      connection-timeout: 5000 # 5초로 단축 (빠른 실패)
      leak-detection-threshold: 10000 # 10초 이상 반환 안 하면 경고
      metrics:
        enabled: true
server:
  tomcat:
    threads:
      max: 50 # 커넥션 풀에 맞춰 축소
    accept-count: 100

// 비동기 처리로 스레드 점유 최소화
@RestController
public class OrderController {

    @GetMapping("/api/orders")
    public CompletableFuture<List<Order>> getOrders() {
        return CompletableFuture.supplyAsync(() -> {
            return orderService.findAll();
        }, taskExecutor);  // 별도 스레드풀 사용
    }
}

재발방지 체크리스트:

HikariCP maximumPoolSize를 Tomcat threads.max보다 작거나 같게 설정하지 않도록 공식 적용: pool_size = (core_count * 2) + spindle_count
leak-detection-threshold 활성화
Micrometer + Prometheus로 hikaricp.connections.pending 모니터링
pending > 0 상태가 30초 이상 지속되면 알림 발생
부하 테스트를 배포 파이프라인에 포함

사례 7: Java + Spring Data JPA — N+1, Lazy Loading, Flush Timing

조합: Java 21+ / Spring Data JPA / Hibernate 6.x

사례 7-A: N+1 & Lazy Loading

증상: 단일 API 호출인데 Hibernate SQL 로그에 수십~수백 개의 SELECT가 찍힌다. 응답시간이 수백ms에서 수초로 증가.

원인: @OneToMany 관계의 기본 fetch 전략이 LAZY인 상태에서 엔티티를 순회하며 연관 컬렉션에 접근하면, 각 접근마다 별도 SELECT가 발생한다.

재현:

@Entity
public class Team {
    @Id @GeneratedValue
    private Long id;
    private String name;

    @OneToMany(mappedBy = "team", fetch = FetchType.LAZY)
    private List<Member> members;
}

@Entity
public class Member {
    @Id @GeneratedValue
    private Long id;
    private String name;

    @ManyToOne(fetch = FetchType.LAZY)
    private Team team;
}

// N+1 발생
@GetMapping("/teams")
public List<TeamDto> getTeams() {
    List<Team> teams = teamRepository.findAll(); // 쿼리 1개
    return teams.stream().map(team -> new TeamDto(
        team.getName(),
        team.getMembers().size()  // team마다 쿼리 1개씩 추가 (N개)
    )).toList();
}

프로파일링:

# application.yml
logging:
  level:
    org.hibernate.SQL: DEBUG
    org.hibernate.orm.jdbc.bind: TRACE
spring:
  jpa:
    properties:
      hibernate:
        generate_statistics: true

# p6spy로 실제 바인딩 파라미터 포함 SQL 확인
# build.gradle
# implementation 'com.github.gavlyukovskiy:p6spy-spring-boot-starter:1.9.0'

해결:

// 방법 1: JPQL fetch join
@Query("SELECT t FROM Team t JOIN FETCH t.members")
List<Team> findAllWithMembers();

// 방법 2: @EntityGraph
@EntityGraph(attributePaths = {"members"})
List<Team> findAll();

// 방법 3: Projection DTO (가장 효율적)
@Query("""
    SELECT new com.example.dto.TeamDto(t.name, SIZE(t.members))
    FROM Team t
    """)
List<TeamDto> findAllTeamSummaries();

사례 7-B: Flush Timing 문제

증상: save() 호출 후 바로 네이티브 쿼리로 조회하면 방금 저장한 데이터가 안 보인다. 또는 @Transactional 메서드 끝에서 예상치 못한 UPDATE 쿼리가 실행된다.

원인: Hibernate는 dirty checking으로 트랜잭션 커밋 시점에 변경을 flush한다. save()는 즉시 INSERT하지 않을 수 있다 (영속성 컨텍스트에 저장만). 또한 엔티티 필드를 수정하면 별도 save() 없이도 트랜잭션 종료 시 자동 UPDATE가 발생한다.

재현 / 해결:

// BAD — flush 전에 native query 실행
@Transactional
public void processOrder(Long orderId) {
    Order order = orderRepository.findById(orderId).orElseThrow();
    order.setStatus("PROCESSED");
    orderRepository.save(order);  // 아직 DB에 flush 안 됨

    // native query는 영속성 컨텍스트를 모름 → 이전 상태 조회
    int count = em.createNativeQuery("SELECT count(*) FROM orders WHERE status = 'PROCESSED'")
                  .getSingleResult();
}

// GOOD — 명시적 flush
@Transactional
public void processOrder(Long orderId) {
    Order order = orderRepository.findById(orderId).orElseThrow();
    order.setStatus("PROCESSED");
    orderRepository.saveAndFlush(order);  // 즉시 DB에 반영

    int count = em.createNativeQuery("SELECT count(*) FROM orders WHERE status = 'PROCESSED'")
                  .getSingleResult();
}

재발방지 체크리스트:

모든 findAll() 계열 메서드에 fetch join 또는 @EntityGraph 적용 여부 확인
hibernate.generate_statistics=true를 staging에 상시 적용
단일 API 호출 시 쿼리 수 상한(e.g., 10개) 초과하면 테스트 실패 설정
native query 사용 전 entityManager.flush() 호출
엔티티 setter 사용 시 dirty checking에 의한 자동 UPDATE 인지
DTO Projection 우선 사용 원칙 문서화

사례 8: React + API Backend — Race Condition & Stale Response 덮어쓰기

조합: React 18+ / 임의 API 백엔드 (REST/GraphQL)

증상: 검색창에 빠르게 입력하면 최종 결과가 아닌 중간 결과가 표시된다. 예를 들어 "react"를 입력했는데 "rea"의 검색 결과가 화면에 남는다. 또는 목록에서 빠르게 아이템을 클릭하면 이전 아이템의 상세 정보가 잠깐 보인다.

원인: 여러 비동기 요청이 동시에 진행될 때, 먼저 보낸 요청의 응답이 늦게 도착하면 나중에 보낸 요청의 결과를 덮어쓴다 (race condition). 네트워크 지연 시간은 요청 순서와 무관하다.

재현:

// BAD — race condition 발생
function SearchResults() {
  const [query, setQuery] = useState('')
  const [results, setResults] = useState([])

  useEffect(() => {
    if (!query) return

    // 이전 요청을 취소하지 않으므로,
    // 느린 응답이 빠른 응답을 덮어쓸 수 있다
    fetch(`/api/search?q=${query}`)
      .then((res) => res.json())
      .then((data) => setResults(data))
  }, [query])

  return (
    <div>
      <input value={query} onChange={(e) => setQuery(e.target.value)} placeholder="검색어 입력" />
      <ul>
        {results.map((r) => (
          <li key={r.id}>{r.title}</li>
        ))}
      </ul>
    </div>
  )
}

브레이크포인트:

Chrome DevTools > Network 탭에서 요청 순서와 응답 도착 순서 비교
React DevTools > Profiler에서 상태 업데이트 타이밍 확인
setResults 호출 직전에 console.log 또는 조건부 breakpoint

프로파일링:

// 디버깅용: 요청/응답 순서 로깅
useEffect(() => {
  if (!query) return
  const requestId = Date.now()
  console.log(`[REQ ${requestId}] query="${query}"`)

  fetch(`/api/search?q=${query}`)
    .then((res) => res.json())
    .then((data) => {
      console.log(`[RES ${requestId}] query="${query}" results=${data.length}`)
      setResults(data)
    })
}, [query])

// 출력 예시:
// [REQ 1001] query="r"
// [REQ 1002] query="re"
// [REQ 1003] query="rea"
// [RES 1003] query="rea" results=15    ← 먼저 도착
// [RES 1001] query="r" results=100     ← 나중에 도착, "rea" 결과를 덮어씀!
// [RES 1002] query="re" results=50

해결:

// GOOD — AbortController로 이전 요청 취소
function SearchResults() {
  const [query, setQuery] = useState('')
  const [results, setResults] = useState([])

  useEffect(() => {
    if (!query) return

    const controller = new AbortController()

    fetch(`/api/search?q=${query}`, { signal: controller.signal })
      .then((res) => res.json())
      .then((data) => setResults(data))
      .catch((err) => {
        if (err.name !== 'AbortError') {
          console.error('Search failed:', err)
        }
      })

    // cleanup: 다음 query 변경 시 이전 요청 취소
    return () => controller.abort()
  }, [query])

  return (
    <div>
      <input value={query} onChange={(e) => setQuery(e.target.value)} placeholder="검색어 입력" />
      <ul>
        {results.map((r) => (
          <li key={r.id}>{r.title}</li>
        ))}
      </ul>
    </div>
  )
}

// 대안: React Query (TanStack Query) 사용 — 자동으로 처리됨
import { useQuery } from '@tanstack/react-query'

function SearchResults() {
  const [query, setQuery] = useState('')

  const { data: results = [] } = useQuery({
    queryKey: ['search', query],
    queryFn: ({ signal }) => fetch(`/api/search?q=${query}`, { signal }).then((r) => r.json()),
    enabled: !!query,
  })

  return (
    <div>
      <input value={query} onChange={(e) => setQuery(e.target.value)} placeholder="검색어 입력" />
      <ul>
        {results.map((r: any) => (
          <li key={r.id}>{r.title}</li>
        ))}
      </ul>
    </div>
  )
}

재발방지 체크리스트:

모든 useEffect 내 fetch에 AbortController 적용
useEffect cleanup 함수에서 abort() 호출 확인
TanStack Query 또는 SWR 같은 데이터 패칭 라이브러리 도입 검토
검색/자동완성에 debounce(300ms) 적용
E2E 테스트에서 빠른 연속 입력 시나리오 포함
네트워크 쓰로틀링(Slow 3G) 상태에서 수동 테스트

증상별 빠른 진단 가이드

아래 표는 증상을 기준으로 어떤 사례를 먼저 의심해야 하는지 빠르게 찾을 수 있게 정리한 것이다.

증상	의심 사례	첫 번째 확인 명령
전체 API 응답 동시 지연	사례 1 (event loop blocking)	`py-spy top --pid <PID>`
단일 API만 느림 + 쿼리 폭증	사례 2, 7 (N+1 query)	`DEBUG` SQL 로그 활성화
화면 깜빡임/레이아웃 깨짐	사례 3 (hydration mismatch)	Chrome Console 에러 확인
메모리 단조 증가 → OOM	사례 4 (memory leak), 사례 5 (goroutine leak)	Heap Snapshot / `pprof goroutine`
간헐적 데이터 정합성 깨짐	사례 2-B (transaction boundary)	`transaction.atomic()` 범위 확인
커넥션 타임아웃 대량 발생	사례 6 (connection pool 고갈)	`hikaricp.connections.pending` 메트릭
검색 결과가 뒤죽박죽	사례 8 (race condition)	Network 탭에서 응답 순서 확인
에러 로그 없이 요청 hang	사례 4-B (async error swallowing)	`unhandledRejection` 핸들러 추가
`save()` 후 데이터 안 보임	사례 7-B (flush timing)	`saveAndFlush()` 변경 후 재시도
goroutine 수 무한 증가	사례 5 (goroutine leak)	`go tool pprof .../goroutine`

통합 재발방지 체크리스트

모든 사례를 관통하는 공통 점검 항목이다. 주기적인 리뷰 또는 새 프로젝트 셋업 시 활용한다.

코드 레벨

async 함수 안에서 동기 블로킹 호출 금지 (Python asyncio, Node.js)
모든 비동기 호출에 에러 핸들링 존재 (try/catch, .catch())
ORM 사용 시 N+1 쿼리 방지 패턴 적용 (select_related, JOIN FETCH, @EntityGraph)
트랜잭션 경계 명시 — 2개 이상 엔티티 변경 시 반드시 atomic/transactional
goroutine/스레드 생성 시 context/timeout 전달
프론트엔드 fetch에 AbortController 적용
싱글턴 캐시에 크기 제한 및 TTL 설정

인프라/설정 레벨

커넥션 풀 크기와 스레드 풀 크기 비율 검증
프로파일링 엔드포인트(/debug/pprof, /actuator) 접근 제어 설정
메모리/CPU/goroutine 수/커넥션 수 기반 알림 설정
부하 테스트를 CI/CD 파이프라인에 포함

모니터링 레벨

APM 도구(Datadog, New Relic, Grafana Tempo) 연동
쿼리 수/응답시간 분포(p50/p95/p99) 대시보드 구축
에러율/OOM Kill/재시작 횟수 알림 설정
주기적 프로파일링 결과 비교(배포 전후)

테스트 레벨

단위 테스트에서 goroutine leak 검사 (goleak)
통합 테스트에서 쿼리 수 상한 assertion
E2E 테스트에서 race condition 시나리오 (빠른 연속 입력, 네트워크 쓰로틀링)
ESLint no-floating-promises, Python PYTHONASYNCIODEBUG 등 정적 분석 활성화

마무리

언어×프레임워크 조합별 장애는 한쪽만 알아서는 진단이 어렵다. Python을 잘 알아도 FastAPI의 async 실행 모델을 모르면 event loop blocking을 놓치고, Spring Boot를 잘 알아도 Hibernate의 flush 전략을 모르면 데이터 정합성 문제를 파악할 수 없다.

핵심 원칙은 세 가지다:

재현 먼저 — 추측 디버깅은 시간 낭비다. 위 사례들의 재현 코드를 참고해서 로컬에서 증상을 100% 재현한 뒤 원인을 파악한다.
프로파일링 데이터로 판단 — py-spy, pprof, JFR, clinic.js, Chrome DevTools 등 각 조합에 맞는 도구를 꺼내서 숫자로 확인한다.
재발방지를 자동화 — 체크리스트를 CI에 녹이고, 모니터링 알림을 설정하고, 린트 규칙을 강제한다. 사람의 주의력에 의존하는 방어는 반드시 뚫린다.

위 8가지 사례와 체크리스트를 팀 위키에 공유하고, 온콜 시 첫 번째로 참조하는 문서로 활용하기를 권한다.

각 언어·프레임워크·IDE의 기초 디버깅 방법은 시리즈 앞선 글을 참고한다. 로컬에서 재현이 안 되는 문제는 원격 디버깅 실전 가이드를 참고한다.