Split View: 서킷 브레이커와 복원력 패턴 실전 가이드 — Resilience4j, Istio, 장애 격리 전략

서킷 브레이커와 복원력 패턴 실전 가이드 — Resilience4j, Istio, 장애 격리 전략

들어가며
1. 서킷 브레이커 패턴의 원리
- 1.1 세 가지 상태: Closed, Open, Half-Open
- 1.2 슬라이딩 윈도우 방식
2. Resilience4j를 활용한 Java/Spring 서킷 브레이커 구현
3. Istio 서비스 메시 레벨 서킷 브레이커
- 3.1 DestinationRule 설정
- 3.2 Istio vs 애플리케이션 레벨 서킷 브레이커
4. Bulkhead 패턴: 장애 격리 전략
- 4.1 Semaphore Bulkhead vs ThreadPool Bulkhead
- 4.2 서비스별 Bulkhead 격리 예시
5. Retry + Timeout + Rate Limiter 조합 패턴
- 5.1 패턴 조합 시 주의사항
- 5.2 프로그래매틱 API를 활용한 조합
6. 폴백(Fallback) 전략 설계
- 6.1 폴백 전략 유형
- 6.2 다단계 폴백 구현
7. Netflix Hystrix에서 Resilience4j로의 마이그레이션
- 7.1 Resilience4j vs Hystrix vs Istio 비교
- 7.2 마이그레이션 핵심 체크리스트
8. 실패 사례 분석과 복구 절차
9. 운영 모니터링 및 메트릭
10. 트러블슈팅
11. 실전 체크리스트
마치며
참고 자료

들어가며

마이크로서비스 아키텍처에서 서비스 간 호출은 본질적으로 불안정하다. 네트워크 지연, 타임아웃, 다운스트림 서비스 장애는 일상적으로 발생하며, 적절한 방어 메커니즘 없이는 단일 서비스 장애가 전체 시스템으로 전파되는 연쇄 장애(Cascading Failure) 를 초래한다. 2024년 블랙프라이데이 기간 한 대형 이커머스 플랫폼에서 상품 추천 서비스의 응답 지연이 상품 목록 페이지 전체를 20초 이상 로딩하게 만들었던 사례가 대표적이다.

이러한 문제를 해결하기 위해 등장한 것이 복원력 패턴(Resilience Patterns) 이다. Michael Nygard가 2007년 Release It! 에서 처음 소개한 서킷 브레이커 패턴을 시작으로, Bulkhead, Retry, Rate Limiter, Timeout, Fallback 등 다양한 패턴이 체계화되었다. Netflix Hystrix가 첫 번째 대중적 구현체였지만, 2018년 유지보수 모드 진입 이후 Resilience4j 가 Java/Spring 생태계의 사실상 표준으로 자리잡았으며, 서비스 메시 환경에서는 Istio 가 인프라 레벨의 서킷 브레이커를 제공한다.

이 글에서는 서킷 브레이커의 동작 원리부터 Resilience4j와 Istio를 활용한 실전 구현, 복합 복원력 패턴 설계, Hystrix 마이그레이션, 운영 모니터링, 장애 사례 분석까지 포괄적으로 다룬다.

1. 서킷 브레이커 패턴의 원리

서킷 브레이커는 전기 회로의 차단기에서 착안한 패턴으로, 원격 서비스 호출 실패를 감지하고 자동으로 호출을 차단하여 시스템 전체의 연쇄 장애를 방지한다.

1.1 세 가지 상태: Closed, Open, Half-Open

              실패율 >= 임계값 (failureRateThreshold)
    ┌─────────────────────────────────────────────┐
    │                                             │
    ▼                                             │
┌──────────┐                                 ┌──────────┐
│          │     시험 호출 성공률 >= 임계값     │          │
│   OPEN   │ <─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ── │  CLOSED  │
│  (차단)   │                                 │  (정상)   │
└──────────┘                                 └──────────┘
     │                                            ▲
     │ waitDurationInOpenState 경과                │
     ▼                                            │
┌──────────────┐    시험 호출 성공               │
│  HALF-OPEN   │ ────────────────────────────────┘
│  (시험 허용)  │
└──────────────┘
     │
     │ 시험 호출 실패
     ▼
┌──────────┐
│   OPEN   │  (다시 차단)
└──────────┘

각 상태의 동작은 다음과 같다.

상태	동작	전이 조건
CLOSED	모든 요청을 정상 통과시키며 결과를 슬라이딩 윈도우에 기록	실패율이 임계값 이상이면 OPEN으로 전이
OPEN	모든 요청을 즉시 거부하고 CallNotPermittedException 발생	waitDuration 경과 후 HALF-OPEN으로 전이
HALF-OPEN	제한된 수의 시험 호출만 허용	시험 호출 성공률에 따라 CLOSED 또는 OPEN으로 전이

1.2 슬라이딩 윈도우 방식

Resilience4j는 두 가지 슬라이딩 윈도우를 지원한다.

COUNT_BASED: 최근 N개 호출 결과를 기준으로 실패율 계산. 트래픽이 일정한 서비스에 적합하다.
TIME_BASED: 최근 N초 내의 호출 결과를 기준으로 실패율 계산. 트래픽 변동이 큰 서비스에 적합하다.

2. Resilience4j를 활용한 Java/Spring 서킷 브레이커 구현

2.1 의존성 설정

// build.gradle (Spring Boot 3.x)
dependencies {
    implementation 'io.github.resilience4j:resilience4j-spring-boot3:2.2.0'
    implementation 'io.github.resilience4j:resilience4j-micrometer:2.2.0'
    implementation 'org.springframework.boot:spring-boot-starter-actuator'
    implementation 'org.springframework.boot:spring-boot-starter-aop'
}

2.2 application.yml 설정

resilience4j:
  circuitbreaker:
    instances:
      paymentService:
        registerHealthIndicator: true
        slidingWindowType: COUNT_BASED
        slidingWindowSize: 10
        minimumNumberOfCalls: 5
        failureRateThreshold: 50
        waitDurationInOpenState: 10s
        permittedNumberOfCallsInHalfOpenState: 3
        slowCallDurationThreshold: 2s
        slowCallRateThreshold: 80
        recordExceptions:
          - java.io.IOException
          - java.util.concurrent.TimeoutException
          - org.springframework.web.client.HttpServerErrorException
        ignoreExceptions:
          - com.example.BusinessException

  retry:
    instances:
      paymentService:
        maxAttempts: 3
        waitDuration: 500ms
        enableExponentialBackoff: true
        exponentialBackoffMultiplier: 2
        retryExceptions:
          - java.io.IOException
          - java.util.concurrent.TimeoutException

  bulkhead:
    instances:
      paymentService:
        maxConcurrentCalls: 20
        maxWaitDuration: 500ms

  timelimiter:
    instances:
      paymentService:
        timeoutDuration: 3s
        cancelRunningFuture: true

  ratelimiter:
    instances:
      paymentService:
        limitRefreshPeriod: 1s
        limitForPeriod: 50
        timeoutDuration: 0s

2.3 어노테이션 기반 구현

@Service
@Slf4j
public class PaymentService {

    private final PaymentGatewayClient paymentGatewayClient;
    private final PaymentCacheService paymentCacheService;

    public PaymentService(PaymentGatewayClient paymentGatewayClient,
                          PaymentCacheService paymentCacheService) {
        this.paymentGatewayClient = paymentGatewayClient;
        this.paymentCacheService = paymentCacheService;
    }

    @CircuitBreaker(name = "paymentService", fallbackMethod = "paymentFallback")
    @Bulkhead(name = "paymentService")
    @Retry(name = "paymentService")
    @TimeLimiter(name = "paymentService")
    public CompletableFuture<PaymentResponse> processPayment(PaymentRequest request) {
        return CompletableFuture.supplyAsync(() -> {
            log.info("결제 요청 처리 중: orderId={}", request.getOrderId());
            return paymentGatewayClient.charge(request);
        });
    }

    // 폴백 메서드: 서킷이 OPEN이거나 예외 발생 시 호출
    private CompletableFuture<PaymentResponse> paymentFallback(
            PaymentRequest request, Throwable throwable) {
        log.warn("결제 서비스 폴백 실행: orderId={}, reason={}",
                request.getOrderId(), throwable.getMessage());

        if (throwable instanceof CallNotPermittedException) {
            // 서킷이 열린 상태 - 큐에 저장 후 비동기 처리
            return CompletableFuture.completedFuture(
                PaymentResponse.queued(request.getOrderId(),
                    "결제 서비스 일시 장애. 주문이 대기열에 등록되었습니다.")
            );
        }

        // 기타 예외 - 캐시된 결과 반환 시도
        return CompletableFuture.completedFuture(
            paymentCacheService.getCachedResponse(request.getOrderId())
                .orElse(PaymentResponse.error(request.getOrderId(),
                    "결제 처리 중 오류가 발생했습니다. 잠시 후 다시 시도해주세요."))
        );
    }
}

Aspect 실행 순서: Resilience4j의 어노테이션은 다음 순서로 중첩 적용된다.

Retry ( CircuitBreaker ( RateLimiter ( TimeLimiter ( Bulkhead ( Function ) ) ) ) )

가장 바깥의 Retry가 마지막에 적용되므로, CircuitBreaker가 예외를 던지면 Retry가 재시도를 수행한다. 이 순서는 각 모듈의 *AspectOrder 속성으로 커스터마이징할 수 있다.

3. Istio 서비스 메시 레벨 서킷 브레이커

Istio는 애플리케이션 코드 변경 없이 인프라 레벨에서 서킷 브레이커를 적용할 수 있다. Envoy 프록시의 Outlier Detection 기능을 활용하여 비정상 인스턴스를 로드 밸런싱 풀에서 자동 제거한다.

3.1 DestinationRule 설정

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: payment-service-circuit-breaker
  namespace: production
spec:
  host: payment-service.production.svc.cluster.local
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100 # 최대 TCP 연결 수
        connectTimeout: 3s # TCP 연결 타임아웃
      http:
        h2UpgradePolicy: DEFAULT
        http1MaxPendingRequests: 50 # 대기 중인 HTTP 요청 최대 수
        http2MaxRequests: 100 # 활성 HTTP/2 요청 최대 수
        maxRequestsPerConnection: 10 # 연결당 최대 요청 수
        maxRetries: 3 # 최대 재시도 횟수
    outlierDetection:
      consecutive5xxErrors: 5 # 연속 5xx 에러 5회 시 제거
      interval: 10s # 분석 주기
      baseEjectionTime: 30s # 최소 제거 시간
      maxEjectionPercent: 50 # 최대 제거 비율 (50%)
      minHealthPercent: 30 # 최소 정상 인스턴스 비율

3.2 Istio vs 애플리케이션 레벨 서킷 브레이커

구분	Istio (인프라 레벨)	Resilience4j (앱 레벨)
적용 방식	코드 변경 없이 YAML 설정	어노테이션 또는 프로그래매틱 API
격리 단위	인스턴스(Pod) 단위 제거	메서드/서비스 단위 차단
폴백	지원하지 않음 (503 반환)	커스텀 폴백 메서드 지원
언어 독립성	모든 언어/프레임워크 지원	Java/Kotlin 전용
세밀한 제어	제한적	매우 세밀함
모니터링	Kiali, Grafana 연동	Micrometer, Actuator 연동
권장 사례	다중 언어 환경, 기본 보호	비즈니스 로직 연동 필요 시

실무에서는 두 가지를 함께 사용하는 것이 권장된다. Istio가 인프라 레벨에서 비정상 인스턴스를 격리하고, Resilience4j가 애플리케이션 레벨에서 세밀한 폴백과 재시도를 처리하는 구조이다.

4. Bulkhead 패턴: 장애 격리 전략

Bulkhead(격벽) 패턴은 선박의 격벽에서 유래한 개념으로, 하나의 구획에 침수가 발생해도 다른 구획은 영향을 받지 않도록 격리하는 전략이다.

4.1 Semaphore Bulkhead vs ThreadPool Bulkhead

구분	Semaphore Bulkhead	ThreadPool Bulkhead
격리 방식	세마포어로 동시 호출 수 제한	별도 스레드 풀에서 실행
호출 스레드	요청 스레드에서 직접 실행	별도 스레드에서 비동기 실행
반환 타입	동기/비동기 모두 지원	CompletableFuture만 지원
오버헤드	낮음	스레드 풀 관리 비용 존재
권장 사례	일반적인 동시성 제한	완전한 스레드 격리 필요 시

# ThreadPool Bulkhead 설정
resilience4j:
  thread-pool-bulkhead:
    instances:
      inventoryService:
        maxThreadPoolSize: 10
        coreThreadPoolSize: 5
        queueCapacity: 20
        keepAliveDuration: 100ms
        writableStackTraceEnabled: true

4.2 서비스별 Bulkhead 격리 예시

@Service
public class OrderOrchestrator {

    @Bulkhead(name = "paymentService", type = Bulkhead.Type.SEMAPHORE)
    public PaymentResult processPayment(Order order) {
        return paymentClient.charge(order.getPaymentInfo());
    }

    @Bulkhead(name = "inventoryService", type = Bulkhead.Type.THREADPOOL)
    public CompletableFuture<InventoryResult> reserveInventory(Order order) {
        return CompletableFuture.supplyAsync(() ->
            inventoryClient.reserve(order.getItems()));
    }

    @Bulkhead(name = "notificationService", type = Bulkhead.Type.SEMAPHORE)
    public void sendNotification(Order order) {
        notificationClient.send(order.getUserId(), "주문이 접수되었습니다.");
    }
}

이렇게 서비스별로 Bulkhead를 분리하면, 재고 서비스가 느려져도 결제 서비스의 동시 호출 용량은 영향을 받지 않는다.

5. Retry + Timeout + Rate Limiter 조합 패턴

복원력 패턴은 단독으로 사용하기보다 조합하여 사용할 때 가장 효과적이다. 하지만 잘못된 조합은 오히려 장애를 악화시킬 수 있으므로 주의가 필요하다.

5.1 패턴 조합 시 주의사항

Retry + CircuitBreaker: Retry만 단독 사용하면 장애 서비스에 부하를 가중시킨다. 반드시 CircuitBreaker와 함께 사용하여 일정 실패율 이상일 때 재시도 자체를 차단해야 한다.
Timeout + Retry: 총 소요 시간 = timeout * maxAttempts. 타임아웃 3초에 재시도 3회라면 최악의 경우 9초가 소요된다. 사용자 응답 시간 SLA를 고려하여 설계해야 한다.
Rate Limiter + CircuitBreaker: 외부 API의 호출 제한(Rate Limit)을 초과하지 않도록 Rate Limiter를 적용하고, API 자체 장애에는 CircuitBreaker가 대응하는 이중 방어 구조이다.

5.2 프로그래매틱 API를 활용한 조합

@Configuration
public class ResilienceConfig {

    @Bean
    public Supplier<String> resilientSupplier(
            CircuitBreakerRegistry circuitBreakerRegistry,
            RetryRegistry retryRegistry,
            BulkheadRegistry bulkheadRegistry,
            RateLimiterRegistry rateLimiterRegistry) {

        CircuitBreaker circuitBreaker = circuitBreakerRegistry
                .circuitBreaker("externalApi");
        Retry retry = retryRegistry.retry("externalApi");
        Bulkhead bulkhead = bulkheadRegistry.bulkhead("externalApi");
        RateLimiter rateLimiter = rateLimiterRegistry
                .rateLimiter("externalApi");

        // 데코레이터 체이닝: 안쪽부터 바깥쪽으로 적용
        Supplier<String> decoratedSupplier = Decorators
                .ofSupplier(() -> externalApiClient.call())
                .withBulkhead(bulkhead)           // 1. 동시 호출 제한
                .withRateLimiter(rateLimiter)      // 2. 초당 호출 제한
                .withCircuitBreaker(circuitBreaker) // 3. 실패 감지/차단
                .withRetry(retry)                   // 4. 재시도
                .withFallback(Arrays.asList(
                    CallNotPermittedException.class,
                    BulkheadFullException.class,
                    RequestNotPermitted.class),
                    throwable -> "Fallback Response")
                .decorate();

        return decoratedSupplier;
    }
}

6. 폴백(Fallback) 전략 설계

폴백은 원래 서비스가 실패했을 때 제공하는 대안적 응답이다. 단순히 에러 메시지를 반환하는 것이 아니라, 사용자 경험을 최대한 유지하면서 graceful degradation을 구현하는 것이 핵심이다.

6.1 폴백 전략 유형

전략	설명	적용 예시
캐시 폴백	마지막 성공 응답을 캐시하여 반환	상품 추천, 환율 정보, 날씨 데이터
기본값 폴백	사전 정의된 기본값 반환	설정 서비스, 기능 플래그
대기열 폴백	요청을 큐에 저장하고 나중에 처리	결제 처리, 주문 접수
대체 서비스 폴백	보조 서비스로 라우팅	CDN 이중화, 멀티 리전
빈 응답 폴백	빈 결과 반환 (에러 대신)	검색 자동완성, 추천 위젯
수동 전환 폴백	운영자가 수동으로 대체 로직 활성화	중요 비즈니스 로직

6.2 다단계 폴백 구현

@Service
@Slf4j
public class ProductRecommendationService {

    private final RecommendationEngine primaryEngine;
    private final RecommendationEngine secondaryEngine;
    private final RedisTemplate<String, List<Product>> cache;

    @CircuitBreaker(name = "recommendation",
                    fallbackMethod = "secondaryRecommendation")
    public List<Product> getRecommendations(String userId) {
        return primaryEngine.recommend(userId);
    }

    // 1차 폴백: 보조 추천 엔진 사용
    private List<Product> secondaryRecommendation(
            String userId, Throwable t) {
        log.warn("1차 추천 엔진 장애, 보조 엔진으로 전환: {}", t.getMessage());
        try {
            return secondaryEngine.recommend(userId);
        } catch (Exception e) {
            return cachedRecommendation(userId, e);
        }
    }

    // 2차 폴백: 캐시된 추천 결과 반환
    private List<Product> cachedRecommendation(
            String userId, Throwable t) {
        log.warn("보조 추천 엔진도 장애, 캐시 조회: {}", t.getMessage());
        List<Product> cached = cache.opsForValue()
                .get("recommendation:" + userId);
        if (cached != null && !cached.isEmpty()) {
            return cached;
        }
        return defaultRecommendation(userId, t);
    }

    // 3차 폴백: 인기 상품 기본 목록 반환
    private List<Product> defaultRecommendation(
            String userId, Throwable t) {
        log.warn("캐시도 없음, 기본 인기 상품 반환");
        return List.of(
            Product.popular("BEST-001", "베스트셀러 상품 A"),
            Product.popular("BEST-002", "베스트셀러 상품 B"),
            Product.popular("BEST-003", "베스트셀러 상품 C")
        );
    }
}

7. Netflix Hystrix에서 Resilience4j로의 마이그레이션

Netflix Hystrix는 2018년 유지보수 모드에 진입했으며, Spring Cloud 2020.0.0부터 공식 지원이 중단되었다. 기존 Hystrix 사용 프로젝트는 Resilience4j로 마이그레이션이 필요하다.

7.1 Resilience4j vs Hystrix vs Istio 비교

항목	Hystrix	Resilience4j	Istio
유지보수 상태	유지보수 모드 (2018~)	활발히 유지보수 중	활발히 유지보수 중
설계 철학	OOP (HystrixCommand 상속)	함수형 프로그래밍 (데코레이터)	인프라 기반 (사이드카 프록시)
모듈 구성	올인원	필요한 모듈만 선택	전체 서비스 메시
Spring Boot 통합	Spring Cloud Netflix	네이티브 Spring Boot 스타터	Kubernetes 환경 필요
격리 방식	Thread Pool / Semaphore	Semaphore / Thread Pool	커넥션 풀 / Outlier Detection
설정 방식	Java Config / Properties	YAML / Java Config / 어노테이션	Kubernetes CRD (YAML)
리액티브 지원	제한적 (RxJava 1)	완전 지원 (Reactor, RxJava 2/3)	해당 없음
메트릭	Hystrix Dashboard	Micrometer / Prometheus	Prometheus / Kiali
폴백	HystrixCommand.getFallback()	fallbackMethod 어노테이션	미지원 (503 반환)
러닝 커브	보통	낮음	높음 (서비스 메시 이해 필요)

7.2 마이그레이션 핵심 체크리스트

1단계: 의존성 교체

// 제거
// implementation 'org.springframework.cloud:spring-cloud-starter-netflix-hystrix'

// 추가
implementation 'io.github.resilience4j:resilience4j-spring-boot3:2.2.0'
implementation 'io.github.resilience4j:resilience4j-micrometer:2.2.0'

2단계: 코드 변환 패턴

Hystrix	Resilience4j
`@HystrixCommand(fallbackMethod = "fallback")`	`@CircuitBreaker(name = "svc", fallbackMethod = "fallback")`
`HystrixCommand extends HystrixCommand`	`Decorators.ofSupplier(() -> ...).withCircuitBreaker(cb)`
`@HystrixProperty(name = "...")`	`application.yml` 설정
`HystrixDashboard`	Micrometer + Grafana

3단계: 설정 마이그레이션

Hystrix의 circuitBreaker.requestVolumeThreshold는 Resilience4j의 minimumNumberOfCalls에 대응하고, circuitBreaker.errorThresholdPercentage는 failureRateThreshold에 대응한다. circuitBreaker.sleepWindowInMilliseconds는 waitDurationInOpenState로 변환된다.

4단계: 점진적 전환

한 번에 전체를 교체하지 말고, 서비스별로 점진적으로 마이그레이션한다. Resilience4j와 Hystrix는 동일 프로젝트에서 공존 가능하므로, 새 서비스부터 Resilience4j를 적용하고 기존 서비스를 순차적으로 전환하는 것이 안전하다.

8. 실패 사례 분석과 복구 절차

8.1 사례 1: Retry Storm (재시도 폭풍)

상황: 결제 게이트웨이 장애 시 모든 클라이언트가 동시에 재시도를 수행하여 게이트웨이 복구를 지연시킴.

원인: CircuitBreaker 없이 Retry만 적용. 재시도 간격에 jitter(무작위 지연)가 없어 동기화된 재시도 발생.

해결:

CircuitBreaker를 Retry와 함께 적용하여 일정 실패율 이상에서 재시도 자체를 차단
Exponential backoff에 jitter 추가

resilience4j:
  retry:
    instances:
      paymentGateway:
        maxAttempts: 3
        waitDuration: 1s
        enableExponentialBackoff: true
        exponentialBackoffMultiplier: 2
        enableRandomizedWait: true # jitter 활성화
        randomizedWaitFactor: 0.5 # 50% 범위 내 무작위화

8.2 사례 2: Bulkhead 미적용으로 인한 스레드 풀 고갈

상황: 재고 확인 API가 느려지면서 Tomcat 스레드 풀 전체를 점유. 결제, 주문 조회 등 무관한 API까지 타임아웃 발생.

원인: 모든 외부 서비스 호출이 동일한 스레드 풀에서 실행.

해결:

서비스별 ThreadPool Bulkhead 적용으로 스레드 격리
느린 서비스가 전체 스레드 풀을 점유하지 못하도록 제한

8.3 사례 3: 서킷 브레이커 임계값 설정 오류

상황: minimumNumberOfCalls: 1, failureRateThreshold: 50으로 설정. 단 한 번의 실패로 서킷이 열려 정상 서비스도 차단됨.

원인: 통계적으로 유의미하지 않은 소수의 호출로 상태 전이가 발생.

해결:

minimumNumberOfCalls를 최소 5~10으로 설정
slidingWindowSize를 충분히 크게 설정 (최소 10 이상)
운영 환경에서 실제 트래픽 패턴을 분석한 후 임계값 조정

8.4 복구 절차 표준화

#!/bin/bash
# circuit-breaker-recovery.sh
# 서킷 브레이커 장애 복구 절차 스크립트

echo "===== 서킷 브레이커 상태 확인 ====="
# Actuator 엔드포인트로 서킷 브레이커 상태 확인
curl -s http://localhost:8080/actuator/circuitbreakers | jq '.circuitBreakers'

echo ""
echo "===== 다운스트림 서비스 헬스 체크 ====="
curl -s http://payment-service:8080/actuator/health | jq '.status'
curl -s http://inventory-service:8080/actuator/health | jq '.status'

echo ""
echo "===== 서킷 브레이커 강제 닫기 (다운스트림 복구 확인 후) ====="
# 주의: 다운스트림 서비스가 완전히 복구된 후에만 실행
# curl -X POST http://localhost:8080/actuator/circuitbreakers/paymentService/close

echo ""
echo "===== 현재 메트릭 확인 ====="
curl -s http://localhost:8080/actuator/metrics/resilience4j.circuitbreaker.state | jq '.'
curl -s http://localhost:8080/actuator/metrics/resilience4j.circuitbreaker.failure.rate | jq '.'

echo ""
echo "===== Istio Outlier Detection 상태 확인 ====="
kubectl get destinationrules -n production
kubectl describe destinationrule payment-service-circuit-breaker -n production

9. 운영 모니터링 및 메트릭

9.1 핵심 모니터링 메트릭

서킷 브레이커 운영에서 반드시 모니터링해야 할 메트릭은 다음과 같다.

메트릭	설명	경고 임계값
`resilience4j.circuitbreaker.state`	현재 서킷 상태 (0=CLOSED, 1=OPEN, 2=HALF_OPEN)	state == 1 (OPEN)
`resilience4j.circuitbreaker.failure.rate`	현재 실패율 (%)	40% 이상
`resilience4j.circuitbreaker.calls`	성공/실패/무시/차단된 호출 수	차단 호출 급증 시
`resilience4j.circuitbreaker.slow.call.rate`	느린 호출 비율 (%)	60% 이상
`resilience4j.bulkhead.available.concurrent.calls`	사용 가능한 동시 호출 수	0에 가까울 때
`resilience4j.retry.calls`	재시도 횟수	급증 시
`resilience4j.ratelimiter.available.permissions`	사용 가능한 허용 수	0에 가까울 때

9.2 Prometheus + Grafana 대시보드 설정

Resilience4j는 Micrometer를 통해 Prometheus 형식의 메트릭을 자동 노출한다.

# Prometheus scrape 설정
scrape_configs:
  - job_name: 'spring-boot-resilience4j'
    metrics_path: '/actuator/prometheus'
    scrape_interval: 5s
    static_configs:
      - targets: ['payment-service:8080']
        labels:
          application: 'payment-service'

Grafana 알림 규칙 예시: 서킷이 OPEN 상태로 전이되면 Slack 알림을 발송하도록 구성한다.

# Grafana Alert Rule (provisioning)
groups:
  - name: circuit-breaker-alerts
    rules:
      - alert: CircuitBreakerOpen
        expr: resilience4j_circuitbreaker_state{state="open"} == 1
        for: 10s
        labels:
          severity: critical
        annotations:
          summary: '서킷 브레이커 OPEN - {{ $labels.name }}'
          description: >
            {{ $labels.application }}의 {{ $labels.name }}
            서킷 브레이커가 OPEN 상태입니다.
            즉시 다운스트림 서비스 상태를 확인하세요.

      - alert: HighFailureRate
        expr: resilience4j_circuitbreaker_failure_rate > 40
        for: 30s
        labels:
          severity: warning
        annotations:
          summary: '높은 실패율 감지 - {{ $labels.name }}'
          description: >
            {{ $labels.name }}의 실패율이 {{ $value }}%입니다.
            서킷이 열리기 전에 원인을 파악하세요.

9.3 Istio 모니터링 (Kiali + Grafana)

# Istio 메시 내 서비스 상태 확인
istioctl proxy-config cluster <pod-name> -n production | grep outlier

# Envoy 통계 확인
kubectl exec -it <pod-name> -n production -c istio-proxy -- \
  curl localhost:15000/stats | grep outlier_detection

# Kiali 대시보드 접근
istioctl dashboard kiali

10. 트러블슈팅

서킷 브레이커가 열리지 않는 경우

minimumNumberOfCalls 값을 확인한다. 이 값보다 적은 호출이 발생했다면 실패율이 100%여도 서킷은 열리지 않는다.
recordExceptions에 실제 발생하는 예외 타입이 포함되어 있는지 확인한다. 등록되지 않은 예외는 실패로 카운트되지 않는다.
ignoreExceptions에 의도치 않게 장애 예외가 포함되어 있는지 점검한다.

서킷이 HALF-OPEN에서 빠르게 다시 OPEN으로 전이되는 경우

permittedNumberOfCallsInHalfOpenState 값이 너무 작으면 통계적으로 유의미한 판단이 어렵다. 최소 3~5로 설정한다.
다운스트림 서비스가 부분적으로만 복구된 경우 발생할 수 있다. 다운스트림의 완전한 복구를 확인한다.

Bulkhead 관련 오류

BulkheadFullException이 빈번하게 발생하면 maxConcurrentCalls 값을 늘리거나, 다운스트림 서비스의 응답 시간을 개선한다.
ThreadPool Bulkhead 사용 시 queueCapacity가 0이면 스레드 풀이 가득 찼을 때 즉시 거부된다.

Istio Outlier Detection이 작동하지 않는 경우

Pod에 Istio 사이드카 프록시가 주입되어 있는지 확인한다: kubectl get pod <name> -o jsonpath='{.spec.containers[*].name}'
DestinationRule의 host 필드가 정확한 서비스 FQDN인지 확인한다.
maxEjectionPercent가 너무 낮으면 일부 비정상 인스턴스가 제거되지 않을 수 있다.

11. 실전 체크리스트

설계 단계

각 다운스트림 서비스의 SLA(응답 시간, 가용성)를 확인했는가
서비스별 장애 영향도를 분류했는가 (Critical / High / Medium / Low)
장애 시 폴백 전략을 정의했는가 (캐시, 기본값, 대기열, 대체 서비스)
Retry 적용 대상이 멱등성(Idempotency)을 보장하는지 확인했는가
Retry + CircuitBreaker 조합 사용 여부를 결정했는가 (Retry 단독 사용 금지)
총 타임아웃 = timeout * maxAttempts 값이 사용자 SLA 이내인지 확인했는가

구현 단계

CircuitBreaker의 slidingWindowSize와 minimumNumberOfCalls를 충분히 크게 설정했는가 (최소 5~10)
recordExceptions에 네트워크/타임아웃 관련 예외를 등록했는가
ignoreExceptions에 비즈니스 예외(400 Bad Request 등)를 등록했는가
서비스별 Bulkhead를 분리 적용했는가
외부 API 호출에 Rate Limiter를 적용했는가
Fallback 메서드의 파라미터가 원래 메서드와 일치하는지 확인했는가 (+ Throwable 추가)

운영 단계

Actuator 엔드포인트(/actuator/circuitbreakers, /actuator/health)를 노출했는가
Prometheus 메트릭 수집을 설정했는가
서킷 OPEN 상태 전이 시 알림(Slack, PagerDuty 등)을 설정했는가
실패율 경고 임계값 알림을 설정했는가
서킷 브레이커 장애 복구 절차(Runbook)를 문서화했는가
주기적인 Chaos Engineering 테스트(서비스 장애 주입)를 수행하고 있는가
Istio 환경이라면 DestinationRule과 Outlier Detection을 설정했는가

테스트 단계

서킷 상태 전이(CLOSED -> OPEN -> HALF-OPEN -> CLOSED) 시나리오를 테스트했는가
Fallback 메서드가 정상 동작하는지 테스트했는가
Bulkhead 가득 참 상황을 시뮬레이션했는가
다운스트림 서비스 완전 다운 시나리오를 테스트했는가
느린 응답(Slow Call) 시나리오를 테스트했는가

마치며

복원력 패턴은 마이크로서비스 아키텍처에서 선택이 아닌 필수이다. 서킷 브레이커, Bulkhead, Retry, Rate Limiter, Timeout을 적절히 조합하면 단일 서비스 장애가 전체 시스템으로 전파되는 것을 효과적으로 차단할 수 있다.

핵심 원칙을 정리하면 다음과 같다.

Retry 단독 사용은 금지: 반드시 CircuitBreaker와 함께 사용하여 재시도 폭풍을 방지한다.
서비스별 격리: Bulkhead로 각 다운스트림 서비스의 리소스 사용을 격리한다.
다단계 폴백: 단일 폴백이 아닌, 대체 서비스 -> 캐시 -> 기본값의 다단계 구조를 설계한다.
인프라 + 앱 레벨 이중 방어: Istio의 Outlier Detection과 Resilience4j를 함께 사용한다.
모니터링 필수: 서킷 상태, 실패율, 느린 호출 비율을 실시간 모니터링하고 알림을 설정한다.

Hystrix에서 Resilience4j로의 마이그레이션은 점진적으로 수행하되, 신규 서비스부터 Resilience4j를 도입하는 것이 현실적이다. 무엇보다 중요한 것은 정기적인 Chaos Engineering 테스트를 통해 설정한 복원력 패턴이 실제 장애 상황에서 기대대로 동작하는지 검증하는 것이다.

참고 자료

Circuit Breaker and Resilience Patterns Practical Guide — Resilience4j, Istio, Fault Isolation Strategies

Introduction
1. Circuit Breaker Pattern Principles
- 1.1 Three States: Closed, Open, Half-Open
- 1.2 Sliding Window Types
2. Java/Spring Circuit Breaker Implementation with Resilience4j
3. Istio Service Mesh Level Circuit Breaker
- 3.1 DestinationRule Configuration
- 3.2 Istio vs Application-Level Circuit Breaker
4. Bulkhead Pattern: Fault Isolation Strategy
- 4.1 Semaphore Bulkhead vs ThreadPool Bulkhead
- 4.2 Per-Service Bulkhead Isolation Example
5. Retry + Timeout + Rate Limiter Combination Patterns
- 5.1 Precautions for Pattern Combinations
- 5.2 Programmatic API Composition
6. Fallback Strategy Design
- 6.1 Fallback Strategy Types
- 6.2 Multi-Level Fallback Implementation
7. Migration from Netflix Hystrix to Resilience4j
- 7.1 Resilience4j vs Hystrix vs Istio Comparison
- 7.2 Migration Core Checklist
8. Failure Case Analysis and Recovery Procedures
9. Operational Monitoring and Metrics
10. Troubleshooting
11. Practical Checklist
Conclusion
References

Introduction

In microservices architecture, inter-service calls are inherently unreliable. Network latency, timeouts, and downstream service failures occur routinely, and without proper defense mechanisms, a single service failure can propagate to the entire system, causing Cascading Failure. A representative example occurred during the 2024 Black Friday period when a large e-commerce platform's product recommendation service response latency caused the entire product listing page to take over 20 seconds to load.

Resilience Patterns emerged to solve these problems. Starting with the Circuit Breaker pattern first introduced by Michael Nygard in his 2007 book Release It!, various patterns including Bulkhead, Retry, Rate Limiter, Timeout, and Fallback have been systematized. Netflix Hystrix was the first popular implementation, but after entering maintenance mode in 2018, Resilience4j has become the de facto standard in the Java/Spring ecosystem, while Istio provides infrastructure-level circuit breakers in service mesh environments.

This article comprehensively covers circuit breaker operating principles, practical implementation with Resilience4j and Istio, composite resilience pattern design, Hystrix migration, operational monitoring, and failure case analysis.

1. Circuit Breaker Pattern Principles

The circuit breaker is a pattern inspired by electrical circuit breakers, detecting remote service call failures and automatically blocking calls to prevent cascading failures across the entire system.

1.1 Three States: Closed, Open, Half-Open

              Failure rate >= threshold (failureRateThreshold)
    +---------------------------------------------+
    |                                             |
    v                                             |
+----------+                                 +----------+
|          |     Trial call success rate >= threshold     |          |
|   OPEN   | <- - - - - - - - - - - - - - -- |  CLOSED  |
| (blocked)|                                 | (normal) |
+----------+                                 +----------+
     |                                            ^
     | waitDurationInOpenState elapsed             |
     v                                            |
+--------------+    Trial call success             |
|  HALF-OPEN   | ---------------------------------+
| (trial mode) |
+--------------+
     |
     | Trial call failure
     v
+----------+
|   OPEN   |  (blocked again)
+----------+

The behavior of each state is as follows:

State	Behavior	Transition Condition
CLOSED	Passes all requests normally and records results in sliding window	Transitions to OPEN when failure rate exceeds threshold
OPEN	Immediately rejects all requests, throws CallNotPermittedException	Transitions to HALF-OPEN after waitDuration elapses
HALF-OPEN	Allows only a limited number of trial calls	Transitions to CLOSED or OPEN based on trial call success rate

1.2 Sliding Window Types

Resilience4j supports two types of sliding windows:

COUNT_BASED: Calculates failure rate based on the last N call results. Suitable for services with consistent traffic.
TIME_BASED: Calculates failure rate based on call results within the last N seconds. Suitable for services with variable traffic.

2. Java/Spring Circuit Breaker Implementation with Resilience4j

2.1 Dependency Setup

// build.gradle (Spring Boot 3.x)
dependencies {
    implementation 'io.github.resilience4j:resilience4j-spring-boot3:2.2.0'
    implementation 'io.github.resilience4j:resilience4j-micrometer:2.2.0'
    implementation 'org.springframework.boot:spring-boot-starter-actuator'
    implementation 'org.springframework.boot:spring-boot-starter-aop'
}

2.2 application.yml Configuration

resilience4j:
  circuitbreaker:
    instances:
      paymentService:
        registerHealthIndicator: true
        slidingWindowType: COUNT_BASED
        slidingWindowSize: 10
        minimumNumberOfCalls: 5
        failureRateThreshold: 50
        waitDurationInOpenState: 10s
        permittedNumberOfCallsInHalfOpenState: 3
        slowCallDurationThreshold: 2s
        slowCallRateThreshold: 80
        recordExceptions:
          - java.io.IOException
          - java.util.concurrent.TimeoutException
          - org.springframework.web.client.HttpServerErrorException
        ignoreExceptions:
          - com.example.BusinessException

  retry:
    instances:
      paymentService:
        maxAttempts: 3
        waitDuration: 500ms
        enableExponentialBackoff: true
        exponentialBackoffMultiplier: 2
        retryExceptions:
          - java.io.IOException
          - java.util.concurrent.TimeoutException

  bulkhead:
    instances:
      paymentService:
        maxConcurrentCalls: 20
        maxWaitDuration: 500ms

  timelimiter:
    instances:
      paymentService:
        timeoutDuration: 3s
        cancelRunningFuture: true

  ratelimiter:
    instances:
      paymentService:
        limitRefreshPeriod: 1s
        limitForPeriod: 50
        timeoutDuration: 0s

2.3 Annotation-Based Implementation

@Service
@Slf4j
public class PaymentService {

    private final PaymentGatewayClient paymentGatewayClient;
    private final PaymentCacheService paymentCacheService;

    public PaymentService(PaymentGatewayClient paymentGatewayClient,
                          PaymentCacheService paymentCacheService) {
        this.paymentGatewayClient = paymentGatewayClient;
        this.paymentCacheService = paymentCacheService;
    }

    @CircuitBreaker(name = "paymentService", fallbackMethod = "paymentFallback")
    @Bulkhead(name = "paymentService")
    @Retry(name = "paymentService")
    @TimeLimiter(name = "paymentService")
    public CompletableFuture<PaymentResponse> processPayment(PaymentRequest request) {
        return CompletableFuture.supplyAsync(() -> {
            log.info("Processing payment: orderId={}", request.getOrderId());
            return paymentGatewayClient.charge(request);
        });
    }

    // Fallback method: called when circuit is OPEN or exception occurs
    private CompletableFuture<PaymentResponse> paymentFallback(
            PaymentRequest request, Throwable throwable) {
        log.warn("Payment service fallback triggered: orderId={}, reason={}",
                request.getOrderId(), throwable.getMessage());

        if (throwable instanceof CallNotPermittedException) {
            // Circuit is open - save to queue for async processing
            return CompletableFuture.completedFuture(
                PaymentResponse.queued(request.getOrderId(),
                    "Payment service temporarily unavailable. Order has been queued.")
            );
        }

        // Other exceptions - try returning cached result
        return CompletableFuture.completedFuture(
            paymentCacheService.getCachedResponse(request.getOrderId())
                .orElse(PaymentResponse.error(request.getOrderId(),
                    "An error occurred during payment processing. Please try again later."))
        );
    }
}

Aspect execution order: Resilience4j annotations are applied in the following nested order:

Retry ( CircuitBreaker ( RateLimiter ( TimeLimiter ( Bulkhead ( Function ) ) ) ) )

Since the outermost Retry is applied last, when the CircuitBreaker throws an exception, Retry performs the retry. This order can be customized via each module's *AspectOrder property.

3. Istio Service Mesh Level Circuit Breaker

Istio can apply circuit breakers at the infrastructure level without application code changes. It leverages Envoy proxy's Outlier Detection feature to automatically remove unhealthy instances from the load balancing pool.

3.1 DestinationRule Configuration

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: payment-service-circuit-breaker
  namespace: production
spec:
  host: payment-service.production.svc.cluster.local
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100 # Maximum TCP connections
        connectTimeout: 3s # TCP connection timeout
      http:
        h2UpgradePolicy: DEFAULT
        http1MaxPendingRequests: 50 # Max pending HTTP requests
        http2MaxRequests: 100 # Max active HTTP/2 requests
        maxRequestsPerConnection: 10 # Max requests per connection
        maxRetries: 3 # Max retries
    outlierDetection:
      consecutive5xxErrors: 5 # Remove after 5 consecutive 5xx errors
      interval: 10s # Analysis interval
      baseEjectionTime: 30s # Minimum ejection time
      maxEjectionPercent: 50 # Max ejection percentage (50%)
      minHealthPercent: 30 # Minimum healthy instance percentage

3.2 Istio vs Application-Level Circuit Breaker

Aspect	Istio (Infrastructure Level)	Resilience4j (App Level)
Application	No code changes, YAML config	Annotations or programmatic API
Isolation unit	Instance (Pod) level ejection	Method/Service level blocking
Fallback	Not supported (returns 503)	Custom fallback methods supported
Language agnostic	All languages/frameworks	Java/Kotlin only
Fine control	Limited	Very fine-grained
Monitoring	Kiali, Grafana integration	Micrometer, Actuator integration
Recommended for	Multi-language environments	When business logic integration needed

In practice, using both together is recommended. Istio isolates unhealthy instances at the infrastructure level, while Resilience4j handles fine-grained fallback and retry at the application level.

4. Bulkhead Pattern: Fault Isolation Strategy

The Bulkhead pattern derives from ship bulkheads, which isolate compartments so that flooding in one doesn't affect others.

4.1 Semaphore Bulkhead vs ThreadPool Bulkhead

Aspect	Semaphore Bulkhead	ThreadPool Bulkhead
Isolation	Limits concurrent calls via semaphore	Executes in separate thread pool
Calling thread	Runs in request thread	Runs asynchronously in separate thread
Return type	Both sync/async supported	CompletableFuture only
Overhead	Low	Thread pool management cost
Recommended for	General concurrency limiting	When full thread isolation needed

# ThreadPool Bulkhead configuration
resilience4j:
  thread-pool-bulkhead:
    instances:
      inventoryService:
        maxThreadPoolSize: 10
        coreThreadPoolSize: 5
        queueCapacity: 20
        keepAliveDuration: 100ms
        writableStackTraceEnabled: true

4.2 Per-Service Bulkhead Isolation Example

@Service
public class OrderOrchestrator {

    @Bulkhead(name = "paymentService", type = Bulkhead.Type.SEMAPHORE)
    public PaymentResult processPayment(Order order) {
        return paymentClient.charge(order.getPaymentInfo());
    }

    @Bulkhead(name = "inventoryService", type = Bulkhead.Type.THREADPOOL)
    public CompletableFuture<InventoryResult> reserveInventory(Order order) {
        return CompletableFuture.supplyAsync(() ->
            inventoryClient.reserve(order.getItems()));
    }

    @Bulkhead(name = "notificationService", type = Bulkhead.Type.SEMAPHORE)
    public void sendNotification(Order order) {
        notificationClient.send(order.getUserId(), "Your order has been received.");
    }
}

By separating Bulkheads per service, even if the inventory service slows down, the payment service's concurrent call capacity remains unaffected.

5. Retry + Timeout + Rate Limiter Combination Patterns

Resilience patterns are most effective when combined rather than used individually. However, incorrect combinations can worsen failures, so caution is needed.

5.1 Precautions for Pattern Combinations

Retry + CircuitBreaker: Using Retry alone adds load to failing services. Always use with CircuitBreaker to block retries beyond a certain failure rate.
Timeout + Retry: Total time = timeout * maxAttempts. With a 3-second timeout and 3 retries, worst case is 9 seconds. Design with user response time SLA in mind.
Rate Limiter + CircuitBreaker: Rate Limiter prevents exceeding external API call limits, while CircuitBreaker handles API failures itself — a dual defense structure.

5.2 Programmatic API Composition

@Configuration
public class ResilienceConfig {

    @Bean
    public Supplier<String> resilientSupplier(
            CircuitBreakerRegistry circuitBreakerRegistry,
            RetryRegistry retryRegistry,
            BulkheadRegistry bulkheadRegistry,
            RateLimiterRegistry rateLimiterRegistry) {

        CircuitBreaker circuitBreaker = circuitBreakerRegistry
                .circuitBreaker("externalApi");
        Retry retry = retryRegistry.retry("externalApi");
        Bulkhead bulkhead = bulkheadRegistry.bulkhead("externalApi");
        RateLimiter rateLimiter = rateLimiterRegistry
                .rateLimiter("externalApi");

        // Decorator chaining: applied from inside to outside
        Supplier<String> decoratedSupplier = Decorators
                .ofSupplier(() -> externalApiClient.call())
                .withBulkhead(bulkhead)           // 1. Concurrent call limit
                .withRateLimiter(rateLimiter)      // 2. Rate limit
                .withCircuitBreaker(circuitBreaker) // 3. Failure detection/blocking
                .withRetry(retry)                   // 4. Retry
                .withFallback(Arrays.asList(
                    CallNotPermittedException.class,
                    BulkheadFullException.class,
                    RequestNotPermitted.class),
                    throwable -> "Fallback Response")
                .decorate();

        return decoratedSupplier;
    }
}

6. Fallback Strategy Design

Fallback provides alternative responses when the original service fails. The key is implementing graceful degradation that maintains user experience as much as possible, rather than simply returning error messages.

6.1 Fallback Strategy Types

Strategy	Description	Use Case Examples
Cache fallback	Return last successful cached response	Product recommendations, exchange rates, weather
Default value	Return predefined default values	Configuration service, feature flags
Queue fallback	Save request to queue for later processing	Payment processing, order intake
Alternative service	Route to backup service	CDN redundancy, multi-region
Empty response	Return empty result (instead of error)	Search autocomplete, recommendation widgets
Manual switch	Operator manually activates alternative	Critical business logic

6.2 Multi-Level Fallback Implementation

@Service
@Slf4j
public class ProductRecommendationService {

    private final RecommendationEngine primaryEngine;
    private final RecommendationEngine secondaryEngine;
    private final RedisTemplate<String, List<Product>> cache;

    @CircuitBreaker(name = "recommendation",
                    fallbackMethod = "secondaryRecommendation")
    public List<Product> getRecommendations(String userId) {
        return primaryEngine.recommend(userId);
    }

    // 1st fallback: Use secondary recommendation engine
    private List<Product> secondaryRecommendation(
            String userId, Throwable t) {
        log.warn("Primary recommendation engine failure, switching to secondary: {}", t.getMessage());
        try {
            return secondaryEngine.recommend(userId);
        } catch (Exception e) {
            return cachedRecommendation(userId, e);
        }
    }

    // 2nd fallback: Return cached recommendation results
    private List<Product> cachedRecommendation(
            String userId, Throwable t) {
        log.warn("Secondary recommendation engine also failed, checking cache: {}", t.getMessage());
        List<Product> cached = cache.opsForValue()
                .get("recommendation:" + userId);
        if (cached != null && !cached.isEmpty()) {
            return cached;
        }
        return defaultRecommendation(userId, t);
    }

    // 3rd fallback: Return default popular products list
    private List<Product> defaultRecommendation(
            String userId, Throwable t) {
        log.warn("No cache available, returning default popular products");
        return List.of(
            Product.popular("BEST-001", "Bestseller Product A"),
            Product.popular("BEST-002", "Bestseller Product B"),
            Product.popular("BEST-003", "Bestseller Product C")
        );
    }
}

7. Migration from Netflix Hystrix to Resilience4j

Netflix Hystrix entered maintenance mode in 2018, and official support was dropped starting from Spring Cloud 2020.0.0. Projects using Hystrix need to migrate to Resilience4j.

7.1 Resilience4j vs Hystrix vs Istio Comparison

Item	Hystrix	Resilience4j	Istio
Maintenance status	Maintenance mode (2018~)	Actively maintained	Actively maintained
Design philosophy	OOP (extend HystrixCommand)	Functional programming (decorators)	Infrastructure-based (sidecar proxy)
Module structure	All-in-one	Select only needed modules	Full service mesh
Spring Boot integration	Spring Cloud Netflix	Native Spring Boot starter	Kubernetes environment required
Isolation	Thread Pool / Semaphore	Semaphore / Thread Pool	Connection pool / Outlier Detection
Configuration	Java Config / Properties	YAML / Java Config / Annotations	Kubernetes CRD (YAML)
Reactive support	Limited (RxJava 1)	Full support (Reactor, RxJava 2/3)	N/A
Metrics	Hystrix Dashboard	Micrometer / Prometheus	Prometheus / Kiali
Fallback	HystrixCommand.getFallback()	fallbackMethod annotation	Not supported (returns 503)
Learning curve	Medium	Low	High (service mesh understanding needed)

7.2 Migration Core Checklist

Step 1: Replace Dependencies

// Remove
// implementation 'org.springframework.cloud:spring-cloud-starter-netflix-hystrix'

// Add
implementation 'io.github.resilience4j:resilience4j-spring-boot3:2.2.0'
implementation 'io.github.resilience4j:resilience4j-micrometer:2.2.0'

Step 2: Code Conversion Patterns

Hystrix	Resilience4j
`@HystrixCommand(fallbackMethod = "fallback")`	`@CircuitBreaker(name = "svc", fallbackMethod = "fallback")`
`HystrixCommand extends HystrixCommand`	`Decorators.ofSupplier(() -> ...).withCircuitBreaker(cb)`
`@HystrixProperty(name = "...")`	`application.yml` configuration
`HystrixDashboard`	Micrometer + Grafana

Step 3: Configuration Migration

Hystrix's circuitBreaker.requestVolumeThreshold maps to Resilience4j's minimumNumberOfCalls, circuitBreaker.errorThresholdPercentage maps to failureRateThreshold, and circuitBreaker.sleepWindowInMilliseconds converts to waitDurationInOpenState.

Step 4: Gradual Transition

Don't replace everything at once. Migrate service by service. Resilience4j and Hystrix can coexist in the same project, so apply Resilience4j to new services first and convert existing services sequentially.

8. Failure Case Analysis and Recovery Procedures

8.1 Case 1: Retry Storm

Situation: When the payment gateway fails, all clients retry simultaneously, delaying gateway recovery.

Cause: Only Retry applied without CircuitBreaker. No jitter in retry intervals, causing synchronized retries.

Solution:

Apply CircuitBreaker with Retry to block retries beyond a certain failure rate
Add jitter to exponential backoff

resilience4j:
  retry:
    instances:
      paymentGateway:
        maxAttempts: 3
        waitDuration: 1s
        enableExponentialBackoff: true
        exponentialBackoffMultiplier: 2
        enableRandomizedWait: true # Enable jitter
        randomizedWaitFactor: 0.5 # Randomize within 50% range

8.2 Case 2: Thread Pool Exhaustion Due to Missing Bulkhead

Situation: When inventory check API becomes slow, it occupies the entire Tomcat thread pool. Unrelated APIs like payment and order queries all timeout.

Cause: All external service calls running in the same thread pool.

Solution:

Apply per-service ThreadPool Bulkhead for thread isolation
Prevent slow services from monopolizing the entire thread pool

8.3 Case 3: Circuit Breaker Threshold Misconfiguration

Situation: Set minimumNumberOfCalls: 1, failureRateThreshold: 50. A single failure opens the circuit, blocking even healthy services.

Cause: State transitions based on statistically insignificant small number of calls.

Solution:

Set minimumNumberOfCalls to at least 5-10
Set slidingWindowSize sufficiently large (minimum 10 or more)
Adjust thresholds after analyzing actual traffic patterns in production

8.4 Standardized Recovery Procedure

#!/bin/bash
# circuit-breaker-recovery.sh
# Circuit breaker failure recovery procedure script

echo "===== Check Circuit Breaker Status ====="
# Check circuit breaker state via Actuator endpoint
curl -s http://localhost:8080/actuator/circuitbreakers | jq '.circuitBreakers'

echo ""
echo "===== Downstream Service Health Check ====="
curl -s http://payment-service:8080/actuator/health | jq '.status'
curl -s http://inventory-service:8080/actuator/health | jq '.status'

echo ""
echo "===== Force Close Circuit Breaker (after confirming downstream recovery) ====="
# WARNING: Execute only after downstream service is fully recovered
# curl -X POST http://localhost:8080/actuator/circuitbreakers/paymentService/close

echo ""
echo "===== Check Current Metrics ====="
curl -s http://localhost:8080/actuator/metrics/resilience4j.circuitbreaker.state | jq '.'
curl -s http://localhost:8080/actuator/metrics/resilience4j.circuitbreaker.failure.rate | jq '.'

echo ""
echo "===== Check Istio Outlier Detection Status ====="
kubectl get destinationrules -n production
kubectl describe destinationrule payment-service-circuit-breaker -n production

9. Operational Monitoring and Metrics

9.1 Key Monitoring Metrics

The following metrics must be monitored for circuit breaker operations:

Metric	Description	Alert Threshold
`resilience4j.circuitbreaker.state`	Current circuit state (0=CLOSED, 1=OPEN, 2=HALF_OPEN)	state == 1 (OPEN)
`resilience4j.circuitbreaker.failure.rate`	Current failure rate (%)	Above 40%
`resilience4j.circuitbreaker.calls`	Successful/failed/ignored/blocked call counts	Spike in blocked calls
`resilience4j.circuitbreaker.slow.call.rate`	Slow call rate (%)	Above 60%
`resilience4j.bulkhead.available.concurrent.calls`	Available concurrent calls	Near 0
`resilience4j.retry.calls`	Retry count	On spike
`resilience4j.ratelimiter.available.permissions`	Available permissions	Near 0

9.2 Prometheus + Grafana Dashboard Setup

Resilience4j automatically exposes Prometheus-format metrics through Micrometer.

# Prometheus scrape configuration
scrape_configs:
  - job_name: 'spring-boot-resilience4j'
    metrics_path: '/actuator/prometheus'
    scrape_interval: 5s
    static_configs:
      - targets: ['payment-service:8080']
        labels:
          application: 'payment-service'

Grafana alert rule example: Configure Slack alerts when the circuit transitions to OPEN state.

# Grafana Alert Rule (provisioning)
groups:
  - name: circuit-breaker-alerts
    rules:
      - alert: CircuitBreakerOpen
        expr: resilience4j_circuitbreaker_state{state="open"} == 1
        for: 10s
        labels:
          severity: critical
        annotations:
          summary: 'Circuit Breaker OPEN - {{ $labels.name }}'
          description: >
            {{ $labels.application }}'s {{ $labels.name }}
            circuit breaker is in OPEN state.
            Check downstream service status immediately.

      - alert: HighFailureRate
        expr: resilience4j_circuitbreaker_failure_rate > 40
        for: 30s
        labels:
          severity: warning
        annotations:
          summary: 'High failure rate detected - {{ $labels.name }}'
          description: >
            {{ $labels.name }} failure rate is {{ $value }}%.
            Identify the cause before the circuit opens.

9.3 Istio Monitoring (Kiali + Grafana)

# Check service status within Istio mesh
istioctl proxy-config cluster <pod-name> -n production | grep outlier

# Check Envoy statistics
kubectl exec -it <pod-name> -n production -c istio-proxy -- \
  curl localhost:15000/stats | grep outlier_detection

# Access Kiali dashboard
istioctl dashboard kiali

10. Troubleshooting

Circuit breaker not opening

Check the minimumNumberOfCalls value. If fewer calls than this value have occurred, the circuit won't open even at 100% failure rate.
Verify that recordExceptions includes the actual exception types being thrown. Unregistered exceptions are not counted as failures.
Check if failure exceptions are unintentionally included in ignoreExceptions.

Circuit quickly transitions back to OPEN from HALF-OPEN

If permittedNumberOfCallsInHalfOpenState is too small, statistically meaningful judgments are difficult. Set to at least 3-5.
This can occur when the downstream service is only partially recovered. Confirm complete downstream recovery.

If BulkheadFullException occurs frequently, increase maxConcurrentCalls or improve downstream service response times.
When using ThreadPool Bulkhead with queueCapacity of 0, requests are immediately rejected when the thread pool is full.

Istio Outlier Detection not working

Verify that the Istio sidecar proxy is injected into the Pod: kubectl get pod <name> -o jsonpath='{.spec.containers[*].name}'
Verify that the DestinationRule host field is the correct service FQDN.
If maxEjectionPercent is too low, some unhealthy instances may not be ejected.

11. Practical Checklist

Design Phase

Confirmed SLA (response time, availability) for each downstream service
Classified failure impact per service (Critical / High / Medium / Low)
Defined fallback strategies for failures (cache, default values, queue, alternative service)
Verified that Retry targets guarantee idempotency
Decided on Retry + CircuitBreaker combination usage (standalone Retry prohibited)
Verified total timeout = timeout * maxAttempts is within user SLA

Implementation Phase

Set CircuitBreaker slidingWindowSize and minimumNumberOfCalls sufficiently large (minimum 5-10)
Registered network/timeout related exceptions in recordExceptions
Registered business exceptions (400 Bad Request, etc.) in ignoreExceptions
Applied per-service Bulkhead isolation
Applied Rate Limiter to external API calls
Verified fallback method parameters match original method (+ Throwable added)

Operations Phase

Exposed Actuator endpoints (/actuator/circuitbreakers, /actuator/health)
Configured Prometheus metric collection
Set up alerts (Slack, PagerDuty, etc.) for circuit OPEN state transitions
Set up failure rate warning threshold alerts
Documented circuit breaker failure recovery procedures (Runbook)
Performing periodic Chaos Engineering tests (service failure injection)
Configured DestinationRule and Outlier Detection if in Istio environment

Testing Phase

Tested circuit state transition scenarios (CLOSED -> OPEN -> HALF-OPEN -> CLOSED)
Tested fallback methods work correctly
Simulated Bulkhead full scenarios
Tested complete downstream service failure scenarios
Tested slow response (Slow Call) scenarios

Conclusion

Resilience patterns are not optional but essential in microservices architecture. By properly combining Circuit Breaker, Bulkhead, Retry, Rate Limiter, and Timeout, you can effectively prevent a single service failure from propagating to the entire system.

Key principles summarized:

Standalone Retry is prohibited: Always use with CircuitBreaker to prevent retry storms.
Per-service isolation: Isolate resource usage for each downstream service with Bulkhead.
Multi-level fallback: Design a multi-level structure of alternative service -> cache -> default values, not just a single fallback.
Dual defense at infrastructure + app level: Use Istio Outlier Detection and Resilience4j together.
Monitoring is essential: Monitor circuit state, failure rate, and slow call rate in real-time with alerts configured.

Migrating from Hystrix to Resilience4j should be done gradually, with Resilience4j introduced to new services first. Most importantly, verify through regular Chaos Engineering tests that your configured resilience patterns work as expected in actual failure scenarios.