Skip to content

Split View: Circuit Breaker 패턴과 Resilience4j 실전 구현 가이드: 장애 전파 차단부터 복구까지

✨ Learn with Quiz
|

Circuit Breaker 패턴과 Resilience4j 실전 구현 가이드: 장애 전파 차단부터 복구까지

Circuit Breaker Resilience4j

들어가며

마이크로서비스 아키텍처에서 서비스 간 네트워크 호출은 본질적으로 불안정하다. 네트워크 지연, 타임아웃, 다운스트림 서비스 장애는 일상적으로 발생하며, 이를 적절히 제어하지 않으면 단일 서비스의 장애가 전체 시스템으로 전파되는 연쇄 장애(Cascading Failure) 가 발생한다. 2024년 말 한 대형 이커머스 플랫폼에서 결제 게이트웨이 하나의 응답 지연이 주문 서비스, 재고 서비스, 알림 서비스까지 연쇄적으로 마비시킨 사례가 대표적이다.

Circuit Breaker 패턴은 전기 회로의 차단기에서 착안한 장애 격리 메커니즘이다. Michael Nygard가 2007년 Release It!에서 처음 소개한 이래, Martin Fowler의 블로그 포스트를 거쳐 마이크로서비스 세계의 핵심 패턴으로 자리잡았다. Netflix의 Hystrix가 첫 번째 대중적 구현체였지만, 2018년 유지보수 모드에 진입하면서 Resilience4j가 사실상의 표준으로 부상했다.

이 글에서는 Circuit Breaker 상태 머신의 동작 원리부터, Resilience4j의 핵심 모듈인 CircuitBreaker, Retry, Bulkhead, RateLimiter를 Spring Boot 3 환경에서 통합 구현하고, Grafana 대시보드로 모니터링하며, 실제 장애 시나리오에서의 복구 전략까지 운영 레벨에서 다룬다.

Circuit Breaker 상태 머신

Circuit Breaker의 핵심은 세 가지 상태(CLOSED, OPEN, HALF-OPEN)와 두 가지 특수 상태(DISABLED, FORCED_OPEN) 간의 전이를 관리하는 유한 상태 머신(Finite State Machine)이다.

상태 전이 다이어그램

                     실패율 >= 임계값
         ┌─────────────────────────────────────┐
         │                                     │
         ▼                                     │
    ┌──────────┐                          ┌──────────┐
    │          │    waitDuration 경과      │          │
OPEN   │ ─────────────────────>CLOSED      (차단)  (정상)    └──────────┘                          └──────────┘
         │                                     ▲
         │ waitDuration 경과                    │
         ▼                                     │ 시험 호출 성공률 >= 임계값
    ┌──────────────┐                           │
HALF-OPEN   │ ──────────────────────────┘
      (시험 허용)    └──────────────┘
         │ 시험 호출 실패율 >= 임계값
    ┌──────────┐
OPEN     (다시 차단)
    └──────────┘

상태별 동작 상세

상태요청 처리전이 조건메트릭 수집
CLOSED모든 요청 통과슬라이딩 윈도우 내 실패율이 임계값 이상이면 OPEN 전이성공/실패/느린 호출 기록
OPEN모든 요청 즉시 거부 (CallNotPermittedException)waitDurationInOpenState 경과 후 HALF-OPEN 전이거부된 호출 수 기록
HALF-OPENpermittedNumberOfCalls 만큼만 허용시험 호출 결과에 따라 CLOSED 또는 OPEN 전이시험 호출 성공/실패 기록
DISABLED모든 요청 통과 (서킷 비활성)수동 전환만 가능메트릭 수집하지 않음
FORCED_OPEN모든 요청 즉시 거부수동 전환만 가능거부된 호출 수 기록

슬라이딩 윈도우 방식 비교

Resilience4j는 두 가지 슬라이딩 윈도우 방식을 제공한다.

구분COUNT_BASEDTIME_BASED
기준최근 N개 호출최근 N초간 호출
설정 예시slidingWindowSize: 10slidingWindowSize: 60
메모리 사용고정 (N개 결과 배열)가변 (N초간의 부분 집계)
적합한 환경호출 빈도가 일정한 서비스호출 빈도가 불규칙한 서비스
평가 시점N번째 호출 이후매 호출 시 시간 윈도우 평가

COUNT_BASED는 내부적으로 N 크기의 원형 비트 배열(circular bit array)로 구현되어, 각 호출 결과를 O(1)로 기록하고 실패율을 상수 시간에 계산한다. TIME_BASED는 N개의 부분 집계 버킷(partial aggregation bucket)을 사용하며, 각 버킷이 1초간의 호출 결과를 집계한다.

Resilience4j 아키텍처

Hystrix에서 Resilience4j로의 전환

Netflix Hystrix가 2018년 유지보수 모드에 진입한 이후, Resilience4j가 JVM 생태계의 표준 장애 허용(fault tolerance) 라이브러리로 자리잡았다.

비교 항목Netflix HystrixResilience4j
상태유지보수 모드 (2018년 이후 업데이트 없음)활발한 개발 (2025년 2.3.0 릴리스)
Java 버전Java 8+Java 17+ (Spring Boot 3 지원)
의존성Archaius, RxJava 등 다수Vavr 한 개
아키텍처모놀리식 (모든 기능 포함)모듈러 (필요한 모듈만 선택)
스레드 모델별도 스레드 풀 필수세마포어 기반 (스레드 풀 옵션)
설정 방식Archaius 필수application.yml, 프로그래밍 방식 모두 지원
리액티브 지원RxJava 1Reactor, RxJava 2/3 네이티브 지원
함수형 인터페이스제한적완전 지원 (Supplier, Function, Runnable 등)
모니터링Hystrix DashboardMicrometer 통합 (Prometheus, Grafana)

Resilience4j 핵심 모듈

Resilience4j는 다섯 가지 핵심 모듈을 독립적으로 또는 조합하여 사용할 수 있다.

모듈역할핵심 설정
CircuitBreaker실패율 기반 회로 차단failureRateThreshold, slidingWindowSize
Retry실패 시 재시도maxAttempts, waitDuration, backoff
Bulkhead동시 호출 수 제한 (격벽)maxConcurrentCalls, maxWaitDuration
RateLimiter단위 시간당 호출 수 제한limitForPeriod, limitRefreshPeriod
TimeLimiter호출 시간 제한timeoutDuration, cancelRunningFuture

어노테이션 기반으로 조합할 때의 적용 순서는 다음과 같다.

외부(먼저 평가) ──────────────────────────────────> 내부(마지막 평가)
Retry -> CircuitBreaker -> RateLimiter -> TimeLimiter -> Bulkhead

이 순서는 Resilience4j가 Spring AOP 기반으로 어노테이션을 처리할 때의 기본 우선순위다. resilience4j.circuitbreaker.circuitBreakerAspectOrder 등의 속성으로 순서를 커스터마이징할 수도 있다.

Spring Boot 3 통합 설정

의존성 설정

// build.gradle.kts (Spring Boot 3.3+ / Resilience4j 2.2+)
plugins {
    id("org.springframework.boot") version "3.3.5"
    id("io.spring.dependency-management") version "1.1.6"
    kotlin("jvm") version "1.9.25"
    kotlin("plugin.spring") version "1.9.25"
}

dependencies {
    // Resilience4j Spring Boot 3 스타터
    implementation("io.github.resilience4j:resilience4j-spring-boot3:2.2.0")

    // 개별 모듈 (스타터에 포함되지만 명시적 선언 권장)
    implementation("io.github.resilience4j:resilience4j-circuitbreaker")
    implementation("io.github.resilience4j:resilience4j-retry")
    implementation("io.github.resilience4j:resilience4j-bulkhead")
    implementation("io.github.resilience4j:resilience4j-ratelimiter")
    implementation("io.github.resilience4j:resilience4j-timelimiter")

    // Micrometer + Prometheus (모니터링)
    implementation("io.github.resilience4j:resilience4j-micrometer")
    implementation("io.micrometer:micrometer-registry-prometheus")

    // Spring Boot Actuator
    implementation("org.springframework.boot:spring-boot-starter-actuator")
    implementation("org.springframework.boot:spring-boot-starter-aop")
    implementation("org.springframework.boot:spring-boot-starter-web")

    // Kotlin Coroutines (선택)
    implementation("org.jetbrains.kotlinx:kotlinx-coroutines-core")
    implementation("org.jetbrains.kotlinx:kotlinx-coroutines-reactor")

    testImplementation("org.springframework.boot:spring-boot-starter-test")
}

통합 설정 파일

# application.yml - Resilience4j 통합 설정
resilience4j:
  circuitbreaker:
    configs:
      default:
        registerHealthIndicator: true
        slidingWindowType: COUNT_BASED
        slidingWindowSize: 10
        minimumNumberOfCalls: 5
        failureRateThreshold: 50
        slowCallRateThreshold: 80
        slowCallDurationThreshold: 3s
        waitDurationInOpenState: 30s
        permittedNumberOfCallsInHalfOpenState: 3
        automaticTransitionFromOpenToHalfOpenEnabled: true
        recordExceptions:
          - java.io.IOException
          - java.util.concurrent.TimeoutException
          - org.springframework.web.client.HttpServerErrorException
        ignoreExceptions:
          - com.example.order.exception.BusinessValidationException
    instances:
      paymentGateway:
        baseConfig: default
        failureRateThreshold: 40
        waitDurationInOpenState: 60s
        slidingWindowSize: 20
      inventoryService:
        baseConfig: default
        failureRateThreshold: 60
        slowCallDurationThreshold: 5s
      notificationService:
        baseConfig: default
        failureRateThreshold: 70
        waitDurationInOpenState: 15s

  retry:
    configs:
      default:
        maxAttempts: 3
        waitDuration: 1s
        enableExponentialBackoff: true
        exponentialBackoffMultiplier: 2.0
        exponentialMaxWaitDuration: 10s
        retryExceptions:
          - java.io.IOException
          - java.util.concurrent.TimeoutException
        ignoreExceptions:
          - com.example.order.exception.BusinessValidationException
    instances:
      paymentGateway:
        baseConfig: default
        maxAttempts: 2
        waitDuration: 2s
      inventoryService:
        baseConfig: default
        maxAttempts: 4
      notificationService:
        baseConfig: default
        maxAttempts: 5
        waitDuration: 500ms

  bulkhead:
    configs:
      default:
        maxConcurrentCalls: 25
        maxWaitDuration: 500ms
    instances:
      paymentGateway:
        baseConfig: default
        maxConcurrentCalls: 15
      inventoryService:
        baseConfig: default
        maxConcurrentCalls: 30
      notificationService:
        baseConfig: default
        maxConcurrentCalls: 50

  ratelimiter:
    configs:
      default:
        limitForPeriod: 100
        limitRefreshPeriod: 1s
        timeoutDuration: 500ms
    instances:
      paymentGateway:
        baseConfig: default
        limitForPeriod: 50
      inventoryService:
        baseConfig: default
        limitForPeriod: 200

  timelimiter:
    configs:
      default:
        timeoutDuration: 5s
        cancelRunningFuture: true
    instances:
      paymentGateway:
        baseConfig: default
        timeoutDuration: 10s
      inventoryService:
        baseConfig: default
        timeoutDuration: 3s

# Actuator 메트릭 노출
management:
  endpoints:
    web:
      exposure:
        include: health,metrics,prometheus,circuitbreakers,retries
  endpoint:
    health:
      show-details: always
  health:
    circuitbreakers:
      enabled: true
  metrics:
    distribution:
      percentiles-histogram:
        resilience4j.circuitbreaker.calls: true
        resilience4j.retry.calls: true
    tags:
      application: order-service

설정에서 주목할 점은 configs.default로 기본 프로필을 정의하고, 각 인스턴스에서 baseConfig: default를 지정하여 공통 설정을 상속받는 구조다. 서비스별 특성에 맞게 임계값만 오버라이드하면 설정 중복을 최소화할 수 있다.

CircuitBreaker 실전 구현

어노테이션 기반 구현 (Kotlin)

// PaymentGatewayClient.kt
@Service
class PaymentGatewayClient(
    private val restClient: RestClient,
    private val paymentRetryQueue: PaymentRetryQueue,
    private val paymentCacheStore: PaymentCacheStore,
) {
    companion object {
        private val log = LoggerFactory.getLogger(PaymentGatewayClient::class.java)
        const val CB_NAME = "paymentGateway"
    }

    @CircuitBreaker(name = CB_NAME, fallbackMethod = "paymentFallback")
    @Retry(name = CB_NAME)
    @Bulkhead(name = CB_NAME)
    fun processPayment(request: PaymentRequest): PaymentResponse {
        log.info("Calling payment gateway for orderId={}", request.orderId)

        val response = restClient.post()
            .uri("https://payment-api.internal/v2/charges")
            .contentType(MediaType.APPLICATION_JSON)
            .body(request)
            .retrieve()
            .body(PaymentResponse::class.java)
            ?: throw PaymentGatewayException("Empty response from payment gateway")

        log.info("Payment processed: orderId={}, txId={}", request.orderId, response.transactionId)
        return response
    }

    /**
     * 폴백 메서드: CircuitBreaker OPEN 또는 예외 발생 시 호출된다.
     * 메서드 시그니처는 원본과 동일 + 마지막 파라미터로 Exception을 받아야 한다.
     */
    private fun paymentFallback(request: PaymentRequest, ex: Exception): PaymentResponse {
        log.warn(
            "Payment fallback activated: orderId={}, reason={}",
            request.orderId, ex.message
        )

        return when (ex) {
            is CallNotPermittedException -> {
                // CircuitBreaker OPEN 상태: 큐에 넣고 비동기 처리
                paymentRetryQueue.enqueue(request)
                PaymentResponse(
                    orderId = request.orderId,
                    status = PaymentStatus.QUEUED,
                    message = "결제가 대기열에 등록되었습니다. 잠시 후 처리됩니다.",
                    transactionId = null,
                )
            }
            is BulkheadFullException -> {
                // Bulkhead 포화: 즉시 재시도 유도
                PaymentResponse(
                    orderId = request.orderId,
                    status = PaymentStatus.RETRY_LATER,
                    message = "현재 결제 요청이 많습니다. 잠시 후 다시 시도해주세요.",
                    transactionId = null,
                )
            }
            else -> {
                // 기타 예외: 캐시된 결제 정보가 있으면 반환
                val cached = paymentCacheStore.getLastSuccess(request.orderId)
                if (cached != null) {
                    log.info("Returning cached payment for orderId={}", request.orderId)
                    cached.copy(status = PaymentStatus.CACHED)
                } else {
                    paymentRetryQueue.enqueue(request)
                    PaymentResponse(
                        orderId = request.orderId,
                        status = PaymentStatus.PENDING,
                        message = "결제 처리 중 오류가 발생했습니다. 자동 재시도됩니다.",
                        transactionId = null,
                    )
                }
            }
        }
    }
}

프로그래밍 방식 구현 (Java)

어노테이션 대신 CircuitBreakerRegistry를 직접 사용하면, 런타임에 동적으로 서킷 브레이커를 생성하거나 설정을 변경할 수 있다.

// InventoryServiceClient.java
@Service
@Slf4j
public class InventoryServiceClient {

    private final CircuitBreaker circuitBreaker;
    private final Retry retry;
    private final Bulkhead bulkhead;
    private final RestClient restClient;

    public InventoryServiceClient(
            CircuitBreakerRegistry cbRegistry,
            RetryRegistry retryRegistry,
            BulkheadRegistry bulkheadRegistry,
            RestClient.Builder restClientBuilder) {

        this.circuitBreaker = cbRegistry.circuitBreaker("inventoryService");
        this.retry = retryRegistry.retry("inventoryService");
        this.bulkhead = bulkheadRegistry.bulkhead("inventoryService");
        this.restClient = restClientBuilder
                .baseUrl("https://inventory-api.internal")
                .build();

        // 이벤트 리스너 등록
        registerEventListeners();
    }

    public InventoryResponse checkStock(String productId, int quantity) {
        // 데코레이터 체인: Bulkhead -> CircuitBreaker -> Retry -> 실제 호출
        Supplier<InventoryResponse> decorated = Decorators
                .ofSupplier(() -> doCheckStock(productId, quantity))
                .withBulkhead(bulkhead)
                .withCircuitBreaker(circuitBreaker)
                .withRetry(retry)
                .withFallback(
                    List.of(
                        CallNotPermittedException.class,
                        BulkheadFullException.class,
                        IOException.class
                    ),
                    ex -> stockFallback(productId, quantity, ex)
                )
                .decorate();

        return decorated.get();
    }

    private InventoryResponse doCheckStock(String productId, int quantity) {
        return restClient.get()
                .uri("/v1/stock/{productId}?qty={qty}", productId, quantity)
                .retrieve()
                .body(InventoryResponse.class);
    }

    private InventoryResponse stockFallback(
            String productId, int quantity, Throwable ex) {
        log.warn("Inventory fallback: productId={}, reason={}", productId, ex.getMessage());
        // 재고가 불확실할 때는 주문을 수락하되 비동기 검증 예약
        return InventoryResponse.builder()
                .productId(productId)
                .available(true)
                .reservationStatus(ReservationStatus.TENTATIVE)
                .message("재고 확인 지연: 잠정 승인 후 비동기 검증 예정")
                .build();
    }

    private void registerEventListeners() {
        circuitBreaker.getEventPublisher()
            .onStateTransition(event -> {
                log.warn("[CircuitBreaker] {} state: {} -> {}",
                    event.getCircuitBreakerName(),
                    event.getStateTransition().getFromState(),
                    event.getStateTransition().getToState());
            })
            .onError(event ->
                log.error("[CircuitBreaker] {} error: {} ({}ms)",
                    event.getCircuitBreakerName(),
                    event.getThrowable().getMessage(),
                    event.getElapsedDuration().toMillis())
            )
            .onSuccess(event ->
                log.debug("[CircuitBreaker] {} success ({}ms)",
                    event.getCircuitBreakerName(),
                    event.getElapsedDuration().toMillis())
            )
            .onCallNotPermitted(event ->
                log.warn("[CircuitBreaker] {} call not permitted (OPEN state)",
                    event.getCircuitBreakerName())
            );

        retry.getEventPublisher()
            .onRetry(event ->
                log.info("[Retry] {} attempt #{} (wait: {}ms)",
                    event.getName(),
                    event.getNumberOfRetryAttempts(),
                    event.getWaitInterval().toMillis())
            );
    }
}

Retry, Bulkhead, RateLimiter 조합

Retry와 Exponential Backoff

재시도 전략에서 가장 중요한 것은 지수 백오프(exponential backoff)와 지터(jitter)의 조합이다. 고정 간격 재시도는 다수의 클라이언트가 동시에 재시도하여 서버에 부하를 집중시키는 thundering herd 문제를 일으킨다.

// RetryConfig를 프로그래밍 방식으로 커스터마이징
@Configuration
class ResilienceConfig {

    @Bean
    fun customRetryConfig(): RetryConfig {
        return RetryConfig.custom<RetryConfig>()
            .maxAttempts(4)
            .intervalFunction(
                // 지수 백오프 + 지터: 1s, 2s(+jitter), 4s(+jitter), 8s(+jitter)
                IntervalFunction.ofExponentialRandomBackoff(
                    Duration.ofSeconds(1),   // 초기 대기 시간
                    2.0,                     // 배수
                    Duration.ofSeconds(15)   // 최대 대기 시간
                )
            )
            .retryOnException { ex ->
                // 재시도 대상 예외 판별
                when (ex) {
                    is IOException -> true
                    is TimeoutException -> true
                    is HttpServerErrorException -> true
                    is ConnectException -> true
                    else -> false
                }
            }
            .ignoreExceptions(
                BusinessValidationException::class.java,
                IllegalArgumentException::class.java
            )
            .failAfterMaxAttempts(true) // 최대 재시도 후 MaxRetriesExceededException 발생
            .build()
    }

    @Bean
    fun retryRegistry(customRetryConfig: RetryConfig): RetryRegistry {
        return RetryRegistry.of(customRetryConfig)
    }
}

Bulkhead: 세마포어 vs 스레드 풀

Bulkhead는 선박의 격벽에서 착안한 패턴으로, 하나의 서비스 호출이 모든 리소스를 독점하지 못하도록 격리한다. Resilience4j는 두 가지 Bulkhead 구현을 제공한다.

구분SemaphoreBulkheadThreadPoolBulkhead
격리 수준동시 호출 수 제한별도 스레드 풀에서 실행
호출 스레드호출자 스레드 그대로 사용전용 스레드 풀의 스레드 사용
반환 타입동기 반환CompletionStage 반환
오버헤드낮음스레드 컨텍스트 전환 비용
적합한 환경대부분의 HTTP 호출CPU 집약적 작업, 완전 격리 필요 시
설정maxConcurrentCalls, maxWaitDurationmaxThreadPoolSize, coreThreadPoolSize, queueCapacity
# ThreadPoolBulkhead 설정 예시
resilience4j:
  thread-pool-bulkhead:
    instances:
      heavyProcessing:
        maxThreadPoolSize: 10
        coreThreadPoolSize: 5
        queueCapacity: 20
        keepAliveDuration: 100ms
        writableStackTraceEnabled: true

RateLimiter 설정과 적용

RateLimiter는 단위 시간당 허용되는 호출 수를 제한하여, 외부 API의 rate limit 초과를 방지하거나 내부 서비스를 과부하로부터 보호한다.

// RateLimiter와 CircuitBreaker 조합
@Service
@Slf4j
public class ExternalApiClient {

    private final RestClient restClient;

    @CircuitBreaker(name = "externalApi", fallbackMethod = "apiFallback")
    @RateLimiter(name = "externalApi")
    @Retry(name = "externalApi")
    public ApiResponse callExternalApi(ApiRequest request) {
        log.debug("Calling external API: endpoint={}", request.getEndpoint());

        return restClient.post()
                .uri(request.getEndpoint())
                .body(request.getPayload())
                .retrieve()
                .body(ApiResponse.class);
    }

    private ApiResponse apiFallback(ApiRequest request, RequestNotPermitted ex) {
        // RateLimiter에 의해 거부된 경우
        log.warn("Rate limit exceeded for external API: {}", request.getEndpoint());
        return ApiResponse.rateLimited(
                "요청 한도를 초과했습니다. " +
                "limitForPeriod 설정을 확인하거나 잠시 후 다시 시도하세요."
        );
    }

    private ApiResponse apiFallback(ApiRequest request, Exception ex) {
        // 기타 예외 (CircuitBreaker OPEN, 네트워크 오류 등)
        log.warn("External API fallback: endpoint={}, reason={}",
                request.getEndpoint(), ex.getMessage());
        return ApiResponse.error("외부 API 호출에 실패했습니다: " + ex.getMessage());
    }
}

폴백 메서드를 오버로딩할 때 주의할 점은, Resilience4j가 예외 타입을 기준으로 가장 구체적인 폴백을 선택한다는 것이다. RequestNotPermitted(RateLimiter 거부)와 Exception(일반 예외)을 분리하면 예외 원인에 따라 다른 폴백 로직을 실행할 수 있다.

Grafana 모니터링 대시보드

Prometheus 메트릭 수집

Resilience4j는 Micrometer를 통해 자동으로 메트릭을 노출한다. Spring Boot Actuator의 /actuator/prometheus 엔드포인트에서 다음 메트릭을 확인할 수 있다.

# CircuitBreaker 상태 확인 (0=CLOSED, 1=OPEN, 2=HALF_OPEN, 3=DISABLED, 4=FORCED_OPEN)
resilience4j_circuitbreaker_state{name="paymentGateway"}

# 실패율 (%)
resilience4j_circuitbreaker_failure_rate{name="paymentGateway"}

# 느린 호출 비율 (%)
resilience4j_circuitbreaker_slow_call_rate{name="paymentGateway"}

# 호출 통계 (kind: successful, failed, ignored, not_permitted)
rate(resilience4j_circuitbreaker_calls_seconds_count{name="paymentGateway"}[5m])

# 호출 지연 시간 분포 (히스토그램)
histogram_quantile(0.95,
  rate(resilience4j_circuitbreaker_calls_seconds_bucket{name="paymentGateway"}[5m])
)

# Retry 재시도 횟수
increase(resilience4j_retry_calls_total{name="paymentGateway", kind="successful_with_retry"}[1h])
increase(resilience4j_retry_calls_total{name="paymentGateway", kind="failed_with_retry"}[1h])

# Bulkhead 가용 동시 호출 수
resilience4j_bulkhead_available_concurrent_calls{name="paymentGateway"}

# RateLimiter 가용 허용 수
resilience4j_ratelimiter_available_permissions{name="externalApi"}

Grafana 대시보드 JSON 구성

Grafana 대시보드에서 핵심적으로 구성해야 할 패널과 각 패널의 PromQL 쿼리를 정리한다.

패널 1 - CircuitBreaker 상태 게이지

resilience4j_circuitbreaker_state{application="order-service"}

Value mapping으로 0=CLOSED(초록), 1=OPEN(빨강), 2=HALF_OPEN(노랑)을 매핑한다.

패널 2 - 실패율 추이 (Time Series)

resilience4j_circuitbreaker_failure_rate{application="order-service", name=~".*"}

임계값 라인(failureRateThreshold)을 추가하여 서킷이 OPEN으로 전이되는 시점을 시각적으로 확인한다.

패널 3 - 호출 성공/실패 비율 (Stacked Bar)

sum by (name, kind) (
  rate(resilience4j_circuitbreaker_calls_seconds_count{application="order-service"}[5m])
)

패널 4 - P95 응답 시간 (Time Series)

histogram_quantile(0.95,
  sum by (le, name) (
    rate(resilience4j_circuitbreaker_calls_seconds_bucket{application="order-service"}[5m])
  )
)

패널 5 - Bulkhead 동시 호출 현황 (Gauge)

resilience4j_bulkhead_max_allowed_concurrent_calls{application="order-service"}
- resilience4j_bulkhead_available_concurrent_calls{application="order-service"}

알림 규칙 설정

Grafana 또는 Prometheus Alertmanager에 다음 알림 규칙을 등록한다.

# prometheus-alerts.yml
groups:
  - name: resilience4j_alerts
    rules:
      - alert: CircuitBreakerOpen
        expr: resilience4j_circuitbreaker_state == 1
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: 'CircuitBreaker OPEN: {{ $labels.name }}'
          description: >
            서비스 {{ $labels.application }}의
            {{ $labels.name }} 서킷 브레이커가 OPEN 상태입니다.
            다운스트림 서비스 장애를 확인하세요.

      - alert: HighFailureRate
        expr: resilience4j_circuitbreaker_failure_rate > 30
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: 'High failure rate: {{ $labels.name }} ({{ $value }}%)'
          description: >
            {{ $labels.name }}의 실패율이 {{ $value }}%로
            경고 임계값(30%)을 초과했습니다.

      - alert: BulkheadSaturation
        expr: >
          (resilience4j_bulkhead_max_allowed_concurrent_calls
          - resilience4j_bulkhead_available_concurrent_calls)
          / resilience4j_bulkhead_max_allowed_concurrent_calls > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: 'Bulkhead 80% saturated: {{ $labels.name }}'

      - alert: ExcessiveRetries
        expr: >
          rate(resilience4j_retry_calls_total{kind="failed_with_retry"}[5m])
          / rate(resilience4j_retry_calls_total[5m]) > 0.5
        for: 3m
        labels:
          severity: warning
        annotations:
          summary: 'Retry 실패율 50% 초과: {{ $labels.name }}'

트러블슈팅 가이드

문제 1: CircuitBreaker가 OPEN으로 전이되지 않는다

증상: 분명히 실패가 발생하고 있는데 서킷이 CLOSED 상태를 유지한다.

원인 분석:

  • minimumNumberOfCalls에 도달하지 않았다. 기본값 100이므로, 호출 빈도가 낮은 서비스에서는 슬라이딩 윈도우가 채워지기 전에 장애가 해소될 수 있다.
  • 예외가 ignoreExceptions에 포함되어 있다. 비즈니스 예외뿐 아니라 의도치 않은 예외까지 ignore 목록에 있는지 확인한다.
  • 예외가 recordExceptions에 포함되지 않았다. recordExceptions를 명시하면 해당 목록에 없는 예외는 실패로 기록하지 않는다.

해결: minimumNumberOfCalls를 서비스 호출 빈도에 맞게 조정하고, recordExceptionsignoreExceptions 목록을 점검한다.

문제 2: Retry와 CircuitBreaker 조합 시 예상보다 많은 호출 발생

증상: maxAttempts=3으로 설정했는데 다운스트림 서비스에 5번 이상의 호출이 기록된다.

원인 분석: 어노테이션 적용 순서에서 Retry가 CircuitBreaker 바깥에 위치한다. 따라서 CircuitBreaker가 실패를 기록한 후, Retry가 다시 CircuitBreaker를 통해 호출을 시도한다. CircuitBreaker의 HALF-OPEN 상태에서 시험 호출이 추가되면 총 호출 수가 예상을 초과할 수 있다.

해결: Retry의 maxAttempts를 보수적으로 설정하고, CircuitBreaker의 slidingWindowSize와 Retry의 maxAttempts 조합이 만드는 최대 호출 수를 계산하여 다운스트림 부하를 예측한다.

문제 3: 폴백 메서드가 호출되지 않는다

증상: CircuitBreaker가 OPEN인데 CallNotPermittedException이 클라이언트에 직접 전파된다.

원인 분석: 폴백 메서드의 시그니처가 원본 메서드와 정확히 일치하지 않는다. 폴백 메서드는 원본 메서드의 모든 파라미터를 동일한 순서와 타입으로 받고, 마지막 파라미터로 Exception(또는 특정 예외 타입)을 추가해야 한다.

해결: 폴백 메서드 시그니처를 점검한다. 반환 타입도 원본과 정확히 동일해야 한다. 아래는 올바른 예시다.

// 원본 메서드
@CircuitBreaker(name = "svc", fallbackMethod = "fallback")
public OrderResponse getOrder(String orderId, boolean includeDetails) { ... }

// 올바른 폴백 (파라미터 동일 + Exception 추가)
private OrderResponse fallback(String orderId, boolean includeDetails, Exception ex) { ... }

// 잘못된 폴백 - 컴파일은 되지만 런타임에 매칭 실패
private OrderResponse fallback(String orderId, Exception ex) { ... }  // 파라미터 누락
private void fallback(String orderId, boolean includeDetails, Exception ex) { ... }  // 반환타입 불일치

문제 4: TIME_BASED 윈도우에서 메모리 사용량이 증가

증상: TIME_BASED 슬라이딩 윈도우를 사용하는데 힙 메모리 사용량이 점진적으로 증가한다.

원인 분석: slidingWindowSize가 지나치게 크게 설정되어 있다. 예를 들어 slidingWindowSize=600(10분)으로 설정하면 600개의 부분 집계 버킷이 유지된다. 트래픽이 높으면 각 버킷의 호출 기록이 누적되어 메모리를 소비한다.

해결: TIME_BASED에서는 slidingWindowSize를 60초 이하로 설정하고, 장기간 추이는 Prometheus 메트릭으로 관찰한다. 메모리 민감한 환경에서는 COUNT_BASED를 우선 사용한다.

운영 체크리스트

운영 환경에 Circuit Breaker를 배포하기 전에 반드시 확인해야 할 항목을 정리한다.

설정 검증

  • slidingWindowSize와 minimumNumberOfCalls의 비율이 적절한가 (minimumNumberOfCalls는 slidingWindowSize의 50% 이하 권장)
  • failureRateThreshold가 서비스 특성에 맞게 설정되었는가 (결제: 30-40%, 알림: 60-70%)
  • waitDurationInOpenState가 다운스트림 서비스의 평균 복구 시간과 맞는가
  • slowCallDurationThreshold가 정상 응답 시간의 P99 이상으로 설정되었는가
  • recordExceptions와 ignoreExceptions가 정확히 분류되었는가

모니터링 확인

  • Prometheus에서 resilience4j 메트릭이 정상 수집되는가
  • Grafana 대시보드에 CircuitBreaker 상태, 실패율, 호출 통계가 표시되는가
  • CircuitBreaker OPEN 알림이 Slack, PagerDuty 등으로 전달되는가
  • OPEN 상태 지속 시간을 추적하는 메트릭이 있는가

폴백 전략 검증

  • 모든 CircuitBreaker에 폴백 메서드가 연결되었는가
  • 폴백 메서드가 의미 있는 응답을 반환하는가 (단순 null 반환 금지)
  • 폴백 메서드 자체에서 예외가 발생하면 어떻게 처리되는가
  • 캐시 폴백 사용 시 캐시 만료 정책이 설정되었는가
  • 대체 서비스 폴백 사용 시 해당 서비스의 CircuitBreaker도 설정되었는가

테스트 검증

  • 단위 테스트에서 CircuitBreaker 상태 전이(CLOSED, OPEN, HALF-OPEN)를 검증했는가
  • 통합 테스트에서 실제 타임아웃, 네트워크 오류 시나리오를 재현했는가
  • 카오스 엔지니어링 도구(Chaos Monkey, Litmus)로 장애 주입 테스트를 수행했는가
  • 부하 테스트에서 Bulkhead 포화 시의 동작을 확인했는가

배포 전략

  • 새로운 CircuitBreaker 설정은 카나리 배포로 일부 트래픽에만 먼저 적용한다
  • 설정 변경 시 Config Server(Spring Cloud Config)나 환경 변수를 통해 무중단으로 반영할 수 있는가
  • CircuitBreaker 설정의 Git 이력 관리가 되고 있는가
  • 롤백 계획이 수립되어 있는가

실패 사례와 복구

사례 1: Retry 폭풍으로 인한 다운스트림 과부하

상황: 결제 서비스의 응답 시간이 증가하기 시작했다. 주문 서비스에서 Retry가 maxAttempts=5, 고정 간격 1초로 설정되어 있었다. 주문 서비스 인스턴스가 20대이고, 초당 100건의 주문이 발생하는 상황에서 결제 서비스에 초당 최대 10,000건(100 x 20 x 5)의 요청이 쏟아졌다.

원인: 지수 백오프(exponential backoff)와 지터(jitter) 없이 고정 간격 재시도를 사용했다. 또한 CircuitBreaker 없이 Retry만 단독 사용하여, 실패 시에도 계속 재시도가 발생했다.

복구 절차:

  1. 즉시 Retry를 비활성화하거나 maxAttempts를 1로 설정하여 재시도를 중단한다
  2. 결제 서비스의 부하가 안정되면, 지수 백오프 + 지터가 적용된 Retry 설정으로 교체한다
  3. CircuitBreaker를 Retry 안쪽에 배치하여, 서킷이 OPEN이면 재시도 자체를 차단한다

재발 방지: Retry는 반드시 CircuitBreaker와 함께 사용하고, 지수 백오프 + 랜덤 지터를 기본으로 적용한다. 고정 간격 재시도는 금지 정책으로 지정한다.

사례 2: 잘못된 예외 분류로 서킷이 영구 OPEN

상황: 재고 서비스에 신규 기능 배포 후, 특정 상품 조회 시 400 Bad Request가 반환되기 시작했다. 이 400 응답이 HttpClientErrorException으로 잡히면서 실패율 집계에 포함되었고, CircuitBreaker가 OPEN으로 전이되어 모든 재고 조회가 차단되었다. 정상 상품 조회까지 불가능해졌다.

원인: recordExceptions에 HttpClientErrorException(4xx)이 포함되어 있었다. 4xx 오류는 클라이언트 측 문제이므로 서킷 브레이커가 개입할 사안이 아니다. 서킷 브레이커는 서버 측 장애(5xx, 타임아웃, 연결 실패)에만 반응해야 한다.

복구 절차:

  1. CircuitBreaker를 FORCED_CLOSE로 수동 전환하여 즉시 정상 트래픽을 복원한다
// Actuator 엔드포인트로 상태 강제 전환
// POST /actuator/circuitbreakers/{name}/force-close
circuitBreakerRegistry.circuitBreaker("inventoryService")
    .transitionToForcedOpenState();  // 또는 transitionToClosedState()
  1. recordExceptions에서 HttpClientErrorException을 제거하고 ignoreExceptions에 추가한다
  2. 설정 반영 후 FORCED_CLOSE를 해제하여 정상 CircuitBreaker 동작으로 복귀한다

재발 방지: 예외 분류 원칙을 문서화한다. 4xx(클라이언트 오류)는 ignoreExceptions, 5xx(서버 오류)는 recordExceptions, 비즈니스 검증 예외는 ignoreExceptions에 분류한다.

사례 3: HALF-OPEN 병목으로 인한 트래픽 손실

상황: 결제 서비스가 복구된 후에도 주문 처리량이 회복되지 않았다. 트래픽 분석 결과, CircuitBreaker가 HALF-OPEN 상태에서 permittedNumberOfCallsInHalfOpenState=1로 설정되어 단 1건의 시험 호출만 허용하고 있었다. 이 시험 호출이 간헐적으로 실패하면서 OPEN과 HALF-OPEN을 반복하는 플래핑(flapping)이 발생했다.

원인: permittedNumberOfCallsInHalfOpenState 값이 지나치게 낮았다. 시험 호출이 1건이면 단일 실패로도 다시 OPEN으로 돌아가므로, 다운스트림 서비스가 간헐적으로만 응답하는 상황에서 CLOSED로 복귀하기 어렵다.

복구 절차:

  1. permittedNumberOfCallsInHalfOpenState를 5-10으로 상향 조정한다
  2. automaticTransitionFromOpenToHalfOpenEnabled: true를 확인하여 수동 개입 없이 자동 전이되게 한다
  3. waitDurationInOpenState를 다운스트림 서비스의 평균 복구 시간에 맞게 조정한다

재발 방지: HALF-OPEN 시험 호출 수는 최소 3건 이상으로 설정하고, 실패율 임계값과 조합하여 통계적으로 유의미한 판단이 가능하도록 한다. OPEN-HALF_OPEN 플래핑을 모니터링 알림에 추가한다.

고급 패턴: 커스텀 CircuitBreaker 레지스트리

서비스 수가 많아지면 각 서비스에 대해 개별적으로 CircuitBreaker를 설정하는 것이 비효율적일 수 있다. 동적으로 CircuitBreaker를 생성하고 관리하는 커스텀 레지스트리를 구현할 수 있다.

// DynamicCircuitBreakerFactory.kt
@Component
class DynamicCircuitBreakerFactory(
    private val circuitBreakerRegistry: CircuitBreakerRegistry,
    private val meterRegistry: MeterRegistry,
) {
    private val log = LoggerFactory.getLogger(javaClass)

    /**
     * 서비스 이름 기반으로 CircuitBreaker를 동적 생성한다.
     * 이미 존재하면 기존 인스턴스를 반환한다.
     */
    fun getOrCreate(
        serviceName: String,
        tier: ServiceTier = ServiceTier.STANDARD,
    ): CircuitBreaker {
        return circuitBreakerRegistry.circuitBreaker(serviceName) {
            buildConfigForTier(tier)
        }.also { cb ->
            registerMetrics(cb)
            log.info(
                "CircuitBreaker created/retrieved: name={}, tier={}, state={}",
                serviceName, tier, cb.state
            )
        }
    }

    private fun buildConfigForTier(tier: ServiceTier): CircuitBreakerConfig {
        return when (tier) {
            ServiceTier.CRITICAL -> CircuitBreakerConfig.custom()
                .failureRateThreshold(30f)
                .slowCallRateThreshold(60f)
                .slowCallDurationThreshold(Duration.ofSeconds(2))
                .waitDurationInOpenState(Duration.ofSeconds(60))
                .slidingWindowSize(20)
                .minimumNumberOfCalls(10)
                .permittedNumberOfCallsInHalfOpenState(5)
                .automaticTransitionFromOpenToHalfOpenEnabled(true)
                .build()

            ServiceTier.STANDARD -> CircuitBreakerConfig.custom()
                .failureRateThreshold(50f)
                .slowCallRateThreshold(80f)
                .slowCallDurationThreshold(Duration.ofSeconds(3))
                .waitDurationInOpenState(Duration.ofSeconds(30))
                .slidingWindowSize(10)
                .minimumNumberOfCalls(5)
                .permittedNumberOfCallsInHalfOpenState(3)
                .automaticTransitionFromOpenToHalfOpenEnabled(true)
                .build()

            ServiceTier.BEST_EFFORT -> CircuitBreakerConfig.custom()
                .failureRateThreshold(70f)
                .slowCallRateThreshold(90f)
                .slowCallDurationThreshold(Duration.ofSeconds(5))
                .waitDurationInOpenState(Duration.ofSeconds(15))
                .slidingWindowSize(5)
                .minimumNumberOfCalls(3)
                .permittedNumberOfCallsInHalfOpenState(2)
                .automaticTransitionFromOpenToHalfOpenEnabled(true)
                .build()
        }
    }

    private fun registerMetrics(cb: CircuitBreaker) {
        TaggedCircuitBreakerMetrics.ofCircuitBreakerRegistry(
            circuitBreakerRegistry
        ).bindTo(meterRegistry)
    }

    enum class ServiceTier {
        CRITICAL,    // 결제, 인증 등 핵심 서비스
        STANDARD,    // 재고, 배송 등 일반 서비스
        BEST_EFFORT  // 알림, 추천 등 비핵심 서비스
    }
}

이 팩토리를 사용하면 서비스 티어에 따라 자동으로 적절한 CircuitBreaker 설정이 적용된다. CRITICAL 서비스는 낮은 실패율 임계값과 긴 대기 시간으로 보수적으로 보호하고, BEST_EFFORT 서비스는 높은 임계값으로 유연하게 운영한다.

테스트 전략

CircuitBreaker 도입 시 반드시 작성해야 할 테스트 케이스를 정리한다.

// CircuitBreakerIntegrationTest.java
@SpringBootTest
@AutoConfigureMockMvc
class CircuitBreakerIntegrationTest {

    @Autowired
    private CircuitBreakerRegistry circuitBreakerRegistry;

    @Autowired
    private MockMvc mockMvc;

    @MockBean
    private RestClient paymentRestClient;

    @Test
    @DisplayName("실패율 임계값 초과 시 CircuitBreaker가 OPEN으로 전이된다")
    void shouldTransitionToOpenOnFailureThreshold() {
        CircuitBreaker cb = circuitBreakerRegistry.circuitBreaker("paymentGateway");
        cb.reset(); // 테스트 격리를 위해 상태 초기화

        assertThat(cb.getState()).isEqualTo(CircuitBreaker.State.CLOSED);

        // slidingWindowSize=20, failureRateThreshold=40 일 때
        // minimumNumberOfCalls=10 이상 호출 후 40% 이상 실패해야 OPEN
        // 10번 호출 중 5번 실패 = 50% 실패율 -> OPEN 전이
        for (int i = 0; i < 5; i++) {
            cb.onSuccess(100, TimeUnit.MILLISECONDS);
        }
        for (int i = 0; i < 5; i++) {
            cb.onError(100, TimeUnit.MILLISECONDS, new IOException("connection refused"));
        }

        assertThat(cb.getState()).isEqualTo(CircuitBreaker.State.OPEN);
        assertThat(cb.getMetrics().getFailureRate()).isGreaterThanOrEqualTo(40f);
    }

    @Test
    @DisplayName("OPEN 상태에서 waitDuration 경과 후 HALF-OPEN으로 전이된다")
    void shouldTransitionToHalfOpenAfterWaitDuration() throws Exception {
        CircuitBreaker cb = circuitBreakerRegistry.circuitBreaker("paymentGateway");
        cb.reset();
        cb.transitionToOpenState();

        assertThat(cb.getState()).isEqualTo(CircuitBreaker.State.OPEN);

        // waitDurationInOpenState 경과를 시뮬레이션
        // (테스트에서는 짧은 waitDuration 설정을 사용하거나 직접 전이)
        cb.transitionToHalfOpenState();

        assertThat(cb.getState()).isEqualTo(CircuitBreaker.State.HALF_OPEN);
    }

    @Test
    @DisplayName("HALF-OPEN 시험 호출 성공 시 CLOSED로 복귀한다")
    void shouldTransitionToClosedOnSuccessfulTrialCalls() {
        CircuitBreaker cb = circuitBreakerRegistry.circuitBreaker("paymentGateway");
        cb.reset();
        cb.transitionToOpenState();
        cb.transitionToHalfOpenState();

        // permittedNumberOfCallsInHalfOpenState=5 만큼 성공
        for (int i = 0; i < 5; i++) {
            cb.onSuccess(50, TimeUnit.MILLISECONDS);
        }

        assertThat(cb.getState()).isEqualTo(CircuitBreaker.State.CLOSED);
    }

    @Test
    @DisplayName("폴백 메서드가 CircuitBreaker OPEN 시 정상 호출된다")
    void shouldInvokeFallbackWhenCircuitIsOpen() throws Exception {
        // 강제로 서킷을 OPEN으로 전환
        CircuitBreaker cb = circuitBreakerRegistry.circuitBreaker("paymentGateway");
        cb.transitionToForcedOpenState();

        mockMvc.perform(post("/api/v1/orders")
                .contentType(MediaType.APPLICATION_JSON)
                .content("{\"productId\": \"P001\", \"quantity\": 1}"))
            .andExpect(status().isOk())
            .andExpect(jsonPath("$.payment.status").value("QUEUED"))
            .andExpect(jsonPath("$.payment.message").exists());

        // 테스트 후 상태 복원
        cb.transitionToClosedState();
    }
}

참고자료

Circuit Breaker Pattern and Resilience4j Practical Implementation Guide: From Failure Isolation to Recovery

Circuit Breaker Resilience4j

Introduction

In microservices architecture, inter-service network calls are inherently unreliable. Network latency, timeouts, and downstream service failures occur routinely, and without proper controls, a single service failure can propagate across the entire system as a cascading failure. A representative example occurred in late 2024, when response latency in a single payment gateway on a large e-commerce platform cascaded to paralyze the order service, inventory service, and notification service.

The Circuit Breaker pattern is a failure isolation mechanism inspired by electrical circuit breakers. Since Michael Nygard first introduced it in Release It! in 2007, and through Martin Fowler's blog post, it has become a core pattern in the microservices world. Netflix's Hystrix was the first widely adopted implementation, but after entering maintenance mode in 2018, Resilience4j emerged as the de facto standard.

This article covers everything from the Circuit Breaker state machine operating principles, to integrating Resilience4j's core modules -- CircuitBreaker, Retry, Bulkhead, and RateLimiter -- in a Spring Boot 3 environment, monitoring with Grafana dashboards, and recovery strategies for real failure scenarios at an operational level.

Circuit Breaker State Machine

The core of the Circuit Breaker is a Finite State Machine that manages transitions between three states (CLOSED, OPEN, HALF-OPEN) and two special states (DISABLED, FORCED_OPEN).

State Transition Diagram

                     Failure rate >= threshold
         ┌─────────────────────────────────────┐
         │                                     │
         ▼                                     │
    ┌──────────┐                          ┌──────────┐
    │          │    waitDuration elapsed   │          │
OPEN   │ ─────────────────────>CLOSED     (blocked) (normal)    └──────────┘                          └──────────┘
         │                                     ▲
         │ waitDuration elapsed                │
         ▼                                     │ Trial call success rate >= threshold
    ┌──────────────┐                           │
HALF-OPEN   │ ──────────────────────────┘
    (trial allowed)    └──────────────┘
Trial call failure rate >= threshold
    ┌──────────┐
OPEN     (blocked again)
    └──────────┘

Detailed Behavior by State

StateRequest HandlingTransition ConditionMetric Collection
CLOSEDAll requests pass throughTransitions to OPEN when failure rate in sliding window exceeds thresholdRecords success/failure/slow calls
OPENAll requests immediately rejected (CallNotPermittedException)Transitions to HALF-OPEN after waitDurationInOpenState elapsesRecords rejected call count
HALF-OPENAllows only permittedNumberOfCallsTransitions to CLOSED or OPEN based on trial call resultsRecords trial call success/failure
DISABLEDAll requests pass through (circuit inactive)Manual transition onlyNo metric collection
FORCED_OPENAll requests immediately rejectedManual transition onlyRecords rejected call count

Sliding Window Type Comparison

Resilience4j provides two sliding window types.

AspectCOUNT_BASEDTIME_BASED
BasisLast N callsCalls in the last N seconds
Config ExampleslidingWindowSize: 10slidingWindowSize: 60
Memory UsageFixed (array of N results)Variable (partial aggregations over N seconds)
Suited ForServices with consistent call frequencyServices with irregular call frequency
EvaluationAfter the Nth callTime window evaluated on each call

COUNT_BASED is internally implemented as a circular bit array of size N, recording each call result in O(1) and calculating the failure rate in constant time. TIME_BASED uses N partial aggregation buckets, each aggregating call results for one second.

Resilience4j Architecture

Transitioning from Hystrix to Resilience4j

After Netflix Hystrix entered maintenance mode in 2018, Resilience4j established itself as the standard fault tolerance library in the JVM ecosystem.

ComparisonNetflix HystrixResilience4j
StatusMaintenance mode (no updates since 2018)Active development (2.3.0 release in 2025)
Java VersionJava 8+Java 17+ (Spring Boot 3 support)
DependenciesMultiple (Archaius, RxJava, etc.)Single (Vavr)
ArchitectureMonolithic (all features included)Modular (select only needed modules)
Thread ModelSeparate thread pool requiredSemaphore-based (thread pool optional)
ConfigurationArchaius requiredBoth application.yml and programmatic
Reactive SupportRxJava 1Native Reactor, RxJava 2/3 support
Functional InterfaceLimitedFull support (Supplier, Function, Runnable, etc.)
MonitoringHystrix DashboardMicrometer integration (Prometheus, Grafana)

Resilience4j Core Modules

Resilience4j provides five core modules that can be used independently or in combination.

ModuleRoleKey Configuration
CircuitBreakerCircuit tripping based on failure ratefailureRateThreshold, slidingWindowSize
RetryRetry on failuremaxAttempts, waitDuration, backoff
BulkheadLimit concurrent calls (isolation)maxConcurrentCalls, maxWaitDuration
RateLimiterLimit calls per time unitlimitForPeriod, limitRefreshPeriod
TimeLimiterLimit call durationtimeoutDuration, cancelRunningFuture

When combining via annotations, the application order is as follows:

Outer (evaluated first) ──────────────────────────────────> Inner (evaluated last)
Retry -> CircuitBreaker -> RateLimiter -> TimeLimiter -> Bulkhead

This order is the default priority when Resilience4j processes annotations via Spring AOP. You can customize the order using properties like resilience4j.circuitbreaker.circuitBreakerAspectOrder.

Spring Boot 3 Integration Configuration

Dependency Setup

// build.gradle.kts (Spring Boot 3.3+ / Resilience4j 2.2+)
plugins {
    id("org.springframework.boot") version "3.3.5"
    id("io.spring.dependency-management") version "1.1.6"
    kotlin("jvm") version "1.9.25"
    kotlin("plugin.spring") version "1.9.25"
}

dependencies {
    // Resilience4j Spring Boot 3 Starter
    implementation("io.github.resilience4j:resilience4j-spring-boot3:2.2.0")

    // Individual modules (included in starter but explicit declaration recommended)
    implementation("io.github.resilience4j:resilience4j-circuitbreaker")
    implementation("io.github.resilience4j:resilience4j-retry")
    implementation("io.github.resilience4j:resilience4j-bulkhead")
    implementation("io.github.resilience4j:resilience4j-ratelimiter")
    implementation("io.github.resilience4j:resilience4j-timelimiter")

    // Micrometer + Prometheus (monitoring)
    implementation("io.github.resilience4j:resilience4j-micrometer")
    implementation("io.micrometer:micrometer-registry-prometheus")

    // Spring Boot Actuator
    implementation("org.springframework.boot:spring-boot-starter-actuator")
    implementation("org.springframework.boot:spring-boot-starter-aop")
    implementation("org.springframework.boot:spring-boot-starter-web")

    // Kotlin Coroutines (optional)
    implementation("org.jetbrains.kotlinx:kotlinx-coroutines-core")
    implementation("org.jetbrains.kotlinx:kotlinx-coroutines-reactor")

    testImplementation("org.springframework.boot:spring-boot-starter-test")
}

Integrated Configuration File

# application.yml - Resilience4j integrated configuration
resilience4j:
  circuitbreaker:
    configs:
      default:
        registerHealthIndicator: true
        slidingWindowType: COUNT_BASED
        slidingWindowSize: 10
        minimumNumberOfCalls: 5
        failureRateThreshold: 50
        slowCallRateThreshold: 80
        slowCallDurationThreshold: 3s
        waitDurationInOpenState: 30s
        permittedNumberOfCallsInHalfOpenState: 3
        automaticTransitionFromOpenToHalfOpenEnabled: true
        recordExceptions:
          - java.io.IOException
          - java.util.concurrent.TimeoutException
          - org.springframework.web.client.HttpServerErrorException
        ignoreExceptions:
          - com.example.order.exception.BusinessValidationException
    instances:
      paymentGateway:
        baseConfig: default
        failureRateThreshold: 40
        waitDurationInOpenState: 60s
        slidingWindowSize: 20
      inventoryService:
        baseConfig: default
        failureRateThreshold: 60
        slowCallDurationThreshold: 5s
      notificationService:
        baseConfig: default
        failureRateThreshold: 70
        waitDurationInOpenState: 15s

  retry:
    configs:
      default:
        maxAttempts: 3
        waitDuration: 1s
        enableExponentialBackoff: true
        exponentialBackoffMultiplier: 2.0
        exponentialMaxWaitDuration: 10s
        retryExceptions:
          - java.io.IOException
          - java.util.concurrent.TimeoutException
        ignoreExceptions:
          - com.example.order.exception.BusinessValidationException
    instances:
      paymentGateway:
        baseConfig: default
        maxAttempts: 2
        waitDuration: 2s
      inventoryService:
        baseConfig: default
        maxAttempts: 4
      notificationService:
        baseConfig: default
        maxAttempts: 5
        waitDuration: 500ms

  bulkhead:
    configs:
      default:
        maxConcurrentCalls: 25
        maxWaitDuration: 500ms
    instances:
      paymentGateway:
        baseConfig: default
        maxConcurrentCalls: 15
      inventoryService:
        baseConfig: default
        maxConcurrentCalls: 30
      notificationService:
        baseConfig: default
        maxConcurrentCalls: 50

  ratelimiter:
    configs:
      default:
        limitForPeriod: 100
        limitRefreshPeriod: 1s
        timeoutDuration: 500ms
    instances:
      paymentGateway:
        baseConfig: default
        limitForPeriod: 50
      inventoryService:
        baseConfig: default
        limitForPeriod: 200

  timelimiter:
    configs:
      default:
        timeoutDuration: 5s
        cancelRunningFuture: true
    instances:
      paymentGateway:
        baseConfig: default
        timeoutDuration: 10s
      inventoryService:
        baseConfig: default
        timeoutDuration: 3s

# Actuator metric exposure
management:
  endpoints:
    web:
      exposure:
        include: health,metrics,prometheus,circuitbreakers,retries
  endpoint:
    health:
      show-details: always
  health:
    circuitbreakers:
      enabled: true
  metrics:
    distribution:
      percentiles-histogram:
        resilience4j.circuitbreaker.calls: true
        resilience4j.retry.calls: true
    tags:
      application: order-service

A notable point in the configuration is defining a base profile with configs.default and specifying baseConfig: default for each instance to inherit common settings. You only need to override thresholds specific to each service to minimize configuration duplication.

CircuitBreaker Practical Implementation

Annotation-Based Implementation (Kotlin)

// PaymentGatewayClient.kt
@Service
class PaymentGatewayClient(
    private val restClient: RestClient,
    private val paymentRetryQueue: PaymentRetryQueue,
    private val paymentCacheStore: PaymentCacheStore,
) {
    companion object {
        private val log = LoggerFactory.getLogger(PaymentGatewayClient::class.java)
        const val CB_NAME = "paymentGateway"
    }

    @CircuitBreaker(name = CB_NAME, fallbackMethod = "paymentFallback")
    @Retry(name = CB_NAME)
    @Bulkhead(name = CB_NAME)
    fun processPayment(request: PaymentRequest): PaymentResponse {
        log.info("Calling payment gateway for orderId={}", request.orderId)

        val response = restClient.post()
            .uri("https://payment-api.internal/v2/charges")
            .contentType(MediaType.APPLICATION_JSON)
            .body(request)
            .retrieve()
            .body(PaymentResponse::class.java)
            ?: throw PaymentGatewayException("Empty response from payment gateway")

        log.info("Payment processed: orderId={}, txId={}", request.orderId, response.transactionId)
        return response
    }

    /**
     * Fallback method: Called when CircuitBreaker is OPEN or an exception occurs.
     * The method signature must match the original + accept an Exception as the last parameter.
     */
    private fun paymentFallback(request: PaymentRequest, ex: Exception): PaymentResponse {
        log.warn(
            "Payment fallback activated: orderId={}, reason={}",
            request.orderId, ex.message
        )

        return when (ex) {
            is CallNotPermittedException -> {
                // CircuitBreaker OPEN state: enqueue for async processing
                paymentRetryQueue.enqueue(request)
                PaymentResponse(
                    orderId = request.orderId,
                    status = PaymentStatus.QUEUED,
                    message = "Payment has been queued. It will be processed shortly.",
                    transactionId = null,
                )
            }
            is BulkheadFullException -> {
                // Bulkhead saturated: prompt immediate retry
                PaymentResponse(
                    orderId = request.orderId,
                    status = PaymentStatus.RETRY_LATER,
                    message = "Too many payment requests at the moment. Please try again shortly.",
                    transactionId = null,
                )
            }
            else -> {
                // Other exceptions: return cached payment info if available
                val cached = paymentCacheStore.getLastSuccess(request.orderId)
                if (cached != null) {
                    log.info("Returning cached payment for orderId={}", request.orderId)
                    cached.copy(status = PaymentStatus.CACHED)
                } else {
                    paymentRetryQueue.enqueue(request)
                    PaymentResponse(
                        orderId = request.orderId,
                        status = PaymentStatus.PENDING,
                        message = "An error occurred during payment processing. Automatic retry in progress.",
                        transactionId = null,
                    )
                }
            }
        }
    }
}

Programmatic Implementation (Java)

Instead of annotations, you can use the CircuitBreakerRegistry directly to dynamically create circuit breakers or change configurations at runtime.

// InventoryServiceClient.java
@Service
@Slf4j
public class InventoryServiceClient {

    private final CircuitBreaker circuitBreaker;
    private final Retry retry;
    private final Bulkhead bulkhead;
    private final RestClient restClient;

    public InventoryServiceClient(
            CircuitBreakerRegistry cbRegistry,
            RetryRegistry retryRegistry,
            BulkheadRegistry bulkheadRegistry,
            RestClient.Builder restClientBuilder) {

        this.circuitBreaker = cbRegistry.circuitBreaker("inventoryService");
        this.retry = retryRegistry.retry("inventoryService");
        this.bulkhead = bulkheadRegistry.bulkhead("inventoryService");
        this.restClient = restClientBuilder
                .baseUrl("https://inventory-api.internal")
                .build();

        // Register event listeners
        registerEventListeners();
    }

    public InventoryResponse checkStock(String productId, int quantity) {
        // Decorator chain: Bulkhead -> CircuitBreaker -> Retry -> actual call
        Supplier<InventoryResponse> decorated = Decorators
                .ofSupplier(() -> doCheckStock(productId, quantity))
                .withBulkhead(bulkhead)
                .withCircuitBreaker(circuitBreaker)
                .withRetry(retry)
                .withFallback(
                    List.of(
                        CallNotPermittedException.class,
                        BulkheadFullException.class,
                        IOException.class
                    ),
                    ex -> stockFallback(productId, quantity, ex)
                )
                .decorate();

        return decorated.get();
    }

    private InventoryResponse doCheckStock(String productId, int quantity) {
        return restClient.get()
                .uri("/v1/stock/{productId}?qty={qty}", productId, quantity)
                .retrieve()
                .body(InventoryResponse.class);
    }

    private InventoryResponse stockFallback(
            String productId, int quantity, Throwable ex) {
        log.warn("Inventory fallback: productId={}, reason={}", productId, ex.getMessage());
        // When stock is uncertain, accept the order but schedule async verification
        return InventoryResponse.builder()
                .productId(productId)
                .available(true)
                .reservationStatus(ReservationStatus.TENTATIVE)
                .message("Stock check delayed: tentative approval with async verification scheduled")
                .build();
    }

    private void registerEventListeners() {
        circuitBreaker.getEventPublisher()
            .onStateTransition(event -> {
                log.warn("[CircuitBreaker] {} state: {} -> {}",
                    event.getCircuitBreakerName(),
                    event.getStateTransition().getFromState(),
                    event.getStateTransition().getToState());
            })
            .onError(event ->
                log.error("[CircuitBreaker] {} error: {} ({}ms)",
                    event.getCircuitBreakerName(),
                    event.getThrowable().getMessage(),
                    event.getElapsedDuration().toMillis())
            )
            .onSuccess(event ->
                log.debug("[CircuitBreaker] {} success ({}ms)",
                    event.getCircuitBreakerName(),
                    event.getElapsedDuration().toMillis())
            )
            .onCallNotPermitted(event ->
                log.warn("[CircuitBreaker] {} call not permitted (OPEN state)",
                    event.getCircuitBreakerName())
            );

        retry.getEventPublisher()
            .onRetry(event ->
                log.info("[Retry] {} attempt #{} (wait: {}ms)",
                    event.getName(),
                    event.getNumberOfRetryAttempts(),
                    event.getWaitInterval().toMillis())
            );
    }
}

Retry, Bulkhead, and RateLimiter Combinations

Retry and Exponential Backoff

The most important aspect of retry strategy is combining exponential backoff with jitter. Fixed-interval retries cause a thundering herd problem where multiple clients retry simultaneously, concentrating load on the server.

// Programmatic RetryConfig customization
@Configuration
class ResilienceConfig {

    @Bean
    fun customRetryConfig(): RetryConfig {
        return RetryConfig.custom<RetryConfig>()
            .maxAttempts(4)
            .intervalFunction(
                // Exponential backoff + jitter: 1s, 2s(+jitter), 4s(+jitter), 8s(+jitter)
                IntervalFunction.ofExponentialRandomBackoff(
                    Duration.ofSeconds(1),   // initial wait duration
                    2.0,                     // multiplier
                    Duration.ofSeconds(15)   // max wait duration
                )
            )
            .retryOnException { ex ->
                // Determine retry-eligible exceptions
                when (ex) {
                    is IOException -> true
                    is TimeoutException -> true
                    is HttpServerErrorException -> true
                    is ConnectException -> true
                    else -> false
                }
            }
            .ignoreExceptions(
                BusinessValidationException::class.java,
                IllegalArgumentException::class.java
            )
            .failAfterMaxAttempts(true) // Throw MaxRetriesExceededException after max retries
            .build()
    }

    @Bean
    fun retryRegistry(customRetryConfig: RetryConfig): RetryRegistry {
        return RetryRegistry.of(customRetryConfig)
    }
}

Bulkhead: Semaphore vs Thread Pool

Bulkhead is a pattern inspired by ship compartment walls (bulkheads), preventing a single service call from monopolizing all resources. Resilience4j provides two Bulkhead implementations.

AspectSemaphoreBulkheadThreadPoolBulkhead
Isolation LevelLimits concurrent callsExecutes in a separate thread pool
Call ThreadUses the caller's thread directlyUses threads from a dedicated thread pool
Return TypeSynchronous returnCompletionStage return
OverheadLowThread context switching cost
Suited ForMost HTTP callsCPU-intensive tasks, when full isolation is needed
ConfigurationmaxConcurrentCalls, maxWaitDurationmaxThreadPoolSize, coreThreadPoolSize, queueCapacity
# ThreadPoolBulkhead configuration example
resilience4j:
  thread-pool-bulkhead:
    instances:
      heavyProcessing:
        maxThreadPoolSize: 10
        coreThreadPoolSize: 5
        queueCapacity: 20
        keepAliveDuration: 100ms
        writableStackTraceEnabled: true

RateLimiter Configuration and Usage

RateLimiter limits the number of calls allowed per time unit, preventing external API rate limit violations or protecting internal services from overload.

// RateLimiter and CircuitBreaker combination
@Service
@Slf4j
public class ExternalApiClient {

    private final RestClient restClient;

    @CircuitBreaker(name = "externalApi", fallbackMethod = "apiFallback")
    @RateLimiter(name = "externalApi")
    @Retry(name = "externalApi")
    public ApiResponse callExternalApi(ApiRequest request) {
        log.debug("Calling external API: endpoint={}", request.getEndpoint());

        return restClient.post()
                .uri(request.getEndpoint())
                .body(request.getPayload())
                .retrieve()
                .body(ApiResponse.class);
    }

    private ApiResponse apiFallback(ApiRequest request, RequestNotPermitted ex) {
        // Rejected by RateLimiter
        log.warn("Rate limit exceeded for external API: {}", request.getEndpoint());
        return ApiResponse.rateLimited(
                "Request limit exceeded. " +
                "Check limitForPeriod settings or try again later."
        );
    }

    private ApiResponse apiFallback(ApiRequest request, Exception ex) {
        // Other exceptions (CircuitBreaker OPEN, network errors, etc.)
        log.warn("External API fallback: endpoint={}, reason={}",
                request.getEndpoint(), ex.getMessage());
        return ApiResponse.error("External API call failed: " + ex.getMessage());
    }
}

An important note when overloading fallback methods: Resilience4j selects the most specific fallback based on exception type. By separating RequestNotPermitted (RateLimiter rejection) and Exception (general exceptions), you can execute different fallback logic based on the exception cause.

Grafana Monitoring Dashboard

Prometheus Metric Collection

Resilience4j automatically exposes metrics via Micrometer. The following metrics are available at Spring Boot Actuator's /actuator/prometheus endpoint:

# CircuitBreaker state check (0=CLOSED, 1=OPEN, 2=HALF_OPEN, 3=DISABLED, 4=FORCED_OPEN)
resilience4j_circuitbreaker_state{name="paymentGateway"}

# Failure rate (%)
resilience4j_circuitbreaker_failure_rate{name="paymentGateway"}

# Slow call rate (%)
resilience4j_circuitbreaker_slow_call_rate{name="paymentGateway"}

# Call statistics (kind: successful, failed, ignored, not_permitted)
rate(resilience4j_circuitbreaker_calls_seconds_count{name="paymentGateway"}[5m])

# Call latency distribution (histogram)
histogram_quantile(0.95,
  rate(resilience4j_circuitbreaker_calls_seconds_bucket{name="paymentGateway"}[5m])
)

# Retry count
increase(resilience4j_retry_calls_total{name="paymentGateway", kind="successful_with_retry"}[1h])
increase(resilience4j_retry_calls_total{name="paymentGateway", kind="failed_with_retry"}[1h])

# Bulkhead available concurrent calls
resilience4j_bulkhead_available_concurrent_calls{name="paymentGateway"}

# RateLimiter available permissions
resilience4j_ratelimiter_available_permissions{name="externalApi"}

Grafana Dashboard JSON Configuration

Here are the essential panels and their PromQL queries for the Grafana dashboard.

Panel 1 - CircuitBreaker State Gauge

resilience4j_circuitbreaker_state{application="order-service"}

Use value mapping to map 0=CLOSED (green), 1=OPEN (red), 2=HALF_OPEN (yellow).

Panel 2 - Failure Rate Trend (Time Series)

resilience4j_circuitbreaker_failure_rate{application="order-service", name=~".*"}

Add a threshold line (failureRateThreshold) to visually identify when the circuit transitions to OPEN.

Panel 3 - Call Success/Failure Ratio (Stacked Bar)

sum by (name, kind) (
  rate(resilience4j_circuitbreaker_calls_seconds_count{application="order-service"}[5m])
)

Panel 4 - P95 Response Time (Time Series)

histogram_quantile(0.95,
  sum by (le, name) (
    rate(resilience4j_circuitbreaker_calls_seconds_bucket{application="order-service"}[5m])
  )
)

Panel 5 - Bulkhead Concurrent Call Status (Gauge)

resilience4j_bulkhead_max_allowed_concurrent_calls{application="order-service"}
- resilience4j_bulkhead_available_concurrent_calls{application="order-service"}

Alert Rule Configuration

Register the following alert rules in Grafana or Prometheus Alertmanager.

# prometheus-alerts.yml
groups:
  - name: resilience4j_alerts
    rules:
      - alert: CircuitBreakerOpen
        expr: resilience4j_circuitbreaker_state == 1
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: 'CircuitBreaker OPEN: {{ $labels.name }}'
          description: >
            The {{ $labels.name }} circuit breaker in service
            {{ $labels.application }} is in OPEN state.
            Check downstream service failures.

      - alert: HighFailureRate
        expr: resilience4j_circuitbreaker_failure_rate > 30
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: 'High failure rate: {{ $labels.name }} ({{ $value }}%)'
          description: >
            The failure rate of {{ $labels.name }} is {{ $value }}%,
            exceeding the warning threshold (30%).

      - alert: BulkheadSaturation
        expr: >
          (resilience4j_bulkhead_max_allowed_concurrent_calls
          - resilience4j_bulkhead_available_concurrent_calls)
          / resilience4j_bulkhead_max_allowed_concurrent_calls > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: 'Bulkhead 80% saturated: {{ $labels.name }}'

      - alert: ExcessiveRetries
        expr: >
          rate(resilience4j_retry_calls_total{kind="failed_with_retry"}[5m])
          / rate(resilience4j_retry_calls_total[5m]) > 0.5
        for: 3m
        labels:
          severity: warning
        annotations:
          summary: 'Retry failure rate exceeds 50%: {{ $labels.name }}'

Troubleshooting Guide

Issue 1: CircuitBreaker Does Not Transition to OPEN

Symptoms: Failures are clearly occurring, but the circuit remains in CLOSED state.

Root Cause Analysis:

  • minimumNumberOfCalls has not been reached. The default value is 100, so for low-frequency services, the sliding window may fill before the failure resolves.
  • The exception is included in ignoreExceptions. Check that not only business exceptions but also unintended exceptions are in the ignore list.
  • The exception is not included in recordExceptions. When recordExceptions is specified, exceptions not in the list are not recorded as failures.

Resolution: Adjust minimumNumberOfCalls to match the service call frequency, and review the recordExceptions and ignoreExceptions lists.

Issue 2: More Calls Than Expected When Combining Retry and CircuitBreaker

Symptoms: maxAttempts is set to 3, but more than 5 calls are recorded on the downstream service.

Root Cause Analysis: In the annotation application order, Retry sits outside CircuitBreaker. Therefore, after the CircuitBreaker records a failure, Retry attempts the call again through the CircuitBreaker. If trial calls are added during the CircuitBreaker's HALF-OPEN state, the total call count can exceed expectations.

Resolution: Set Retry's maxAttempts conservatively, and calculate the maximum number of calls produced by the combination of CircuitBreaker's slidingWindowSize and Retry's maxAttempts to predict downstream load.

Issue 3: Fallback Method Is Not Being Called

Symptoms: CircuitBreaker is OPEN, but CallNotPermittedException is propagated directly to the client.

Root Cause Analysis: The fallback method's signature does not exactly match the original method. The fallback method must accept all parameters of the original method in the same order and type, plus an Exception (or specific exception type) as the last parameter.

Resolution: Review the fallback method signature. The return type must also exactly match the original. Below are correct examples:

// Original method
@CircuitBreaker(name = "svc", fallbackMethod = "fallback")
public OrderResponse getOrder(String orderId, boolean includeDetails) { ... }

// Correct fallback (same parameters + Exception added)
private OrderResponse fallback(String orderId, boolean includeDetails, Exception ex) { ... }

// Incorrect fallback - compiles but fails to match at runtime
private OrderResponse fallback(String orderId, Exception ex) { ... }  // Missing parameter
private void fallback(String orderId, boolean includeDetails, Exception ex) { ... }  // Return type mismatch

Issue 4: Memory Usage Increases with TIME_BASED Window

Symptoms: Using a TIME_BASED sliding window and heap memory usage gradually increases.

Root Cause Analysis: The slidingWindowSize is set too large. For example, setting slidingWindowSize=600 (10 minutes) maintains 600 partial aggregation buckets. With high traffic, call records accumulate in each bucket, consuming memory.

Resolution: For TIME_BASED, set slidingWindowSize to 60 seconds or less, and observe long-term trends through Prometheus metrics. In memory-sensitive environments, prefer COUNT_BASED.

Operations Checklist

Here are items that must be verified before deploying Circuit Breakers to production.

Configuration Verification

  • Is the ratio of slidingWindowSize to minimumNumberOfCalls appropriate? (minimumNumberOfCalls should be 50% or less of slidingWindowSize)
  • Is failureRateThreshold set according to service characteristics? (Payment: 30-40%, Notification: 60-70%)
  • Does waitDurationInOpenState match the downstream service's average recovery time?
  • Is slowCallDurationThreshold set at or above the normal response time P99?
  • Are recordExceptions and ignoreExceptions properly categorized?

Monitoring Verification

  • Are resilience4j metrics being collected properly in Prometheus?
  • Does the Grafana dashboard display CircuitBreaker state, failure rate, and call statistics?
  • Are CircuitBreaker OPEN alerts being delivered to Slack, PagerDuty, etc.?
  • Are there metrics tracking OPEN state duration?

Fallback Strategy Verification

  • Are fallback methods connected to all CircuitBreakers?
  • Do fallback methods return meaningful responses? (No simple null returns)
  • How are exceptions in the fallback method itself handled?
  • Is a cache expiration policy configured when using cache fallback?
  • When using alternative service fallback, is a CircuitBreaker also configured for that service?

Test Verification

  • Have unit tests verified CircuitBreaker state transitions (CLOSED, OPEN, HALF-OPEN)?
  • Have integration tests reproduced actual timeout and network error scenarios?
  • Has fault injection testing been performed with chaos engineering tools (Chaos Monkey, Litmus)?
  • Has Bulkhead saturation behavior been verified in load tests?

Deployment Strategy

  • Apply new CircuitBreaker configurations via canary deployment to a subset of traffic first
  • Can configuration changes be applied without downtime via Config Server (Spring Cloud Config) or environment variables?
  • Is Git history management in place for CircuitBreaker configurations?
  • Is a rollback plan established?

Failure Cases and Recovery

Case 1: Downstream Overload Due to Retry Storm

Situation: Response times from the payment service began increasing. The order service had Retry configured with maxAttempts=5 and a fixed 1-second interval. With 20 order service instances and 100 orders per second, up to 10,000 requests per second (100 x 20 x 5) were flooding the payment service.

Cause: Fixed-interval retries were used without exponential backoff and jitter. Also, Retry was used standalone without a CircuitBreaker, so retries continued even on failure.

Recovery Procedure:

  1. Immediately disable Retry or set maxAttempts to 1 to stop retries
  2. Once the payment service load stabilizes, replace with a Retry configuration that includes exponential backoff + jitter
  3. Place the CircuitBreaker inside Retry so that retries are blocked when the circuit is OPEN

Prevention: Always use Retry together with CircuitBreaker, and apply exponential backoff + random jitter by default. Prohibit fixed-interval retries as a policy.

Case 2: Permanent OPEN Circuit Due to Incorrect Exception Classification

Situation: After deploying a new feature to the inventory service, certain product queries started returning 400 Bad Request. These 400 responses were caught as HttpClientErrorException and included in the failure rate calculation, causing the CircuitBreaker to transition to OPEN and block all inventory queries. Even normal product queries became impossible.

Cause: recordExceptions included HttpClientErrorException (4xx). Since 4xx errors are client-side issues, the circuit breaker should not intervene. Circuit breakers should only respond to server-side failures (5xx, timeouts, connection failures).

Recovery Procedure:

  1. Manually switch the CircuitBreaker to FORCED_CLOSE to immediately restore normal traffic
// Force state transition via Actuator endpoint
// POST /actuator/circuitbreakers/{name}/force-close
circuitBreakerRegistry.circuitBreaker("inventoryService")
    .transitionToForcedOpenState();  // or transitionToClosedState()
  1. Remove HttpClientErrorException from recordExceptions and add it to ignoreExceptions
  2. After applying the configuration, release FORCED_CLOSE to return to normal CircuitBreaker operation

Prevention: Document exception classification principles. 4xx (client errors) go in ignoreExceptions, 5xx (server errors) go in recordExceptions, and business validation exceptions go in ignoreExceptions.

Case 3: Traffic Loss Due to HALF-OPEN Bottleneck

Situation: Even after the payment service recovered, order processing throughput did not recover. Traffic analysis revealed that the CircuitBreaker was set with permittedNumberOfCallsInHalfOpenState=1 in the HALF-OPEN state, allowing only 1 trial call. This trial call intermittently failed, causing flapping between OPEN and HALF-OPEN states.

Cause: The permittedNumberOfCallsInHalfOpenState value was too low. With only 1 trial call, a single failure returns the circuit to OPEN, making it difficult to return to CLOSED when the downstream service responds only intermittently.

Recovery Procedure:

  1. Increase permittedNumberOfCallsInHalfOpenState to 5-10
  2. Verify that automaticTransitionFromOpenToHalfOpenEnabled is set to true for automatic transitions without manual intervention
  3. Adjust waitDurationInOpenState to match the downstream service's average recovery time

Prevention: Set HALF-OPEN trial calls to at least 3 or more, and combine with the failure rate threshold to enable statistically meaningful decisions. Add OPEN-HALF_OPEN flapping detection to monitoring alerts.

Advanced Pattern: Custom CircuitBreaker Registry

As the number of services grows, configuring CircuitBreakers individually for each service can become inefficient. You can implement a custom registry that dynamically creates and manages CircuitBreakers.

// DynamicCircuitBreakerFactory.kt
@Component
class DynamicCircuitBreakerFactory(
    private val circuitBreakerRegistry: CircuitBreakerRegistry,
    private val meterRegistry: MeterRegistry,
) {
    private val log = LoggerFactory.getLogger(javaClass)

    /**
     * Dynamically creates a CircuitBreaker based on service name.
     * Returns an existing instance if one already exists.
     */
    fun getOrCreate(
        serviceName: String,
        tier: ServiceTier = ServiceTier.STANDARD,
    ): CircuitBreaker {
        return circuitBreakerRegistry.circuitBreaker(serviceName) {
            buildConfigForTier(tier)
        }.also { cb ->
            registerMetrics(cb)
            log.info(
                "CircuitBreaker created/retrieved: name={}, tier={}, state={}",
                serviceName, tier, cb.state
            )
        }
    }

    private fun buildConfigForTier(tier: ServiceTier): CircuitBreakerConfig {
        return when (tier) {
            ServiceTier.CRITICAL -> CircuitBreakerConfig.custom()
                .failureRateThreshold(30f)
                .slowCallRateThreshold(60f)
                .slowCallDurationThreshold(Duration.ofSeconds(2))
                .waitDurationInOpenState(Duration.ofSeconds(60))
                .slidingWindowSize(20)
                .minimumNumberOfCalls(10)
                .permittedNumberOfCallsInHalfOpenState(5)
                .automaticTransitionFromOpenToHalfOpenEnabled(true)
                .build()

            ServiceTier.STANDARD -> CircuitBreakerConfig.custom()
                .failureRateThreshold(50f)
                .slowCallRateThreshold(80f)
                .slowCallDurationThreshold(Duration.ofSeconds(3))
                .waitDurationInOpenState(Duration.ofSeconds(30))
                .slidingWindowSize(10)
                .minimumNumberOfCalls(5)
                .permittedNumberOfCallsInHalfOpenState(3)
                .automaticTransitionFromOpenToHalfOpenEnabled(true)
                .build()

            ServiceTier.BEST_EFFORT -> CircuitBreakerConfig.custom()
                .failureRateThreshold(70f)
                .slowCallRateThreshold(90f)
                .slowCallDurationThreshold(Duration.ofSeconds(5))
                .waitDurationInOpenState(Duration.ofSeconds(15))
                .slidingWindowSize(5)
                .minimumNumberOfCalls(3)
                .permittedNumberOfCallsInHalfOpenState(2)
                .automaticTransitionFromOpenToHalfOpenEnabled(true)
                .build()
        }
    }

    private fun registerMetrics(cb: CircuitBreaker) {
        TaggedCircuitBreakerMetrics.ofCircuitBreakerRegistry(
            circuitBreakerRegistry
        ).bindTo(meterRegistry)
    }

    enum class ServiceTier {
        CRITICAL,    // Core services like payment, authentication
        STANDARD,    // General services like inventory, shipping
        BEST_EFFORT  // Non-critical services like notifications, recommendations
    }
}

Using this factory, appropriate CircuitBreaker configurations are automatically applied based on the service tier. CRITICAL services are conservatively protected with low failure rate thresholds and long wait durations, while BEST_EFFORT services operate flexibly with high thresholds.

Test Strategy

Here are the essential test cases that must be written when introducing CircuitBreakers.

// CircuitBreakerIntegrationTest.java
@SpringBootTest
@AutoConfigureMockMvc
class CircuitBreakerIntegrationTest {

    @Autowired
    private CircuitBreakerRegistry circuitBreakerRegistry;

    @Autowired
    private MockMvc mockMvc;

    @MockBean
    private RestClient paymentRestClient;

    @Test
    @DisplayName("CircuitBreaker transitions to OPEN when failure rate threshold is exceeded")
    void shouldTransitionToOpenOnFailureThreshold() {
        CircuitBreaker cb = circuitBreakerRegistry.circuitBreaker("paymentGateway");
        cb.reset(); // Reset state for test isolation

        assertThat(cb.getState()).isEqualTo(CircuitBreaker.State.CLOSED);

        // With slidingWindowSize=20, failureRateThreshold=40
        // Must call minimumNumberOfCalls=10 or more, then 40%+ failure -> OPEN
        // 10 calls with 5 failures = 50% failure rate -> OPEN transition
        for (int i = 0; i < 5; i++) {
            cb.onSuccess(100, TimeUnit.MILLISECONDS);
        }
        for (int i = 0; i < 5; i++) {
            cb.onError(100, TimeUnit.MILLISECONDS, new IOException("connection refused"));
        }

        assertThat(cb.getState()).isEqualTo(CircuitBreaker.State.OPEN);
        assertThat(cb.getMetrics().getFailureRate()).isGreaterThanOrEqualTo(40f);
    }

    @Test
    @DisplayName("Transitions to HALF-OPEN after waitDuration elapses in OPEN state")
    void shouldTransitionToHalfOpenAfterWaitDuration() throws Exception {
        CircuitBreaker cb = circuitBreakerRegistry.circuitBreaker("paymentGateway");
        cb.reset();
        cb.transitionToOpenState();

        assertThat(cb.getState()).isEqualTo(CircuitBreaker.State.OPEN);

        // Simulate waitDurationInOpenState elapsed
        // (In tests, use a short waitDuration setting or transition directly)
        cb.transitionToHalfOpenState();

        assertThat(cb.getState()).isEqualTo(CircuitBreaker.State.HALF_OPEN);
    }

    @Test
    @DisplayName("Transitions to CLOSED on successful trial calls in HALF-OPEN")
    void shouldTransitionToClosedOnSuccessfulTrialCalls() {
        CircuitBreaker cb = circuitBreakerRegistry.circuitBreaker("paymentGateway");
        cb.reset();
        cb.transitionToOpenState();
        cb.transitionToHalfOpenState();

        // permittedNumberOfCallsInHalfOpenState=5 successful calls
        for (int i = 0; i < 5; i++) {
            cb.onSuccess(50, TimeUnit.MILLISECONDS);
        }

        assertThat(cb.getState()).isEqualTo(CircuitBreaker.State.CLOSED);
    }

    @Test
    @DisplayName("Fallback method is properly called when CircuitBreaker is OPEN")
    void shouldInvokeFallbackWhenCircuitIsOpen() throws Exception {
        // Force circuit to OPEN state
        CircuitBreaker cb = circuitBreakerRegistry.circuitBreaker("paymentGateway");
        cb.transitionToForcedOpenState();

        mockMvc.perform(post("/api/v1/orders")
                .contentType(MediaType.APPLICATION_JSON)
                .content("{\"productId\": \"P001\", \"quantity\": 1}"))
            .andExpect(status().isOk())
            .andExpect(jsonPath("$.payment.status").value("QUEUED"))
            .andExpect(jsonPath("$.payment.message").exists());

        // Restore state after test
        cb.transitionToClosedState();
    }
}

References

Quiz

Q1: What is the main topic covered in "Circuit Breaker Pattern and Resilience4j Practical Implementation Guide: From Failure Isolation to Recovery"?

Covers the Circuit Breaker state machine principles, integrated configuration of Resilience4j CircuitBreaker, Retry, Bulkhead, and RateLimiter modules, practical implementation in Spring Boot 3, Grafana monitoring, and recovery strategies for various failure scenarios.

Q2: What is Circuit Breaker State Machine? The core of the Circuit Breaker is a Finite State Machine that manages transitions between three states (CLOSED, OPEN, HALF-OPEN) and two special states (DISABLED, FORCED_OPEN).

Q3: Describe the Resilience4j Architecture. Transitioning from Hystrix to Resilience4j After Netflix Hystrix entered maintenance mode in 2018, Resilience4j established itself as the standard fault tolerance library in the JVM ecosystem.

Q4: What are the key steps for Spring Boot 3 Integration Configuration? Dependency Setup Integrated Configuration File A notable point in the configuration is defining a base profile with configs.default and specifying baseConfig: default for each instance to inherit common settings.

Q5: How does CircuitBreaker Practical Implementation work? Annotation-Based Implementation (Kotlin) Programmatic Implementation (Java) Instead of annotations, you can use the CircuitBreakerRegistry directly to dynamically create circuit breakers or change configurations at runtime.