Split View: Circuit Breaker 패턴과 Resilience4j 실전 구현 가이드: 장애 전파 차단부터 복구까지
Circuit Breaker 패턴과 Resilience4j 실전 구현 가이드: 장애 전파 차단부터 복구까지
- 들어가며
- Circuit Breaker 상태 머신
- Resilience4j 아키텍처
- Spring Boot 3 통합 설정
- CircuitBreaker 실전 구현
- Retry, Bulkhead, RateLimiter 조합
- Grafana 모니터링 대시보드
- 트러블슈팅 가이드
- 운영 체크리스트
- 실패 사례와 복구
- 고급 패턴: 커스텀 CircuitBreaker 레지스트리
- 테스트 전략
- 참고자료

들어가며
마이크로서비스 아키텍처에서 서비스 간 네트워크 호출은 본질적으로 불안정하다. 네트워크 지연, 타임아웃, 다운스트림 서비스 장애는 일상적으로 발생하며, 이를 적절히 제어하지 않으면 단일 서비스의 장애가 전체 시스템으로 전파되는 연쇄 장애(Cascading Failure) 가 발생한다. 2024년 말 한 대형 이커머스 플랫폼에서 결제 게이트웨이 하나의 응답 지연이 주문 서비스, 재고 서비스, 알림 서비스까지 연쇄적으로 마비시킨 사례가 대표적이다.
Circuit Breaker 패턴은 전기 회로의 차단기에서 착안한 장애 격리 메커니즘이다. Michael Nygard가 2007년 Release It!에서 처음 소개한 이래, Martin Fowler의 블로그 포스트를 거쳐 마이크로서비스 세계의 핵심 패턴으로 자리잡았다. Netflix의 Hystrix가 첫 번째 대중적 구현체였지만, 2018년 유지보수 모드에 진입하면서 Resilience4j가 사실상의 표준으로 부상했다.
이 글에서는 Circuit Breaker 상태 머신의 동작 원리부터, Resilience4j의 핵심 모듈인 CircuitBreaker, Retry, Bulkhead, RateLimiter를 Spring Boot 3 환경에서 통합 구현하고, Grafana 대시보드로 모니터링하며, 실제 장애 시나리오에서의 복구 전략까지 운영 레벨에서 다룬다.
Circuit Breaker 상태 머신
Circuit Breaker의 핵심은 세 가지 상태(CLOSED, OPEN, HALF-OPEN)와 두 가지 특수 상태(DISABLED, FORCED_OPEN) 간의 전이를 관리하는 유한 상태 머신(Finite State Machine)이다.
상태 전이 다이어그램
실패율 >= 임계값
┌─────────────────────────────────────┐
│ │
▼ │
┌──────────┐ ┌──────────┐
│ │ waitDuration 경과 │ │
│ OPEN │ ─────────────────────> │ CLOSED │
│ (차단) │ │ (정상) │
└──────────┘ └──────────┘
│ ▲
│ waitDuration 경과 │
▼ │ 시험 호출 성공률 >= 임계값
┌──────────────┐ │
│ HALF-OPEN │ ──────────────────────────┘
│ (시험 허용) │
└──────────────┘
│
│ 시험 호출 실패율 >= 임계값
│
▼
┌──────────┐
│ OPEN │ (다시 차단)
└──────────┘
상태별 동작 상세
| 상태 | 요청 처리 | 전이 조건 | 메트릭 수집 |
|---|---|---|---|
| CLOSED | 모든 요청 통과 | 슬라이딩 윈도우 내 실패율이 임계값 이상이면 OPEN 전이 | 성공/실패/느린 호출 기록 |
| OPEN | 모든 요청 즉시 거부 (CallNotPermittedException) | waitDurationInOpenState 경과 후 HALF-OPEN 전이 | 거부된 호출 수 기록 |
| HALF-OPEN | permittedNumberOfCalls 만큼만 허용 | 시험 호출 결과에 따라 CLOSED 또는 OPEN 전이 | 시험 호출 성공/실패 기록 |
| DISABLED | 모든 요청 통과 (서킷 비활성) | 수동 전환만 가능 | 메트릭 수집하지 않음 |
| FORCED_OPEN | 모든 요청 즉시 거부 | 수동 전환만 가능 | 거부된 호출 수 기록 |
슬라이딩 윈도우 방식 비교
Resilience4j는 두 가지 슬라이딩 윈도우 방식을 제공한다.
| 구분 | COUNT_BASED | TIME_BASED |
|---|---|---|
| 기준 | 최근 N개 호출 | 최근 N초간 호출 |
| 설정 예시 | slidingWindowSize: 10 | slidingWindowSize: 60 |
| 메모리 사용 | 고정 (N개 결과 배열) | 가변 (N초간의 부분 집계) |
| 적합한 환경 | 호출 빈도가 일정한 서비스 | 호출 빈도가 불규칙한 서비스 |
| 평가 시점 | N번째 호출 이후 | 매 호출 시 시간 윈도우 평가 |
COUNT_BASED는 내부적으로 N 크기의 원형 비트 배열(circular bit array)로 구현되어, 각 호출 결과를 O(1)로 기록하고 실패율을 상수 시간에 계산한다. TIME_BASED는 N개의 부분 집계 버킷(partial aggregation bucket)을 사용하며, 각 버킷이 1초간의 호출 결과를 집계한다.
Resilience4j 아키텍처
Hystrix에서 Resilience4j로의 전환
Netflix Hystrix가 2018년 유지보수 모드에 진입한 이후, Resilience4j가 JVM 생태계의 표준 장애 허용(fault tolerance) 라이브러리로 자리잡았다.
| 비교 항목 | Netflix Hystrix | Resilience4j |
|---|---|---|
| 상태 | 유지보수 모드 (2018년 이후 업데이트 없음) | 활발한 개발 (2025년 2.3.0 릴리스) |
| Java 버전 | Java 8+ | Java 17+ (Spring Boot 3 지원) |
| 의존성 | Archaius, RxJava 등 다수 | Vavr 한 개 |
| 아키텍처 | 모놀리식 (모든 기능 포함) | 모듈러 (필요한 모듈만 선택) |
| 스레드 모델 | 별도 스레드 풀 필수 | 세마포어 기반 (스레드 풀 옵션) |
| 설정 방식 | Archaius 필수 | application.yml, 프로그래밍 방식 모두 지원 |
| 리액티브 지원 | RxJava 1 | Reactor, RxJava 2/3 네이티브 지원 |
| 함수형 인터페이스 | 제한적 | 완전 지원 (Supplier, Function, Runnable 등) |
| 모니터링 | Hystrix Dashboard | Micrometer 통합 (Prometheus, Grafana) |
Resilience4j 핵심 모듈
Resilience4j는 다섯 가지 핵심 모듈을 독립적으로 또는 조합하여 사용할 수 있다.
| 모듈 | 역할 | 핵심 설정 |
|---|---|---|
| CircuitBreaker | 실패율 기반 회로 차단 | failureRateThreshold, slidingWindowSize |
| Retry | 실패 시 재시도 | maxAttempts, waitDuration, backoff |
| Bulkhead | 동시 호출 수 제한 (격벽) | maxConcurrentCalls, maxWaitDuration |
| RateLimiter | 단위 시간당 호출 수 제한 | limitForPeriod, limitRefreshPeriod |
| TimeLimiter | 호출 시간 제한 | timeoutDuration, cancelRunningFuture |
어노테이션 기반으로 조합할 때의 적용 순서는 다음과 같다.
외부(먼저 평가) ──────────────────────────────────> 내부(마지막 평가)
Retry -> CircuitBreaker -> RateLimiter -> TimeLimiter -> Bulkhead
이 순서는 Resilience4j가 Spring AOP 기반으로 어노테이션을 처리할 때의 기본 우선순위다. resilience4j.circuitbreaker.circuitBreakerAspectOrder 등의 속성으로 순서를 커스터마이징할 수도 있다.
Spring Boot 3 통합 설정
의존성 설정
// build.gradle.kts (Spring Boot 3.3+ / Resilience4j 2.2+)
plugins {
id("org.springframework.boot") version "3.3.5"
id("io.spring.dependency-management") version "1.1.6"
kotlin("jvm") version "1.9.25"
kotlin("plugin.spring") version "1.9.25"
}
dependencies {
// Resilience4j Spring Boot 3 스타터
implementation("io.github.resilience4j:resilience4j-spring-boot3:2.2.0")
// 개별 모듈 (스타터에 포함되지만 명시적 선언 권장)
implementation("io.github.resilience4j:resilience4j-circuitbreaker")
implementation("io.github.resilience4j:resilience4j-retry")
implementation("io.github.resilience4j:resilience4j-bulkhead")
implementation("io.github.resilience4j:resilience4j-ratelimiter")
implementation("io.github.resilience4j:resilience4j-timelimiter")
// Micrometer + Prometheus (모니터링)
implementation("io.github.resilience4j:resilience4j-micrometer")
implementation("io.micrometer:micrometer-registry-prometheus")
// Spring Boot Actuator
implementation("org.springframework.boot:spring-boot-starter-actuator")
implementation("org.springframework.boot:spring-boot-starter-aop")
implementation("org.springframework.boot:spring-boot-starter-web")
// Kotlin Coroutines (선택)
implementation("org.jetbrains.kotlinx:kotlinx-coroutines-core")
implementation("org.jetbrains.kotlinx:kotlinx-coroutines-reactor")
testImplementation("org.springframework.boot:spring-boot-starter-test")
}
통합 설정 파일
# application.yml - Resilience4j 통합 설정
resilience4j:
circuitbreaker:
configs:
default:
registerHealthIndicator: true
slidingWindowType: COUNT_BASED
slidingWindowSize: 10
minimumNumberOfCalls: 5
failureRateThreshold: 50
slowCallRateThreshold: 80
slowCallDurationThreshold: 3s
waitDurationInOpenState: 30s
permittedNumberOfCallsInHalfOpenState: 3
automaticTransitionFromOpenToHalfOpenEnabled: true
recordExceptions:
- java.io.IOException
- java.util.concurrent.TimeoutException
- org.springframework.web.client.HttpServerErrorException
ignoreExceptions:
- com.example.order.exception.BusinessValidationException
instances:
paymentGateway:
baseConfig: default
failureRateThreshold: 40
waitDurationInOpenState: 60s
slidingWindowSize: 20
inventoryService:
baseConfig: default
failureRateThreshold: 60
slowCallDurationThreshold: 5s
notificationService:
baseConfig: default
failureRateThreshold: 70
waitDurationInOpenState: 15s
retry:
configs:
default:
maxAttempts: 3
waitDuration: 1s
enableExponentialBackoff: true
exponentialBackoffMultiplier: 2.0
exponentialMaxWaitDuration: 10s
retryExceptions:
- java.io.IOException
- java.util.concurrent.TimeoutException
ignoreExceptions:
- com.example.order.exception.BusinessValidationException
instances:
paymentGateway:
baseConfig: default
maxAttempts: 2
waitDuration: 2s
inventoryService:
baseConfig: default
maxAttempts: 4
notificationService:
baseConfig: default
maxAttempts: 5
waitDuration: 500ms
bulkhead:
configs:
default:
maxConcurrentCalls: 25
maxWaitDuration: 500ms
instances:
paymentGateway:
baseConfig: default
maxConcurrentCalls: 15
inventoryService:
baseConfig: default
maxConcurrentCalls: 30
notificationService:
baseConfig: default
maxConcurrentCalls: 50
ratelimiter:
configs:
default:
limitForPeriod: 100
limitRefreshPeriod: 1s
timeoutDuration: 500ms
instances:
paymentGateway:
baseConfig: default
limitForPeriod: 50
inventoryService:
baseConfig: default
limitForPeriod: 200
timelimiter:
configs:
default:
timeoutDuration: 5s
cancelRunningFuture: true
instances:
paymentGateway:
baseConfig: default
timeoutDuration: 10s
inventoryService:
baseConfig: default
timeoutDuration: 3s
# Actuator 메트릭 노출
management:
endpoints:
web:
exposure:
include: health,metrics,prometheus,circuitbreakers,retries
endpoint:
health:
show-details: always
health:
circuitbreakers:
enabled: true
metrics:
distribution:
percentiles-histogram:
resilience4j.circuitbreaker.calls: true
resilience4j.retry.calls: true
tags:
application: order-service
설정에서 주목할 점은 configs.default로 기본 프로필을 정의하고, 각 인스턴스에서 baseConfig: default를 지정하여 공통 설정을 상속받는 구조다. 서비스별 특성에 맞게 임계값만 오버라이드하면 설정 중복을 최소화할 수 있다.
CircuitBreaker 실전 구현
어노테이션 기반 구현 (Kotlin)
// PaymentGatewayClient.kt
@Service
class PaymentGatewayClient(
private val restClient: RestClient,
private val paymentRetryQueue: PaymentRetryQueue,
private val paymentCacheStore: PaymentCacheStore,
) {
companion object {
private val log = LoggerFactory.getLogger(PaymentGatewayClient::class.java)
const val CB_NAME = "paymentGateway"
}
@CircuitBreaker(name = CB_NAME, fallbackMethod = "paymentFallback")
@Retry(name = CB_NAME)
@Bulkhead(name = CB_NAME)
fun processPayment(request: PaymentRequest): PaymentResponse {
log.info("Calling payment gateway for orderId={}", request.orderId)
val response = restClient.post()
.uri("https://payment-api.internal/v2/charges")
.contentType(MediaType.APPLICATION_JSON)
.body(request)
.retrieve()
.body(PaymentResponse::class.java)
?: throw PaymentGatewayException("Empty response from payment gateway")
log.info("Payment processed: orderId={}, txId={}", request.orderId, response.transactionId)
return response
}
/**
* 폴백 메서드: CircuitBreaker OPEN 또는 예외 발생 시 호출된다.
* 메서드 시그니처는 원본과 동일 + 마지막 파라미터로 Exception을 받아야 한다.
*/
private fun paymentFallback(request: PaymentRequest, ex: Exception): PaymentResponse {
log.warn(
"Payment fallback activated: orderId={}, reason={}",
request.orderId, ex.message
)
return when (ex) {
is CallNotPermittedException -> {
// CircuitBreaker OPEN 상태: 큐에 넣고 비동기 처리
paymentRetryQueue.enqueue(request)
PaymentResponse(
orderId = request.orderId,
status = PaymentStatus.QUEUED,
message = "결제가 대기열에 등록되었습니다. 잠시 후 처리됩니다.",
transactionId = null,
)
}
is BulkheadFullException -> {
// Bulkhead 포화: 즉시 재시도 유도
PaymentResponse(
orderId = request.orderId,
status = PaymentStatus.RETRY_LATER,
message = "현재 결제 요청이 많습니다. 잠시 후 다시 시도해주세요.",
transactionId = null,
)
}
else -> {
// 기타 예외: 캐시된 결제 정보가 있으면 반환
val cached = paymentCacheStore.getLastSuccess(request.orderId)
if (cached != null) {
log.info("Returning cached payment for orderId={}", request.orderId)
cached.copy(status = PaymentStatus.CACHED)
} else {
paymentRetryQueue.enqueue(request)
PaymentResponse(
orderId = request.orderId,
status = PaymentStatus.PENDING,
message = "결제 처리 중 오류가 발생했습니다. 자동 재시도됩니다.",
transactionId = null,
)
}
}
}
}
}
프로그래밍 방식 구현 (Java)
어노테이션 대신 CircuitBreakerRegistry를 직접 사용하면, 런타임에 동적으로 서킷 브레이커를 생성하거나 설정을 변경할 수 있다.
// InventoryServiceClient.java
@Service
@Slf4j
public class InventoryServiceClient {
private final CircuitBreaker circuitBreaker;
private final Retry retry;
private final Bulkhead bulkhead;
private final RestClient restClient;
public InventoryServiceClient(
CircuitBreakerRegistry cbRegistry,
RetryRegistry retryRegistry,
BulkheadRegistry bulkheadRegistry,
RestClient.Builder restClientBuilder) {
this.circuitBreaker = cbRegistry.circuitBreaker("inventoryService");
this.retry = retryRegistry.retry("inventoryService");
this.bulkhead = bulkheadRegistry.bulkhead("inventoryService");
this.restClient = restClientBuilder
.baseUrl("https://inventory-api.internal")
.build();
// 이벤트 리스너 등록
registerEventListeners();
}
public InventoryResponse checkStock(String productId, int quantity) {
// 데코레이터 체인: Bulkhead -> CircuitBreaker -> Retry -> 실제 호출
Supplier<InventoryResponse> decorated = Decorators
.ofSupplier(() -> doCheckStock(productId, quantity))
.withBulkhead(bulkhead)
.withCircuitBreaker(circuitBreaker)
.withRetry(retry)
.withFallback(
List.of(
CallNotPermittedException.class,
BulkheadFullException.class,
IOException.class
),
ex -> stockFallback(productId, quantity, ex)
)
.decorate();
return decorated.get();
}
private InventoryResponse doCheckStock(String productId, int quantity) {
return restClient.get()
.uri("/v1/stock/{productId}?qty={qty}", productId, quantity)
.retrieve()
.body(InventoryResponse.class);
}
private InventoryResponse stockFallback(
String productId, int quantity, Throwable ex) {
log.warn("Inventory fallback: productId={}, reason={}", productId, ex.getMessage());
// 재고가 불확실할 때는 주문을 수락하되 비동기 검증 예약
return InventoryResponse.builder()
.productId(productId)
.available(true)
.reservationStatus(ReservationStatus.TENTATIVE)
.message("재고 확인 지연: 잠정 승인 후 비동기 검증 예정")
.build();
}
private void registerEventListeners() {
circuitBreaker.getEventPublisher()
.onStateTransition(event -> {
log.warn("[CircuitBreaker] {} state: {} -> {}",
event.getCircuitBreakerName(),
event.getStateTransition().getFromState(),
event.getStateTransition().getToState());
})
.onError(event ->
log.error("[CircuitBreaker] {} error: {} ({}ms)",
event.getCircuitBreakerName(),
event.getThrowable().getMessage(),
event.getElapsedDuration().toMillis())
)
.onSuccess(event ->
log.debug("[CircuitBreaker] {} success ({}ms)",
event.getCircuitBreakerName(),
event.getElapsedDuration().toMillis())
)
.onCallNotPermitted(event ->
log.warn("[CircuitBreaker] {} call not permitted (OPEN state)",
event.getCircuitBreakerName())
);
retry.getEventPublisher()
.onRetry(event ->
log.info("[Retry] {} attempt #{} (wait: {}ms)",
event.getName(),
event.getNumberOfRetryAttempts(),
event.getWaitInterval().toMillis())
);
}
}
Retry, Bulkhead, RateLimiter 조합
Retry와 Exponential Backoff
재시도 전략에서 가장 중요한 것은 지수 백오프(exponential backoff)와 지터(jitter)의 조합이다. 고정 간격 재시도는 다수의 클라이언트가 동시에 재시도하여 서버에 부하를 집중시키는 thundering herd 문제를 일으킨다.
// RetryConfig를 프로그래밍 방식으로 커스터마이징
@Configuration
class ResilienceConfig {
@Bean
fun customRetryConfig(): RetryConfig {
return RetryConfig.custom<RetryConfig>()
.maxAttempts(4)
.intervalFunction(
// 지수 백오프 + 지터: 1s, 2s(+jitter), 4s(+jitter), 8s(+jitter)
IntervalFunction.ofExponentialRandomBackoff(
Duration.ofSeconds(1), // 초기 대기 시간
2.0, // 배수
Duration.ofSeconds(15) // 최대 대기 시간
)
)
.retryOnException { ex ->
// 재시도 대상 예외 판별
when (ex) {
is IOException -> true
is TimeoutException -> true
is HttpServerErrorException -> true
is ConnectException -> true
else -> false
}
}
.ignoreExceptions(
BusinessValidationException::class.java,
IllegalArgumentException::class.java
)
.failAfterMaxAttempts(true) // 최대 재시도 후 MaxRetriesExceededException 발생
.build()
}
@Bean
fun retryRegistry(customRetryConfig: RetryConfig): RetryRegistry {
return RetryRegistry.of(customRetryConfig)
}
}
Bulkhead: 세마포어 vs 스레드 풀
Bulkhead는 선박의 격벽에서 착안한 패턴으로, 하나의 서비스 호출이 모든 리소스를 독점하지 못하도록 격리한다. Resilience4j는 두 가지 Bulkhead 구현을 제공한다.
| 구분 | SemaphoreBulkhead | ThreadPoolBulkhead |
|---|---|---|
| 격리 수준 | 동시 호출 수 제한 | 별도 스레드 풀에서 실행 |
| 호출 스레드 | 호출자 스레드 그대로 사용 | 전용 스레드 풀의 스레드 사용 |
| 반환 타입 | 동기 반환 | CompletionStage 반환 |
| 오버헤드 | 낮음 | 스레드 컨텍스트 전환 비용 |
| 적합한 환경 | 대부분의 HTTP 호출 | CPU 집약적 작업, 완전 격리 필요 시 |
| 설정 | maxConcurrentCalls, maxWaitDuration | maxThreadPoolSize, coreThreadPoolSize, queueCapacity |
# ThreadPoolBulkhead 설정 예시
resilience4j:
thread-pool-bulkhead:
instances:
heavyProcessing:
maxThreadPoolSize: 10
coreThreadPoolSize: 5
queueCapacity: 20
keepAliveDuration: 100ms
writableStackTraceEnabled: true
RateLimiter 설정과 적용
RateLimiter는 단위 시간당 허용되는 호출 수를 제한하여, 외부 API의 rate limit 초과를 방지하거나 내부 서비스를 과부하로부터 보호한다.
// RateLimiter와 CircuitBreaker 조합
@Service
@Slf4j
public class ExternalApiClient {
private final RestClient restClient;
@CircuitBreaker(name = "externalApi", fallbackMethod = "apiFallback")
@RateLimiter(name = "externalApi")
@Retry(name = "externalApi")
public ApiResponse callExternalApi(ApiRequest request) {
log.debug("Calling external API: endpoint={}", request.getEndpoint());
return restClient.post()
.uri(request.getEndpoint())
.body(request.getPayload())
.retrieve()
.body(ApiResponse.class);
}
private ApiResponse apiFallback(ApiRequest request, RequestNotPermitted ex) {
// RateLimiter에 의해 거부된 경우
log.warn("Rate limit exceeded for external API: {}", request.getEndpoint());
return ApiResponse.rateLimited(
"요청 한도를 초과했습니다. " +
"limitForPeriod 설정을 확인하거나 잠시 후 다시 시도하세요."
);
}
private ApiResponse apiFallback(ApiRequest request, Exception ex) {
// 기타 예외 (CircuitBreaker OPEN, 네트워크 오류 등)
log.warn("External API fallback: endpoint={}, reason={}",
request.getEndpoint(), ex.getMessage());
return ApiResponse.error("외부 API 호출에 실패했습니다: " + ex.getMessage());
}
}
폴백 메서드를 오버로딩할 때 주의할 점은, Resilience4j가 예외 타입을 기준으로 가장 구체적인 폴백을 선택한다는 것이다. RequestNotPermitted(RateLimiter 거부)와 Exception(일반 예외)을 분리하면 예외 원인에 따라 다른 폴백 로직을 실행할 수 있다.
Grafana 모니터링 대시보드
Prometheus 메트릭 수집
Resilience4j는 Micrometer를 통해 자동으로 메트릭을 노출한다. Spring Boot Actuator의 /actuator/prometheus 엔드포인트에서 다음 메트릭을 확인할 수 있다.
# CircuitBreaker 상태 확인 (0=CLOSED, 1=OPEN, 2=HALF_OPEN, 3=DISABLED, 4=FORCED_OPEN)
resilience4j_circuitbreaker_state{name="paymentGateway"}
# 실패율 (%)
resilience4j_circuitbreaker_failure_rate{name="paymentGateway"}
# 느린 호출 비율 (%)
resilience4j_circuitbreaker_slow_call_rate{name="paymentGateway"}
# 호출 통계 (kind: successful, failed, ignored, not_permitted)
rate(resilience4j_circuitbreaker_calls_seconds_count{name="paymentGateway"}[5m])
# 호출 지연 시간 분포 (히스토그램)
histogram_quantile(0.95,
rate(resilience4j_circuitbreaker_calls_seconds_bucket{name="paymentGateway"}[5m])
)
# Retry 재시도 횟수
increase(resilience4j_retry_calls_total{name="paymentGateway", kind="successful_with_retry"}[1h])
increase(resilience4j_retry_calls_total{name="paymentGateway", kind="failed_with_retry"}[1h])
# Bulkhead 가용 동시 호출 수
resilience4j_bulkhead_available_concurrent_calls{name="paymentGateway"}
# RateLimiter 가용 허용 수
resilience4j_ratelimiter_available_permissions{name="externalApi"}
Grafana 대시보드 JSON 구성
Grafana 대시보드에서 핵심적으로 구성해야 할 패널과 각 패널의 PromQL 쿼리를 정리한다.
패널 1 - CircuitBreaker 상태 게이지
resilience4j_circuitbreaker_state{application="order-service"}
Value mapping으로 0=CLOSED(초록), 1=OPEN(빨강), 2=HALF_OPEN(노랑)을 매핑한다.
패널 2 - 실패율 추이 (Time Series)
resilience4j_circuitbreaker_failure_rate{application="order-service", name=~".*"}
임계값 라인(failureRateThreshold)을 추가하여 서킷이 OPEN으로 전이되는 시점을 시각적으로 확인한다.
패널 3 - 호출 성공/실패 비율 (Stacked Bar)
sum by (name, kind) (
rate(resilience4j_circuitbreaker_calls_seconds_count{application="order-service"}[5m])
)
패널 4 - P95 응답 시간 (Time Series)
histogram_quantile(0.95,
sum by (le, name) (
rate(resilience4j_circuitbreaker_calls_seconds_bucket{application="order-service"}[5m])
)
)
패널 5 - Bulkhead 동시 호출 현황 (Gauge)
resilience4j_bulkhead_max_allowed_concurrent_calls{application="order-service"}
- resilience4j_bulkhead_available_concurrent_calls{application="order-service"}
알림 규칙 설정
Grafana 또는 Prometheus Alertmanager에 다음 알림 규칙을 등록한다.
# prometheus-alerts.yml
groups:
- name: resilience4j_alerts
rules:
- alert: CircuitBreakerOpen
expr: resilience4j_circuitbreaker_state == 1
for: 30s
labels:
severity: critical
annotations:
summary: 'CircuitBreaker OPEN: {{ $labels.name }}'
description: >
서비스 {{ $labels.application }}의
{{ $labels.name }} 서킷 브레이커가 OPEN 상태입니다.
다운스트림 서비스 장애를 확인하세요.
- alert: HighFailureRate
expr: resilience4j_circuitbreaker_failure_rate > 30
for: 2m
labels:
severity: warning
annotations:
summary: 'High failure rate: {{ $labels.name }} ({{ $value }}%)'
description: >
{{ $labels.name }}의 실패율이 {{ $value }}%로
경고 임계값(30%)을 초과했습니다.
- alert: BulkheadSaturation
expr: >
(resilience4j_bulkhead_max_allowed_concurrent_calls
- resilience4j_bulkhead_available_concurrent_calls)
/ resilience4j_bulkhead_max_allowed_concurrent_calls > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: 'Bulkhead 80% saturated: {{ $labels.name }}'
- alert: ExcessiveRetries
expr: >
rate(resilience4j_retry_calls_total{kind="failed_with_retry"}[5m])
/ rate(resilience4j_retry_calls_total[5m]) > 0.5
for: 3m
labels:
severity: warning
annotations:
summary: 'Retry 실패율 50% 초과: {{ $labels.name }}'
트러블슈팅 가이드
문제 1: CircuitBreaker가 OPEN으로 전이되지 않는다
증상: 분명히 실패가 발생하고 있는데 서킷이 CLOSED 상태를 유지한다.
원인 분석:
minimumNumberOfCalls에 도달하지 않았다. 기본값 100이므로, 호출 빈도가 낮은 서비스에서는 슬라이딩 윈도우가 채워지기 전에 장애가 해소될 수 있다.- 예외가
ignoreExceptions에 포함되어 있다. 비즈니스 예외뿐 아니라 의도치 않은 예외까지 ignore 목록에 있는지 확인한다. - 예외가
recordExceptions에 포함되지 않았다. recordExceptions를 명시하면 해당 목록에 없는 예외는 실패로 기록하지 않는다.
해결: minimumNumberOfCalls를 서비스 호출 빈도에 맞게 조정하고, recordExceptions와 ignoreExceptions 목록을 점검한다.
문제 2: Retry와 CircuitBreaker 조합 시 예상보다 많은 호출 발생
증상: maxAttempts=3으로 설정했는데 다운스트림 서비스에 5번 이상의 호출이 기록된다.
원인 분석: 어노테이션 적용 순서에서 Retry가 CircuitBreaker 바깥에 위치한다. 따라서 CircuitBreaker가 실패를 기록한 후, Retry가 다시 CircuitBreaker를 통해 호출을 시도한다. CircuitBreaker의 HALF-OPEN 상태에서 시험 호출이 추가되면 총 호출 수가 예상을 초과할 수 있다.
해결: Retry의 maxAttempts를 보수적으로 설정하고, CircuitBreaker의 slidingWindowSize와 Retry의 maxAttempts 조합이 만드는 최대 호출 수를 계산하여 다운스트림 부하를 예측한다.
문제 3: 폴백 메서드가 호출되지 않는다
증상: CircuitBreaker가 OPEN인데 CallNotPermittedException이 클라이언트에 직접 전파된다.
원인 분석: 폴백 메서드의 시그니처가 원본 메서드와 정확히 일치하지 않는다. 폴백 메서드는 원본 메서드의 모든 파라미터를 동일한 순서와 타입으로 받고, 마지막 파라미터로 Exception(또는 특정 예외 타입)을 추가해야 한다.
해결: 폴백 메서드 시그니처를 점검한다. 반환 타입도 원본과 정확히 동일해야 한다. 아래는 올바른 예시다.
// 원본 메서드
@CircuitBreaker(name = "svc", fallbackMethod = "fallback")
public OrderResponse getOrder(String orderId, boolean includeDetails) { ... }
// 올바른 폴백 (파라미터 동일 + Exception 추가)
private OrderResponse fallback(String orderId, boolean includeDetails, Exception ex) { ... }
// 잘못된 폴백 - 컴파일은 되지만 런타임에 매칭 실패
private OrderResponse fallback(String orderId, Exception ex) { ... } // 파라미터 누락
private void fallback(String orderId, boolean includeDetails, Exception ex) { ... } // 반환타입 불일치
문제 4: TIME_BASED 윈도우에서 메모리 사용량이 증가
증상: TIME_BASED 슬라이딩 윈도우를 사용하는데 힙 메모리 사용량이 점진적으로 증가한다.
원인 분석: slidingWindowSize가 지나치게 크게 설정되어 있다. 예를 들어 slidingWindowSize=600(10분)으로 설정하면 600개의 부분 집계 버킷이 유지된다. 트래픽이 높으면 각 버킷의 호출 기록이 누적되어 메모리를 소비한다.
해결: TIME_BASED에서는 slidingWindowSize를 60초 이하로 설정하고, 장기간 추이는 Prometheus 메트릭으로 관찰한다. 메모리 민감한 환경에서는 COUNT_BASED를 우선 사용한다.
운영 체크리스트
운영 환경에 Circuit Breaker를 배포하기 전에 반드시 확인해야 할 항목을 정리한다.
설정 검증
- slidingWindowSize와 minimumNumberOfCalls의 비율이 적절한가 (minimumNumberOfCalls는 slidingWindowSize의 50% 이하 권장)
- failureRateThreshold가 서비스 특성에 맞게 설정되었는가 (결제: 30-40%, 알림: 60-70%)
- waitDurationInOpenState가 다운스트림 서비스의 평균 복구 시간과 맞는가
- slowCallDurationThreshold가 정상 응답 시간의 P99 이상으로 설정되었는가
- recordExceptions와 ignoreExceptions가 정확히 분류되었는가
모니터링 확인
- Prometheus에서 resilience4j 메트릭이 정상 수집되는가
- Grafana 대시보드에 CircuitBreaker 상태, 실패율, 호출 통계가 표시되는가
- CircuitBreaker OPEN 알림이 Slack, PagerDuty 등으로 전달되는가
- OPEN 상태 지속 시간을 추적하는 메트릭이 있는가
폴백 전략 검증
- 모든 CircuitBreaker에 폴백 메서드가 연결되었는가
- 폴백 메서드가 의미 있는 응답을 반환하는가 (단순 null 반환 금지)
- 폴백 메서드 자체에서 예외가 발생하면 어떻게 처리되는가
- 캐시 폴백 사용 시 캐시 만료 정책이 설정되었는가
- 대체 서비스 폴백 사용 시 해당 서비스의 CircuitBreaker도 설정되었는가
테스트 검증
- 단위 테스트에서 CircuitBreaker 상태 전이(CLOSED, OPEN, HALF-OPEN)를 검증했는가
- 통합 테스트에서 실제 타임아웃, 네트워크 오류 시나리오를 재현했는가
- 카오스 엔지니어링 도구(Chaos Monkey, Litmus)로 장애 주입 테스트를 수행했는가
- 부하 테스트에서 Bulkhead 포화 시의 동작을 확인했는가
배포 전략
- 새로운 CircuitBreaker 설정은 카나리 배포로 일부 트래픽에만 먼저 적용한다
- 설정 변경 시 Config Server(Spring Cloud Config)나 환경 변수를 통해 무중단으로 반영할 수 있는가
- CircuitBreaker 설정의 Git 이력 관리가 되고 있는가
- 롤백 계획이 수립되어 있는가
실패 사례와 복구
사례 1: Retry 폭풍으로 인한 다운스트림 과부하
상황: 결제 서비스의 응답 시간이 증가하기 시작했다. 주문 서비스에서 Retry가 maxAttempts=5, 고정 간격 1초로 설정되어 있었다. 주문 서비스 인스턴스가 20대이고, 초당 100건의 주문이 발생하는 상황에서 결제 서비스에 초당 최대 10,000건(100 x 20 x 5)의 요청이 쏟아졌다.
원인: 지수 백오프(exponential backoff)와 지터(jitter) 없이 고정 간격 재시도를 사용했다. 또한 CircuitBreaker 없이 Retry만 단독 사용하여, 실패 시에도 계속 재시도가 발생했다.
복구 절차:
- 즉시 Retry를 비활성화하거나 maxAttempts를 1로 설정하여 재시도를 중단한다
- 결제 서비스의 부하가 안정되면, 지수 백오프 + 지터가 적용된 Retry 설정으로 교체한다
- CircuitBreaker를 Retry 안쪽에 배치하여, 서킷이 OPEN이면 재시도 자체를 차단한다
재발 방지: Retry는 반드시 CircuitBreaker와 함께 사용하고, 지수 백오프 + 랜덤 지터를 기본으로 적용한다. 고정 간격 재시도는 금지 정책으로 지정한다.
사례 2: 잘못된 예외 분류로 서킷이 영구 OPEN
상황: 재고 서비스에 신규 기능 배포 후, 특정 상품 조회 시 400 Bad Request가 반환되기 시작했다. 이 400 응답이 HttpClientErrorException으로 잡히면서 실패율 집계에 포함되었고, CircuitBreaker가 OPEN으로 전이되어 모든 재고 조회가 차단되었다. 정상 상품 조회까지 불가능해졌다.
원인: recordExceptions에 HttpClientErrorException(4xx)이 포함되어 있었다. 4xx 오류는 클라이언트 측 문제이므로 서킷 브레이커가 개입할 사안이 아니다. 서킷 브레이커는 서버 측 장애(5xx, 타임아웃, 연결 실패)에만 반응해야 한다.
복구 절차:
- CircuitBreaker를 FORCED_CLOSE로 수동 전환하여 즉시 정상 트래픽을 복원한다
// Actuator 엔드포인트로 상태 강제 전환
// POST /actuator/circuitbreakers/{name}/force-close
circuitBreakerRegistry.circuitBreaker("inventoryService")
.transitionToForcedOpenState(); // 또는 transitionToClosedState()
- recordExceptions에서 HttpClientErrorException을 제거하고 ignoreExceptions에 추가한다
- 설정 반영 후 FORCED_CLOSE를 해제하여 정상 CircuitBreaker 동작으로 복귀한다
재발 방지: 예외 분류 원칙을 문서화한다. 4xx(클라이언트 오류)는 ignoreExceptions, 5xx(서버 오류)는 recordExceptions, 비즈니스 검증 예외는 ignoreExceptions에 분류한다.
사례 3: HALF-OPEN 병목으로 인한 트래픽 손실
상황: 결제 서비스가 복구된 후에도 주문 처리량이 회복되지 않았다. 트래픽 분석 결과, CircuitBreaker가 HALF-OPEN 상태에서 permittedNumberOfCallsInHalfOpenState=1로 설정되어 단 1건의 시험 호출만 허용하고 있었다. 이 시험 호출이 간헐적으로 실패하면서 OPEN과 HALF-OPEN을 반복하는 플래핑(flapping)이 발생했다.
원인: permittedNumberOfCallsInHalfOpenState 값이 지나치게 낮았다. 시험 호출이 1건이면 단일 실패로도 다시 OPEN으로 돌아가므로, 다운스트림 서비스가 간헐적으로만 응답하는 상황에서 CLOSED로 복귀하기 어렵다.
복구 절차:
- permittedNumberOfCallsInHalfOpenState를 5-10으로 상향 조정한다
- automaticTransitionFromOpenToHalfOpenEnabled: true를 확인하여 수동 개입 없이 자동 전이되게 한다
- waitDurationInOpenState를 다운스트림 서비스의 평균 복구 시간에 맞게 조정한다
재발 방지: HALF-OPEN 시험 호출 수는 최소 3건 이상으로 설정하고, 실패율 임계값과 조합하여 통계적으로 유의미한 판단이 가능하도록 한다. OPEN-HALF_OPEN 플래핑을 모니터링 알림에 추가한다.
고급 패턴: 커스텀 CircuitBreaker 레지스트리
서비스 수가 많아지면 각 서비스에 대해 개별적으로 CircuitBreaker를 설정하는 것이 비효율적일 수 있다. 동적으로 CircuitBreaker를 생성하고 관리하는 커스텀 레지스트리를 구현할 수 있다.
// DynamicCircuitBreakerFactory.kt
@Component
class DynamicCircuitBreakerFactory(
private val circuitBreakerRegistry: CircuitBreakerRegistry,
private val meterRegistry: MeterRegistry,
) {
private val log = LoggerFactory.getLogger(javaClass)
/**
* 서비스 이름 기반으로 CircuitBreaker를 동적 생성한다.
* 이미 존재하면 기존 인스턴스를 반환한다.
*/
fun getOrCreate(
serviceName: String,
tier: ServiceTier = ServiceTier.STANDARD,
): CircuitBreaker {
return circuitBreakerRegistry.circuitBreaker(serviceName) {
buildConfigForTier(tier)
}.also { cb ->
registerMetrics(cb)
log.info(
"CircuitBreaker created/retrieved: name={}, tier={}, state={}",
serviceName, tier, cb.state
)
}
}
private fun buildConfigForTier(tier: ServiceTier): CircuitBreakerConfig {
return when (tier) {
ServiceTier.CRITICAL -> CircuitBreakerConfig.custom()
.failureRateThreshold(30f)
.slowCallRateThreshold(60f)
.slowCallDurationThreshold(Duration.ofSeconds(2))
.waitDurationInOpenState(Duration.ofSeconds(60))
.slidingWindowSize(20)
.minimumNumberOfCalls(10)
.permittedNumberOfCallsInHalfOpenState(5)
.automaticTransitionFromOpenToHalfOpenEnabled(true)
.build()
ServiceTier.STANDARD -> CircuitBreakerConfig.custom()
.failureRateThreshold(50f)
.slowCallRateThreshold(80f)
.slowCallDurationThreshold(Duration.ofSeconds(3))
.waitDurationInOpenState(Duration.ofSeconds(30))
.slidingWindowSize(10)
.minimumNumberOfCalls(5)
.permittedNumberOfCallsInHalfOpenState(3)
.automaticTransitionFromOpenToHalfOpenEnabled(true)
.build()
ServiceTier.BEST_EFFORT -> CircuitBreakerConfig.custom()
.failureRateThreshold(70f)
.slowCallRateThreshold(90f)
.slowCallDurationThreshold(Duration.ofSeconds(5))
.waitDurationInOpenState(Duration.ofSeconds(15))
.slidingWindowSize(5)
.minimumNumberOfCalls(3)
.permittedNumberOfCallsInHalfOpenState(2)
.automaticTransitionFromOpenToHalfOpenEnabled(true)
.build()
}
}
private fun registerMetrics(cb: CircuitBreaker) {
TaggedCircuitBreakerMetrics.ofCircuitBreakerRegistry(
circuitBreakerRegistry
).bindTo(meterRegistry)
}
enum class ServiceTier {
CRITICAL, // 결제, 인증 등 핵심 서비스
STANDARD, // 재고, 배송 등 일반 서비스
BEST_EFFORT // 알림, 추천 등 비핵심 서비스
}
}
이 팩토리를 사용하면 서비스 티어에 따라 자동으로 적절한 CircuitBreaker 설정이 적용된다. CRITICAL 서비스는 낮은 실패율 임계값과 긴 대기 시간으로 보수적으로 보호하고, BEST_EFFORT 서비스는 높은 임계값으로 유연하게 운영한다.
테스트 전략
CircuitBreaker 도입 시 반드시 작성해야 할 테스트 케이스를 정리한다.
// CircuitBreakerIntegrationTest.java
@SpringBootTest
@AutoConfigureMockMvc
class CircuitBreakerIntegrationTest {
@Autowired
private CircuitBreakerRegistry circuitBreakerRegistry;
@Autowired
private MockMvc mockMvc;
@MockBean
private RestClient paymentRestClient;
@Test
@DisplayName("실패율 임계값 초과 시 CircuitBreaker가 OPEN으로 전이된다")
void shouldTransitionToOpenOnFailureThreshold() {
CircuitBreaker cb = circuitBreakerRegistry.circuitBreaker("paymentGateway");
cb.reset(); // 테스트 격리를 위해 상태 초기화
assertThat(cb.getState()).isEqualTo(CircuitBreaker.State.CLOSED);
// slidingWindowSize=20, failureRateThreshold=40 일 때
// minimumNumberOfCalls=10 이상 호출 후 40% 이상 실패해야 OPEN
// 10번 호출 중 5번 실패 = 50% 실패율 -> OPEN 전이
for (int i = 0; i < 5; i++) {
cb.onSuccess(100, TimeUnit.MILLISECONDS);
}
for (int i = 0; i < 5; i++) {
cb.onError(100, TimeUnit.MILLISECONDS, new IOException("connection refused"));
}
assertThat(cb.getState()).isEqualTo(CircuitBreaker.State.OPEN);
assertThat(cb.getMetrics().getFailureRate()).isGreaterThanOrEqualTo(40f);
}
@Test
@DisplayName("OPEN 상태에서 waitDuration 경과 후 HALF-OPEN으로 전이된다")
void shouldTransitionToHalfOpenAfterWaitDuration() throws Exception {
CircuitBreaker cb = circuitBreakerRegistry.circuitBreaker("paymentGateway");
cb.reset();
cb.transitionToOpenState();
assertThat(cb.getState()).isEqualTo(CircuitBreaker.State.OPEN);
// waitDurationInOpenState 경과를 시뮬레이션
// (테스트에서는 짧은 waitDuration 설정을 사용하거나 직접 전이)
cb.transitionToHalfOpenState();
assertThat(cb.getState()).isEqualTo(CircuitBreaker.State.HALF_OPEN);
}
@Test
@DisplayName("HALF-OPEN 시험 호출 성공 시 CLOSED로 복귀한다")
void shouldTransitionToClosedOnSuccessfulTrialCalls() {
CircuitBreaker cb = circuitBreakerRegistry.circuitBreaker("paymentGateway");
cb.reset();
cb.transitionToOpenState();
cb.transitionToHalfOpenState();
// permittedNumberOfCallsInHalfOpenState=5 만큼 성공
for (int i = 0; i < 5; i++) {
cb.onSuccess(50, TimeUnit.MILLISECONDS);
}
assertThat(cb.getState()).isEqualTo(CircuitBreaker.State.CLOSED);
}
@Test
@DisplayName("폴백 메서드가 CircuitBreaker OPEN 시 정상 호출된다")
void shouldInvokeFallbackWhenCircuitIsOpen() throws Exception {
// 강제로 서킷을 OPEN으로 전환
CircuitBreaker cb = circuitBreakerRegistry.circuitBreaker("paymentGateway");
cb.transitionToForcedOpenState();
mockMvc.perform(post("/api/v1/orders")
.contentType(MediaType.APPLICATION_JSON)
.content("{\"productId\": \"P001\", \"quantity\": 1}"))
.andExpect(status().isOk())
.andExpect(jsonPath("$.payment.status").value("QUEUED"))
.andExpect(jsonPath("$.payment.message").exists());
// 테스트 후 상태 복원
cb.transitionToClosedState();
}
}
참고자료
- Resilience4j CircuitBreaker 공식 문서 - CircuitBreaker 모듈의 상세 설정과 동작 원리
- Spring Cloud Circuit Breaker 레퍼런스 - Spring Cloud와 Resilience4j 통합 가이드
- Spring Boot Circuit Breaker Pattern with Resilience4j - GeeksforGeeks - Spring Boot 환경에서의 단계별 구현 튜토리얼
- Circuit Breaker Pattern in Microservices - Java Guides - 마이크로서비스 아키텍처에서의 Circuit Breaker 설계 패턴
- Circuit Breaker Pattern for Resilient Systems - DZone - 분산 시스템 복원력을 위한 Circuit Breaker 실전 적용
- Martin Fowler: CircuitBreaker - Circuit Breaker 패턴의 원조 설명
- Resilience4j GitHub Repository - 소스 코드와 릴리스 노트
Circuit Breaker Pattern and Resilience4j Practical Implementation Guide: From Failure Isolation to Recovery
- Introduction
- Circuit Breaker State Machine
- Resilience4j Architecture
- Spring Boot 3 Integration Configuration
- CircuitBreaker Practical Implementation
- Retry, Bulkhead, and RateLimiter Combinations
- Grafana Monitoring Dashboard
- Troubleshooting Guide
- Operations Checklist
- Failure Cases and Recovery
- Advanced Pattern: Custom CircuitBreaker Registry
- Test Strategy
- References
- Quiz

Introduction
In microservices architecture, inter-service network calls are inherently unreliable. Network latency, timeouts, and downstream service failures occur routinely, and without proper controls, a single service failure can propagate across the entire system as a cascading failure. A representative example occurred in late 2024, when response latency in a single payment gateway on a large e-commerce platform cascaded to paralyze the order service, inventory service, and notification service.
The Circuit Breaker pattern is a failure isolation mechanism inspired by electrical circuit breakers. Since Michael Nygard first introduced it in Release It! in 2007, and through Martin Fowler's blog post, it has become a core pattern in the microservices world. Netflix's Hystrix was the first widely adopted implementation, but after entering maintenance mode in 2018, Resilience4j emerged as the de facto standard.
This article covers everything from the Circuit Breaker state machine operating principles, to integrating Resilience4j's core modules -- CircuitBreaker, Retry, Bulkhead, and RateLimiter -- in a Spring Boot 3 environment, monitoring with Grafana dashboards, and recovery strategies for real failure scenarios at an operational level.
Circuit Breaker State Machine
The core of the Circuit Breaker is a Finite State Machine that manages transitions between three states (CLOSED, OPEN, HALF-OPEN) and two special states (DISABLED, FORCED_OPEN).
State Transition Diagram
Failure rate >= threshold
┌─────────────────────────────────────┐
│ │
▼ │
┌──────────┐ ┌──────────┐
│ │ waitDuration elapsed │ │
│ OPEN │ ─────────────────────> │ CLOSED │
│ (blocked)│ │ (normal) │
└──────────┘ └──────────┘
│ ▲
│ waitDuration elapsed │
▼ │ Trial call success rate >= threshold
┌──────────────┐ │
│ HALF-OPEN │ ──────────────────────────┘
│(trial allowed)│
└──────────────┘
│
│ Trial call failure rate >= threshold
│
▼
┌──────────┐
│ OPEN │ (blocked again)
└──────────┘
Detailed Behavior by State
| State | Request Handling | Transition Condition | Metric Collection |
|---|---|---|---|
| CLOSED | All requests pass through | Transitions to OPEN when failure rate in sliding window exceeds threshold | Records success/failure/slow calls |
| OPEN | All requests immediately rejected (CallNotPermittedException) | Transitions to HALF-OPEN after waitDurationInOpenState elapses | Records rejected call count |
| HALF-OPEN | Allows only permittedNumberOfCalls | Transitions to CLOSED or OPEN based on trial call results | Records trial call success/failure |
| DISABLED | All requests pass through (circuit inactive) | Manual transition only | No metric collection |
| FORCED_OPEN | All requests immediately rejected | Manual transition only | Records rejected call count |
Sliding Window Type Comparison
Resilience4j provides two sliding window types.
| Aspect | COUNT_BASED | TIME_BASED |
|---|---|---|
| Basis | Last N calls | Calls in the last N seconds |
| Config Example | slidingWindowSize: 10 | slidingWindowSize: 60 |
| Memory Usage | Fixed (array of N results) | Variable (partial aggregations over N seconds) |
| Suited For | Services with consistent call frequency | Services with irregular call frequency |
| Evaluation | After the Nth call | Time window evaluated on each call |
COUNT_BASED is internally implemented as a circular bit array of size N, recording each call result in O(1) and calculating the failure rate in constant time. TIME_BASED uses N partial aggregation buckets, each aggregating call results for one second.
Resilience4j Architecture
Transitioning from Hystrix to Resilience4j
After Netflix Hystrix entered maintenance mode in 2018, Resilience4j established itself as the standard fault tolerance library in the JVM ecosystem.
| Comparison | Netflix Hystrix | Resilience4j |
|---|---|---|
| Status | Maintenance mode (no updates since 2018) | Active development (2.3.0 release in 2025) |
| Java Version | Java 8+ | Java 17+ (Spring Boot 3 support) |
| Dependencies | Multiple (Archaius, RxJava, etc.) | Single (Vavr) |
| Architecture | Monolithic (all features included) | Modular (select only needed modules) |
| Thread Model | Separate thread pool required | Semaphore-based (thread pool optional) |
| Configuration | Archaius required | Both application.yml and programmatic |
| Reactive Support | RxJava 1 | Native Reactor, RxJava 2/3 support |
| Functional Interface | Limited | Full support (Supplier, Function, Runnable, etc.) |
| Monitoring | Hystrix Dashboard | Micrometer integration (Prometheus, Grafana) |
Resilience4j Core Modules
Resilience4j provides five core modules that can be used independently or in combination.
| Module | Role | Key Configuration |
|---|---|---|
| CircuitBreaker | Circuit tripping based on failure rate | failureRateThreshold, slidingWindowSize |
| Retry | Retry on failure | maxAttempts, waitDuration, backoff |
| Bulkhead | Limit concurrent calls (isolation) | maxConcurrentCalls, maxWaitDuration |
| RateLimiter | Limit calls per time unit | limitForPeriod, limitRefreshPeriod |
| TimeLimiter | Limit call duration | timeoutDuration, cancelRunningFuture |
When combining via annotations, the application order is as follows:
Outer (evaluated first) ──────────────────────────────────> Inner (evaluated last)
Retry -> CircuitBreaker -> RateLimiter -> TimeLimiter -> Bulkhead
This order is the default priority when Resilience4j processes annotations via Spring AOP. You can customize the order using properties like resilience4j.circuitbreaker.circuitBreakerAspectOrder.
Spring Boot 3 Integration Configuration
Dependency Setup
// build.gradle.kts (Spring Boot 3.3+ / Resilience4j 2.2+)
plugins {
id("org.springframework.boot") version "3.3.5"
id("io.spring.dependency-management") version "1.1.6"
kotlin("jvm") version "1.9.25"
kotlin("plugin.spring") version "1.9.25"
}
dependencies {
// Resilience4j Spring Boot 3 Starter
implementation("io.github.resilience4j:resilience4j-spring-boot3:2.2.0")
// Individual modules (included in starter but explicit declaration recommended)
implementation("io.github.resilience4j:resilience4j-circuitbreaker")
implementation("io.github.resilience4j:resilience4j-retry")
implementation("io.github.resilience4j:resilience4j-bulkhead")
implementation("io.github.resilience4j:resilience4j-ratelimiter")
implementation("io.github.resilience4j:resilience4j-timelimiter")
// Micrometer + Prometheus (monitoring)
implementation("io.github.resilience4j:resilience4j-micrometer")
implementation("io.micrometer:micrometer-registry-prometheus")
// Spring Boot Actuator
implementation("org.springframework.boot:spring-boot-starter-actuator")
implementation("org.springframework.boot:spring-boot-starter-aop")
implementation("org.springframework.boot:spring-boot-starter-web")
// Kotlin Coroutines (optional)
implementation("org.jetbrains.kotlinx:kotlinx-coroutines-core")
implementation("org.jetbrains.kotlinx:kotlinx-coroutines-reactor")
testImplementation("org.springframework.boot:spring-boot-starter-test")
}
Integrated Configuration File
# application.yml - Resilience4j integrated configuration
resilience4j:
circuitbreaker:
configs:
default:
registerHealthIndicator: true
slidingWindowType: COUNT_BASED
slidingWindowSize: 10
minimumNumberOfCalls: 5
failureRateThreshold: 50
slowCallRateThreshold: 80
slowCallDurationThreshold: 3s
waitDurationInOpenState: 30s
permittedNumberOfCallsInHalfOpenState: 3
automaticTransitionFromOpenToHalfOpenEnabled: true
recordExceptions:
- java.io.IOException
- java.util.concurrent.TimeoutException
- org.springframework.web.client.HttpServerErrorException
ignoreExceptions:
- com.example.order.exception.BusinessValidationException
instances:
paymentGateway:
baseConfig: default
failureRateThreshold: 40
waitDurationInOpenState: 60s
slidingWindowSize: 20
inventoryService:
baseConfig: default
failureRateThreshold: 60
slowCallDurationThreshold: 5s
notificationService:
baseConfig: default
failureRateThreshold: 70
waitDurationInOpenState: 15s
retry:
configs:
default:
maxAttempts: 3
waitDuration: 1s
enableExponentialBackoff: true
exponentialBackoffMultiplier: 2.0
exponentialMaxWaitDuration: 10s
retryExceptions:
- java.io.IOException
- java.util.concurrent.TimeoutException
ignoreExceptions:
- com.example.order.exception.BusinessValidationException
instances:
paymentGateway:
baseConfig: default
maxAttempts: 2
waitDuration: 2s
inventoryService:
baseConfig: default
maxAttempts: 4
notificationService:
baseConfig: default
maxAttempts: 5
waitDuration: 500ms
bulkhead:
configs:
default:
maxConcurrentCalls: 25
maxWaitDuration: 500ms
instances:
paymentGateway:
baseConfig: default
maxConcurrentCalls: 15
inventoryService:
baseConfig: default
maxConcurrentCalls: 30
notificationService:
baseConfig: default
maxConcurrentCalls: 50
ratelimiter:
configs:
default:
limitForPeriod: 100
limitRefreshPeriod: 1s
timeoutDuration: 500ms
instances:
paymentGateway:
baseConfig: default
limitForPeriod: 50
inventoryService:
baseConfig: default
limitForPeriod: 200
timelimiter:
configs:
default:
timeoutDuration: 5s
cancelRunningFuture: true
instances:
paymentGateway:
baseConfig: default
timeoutDuration: 10s
inventoryService:
baseConfig: default
timeoutDuration: 3s
# Actuator metric exposure
management:
endpoints:
web:
exposure:
include: health,metrics,prometheus,circuitbreakers,retries
endpoint:
health:
show-details: always
health:
circuitbreakers:
enabled: true
metrics:
distribution:
percentiles-histogram:
resilience4j.circuitbreaker.calls: true
resilience4j.retry.calls: true
tags:
application: order-service
A notable point in the configuration is defining a base profile with configs.default and specifying baseConfig: default for each instance to inherit common settings. You only need to override thresholds specific to each service to minimize configuration duplication.
CircuitBreaker Practical Implementation
Annotation-Based Implementation (Kotlin)
// PaymentGatewayClient.kt
@Service
class PaymentGatewayClient(
private val restClient: RestClient,
private val paymentRetryQueue: PaymentRetryQueue,
private val paymentCacheStore: PaymentCacheStore,
) {
companion object {
private val log = LoggerFactory.getLogger(PaymentGatewayClient::class.java)
const val CB_NAME = "paymentGateway"
}
@CircuitBreaker(name = CB_NAME, fallbackMethod = "paymentFallback")
@Retry(name = CB_NAME)
@Bulkhead(name = CB_NAME)
fun processPayment(request: PaymentRequest): PaymentResponse {
log.info("Calling payment gateway for orderId={}", request.orderId)
val response = restClient.post()
.uri("https://payment-api.internal/v2/charges")
.contentType(MediaType.APPLICATION_JSON)
.body(request)
.retrieve()
.body(PaymentResponse::class.java)
?: throw PaymentGatewayException("Empty response from payment gateway")
log.info("Payment processed: orderId={}, txId={}", request.orderId, response.transactionId)
return response
}
/**
* Fallback method: Called when CircuitBreaker is OPEN or an exception occurs.
* The method signature must match the original + accept an Exception as the last parameter.
*/
private fun paymentFallback(request: PaymentRequest, ex: Exception): PaymentResponse {
log.warn(
"Payment fallback activated: orderId={}, reason={}",
request.orderId, ex.message
)
return when (ex) {
is CallNotPermittedException -> {
// CircuitBreaker OPEN state: enqueue for async processing
paymentRetryQueue.enqueue(request)
PaymentResponse(
orderId = request.orderId,
status = PaymentStatus.QUEUED,
message = "Payment has been queued. It will be processed shortly.",
transactionId = null,
)
}
is BulkheadFullException -> {
// Bulkhead saturated: prompt immediate retry
PaymentResponse(
orderId = request.orderId,
status = PaymentStatus.RETRY_LATER,
message = "Too many payment requests at the moment. Please try again shortly.",
transactionId = null,
)
}
else -> {
// Other exceptions: return cached payment info if available
val cached = paymentCacheStore.getLastSuccess(request.orderId)
if (cached != null) {
log.info("Returning cached payment for orderId={}", request.orderId)
cached.copy(status = PaymentStatus.CACHED)
} else {
paymentRetryQueue.enqueue(request)
PaymentResponse(
orderId = request.orderId,
status = PaymentStatus.PENDING,
message = "An error occurred during payment processing. Automatic retry in progress.",
transactionId = null,
)
}
}
}
}
}
Programmatic Implementation (Java)
Instead of annotations, you can use the CircuitBreakerRegistry directly to dynamically create circuit breakers or change configurations at runtime.
// InventoryServiceClient.java
@Service
@Slf4j
public class InventoryServiceClient {
private final CircuitBreaker circuitBreaker;
private final Retry retry;
private final Bulkhead bulkhead;
private final RestClient restClient;
public InventoryServiceClient(
CircuitBreakerRegistry cbRegistry,
RetryRegistry retryRegistry,
BulkheadRegistry bulkheadRegistry,
RestClient.Builder restClientBuilder) {
this.circuitBreaker = cbRegistry.circuitBreaker("inventoryService");
this.retry = retryRegistry.retry("inventoryService");
this.bulkhead = bulkheadRegistry.bulkhead("inventoryService");
this.restClient = restClientBuilder
.baseUrl("https://inventory-api.internal")
.build();
// Register event listeners
registerEventListeners();
}
public InventoryResponse checkStock(String productId, int quantity) {
// Decorator chain: Bulkhead -> CircuitBreaker -> Retry -> actual call
Supplier<InventoryResponse> decorated = Decorators
.ofSupplier(() -> doCheckStock(productId, quantity))
.withBulkhead(bulkhead)
.withCircuitBreaker(circuitBreaker)
.withRetry(retry)
.withFallback(
List.of(
CallNotPermittedException.class,
BulkheadFullException.class,
IOException.class
),
ex -> stockFallback(productId, quantity, ex)
)
.decorate();
return decorated.get();
}
private InventoryResponse doCheckStock(String productId, int quantity) {
return restClient.get()
.uri("/v1/stock/{productId}?qty={qty}", productId, quantity)
.retrieve()
.body(InventoryResponse.class);
}
private InventoryResponse stockFallback(
String productId, int quantity, Throwable ex) {
log.warn("Inventory fallback: productId={}, reason={}", productId, ex.getMessage());
// When stock is uncertain, accept the order but schedule async verification
return InventoryResponse.builder()
.productId(productId)
.available(true)
.reservationStatus(ReservationStatus.TENTATIVE)
.message("Stock check delayed: tentative approval with async verification scheduled")
.build();
}
private void registerEventListeners() {
circuitBreaker.getEventPublisher()
.onStateTransition(event -> {
log.warn("[CircuitBreaker] {} state: {} -> {}",
event.getCircuitBreakerName(),
event.getStateTransition().getFromState(),
event.getStateTransition().getToState());
})
.onError(event ->
log.error("[CircuitBreaker] {} error: {} ({}ms)",
event.getCircuitBreakerName(),
event.getThrowable().getMessage(),
event.getElapsedDuration().toMillis())
)
.onSuccess(event ->
log.debug("[CircuitBreaker] {} success ({}ms)",
event.getCircuitBreakerName(),
event.getElapsedDuration().toMillis())
)
.onCallNotPermitted(event ->
log.warn("[CircuitBreaker] {} call not permitted (OPEN state)",
event.getCircuitBreakerName())
);
retry.getEventPublisher()
.onRetry(event ->
log.info("[Retry] {} attempt #{} (wait: {}ms)",
event.getName(),
event.getNumberOfRetryAttempts(),
event.getWaitInterval().toMillis())
);
}
}
Retry, Bulkhead, and RateLimiter Combinations
Retry and Exponential Backoff
The most important aspect of retry strategy is combining exponential backoff with jitter. Fixed-interval retries cause a thundering herd problem where multiple clients retry simultaneously, concentrating load on the server.
// Programmatic RetryConfig customization
@Configuration
class ResilienceConfig {
@Bean
fun customRetryConfig(): RetryConfig {
return RetryConfig.custom<RetryConfig>()
.maxAttempts(4)
.intervalFunction(
// Exponential backoff + jitter: 1s, 2s(+jitter), 4s(+jitter), 8s(+jitter)
IntervalFunction.ofExponentialRandomBackoff(
Duration.ofSeconds(1), // initial wait duration
2.0, // multiplier
Duration.ofSeconds(15) // max wait duration
)
)
.retryOnException { ex ->
// Determine retry-eligible exceptions
when (ex) {
is IOException -> true
is TimeoutException -> true
is HttpServerErrorException -> true
is ConnectException -> true
else -> false
}
}
.ignoreExceptions(
BusinessValidationException::class.java,
IllegalArgumentException::class.java
)
.failAfterMaxAttempts(true) // Throw MaxRetriesExceededException after max retries
.build()
}
@Bean
fun retryRegistry(customRetryConfig: RetryConfig): RetryRegistry {
return RetryRegistry.of(customRetryConfig)
}
}
Bulkhead: Semaphore vs Thread Pool
Bulkhead is a pattern inspired by ship compartment walls (bulkheads), preventing a single service call from monopolizing all resources. Resilience4j provides two Bulkhead implementations.
| Aspect | SemaphoreBulkhead | ThreadPoolBulkhead |
|---|---|---|
| Isolation Level | Limits concurrent calls | Executes in a separate thread pool |
| Call Thread | Uses the caller's thread directly | Uses threads from a dedicated thread pool |
| Return Type | Synchronous return | CompletionStage return |
| Overhead | Low | Thread context switching cost |
| Suited For | Most HTTP calls | CPU-intensive tasks, when full isolation is needed |
| Configuration | maxConcurrentCalls, maxWaitDuration | maxThreadPoolSize, coreThreadPoolSize, queueCapacity |
# ThreadPoolBulkhead configuration example
resilience4j:
thread-pool-bulkhead:
instances:
heavyProcessing:
maxThreadPoolSize: 10
coreThreadPoolSize: 5
queueCapacity: 20
keepAliveDuration: 100ms
writableStackTraceEnabled: true
RateLimiter Configuration and Usage
RateLimiter limits the number of calls allowed per time unit, preventing external API rate limit violations or protecting internal services from overload.
// RateLimiter and CircuitBreaker combination
@Service
@Slf4j
public class ExternalApiClient {
private final RestClient restClient;
@CircuitBreaker(name = "externalApi", fallbackMethod = "apiFallback")
@RateLimiter(name = "externalApi")
@Retry(name = "externalApi")
public ApiResponse callExternalApi(ApiRequest request) {
log.debug("Calling external API: endpoint={}", request.getEndpoint());
return restClient.post()
.uri(request.getEndpoint())
.body(request.getPayload())
.retrieve()
.body(ApiResponse.class);
}
private ApiResponse apiFallback(ApiRequest request, RequestNotPermitted ex) {
// Rejected by RateLimiter
log.warn("Rate limit exceeded for external API: {}", request.getEndpoint());
return ApiResponse.rateLimited(
"Request limit exceeded. " +
"Check limitForPeriod settings or try again later."
);
}
private ApiResponse apiFallback(ApiRequest request, Exception ex) {
// Other exceptions (CircuitBreaker OPEN, network errors, etc.)
log.warn("External API fallback: endpoint={}, reason={}",
request.getEndpoint(), ex.getMessage());
return ApiResponse.error("External API call failed: " + ex.getMessage());
}
}
An important note when overloading fallback methods: Resilience4j selects the most specific fallback based on exception type. By separating RequestNotPermitted (RateLimiter rejection) and Exception (general exceptions), you can execute different fallback logic based on the exception cause.
Grafana Monitoring Dashboard
Prometheus Metric Collection
Resilience4j automatically exposes metrics via Micrometer. The following metrics are available at Spring Boot Actuator's /actuator/prometheus endpoint:
# CircuitBreaker state check (0=CLOSED, 1=OPEN, 2=HALF_OPEN, 3=DISABLED, 4=FORCED_OPEN)
resilience4j_circuitbreaker_state{name="paymentGateway"}
# Failure rate (%)
resilience4j_circuitbreaker_failure_rate{name="paymentGateway"}
# Slow call rate (%)
resilience4j_circuitbreaker_slow_call_rate{name="paymentGateway"}
# Call statistics (kind: successful, failed, ignored, not_permitted)
rate(resilience4j_circuitbreaker_calls_seconds_count{name="paymentGateway"}[5m])
# Call latency distribution (histogram)
histogram_quantile(0.95,
rate(resilience4j_circuitbreaker_calls_seconds_bucket{name="paymentGateway"}[5m])
)
# Retry count
increase(resilience4j_retry_calls_total{name="paymentGateway", kind="successful_with_retry"}[1h])
increase(resilience4j_retry_calls_total{name="paymentGateway", kind="failed_with_retry"}[1h])
# Bulkhead available concurrent calls
resilience4j_bulkhead_available_concurrent_calls{name="paymentGateway"}
# RateLimiter available permissions
resilience4j_ratelimiter_available_permissions{name="externalApi"}
Grafana Dashboard JSON Configuration
Here are the essential panels and their PromQL queries for the Grafana dashboard.
Panel 1 - CircuitBreaker State Gauge
resilience4j_circuitbreaker_state{application="order-service"}
Use value mapping to map 0=CLOSED (green), 1=OPEN (red), 2=HALF_OPEN (yellow).
Panel 2 - Failure Rate Trend (Time Series)
resilience4j_circuitbreaker_failure_rate{application="order-service", name=~".*"}
Add a threshold line (failureRateThreshold) to visually identify when the circuit transitions to OPEN.
Panel 3 - Call Success/Failure Ratio (Stacked Bar)
sum by (name, kind) (
rate(resilience4j_circuitbreaker_calls_seconds_count{application="order-service"}[5m])
)
Panel 4 - P95 Response Time (Time Series)
histogram_quantile(0.95,
sum by (le, name) (
rate(resilience4j_circuitbreaker_calls_seconds_bucket{application="order-service"}[5m])
)
)
Panel 5 - Bulkhead Concurrent Call Status (Gauge)
resilience4j_bulkhead_max_allowed_concurrent_calls{application="order-service"}
- resilience4j_bulkhead_available_concurrent_calls{application="order-service"}
Alert Rule Configuration
Register the following alert rules in Grafana or Prometheus Alertmanager.
# prometheus-alerts.yml
groups:
- name: resilience4j_alerts
rules:
- alert: CircuitBreakerOpen
expr: resilience4j_circuitbreaker_state == 1
for: 30s
labels:
severity: critical
annotations:
summary: 'CircuitBreaker OPEN: {{ $labels.name }}'
description: >
The {{ $labels.name }} circuit breaker in service
{{ $labels.application }} is in OPEN state.
Check downstream service failures.
- alert: HighFailureRate
expr: resilience4j_circuitbreaker_failure_rate > 30
for: 2m
labels:
severity: warning
annotations:
summary: 'High failure rate: {{ $labels.name }} ({{ $value }}%)'
description: >
The failure rate of {{ $labels.name }} is {{ $value }}%,
exceeding the warning threshold (30%).
- alert: BulkheadSaturation
expr: >
(resilience4j_bulkhead_max_allowed_concurrent_calls
- resilience4j_bulkhead_available_concurrent_calls)
/ resilience4j_bulkhead_max_allowed_concurrent_calls > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: 'Bulkhead 80% saturated: {{ $labels.name }}'
- alert: ExcessiveRetries
expr: >
rate(resilience4j_retry_calls_total{kind="failed_with_retry"}[5m])
/ rate(resilience4j_retry_calls_total[5m]) > 0.5
for: 3m
labels:
severity: warning
annotations:
summary: 'Retry failure rate exceeds 50%: {{ $labels.name }}'
Troubleshooting Guide
Issue 1: CircuitBreaker Does Not Transition to OPEN
Symptoms: Failures are clearly occurring, but the circuit remains in CLOSED state.
Root Cause Analysis:
minimumNumberOfCallshas not been reached. The default value is 100, so for low-frequency services, the sliding window may fill before the failure resolves.- The exception is included in
ignoreExceptions. Check that not only business exceptions but also unintended exceptions are in the ignore list. - The exception is not included in
recordExceptions. When recordExceptions is specified, exceptions not in the list are not recorded as failures.
Resolution: Adjust minimumNumberOfCalls to match the service call frequency, and review the recordExceptions and ignoreExceptions lists.
Issue 2: More Calls Than Expected When Combining Retry and CircuitBreaker
Symptoms: maxAttempts is set to 3, but more than 5 calls are recorded on the downstream service.
Root Cause Analysis: In the annotation application order, Retry sits outside CircuitBreaker. Therefore, after the CircuitBreaker records a failure, Retry attempts the call again through the CircuitBreaker. If trial calls are added during the CircuitBreaker's HALF-OPEN state, the total call count can exceed expectations.
Resolution: Set Retry's maxAttempts conservatively, and calculate the maximum number of calls produced by the combination of CircuitBreaker's slidingWindowSize and Retry's maxAttempts to predict downstream load.
Issue 3: Fallback Method Is Not Being Called
Symptoms: CircuitBreaker is OPEN, but CallNotPermittedException is propagated directly to the client.
Root Cause Analysis: The fallback method's signature does not exactly match the original method. The fallback method must accept all parameters of the original method in the same order and type, plus an Exception (or specific exception type) as the last parameter.
Resolution: Review the fallback method signature. The return type must also exactly match the original. Below are correct examples:
// Original method
@CircuitBreaker(name = "svc", fallbackMethod = "fallback")
public OrderResponse getOrder(String orderId, boolean includeDetails) { ... }
// Correct fallback (same parameters + Exception added)
private OrderResponse fallback(String orderId, boolean includeDetails, Exception ex) { ... }
// Incorrect fallback - compiles but fails to match at runtime
private OrderResponse fallback(String orderId, Exception ex) { ... } // Missing parameter
private void fallback(String orderId, boolean includeDetails, Exception ex) { ... } // Return type mismatch
Issue 4: Memory Usage Increases with TIME_BASED Window
Symptoms: Using a TIME_BASED sliding window and heap memory usage gradually increases.
Root Cause Analysis: The slidingWindowSize is set too large. For example, setting slidingWindowSize=600 (10 minutes) maintains 600 partial aggregation buckets. With high traffic, call records accumulate in each bucket, consuming memory.
Resolution: For TIME_BASED, set slidingWindowSize to 60 seconds or less, and observe long-term trends through Prometheus metrics. In memory-sensitive environments, prefer COUNT_BASED.
Operations Checklist
Here are items that must be verified before deploying Circuit Breakers to production.
Configuration Verification
- Is the ratio of slidingWindowSize to minimumNumberOfCalls appropriate? (minimumNumberOfCalls should be 50% or less of slidingWindowSize)
- Is failureRateThreshold set according to service characteristics? (Payment: 30-40%, Notification: 60-70%)
- Does waitDurationInOpenState match the downstream service's average recovery time?
- Is slowCallDurationThreshold set at or above the normal response time P99?
- Are recordExceptions and ignoreExceptions properly categorized?
Monitoring Verification
- Are resilience4j metrics being collected properly in Prometheus?
- Does the Grafana dashboard display CircuitBreaker state, failure rate, and call statistics?
- Are CircuitBreaker OPEN alerts being delivered to Slack, PagerDuty, etc.?
- Are there metrics tracking OPEN state duration?
Fallback Strategy Verification
- Are fallback methods connected to all CircuitBreakers?
- Do fallback methods return meaningful responses? (No simple null returns)
- How are exceptions in the fallback method itself handled?
- Is a cache expiration policy configured when using cache fallback?
- When using alternative service fallback, is a CircuitBreaker also configured for that service?
Test Verification
- Have unit tests verified CircuitBreaker state transitions (CLOSED, OPEN, HALF-OPEN)?
- Have integration tests reproduced actual timeout and network error scenarios?
- Has fault injection testing been performed with chaos engineering tools (Chaos Monkey, Litmus)?
- Has Bulkhead saturation behavior been verified in load tests?
Deployment Strategy
- Apply new CircuitBreaker configurations via canary deployment to a subset of traffic first
- Can configuration changes be applied without downtime via Config Server (Spring Cloud Config) or environment variables?
- Is Git history management in place for CircuitBreaker configurations?
- Is a rollback plan established?
Failure Cases and Recovery
Case 1: Downstream Overload Due to Retry Storm
Situation: Response times from the payment service began increasing. The order service had Retry configured with maxAttempts=5 and a fixed 1-second interval. With 20 order service instances and 100 orders per second, up to 10,000 requests per second (100 x 20 x 5) were flooding the payment service.
Cause: Fixed-interval retries were used without exponential backoff and jitter. Also, Retry was used standalone without a CircuitBreaker, so retries continued even on failure.
Recovery Procedure:
- Immediately disable Retry or set maxAttempts to 1 to stop retries
- Once the payment service load stabilizes, replace with a Retry configuration that includes exponential backoff + jitter
- Place the CircuitBreaker inside Retry so that retries are blocked when the circuit is OPEN
Prevention: Always use Retry together with CircuitBreaker, and apply exponential backoff + random jitter by default. Prohibit fixed-interval retries as a policy.
Case 2: Permanent OPEN Circuit Due to Incorrect Exception Classification
Situation: After deploying a new feature to the inventory service, certain product queries started returning 400 Bad Request. These 400 responses were caught as HttpClientErrorException and included in the failure rate calculation, causing the CircuitBreaker to transition to OPEN and block all inventory queries. Even normal product queries became impossible.
Cause: recordExceptions included HttpClientErrorException (4xx). Since 4xx errors are client-side issues, the circuit breaker should not intervene. Circuit breakers should only respond to server-side failures (5xx, timeouts, connection failures).
Recovery Procedure:
- Manually switch the CircuitBreaker to FORCED_CLOSE to immediately restore normal traffic
// Force state transition via Actuator endpoint
// POST /actuator/circuitbreakers/{name}/force-close
circuitBreakerRegistry.circuitBreaker("inventoryService")
.transitionToForcedOpenState(); // or transitionToClosedState()
- Remove HttpClientErrorException from recordExceptions and add it to ignoreExceptions
- After applying the configuration, release FORCED_CLOSE to return to normal CircuitBreaker operation
Prevention: Document exception classification principles. 4xx (client errors) go in ignoreExceptions, 5xx (server errors) go in recordExceptions, and business validation exceptions go in ignoreExceptions.
Case 3: Traffic Loss Due to HALF-OPEN Bottleneck
Situation: Even after the payment service recovered, order processing throughput did not recover. Traffic analysis revealed that the CircuitBreaker was set with permittedNumberOfCallsInHalfOpenState=1 in the HALF-OPEN state, allowing only 1 trial call. This trial call intermittently failed, causing flapping between OPEN and HALF-OPEN states.
Cause: The permittedNumberOfCallsInHalfOpenState value was too low. With only 1 trial call, a single failure returns the circuit to OPEN, making it difficult to return to CLOSED when the downstream service responds only intermittently.
Recovery Procedure:
- Increase permittedNumberOfCallsInHalfOpenState to 5-10
- Verify that automaticTransitionFromOpenToHalfOpenEnabled is set to true for automatic transitions without manual intervention
- Adjust waitDurationInOpenState to match the downstream service's average recovery time
Prevention: Set HALF-OPEN trial calls to at least 3 or more, and combine with the failure rate threshold to enable statistically meaningful decisions. Add OPEN-HALF_OPEN flapping detection to monitoring alerts.
Advanced Pattern: Custom CircuitBreaker Registry
As the number of services grows, configuring CircuitBreakers individually for each service can become inefficient. You can implement a custom registry that dynamically creates and manages CircuitBreakers.
// DynamicCircuitBreakerFactory.kt
@Component
class DynamicCircuitBreakerFactory(
private val circuitBreakerRegistry: CircuitBreakerRegistry,
private val meterRegistry: MeterRegistry,
) {
private val log = LoggerFactory.getLogger(javaClass)
/**
* Dynamically creates a CircuitBreaker based on service name.
* Returns an existing instance if one already exists.
*/
fun getOrCreate(
serviceName: String,
tier: ServiceTier = ServiceTier.STANDARD,
): CircuitBreaker {
return circuitBreakerRegistry.circuitBreaker(serviceName) {
buildConfigForTier(tier)
}.also { cb ->
registerMetrics(cb)
log.info(
"CircuitBreaker created/retrieved: name={}, tier={}, state={}",
serviceName, tier, cb.state
)
}
}
private fun buildConfigForTier(tier: ServiceTier): CircuitBreakerConfig {
return when (tier) {
ServiceTier.CRITICAL -> CircuitBreakerConfig.custom()
.failureRateThreshold(30f)
.slowCallRateThreshold(60f)
.slowCallDurationThreshold(Duration.ofSeconds(2))
.waitDurationInOpenState(Duration.ofSeconds(60))
.slidingWindowSize(20)
.minimumNumberOfCalls(10)
.permittedNumberOfCallsInHalfOpenState(5)
.automaticTransitionFromOpenToHalfOpenEnabled(true)
.build()
ServiceTier.STANDARD -> CircuitBreakerConfig.custom()
.failureRateThreshold(50f)
.slowCallRateThreshold(80f)
.slowCallDurationThreshold(Duration.ofSeconds(3))
.waitDurationInOpenState(Duration.ofSeconds(30))
.slidingWindowSize(10)
.minimumNumberOfCalls(5)
.permittedNumberOfCallsInHalfOpenState(3)
.automaticTransitionFromOpenToHalfOpenEnabled(true)
.build()
ServiceTier.BEST_EFFORT -> CircuitBreakerConfig.custom()
.failureRateThreshold(70f)
.slowCallRateThreshold(90f)
.slowCallDurationThreshold(Duration.ofSeconds(5))
.waitDurationInOpenState(Duration.ofSeconds(15))
.slidingWindowSize(5)
.minimumNumberOfCalls(3)
.permittedNumberOfCallsInHalfOpenState(2)
.automaticTransitionFromOpenToHalfOpenEnabled(true)
.build()
}
}
private fun registerMetrics(cb: CircuitBreaker) {
TaggedCircuitBreakerMetrics.ofCircuitBreakerRegistry(
circuitBreakerRegistry
).bindTo(meterRegistry)
}
enum class ServiceTier {
CRITICAL, // Core services like payment, authentication
STANDARD, // General services like inventory, shipping
BEST_EFFORT // Non-critical services like notifications, recommendations
}
}
Using this factory, appropriate CircuitBreaker configurations are automatically applied based on the service tier. CRITICAL services are conservatively protected with low failure rate thresholds and long wait durations, while BEST_EFFORT services operate flexibly with high thresholds.
Test Strategy
Here are the essential test cases that must be written when introducing CircuitBreakers.
// CircuitBreakerIntegrationTest.java
@SpringBootTest
@AutoConfigureMockMvc
class CircuitBreakerIntegrationTest {
@Autowired
private CircuitBreakerRegistry circuitBreakerRegistry;
@Autowired
private MockMvc mockMvc;
@MockBean
private RestClient paymentRestClient;
@Test
@DisplayName("CircuitBreaker transitions to OPEN when failure rate threshold is exceeded")
void shouldTransitionToOpenOnFailureThreshold() {
CircuitBreaker cb = circuitBreakerRegistry.circuitBreaker("paymentGateway");
cb.reset(); // Reset state for test isolation
assertThat(cb.getState()).isEqualTo(CircuitBreaker.State.CLOSED);
// With slidingWindowSize=20, failureRateThreshold=40
// Must call minimumNumberOfCalls=10 or more, then 40%+ failure -> OPEN
// 10 calls with 5 failures = 50% failure rate -> OPEN transition
for (int i = 0; i < 5; i++) {
cb.onSuccess(100, TimeUnit.MILLISECONDS);
}
for (int i = 0; i < 5; i++) {
cb.onError(100, TimeUnit.MILLISECONDS, new IOException("connection refused"));
}
assertThat(cb.getState()).isEqualTo(CircuitBreaker.State.OPEN);
assertThat(cb.getMetrics().getFailureRate()).isGreaterThanOrEqualTo(40f);
}
@Test
@DisplayName("Transitions to HALF-OPEN after waitDuration elapses in OPEN state")
void shouldTransitionToHalfOpenAfterWaitDuration() throws Exception {
CircuitBreaker cb = circuitBreakerRegistry.circuitBreaker("paymentGateway");
cb.reset();
cb.transitionToOpenState();
assertThat(cb.getState()).isEqualTo(CircuitBreaker.State.OPEN);
// Simulate waitDurationInOpenState elapsed
// (In tests, use a short waitDuration setting or transition directly)
cb.transitionToHalfOpenState();
assertThat(cb.getState()).isEqualTo(CircuitBreaker.State.HALF_OPEN);
}
@Test
@DisplayName("Transitions to CLOSED on successful trial calls in HALF-OPEN")
void shouldTransitionToClosedOnSuccessfulTrialCalls() {
CircuitBreaker cb = circuitBreakerRegistry.circuitBreaker("paymentGateway");
cb.reset();
cb.transitionToOpenState();
cb.transitionToHalfOpenState();
// permittedNumberOfCallsInHalfOpenState=5 successful calls
for (int i = 0; i < 5; i++) {
cb.onSuccess(50, TimeUnit.MILLISECONDS);
}
assertThat(cb.getState()).isEqualTo(CircuitBreaker.State.CLOSED);
}
@Test
@DisplayName("Fallback method is properly called when CircuitBreaker is OPEN")
void shouldInvokeFallbackWhenCircuitIsOpen() throws Exception {
// Force circuit to OPEN state
CircuitBreaker cb = circuitBreakerRegistry.circuitBreaker("paymentGateway");
cb.transitionToForcedOpenState();
mockMvc.perform(post("/api/v1/orders")
.contentType(MediaType.APPLICATION_JSON)
.content("{\"productId\": \"P001\", \"quantity\": 1}"))
.andExpect(status().isOk())
.andExpect(jsonPath("$.payment.status").value("QUEUED"))
.andExpect(jsonPath("$.payment.message").exists());
// Restore state after test
cb.transitionToClosedState();
}
}
References
- Resilience4j CircuitBreaker Official Documentation - Detailed configuration and operating principles of the CircuitBreaker module
- Spring Cloud Circuit Breaker Reference - Spring Cloud and Resilience4j integration guide
- Spring Boot Circuit Breaker Pattern with Resilience4j - GeeksforGeeks - Step-by-step implementation tutorial in Spring Boot
- Circuit Breaker Pattern in Microservices - Java Guides - Circuit Breaker design pattern in microservices architecture
- Circuit Breaker Pattern for Resilient Systems - DZone - Practical application of Circuit Breaker for distributed system resilience
- Martin Fowler: CircuitBreaker - The original explanation of the Circuit Breaker pattern
- Resilience4j GitHub Repository - Source code and release notes
Quiz
Q1: What is the main topic covered in "Circuit Breaker Pattern and Resilience4j Practical
Implementation Guide: From Failure Isolation to Recovery"?
Covers the Circuit Breaker state machine principles, integrated configuration of Resilience4j CircuitBreaker, Retry, Bulkhead, and RateLimiter modules, practical implementation in Spring Boot 3, Grafana monitoring, and recovery strategies for various failure scenarios.
Q2: What is Circuit Breaker State Machine?
The core of the Circuit Breaker is a Finite State Machine that manages transitions between three
states (CLOSED, OPEN, HALF-OPEN) and two special states (DISABLED, FORCED_OPEN).
Q3: Describe the Resilience4j Architecture.
Transitioning from Hystrix to Resilience4j After Netflix Hystrix entered maintenance mode in 2018,
Resilience4j established itself as the standard fault tolerance library in the JVM ecosystem.
Q4: What are the key steps for Spring Boot 3 Integration Configuration?
Dependency Setup Integrated Configuration File A notable point in the configuration is defining a
base profile with configs.default and specifying baseConfig: default for each instance to inherit
common settings.
Q5: How does CircuitBreaker Practical Implementation work?
Annotation-Based Implementation (Kotlin) Programmatic Implementation (Java) Instead of
annotations, you can use the CircuitBreakerRegistry directly to dynamically create circuit
breakers or change configurations at runtime.