Skip to content

필사 모드: Circuit Breaker Pattern and Resilience4j Practical Implementation Guide: From Failure Isolation to Recovery

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.
원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction

In microservices architecture, inter-service network calls are inherently unreliable. Network latency, timeouts, and downstream service failures occur routinely, and without proper controls, a single service failure can propagate across the entire system as a **cascading failure**. A representative example occurred in late 2024, when response latency in a single payment gateway on a large e-commerce platform cascaded to paralyze the order service, inventory service, and notification service.

The Circuit Breaker pattern is a failure isolation mechanism inspired by electrical circuit breakers. Since Michael Nygard first introduced it in Release It! in 2007, and through Martin Fowler's blog post, it has become a core pattern in the microservices world. Netflix's Hystrix was the first widely adopted implementation, but after entering maintenance mode in 2018, Resilience4j emerged as the de facto standard.

This article covers everything from the Circuit Breaker state machine operating principles, to integrating Resilience4j's core modules -- CircuitBreaker, Retry, Bulkhead, and RateLimiter -- in a Spring Boot 3 environment, monitoring with Grafana dashboards, and recovery strategies for real failure scenarios at an operational level.

Circuit Breaker State Machine

The core of the Circuit Breaker is a Finite State Machine that manages transitions between three states (CLOSED, OPEN, HALF-OPEN) and two special states (DISABLED, FORCED_OPEN).

State Transition Diagram

Failure rate >= threshold

┌─────────────────────────────────────┐

│ │

▼ │

┌──────────┐ ┌──────────┐

│ │ waitDuration elapsed │ │

│ OPEN │ ─────────────────────> │ CLOSED │

│ (blocked)│ │ (normal) │

└──────────┘ └──────────┘

│ ▲

│ waitDuration elapsed │

▼ │ Trial call success rate >= threshold

┌──────────────┐ │

│ HALF-OPEN │ ──────────────────────────┘

│(trial allowed)│

└──────────────┘

│ Trial call failure rate >= threshold

┌──────────┐

│ OPEN │ (blocked again)

└──────────┘

Detailed Behavior by State

| State | Request Handling | Transition Condition | Metric Collection |

| ----------- | ------------------------------------------------------------- | ------------------------------------------------------------------------- | ---------------------------------- |

| CLOSED | All requests pass through | Transitions to OPEN when failure rate in sliding window exceeds threshold | Records success/failure/slow calls |

| OPEN | All requests immediately rejected (CallNotPermittedException) | Transitions to HALF-OPEN after waitDurationInOpenState elapses | Records rejected call count |

| HALF-OPEN | Allows only permittedNumberOfCalls | Transitions to CLOSED or OPEN based on trial call results | Records trial call success/failure |

| DISABLED | All requests pass through (circuit inactive) | Manual transition only | No metric collection |

| FORCED_OPEN | All requests immediately rejected | Manual transition only | Records rejected call count |

Sliding Window Type Comparison

Resilience4j provides two sliding window types.

| Aspect | COUNT_BASED | TIME_BASED |

| -------------- | --------------------------------------- | ---------------------------------------------- |

| Basis | Last N calls | Calls in the last N seconds |

| Config Example | slidingWindowSize: 10 | slidingWindowSize: 60 |

| Memory Usage | Fixed (array of N results) | Variable (partial aggregations over N seconds) |

| Suited For | Services with consistent call frequency | Services with irregular call frequency |

| Evaluation | After the Nth call | Time window evaluated on each call |

COUNT_BASED is internally implemented as a circular bit array of size N, recording each call result in O(1) and calculating the failure rate in constant time. TIME_BASED uses N partial aggregation buckets, each aggregating call results for one second.

Resilience4j Architecture

Transitioning from Hystrix to Resilience4j

After Netflix Hystrix entered maintenance mode in 2018, Resilience4j established itself as the standard fault tolerance library in the JVM ecosystem.

| Comparison | Netflix Hystrix | Resilience4j |

| -------------------- | ---------------------------------------- | ------------------------------------------------- |

| Status | Maintenance mode (no updates since 2018) | Active development (2.3.0 release in 2025) |

| Java Version | Java 8+ | Java 17+ (Spring Boot 3 support) |

| Dependencies | Multiple (Archaius, RxJava, etc.) | Single (Vavr) |

| Architecture | Monolithic (all features included) | Modular (select only needed modules) |

| Thread Model | Separate thread pool required | Semaphore-based (thread pool optional) |

| Configuration | Archaius required | Both application.yml and programmatic |

| Reactive Support | RxJava 1 | Native Reactor, RxJava 2/3 support |

| Functional Interface | Limited | Full support (Supplier, Function, Runnable, etc.) |

| Monitoring | Hystrix Dashboard | Micrometer integration (Prometheus, Grafana) |

Resilience4j Core Modules

Resilience4j provides five core modules that can be used independently or in combination.

| Module | Role | Key Configuration |

| -------------- | -------------------------------------- | --------------------------------------- |

| CircuitBreaker | Circuit tripping based on failure rate | failureRateThreshold, slidingWindowSize |

| Retry | Retry on failure | maxAttempts, waitDuration, backoff |

| Bulkhead | Limit concurrent calls (isolation) | maxConcurrentCalls, maxWaitDuration |

| RateLimiter | Limit calls per time unit | limitForPeriod, limitRefreshPeriod |

| TimeLimiter | Limit call duration | timeoutDuration, cancelRunningFuture |

When combining via annotations, the application order is as follows:

Outer (evaluated first) ──────────────────────────────────> Inner (evaluated last)

Retry -> CircuitBreaker -> RateLimiter -> TimeLimiter -> Bulkhead

This order is the default priority when Resilience4j processes annotations via Spring AOP. You can customize the order using properties like `resilience4j.circuitbreaker.circuitBreakerAspectOrder`.

Spring Boot 3 Integration Configuration

Dependency Setup

// build.gradle.kts (Spring Boot 3.3+ / Resilience4j 2.2+)

plugins {

id("org.springframework.boot") version "3.3.5"

id("io.spring.dependency-management") version "1.1.6"

kotlin("jvm") version "1.9.25"

kotlin("plugin.spring") version "1.9.25"

}

dependencies {

// Resilience4j Spring Boot 3 Starter

implementation("io.github.resilience4j:resilience4j-spring-boot3:2.2.0")

// Individual modules (included in starter but explicit declaration recommended)

implementation("io.github.resilience4j:resilience4j-circuitbreaker")

implementation("io.github.resilience4j:resilience4j-retry")

implementation("io.github.resilience4j:resilience4j-bulkhead")

implementation("io.github.resilience4j:resilience4j-ratelimiter")

implementation("io.github.resilience4j:resilience4j-timelimiter")

// Micrometer + Prometheus (monitoring)

implementation("io.github.resilience4j:resilience4j-micrometer")

implementation("io.micrometer:micrometer-registry-prometheus")

// Spring Boot Actuator

implementation("org.springframework.boot:spring-boot-starter-actuator")

implementation("org.springframework.boot:spring-boot-starter-aop")

implementation("org.springframework.boot:spring-boot-starter-web")

// Kotlin Coroutines (optional)

implementation("org.jetbrains.kotlinx:kotlinx-coroutines-core")

implementation("org.jetbrains.kotlinx:kotlinx-coroutines-reactor")

testImplementation("org.springframework.boot:spring-boot-starter-test")

}

Integrated Configuration File

application.yml - Resilience4j integrated configuration

resilience4j:

circuitbreaker:

configs:

default:

registerHealthIndicator: true

slidingWindowType: COUNT_BASED

slidingWindowSize: 10

minimumNumberOfCalls: 5

failureRateThreshold: 50

slowCallRateThreshold: 80

slowCallDurationThreshold: 3s

waitDurationInOpenState: 30s

permittedNumberOfCallsInHalfOpenState: 3

automaticTransitionFromOpenToHalfOpenEnabled: true

recordExceptions:

- java.io.IOException

- java.util.concurrent.TimeoutException

- org.springframework.web.client.HttpServerErrorException

ignoreExceptions:

- com.example.order.exception.BusinessValidationException

instances:

paymentGateway:

baseConfig: default

failureRateThreshold: 40

waitDurationInOpenState: 60s

slidingWindowSize: 20

inventoryService:

baseConfig: default

failureRateThreshold: 60

slowCallDurationThreshold: 5s

notificationService:

baseConfig: default

failureRateThreshold: 70

waitDurationInOpenState: 15s

retry:

configs:

default:

maxAttempts: 3

waitDuration: 1s

enableExponentialBackoff: true

exponentialBackoffMultiplier: 2.0

exponentialMaxWaitDuration: 10s

retryExceptions:

- java.io.IOException

- java.util.concurrent.TimeoutException

ignoreExceptions:

- com.example.order.exception.BusinessValidationException

instances:

paymentGateway:

baseConfig: default

maxAttempts: 2

waitDuration: 2s

inventoryService:

baseConfig: default

maxAttempts: 4

notificationService:

baseConfig: default

maxAttempts: 5

waitDuration: 500ms

bulkhead:

configs:

default:

maxConcurrentCalls: 25

maxWaitDuration: 500ms

instances:

paymentGateway:

baseConfig: default

maxConcurrentCalls: 15

inventoryService:

baseConfig: default

maxConcurrentCalls: 30

notificationService:

baseConfig: default

maxConcurrentCalls: 50

ratelimiter:

configs:

default:

limitForPeriod: 100

limitRefreshPeriod: 1s

timeoutDuration: 500ms

instances:

paymentGateway:

baseConfig: default

limitForPeriod: 50

inventoryService:

baseConfig: default

limitForPeriod: 200

timelimiter:

configs:

default:

timeoutDuration: 5s

cancelRunningFuture: true

instances:

paymentGateway:

baseConfig: default

timeoutDuration: 10s

inventoryService:

baseConfig: default

timeoutDuration: 3s

Actuator metric exposure

management:

endpoints:

web:

exposure:

include: health,metrics,prometheus,circuitbreakers,retries

endpoint:

health:

show-details: always

health:

circuitbreakers:

enabled: true

metrics:

distribution:

percentiles-histogram:

resilience4j.circuitbreaker.calls: true

resilience4j.retry.calls: true

tags:

application: order-service

A notable point in the configuration is defining a base profile with `configs.default` and specifying `baseConfig: default` for each instance to inherit common settings. You only need to override thresholds specific to each service to minimize configuration duplication.

CircuitBreaker Practical Implementation

Annotation-Based Implementation (Kotlin)

// PaymentGatewayClient.kt

@Service

class PaymentGatewayClient(

private val restClient: RestClient,

private val paymentRetryQueue: PaymentRetryQueue,

private val paymentCacheStore: PaymentCacheStore,

) {

companion object {

private val log = LoggerFactory.getLogger(PaymentGatewayClient::class.java)

const val CB_NAME = "paymentGateway"

}

@CircuitBreaker(name = CB_NAME, fallbackMethod = "paymentFallback")

@Retry(name = CB_NAME)

@Bulkhead(name = CB_NAME)

fun processPayment(request: PaymentRequest): PaymentResponse {

log.info("Calling payment gateway for orderId={}", request.orderId)

val response = restClient.post()

.uri("https://payment-api.internal/v2/charges")

.contentType(MediaType.APPLICATION_JSON)

.body(request)

.retrieve()

.body(PaymentResponse::class.java)

?: throw PaymentGatewayException("Empty response from payment gateway")

log.info("Payment processed: orderId={}, txId={}", request.orderId, response.transactionId)

return response

}

/**

* Fallback method: Called when CircuitBreaker is OPEN or an exception occurs.

* The method signature must match the original + accept an Exception as the last parameter.

*/

private fun paymentFallback(request: PaymentRequest, ex: Exception): PaymentResponse {

log.warn(

"Payment fallback activated: orderId={}, reason={}",

request.orderId, ex.message

)

return when (ex) {

is CallNotPermittedException -> {

// CircuitBreaker OPEN state: enqueue for async processing

paymentRetryQueue.enqueue(request)

PaymentResponse(

orderId = request.orderId,

status = PaymentStatus.QUEUED,

message = "Payment has been queued. It will be processed shortly.",

transactionId = null,

)

}

is BulkheadFullException -> {

// Bulkhead saturated: prompt immediate retry

PaymentResponse(

orderId = request.orderId,

status = PaymentStatus.RETRY_LATER,

message = "Too many payment requests at the moment. Please try again shortly.",

transactionId = null,

)

}

else -> {

// Other exceptions: return cached payment info if available

val cached = paymentCacheStore.getLastSuccess(request.orderId)

if (cached != null) {

log.info("Returning cached payment for orderId={}", request.orderId)

cached.copy(status = PaymentStatus.CACHED)

} else {

paymentRetryQueue.enqueue(request)

PaymentResponse(

orderId = request.orderId,

status = PaymentStatus.PENDING,

message = "An error occurred during payment processing. Automatic retry in progress.",

transactionId = null,

)

}

}

}

}

}

Programmatic Implementation (Java)

Instead of annotations, you can use the CircuitBreakerRegistry directly to dynamically create circuit breakers or change configurations at runtime.

// InventoryServiceClient.java

@Service

@Slf4j

public class InventoryServiceClient {

private final CircuitBreaker circuitBreaker;

private final Retry retry;

private final Bulkhead bulkhead;

private final RestClient restClient;

public InventoryServiceClient(

CircuitBreakerRegistry cbRegistry,

RetryRegistry retryRegistry,

BulkheadRegistry bulkheadRegistry,

RestClient.Builder restClientBuilder) {

this.circuitBreaker = cbRegistry.circuitBreaker("inventoryService");

this.retry = retryRegistry.retry("inventoryService");

this.bulkhead = bulkheadRegistry.bulkhead("inventoryService");

this.restClient = restClientBuilder

.baseUrl("https://inventory-api.internal")

.build();

// Register event listeners

registerEventListeners();

}

public InventoryResponse checkStock(String productId, int quantity) {

// Decorator chain: Bulkhead -> CircuitBreaker -> Retry -> actual call

Supplier<InventoryResponse> decorated = Decorators

.ofSupplier(() -> doCheckStock(productId, quantity))

.withBulkhead(bulkhead)

.withCircuitBreaker(circuitBreaker)

.withRetry(retry)

.withFallback(

List.of(

CallNotPermittedException.class,

BulkheadFullException.class,

IOException.class

),

ex -> stockFallback(productId, quantity, ex)

)

.decorate();

return decorated.get();

}

private InventoryResponse doCheckStock(String productId, int quantity) {

return restClient.get()

.uri("/v1/stock/{productId}?qty={qty}", productId, quantity)

.retrieve()

.body(InventoryResponse.class);

}

private InventoryResponse stockFallback(

String productId, int quantity, Throwable ex) {

log.warn("Inventory fallback: productId={}, reason={}", productId, ex.getMessage());

// When stock is uncertain, accept the order but schedule async verification

return InventoryResponse.builder()

.productId(productId)

.available(true)

.reservationStatus(ReservationStatus.TENTATIVE)

.message("Stock check delayed: tentative approval with async verification scheduled")

.build();

}

private void registerEventListeners() {

circuitBreaker.getEventPublisher()

.onStateTransition(event -> {

log.warn("[CircuitBreaker] {} state: {} -> {}",

event.getCircuitBreakerName(),

event.getStateTransition().getFromState(),

event.getStateTransition().getToState());

})

.onError(event ->

log.error("[CircuitBreaker] {} error: {} ({}ms)",

event.getCircuitBreakerName(),

event.getThrowable().getMessage(),

event.getElapsedDuration().toMillis())

)

.onSuccess(event ->

log.debug("[CircuitBreaker] {} success ({}ms)",

event.getCircuitBreakerName(),

event.getElapsedDuration().toMillis())

)

.onCallNotPermitted(event ->

log.warn("[CircuitBreaker] {} call not permitted (OPEN state)",

event.getCircuitBreakerName())

);

retry.getEventPublisher()

.onRetry(event ->

log.info("[Retry] {} attempt #{} (wait: {}ms)",

event.getName(),

event.getNumberOfRetryAttempts(),

event.getWaitInterval().toMillis())

);

}

}

Retry, Bulkhead, and RateLimiter Combinations

Retry and Exponential Backoff

The most important aspect of retry strategy is combining exponential backoff with jitter. Fixed-interval retries cause a **thundering herd** problem where multiple clients retry simultaneously, concentrating load on the server.

// Programmatic RetryConfig customization

@Configuration

class ResilienceConfig {

@Bean

fun customRetryConfig(): RetryConfig {

return RetryConfig.custom<RetryConfig>()

.maxAttempts(4)

.intervalFunction(

// Exponential backoff + jitter: 1s, 2s(+jitter), 4s(+jitter), 8s(+jitter)

IntervalFunction.ofExponentialRandomBackoff(

Duration.ofSeconds(1), // initial wait duration

2.0, // multiplier

Duration.ofSeconds(15) // max wait duration

)

)

.retryOnException { ex ->

// Determine retry-eligible exceptions

when (ex) {

is IOException -> true

is TimeoutException -> true

is HttpServerErrorException -> true

is ConnectException -> true

else -> false

}

}

.ignoreExceptions(

BusinessValidationException::class.java,

IllegalArgumentException::class.java

)

.failAfterMaxAttempts(true) // Throw MaxRetriesExceededException after max retries

.build()

}

@Bean

fun retryRegistry(customRetryConfig: RetryConfig): RetryRegistry {

return RetryRegistry.of(customRetryConfig)

}

}

Bulkhead: Semaphore vs Thread Pool

Bulkhead is a pattern inspired by ship compartment walls (bulkheads), preventing a single service call from monopolizing all resources. Resilience4j provides two Bulkhead implementations.

| Aspect | SemaphoreBulkhead | ThreadPoolBulkhead |

| --------------- | ----------------------------------- | ---------------------------------------------------- |

| Isolation Level | Limits concurrent calls | Executes in a separate thread pool |

| Call Thread | Uses the caller's thread directly | Uses threads from a dedicated thread pool |

| Return Type | Synchronous return | CompletionStage return |

| Overhead | Low | Thread context switching cost |

| Suited For | Most HTTP calls | CPU-intensive tasks, when full isolation is needed |

| Configuration | maxConcurrentCalls, maxWaitDuration | maxThreadPoolSize, coreThreadPoolSize, queueCapacity |

ThreadPoolBulkhead configuration example

resilience4j:

thread-pool-bulkhead:

instances:

heavyProcessing:

maxThreadPoolSize: 10

coreThreadPoolSize: 5

queueCapacity: 20

keepAliveDuration: 100ms

writableStackTraceEnabled: true

RateLimiter Configuration and Usage

RateLimiter limits the number of calls allowed per time unit, preventing external API rate limit violations or protecting internal services from overload.

// RateLimiter and CircuitBreaker combination

@Service

@Slf4j

public class ExternalApiClient {

private final RestClient restClient;

@CircuitBreaker(name = "externalApi", fallbackMethod = "apiFallback")

@RateLimiter(name = "externalApi")

@Retry(name = "externalApi")

public ApiResponse callExternalApi(ApiRequest request) {

log.debug("Calling external API: endpoint={}", request.getEndpoint());

return restClient.post()

.uri(request.getEndpoint())

.body(request.getPayload())

.retrieve()

.body(ApiResponse.class);

}

private ApiResponse apiFallback(ApiRequest request, RequestNotPermitted ex) {

// Rejected by RateLimiter

log.warn("Rate limit exceeded for external API: {}", request.getEndpoint());

return ApiResponse.rateLimited(

"Request limit exceeded. " +

"Check limitForPeriod settings or try again later."

);

}

private ApiResponse apiFallback(ApiRequest request, Exception ex) {

// Other exceptions (CircuitBreaker OPEN, network errors, etc.)

log.warn("External API fallback: endpoint={}, reason={}",

request.getEndpoint(), ex.getMessage());

return ApiResponse.error("External API call failed: " + ex.getMessage());

}

}

An important note when overloading fallback methods: Resilience4j selects the most specific fallback based on exception type. By separating `RequestNotPermitted` (RateLimiter rejection) and `Exception` (general exceptions), you can execute different fallback logic based on the exception cause.

Grafana Monitoring Dashboard

Prometheus Metric Collection

Resilience4j automatically exposes metrics via Micrometer. The following metrics are available at Spring Boot Actuator's `/actuator/prometheus` endpoint:

CircuitBreaker state check (0=CLOSED, 1=OPEN, 2=HALF_OPEN, 3=DISABLED, 4=FORCED_OPEN)

resilience4j_circuitbreaker_state{name="paymentGateway"}

Failure rate (%)

resilience4j_circuitbreaker_failure_rate{name="paymentGateway"}

Slow call rate (%)

resilience4j_circuitbreaker_slow_call_rate{name="paymentGateway"}

Call statistics (kind: successful, failed, ignored, not_permitted)

rate(resilience4j_circuitbreaker_calls_seconds_count{name="paymentGateway"}[5m])

Call latency distribution (histogram)

histogram_quantile(0.95,

rate(resilience4j_circuitbreaker_calls_seconds_bucket{name="paymentGateway"}[5m])

)

Retry count

increase(resilience4j_retry_calls_total{name="paymentGateway", kind="successful_with_retry"}[1h])

increase(resilience4j_retry_calls_total{name="paymentGateway", kind="failed_with_retry"}[1h])

Bulkhead available concurrent calls

resilience4j_bulkhead_available_concurrent_calls{name="paymentGateway"}

RateLimiter available permissions

resilience4j_ratelimiter_available_permissions{name="externalApi"}

Grafana Dashboard JSON Configuration

Here are the essential panels and their PromQL queries for the Grafana dashboard.

**Panel 1 - CircuitBreaker State Gauge**

resilience4j_circuitbreaker_state{application="order-service"}

Use value mapping to map 0=CLOSED (green), 1=OPEN (red), 2=HALF_OPEN (yellow).

**Panel 2 - Failure Rate Trend (Time Series)**

resilience4j_circuitbreaker_failure_rate{application="order-service", name=~".*"}

Add a threshold line (failureRateThreshold) to visually identify when the circuit transitions to OPEN.

**Panel 3 - Call Success/Failure Ratio (Stacked Bar)**

sum by (name, kind) (

rate(resilience4j_circuitbreaker_calls_seconds_count{application="order-service"}[5m])

)

**Panel 4 - P95 Response Time (Time Series)**

histogram_quantile(0.95,

sum by (le, name) (

rate(resilience4j_circuitbreaker_calls_seconds_bucket{application="order-service"}[5m])

)

)

**Panel 5 - Bulkhead Concurrent Call Status (Gauge)**

resilience4j_bulkhead_max_allowed_concurrent_calls{application="order-service"}

- resilience4j_bulkhead_available_concurrent_calls{application="order-service"}

Alert Rule Configuration

Register the following alert rules in Grafana or Prometheus Alertmanager.

prometheus-alerts.yml

groups:

- name: resilience4j_alerts

rules:

- alert: CircuitBreakerOpen

expr: resilience4j_circuitbreaker_state == 1

for: 30s

labels:

severity: critical

annotations:

summary: 'CircuitBreaker OPEN: {{ $labels.name }}'

description: >

The {{ $labels.name }} circuit breaker in service

{{ $labels.application }} is in OPEN state.

Check downstream service failures.

- alert: HighFailureRate

expr: resilience4j_circuitbreaker_failure_rate > 30

for: 2m

labels:

severity: warning

annotations:

summary: 'High failure rate: {{ $labels.name }} ({{ $value }}%)'

description: >

The failure rate of {{ $labels.name }} is {{ $value }}%,

exceeding the warning threshold (30%).

- alert: BulkheadSaturation

expr: >

(resilience4j_bulkhead_max_allowed_concurrent_calls

- resilience4j_bulkhead_available_concurrent_calls)

/ resilience4j_bulkhead_max_allowed_concurrent_calls > 0.8

for: 5m

labels:

severity: warning

annotations:

summary: 'Bulkhead 80% saturated: {{ $labels.name }}'

- alert: ExcessiveRetries

expr: >

rate(resilience4j_retry_calls_total{kind="failed_with_retry"}[5m])

/ rate(resilience4j_retry_calls_total[5m]) > 0.5

for: 3m

labels:

severity: warning

annotations:

summary: 'Retry failure rate exceeds 50%: {{ $labels.name }}'

Troubleshooting Guide

Issue 1: CircuitBreaker Does Not Transition to OPEN

**Symptoms**: Failures are clearly occurring, but the circuit remains in CLOSED state.

**Root Cause Analysis**:

- `minimumNumberOfCalls` has not been reached. The default value is 100, so for low-frequency services, the sliding window may fill before the failure resolves.

- The exception is included in `ignoreExceptions`. Check that not only business exceptions but also unintended exceptions are in the ignore list.

- The exception is not included in `recordExceptions`. When recordExceptions is specified, exceptions not in the list are not recorded as failures.

**Resolution**: Adjust `minimumNumberOfCalls` to match the service call frequency, and review the `recordExceptions` and `ignoreExceptions` lists.

Issue 2: More Calls Than Expected When Combining Retry and CircuitBreaker

**Symptoms**: maxAttempts is set to 3, but more than 5 calls are recorded on the downstream service.

**Root Cause Analysis**: In the annotation application order, Retry sits outside CircuitBreaker. Therefore, after the CircuitBreaker records a failure, Retry attempts the call again through the CircuitBreaker. If trial calls are added during the CircuitBreaker's HALF-OPEN state, the total call count can exceed expectations.

**Resolution**: Set Retry's maxAttempts conservatively, and calculate the maximum number of calls produced by the combination of CircuitBreaker's slidingWindowSize and Retry's maxAttempts to predict downstream load.

Issue 3: Fallback Method Is Not Being Called

**Symptoms**: CircuitBreaker is OPEN, but CallNotPermittedException is propagated directly to the client.

**Root Cause Analysis**: The fallback method's signature does not exactly match the original method. The fallback method must accept all parameters of the original method in the same order and type, plus an Exception (or specific exception type) as the last parameter.

**Resolution**: Review the fallback method signature. The return type must also exactly match the original. Below are correct examples:

// Original method

@CircuitBreaker(name = "svc", fallbackMethod = "fallback")

public OrderResponse getOrder(String orderId, boolean includeDetails) { ... }

// Correct fallback (same parameters + Exception added)

private OrderResponse fallback(String orderId, boolean includeDetails, Exception ex) { ... }

// Incorrect fallback - compiles but fails to match at runtime

private OrderResponse fallback(String orderId, Exception ex) { ... } // Missing parameter

private void fallback(String orderId, boolean includeDetails, Exception ex) { ... } // Return type mismatch

Issue 4: Memory Usage Increases with TIME_BASED Window

**Symptoms**: Using a TIME_BASED sliding window and heap memory usage gradually increases.

**Root Cause Analysis**: The slidingWindowSize is set too large. For example, setting slidingWindowSize=600 (10 minutes) maintains 600 partial aggregation buckets. With high traffic, call records accumulate in each bucket, consuming memory.

**Resolution**: For TIME_BASED, set slidingWindowSize to 60 seconds or less, and observe long-term trends through Prometheus metrics. In memory-sensitive environments, prefer COUNT_BASED.

Operations Checklist

Here are items that must be verified before deploying Circuit Breakers to production.

**Configuration Verification**

- Is the ratio of slidingWindowSize to minimumNumberOfCalls appropriate? (minimumNumberOfCalls should be 50% or less of slidingWindowSize)

- Is failureRateThreshold set according to service characteristics? (Payment: 30-40%, Notification: 60-70%)

- Does waitDurationInOpenState match the downstream service's average recovery time?

- Is slowCallDurationThreshold set at or above the normal response time P99?

- Are recordExceptions and ignoreExceptions properly categorized?

**Monitoring Verification**

- Are resilience4j metrics being collected properly in Prometheus?

- Does the Grafana dashboard display CircuitBreaker state, failure rate, and call statistics?

- Are CircuitBreaker OPEN alerts being delivered to Slack, PagerDuty, etc.?

- Are there metrics tracking OPEN state duration?

**Fallback Strategy Verification**

- Are fallback methods connected to all CircuitBreakers?

- Do fallback methods return meaningful responses? (No simple null returns)

- How are exceptions in the fallback method itself handled?

- Is a cache expiration policy configured when using cache fallback?

- When using alternative service fallback, is a CircuitBreaker also configured for that service?

**Test Verification**

- Have unit tests verified CircuitBreaker state transitions (CLOSED, OPEN, HALF-OPEN)?

- Have integration tests reproduced actual timeout and network error scenarios?

- Has fault injection testing been performed with chaos engineering tools (Chaos Monkey, Litmus)?

- Has Bulkhead saturation behavior been verified in load tests?

**Deployment Strategy**

- Apply new CircuitBreaker configurations via canary deployment to a subset of traffic first

- Can configuration changes be applied without downtime via Config Server (Spring Cloud Config) or environment variables?

- Is Git history management in place for CircuitBreaker configurations?

- Is a rollback plan established?

Failure Cases and Recovery

Case 1: Downstream Overload Due to Retry Storm

**Situation**: Response times from the payment service began increasing. The order service had Retry configured with maxAttempts=5 and a fixed 1-second interval. With 20 order service instances and 100 orders per second, up to 10,000 requests per second (100 x 20 x 5) were flooding the payment service.

**Cause**: Fixed-interval retries were used without exponential backoff and jitter. Also, Retry was used standalone without a CircuitBreaker, so retries continued even on failure.

**Recovery Procedure**:

1. Immediately disable Retry or set maxAttempts to 1 to stop retries

2. Once the payment service load stabilizes, replace with a Retry configuration that includes exponential backoff + jitter

3. Place the CircuitBreaker inside Retry so that retries are blocked when the circuit is OPEN

**Prevention**: Always use Retry together with CircuitBreaker, and apply exponential backoff + random jitter by default. Prohibit fixed-interval retries as a policy.

Case 2: Permanent OPEN Circuit Due to Incorrect Exception Classification

**Situation**: After deploying a new feature to the inventory service, certain product queries started returning 400 Bad Request. These 400 responses were caught as HttpClientErrorException and included in the failure rate calculation, causing the CircuitBreaker to transition to OPEN and block all inventory queries. Even normal product queries became impossible.

**Cause**: `recordExceptions` included HttpClientErrorException (4xx). Since 4xx errors are client-side issues, the circuit breaker should not intervene. Circuit breakers should only respond to server-side failures (5xx, timeouts, connection failures).

**Recovery Procedure**:

1. Manually switch the CircuitBreaker to FORCED_CLOSE to immediately restore normal traffic

// Force state transition via Actuator endpoint

// POST /actuator/circuitbreakers/{name}/force-close

circuitBreakerRegistry.circuitBreaker("inventoryService")

.transitionToForcedOpenState(); // or transitionToClosedState()

2. Remove HttpClientErrorException from recordExceptions and add it to ignoreExceptions

3. After applying the configuration, release FORCED_CLOSE to return to normal CircuitBreaker operation

**Prevention**: Document exception classification principles. 4xx (client errors) go in ignoreExceptions, 5xx (server errors) go in recordExceptions, and business validation exceptions go in ignoreExceptions.

Case 3: Traffic Loss Due to HALF-OPEN Bottleneck

**Situation**: Even after the payment service recovered, order processing throughput did not recover. Traffic analysis revealed that the CircuitBreaker was set with permittedNumberOfCallsInHalfOpenState=1 in the HALF-OPEN state, allowing only 1 trial call. This trial call intermittently failed, causing flapping between OPEN and HALF-OPEN states.

**Cause**: The permittedNumberOfCallsInHalfOpenState value was too low. With only 1 trial call, a single failure returns the circuit to OPEN, making it difficult to return to CLOSED when the downstream service responds only intermittently.

**Recovery Procedure**:

1. Increase permittedNumberOfCallsInHalfOpenState to 5-10

2. Verify that automaticTransitionFromOpenToHalfOpenEnabled is set to true for automatic transitions without manual intervention

3. Adjust waitDurationInOpenState to match the downstream service's average recovery time

**Prevention**: Set HALF-OPEN trial calls to at least 3 or more, and combine with the failure rate threshold to enable statistically meaningful decisions. Add OPEN-HALF_OPEN flapping detection to monitoring alerts.

Advanced Pattern: Custom CircuitBreaker Registry

As the number of services grows, configuring CircuitBreakers individually for each service can become inefficient. You can implement a custom registry that dynamically creates and manages CircuitBreakers.

// DynamicCircuitBreakerFactory.kt

@Component

class DynamicCircuitBreakerFactory(

private val circuitBreakerRegistry: CircuitBreakerRegistry,

private val meterRegistry: MeterRegistry,

) {

private val log = LoggerFactory.getLogger(javaClass)

/**

* Dynamically creates a CircuitBreaker based on service name.

* Returns an existing instance if one already exists.

*/

fun getOrCreate(

serviceName: String,

tier: ServiceTier = ServiceTier.STANDARD,

): CircuitBreaker {

return circuitBreakerRegistry.circuitBreaker(serviceName) {

buildConfigForTier(tier)

}.also { cb ->

registerMetrics(cb)

log.info(

"CircuitBreaker created/retrieved: name={}, tier={}, state={}",

serviceName, tier, cb.state

)

}

}

private fun buildConfigForTier(tier: ServiceTier): CircuitBreakerConfig {

return when (tier) {

ServiceTier.CRITICAL -> CircuitBreakerConfig.custom()

.failureRateThreshold(30f)

.slowCallRateThreshold(60f)

.slowCallDurationThreshold(Duration.ofSeconds(2))

.waitDurationInOpenState(Duration.ofSeconds(60))

.slidingWindowSize(20)

.minimumNumberOfCalls(10)

.permittedNumberOfCallsInHalfOpenState(5)

.automaticTransitionFromOpenToHalfOpenEnabled(true)

.build()

ServiceTier.STANDARD -> CircuitBreakerConfig.custom()

.failureRateThreshold(50f)

.slowCallRateThreshold(80f)

.slowCallDurationThreshold(Duration.ofSeconds(3))

.waitDurationInOpenState(Duration.ofSeconds(30))

.slidingWindowSize(10)

.minimumNumberOfCalls(5)

.permittedNumberOfCallsInHalfOpenState(3)

.automaticTransitionFromOpenToHalfOpenEnabled(true)

.build()

ServiceTier.BEST_EFFORT -> CircuitBreakerConfig.custom()

.failureRateThreshold(70f)

.slowCallRateThreshold(90f)

.slowCallDurationThreshold(Duration.ofSeconds(5))

.waitDurationInOpenState(Duration.ofSeconds(15))

.slidingWindowSize(5)

.minimumNumberOfCalls(3)

.permittedNumberOfCallsInHalfOpenState(2)

.automaticTransitionFromOpenToHalfOpenEnabled(true)

.build()

}

}

private fun registerMetrics(cb: CircuitBreaker) {

TaggedCircuitBreakerMetrics.ofCircuitBreakerRegistry(

circuitBreakerRegistry

).bindTo(meterRegistry)

}

enum class ServiceTier {

CRITICAL, // Core services like payment, authentication

STANDARD, // General services like inventory, shipping

BEST_EFFORT // Non-critical services like notifications, recommendations

}

}

Using this factory, appropriate CircuitBreaker configurations are automatically applied based on the service tier. CRITICAL services are conservatively protected with low failure rate thresholds and long wait durations, while BEST_EFFORT services operate flexibly with high thresholds.

Test Strategy

Here are the essential test cases that must be written when introducing CircuitBreakers.

// CircuitBreakerIntegrationTest.java

@SpringBootTest

@AutoConfigureMockMvc

class CircuitBreakerIntegrationTest {

@Autowired

private CircuitBreakerRegistry circuitBreakerRegistry;

@Autowired

private MockMvc mockMvc;

@MockBean

private RestClient paymentRestClient;

@Test

@DisplayName("CircuitBreaker transitions to OPEN when failure rate threshold is exceeded")

void shouldTransitionToOpenOnFailureThreshold() {

CircuitBreaker cb = circuitBreakerRegistry.circuitBreaker("paymentGateway");

cb.reset(); // Reset state for test isolation

assertThat(cb.getState()).isEqualTo(CircuitBreaker.State.CLOSED);

// With slidingWindowSize=20, failureRateThreshold=40

// Must call minimumNumberOfCalls=10 or more, then 40%+ failure -> OPEN

// 10 calls with 5 failures = 50% failure rate -> OPEN transition

for (int i = 0; i < 5; i++) {

cb.onSuccess(100, TimeUnit.MILLISECONDS);

}

for (int i = 0; i < 5; i++) {

cb.onError(100, TimeUnit.MILLISECONDS, new IOException("connection refused"));

}

assertThat(cb.getState()).isEqualTo(CircuitBreaker.State.OPEN);

assertThat(cb.getMetrics().getFailureRate()).isGreaterThanOrEqualTo(40f);

}

@Test

@DisplayName("Transitions to HALF-OPEN after waitDuration elapses in OPEN state")

void shouldTransitionToHalfOpenAfterWaitDuration() throws Exception {

CircuitBreaker cb = circuitBreakerRegistry.circuitBreaker("paymentGateway");

cb.reset();

cb.transitionToOpenState();

assertThat(cb.getState()).isEqualTo(CircuitBreaker.State.OPEN);

// Simulate waitDurationInOpenState elapsed

// (In tests, use a short waitDuration setting or transition directly)

cb.transitionToHalfOpenState();

assertThat(cb.getState()).isEqualTo(CircuitBreaker.State.HALF_OPEN);

}

@Test

@DisplayName("Transitions to CLOSED on successful trial calls in HALF-OPEN")

void shouldTransitionToClosedOnSuccessfulTrialCalls() {

CircuitBreaker cb = circuitBreakerRegistry.circuitBreaker("paymentGateway");

cb.reset();

cb.transitionToOpenState();

cb.transitionToHalfOpenState();

// permittedNumberOfCallsInHalfOpenState=5 successful calls

for (int i = 0; i < 5; i++) {

cb.onSuccess(50, TimeUnit.MILLISECONDS);

}

assertThat(cb.getState()).isEqualTo(CircuitBreaker.State.CLOSED);

}

@Test

@DisplayName("Fallback method is properly called when CircuitBreaker is OPEN")

void shouldInvokeFallbackWhenCircuitIsOpen() throws Exception {

// Force circuit to OPEN state

CircuitBreaker cb = circuitBreakerRegistry.circuitBreaker("paymentGateway");

cb.transitionToForcedOpenState();

mockMvc.perform(post("/api/v1/orders")

.contentType(MediaType.APPLICATION_JSON)

.content("{\"productId\": \"P001\", \"quantity\": 1}"))

.andExpect(status().isOk())

.andExpect(jsonPath("$.payment.status").value("QUEUED"))

.andExpect(jsonPath("$.payment.message").exists());

// Restore state after test

cb.transitionToClosedState();

}

}

References

- [Resilience4j CircuitBreaker Official Documentation](https://resilience4j.readme.io/docs/circuitbreaker) - Detailed configuration and operating principles of the CircuitBreaker module

- [Spring Cloud Circuit Breaker Reference](https://docs.spring.io/spring-cloud-circuitbreaker/reference/) - Spring Cloud and Resilience4j integration guide

- [Spring Boot Circuit Breaker Pattern with Resilience4j - GeeksforGeeks](https://www.geeksforgeeks.org/advance-java/spring-boot-circuit-breaker-pattern-with-resilience4j/) - Step-by-step implementation tutorial in Spring Boot

- [Circuit Breaker Pattern in Microservices - Java Guides](https://www.javaguides.net/2025/03/circuit-breaker-pattern-in-microservices.html) - Circuit Breaker design pattern in microservices architecture

- [Circuit Breaker Pattern for Resilient Systems - DZone](https://dzone.com/articles/circuit-breaker-pattern-resilient-systems) - Practical application of Circuit Breaker for distributed system resilience

- [Martin Fowler: CircuitBreaker](https://martinfowler.com/bliki/CircuitBreaker.html) - The original explanation of the Circuit Breaker pattern

- [Resilience4j GitHub Repository](https://github.com/resilience4j/resilience4j) - Source code and release notes

Quiz

Q1: What is the main topic covered in "Circuit Breaker Pattern and Resilience4j Practical

Implementation Guide: From Failure Isolation to Recovery"?

Covers the Circuit Breaker state machine principles, integrated configuration of Resilience4j

CircuitBreaker, Retry, Bulkhead, and RateLimiter modules, practical implementation in Spring Boot

3, Grafana monitoring, and recovery strategies for various failure scenarios.

The core of the Circuit Breaker is a Finite State Machine that manages transitions between three

states (CLOSED, OPEN, HALF-OPEN) and two special states (DISABLED, FORCED_OPEN).

Transitioning from Hystrix to Resilience4j After Netflix Hystrix entered maintenance mode in 2018,

Resilience4j established itself as the standard fault tolerance library in the JVM ecosystem.

Dependency Setup Integrated Configuration File A notable point in the configuration is defining a

base profile with configs.default and specifying baseConfig: default for each instance to inherit

common settings.

Annotation-Based Implementation (Kotlin) Programmatic Implementation (Java) Instead of

annotations, you can use the CircuitBreakerRegistry directly to dynamically create circuit

breakers or change configurations at runtime.

현재 단락 (1/782)

In microservices architecture, inter-service network calls are inherently unreliable. Network latenc...

작성 글자: 0원문 글자: 36,371작성 단락: 0/782