- Authors
- Name
- Introduction
- Circuit Breaker State Machine
- Resilience4j Architecture
- Spring Boot 3 Integration Configuration
- CircuitBreaker Practical Implementation
- Retry, Bulkhead, and RateLimiter Combinations
- Grafana Monitoring Dashboard
- Troubleshooting Guide
- Operations Checklist
- Failure Cases and Recovery
- Advanced Pattern: Custom CircuitBreaker Registry
- Test Strategy
- References

Introduction
In microservices architecture, inter-service network calls are inherently unreliable. Network latency, timeouts, and downstream service failures occur routinely, and without proper controls, a single service failure can propagate across the entire system as a cascading failure. A representative example occurred in late 2024, when response latency in a single payment gateway on a large e-commerce platform cascaded to paralyze the order service, inventory service, and notification service.
The Circuit Breaker pattern is a failure isolation mechanism inspired by electrical circuit breakers. Since Michael Nygard first introduced it in Release It! in 2007, and through Martin Fowler's blog post, it has become a core pattern in the microservices world. Netflix's Hystrix was the first widely adopted implementation, but after entering maintenance mode in 2018, Resilience4j emerged as the de facto standard.
This article covers everything from the Circuit Breaker state machine operating principles, to integrating Resilience4j's core modules -- CircuitBreaker, Retry, Bulkhead, and RateLimiter -- in a Spring Boot 3 environment, monitoring with Grafana dashboards, and recovery strategies for real failure scenarios at an operational level.
Circuit Breaker State Machine
The core of the Circuit Breaker is a Finite State Machine that manages transitions between three states (CLOSED, OPEN, HALF-OPEN) and two special states (DISABLED, FORCED_OPEN).
State Transition Diagram
Failure rate >= threshold
┌─────────────────────────────────────┐
│ │
▼ │
┌──────────┐ ┌──────────┐
│ │ waitDuration elapsed │ │
│ OPEN │ ─────────────────────> │ CLOSED │
│ (blocked)│ │ (normal) │
└──────────┘ └──────────┘
│ ▲
│ waitDuration elapsed │
▼ │ Trial call success rate >= threshold
┌──────────────┐ │
│ HALF-OPEN │ ──────────────────────────┘
│(trial allowed)│
└──────────────┘
│
│ Trial call failure rate >= threshold
│
▼
┌──────────┐
│ OPEN │ (blocked again)
└──────────┘
Detailed Behavior by State
| State | Request Handling | Transition Condition | Metric Collection |
|---|---|---|---|
| CLOSED | All requests pass through | Transitions to OPEN when failure rate in sliding window exceeds threshold | Records success/failure/slow calls |
| OPEN | All requests immediately rejected (CallNotPermittedException) | Transitions to HALF-OPEN after waitDurationInOpenState elapses | Records rejected call count |
| HALF-OPEN | Allows only permittedNumberOfCalls | Transitions to CLOSED or OPEN based on trial call results | Records trial call success/failure |
| DISABLED | All requests pass through (circuit inactive) | Manual transition only | No metric collection |
| FORCED_OPEN | All requests immediately rejected | Manual transition only | Records rejected call count |
Sliding Window Type Comparison
Resilience4j provides two sliding window types.
| Aspect | COUNT_BASED | TIME_BASED |
|---|---|---|
| Basis | Last N calls | Calls in the last N seconds |
| Config Example | slidingWindowSize: 10 | slidingWindowSize: 60 |
| Memory Usage | Fixed (array of N results) | Variable (partial aggregations over N seconds) |
| Suited For | Services with consistent call frequency | Services with irregular call frequency |
| Evaluation | After the Nth call | Time window evaluated on each call |
COUNT_BASED is internally implemented as a circular bit array of size N, recording each call result in O(1) and calculating the failure rate in constant time. TIME_BASED uses N partial aggregation buckets, each aggregating call results for one second.
Resilience4j Architecture
Transitioning from Hystrix to Resilience4j
After Netflix Hystrix entered maintenance mode in 2018, Resilience4j established itself as the standard fault tolerance library in the JVM ecosystem.
| Comparison | Netflix Hystrix | Resilience4j |
|---|---|---|
| Status | Maintenance mode (no updates since 2018) | Active development (2.3.0 release in 2025) |
| Java Version | Java 8+ | Java 17+ (Spring Boot 3 support) |
| Dependencies | Multiple (Archaius, RxJava, etc.) | Single (Vavr) |
| Architecture | Monolithic (all features included) | Modular (select only needed modules) |
| Thread Model | Separate thread pool required | Semaphore-based (thread pool optional) |
| Configuration | Archaius required | Both application.yml and programmatic |
| Reactive Support | RxJava 1 | Native Reactor, RxJava 2/3 support |
| Functional Interface | Limited | Full support (Supplier, Function, Runnable, etc.) |
| Monitoring | Hystrix Dashboard | Micrometer integration (Prometheus, Grafana) |
Resilience4j Core Modules
Resilience4j provides five core modules that can be used independently or in combination.
| Module | Role | Key Configuration |
|---|---|---|
| CircuitBreaker | Circuit tripping based on failure rate | failureRateThreshold, slidingWindowSize |
| Retry | Retry on failure | maxAttempts, waitDuration, backoff |
| Bulkhead | Limit concurrent calls (isolation) | maxConcurrentCalls, maxWaitDuration |
| RateLimiter | Limit calls per time unit | limitForPeriod, limitRefreshPeriod |
| TimeLimiter | Limit call duration | timeoutDuration, cancelRunningFuture |
When combining via annotations, the application order is as follows:
Outer (evaluated first) ──────────────────────────────────> Inner (evaluated last)
Retry -> CircuitBreaker -> RateLimiter -> TimeLimiter -> Bulkhead
This order is the default priority when Resilience4j processes annotations via Spring AOP. You can customize the order using properties like resilience4j.circuitbreaker.circuitBreakerAspectOrder.
Spring Boot 3 Integration Configuration
Dependency Setup
// build.gradle.kts (Spring Boot 3.3+ / Resilience4j 2.2+)
plugins {
id("org.springframework.boot") version "3.3.5"
id("io.spring.dependency-management") version "1.1.6"
kotlin("jvm") version "1.9.25"
kotlin("plugin.spring") version "1.9.25"
}
dependencies {
// Resilience4j Spring Boot 3 Starter
implementation("io.github.resilience4j:resilience4j-spring-boot3:2.2.0")
// Individual modules (included in starter but explicit declaration recommended)
implementation("io.github.resilience4j:resilience4j-circuitbreaker")
implementation("io.github.resilience4j:resilience4j-retry")
implementation("io.github.resilience4j:resilience4j-bulkhead")
implementation("io.github.resilience4j:resilience4j-ratelimiter")
implementation("io.github.resilience4j:resilience4j-timelimiter")
// Micrometer + Prometheus (monitoring)
implementation("io.github.resilience4j:resilience4j-micrometer")
implementation("io.micrometer:micrometer-registry-prometheus")
// Spring Boot Actuator
implementation("org.springframework.boot:spring-boot-starter-actuator")
implementation("org.springframework.boot:spring-boot-starter-aop")
implementation("org.springframework.boot:spring-boot-starter-web")
// Kotlin Coroutines (optional)
implementation("org.jetbrains.kotlinx:kotlinx-coroutines-core")
implementation("org.jetbrains.kotlinx:kotlinx-coroutines-reactor")
testImplementation("org.springframework.boot:spring-boot-starter-test")
}
Integrated Configuration File
# application.yml - Resilience4j integrated configuration
resilience4j:
circuitbreaker:
configs:
default:
registerHealthIndicator: true
slidingWindowType: COUNT_BASED
slidingWindowSize: 10
minimumNumberOfCalls: 5
failureRateThreshold: 50
slowCallRateThreshold: 80
slowCallDurationThreshold: 3s
waitDurationInOpenState: 30s
permittedNumberOfCallsInHalfOpenState: 3
automaticTransitionFromOpenToHalfOpenEnabled: true
recordExceptions:
- java.io.IOException
- java.util.concurrent.TimeoutException
- org.springframework.web.client.HttpServerErrorException
ignoreExceptions:
- com.example.order.exception.BusinessValidationException
instances:
paymentGateway:
baseConfig: default
failureRateThreshold: 40
waitDurationInOpenState: 60s
slidingWindowSize: 20
inventoryService:
baseConfig: default
failureRateThreshold: 60
slowCallDurationThreshold: 5s
notificationService:
baseConfig: default
failureRateThreshold: 70
waitDurationInOpenState: 15s
retry:
configs:
default:
maxAttempts: 3
waitDuration: 1s
enableExponentialBackoff: true
exponentialBackoffMultiplier: 2.0
exponentialMaxWaitDuration: 10s
retryExceptions:
- java.io.IOException
- java.util.concurrent.TimeoutException
ignoreExceptions:
- com.example.order.exception.BusinessValidationException
instances:
paymentGateway:
baseConfig: default
maxAttempts: 2
waitDuration: 2s
inventoryService:
baseConfig: default
maxAttempts: 4
notificationService:
baseConfig: default
maxAttempts: 5
waitDuration: 500ms
bulkhead:
configs:
default:
maxConcurrentCalls: 25
maxWaitDuration: 500ms
instances:
paymentGateway:
baseConfig: default
maxConcurrentCalls: 15
inventoryService:
baseConfig: default
maxConcurrentCalls: 30
notificationService:
baseConfig: default
maxConcurrentCalls: 50
ratelimiter:
configs:
default:
limitForPeriod: 100
limitRefreshPeriod: 1s
timeoutDuration: 500ms
instances:
paymentGateway:
baseConfig: default
limitForPeriod: 50
inventoryService:
baseConfig: default
limitForPeriod: 200
timelimiter:
configs:
default:
timeoutDuration: 5s
cancelRunningFuture: true
instances:
paymentGateway:
baseConfig: default
timeoutDuration: 10s
inventoryService:
baseConfig: default
timeoutDuration: 3s
# Actuator metric exposure
management:
endpoints:
web:
exposure:
include: health,metrics,prometheus,circuitbreakers,retries
endpoint:
health:
show-details: always
health:
circuitbreakers:
enabled: true
metrics:
distribution:
percentiles-histogram:
resilience4j.circuitbreaker.calls: true
resilience4j.retry.calls: true
tags:
application: order-service
A notable point in the configuration is defining a base profile with configs.default and specifying baseConfig: default for each instance to inherit common settings. You only need to override thresholds specific to each service to minimize configuration duplication.
CircuitBreaker Practical Implementation
Annotation-Based Implementation (Kotlin)
// PaymentGatewayClient.kt
@Service
class PaymentGatewayClient(
private val restClient: RestClient,
private val paymentRetryQueue: PaymentRetryQueue,
private val paymentCacheStore: PaymentCacheStore,
) {
companion object {
private val log = LoggerFactory.getLogger(PaymentGatewayClient::class.java)
const val CB_NAME = "paymentGateway"
}
@CircuitBreaker(name = CB_NAME, fallbackMethod = "paymentFallback")
@Retry(name = CB_NAME)
@Bulkhead(name = CB_NAME)
fun processPayment(request: PaymentRequest): PaymentResponse {
log.info("Calling payment gateway for orderId={}", request.orderId)
val response = restClient.post()
.uri("https://payment-api.internal/v2/charges")
.contentType(MediaType.APPLICATION_JSON)
.body(request)
.retrieve()
.body(PaymentResponse::class.java)
?: throw PaymentGatewayException("Empty response from payment gateway")
log.info("Payment processed: orderId={}, txId={}", request.orderId, response.transactionId)
return response
}
/**
* Fallback method: Called when CircuitBreaker is OPEN or an exception occurs.
* The method signature must match the original + accept an Exception as the last parameter.
*/
private fun paymentFallback(request: PaymentRequest, ex: Exception): PaymentResponse {
log.warn(
"Payment fallback activated: orderId={}, reason={}",
request.orderId, ex.message
)
return when (ex) {
is CallNotPermittedException -> {
// CircuitBreaker OPEN state: enqueue for async processing
paymentRetryQueue.enqueue(request)
PaymentResponse(
orderId = request.orderId,
status = PaymentStatus.QUEUED,
message = "Payment has been queued. It will be processed shortly.",
transactionId = null,
)
}
is BulkheadFullException -> {
// Bulkhead saturated: prompt immediate retry
PaymentResponse(
orderId = request.orderId,
status = PaymentStatus.RETRY_LATER,
message = "Too many payment requests at the moment. Please try again shortly.",
transactionId = null,
)
}
else -> {
// Other exceptions: return cached payment info if available
val cached = paymentCacheStore.getLastSuccess(request.orderId)
if (cached != null) {
log.info("Returning cached payment for orderId={}", request.orderId)
cached.copy(status = PaymentStatus.CACHED)
} else {
paymentRetryQueue.enqueue(request)
PaymentResponse(
orderId = request.orderId,
status = PaymentStatus.PENDING,
message = "An error occurred during payment processing. Automatic retry in progress.",
transactionId = null,
)
}
}
}
}
}
Programmatic Implementation (Java)
Instead of annotations, you can use the CircuitBreakerRegistry directly to dynamically create circuit breakers or change configurations at runtime.
// InventoryServiceClient.java
@Service
@Slf4j
public class InventoryServiceClient {
private final CircuitBreaker circuitBreaker;
private final Retry retry;
private final Bulkhead bulkhead;
private final RestClient restClient;
public InventoryServiceClient(
CircuitBreakerRegistry cbRegistry,
RetryRegistry retryRegistry,
BulkheadRegistry bulkheadRegistry,
RestClient.Builder restClientBuilder) {
this.circuitBreaker = cbRegistry.circuitBreaker("inventoryService");
this.retry = retryRegistry.retry("inventoryService");
this.bulkhead = bulkheadRegistry.bulkhead("inventoryService");
this.restClient = restClientBuilder
.baseUrl("https://inventory-api.internal")
.build();
// Register event listeners
registerEventListeners();
}
public InventoryResponse checkStock(String productId, int quantity) {
// Decorator chain: Bulkhead -> CircuitBreaker -> Retry -> actual call
Supplier<InventoryResponse> decorated = Decorators
.ofSupplier(() -> doCheckStock(productId, quantity))
.withBulkhead(bulkhead)
.withCircuitBreaker(circuitBreaker)
.withRetry(retry)
.withFallback(
List.of(
CallNotPermittedException.class,
BulkheadFullException.class,
IOException.class
),
ex -> stockFallback(productId, quantity, ex)
)
.decorate();
return decorated.get();
}
private InventoryResponse doCheckStock(String productId, int quantity) {
return restClient.get()
.uri("/v1/stock/{productId}?qty={qty}", productId, quantity)
.retrieve()
.body(InventoryResponse.class);
}
private InventoryResponse stockFallback(
String productId, int quantity, Throwable ex) {
log.warn("Inventory fallback: productId={}, reason={}", productId, ex.getMessage());
// When stock is uncertain, accept the order but schedule async verification
return InventoryResponse.builder()
.productId(productId)
.available(true)
.reservationStatus(ReservationStatus.TENTATIVE)
.message("Stock check delayed: tentative approval with async verification scheduled")
.build();
}
private void registerEventListeners() {
circuitBreaker.getEventPublisher()
.onStateTransition(event -> {
log.warn("[CircuitBreaker] {} state: {} -> {}",
event.getCircuitBreakerName(),
event.getStateTransition().getFromState(),
event.getStateTransition().getToState());
})
.onError(event ->
log.error("[CircuitBreaker] {} error: {} ({}ms)",
event.getCircuitBreakerName(),
event.getThrowable().getMessage(),
event.getElapsedDuration().toMillis())
)
.onSuccess(event ->
log.debug("[CircuitBreaker] {} success ({}ms)",
event.getCircuitBreakerName(),
event.getElapsedDuration().toMillis())
)
.onCallNotPermitted(event ->
log.warn("[CircuitBreaker] {} call not permitted (OPEN state)",
event.getCircuitBreakerName())
);
retry.getEventPublisher()
.onRetry(event ->
log.info("[Retry] {} attempt #{} (wait: {}ms)",
event.getName(),
event.getNumberOfRetryAttempts(),
event.getWaitInterval().toMillis())
);
}
}
Retry, Bulkhead, and RateLimiter Combinations
Retry and Exponential Backoff
The most important aspect of retry strategy is combining exponential backoff with jitter. Fixed-interval retries cause a thundering herd problem where multiple clients retry simultaneously, concentrating load on the server.
// Programmatic RetryConfig customization
@Configuration
class ResilienceConfig {
@Bean
fun customRetryConfig(): RetryConfig {
return RetryConfig.custom<RetryConfig>()
.maxAttempts(4)
.intervalFunction(
// Exponential backoff + jitter: 1s, 2s(+jitter), 4s(+jitter), 8s(+jitter)
IntervalFunction.ofExponentialRandomBackoff(
Duration.ofSeconds(1), // initial wait duration
2.0, // multiplier
Duration.ofSeconds(15) // max wait duration
)
)
.retryOnException { ex ->
// Determine retry-eligible exceptions
when (ex) {
is IOException -> true
is TimeoutException -> true
is HttpServerErrorException -> true
is ConnectException -> true
else -> false
}
}
.ignoreExceptions(
BusinessValidationException::class.java,
IllegalArgumentException::class.java
)
.failAfterMaxAttempts(true) // Throw MaxRetriesExceededException after max retries
.build()
}
@Bean
fun retryRegistry(customRetryConfig: RetryConfig): RetryRegistry {
return RetryRegistry.of(customRetryConfig)
}
}
Bulkhead: Semaphore vs Thread Pool
Bulkhead is a pattern inspired by ship compartment walls (bulkheads), preventing a single service call from monopolizing all resources. Resilience4j provides two Bulkhead implementations.
| Aspect | SemaphoreBulkhead | ThreadPoolBulkhead |
|---|---|---|
| Isolation Level | Limits concurrent calls | Executes in a separate thread pool |
| Call Thread | Uses the caller's thread directly | Uses threads from a dedicated thread pool |
| Return Type | Synchronous return | CompletionStage return |
| Overhead | Low | Thread context switching cost |
| Suited For | Most HTTP calls | CPU-intensive tasks, when full isolation is needed |
| Configuration | maxConcurrentCalls, maxWaitDuration | maxThreadPoolSize, coreThreadPoolSize, queueCapacity |
# ThreadPoolBulkhead configuration example
resilience4j:
thread-pool-bulkhead:
instances:
heavyProcessing:
maxThreadPoolSize: 10
coreThreadPoolSize: 5
queueCapacity: 20
keepAliveDuration: 100ms
writableStackTraceEnabled: true
RateLimiter Configuration and Usage
RateLimiter limits the number of calls allowed per time unit, preventing external API rate limit violations or protecting internal services from overload.
// RateLimiter and CircuitBreaker combination
@Service
@Slf4j
public class ExternalApiClient {
private final RestClient restClient;
@CircuitBreaker(name = "externalApi", fallbackMethod = "apiFallback")
@RateLimiter(name = "externalApi")
@Retry(name = "externalApi")
public ApiResponse callExternalApi(ApiRequest request) {
log.debug("Calling external API: endpoint={}", request.getEndpoint());
return restClient.post()
.uri(request.getEndpoint())
.body(request.getPayload())
.retrieve()
.body(ApiResponse.class);
}
private ApiResponse apiFallback(ApiRequest request, RequestNotPermitted ex) {
// Rejected by RateLimiter
log.warn("Rate limit exceeded for external API: {}", request.getEndpoint());
return ApiResponse.rateLimited(
"Request limit exceeded. " +
"Check limitForPeriod settings or try again later."
);
}
private ApiResponse apiFallback(ApiRequest request, Exception ex) {
// Other exceptions (CircuitBreaker OPEN, network errors, etc.)
log.warn("External API fallback: endpoint={}, reason={}",
request.getEndpoint(), ex.getMessage());
return ApiResponse.error("External API call failed: " + ex.getMessage());
}
}
An important note when overloading fallback methods: Resilience4j selects the most specific fallback based on exception type. By separating RequestNotPermitted (RateLimiter rejection) and Exception (general exceptions), you can execute different fallback logic based on the exception cause.
Grafana Monitoring Dashboard
Prometheus Metric Collection
Resilience4j automatically exposes metrics via Micrometer. The following metrics are available at Spring Boot Actuator's /actuator/prometheus endpoint:
# CircuitBreaker state check (0=CLOSED, 1=OPEN, 2=HALF_OPEN, 3=DISABLED, 4=FORCED_OPEN)
resilience4j_circuitbreaker_state{name="paymentGateway"}
# Failure rate (%)
resilience4j_circuitbreaker_failure_rate{name="paymentGateway"}
# Slow call rate (%)
resilience4j_circuitbreaker_slow_call_rate{name="paymentGateway"}
# Call statistics (kind: successful, failed, ignored, not_permitted)
rate(resilience4j_circuitbreaker_calls_seconds_count{name="paymentGateway"}[5m])
# Call latency distribution (histogram)
histogram_quantile(0.95,
rate(resilience4j_circuitbreaker_calls_seconds_bucket{name="paymentGateway"}[5m])
)
# Retry count
increase(resilience4j_retry_calls_total{name="paymentGateway", kind="successful_with_retry"}[1h])
increase(resilience4j_retry_calls_total{name="paymentGateway", kind="failed_with_retry"}[1h])
# Bulkhead available concurrent calls
resilience4j_bulkhead_available_concurrent_calls{name="paymentGateway"}
# RateLimiter available permissions
resilience4j_ratelimiter_available_permissions{name="externalApi"}
Grafana Dashboard JSON Configuration
Here are the essential panels and their PromQL queries for the Grafana dashboard.
Panel 1 - CircuitBreaker State Gauge
resilience4j_circuitbreaker_state{application="order-service"}
Use value mapping to map 0=CLOSED (green), 1=OPEN (red), 2=HALF_OPEN (yellow).
Panel 2 - Failure Rate Trend (Time Series)
resilience4j_circuitbreaker_failure_rate{application="order-service", name=~".*"}
Add a threshold line (failureRateThreshold) to visually identify when the circuit transitions to OPEN.
Panel 3 - Call Success/Failure Ratio (Stacked Bar)
sum by (name, kind) (
rate(resilience4j_circuitbreaker_calls_seconds_count{application="order-service"}[5m])
)
Panel 4 - P95 Response Time (Time Series)
histogram_quantile(0.95,
sum by (le, name) (
rate(resilience4j_circuitbreaker_calls_seconds_bucket{application="order-service"}[5m])
)
)
Panel 5 - Bulkhead Concurrent Call Status (Gauge)
resilience4j_bulkhead_max_allowed_concurrent_calls{application="order-service"}
- resilience4j_bulkhead_available_concurrent_calls{application="order-service"}
Alert Rule Configuration
Register the following alert rules in Grafana or Prometheus Alertmanager.
# prometheus-alerts.yml
groups:
- name: resilience4j_alerts
rules:
- alert: CircuitBreakerOpen
expr: resilience4j_circuitbreaker_state == 1
for: 30s
labels:
severity: critical
annotations:
summary: 'CircuitBreaker OPEN: {{ $labels.name }}'
description: >
The {{ $labels.name }} circuit breaker in service
{{ $labels.application }} is in OPEN state.
Check downstream service failures.
- alert: HighFailureRate
expr: resilience4j_circuitbreaker_failure_rate > 30
for: 2m
labels:
severity: warning
annotations:
summary: 'High failure rate: {{ $labels.name }} ({{ $value }}%)'
description: >
The failure rate of {{ $labels.name }} is {{ $value }}%,
exceeding the warning threshold (30%).
- alert: BulkheadSaturation
expr: >
(resilience4j_bulkhead_max_allowed_concurrent_calls
- resilience4j_bulkhead_available_concurrent_calls)
/ resilience4j_bulkhead_max_allowed_concurrent_calls > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: 'Bulkhead 80% saturated: {{ $labels.name }}'
- alert: ExcessiveRetries
expr: >
rate(resilience4j_retry_calls_total{kind="failed_with_retry"}[5m])
/ rate(resilience4j_retry_calls_total[5m]) > 0.5
for: 3m
labels:
severity: warning
annotations:
summary: 'Retry failure rate exceeds 50%: {{ $labels.name }}'
Troubleshooting Guide
Issue 1: CircuitBreaker Does Not Transition to OPEN
Symptoms: Failures are clearly occurring, but the circuit remains in CLOSED state.
Root Cause Analysis:
minimumNumberOfCallshas not been reached. The default value is 100, so for low-frequency services, the sliding window may fill before the failure resolves.- The exception is included in
ignoreExceptions. Check that not only business exceptions but also unintended exceptions are in the ignore list. - The exception is not included in
recordExceptions. When recordExceptions is specified, exceptions not in the list are not recorded as failures.
Resolution: Adjust minimumNumberOfCalls to match the service call frequency, and review the recordExceptions and ignoreExceptions lists.
Issue 2: More Calls Than Expected When Combining Retry and CircuitBreaker
Symptoms: maxAttempts is set to 3, but more than 5 calls are recorded on the downstream service.
Root Cause Analysis: In the annotation application order, Retry sits outside CircuitBreaker. Therefore, after the CircuitBreaker records a failure, Retry attempts the call again through the CircuitBreaker. If trial calls are added during the CircuitBreaker's HALF-OPEN state, the total call count can exceed expectations.
Resolution: Set Retry's maxAttempts conservatively, and calculate the maximum number of calls produced by the combination of CircuitBreaker's slidingWindowSize and Retry's maxAttempts to predict downstream load.
Issue 3: Fallback Method Is Not Being Called
Symptoms: CircuitBreaker is OPEN, but CallNotPermittedException is propagated directly to the client.
Root Cause Analysis: The fallback method's signature does not exactly match the original method. The fallback method must accept all parameters of the original method in the same order and type, plus an Exception (or specific exception type) as the last parameter.
Resolution: Review the fallback method signature. The return type must also exactly match the original. Below are correct examples:
// Original method
@CircuitBreaker(name = "svc", fallbackMethod = "fallback")
public OrderResponse getOrder(String orderId, boolean includeDetails) { ... }
// Correct fallback (same parameters + Exception added)
private OrderResponse fallback(String orderId, boolean includeDetails, Exception ex) { ... }
// Incorrect fallback - compiles but fails to match at runtime
private OrderResponse fallback(String orderId, Exception ex) { ... } // Missing parameter
private void fallback(String orderId, boolean includeDetails, Exception ex) { ... } // Return type mismatch
Issue 4: Memory Usage Increases with TIME_BASED Window
Symptoms: Using a TIME_BASED sliding window and heap memory usage gradually increases.
Root Cause Analysis: The slidingWindowSize is set too large. For example, setting slidingWindowSize=600 (10 minutes) maintains 600 partial aggregation buckets. With high traffic, call records accumulate in each bucket, consuming memory.
Resolution: For TIME_BASED, set slidingWindowSize to 60 seconds or less, and observe long-term trends through Prometheus metrics. In memory-sensitive environments, prefer COUNT_BASED.
Operations Checklist
Here are items that must be verified before deploying Circuit Breakers to production.
Configuration Verification
- Is the ratio of slidingWindowSize to minimumNumberOfCalls appropriate? (minimumNumberOfCalls should be 50% or less of slidingWindowSize)
- Is failureRateThreshold set according to service characteristics? (Payment: 30-40%, Notification: 60-70%)
- Does waitDurationInOpenState match the downstream service's average recovery time?
- Is slowCallDurationThreshold set at or above the normal response time P99?
- Are recordExceptions and ignoreExceptions properly categorized?
Monitoring Verification
- Are resilience4j metrics being collected properly in Prometheus?
- Does the Grafana dashboard display CircuitBreaker state, failure rate, and call statistics?
- Are CircuitBreaker OPEN alerts being delivered to Slack, PagerDuty, etc.?
- Are there metrics tracking OPEN state duration?
Fallback Strategy Verification
- Are fallback methods connected to all CircuitBreakers?
- Do fallback methods return meaningful responses? (No simple null returns)
- How are exceptions in the fallback method itself handled?
- Is a cache expiration policy configured when using cache fallback?
- When using alternative service fallback, is a CircuitBreaker also configured for that service?
Test Verification
- Have unit tests verified CircuitBreaker state transitions (CLOSED, OPEN, HALF-OPEN)?
- Have integration tests reproduced actual timeout and network error scenarios?
- Has fault injection testing been performed with chaos engineering tools (Chaos Monkey, Litmus)?
- Has Bulkhead saturation behavior been verified in load tests?
Deployment Strategy
- Apply new CircuitBreaker configurations via canary deployment to a subset of traffic first
- Can configuration changes be applied without downtime via Config Server (Spring Cloud Config) or environment variables?
- Is Git history management in place for CircuitBreaker configurations?
- Is a rollback plan established?
Failure Cases and Recovery
Case 1: Downstream Overload Due to Retry Storm
Situation: Response times from the payment service began increasing. The order service had Retry configured with maxAttempts=5 and a fixed 1-second interval. With 20 order service instances and 100 orders per second, up to 10,000 requests per second (100 x 20 x 5) were flooding the payment service.
Cause: Fixed-interval retries were used without exponential backoff and jitter. Also, Retry was used standalone without a CircuitBreaker, so retries continued even on failure.
Recovery Procedure:
- Immediately disable Retry or set maxAttempts to 1 to stop retries
- Once the payment service load stabilizes, replace with a Retry configuration that includes exponential backoff + jitter
- Place the CircuitBreaker inside Retry so that retries are blocked when the circuit is OPEN
Prevention: Always use Retry together with CircuitBreaker, and apply exponential backoff + random jitter by default. Prohibit fixed-interval retries as a policy.
Case 2: Permanent OPEN Circuit Due to Incorrect Exception Classification
Situation: After deploying a new feature to the inventory service, certain product queries started returning 400 Bad Request. These 400 responses were caught as HttpClientErrorException and included in the failure rate calculation, causing the CircuitBreaker to transition to OPEN and block all inventory queries. Even normal product queries became impossible.
Cause: recordExceptions included HttpClientErrorException (4xx). Since 4xx errors are client-side issues, the circuit breaker should not intervene. Circuit breakers should only respond to server-side failures (5xx, timeouts, connection failures).
Recovery Procedure:
- Manually switch the CircuitBreaker to FORCED_CLOSE to immediately restore normal traffic
// Force state transition via Actuator endpoint
// POST /actuator/circuitbreakers/{name}/force-close
circuitBreakerRegistry.circuitBreaker("inventoryService")
.transitionToForcedOpenState(); // or transitionToClosedState()
- Remove HttpClientErrorException from recordExceptions and add it to ignoreExceptions
- After applying the configuration, release FORCED_CLOSE to return to normal CircuitBreaker operation
Prevention: Document exception classification principles. 4xx (client errors) go in ignoreExceptions, 5xx (server errors) go in recordExceptions, and business validation exceptions go in ignoreExceptions.
Case 3: Traffic Loss Due to HALF-OPEN Bottleneck
Situation: Even after the payment service recovered, order processing throughput did not recover. Traffic analysis revealed that the CircuitBreaker was set with permittedNumberOfCallsInHalfOpenState=1 in the HALF-OPEN state, allowing only 1 trial call. This trial call intermittently failed, causing flapping between OPEN and HALF-OPEN states.
Cause: The permittedNumberOfCallsInHalfOpenState value was too low. With only 1 trial call, a single failure returns the circuit to OPEN, making it difficult to return to CLOSED when the downstream service responds only intermittently.
Recovery Procedure:
- Increase permittedNumberOfCallsInHalfOpenState to 5-10
- Verify that automaticTransitionFromOpenToHalfOpenEnabled is set to true for automatic transitions without manual intervention
- Adjust waitDurationInOpenState to match the downstream service's average recovery time
Prevention: Set HALF-OPEN trial calls to at least 3 or more, and combine with the failure rate threshold to enable statistically meaningful decisions. Add OPEN-HALF_OPEN flapping detection to monitoring alerts.
Advanced Pattern: Custom CircuitBreaker Registry
As the number of services grows, configuring CircuitBreakers individually for each service can become inefficient. You can implement a custom registry that dynamically creates and manages CircuitBreakers.
// DynamicCircuitBreakerFactory.kt
@Component
class DynamicCircuitBreakerFactory(
private val circuitBreakerRegistry: CircuitBreakerRegistry,
private val meterRegistry: MeterRegistry,
) {
private val log = LoggerFactory.getLogger(javaClass)
/**
* Dynamically creates a CircuitBreaker based on service name.
* Returns an existing instance if one already exists.
*/
fun getOrCreate(
serviceName: String,
tier: ServiceTier = ServiceTier.STANDARD,
): CircuitBreaker {
return circuitBreakerRegistry.circuitBreaker(serviceName) {
buildConfigForTier(tier)
}.also { cb ->
registerMetrics(cb)
log.info(
"CircuitBreaker created/retrieved: name={}, tier={}, state={}",
serviceName, tier, cb.state
)
}
}
private fun buildConfigForTier(tier: ServiceTier): CircuitBreakerConfig {
return when (tier) {
ServiceTier.CRITICAL -> CircuitBreakerConfig.custom()
.failureRateThreshold(30f)
.slowCallRateThreshold(60f)
.slowCallDurationThreshold(Duration.ofSeconds(2))
.waitDurationInOpenState(Duration.ofSeconds(60))
.slidingWindowSize(20)
.minimumNumberOfCalls(10)
.permittedNumberOfCallsInHalfOpenState(5)
.automaticTransitionFromOpenToHalfOpenEnabled(true)
.build()
ServiceTier.STANDARD -> CircuitBreakerConfig.custom()
.failureRateThreshold(50f)
.slowCallRateThreshold(80f)
.slowCallDurationThreshold(Duration.ofSeconds(3))
.waitDurationInOpenState(Duration.ofSeconds(30))
.slidingWindowSize(10)
.minimumNumberOfCalls(5)
.permittedNumberOfCallsInHalfOpenState(3)
.automaticTransitionFromOpenToHalfOpenEnabled(true)
.build()
ServiceTier.BEST_EFFORT -> CircuitBreakerConfig.custom()
.failureRateThreshold(70f)
.slowCallRateThreshold(90f)
.slowCallDurationThreshold(Duration.ofSeconds(5))
.waitDurationInOpenState(Duration.ofSeconds(15))
.slidingWindowSize(5)
.minimumNumberOfCalls(3)
.permittedNumberOfCallsInHalfOpenState(2)
.automaticTransitionFromOpenToHalfOpenEnabled(true)
.build()
}
}
private fun registerMetrics(cb: CircuitBreaker) {
TaggedCircuitBreakerMetrics.ofCircuitBreakerRegistry(
circuitBreakerRegistry
).bindTo(meterRegistry)
}
enum class ServiceTier {
CRITICAL, // Core services like payment, authentication
STANDARD, // General services like inventory, shipping
BEST_EFFORT // Non-critical services like notifications, recommendations
}
}
Using this factory, appropriate CircuitBreaker configurations are automatically applied based on the service tier. CRITICAL services are conservatively protected with low failure rate thresholds and long wait durations, while BEST_EFFORT services operate flexibly with high thresholds.
Test Strategy
Here are the essential test cases that must be written when introducing CircuitBreakers.
// CircuitBreakerIntegrationTest.java
@SpringBootTest
@AutoConfigureMockMvc
class CircuitBreakerIntegrationTest {
@Autowired
private CircuitBreakerRegistry circuitBreakerRegistry;
@Autowired
private MockMvc mockMvc;
@MockBean
private RestClient paymentRestClient;
@Test
@DisplayName("CircuitBreaker transitions to OPEN when failure rate threshold is exceeded")
void shouldTransitionToOpenOnFailureThreshold() {
CircuitBreaker cb = circuitBreakerRegistry.circuitBreaker("paymentGateway");
cb.reset(); // Reset state for test isolation
assertThat(cb.getState()).isEqualTo(CircuitBreaker.State.CLOSED);
// With slidingWindowSize=20, failureRateThreshold=40
// Must call minimumNumberOfCalls=10 or more, then 40%+ failure -> OPEN
// 10 calls with 5 failures = 50% failure rate -> OPEN transition
for (int i = 0; i < 5; i++) {
cb.onSuccess(100, TimeUnit.MILLISECONDS);
}
for (int i = 0; i < 5; i++) {
cb.onError(100, TimeUnit.MILLISECONDS, new IOException("connection refused"));
}
assertThat(cb.getState()).isEqualTo(CircuitBreaker.State.OPEN);
assertThat(cb.getMetrics().getFailureRate()).isGreaterThanOrEqualTo(40f);
}
@Test
@DisplayName("Transitions to HALF-OPEN after waitDuration elapses in OPEN state")
void shouldTransitionToHalfOpenAfterWaitDuration() throws Exception {
CircuitBreaker cb = circuitBreakerRegistry.circuitBreaker("paymentGateway");
cb.reset();
cb.transitionToOpenState();
assertThat(cb.getState()).isEqualTo(CircuitBreaker.State.OPEN);
// Simulate waitDurationInOpenState elapsed
// (In tests, use a short waitDuration setting or transition directly)
cb.transitionToHalfOpenState();
assertThat(cb.getState()).isEqualTo(CircuitBreaker.State.HALF_OPEN);
}
@Test
@DisplayName("Transitions to CLOSED on successful trial calls in HALF-OPEN")
void shouldTransitionToClosedOnSuccessfulTrialCalls() {
CircuitBreaker cb = circuitBreakerRegistry.circuitBreaker("paymentGateway");
cb.reset();
cb.transitionToOpenState();
cb.transitionToHalfOpenState();
// permittedNumberOfCallsInHalfOpenState=5 successful calls
for (int i = 0; i < 5; i++) {
cb.onSuccess(50, TimeUnit.MILLISECONDS);
}
assertThat(cb.getState()).isEqualTo(CircuitBreaker.State.CLOSED);
}
@Test
@DisplayName("Fallback method is properly called when CircuitBreaker is OPEN")
void shouldInvokeFallbackWhenCircuitIsOpen() throws Exception {
// Force circuit to OPEN state
CircuitBreaker cb = circuitBreakerRegistry.circuitBreaker("paymentGateway");
cb.transitionToForcedOpenState();
mockMvc.perform(post("/api/v1/orders")
.contentType(MediaType.APPLICATION_JSON)
.content("{\"productId\": \"P001\", \"quantity\": 1}"))
.andExpect(status().isOk())
.andExpect(jsonPath("$.payment.status").value("QUEUED"))
.andExpect(jsonPath("$.payment.message").exists());
// Restore state after test
cb.transitionToClosedState();
}
}
References
- Resilience4j CircuitBreaker Official Documentation - Detailed configuration and operating principles of the CircuitBreaker module
- Spring Cloud Circuit Breaker Reference - Spring Cloud and Resilience4j integration guide
- Spring Boot Circuit Breaker Pattern with Resilience4j - GeeksforGeeks - Step-by-step implementation tutorial in Spring Boot
- Circuit Breaker Pattern in Microservices - Java Guides - Circuit Breaker design pattern in microservices architecture
- Circuit Breaker Pattern for Resilient Systems - DZone - Practical application of Circuit Breaker for distributed system resilience
- Martin Fowler: CircuitBreaker - The original explanation of the Circuit Breaker pattern
- Resilience4j GitHub Repository - Source code and release notes