Circuit Breaker Pattern and Resilience4j Practical Implementation Guide: From Failure Isolation to Recovery

Introduction
Circuit Breaker State Machine
Resilience4j Architecture
- Transitioning from Hystrix to Resilience4j
- Resilience4j Core Modules
Spring Boot 3 Integration Configuration
- Dependency Setup
- Integrated Configuration File
CircuitBreaker Practical Implementation
- Annotation-Based Implementation (Kotlin)
- Programmatic Implementation (Java)
Retry, Bulkhead, and RateLimiter Combinations
Grafana Monitoring Dashboard
Troubleshooting Guide
Operations Checklist
Failure Cases and Recovery
Advanced Pattern: Custom CircuitBreaker Registry
Test Strategy
References

Introduction

In microservices architecture, inter-service network calls are inherently unreliable. Network latency, timeouts, and downstream service failures occur routinely, and without proper controls, a single service failure can propagate across the entire system as a cascading failure. A representative example occurred in late 2024, when response latency in a single payment gateway on a large e-commerce platform cascaded to paralyze the order service, inventory service, and notification service.

The Circuit Breaker pattern is a failure isolation mechanism inspired by electrical circuit breakers. Since Michael Nygard first introduced it in Release It! in 2007, and through Martin Fowler's blog post, it has become a core pattern in the microservices world. Netflix's Hystrix was the first widely adopted implementation, but after entering maintenance mode in 2018, Resilience4j emerged as the de facto standard.

This article covers everything from the Circuit Breaker state machine operating principles, to integrating Resilience4j's core modules -- CircuitBreaker, Retry, Bulkhead, and RateLimiter -- in a Spring Boot 3 environment, monitoring with Grafana dashboards, and recovery strategies for real failure scenarios at an operational level.

Circuit Breaker State Machine

The core of the Circuit Breaker is a Finite State Machine that manages transitions between three states (CLOSED, OPEN, HALF-OPEN) and two special states (DISABLED, FORCED_OPEN).

State Transition Diagram

                     Failure rate >= threshold
         ┌─────────────────────────────────────┐
         │                                     │
         ▼                                     │
    ┌──────────┐                          ┌──────────┐
    │          │    waitDuration elapsed   │          │
    │   OPEN   │ ─────────────────────>   │  CLOSED  │
    │ (blocked)│                          │ (normal) │
    └──────────┘                          └──────────┘
         │                                     ▲
         │ waitDuration elapsed                │
         ▼                                     │ Trial call success rate >= threshold
    ┌──────────────┐                           │
    │  HALF-OPEN   │ ──────────────────────────┘
    │(trial allowed)│
    └──────────────┘
         │
         │ Trial call failure rate >= threshold
         │
         ▼
    ┌──────────┐
    │   OPEN   │  (blocked again)
    └──────────┘

Detailed Behavior by State

State	Request Handling	Transition Condition	Metric Collection
CLOSED	All requests pass through	Transitions to OPEN when failure rate in sliding window exceeds threshold	Records success/failure/slow calls
OPEN	All requests immediately rejected (CallNotPermittedException)	Transitions to HALF-OPEN after waitDurationInOpenState elapses	Records rejected call count
HALF-OPEN	Allows only permittedNumberOfCalls	Transitions to CLOSED or OPEN based on trial call results	Records trial call success/failure
DISABLED	All requests pass through (circuit inactive)	Manual transition only	No metric collection
FORCED_OPEN	All requests immediately rejected	Manual transition only	Records rejected call count

Sliding Window Type Comparison

Resilience4j provides two sliding window types.

Aspect	COUNT_BASED	TIME_BASED
Basis	Last N calls	Calls in the last N seconds
Config Example	slidingWindowSize: 10	slidingWindowSize: 60
Memory Usage	Fixed (array of N results)	Variable (partial aggregations over N seconds)
Suited For	Services with consistent call frequency	Services with irregular call frequency
Evaluation	After the Nth call	Time window evaluated on each call

COUNT_BASED is internally implemented as a circular bit array of size N, recording each call result in O(1) and calculating the failure rate in constant time. TIME_BASED uses N partial aggregation buckets, each aggregating call results for one second.

Resilience4j Architecture

Transitioning from Hystrix to Resilience4j

After Netflix Hystrix entered maintenance mode in 2018, Resilience4j established itself as the standard fault tolerance library in the JVM ecosystem.

Comparison	Netflix Hystrix	Resilience4j
Status	Maintenance mode (no updates since 2018)	Active development (2.3.0 release in 2025)
Java Version	Java 8+	Java 17+ (Spring Boot 3 support)
Dependencies	Multiple (Archaius, RxJava, etc.)	Single (Vavr)
Architecture	Monolithic (all features included)	Modular (select only needed modules)
Thread Model	Separate thread pool required	Semaphore-based (thread pool optional)
Configuration	Archaius required	Both application.yml and programmatic
Reactive Support	RxJava 1	Native Reactor, RxJava 2/3 support
Functional Interface	Limited	Full support (Supplier, Function, Runnable, etc.)
Monitoring	Hystrix Dashboard	Micrometer integration (Prometheus, Grafana)

Resilience4j Core Modules

Resilience4j provides five core modules that can be used independently or in combination.

Module	Role	Key Configuration
CircuitBreaker	Circuit tripping based on failure rate	failureRateThreshold, slidingWindowSize
Retry	Retry on failure	maxAttempts, waitDuration, backoff
Bulkhead	Limit concurrent calls (isolation)	maxConcurrentCalls, maxWaitDuration
RateLimiter	Limit calls per time unit	limitForPeriod, limitRefreshPeriod
TimeLimiter	Limit call duration	timeoutDuration, cancelRunningFuture

When combining via annotations, the application order is as follows:

Outer (evaluated first) ──────────────────────────────────> Inner (evaluated last)
Retry -> CircuitBreaker -> RateLimiter -> TimeLimiter -> Bulkhead

This order is the default priority when Resilience4j processes annotations via Spring AOP. You can customize the order using properties like resilience4j.circuitbreaker.circuitBreakerAspectOrder.

Spring Boot 3 Integration Configuration

Dependency Setup

// build.gradle.kts (Spring Boot 3.3+ / Resilience4j 2.2+)
plugins {
    id("org.springframework.boot") version "3.3.5"
    id("io.spring.dependency-management") version "1.1.6"
    kotlin("jvm") version "1.9.25"
    kotlin("plugin.spring") version "1.9.25"
}

dependencies {
    // Resilience4j Spring Boot 3 Starter
    implementation("io.github.resilience4j:resilience4j-spring-boot3:2.2.0")

    // Individual modules (included in starter but explicit declaration recommended)
    implementation("io.github.resilience4j:resilience4j-circuitbreaker")
    implementation("io.github.resilience4j:resilience4j-retry")
    implementation("io.github.resilience4j:resilience4j-bulkhead")
    implementation("io.github.resilience4j:resilience4j-ratelimiter")
    implementation("io.github.resilience4j:resilience4j-timelimiter")

    // Micrometer + Prometheus (monitoring)
    implementation("io.github.resilience4j:resilience4j-micrometer")
    implementation("io.micrometer:micrometer-registry-prometheus")

    // Spring Boot Actuator
    implementation("org.springframework.boot:spring-boot-starter-actuator")
    implementation("org.springframework.boot:spring-boot-starter-aop")
    implementation("org.springframework.boot:spring-boot-starter-web")

    // Kotlin Coroutines (optional)
    implementation("org.jetbrains.kotlinx:kotlinx-coroutines-core")
    implementation("org.jetbrains.kotlinx:kotlinx-coroutines-reactor")

    testImplementation("org.springframework.boot:spring-boot-starter-test")
}

Integrated Configuration File

# application.yml - Resilience4j integrated configuration
resilience4j:
  circuitbreaker:
    configs:
      default:
        registerHealthIndicator: true
        slidingWindowType: COUNT_BASED
        slidingWindowSize: 10
        minimumNumberOfCalls: 5
        failureRateThreshold: 50
        slowCallRateThreshold: 80
        slowCallDurationThreshold: 3s
        waitDurationInOpenState: 30s
        permittedNumberOfCallsInHalfOpenState: 3
        automaticTransitionFromOpenToHalfOpenEnabled: true
        recordExceptions:
          - java.io.IOException
          - java.util.concurrent.TimeoutException
          - org.springframework.web.client.HttpServerErrorException
        ignoreExceptions:
          - com.example.order.exception.BusinessValidationException
    instances:
      paymentGateway:
        baseConfig: default
        failureRateThreshold: 40
        waitDurationInOpenState: 60s
        slidingWindowSize: 20
      inventoryService:
        baseConfig: default
        failureRateThreshold: 60
        slowCallDurationThreshold: 5s
      notificationService:
        baseConfig: default
        failureRateThreshold: 70
        waitDurationInOpenState: 15s

  retry:
    configs:
      default:
        maxAttempts: 3
        waitDuration: 1s
        enableExponentialBackoff: true
        exponentialBackoffMultiplier: 2.0
        exponentialMaxWaitDuration: 10s
        retryExceptions:
          - java.io.IOException
          - java.util.concurrent.TimeoutException
        ignoreExceptions:
          - com.example.order.exception.BusinessValidationException
    instances:
      paymentGateway:
        baseConfig: default
        maxAttempts: 2
        waitDuration: 2s
      inventoryService:
        baseConfig: default
        maxAttempts: 4
      notificationService:
        baseConfig: default
        maxAttempts: 5
        waitDuration: 500ms

  bulkhead:
    configs:
      default:
        maxConcurrentCalls: 25
        maxWaitDuration: 500ms
    instances:
      paymentGateway:
        baseConfig: default
        maxConcurrentCalls: 15
      inventoryService:
        baseConfig: default
        maxConcurrentCalls: 30
      notificationService:
        baseConfig: default
        maxConcurrentCalls: 50

  ratelimiter:
    configs:
      default:
        limitForPeriod: 100
        limitRefreshPeriod: 1s
        timeoutDuration: 500ms
    instances:
      paymentGateway:
        baseConfig: default
        limitForPeriod: 50
      inventoryService:
        baseConfig: default
        limitForPeriod: 200

  timelimiter:
    configs:
      default:
        timeoutDuration: 5s
        cancelRunningFuture: true
    instances:
      paymentGateway:
        baseConfig: default
        timeoutDuration: 10s
      inventoryService:
        baseConfig: default
        timeoutDuration: 3s

# Actuator metric exposure
management:
  endpoints:
    web:
      exposure:
        include: health,metrics,prometheus,circuitbreakers,retries
  endpoint:
    health:
      show-details: always
  health:
    circuitbreakers:
      enabled: true
  metrics:
    distribution:
      percentiles-histogram:
        resilience4j.circuitbreaker.calls: true
        resilience4j.retry.calls: true
    tags:
      application: order-service

A notable point in the configuration is defining a base profile with configs.default and specifying baseConfig: default for each instance to inherit common settings. You only need to override thresholds specific to each service to minimize configuration duplication.

CircuitBreaker Practical Implementation

Annotation-Based Implementation (Kotlin)

// PaymentGatewayClient.kt
@Service
class PaymentGatewayClient(
    private val restClient: RestClient,
    private val paymentRetryQueue: PaymentRetryQueue,
    private val paymentCacheStore: PaymentCacheStore,
) {
    companion object {
        private val log = LoggerFactory.getLogger(PaymentGatewayClient::class.java)
        const val CB_NAME = "paymentGateway"
    }

    @CircuitBreaker(name = CB_NAME, fallbackMethod = "paymentFallback")
    @Retry(name = CB_NAME)
    @Bulkhead(name = CB_NAME)
    fun processPayment(request: PaymentRequest): PaymentResponse {
        log.info("Calling payment gateway for orderId={}", request.orderId)

        val response = restClient.post()
            .uri("https://payment-api.internal/v2/charges")
            .contentType(MediaType.APPLICATION_JSON)
            .body(request)
            .retrieve()
            .body(PaymentResponse::class.java)
            ?: throw PaymentGatewayException("Empty response from payment gateway")

        log.info("Payment processed: orderId={}, txId={}", request.orderId, response.transactionId)
        return response
    }

    /**
     * Fallback method: Called when CircuitBreaker is OPEN or an exception occurs.
     * The method signature must match the original + accept an Exception as the last parameter.
     */
    private fun paymentFallback(request: PaymentRequest, ex: Exception): PaymentResponse {
        log.warn(
            "Payment fallback activated: orderId={}, reason={}",
            request.orderId, ex.message
        )

        return when (ex) {
            is CallNotPermittedException -> {
                // CircuitBreaker OPEN state: enqueue for async processing
                paymentRetryQueue.enqueue(request)
                PaymentResponse(
                    orderId = request.orderId,
                    status = PaymentStatus.QUEUED,
                    message = "Payment has been queued. It will be processed shortly.",
                    transactionId = null,
                )
            }
            is BulkheadFullException -> {
                // Bulkhead saturated: prompt immediate retry
                PaymentResponse(
                    orderId = request.orderId,
                    status = PaymentStatus.RETRY_LATER,
                    message = "Too many payment requests at the moment. Please try again shortly.",
                    transactionId = null,
                )
            }
            else -> {
                // Other exceptions: return cached payment info if available
                val cached = paymentCacheStore.getLastSuccess(request.orderId)
                if (cached != null) {
                    log.info("Returning cached payment for orderId={}", request.orderId)
                    cached.copy(status = PaymentStatus.CACHED)
                } else {
                    paymentRetryQueue.enqueue(request)
                    PaymentResponse(
                        orderId = request.orderId,
                        status = PaymentStatus.PENDING,
                        message = "An error occurred during payment processing. Automatic retry in progress.",
                        transactionId = null,
                    )
                }
            }
        }
    }
}

Programmatic Implementation (Java)

Instead of annotations, you can use the CircuitBreakerRegistry directly to dynamically create circuit breakers or change configurations at runtime.

// InventoryServiceClient.java
@Service
@Slf4j
public class InventoryServiceClient {

    private final CircuitBreaker circuitBreaker;
    private final Retry retry;
    private final Bulkhead bulkhead;
    private final RestClient restClient;

    public InventoryServiceClient(
            CircuitBreakerRegistry cbRegistry,
            RetryRegistry retryRegistry,
            BulkheadRegistry bulkheadRegistry,
            RestClient.Builder restClientBuilder) {

        this.circuitBreaker = cbRegistry.circuitBreaker("inventoryService");
        this.retry = retryRegistry.retry("inventoryService");
        this.bulkhead = bulkheadRegistry.bulkhead("inventoryService");
        this.restClient = restClientBuilder
                .baseUrl("https://inventory-api.internal")
                .build();

        // Register event listeners
        registerEventListeners();
    }

    public InventoryResponse checkStock(String productId, int quantity) {
        // Decorator chain: Bulkhead -> CircuitBreaker -> Retry -> actual call
        Supplier<InventoryResponse> decorated = Decorators
                .ofSupplier(() -> doCheckStock(productId, quantity))
                .withBulkhead(bulkhead)
                .withCircuitBreaker(circuitBreaker)
                .withRetry(retry)
                .withFallback(
                    List.of(
                        CallNotPermittedException.class,
                        BulkheadFullException.class,
                        IOException.class
                    ),
                    ex -> stockFallback(productId, quantity, ex)
                )
                .decorate();

        return decorated.get();
    }

    private InventoryResponse doCheckStock(String productId, int quantity) {
        return restClient.get()
                .uri("/v1/stock/{productId}?qty={qty}", productId, quantity)
                .retrieve()
                .body(InventoryResponse.class);
    }

    private InventoryResponse stockFallback(
            String productId, int quantity, Throwable ex) {
        log.warn("Inventory fallback: productId={}, reason={}", productId, ex.getMessage());
        // When stock is uncertain, accept the order but schedule async verification
        return InventoryResponse.builder()
                .productId(productId)
                .available(true)
                .reservationStatus(ReservationStatus.TENTATIVE)
                .message("Stock check delayed: tentative approval with async verification scheduled")
                .build();
    }

    private void registerEventListeners() {
        circuitBreaker.getEventPublisher()
            .onStateTransition(event -> {
                log.warn("[CircuitBreaker] {} state: {} -> {}",
                    event.getCircuitBreakerName(),
                    event.getStateTransition().getFromState(),
                    event.getStateTransition().getToState());
            })
            .onError(event ->
                log.error("[CircuitBreaker] {} error: {} ({}ms)",
                    event.getCircuitBreakerName(),
                    event.getThrowable().getMessage(),
                    event.getElapsedDuration().toMillis())
            )
            .onSuccess(event ->
                log.debug("[CircuitBreaker] {} success ({}ms)",
                    event.getCircuitBreakerName(),
                    event.getElapsedDuration().toMillis())
            )
            .onCallNotPermitted(event ->
                log.warn("[CircuitBreaker] {} call not permitted (OPEN state)",
                    event.getCircuitBreakerName())
            );

        retry.getEventPublisher()
            .onRetry(event ->
                log.info("[Retry] {} attempt #{} (wait: {}ms)",
                    event.getName(),
                    event.getNumberOfRetryAttempts(),
                    event.getWaitInterval().toMillis())
            );
    }
}

Retry, Bulkhead, and RateLimiter Combinations

Retry and Exponential Backoff

The most important aspect of retry strategy is combining exponential backoff with jitter. Fixed-interval retries cause a thundering herd problem where multiple clients retry simultaneously, concentrating load on the server.

// Programmatic RetryConfig customization
@Configuration
class ResilienceConfig {

    @Bean
    fun customRetryConfig(): RetryConfig {
        return RetryConfig.custom<RetryConfig>()
            .maxAttempts(4)
            .intervalFunction(
                // Exponential backoff + jitter: 1s, 2s(+jitter), 4s(+jitter), 8s(+jitter)
                IntervalFunction.ofExponentialRandomBackoff(
                    Duration.ofSeconds(1),   // initial wait duration
                    2.0,                     // multiplier
                    Duration.ofSeconds(15)   // max wait duration
                )
            )
            .retryOnException { ex ->
                // Determine retry-eligible exceptions
                when (ex) {
                    is IOException -> true
                    is TimeoutException -> true
                    is HttpServerErrorException -> true
                    is ConnectException -> true
                    else -> false
                }
            }
            .ignoreExceptions(
                BusinessValidationException::class.java,
                IllegalArgumentException::class.java
            )
            .failAfterMaxAttempts(true) // Throw MaxRetriesExceededException after max retries
            .build()
    }

    @Bean
    fun retryRegistry(customRetryConfig: RetryConfig): RetryRegistry {
        return RetryRegistry.of(customRetryConfig)
    }
}

Bulkhead: Semaphore vs Thread Pool

Bulkhead is a pattern inspired by ship compartment walls (bulkheads), preventing a single service call from monopolizing all resources. Resilience4j provides two Bulkhead implementations.

Aspect	SemaphoreBulkhead	ThreadPoolBulkhead
Isolation Level	Limits concurrent calls	Executes in a separate thread pool
Call Thread	Uses the caller's thread directly	Uses threads from a dedicated thread pool
Return Type	Synchronous return	CompletionStage return
Overhead	Low	Thread context switching cost
Suited For	Most HTTP calls	CPU-intensive tasks, when full isolation is needed
Configuration	maxConcurrentCalls, maxWaitDuration	maxThreadPoolSize, coreThreadPoolSize, queueCapacity

# ThreadPoolBulkhead configuration example
resilience4j:
  thread-pool-bulkhead:
    instances:
      heavyProcessing:
        maxThreadPoolSize: 10
        coreThreadPoolSize: 5
        queueCapacity: 20
        keepAliveDuration: 100ms
        writableStackTraceEnabled: true

RateLimiter Configuration and Usage

RateLimiter limits the number of calls allowed per time unit, preventing external API rate limit violations or protecting internal services from overload.

// RateLimiter and CircuitBreaker combination
@Service
@Slf4j
public class ExternalApiClient {

    private final RestClient restClient;

    @CircuitBreaker(name = "externalApi", fallbackMethod = "apiFallback")
    @RateLimiter(name = "externalApi")
    @Retry(name = "externalApi")
    public ApiResponse callExternalApi(ApiRequest request) {
        log.debug("Calling external API: endpoint={}", request.getEndpoint());

        return restClient.post()
                .uri(request.getEndpoint())
                .body(request.getPayload())
                .retrieve()
                .body(ApiResponse.class);
    }

    private ApiResponse apiFallback(ApiRequest request, RequestNotPermitted ex) {
        // Rejected by RateLimiter
        log.warn("Rate limit exceeded for external API: {}", request.getEndpoint());
        return ApiResponse.rateLimited(
                "Request limit exceeded. " +
                "Check limitForPeriod settings or try again later."
        );
    }

    private ApiResponse apiFallback(ApiRequest request, Exception ex) {
        // Other exceptions (CircuitBreaker OPEN, network errors, etc.)
        log.warn("External API fallback: endpoint={}, reason={}",
                request.getEndpoint(), ex.getMessage());
        return ApiResponse.error("External API call failed: " + ex.getMessage());
    }
}

An important note when overloading fallback methods: Resilience4j selects the most specific fallback based on exception type. By separating RequestNotPermitted (RateLimiter rejection) and Exception (general exceptions), you can execute different fallback logic based on the exception cause.

Grafana Monitoring Dashboard

Prometheus Metric Collection

Resilience4j automatically exposes metrics via Micrometer. The following metrics are available at Spring Boot Actuator's /actuator/prometheus endpoint:

# CircuitBreaker state check (0=CLOSED, 1=OPEN, 2=HALF_OPEN, 3=DISABLED, 4=FORCED_OPEN)
resilience4j_circuitbreaker_state{name="paymentGateway"}

# Failure rate (%)
resilience4j_circuitbreaker_failure_rate{name="paymentGateway"}

# Slow call rate (%)
resilience4j_circuitbreaker_slow_call_rate{name="paymentGateway"}

# Call statistics (kind: successful, failed, ignored, not_permitted)
rate(resilience4j_circuitbreaker_calls_seconds_count{name="paymentGateway"}[5m])

# Call latency distribution (histogram)
histogram_quantile(0.95,
  rate(resilience4j_circuitbreaker_calls_seconds_bucket{name="paymentGateway"}[5m])
)

# Retry count
increase(resilience4j_retry_calls_total{name="paymentGateway", kind="successful_with_retry"}[1h])
increase(resilience4j_retry_calls_total{name="paymentGateway", kind="failed_with_retry"}[1h])

# Bulkhead available concurrent calls
resilience4j_bulkhead_available_concurrent_calls{name="paymentGateway"}

# RateLimiter available permissions
resilience4j_ratelimiter_available_permissions{name="externalApi"}

Grafana Dashboard JSON Configuration

Here are the essential panels and their PromQL queries for the Grafana dashboard.

Panel 1 - CircuitBreaker State Gauge

resilience4j_circuitbreaker_state{application="order-service"}

Use value mapping to map 0=CLOSED (green), 1=OPEN (red), 2=HALF_OPEN (yellow).

Panel 2 - Failure Rate Trend (Time Series)

resilience4j_circuitbreaker_failure_rate{application="order-service", name=~".*"}

Add a threshold line (failureRateThreshold) to visually identify when the circuit transitions to OPEN.

Panel 3 - Call Success/Failure Ratio (Stacked Bar)

sum by (name, kind) (
  rate(resilience4j_circuitbreaker_calls_seconds_count{application="order-service"}[5m])
)

Panel 4 - P95 Response Time (Time Series)

histogram_quantile(0.95,
  sum by (le, name) (
    rate(resilience4j_circuitbreaker_calls_seconds_bucket{application="order-service"}[5m])
  )
)

Panel 5 - Bulkhead Concurrent Call Status (Gauge)

resilience4j_bulkhead_max_allowed_concurrent_calls{application="order-service"}
- resilience4j_bulkhead_available_concurrent_calls{application="order-service"}

Alert Rule Configuration

# prometheus-alerts.yml
groups:
  - name: resilience4j_alerts
    rules:
      - alert: CircuitBreakerOpen
        expr: resilience4j_circuitbreaker_state == 1
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: 'CircuitBreaker OPEN: {{ $labels.name }}'
          description: >
            The {{ $labels.name }} circuit breaker in service
            {{ $labels.application }} is in OPEN state.
            Check downstream service failures.

      - alert: HighFailureRate
        expr: resilience4j_circuitbreaker_failure_rate > 30
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: 'High failure rate: {{ $labels.name }} ({{ $value }}%)'
          description: >
            The failure rate of {{ $labels.name }} is {{ $value }}%,
            exceeding the warning threshold (30%).

      - alert: BulkheadSaturation
        expr: >
          (resilience4j_bulkhead_max_allowed_concurrent_calls
          - resilience4j_bulkhead_available_concurrent_calls)
          / resilience4j_bulkhead_max_allowed_concurrent_calls > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: 'Bulkhead 80% saturated: {{ $labels.name }}'

      - alert: ExcessiveRetries
        expr: >
          rate(resilience4j_retry_calls_total{kind="failed_with_retry"}[5m])
          / rate(resilience4j_retry_calls_total[5m]) > 0.5
        for: 3m
        labels:
          severity: warning
        annotations:
          summary: 'Retry failure rate exceeds 50%: {{ $labels.name }}'

Troubleshooting Guide

Issue 1: CircuitBreaker Does Not Transition to OPEN

Symptoms: Failures are clearly occurring, but the circuit remains in CLOSED state.

Root Cause Analysis:

minimumNumberOfCalls has not been reached. The default value is 100, so for low-frequency services, the sliding window may fill before the failure resolves.
The exception is included in ignoreExceptions. Check that not only business exceptions but also unintended exceptions are in the ignore list.
The exception is not included in recordExceptions. When recordExceptions is specified, exceptions not in the list are not recorded as failures.

Resolution: Adjust minimumNumberOfCalls to match the service call frequency, and review the recordExceptions and ignoreExceptions lists.

Issue 2: More Calls Than Expected When Combining Retry and CircuitBreaker

Symptoms: maxAttempts is set to 3, but more than 5 calls are recorded on the downstream service.

Root Cause Analysis: In the annotation application order, Retry sits outside CircuitBreaker. Therefore, after the CircuitBreaker records a failure, Retry attempts the call again through the CircuitBreaker. If trial calls are added during the CircuitBreaker's HALF-OPEN state, the total call count can exceed expectations.

Resolution: Set Retry's maxAttempts conservatively, and calculate the maximum number of calls produced by the combination of CircuitBreaker's slidingWindowSize and Retry's maxAttempts to predict downstream load.

Issue 3: Fallback Method Is Not Being Called

Symptoms: CircuitBreaker is OPEN, but CallNotPermittedException is propagated directly to the client.

Root Cause Analysis: The fallback method's signature does not exactly match the original method. The fallback method must accept all parameters of the original method in the same order and type, plus an Exception (or specific exception type) as the last parameter.

Resolution: Review the fallback method signature. The return type must also exactly match the original. Below are correct examples:

// Original method
@CircuitBreaker(name = "svc", fallbackMethod = "fallback")
public OrderResponse getOrder(String orderId, boolean includeDetails) { ... }

// Correct fallback (same parameters + Exception added)
private OrderResponse fallback(String orderId, boolean includeDetails, Exception ex) { ... }

// Incorrect fallback - compiles but fails to match at runtime
private OrderResponse fallback(String orderId, Exception ex) { ... }  // Missing parameter
private void fallback(String orderId, boolean includeDetails, Exception ex) { ... }  // Return type mismatch

Issue 4: Memory Usage Increases with TIME_BASED Window

Symptoms: Using a TIME_BASED sliding window and heap memory usage gradually increases.

Root Cause Analysis: The slidingWindowSize is set too large. For example, setting slidingWindowSize=600 (10 minutes) maintains 600 partial aggregation buckets. With high traffic, call records accumulate in each bucket, consuming memory.

Resolution: For TIME_BASED, set slidingWindowSize to 60 seconds or less, and observe long-term trends through Prometheus metrics. In memory-sensitive environments, prefer COUNT_BASED.

Operations Checklist

Here are items that must be verified before deploying Circuit Breakers to production.

Configuration Verification

Is the ratio of slidingWindowSize to minimumNumberOfCalls appropriate? (minimumNumberOfCalls should be 50% or less of slidingWindowSize)
Is failureRateThreshold set according to service characteristics? (Payment: 30-40%, Notification: 60-70%)
Does waitDurationInOpenState match the downstream service's average recovery time?
Is slowCallDurationThreshold set at or above the normal response time P99?
Are recordExceptions and ignoreExceptions properly categorized?

Monitoring Verification

Are resilience4j metrics being collected properly in Prometheus?
Does the Grafana dashboard display CircuitBreaker state, failure rate, and call statistics?
Are CircuitBreaker OPEN alerts being delivered to Slack, PagerDuty, etc.?
Are there metrics tracking OPEN state duration?

Fallback Strategy Verification

Are fallback methods connected to all CircuitBreakers?
Do fallback methods return meaningful responses? (No simple null returns)
How are exceptions in the fallback method itself handled?
Is a cache expiration policy configured when using cache fallback?
When using alternative service fallback, is a CircuitBreaker also configured for that service?

Test Verification

Have unit tests verified CircuitBreaker state transitions (CLOSED, OPEN, HALF-OPEN)?
Have integration tests reproduced actual timeout and network error scenarios?
Has fault injection testing been performed with chaos engineering tools (Chaos Monkey, Litmus)?
Has Bulkhead saturation behavior been verified in load tests?

Deployment Strategy

Apply new CircuitBreaker configurations via canary deployment to a subset of traffic first
Can configuration changes be applied without downtime via Config Server (Spring Cloud Config) or environment variables?
Is Git history management in place for CircuitBreaker configurations?
Is a rollback plan established?

Failure Cases and Recovery

Case 1: Downstream Overload Due to Retry Storm

Situation: Response times from the payment service began increasing. The order service had Retry configured with maxAttempts=5 and a fixed 1-second interval. With 20 order service instances and 100 orders per second, up to 10,000 requests per second (100 x 20 x 5) were flooding the payment service.

Cause: Fixed-interval retries were used without exponential backoff and jitter. Also, Retry was used standalone without a CircuitBreaker, so retries continued even on failure.

Recovery Procedure:

Immediately disable Retry or set maxAttempts to 1 to stop retries
Once the payment service load stabilizes, replace with a Retry configuration that includes exponential backoff + jitter
Place the CircuitBreaker inside Retry so that retries are blocked when the circuit is OPEN

Prevention: Always use Retry together with CircuitBreaker, and apply exponential backoff + random jitter by default. Prohibit fixed-interval retries as a policy.

Case 2: Permanent OPEN Circuit Due to Incorrect Exception Classification

Situation: After deploying a new feature to the inventory service, certain product queries started returning 400 Bad Request. These 400 responses were caught as HttpClientErrorException and included in the failure rate calculation, causing the CircuitBreaker to transition to OPEN and block all inventory queries. Even normal product queries became impossible.

Cause: recordExceptions included HttpClientErrorException (4xx). Since 4xx errors are client-side issues, the circuit breaker should not intervene. Circuit breakers should only respond to server-side failures (5xx, timeouts, connection failures).

Recovery Procedure:

Manually switch the CircuitBreaker to FORCED_CLOSE to immediately restore normal traffic

// Force state transition via Actuator endpoint
// POST /actuator/circuitbreakers/{name}/force-close
circuitBreakerRegistry.circuitBreaker("inventoryService")
    .transitionToForcedOpenState();  // or transitionToClosedState()

Remove HttpClientErrorException from recordExceptions and add it to ignoreExceptions
After applying the configuration, release FORCED_CLOSE to return to normal CircuitBreaker operation

Prevention: Document exception classification principles. 4xx (client errors) go in ignoreExceptions, 5xx (server errors) go in recordExceptions, and business validation exceptions go in ignoreExceptions.

Case 3: Traffic Loss Due to HALF-OPEN Bottleneck

Situation: Even after the payment service recovered, order processing throughput did not recover. Traffic analysis revealed that the CircuitBreaker was set with permittedNumberOfCallsInHalfOpenState=1 in the HALF-OPEN state, allowing only 1 trial call. This trial call intermittently failed, causing flapping between OPEN and HALF-OPEN states.

Cause: The permittedNumberOfCallsInHalfOpenState value was too low. With only 1 trial call, a single failure returns the circuit to OPEN, making it difficult to return to CLOSED when the downstream service responds only intermittently.

Recovery Procedure:

Increase permittedNumberOfCallsInHalfOpenState to 5-10
Verify that automaticTransitionFromOpenToHalfOpenEnabled is set to true for automatic transitions without manual intervention
Adjust waitDurationInOpenState to match the downstream service's average recovery time

Prevention: Set HALF-OPEN trial calls to at least 3 or more, and combine with the failure rate threshold to enable statistically meaningful decisions. Add OPEN-HALF_OPEN flapping detection to monitoring alerts.

Advanced Pattern: Custom CircuitBreaker Registry

As the number of services grows, configuring CircuitBreakers individually for each service can become inefficient. You can implement a custom registry that dynamically creates and manages CircuitBreakers.

// DynamicCircuitBreakerFactory.kt
@Component
class DynamicCircuitBreakerFactory(
    private val circuitBreakerRegistry: CircuitBreakerRegistry,
    private val meterRegistry: MeterRegistry,
) {
    private val log = LoggerFactory.getLogger(javaClass)

    /**
     * Dynamically creates a CircuitBreaker based on service name.
     * Returns an existing instance if one already exists.
     */
    fun getOrCreate(
        serviceName: String,
        tier: ServiceTier = ServiceTier.STANDARD,
    ): CircuitBreaker {
        return circuitBreakerRegistry.circuitBreaker(serviceName) {
            buildConfigForTier(tier)
        }.also { cb ->
            registerMetrics(cb)
            log.info(
                "CircuitBreaker created/retrieved: name={}, tier={}, state={}",
                serviceName, tier, cb.state
            )
        }
    }

    private fun buildConfigForTier(tier: ServiceTier): CircuitBreakerConfig {
        return when (tier) {
            ServiceTier.CRITICAL -> CircuitBreakerConfig.custom()
                .failureRateThreshold(30f)
                .slowCallRateThreshold(60f)
                .slowCallDurationThreshold(Duration.ofSeconds(2))
                .waitDurationInOpenState(Duration.ofSeconds(60))
                .slidingWindowSize(20)
                .minimumNumberOfCalls(10)
                .permittedNumberOfCallsInHalfOpenState(5)
                .automaticTransitionFromOpenToHalfOpenEnabled(true)
                .build()

            ServiceTier.STANDARD -> CircuitBreakerConfig.custom()
                .failureRateThreshold(50f)
                .slowCallRateThreshold(80f)
                .slowCallDurationThreshold(Duration.ofSeconds(3))
                .waitDurationInOpenState(Duration.ofSeconds(30))
                .slidingWindowSize(10)
                .minimumNumberOfCalls(5)
                .permittedNumberOfCallsInHalfOpenState(3)
                .automaticTransitionFromOpenToHalfOpenEnabled(true)
                .build()

            ServiceTier.BEST_EFFORT -> CircuitBreakerConfig.custom()
                .failureRateThreshold(70f)
                .slowCallRateThreshold(90f)
                .slowCallDurationThreshold(Duration.ofSeconds(5))
                .waitDurationInOpenState(Duration.ofSeconds(15))
                .slidingWindowSize(5)
                .minimumNumberOfCalls(3)
                .permittedNumberOfCallsInHalfOpenState(2)
                .automaticTransitionFromOpenToHalfOpenEnabled(true)
                .build()
        }
    }

    private fun registerMetrics(cb: CircuitBreaker) {
        TaggedCircuitBreakerMetrics.ofCircuitBreakerRegistry(
            circuitBreakerRegistry
        ).bindTo(meterRegistry)
    }

    enum class ServiceTier {
        CRITICAL,    // Core services like payment, authentication
        STANDARD,    // General services like inventory, shipping
        BEST_EFFORT  // Non-critical services like notifications, recommendations
    }
}

Using this factory, appropriate CircuitBreaker configurations are automatically applied based on the service tier. CRITICAL services are conservatively protected with low failure rate thresholds and long wait durations, while BEST_EFFORT services operate flexibly with high thresholds.

Test Strategy

Here are the essential test cases that must be written when introducing CircuitBreakers.

// CircuitBreakerIntegrationTest.java
@SpringBootTest
@AutoConfigureMockMvc
class CircuitBreakerIntegrationTest {

    @Autowired
    private CircuitBreakerRegistry circuitBreakerRegistry;

    @Autowired
    private MockMvc mockMvc;

    @MockBean
    private RestClient paymentRestClient;

    @Test
    @DisplayName("CircuitBreaker transitions to OPEN when failure rate threshold is exceeded")
    void shouldTransitionToOpenOnFailureThreshold() {
        CircuitBreaker cb = circuitBreakerRegistry.circuitBreaker("paymentGateway");
        cb.reset(); // Reset state for test isolation

        assertThat(cb.getState()).isEqualTo(CircuitBreaker.State.CLOSED);

        // With slidingWindowSize=20, failureRateThreshold=40
        // Must call minimumNumberOfCalls=10 or more, then 40%+ failure -> OPEN
        // 10 calls with 5 failures = 50% failure rate -> OPEN transition
        for (int i = 0; i < 5; i++) {
            cb.onSuccess(100, TimeUnit.MILLISECONDS);
        }
        for (int i = 0; i < 5; i++) {
            cb.onError(100, TimeUnit.MILLISECONDS, new IOException("connection refused"));
        }

        assertThat(cb.getState()).isEqualTo(CircuitBreaker.State.OPEN);
        assertThat(cb.getMetrics().getFailureRate()).isGreaterThanOrEqualTo(40f);
    }

    @Test
    @DisplayName("Transitions to HALF-OPEN after waitDuration elapses in OPEN state")
    void shouldTransitionToHalfOpenAfterWaitDuration() throws Exception {
        CircuitBreaker cb = circuitBreakerRegistry.circuitBreaker("paymentGateway");
        cb.reset();
        cb.transitionToOpenState();

        assertThat(cb.getState()).isEqualTo(CircuitBreaker.State.OPEN);

        // Simulate waitDurationInOpenState elapsed
        // (In tests, use a short waitDuration setting or transition directly)
        cb.transitionToHalfOpenState();

        assertThat(cb.getState()).isEqualTo(CircuitBreaker.State.HALF_OPEN);
    }

    @Test
    @DisplayName("Transitions to CLOSED on successful trial calls in HALF-OPEN")
    void shouldTransitionToClosedOnSuccessfulTrialCalls() {
        CircuitBreaker cb = circuitBreakerRegistry.circuitBreaker("paymentGateway");
        cb.reset();
        cb.transitionToOpenState();
        cb.transitionToHalfOpenState();

        // permittedNumberOfCallsInHalfOpenState=5 successful calls
        for (int i = 0; i < 5; i++) {
            cb.onSuccess(50, TimeUnit.MILLISECONDS);
        }

        assertThat(cb.getState()).isEqualTo(CircuitBreaker.State.CLOSED);
    }

    @Test
    @DisplayName("Fallback method is properly called when CircuitBreaker is OPEN")
    void shouldInvokeFallbackWhenCircuitIsOpen() throws Exception {
        // Force circuit to OPEN state
        CircuitBreaker cb = circuitBreakerRegistry.circuitBreaker("paymentGateway");
        cb.transitionToForcedOpenState();

        mockMvc.perform(post("/api/v1/orders")
                .contentType(MediaType.APPLICATION_JSON)
                .content("{\"productId\": \"P001\", \"quantity\": 1}"))
            .andExpect(status().isOk())
            .andExpect(jsonPath("$.payment.status").value("QUEUED"))
            .andExpect(jsonPath("$.payment.message").exists());

        // Restore state after test
        cb.transitionToClosedState();
    }
}

References

Resilience4j CircuitBreaker Official Documentation - Detailed configuration and operating principles of the CircuitBreaker module
Spring Cloud Circuit Breaker Reference - Spring Cloud and Resilience4j integration guide
Spring Boot Circuit Breaker Pattern with Resilience4j - GeeksforGeeks - Step-by-step implementation tutorial in Spring Boot
Circuit Breaker Pattern in Microservices - Java Guides - Circuit Breaker design pattern in microservices architecture
Circuit Breaker Pattern for Resilient Systems - DZone - Practical application of Circuit Breaker for distributed system resilience
Martin Fowler: CircuitBreaker - The original explanation of the Circuit Breaker pattern
Resilience4j GitHub Repository - Source code and release notes