Circuit Breaker and Resilience Patterns Practical Guide — Resilience4j, Istio, Fault Isolation Strategies

Introduction
1. Circuit Breaker Pattern Principles
- 1.1 Three States: Closed, Open, Half-Open
- 1.2 Sliding Window Types
2. Java/Spring Circuit Breaker Implementation with Resilience4j
3. Istio Service Mesh Level Circuit Breaker
- 3.1 DestinationRule Configuration
- 3.2 Istio vs Application-Level Circuit Breaker
4. Bulkhead Pattern: Fault Isolation Strategy
- 4.1 Semaphore Bulkhead vs ThreadPool Bulkhead
- 4.2 Per-Service Bulkhead Isolation Example
5. Retry + Timeout + Rate Limiter Combination Patterns
- 5.1 Precautions for Pattern Combinations
- 5.2 Programmatic API Composition
6. Fallback Strategy Design
- 6.1 Fallback Strategy Types
- 6.2 Multi-Level Fallback Implementation
7. Migration from Netflix Hystrix to Resilience4j
- 7.1 Resilience4j vs Hystrix vs Istio Comparison
- 7.2 Migration Core Checklist
8. Failure Case Analysis and Recovery Procedures
9. Operational Monitoring and Metrics
10. Troubleshooting
11. Practical Checklist
Conclusion
References

Introduction

In microservices architecture, inter-service calls are inherently unreliable. Network latency, timeouts, and downstream service failures occur routinely, and without proper defense mechanisms, a single service failure can propagate to the entire system, causing Cascading Failure. A representative example occurred during the 2024 Black Friday period when a large e-commerce platform's product recommendation service response latency caused the entire product listing page to take over 20 seconds to load.

Resilience Patterns emerged to solve these problems. Starting with the Circuit Breaker pattern first introduced by Michael Nygard in his 2007 book Release It!, various patterns including Bulkhead, Retry, Rate Limiter, Timeout, and Fallback have been systematized. Netflix Hystrix was the first popular implementation, but after entering maintenance mode in 2018, Resilience4j has become the de facto standard in the Java/Spring ecosystem, while Istio provides infrastructure-level circuit breakers in service mesh environments.

This article comprehensively covers circuit breaker operating principles, practical implementation with Resilience4j and Istio, composite resilience pattern design, Hystrix migration, operational monitoring, and failure case analysis.

1. Circuit Breaker Pattern Principles

The circuit breaker is a pattern inspired by electrical circuit breakers, detecting remote service call failures and automatically blocking calls to prevent cascading failures across the entire system.

1.1 Three States: Closed, Open, Half-Open

              Failure rate >= threshold (failureRateThreshold)
    +---------------------------------------------+
    |                                             |
    v                                             |
+----------+                                 +----------+
|          |     Trial call success rate >= threshold     |          |
|   OPEN   | <- - - - - - - - - - - - - - -- |  CLOSED  |
| (blocked)|                                 | (normal) |
+----------+                                 +----------+
     |                                            ^
     | waitDurationInOpenState elapsed             |
     v                                            |
+--------------+    Trial call success             |
|  HALF-OPEN   | ---------------------------------+
| (trial mode) |
+--------------+
     |
     | Trial call failure
     v
+----------+
|   OPEN   |  (blocked again)
+----------+

The behavior of each state is as follows:

State	Behavior	Transition Condition
CLOSED	Passes all requests normally and records results in sliding window	Transitions to OPEN when failure rate exceeds threshold
OPEN	Immediately rejects all requests, throws CallNotPermittedException	Transitions to HALF-OPEN after waitDuration elapses
HALF-OPEN	Allows only a limited number of trial calls	Transitions to CLOSED or OPEN based on trial call success rate

1.2 Sliding Window Types

Resilience4j supports two types of sliding windows:

COUNT_BASED: Calculates failure rate based on the last N call results. Suitable for services with consistent traffic.
TIME_BASED: Calculates failure rate based on call results within the last N seconds. Suitable for services with variable traffic.

2. Java/Spring Circuit Breaker Implementation with Resilience4j

2.1 Dependency Setup

// build.gradle (Spring Boot 3.x)
dependencies {
    implementation 'io.github.resilience4j:resilience4j-spring-boot3:2.2.0'
    implementation 'io.github.resilience4j:resilience4j-micrometer:2.2.0'
    implementation 'org.springframework.boot:spring-boot-starter-actuator'
    implementation 'org.springframework.boot:spring-boot-starter-aop'
}

2.2 application.yml Configuration

resilience4j:
  circuitbreaker:
    instances:
      paymentService:
        registerHealthIndicator: true
        slidingWindowType: COUNT_BASED
        slidingWindowSize: 10
        minimumNumberOfCalls: 5
        failureRateThreshold: 50
        waitDurationInOpenState: 10s
        permittedNumberOfCallsInHalfOpenState: 3
        slowCallDurationThreshold: 2s
        slowCallRateThreshold: 80
        recordExceptions:
          - java.io.IOException
          - java.util.concurrent.TimeoutException
          - org.springframework.web.client.HttpServerErrorException
        ignoreExceptions:
          - com.example.BusinessException

  retry:
    instances:
      paymentService:
        maxAttempts: 3
        waitDuration: 500ms
        enableExponentialBackoff: true
        exponentialBackoffMultiplier: 2
        retryExceptions:
          - java.io.IOException
          - java.util.concurrent.TimeoutException

  bulkhead:
    instances:
      paymentService:
        maxConcurrentCalls: 20
        maxWaitDuration: 500ms

  timelimiter:
    instances:
      paymentService:
        timeoutDuration: 3s
        cancelRunningFuture: true

  ratelimiter:
    instances:
      paymentService:
        limitRefreshPeriod: 1s
        limitForPeriod: 50
        timeoutDuration: 0s

2.3 Annotation-Based Implementation

@Service
@Slf4j
public class PaymentService {

    private final PaymentGatewayClient paymentGatewayClient;
    private final PaymentCacheService paymentCacheService;

    public PaymentService(PaymentGatewayClient paymentGatewayClient,
                          PaymentCacheService paymentCacheService) {
        this.paymentGatewayClient = paymentGatewayClient;
        this.paymentCacheService = paymentCacheService;
    }

    @CircuitBreaker(name = "paymentService", fallbackMethod = "paymentFallback")
    @Bulkhead(name = "paymentService")
    @Retry(name = "paymentService")
    @TimeLimiter(name = "paymentService")
    public CompletableFuture<PaymentResponse> processPayment(PaymentRequest request) {
        return CompletableFuture.supplyAsync(() -> {
            log.info("Processing payment: orderId={}", request.getOrderId());
            return paymentGatewayClient.charge(request);
        });
    }

    // Fallback method: called when circuit is OPEN or exception occurs
    private CompletableFuture<PaymentResponse> paymentFallback(
            PaymentRequest request, Throwable throwable) {
        log.warn("Payment service fallback triggered: orderId={}, reason={}",
                request.getOrderId(), throwable.getMessage());

        if (throwable instanceof CallNotPermittedException) {
            // Circuit is open - save to queue for async processing
            return CompletableFuture.completedFuture(
                PaymentResponse.queued(request.getOrderId(),
                    "Payment service temporarily unavailable. Order has been queued.")
            );
        }

        // Other exceptions - try returning cached result
        return CompletableFuture.completedFuture(
            paymentCacheService.getCachedResponse(request.getOrderId())
                .orElse(PaymentResponse.error(request.getOrderId(),
                    "An error occurred during payment processing. Please try again later."))
        );
    }
}

Aspect execution order: Resilience4j annotations are applied in the following nested order:

Retry ( CircuitBreaker ( RateLimiter ( TimeLimiter ( Bulkhead ( Function ) ) ) ) )

Since the outermost Retry is applied last, when the CircuitBreaker throws an exception, Retry performs the retry. This order can be customized via each module's *AspectOrder property.

3. Istio Service Mesh Level Circuit Breaker

Istio can apply circuit breakers at the infrastructure level without application code changes. It leverages Envoy proxy's Outlier Detection feature to automatically remove unhealthy instances from the load balancing pool.

3.1 DestinationRule Configuration

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: payment-service-circuit-breaker
  namespace: production
spec:
  host: payment-service.production.svc.cluster.local
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100 # Maximum TCP connections
        connectTimeout: 3s # TCP connection timeout
      http:
        h2UpgradePolicy: DEFAULT
        http1MaxPendingRequests: 50 # Max pending HTTP requests
        http2MaxRequests: 100 # Max active HTTP/2 requests
        maxRequestsPerConnection: 10 # Max requests per connection
        maxRetries: 3 # Max retries
    outlierDetection:
      consecutive5xxErrors: 5 # Remove after 5 consecutive 5xx errors
      interval: 10s # Analysis interval
      baseEjectionTime: 30s # Minimum ejection time
      maxEjectionPercent: 50 # Max ejection percentage (50%)
      minHealthPercent: 30 # Minimum healthy instance percentage

3.2 Istio vs Application-Level Circuit Breaker

Aspect	Istio (Infrastructure Level)	Resilience4j (App Level)
Application	No code changes, YAML config	Annotations or programmatic API
Isolation unit	Instance (Pod) level ejection	Method/Service level blocking
Fallback	Not supported (returns 503)	Custom fallback methods supported
Language agnostic	All languages/frameworks	Java/Kotlin only
Fine control	Limited	Very fine-grained
Monitoring	Kiali, Grafana integration	Micrometer, Actuator integration
Recommended for	Multi-language environments	When business logic integration needed

In practice, using both together is recommended. Istio isolates unhealthy instances at the infrastructure level, while Resilience4j handles fine-grained fallback and retry at the application level.

4. Bulkhead Pattern: Fault Isolation Strategy

The Bulkhead pattern derives from ship bulkheads, which isolate compartments so that flooding in one doesn't affect others.

4.1 Semaphore Bulkhead vs ThreadPool Bulkhead

Aspect	Semaphore Bulkhead	ThreadPool Bulkhead
Isolation	Limits concurrent calls via semaphore	Executes in separate thread pool
Calling thread	Runs in request thread	Runs asynchronously in separate thread
Return type	Both sync/async supported	CompletableFuture only
Overhead	Low	Thread pool management cost
Recommended for	General concurrency limiting	When full thread isolation needed

# ThreadPool Bulkhead configuration
resilience4j:
  thread-pool-bulkhead:
    instances:
      inventoryService:
        maxThreadPoolSize: 10
        coreThreadPoolSize: 5
        queueCapacity: 20
        keepAliveDuration: 100ms
        writableStackTraceEnabled: true

4.2 Per-Service Bulkhead Isolation Example

@Service
public class OrderOrchestrator {

    @Bulkhead(name = "paymentService", type = Bulkhead.Type.SEMAPHORE)
    public PaymentResult processPayment(Order order) {
        return paymentClient.charge(order.getPaymentInfo());
    }

    @Bulkhead(name = "inventoryService", type = Bulkhead.Type.THREADPOOL)
    public CompletableFuture<InventoryResult> reserveInventory(Order order) {
        return CompletableFuture.supplyAsync(() ->
            inventoryClient.reserve(order.getItems()));
    }

    @Bulkhead(name = "notificationService", type = Bulkhead.Type.SEMAPHORE)
    public void sendNotification(Order order) {
        notificationClient.send(order.getUserId(), "Your order has been received.");
    }
}

By separating Bulkheads per service, even if the inventory service slows down, the payment service's concurrent call capacity remains unaffected.

5. Retry + Timeout + Rate Limiter Combination Patterns

Resilience patterns are most effective when combined rather than used individually. However, incorrect combinations can worsen failures, so caution is needed.

5.1 Precautions for Pattern Combinations

Retry + CircuitBreaker: Using Retry alone adds load to failing services. Always use with CircuitBreaker to block retries beyond a certain failure rate.
Timeout + Retry: Total time = timeout * maxAttempts. With a 3-second timeout and 3 retries, worst case is 9 seconds. Design with user response time SLA in mind.
Rate Limiter + CircuitBreaker: Rate Limiter prevents exceeding external API call limits, while CircuitBreaker handles API failures itself — a dual defense structure.

5.2 Programmatic API Composition

@Configuration
public class ResilienceConfig {

    @Bean
    public Supplier<String> resilientSupplier(
            CircuitBreakerRegistry circuitBreakerRegistry,
            RetryRegistry retryRegistry,
            BulkheadRegistry bulkheadRegistry,
            RateLimiterRegistry rateLimiterRegistry) {

        CircuitBreaker circuitBreaker = circuitBreakerRegistry
                .circuitBreaker("externalApi");
        Retry retry = retryRegistry.retry("externalApi");
        Bulkhead bulkhead = bulkheadRegistry.bulkhead("externalApi");
        RateLimiter rateLimiter = rateLimiterRegistry
                .rateLimiter("externalApi");

        // Decorator chaining: applied from inside to outside
        Supplier<String> decoratedSupplier = Decorators
                .ofSupplier(() -> externalApiClient.call())
                .withBulkhead(bulkhead)           // 1. Concurrent call limit
                .withRateLimiter(rateLimiter)      // 2. Rate limit
                .withCircuitBreaker(circuitBreaker) // 3. Failure detection/blocking
                .withRetry(retry)                   // 4. Retry
                .withFallback(Arrays.asList(
                    CallNotPermittedException.class,
                    BulkheadFullException.class,
                    RequestNotPermitted.class),
                    throwable -> "Fallback Response")
                .decorate();

        return decoratedSupplier;
    }
}

6. Fallback Strategy Design

Fallback provides alternative responses when the original service fails. The key is implementing graceful degradation that maintains user experience as much as possible, rather than simply returning error messages.

6.1 Fallback Strategy Types

Strategy	Description	Use Case Examples
Cache fallback	Return last successful cached response	Product recommendations, exchange rates, weather
Default value	Return predefined default values	Configuration service, feature flags
Queue fallback	Save request to queue for later processing	Payment processing, order intake
Alternative service	Route to backup service	CDN redundancy, multi-region
Empty response	Return empty result (instead of error)	Search autocomplete, recommendation widgets
Manual switch	Operator manually activates alternative	Critical business logic

6.2 Multi-Level Fallback Implementation

@Service
@Slf4j
public class ProductRecommendationService {

    private final RecommendationEngine primaryEngine;
    private final RecommendationEngine secondaryEngine;
    private final RedisTemplate<String, List<Product>> cache;

    @CircuitBreaker(name = "recommendation",
                    fallbackMethod = "secondaryRecommendation")
    public List<Product> getRecommendations(String userId) {
        return primaryEngine.recommend(userId);
    }

    // 1st fallback: Use secondary recommendation engine
    private List<Product> secondaryRecommendation(
            String userId, Throwable t) {
        log.warn("Primary recommendation engine failure, switching to secondary: {}", t.getMessage());
        try {
            return secondaryEngine.recommend(userId);
        } catch (Exception e) {
            return cachedRecommendation(userId, e);
        }
    }

    // 2nd fallback: Return cached recommendation results
    private List<Product> cachedRecommendation(
            String userId, Throwable t) {
        log.warn("Secondary recommendation engine also failed, checking cache: {}", t.getMessage());
        List<Product> cached = cache.opsForValue()
                .get("recommendation:" + userId);
        if (cached != null && !cached.isEmpty()) {
            return cached;
        }
        return defaultRecommendation(userId, t);
    }

    // 3rd fallback: Return default popular products list
    private List<Product> defaultRecommendation(
            String userId, Throwable t) {
        log.warn("No cache available, returning default popular products");
        return List.of(
            Product.popular("BEST-001", "Bestseller Product A"),
            Product.popular("BEST-002", "Bestseller Product B"),
            Product.popular("BEST-003", "Bestseller Product C")
        );
    }
}

7. Migration from Netflix Hystrix to Resilience4j

Netflix Hystrix entered maintenance mode in 2018, and official support was dropped starting from Spring Cloud 2020.0.0. Projects using Hystrix need to migrate to Resilience4j.

7.1 Resilience4j vs Hystrix vs Istio Comparison

Item	Hystrix	Resilience4j	Istio
Maintenance status	Maintenance mode (2018~)	Actively maintained	Actively maintained
Design philosophy	OOP (extend HystrixCommand)	Functional programming (decorators)	Infrastructure-based (sidecar proxy)
Module structure	All-in-one	Select only needed modules	Full service mesh
Spring Boot integration	Spring Cloud Netflix	Native Spring Boot starter	Kubernetes environment required
Isolation	Thread Pool / Semaphore	Semaphore / Thread Pool	Connection pool / Outlier Detection
Configuration	Java Config / Properties	YAML / Java Config / Annotations	Kubernetes CRD (YAML)
Reactive support	Limited (RxJava 1)	Full support (Reactor, RxJava 2/3)	N/A
Metrics	Hystrix Dashboard	Micrometer / Prometheus	Prometheus / Kiali
Fallback	HystrixCommand.getFallback()	fallbackMethod annotation	Not supported (returns 503)
Learning curve	Medium	Low	High (service mesh understanding needed)

7.2 Migration Core Checklist

Step 1: Replace Dependencies

// Remove
// implementation 'org.springframework.cloud:spring-cloud-starter-netflix-hystrix'

// Add
implementation 'io.github.resilience4j:resilience4j-spring-boot3:2.2.0'
implementation 'io.github.resilience4j:resilience4j-micrometer:2.2.0'

Step 2: Code Conversion Patterns

Hystrix	Resilience4j
`@HystrixCommand(fallbackMethod = "fallback")`	`@CircuitBreaker(name = "svc", fallbackMethod = "fallback")`
`HystrixCommand extends HystrixCommand`	`Decorators.ofSupplier(() -> ...).withCircuitBreaker(cb)`
`@HystrixProperty(name = "...")`	`application.yml` configuration
`HystrixDashboard`	Micrometer + Grafana

Step 3: Configuration Migration

Hystrix's circuitBreaker.requestVolumeThreshold maps to Resilience4j's minimumNumberOfCalls, circuitBreaker.errorThresholdPercentage maps to failureRateThreshold, and circuitBreaker.sleepWindowInMilliseconds converts to waitDurationInOpenState.

Step 4: Gradual Transition

Don't replace everything at once. Migrate service by service. Resilience4j and Hystrix can coexist in the same project, so apply Resilience4j to new services first and convert existing services sequentially.

8. Failure Case Analysis and Recovery Procedures

8.1 Case 1: Retry Storm

Situation: When the payment gateway fails, all clients retry simultaneously, delaying gateway recovery.

Cause: Only Retry applied without CircuitBreaker. No jitter in retry intervals, causing synchronized retries.

Solution:

Apply CircuitBreaker with Retry to block retries beyond a certain failure rate
Add jitter to exponential backoff

resilience4j:
  retry:
    instances:
      paymentGateway:
        maxAttempts: 3
        waitDuration: 1s
        enableExponentialBackoff: true
        exponentialBackoffMultiplier: 2
        enableRandomizedWait: true # Enable jitter
        randomizedWaitFactor: 0.5 # Randomize within 50% range

8.2 Case 2: Thread Pool Exhaustion Due to Missing Bulkhead

Situation: When inventory check API becomes slow, it occupies the entire Tomcat thread pool. Unrelated APIs like payment and order queries all timeout.

Cause: All external service calls running in the same thread pool.

Solution:

Apply per-service ThreadPool Bulkhead for thread isolation
Prevent slow services from monopolizing the entire thread pool

8.3 Case 3: Circuit Breaker Threshold Misconfiguration

Situation: Set minimumNumberOfCalls: 1, failureRateThreshold: 50. A single failure opens the circuit, blocking even healthy services.

Cause: State transitions based on statistically insignificant small number of calls.

Solution:

Set minimumNumberOfCalls to at least 5-10
Set slidingWindowSize sufficiently large (minimum 10 or more)
Adjust thresholds after analyzing actual traffic patterns in production

8.4 Standardized Recovery Procedure

#!/bin/bash
# circuit-breaker-recovery.sh
# Circuit breaker failure recovery procedure script

echo "===== Check Circuit Breaker Status ====="
# Check circuit breaker state via Actuator endpoint
curl -s http://localhost:8080/actuator/circuitbreakers | jq '.circuitBreakers'

echo ""
echo "===== Downstream Service Health Check ====="
curl -s http://payment-service:8080/actuator/health | jq '.status'
curl -s http://inventory-service:8080/actuator/health | jq '.status'

echo ""
echo "===== Force Close Circuit Breaker (after confirming downstream recovery) ====="
# WARNING: Execute only after downstream service is fully recovered
# curl -X POST http://localhost:8080/actuator/circuitbreakers/paymentService/close

echo ""
echo "===== Check Current Metrics ====="
curl -s http://localhost:8080/actuator/metrics/resilience4j.circuitbreaker.state | jq '.'
curl -s http://localhost:8080/actuator/metrics/resilience4j.circuitbreaker.failure.rate | jq '.'

echo ""
echo "===== Check Istio Outlier Detection Status ====="
kubectl get destinationrules -n production
kubectl describe destinationrule payment-service-circuit-breaker -n production

9. Operational Monitoring and Metrics

9.1 Key Monitoring Metrics

The following metrics must be monitored for circuit breaker operations:

Metric	Description	Alert Threshold
`resilience4j.circuitbreaker.state`	Current circuit state (0=CLOSED, 1=OPEN, 2=HALF_OPEN)	state == 1 (OPEN)
`resilience4j.circuitbreaker.failure.rate`	Current failure rate (%)	Above 40%
`resilience4j.circuitbreaker.calls`	Successful/failed/ignored/blocked call counts	Spike in blocked calls
`resilience4j.circuitbreaker.slow.call.rate`	Slow call rate (%)	Above 60%
`resilience4j.bulkhead.available.concurrent.calls`	Available concurrent calls	Near 0
`resilience4j.retry.calls`	Retry count	On spike
`resilience4j.ratelimiter.available.permissions`	Available permissions	Near 0

9.2 Prometheus + Grafana Dashboard Setup

Resilience4j automatically exposes Prometheus-format metrics through Micrometer.

# Prometheus scrape configuration
scrape_configs:
  - job_name: 'spring-boot-resilience4j'
    metrics_path: '/actuator/prometheus'
    scrape_interval: 5s
    static_configs:
      - targets: ['payment-service:8080']
        labels:
          application: 'payment-service'

Grafana alert rule example: Configure Slack alerts when the circuit transitions to OPEN state.

# Grafana Alert Rule (provisioning)
groups:
  - name: circuit-breaker-alerts
    rules:
      - alert: CircuitBreakerOpen
        expr: resilience4j_circuitbreaker_state{state="open"} == 1
        for: 10s
        labels:
          severity: critical
        annotations:
          summary: 'Circuit Breaker OPEN - {{ $labels.name }}'
          description: >
            {{ $labels.application }}'s {{ $labels.name }}
            circuit breaker is in OPEN state.
            Check downstream service status immediately.

      - alert: HighFailureRate
        expr: resilience4j_circuitbreaker_failure_rate > 40
        for: 30s
        labels:
          severity: warning
        annotations:
          summary: 'High failure rate detected - {{ $labels.name }}'
          description: >
            {{ $labels.name }} failure rate is {{ $value }}%.
            Identify the cause before the circuit opens.

9.3 Istio Monitoring (Kiali + Grafana)

# Check service status within Istio mesh
istioctl proxy-config cluster <pod-name> -n production | grep outlier

# Check Envoy statistics
kubectl exec -it <pod-name> -n production -c istio-proxy -- \
  curl localhost:15000/stats | grep outlier_detection

# Access Kiali dashboard
istioctl dashboard kiali

10. Troubleshooting

Circuit breaker not opening

Check the minimumNumberOfCalls value. If fewer calls than this value have occurred, the circuit won't open even at 100% failure rate.
Verify that recordExceptions includes the actual exception types being thrown. Unregistered exceptions are not counted as failures.
Check if failure exceptions are unintentionally included in ignoreExceptions.

Circuit quickly transitions back to OPEN from HALF-OPEN

If permittedNumberOfCallsInHalfOpenState is too small, statistically meaningful judgments are difficult. Set to at least 3-5.
This can occur when the downstream service is only partially recovered. Confirm complete downstream recovery.

If BulkheadFullException occurs frequently, increase maxConcurrentCalls or improve downstream service response times.
When using ThreadPool Bulkhead with queueCapacity of 0, requests are immediately rejected when the thread pool is full.

Istio Outlier Detection not working

Verify that the Istio sidecar proxy is injected into the Pod: kubectl get pod <name> -o jsonpath='{.spec.containers[*].name}'
Verify that the DestinationRule host field is the correct service FQDN.
If maxEjectionPercent is too low, some unhealthy instances may not be ejected.

11. Practical Checklist

Design Phase

Confirmed SLA (response time, availability) for each downstream service
Classified failure impact per service (Critical / High / Medium / Low)
Defined fallback strategies for failures (cache, default values, queue, alternative service)
Verified that Retry targets guarantee idempotency
Decided on Retry + CircuitBreaker combination usage (standalone Retry prohibited)
Verified total timeout = timeout * maxAttempts is within user SLA

Implementation Phase

Set CircuitBreaker slidingWindowSize and minimumNumberOfCalls sufficiently large (minimum 5-10)
Registered network/timeout related exceptions in recordExceptions
Registered business exceptions (400 Bad Request, etc.) in ignoreExceptions
Applied per-service Bulkhead isolation
Applied Rate Limiter to external API calls
Verified fallback method parameters match original method (+ Throwable added)

Operations Phase

Exposed Actuator endpoints (/actuator/circuitbreakers, /actuator/health)
Configured Prometheus metric collection
Set up alerts (Slack, PagerDuty, etc.) for circuit OPEN state transitions
Set up failure rate warning threshold alerts
Documented circuit breaker failure recovery procedures (Runbook)
Performing periodic Chaos Engineering tests (service failure injection)
Configured DestinationRule and Outlier Detection if in Istio environment

Testing Phase

Tested circuit state transition scenarios (CLOSED -> OPEN -> HALF-OPEN -> CLOSED)
Tested fallback methods work correctly
Simulated Bulkhead full scenarios
Tested complete downstream service failure scenarios
Tested slow response (Slow Call) scenarios

Conclusion

Resilience patterns are not optional but essential in microservices architecture. By properly combining Circuit Breaker, Bulkhead, Retry, Rate Limiter, and Timeout, you can effectively prevent a single service failure from propagating to the entire system.

Key principles summarized:

Standalone Retry is prohibited: Always use with CircuitBreaker to prevent retry storms.
Per-service isolation: Isolate resource usage for each downstream service with Bulkhead.
Multi-level fallback: Design a multi-level structure of alternative service -> cache -> default values, not just a single fallback.
Dual defense at infrastructure + app level: Use Istio Outlier Detection and Resilience4j together.
Monitoring is essential: Monitor circuit state, failure rate, and slow call rate in real-time with alerts configured.

Migrating from Hystrix to Resilience4j should be done gradually, with Resilience4j introduced to new services first. Most importantly, verify through regular Chaos Engineering tests that your configured resilience patterns work as expected in actual failure scenarios.

Introduction

1. Circuit Breaker Pattern Principles

1.1 Three States: Closed, Open, Half-Open

1.2 Sliding Window Types

2. Java/Spring Circuit Breaker Implementation with Resilience4j

2.1 Dependency Setup

2.2 application.yml Configuration

2.3 Annotation-Based Implementation

3. Istio Service Mesh Level Circuit Breaker

3.1 DestinationRule Configuration

3.2 Istio vs Application-Level Circuit Breaker

4. Bulkhead Pattern: Fault Isolation Strategy

4.1 Semaphore Bulkhead vs ThreadPool Bulkhead

4.2 Per-Service Bulkhead Isolation Example

5. Retry + Timeout + Rate Limiter Combination Patterns

5.1 Precautions for Pattern Combinations

5.2 Programmatic API Composition

6. Fallback Strategy Design

6.1 Fallback Strategy Types

6.2 Multi-Level Fallback Implementation

7. Migration from Netflix Hystrix to Resilience4j

7.1 Resilience4j vs Hystrix vs Istio Comparison

7.2 Migration Core Checklist

8. Failure Case Analysis and Recovery Procedures

8.1 Case 1: Retry Storm

8.2 Case 2: Thread Pool Exhaustion Due to Missing Bulkhead

8.3 Case 3: Circuit Breaker Threshold Misconfiguration

8.4 Standardized Recovery Procedure

9. Operational Monitoring and Metrics

9.1 Key Monitoring Metrics

9.2 Prometheus + Grafana Dashboard Setup

9.3 Istio Monitoring (Kiali + Grafana)

10. Troubleshooting

Circuit breaker not opening

Circuit quickly transitions back to OPEN from HALF-OPEN

Bulkhead-related errors

Istio Outlier Detection not working

11. Practical Checklist

Design Phase

Implementation Phase

Operations Phase

Testing Phase

Conclusion

References