- Published on
Microservices Architecture 2025 Complete Guide: From Monolith to MSA, and Back to Modular Monolith
- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Introduction
- 1. The History of Architectural Evolution
- 2. When MSA Is Overkill: Real-World Lessons
- 3. Architecture Comparison: Monolith vs MSA vs Modular Monolith
- 4. DDD and Service Decomposition
- 5. Communication Patterns: REST vs gRPC vs Event-Driven
- 6. Service Mesh: Istio vs Linkerd
- 7. Distributed Transactions and the Saga Pattern
- 8. Observability: OpenTelemetry
- 9. API Gateway Pattern
- 10. Deployment Strategies
- 11. Migration: The Strangler Fig Pattern
- 12. Anti-Patterns
- 13. Decision Framework
- 14. Interview Prep Q&A (Top 15)
- Q1. Explain five core characteristics of microservices.
- Q2. Explain the CAP theorem and its relationship with microservices.
- Q3. What are the two implementation approaches for the Saga pattern?
- Q4. Explain the Sidecar pattern in Service Mesh.
- Q5. Explain the principles of Distributed Tracing.
- Q6. What are the benefits and caveats of the Strangler Fig Pattern?
- Q7. What is the difference between API Gateway and Service Mesh?
- Q8. Explain the relationship between Event Sourcing and CQRS.
- Q9. How do you ensure data consistency in microservices?
- Q10. When is gRPC more advantageous than REST?
- Q11. Explain the Circuit Breaker pattern.
- Q12. Compare service discovery approaches.
- Q13. When should you transition from Modular Monolith to MSA?
- Q14. What is the Outbox Pattern?
- Q15. Explain the microservices Testing Pyramid.
- 15. Quiz
- References
Introduction
In 2024, Amazon Prime Video announced that they reverted from microservices to a monolith, cutting costs by 90% -- a revelation that sent shockwaves through the industry. DHH (creator of Ruby on Rails) declared that "microservices are over-engineering for most teams," and Shopify successfully handles massive traffic with a modular monolith architecture.
That said, microservices are far from dead. Netflix, Uber, and Spotify still operate thousands of microservices, and there are clear cases where MSA is the right choice depending on organizational scale and domain complexity.
This article provides an objective comparison of monolith, MSA, and modular monolith -- from the history of architectural evolution to current 2025 trends -- along with a systematic coverage of patterns and technologies needed in practice.
1. The History of Architectural Evolution
1.1 The Monolith Era (2000s)
The traditional architecture where all functionality is contained in a single deployment unit.
┌─────────────────────────────────────┐
│ Monolith Application │
│ ┌─────────┐ ┌─────────┐ ┌───────┐ │
│ │ User │ │ Order │ │Payment│ │
│ │ Module │ │ Module │ │Module │ │
│ └────┬─────┘ └────┬────┘ └──┬────┘ │
│ └──────┬─────┴─────────┘ │
│ ┌────▼────┐ │
│ │ Shared │ │
│ │ DB │ │
│ └─────────┘ │
└─────────────────────────────────────┘
Pros: Simple development/deployment, easy transaction management, convenient debugging
Cons: Codebase bloat, deployment bottlenecks, locked-in tech stack, scalability limits
1.2 The SOA Era (2005-2015)
Service-Oriented Architecture connected services around an Enterprise Service Bus (ESB).
┌────────┐ ┌────────┐ ┌────────┐
│Service A│ │Service B│ │Service C│
└───┬─────┘ └───┬─────┘ └───┬─────┘
└──────────────┼──────────────┘
┌────▼────┐
│ ESB │
│(Message │
│ Bus) │
└─────────┘
SOA pursued service reusability and standardization, but the ESB became a single point of failure, and the complexity of SOAP/WSDL was problematic.
1.3 The Microservices Era (2014-Present)
Microservices, defined by Martin Fowler and James Lewis, are an evolution of SOA where each service is independently deployable and scalable.
┌─────────┐ ┌─────────┐ ┌─────────┐
│ User │ │ Order │ │ Payment │
│ Service │ │ Service │ │ Service │
│ :8080 │ │ :8081 │ │ :8082 │
└──┬──────┘ └──┬──────┘ └──┬──────┘
│ REST/gRPC │ Event │
▼ ▼ ▼
┌──────┐ ┌──────┐ ┌──────┐
│UserDB│ │OrderDB│ │PayDB │
└──────┘ └──────┘ └──────┘
1.4 2025: The Rise of the Modular Monolith
Timeline:
2000 ──── 2005 ──── 2014 ──── 2020 ──── 2025
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
Monolith SOA Microservices Service Modular
Mesh Monolith
Spread Resurgence
2. When MSA Is Overkill: Real-World Lessons
2.1 The Amazon Prime Video Case
In 2023, the Amazon Prime Video team migrated their audio/video monitoring service from microservices to a monolith, reducing infrastructure costs by 90%.
Problems:
- Explosive growth of AWS Step Functions state transition costs
- S3 intermediate storage costs for inter-service data transfer
- Microservice orchestration overhead
Solution:
- Execute all processing steps within a single process
- In-memory communication instead of network calls
- Eliminate S3 intermediate storage
2.2 DHH's Critique
DHH, the creator of Ruby on Rails, argues that "most teams are perfectly fine with a monolith."
| Argument | Rationale |
|---|---|
| Distributed system complexity underestimated | Network failures, data consistency, debugging difficulty |
| Mismatch with team size | A team of 5-10 running 20 services is inefficient |
| Operational cost explosion | Infrastructure, monitoring, deployment pipeline costs |
| Premature decomposition risk | Poor boundary definition when domain understanding is lacking |
2.3 Shopify's Modular Monolith Success
Shopify processes tens of billions of dollars in transactions annually while using a modular monolith architecture.
# Module boundary definition using Shopify's Packwerk
# packages/checkout/package.yml
enforce_dependencies: true
enforce_privacy: true
dependencies:
- packages/inventory
- packages/payment
3. Architecture Comparison: Monolith vs MSA vs Modular Monolith
3.1 Comparison Matrix
| Aspect | Monolith | Modular Monolith | Microservices |
|---|---|---|---|
| Deployment Unit | 1 | 1 (module-level build) | N (per service) |
| Team Size | 1-20 | 5-50 | 20-hundreds |
| Communication | Method calls | Module interfaces | Network (REST/gRPC) |
| Data Store | Shared DB | Schema per module | DB per service |
| Transactions | ACID | ACID (limited across modules) | Saga/Compensation |
| Scalability | Vertical | Vertical + partial horizontal | Horizontal (per service) |
| Operational Complexity | Low | Medium | High |
| Tech Diversity | Single stack | Single stack | Polyglot |
| Fault Isolation | None | Partial | Full isolation |
| Initial Cost | Low | Medium | High |
3.2 Modular Monolith Architecture in Detail
┌─────────────────────────────────────────────┐
│ Modular Monolith │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ User │ │ Order │ │ Payment │ │
│ │ Module │ │ Module │ │ Module │ │
│ │ │ │ │ │ │ │
│ │ Public │ │ Public │ │ Public │ │
│ │ API │◄─► API │◄─► API │ │
│ │ │ │ │ │ │ │
│ │ Private │ │ Private │ │ Private │ │
│ │ Impl │ │ Impl │ │ Impl │ │
│ └──┬──────┘ └──┬──────┘ └──┬──────┘ │
│ │ │ │ │
│ ┌──▼──┐ ┌──▼──┐ ┌──▼──┐ │
│ │user │ │order│ │pay │ │
│ │schema│ │schema│ │schema│ │
│ └─────┘ └─────┘ └─────┘ │
│ └────────────┼────────────┘ │
│ ┌──────▼──────┐ │
│ │ Shared DB │ │
│ │(Separated │ │
│ │ Schemas) │ │
│ └─────────────┘ │
└─────────────────────────────────────────────┘
// Java modular monolith example (Spring Modulith)
@ApplicationModule(
allowedDependencies = {"order", "shared"}
)
package com.example.payment;
// Inter-module communication via events
@Service
public class PaymentService {
private final ApplicationEventPublisher events;
@Transactional
public PaymentResult processPayment(PaymentRequest request) {
Payment payment = Payment.create(request);
paymentRepository.save(payment);
// Publish event to other modules (no direct dependency)
events.publishEvent(new PaymentCompletedEvent(
payment.getId(),
payment.getOrderId(),
payment.getAmount()
));
return PaymentResult.success(payment);
}
}
3.3 Decision Flowchart
Start: New Project
│
▼
Team size > 50? ──Yes──▶ Are domains clearly separated?
│ │
No Yes ──▶ Consider MSA
│ │
▼ No ──▶ Modular Monolith
Need independently scalable services?
│
Yes ──▶ Extract only those services (Hybrid)
│
No ──▶ Monolith or Modular Monolith
4. DDD and Service Decomposition
4.1 Identifying Bounded Contexts
In Domain-Driven Design, the Bounded Context serves as the natural boundary for microservices.
┌─────────────── E-Commerce Domain ───────────────┐
│ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Catalog │ │ Order │ │
│ │ Context │ │ Context │ │
│ │ │ │ │ │
│ │ - Product │ │ - Order │ │
│ │ - Category │ │ - OrderItem │ │
│ │ - Price │ │ - Shipment │ │
│ └──────────────┘ └──────────────┘ │
│ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Identity │ │ Payment │ │
│ │ Context │ │ Context │ │
│ │ │ │ │ │
│ │ - User │ │ - Payment │ │
│ │ - Role │ │ - Refund │ │
│ │ - Permission │ │ - Invoice │ │
│ └──────────────┘ └──────────────┘ │
└──────────────────────────────────────────────────┘
4.2 Context Mapping Patterns
┌────────────┐ ┌────────────┐
│ Upstream │ │ Downstream │
│ Context │ ──Conformist────▶ │ Context │
│ │ │ │
│ (Order) │ ──ACL──────────▶ │ (Payment) │
│ │ │ │
│ │ ──OHS/PL───────▶ │ (Shipping) │
└────────────┘ └────────────┘
ACL: Anti-Corruption Layer (translation layer)
OHS: Open Host Service (public API)
PL: Published Language (shared schema)
4.3 Service Decomposition Strategies
// 1. Decomposition by Aggregate
// Order Aggregate Root
public class Order {
private OrderId id;
private CustomerId customerId;
private List<OrderLine> lines;
private OrderStatus status;
private Money totalAmount;
// Business logic within the aggregate
public void addItem(ProductId productId, int quantity, Money price) {
OrderLine line = new OrderLine(productId, quantity, price);
this.lines.add(line);
this.totalAmount = calculateTotal();
}
public void confirm() {
if (this.lines.isEmpty()) {
throw new OrderException("Cannot confirm an empty order");
}
this.status = OrderStatus.CONFIRMED;
}
}
4.4 Decomposition Mistakes to Avoid
- Data-driven decomposition: Splitting services by table leads to excessive inter-service communication
- Technology-driven decomposition: Splitting by frontend/backend/DB layers loses MSA benefits
- Premature decomposition: Splitting before understanding the domain leads to incorrect boundaries
- Over-granularity: Nanoservices only increase operational burden
5. Communication Patterns: REST vs gRPC vs Event-Driven
5.1 Synchronous Communication Comparison
| Aspect | REST (HTTP/JSON) | gRPC (HTTP/2 + Protobuf) |
|---|---|---|
| Serialization | JSON (text) | Protocol Buffers (binary) |
| Performance | Relatively slow | 2-10x faster |
| Streaming | Limited (SSE/WebSocket) | Native bidirectional streaming |
| Code Generation | OpenAPI (optional) | Required (proto files) |
| Browser Support | Native | Requires gRPC-Web |
| Readability | High (JSON) | Low (binary) |
| Use Case | External APIs, simple CRUD | Internal inter-service communication |
5.2 gRPC Service Definition
// order_service.proto
syntax = "proto3";
package order.v1;
service OrderService {
// Unary RPC
rpc CreateOrder (CreateOrderRequest) returns (CreateOrderResponse);
// Server streaming - order status updates
rpc WatchOrderStatus (WatchRequest) returns (stream OrderStatusUpdate);
// Client streaming - bulk order registration
rpc BulkCreateOrders (stream CreateOrderRequest) returns (BulkCreateResponse);
// Bidirectional streaming - real-time order processing
rpc ProcessOrders (stream OrderAction) returns (stream OrderResult);
}
message CreateOrderRequest {
string customer_id = 1;
repeated OrderItem items = 2;
ShippingAddress shipping = 3;
}
message OrderItem {
string product_id = 1;
int32 quantity = 2;
int64 price_cents = 3;
}
message CreateOrderResponse {
string order_id = 1;
OrderStatus status = 2;
int64 total_cents = 3;
}
enum OrderStatus {
ORDER_STATUS_UNSPECIFIED = 0;
ORDER_STATUS_PENDING = 1;
ORDER_STATUS_CONFIRMED = 2;
ORDER_STATUS_SHIPPED = 3;
ORDER_STATUS_DELIVERED = 4;
}
5.3 Event-Driven Asynchronous Communication
Producer Consumer
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Order │───▶│ Kafka │───▶│ Payment │
│ Service │ │ Topic: │ │ Service │
│ │ │ orders │ │ │
└─────────┘ └─────────┘ └─────────┘
│
├────────▶ ┌─────────┐
│ │Inventory│
│ │ Service │
│ └─────────┘
│
└────────▶ ┌─────────┐
│Notific- │
│ation │
│ Service │
└─────────┘
// Kafka Producer - publish order events
@Service
public class OrderEventPublisher {
private final KafkaTemplate<String, OrderEvent> kafkaTemplate;
public void publishOrderCreated(Order order) {
OrderCreatedEvent event = OrderCreatedEvent.builder()
.orderId(order.getId())
.customerId(order.getCustomerId())
.items(order.getItems())
.totalAmount(order.getTotalAmount())
.timestamp(Instant.now())
.build();
kafkaTemplate.send("orders.created", order.getId(), event)
.whenComplete((result, ex) -> {
if (ex != null) {
log.error("Event publish failed: orderId={}",
order.getId(), ex);
// Compensation logic or retry
}
});
}
}
// Kafka Consumer - payment processing
@Service
public class PaymentEventConsumer {
@KafkaListener(
topics = "orders.created",
groupId = "payment-service",
containerFactory = "kafkaListenerContainerFactory"
)
public void handleOrderCreated(OrderCreatedEvent event) {
log.info("Order event received: orderId={}", event.getOrderId());
paymentService.processPayment(event);
}
}
5.4 Communication Pattern Selection Guide
Need synchronous call?
│
Yes ──▶ For external clients?
│ │
│ Yes ──▶ REST (OpenAPI)
│ │
│ No ──▶ Need high performance? ──Yes──▶ gRPC
│ │
│ No ──▶ REST
│
No ──▶ Need ordering guarantee?
│
Yes ──▶ Kafka (ordering via partition key)
│
No ──▶ Need fan-out?
│
Yes ──▶ SNS + SQS / Kafka topics
│
No ──▶ SQS / RabbitMQ
6. Service Mesh: Istio vs Linkerd
6.1 What is a Service Mesh?
A service mesh is a dedicated infrastructure layer for managing inter-service communication at the infrastructure level.
Without Service Mesh: With Service Mesh:
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│Service A│──│Service B│ │Service A│ │Service B│
│ │ │ │ │ ┌─────┐ │ │ ┌─────┐ │
│(retry, │ │(retry, │ │ │Proxy│─┼──┼─│Proxy│ │
│ auth, │ │ auth, │ │ │(side│ │ │ │(side│ │
│ metrics)│ │ metrics)│ │ │car) │ │ │ │car) │ │
└─────────┘ └─────────┘ │ └─────┘ │ │ └─────┘ │
└─────────┘ └─────────┘
Communication logic Cross-cutting concerns
in every service delegated to infrastructure
6.2 Key Features
| Feature | Description |
|---|---|
| Traffic Management | Load balancing, routing, canary deployments |
| Security | Automatic mTLS, authentication/authorization |
| Observability | Distributed tracing, metrics collection, logging |
| Resilience | Retries, circuit breakers, timeouts |
6.3 Istio vs Linkerd Comparison
| Aspect | Istio | Linkerd |
|---|---|---|
| Proxy | Envoy (C++) | linkerd2-proxy (Rust) |
| Resource Usage | High (100-200MB/pod) | Low (20-30MB/pod) |
| Feature Scope | Comprehensive (VM support, etc.) | Core features focused |
| Learning Curve | Steep | Gentle |
| mTLS | Manual configuration needed | Enabled by default |
| CNCF Status | Graduated project | Graduated project |
| Recommended Scale | Large (100+ services) | Small-medium (10-100 services) |
6.4 Istio Traffic Management Example
# VirtualService - Canary Deployment
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: order-service
spec:
hosts:
- order-service
http:
- match:
- headers:
x-canary:
exact: "true"
route:
- destination:
host: order-service
subset: v2
- route:
- destination:
host: order-service
subset: v1
weight: 90
- destination:
host: order-service
subset: v2
weight: 10
---
# DestinationRule - Circuit Breaker
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: order-service
spec:
host: order-service
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
h2UpgradePolicy: DEFAULT
http1MaxPendingRequests: 100
http2MaxRequests: 1000
outlierDetection:
consecutive5xxErrors: 5
interval: 30s
baseEjectionTime: 30s
maxEjectionPercent: 50
subsets:
- name: v1
labels:
version: v1
- name: v2
labels:
version: v2
6.5 When You Don't Need a Service Mesh
- Fewer than 10 services
- Kubernetes built-in features are sufficient
- Your team has no service mesh operational experience
- Systems extremely sensitive to latency (sidecar overhead)
7. Distributed Transactions and the Saga Pattern
7.1 The Problem with Distributed Transactions
In microservices, each service owns its own database, making ACID transactions across multiple services impossible.
Order creation scenario:
1. Order Service: Create order
2. Inventory Service: Reserve stock
3. Payment Service: Process payment
4. Notification Service: Send notification
What if payment fails at step 3?
-> Steps 1 and 2 need to be rolled back!
7.2 Saga Pattern - Choreography
┌─────────┐ OrderCreated ┌─────────┐ StockReserved ┌─────────┐
│ Order │──────────────▶│Inventory│───────────────▶│ Payment │
│ Service │ │ Service │ │ Service │
└─────────┘ └─────────┘ └─────────┘
▲ ▲ │
│ StockRelease │ PaymentFailed │
│ (compensate) │◀──────────────────────────┘
│◀────────────────────────┘
│ OrderCancelled (compensate)
Each service subscribes to events and autonomously triggers the next step.
7.3 Saga Pattern - Orchestration
┌────────────────┐
│ Order Saga │
│ Orchestrator │
└───┬──┬──┬──┬──┘
1.Create │ │ │ │ 4.Notify
Order ┌────┘ │ │ └──────┐
▼ │ │ ▼
┌─────────┐ │ │ ┌──────────┐
│ Order │ │ │ │Notificat-│
│ Service │ │ │ │ion Svc │
└─────────┘ │ │ └──────────┘
2.Reserve │ │ 3.Payment
Stock┌────┘ └─────┐
▼ ▼
┌─────────┐ ┌─────────┐
│Inventory│ │ Payment │
│ Service │ │ Service │
└─────────┘ └─────────┘
// Saga Orchestrator Implementation
@Service
public class OrderSagaOrchestrator {
public Mono<OrderResult> createOrder(CreateOrderCommand command) {
return Mono.just(new SagaState(command))
// Step 1: Create order
.flatMap(state -> orderService.createOrder(state.getCommand())
.map(order -> state.withOrder(order)))
// Step 2: Reserve stock
.flatMap(state -> inventoryService
.reserveStock(state.getOrder())
.map(reservation -> state.withReservation(reservation))
.onErrorResume(e -> compensateOrder(state, e)))
// Step 3: Process payment
.flatMap(state -> paymentService
.processPayment(state.getOrder())
.map(payment -> state.withPayment(payment))
.onErrorResume(e ->
compensateReservation(state, e)))
// Step 4: Complete
.flatMap(state -> {
orderService.confirmOrder(
state.getOrder().getId());
return Mono.just(OrderResult.success(state));
});
}
private Mono<SagaState> compensateReservation(
SagaState state, Throwable e) {
log.warn("Payment failed, starting stock compensation: orderId={}",
state.getOrder().getId());
return inventoryService
.releaseStock(state.getReservation())
.then(compensateOrder(state, e));
}
private Mono<SagaState> compensateOrder(
SagaState state, Throwable e) {
log.warn("Starting order compensation: orderId={}",
state.getOrder().getId());
return orderService
.cancelOrder(state.getOrder().getId())
.then(Mono.error(
new SagaFailedException("Saga failed", e)));
}
}
7.4 Choreography vs Orchestration
| Aspect | Choreography | Orchestration |
|---|---|---|
| Coupling | Loose | Dependent on central coordinator |
| Complexity | Grows with service count | Remains consistent |
| Debugging | Difficult (event tracing) | Relatively easy |
| Single Point of Failure | None | Orchestrator |
| Best For | Simple flows (3-4 steps or fewer) | Complex flows (5+ steps) |
8. Observability: OpenTelemetry
8.1 The Three Pillars of Observability
┌─────────────┐
│ Observability│
└──────┬──────┘
┌──────────┼──────────┐
▼ ▼ ▼
┌─────────┐ ┌──────┐ ┌─────────┐
│ Logs │ │Metrics│ │ Traces │
│(events) │ │(values)│ │(request │
└─────────┘ └──────┘ │ flow) │
What How much └─────────┘
happened? is happening? Where?
8.2 OpenTelemetry Integrated Architecture
┌─────────────────────────────────────────────────────┐
│ Application │
│ ┌─────────────────────────────────────────────┐ │
│ │ OpenTelemetry SDK │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ Tracer │ │ Meter │ │ Logger │ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ │ │
│ └────────────────────┬────────────────────────┘ │
└───────────────────────┼─────────────────────────────┘
│ OTLP
▼
┌─────────────────┐
│ OTel Collector │
│ ┌───────────┐ │
│ │ Receivers │ │
│ │Processors│ │
│ │ Exporters │ │
│ └───────────┘ │
└────┬──┬──┬──────┘
│ │ │
┌────────┘ │ └────────┐
▼ ▼ ▼
┌─────────┐ ┌────────┐ ┌─────────┐
│ Jaeger │ │Promethe│ │ Loki │
│ (Traces)│ │us/Mimir│ │ (Logs) │
│ │ │(Metrics│ │ │
└─────────┘ └────────┘ └─────────┘
│ │ │
└─────┬─────┘───────────┘
▼
┌───────────┐
│ Grafana │
│(Dashboard)│
└───────────┘
8.3 Distributed Tracing Implementation
// Spring Boot + OpenTelemetry auto-configuration
// build.gradle
// implementation 'io.opentelemetry.instrumentation:
// opentelemetry-spring-boot-starter'
// Custom span creation
@Service
public class OrderService {
private final Tracer tracer;
public Order createOrder(CreateOrderRequest request) {
Span span = tracer.spanBuilder("order.create")
.setAttribute("order.customer_id",
request.getCustomerId())
.setAttribute("order.item_count",
request.getItems().size())
.startSpan();
try (Scope scope = span.makeCurrent()) {
// Order creation logic
Order order = processOrder(request);
span.setAttribute("order.id", order.getId());
span.setAttribute("order.total",
order.getTotal().doubleValue());
span.setStatus(StatusCode.OK);
return order;
} catch (Exception e) {
span.setStatus(StatusCode.ERROR, e.getMessage());
span.recordException(e);
throw e;
} finally {
span.end();
}
}
}
8.4 Key Metrics (RED/USE)
RED Method (Service perspective):
┌──────────────────────────────────────┐
│ R - Rate: Requests per second │
│ E - Errors: Error rate (%) │
│ D - Duration: Response time │
│ (p50/p95/p99) │
└──────────────────────────────────────┘
USE Method (Resource perspective):
┌──────────────────────────────────────┐
│ U - Utilization: Resource usage rate │
│ S - Saturation: Queue length │
│ E - Errors: Resource error count│
└──────────────────────────────────────┘
9. API Gateway Pattern
9.1 The Role of API Gateway
┌─────────────────────┐
│ API Gateway │
Client ────────▶│ │
│ - Auth/AuthZ │
│ - Rate Limiting │
│ - Request Routing │
│ - Load Balancing │
│ - Response Caching │
│ - Request/Response │
│ Transformation │
│ - Circuit Breaker │
│ - Logging/Monitoring│
└──┬──────┬──────┬────┘
│ │ │
▼ ▼ ▼
┌─────┐┌─────┐┌─────┐
│Svc A││Svc B││Svc C│
└─────┘└─────┘└─────┘
9.2 BFF (Backend for Frontend) Pattern
┌────────┐ ┌──────────────┐
│ Web │────▶│ Web BFF │──┐
│ Client │ │(GraphQL) │ │
└────────┘ └──────────────┘ │
│ ┌──────────┐
┌────────┐ ┌──────────────┐ ├───▶│ Order │
│ Mobile │────▶│ Mobile BFF │──┤ │ Service │
│ App │ │(REST, light) │ │ └──────────┘
└────────┘ └──────────────┘ │
│ ┌──────────┐
┌────────┐ ┌──────────────┐ ├───▶│ User │
│ IoT │────▶│ IoT BFF │──┤ │ Service │
│ Device │ │(MQTT bridge) │ │ └──────────┘
└────────┘ └──────────────┘ │
│ ┌──────────┐
└───▶│ Product │
│ Service │
└──────────┘
9.3 API Gateway Comparison
| Aspect | Kong | AWS API Gateway | Envoy Gateway | APISIX |
|---|---|---|---|---|
| Foundation | Nginx + Lua | AWS Managed | Envoy Proxy | Nginx + Lua |
| Deployment | Self-hosted/Cloud | AWS only | Kubernetes | Self-hosted |
| Protocols | REST, gRPC, WS | REST, WS, HTTP | REST, gRPC | REST, gRPC, MQTT |
| Plugins | Extensive | Lambda integration | Extensible | Extensive |
| Pricing | Open-source/Enterprise | Per-request billing | Open-source | Open-source |
10. Deployment Strategies
10.1 Blue-Green Deployment
Phase 1: Blue (current) running
┌─────────────┐ ┌──────────┐
│ Load │────▶│ Blue v1 │ (100% traffic)
│ Balancer │ │ (Active) │
└─────────────┘ └──────────┘
┌──────────┐
│ Green v2 │ (idle)
│ (Idle) │
└──────────┘
Phase 2: Switch to Green
┌─────────────┐ ┌──────────┐
│ Load │ │ Blue v1 │ (idle)
│ Balancer │ │ (Idle) │
└─────────────┘ └──────────┘
│ ┌──────────┐
└───────────▶│ Green v2 │ (100% traffic)
│ (Active) │
└──────────┘
10.2 Canary Deployment
# Kubernetes + Argo Rollouts canary deployment
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: order-service
spec:
replicas: 10
strategy:
canary:
steps:
- setWeight: 5
- pause:
duration: 5m
- analysis:
templates:
- templateName: success-rate
args:
- name: service-name
value: order-service
- setWeight: 25
- pause:
duration: 10m
- analysis:
templates:
- templateName: success-rate
- setWeight: 50
- pause:
duration: 15m
- setWeight: 100
selector:
matchLabels:
app: order-service
template:
metadata:
labels:
app: order-service
spec:
containers:
- name: order-service
image: order-service:v2
ports:
- containerPort: 8080
10.3 Deployment Strategy Comparison
| Strategy | Downtime | Rollback Speed | Resource Cost | Risk Level |
|---|---|---|---|---|
| Rolling Update | None | Moderate | Low | Medium |
| Blue-Green | None | Instant | High (2x) | Low |
| Canary | None | Fast | Medium | Low |
| A/B Testing | None | Fast | Medium | Low |
| Recreate | Yes | Slow | Low | High |
11. Migration: The Strangler Fig Pattern
11.1 Pattern Overview
The Strangler Fig Pattern is a strategy for progressively migrating a legacy monolith to microservices. Just as a strangler fig tree wraps around its host tree, new services gradually replace the legacy system.
Phase 1: Place proxy
┌────────┐ ┌──────────┐ ┌────────────┐
│ Client │────▶│ Proxy │────▶│ Monolith │
└────────┘ │(Facade) │ │(all funcs) │
└──────────┘ └────────────┘
Phase 2: Extract some features
┌────────┐ ┌──────────┐ ┌────────────┐
│ Client │────▶│ Proxy │─┬──▶│ Monolith │
└────────┘ │ │ │ │(reduced) │
└──────────┘ │ └────────────┘
│
└──▶┌────────────┐
│ New Service│
│(extracted) │
└────────────┘
Phase 3: Most features migrated
┌────────┐ ┌──────────┐ ┌────────────┐
│ Client │────▶│ Proxy │─┬──▶│ Monolith │
└────────┘ │ │ │ │(minimal) │
└──────────┘ │ └────────────┘
│
├──▶┌────────────┐
│ │ Service A │
│ └────────────┘
├──▶┌────────────┐
│ │ Service B │
│ └────────────┘
└──▶┌────────────┐
│ Service C │
└────────────┘
Phase 4: Remove monolith
┌────────┐ ┌──────────┐
│ Client │────▶│ API GW │─┬──▶ Service A
└────────┘ └──────────┘ ├──▶ Service B
└──▶ Service C
11.2 Migration Phases
1. Analysis (2-4 weeks)
├── Domain modeling (Event Storming)
├── Dependency analysis (code/data)
└── Priority determination
2. Foundation (4-8 weeks)
├── CI/CD pipeline
├── Container/orchestration environment
├── Observability stack
└── API Gateway placement
3. First service extraction (4-6 weeks)
├── Select most independent domain
├── Implement Anti-Corruption Layer
├── Data migration
└── Dual write / Event bridge
4. Iterative extraction (2-4 weeks per service)
├── Extract next service
├── Remove monolith code
└── Integration testing
5. Monolith removal
├── Migrate remaining features
├── Data cleanup
└── Infrastructure decommission
11.3 Data Migration Strategy
// Dual Write Pattern - write to both during transition
@Service
public class UserServiceMigration {
private final MonolithUserRepository monolithRepo;
private final NewUserServiceClient newServiceClient;
private final FeatureFlag featureFlag;
public User createUser(CreateUserRequest request) {
// Always write to monolith (existing)
User user = monolithRepo.save(toEntity(request));
// Also write to new service (migration)
try {
newServiceClient.createUser(
toNewServiceRequest(user));
} catch (Exception e) {
log.warn("New service sync failed, will retry later", e);
migrationQueue.enqueue(new SyncEvent(user));
}
return user;
}
public User getUser(String userId) {
// Switch reads using Feature Flag
if (featureFlag.isEnabled("read-from-new-service")) {
try {
return newServiceClient.getUser(userId);
} catch (Exception e) {
log.warn("New service read failed, falling back", e);
return monolithRepo.findById(userId);
}
}
return monolithRepo.findById(userId);
}
}
12. Anti-Patterns
12.1 Common MSA Anti-Patterns
| Anti-Pattern | Description | Solution |
|---|---|---|
| Distributed Monolith | Services are split but remain tightly coupled | Redefine Bounded Contexts |
| Nanoservice | Services too small (function-level) | Merge services, group related features |
| Shared DB | Multiple services accessing the same DB | DB per Service principle |
| Synchronous Chain | Sequential sync calls: A to B to C to D | Event-driven asynchronous |
| Golden Hammer | Applying MSA to every problem | Evaluate suitability before choosing architecture |
| Version Hell | API versions proliferate | Maintain backward compatibility, contract testing |
12.2 Distributed Monolith Symptoms
Distributed Monolith Checklist:
[ ] Deploying one service requires deploying others too
[ ] Synchronous call chains span 3+ services
[ ] Multiple services read/write the same database
[ ] Shared library versions must be coordinated across services
[ ] One service failure brings down the entire system
[ ] Service boundaries are drawn by tech layer (front/back/DB)
If 3 or more apply, you likely have a distributed monolith.
12.3 The Synchronous Call Chain Problem
Problem: Synchronous chain
Client -> Order -> Inventory -> Payment -> Shipping
(3 sec)
Total response time: 3 sec + each service processing time
If any service fails, the entire chain fails
Solution: Async events + immediate response
Client -> Order (200 OK, order accepted)
|
+-- Event: OrderCreated
| |
| +-- Inventory (async)
| +-- Payment (async)
| +-- Shipping (async)
|
+-- Status query API provided
13. Decision Framework
13.1 Architecture Selection Matrix
Score the following questions (1-5 points).
| Question | 1 point (Low) | 5 points (High) |
|---|---|---|
| Team size | 5 or fewer | 50+ |
| Domain complexity | Simple CRUD | Complex business logic |
| Independent scaling needs | Uniform load | Vastly different load per service |
| Deployment frequency | Monthly | Dozens per day |
| Tech diversity needs | Single stack | Optimal tech per service needed |
| Team autonomy | Centralized | Independent per team |
| Operational maturity | DevOps beginners | Mature platform team |
Score Interpretation:
- 7-15 points: Monolith (or Modular Monolith)
- 16-25 points: Modular Monolith (or Hybrid)
- 26-35 points: Microservices
13.2 Recommended Tech Stacks
Monolith:
├── Spring Boot + JPA
├── Django + PostgreSQL
├── Rails + PostgreSQL
└── Next.js Full Stack
Modular Monolith:
├── Spring Modulith
├── .NET Aspire
├── Go (module pattern)
└── Rust (Workspace)
Microservices:
├── Communication: gRPC (internal) + REST (external)
├── Messaging: Kafka / RabbitMQ
├── Orchestration: Kubernetes
├── Service Mesh: Istio / Linkerd
├── Observability: OpenTelemetry + Grafana
├── CI/CD: ArgoCD / GitHub Actions
└── API Gateway: Kong / Envoy Gateway
14. Interview Prep Q&A (Top 15)
Q1. Explain five core characteristics of microservices.
- Single Responsibility: Each service handles one business capability
- Independent Deployment: Deployable independently from other services
- Distributed Data: Each service owns its own data store
- Tech Diversity: Each service can choose its optimal tech stack
- Fault Isolation: One service failure does not propagate to the entire system
Q2. Explain the CAP theorem and its relationship with microservices.
According to the CAP theorem, a distributed system can guarantee only 2 out of 3: Consistency, Availability, and Partition Tolerance. Since network partitions are inevitable in microservices, P must always be chosen, leaving the choice between AP (availability-first) or CP (consistency-first). Most MSA systems choose AP with Eventual Consistency.
Q3. What are the two implementation approaches for the Saga pattern?
Choreography: Each service publishes and subscribes to events, autonomously performing the next step. There is no central coordinator, resulting in loose coupling, but event flow tracking becomes difficult as services increase.
Orchestration: A central orchestrator calls each service in sequence and manages compensation logic. The flow is clear, but the orchestrator can become a single point of failure.
Q4. Explain the Sidecar pattern in Service Mesh.
A proxy container (sidecar) is co-located with each service pod. All inbound/outbound traffic from the service passes through the sidecar, handling cross-cutting concerns like authentication (mTLS), routing, retries, and observability without modifying application code. Istio uses Envoy proxy, while Linkerd uses linkerd2-proxy as sidecars.
Q5. Explain the principles of Distributed Tracing.
When a request enters the system, a unique Trace ID is generated. Each service propagates this Trace ID and records its processing unit as a Span. Parent-child Spans form a complete Trace, enabling visualization of the entire request path and timing of each segment. The W3C Trace Context standard header is used.
Q6. What are the benefits and caveats of the Strangler Fig Pattern?
Benefits: Minimized risk through gradual migration, avoids big-bang migration, monolith and new services can coexist, easy rollback
Caveats: Dual-system operational costs during transition, data synchronization complexity, routing rule management burden, difficulty determining transition completion point
Q7. What is the difference between API Gateway and Service Mesh?
API Gateway: Manages external traffic (North-South). Handles edge features like authentication, rate limiting, request transformation, and API aggregation.
Service Mesh: Manages internal inter-service traffic (East-West). Handles mTLS, service discovery, load balancing, circuit breakers, and other inter-service communication concerns.
They are complementary, and large-scale MSA systems use both together.
Q8. Explain the relationship between Event Sourcing and CQRS.
Event Sourcing: Instead of storing state directly, it stores a sequence of state-change events. Current state is reconstructed by replaying events.
CQRS: Separates command (write) and query (read) models. Writes go to the event store; reads come from optimized views (Read Models). Event Sourcing can be used without CQRS, but combining them allows independent optimization of write and read performance.
Q9. How do you ensure data consistency in microservices?
Instead of strong consistency, we accept Eventual Consistency. Specific patterns include Saga Pattern (distributed transactions), Outbox Pattern (guaranteed event publishing), Change Data Capture (DB change detection), and Idempotency guarantees (safe duplicate processing). When strong data consistency is absolutely required, those capabilities should be placed in the same service.
Q10. When is gRPC more advantageous than REST?
gRPC excels in internal inter-service communication requiring high throughput, bidirectional streaming, strongly-typed contracts (proto files), and reduced payload size through binary serialization. REST is more suitable for browser clients, public APIs for third-party developers, and simple CRUD operations.
Q11. Explain the Circuit Breaker pattern.
Like an electrical circuit breaker, it blocks calls to a failing downstream service to prevent fault propagation. It has three states: Closed (normal calls), Open (calls blocked, immediate failure returned), Half-Open (testing recovery with some requests). When consecutive failures exceed a threshold, it transitions to Open, and after a timeout, tests recovery in Half-Open.
Q12. Compare service discovery approaches.
Client-Side Discovery: The client directly queries a service registry (Eureka, Consul) and selects target instances. Load balancing logic resides in the client.
Server-Side Discovery: A load balancer references the service registry for routing. Kubernetes Service is a prime example. Simpler as clients do not need to know about discovery logic.
Q13. When should you transition from Modular Monolith to MSA?
Consider transitioning when: a specific module's scaling requirements significantly differ from others, the team grows beyond 50 people requiring independent deployment, a specific module requires a different technology stack, or there is a business need for mandatory fault isolation.
Q14. What is the Outbox Pattern?
Within a local transaction, business data and events are saved together to an Outbox table. A separate process (Polling Publisher or CDC) reads events from the Outbox table and publishes them to a message broker. This ensures atomicity between business logic and event publishing.
Q15. Explain the microservices Testing Pyramid.
It consists of unit tests (internal service logic), integration tests (DB/external integration), contract tests (inter-service API compatibility using tools like Pact), component tests (single service E2E), and E2E tests (full system). Contract tests are especially important as they automatically verify that provider service changes do not break consumers.
15. Quiz
Q1. Explain 3 key advantages and 2 limitations of Modular Monolith compared to MSA.
Advantages: (1) Module-level separation without MSA complexity (network communication, distributed transactions, service mesh), (2) ACID transactions available for easy data consistency, (3) Single deployment unit for simpler operations/deployment.
Limitations: (1) Cannot independently scale specific services (difficult to horizontally scale individual modules), (2) Tech stack is limited to a single language/framework.
Q2. What happens when a compensating transaction fails in the Saga pattern?
Compensating transaction failure can leave the system in an inconsistent state, making it a critical issue. Solutions: (1) Retry policy (exponential backoff) for repeated compensation attempts, (2) Dead Letter Queue stores failed compensation events for manual/automatic recovery, (3) Design compensating transactions to be idempotent for safe retries, (4) Monitoring/alerting to immediately notify the operations team for manual intervention.
Q3. How can you minimize sidecar proxy overhead when adopting a Service Mesh?
(1) Choose a lightweight proxy (Linkerd's Rust-based proxy uses significantly less memory than Envoy), (2) Set appropriate resource requests/limits for sidecars (e.g., CPU 100m, Memory 50Mi), (3) Enable only the mesh features you need (disable unnecessary telemetry), (4) Consider eBPF-based service mesh (Cilium Service Mesh) which processes at kernel level without sidecars, (5) Exclude latency-critical services from the mesh.
Q4. What criteria should you use to select the first service to extract when applying the Strangler Fig Pattern?
(1) Features with the fewest dependencies on other modules (can operate independently), (2) High business value to motivate the team, (3) Well-defined domain boundaries exist, (4) Areas requiring independent scaling or different technology, (5) Sufficient test coverage for easy behavior verification. Conversely, high-risk areas like core payment or authentication should be extracted later.
Q5. Explain why Context Propagation in OpenTelemetry is important and how it works.
Context Propagation is the mechanism for passing Trace ID and Span ID between services to track a single request across a distributed environment. Its importance: without it, logs/metrics/traces from each service remain isolated, making it impossible to understand request flow. How it works: (1) Trace ID is generated at the first service, (2) Propagated to the next service via HTTP headers (W3C traceparent) or gRPC metadata, (3) In message queue environments, context is included in message headers, (4) Each service creates child Spans based on the received context.
References
- Martin Fowler, "Microservices" - https://martinfowler.com/articles/microservices.html
- Sam Newman, "Building Microservices, 2nd Edition" (O'Reilly, 2021)
- Chris Richardson, "Microservices Patterns" (Manning, 2018)
- Microservices.io - Patterns - https://microservices.io/patterns/index.html
- Amazon Prime Video Tech Blog, "Scaling up the Prime Video monitoring service" (2023)
- DHH, "Even Amazon can't make sense of serverless or microservices" (2023)
- Spring Modulith Documentation - https://docs.spring.io/spring-modulith/reference/
- Shopify Engineering, "Deconstructing the Monolith" - https://shopify.engineering/deconstructing-monolith-designing-software-maximizes-developer-productivity
- Istio Documentation - https://istio.io/latest/docs/
- Linkerd Documentation - https://linkerd.io/2/overview/
- OpenTelemetry Documentation - https://opentelemetry.io/docs/
- gRPC Documentation - https://grpc.io/docs/
- Argo Rollouts - https://argoproj.github.io/rollouts/
- Confluent, "Event-Driven Microservices" - https://developer.confluent.io/patterns/
- Vaughn Vernon, "Implementing Domain-Driven Design" (Addison-Wesley, 2013)
- Eric Evans, "Domain-Driven Design" (Addison-Wesley, 2003)
- CNCF Landscape - Service Mesh - https://landscape.cncf.io/