Event-Driven + Saga Architecture Tradeoffs

The Moment Distributed Transactions Become Necessary
Two Coordination Approaches for Saga
- Choreography: Event-Based Autonomous Coordination
- Orchestration: Central Orchestrator Approach
Saga Orchestrator Implementation
Event Contracts and Schema Management
Outbox Pattern: Ensuring Atomicity of Event Publishing
Idempotency: Handling Duplicate Events
Saga Operations Monitoring
Deciding Whether to Adopt: Not Everything Needs a Saga
Practical Troubleshooting
References

The Moment Distributed Transactions Become Necessary

In a monolith, you can handle order creation, inventory deduction, and payment processing all within a single DB transaction. ACID is guaranteed, so if something fails midway, a single rollback handles everything.

But the moment services are separated, this "obvious" capability becomes impossible. When the Order Service, Inventory Service, and Payment Service each have their own databases, you cannot wrap them in a single transaction. 2PC (Two-Phase Commit) is theoretically possible, but it is vulnerable to network partitions, the coordinator becomes a single point of failure, and lock duration increases causing throughput to plummet.

The Saga pattern is an alternative systematized by Chris Richardson (microservices.io/patterns/data/saga). The core idea is simple:

Decompose a distributed transaction into multiple local transactions, and if any intermediate step fails, execute compensating transactions to undo the already-completed transactions.

This article covers the real tradeoffs that arise when combining Saga with Event-Driven Architecture (EDA). It honestly analyzes the complexity, costs, and organizational requirements behind the slogan "separating enables scaling."

Two Coordination Approaches for Saga

Choreography: Event-Based Autonomous Coordination

Each service completes its local transaction and publishes a domain event, then the next service subscribes to that event and performs its own work. There is no central orchestrator.

OrderService              InventoryService          PaymentService
    |                        |                        |
    |-- OrderCreated ------->|                        |
    |                        |-- InventoryReserved --->|
    |                        |                        |-- PaymentProcessed
    |<----------- OrderConfirmed ----------------------|

On failure, compensation events flow in reverse:

PaymentService: PaymentFailed
  --> InventoryService: InventoryReleased
    --> OrderService: OrderCancelled

Advantages: No direct dependencies between services. No need to modify existing services when adding new ones. Disadvantages: The entire flow cannot be seen at a glance in code. When event chains become complex, tracking where failures occurred is difficult. Risk of circular dependencies.

Orchestration: Central Orchestrator Approach

The Saga Orchestrator manages the entire flow. It determines the next step based on success/failure of each stage and controls the compensation order upon failure.

SagaOrchestrator
    |-- (1) reserve_credit --> PaymentService
    |-- (2) reserve_inventory --> InventoryService
    |-- (3) create_shipment --> ShippingService
    |
    |  On (3) failure:
    |-- compensate (2) release_inventory
    |-- compensate (1) refund_credit

Advantages: The entire flow exists explicitly in the orchestrator code. Debugging and monitoring are easy. Disadvantages: The orchestrator can become a single point of failure. As the number of services grows, the orchestrator's complexity increases.

Saga Orchestrator Implementation

The following is a core implementation of a Saga Orchestrator usable in production.

"""
Saga Orchestrator: Manages execution and compensation of distributed transactions.

Each step implements execute() and compensate(),
and the orchestrator executes sequentially, then compensates in reverse order on failure.
"""
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
import logging
import uuid
import time

logger = logging.getLogger(__name__)


class SagaStatus(Enum):
    PENDING = "pending"
    RUNNING = "running"
    COMPLETED = "completed"
    COMPENSATING = "compensating"
    COMPENSATED = "compensated"
    FAILED = "failed"  # When compensation itself fails


class StepStatus(Enum):
    PENDING = "pending"
    EXECUTED = "executed"
    COMPENSATED = "compensated"
    COMPENSATION_FAILED = "compensation_failed"


@dataclass
class SagaStep:
    """Individual step of a Saga.

    execute() and compensate() are implemented in subclasses.
    Retry safety is ensured through idempotency_key.
    """
    name: str
    status: StepStatus = StepStatus.PENDING
    idempotency_key: str = ""
    executed_at: Optional[float] = None
    compensated_at: Optional[float] = None
    error: Optional[str] = None

    def __post_init__(self):
        if not self.idempotency_key:
            self.idempotency_key = str(uuid.uuid4())

    def execute(self, context: dict) -> dict:
        raise NotImplementedError

    def compensate(self, context: dict) -> None:
        raise NotImplementedError


@dataclass
class SagaExecution:
    """Saga execution record."""
    saga_id: str
    status: SagaStatus = SagaStatus.PENDING
    steps: list[SagaStep] = field(default_factory=list)
    context: dict = field(default_factory=dict)
    started_at: Optional[float] = None
    completed_at: Optional[float] = None
    error: Optional[str] = None


class SagaOrchestrator:
    """Orchestrator that manages Saga execution.

    Execution flow:
    1. Execute each step sequentially
    2. If any step fails, compensate already-executed steps in reverse order
    3. If compensation also fails, record as FAILED and send alert
    """

    def __init__(self, steps: list[SagaStep], on_failure: Optional[callable] = None):
        self.steps = steps
        self.on_failure = on_failure  # Callback invoked on compensation failure

    def run(self, initial_context: dict | None = None) -> SagaExecution:
        execution = SagaExecution(
            saga_id=str(uuid.uuid4()),
            steps=self.steps,
            context=initial_context or {},
            started_at=time.time(),
        )
        execution.status = SagaStatus.RUNNING
        completed_steps: list[SagaStep] = []

        logger.info(f"Saga {execution.saga_id} started with {len(self.steps)} steps")

        try:
            for step in self.steps:
                logger.info(f"Saga {execution.saga_id}: executing '{step.name}'")
                try:
                    result = step.execute(execution.context)
                    step.status = StepStatus.EXECUTED
                    step.executed_at = time.time()

                    # Merge step result into context
                    if isinstance(result, dict):
                        execution.context.update(result)

                    completed_steps.append(step)

                except Exception as e:
                    step.error = str(e)
                    logger.error(
                        f"Saga {execution.saga_id}: step '{step.name}' failed: {e}"
                    )
                    # Start compensation
                    self._compensate(execution, completed_steps)
                    return execution

            execution.status = SagaStatus.COMPLETED
            execution.completed_at = time.time()
            logger.info(f"Saga {execution.saga_id} completed successfully")

        except Exception as e:
            execution.status = SagaStatus.FAILED
            execution.error = str(e)
            logger.critical(f"Saga {execution.saga_id} unexpected error: {e}")

        return execution

    def _compensate(self, execution: SagaExecution, completed_steps: list[SagaStep]):
        """Compensate completed steps in reverse order."""
        execution.status = SagaStatus.COMPENSATING
        compensation_failures = []

        for step in reversed(completed_steps):
            logger.info(
                f"Saga {execution.saga_id}: compensating '{step.name}'"
            )
            try:
                step.compensate(execution.context)
                step.status = StepStatus.COMPENSATED
                step.compensated_at = time.time()
            except Exception as e:
                step.status = StepStatus.COMPENSATION_FAILED
                step.error = f"Compensation failed: {e}"
                compensation_failures.append((step.name, str(e)))
                logger.critical(
                    f"Saga {execution.saga_id}: compensation of '{step.name}' FAILED: {e}"
                )

        if compensation_failures:
            execution.status = SagaStatus.FAILED
            execution.error = f"Compensation failures: {compensation_failures}"
            if self.on_failure:
                self.on_failure(execution, compensation_failures)
        else:
            execution.status = SagaStatus.COMPENSATED
            execution.completed_at = time.time()

Here is an example of actual step implementations:

class ReserveCreditStep(SagaStep):
    """Reserves credit in the payment service."""

    def __init__(self, payment_client):
        super().__init__(name="reserve_credit")
        self.payment_client = payment_client

    def execute(self, context: dict) -> dict:
        result = self.payment_client.reserve(
            customer_id=context["customer_id"],
            amount=context["order_total"],
            idempotency_key=self.idempotency_key,
        )
        return {"reservation_id": result.reservation_id}

    def compensate(self, context: dict) -> None:
        self.payment_client.release(
            reservation_id=context["reservation_id"],
            idempotency_key=f"{self.idempotency_key}_compensate",
        )


class ReserveInventoryStep(SagaStep):
    """Reserves products in the inventory service."""

    def __init__(self, inventory_client):
        super().__init__(name="reserve_inventory")
        self.inventory_client = inventory_client

    def execute(self, context: dict) -> dict:
        result = self.inventory_client.reserve(
            items=context["order_items"],
            idempotency_key=self.idempotency_key,
        )
        return {"inventory_reservation_id": result.reservation_id}

    def compensate(self, context: dict) -> None:
        self.inventory_client.release(
            reservation_id=context["inventory_reservation_id"],
            idempotency_key=f"{self.idempotency_key}_compensate",
        )

Event Contracts and Schema Management

In Event-Driven Architecture, events are contracts between services. Without managing these contracts, a change in one service's event format can silently break consumer services.

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://events.example.com/order/created/v2",
  "title": "OrderCreated",
  "description": "Event published when an order is created",
  "type": "object",
  "required": ["event_id", "event_type", "event_version", "timestamp", "data"],
  "properties": {
    "event_id": {
      "type": "string",
      "format": "uuid",
      "description": "Unique event ID (used as idempotency key)"
    },
    "event_type": {
      "type": "string",
      "const": "order.created"
    },
    "event_version": {
      "type": "integer",
      "const": 2,
      "description": "Schema version. Used for compatibility determination"
    },
    "timestamp": {
      "type": "string",
      "format": "date-time"
    },
    "data": {
      "type": "object",
      "required": ["order_id", "customer_id", "items", "total_amount"],
      "properties": {
        "order_id": { "type": "string" },
        "customer_id": { "type": "string" },
        "items": {
          "type": "array",
          "items": {
            "type": "object",
            "required": ["sku", "quantity", "unit_price"],
            "properties": {
              "sku": { "type": "string" },
              "quantity": { "type": "integer", "minimum": 1 },
              "unit_price": { "type": "number", "minimum": 0 }
            }
          }
        },
        "total_amount": { "type": "number", "minimum": 0 },
        "currency": { "type": "string", "default": "KRW" },
        "shipping_address": {
          "type": "object",
          "description": "Field added in v2. v1 consumers ignore this field."
        }
      }
    }
  }
}

For schema compatibility policies, it is helpful to refer to the Confluent Schema Registry compatibility modes.

Compatibility Mode	Allowed Changes	When to Use
BACKWARD	Add fields (optional), delete fields	When consumers are deployed first
FORWARD	Add fields, delete fields with defaults	When producers are deployed first
FULL	Satisfies both of the above	When producer/consumer deployment order cannot be guaranteed
NONE	All changes allowed (dangerous)	Development environments only

Outbox Pattern: Ensuring Atomicity of Event Publishing

One of the most common mistakes is that the two operations of "saving to DB and publishing an event" are not atomic.

# Dangerous pattern
1. Save order to DB     -- Success
2. Publish event to Kafka -- Failure (network issue)
=> Order is created but no event, so inventory is not deducted and payment is not processed

The Outbox pattern stores events together in a separate outbox table, and a separate process polls the outbox to deliver messages to the message broker.

-- outbox table schema
CREATE TABLE outbox (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    aggregate_type  VARCHAR(100) NOT NULL,    -- 'Order', 'Payment', etc.
    aggregate_id    VARCHAR(100) NOT NULL,    -- Order ID, Payment ID
    event_type      VARCHAR(200) NOT NULL,    -- 'order.created'
    event_version   INT NOT NULL DEFAULT 1,
    payload         JSONB NOT NULL,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    published_at    TIMESTAMPTZ,              -- NULL means not yet published
    retry_count     INT NOT NULL DEFAULT 0,
    max_retries     INT NOT NULL DEFAULT 5
);

-- Index for querying unpublished events
CREATE INDEX idx_outbox_unpublished
    ON outbox (created_at)
    WHERE published_at IS NULL AND retry_count < max_retries;

-- Process order creation and event in a single transaction
BEGIN;
    INSERT INTO orders (id, customer_id, total, status)
    VALUES ('ord-123', 'cust-456', 50000, 'created');

    INSERT INTO outbox (aggregate_type, aggregate_id, event_type, payload)
    VALUES (
        'Order',
        'ord-123',
        'order.created',
        '{"order_id": "ord-123", "customer_id": "cust-456", "total": 50000}'::jsonb
    );
COMMIT;

The Outbox publisher runs as a separate process, periodically polling for unpublished events.

"""
Outbox Publisher: Delivers unpublished events from the outbox table to the broker.

Guarantees at-least-once delivery, and consumers must
handle idempotency on their side.
"""
import time
import json
import logging

logger = logging.getLogger(__name__)


class OutboxPublisher:
    def __init__(self, db_pool, kafka_producer, poll_interval_seconds: float = 1.0):
        self.db_pool = db_pool
        self.producer = kafka_producer
        self.poll_interval = poll_interval_seconds

    def run(self):
        """Main loop. Continuously polls and publishes unpublished events."""
        logger.info("OutboxPublisher started")
        while True:
            try:
                published_count = self._poll_and_publish()
                if published_count == 0:
                    time.sleep(self.poll_interval)
            except Exception as e:
                logger.error(f"OutboxPublisher error: {e}")
                time.sleep(self.poll_interval * 5)

    def _poll_and_publish(self, batch_size: int = 100) -> int:
        """Queries a batch of unpublished events and publishes them."""
        with self.db_pool.connection() as conn:
            with conn.cursor() as cur:
                cur.execute("""
                    SELECT id, aggregate_type, aggregate_id,
                           event_type, event_version, payload
                    FROM outbox
                    WHERE published_at IS NULL
                      AND retry_count < max_retries
                    ORDER BY created_at
                    LIMIT %s
                    FOR UPDATE SKIP LOCKED
                """, (batch_size,))

                rows = cur.fetchall()
                if not rows:
                    return 0

                published_ids = []
                failed_ids = []

                for row in rows:
                    event_id, agg_type, agg_id, evt_type, evt_ver, payload = row
                    topic = f"{agg_type.lower()}-events"

                    try:
                        self.producer.send(
                            topic=topic,
                            key=agg_id.encode(),
                            value=json.dumps({
                                "event_id": str(event_id),
                                "event_type": evt_type,
                                "event_version": evt_ver,
                                "data": payload,
                            }).encode(),
                        )
                        published_ids.append(event_id)
                    except Exception as e:
                        logger.warning(f"Failed to publish {event_id}: {e}")
                        failed_ids.append(event_id)

                # Mark successful events
                if published_ids:
                    cur.execute("""
                        UPDATE outbox SET published_at = NOW()
                        WHERE id = ANY(%s)
                    """, (published_ids,))

                # Increment retry count for failed events
                if failed_ids:
                    cur.execute("""
                        UPDATE outbox SET retry_count = retry_count + 1
                        WHERE id = ANY(%s)
                    """, (failed_ids,))

                conn.commit()
                return len(published_ids)

Idempotency: Handling Duplicate Events

In distributed systems, events are delivered "at-least-once," meaning the same event may arrive multiple times. Consumers must be idempotent.

class IdempotentEventHandler:
    """Handler wrapper that safely processes duplicate events.

    Records processing history using event_id as key,
    and ignores already-processed events.
    """

    def __init__(self, db_pool, handler_fn):
        self.db_pool = db_pool
        self.handler_fn = handler_fn

    def handle(self, event: dict) -> bool:
        event_id = event["event_id"]

        with self.db_pool.connection() as conn:
            with conn.cursor() as cur:
                # Check if already processed
                cur.execute(
                    "SELECT 1 FROM processed_events WHERE event_id = %s",
                    (event_id,),
                )
                if cur.fetchone():
                    logger.info(f"Duplicate event {event_id}, skipping")
                    return True

                try:
                    # Execute business logic and record event processing in a single transaction
                    self.handler_fn(event, cur)

                    cur.execute(
                        "INSERT INTO processed_events (event_id, processed_at) VALUES (%s, NOW())",
                        (event_id,),
                    )
                    conn.commit()
                    return True

                except Exception as e:
                    conn.rollback()
                    logger.error(f"Failed to process event {event_id}: {e}")
                    return False

Saga Operations Monitoring

When Sagas run in production, you need to track the status of each saga instance in real-time.

-- Saga status dashboard queries

-- 1. Saga status distribution by time period
SELECT
    date_trunc('hour', started_at) AS hour,
    status,
    COUNT(*) AS count,
    AVG(EXTRACT(EPOCH FROM (completed_at - started_at))) AS avg_duration_seconds
FROM saga_executions
WHERE started_at >= NOW() - INTERVAL '24 hours'
GROUP BY 1, 2
ORDER BY 1 DESC, 2;

-- 2. Sagas with compensation failures (immediate response needed)
SELECT
    saga_id,
    started_at,
    error,
    jsonb_agg(
        jsonb_build_object(
            'step', step_name,
            'status', step_status,
            'error', step_error
        ) ORDER BY step_order
    ) AS step_details
FROM saga_executions e
JOIN saga_steps s ON e.saga_id = s.saga_id
WHERE e.status = 'FAILED'
  AND e.started_at >= NOW() - INTERVAL '7 days'
GROUP BY e.saga_id, e.started_at, e.error
ORDER BY e.started_at DESC;

-- 3. Failure rate by step (identifying bottlenecks)
SELECT
    step_name,
    COUNT(*) AS total_executions,
    SUM(CASE WHEN step_status = 'executed' THEN 1 ELSE 0 END) AS successes,
    SUM(CASE WHEN step_status != 'executed' THEN 1 ELSE 0 END) AS failures,
    ROUND(
        100.0 * SUM(CASE WHEN step_status != 'executed' THEN 1 ELSE 0 END) / COUNT(*),
        2
    ) AS failure_rate_pct
FROM saga_steps
WHERE created_at >= NOW() - INTERVAL '7 days'
GROUP BY step_name
ORDER BY failure_rate_pct DESC;

Deciding Whether to Adopt: Not Everything Needs a Saga

Saga + EDA is powerful but complex. Before adoption, you should honestly answer the following questions.

When Saga/EDA is appropriate:

When data consistency between services is essential, but eventual consistency is sufficient rather than strong consistency (ACID)
When independent deployment and scaling of each service is a business requirement
When teams are separated along service ownership boundaries and each team needs its own deployment cycle
When transaction volume is high enough that the locking costs of 2PC are unacceptable

When monolith transactions are better:

Early-stage products where service boundaries are unclear and change frequently
When team size is small (5 or fewer) and the operational burden of distributed systems outweighs the development speed benefits
When data consistency requirements are real-time, making eventual consistency business-unacceptable (e.g., bank account transfers)
When the team lacks the operational capability for message brokers (Kafka, RabbitMQ)

Practical Troubleshooting

What If Compensating Transactions Fail?

This is the most difficult problem with Sagas. When compensation fails, the system ends up in a "half state." The order is cancelled but the payment has not been refunded.

Countermeasures:

Dead Letter Queue (DLQ): Send compensation failure events to a DLQ and have the operations team process them manually.
Retry Policy: Retry 3-5 times with exponential backoff. Temporary network issues are resolved this way.
Manual Recovery Runbook: Document step-by-step manual recovery procedures for cases that cannot be resolved even from the DLQ.
Alerting: Set compensation failures as P1 alerts. If you think "I'll deal with it later," data inconsistencies accumulate.

When Event Ordering Is Not Guaranteed

Kafka guarantees ordering within a partition but not across partitions. Events for the same order going to different partitions can have their order reversed.

Countermeasures:

Use the same partition key for events of the same aggregate (e.g., order_id).
Check the sequence number of events on the consumer side, and if the order is wrong, wait briefly or send to a reprocessing queue.
Include causation_id (the ID of the causing event) in events to track causal relationships.

Dead Letter Queue Keeps Growing

Symptom: Thousands of events are piling up in the DLQ and nobody is processing them.

Cause: There is no DLQ processing process, or the processing responsibility is unclear.

Countermeasure: (1) Set up alerts on DLQ count (warning at 100+ items, P1 at 1000+ items). (2) Include weekly DLQ review in team routines. (3) Implement triage logic that classifies events into those that can be auto-reprocessed and those requiring manual intervention.

References

Chris Richardson, "Saga Pattern" -- microservices.io/patterns/data/saga
Chris Richardson, "Microservices Patterns", Manning, 2018 -- Chapter 4: Managing transactions with sagas
Martin Fowler, "What do you mean by Event-Driven?" -- martinfowler.com/articles/201701-event-driven
Confluent Schema Registry Documentation -- docs.confluent.io/platform/current/schema-registry
Implementing the Saga Pattern in Workflows -- cloud.google.com/blog
OpenTelemetry Documentation -- opentelemetry.io/docs

Quiz

What is the main reason the Saga pattern is used instead of 2PC (Two-Phase Commit)? Answer: ||2PC makes the coordinator a single point of failure, is vulnerable to network partitions, and causes throughput to plummet due to extended lock duration. Saga processes each step as an independent local transaction and recovers through compensating transactions on failure, resulting in higher availability and throughput.||
What is the biggest operational difference between Choreography and Orchestration? Answer: ||In Choreography, the entire saga flow is distributed across multiple services, making it hard to grasp at a glance and complex to debug. In Orchestration, the entire flow exists explicitly in the orchestrator code, making tracking and debugging easy, but the orchestrator can become a single point of failure.||
What core problem does the Outbox pattern solve? Answer: ||The problem that DB save and event publishing are not atomic. If saved to DB but event publishing fails, data inconsistency occurs. The Outbox pattern saves events to an outbox table within the same DB transaction, and a separate process reads and publishes them, ensuring atomicity.||
Why is consumer idempotency essential? Answer: ||In distributed systems, events are delivered at-least-once, so the same event may arrive multiple times. Without idempotency, the same order could be charged twice or inventory could be double-deducted.||
What are the minimum three things to do when a Saga's compensating transaction fails? Answer: ||Store the failure event in a DLQ, send a P1 alert, and have the operations team intervene following a manual recovery runbook to restore data consistency.||
How do you ensure event ordering for the same order in Kafka? Answer: ||Set the partition key to the same aggregate (e.g., order_id) so that events for the same aggregate go to the same partition. Kafka guarantees ordering within a partition, so events with the same key are processed in order.||
In what situation is the FULL event schema compatibility mode recommended? Answer: ||When the deployment order of producers and consumers cannot be guaranteed. FULL compatibility satisfies both BACKWARD (consumer first) and FORWARD (producer first), making it safe regardless of deployment order.||
What are two representative situations where Saga/EDA adoption should be avoided? Answer: ||(1) In early product stages where service boundaries change frequently, monoliths have lower change costs. (2) When team size is small and the operational burden of distributed systems (Kafka, monitoring, DLQ processing, etc.) outweighs the development speed benefits.||