Skip to content

필사 모드: Durable Workflow Engines 2026 — Temporal, Restate, Hatchet, Inngest, Trigger.dev, Cadence, DBOS, Dapr Workflows Deep Dive

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.
원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Prologue — "Is it a background job or a workflow?"

A payments design meeting in 2026.

Junior: "After payment we send an email — we just push to a queue and a worker picks it up, right?"

Senior: "What if the worker dies?"

Junior: "We retry."

Senior: "But the email already went out twice? You refunded but inventory is still held? Payment went through but the order is stuck in pending?"

Junior: "..."

Senior: "That's the moment you need a workflow engine."

This tiny dialog captures the core distributed-systems anxiety of 2026. Background jobs that started life as Cron + Celery + Redis have evolved into multi-step asynchronous business processes — payments, orders, subscriptions, onboarding, AI agents — and simple queues just can't carry the weight.

**Durable execution** — you write workflows like code, but every step of that code is persisted, so the process can die and come back and continue from exactly where it was. In 2026, this paradigm has matured into a rich market: Temporal defined the category, and Restate, Hatchet, Inngest, Trigger.dev v3, and DBOS each offer their own answer.

This article maps that whole landscape. What durable execution actually is, the design philosophy of each engine, code examples, patterns like Saga and Outbox, the place these engines occupy in AI agent orchestration, and what Korean and Japanese big tech actually uses.

Chapter 1 · What Is Durable Execution — Event Sourcing and Replay

One-liner definition: durable execution is "an execution model that persists every input/output and non-deterministic choice of a function into an event log, and on restart replays that log to recover exactly the same state."

Three intellectual roots:

1. **Event sourcing** — don't store state, store the events that produced state.

2. **Deterministic replay** — running the same code over the same event log must always yield the same result.

3. **Side-effect isolation** — external calls (HTTP, DB, queues) must execute once, and their result must be persisted so re-execution doesn't duplicate work.

Traditional workflow engines (BPMN, Airflow) keep the DAG definition in an external file, and the engine interprets that DAG to run it. Durable execution is different — **the workflow itself is code**. `if`, `for`, `try/catch`, function calls — all part of the workflow definition, and the engine persists the execution of that code.

A side-by-side table.

| Item | Traditional (Airflow/Argo) | Durable Execution (Temporal/Restate) |

| --- | --- | --- |

| Definition format | YAML/Python DAG declaration | Arbitrary code |

| Branching | Explicit in the DAG | `if/else` in code |

| Long waits | Sensor / poke | A function call like `sleep(30d)` |

| State storage | Task table in external DB | Event log |

| Failure recovery | Retry from failed task | Replay from same point |

| Developer experience | DSL learning | Regular code |

The real value here is that **`sleep(30d)` literally sleeps 30 days and wakes up as if it had been sitting in memory the whole time**. The process dies, the host crashes, a different node restarts the workflow — and event-log replay restores it to the same line.

Chapter 2 · Why Workflow Engines Matter Now — Five Problems

In a 2026 microservices environment, here are the five problems workflow engines solve.

**Problem 1 — Distributed sagas.** An order touches four services: payment, inventory, shipping, notification. If step three fails, you have to compensate steps one and two. You can't wrap that in a single DB transaction.

**Problem 2 — Long-running work.** After signup: 7-day trial, charge after 14 days, expire after 30. You can't hold that timeline in memory, and cron + state machines turn into spaghetti.

**Problem 3 — AI pipelines.** RAG indexing -> embedding -> generation -> review -> publish — multi-step LLM pipelines have different failure rates per step, some take minutes, and expensive steps must never be called twice.

**Problem 4 — Idempotency and retries.** Calling external APIs (payments, mail, SMS) on a flaky network without idempotency keys is dangerous to retry. Workflow engines abstract that idempotency into a single line.

**Problem 5 — Exactly-once.** Beyond what "at-least-once + idempotency" via queues can fake, workflow engines promise "each business action happened exactly once" as a core guarantee.

In short: the problems a 2026 backend engineer faces look less and less like "a single HTTP response" and more like "a process that crosses systems and crosses time."

Chapter 3 · Temporal Cloud + Temporal Server 1.25 — The Category Definer

Temporal is a 2019 company founded by Maxim Fateev and Samar Abbas (both original authors of Uber Cadence). They effectively defined "durable execution" as a category. A 2025 Series D pegged the valuation past $1.5B.

Core architecture.

| Component | Role |

| --- | --- |

| Frontend | gRPC API gateway |

| History | Event log persistence (Cassandra/PostgreSQL/MySQL) |

| Matching | Task queues |

| Worker (Internal) | System workflows (e.g. archive) |

| **Worker (User)** | Runs the workflows/activities you wrote with the SDK |

Big changes in Temporal Server 1.25 (late 2025):

- **PostgreSQL 16 support + new visibility backend** (reduces ElasticSearch dependence).

- **Update with start** — start a workflow with an input update in a single call.

- **Nexus** — safely signal/query across namespaces (multi-tenant friendly).

- **Workflow memoization cache improvements** — 50% memory reduction.

Official SDKs: Go, Java, Python, .NET, TypeScript, PHP, Ruby. A Rust SDK went beta in 2025. The most beloved are TypeScript and Python.

**TypeScript workflow example.**

// workflows/order.ts

const { chargeCard, reserveInventory, ship, sendEmail, refundCard } =

proxyActivities<typeof activities>({ startToCloseTimeout: '1 minute' })

export const cancelOrderSignal = defineSignal('cancelOrder')

export async function orderWorkflow(orderId: string, amount: number) {

let cancelled = false

setHandler(cancelOrderSignal, () => { cancelled = true })

await chargeCard(orderId, amount)

try {

await reserveInventory(orderId)

await ship(orderId)

await sleep('48 hours') // 48h after ship, send satisfaction survey

if (!cancelled) await sendEmail(orderId, 'satisfaction-survey')

} catch (err) {

await refundCard(orderId, amount) // compensating action

throw err

}

}

The magic is `sleep('48 hours')`. The worker dies, it restarts somewhere else, and 48 hours later it picks up exactly at the next line.

Running a worker.

// worker.ts

const worker = await Worker.create({

workflowsPath: require.resolve('./workflows'),

activities,

taskQueue: 'orders',

})

await worker.run()

**Temporal Cloud** is the managed offering. 2026 pricing is action-based plus active executions; for a typical SaaS, Cloud is cheaper than running the cluster yourself.

Chapter 4 · Cadence (Uber) — The Ancestor of Temporal

Cadence is the workflow engine Uber open-sourced in 2017, and it's the direct ancestor of Temporal. Temporal's co-founders led the Cadence team at Uber, then forked it in 2019 to start Temporal.

The two systems in 2026:

- **APIs and SDKs are similar** but not compatible. Moving Cadence Go SDK code to Temporal requires package renames and some semantic adjustments.

- **Cadence is still operated at huge scale inside Uber** (hundreds of thousands of workflow instances per day).

- **Temporal drives the OSS-side innovation** — Nexus, Workflow Updates, Visibility v2 land in Temporal first.

For a new project, choose Temporal. Cadence makes sense only inside the Uber ecosystem or where Cadence assets already exist.

Chapter 5 · Restate — Workflow + RPC, "A Simpler Temporal"

Restate was founded in 2023 by Stephan Ewen (co-founder of Apache Flink). 1.0 shipped in 2025.

Design decisions, compared:

| Item | Temporal | Restate |

| --- | --- | --- |

| Core abstractions | Workflow + Activity | Virtual object + Workflow + Service |

| Persistence | Cassandra/Postgres | Embedded RocksDB + arbitrary external storage |

| Communication | gRPC + proprietary protocol | HTTP/2 + Connect/gRPC |

| Deployment | Cluster (History/Matching/...) | Single binary |

| Determinism | SDK enforces it | Per-handler idempotency |

Restate's central concept is the **virtual object** — a stateless handler (HTTP endpoint) combined with persistent state. It naturally expresses things like "a payment domain object."

**TypeScript example.**

const orderService = restate.service({

name: 'OrderService',

handlers: {

async place(ctx: restate.Context, input: { orderId: string; amount: number }) {

const paymentId = await ctx.run('charge', () =>

chargeAPI(input.orderId, input.amount)

)

try {

await ctx.run('reserve', () => reserveInventory(input.orderId))

} catch (err) {

await ctx.run('refund', () => refundAPI(paymentId))

throw err

}

await ctx.sleep(48 * 60 * 60 * 1000) // 48h

await ctx.run('survey', () => sendSurveyEmail(input.orderId))

},

},

})

restate.endpoint().bind(orderService).listen(9080)

`ctx.run(...)` is the idempotent unit for side effects. The result is persisted on the Restate server and returned verbatim on replay.

Restate's appeal is **operational simplicity**. A single Rust binary, no separate SQL or Cassandra cluster needed for small deployments. AWS/GCP/Azure managed offerings are in beta.

2026 positioning: "**You need Temporal's power but hate its operational complexity.**" Adoption is rising in small startups and single-service apps that want to onboard quickly.

Chapter 6 · Hatchet — Distributed Queue + Workflows on Postgres

Hatchet launched in 2023, with a single differentiating bet: **it runs on Postgres alone**.

Design declaration:

- Single external dependency — Postgres. No Cassandra, no Redis, no NATS, no Kafka.

- gRPC + worker SDKs (Python, TypeScript, Go).

- Built-in multi-tenancy.

- AOR (Acyclic Orchestration Runtime) — define steps like a DAG.

**Python example.**

from hatchet_sdk import Hatchet, Context

hatchet = Hatchet()

@hatchet.workflow(on_events=["order:created"])

class OrderWorkflow:

@hatchet.step(retries=3, timeout="1m")

def charge(self, ctx: Context):

order = ctx.workflow_input()

return charge_card(order["id"], order["amount"])

@hatchet.step(parents=["charge"])

def reserve(self, ctx: Context):

return reserve_inventory(ctx.workflow_input()["id"])

@hatchet.step(parents=["reserve"])

def ship(self, ctx: Context):

return ship_order(ctx.workflow_input()["id"])

worker = hatchet.worker("orders-worker")

worker.register_workflow(OrderWorkflow())

worker.start()

Hatchet's pitch is irresistible for **teams already running Postgres**. No new infra; queue, background jobs, and workflows unified in one system. Plus it took hold quickly in Python-heavy areas (data and AI).

Hatchet Cloud launched in 2026, adding a managed option.

Chapter 7 · Inngest — Step Functions, Developer-First

Inngest started in 2021 with the slogan "**Step Functions correctness for developers, without AWS lock-in**." Its TypeScript-first SDK is widely considered the most polished in the category.

The core abstraction is `step`. Inside a function you define durable units via `step.run`, `step.sleep`, `step.sendEvent`, `step.waitForEvent`. The Inngest engine persists each step's I/O and reuses cached values on replay.

**TypeScript example.**

export const onboarding = inngest.createFunction(

{ id: 'user-onboarding' },

{ event: 'user.signed_up' },

async ({ event, step }) => {

await step.run('send-welcome-email', () => sendEmail(event.data.email, 'welcome'))

await step.sleep('wait-2-days', '2 days')

await step.run('send-tips', () => sendEmail(event.data.email, 'tips'))

const purchase = await step.waitForEvent('await-purchase', {

event: 'order.completed',

timeout: '7d',

match: 'data.userId',

})

if (!purchase) {

await step.run('reminder', () => sendEmail(event.data.email, 'reminder'))

}

}

)

This expresses a 7-day onboarding sequence inline. `step.waitForEvent` for "wait for this event to arrive" reads naturally.

Inngest's appeal is **five things in one** — event queue, cron, workflows, function hosting, observability. It replaces the BullMQ/Sidekiq + Temporal combo with one SaaS.

Pricing is per-step (function-step counted). Small startups can start free.

Chapter 8 · Trigger.dev v3 — The Open-Source Inngest Alternative

Trigger.dev started in 2022; v2 was SaaS-centric. **v3, released in 2024, was a major pivot.**

- **Open-sourced** under AGPL with self-hosting.

- **Container-based execution** — runs your code in Firecracker or Docker for isolation.

- **Checkpointing** — snapshot memory mid-job so long jobs can resume.

- **TypeScript-first.**

**Example.**

export const onboarding = task({

id: 'user-onboarding',

maxDuration: 60 * 60 * 24 * 7, // 7d

run: async (payload: { userId: string; email: string }) => {

await sendEmail(payload.email, 'welcome')

await wait.for({ days: 2 })

await sendEmail(payload.email, 'tips')

const purchase = await wait.forEvent('order.completed', {

timeout: '5d',

filter: (e: any) => e.userId === payload.userId,

})

if (!purchase) {

await sendEmail(payload.email, 'reminder')

}

},

})

The differentiator is **checkpointing**. Call any npm library (Puppeteer for PDF generation, say) and the container can be paused and resumed mid-execution. A great fit for AI inference and other GPU-bound tasks.

Inngest vs Trigger.dev:

| Item | Inngest | Trigger.dev v3 |

| --- | --- | --- |

| License | Closed (some OSS core) | AGPL |

| Execution model | User function called over HTTP | Trigger container runs your code |

| Self-hosting | Limited | Fully supported |

| Pricing | Per step | Per compute time |

| AI / long jobs | Possible, but split into steps | Natural with checkpointing |

Chapter 9 · DBOS — Postgres Is the Runtime

DBOS was founded in 2023 by Mike Stonebraker's MIT group (PostgreSQL, Vertica, VoltDB). One-line summary: "**a runtime that persists program state into Postgres**."

The idea: persist every function call, transaction, and workflow step's inputs and outputs into Postgres. Then functions themselves become **transactional**, and crash recovery is guaranteed by Postgres transaction integrity.

**Python example.**

from dbos import DBOS, Queue, WorkflowHandle

from sqlalchemy import text

DBOS()

@DBOS.workflow()

def order_workflow(order_id: str, amount: int):

payment_id = charge_step(order_id, amount)

try:

reserve_step(order_id)

ship_step(order_id)

except Exception:

refund_step(payment_id)

raise

@DBOS.transaction()

def charge_step(order_id: str, amount: int):

DBOS.sql_session.execute(

text("INSERT INTO payments(order_id, amount) VALUES (:o, :a)"),

{"o": order_id, "a": amount},

)

return f"pay-{order_id}"

`@DBOS.workflow()` and `@DBOS.transaction()` mark the durable units. Transactional activities write to Postgres in the same transaction as the workflow's state update — **DB consistency + workflow consistency unified**.

In 2026, DBOS is popular with teams who want to do everything **on Postgres only**. The downside is Postgres lock-in and the fact that Postgres can become the bottleneck at very high throughput.

Chapter 10 · Dapr Workflows — Kubernetes-Native Sidecar

Dapr is Microsoft's open-source microservices runtime (2019). Its hallmark is the **sidecar pattern** — a Dapr sidecar runs next to your app and abstracts state/queues/events. CNCF graduation came in 2022.

**Dapr Workflows** arrived in Dapr 1.10 (alpha) and stabilized in 1.13 (2024). Internally it adopts the **DurableTask** library (the core of Azure Durable Functions).

Properties:

- **Language SDKs:** Go, .NET, Python, Java, JavaScript.

- **Backend abstraction:** state store can be Redis, Postgres, Cosmos DB, etc.

- **K8s-native** — Dapr runs as a sidecar inside Kubernetes.

**Python example.**

from dapr.ext.workflow import WorkflowRuntime, DaprWorkflowContext, WorkflowActivityContext

wfr = WorkflowRuntime()

@wfr.workflow(name="order")

def order_workflow(ctx: DaprWorkflowContext, order: dict):

payment = yield ctx.call_activity(charge, input=order)

try:

yield ctx.call_activity(reserve, input=order)

yield ctx.create_timer(fire_at=ctx.current_utc_datetime + timedelta(hours=48))

yield ctx.call_activity(send_survey, input=order)

except Exception:

yield ctx.call_activity(refund, input=payment)

raise

@wfr.activity(name="charge")

def charge(ctx: WorkflowActivityContext, order: dict):

return charge_card_api(order["id"], order["amount"])

Dapr Workflows shines when **you already run Dapr on K8s** — you can adopt workflows without standing up new infrastructure. Pick a state store, swap backends with the same code.

The weakness is that SDK maturity and observability still trail the dedicated engines.

Chapter 11 · AWS Step Functions — The Managed Standard

AWS Step Functions (2016) is the de-facto standard for managed workflows. **JSON-defined state machines** are the core.

Types:

| Type | Pricing | Use case |

| --- | --- | --- |

| Standard | Per state transition ($0.025/1000) | Long-running, up to 1 year, exactly-once |

| Express | Compute time + memory | Short jobs, up to 5 minutes, at-least-once |

**ASL (Amazon States Language) example.**

{

"StartAt": "ChargeCard",

"States": {

"ChargeCard": {

"Type": "Task",

"Resource": "arn:aws:states:::lambda:invoke",

"Parameters": { "FunctionName": "charge-card", "Payload.$": "$" },

"Next": "Reserve"

},

"Reserve": {

"Type": "Task",

"Resource": "arn:aws:states:::lambda:invoke",

"Parameters": { "FunctionName": "reserve-inventory", "Payload.$": "$" },

"Catch": [{ "ErrorEquals": ["States.ALL"], "Next": "Refund" }],

"Next": "Wait48h"

},

"Wait48h": { "Type": "Wait", "Seconds": 172800, "Next": "Survey" },

"Survey": {

"Type": "Task",

"Resource": "arn:aws:states:::lambda:invoke",

"Parameters": { "FunctionName": "send-survey", "Payload.$": "$" },

"End": true

},

"Refund": {

"Type": "Task",

"Resource": "arn:aws:states:::lambda:invoke",

"Parameters": { "FunctionName": "refund-card", "Payload.$": "$" },

"End": true

}

}

}

AWS lock-in is heavy but **operational overhead is essentially zero**. Direct integration with every AWS service (Lambda, DynamoDB, SQS, Bedrock, Athena, Glue) — one line of SDK to invoke.

The cost trap: Standard gets expensive fast for transition-heavy workflows. Millions of executions with dozens of steps can easily exceed several thousand dollars a month. Express + Map state cost optimization is standard.

Chapter 12 · GCP Workflows + Azure Durable Functions

**Google Cloud Workflows** (2020):

- YAML or JSON definitions.

- HTTP calls + branching + parallelism.

- Native integration with Eventarc, Cloud Run, Cloud Functions.

- Per-step pricing.

- Weakness: state-machine expressiveness is simpler than Step Functions.

**Azure Durable Functions** (2017):

- An extension of Azure Functions.

- Write workflows like code in **C#, JavaScript, Python**.

- The original durable-execution offering — Temporal borrows some ideas from here.

- The OSS DurableTask library powers Dapr Workflows.

**TypeScript Durable Function example.**

df.app.orchestration('orderWorkflow', function* (ctx) {

const order = ctx.df.getInput()

const payment = yield ctx.df.callActivity('chargeCard', order)

try {

yield ctx.df.callActivity('reserveInventory', order)

yield ctx.df.createTimer(new Date(Date.now() + 48 * 3600 * 1000))

yield ctx.df.callActivity('sendSurvey', order)

} catch {

yield ctx.df.callActivity('refundCard', payment)

throw new Error('order failed')

}

})

Each `yield` creates a checkpoint. Cloud lock-in, but mature, and the standard pick if you're on Azure.

Chapter 13 · Airflow + Dagster + Prefect — Data Engineering DAGs

The data-pipeline domain has its own DAG engines. Not the focus of this article, but a quick comparison.

| Engine | Started | Strength | Weakness |

| --- | --- | --- | --- |

| Apache Airflow | 2014 (Airbnb) | Most adopted, huge connector ecosystem | Python DAG declarations, dynamic workflows are awkward |

| Dagster | 2018 | Asset-graph centric, type system | Learning curve |

| Prefect | 2018 | Pythonic API, dynamic workflows | Smaller community |

| Argo Workflows | 2018 | K8s-native, strong for CI/CD | Pod overhead |

The decisive difference between data-engineering DAGs and durable execution:

- **DAG engines:** data-asset transformation graphs, time-based triggers at the center.

- **Durable execution:** business processes, mixed event/time/signal-driven code flow.

The 2026 division of labor: "**DAGs for data pipelines, durable execution for business logic.**"

Chapter 14 · Argo Workflows — Kubernetes CI/CD

Argo Workflows is a CNCF-graduated project open-sourced in 2017 by Applatix (acquired by Intuit). **Every step is a K8s Pod.**

apiVersion: argoproj.io/v1alpha1

kind: Workflow

metadata: { generateName: build-and-deploy- }

spec:

entrypoint: pipeline

templates:

- name: pipeline

steps:

- - name: build

template: build

- - name: test

template: test

- - name: deploy

template: deploy

- name: build

container: { image: golang:1.23, command: ["go", "build", "./..."] }

- name: test

container: { image: golang:1.23, command: ["go", "test", "./..."] }

- name: deploy

container: { image: argoproj/argocd:v2.12, command: ["argocd", "app", "sync", "main"] }

Argo Workflows is the **K8s-native CI/CD** standard. ML pipelines (Kubeflow Pipelines is Argo-based), build/deploy (paired with Argo CD), data pipelines — it does all of these.

Positioning vs durable execution: Argo is **container-orchestration workflows**, Temporal/Restate is **in-language durable execution**. Many shops use both in different spots.

Chapter 15 · Cloudflare Workflows — Workers + Durable Objects

Cloudflare Workflows shipped as a beta in September 2024. Core idea: **durable workflows on top of Cloudflare Workers**.

export class OrderWorkflow extends WorkflowEntrypoint<Env, { orderId: string; amount: number }> {

async run(event: WorkflowEvent<{ orderId: string; amount: number }>, step: WorkflowStep) {

const payment = await step.do('charge', async () =>

chargeCard(event.payload.orderId, event.payload.amount)

)

try {

await step.do('reserve', async () => reserveInventory(event.payload.orderId))

await step.sleep('wait 48h', '48 hours')

await step.do('survey', async () => sendSurvey(event.payload.orderId))

} catch (err) {

await step.do('refund', async () => refundCard(payment))

throw err

}

}

}

Edge compute + durable execution. Executions run at the global edge so latency is very low, and Durable Objects hold persistent state. GA pending.

Chapter 16 · Conductor (Netflix → Orkes 2022) — JSON DSL

Conductor is the workflow engine Netflix open-sourced in 2016, built on a **JSON DSL**. The core team spun out as Orkes in 2022 and runs a managed offering.

{

"name": "order_workflow",

"version": 1,

"tasks": [

{ "name": "charge_card", "type": "SIMPLE", "taskReferenceName": "charge" },

{ "name": "reserve_inventory", "type": "SIMPLE", "taskReferenceName": "reserve" },

{

"name": "wait_48h", "type": "WAIT", "taskReferenceName": "wait48",

"inputParameters": { "duration": "48h" }

},

{ "name": "send_survey", "type": "SIMPLE", "taskReferenceName": "survey" }

]

}

Conductor's strength is **language neutrality**. JSON DSL + workers in any language. Netflix's internal battle-testing across thousands of workflows is real social proof.

The weakness is the separate DSL learning curve. In 2026 Conductor sees use in large enterprises (Netflix plus Walmart, Tesla, etc.), but startups lean toward Temporal/Inngest.

Chapter 17 · Zeebe + Camunda 8 — BPMN-Based

Camunda started in 2008. With Camunda 8 (2022) they adopted **Zeebe** (cloud-native, gRPC) as the primary engine.

Properties:

- **BPMN 2.0**: a visual notation business people can read.

- **DMN**: decision tables.

- **Clustering**: Zeebe broker + Elasticsearch.

BPMN is great for collaborating on workflows with business, marketing, and operations teams. Insurance, banking, and telecom adopt it heavily. If your engineers want to code workflows, Temporal/Restate is natural; if **the business reviews the diagram**, Camunda fits.

Chapter 18 · The Saga Pattern — Distributed Compensating Transactions

The default pattern for durable workflows is the **saga**.

| Step | Action | Compensation |

| --- | --- | --- |

| 1 | Charge | Refund |

| 2 | Reserve inventory | Restore inventory |

| 3 | Ship | Cancel shipment |

| 4 | Notify | (none) |

If step 3 fails, run compensating actions in reverse order: 2 then 1. A workflow engine expresses this clearly as `try/catch + compensation calls`.

**Saga in Temporal.**

const { charge, refund, reserve, restore, ship, cancel, notify } =

proxyActivities<typeof activities>({ startToCloseTimeout: '1 minute' })

export async function saga(orderId: string, amount: number) {

const compensations: Array<() => Promise<void>> = []

try {

await charge(orderId, amount)

compensations.push(() => refund(orderId, amount))

await reserve(orderId)

compensations.push(() => restore(orderId))

await ship(orderId)

compensations.push(() => cancel(orderId))

await notify(orderId)

} catch (err) {

for (const compensate of compensations.reverse()) await compensate()

throw err

}

}

**Compensation idempotency is critical.** `refund(orderId)` called twice must not refund twice.

Chapter 19 · The Outbox Pattern and Idempotency Keys

Two correctness tools workflow engines either ride or solve.

**Outbox pattern** — to atomically tie a DB transaction to message dispatch, write an "events to send" record into an outbox table inside the same transaction as the business change. A separate dispatcher reads the outbox and emits to the queue. DBOS expresses this naturally; in Temporal you'd write to the outbox table inside an activity.

BEGIN;

UPDATE orders SET status='paid' WHERE id='o1';

INSERT INTO outbox(topic, payload) VALUES ('order.paid', '{"id":"o1"}');

COMMIT;

**Idempotency keys** — pass a key with external API calls (especially payments) so the same request executed twice is processed once. Stripe, Adyen, and PayPal all support this.

curl -X POST https://api.stripe.com/v1/charges \

-H "Idempotency-Key: order-o1-charge" \

-d amount=10000 -d currency=usd -d source=tok_visa

Temporal, Restate, and Inngest naturally use workflow ID or step ID as the idempotency key. `Idempotency-Key: workflowId-stepName` is a common pattern.

Chapter 20 · Exponential Backoff Retries and Dead Letters

A standard feature in every engine: **exponential backoff retries**. Temporal RetryPolicy example.

const { charge } = proxyActivities<typeof activities>({

startToCloseTimeout: '30 seconds',

retry: {

initialInterval: '1 second',

backoffCoefficient: 2,

maximumInterval: '1 minute',

maximumAttempts: 5,

nonRetryableErrorTypes: ['ValidationError'],

},

})

After 5 failures the workflow goes to a failed state and an operator restarts it by hand or moves it to a dead-letter queue. Inngest provides automatic dead-lettering out of the box.

Non-deterministic errors (timeouts, network) are safe to retry; **deterministic errors** (validation, bad input) fail the same way on retry. Separating them via `nonRetryableErrorTypes` is standard.

Chapter 21 · AI Agent Orchestration — LangGraph, Claude, OpenAI Agents

The fastest-growing workflow use case in 2025 and 2026 is **AI agents**. Multi-step LLM pipelines hit every appeal of durable execution — per-step cost, per-step failures, idempotency.

**LangGraph** (LangChain sister project, 2024+). Graph-based agent workflows. Nodes and edges form a state machine.

from langgraph.graph import StateGraph, END

from typing import TypedDict

class State(TypedDict):

question: str

answer: str

graph = StateGraph(State)

graph.add_node("retrieve", lambda s: {"answer": rag(s["question"])})

graph.add_node("review", lambda s: {"answer": review(s["answer"])})

graph.add_edge("retrieve", "review")

graph.add_edge("review", END)

graph.set_entry_point("retrieve")

app = graph.compile()

LangGraph itself is "graph definition + in-memory runtime," but **LangGraph Cloud** provides checkpoint storage and resume — moving it closer to durable execution.

**Anthropic Claude Agent SDK** (2025). Tool use + handoffs. Pair it with Temporal/Restate to wrap each tool call as an activity, and you get natural durable agents.

**OpenAI Agents SDK** (2025). Similar handoff model. One agent hands work to another.

Common thread: the SDKs themselves give weak durability guarantees, which is why the pattern of **Temporal/Restate/Inngest as the persistence backend** has become standard.

Chapter 22 · Comparison Matrix — Pick the Right Tool

| Engine | OSS | Operational complexity | SDKs | Strength | Weakness |

| --- | --- | --- | --- | --- | --- |

| Temporal | Yes (MIT) | Medium-high | Go/Java/Py/.NET/TS/PHP/Rust | Most powerful, battle-tested at scale | Cassandra/PG ops burden, Cloud can be pricey |

| Restate | Yes (BSL) | Low | Go/Java/TS/Py/Rust/Kotlin | Simple ops, virtual objects | Young, smaller ecosystem |

| Hatchet | Yes (MIT) | Low | Py/TS/Go | Postgres-only simplicity | Young, limited scale validation |

| Inngest | Partial | Managed | TS/Py/Go | Best-in-class DX | Self-hosting limited |

| Trigger.dev v3 | Yes (AGPL) | Medium | TS | OSS + checkpointing | Container overhead |

| DBOS | Yes (MIT) | Low | Py/TS | Postgres transaction integration | PG-dependent, throughput ceiling |

| Dapr Workflows | Yes (Apache 2.0) | Medium | Go/.NET/Py/Java/JS | K8s + Dapr ecosystem | SDK maturity behind |

| AWS Step Functions | No | Managed | JSON DSL | AWS integration, zero ops | AWS lock-in, cost |

| Azure Durable | No | Managed | C#/JS/Py/Java | Azure integration | Azure lock-in |

| GCP Workflows | No | Managed | YAML | GCP integration | Lower expressiveness |

| Cadence | Yes (MIT) | Medium-high | Go/Java | Uber-validated | Losing ground to Temporal |

| Conductor | Yes (Apache 2.0) | Medium | Language-neutral workers | JSON DSL, polyglot | DSL learning |

| Camunda 8 / Zeebe | Partial | Medium-high | BPMN + clients | BPMN, business collaboration | Heavier footprint |

| Cloudflare Workflows | No | Managed | TS | Edge + Workers | Beta, short-task focus |

**One-line guide:**

- "Battle-tested and most powerful" -> Temporal.

- "Simpler ops, similar expressiveness" -> Restate.

- "Postgres only, please" -> Hatchet or DBOS.

- "TypeScript, managed, fast onboarding" -> Inngest.

- "OSS managed alternative, long/AI jobs" -> Trigger.dev v3.

- "AWS-only, zero ops" -> Step Functions.

- "K8s + Dapr environment" -> Dapr Workflows.

- "BPMN business collaboration" -> Camunda 8.

Chapter 23 · Korean Big-Tech Adoption

**Toss** — built an internal workflow engine for durable financial transactions and shared Temporal adoption experience at a 2024 conference (work going back to 2023). Saga patterns power their payments, settlement, and transfer domains.

**Coupang** — order pipelines mix an internal workflow engine with AWS Step Functions. Newer microservices are evaluating Temporal.

**Naver** — internal Naver Cloud automation uses Argo Workflows + bespoke engines. Search and shopping batches still run on in-house distributed job systems.

**NCsoft** — game state machines and guild events as long-running async work; shared experience evaluating Cadence/Temporal-style engines.

**LINE Plus / LY** — Kafka + bespoke workflow engines for messaging, payments, ads. Some domains have Temporal POCs underway.

**Kakao** — KakaoTalk notification and payment pipelines. Strong in-house async infrastructure; new domains are evaluating Temporal and Inngest.

Common thread: **payments and orders are the first domains to evaluate durable execution**. Transaction consistency + compensation + operational observability all need solving at once.

Chapter 24 · Japanese Big-Tech Adoption

**Mercari** — payments and transfers running on in-house workflow engines with some Temporal POCs. Among the most microservice-mature Japanese tech companies.

**LINE Yahoo (LY Corporation)** — post-merger unified order/payment pipelines use Kafka + workflow engines, mixing in-house and partial Temporal usage.

**Rakuten** — supply chain, subscription, and points systems run on a combination of AWS Step Functions and Apache Airflow, with new domains evaluating Temporal Cloud.

**Recruit** — Indeed, SUUMO, and other subsidiaries vary in choice. Some teams run Inngest, others Airflow + custom DAGs.

**SmartHR / freee** — SaaS HR and accounting actively evaluating Temporal and Inngest to unify async jobs and workflows.

**NTT Data** — high BPMN/Camunda adoption for enterprise integration.

Distinctive trait of the Japanese market: BPMN (Camunda) adoption in enterprises is higher than in Korea or the US. In modern SaaS and startups, Temporal and Inngest are gaining ground quickly.

Chapter 25 · Getting Started — Your First Durable Workflow in 30 Minutes

Fastest start: Temporal CLI + TypeScript example.

1. Start the Temporal Server

brew install temporal

temporal server start-dev

2. New project

npx @temporalio/create@latest hello-temporal

cd hello-temporal

3. Worker + client

npm install

npm run start.watch # worker

npm run workflow # run the workflow

Inngest is even faster.

npx inngest-cli@latest dev

In another terminal

mkdir hello-inngest && cd $_

npm init -y

npm install inngest

Restate.

docker run -d --name restate -p 8080:8080 -p 9070:9070 docker.restate.dev/restatedev/restate:latest

npm install @restatedev/restate-sdk

All three reach the first workflow in 30 minutes. Then pick one of your gnarliest async flows — "charge -> refund," "signup -> trial expiry," "AI indexing -> review" — and port it.

Chapter 26 · Anti-Patterns and Common Mistakes

Common mistakes when using durable workflows.

**1. Making everything a workflow.** Simple background jobs (single email sends) only need BullMQ/Sidekiq/Celery. Workflow engines fit "multi-step + time-crossing + compensation required."

**2. Breaking determinism.** Calling `Date.now()`, `Math.random()`, or `fetch(...)` directly inside workflow code breaks replay. Always go through `proxyActivities` / `ctx.run` / `step.run`.

**3. Passing huge payloads.** Workflow inputs and outputs are persisted in the event log. MB-scale payloads explode the log. Pass S3 keys and read inside the activity.

**4. Infinite loops inside workflows.** Don't `while(true)`-poll. Use Temporal's `continueAsNew`, Restate's new sessions, or Inngest's new events to refresh the workflow.

**5. Missing compensations.** The essence of saga is compensation. Always answer "what if the compensation itself fails?"

**6. Signal storms.** Signals can make workflow state non-deterministic. Be careful about ordering and idempotency.

**7. Ignoring managed cost.** Temporal Cloud / Step Functions price per action. A workflow polling once per second can easily run thousands of dollars per month.

Chapter 27 · Closing — "Workflows As Code"

The 2026 trajectory of workflow engines.

1. **Code-first convergence.** JSON/YAML/BPMN DSLs are receding, and the "write workflows as ordinary code, let the engine persist" paradigm is dominant.

2. **Developer-first.** The rise of Inngest, Trigger.dev, and Hatchet shows that DX is a decisive differentiator.

3. **AI agent coupling.** LangGraph, Claude, and the OpenAI Agents SDK all need durable execution backends.

4. **Operational simplicity competition.** Restate, DBOS, and Hatchet all sell "simpler to operate than Temporal" as a core pitch.

5. **Cloud managed maturity.** Temporal Cloud, Inngest Cloud, Restate Cloud, Trigger.dev Cloud — managed options proliferate.

For teams adopting for the first time, this is the playbook:

1. Pick your single most painful async business process (often payments or onboarding).

2. Start managed with Temporal Cloud or Inngest.

3. Write workflow definitions as code, keep activities small and idempotent.

4. Design saga + idempotency keys + compensations explicitly.

5. Verify operational observability (workflow state, retries, dead letters).

A workflow engine isn't "more infrastructure" — it's a **new abstraction that lets you express business logic better**. Used well, the hardest parts of a distributed system become "just code."

References

1. [Temporal — Open Source Durable Execution Platform](https://temporal.io/)

2. [Temporal Cloud Documentation](https://docs.temporal.io/cloud)

3. [Temporal Server 1.25 Release Notes](https://github.com/temporalio/temporal/releases)

4. [Cadence Workflow — Uber Open Source](https://cadenceworkflow.io/)

5. [Restate — Distributed Application Toolkit](https://restate.dev/)

6. [Hatchet — Postgres-backed Workflow Engine](https://hatchet.run/)

7. [Inngest — Reliable Background Jobs and Workflows](https://www.inngest.com/)

8. [Trigger.dev v3 — Open Source Background Jobs](https://trigger.dev/)

9. [DBOS — Database-Backed Durable Programs](https://www.dbos.dev/)

10. [Dapr Workflows Documentation](https://docs.dapr.io/developing-applications/building-blocks/workflow/)

11. [AWS Step Functions Documentation](https://docs.aws.amazon.com/step-functions/)

12. [Azure Durable Functions Overview](https://learn.microsoft.com/azure/azure-functions/durable/durable-functions-overview)

13. [Google Cloud Workflows](https://cloud.google.com/workflows)

14. [Cloudflare Workflows Documentation](https://developers.cloudflare.com/workflows/)

15. [Conductor / Orkes](https://orkes.io/)

16. [Camunda 8 Documentation](https://docs.camunda.io/)

17. [Apache Airflow](https://airflow.apache.org/)

18. [Dagster](https://dagster.io/)

19. [Prefect](https://www.prefect.io/)

20. [Argo Workflows](https://argoproj.github.io/workflows/)

21. [Saga Pattern — microservices.io](https://microservices.io/patterns/data/saga.html)

22. [Outbox Pattern — microservices.io](https://microservices.io/patterns/data/transactional-outbox.html)

23. [Stripe API — Idempotency](https://docs.stripe.com/api/idempotent_requests)

24. [LangGraph Documentation](https://langchain-ai.github.io/langgraph/)

25. [Anthropic Claude Agent SDK](https://docs.anthropic.com/en/docs/agents)

26. [OpenAI Agents SDK](https://github.com/openai/openai-agents-python)

27. [Maxim Fateev — Designing Temporal (QCon)](https://www.infoq.com/presentations/temporal-design/)

28. [Mercari Engineering Blog — Microservices](https://engineering.mercari.com/en/)

현재 단락 (1/556)

A payments design meeting in 2026.

작성 글자: 0원문 글자: 32,789작성 단락: 0/556