💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Prologue — A Function That Refuses to Die for 8 Hours

Spring 2026, an engineer at an AI startup posts on Slack.

"The agent was processing 50 PR reviews. It hit OpenAI 429 on the 14th. The container OOMed. If we re-run it, it re-does runs 1 through 13. Our LLM bill doubles."

That sentence is the one-liner for the 2026 workflow engine renaissance. **AI agents are long-running.** A single task runs for 8 hours, calls 100 external APIs along the way, and one of them fails. Containers die from OOM, redeploy, or preemption — anytime. But the business demands: "Don't start from step 1, resume from step 14."

The technology that answers this demand is **durable execution.** Even if a function dies, durable storage remembers the state of the last step, and on restart, the workflow replays deterministically from that point. Uber Cadence pioneered this in 2016, Temporal forked and exploded it as open source in 2019, and from 2023 to 2025, Inngest, Trigger.dev, Hatchet, Restate, and DBOS entered the market each from a different angle.

This post draws the map of workflow engines as of 2026. From durable execution semantics, to architectural differences across Temporal, Inngest, Trigger.dev, Hatchet, Restate, and DBOS, to the place of AWS Step Functions, Airflow, Prefect, and Dagster, to saga patterns, idempotency, exactly-once semantics, and real cases from Korea and Japan.

Chapter 1 · What Is Durable Execution — A One-Sentence Definition

First, one sentence. **Durable execution is an execution model where the state of a function's run is persisted to durable external storage, so that even if the process dies, it resumes deterministically from its last progress point.**

Unpacked, three pillars emerge.

1. **Event sourcing** — Every time the function executes a step, the result is appended to an event log. Not the function's code, but the history of the function's execution, is the source of truth.

2. **Deterministic replay** — When the function dies, a new worker re-runs the same code from the beginning. But external calls aren't actually re-invoked — their results are fetched from the event log. State is restored without external side effects.

3. **Idempotency** — Running the same operation twice should produce the same result. External side effects like payments are protected by idempotency keys.

When these three combine, the magic of "the function pretends it didn't die" becomes possible. That's why durable execution is sometimes called "**a function that behaves like a virtual machine**" — the real machine dies, but the virtual machine lives forever.

A decisive constraint follows. **Workflow functions must be deterministic.** Non-deterministic operations like `Math.random()`, `Date.now()`, reading environment variables, or calling external APIs directly must go through deterministic SDK wrappers. Otherwise, replay diverges down a different path.

Chapter 2 · Four Major Branches of Workflow Engines — The 2026 Map

The 2026 workflow market splits into roughly four branches.

- **Durable execution SDKs** — Temporal, Cadence, Restate, DBOS. Write workflows as code, persist via deterministic replay. Strong for long-running, complex branching.

- **Event-driven function platforms** — Inngest, Trigger.dev. Events invoke functions; `step.run` inside the function becomes a durable boundary. DX-friendly, strong for AI agents and SaaS workflows.

- **Task queues + workflow** — Hatchet, Celery+orchestration. Queues built atop Postgres or Redis — the evolution of the simple queue. Strengths are simplicity and low operational cost.

- **DAG orchestration (batch ETL)** — Airflow, Prefect, Dagster, Argo Workflows. Data-pipeline-centric, scheduled DAG execution. A different category from durable execution.

Beyond these, BPMN-based engines (Camunda) and microservice orchestration (Netflix Conductor) hold their own seats. AWS Step Functions occupies a unique spot — state-machine DSL based, "durable but JSON, not code."

| --- | --- | --- | --- | --- |

Chapter 3 · Temporal Architecture — The Evolution of Uber Cadence

Temporal is both the company and the open-source engine started in 2019 by the Uber Cadence team (Maxim Fateev, Samar Abbas). As of 2026, it has effectively become the de facto standard for durable execution.

The Temporal server consists of four core services.

- **Frontend** — The entry point for all client and worker RPCs. A gRPC gateway.

- **History** — Stores and mutates the workflow execution event log. The heaviest component.

- **Matching** — Matches tasks to workers via task queues.

- **Worker** — An internal worker that runs system workflows (scheduling, archival).

These four services share a persistent store like Cassandra, MySQL, or Postgres. Workflow execution events (`WorkflowExecutionStarted`, `ActivityTaskCompleted`, and so on) go into the History service; workers poll Matching for tasks.

// Temporal TypeScript SDK — payment saga workflow

const { chargeCard, reserveInventory, shipOrder, refundCard, releaseInventory } =

proxyActivities<typeof activities>({

startToCloseTimeout: '1 minute',

retry: { maximumAttempts: 3, initialInterval: '1s' },

})

export const cancelOrderSignal = defineSignal('cancelOrder')

export async function processOrder(orderId: string, amount: number): Promise<string> {

let cancelled = false

setHandler(cancelOrderSignal, () => { cancelled = true })

const chargeId = await chargeCard(orderId, amount)

try {

const reservationId = await reserveInventory(orderId)

if (cancelled) throw new Error('cancelled-after-reserve')

await sleep('5 seconds') // human-in-the-loop confirmation window

const trackingId = await shipOrder(orderId, reservationId)

return trackingId

} catch (err) {

// compensating actions — saga pattern

await releaseInventory(orderId).catch(() => {})

await refundCard(chargeId, amount)

throw err

}

The magic of this code is that `sleep('5 seconds')` doesn't actually hold a worker for 5 seconds. The workflow state is persisted to History, and 5 seconds later when the timer fires, a new worker resumes from that point. A payment workflow could wait days without consuming worker memory.

Chapter 4 · Deterministic Replay — How a Function Pretends Not to Die

The core magic of Temporal, Cadence, and Restate is deterministic replay. Here's how it works step by step.

1. The workflow starts. A worker runs the code.

2. A call to `chargeCard(orderId, amount)` happens. The SDK intercepts it and records an `ActivityTaskScheduled` event to History.

3. When the activity completes, `ActivityTaskCompleted` + the result are recorded to History.

4. The SDK returns the result to the workflow code.

5. The next step proceeds. `reserveInventory(...)` works the same way.

6. **The worker dies.** OOM, redeploy, preemption. Whatever.

7. A new worker picks up the same workflow ID. It runs the code from the beginning.

8. The call to `chargeCard(...)` happens again, but the SDK sees the result already exists in History and returns it immediately. The card is not actually charged.

9. `reserveInventory(...)` is replayed the same way.

10. Once it reaches the last incomplete step, real execution resumes from there.

For this scenario to work, **the workflow code itself must be deterministic.** Same input → same branching → same call order. If you use `Math.random()` directly, replay yields a different value, and the branch diverges. That's why Temporal SDKs provide deterministic helpers like `workflow.uuid4()` and `workflow.now()`.

This constraint is the steepest part of the durable-execution learning curve. After being burned once by a branch-nondeterminism bug — having thought "can't I just write regular code?" — you start to understand why SDKs are so strict.

Chapter 5 · Inngest — "Function as Event Handler" Model

Inngest (founded 2021 by Tony Holdstock-Brown and Dan Farrelly) attacks durable execution from a different angle. **It abolishes the concept of a "workflow."** Instead, **functions** are first-class citizens, and `step.run`, `step.sleep`, and `step.waitForEvent` inside the function are the durable boundaries.

// Inngest — post-payment processing

const inngest = new Inngest({ id: 'commerce' })

export const onPaymentSucceeded = inngest.createFunction(

{ id: 'on-payment-succeeded', retries: 3 },

{ event: 'stripe/payment.succeeded' },

async ({ event, step }) => {

const order = await step.run('load-order', async () => {

return await db.orders.findById(event.data.orderId)

})

await step.run('send-receipt', async () => {

await emails.sendReceipt(order)

})

// 24h later, request a review — Inngest persists the function for 24h

await step.sleep('wait-1-day', '1 day')

await step.run('send-review-request', async () => {

await emails.sendReviewRequest(order)

})

// Bonus points if the user submits a review within 7 days

const review = await step.waitForEvent('wait-for-review', {

event: 'review/submitted',

match: 'data.orderId',

timeout: '7 days',

})

if (review) {

await step.run('grant-bonus-points', async () => {

await loyalty.grantPoints(order.userId, 100)

})

}

)

The aesthetics of Inngest are twofold. First, **events are the entry point** — instead of explicitly starting a workflow, events automatically wake the function. Second, **DX is friendly** — `step.run` looks like an ordinary async function, and the SDK hides most of the determinism constraints.

But there are tradeoffs. Inngest defaults to managed SaaS (a self-host option was added in 2024), and it feels less natural than Temporal for complex branching, signals, and child workflows. It's strong for SaaS, AI agents, and event-driven systems; Temporal fits better for huge payment systems.

Chapter 6 · Trigger.dev v3 — Vercel-Friendly Durable Execution

Trigger.dev started in 2022 and went through a major redesign in v3 in 2024. Until v2, it sat in the "Vercel Cron replacement" niche; in v3, it evolved into a true durable execution engine. The core is the **task model** — similar to Inngest functions but with a more explicit SDK.

// Trigger.dev v3 — AI agent task

const anthropic = new Anthropic()

export const reviewPullRequestTask = task({

id: 'review-pull-request',

maxDuration: 1800, // 30 minutes

retry: { maxAttempts: 3 },

run: async (payload: { repo: string; prNumber: number }, { ctx }) => {

logger.info('Reviewing PR', payload)

const diff = await fetchPRDiff(payload.repo, payload.prNumber)

const review = await anthropic.messages.create({

model: 'claude-opus-4-7',

max_tokens: 4096,

messages: [

{ role: 'user', content: `Review this diff:\n${diff}` },

})

await wait.for({ seconds: 10 }) // backoff before posting

await postReviewComment(payload.repo, payload.prNumber, review.content)

return { tokensUsed: review.usage.input_tokens + review.usage.output_tokens }

})

Trigger.dev v3's strength is that **infrastructure comes along** — tasks run isolated inside Docker containers, and long-running AI agents get explicit memory and CPU limits. It pairs naturally with Vercel; the Next.js + Trigger.dev combo is common in 2026 indie SaaS.

The weak side: it doesn't go as deep on distributed-system abstractions as Temporal; self-hosting is possible but operationally complex. It fits best for workloads that are "long but simple branching" — AI agents and media processing.

Chapter 7 · Hatchet — The Rise of the Postgres-Native Queue

Hatchet (founded 2024) is one of the most interesting newcomers in the 2026 market. The message is simple: **"One Postgres box is enough."** No separate infrastructure like Kafka, Cassandra, or Redis — Postgres alone implements the durable queue and workflow.

Hatchet Python SDK — async task pipeline

from hatchet_sdk import Hatchet, Context

from pydantic import BaseModel

hatchet = Hatchet()

class ProcessVideoInput(BaseModel):

video_url: str

user_id: str

@hatchet.workflow(on_events=["video:uploaded"], input_validator=ProcessVideoInput)

class ProcessVideo:

@hatchet.step(timeout="5m")

def download(self, ctx: Context) -> dict:

input = ProcessVideoInput.model_validate(ctx.workflow_input())

path = download_to_s3(input.video_url)

return {"path": path}

@hatchet.step(parents=["download"], timeout="30m")

def transcribe(self, ctx: Context) -> dict:

path = ctx.step_output("download")["path"]

transcript = whisper_transcribe(path)

return {"transcript": transcript}

@hatchet.step(parents=["transcribe"], timeout="10m")

def summarize(self, ctx: Context) -> dict:

transcript = ctx.step_output("transcribe")["transcript"]

summary = claude_summarize(transcript)

return {"summary": summary}

Hatchet's appeal is **operational simplicity** — one Postgres plus a few worker containers. Backup, replication, and HA leverage the Postgres ecosystem directly. Since 2024, it has been a flag-bearer for the "**Postgres is the future of all queues**" wave, alongside contemporaries like GraphileWorker, pg-boss, and Riverqueue.

The downside: at truly massive scale (tens of thousands of jobs per second), Postgres can become a bottleneck, and complex workflow patterns like Temporal's child workflows, signals, and queries aren't all there yet.

Chapter 8 · Restate — Stateful Invocations

Restate (founded 2023 by Stephan Ewen, Igal Shilman, and others from Apache Flink) brings yet another angle to durable execution. **Stateful invocations** — function calls behave like stateful objects.

// Restate — payment processing as a virtual object

const paymentService = restate.object({

name: 'payment',

handlers: {

process: async (ctx: restate.ObjectContext, amount: number) => {

// Object state — persisted for this paymentId

const status = (await ctx.get<string>('status')) ?? 'pending'

if (status === 'completed') return { idempotent: true }

ctx.set('status', 'processing')

const chargeId = await ctx.run('charge', () => stripe.charge(amount))

ctx.set('chargeId', chargeId)

ctx.set('status', 'completed')

return { chargeId }

refund: async (ctx: restate.ObjectContext) => {

const chargeId = await ctx.get<string>('chargeId')

if (!chargeId) throw new restate.TerminalError('no-charge-to-refund')

await ctx.run('refund', () => stripe.refund(chargeId))

ctx.set('status', 'refunded')

})

restate.endpoint().bind(paymentService).listen()

Restate's differentiator is **per-object state** — calls against the same paymentId are automatically serialized and share persistent state. Patterns that Temporal handles via workflow + signal combinations are simply object methods in Restate.

Another strength is the **operational model.** The Restate server is a single binary that persists state via its own RocksDB. You don't need to set up Cassandra or MySQL separately, as you would with Temporal. The barrier to adopting durable execution as a small team is low.

Chapter 9 · DBOS — Postgres Is the Compute (Mike Stonebraker)

DBOS (founded 2023, co-founded by database legend Mike Stonebraker and Matei Zaharia) takes the most radical stance. **"The DB should be the OS."** Not only the state of workflows, but the workflow execution itself happens as transactions inside Postgres.

DBOS Python — transactions fused with workflows

from dbos import DBOS, WorkflowContext, TransactionContext

@DBOS.workflow()

def process_signup(ctx: WorkflowContext, email: str) -> str:

user_id = create_user(ctx, email)

send_welcome_email(ctx, user_id)

grant_trial_credits(ctx, user_id)

return user_id

@DBOS.transaction()

def create_user(ctx: TransactionContext, email: str) -> str:

This function runs inside a Postgres transaction

cur = ctx.sql_session.execute(

"INSERT INTO users (email) VALUES (%s) RETURNING id",

(email,),

)

return cur.fetchone()[0]

@DBOS.step()

def send_welcome_email(ctx, user_id: str) -> None:

External call — guaranteed to execute exactly once (idempotent)

sendgrid.send_email(user_id, "Welcome!")

@DBOS.transaction()

def grant_trial_credits(ctx: TransactionContext, user_id: str) -> None:

ctx.sql_session.execute(

"INSERT INTO credits (user_id, amount) VALUES (%s, %s)",

(user_id, 100),

)

DBOS's elegance is that **transactional consistency and workflow durability live in one system.** In Temporal, activities and DB transactions are different systems, which makes exactly-once execution hard; in DBOS, both happen in a single transaction. As one would expect from a system built by a database-academia luminary, the semantics are crisp.

The flip side: it's a new paradigm with a small ecosystem (Python and TypeScript SDKs primarily), and not every workload fits the transactional model. Serious production use cases are growing in 2026, but it hasn't reached Temporal's maturity.

Chapter 10 · AWS Step Functions — The Place of JSON State Machines

Step Functions (released by AWS in 2016) is one of the originators of the durable execution market, but it has a different shape than the others. Workflows are written in **Amazon States Language (ASL)** — a JSON DSL. Not code, but a state-machine definition file.

Step Functions — Standard-mode saga (expressed in YAML)

StartAt: ChargeCard

States:

ChargeCard:

Type: Task

Resource: arn:aws:lambda:us-east-1:123456789012:function:chargeCard

Retry:

- ErrorEquals: ["States.TaskFailed"]

IntervalSeconds: 2

MaxAttempts: 3

Catch:

- ErrorEquals: ["States.ALL"]

Next: FailOrder

Next: ReserveInventory

ReserveInventory:

Type: Task

Resource: arn:aws:lambda:us-east-1:123456789012:function:reserveInventory

Catch:

- ErrorEquals: ["States.ALL"]

Next: RefundCard

Next: ShipOrder

ShipOrder:

Type: Task

Resource: arn:aws:lambda:us-east-1:123456789012:function:shipOrder

Catch:

- ErrorEquals: ["States.ALL"]

Next: RefundAndRelease

End: true

RefundAndRelease:

Type: Parallel

Branches:

- StartAt: RefundCard

States:

RefundCard:

Type: Task

Resource: arn:aws:lambda:us-east-1:123456789012:function:refundCard

End: true

- StartAt: ReleaseInventory

States:

ReleaseInventory:

Type: Task

Resource: arn:aws:lambda:us-east-1:123456789012:function:releaseInventory

End: true

Next: FailOrder

RefundCard:

Type: Task

Resource: arn:aws:lambda:us-east-1:123456789012:function:refundCard

Next: FailOrder

FailOrder:

Type: Fail

Cause: OrderProcessingFailed

Step Functions has two modes to distinguish.

- **Standard** — durable workflows. Up to a year of execution. Billed per state transition. Few expensive workflows like payments and approvals.

- **Express** — ephemeral workflows. Up to 5 minutes. Billed by execution time and memory. High-volume short event processing.

Step Functions' strength is the **depth of AWS integration** — direct integration with around 200 services like Lambda, DynamoDB, SQS, and EventBridge. Its weakness is that JSON DSL is less expressive than code, debugging is harder, and AWS lock-in.

In 2026, unless you're an AWS-only org, code-based engines (Temporal, Inngest, Trigger.dev) trend ahead in preference, but for teams that already live entirely on AWS, Step Functions remains a powerful option.

Chapter 11 · Apache Airflow vs Prefect vs Dagster — The Place of DAG Orchestration

On the data-pipeline side, there are engines with a different worldview. **DAG orchestration.**

- **Apache Airflow** (Airbnb, 2014) — industry standard. DAGs in Python. Scheduler + worker architecture. Criticisms: macro environment (Jinja) dependence, complex backfill semantics.

- **Prefect** (2018) — started from criticism of Airflow. "Code-first" DAGs, more imperative model, strong cloud option. Prefect 2/3 introduced flow and task abstractions.

- **Dagster** (Elementl, 2019) — asset-centric model. Not a DAG, but a declaration of "which asset depends on which asset." Deep integration with ecosystems like dbt and Snowflake.

- **Argo Workflows** — Kubernetes-native. DAGs defined as CRDs, executed per pod. Common in ML pipelines.

| --- | --- | --- | --- | --- |

Key distinction: **this category is not a durable-execution SDK.** They started as batch-job schedulers, and when a job fails, it typically re-runs from the beginning. They fit "100 jobs that move data daily at 03:00" better than "one function that runs for hours like an AI agent."

In 2026, the boundaries are blending — Prefect expands to general workflows, and Dagster's asset model grows powerful in ML pipelines. But OLTP workflows like payment sagas remain Temporal and Inngest territory.

Chapter 12 · Saga Pattern — A Pragmatic Answer to Distributed Transactions

One of the biggest problems of the microservice era: **how to consistently handle a transaction that spans multiple services.** Payment in one service, inventory in another, shipping in a third — you can't wrap this in an ACID transaction.

The answer is Hector Garcia-Molina's **saga pattern** from 1987. Break a large transaction into a sequence of small local transactions, and on failure, execute **compensating transactions** in reverse order.

// Payment saga — expressed in Temporal

export async function orderSaga(orderId: string, amount: number) {

const completed: Array<() => Promise<void>> = []

try {

const chargeId = await chargeCard(orderId, amount)

completed.push(() => refundCard(chargeId, amount))

const reservationId = await reserveInventory(orderId)

completed.push(() => releaseInventory(reservationId))

const trackingId = await shipOrder(orderId, reservationId)

completed.push(() => cancelShipment(trackingId))

await sendConfirmation(orderId, trackingId)

return { chargeId, trackingId }

} catch (err) {

// Execute compensating transactions in reverse order

for (const compensate of completed.reverse()) {

await compensate().catch((e) => console.error('compensation failed', e))

}

throw err

}

Two **saga variants** to distinguish.

- **Orchestration saga** — a central coordinator (a workflow) calls each step. Branching lives in one place, easy to trace. The natural model for Temporal and Inngest.

- **Choreography saga** — each service publishes/subscribes to events and proceeds autonomously. No central coordinator. Requires an event bus like Kafka or EventBridge. Loose coupling, but tracing and debugging are hard.

The 2026 consensus: **orchestration saga is the default choice.** Because a durable-execution engine runs the coordinator safely. Choreography is worth it only in truly autonomous microservice organizations, and the tracing burden outweighs the benefit for most teams.

Chapter 13 · Idempotency and Exactly-Once Semantics

One of the most dangerous promises in a distributed-systems book is "exactly-once." Strictly speaking, exactly-once in a distributed system is **impossible** — the sender can never be eternally certain that the message arrived (the Two Generals Problem).

The pragmatic answer is **effectively-once.** Even if the operation is tried multiple times, the system state looks as if it happened once. The two tools that guarantee this are **idempotency** and **deduplication.**

-- Postgres idempotency-key table — prevent duplicate payments

CREATE TABLE payment_attempts (

idempotency_key TEXT PRIMARY KEY,

charge_id TEXT,

amount BIGINT NOT NULL,

status TEXT NOT NULL CHECK (status IN ('processing', 'succeeded', 'failed')),

created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),

updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()

);

-- Idempotent charge — calling twice with the same key still charges once

CREATE OR REPLACE FUNCTION charge_with_idempotency(

p_idempotency_key TEXT,

p_amount BIGINT

) RETURNS TABLE(charge_id TEXT, was_new BOOLEAN) AS $$

DECLARE

v_existing payment_attempts%ROWTYPE;

v_charge_id TEXT;

BEGIN

-- Check whether the key was already processed

SELECT * INTO v_existing FROM payment_attempts

WHERE idempotency_key = p_idempotency_key FOR UPDATE;

IF FOUND THEN

RETURN QUERY SELECT v_existing.charge_id, FALSE;

RETURN;

END IF;

-- First time — record then call the external API

INSERT INTO payment_attempts(idempotency_key, amount, status)

VALUES (p_idempotency_key, p_amount, 'processing');

v_charge_id := stripe_charge(p_amount); -- external API call

UPDATE payment_attempts

SET charge_id = v_charge_id, status = 'succeeded', updated_at = NOW()

WHERE idempotency_key = p_idempotency_key;

RETURN QUERY SELECT v_charge_id, TRUE;

END;

$$ LANGUAGE plpgsql;

Durable-execution engines handle idempotency automatically. A Temporal activity is identified by its `ActivityId`; if the result is already in History, replay does not re-invoke it. The idempotency of external calls (Stripe, SendGrid, etc.) is the user's responsibility — Stripe API's `Idempotency-Key` header, SendGrid's `X-Idempotency-Key`. The durable execution engine cannot substitute for external systems' guarantees.

Core principle: **attach an idempotency key to every call with external side effects.** Durable execution + idempotency keys = effectively-once. Either alone is not safe.

Chapter 14 · Long-Running AI Agents — Durable Execution's New Killer App

The biggest driver of the 2026 durable-execution renaissance is **long-running AI agents.** Look at the characteristics of LLM agents.

- A single task runs for tens of minutes to days (code review, research, automation).

- Calls external APIs tens to hundreds of times (OpenAI, Anthropic, tools).

- Costs can range from `$0.10` to `$10` per task.

- Waits for human approval mid-flow (human-in-the-loop).

- On failure, restarting from scratch doubles the cost.

All these traits fit durable execution perfectly.

// AI agent + durable execution — Trigger.dev v3

const claude = new Anthropic()

export const researchAgent = task({

id: 'research-agent',

maxDuration: 7200, // 2 hours

run: async (payload: { question: string }, { ctx }) => {

const tools = [searchWebTool, readPageTool, summarizeTool]

const messages: Anthropic.MessageParam[] = [

{ role: 'user', content: payload.question },

]

for (let iteration = 0; iteration < 50; iteration++) {

const response = await claude.messages.create({

model: 'claude-opus-4-7',

max_tokens: 4096,

tools,

messages,

})

messages.push({ role: 'assistant', content: response.content })

if (response.stop_reason === 'end_turn') {

return { finalAnswer: extractText(response.content), iterations: iteration }

}

// Handle tool calls — each tool call is a durable boundary

const toolResults = await Promise.all(

response.content

.filter((b) => b.type === 'tool_use')

.map(async (tool) => {

return await executeTool(tool.name, tool.input)

})

)

messages.push({ role: 'user', content: toolResults })

// Brief rest every 5 iterations — to dodge rate limits

if (iteration % 5 === 4) await wait.for({ seconds: 30 })

}

throw new Error('agent-did-not-converge')

})

This pattern is becoming the 2026 standard for AI agent infrastructure. The agent itself is a simple loop — LLM call → tool execution → LLM call again. Durable execution persists each step of that loop. Even if it dies on iteration 14, you don't pay double for the cost.

Anthropic's durable context cache (prompt caching) from 2024–2025, Vercel AI SDK + Inngest integration, LangGraph's checkpoint model — all point in the same direction. **LLM agents cannot reach production without durable execution.**

Chapter 15 · Toss Payment Workflow Case

Korean fintech Toss has adopted durable-execution patterns in its payment processing infrastructure since 2018 (based on public materials; whether they directly use Temporal is a separate matter). Payments are intrinsically a domain where durable execution fits.

Look at the characteristics of Toss payments.

- A single payment passes through card issuers, VAN providers, PG, and settlement systems.

- Failures can happen at any step; partial failures are common.

- A payment must not be processed twice (idempotency).

- A refund is the mirror of a payment — a compensating transaction.

- When disputes arise, audit logs of every step are required.

In this domain, saga patterns and event sourcing come naturally. Many Korean fintechs like Toss have built their own workflow engines or assembled Kafka-based sagas; since 2024, Temporal adoption cases have grown as well. The complexity of the payment domain is one of the strongest cases for durable execution engines.

In the same vein, KakaoPay has implemented payment sagas between microservices via in-house frameworks. KakaoPay's tech blog has shared patterns like "solve distributed transactions with sagas and code compensating transactions explicitly." Multi-step workflows like payment, transfer, and payback are canonical cases for durable execution.

Chapter 16 · Japanese Cases — Mercari, Rakuten, Cybozu

Durable-execution adoption is fast in Japan too.

**Mercari** has solved payments and settlements in its microservice environment with saga patterns. Mercari Payments (Merpay) places explicit saga coordinators to avoid distributed-transaction problems and protects each step with idempotency. Since 2024, there has been an announcement of Temporal adoption in some areas.

**Rakuten** has an enormous volume of workflows across broad businesses (e-commerce, finance, telecom, travel). A multi-engine environment that uses in-house workflow engines, Apache Airflow for data pipelines, AWS Step Functions for some areas, and Camunda for BPMN workflows. At Rakuten's scale, unifying on a single engine is actually harder.

**Cybozu** (developer of kintone) handled message-queue + worker patterns early on for asynchronous processing in its cloud service, and in the 2020s applied durable-execution patterns to some areas. In Japanese SaaS, durable execution fits naturally in places like safe migrations of user data, account cleanup, and subscription changes.

A curious aspect of the Japanese market is the **high share of in-house implementations.** Giant organizations like NTT, SoftBank, and Nintendo often operate their own workflow engines, and adoption of new tools moves more slowly than in the US. But once adopted, usage runs deep.

Chapter 17 · Cadence — The Original, Made at Uber

Cadence (open-sourced by Uber in 2017) is the direct ancestor of Temporal. The core team — Maxim Fateev, Samar Abbas, and others — left Uber in 2019 to found Temporal, and the core architecture is nearly identical. The difference is that the SDKs and APIs have diverged over time.

As of 2026, Cadence is still operated at large scale inside Uber — tens of millions of workflows daily. Some external users remain, but the weight has shifted toward Temporal. For a new project, choosing Temporal is the natural call. Cadence holds the seat of "the operational asset of the pre-Temporal era."

There's an interesting divergence. Cadence and Temporal started from the same root, but their semantics, SDKs, and operational models are pulling apart. New features like Cadence's long-polling model or Temporal's Workflow Update — features that exist in one and not the other — are growing. Compatibility between the two engines is no longer guaranteed.

Chapter 18 · Camunda and Conductor — BPMN and Microservice Orchestration

Alongside code-centric durable execution, there are engines from different traditions.

**Camunda** (Germany, founded 2008) is a workflow engine based on the BPMN (Business Process Model and Notation) standard. A BPMN diagram drawn graphically by a business analyst becomes a runnable workflow. Strong in domains like banking, insurance, and telecom. Camunda 8 (announced in 2022) brought a cloud-native redesign and handles microservice orchestration atop Kubernetes.

**Netflix Conductor** (open-sourced by Netflix in 2016) is a microservice orchestration engine. Workflows are defined in a JSON DSL, and workers in many languages poll for tasks. It powers Netflix's large-scale video processing and content-metadata pipelines. In 2022, Orkes spun off as a company providing commercial support.

| --- | --- | --- | --- |

Key distinction: **code vs DSL.** If business analysts often define workflows, BPMN/Camunda. If engineers write all workflows, Temporal/Inngest. If AWS lock-in is acceptable, Step Functions.

Chapter 19 · Self-Host vs SaaS — Cost and Ops Tradeoffs

A major workflow-engine decision is **self-host or SaaS.**

| --- | --- | --- | --- |

The truth of self-hosting: **operating Temporal at production grade is not cheap.** Cassandra/MySQL operations, HA across four services, metrics/monitoring, upgrades. For small teams, SaaS is almost always cheaper.

That said, if you have data-sovereignty, compliance, or VPC-only constraints, self-hosting is mandatory. Korean and Japanese financial institutions almost always self-host. US indie SaaS almost always go SaaS.

An interesting trend is **the hybrid model.** Inngest and Trigger.dev allow keeping the control plane on SaaS while running actual execution workers in user infrastructure. Data doesn't leak; ops burden shifts to SaaS. A pattern likely to grow more common in 2026.

Chapter 20 · Which Engine for Which Place — A One-Line Guide

Tool selection guide. Simplified.

- **Payments, fintech sagas, complex branching** → Temporal. Effectively the standard.

- **AI agents, SaaS automation, DX-first** → Inngest or Trigger.dev v3. Either one, by taste.

- **Evolution of a simple queue, only Postgres allowed** → Hatchet. Lightweight ops.

- **Per-object stateful invocations, small ops footprint** → Restate. New, a bet.

- **Transactional consistency matters, Postgres-native** → DBOS. Academic roots.

- **Already all-in on AWS, AWS lock-in OK** → Step Functions.

- **Data ETL pipelines** → Airflow (traditional), Prefect (modern), Dagster (asset-centric).

- **Kubernetes-native ML/CI** → Argo Workflows.

- **BPMN standard, business/IT collaboration** → Camunda.

- **Netflix-style microservice orchestration** → Conductor.

**Mixing is normal.** Temporal for payments, Airflow for data, Inngest for AI — a common setup. Attempts to unify every workflow on a single engine usually fail. Each engine has its area of strength, and that difference is the essence of tool selection.

Chapter 21 · Pitfalls — Five Common Failure Modes When Adopting Durable Execution

1. **Not keeping determinism** — using `Math.random()`, `Date.now()`, or environment variables directly inside workflow code. The first replay branches differently and you fall into debugging hell. Always use the SDK's deterministic helpers.

2. **Huge workflow payloads** — passing 100MB in workflow inputs/outputs. History explodes, performance dies. Put large data in S3 and pass references.

3. **Mixing activity and workflow responsibilities** — running DB queries or HTTP calls directly inside the workflow. All side effects must be isolated inside activities. The workflow is pure orchestration.

4. **Ignoring versioning** — changing code while a workflow runs for days. In-flight executions are incompatible with the new code. You must learn tools like Temporal's `patched()` or Inngest's function versioning.

5. **Forgetting idempotency keys** — activities making external calls (Stripe, SendGrid) without idempotency keys. On activity retry, the customer is charged twice. Durable execution + idempotency keys = effectively-once. Both are mandatory.

These five are the real learning curve of adopting durable execution. It doesn't end with installing the tool — you must internalize determinism, isolation, versioning, and idempotency as code patterns. It takes two to three weeks.

Chapter 22 · Cost Model — The Real Cost per Workflow

One thing often overlooked when adopting a workflow engine is the **cost model.** SaaS prices look small, but the unit matters; the shock comes if the unit is different from what you expected.

Rough 2026 price ranges (official prices; actuals differ in negotiation).

- **Temporal Cloud** — starts around `$200`, per action (event). At large traffic, tens of thousands of dollars per month.

- **Inngest** — free 50k executions/month; Pro `$50/month` for 250k executions. Per execution.

- **Trigger.dev** — free 10k executions/month; Pro `$50/month` for 100k executions plus time.

- **Hatchet** — free 1k workflows/month; Pro onwards is usage-based.

- **Step Functions Standard** — per 1k state transitions, `$0.025`. Looks expensive for small jobs, surprisingly cheap for big workflows.

- **Step Functions Express** — per 1k executions, `$1`, plus memory and time.

A sanity check. Suppose an AI agent does an average of 20 steps per workflow, and runs 100k times per month.

- Inngest Pro fits inside 250k executions — `$50/month`.

- Step Functions Standard — 2M state transitions = roughly `$50/month`.

- Temporal Cloud — 2M actions → separately quoted; possibly significantly more expensive.

- Self-hosted Hatchet — one Postgres + workers = just infrastructure cost.

**The unit price per workflow relative to the LLM cost** is the real signal. If a single agent run burns `$1` of LLM, a `$0.001` workflow engine charge is effectively free. Conversely, using an expensive engine for simple background jobs like email sends accumulates cost fast.

Chapter 23 · Migration — Moving from One Engine to Another

Workflow engines are hard to swap once adopted. Workflow code and persistent state are tightly coupled.

Migration strategies.

1. **New workflows on the new engine** — leave existing workflows alone; write only new ones on the new engine. The most common pattern; over time, natural migration.

2. **Drain model** — block new executions on the existing engine and wait for in-flight ones to finish. If workflows run for days, this can take months.

3. **Event replay** — if event-sourced, replay event logs on the new engine. Theoretically possible but difficult due to semantic differences.

4. **Manual state migration** — extract in-flight workflow state and restore as workflows on the new engine. Hardest and riskiest.

Most teams pick option 1. Time is a friend. If the new engine is worth adopting, the share of the old engine naturally shrinks over 6–12 months.

Chapter 24 · 2026 Trends — Where Is This Going

Finally, a roundup of the big currents in 2026.

- **AI agents forced durable execution** — due to LLM cost, length, and reliability problems, production agents without durable execution are effectively impossible. The revenue curves of Temporal/Inngest/Trigger.dev track this trend precisely.

- **Postgres at the center of infrastructure** — Hatchet, DBOS, GraphileWorker, Riverqueue, pg-boss. The view that Postgres alone is enough (no Kafka) is spreading. "Postgres for the queue, the workflow engine, and the compute."

- **Hybrid self-host** — control plane on SaaS, execution inside the user VPC. A compromise between data sovereignty and operational convenience.

- **Maturation of workflow versioning** — Temporal's Workflow Update, Inngest's function versioning. The semantics of code changes for workflows that run for days are getting more refined.

- **Generalization of the event-function model** — the Inngest/Trigger.dev style of "events wake functions" is becoming the standard for SaaS and indie developers. No separate workflow definition — the function itself is the workflow.

Another current is **the convergence of CDC and workflows.** A pattern of pulling events via Postgres CDC (Debezium, Sequin) and waking the workflow engine. Database changes become the starting points of workflows. Expected to grow more common in late 2026.

Epilogue — The Era of Functions That Refuse to Die

A one-sentence summary: **In 2026, workflow engines became a core layer of infrastructure. For two reasons — AI agents forced durable execution, and Postgres is becoming the backbone of workflows.**

If the saga patterns in distributed-systems textbooks felt academic 15 years ago, in 2026 they are the basic vocabulary of every SaaS backend. Payments, transfers, reservations, messaging, AI agents — all variations of a saga. And those sagas ride on top of durable execution engines.

In the next decade of backend engineering, "which workflow engine" becomes as important as "which database." And that choice depends on the essence of the business domain — Temporal for payments, Inngest for AI, Airflow for ETL. One tool cannot solve everything.

A last word: **containers die. But workflows must not.** That is the promise of durable execution.

> "Processes are not eternal. But the meaning of a function can be. That is durable execution."

— Durable Execution & Workflow Engines 2026, end.

References

- [Temporal — Durable execution platform](https://temporal.io/)

- [Temporal Documentation](https://docs.temporal.io/)

- [Cadence — Uber open source](https://cadenceworkflow.io/)

- [Inngest — Durable event-driven functions](https://www.inngest.com/)

- [Inngest Documentation](https://www.inngest.com/docs)

- [Trigger.dev — Background jobs platform](https://trigger.dev/)

- [Trigger.dev v3 Documentation](https://trigger.dev/docs)

- [Hatchet — Postgres-native task queue](https://hatchet.run/)

- [Hatchet GitHub — hatchet-dev/hatchet](https://github.com/hatchet-dev/hatchet)

- [Restate — Durable stateful invocations](https://restate.dev/)

- [DBOS — Postgres-native compute](https://www.dbos.dev/)

- [AWS Step Functions](https://aws.amazon.com/step-functions/)

- [Apache Airflow](https://airflow.apache.org/)

- [Prefect](https://www.prefect.io/)

- [Dagster](https://dagster.io/)

- [Argo Workflows — argoproj](https://argoproj.github.io/argo-workflows/)

- [Camunda — BPMN process automation](https://camunda.com/)

- [Netflix Conductor / Orkes](https://orkes.io/)

- [Saga pattern — Hector Garcia-Molina, 1987](https://www.cs.cornell.edu/andru/cs711/2002fa/reading/sagas.pdf)

- [Pattern: Saga — microservices.io](https://microservices.io/patterns/data/saga.html)