- Published on
Durable Execution Engines in 2026 — A Deep Dive Comparison of Temporal, Restate, Inngest, Trigger.dev, DBOS: Escaping Cron and Retry Hell
- Authors

- Name
- Youngju Kim
- @fjvbn20031
Prologue — A Way Out of Cron and Retry Hell
The 2019 backend engineer wrote code like this. "Charge the customer, send a receipt 5 minutes later, send a review request after 2 days." The answer usually looked like this.
- Payment — synchronous API.
- Receipt — push to a queue (SQS, RabbitMQ), worker processes after 5 minutes.
- Review email — cron scans nightly, picks "purchases 2 days old," sends mail.
- On failure? Add a
retry_countcolumn, give up after 7 days.
This pattern worked. Except we rewrote it every time, bugged it every time, and never had observability. To know "how far did it get," you had to query the DB by hand.
The 2026 answer is different.
@workflow
async function postPurchaseFlow(orderId: string) {
await chargeCustomer(orderId)
await sleep('5 minutes')
await sendReceipt(orderId)
await sleep('2 days')
await sendReviewRequest(orderId)
}
That is the whole thing. The function itself is the workflow definition. If the server dies, 5 minutes later, 2 days later, the function resumes from exactly that line. No queue, no cron, no retry_count column.
The name of this magic is Durable Execution (DE). It started getting real traction in 2024, and by 2026 it has become a standard category of backend infrastructure.
- Temporal raised a 300M dollar Series D at a 5B valuation in February 2026. Cumulative 9.1 trillion action executions, with 1.86 trillion from AI-native companies alone.
- AWS announced Lambda Durable Functions at re:Invent 2025.
- Cloudflare Workflows went GA.
- Vercel released Workflow DevKit.
- Inngest, Trigger.dev, Restate, and DBOS each grew share with their own weapons.
Why now? AI agents. An agent loop that calls 30 tools over an hour has to start over from scratch if the server dies even once. That is the bill — 5 dollars evaporates in tokens every time it breaks. DE prevents that.
Here is the flow of this article.
- What Durable Execution actually means — deterministic replay and workflow-as-code
- The 2026 engine landscape at a glance — comparison matrix
- Temporal — how it became the DE standard
- Restate — the lightweight challenger
- Inngest — event-driven DX advantage
- Trigger.dev — DE for the Next.js era
- DBOS — lightweight, needs only Postgres
- The same workflow in three engines — real code comparison
- Where AWS Step Functions, Azure Durable Functions, and Cadence sit
- AI agents and DE — why it exploded
- Decision tree and anti-patterns
By the end, "what fits our team" should take 5 minutes to decide.
Chapter 1 · What Durable Execution Means
1.1 Definition — Separating Intent from Execution
Durable Execution is a paradigm that separates the intent of a function (business logic) from its actual execution (when, where, how many times it runs). The function looks like ordinary code, but the runtime checkpoints progress to persistent storage at each step. If the server dies, another worker resumes from that line.
Core guarantees:
- Exactly-once effects: the same step does not run twice. Side effects happen once.
- Long sleeps:
sleep('30 days')actually means it. It wakes up 30 days later. - Retries with state: only failed steps retry, results of completed steps are remembered.
- Deterministic replay: when a worker dies and comes back, history is replayed and the worker is restored to the exact same state.
1.2 Two Mechanisms
DE engines usually fall into one of two camps.
| Mechanism | Description | Examples |
|---|---|---|
| Journal-based replay | Every completed step is recorded in an append-only log. When a worker wakes up, the code re-runs from the top but already-recorded results are used from the cache instead of re-executed. | Temporal, Cadence, Restate |
| Database checkpointing | Each step ends with a DB transaction storing state. On resume, the worker reads last state from the DB and continues from the next step. | DBOS, Inngest, Trigger.dev (varies by family) |
Journal-based replay forces the code to be deterministic. For example, calling Math.random() or Date.now() directly produces different values on replay and breaks correctness. You must use the engine-provided deterministic API.
DB checkpointing has weaker constraints, but step input-output serialization, DB load, and transaction boundaries become the core operational concerns.
1.3 The Workflow-as-Code Model
Older workflow engines (Airflow, early Cadence, Step Functions) used DAGs, JSON, or YAML. DE engines just use code. if, for, try, await all mean what they say.
@workflow
async function orderFlow(input: OrderInput) {
const payment = await ctx.run('charge', () => chargeCard(input))
try {
await ctx.run('ship', () => shipItem(input))
} catch (e) {
// saga: compensating transaction
await ctx.run('refund', () => refundCard(payment))
throw e
}
await ctx.sleep('1 day')
await ctx.run('review-request', () => sendReviewMail(input))
}
This is plain JS. But the runtime checkpoints every ctx.run, and ctx.sleep does not occupy worker memory — the workflow resumes on a different worker after the sleep.
1.4 Patterns DE Unlocks
Workflow shapes where DE shines:
- Long-running approvals — vacation requests where a manager clicks 3 days later.
- Sagas and compensating transactions — charge then ship then email, with compensation wherever it fails.
- Retries with state — external API fails 5 times, retried idempotently with progress preserved.
- Scheduled fan-out — daily push to 10,000 users with per-push outcome tracking.
- AI agent loops — LLM call then tool call then next LLM, surviving crashes with context intact.
1.5 When You Do Not Need DE
Not every workflow calls for DE.
- Simple request-response: under 100 ms, HTTP or gRPC is enough.
- Read-only pipelines: ETL and batch analytics fit Spark, dbt, Airflow better.
- Fewer than 100 a day: ops burden outweighs benefit. A log line or a DB column is cheaper.
- Team of one: the DE engine's own learning curve exceeds the cost of writing it by hand.
"Workflow longer than 5 minutes, three or more external calls, expensive failures" — then consider DE.
Chapter 2 · 2026 Engine Comparison Matrix
| Item | Temporal | Restate | Inngest | Trigger.dev | DBOS |
|---|---|---|---|---|---|
| Category | DE standard, enterprise | DE plus state, lightweight | DE plus events, serverless | DE for Next.js-friendly stack | DE library, Postgres-native |
| Language SDKs | Go, Java, TS, Python, .NET, PHP, Ruby | TS, Java, Kotlin, Go, Python, Rust | TS, Python, Go, Java | TS, Python | TS, Python, Java, Go |
| Self-hosting | Yes (Apache 2.0) | Yes (single binary, BSL to Apache) | Yes (Apache 2.0, Inngest server) | Yes (Apache 2.0, v4) | Yes (Apache 2.0 library) |
| Infra dependency | Own server plus Cassandra or Postgres | Single binary plus RocksDB | Own server plus Postgres | Own server plus Postgres plus Redis | Postgres only |
| Cloud pricing model | About 0.00025 USD per action and up | Self-serve early-stage SaaS | Usage-based plus events freemium | Concurrency plus monthly tier, freemium | 50 USD per million extra checkpoints |
| Free tier | Self-hosted free | Self-hosted free | 50k executions per month free | 5 USD credit per month | Self-hosted free |
| Workflow model | Deterministic function (journal) | Deterministic function (journal plus state keys) | Step function (event-triggered) | Task function (trigger-based) | Transactional function (Postgres) |
| AI-agent friendliness | Very high (OpenAI Codex uses it) | High (Pydantic AI integration) | Very high (Agent Kit) | Very high (Session, Realtime) | Medium (library integration) |
| Observability | Web UI plus tctl plus OTel | Web UI plus OTel | Web UI plus live stream | Web UI plus live logs and triggers | DB queries plus dashboard |
| Strongest at | Scale, multi-language, maturity | Lightweight, single binary | Event-driven, DX, pricing | TS full-stack, no time limits | Simplicity, single Postgres |
| Weakest at | Learning curve, ops burden | New ecosystem | Multi-language weak | TS-centric | No deterministic replay |
| Target user | Enterprise, finance, AI infra | Full-stack backend, DB-friendly | TS, Python startups | Next.js, full-stack teams | Postgres-centric backends |
This single table closes about 80 percent of the decision. Pick which row is your team's deal-breaker. It is usually one of three — language, self-hosting requirement, pricing model.
Chapter 3 · Temporal — How It Became the DE Standard
3.1 Funding, Adoption, Scale
Temporal is the successor to Cadence (the open-source workflow engine Maxim Fateev and Samar Abbas built at Uber). They left in 2019 and the same team built the next generation. February 2026 Series D was 300M USD at a 5B valuation. Cumulative 9.1T action executions, 1.86T from AI-native companies. OpenAI Codex uses Temporal to handle millions of daily coding-agent requests, a publicly disclosed deployment. 3,000-plus paying customers including NVIDIA and Netflix.
3.2 Architecture — Journal and Worker Separation
Core components:
- Temporal Server: stores workflow state (the journal, also called Event History). Backed by Cassandra, Postgres, or MySQL.
- Worker: runs user code. Talks to the server via long polling.
- Internally split into Frontend, History, Matching, Worker services for scale.
The worker is stateless. All workflow progress lives on the server. If the worker dies, another worker picks up the next task for the same workflow and replays the journal to resume.
3.3 Deterministic Workflow Code — Constraints and Rewards
The rules are simple but strict. Inside a workflow function:
- Do not call time directly (
Date.now()is forbidden). Useworkflow.now(). - Do not call random directly. Use
workflow.random(). - Do not call external services directly. Use
proxyActivitiesto invoke an activity. - No multi-threading, no global variables.
If you follow these rules, replay is deterministic. Re-running the workflow code produces the same step sequence. The server walks the history and returns cached results for already-completed activities.
3.4 Activities — Where You Touch the World
// activities.ts
export async function chargeCard(orderId: string): Promise<PaymentReceipt> {
return await stripe.charge({ orderId })
}
export async function shipItem(orderId: string): Promise<TrackingNumber> {
return await fedex.createShipment({ orderId })
}
// workflows.ts
import { proxyActivities, sleep } from '@temporalio/workflow'
import type * as activities from './activities'
const { chargeCard, shipItem } = proxyActivities<typeof activities>({
startToCloseTimeout: '30 seconds',
retry: { maximumAttempts: 5, initialInterval: '1s', backoffCoefficient: 2 },
})
export async function orderFlow(orderId: string) {
const receipt = await chargeCard(orderId)
await sleep('5 minutes')
const tracking = await shipItem(orderId)
return { receipt, tracking }
}
Activities are plain functions. Retry, timeout, heartbeat, and idempotency are options on top.
3.5 Pricing and Self-Hosting
- Temporal Cloud: about 0.00025 USD per action. Stored actions, active workers, and retention period (7, 30, 90 days) all multiply the bill. Low traffic starts around 200 USD a month.
- Self-hosted: Apache 2.0, free in license. But running Cassandra or Postgres for HA, tuning, and backups is non-trivial. Realistic bill is 2,500 to 4,500 USD a month combined with infra and ops time.
Scale and maturity are its weapons. Steep learning curve is its weakness for small teams.
Chapter 4 · Restate — The Lightweight Challenger
4.1 Positioning
Restate started from "Temporal is too heavy." It runs as a single binary (written in Rust, embedded RocksDB), with no external DB dependency. It sits like a proxy in front of your existing services and captures their call journal.
4.2 Key Differentiators
- Single binary: one Docker container, no separate DB or message queue.
- State and communication together: workflows can hold key-value state (
ctx.set('cart', items)), no Redis or DB needed. - HTTP and gRPC handlers: workflows are HTTP services. External callers just call them.
- Deterministic function model: similar constraints to Temporal. SDKs for JS, TS, Java, Kotlin, Go, Python, Rust.
4.3 Code Shape
import * as restate from '@restatedev/restate-sdk'
const order = restate.workflow({
name: 'order',
handlers: {
run: async (ctx: restate.WorkflowContext, orderId: string) => {
const receipt = await ctx.run('charge', () => chargeCard(orderId))
await ctx.sleep(5 * 60_000)
const tracking = await ctx.run('ship', () => shipItem(orderId))
ctx.set('status', 'shipped')
return { receipt, tracking }
},
},
})
restate.endpoint().bind(order).listen(9080)
Bring this up, bring up one Restate server, register, and you are done.
4.4 Pricing and Self-Hosting
- Restate Cloud: GA in late 2025. Free tier first, paid tiers with stronger SLA are being rolled out.
- Self-hosted: BSL turning into Apache 2.0 after the time-window. The single binary is free.
Less ops burden than Temporal, more expressive workflow model than DBOS. The ecosystem is still young.
Chapter 5 · Inngest — Event-Driven DX Advantage
5.1 Positioning
Inngest's intuition is "event then function." A function subscribes to events, and every step.run call inside the function is automatically checkpointed. There is no separate workflow definition — the function itself is the workflow.
5.2 Code Shape
import { Inngest } from 'inngest'
const inngest = new Inngest({ id: 'shop' })
export const orderFlow = inngest.createFunction(
{ id: 'order-flow' },
{ event: 'order.placed' },
async ({ event, step }) => {
const receipt = await step.run('charge', async () => {
return await chargeCard(event.data.orderId)
})
await step.sleep('wait-5m', '5m')
const tracking = await step.run('ship', async () => {
return await shipItem(event.data.orderId)
})
await step.sleep('wait-2d', '2d')
await step.run('review-mail', async () => {
return await sendReviewMail(event.data.orderId)
})
return { receipt, tracking }
}
)
Event-then-function fits naturally with domain event publication. AWS EventBridge feel, but self-hostable.
5.3 2026 Highlights
- Checkpointing: cuts inter-step latency to nearly zero, dropping total workflow duration by 50 percent.
- Agent Kit: helpers for building AI agents. Auto-decomposes tool calls and LLM responses into steps.
- Agent Skills: six pre-built skills for Claude Code, Cursor, Windsurf.
- Self-hosted Inngest server: production-grade self-hosting on Postgres.
5.4 Pricing
- Cloud: first 50k executions free per month. Then usage-based pricing plus per-event metering. First million events free.
- Self-hosted: Apache 2.0, free.
Strong fit for TypeScript and Python teams, especially on Vercel or Next.js. Weak on multi-language.
Chapter 6 · Trigger.dev — DE for the Next.js Era
6.1 Positioning
Trigger.dev answers a very real pain — "where do I run the long task that does not fit on Vercel?" v4 brought deterministic replay, session-based bidirectional channels, and real-time log streaming.
6.2 Code Shape
import { task, logger, wait } from '@trigger.dev/sdk/v3'
export const orderFlow = task({
id: 'order-flow',
retry: { maxAttempts: 5, factor: 2, minTimeoutInMs: 1000 },
run: async (payload: { orderId: string }, { ctx }) => {
const receipt = await chargeCard(payload.orderId)
logger.info('charged', { receipt })
await wait.for({ minutes: 5 })
const tracking = await shipItem(payload.orderId)
logger.info('shipped', { tracking })
await wait.for({ days: 2 })
await sendReviewMail(payload.orderId)
return { receipt, tracking }
},
})
Three strengths:
- No timeouts: hour-long tasks without Lambda, Vercel, or Cloudflare time limits.
- No charge while waiting: when a task enters
wait, the container is frozen and you do not pay. - Real-time logs: users see logs stream over SSE while the task runs.
6.3 2026 Highlights
- Session primitive: a bidirectional I/O channel that outlives a single run. Manager role for chat agents.
- Concurrency and queue controls: per-task concurrency caps and queue priorities.
- v4 deterministic mode: opt-in, when you need stronger correctness guarantees.
6.4 Pricing
- Free: 5 USD credit per month, 10 concurrent runs.
- Hobby: 10 USD per month.
- Pro: 50 USD per month, 200-plus concurrent runs.
- Enterprise: custom.
- Self-hosted: v4 under Apache 2.0.
Has become close to the default choice for Next.js and full-stack TS teams.
Chapter 7 · DBOS — Just Postgres, Please
7.1 Positioning
DBOS answers "why bring up more infrastructure when Postgres is already running?" It is a library, full stop. No separate server. Workflow state lives in your own Postgres.
7.2 Code Shape (TypeScript)
import { DBOS, Workflow, Step } from '@dbos-inc/dbos-sdk'
class OrderFlow {
@Step()
static async chargeCard(orderId: string) {
return await stripe.charge({ orderId })
}
@Step()
static async shipItem(orderId: string) {
return await fedex.create({ orderId })
}
@Workflow()
static async run(orderId: string) {
const receipt = await OrderFlow.chargeCard(orderId)
await DBOS.sleep(5 * 60_000)
const tracking = await OrderFlow.shipItem(orderId)
return { receipt, tracking }
}
}
Each @Step can run inside a single Postgres transaction — for DB operations, the DB itself guarantees exactly once. If a workflow fails, a DB row records the failure, and on restart the same workflow ID resumes from where it stopped.
7.3 Strengths and Weaknesses
- Strengths: one piece of infra (just Postgres). Lowest learning curve. Transactional integrity.
- Weaknesses: not a deterministic replay model. Workflow code is not re-executed from the top. Resumes from the last checkpoint. Less expressive for very long workflows with heavy branching.
- 2026: Java 0.8 ships Spring Boot integration. Self-hosted Conductor (cloud observability) becomes available.
7.4 Pricing
- Self-hosted library: free.
- DBOS Cloud plus Conductor: 50 USD per million extra checkpoints. Custom enterprise.
Lowest friction for Postgres-centric backend teams.
Chapter 8 · The Same Workflow in Three Engines — Real Code Comparison
Scenario: charge a card, retry up to 4 times, send a receipt on success, send a review email after 2 days. Same shape across Temporal, Inngest, and Trigger.dev.
8.1 Temporal (TS)
// activities.ts
export async function chargeCard(orderId: string) {
return await stripe.charge({ orderId })
}
export async function sendReceipt(orderId: string, receipt: Receipt) {
await mailer.send({ to: orderId, body: receipt })
}
export async function sendReviewMail(orderId: string) {
await mailer.send({ to: orderId, template: 'review' })
}
// workflow.ts
import { proxyActivities, sleep } from '@temporalio/workflow'
import type * as acts from './activities'
const { chargeCard, sendReceipt, sendReviewMail } = proxyActivities<typeof acts>({
startToCloseTimeout: '60 seconds',
retry: {
maximumAttempts: 4,
initialInterval: '2s',
backoffCoefficient: 2,
nonRetryableErrorTypes: ['CardDeclinedError'],
},
})
export async function postPurchaseFlow(orderId: string) {
const receipt = await chargeCard(orderId)
await sendReceipt(orderId, receipt)
await sleep('2 days')
await sendReviewMail(orderId)
return receipt
}
Notes: retry, timeout, and exception classification live in activity options. The workflow itself is clean.
8.2 Inngest (TS)
import { Inngest, NonRetriableError } from 'inngest'
const inngest = new Inngest({ id: 'shop' })
export const postPurchaseFlow = inngest.createFunction(
{
id: 'post-purchase',
retries: 4,
},
{ event: 'order.placed' },
async ({ event, step }) => {
const receipt = await step.run('charge', async () => {
try {
return await stripe.charge({ orderId: event.data.orderId })
} catch (e) {
if (e.code === 'card_declined') {
throw new NonRetriableError('card declined')
}
throw e
}
})
await step.run('send-receipt', () => mailer.send({
to: event.data.orderId, body: receipt,
}))
await step.sleep('wait-2d', '2 days')
await step.run('review-mail', () => mailer.send({
to: event.data.orderId, template: 'review',
}))
return receipt
}
)
Notes: retry is a function option, classification is a NonRetriableError throw. The event payload is the workflow input.
8.3 Trigger.dev (TS)
import { task, wait, AbortTaskRunError } from '@trigger.dev/sdk/v3'
export const postPurchaseFlow = task({
id: 'post-purchase-flow',
retry: {
maxAttempts: 4,
factor: 2,
minTimeoutInMs: 2000,
maxTimeoutInMs: 30000,
},
run: async (payload: { orderId: string }) => {
let receipt
try {
receipt = await stripe.charge({ orderId: payload.orderId })
} catch (e) {
if (e.code === 'card_declined') {
throw new AbortTaskRunError('card declined')
}
throw e
}
await mailer.send({ to: payload.orderId, body: receipt })
await wait.for({ days: 2 })
await mailer.send({ to: payload.orderId, template: 'review' })
return receipt
},
})
Notes: retry is a single option on the task. AbortTaskRunError aborts further retries. Smallest surface area.
8.4 The Same Workflow in Go — Temporal Activity
A multi-language example, leaning into Temporal's strength:
// activities.go
package activities
import (
"context"
"errors"
)
type ChargeInput struct {
OrderID string
}
type Receipt struct {
PaymentID string
Amount int
}
func ChargeCard(ctx context.Context, input ChargeInput) (*Receipt, error) {
r, err := stripeClient.Charge(ctx, input.OrderID)
if err != nil {
if errors.Is(err, stripe.ErrCardDeclined) {
return nil, temporal.NewNonRetryableApplicationError(
"card declined", "CardDeclinedError", err,
)
}
return nil, err
}
return &Receipt{PaymentID: r.ID, Amount: r.Amount}, nil
}
// workflow.go
package workflows
import (
"time"
"go.temporal.io/sdk/workflow"
)
func PostPurchaseFlow(ctx workflow.Context, orderID string) (*Receipt, error) {
ao := workflow.ActivityOptions{
StartToCloseTimeout: time.Minute,
RetryPolicy: &temporal.RetryPolicy{
MaximumAttempts: 4,
BackoffCoefficient: 2.0,
InitialInterval: 2 * time.Second,
},
}
ctx = workflow.WithActivityOptions(ctx, ao)
var receipt Receipt
if err := workflow.ExecuteActivity(ctx, ChargeCard, ChargeInput{OrderID: orderID}).Get(ctx, &receipt); err != nil {
return nil, err
}
if err := workflow.Sleep(ctx, 48*time.Hour); err != nil {
return nil, err
}
if err := workflow.ExecuteActivity(ctx, SendReviewMail, orderID).Get(ctx, nil); err != nil {
return nil, err
}
return &receipt, nil
}
Same meaning. Temporal's strength is one unified model across Go, Java, Python, .NET, and more.
8.5 Summary — Same but Different
All three engines express "retries with state" cleanly. The differences are the surroundings.
- Temporal: retries are an activity-level option, precise and expressive. Workflow code is the most abstracted.
- Inngest: events publish then trigger functions automatically. Natural fit when domain events are central.
- Trigger.dev: everything in one file. Fits inside a single Next.js repo.
Chapter 9 · Where AWS Step Functions, Azure Durable Functions, and Cadence Sit
9.1 AWS Step Functions
The classic. JSON or YAML-based Amazon States Language. State machines drawn declaratively. Strength: deep AWS service integration (Lambda, SQS, SNS, DynamoDB take roughly one line). Weakness: JSON is not code, large workflows are hard to maintain, and per-state-transition pricing adds up faster than expected.
The big shift at re:Invent 2025 was AWS Lambda Durable Functions. Write deterministic replay workflows directly inside Lambda — Python 3.12-plus, Node.js 22-plus, TypeScript 5-plus supported. It is both an alternative and a complement to Step Functions.
9.2 Azure Durable Functions
Microsoft's orthodox option. C#, JS, Python, F#, PowerShell. Orchestrator functions must be deterministic, activity functions carry side effects. Same paradigm as Temporal but tied to Azure Functions infrastructure. In 2026 it integrates into .NET Microsoft Agent Framework as the standard for agent workflows.
9.3 Cadence
Temporal's parent. Still core inside Uber, still open source. For new projects, picking Temporal is almost always right — Temporal is the successor and is the active one.
9.4 Where Each Sits
- Deeply tied to AWS or Azure: Step Functions, Lambda Durable, Azure Durable feel natural. Accept the cloud lock-in in exchange for zero ops friction.
- Multi-cloud, on-prem mandate, integration with existing code: Temporal, Restate, Inngest self-hosted.
- Existing Uber code: keep Cadence, but consider Temporal for new workflows.
Chapter 10 · Why It Exploded Between 2024 and 2026 — AI Agents Are the Answer
10.1 Five Drivers
- AI agents run for hours: LLM call then tool call then next LLM, repeated 30 times. Hour-long tasks are normal. One crash, and 50 USD evaporates.
- Asymmetric token cost: compute is cheap, LLM calls are expensive. Checkpointing is mandatory to avoid redoing them.
- Big-tech adoption: OpenAI Codex on Temporal, AWS, Cloudflare, and Vercel shipping their own durable solutions. Market signal got loud.
- Frameworks as first-class citizens: LangGraph, Pydantic AI, OpenAI Agents SDK all adopted DE as a standard, not an option.
- Better developer experience: compared to 5 years ago, Inngest, Trigger.dev, Restate, and DBOS all lowered the entry bar to "one function, one file, one piece of infra."
10.2 An AI Agent Loop — Exactly the Shape DE Solves
@workflow
async function researchAgent(query: string) {
let context = []
for (let i = 0; i < 30; i++) {
const plan = await ctx.run('llm-plan', () => llm.plan(query, context))
if (plan.action === 'done') return plan.answer
const result = await ctx.run(`tool-${i}`, () => tools.run(plan.tool, plan.args))
context.push({ plan, result })
// guard: prevent cost runaway
if (cost(context) > 5) throw new Error('budget exceeded')
}
}
- Every LLM call and tool call is checkpointed — if it crashes at iteration 30, iterations 1 through 29 are not re-invoked.
- The
forloop is the workflow intent itself. - Budget guard is a simple
ifstatement.
Anyone who has built this with queues and cron knows — debugging hell. DE compresses it to one function.
Chapter 11 · Adoption Decision Tree
Does the workflow exceed 5 minutes?
├── No → DE not needed. HTTP or gRPC is enough.
└── Yes
├── Three-plus external calls?
│ ├── No → A queue plus idempotency key may be enough.
│ └── Yes
│ ├── Need multi-language?
│ │ ├── Yes → Temporal or Restate
│ │ └── No (TS-centric)
│ │ ├── Event-driven domain? → Inngest
│ │ ├── Next.js full-stack? → Trigger.dev
│ │ └── Want one Postgres only? → DBOS
│ └── Self-hosting required?
│ ├── Yes → Temporal, Restate, Inngest, Trigger.dev, DBOS self-hosted
│ └── No → All cloud SaaS options OK
└── Is it an AI agent loop?
├── Yes → Temporal, Inngest, Trigger.dev (these three are strongest)
└── No → Use the general matrix
Extra branches:
- Deep in AWS or Azure? → Start with Lambda Durable Functions or Azure Durable Functions.
- Already on Cadence? → Keep Cadence, evaluate Temporal for new workflows.
- Data sovereignty or regulation (GDPR, HIPAA)? → All five support self-host. Restate has the lowest ops burden.
- Fewer than 100 workflows a day? → DE adoption cost outweighs benefit. A DB column and a simple queue is enough.
Chapter 12 · Anti-Patterns
12.1 Calling Time, Random, or HTTP Directly Inside a Workflow
In deterministic models (Temporal, Restate), calling Date.now(), Math.random(), or fetch() directly inside a workflow produces different values on replay and breaks correctness. Wrap them in engine APIs (ctx.now(), ctx.random(), ctx.run()).
12.2 Splitting Every Function into a Step
Steps (or activities) have checkpointing cost. Splitting 1 ms pure functions into steps slows the workflow 100x. Use the step boundary for external side effects, external calls, retry units.
12.3 Using DE for ETL, Kafka, and DAGs Too
DE is for transactional, long-running, stateful workflows. Data pipelines fit Airflow, dbt, Spark better. Mixing them makes both awkward.
12.4 Self-Hosting Without Operators
A Temporal cluster on EC2 — done. Six months later Cassandra disk is full and there are no backups. Self-hosted DE is not free. Compare ops cost against cloud SaaS.
12.5 Changing Workflow Code Without Versioning
Changing code of a running workflow breaks old workers that meet the new history. All DE engines provide versioning APIs (Patched, Workflow Versioning, Side Effects). Version every change.
12.6 Unbounded Tools or Budget for AI Agents
LLM and tool calls being deterministic helps, but infinite loops are still expensive even when deterministic. Tool whitelist and budget guards must live inside the workflow.
12.7 External API Calls Without an Idempotency Key
Activity retries mean external APIs can receive the same call twice. Include an idempotency key in the input, and if the external API supports it pass it as a header (Idempotency-Key). Stripe, SendGrid, Slack all support it.
Epilogue — "Pick an Engine" Is the Wrong Frame; "Buy Workflow Thinking" Is the Right One
The real value of a Durable Execution engine is not the tool. It is seeing the workflow as code. The backend you build with queues and cron and the backend you build with one function rely on completely different mental models.
Adoption checklist:
- List 5 workflow candidates longer than 5 minutes.
- Estimate per workflow: external calls, failure frequency, cost.
- Decide on language SDK (Go or Java needed? TS only? Plus Python?).
- Confirm self-hosting requirement (data sovereignty, regulation).
- Simulate the 6-month bill — executions, concurrency, checkpoints.
- Pick one engine for a 2-week PoC, move one real workflow.
- If using a deterministic model, define versioning policy on day one.
- Standardize idempotency key conventions across the team.
- Agree on one observability surface (web UI or OTel dashboard).
- If running AI agents, put cost guards and tool whitelists inside the workflow.
Anti-patterns short list:
- Direct time, random, or HTTP inside a workflow — breaks deterministic models.
- Splitting 1 ms functions into steps — checkpoint cost becomes the bottleneck.
- Using DE for ETL too — data pipelines belong elsewhere.
- Self-hosting without operators — hidden ops cost is heavy.
- Workflow changes without versioning — breaks live executions.
- Unbounded agent permissions — cost and security both blow up.
- External API calls without idempotency keys — retries double side effects.
Next Up
The next article is a Temporal self-hosted production guide — choosing between Cassandra and Postgres, multi-cluster HA, blue-green worker deploys, Worker Versioning and workflow migration, cost and observability. Six months of breakage from one team, distilled. If your Temporal Cloud bill crosses 5,000 USD a month, that article becomes your next decision point.
Durable Execution is not a tool choice. It is the act of treating time itself as a first-class function input. Buy that thinking, and the tool follows.
References
Temporal
- Temporal — Durable Execution Solutions
- Temporal Series D — 300M at 5B valuation (Feb 2026)
- Temporal hits 3,000 paying customers (The New Stack)
- The definitive guide to Durable Execution (Temporal)
- Temporal Pricing 2026 (Automation Atlas)
- Temporal Workflow Documentation
- Spooky Stories — Temporal anti-patterns
- Maxim Fateev on durable execution for AI agents (WorkOS)
Restate
- Restate — Build innately resilient distributed apps
- What is Durable Execution (Restate)
- Restate Cloud — Open to Everyone
- Self-hosted Restate Overview
- Restate on GitHub
- Build durable agents with Restate and Pydantic AI
Inngest
- Inngest — AI and backend workflows
- Inngest Pricing
- Inngest Changelog
- Durable Execution — Key to AI Agents in Production (Inngest)
- How to Build a Durable AI Agent with Inngest
- Inngest on GitHub
Trigger.dev
- Trigger.dev — Build and deploy AI agents and workflows
- Trigger.dev Pricing
- Trigger.dev AI Agents Product
- Trigger.dev Releases
- Trigger.dev on GitHub
DBOS
- DBOS — Durable Workflow Orchestration
- DBOS Transact (open source library)
- DBOS Pricing
- Why Postgres for Durable Execution (DBOS)
- dbos-transact-ts on GitHub
- dbos-transact-py on GitHub
AWS, Azure, Cadence, Survey
- AWS Lambda Durable Functions vs Step Functions
- AWS Lambda Durable Functions — A Step-by-Step Guide
- Azure Durable Functions Overview (Microsoft Learn)
- Durable Workflows in Microsoft Agent Framework (.NET Blog)
- Cadence Workflow Engine (Uber)
Patterns, AI Agents, Comparisons
- Durable Execution Patterns for AI Agents (Zylos Research)
- Durable Agent Execution in Production 2026 (AgentMarketCap)
- Durable Workflow Platforms for AI Agents (Render)
- LangGraph Durable Execution Docs
- The Emerging Landscape of Durable Computing (Golem Cloud)
- Saga Pattern with Temporal
- Temporal vs Restate vs Windmill 2026 (PkgPulse)