Durable Workflows in 2026: Temporal vs Restate vs Inngest — and When to Skip All Three

Durable workflow engines are the most over-adopted piece of backend infrastructure I see in 2026.

Not because they are bad. Temporal, Restate, and Inngest are genuinely good at the narrow thing they do. The problem is that teams keep using them as a substitute for thinking about state, and the tax shows up six months later when nobody wants to touch the workflow code.

Durable execution is a specific solution to a specific problem: multi-step business processes that must survive process crashes, with auditable retries and compensation. That is it. If your problem is not that, you are buying a lot of machinery you will pay for later.

This post is about when those three engines pull their weight, where each one actually hurts, and why a plain state machine backed by an outbox table is still the right call more often than vendor slide decks admit.

1. What "durable execution" actually means

The phrase gets thrown around so loosely that it is worth pinning down.

Durable execution is a runtime model where your workflow code is treated as deterministic, and the engine records every interaction with the outside world as an event in a history log. When the process crashes, the engine replays the history, skips the side effects that already happened, and resumes exactly where it left off.

The mental model is:

your workflow is a function
every await on an external call is a checkpoint
the engine stores the result of each checkpoint
on replay, checkpoints return their recorded result instead of re-executing
the workflow reaches the same state it was in before the crash

This is the "code as workflow" promise. You write what looks like normal imperative logic, and the engine makes it crash-safe without you manually persisting state at every step.

The catch is that it only works because your code is required to be deterministic. No Math.random(), no new Date() outside engine APIs, no reading env vars mid-flow. Break determinism and replay produces a different path than the recorded history, and the engine refuses to continue.

That constraint is the entire tax. Everything in the rest of this post flows from it.

2. Why this problem is real

I do not want to under-sell the need here. There is a category of work that genuinely benefits from a durable engine:

onboarding flows with 12 steps across 5 external systems
payout pipelines where each step moves money and each retry must be idempotent
long-running approvals that wait hours or days for a human signal
scheduled jobs that must not double-fire across deploys
saga-style compensations across services that do not share a database

The common thread: the process has to survive things your HTTP handler cannot. A deploy rolling pods mid-flight. A consumer crash after the third step. A human who takes four days to click approve.

You can build that yourself. Teams have, for decades. The question is whether the tax of building it yourself is worth avoiding the tax of adopting an engine.

3. Temporal — what you get and what it costs

Temporal is the heavyweight. Descended from Uber's Cadence, it has the most mature SDK story (TypeScript, Go, Java, Python, .NET, PHP), the best visibility UI, and genuinely impressive primitives: signals, queries, child workflows, continue-as-new, versioning APIs, schedules.

What you actually get:

deterministic replay that works reliably across many language runtimes
a cluster that stores history in Cassandra, MySQL, or Postgres
a web UI that lets you inspect any workflow run, step by step
signals and queries for in-flight interaction
activity retries with configurable backoff, timeout, and heartbeat

What it costs:

cluster operations. Temporal Cluster is not trivial. It has frontend, history, matching, and worker services, plus Elasticsearch for advanced visibility. Temporal Cloud removes this pain at a price. Self-hosting it well is a part-time SRE job.
versioning gymnastics. Once a workflow is in production, changing its code is a careful exercise. You use patched() or getVersion() to branch logic based on deploy history, because old in-flight workflows must continue to replay cleanly against their recorded history. Teams that skip this burn workflows on the next deploy.
local dev friction. Running a dev cluster, wiring the worker, and remembering to register every workflow and activity file adds real overhead compared to a plain HTTP handler.
vendor-shaped thinking. Once you have Temporal, every orchestration problem starts to look like a workflow. That is not always the right frame.

Temporal is the right choice when the workflow is the product surface. Human-in-the-loop approvals, multi-day sagas, anything with child workflows and signals. It is a bad choice when you are using it to orchestrate three HTTP calls with a retry.

4. Restate — the simpler model

Restate is the newer entrant, and its pitch is simpler by design.

Instead of a dedicated cluster and a separate worker registry, Restate runs as a single binary that sits in front of your services and persists invocation state to Postgres or its own embedded store. Your service exposes durable handlers over HTTP or gRPC, and Restate intercepts calls, stores journaled events, and replays them on failure.

What you get:

simpler operational footprint — one binary, backed by Postgres
HTTP-native. Your durable handler is still an HTTP endpoint
virtual objects — a keyed concurrency model that gives you per-key serialization without a lock service
a programming model that feels closer to "normal" code than Temporal's SDK ceremony

What it costs:

newer, smaller ecosystem. Fewer battle-tested patterns, fewer blog posts when something weird happens at 2 a.m.
feature gaps versus Temporal. Schedule semantics, visibility tooling, and advanced retry policies are still maturing.
less mature versioning story. You still have the determinism tax, and the escape hatches are less established.
runtime still needs care. "Simpler" does not mean "free". You still have a stateful system in the critical path.

Restate is attractive when you want durable execution without running a Temporal cluster, and your workflows are short-ish and not complex enough to exercise the edges of the platform. It is less attractive when you need deep visibility, a mature versioning strategy, or multi-language SDK parity.

5. Inngest — the serverless angle

Inngest is a different shape again. It is built for the serverless and edge runtime world, and its durability model uses step functions that are stored and replayed by Inngest's managed runtime.

What you get:

excellent developer experience. inngest dev is the best local loop of the three
first-class serverless fit. Functions deploy with your Next.js, Vercel, Cloudflare, or AWS Lambda app
flow control built in: concurrency limits, rate limiting, debounce, throttle, priority, all declarative
event-triggered and cron-triggered workflows in one model
a web UI that is genuinely pleasant

What it costs:

opinionated runtime. Your durable code runs inside Inngest's step.run, step.sleep, step.waitForEvent primitives. The surface is narrower than Temporal's.
vendor lock-in. The managed service is the product. Self-hosting exists but is second-class.
cost curve at scale. Per-step pricing is fine at small volume. It becomes a line item in FinOps reviews when you hit millions of steps per day.
less suited to long-running, human-in-the-loop sagas. It can do them, but Temporal still has the edge here.

Inngest is great when the team is small, the runtime is already serverless, and the workflows are mostly event-driven pipelines with fan-out, retries, and scheduled work. It is painful when you outgrow its pricing model or need primitives it does not expose.

6. The shared tax across all three

Before arguing for skipping them, it is worth being clear about what you buy no matter which one you pick.

determinism discipline. You cannot read the clock, random numbers, env vars, or make uncontrolled I/O calls inside workflow code. Every team eventually has a junior dev do this by accident and lose a weekend to it.
a new mental model for errors. A thrown exception in workflow code is not a bug, it is a retry. Teams have to learn that.
versioning pain on every workflow change. In-flight instances outlive your deploys. Changes to step order, step count, or step signatures are breaking changes to the history log.
observability split-brain. Your regular APM sees the process. The workflow engine sees the logical run. Correlating them is extra work.
infrastructure you now own. Managed or not, you now have a critical system in the path of business-critical flows.

None of this is a dealbreaker. All of it is cost. Be honest about it.

7. The durable workflow handler — what the code actually looks like

Here is a simplified Temporal-style TypeScript workflow for an onboarding flow. It is representative of what the "code as workflow" model looks like across all three engines, with engine-specific syntax differences.

TypeScript

// workflows/onboarding.ts
import { proxyActivities, sleep } from "@temporalio/workflow";
import type * as activities from "../activities";

const {
  createAccount,
  sendWelcomeEmail,
  provisionWorkspace,
  notifyCRM,
  scheduleFollowUp,
} = proxyActivities<typeof activities>({
  startToCloseTimeout: "1 minute",
  retry: {
    initialInterval: "2s",
    maximumAttempts: 5,
    backoffCoefficient: 2,
  },
});

export async function onboardingWorkflow(input: {
  userId: string;
  email: string;
  plan: "free" | "pro";
}) {
  const account = await createAccount(input);

  await sendWelcomeEmail({
    email: input.email,
    accountId: account.id,
  });

  const workspace = await provisionWorkspace({
    accountId: account.id,
    plan: input.plan,
  });

  await notifyCRM({
    accountId: account.id,
    workspaceId: workspace.id,
    plan: input.plan,
  });

  // wait three days, durably, then send a nudge
  await sleep("72 hours");
  await scheduleFollowUp({ accountId: account.id });

  return { accountId: account.id, workspaceId: workspace.id };
}

The appeal is real. That sleep("72 hours") is not setTimeout. The engine persists the fact that the workflow is asleep, releases the worker, and wakes it up 72 hours later on any healthy worker in the fleet. If the process running this workflow dies mid-flow, a replacement worker picks up the history and resumes from exactly the right step.

Every await is a checkpoint. Every activity call is recorded. Every retry is bounded. No manual state table. No manual retry scheduler. No manual resume logic.

That is what you are paying for.

8. The DIY equivalent — state machine and outbox

Now the same flow, done with Postgres and a worker. No engine. No cluster. No SDK to learn.

// a single table represents the workflow state
// CREATE TABLE onboarding_runs (
//   id           uuid primary key,
//   user_id      text not null,
//   email        text not null,
//   plan         text not null,
//   state        text not null,        -- current step
//   attempt      int  not null default 0,
//   next_run_at  timestamptz not null, -- when to pick this up
//   account_id   text,
//   workspace_id text,
//   last_error   text,
//   created_at   timestamptz not null default now(),
//   updated_at   timestamptz not null default now()
// );
// CREATE INDEX ON onboarding_runs (state, next_run_at);

func stepOnboarding(ctx context.Context, db *sql.DB, run Run) error {
    switch run.State {
    case "created":
        acc, err := createAccount(ctx, run)
        if err != nil {
            return markRetry(db, run, err)
        }
        return advance(db, run, "account_created", map[string]any{
            "account_id": acc.ID,
        })

    case "account_created":
        if err := sendWelcomeEmail(ctx, run); err != nil {
            return markRetry(db, run, err)
        }
        return advance(db, run, "email_sent", nil)

    case "email_sent":
        ws, err := provisionWorkspace(ctx, run)
        if err != nil {
            return markRetry(db, run, err)
        }
        return advance(db, run, "workspace_ready", map[string]any{
            "workspace_id": ws.ID,
        })

    case "workspace_ready":
        if err := notifyCRM(ctx, run); err != nil {
            return markRetry(db, run, err)
        }
        // schedule the 72h follow-up by pushing next_run_at forward
        return advanceAt(db, run, "awaiting_followup",
            time.Now().Add(72*time.Hour))

    case "awaiting_followup":
        if err := scheduleFollowUp(ctx, run); err != nil {
            return markRetry(db, run, err)
        }
        return advance(db, run, "done", nil)

    case "done":
        return nil
    }
    return fmt.Errorf("unknown state: %s", run.State)
}

A single worker loop polls onboarding_runs WHERE state != 'done' AND next_run_at <= now() ORDER BY next_run_at LIMIT N FOR UPDATE SKIP LOCKED, calls stepOnboarding, and commits.

Every side effect is paired with an outbox row inside the same transaction as the state advance, so retries are safe as long as the side effects are idempotent. The engine here is Postgres plus a worker. That is the whole thing.

Is it as elegant as the Temporal version? No.

Is it harder to reason about over time? That depends on how many workflows you have, not how complex any one is.

Does it survive crashes, deploys, and restarts? Yes. The state is in Postgres. The worker is stateless. Pods crash, workers pick up the next row.

9. Where the DIY version actually wins

The DIY state machine beats all three engines in these cases:

low workflow count. If you have 3–5 long-running flows, not 50, the engine overhead is larger than the code you are avoiding.
SQL is already your source of truth. When the workflow state lives next to the business data it touches, joins, reports, and admin tools are trivial.
your ops team already runs Postgres well. Adding another stateful system has a nonzero cost. Not adding one has zero cost.
workflows rarely change. If the step graph is stable, the versioning tax of a durable engine is pure overhead you do not need.
you need custom scheduling semantics. Business-hours-only retries, tenant-aware rate limits, priority lanes — all trivial in SQL, all awkward in opinionated engines.

Teams underestimate how far a well-structured state machine with an outbox goes. A few hundred lines of Go or TypeScript, one table, one worker, idempotent side effects, and you have most of the durability you actually needed. (Idempotency keys and dedupe tables deserve their own post — treat side effects as retry-safe by default and you avoid most of the pain here.)

10. Where the engines actually win

I am not saying skip them always. The engines are worth it when:

the workflow has human-in-the-loop steps measured in hours or days, not seconds. Temporal signals and Inngest waitForEvent shine here.
the step graph is large and dynamic. Dozens of steps, child workflows, fan-out and fan-in. Writing this as a state machine is possible but painful.
you need first-class visibility UI. Product, ops, and support staring at a run history to debug customer issues is worth real money.
versioning is a feature, not a tax. When you need parallel workflow versions serving different cohorts, engines give you primitives. SQL does not.
the team is growing. A shared engine enforces discipline that a hand-rolled state machine does not. "Everybody use this pattern" holds up less well than "everybody use this SDK."

The test I use: if I can describe the workflow as "a function with awaits" on a whiteboard and the person listening nods, an engine is probably worth it. If I find myself describing it as "a few rows that advance through states," skip the engine.

11. The decision rubric

Here is the question set I walk through before picking anything:

Does any single workflow instance need to survive longer than a typical process lifetime?
Does it wait for external signals measured in hours or days?
Is the step graph larger than what you can draw on half a whiteboard?
Do you need to run thousands of concurrent instances, each in its own state?
Is the team big enough that shared primitives beat hand-rolled patterns?
Are your side effects already idempotent, or do you need the engine to enforce it?

If the answer to most of those is yes, pick an engine:

Pick Temporal when maturity, multi-language SDKs, and deep visibility matter more than operational simplicity.
Pick Restate when you want durable execution with less cluster overhead and your workflows are not at the edges of the platform.
Pick Inngest when your runtime is serverless, your team is small, and event-driven flow-control is a first-class need.

If the answer to most of those is no, a Postgres-backed state machine with an outbox is almost certainly the right call.

12. The anti-pattern to watch for

There is one failure mode I see more than any other: teams adopt a durable engine because Kubernetes is crashing their pods.

The engine does not fix that. It makes it less visible, and more expensive.

If your pods are crashing because of memory leaks, bad health checks, or botched deploys, the right fix is to stop the pods from crashing. Wrapping unstable code in durable-execution retries turns operational chaos into expensive retry storms and hides the actual problem.

Durable engines are for flows that logically span many processes over long time horizons. They are not a bandage for infrastructure instability.

13. A pragmatic takeaway

Durable workflow engines solve a narrow, real problem. Treat them as infrastructure you adopt when the workflow is the product surface, not as a default layer between your API and your database.

Before reaching for Temporal, Restate, or Inngest, ask three questions:

Is this workflow really going to outlive a single process, for reasons deeper than "a pod might restart"?
Would a state table plus a worker plus idempotent side effects handle this cleanly?
Can the team carry the versioning and determinism tax on every code change forever?

If you answer honestly and still land on an engine, pick the one whose tradeoffs match your team's shape and move on. If you cannot answer yes to all three, the smaller tool usually wins.

The best durable-workflow architecture is the one your on-call engineer can still reason about at 3 a.m. during an incident. That is usually closer to "a few states in Postgres" than vendor marketing wants you to believe.