Technical Whitepaper · v1.0

Agent Orchestrator

Multi-agent workflows with deterministic replay, durable state, and tool budgets , the reliability tier most agent demos lack.

MIT LicenceTypeScriptPostgres + DrizzleBullMQDeterministic ReplayInspector UI
DurableState
DeterministicReplay
100 %Type-checked
Per-toolBudgets

§ Abstract

The gap between an agent demo and an agent product is durability. A demo runs once, on the happy path, on a developer’s machine. A product runs ten thousand times, on every path, across restarts, across deploys, across the slow death of an upstream API. The leap is not a bigger prompt or a better model. The leap is the infrastructure that survives.

Agent Orchestrator is that infrastructure. Workflows are TypeScript functions; runs are sequences of journaled events stored in Postgres; replay re-derives a run from its events without invoking the model again; tool budgets gate spend before the model is called. A BullMQ queue moves work between steps with retries, backoff, and dead-letter handling. An Inspector UI shows the live graph and lets engineers step through any past run.

This whitepaper documents the architectural decisions: why we journal events rather than checkpoint state, why Postgres rather than a workflow service, why we do not try to be Temporal. It also documents what the orchestrator does not do and probably should not.

1Executive Summary

Agent Orchestrator is a TypeScript runtime that executes workflows composed of agents and tools. A workflow is a function. The runtime invokes that function inside a journaling shell that records every model call, tool call, and decision as an event in Postgres. If the process dies, the next process picks up the same workflow at the same step by replaying the events to the last checkpoint and continuing.

Tool calls have budgets , count and cost , enforced before the model is invoked. Tools that exceed their budget raise a typed error that the workflow can catch and handle gracefully. Workflows can be paused, resumed, cancelled, and replayed. Replay is bit-for-bit deterministic when the model and tools are deterministic; when they are not, replay uses the journaled outputs rather than re-invoking.

The Inspector UI is a Next.js app connected to the orchestrator over tRPC. It shows the workflow graph, the current state of each in-flight run, the message-level timeline of any past run, and offers a replay-with-edits mode for debugging.

2Background

The agent ecosystem in 2026 has converged on a workflow-as-graph model. LangGraph, CrewAI, AutoGen, the OpenAI Agents SDK , all describe agents handing work to each other through structured messages. The convergence on the description layer is genuine progress.

The execution layer is where the gap remains. Most frameworks assume a single process, a single host, a single happy path. Long-running workflows, paused workflows, retried workflows, replayed workflows , these are exercises for the reader. The teams that ship agent products end up writing their own Postgres journal, their own queue plumbing, their own retry semantics. We did. Three times. This project is the consolidation.

The closest existing tools are Temporal and Inngest. Both are excellent. Both are designed for general-purpose workflows, with the agent layer being something you build on top. Agent Orchestrator inverts this: it is opinionated for agents specifically, with tool budgets and message journaling as first-class concerns. For teams already running Temporal, the orchestrator is not the right answer. For teams that need an agent runtime with these properties out of the box, it is.

3Problem in detail

Crash recovery is not optional

Agent workflows take minutes, sometimes hours. Process restarts during a deploy or an OS update will happen. Without durable state, every restart restarts the workflow from the beginning. With cost-bearing tool calls and customer-facing latency, that is not acceptable. Durable state is the price of admission.

Replay is the only way to debug a complex workflow

When a workflow with eight agents and forty tool calls produces a wrong answer, walking through it once at production speed teaches you nothing. Walking through the journaled events offline, pausing at each step, optionally editing the prompt and seeing the difference , that is debugging. The orchestrator is built around this loop.

Tool budgets prevent runaway loops

The classic agent failure mode is a loop. A planner asks a researcher; the researcher returns nothing useful; the planner asks again; round and round. Without budgets, the only stop condition is wall clock. With budgets, the loop hits a typed error and the workflow can choose to abort or fall back. Budgets are non-negotiable in any production deployment.

4Goals + non-goals

Goals

  • Durable state for every workflow. No data loss across restarts.
  • Deterministic replay from any checkpoint, with the journaled tool outputs.
  • Tool budgets enforced before the model is invoked.
  • Inspector UI for engineers, with live graph state and replay.
  • One Postgres database. One Redis. No third service for state.
  • TypeScript end-to-end. Strict types. Zod-validated boundaries.

Non-goals

  • A general-purpose workflow engine. Use Temporal or Inngest if that is what you need.
  • A prompt-management product. The orchestrator does not version prompts.
  • An LLM. The orchestrator calls models via adapters; default adapters cover the common providers.
  • Multi-region replication. The orchestrator runs in one region. If you need active-active, build it in front.

5Architecture

Three components. The runtime executes workflows. Postgres stores events. Redis with BullMQ moves jobs between steps. The Inspector UI is a separate process.

The runtime starts a workflow as follows: a new run row is inserted into Postgres with the workflow name and inputs; the runtime invokes the workflow function inside a journaling shell. Every await orch.callModel(...) and await orch.callTool(...) first checks the journal: if this exact step has been recorded, return the recorded output (replay path); otherwise execute, record the result, return the result (live path).

Steps that yield to the queue (long-running tools, sub-workflows, agent handoffs) push a BullMQ job and the runtime suspends the workflow. The job is picked up by another worker, which resumes the workflow by replaying the journal up to the suspension and continuing. This is how the runtime survives a process exit: there is no in-memory state to lose; everything lives in Postgres.

Postgres tables (simplified)
  runs       (id, workflow, status, started_at, finished_at, tenant_id)
  events     (run_id, seq, kind, agent, tool, args_hash, output, ts)
  budgets    (run_id, tool, scope, limit, used)
  workflows  (id, name, version, hash)

6Key technical decisions

Event journal, not state checkpoint

We considered a checkpoint-based design where the runtime serialises the workflow state at each step. We chose an event journal because it composes better with replay-with-edits: an event journal can be re-derived under different conditions. A snapshot can only be loaded as-is.

Postgres, not a workflow service

The orchestrator is one Postgres database and one Redis. No bespoke service. The schema is small enough to read in twenty minutes. Migrations are Drizzle. This is a deliberate choice: every team running a TypeScript agent product is already running Postgres. Adding the orchestrator costs them no new operational dependency.

BullMQ, not a custom queue

BullMQ is a mature Redis-backed queue with backoff, concurrency, and DLQ semantics that are exactly what we want. Building our own would have meant reinventing it badly.

Drizzle, not Prisma

Drizzle’s SQL-shaped API is honest about what runs in the database. The migrations are checkpoint files. The introspection works in both directions. For a project that needs predictable database performance, Drizzle is the right choice.

tRPC for the Inspector

The Inspector is a Next.js app connected to the orchestrator runtime. tRPC gives end-to-end type safety with no codegen, which fits the “every line is typed” goal.

7Implementation milestones

Milestone 1 · journaling shell

The first thing built was the orch.callModel and orch.callTool primitives, with synchronous in-memory journaling. The first workflow was a deterministic three-step pipeline: planner, researcher, writer. Once that worked end-to-end and could be replayed, the rest of the project had a foundation.

Milestone 2 · Postgres persistence

The journal moved into Postgres. Drizzle schemas, migrations, and the run/event tables were added. Crash recovery (kill the process mid-workflow, restart, run resumes) was the acceptance test.

Milestone 3 · BullMQ and step suspension

Long-running tools and sub-workflows were lifted onto BullMQ. The runtime learned to suspend a workflow and resume it on a different worker. This is what made the orchestrator horizontally scalable.

Milestone 4 · tool budgets and Inspector

Tool budgets were added with a typed error path. The Inspector UI was built last, against the stable runtime API. SSE for live tailing, tRPC for queries, replay controls.

8Lessons / honest limits

Lessons

  • The journal is the product. Every other feature flows from having a high-quality event journal. Get that right first.
  • Replay-with-edits beats live-debug ten times out of ten. Engineers stopped using printf-debugging the moment the Inspector replay tab existed.
  • Budgets are usability, not just safety. A budget breach is a clean, typed error. A runaway loop is a 3am page.

Honest limits

  • Single region. No multi-region replication. Run replicas behind a load balancer if you need HA.
  • Postgres is the bottleneck if you push it. Event-heavy workflows write a lot of rows. The schema is partitioned-friendly; production users with high throughput will partition by month.
  • Replay determinism requires deterministic tools. Non-deterministic tools (web search, time) are journaled with their outputs, so replay is reproducible at the workflow level even when the underlying call is not.
  • Not a Temporal replacement. If you need cron, signals, sagas across half a dozen languages, use Temporal.

9Conclusion

Agent Orchestrator gives you the production tier of an agent product without writing it three times. Durable state, deterministic replay, tool budgets, and an Inspector that the engineers will actually use. Workflows are TypeScript; the runtime is small enough to read; the schema is small enough to operate. Most teams ship faster with this than with a bigger framework, because the parts that fail in production are the parts already solved.

The repository is MIT licensed. The wiki includes a workflow authoring guide, an adapter authoring guide, a Postgres operations note, and a Temporal-comparison page so you can decide which tool fits your case.

Agent Orchestrator · Built by Sarma Linux · MIT licensed