agent-orchestrator^[1] is a multi-agent workflow engine built on Postgres, BullMQ, and TypeScript. It ships deterministic replay for debugging, an Inspector UI for workflow visualisation, and durable execution that survives worker crashes. This post is the design rationale and the lessons from 18 months of using it.

The problem with naive agent pipelines

A naive agent pipeline looks like:

Call the LLM
Parse the tool call from the response
Execute the tool
Loop back to 1

This works for demos. It fails in production because:

If the worker crashes between step 3 and step 4, the tool has been called but the result is not recorded. On restart, the tool gets called again. For idempotent tools this is fine; for "send an email" or "charge a card" it is not.
Debugging is hard. "The agent did something wrong" — which call? What was the context at the time? What branch of logic was taken?
Scaling is hard. If the pipeline is a synchronous function, one agent run blocks one worker. To run 100 agents in parallel you need 100 workers, or async plumbing.

Durable workflow engines solve the first two. BullMQ solves the third.

Postgres as the source of truth

Every workflow event is written to Postgres before the action it describes is taken^[3]. The write-ahead pattern:

Write event to workflow_events table: { workflow_id, step, type: 'tool_call_start', payload: { tool, args } }
Execute the tool call
Write event: { type: 'tool_call_complete', payload: { result } }

On worker crash between step 2 and step 3, the recovery path reads the last event for the workflow. It finds a tool_call_start with no corresponding tool_call_complete. It re-executes step 2, but only if the tool is idempotent (flagged in the tool registry). If the tool is not idempotent, it surfaces the ambiguity to the operator rather than assuming either direction.

This is not novel — it is the write-ahead log pattern from database internals, applied to agent steps.

The schema is simple: workflow_runs, workflow_events, workflow_tasks. Drizzle handles migrations. The whole schema is under 150 lines.

Deterministic replay

Debug time per bug: with vs without deterministic replay (minutes)

Source: My own records, 14 production bugs since launch

The numbers above are from my own records across 14 production bugs since launch (my own bench). With replay, debugging the timeout race condition took 22 minutes. Without replay, my estimate is 3 hours because you cannot reproduce time-dependent race conditions reliably without a recording.

Replay works as follows. Given a workflow_id, the system re-executes the workflow from the beginning, but instead of calling live LLMs or tools, it reads responses from the workflow_events log. The workflow code runs exactly as it ran in production: same branches taken, same tool call args, same LLM outputs. The only difference is the execution is instant (no network calls) and deterministic (same result every time).

You can inject a fault at any step: "replay up to step 7, then substitute this modified tool result and continue." This is how you test a fix for a bug without touching production.

The replay runner is lib/replay/runner.ts. It intercepts the LLM client and tool registry at the module boundary, substituting recorded results.

Why BullMQ over Temporal

The main alternative for durable workflows is Temporal^[4]. Temporal is excellent and I evaluated it seriously. I chose BullMQ^[2] for three reasons:

Operational simplicity. Temporal requires running a server (the Temporal service), a worker, and a client. The Temporal service is a significant operational dependency — it is its own distributed system. BullMQ runs on Redis and Postgres, which most teams already have.

No workflow SDK lock-in. Temporal's programming model is the workflow SDK: workflow.sleep, workflow.executeActivity, etc. Your workflow code is written inside Temporal's mental model. BullMQ is a job queue; the workflow logic is plain TypeScript with no SDK-specific constructs. Migrating off BullMQ later does not require rewriting workflow code.

Scale point. For the concurrency levels agent-orchestrator targets (tens of thousands of tasks per hour, not millions), BullMQ on Redis is sufficient.

The tradeoff: Temporal handles failure modes at the workflow level more elegantly than BullMQ does. BullMQ is a job queue with retry semantics; durability at the workflow level is the layer I built on top. Temporal builds durability into the SDK. If I were building for millions of concurrent workflows, I would revisit Temporal.

Concurrency and throughput

Tasks completed per hour at increasing concurrency (3 worker pods)

Source: My own bench — Hetzner CAX31 x3, GPT-4o mini tasks averaging 1.2s each, 2026-03

The throughput numbers above are from my own bench on three Hetzner CAX31 instances (my own bench), running GPT-4o mini tasks averaging 1.2 seconds per round trip. At c=20 (20 concurrent workers per pod, 60 total) the system completes about 38,600 tasks per hour with under 1 percent error rate. At c=40 per pod the throughput climbs to 58,000 tasks/hour but the error rate reaches 2 percent, mostly rate-limit responses from the LLM provider.

The practical operating point for most workloads is c=10 to c=20 per pod. Beyond that, rate-limit errors dominate and the throughput gain is marginal.

The Inspector UI

The Inspector UI is a Next.js app that renders workflow graphs as trees. Each node is a workflow step; colour indicates status (pending, running, complete, failed). Clicking a node shows the full event log for that step: LLM prompt, response, tool args, tool result.

The UI reads directly from the Postgres workflow_events table. There is no dedicated API; the Next.js route handlers query Postgres. For a monitoring tool that reads mostly historical data, this is sufficient.

What I would do differently

Idempotency keys on every tool call. I added idempotency key support late. If I built it again, every tool call would carry a caller-generated idempotency key from day one. Tools that cannot accept an idempotency key are not suitable for durable workflows.

Event schema versioning. Early events in the log use a slightly different shape from later ones. The replay runner handles both but the branching is ugly. A formal event schema version field from the start would have avoided this.

Separate the orchestration from the tool registry. Today the tool registry is embedded in the orchestrator. The cleaner design is an external tool registry that the orchestrator calls over HTTP, so tools can be deployed and versioned independently of the orchestration layer.

Running it

Live

GitHub repo stats (live)

Fetching live data…

Source: GitHub REST API · cached 10–60 min

The repo ships with a docker-compose.yml that starts Postgres, Redis, the worker, and the Inspector UI. A sample workflow is included that chains three LLM calls with a tool step in the middle. Total time to first workflow completion from clone: about 10 minutes.

Where it fits

agent-orchestrator is the right tool when you have multi-step agent workflows that need to be reliable, debuggable, and scalable. If your workflow is a single LLM call with one tool, it is overengineering. If your workflow involves branching logic, long-running steps, retries, and human-in-the-loop pauses, the durability model pays for itself.

The durable workflow pattern is the foundation that makes agents trustworthy in production. Without it, agents are demos.

About the data

A note on what the numbers in this post represent so you can read them with the right confidence:

"My own bench" rows are personal measurements on my own hardware. They are honest about my setup and reproducible there, but they should not be treated as universal benchmark scores.
Benchmark numbers attributed to public sources (Geekbench Browser, DXOMARK, NotebookCheck, FIA timing) are illustrative — the trend is what matters, not the third decimal place. Cross-check against the source for anything you would act on financially.
Client outcomes and ROI percentages in business-focused posts are anonymised composites drawn from my own consulting work. Real numbers, real direction, sanitised so individual clients are not identifiable.
Foldable crease-depth and similar engineering measurements are estimates pulled from teardown reports and reviewer claims; manufacturers do not publish these directly.
Forecasts and "what I bet" lines are exactly that — opinions, not predictions with a track record yet.

If you spot a number that contradicts a source you trust, tell me — I would rather correct it than be the chart that was off by 6 percent and pretended otherwise.

Building agent-orchestrator: durable workflows, deterministic replay, and the BullMQ decision