Back to Blog
AI

SarmaLink-AI failover engine: how zero-downtime LLM rotation actually works

A deep dive into the failover engine inside SarmaLink-AI: provider weighting, latency budgets, cooldown windows after 5xx errors, and how zero-downtime rotation stays transparent to the caller.

3 May 202614 min read

SarmaLink-AI is a self-hosted OpenAI-compatible gateway that routes across 14 LLM providers with automatic failover[1]. The public README covers the setup. This post covers the engine itself: how routing decisions are made, how failures are detected, how cooldowns work, and how callers see zero disruption when a provider goes down.

The routing model

Every request entering the gateway carries three implicit properties: a latency budget, a quality tier, and a cost class. You do not set these directly — they are inferred from the model alias the caller requests.

When a caller sends model: "fast", the gateway maps that to the provider pool tagged tier=fast. Today that pool contains Groq (primary), Fireworks (secondary), and Together (tertiary). The gateway picks primary first. If primary is in cooldown, it advances to secondary, and so on.

Model aliases are defined in lib/config/model-map.ts. You can override them per deployment.

Provider weighting

Within a tier pool, providers carry an integer weight (default: 100). Weight affects selection probability when multiple providers are healthy and latency-equivalent.

In practice, weights are used to send a fraction of traffic to a new provider for comparison testing before fully cutting over. Set weight to 10 on a new provider and 90 on the incumbent; the new provider sees 10 percent of traffic, logs are comparable, and promotion is a config change not a code change.

Weights are hot-reloaded from the config file. No restart required.

Latency budget enforcement

Each tier has a configurable latency budget: the maximum wall-clock time in milliseconds before the gateway aborts the upstream request and tries the next provider.

Median response latency per provider (ms, first token)

Source: My own bench — Hetzner CAX31, 50-token prompt, 200 requests per provider, 2026-04

The numbers above are from my own bench on Hetzner CAX31 in April 2026 (my own bench). Groq leads on first-token latency at around 210ms p50. OpenAI and Anthropic are in the 490-560ms p50 range. These numbers shift with load and model size, so the gateway measures them in real time rather than using static config.

The in-process latency tracker maintains a rolling 60-second window of observed first-token times per provider. Every 30 seconds the tracker emits updated p50/p95 estimates. The routing layer reads these estimates when selecting a provider for a new request.

If a provider's observed p50 exceeds twice its configured budget, it is downweighted automatically. The downweight is soft: the provider remains in the pool but receives a lower share of traffic until its numbers recover.

Failure detection and the 5xx path

A 5xx response from a provider triggers the cooldown path[2]. The logic:

  1. Increment the provider's consecutive-failure counter.
  2. On the first failure, immediately retry the same request on the next provider in the pool. No delay, no dropped request.
  3. If consecutive failures reach the threshold (default: 3), enter cooldown.
  4. During cooldown, zero traffic is sent to the provider.
  5. After the cooldown window (default: 60 seconds), send a single probe request. If the probe succeeds, restore the provider with 20 percent weight and ramp over the next 5 minutes.

Rate-limit responses (429) are handled differently from server errors (500, 502, 503). A 429 increments a separate rate-limit counter and triggers a shorter backoff (default: 15 seconds). The provider is not fully cooled down — it comes back online sooner because rate limits are typically per-minute and the provider itself is healthy.

408 and connection timeouts follow the same path as 5xx.

Zero-downtime rotation in practice

Requests routed to backup providers during simulated Groq outage (15-min window)

Source: My own bench — synthetic outage test, 2026-04

The chart above shows traffic distribution during a simulated 15-minute Groq outage on my test bench (my own bench). At T+0 Groq handles all traffic. At T+1 (after the cooldown threshold is met) all traffic shifts to the backup pool. At T+16 Groq comes back and primary traffic is restored within one minute.

The caller's perspective: no failed requests. One request may see a slightly higher latency (the retry adds one round-trip), but no error surfaces.

Streaming considerations

Streaming makes failover harder. Once the gateway has begun streaming tokens to the caller, it cannot switch providers mid-stream without breaking the response. The approach is to detect failure before the first token arrives, not after.

If a provider connection is established but the first token does not arrive within the latency budget, the gateway aborts the upstream connection, logs the timeout, and retries on the next provider. Because no tokens have been sent downstream yet, the caller sees a clean response from the backup provider.

After the first token arrives, the gateway is committed. A mid-stream failure returns a partial response with an error appended. This is the honest behaviour — better than silently truncating.

The provider registry

All 14 providers are defined in lib/providers/registry.ts. Each entry carries:

  • Base URL (OpenAI-compatible endpoint)
  • API key env var name
  • Default model name to use when a tier alias resolves here
  • Weight, tier tags
  • Capabilities flags: streaming, functionCalling, vision

Adding a 15th provider is a four-line addition to the registry. No other files change.

What is not yet automatic

Cost accounting is partially implemented. The gateway logs token usage per request with provider and model. A daily rollup script aggregates cost estimates using published pricing, but there is no real-time cost cap enforcement yet. A request will not be aborted because it is about to breach a monthly budget limit. That is on the roadmap.

Prompt caching is provider-specific. Anthropic supports it natively[3]; other providers do not. The gateway does not currently abstract over prompt caching.

Running it

Live
GitHub repo stats (live)
Fetching live data…

Source: GitHub REST API · cached 10–60 min

The repo ships with a docker-compose.yml and a config.example.yml. Point your existing OpenAI client at http://localhost:3000/v1 and it routes through the gateway immediately. No SDK change required.

Where this is heading

The next meaningful features are: (1) real-time cost cap enforcement per API key, (2) per-request provider pinning via a custom header for callers that need deterministic routing, and (3) a web UI for the provider health dashboard so ops teams can see cooldown state without reading logs.

The failover engine itself is stable. I have run it in production for six months without a caller-visible outage caused by a provider failure. The design is intentionally conservative: prefer correct behaviour over clever behaviour, and prefer observability over opaqueness.

About the data

A note on what the numbers in this post represent so you can read them with the right confidence:

  • "My own bench" rows are personal measurements on my own hardware. They are honest about my setup and reproducible there, but they should not be treated as universal benchmark scores.
  • Benchmark numbers attributed to public sources (Geekbench Browser, DXOMARK, NotebookCheck, FIA timing) are illustrative — the trend is what matters, not the third decimal place. Cross-check against the source for anything you would act on financially.
  • Client outcomes and ROI percentages in business-focused posts are anonymised composites drawn from my own consulting work. Real numbers, real direction, sanitised so individual clients are not identifiable.
  • Foldable crease-depth and similar engineering measurements are estimates pulled from teardown reports and reviewer claims; manufacturers do not publish these directly.
  • Forecasts and "what I bet" lines are exactly that — opinions, not predictions with a track record yet.

If you spot a number that contradicts a source you trust, tell me — I would rather correct it than be the chart that was off by 6 percent and pretended otherwise.

References

  1. [1]
    SarmaLink-AI source on GitHub https://github.com/sarmakska/Sarmalink-ai
  2. [2]
  3. [3]
S

Sarma

SarmaLinux

More from AI

Have a project in mind?

Let's discuss how I can help you implement these ideas.

Get in Touch