Open Source · MIT · TypeScript · Hono

Route every prompt to the cheapest model that can do the job.

An OpenAI-compatible proxy that classifies every request, applies your declarative YAML policy, and dispatches to local Ollama, hosted SarmaLink-AI or an OpenAI frontier model. Privacy pinning is fail-closed. Latency budgets spill slow local to fast cloud. Rolling A/B promotes from real production traffic.

YAML
Policy as code
Privacy
Pinning · fail-closed
A/B
Rolling routing
OpenAI
Wire-compatible
MIT
Licence

Why this exists

Most teams shipping AI products want to use local models for some traffic and cloud models for the rest. Local for privacy-sensitive prompts, local for high-volume cheap traffic, cloud for the difficult cases. The right answer is per-request, not per-application. Hard-coding the routing logic into the application is the wrong place for it.

Hosted LLM gateways handle the cloud side beautifully but treat local as second class. Local gateways handle Ollama beautifully but do not interoperate with cloud at the same shape. Teams keep stitching their own routers from npm packages and regrets.

local-llm-router is the focused middle. An OpenAI-compatible proxy with first-class Ollama support, first-class cloud support, and a YAML policy you can reason about and PR-review. Privacy pinning is a feature. Latency-budget fallback is a feature. Rolling A/B is a feature. The application calls one URL and gets the right answer from the right place.

Request flow

Classify, decide, dispatch, stream. Every step happens in-process; the round-trip is the backend, not the router.

rendering
Request flow: the classifier produces dimensions, the decision engine walks the policy, privacy pin overrides budget, the dispatcher streams from the chosen backend and writes a row to node:sqlite.

What the classifier sees

Six dimensions, deterministic, no model call. The policy file matches on any conjunction of them.

DimensionPossible valuesHow it is set
taskcode · vision · web_search · generalHeuristics over keywords, content parts (image_url), and tool/function hints in the request.
complexitylow · medium · highLength, presence of multi-step reasoning markers, code-block depth.
sensitivitylow · medium · highRegex bank for PII / PHI / secrets, plus explicit metadata.sensitivity if the client sets it.
modalitytext · image · audio · multiInspects content parts and the model name family.
familyqwen-coder · gemma · llama · gpt · sonnetResolved from the requested model name or chosen by task when model = "auto".
tokensestimated countLight tokeniser estimate over the prompt and system messages.

The whole policy.yaml

This is the example policy shipped with the repo. Read top to bottom, that is the order rules fire in.

backends:
  local:
    type: ollama
    endpoint: http://localhost:11434
    # The classifier picks a family; the router resolves it to one of
    # these models. Code goes to Qwen 2.5 Coder, vision to Gemma 3,
    # everything else to Llama 4.
    families:
      qwen-coder: qwen2.5-coder:7b
      gemma: gemma3:12b
      llama: llama4:16x17b
    models: [llama4:16x17b, qwen2.5-coder:7b, gemma3:12b]
    p50Ms: 1800

  sarmalink:
    type: sarmalink
    endpoint: https://api.sarmalink.ai/v1
    model: smart
    p50Ms: 600

  frontier:
    type: openai
    endpoint: https://api.openai.com/v1
    model: gpt-4o
    p50Ms: 900

routes:
  # Privacy pin: sensitive requests never leave the machine.
  - match: { sensitivity: high }
    backend: local
    reason: "Privacy: never leave the machine"

  # Short code edits run on local Qwen 2.5 Coder with cloud fallback.
  - match: { task: code, complexity: low }
    backend: local
    fallback: sarmalink
    latencyBudgetMs: 2500

  # Hard code goes straight to the frontier model.
  - match: { task: code, complexity: high }
    backend: frontier
    fallback: sarmalink

  # Image prompts route to local Gemma 3 with vision fallback.
  - match: { task: vision }
    backend: local
    fallback: frontier
    latencyBudgetMs: 3000

  # Live-data questions need cloud tools.
  - match: { task: web_search }
    backend: sarmalink

  # Everything else: local first, spill to hosted when too slow.
  - default: local
    fallback: sarmalink
    latencyBudgetMs: 1200

ab:
  enabled: false
  sampleRate: 0.05
  candidates:
    local: sarmalink

Every field is documented: Policy DSL wiki page

What is in the box

Twelve features. Every one is implemented in the repo today.

OpenAI Chat Completions API

Drop-in for /v1/chat/completions. Streaming and non-streaming. Any OpenAI client works without code changes: point base_url at the router and set model: "auto".

OpenAI Responses API

Also serves /v1/responses. Accepts input, instructions, and content parts; returns a Responses envelope with output, output_text, and usage. Streaming is the backend's native SSE.

Deterministic classifier

Tags every request with task, complexity, sensitivity, modality, an open-weight model family, and an estimated token count. No extra model call, no extra round trip, no extra cost.

YAML routing policy

A single policy file describes routing decisions. Match by task, complexity, sensitivity, modality, family or model. Reviewed in PRs like the rest of your code.

Privacy pinning, fail-closed

A request tagged or detected as sensitive is pinned to a local backend. If the local backend is unavailable, the request fails closed: never silently leaves the network.

Family-to-model resolution

The classifier picks a family (qwen-coder, gemma, llama). The decision engine resolves it to a concrete model on the chosen backend's families map. One policy drives a heterogeneous fleet.

Latency-budget fallback

A route can set latencyBudgetMs. When the primary backend's expected latency (from live metrics or p50Ms hint) exceeds the budget and the fallback is faster, the request shifts. Slow local never blows a tight interactive budget.

Three backends in the box

Ollama for local, SarmaLink-AI for hosted multi-provider, OpenAI for frontier. A registry pattern makes a new backend roughly sixty lines of TypeScript.

node:sqlite metrics

Per-route success, latency and fallback rate persist in node:sqlite. JSON summary at /v1/metrics, Prometheus text at /metrics. No extra dependency, no extra process.

Rolling A/B routing

Mirror a sample of traffic to a candidate backend in the background. The router records its latency and success. /v1/ab reports which candidates are ready to promote.

Hono runtime, edge-ready

Built on Hono. Runs on Bun, Node, Cloudflare Workers, Deno. Sub-millisecond router overhead. The same code, the same behaviour, the runtime is the difference.

Typed config, Zod-validated

The policy file is loaded once at startup and validated with Zod. A malformed policy fails fast with a clear message; it never reaches a live request.

Three backends in the box

A registry pattern means a fourth backend is typically sixty lines.

type: ollama

Ollama (local)

On-prem GPU host or laptop. The privacy-pinned destination. Reads families to map qwen-coder, gemma and llama family tags to concrete model tags.

~1800ms (12B on laptop)
type: sarmalink

SarmaLink-AI (cloud)

Hosted multi-provider gateway with web tools and chat memory. The natural cloud fallback when local cannot keep up.

~600ms
type: openai

OpenAI frontier

Frontier model on hard code and reasoning. The escape hatch when local and SarmaLink-AI cannot do the job.

~900ms

Quick start

Clone, install, point an OpenAI client at port 3030, watch the metrics.

01Clone and install
git clone https://github.com/sarmakska/local-llm-router.git
cd local-llm-router
pnpm install
02Copy the example policy
cp policy.example.yaml policy.yaml
cp .env.example .env  # OPENAI_API_KEY, SARMALINK_API_KEY, LLR_POLICY, LLR_DB
03Start the router
pnpm dev   # Hono on :3030
04Make a request
curl -N http://localhost:3030/v1/chat/completions \
  -H "Authorization: Bearer anything" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "Refactor this function..."}],
    "stream": true
  }'
05Point your OpenAI client at it
from openai import OpenAI
client = OpenAI(base_url="http://localhost:3030/v1", api_key="anything")
client.chat.completions.create(
  model="auto",
  messages=[{"role": "user", "content": "Hi"}]
)
06Watch the metrics
curl http://localhost:3030/v1/metrics   # JSON
curl http://localhost:3030/metrics     # Prometheus text
curl http://localhost:3030/v1/ab       # A/B candidate report

Environment variables

Five environment variables. Two are paths, two are credentials, one is the port.

VariablePurposeDefault
LLR_PORTHTTP port for the Hono server.3030
LLR_POLICYPath to the YAML policy file../policy.yaml
LLR_DBPath to the node:sqlite metrics database../metrics.db
OPENAI_API_KEYFrontier backend credential.unset (disabled)
SARMALINK_API_KEYSarmaLink-AI backend credential.unset (disabled)

Decision sequence

What actually happens when a request comes in. Sensitivity pin trumps budget. Budget trumps preference.

rendering
Decision sequence: classifier produces dimensions, decision engine picks a backend (with privacy pin and latency budget overrides), dispatcher streams, sqlite logs the outcome.

Use cases

What teams actually run this for.

Privacy-sensitive AI products

Healthcare, legal, finance. Sensitive prompts pin to local Ollama. Non-sensitive prompts can reach cloud frontier models. Fail-closed means PII never leaves the network by mistake.

Cost-conscious teams

Trivial prompts go to local or cheap hosted; long-context or hard prompts go to frontier. The audit log shows the savings honestly, per route.

Local-first development

Developers run Ollama locally; production uses cloud. The router is the single point of swap. Same client code, same model: "auto", different policy.

Model migration

Migrate from one provider to another with rolling A/B. Watch quality, latency and cost in the audit log. Promote when ready.

Regional egress control

Some workloads are not allowed to leave a region. Privacy pinning extended with regional metadata keeps those workloads on a regional Ollama.

Edge runtime experiments

Hono runs on Cloudflare Workers and Bun and Deno. Deploy the same router to a different runtime and compare cold-start and overhead numbers in the same metrics store.

local-llm-router vs alternatives

How the router compares to the closest tools in the space. Honest scope-by-scope.

Featurelocal-llm-routerLiteLLMPortkeyOpenRouterOllama
OpenAI-compatible (Chat + Responses)Both APIsChat onlyBothChat onlyChat only
Native local backend (Ollama)First-classSupportedSupportedNot in scopeNative
YAML policy with classifierBuilt-inLimitedConfig UIPer-requestNo
Privacy pinning, fail-closedFirst-class featureManualManualN/AN/A (local only)
Rolling A/B with shadow trafficBuilt-inBring your ownPaid tierNoNo
Latency-budget fallbackPer-routeNoManualPer-requestN/A
Self-hosted, single binaryYes, Hono + sqliteYesHosted SaaSHosted onlyYes
LicenceMITMITCommercialCommercialMIT

Tech stack

Small surface. Built-in Node primitives where possible.

TypeScriptNode.js 22Hononode:sqliteOllama 0.5+ZodYAMLVitestDockerOpenAI SDK compatible

Frequently asked

The questions that come up most before adoption.

How is this different from LiteLLM?+

LiteLLM is a library and an excellent hosted gateway. local-llm-router is a focused self-hosted proxy where local Ollama is a first-class destination, YAML routing policy is the primary interface, and privacy pinning with fail-closed semantics is built in rather than bolted on. If you need a hosted multi-provider gateway with billing, LiteLLM is the right tool. If you need a small self-hosted router that takes local seriously, this is.

Does the classifier call an LLM?+

No. The classifier is deterministic and runs in-process. It uses heuristics over the request body and metadata to produce task, complexity, sensitivity, modality, family and token estimates. Adding a model call would buy small accuracy at large latency cost; we did not make that trade.

What does fail-closed mean for privacy pinning?+

A request tagged sensitive is pinned to a local backend. If the local backend returns a non-success or is unreachable, the request fails with an explicit error rather than silently spilling to a cloud fallback. The pin is a hard wall, not a preference.

How does latency-budget fallback decide to spill?+

The decision engine compares the primary backend's expected latency (live metrics if present, the backend's p50Ms hint otherwise) against the route's latencyBudgetMs. If the primary is over budget and the fallback is faster, the request goes to the fallback. The check happens per-request, never per-pod.

Why node:sqlite instead of an external metrics store?+

For per-route success, latency and fallback rate, node:sqlite is enough. It is built into Node 22, has no separate process, and serialises writes safely. The Prometheus text endpoint is there if you want to scrape into a real metrics stack.

Can I add a new backend?+

Yes. Backends are a registry pattern: implement a small interface, register it under a type, expose it in policy.yaml. New backends are typically sixty lines of TypeScript plus a streaming SSE bridge.

Does the router support tools and function calling?+

The proxy passes tool definitions through unchanged. Tool execution is the backend's job; the router's job is to route the request to the right backend in the first place.

Is it safe to expose publicly?+

It is a router, not a gateway. There is no auth, no rate limiting, no billing. Put it behind your own ingress or an API-gateway layer for those concerns. The wiki has a deployment pattern with Cloudflare Access in front.

Route smart. Keep private private.

Clone the repo, copy the example policy, point an OpenAI client at port 3030, and watch the metrics. MIT licensed.