Open Source · MIT · TypeScript

Route every request to the right model, on the right hardware.

OpenAI-compatible. Ollama on the inside, cloud on the outside, YAML policy in the middle.

Local LLM Router is the OpenAI-compatible proxy that decides, per request, whether the call should hit your local Ollama, a cloud frontier model, or a cost-tier alternative. Routing rules live in a versioned YAML policy. Privacy pinning forces sensitive requests local. Rolling A/B routing lets you migrate traffic between models without changing your application code.

YAML
Policy as code
Privacy
Pinning
A/B
Rolling routing
OpenAI
Wire-compatible
MIT
Licence

Why this exists

Most teams shipping AI products want to use local models for some traffic and cloud models for the rest. Local for privacy-sensitive prompts, local for high-volume cheap traffic, cloud for the difficult cases. The right answer is per-request, not per-application. Hard-coding the routing logic into your application is the wrong place for it.

The hosted LLM gateways (OpenRouter, Portkey, LiteLLM proxy) handle the cloud side beautifully but treat local models as second-class. The Ollama gateways handle local models beautifully but do not interoperate with cloud. Teams keep stitching their own.

Local LLM Router is the focused middle: an OpenAI-compatible proxy with first-class Ollama support, first-class cloud support, and a YAML routing policy that can be reasoned about, version-controlled, and reviewed in PRs. Privacy pinning is a feature; A/B rollout is a feature; cost optimisation is a feature.

What it does

Every feature below ships in the public repository today. Clone, configure, run.

OpenAI-compatible API

Point any OpenAI client at the router. /v1/chat/completions, /v1/embeddings, /v1/completions. Streaming and non-streaming.

Ollama first

Local Ollama is a first-class destination. Model name resolution maps OpenAI-style model names to your Ollama models.

YAML routing policy

A single policy file describes routing decisions. Match by model, by tag, by request size, by content classifier. Reviewed in PRs like the rest of your code.

Privacy pinning

A request tagged or detected as sensitive is pinned to a local model. Failovers respect the pin: if local is down, the request fails closed, never silently leaves the network.

Rolling A/B routing

Migrate ten percent of traffic to a new model, watch the dashboard, ramp up. The application is unchanged.

Cost optimisation

Cheap models for short prompts, expensive models for long-context. Per-request cost is logged and aggregated.

Per-request audit

every routing decision is logged with request hash, chosen model, latency, cost, outcome. Queryable in better-sqlite3.

Failover chains

Each policy rule can list a failover sequence. If the primary returns 5xx, the next destination fires.

Cloud providers

OpenAI, Anthropic, Google, Mistral, Groq, OpenRouter. Add more in twenty lines.

Edge-ready

Hono is the runtime. Deploys to Bun, Node, Cloudflare Workers, and Deno. Sub-millisecond router overhead.

Tech stack

TypeScriptBun (or Node)Honobetter-sqlite3OllamaYAMLZodVitestDocker

Architecture, in one diagram

The whole system on a single screen. Every box maps to a real folder in the repo.

┌─────────────────────────┐
│  Client (any OpenAI SDK) │
└──────────────┬──────────┘
               │  /v1/chat/completions
               ▼
┌──────────────────────────────────────────────────┐
│ Local LLM Router (Hono, OpenAI-compatible)        │
│  ┌──────────────────────────────────────────┐    │
│  │ Policy engine                              │    │
│  │   match by model, tag, size, classifier    │    │
│  │   → destination + failover chain           │    │
│  └────────────────┬───────────────────────────┘    │
│                   ▼                                 │
│  ┌──────────────────────────────────────────┐    │
│  │ Privacy pin                                │    │
│  │   sensitive request → local only           │    │
│  └────────────────┬───────────────────────────┘    │
│                   ▼                                 │
│  ┌──────────────────────────────────────────┐    │
│  │ Dispatcher                                 │    │
│  │   try primary, on 5xx try next, journal    │    │
│  └────────────────┬───────────────────────────┘    │
│                   ▼                                 │
│  ┌──────────────────────────────────────────┐    │
│  │ better-sqlite3 audit log                   │    │
│  │   request, model, latency, cost, outcome   │    │
│  └────────────────────────────────────────────┘    │
└──────────────────────────────────────────────────┘
        │                   │                  │
        ▼                   ▼                  ▼
   Local Ollama       OpenAI / Anthr.    OpenRouter
   (gpu host)         (cloud)            (cloud)

Quick start

From clone to first request in under five minutes.

01
git clone https://github.com/sarmakska/local-llm-router.git
cd local-llm-router
02
bun install
cp policy.example.yaml policy.yaml
cp .env.example .env  # OPENAI_KEY, ANTHROPIC_KEY, OLLAMA_HOST
03
bun run dev   # Hono on :7877
04
curl -N http://localhost:7877/v1/chat/completions \
  -H "Authorization: Bearer dev" -H "Content-Type: application/json" \
  -d '{"model":"smart","messages":[{"role":"user","content":"Hi"}],"stream":true}'

Where it fits

The patterns this repository was built around.

Privacy-sensitive AI products

Healthcare, legal, finance. Sensitive prompts pin to local Ollama. Non-sensitive prompts can use cloud frontier models.

Cost-conscious teams

Route short prompts to cheap models, long-context to expensive ones. Cost logging shows the savings honestly.

Local-first development

Developers run Ollama locally; production uses cloud. The router is the single point of swap.

Model migration

Migrate from one provider to another with rolling A/B. Watch quality and latency in the audit log; ramp up.

Route smart. Keep private private.

Clone the repo, follow the four-step quick start, ship something real.