Open Source · MIT · TypeScript

Route every request to the right model, on the right hardware.

OpenAI-compatible. Ollama on the inside, cloud on the outside, YAML policy in the middle.

Local LLM Router is the OpenAI-compatible proxy that decides, per request, whether the call should hit your local Ollama, a cloud frontier model, or a cost-tier alternative. Routing rules live in a versioned YAML policy. Privacy pinning forces sensitive requests local. Rolling A/B routing lets you migrate traffic between models without changing your application code.

View on GitHub Whitepaper How it works Get help shipping

YAML

Policy as code

Privacy

Pinning

A/B

Rolling routing

OpenAI

Wire-compatible

MIT

Licence

Why this exists

Most teams shipping AI products want to use local models for some traffic and cloud models for the rest. Local for privacy-sensitive prompts, local for high-volume cheap traffic, cloud for the difficult cases. The right answer is per-request, not per-application. Hard-coding the routing logic into your application is the wrong place for it.

The hosted LLM gateways (OpenRouter, Portkey, LiteLLM proxy) handle the cloud side beautifully but treat local models as second-class. The Ollama gateways handle local models beautifully but do not interoperate with cloud. Teams keep stitching their own.

Local LLM Router is the focused middle: an OpenAI-compatible proxy with first-class Ollama support, first-class cloud support, and a YAML routing policy that can be reasoned about, version-controlled, and reviewed in PRs. Privacy pinning is a feature; A/B rollout is a feature; cost optimisation is a feature.

What it does

Every feature below ships in the public repository today. Clone, configure, run.

OpenAI-compatible API

Point any OpenAI client at the router. /v1/chat/completions, /v1/embeddings, /v1/completions. Streaming and non-streaming.

Ollama first

Local Ollama is a first-class destination. Model name resolution maps OpenAI-style model names to your Ollama models.

YAML routing policy

A single policy file describes routing decisions. Match by model, by tag, by request size, by content classifier. Reviewed in PRs like the rest of your code.

Privacy pinning

A request tagged or detected as sensitive is pinned to a local model. Failovers respect the pin: if local is down, the request fails closed, never silently leaves the network.

Rolling A/B routing

Migrate ten percent of traffic to a new model, watch the dashboard, ramp up. The application is unchanged.

Cost optimisation

Cheap models for short prompts, expensive models for long-context. Per-request cost is logged and aggregated.

Per-request audit

every routing decision is logged with request hash, chosen model, latency, cost, outcome. Queryable in better-sqlite3.

Failover chains

Each policy rule can list a failover sequence. If the primary returns 5xx, the next destination fires.

Cloud providers

OpenAI, Anthropic, Google, Mistral, Groq, OpenRouter. Add more in twenty lines.

Edge-ready

Hono is the runtime. Deploys to Bun, Node, Cloudflare Workers, and Deno. Sub-millisecond router overhead.

Architecture, in one diagram

The whole system on a single screen. Every box maps to a real folder in the repo.

┌─────────────────────────┐
│  Client (any OpenAI SDK) │
└──────────────┬──────────┘
               │  /v1/chat/completions
               ▼
┌──────────────────────────────────────────────────┐
│ Local LLM Router (Hono, OpenAI-compatible)        │
│  ┌──────────────────────────────────────────┐    │
│  │ Policy engine                              │    │
│  │   match by model, tag, size, classifier    │    │
│  │   → destination + failover chain           │    │
│  └────────────────┬───────────────────────────┘    │
│                   ▼                                 │
│  ┌──────────────────────────────────────────┐    │
│  │ Privacy pin                                │    │
│  │   sensitive request → local only           │    │
│  └────────────────┬───────────────────────────┘    │
│                   ▼                                 │
│  ┌──────────────────────────────────────────┐    │
│  │ Dispatcher                                 │    │
│  │   try primary, on 5xx try next, journal    │    │
│  └────────────────┬───────────────────────────┘    │
│                   ▼                                 │
│  ┌──────────────────────────────────────────┐    │
│  │ better-sqlite3 audit log                   │    │
│  │   request, model, latency, cost, outcome   │    │
│  └────────────────────────────────────────────┘    │
└──────────────────────────────────────────────────┘
        │                   │                  │
        ▼                   ▼                  ▼
   Local Ollama       OpenAI / Anthr.    OpenRouter
   (gpu host)         (cloud)            (cloud)

Quick start

From clone to first request in under five minutes.

01

git clone https://github.com/sarmakska/local-llm-router.git
cd local-llm-router

02

bun install
cp policy.example.yaml policy.yaml
cp .env.example .env  # OPENAI_KEY, ANTHROPIC_KEY, OLLAMA_HOST

03

bun run dev   # Hono on :7877

04

curl -N http://localhost:7877/v1/chat/completions \
  -H "Authorization: Bearer dev" -H "Content-Type: application/json" \
  -d '{"model":"smart","messages":[{"role":"user","content":"Hi"}],"stream":true}'

Where it fits

The patterns this repository was built around.

Privacy-sensitive AI products

Healthcare, legal, finance. Sensitive prompts pin to local Ollama. Non-sensitive prompts can use cloud frontier models.

Cost-conscious teams

Route short prompts to cheap models, long-context to expensive ones. Cost logging shows the savings honestly.

Local-first development

Developers run Ollama locally; production uses cloud. The router is the single point of swap.

Model migration

Migrate from one provider to another with rolling A/B. Watch quality and latency in the audit log; ramp up.

Related products

The wider Sarma Linux toolkit. Every project ships with the same opinions: open source, MIT, real depth, no marketing fluff.

SarmaLink-AI

Multi-provider AI assistant with sub-50ms failover across 36 engines.

Open product page

MCP Server Toolkit

Production-ready Model Context Protocol server starter, with plugins.

Open product page

Voice Agent Starter

Sub-second real-time voice loop with WebRTC, barge-in, and pluggable STT/TTS.

Open product page

Agent Orchestrator

Deterministic-replay multi-agent workflows with durable state.

Open product page

AI Eval Runner

Evals as code. Datasets, scorers, traces, regressions, all in one CLI.

Open product page

StaffPortal

Open-source HR + ops platform built to replace three SaaS subscriptions.

Open product page

RAG-over-PDF

A minimal, production-shaped RAG starter with cited streaming answers.

Open product page

Receipt Scanner

Vision-OCR receipt scanning starter with Zod-typed JSON output.

Open product page

Webhook-to-Email

A tiny, production-grade webhook receiver with HMAC and React Email.

Open product page

Route every request to the right model, on the right hardware.

Why this exists

What it does

OpenAI-compatible API

Ollama first

YAML routing policy

Privacy pinning

Rolling A/B routing

Cost optimisation

Per-request audit

Failover chains

Cloud providers

Edge-ready

Tech stack

Architecture, in one diagram

Quick start

Where it fits

Privacy-sensitive AI products

Cost-conscious teams

Local-first development

Model migration

Related products

SarmaLink-AI

MCP Server Toolkit

Voice Agent Starter

Agent Orchestrator

AI Eval Runner

StaffPortal

RAG-over-PDF

Receipt Scanner

Webhook-to-Email

Route smart. Keep private private.