Route every prompt to the cheapest model that can do the job.
An OpenAI-compatible proxy that classifies every request, applies your declarative YAML policy, and dispatches to local Ollama, hosted SarmaLink-AI or an OpenAI frontier model. Privacy pinning is fail-closed. Latency budgets spill slow local to fast cloud. Rolling A/B promotes from real production traffic.
Why this exists
Most teams shipping AI products want to use local models for some traffic and cloud models for the rest. Local for privacy-sensitive prompts, local for high-volume cheap traffic, cloud for the difficult cases. The right answer is per-request, not per-application. Hard-coding the routing logic into the application is the wrong place for it.
Hosted LLM gateways handle the cloud side beautifully but treat local as second class. Local gateways handle Ollama beautifully but do not interoperate with cloud at the same shape. Teams keep stitching their own routers from npm packages and regrets.
local-llm-router is the focused middle. An OpenAI-compatible proxy with first-class Ollama support, first-class cloud support, and a YAML policy you can reason about and PR-review. Privacy pinning is a feature. Latency-budget fallback is a feature. Rolling A/B is a feature. The application calls one URL and gets the right answer from the right place.
Request flow
Classify, decide, dispatch, stream. Every step happens in-process; the round-trip is the backend, not the router.
What the classifier sees
Six dimensions, deterministic, no model call. The policy file matches on any conjunction of them.
| Dimension | Possible values | How it is set |
|---|---|---|
| task | code · vision · web_search · general | Heuristics over keywords, content parts (image_url), and tool/function hints in the request. |
| complexity | low · medium · high | Length, presence of multi-step reasoning markers, code-block depth. |
| sensitivity | low · medium · high | Regex bank for PII / PHI / secrets, plus explicit metadata.sensitivity if the client sets it. |
| modality | text · image · audio · multi | Inspects content parts and the model name family. |
| family | qwen-coder · gemma · llama · gpt · sonnet | Resolved from the requested model name or chosen by task when model = "auto". |
| tokens | estimated count | Light tokeniser estimate over the prompt and system messages. |
The whole policy.yaml
This is the example policy shipped with the repo. Read top to bottom, that is the order rules fire in.
backends:
local:
type: ollama
endpoint: http://localhost:11434
# The classifier picks a family; the router resolves it to one of
# these models. Code goes to Qwen 2.5 Coder, vision to Gemma 3,
# everything else to Llama 4.
families:
qwen-coder: qwen2.5-coder:7b
gemma: gemma3:12b
llama: llama4:16x17b
models: [llama4:16x17b, qwen2.5-coder:7b, gemma3:12b]
p50Ms: 1800
sarmalink:
type: sarmalink
endpoint: https://api.sarmalink.ai/v1
model: smart
p50Ms: 600
frontier:
type: openai
endpoint: https://api.openai.com/v1
model: gpt-4o
p50Ms: 900
routes:
# Privacy pin: sensitive requests never leave the machine.
- match: { sensitivity: high }
backend: local
reason: "Privacy: never leave the machine"
# Short code edits run on local Qwen 2.5 Coder with cloud fallback.
- match: { task: code, complexity: low }
backend: local
fallback: sarmalink
latencyBudgetMs: 2500
# Hard code goes straight to the frontier model.
- match: { task: code, complexity: high }
backend: frontier
fallback: sarmalink
# Image prompts route to local Gemma 3 with vision fallback.
- match: { task: vision }
backend: local
fallback: frontier
latencyBudgetMs: 3000
# Live-data questions need cloud tools.
- match: { task: web_search }
backend: sarmalink
# Everything else: local first, spill to hosted when too slow.
- default: local
fallback: sarmalink
latencyBudgetMs: 1200
ab:
enabled: false
sampleRate: 0.05
candidates:
local: sarmalinkEvery field is documented: Policy DSL wiki page
What is in the box
Twelve features. Every one is implemented in the repo today.
OpenAI Chat Completions API
Drop-in for /v1/chat/completions. Streaming and non-streaming. Any OpenAI client works without code changes: point base_url at the router and set model: "auto".
OpenAI Responses API
Also serves /v1/responses. Accepts input, instructions, and content parts; returns a Responses envelope with output, output_text, and usage. Streaming is the backend's native SSE.
Deterministic classifier
Tags every request with task, complexity, sensitivity, modality, an open-weight model family, and an estimated token count. No extra model call, no extra round trip, no extra cost.
YAML routing policy
A single policy file describes routing decisions. Match by task, complexity, sensitivity, modality, family or model. Reviewed in PRs like the rest of your code.
Privacy pinning, fail-closed
A request tagged or detected as sensitive is pinned to a local backend. If the local backend is unavailable, the request fails closed: never silently leaves the network.
Family-to-model resolution
The classifier picks a family (qwen-coder, gemma, llama). The decision engine resolves it to a concrete model on the chosen backend's families map. One policy drives a heterogeneous fleet.
Latency-budget fallback
A route can set latencyBudgetMs. When the primary backend's expected latency (from live metrics or p50Ms hint) exceeds the budget and the fallback is faster, the request shifts. Slow local never blows a tight interactive budget.
Three backends in the box
Ollama for local, SarmaLink-AI for hosted multi-provider, OpenAI for frontier. A registry pattern makes a new backend roughly sixty lines of TypeScript.
node:sqlite metrics
Per-route success, latency and fallback rate persist in node:sqlite. JSON summary at /v1/metrics, Prometheus text at /metrics. No extra dependency, no extra process.
Rolling A/B routing
Mirror a sample of traffic to a candidate backend in the background. The router records its latency and success. /v1/ab reports which candidates are ready to promote.
Hono runtime, edge-ready
Built on Hono. Runs on Bun, Node, Cloudflare Workers, Deno. Sub-millisecond router overhead. The same code, the same behaviour, the runtime is the difference.
Typed config, Zod-validated
The policy file is loaded once at startup and validated with Zod. A malformed policy fails fast with a clear message; it never reaches a live request.
Three backends in the box
A registry pattern means a fourth backend is typically sixty lines.
Ollama (local)
On-prem GPU host or laptop. The privacy-pinned destination. Reads families to map qwen-coder, gemma and llama family tags to concrete model tags.
SarmaLink-AI (cloud)
Hosted multi-provider gateway with web tools and chat memory. The natural cloud fallback when local cannot keep up.
OpenAI frontier
Frontier model on hard code and reasoning. The escape hatch when local and SarmaLink-AI cannot do the job.
Quick start
Clone, install, point an OpenAI client at port 3030, watch the metrics.
git clone https://github.com/sarmakska/local-llm-router.git cd local-llm-router pnpm install
cp policy.example.yaml policy.yaml cp .env.example .env # OPENAI_API_KEY, SARMALINK_API_KEY, LLR_POLICY, LLR_DB
pnpm dev # Hono on :3030
curl -N http://localhost:3030/v1/chat/completions \
-H "Authorization: Bearer anything" \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"messages": [{"role": "user", "content": "Refactor this function..."}],
"stream": true
}'from openai import OpenAI
client = OpenAI(base_url="http://localhost:3030/v1", api_key="anything")
client.chat.completions.create(
model="auto",
messages=[{"role": "user", "content": "Hi"}]
)curl http://localhost:3030/v1/metrics # JSON curl http://localhost:3030/metrics # Prometheus text curl http://localhost:3030/v1/ab # A/B candidate report
Environment variables
Five environment variables. Two are paths, two are credentials, one is the port.
| Variable | Purpose | Default |
|---|---|---|
| LLR_PORT | HTTP port for the Hono server. | 3030 |
| LLR_POLICY | Path to the YAML policy file. | ./policy.yaml |
| LLR_DB | Path to the node:sqlite metrics database. | ./metrics.db |
| OPENAI_API_KEY | Frontier backend credential. | unset (disabled) |
| SARMALINK_API_KEY | SarmaLink-AI backend credential. | unset (disabled) |
Decision sequence
What actually happens when a request comes in. Sensitivity pin trumps budget. Budget trumps preference.
Use cases
What teams actually run this for.
Privacy-sensitive AI products
Healthcare, legal, finance. Sensitive prompts pin to local Ollama. Non-sensitive prompts can reach cloud frontier models. Fail-closed means PII never leaves the network by mistake.
Cost-conscious teams
Trivial prompts go to local or cheap hosted; long-context or hard prompts go to frontier. The audit log shows the savings honestly, per route.
Local-first development
Developers run Ollama locally; production uses cloud. The router is the single point of swap. Same client code, same model: "auto", different policy.
Model migration
Migrate from one provider to another with rolling A/B. Watch quality, latency and cost in the audit log. Promote when ready.
Regional egress control
Some workloads are not allowed to leave a region. Privacy pinning extended with regional metadata keeps those workloads on a regional Ollama.
Edge runtime experiments
Hono runs on Cloudflare Workers and Bun and Deno. Deploy the same router to a different runtime and compare cold-start and overhead numbers in the same metrics store.
local-llm-router vs alternatives
How the router compares to the closest tools in the space. Honest scope-by-scope.
| Feature | local-llm-router | LiteLLM | Portkey | OpenRouter | Ollama |
|---|---|---|---|---|---|
| OpenAI-compatible (Chat + Responses) | Both APIs | Chat only | Both | Chat only | Chat only |
| Native local backend (Ollama) | First-class | Supported | Supported | Not in scope | Native |
| YAML policy with classifier | Built-in | Limited | Config UI | Per-request | No |
| Privacy pinning, fail-closed | First-class feature | Manual | Manual | N/A | N/A (local only) |
| Rolling A/B with shadow traffic | Built-in | Bring your own | Paid tier | No | No |
| Latency-budget fallback | Per-route | No | Manual | Per-request | N/A |
| Self-hosted, single binary | Yes, Hono + sqlite | Yes | Hosted SaaS | Hosted only | Yes |
| Licence | MIT | MIT | Commercial | Commercial | MIT |
Tech stack
Small surface. Built-in Node primitives where possible.
Documentation & guides
Wiki pages cover the architecture, the DSL, the backends, and the deployment story.
Frequently asked
The questions that come up most before adoption.
How is this different from LiteLLM?+
LiteLLM is a library and an excellent hosted gateway. local-llm-router is a focused self-hosted proxy where local Ollama is a first-class destination, YAML routing policy is the primary interface, and privacy pinning with fail-closed semantics is built in rather than bolted on. If you need a hosted multi-provider gateway with billing, LiteLLM is the right tool. If you need a small self-hosted router that takes local seriously, this is.
Does the classifier call an LLM?+
No. The classifier is deterministic and runs in-process. It uses heuristics over the request body and metadata to produce task, complexity, sensitivity, modality, family and token estimates. Adding a model call would buy small accuracy at large latency cost; we did not make that trade.
What does fail-closed mean for privacy pinning?+
A request tagged sensitive is pinned to a local backend. If the local backend returns a non-success or is unreachable, the request fails with an explicit error rather than silently spilling to a cloud fallback. The pin is a hard wall, not a preference.
How does latency-budget fallback decide to spill?+
The decision engine compares the primary backend's expected latency (live metrics if present, the backend's p50Ms hint otherwise) against the route's latencyBudgetMs. If the primary is over budget and the fallback is faster, the request goes to the fallback. The check happens per-request, never per-pod.
Why node:sqlite instead of an external metrics store?+
For per-route success, latency and fallback rate, node:sqlite is enough. It is built into Node 22, has no separate process, and serialises writes safely. The Prometheus text endpoint is there if you want to scrape into a real metrics stack.
Can I add a new backend?+
Yes. Backends are a registry pattern: implement a small interface, register it under a type, expose it in policy.yaml. New backends are typically sixty lines of TypeScript plus a streaming SSE bridge.
Does the router support tools and function calling?+
The proxy passes tool definitions through unchanged. Tool execution is the backend's job; the router's job is to route the request to the right backend in the first place.
Is it safe to expose publicly?+
It is a router, not a gateway. There is no auth, no rate limiting, no billing. Put it behind your own ingress or an API-gateway layer for those concerns. The wiki has a deployment pattern with Cloudflare Access in front.
Route smart. Keep private private.
Clone the repo, copy the example policy, point an OpenAI client at port 3030, and watch the metrics. MIT licensed.
Related projects
Part of a portfolio of production-shaped open-source repos.