Open Source · MIT · TypeScript · Hono

Route every prompt to the cheapest model that can do the job.

An OpenAI-compatible proxy that classifies every request, applies your declarative YAML policy, and dispatches to local Ollama, hosted SarmaLink-AI or an OpenAI frontier model. Privacy pinning is fail-closed. Latency budgets spill slow local to fast cloud. Rolling A/B promotes from real production traffic.

YAML

Policy as code

Privacy

Pinning · fail-closed

A/B

Rolling routing

OpenAI

Wire-compatible

MIT

Licence

View on GitHub Read the wiki Whitepaper How it works

Why this exists

Most teams shipping AI products want to use local models for some traffic and cloud models for the rest. Local for privacy-sensitive prompts, local for high-volume cheap traffic, cloud for the difficult cases. The right answer is per-request, not per-application. Hard-coding the routing logic into the application is the wrong place for it.

Hosted LLM gateways handle the cloud side beautifully but treat local as second class. Local gateways handle Ollama beautifully but do not interoperate with cloud at the same shape. Teams keep stitching their own routers from npm packages and regrets.

local-llm-router is the focused middle. An OpenAI-compatible proxy with first-class Ollama support, first-class cloud support, and a YAML policy you can reason about and PR-review. Privacy pinning is a feature. Latency-budget fallback is a feature. Rolling A/B is a feature. The application calls one URL and gets the right answer from the right place.

Request flow

Classify, decide, dispatch, stream. Every step happens in-process; the round-trip is the backend, not the router.

rendering

Request flow: the classifier produces dimensions, the decision engine walks the policy, privacy pin overrides budget, the dispatcher streams from the chosen backend and writes a row to node:sqlite.

What the classifier sees

Six dimensions, deterministic, no model call. The policy file matches on any conjunction of them.

Dimension	Possible values	How it is set
task	code · vision · web_search · general	Heuristics over keywords, content parts (image_url), and tool/function hints in the request.
complexity	low · medium · high	Length, presence of multi-step reasoning markers, code-block depth.
sensitivity	low · medium · high	Regex bank for PII / PHI / secrets, plus explicit metadata.sensitivity if the client sets it.
modality	text · image · audio · multi	Inspects content parts and the model name family.
family	qwen-coder · gemma · llama · gpt · sonnet	Resolved from the requested model name or chosen by task when model = "auto".
tokens	estimated count	Light tokeniser estimate over the prompt and system messages.

The whole policy.yaml

This is the example policy shipped with the repo. Read top to bottom, that is the order rules fire in.

backends:
  local:
    type: ollama
    endpoint: http://localhost:11434
    # The classifier picks a family; the router resolves it to one of
    # these models. Code goes to Qwen 2.5 Coder, vision to Gemma 3,
    # everything else to Llama 4.
    families:
      qwen-coder: qwen2.5-coder:7b
      gemma: gemma3:12b
      llama: llama4:16x17b
    models: [llama4:16x17b, qwen2.5-coder:7b, gemma3:12b]
    p50Ms: 1800

  sarmalink:
    type: sarmalink
    endpoint: https://api.sarmalink.ai/v1
    model: smart
    p50Ms: 600

  frontier:
    type: openai
    endpoint: https://api.openai.com/v1
    model: gpt-4o
    p50Ms: 900

routes:
  # Privacy pin: sensitive requests never leave the machine.
  - match: { sensitivity: high }
    backend: local
    reason: "Privacy: never leave the machine"

  # Short code edits run on local Qwen 2.5 Coder with cloud fallback.
  - match: { task: code, complexity: low }
    backend: local
    fallback: sarmalink
    latencyBudgetMs: 2500

  # Hard code goes straight to the frontier model.
  - match: { task: code, complexity: high }
    backend: frontier
    fallback: sarmalink

  # Image prompts route to local Gemma 3 with vision fallback.
  - match: { task: vision }
    backend: local
    fallback: frontier
    latencyBudgetMs: 3000

  # Live-data questions need cloud tools.
  - match: { task: web_search }
    backend: sarmalink

  # Everything else: local first, spill to hosted when too slow.
  - default: local
    fallback: sarmalink
    latencyBudgetMs: 1200

ab:
  enabled: false
  sampleRate: 0.05
  candidates:
    local: sarmalink

Every field is documented: Policy DSL wiki page

What is in the box

Twelve features. Every one is implemented in the repo today.

OpenAI Chat Completions API

Drop-in for /v1/chat/completions. Streaming and non-streaming. Any OpenAI client works without code changes: point base_url at the router and set model: "auto".

OpenAI Responses API

Also serves /v1/responses. Accepts input, instructions, and content parts; returns a Responses envelope with output, output_text, and usage. Streaming is the backend's native SSE.

Deterministic classifier

Tags every request with task, complexity, sensitivity, modality, an open-weight model family, and an estimated token count. No extra model call, no extra round trip, no extra cost.

YAML routing policy

A single policy file describes routing decisions. Match by task, complexity, sensitivity, modality, family or model. Reviewed in PRs like the rest of your code.

Privacy pinning, fail-closed

A request tagged or detected as sensitive is pinned to a local backend. If the local backend is unavailable, the request fails closed: never silently leaves the network.

Family-to-model resolution

The classifier picks a family (qwen-coder, gemma, llama). The decision engine resolves it to a concrete model on the chosen backend's families map. One policy drives a heterogeneous fleet.

Latency-budget fallback

A route can set latencyBudgetMs. When the primary backend's expected latency (from live metrics or p50Ms hint) exceeds the budget and the fallback is faster, the request shifts. Slow local never blows a tight interactive budget.

Three backends in the box

Ollama for local, SarmaLink-AI for hosted multi-provider, OpenAI for frontier. A registry pattern makes a new backend roughly sixty lines of TypeScript.

node:sqlite metrics

Per-route success, latency and fallback rate persist in node:sqlite. JSON summary at /v1/metrics, Prometheus text at /metrics. No extra dependency, no extra process.

Rolling A/B routing

Mirror a sample of traffic to a candidate backend in the background. The router records its latency and success. /v1/ab reports which candidates are ready to promote.

Hono runtime, edge-ready

Built on Hono. Runs on Bun, Node, Cloudflare Workers, Deno. Sub-millisecond router overhead. The same code, the same behaviour, the runtime is the difference.

Typed config, Zod-validated

The policy file is loaded once at startup and validated with Zod. A malformed policy fails fast with a clear message; it never reaches a live request.

Three backends in the box

A registry pattern means a fourth backend is typically sixty lines.

type: ollama

Ollama (local)

On-prem GPU host or laptop. The privacy-pinned destination. Reads families to map qwen-coder, gemma and llama family tags to concrete model tags.

~1800ms (12B on laptop)

type: sarmalink

SarmaLink-AI (cloud)

Hosted multi-provider gateway with web tools and chat memory. The natural cloud fallback when local cannot keep up.

~600ms

type: openai

OpenAI frontier

Frontier model on hard code and reasoning. The escape hatch when local and SarmaLink-AI cannot do the job.

~900ms

Quick start

Clone, install, point an OpenAI client at port 3030, watch the metrics.

01Clone and install

git clone https://github.com/sarmakska/local-llm-router.git
cd local-llm-router
pnpm install

02Copy the example policy

cp policy.example.yaml policy.yaml
cp .env.example .env  # OPENAI_API_KEY, SARMALINK_API_KEY, LLR_POLICY, LLR_DB

03Start the router

pnpm dev   # Hono on :3030

04Make a request

curl -N http://localhost:3030/v1/chat/completions \
  -H "Authorization: Bearer anything" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "Refactor this function..."}],
    "stream": true
  }'

05Point your OpenAI client at it

from openai import OpenAI
client = OpenAI(base_url="http://localhost:3030/v1", api_key="anything")
client.chat.completions.create(
  model="auto",
  messages=[{"role": "user", "content": "Hi"}]
)

06Watch the metrics

curl http://localhost:3030/v1/metrics   # JSON
curl http://localhost:3030/metrics     # Prometheus text
curl http://localhost:3030/v1/ab       # A/B candidate report

Environment variables

Five environment variables. Two are paths, two are credentials, one is the port.

Variable	Purpose	Default
LLR_PORT	HTTP port for the Hono server.	3030
LLR_POLICY	Path to the YAML policy file.	./policy.yaml
LLR_DB	Path to the node:sqlite metrics database.	./metrics.db
OPENAI_API_KEY	Frontier backend credential.	unset (disabled)
SARMALINK_API_KEY	SarmaLink-AI backend credential.	unset (disabled)

Decision sequence

What actually happens when a request comes in. Sensitivity pin trumps budget. Budget trumps preference.

rendering

Decision sequence: classifier produces dimensions, decision engine picks a backend (with privacy pin and latency budget overrides), dispatcher streams, sqlite logs the outcome.

Use cases

What teams actually run this for.

Privacy-sensitive AI products

Healthcare, legal, finance. Sensitive prompts pin to local Ollama. Non-sensitive prompts can reach cloud frontier models. Fail-closed means PII never leaves the network by mistake.

Cost-conscious teams

Trivial prompts go to local or cheap hosted; long-context or hard prompts go to frontier. The audit log shows the savings honestly, per route.

Local-first development

Developers run Ollama locally; production uses cloud. The router is the single point of swap. Same client code, same model: "auto", different policy.

Model migration

Migrate from one provider to another with rolling A/B. Watch quality, latency and cost in the audit log. Promote when ready.

Regional egress control

Some workloads are not allowed to leave a region. Privacy pinning extended with regional metadata keeps those workloads on a regional Ollama.

Edge runtime experiments

Hono runs on Cloudflare Workers and Bun and Deno. Deploy the same router to a different runtime and compare cold-start and overhead numbers in the same metrics store.

local-llm-router vs alternatives

How the router compares to the closest tools in the space. Honest scope-by-scope.

Feature	local-llm-router	LiteLLM	Portkey	OpenRouter	Ollama
OpenAI-compatible (Chat + Responses)	Both APIs	Chat only	Both	Chat only	Chat only
Native local backend (Ollama)	First-class	Supported	Supported	Not in scope	Native
YAML policy with classifier	Built-in	Limited	Config UI	Per-request	No
Privacy pinning, fail-closed	First-class feature	Manual	Manual	N/A	N/A (local only)
Rolling A/B with shadow traffic	Built-in	Bring your own	Paid tier	No	No
Latency-budget fallback	Per-route	No	Manual	Per-request	N/A
Self-hosted, single binary	Yes, Hono + sqlite	Yes	Hosted SaaS	Hosted only	Yes
Licence	MIT	MIT	Commercial	Commercial	MIT

Tech stack

Small surface. Built-in Node primitives where possible.

TypeScriptNode.js 22Hononode:sqliteOllama 0.5+ZodYAMLVitestDockerOpenAI SDK compatible

Documentation & guides

Wiki pages cover the architecture, the DSL, the backends, and the deployment story.

Architecture

How the classifier, decision engine, dispatcher and metrics store fit together.

Read on GitHub

Quick-Start

Tokens, policy file, request, metrics. The fastest path from clone to live.

Read on GitHub

Policy-DSL

Every match key, every backend type, every route option, with examples.

Read on GitHub

Backends

Ollama, SarmaLink-AI, OpenAI. Adding a fourth backend in sixty lines.

Read on GitHub

Privacy-Pinning

How fail-closed is implemented, what it covers, and what it does not.

Read on GitHub

Metrics-and-AB

The SQLite schema, the JSON and Prometheus endpoints, and the A/B promotion logic.

Read on GitHub

Deployment

Docker, Bun, Node, Cloudflare Workers, Deno. Same code, four runtimes.

Read on GitHub

Roadmap

Caching, content-classifier model option, regional pinning, more backends.

Read on GitHub

Frequently asked

The questions that come up most before adoption.

How is this different from LiteLLM?+

LiteLLM is a library and an excellent hosted gateway. local-llm-router is a focused self-hosted proxy where local Ollama is a first-class destination, YAML routing policy is the primary interface, and privacy pinning with fail-closed semantics is built in rather than bolted on. If you need a hosted multi-provider gateway with billing, LiteLLM is the right tool. If you need a small self-hosted router that takes local seriously, this is.

Does the classifier call an LLM?+

No. The classifier is deterministic and runs in-process. It uses heuristics over the request body and metadata to produce task, complexity, sensitivity, modality, family and token estimates. Adding a model call would buy small accuracy at large latency cost; we did not make that trade.

What does fail-closed mean for privacy pinning?+

A request tagged sensitive is pinned to a local backend. If the local backend returns a non-success or is unreachable, the request fails with an explicit error rather than silently spilling to a cloud fallback. The pin is a hard wall, not a preference.

How does latency-budget fallback decide to spill?+

The decision engine compares the primary backend's expected latency (live metrics if present, the backend's p50Ms hint otherwise) against the route's latencyBudgetMs. If the primary is over budget and the fallback is faster, the request goes to the fallback. The check happens per-request, never per-pod.

Why node:sqlite instead of an external metrics store?+

For per-route success, latency and fallback rate, node:sqlite is enough. It is built into Node 22, has no separate process, and serialises writes safely. The Prometheus text endpoint is there if you want to scrape into a real metrics stack.

Can I add a new backend?+

Yes. Backends are a registry pattern: implement a small interface, register it under a type, expose it in policy.yaml. New backends are typically sixty lines of TypeScript plus a streaming SSE bridge.

Does the router support tools and function calling?+

The proxy passes tool definitions through unchanged. Tool execution is the backend's job; the router's job is to route the request to the right backend in the first place.

Is it safe to expose publicly?+

It is a router, not a gateway. There is no auth, no rate limiting, no billing. Put it behind your own ingress or an API-gateway layer for those concerns. The wiki has a deployment pattern with Cloudflare Access in front.

Route smart. Keep private private.

Clone the repo, copy the example policy, point an OpenAI client at port 3030, and watch the metrics. MIT licensed.

View on GitHub Read the whitepaper How it works Need a hand? Contact me

Related projects

Part of a portfolio of production-shaped open-source repos.

Route every prompt to the cheapest model that can do the job.

Why this exists

Request flow

What the classifier sees

The whole policy.yaml

What is in the box

OpenAI Chat Completions API

OpenAI Responses API

Deterministic classifier

YAML routing policy

Privacy pinning, fail-closed

Family-to-model resolution

Latency-budget fallback

Three backends in the box

node:sqlite metrics

Rolling A/B routing

Hono runtime, edge-ready

Typed config, Zod-validated

Three backends in the box

Ollama (local)

SarmaLink-AI (cloud)

OpenAI frontier

Quick start

Environment variables

Decision sequence

Use cases

Privacy-sensitive AI products

Cost-conscious teams

Local-first development

Model migration

Regional egress control

Edge runtime experiments

local-llm-router vs alternatives

Tech stack

Documentation & guides

Architecture

Quick-Start

Policy-DSL

Backends

Privacy-Pinning

Metrics-and-AB

Deployment

Roadmap

Frequently asked

Route smart. Keep private private.

Related projects

SarmaLink-AI

MCP Server Toolkit

Voice Agent Starter

Agent Orchestrator

AI Eval Runner

StaffPortal