Technical Whitepaper · v1.0

Local LLM Router

OpenAI-compatible proxy that routes between local Ollama and cloud LLMs based on a YAML policy. Privacy pinning. Rolling A/B. Cost optimisation.

MIT LicenceTypeScriptHonoOllama-firstPrivacy PinningYAML Policy

< 1 msRouter overhead

YAMLPolicy as code

OpenAIWire-compatible

SQLiteAudit log

Repository Wiki Product page How it works

§ Abstract

The right answer to “which LLM should this request use” is not a single static answer. It depends on the prompt’s sensitivity, its expected length, the target latency, and the team’s cost posture. Encoding the answer once, in a configuration file that lives in version control, lets every team member reason about and review the policy. Encoding it in application code spreads the same logic across every codebase.

Local LLM Router is the focused proxy that does this one job well. It is OpenAI-compatible on the front, supports Ollama and the major cloud providers on the back, and chooses between them per request based on a YAML policy. Privacy pinning forces sensitive requests local. Rolling A/B routing migrates traffic between models gradually. The audit log records every decision in better-sqlite3 for after-the-fact analysis.

This whitepaper documents the policy engine, the privacy pinning design, the A/B routing algorithm, the audit-log schema, and the operational lessons from running this proxy in front of real applications.

1Executive Summary

Local LLM Router is a TypeScript proxy built on Hono. It exposes the OpenAI Chat Completions, Embeddings, and Completions endpoints, plus a small admin API for policy reload and audit-log queries. Inbound requests pass through the policy engine, which selects a destination from the policy file. The dispatcher attempts the destination; on 5xx, it falls through to the configured failover; on success, it streams or returns the response and writes the decision to better-sqlite3.

Privacy pinning is implemented as a non-overridable rule: if a request is marked sensitive (by header, by tag, or by a content classifier), only local destinations are eligible, and if no local destination is available, the request fails closed. This is the right shape for regulated industries: silent fallback to cloud is the failure mode, not the feature.

A/B routing is implemented as a weighted random selection within a policy rule. A new model can be ramped from 10% to 50% to 100% by editing the policy. The audit log records which model was selected for every request, so the team can compare quality and cost honestly.

2Background

The LLM gateway space in 2026 is mature. OpenRouter, Portkey, LiteLLM proxy, and others handle the cloud routing problem with varying degrees of polish. Ollama is the dominant local LLM runtime. The gap is in the join: a proxy that treats both as first-class destinations and lets the team express the routing logic in a reviewable artefact.

Most teams writing this themselves end up with branching logic inside their application. Some have a config object; some have a feature flag service; some have a hand-rolled middleware. The shared property of all these solutions is that the routing logic is hard to review and harder to migrate. Pulling it into a YAML file outside the application is the simplification this project is built around.

3Problem in detail

Per-request routing decisions

The right model is not a per-application decision. It is a per-request decision. A short prompt with no PII can go to a fast cheap model. A long prompt about a regulated topic must go local. A development request can A/B between candidates. Static configuration cannot express this. Application logic can but should not.

Privacy must be non-overridable

The classic failure of cloud-leaning gateways is that a transient outage of the local destination silently falls back to cloud, and a sensitive request leaks. The privacy pinning in Local LLM Router is fail-closed: if a sensitive request cannot reach a local destination, the request returns an error, never reaches a cloud one.

A/B without application changes

A team that wants to migrate from Provider A to Provider B should not have to deploy code to ramp traffic. The router does this by weighted random selection within a policy rule; the application sees the same OpenAI-compatible endpoint throughout.

4Goals + non-goals

Goals

OpenAI-compatible API. Drop-in replacement for the OpenAI base URL.
Ollama as a first-class destination.
YAML policy with match-by-model, by-tag, by-size, by-classifier rules.
Privacy pinning that fails closed.
Rolling A/B routing inside a rule.
Audit log of every routing decision.
Sub-millisecond router overhead.

Non-goals

Hosted gateway. Run it yourself.
Prompt management. Prompts live in your application.
Caching. There are good caches; this project is not one.
Embedding training. Use a tool that is purpose-built.

5Architecture

One Hono process. The hot path is: parse request, evaluate policy, check pin, dispatch, stream response, write audit row. Each step is small and individually tested.

Policy schema (simplified YAML)
rules:
  - name: "sensitive → local only"
    when:
      headers.x-sensitive: "true"
    pin: local
    destinations:
      - ollama: "llama3.3:70b"

  - name: "short prompts → fast cloud"
    when:
      input.tokens: { lt: 500 }
    destinations:
      - openai: "gpt-4o-mini"   # 70 % weight
        weight: 70
      - groq: "llama-3.3-70b"   # 30 % weight
        weight: 30

  - name: "long prompts → frontier"
    when:
      input.tokens: { gte: 500 }
    destinations:
      - anthropic: "claude-haiku-4"
      - openrouter: "fallback"

The policy file is reloaded on signal (SIGHUP) without restarting the process. Validation is via Zod; an invalid policy is rejected and the previous policy stays active.

6Key technical decisions

Hono, not Express

Hono is fast, runs everywhere (Bun, Node, Cloudflare Workers, Deno), and has a clean middleware story. The router’s overhead is below one millisecond on Bun.

better-sqlite3, not Postgres

The audit log is write-heavy on a single host. better-sqlite3 with WAL mode handles tens of thousands of writes per second on a laptop. Operationally, the router is one process and one SQLite file. That simplicity is a feature.

YAML, not JSON

Policy files are read by humans. YAML’s comment support and lighter syntax make policies more legible. We validate strictly with Zod after parsing; the runtime model is JSON.

Fail closed on privacy

A pinned request that cannot reach a pinned destination returns an error. Silent cloud fallback would be wrong; we make it impossible.

7Implementation milestones

Milestone 1 · OpenAI-compatible passthrough

Hono server, parse OpenAI requests, forward to a single configured destination, stream the response. Acceptance: an existing OpenAI-using application works against the router with a base URL change.

Milestone 2 · policy engine

Zod-validated YAML policy, expression engine for match conditions, weighted destination selection. Acceptance: rules of the form “short prompts to provider A” route correctly, with property-based tests for the matcher.

Milestone 3 · privacy pinning

Pin field on rules, fail-closed dispatcher, header-driven and content-classifier-driven pinning. Acceptance: a sensitive request never reaches a non-pinned destination, even when the local destination is unhealthy.

Milestone 4 · audit log + viewer

better-sqlite3 schema, write path, admin queries. Small admin UI for log inspection. CSV export for offline analysis.

8Lessons / honest limits

Lessons

Policy as code is the win. The PRs that change routing become legible. Reviewers can disagree before traffic moves.
Fail-closed privacy is the only correct default. Silent cloud fallback would be the bug everyone discovers in production.
SQLite is a serious database. The audit-log writes are not the bottleneck; the model calls are.

Honest limits

Single-instance. The audit log is local. For multi-instance deployments, ship the SQLite file or write to a shared destination via the OTel collector.
No caching. Out of scope. Pair with a separate cache.
Content classifier is opt-in. Built-in classifier is a small Llama 3.2 model on Ollama; teams that want a stronger one can plug in their own.
Embedding endpoints are basic. Routing for embeddings is by model name and request size. Privacy pinning applies.

9Conclusion

Local LLM Router is what an LLM gateway looks like when it does one job well. OpenAI-compatible on the front; Ollama and cloud on the back; YAML policy in the middle; SQLite for audit. Every routing decision is reviewable, version-controlled, and recorded. Privacy is non-overridable. A/B is rampable.

The repository is MIT licensed. The wiki contains policy patterns for the common scenarios (privacy pinning, A/B rollout, cost optimisation, model migration), an Ollama hardening note, and a Cloudflare Workers deployment recipe.

Local LLM Router · Built by Sarma Linux · MIT licensed