How it works

How Local LLM Router works

A complete tour of the architecture: data flow, subsystems, technology choices, performance, and where the project is heading next.

TL;DR

A small Hono server speaks the OpenAI Chat Completions wire format. Each request runs through a YAML policy engine that picks a destination by rule. Sensitive requests are pinned to local Ollama. A/B rollout is weighted random within a rule. The dispatcher tries the destination, falls through on 5xx, streams the response, and writes the decision to better-sqlite3.

Core data flow

From the moment a request enters the system to the moment a response leaves it.

  Inbound: POST /v1/chat/completions
       │  body { model, messages, stream, ... }
       ▼
  Parse + validate (Zod)
       │
       ▼
  Compute features
       ├─ token estimate
       ├─ headers (sensitivity, tenant, request id)
       └─ optional content classifier (Ollama Llama 3.2)
       │
       ▼
  Policy engine
       │  evaluate rules in order
       │  first match wins
       ▼
  Rule selected: { pin?, destinations[ {target, weight, fallback} ] }
       │
       ▼
  Privacy pin check
       ├─ pinned + local available  → continue
       └─ pinned + local unavailable → 503 fail-closed
       │
       ▼
  Weighted random destination
       │
       ▼
  Dispatcher
       ├─ HTTP call to Ollama or cloud provider
       ├─ on 5xx: try next destination in chain
       └─ stream response back to client
       │
       ▼
  Audit log (better-sqlite3)
       │  request_hash, rule, destination, latency, cost, outcome

Each subsystem, deep-dived

Every component in the data flow above, opened up and explained.

Policy engine

YAML in, JSON model out. A policy is an ordered list of rules. Each rule has a when clause, an optional pin, and a list of destinations. The first matching rule wins.

The when clause is a small expression DSL. Supported operators are equality, lt/gt/lte/gte, and a regex match. References allowed: headers.*, input.tokens, input.contains, classifier.label, tags.*. The DSL is intentionally tiny: a Zod-typed AST, no eval, no shell-out. Untrusted policies are still safe.

Privacy pinning

A rule with pin: local restricts the destination to local Ollama. The dispatcher refuses to fall through to cloud destinations, even if the local one returns 5xx. The request fails closed with a 503. This is the only safe behaviour for sensitive traffic.

Sensitivity can come from three sources: an x-sensitive: true header from the application, a tags array passed in the request body, or the optional content classifier. The classifier is a small Llama 3.2 model running on the same Ollama instance; running it costs around 80 milliseconds and is gated behind a policy flag for teams that want it.

A/B routing

A rule with multiple destinations and per-destination weights distributes traffic randomly with the configured weights. The seed for the random choice is the request id, so identical requests within a single client session are routed consistently when the client cooperates. The audit log records which destination was chosen, so you can compare quality and cost across the A and B groups by SQL query.

Dispatcher

The dispatcher takes a destination and a request, produces a response. For Ollama destinations, it calls the local Ollama HTTP API; for cloud destinations, it calls the provider’s OpenAI-compatible (or native) endpoint. Streaming is preserved end-to-end: the upstream SSE chunks are forwarded to the client as they arrive.

On a 5xx response from the destination, the dispatcher tries the next entry in the rule’s destinations list, unless the rule is pinned. On a 4xx, the error is returned to the client as-is; this is a client error, not a routing error.

Audit log

better-sqlite3 with WAL mode. One row per request: id, ts, request_hash, rule, destination, latency_ms, cost, status, tokens_in, tokens_out, tags. The request hash is a fingerprint of model + messages + temperature; it lets you match repeat requests without storing the full prompt.

The admin API exposes paginated queries against the log, plus a CSV export. A small HTMX UI is included for ops users; engineers can also issue SQL directly against the SQLite file.

OpenAI compatibility surface

Implemented endpoints: POST /v1/chat/completions (streaming and non-streaming), POST /v1/embeddings, POST /v1/completions, GET /v1/models. Authentication is a Bearer token; tokens are configured in the policy file with optional per-token routing rules. Rate limiting is per-token, default off.

Hot-reload and deployment

The policy file is reloaded on SIGHUP. Invalid policies are rejected and the previous policy stays active. Deployment options: Bun on a VM, Node on a VM, Docker, Cloudflare Workers (HTTP transport only). The audit log on Workers is shipped to D1; the wiki has the recipe.

Why this stack

The road not taken matters as much as the road taken. Here is what was picked, why, and what was rejected and why.

Picked

Hono

Fast, runs everywhere we want to deploy, clean middleware story.

Not this

Express , slower, ecosystem fit for older Node, no Workers story.

Picked

Bun (preferred)

Fastest TS runtime in 2026 for this kind of work. Uses better-sqlite3 natively. Boot time near zero.

Not this

Node , also supported, slightly slower. Workers , supported, no SQLite, ship audit to D1.

Picked

better-sqlite3 + WAL

Single-file audit log with serious throughput. No server. Simple ops.

Not this

Postgres , heavyweight for this audit-log shape on a single host.

Picked

YAML policy

Human-readable, comment-friendly, reviewable in PRs. Validated strictly after parsing.

Not this

JSON , works, less legible. TOML , fine, but the team is more familiar with YAML.

Picked

Zod

Type-safe runtime validation for the policy file and the request bodies.

Not this

Hand-rolled validation , lossy types, easy to drift.

Picked

Ollama as first-class destination

Local Ollama is the dominant local LLM runtime. Treat it like the cloud providers, not as a side feature.

Not this

Local-only adapters bolted on , would have signalled second-class.

Picked

Fail-closed privacy

Silent cloud fallback for pinned requests would be a footgun.

Not this

Best-effort pinning , wrong default for regulated industries.

Performance & observability

Router overhead is below one millisecond on Bun for non-classifier rules. Adding the content classifier adds 60–100 milliseconds for the small Llama 3.2 model on a CPU host; on a GPU host it is below 30 milliseconds. The classifier is opt-in per rule.

Throughput is bound by the destination. Ollama on a serious GPU host streams as fast as the model produces tokens. Cloud destinations stream at whatever rate they offer. The router does not buffer; the SSE chunks are forwarded as they arrive.

Audit-log writes are async to the response. The SQLite WAL handles tens of thousands of inserts per second on commodity hardware. The audit log is rarely the bottleneck; if it ever is, the writes can be batched.

Where it is heading

→Embedding-based classifier as an optional alternative to the small-LLM classifier.
→Federated audit log via OTel collector for multi-instance deploys.
→Cost-aware routing that considers per-token price plus expected output length.
→Latency-aware routing that exits to the next destination if the first does not produce a token within N milliseconds.
→Admin UI improvements: a richer SQL playground, per-tenant filters, scheduled exports.

Read the full whitepaper for the formal technical write-up.

Whitepaper Repository Get help shipping