How Local LLM Router works
A complete tour of the architecture: data flow, subsystems, technology choices, performance, and where the project is heading next.
A small Hono server speaks the OpenAI Chat Completions wire format. Each request runs through a YAML policy engine that picks a destination by rule. Sensitive requests are pinned to local Ollama. A/B rollout is weighted random within a rule. The dispatcher tries the destination, falls through on 5xx, streams the response, and writes the decision to better-sqlite3.
Core data flow
From the moment a request enters the system to the moment a response leaves it.
Inbound: POST /v1/chat/completions
│ body { model, messages, stream, ... }
▼
Parse + validate (Zod)
│
▼
Compute features
├─ token estimate
├─ headers (sensitivity, tenant, request id)
└─ optional content classifier (Ollama Llama 3.2)
│
▼
Policy engine
│ evaluate rules in order
│ first match wins
▼
Rule selected: { pin?, destinations[ {target, weight, fallback} ] }
│
▼
Privacy pin check
├─ pinned + local available → continue
└─ pinned + local unavailable → 503 fail-closed
│
▼
Weighted random destination
│
▼
Dispatcher
├─ HTTP call to Ollama or cloud provider
├─ on 5xx: try next destination in chain
└─ stream response back to client
│
▼
Audit log (better-sqlite3)
│ request_hash, rule, destination, latency, cost, outcomeEach subsystem, deep-dived
Every component in the data flow above, opened up and explained.
Policy engine
YAML in, JSON model out. A policy is an ordered list of rules. Each rule has a when clause, an optional pin, and a list of destinations. The first matching rule wins.
The when clause is a small expression DSL. Supported operators are equality, lt/gt/lte/gte, and a regex match. References allowed: headers.*, input.tokens, input.contains, classifier.label, tags.*. The DSL is intentionally tiny: a Zod-typed AST, no eval, no shell-out. Untrusted policies are still safe.
Privacy pinning
A rule with pin: local restricts the destination to local Ollama. The dispatcher refuses to fall through to cloud destinations, even if the local one returns 5xx. The request fails closed with a 503. This is the only safe behaviour for sensitive traffic.
Sensitivity can come from three sources: an x-sensitive: true header from the application, a tags array passed in the request body, or the optional content classifier. The classifier is a small Llama 3.2 model running on the same Ollama instance; running it costs around 80 milliseconds and is gated behind a policy flag for teams that want it.
A/B routing
A rule with multiple destinations and per-destination weights distributes traffic randomly with the configured weights. The seed for the random choice is the request id, so identical requests within a single client session are routed consistently when the client cooperates. The audit log records which destination was chosen, so you can compare quality and cost across the A and B groups by SQL query.
Dispatcher
The dispatcher takes a destination and a request, produces a response. For Ollama destinations, it calls the local Ollama HTTP API; for cloud destinations, it calls the provider’s OpenAI-compatible (or native) endpoint. Streaming is preserved end-to-end: the upstream SSE chunks are forwarded to the client as they arrive.
On a 5xx response from the destination, the dispatcher tries the next entry in the rule’s destinations list, unless the rule is pinned. On a 4xx, the error is returned to the client as-is; this is a client error, not a routing error.
Audit log
better-sqlite3 with WAL mode. One row per request: id, ts, request_hash, rule, destination, latency_ms, cost, status, tokens_in, tokens_out, tags. The request hash is a fingerprint of model + messages + temperature; it lets you match repeat requests without storing the full prompt.
The admin API exposes paginated queries against the log, plus a CSV export. A small HTMX UI is included for ops users; engineers can also issue SQL directly against the SQLite file.
OpenAI compatibility surface
Implemented endpoints: POST /v1/chat/completions (streaming and non-streaming), POST /v1/embeddings, POST /v1/completions, GET /v1/models. Authentication is a Bearer token; tokens are configured in the policy file with optional per-token routing rules. Rate limiting is per-token, default off.
Hot-reload and deployment
The policy file is reloaded on SIGHUP. Invalid policies are rejected and the previous policy stays active. Deployment options: Bun on a VM, Node on a VM, Docker, Cloudflare Workers (HTTP transport only). The audit log on Workers is shipped to D1; the wiki has the recipe.
Why this stack
The road not taken matters as much as the road taken. Here is what was picked, why, and what was rejected and why.
Hono
Fast, runs everywhere we want to deploy, clean middleware story.
Express , slower, ecosystem fit for older Node, no Workers story.
Bun (preferred)
Fastest TS runtime in 2026 for this kind of work. Uses better-sqlite3 natively. Boot time near zero.
Node , also supported, slightly slower. Workers , supported, no SQLite, ship audit to D1.
better-sqlite3 + WAL
Single-file audit log with serious throughput. No server. Simple ops.
Postgres , heavyweight for this audit-log shape on a single host.
YAML policy
Human-readable, comment-friendly, reviewable in PRs. Validated strictly after parsing.
JSON , works, less legible. TOML , fine, but the team is more familiar with YAML.
Zod
Type-safe runtime validation for the policy file and the request bodies.
Hand-rolled validation , lossy types, easy to drift.
Ollama as first-class destination
Local Ollama is the dominant local LLM runtime. Treat it like the cloud providers, not as a side feature.
Local-only adapters bolted on , would have signalled second-class.
Fail-closed privacy
Silent cloud fallback for pinned requests would be a footgun.
Best-effort pinning , wrong default for regulated industries.
Performance & observability
Router overhead is below one millisecond on Bun for non-classifier rules. Adding the content classifier adds 60–100 milliseconds for the small Llama 3.2 model on a CPU host; on a GPU host it is below 30 milliseconds. The classifier is opt-in per rule.
Throughput is bound by the destination. Ollama on a serious GPU host streams as fast as the model produces tokens. Cloud destinations stream at whatever rate they offer. The router does not buffer; the SSE chunks are forwarded as they arrive.
Audit-log writes are async to the response. The SQLite WAL handles tens of thousands of inserts per second on commodity hardware. The audit log is rarely the bottleneck; if it ever is, the writes can be batched.
Where it is heading
- →Embedding-based classifier as an optional alternative to the small-LLM classifier.
- →Federated audit log via OTel collector for multi-instance deploys.
- →Cost-aware routing that considers per-token price plus expected output length.
- →Latency-aware routing that exits to the next destination if the first does not produce a token within N milliseconds.
- →Admin UI improvements: a richer SQL playground, per-tenant filters, scheduled exports.
Read the full whitepaper for the formal technical write-up.