How seven providers act as one.
The plumbing behind a multi-provider AI backend that fails over in 50 milliseconds. Adapters, first-token gating, cooldown lists, and the audit log.
If you only read one paragraph.
Every chat request walks a chain of provider/model steps. Each step has an adapter that translates canonical messages into provider-specific calls and translates streaming responses back into canonical token chunks. The router opens a connection to the first step, waits for the first usable chunk, and only then commits the stream to the browser. If any step fails before that first chunk, it gets sin-binned for 60 seconds and the next step takes over. The browser sees one continuous stream. Every step writes a row to an audit table so we can see, after the fact, which engines actually carried traffic.
A request, end to end.
When something fails.
The pieces.
The adapter pattern
One file per provider under lib/providers/adapters/. Each exports a single function with the same signature.
Stateless. Pure. Two hundred lines. Adding a new provider is implementing this signature.
The router
The router walks the failover chain, opens connections, gates on the first chunk, and writes audit rows. The whole thing is roughly 150 lines. The first-token race against a timeout is the load-bearing primitive.
Cooldown set
A simple in-memory Map<string, number> from step label to expiry timestamp. Subsequent requests check the map; expired entries are deleted on access. Best-effort: each Edge instance has its own; that is fine because the goal is to avoid hammering a known-bad step, not to guarantee global exclusion.
Auto-router (intent classifier)
Before the router runs, a small regex-based classifier picks a mode (smart, fast, coder, live, vision, reasoner) from message content. Zero API calls, runs in microseconds, easy to test. The classifier’s output is overridable by the client; it is a default, not a verdict.
Sanitiser at the trust boundary
Three sources of untrusted text reach the model: user messages, tool results, and saved memories. Each is wrapped in explicit XML-style markers before being concatenated into the prompt. Known jailbreak patterns are stripped. Unit tests cover documented attack categories. This is layered defence: the wrapping survives even if the strip pattern misses something.
ai_events audit log
Every step writes one row: user_id, backend, status (success / rate_limited / error), latency_ms, tokens_out, created_at. RLS ensures users only see their own events. The health endpoint aggregates across all events to compute per-backend success rates and p95 latencies.
Why this, not that.
Next.js Edge Runtime
Sub-100ms cold starts, native ReadableStream, geographic distribution by default. Time-to-first-token is the metric that matters; Edge wins it.
Node serverless functions are fine but cold-start at 300-800ms; users feel that. Long-lived containers are operationally heavier.
Server-Sent Events end-to-end
One-way server-to-client streaming is exactly what chat needs. Native EventSource on the client. Works through every proxy, CDN, and corporate firewall.
WebSockets are bidirectional but the chat UX does not need it. They break through some proxies and require connection management.
TypeScript with strict types
Seven adapters, each with subtly different SDK shapes. Strict types catch shape changes at compile time. The canonical types are the contract; tsc enforces it.
Plain JavaScript in a multi-provider system is a runtime-error generator.
Stateless adapters via fetch
Each adapter is a pure function: messages + key + step → async iterator of TokenChunk. No client objects, no shared state, no singletons.
Stateful SDK clients with connection pools introduce race conditions in serverless and force singleton patterns I do not want.
Supabase Postgres for ai_events
One audit row per step. RLS-scoped per-user. Powers the health endpoint and informs chain reordering. Generous free tier.
A separate analytics database is overkill at this scale. Logs in stdout are unqueryable.
In-memory cooldown set
When a step fails, it is sin-binned for ~60 seconds. Subsequent requests skip it until cooldown expires. Simple, fast, no external dependency.
Redis would be more durable but adds a network hop and an operational dependency. The cooldown is best-effort; in-memory is fine.
What we watch.
The /api/admin/health endpoint surfaces per-backend success rates over a 24-hour window, plus dead-model detection (a backend with <50% success over >10 attempts gets flagged). Chain reorderings are driven by this data, not by gut feel.
What it talks to.
Seven providers
Groq, SambaNova, Cerebras, Google Gemini, OpenRouter, Cloudflare Workers AI, Tavily. Each via its public OpenAI-compatible or native HTTP endpoint.
Supabase Postgres
Sessions, usage counters, audit events, user memories. RLS on every table. Service role only used for cron handlers.
Cloudflare R2
S3-compatible object storage for image attachments and generated images. 7-day signed URLs. Zero egress fees.
Where it goes next.
Per-tenant chain customisation
Embedded users want their own provider order (price-sensitive vs latency-sensitive). The chain definition is data, not code; this is a config-only change.
Token-stream caching
For deterministic prompts (system + same first user message), cache the full token stream and replay it. Pure win for frequently-asked questions.
Federated cooldown
Per-instance cooldown sets are best-effort. A small Redis layer would share cooldown state across Edge instances. Optional, behind a feature flag.
More providers
Mistral, Together, Anthropic direct, Replicate. Each is an afternoon of adapter code; the router does not change.
Want the engineering record?
The case-study whitepaper covers the architectural bets, the trade-offs, and the lessons learned in detail. The product whitepaper covers benchmarks, modes, and deployment.