How it works · Multi-Provider AI Backend

How seven providers act as one.

The plumbing behind a multi-provider AI backend that fails over in 50 milliseconds. Adapters, first-token gating, cooldown lists, and the audit log.

A complementary product walk-through lives at /products/sarmalink-ai/how-it-works.
The 60-second version

If you only read one paragraph.

Every chat request walks a chain of provider/model steps. Each step has an adapter that translates canonical messages into provider-specific calls and translates streaming responses back into canonical token chunks. The router opens a connection to the first step, waits for the first usable chunk, and only then commits the stream to the browser. If any step fails before that first chunk, it gets sin-binned for 60 seconds and the next step takes over. The browser sees one continuous stream. Every step writes a row to an audit table so we can see, after the fact, which engines actually carried traffic.

Core data flow

A request, end to end.

POST /api/chat { messages: [...], mode: 'smart' } ↓ Edge route handler ↓ wrap user messages with sanitizer (trust boundary) ↓ pick failover chain for 'smart' mode ↓ enter Router [step 0] SambaNova · DeepSeek V3.2 685B • adapter.toRequest(canonical) → provider HTTP body • fetch(endpoint, { stream: true }) • read first SSE chunk • 200 OK · first chunk arrives in 820ms • COMMIT stream to browser ↓ tokens flow through adapter.toCanonical() → SSE → client ai_events row written: { backend: 'SambaNova V3.2', status: 'success', latency_ms: 820, tokens_out: 380 }

When something fails.

POST /api/chat { messages: [...], mode: 'coder' } ↓ enter Router [step 0] SambaNova · DeepSeek V3.2 685B • fetch → 429 Too Many Requests (38ms) • cooldown.add('SambaNova V3.2', 60_000) • ai_events: status='rate_limited', latency=38ms ↓ 47ms later [step 1] Cerebras · Qwen 3 Coder 480B • fetch → first chunk in 94ms • COMMIT stream • ai_events: status='success', latency=94ms User saw: one continuous stream, 1.4s total User missed: a 685B model was busy, a 480B model took over
Subsystems

The pieces.

The adapter pattern

One file per provider under lib/providers/adapters/. Each exports a single function with the same signature.

// lib/providers/adapters/groq.ts export async function* groq( step: ProviderStep, apiKey: string, messages: CanonicalMessage[], ): AsyncIterable<TokenChunk> { const body = toGroqRequest(step.model, messages) const res = await fetch(GROQ_URL, { method: 'POST', headers: { 'Authorization': `Bearer ${apiKey}`, 'Content-Type': 'application/json', }, body: JSON.stringify(body), }) if (!res.ok) throw new ProviderError(res.status, step.label) for await (const event of parseSSE(res.body!)) { const chunk = fromGroqDelta(event) if (chunk) yield chunk } yield { type: 'finish', value: '', meta: { backend: step.label } } }

Stateless. Pure. Two hundred lines. Adding a new provider is implementing this signature.

The router

The router walks the failover chain, opens connections, gates on the first chunk, and writes audit rows. The whole thing is roughly 150 lines. The first-token race against a timeout is the load-bearing primitive.

async function commitOnFirstChunk(stream, label, timeoutMs) { const reader = stream[Symbol.asyncIterator]() const first = await Promise.race([ reader.next(), new Promise(resolve => setTimeout( () => resolve({ value: { type: 'error' }, done: false }), timeoutMs, )), ]) if (first.done || first.value.type === 'error') { cooldown.add(label, 60_000) throw new Error('first-chunk failed') } return concat(first.value, reader) }

Cooldown set

A simple in-memory Map<string, number> from step label to expiry timestamp. Subsequent requests check the map; expired entries are deleted on access. Best-effort: each Edge instance has its own; that is fine because the goal is to avoid hammering a known-bad step, not to guarantee global exclusion.

Auto-router (intent classifier)

Before the router runs, a small regex-based classifier picks a mode (smart, fast, coder, live, vision, reasoner) from message content. Zero API calls, runs in microseconds, easy to test. The classifier’s output is overridable by the client; it is a default, not a verdict.

Sanitiser at the trust boundary

Three sources of untrusted text reach the model: user messages, tool results, and saved memories. Each is wrapped in explicit XML-style markers before being concatenated into the prompt. Known jailbreak patterns are stripped. Unit tests cover documented attack categories. This is layered defence: the wrapping survives even if the strip pattern misses something.

ai_events audit log

Every step writes one row: user_id, backend, status (success / rate_limited / error), latency_ms, tokens_out, created_at. RLS ensures users only see their own events. The health endpoint aggregates across all events to compute per-backend success rates and p95 latencies.

SELECT backend, COUNT(*) FILTER (WHERE status='success') AS ok, COUNT(*) FILTER (WHERE status!='success') AS fail, percentile_cont(0.95) WITHIN GROUP (ORDER BY latency_ms) AS p95_ms FROM ai_events WHERE created_at > now() - interval '24 hours' GROUP BY backend ORDER BY p95_ms;
Stack and reasoning

Why this, not that.

Next.js Edge Runtime

Why we picked it

Sub-100ms cold starts, native ReadableStream, geographic distribution by default. Time-to-first-token is the metric that matters; Edge wins it.

What we rejected

Node serverless functions are fine but cold-start at 300-800ms; users feel that. Long-lived containers are operationally heavier.

Server-Sent Events end-to-end

Why we picked it

One-way server-to-client streaming is exactly what chat needs. Native EventSource on the client. Works through every proxy, CDN, and corporate firewall.

What we rejected

WebSockets are bidirectional but the chat UX does not need it. They break through some proxies and require connection management.

TypeScript with strict types

Why we picked it

Seven adapters, each with subtly different SDK shapes. Strict types catch shape changes at compile time. The canonical types are the contract; tsc enforces it.

What we rejected

Plain JavaScript in a multi-provider system is a runtime-error generator.

Stateless adapters via fetch

Why we picked it

Each adapter is a pure function: messages + key + step → async iterator of TokenChunk. No client objects, no shared state, no singletons.

What we rejected

Stateful SDK clients with connection pools introduce race conditions in serverless and force singleton patterns I do not want.

Supabase Postgres for ai_events

Why we picked it

One audit row per step. RLS-scoped per-user. Powers the health endpoint and informs chain reordering. Generous free tier.

What we rejected

A separate analytics database is overkill at this scale. Logs in stdout are unqueryable.

In-memory cooldown set

Why we picked it

When a step fails, it is sin-binned for ~60 seconds. Subsequent requests skip it until cooldown expires. Simple, fast, no external dependency.

What we rejected

Redis would be more durable but adds a network hop and an operational dependency. The cooldown is best-effort; in-memory is fine.

Performance and observability

What we watch.

~50ms
Failover handoff (median)
41ms
Fastest first-token (Groq Fast mode)
0
User-visible incidents from a full chain exhaustion

The /api/admin/health endpoint surfaces per-backend success rates over a 24-hour window, plus dead-model detection (a backend with <50% success over >10 attempts gets flagged). Chain reorderings are driven by this data, not by gut feel.

Integrations

What it talks to.

Seven providers

Groq, SambaNova, Cerebras, Google Gemini, OpenRouter, Cloudflare Workers AI, Tavily. Each via its public OpenAI-compatible or native HTTP endpoint.

Supabase Postgres

Sessions, usage counters, audit events, user memories. RLS on every table. Service role only used for cron handlers.

Cloudflare R2

S3-compatible object storage for image attachments and generated images. 7-day signed URLs. Zero egress fees.

Future direction

Where it goes next.

Per-tenant chain customisation

Embedded users want their own provider order (price-sensitive vs latency-sensitive). The chain definition is data, not code; this is a config-only change.

Token-stream caching

For deterministic prompts (system + same first user message), cache the full token stream and replay it. Pure win for frequently-asked questions.

Federated cooldown

Per-instance cooldown sets are best-effort. A small Redis layer would share cooldown state across Edge instances. Optional, behind a feature flag.

More providers

Mistral, Together, Anthropic direct, Replicate. Each is an afternoon of adapter code; the router does not change.

Want the engineering record?

The case-study whitepaper covers the architectural bets, the trade-offs, and the lessons learned in detail. The product whitepaper covers benchmarks, modes, and deployment.