How it works · Multi-Provider AI Backend

How seven providers act as one.

The plumbing behind a multi-provider AI backend that fails over in 50 milliseconds. Adapters, first-token gating, cooldown lists, and the audit log.

A complementary product walk-through lives at /products/sarmalink-ai/how-it-works.

The 60-second version

If you only read one paragraph.

Every chat request walks a chain of provider/model steps. Each step has an adapter that translates canonical messages into provider-specific calls and translates streaming responses back into canonical token chunks. The router opens a connection to the first step, waits for the first usable chunk, and only then commits the stream to the browser. If any step fails before that first chunk, it gets sin-binned for 60 seconds and the next step takes over. The browser sees one continuous stream. Every step writes a row to an audit table so we can see, after the fact, which engines actually carried traffic.

Core data flow

A request, end to end.

POST /api/chat { messages: [...], mode: 'smart' } ↓ Edge route handler ↓ wrap user messages with sanitizer (trust boundary) ↓ pick failover chain for 'smart' mode ↓ enter Router [step 0] SambaNova · DeepSeek V3.2 685B • adapter.toRequest(canonical) → provider HTTP body • fetch(endpoint, { stream: true }) • read first SSE chunk • 200 OK · first chunk arrives in 820ms • COMMIT stream to browser ↓ tokens flow through adapter.toCanonical() → SSE → client ai_events row written: { backend: 'SambaNova V3.2', status: 'success', latency_ms: 820, tokens_out: 380 }

When something fails.

POST /api/chat { messages: [...], mode: 'coder' } ↓ enter Router [step 0] SambaNova · DeepSeek V3.2 685B • fetch → 429 Too Many Requests (38ms) • cooldown.add('SambaNova V3.2', 60_000) • ai_events: status='rate_limited', latency=38ms ↓ 47ms later [step 1] Cerebras · Qwen 3 Coder 480B • fetch → first chunk in 94ms • COMMIT stream • ai_events: status='success', latency=94ms User saw: one continuous stream, 1.4s total User missed: a 685B model was busy, a 480B model took over

Subsystems

The pieces.

The adapter pattern

One file per provider under lib/providers/adapters/. Each exports a single function with the same signature.

// lib/providers/adapters/groq.ts export async function* groq( step: ProviderStep, apiKey: string, messages: CanonicalMessage[], ): AsyncIterable<TokenChunk> { const body = toGroqRequest(step.model, messages) const res = await fetch(GROQ_URL, { method: 'POST', headers: { 'Authorization': `Bearer ${apiKey}`, 'Content-Type': 'application/json', }, body: JSON.stringify(body), }) if (!res.ok) throw new ProviderError(res.status, step.label) for await (const event of parseSSE(res.body!)) { const chunk = fromGroqDelta(event) if (chunk) yield chunk } yield { type: 'finish', value: '', meta: { backend: step.label } } }

Stateless. Pure. Two hundred lines. Adding a new provider is implementing this signature.

The router

The router walks the failover chain, opens connections, gates on the first chunk, and writes audit rows. The whole thing is roughly 150 lines. The first-token race against a timeout is the load-bearing primitive.

async function commitOnFirstChunk(stream, label, timeoutMs) { const reader = stream[Symbol.asyncIterator]() const first = await Promise.race([ reader.next(), new Promise(resolve => setTimeout( () => resolve({ value: { type: 'error' }, done: false }), timeoutMs, )), ]) if (first.done || first.value.type === 'error') { cooldown.add(label, 60_000) throw new Error('first-chunk failed') } return concat(first.value, reader) }

Cooldown set

A simple in-memory Map<string, number> from step label to expiry timestamp. Subsequent requests check the map; expired entries are deleted on access. Best-effort: each Edge instance has its own; that is fine because the goal is to avoid hammering a known-bad step, not to guarantee global exclusion.

Auto-router (intent classifier)

Before the router runs, a small regex-based classifier picks a mode (smart, fast, coder, live, vision, reasoner) from message content. Zero API calls, runs in microseconds, easy to test. The classifier’s output is overridable by the client; it is a default, not a verdict.

Sanitiser at the trust boundary

Three sources of untrusted text reach the model: user messages, tool results, and saved memories. Each is wrapped in explicit XML-style markers before being concatenated into the prompt. Known jailbreak patterns are stripped. Unit tests cover documented attack categories. This is layered defence: the wrapping survives even if the strip pattern misses something.

Performance and observability

What we watch.

~50ms

Failover handoff (median)

41ms

Fastest first-token (Groq Fast mode)

User-visible incidents from a full chain exhaustion

The /api/admin/health endpoint surfaces per-backend success rates over a 24-hour window, plus dead-model detection (a backend with <50% success over >10 attempts gets flagged). Chain reorderings are driven by this data, not by gut feel.

Integrations

What it talks to.

Seven providers

Groq, SambaNova, Cerebras, Google Gemini, OpenRouter, Cloudflare Workers AI, Tavily. Each via its public OpenAI-compatible or native HTTP endpoint.

Supabase Postgres

Sessions, usage counters, audit events, user memories. RLS on every table. Service role only used for cron handlers.

Cloudflare R2

S3-compatible object storage for image attachments and generated images. 7-day signed URLs. Zero egress fees.

Future direction

Where it goes next.

Per-tenant chain customisation

Embedded users want their own provider order (price-sensitive vs latency-sensitive). The chain definition is data, not code; this is a config-only change.

Token-stream caching

For deterministic prompts (system + same first user message), cache the full token stream and replay it. Pure win for frequently-asked questions.

Federated cooldown

Per-instance cooldown sets are best-effort. A small Redis layer would share cooldown state across Edge instances. Optional, behind a feature flag.

More providers

Mistral, Together, Anthropic direct, Replicate. Each is an afternoon of adapter code; the router does not change.

Want the engineering record?

The case-study whitepaper covers the architectural bets, the trade-offs, and the lessons learned in detail. The product whitepaper covers benchmarks, modes, and deployment.

Case-study whitepaper →Product whitepaper Back to case study