Engineering Whitepaper · Multi-Provider AI Backend

Treating AI providers as a commodity.

The case-study engineering record for the multi-provider AI backend that powers SarmaLink-AI. Why provider lock-in is an operational risk, how the adapter + failover architecture removes it, and the trade-offs that shaped the codebase.

Next.js Edge Runtime7 Providers36 EnginesSSE StreamingMIT LicensedOpen Source

7providers behind one API

~50msfailover handoff

~3Klines of router + adapters

0user-visible incidents

Note. A separate, product-perspective whitepaper for SarmaLink-AI lives at /products/sarmalink-ai/whitepaper and covers benchmarks, deployment, the six modes, and the cost model. This document is the engineering case study: why I built it, what I picked, what I rejected, what I would do differently.

Back to case study How it works Product whitepaper

Executive summary

The multi-provider AI backend is a Next.js application running on the Edge Runtime, presenting a single canonical message + streaming-token interface and routing each request through a chain of provider-specific adapters with first-token-gated failover. Seven providers, thirty-six engines, three thousand lines of TypeScript. The architecture removes provider lock-in, survives real outages without user-visible disruption, and sits underneath every AI feature I now build for clients.

01Background

I started running into the multi-provider problem on three different client engagements in the same quarter. One project depended on a specific OpenAI model that got deprecated with three months’ notice. A second hit a Groq rate-limit cliff during a Friday afternoon load test. A third needed Gemini for image input and Claude for long-context reasoning, and I was writing two completely different SDK calls in two completely different places.

The root cause was the same in all three cases: depending on a single AI provider in production is the same kind of operational risk as depending on a single cloud region. Eventually it bites you. The fix is the same too: redundancy, designed in, before you need it.

02The problem in detail

Provider lock-in is one symptom. The deeper problem is that AI provider SDKs are not designed to be interchangeable.

Different message formats. OpenAI uses {role, content}; Anthropic uses {role, content[]} with content blocks; Gemini uses {role, parts[]}.
Different streaming protocols. Some are SSE with deltas, some are SSE with full message snapshots, some are HTTP/1 chunked transfer with provider-specific framing.
Different error semantics. A 429 from one provider means “wait a bit”; from another it means “you have run out of monthly credit”. A 400 might be a malformed request or a content-policy block.
Different tool-call formats, system-prompt rules, and JSON-mode quirks. Trying to wrap them naively produces a forest of if-statements.

Existing options were either too heavyweight (full agent frameworks for what should be a streaming chat call) or did not handle failover sensibly (you still wrote your own retry logic). I wanted a single function I could call that would just work, and quietly fail over to another provider when the current one was unavailable.

03Goals and non-goals

In scope

One canonical message format and one canonical token-stream format used everywhere in the application
One stateless adapter per provider, translating to and from canonical at the edge
A router that walks a chain of provider/model steps and fails over before the first token
Edge-runtime deployment with sub-100ms cold starts
Server-sent events all the way to the browser
Capability metadata per engine (vision, tool use, JSON mode, context window)
A standalone TypeScript module that can be embedded in other Next.js apps

Explicitly out of scope

Agent frameworks. No tool-calling orchestration, no DAGs, no autonomous loops. The router emits tokens; agents are a layer on top.
Local model inference. Cloud providers only. Llama.cpp wrappers are a different problem.
Per-user billing or quotas. The router is a transport. Quotas live in the application.
Caching. Token-stream caching is interesting and not in scope. A separate concern, layered on top.
Multi-region deployment. Edge runtime gives geographic distribution by default. No bespoke region routing.

04Architecture

┌─────────────────────────────────────────────────────────────────┐
│ Browser                                                         │
│   ↓ POST /api/chat  { messages, mode }                          │
│ Vercel Edge Runtime → Next.js Route Handler                     │
│   ↓                                                             │
│ Router                                                          │
│   • picks the failover chain for the requested mode             │
│   • opens connection to step[0]                                 │
│   • waits for first token                                       │
│   • on 429/503/timeout → cooldown step[0], try step[1]          │
│   ↓                                                             │
│ Adapter (per provider)                                          │
│   • canonical messages → provider-specific request              │
│   • provider streaming response → canonical token stream        │
│   ↓                                                             │
│ Provider HTTP endpoint                                          │
│   (OpenAI-compatible, Anthropic, Gemini, OpenRouter, ...)       │
│                                                                 │
│ Tokens flow back through the same path as SSE to the browser.  │
│                                                                 │
│ Side channels:                                                  │
│   • ai_events table (Postgres) — per-step status + latency      │
│   • cooldown set (in-memory) — recently-failed steps skipped    │
└─────────────────────────────────────────────────────────────────┘

Canonical types

type Role = 'system' | 'user' | 'assistant' | 'tool'

interface CanonicalMessage {
  role: Role
  content: string
  // multimodal: optional image refs by URL or base64
  images?: Array<{ url: string }>
}

interface TokenChunk {
  type: 'token' | 'tool_call' | 'finish' | 'error'
  value: string
  meta?: { engine?: string; backend?: string }
}

interface ProviderStep {
  provider: 'groq' | 'sambanova' | 'cerebras' | 'gemini'
            | 'openrouter' | 'cloudflare' | 'tavily'
  model: string
  label: string
  cost_band: 'free' | 'cheap' | 'standard' | 'premium'
}

Repo layout

lib/
├── providers/
│   ├── registry.ts        # all engines + endpoints + key collections
│   ├── failover.ts        # the router (tryFailover)
│   ├── adapters/
│   │   ├── groq.ts
│   │   ├── sambanova.ts
│   │   ├── cerebras.ts
│   │   ├── gemini.ts
│   │   └── openrouter.ts
│   └── streams.ts         # SSE parser, canonical token emitter
├── prompts/sanitize.ts    # trust-boundary wrapping
├── intent.ts              # auto-router classifier
└── repositories/          # Supabase typed CRUD

05Key technical decisions

Design the canonical format first

The first day was spent on CanonicalMessage and TokenChunk. Every adapter translates to and from these types at its boundary. If the provider format leaks anywhere into application code, the architecture has already failed. Spending a day on the canonical format pays back across every line of code that follows.

Stateless adapters

Each adapter is a pure function: given a step, a key, and a message list, return an async iterator of TokenChunk. No client objects holding state, no shared connection pools, no singletons. Stateless adapters are testable, swappable, and free of the subtle bugs that come from shared mutable state in long-lived processes.

Fail over before the first token

Once you have started streaming to the browser, you cannot un-stream. So the router opens the connection, waits for the first chunk, and only then commits to that provider. If the first chunk fails (429, 503, network error, timeout), the router moves to the next step. The user sees one continuous stream.

async function tryFailover(steps, messages, opts) {
  for (const step of steps) {
    if (cooldown.has(step.label)) continue
    const keys = providerKeys(step.provider)
    for (const key of rotateKeys(keys)) {
      const stream = await openStream(step, key, messages)
      const first = await Promise.race([
        readFirstChunk(stream),
        timeout(opts.firstTokenTimeoutMs ?? 8000),
      ])
      if (first.error) {
        cooldown.add(step.label, 60_000)
        continue
      }
      return concatChunkAndStream(first, stream)  // commit
    }
  }
  throw new Error('All steps exhausted')
}

Edge runtime is worth the constraints

The constraints are real: no Node-only dependencies, fetch-only HTTP, smaller bundle limits, no filesystem. In return: sub-100ms cold starts, geographic distribution by default, native ReadableStream support. For an AI proxy where time-to-first-token is the dominant UX metric, the trade is heavily favourable.

Retry only what is retryable

Retry budgets are for transient errors: 429, 503, network timeouts, sometimes 502. They are not for 400 (malformed request), 401/403 (auth), or content-policy blocks. A 400 from one provider will be a 400 from the next. Retrying user-error responses wastes the user’s time and the provider’s capacity.

Cooldown lists, not exponential backoff

When a step fails, it is pushed into a 60-second cooldown set in memory. Subsequent requests skip it until the cooldown expires. This is simpler than per-step exponential backoff and aligns with how providers actually rate-limit (a 429 means “not for the next minute”, not “wait 200ms then 400ms then 800ms”).

Audit every step

Every adapter call writes a row to ai_events: backend, status, latency, tokens out, request id. This is what makes the failover behaviour observable and tunable. Without it, you have no idea which steps actually carry traffic and which are dead weight.

06Implementation milestones

Phase	Deliverable
1	Canonical types. First adapter (Groq). End-to-end stream from browser to provider.
2	Second adapter (SambaNova). Router walking 2 steps. Manual failover test.
3	First-token gating. Cooldown list. ai_events table + writes.
4	Adapters 3 to 7 (Cerebras, Gemini, OpenRouter, Cloudflare, Tavily for tools).
5	Mode definitions (smart, reasoner, live, fast, coder, vision) with per-mode chains.
6	Auto-router based on message intent. Sanitiser at trust boundaries.
7	Health endpoint with per-backend success rate, p95, and dead-model detection.
8	Standalone module extraction. Public open-source release.

07Results

The architectural bet has paid off in production. Multiple times, a primary provider has hit a rate limit or returned 5xx during real usage. The router failed over to a secondary engine; users noticed, at most, a slightly different writing style. Zero pages, zero incidents, zero tickets attributable to provider failures.

The codebase has stayed small. The complete router plus all seven provider adapters fits in well under three thousand lines of TypeScript. New providers slot in without touching existing code, which is the whole point. The community has added a handful in single-PR-afternoons.

Personally, I now build every AI feature for clients on top of this backend. Not because it is mine, but because the failover model has changed how I think about provider risk. I will not deploy a single-provider integration into anything that needs to stay up.

08Lessons learned

Canonical-first is non-negotiable

If the provider message format leaks one level above the adapter, the entire architecture is compromised. The discipline is to draw a hard line at the adapter boundary and never cross it.

Stateless adapters scale across engineers

A new contributor can add a provider in an afternoon because there is no shared state to understand. Each adapter is a self-contained function; correctness is local.

First-token gating is the magic trick

Failover after the first byte hits the wire is not failover; it is a broken connection. Holding the commit until the first usable chunk has arrived is what makes the whole thing feel seamless to users.

Audit logs are how you tune

Without ai_events, you have opinions about which providers carry traffic. With it, you have data. Several chain reorderings have come from staring at the per-backend success-rate query.

Edge runtime forces good habits

Fetch-only HTTP, no Node APIs, smaller bundles. The constraints push you towards a leaner codebase. In a year of running on Edge, I have not once wished for the full Node API.

09Conclusion

The architectural bet of this project is small and load-bearing: design the canonical types first, keep the adapters stateless and small, gate failover on the first token, and instrument every step. None of this is exotic. The discipline is in the no-compromises insistence on those four properties through every line of code. The result is a backend that survives real provider outages, can absorb new engines in an afternoon, and sits underneath every AI feature I now ship.

← Back to case study How it works →Product whitepaper