All playbooks
AI engineering
28 min read

Multi-engine LLM gateway with failover

How Sarmalink-AI takes a single OpenAI-shaped request and routes it across 14 backends (Groq, OpenRouter, Anthropic, Gemini, Mistral, local Ollama, more), fails over in under 50 ms when an engine returns a 429 or 5xx, deduplicates retries, and never returns a 500 to the caller. The architecture, the routing logic, and the cost guard.

Why a gateway at all

No single LLM provider is up 100 percent of the time. Groq has 429 spikes when a new free-tier wave lands, OpenAI has the occasional regional brownout, Anthropic throttles bursty traffic, and any free-tier engine can vanish on a Tuesday. If your product hard-wires one provider, every one of those incidents lands directly on your users.

Pricing is the second reason. Per-token costs move weekly. A model that was the best price-to-quality ratio in January is third-best by April. Without a gateway in the middle you are renegotiating with your own codebase every time you want to switch.

Capability is the third. Some engines do vision, some do tool calling well, some have a 200k context, some do not. A gateway lets you express "I need a vision-capable engine" once, and the routing layer worries about which provider can serve it today.

Vendor lock-in is the fourth. Once your application code is shaped around one provider's SDK quirks, migrating costs weeks. A thin OpenAI-shaped front door makes the provider a config detail, not an architectural decision.

The provider is a config row, not an architectural choice. Build the gateway once, swap the engines forever.

The OpenAI shape as contract

Every backend adapter in Sarmalink-AI translates to and from one canonical shape: OpenAI's chat-completions request and response. It is not because OpenAI is best; it is because it is the shape every client library, every SDK, every Postman collection, and every junior engineer already knows.

A request looks like the usual { model, messages, temperature, stream, tools }. A response looks like the usual { id, choices: [{ message, finish_reason }], usage }. The gateway accepts that, picks an engine, translates outward, and translates the response back. Callers never see Anthropic's content array or Gemini's candidates.

ts
// lib/gateway/types.ts export type ChatMessage = { role: 'system' | 'user' | 'assistant' | 'tool' content: string | Array<{ type: 'text' | 'image_url'; [k: string]: unknown }> tool_call_id?: string } export type ChatRequest = { model: string // logical model id, e.g. 'fast', 'smart', 'vision' messages: ChatMessage[] temperature?: number stream?: boolean tools?: unknown[] max_tokens?: number user?: string // org id, used by the cost guard } export type ChatResponse = { id: string object: 'chat.completion' created: number model: string // physical model that served the request engine: string // which backend served it choices: Array<{ index: number message: { role: 'assistant'; content: string; tool_calls?: unknown[] } finish_reason: 'stop' | 'length' | 'tool_calls' | 'content_filter' }> usage: { prompt_tokens: number; completion_tokens: number; total_tokens: number } }

The engine registry

The registry is a typed list of backends. Each entry carries everything the router needs to make a decision without calling out to the provider. It lives in Postgres so it can be hot-reloaded; a 30 second cache in memory keeps the hot path fast.

ts
// lib/gateway/registry.ts export type Engine = { id: string // 'groq', 'openrouter', 'anthropic', 'ollama-local', ... baseUrl: string apiKey: string | null // null for local engines modelMap: Record<string, string> // logical -> physical capabilities: Array<'chat' | 'vision' | 'tools' | 'json' | 'long_context'> priority: number // lower = preferred costPer1k: { input: number; output: number } // GBP maxRpm: number status: 'healthy' | 'degraded' | 'unhealthy' } const REGISTRY: Engine[] = [ { id: 'groq', baseUrl: 'https://api.groq.com/openai/v1', apiKey: process.env.GROQ_API_KEY ?? null, modelMap: { fast: 'llama-3.3-70b-versatile', smart: 'llama-3.3-70b-versatile' }, capabilities: ['chat', 'tools', 'json'], priority: 10, costPer1k: { input: 0.00059, output: 0.00079 }, maxRpm: 30, status: 'healthy', }, { id: 'anthropic', baseUrl: 'https://api.anthropic.com/v1', apiKey: process.env.ANTHROPIC_API_KEY ?? null, modelMap: { smart: 'claude-3-5-sonnet-latest', vision: 'claude-3-5-sonnet-latest' }, capabilities: ['chat', 'vision', 'tools', 'json', 'long_context'], priority: 20, costPer1k: { input: 0.0024, output: 0.012 }, maxRpm: 50, status: 'healthy', }, // ... 12 more ] export function pickEngines(req: { model: string; needs: Engine['capabilities'][number][] }) { return REGISTRY .filter(e => e.apiKey !== null || e.baseUrl.startsWith('http://localhost')) .filter(e => req.needs.every(c => e.capabilities.includes(c))) .filter(e => e.modelMap[req.model]) .filter(e => e.status !== 'unhealthy') .sort((a, b) => a.priority - b.priority) }

Health checks

A background worker hits each engine every 30 seconds with a one token completion. Latency under 800 ms marks it healthy, between 800 and 3000 ms degraded, anything else or a non-2xx marks it unhealthy. Three consecutive failures push status to unhealthy and the router stops sending traffic. One success flips it back to degraded; three consecutive successes flip it to healthy.

The router skips degraded engines for first-pick, but allows them as fallback. That keeps user latency tight without losing the engine entirely when it has a wobble.

ts
// lib/gateway/health.ts import { REGISTRY } from './registry' export async function pingEngine(e: Engine): Promise<number> { const t0 = Date.now() const res = await fetch(`${e.baseUrl}/chat/completions`, { method: 'POST', headers: { 'Content-Type': 'application/json', Authorization: `Bearer ${e.apiKey}` }, body: JSON.stringify({ model: Object.values(e.modelMap)[0], messages: [{ role: 'user', content: 'ping' }], max_tokens: 1, }), signal: AbortSignal.timeout(4000), }) if (!res.ok) throw new Error(`${e.id} ${res.status}`) return Date.now() - t0 } const consecutiveFailures = new Map<string, number>() const consecutiveSuccesses = new Map<string, number>() export async function runHealthSweep() { await Promise.all(REGISTRY.map(async (e) => { try { const ms = await pingEngine(e) consecutiveFailures.set(e.id, 0) const ok = (consecutiveSuccesses.get(e.id) ?? 0) + 1 consecutiveSuccesses.set(e.id, ok) e.status = ms < 800 && ok >= 3 ? 'healthy' : 'degraded' } catch { consecutiveSuccesses.set(e.id, 0) const fails = (consecutiveFailures.get(e.id) ?? 0) + 1 consecutiveFailures.set(e.id, fails) if (fails >= 3) e.status = 'unhealthy' else e.status = 'degraded' } })) }

Failover policy

The failover loop is the heart of the gateway. The rules are short, deterministic, and easy to reason about.

On a 429 from the current engine, retry the next engine immediately. The user is hot, every millisecond costs. On a 5xx, wait 200 ms before retrying; transient infrastructure errors clear within that window often enough that you avoid hammering a downstream that is already crying. On a capability mismatch (the picked engine returned "I cannot do vision"), skip to the next engine that can. On a network timeout (no headers in 8 seconds), kill and retry.

The budget is four hops. After four engines fail in sequence, return the last upstream error verbatim. Do not invent a generic 500; the caller wants to know whether it was a rate limit or a server error so they can react.

ts
// lib/gateway/failover.ts import { pickEngines } from './registry' import { callEngine } from './adapters' const MAX_HOPS = 4 export async function chatWithFailover(req: ChatRequest): Promise<ChatResponse> { const needs = inferNeeds(req) const candidates = pickEngines({ model: req.model, needs }) if (candidates.length === 0) throw new GatewayError('no_engine_available', 503) let lastError: unknown = null for (let hop = 0; hop < Math.min(MAX_HOPS, candidates.length); hop++) { const engine = candidates[hop] try { const res = await callEngine(engine, req) logSuccess({ requestId: req.user, engine: engine.id, hop, res }) return res } catch (err) { lastError = err if (err instanceof GatewayError && err.status === 429) continue if (err instanceof GatewayError && err.status >= 500) { await new Promise(r => setTimeout(r, 200)) continue } if (err instanceof CapabilityMismatchError) continue if (err instanceof TimeoutError) continue throw err // 4xx that is not 429 means caller error, do not retry } } throw lastError }

Intent routing

Failover answers the question "this engine died, who is next". Intent routing answers a different question: "given what the user is asking, which class of engine should we try first". It is optional, off by default, and gated by a feature flag.

The classifier is deliberately cheap. A keyword pass first (regex on common patterns), then an embedding lookup against a small reference set if the keyword pass is unsure. The output is an engine class, not an engine: code, search, voice, vision, long_context, or general. Failover then operates within that class.

ts
// lib/gateway/intent.ts const KEYWORDS: Record<string, RegExp> = { code: /\b(function|class|typescript|python|rust|refactor|stack trace|bug)\b/i, search: /\b(latest|today|current|news|price right now|who won)\b/i, voice: /\b(speak|say|voice|read out|tts)\b/i, vision: /\b(image|photo|screenshot|describe this picture|chart)\b/i, long_context: /\b(summarise this document|whole pdf|full transcript)\b/i, } const CLASS_TO_PREFERRED: Record<string, string[]> = { code: ['anthropic', 'qwen-coder', 'openrouter'], search: ['perplexity', 'openrouter'], voice: ['groq', 'openrouter'], vision: ['anthropic', 'gemini', 'openrouter'], long_context: ['gemini', 'anthropic'], general: ['groq', 'openrouter', 'anthropic'], } export function classify(text: string): keyof typeof CLASS_TO_PREFERRED { for (const [cls, re] of Object.entries(KEYWORDS)) if (re.test(text)) return cls return 'general' }

Streaming pass-through

Streaming is where naive gateways fall over. The contract is OpenAI Server-Sent Events: data: {...}\n\n chunks with a final data: [DONE]\n\n. Each adapter normalises whatever the provider emits (Anthropic event-stream, Gemini chunked JSON, raw OpenAI) into that shape.

The one strict rule: you may switch engines mid-stream only if no chunks have been sent to the caller yet. Once a single byte reaches the user, you are committed to that engine for the rest of the response. Otherwise you risk shipping half a sentence from one model and half from another, which is worse than any error.

ts
// lib/gateway/stream.ts export async function streamWithFailover(req: ChatRequest, res: Response) { const candidates = pickEngines({ model: req.model, needs: inferNeeds(req) }) for (let hop = 0; hop < Math.min(MAX_HOPS, candidates.length); hop++) { const engine = candidates[hop] const upstream = await callEngineStream(engine, req).catch(e => ({ error: e })) if ('error' in upstream) { if (isRetryable(upstream.error)) continue throw upstream.error } const reader = upstream.body.getReader() let firstChunkSent = false try { while (true) { const { value, done } = await reader.read() if (done) { res.write('data: [DONE]\n\n'); return } const normalised = adaptChunk(engine.id, value) res.write(`data: ${JSON.stringify(normalised)}\n\n`) firstChunkSent = true } } catch (err) { if (firstChunkSent) throw err // committed, cannot switch // first byte never reached the user, safe to try the next engine continue } } throw new GatewayError('all_engines_failed', 502) }

Cost guard

Every request decrements an org-scoped daily token budget in Postgres. The budget is a row keyed by (org_id, date). At 80 percent the response carries a soft warning header. At 100 percent the gateway returns 402 budget_exhausted immediately, before even picking an engine.

The decrement happens after the response, not before. Pre-charging is impossible because you do not know completion length yet. To stop runaway streams, the streaming loop checks remaining budget every 512 tokens and cuts the stream cleanly with a finish_reason: 'length' if it runs out.

sql
-- supabase/migrations/010_cost_guard.sql create table if not exists org_budget ( org_id uuid not null, date date not null, daily_limit_tokens int not null default 1000000, used_tokens int not null default 0, primary key (org_id, date) ); create or replace function consume_tokens(p_org uuid, p_tokens int) returns table(remaining int, soft_warn bool) language plpgsql as $$ declare v_used int; v_limit int; begin insert into org_budget(org_id, date) values (p_org, current_date) on conflict do nothing; update org_budget set used_tokens = used_tokens + p_tokens where org_id = p_org and date = current_date returning used_tokens, daily_limit_tokens into v_used, v_limit; return query select v_limit - v_used, v_used > v_limit * 0.8; end $$;

Observability

Every completed request writes one row to a gateway_requests table: request_id, org_id, engine, model, tokens_in, tokens_out, latency_ms, cost_gbp, hops, status. That single table answers every operational question I have ever needed to answer: which engine is slow today, which model is burning budget, which org is hammering us.

Postgres is fine until you cross a few million rows per day. After that, mirror to ClickHouse with a Logical Replication slot and run the dashboard off ClickHouse. Grafana speaks both. The admin page in Sarmalink-AI hits Postgres directly with a 60 second materialised view; that has held up at meaningful traffic without complaint.

Pitfalls

Leaking provider names in error responses

Returning "Anthropic rate limit exceeded" tells the caller which provider you use and gives competitors a free intelligence report. Normalise every error to the OpenAI error shape with your own codes (rate_limited, upstream_error, budget_exhausted).

Double-billing on streaming retries

If the first engine returns 200 then dies after 50 tokens, you have already paid for those 50 tokens. Track tokens charged per engine attempt, not per request, or your monthly cost reconciliation will be 15 percent off and you will never find it.

Model-name mapping drift

Providers rename models constantly (claude-3-5-sonnet-20240620, claude-3-5-sonnet-latest, claude-3.5-sonnet). Keep the mapping in one place (the registry), version it, and add a CI check that pings each provider with each mapped name once a week.

No idempotency key on streamed retries

Without one, the caller cannot safely retry a stream that died mid-flight; they will get the prefix twice. Generate a request_id, accept it back on retry, and dedupe within a 5 minute window.

Eating long-tail latency without a hard timeout

An engine that takes 90 seconds to first byte is functionally dead but technically alive. Set a hard headers-received timeout (8 seconds is generous) and treat exceeding it as a 5xx for failover purposes.

Wrap up

A gateway sounds like infrastructure overkill until the first outage you sail through without paging anyone. The full surface is small: a typed registry, a health worker, a 60 line failover loop, an SSE pass-through, a Postgres budget table, and one log row per request. Everything else is policy on top.

Sarmalink-AI is the open-source reference implementation; the source is on GitHub and the patterns above map one-to-one to the code. Fork it, swap the registry for your providers, and ship. The day the next provider has a bad afternoon, you will be the only thing on the internet that does not notice.

Want this done for you?

If you would rather skip the YAK shave and have someone who has done this fifty times set it up properly, that is what I do for a living.

Start a project