One assistant. Seven providers.
Zero downtime.
SarmaLink-AI is an open-source AI assistant I built to solve a problem I had: depending on a single AI provider is fragile. Rate limits, regional outages, model deprecations, sudden pricing changes — every team using LLMs has been bitten by at least one. SarmaLink-AI routes across 36 engines and 7 providers with automatic failover, streaming, and a single clean abstraction over wildly inconsistent SDKs.
Problem
I started running into the multi-provider problem on client work. One project depended heavily on a specific OpenAI model that got deprecated with three months’ notice. Another saw a Groq rate-limit cliff during a Friday afternoon load test. A third needed Gemini for image input and Claude for long-context reasoning, and I was writing two completely different SDK calls in two completely different places.
The deeper issue is that AI provider SDKs are not designed to be interchangeable. They have different message formats, different streaming protocols, different error semantics, different ways of expressing tool calls, different quirks around system prompts. The naive thing — wrap each one in a provider-specific function — turns into a forest of if-statements and a maintenance nightmare.
I also did not want a heavyweight framework. The existing options either pulled in too much (full agent frameworks for what should be a streaming chat call) or did not handle failover sensibly (you still had to write your own retry logic). I wanted a single function I could call that would just work — and fail over to another provider if the current one was unavailable.
- →Provider lock-in is a real operational risk for any production AI workload
- →Different SDKs, different message formats, different streaming protocols
- →Existing abstractions are either too heavy or do not handle failover
- →Model deprecations happen, and they happen on the provider’s schedule, not yours
Depending on one AI provider in production is the same kind of risk as depending on one cloud region. Eventually, it will bite you.
Approach
I started by writing down the lowest-common-denominator interface I actually wanted to call. A single function: pass in a list of messages, get back a stream of tokens. The provider, the model, the SDK quirks — none of that should appear at the call site.
From there I worked backwards. Each provider got an adapter: a thin module that translates the canonical message format into whatever the provider’s SDK wants, and translates the streaming response back into a canonical token stream. Adapters are pure, stateless, and deliberately small. Adding a new provider is two hundred lines of code, not two thousand.
The router sits on top of the adapters. It takes a routing policy — primary engine, fallback chain, retry budget — and walks through the chain on failure. Failure means “429”, “503”, “invalid model”, or “timeout”. The router does not retry on user-error responses (400 with a malformed prompt is your fault, not the provider’s). Failover happens before the first token is emitted to the client, so users see no glitch — just a slightly slower first response on the rare occasions it triggers.
I made the deliberate choice to run the chat endpoint on the Vercel Edge Runtime. The cold-start cost of Node functions was visible in the time-to-first-token. Edge gets you sub-100ms cold starts, and the streaming primitives (ReadableStream, server-sent events) are first-class. The trade-off — no Node-only dependencies — was easy to live with because every adapter only uses fetch.
- →Single canonical message + streaming format, with thin per-provider adapters
- →Stateless adapters: all routing state lives in the request, not the adapter
- →Routing policy declares primary + fallback chain + retry budget per request
- →Failover before first token: user sees one continuous stream
- →Edge runtime for sub-100ms cold starts and native streaming
- →Open source, MIT licensed, deliberately small surface area
Solution
SarmaLink-AI ships as a Next.js application with a complete chat UI plus the underlying routing engine. The current build supports 36 engines across 7 providers — DeepSeek, Groq, Cerebras, SambaNova, Google Gemini, OpenAI-compatible endpoints, and a few others. Each engine is registered with metadata: cost band, latency profile, context window, capability flags (vision, tool use, JSON mode).
The routing engine is a small state machine. Given a request, it picks an engine based on the routing policy, opens a streaming connection through the appropriate adapter, and forwards tokens to the client over server-sent events. If the adapter throws a retryable error, the engine is moved to a cooldown list, the next fallback is selected, and the stream continues — invisibly to the consumer of the SSE channel.
The chat UI is deliberately understated. Sidebar of conversations, model picker, message stream, and that is mostly it. Every conversation is stored in Supabase Postgres with full message history. The user can switch models mid-conversation, and because every message is canonical, the conversation continues seamlessly on a different provider.
There is also a developer surface — a single TypeScript module exporting one function — that you can use without the UI at all. Drop it into another Next.js app, point it at your provider keys, and you have failover-aware streaming chat in twenty lines of code. That is the thing I personally use most.
- →36 engines registered with capability metadata across 7 providers
- →Stateless adapter per provider, ~200 lines each
- →Streaming-first router with cooldown-list-based failover
- →Edge runtime + server-sent events end-to-end
- →Full chat UI with conversation history in Postgres
- →Standalone module for embedding in other Next.js apps
The first time a Groq endpoint went down mid-conversation and the response just kept coming — from a Cerebras model the user had never picked — I knew the architecture was right.
What actually changed.
Live and open source. Used in my own client work and by a small but growing community of contributors. Has survived multiple real provider outages without user-visible disruption.
The architectural bet has paid off in production. Multiple times, a primary provider has hit a rate limit or returned 5xx responses during real usage, and the router has failed over to a secondary engine without the user noticing anything beyond a slightly different writing style. Zero pages, zero incidents, zero tickets.
The codebase has stayed small — the entire router plus all seven provider adapters fits in well under three thousand lines of TypeScript. New providers slot in without touching existing code, which is the whole point. When a community contributor wanted to add a new provider, they did it in an afternoon.
Personally, I now build every AI feature for clients on top of SarmaLink-AI. Not because it is mine, but because the failover model has changed how I think about provider risk. I never deploy a single-provider integration anymore. The architectural pattern is the deliverable; the open-source project is just the reference implementation.
Tools, picked deliberately.
What I would tell someone building this from scratch.
Design the canonical format first.
Every adapter translates to and from one internal message + token format. If you let the provider format leak into your application code, you have already lost. Spend the first day on the canonical format; everything else follows.
Adapters should be stateless and small.
Each adapter is a pure function. No client objects holding state, no shared connection pools, no singleton anything. Stateless adapters are testable, swappable, and free of subtle bugs.
Fail over before the first token.
If you start streaming and then a provider dies, you cannot un-stream. So the router opens the connection, waits for the first chunk, and only then commits. If the first chunk fails, it tries the next engine. Users see one stream.
Edge runtime is worth the constraints.
No Node-only dependencies, fetch-only HTTP, smaller bundle limits. In return: sub-100ms cold starts, instant geographic distribution, native streaming. For an AI proxy, this is the right trade.
Retry only what is retryable.
A 400 from one provider will be a 400 from the next. Retry budgets are for transient errors — 429, 503, network timeouts. Treat user-error responses as terminal and surface them immediately.
Need something like this built?
I take one client at a time. If your problem is real and your timeline is honest, let’s talk.
Go deeper.
The engineering record
Architectural bets, adapter design, first-token gating, edge-runtime trade-offs, and the lessons learned.
Inside the backend
Adapter pattern, the router, cooldown lists, the audit log, and the trust-boundary sanitiser explained.
SarmaLink-AI in detail
Benchmarks, deployment guide, the six modes, the seven providers, security model, and cost analysis.