One gateway. Seven providers.
Sub-50ms failover.
SarmaLink-AI is an open-source AI gateway I built to solve a real operational problem: depending on a single AI provider is fragile. Rate limits, regional outages, model deprecations, sudden pricing changes, any production AI workload gets bitten eventually. SarmaLink-AI routes across 14 failover engines and 7 providers with automatic sub-50ms handoff, intent-based auto-routing, persistent memory, image generation, live tools, and a cross-repo plugin system that dispatches specialised tasks to sibling projects.
Problem
The multi-provider problem showed up on three different client engagements in the same quarter. One project depended on a specific model that got deprecated with three months notice. Another hit a Groq rate-limit cliff during a Friday afternoon load test. A third needed Gemini for image input and a different model for long-context reasoning, and I was writing two completely separate SDK integrations that had nothing to do with each other.
The deeper issue is that AI provider SDKs are not designed to be interchangeable. Different message formats, different streaming protocols, different error semantics, different tool-call conventions, different system-prompt quirks. The naive approach, wrap each one in a provider-specific function, becomes a forest of if-statements and a maintenance liability.
Beyond the failover problem, I kept rebuilding the same higher-level capabilities: intent routing to pick the right model for a given message, memory persistence across sessions, image generation, live search. And the gateway needed to dispatch specialised tasks to other tools, voice loops, multi-agent workflows, eval runners, not just forward every request to a single LLM.
- →Provider lock-in is an operational risk, rate limits, deprecations, and outages are when, not if
- →Different SDKs, different message formats, different streaming protocols per provider
- →No existing abstraction handled failover, intent routing, and plugin dispatch in one package
- →Higher-level capabilities (memory, image gen, live tools) needed a single place to live
Depending on one AI provider in production is the same kind of risk as depending on one cloud region. Eventually, it will bite you.
Approach
The core architecture is a canonical message format plus a thin stateless adapter per provider. Every adapter translates to and from the same internal format. The router walks a failover chain, opens a connection to the first step, waits for the first token, and only then commits the stream to the browser. If the first token fails, the step goes into a 60-second cooldown and the next engine takes over, invisibly.
On top of the failover layer, an intent classifier reads the incoming message and routes it to the right mode: Coder for code, Reasoner for hard problems, Live for current events, Fast for quick lookups, Vision for images, Smart for everything else. The classifier runs in microseconds with no API calls. Users can override it; it is a default, not a verdict.
The plugin system gives the gateway a longer reach. Ten sibling repos register themselves as plugins with the intents they accept. When the gateway sees a voice intent or a multi-agent workflow intent, it can dispatch to the appropriate tool rather than forwarding to an LLM. Manus integration sits in the same layer, tasks can be dispatched to Manus with full lifecycle management (create, poll, cancel, webhook). The whole thing is driven by environment variables; disabled by default, opt-in per deployment.
- →Single canonical message + streaming format, thin stateless adapter per provider
- →First-token gating: failover commits before any byte reaches the browser
- →Intent classifier routes to the right mode in microseconds, zero API calls
- →Cross-repo plugin system: 10 sibling repos accessible as typed endpoints
- →Manus integration: task lifecycle (create, poll, cancel) plus webhook verification
- →Persistent memory, image generation, and live search built into the gateway layer
Solution
SarmaLink-AI ships as a Next.js application with a complete chat UI and the underlying gateway engine. The current build has 14 failover engines across 7 providers, Groq, Cerebras, SambaNova, Google Gemini, GitHub Models, Cohere, and OpenRouter. Each engine carries metadata: cost band, latency profile, capability flags for vision, tool use, and JSON mode. The intent classifier maps every incoming message to one of seven modes and picks the appropriate failover chain automatically.
The gateway layer goes further than routing. Persistent memory stores facts the user or assistant surfaces during a session and injects them into subsequent context. Image generation is available via a vision-capable chain. Live tools connect to grounded search for current events. The sanitiser at the trust boundary wraps user messages, tool results, and memory in explicit markers before any of it reaches a model.
The plugin system exposes ten sibling repos as typed, intent-matched endpoints. A voice request can be dispatched to voice-agent-starter. A multi-agent workflow can be handed to agent-orchestrator. An eval run can be sent to ai-eval-runner. Manus integration sits in the same dispatch layer, with full task lifecycle and webhook signature verification via HMAC-SHA256.
The developer surface is a single TypeScript module: pass in messages, get back a stream of tokens. No UI required. Drop it into any Next.js app, set the provider keys as environment variables, and failover-aware streaming chat is twenty lines of code.
- →14 failover engines, 7 providers, sub-50ms handoff, zero user-visible disruption
- →Intent classifier: 7 modes, zero API calls, microsecond routing decision
- →Cross-repo plugin system: 10 sibling repos, intent-matched dispatch
- →Manus integration: full task lifecycle + HMAC webhook verification
- →Persistent memory, image generation, live tools built into the gateway
- →Standalone TypeScript module, no UI, no framework lock-in
The first time a Groq endpoint went down mid-conversation and the response just kept coming, from a Cerebras model the user had never picked, the architecture proved itself.
What actually changed.
Live and open source. Used in client work and by a growing community. Has survived multiple real provider outages without user-visible disruption.
The failover architecture has paid off in production. Multiple times, a primary provider hit a rate limit or returned 5xx during live usage. The router moved to the next engine; users noticed, at most, a slightly different writing style. Zero pages, zero incidents, zero tickets attributable to provider failures.
The plugin and Manus layers have extended the gateway beyond chat. Intent-based routing now dispatches specialised tasks to the right tool automatically, which has changed how I think about building AI-powered products: the gateway is the front door, not an LLM wrapper.
Every AI feature I now build for clients starts here. Not out of habit, but because the failover model, the intent routing, and the plugin dispatch have become the default architectural shape I want for anything that needs to stay up and handle diverse request types.
Tools, picked deliberately.
What I would tell someone building this from scratch.
Design the canonical format first.
Every adapter translates to and from one internal message + token format. If the provider format leaks one level above the adapter, the architecture has already failed. The canonical types are the contract; everything else is detail.
Adapters should be stateless and small.
Each adapter is a pure function. No client objects holding state, no shared connection pools, no singleton anything. Stateless adapters are testable, swappable, and free of the subtle bugs that appear in long-lived serverless processes.
Fail over before the first token.
Once you start streaming to the browser, you cannot un-stream. The router opens the connection, waits for the first chunk, and only then commits. If the first chunk fails, it tries the next engine. Users see one continuous stream.
Intent routing is the highest-leverage feature.
Routing to the right model for the right task, not just the cheapest or fastest, is where most of the quality gain comes from. A microsecond classifier that costs nothing and gets it right most of the time is more valuable than adding another failover step.
Plugins beat monolithic capabilities.
Trying to build voice, multi-agent orchestration, and evals into one codebase is a trap. Registering those as plugin endpoints with intent matching gives the same dispatch surface with far less coupling.
Need something like this built?
I take one client at a time. If your problem is real and your timeline is honest, let’s talk.
Go deeper.
SarmaLink-AI
The product overview: modes, providers, plugin system, Manus integration, and getting started.
The engineering record
Architectural bets, adapter design, first-token gating, intent routing, plugin dispatch, and lessons learned.
Inside the gateway
Adapter pattern, the router, cooldown lists, intent classifier, plugin dispatch, and the audit log.
SarmaLink-AI in depth
Benchmarks, deployment guide, the seven modes, providers, security model, and cost analysis.