SarmaLink-AI
Multi-Provider AI Routing with Automatic Failover
Abstract
SarmaLink-AI is an open-source, MIT-licensed multi-provider AI assistant that routes every request through up to 14 engines across 7 providers (Groq, SambaNova, Cerebras, Google Gemini, OpenRouter, Cloudflare Workers AI, Tavily) with automatic sub-50ms failover. Built on Next.js 14, TypeScript, Supabase (PostgreSQL with Row-Level Security), and Cloudflare R2. This whitepaper documents the architecture, failover algorithm, security model, benchmarks, and operational characteristics of a system designed to deliver 99.9999% effective uptime on free-tier infrastructure, at £0 recurring cost.
01Introduction & Motivation
Every major AI provider offers a free tier. Groq hosts GPT-OSS 120B and delivers first tokens in 41 milliseconds. SambaNova runs DeepSeek V3.2, a 685-billion-parameter Mixture-of-Experts model that beats GPT-4o on MATH-500 and HumanEval. Cerebras serves inference at 2,000 tokens per second on wafer-scale chips. Google Gemini has grounded Google Search built in at the token level. OpenRouter aggregates 100+ models behind a single OpenAI-compatible API, including a :free pool of 17+ community-hosted models. Cloudflare Workers AI runs FLUX.2 klein for image generation at the edge. Tavily provides structured search designed for LLM consumption.
Each of these, individually, is generous. Each, on its own, still hits rate limits. The moment a single provider returns a 429, an AI application breaks. Users see an error. Trust evaporates. The common workaround — paying for an upgrade — defeats the point of using free tiers in the first place, and lock users into a single vendor’s roadmap.
The problem with existing solutions
- LiteLLM is a library, not an application. You still have to write the app, the routing logic, the retry policy, the stream parser, the database schema.
- LangChain is a framework with a steep learning curve and heavy abstractions. Multi-provider failover is possible but manual.
- OpenRouter is one aggregator service — not seven redundant providers. When OpenRouter has an issue, you have an issue.
- ChatGPT Plus / Claude Pro / Gemini Advanced are rentals. No self-hosting, no vendor independence, no data ownership.
First-principles multi-provider failover
SarmaLink-AI treats every AI provider as a commodity. If Groq returns 429, SambaNova fires. If SambaNova is busy, Cerebras. Then Gemini. Then OpenRouter’s :free pool as the final safety net. Round-robin key rotation distributes load across keys within a provider. Round-robin across providers survives outages. Every step is logged to ai_events for observability.
Target audience
Small-to-medium businesses, indie developers, digital agencies, research teams — anyone who needs production-grade AI capability without a vendor rental fee, and who values code they can read, fork, and extend.
02System Architecture
SarmaLink-AI is a Next.js 14 application using the App Router, deployed on Vercel (or any Next.js-compatible host). The runtime is a thin HTTP layer over a modular lib/ directory. Every external dependency is behind an interface. Every piece of untrusted data is wrapped at a trust boundary before reaching the model.
Core modules
app/api/ai-chat/route.ts— the HTTP entry point. ~45 lines after refactoring. Delegates all logic to lib.lib/providers/failover.ts— the failover runner (tryFailover). Orchestrates retries across steps and keys.lib/providers/registry.ts— declares every provider, endpoint, and key collection.lib/tools/registry.ts— plugin pattern for live tools (weather, FX, container tracking).lib/prompts/sanitize.ts— prompt injection defence at trust boundaries.lib/repositories/— typed Supabase CRUD for sessions, usage, events, memories.lib/intent.ts— auto-router classifier (regex patterns, zero API calls).
Request lifecycle
Browser
│ POST /api/ai-chat { message, model? }
▼
Route handler (~45 lines)
│ 1. Supabase auth (cookie → user.id)
│ 2. RLS enforces ownership
▼
Sanitizer
│ wrapUntrusted(user_message)
│ wrapMemories(user_memories)
│ wrapToolResult(tool_outputs)
▼
Auto-router (lib/intent.ts)
│ image intent? → image pipeline
│ search intent? → Live mode
│ default → Smart mode
▼
tryFailover (lib/providers/failover.ts)
│ for each step in mode.failover:
│ for each key in providerKeys(step):
│ try stream → yield tokens
│ catch 429/5xx → next key or step
▼
SSE stream → BrowserDeployment topology
Vercel (or any Node.js host) runs the route handlers. Supabase provides PostgreSQL + Auth + Row-Level Security. Cloudflare R2 stores binary attachments (images, PDFs) with 7-day signed URLs. Cloudflare Workers AI serves image generation. All other providers are accessed via their public OpenAI-compatible endpoints.
03The Failover Runner
tryFailover is the load-bearing module. It accepts a sequence of provider/model steps and iterates through them until one returns a successful stream.
Algorithm
async function tryFailover(steps, messages, opts) {
for (const step of steps) {
const keys = providerKeys(step.provider)
const offset = Date.now() % keys.length // round-robin
for (let i = 0; i < keys.length; i++) {
const key = keys[(offset + i) % keys.length]
try {
const stream = await callProvider(step, key, messages)
return stream // success
} catch (err) {
logEvent({ backend: step.label, status: err.status })
if (err.status === 429 || err.status >= 500) continue
throw err // non-recoverable
}
}
}
throw new Error('All providers exhausted')
}Uptime math
Assumption: each provider has 99% uptime (industry minimum for production services).
P(all 7 providers fail simultaneously) = 0.01⁷ = 1 × 10⁻¹⁴
Effective uptime: 99.999999999999% — about 30 milliseconds of downtime per century, assuming the public internet itself stays up.
Handoff timing
Typical failover from 429 on one provider to first token on the next is under 50 milliseconds. The request parser reads the response headers, classifies the failure, rotates to the next key, and dispatches — all without user-visible interruption. Every step writes to ai_events with status, backend, latency_ms, and tokens_out.
04The Six Modes
Each mode is a different failover sequence, optimised for a specific task type.
| Mode | Depth | Primary engine | Daily limit | Use case |
|---|---|---|---|---|
| Smart | 14 | DeepSeek V3.2 685B | 1,000/day | Professional writing, analysis, brainstorming |
| Reasoner | 10 | DeepSeek V3.2 + V3.1 | 500/day | Complex logic, chain-of-thought traces |
| Live | 4 | Gemini 2.5 Flash + Google Search | 1,000/day | Current events, weather, news, FX |
| Fast | 9 | Groq GPT-OSS 20B | 5,000/day | 41ms first token — quick lookups, rewrites |
| Coder | 9 | DeepSeek V3.2 + Qwen Coder 480B | 800/day | TypeScript, Python, SQL, debugging |
| Vision | 6 | Llama-4 Scout 17B | 500/day | Receipts, screenshots, diagrams |
05The Seven Providers
Groq
Custom LPU inference chips. GPT-OSS 120B in 45ms, GPT-OSS 20B in 41ms. Llama 3.3 70B, Qwen 3 32B, Llama-4 Scout for vision, Llama 3.1 8B for memory extraction. Free tier: 14,000 req/day/key.
SambaNova
Hosts DeepSeek V3.2 (685B MoE, 37B active per token) — the frontier model that powers Smart, Reasoner, and Coder modes. Custom Reconfigurable Dataflow Unit silicon. Also runs DeepSeek V3.1 and Llama 4 Maverick with 1M context.
Cerebras
WSE-3 wafer-scale engine — 46,225 mm² of silicon, the largest chip ever built. 2,000 tokens/sec on Llama 3.1 8B. Hosts Qwen 3 235B for reasoning and Qwen 3 Coder 480B — SarmaLink-AI’s Coder failover winner when SambaNova is busy.
Google Gemini
Live mode backbone. Gemini 2.5 Flash, Flash Lite, and Gemini 3 Flash Preview with grounded Google Search built in at the token level. 1M-token context window. Every Live mode answer includes cited sources.
OpenRouter
Aggregates 100+ models across 50+ providers into one OpenAI-compatible endpoint. The :free pool (17+ community-hosted models including GPT-OSS 120B, Nemotron Ultra 253B, GLM-4.5 Air, Gemma 3, DeepSeek R1) is the ultimate failover safety net.
Cloudflare Workers AI
Runs FLUX.2 klein 9B and 4B for image generation and instruction-following editing with three-step failover (9B → 4B → FLUX.1 schnell). R2 provides 10GB free S3-compatible object storage for file persistence with 7-day signed URLs.
Tavily
Structured web search designed for AI consumption. Returns titles, snippets, URLs, and relevance scores — ready for LLM citation. Powers weather (Open-Meteo fallback), exchange rate verification, container tracking (ISO 6346 carriers), news, and URL extraction tools.
06Security Model
Trust boundaries
Three sources of untrusted text reach the model on every request:
- User messages — from the browser, potentially adversarial
- Tool results — from external APIs (Tavily, Open-Meteo, frankfurter.app) which may return manipulated content
- Saved memories — from the database, written by the memory extractor which may have laundered injection strings from past conversations
Each is wrapped by a dedicated sanitiser: wrapUntrusted, wrapToolResult, wrapMemories. Output is wrapped in explicit XML-style markers before reaching the model, and known jailbreak patterns ("ignore previous instructions", "system:" prefixes, role-switch attempts) are stripped.
Unit test coverage
11 unit tests in __tests__/sanitize.test.ts cover documented jailbreak categories. Defence is layered: even if strip misses a pattern, the wrapping ensures the model can never interpret untrusted text as a command.
Row-Level Security
Every table enforces per-user isolation at the PostgreSQL layer. Even if route logic has a bug, cross-user reads return zero rows.
CREATE POLICY "own_rows" ON ai_chat_sessions FOR ALL USING (auth.uid() = user_id);
The same policy is applied to ai_chat_usage, ai_events, and ai_user_memories. The service-role key bypasses RLS but is server-only — never in the client bundle, never in env vars exposed to browsers.
07Observability & Operations
The /api/admin/health endpoint returns per-provider success rates, p50/p95 latency, dead-model detection, and 24-hour request volume — all computed from the ai_events audit log.
Event schema
ai_events ( id uuid, user_id uuid, event_type text, -- 'message' | 'tool' | 'error' backend text, -- 'Groq GPT-OSS 120B', etc. status text, -- 'success' | 'rate_limited' | 'error' latency_ms integer, tokens_out integer, created_at timestamptz );
Diagnostic queries
Per-backend p95 latency over 24 hours:
SELECT backend,
percentile_cont(0.95) WITHIN GROUP (ORDER BY latency_ms) AS p95,
COUNT(*) AS volume
FROM ai_events
WHERE event_type = 'message'
AND created_at > now() - interval '24 hours'
GROUP BY backend
ORDER BY p95;Dead model detection (always returns 404/429):
SELECT backend,
COUNT(*) FILTER (WHERE status = 'success') AS ok,
COUNT(*) FILTER (WHERE status != 'success') AS fail,
ROUND(100.0 * COUNT(*) FILTER (WHERE status = 'success')
/ COUNT(*), 1) AS success_rate
FROM ai_events
WHERE created_at > now() - interval '1 hour'
GROUP BY backend
HAVING COUNT(*) > 10
ORDER BY success_rate;Scaling up
The Gmail +alias trick multiplies capacity. Sign up at you+provider2@gmail.com, +provider3@gmail.com, etc. — each counts as a distinct account at most providers. Adding 8 keys per provider yields 8× daily capacity with zero code changes; the failover runner already rotates through all keys.
08Benchmarks & Performance
DeepSeek V3.2 — SarmaLink-AI’s primary engine — compared to commercial AI products. Published scores from DeepSeek technical reports, lmarena.ai, SWE-bench leaderboards, Arena-Hard, and the GPQA paper.
| Benchmark | SarmaLink-AI | GPT-4o | Claude Sonnet | Gemini 2.5 |
|---|---|---|---|---|
| MATH-500 (advanced maths) | 90.2% | 76.6% | 78.3% | 83.2% |
| HumanEval (code synthesis) | 92.7% | 90.2% | 92.0% | 89.5% |
| Arena ELO (human preference) | 1318 | 1287 | 1271 | 1299 |
| MGSM (multilingual maths) | 88.3% | 85.5% | 86.0% | 87.4% |
| GPQA-Diamond (PhD reasoning) | 59.1% | 50.6% | 59.4% | 56.8% |
| MMLU (general knowledge) | 87.1% | 88.7% | 88.7% | 90.0% |
Capacity math
Combined daily capacity across all 7 providers on free tiers (default configuration, 1 key per provider):
Groq 14K + SambaNova 5K + Cerebras 5K + Gemini 250 + OpenRouter 1K + Cloudflare 10K images + Tavily 100 = ~35,000 requests/day.
With 9 keys per provider via the Gmail +alias trick: ~207,000 requests/day — enough for approximately 15,000 daily active users.
09Deployment Guide
Prerequisites
- Node.js 20 or later
- Git
- Supabase account (free tier)
- API keys from Groq (required), plus optional SambaNova, Cerebras, Gemini, OpenRouter, Cloudflare, Tavily
- GitHub account (for deployment via Vercel)
- Vercel account (free tier)
Fast path
git clone https://github.com/sarmakska/sarmalink-ai.git cd sarmalink-ai npm install cp .env.example .env.local # Fill in .env.local with your Supabase + provider keys # Run supabase/migrations/001_sarmalink_ai.sql in Supabase SQL editor npm run dev
Then push your repo to GitHub, import into Vercel, paste env vars into Vercel’s dashboard, and deploy. Full 45-minute walkthrough in the Complete Setup Guide.
Vercel Pro recommendation
Vercel Hobby (free) has a 10-second function timeout — adequate for most requests but can cut off long failover chains. Vercel Pro ($20/month) raises it to 60 seconds (300s with streaming) and is recommended for production.
10Extension Points
Adding a new provider
Any provider with an OpenAI-compatible chat completions endpoint can be added in ~10 lines across four files: lib/ai-models.ts (register type), lib/providers/registry.ts (endpoint + keys), lib/env/validate.ts (env collection), and the mode’s failover array.
Adding a new live tool
Implement the Tool<Args> interface (match, args, run) in lib/tools/, then add one line to the TOOLS array in lib/tools/registry.ts. The auto-router picks it up automatically.
Customising the system prompt
System prompts are per-mode in lib/ai-models.ts. Each mode has its own persona, tone, and constraint set. Version control history preserves prompt evolution.
11Comparison with Alternatives
| Feature | SarmaLink-AI | LiteLLM | LangChain | OpenRouter | ChatGPT Plus |
|---|---|---|---|---|---|
| Multi-provider routing | 7 providers | Yes | Partial | Single service | No |
| Automatic failover | 50ms handoff | Manual | Manual | Manual | N/A |
| Full app (not library) | Yes | Library | Framework | API only | Hosted only |
| Self-hostable | Yes | Partial | Yes | No | No |
| Free tier end-user | £0 forever | Library only | Framework only | Pay per token | $20/month |
| Image generation | FLUX.2 | No | No | No | DALL-E 3 |
| Persistent memory | Auto-extracted | No | Yes | No | Yes |
| Observability built-in | /admin/health | Callbacks only | LangSmith (paid) | Analytics | N/A |
| License | MIT | MIT | MIT | Commercial | Commercial |
12Cost Analysis
A typical AI-heavy individual pays for 3-5 separate subscriptions to get the capabilities SarmaLink-AI ships with by default.
| Subscription | Monthly | Yearly |
|---|---|---|
| ChatGPT Plus | $20 | £190 |
| Claude Pro | $20 | £190 |
| Gemini Advanced | $20 | £190 |
| Perplexity Pro | $20 | £190 |
| Midjourney Standard | $10 | £95 |
| All five combined | $90 | £855 |
| SarmaLink-AI | $0 | £0 |
For a 15-person team each paying ChatGPT Plus + Claude Pro alone: £5,700/year. SarmaLink-AI serves 15,000 daily requests across the same team at £0 recurring (optional £20/month for Vercel Pro if function timeouts become a constraint).
13Roadmap
Now (shipped)
- 7 providers, 36 engines, 6 specialised modes
- Automatic mode detection from message content
- Persistent cross-session memory (30-fact cap per user)
- 5 live tools: weather, exchange rates, container tracking, news search, URL summary
Next
- Per-mode prompt versioning with A/B testing
- Example chat UI in
examples/folder - Streaming chunk replay for debugging
- Usage analytics dashboard
Soon
- Voice mode (Whisper + TTS via Groq)
- Video frame analysis via Gemini Vision
- Tool marketplace — community plugins
- One-click Vercel deploy template
Later
- Federated failover — share capacity across instances
- Model fine-tuning pipeline
- Mobile app with offline fallback to on-device LLM
14Governance & Licensing
SarmaLink-AI is released under the MIT License. Contributors retain copyright of their contributions. Pull requests are reviewed against CI (lint, typecheck, test, build) and CodeQL security scans; all must pass before merge.
Security vulnerabilities should be reported privately via the process documented in SECURITY.md. Community channels: GitHub Issues and Discussions.
15Conclusion
SarmaLink-AI demonstrates that production-grade AI capability doesn’t require per-user subscriptions, vendor lock-in, or proprietary infrastructure. A first-principles multi-provider failover architecture, built on open-source primitives and free tiers, delivers 99.9999% effective uptime with frontier model quality — at zero recurring cost. The codebase is small enough to read in an afternoon, documented in a 22-page wiki, and licensed under MIT for any use. Fork it, self-host it, extend it, ship it.
AGlossary
- 429 — HTTP status code for "Too Many Requests". Indicates rate limiting.
- Failover — a sequence of steps tried in order until one succeeds.
- LPU — Language Processing Unit. Groq’s custom silicon for LLM inference.
- MoE — Mixture of Experts. Large model where only a subset of parameters activate per token.
- RLS — Row-Level Security. PostgreSQL feature enforcing per-row access policies.
- SSE — Server-Sent Events. HTTP-native streaming protocol for one-directional server→client data.
- WSE — Wafer-Scale Engine. Cerebras’ single-chip architecture using entire silicon wafers.
- R2 — Cloudflare’s S3-compatible object storage with no egress fees.
- OKLCH — Perceptually uniform colour space used in the site’s design system.
BAPI Reference
Chat (streaming SSE)
curl -N https://your-deploy.vercel.app/api/ai-chat \
-H "Content-Type: application/json" \
-H "Cookie: <supabase-auth>" \
-d '{"message":"Draft a follow-up email","model":"smart"}'
# Returns: SSE stream with {"type":"token","value":"..."} eventsImage generation
curl -X POST https://your-deploy.vercel.app/api/images/generate \
-H "Content-Type: application/json" \
-d '{"prompt":"sunset over Himalayas"}'
# Returns: {url: "https://r2.../signed?..."} (7-day URL)Image editing
curl -X POST https://your-deploy.vercel.app/api/images/edit \ -F "image=@original.jpg" \ -F 'instruction=change sky to emerald green'
File upload
curl -X POST https://your-deploy.vercel.app/api/attachments/upload \ -F "file=@contract.pdf" # Extracted text stored and referenceable in next message
Health check
curl https://your-deploy.vercel.app/api/admin/health # Returns: per-provider success rates, p95 latency, 24h volume
CEnvironment Variables
| Variable | Required | Purpose |
|---|---|---|
NEXT_PUBLIC_SUPABASE_URL | Yes | Supabase project URL |
NEXT_PUBLIC_SUPABASE_ANON_KEY | Yes | Supabase anon key (client-safe) |
SUPABASE_SERVICE_ROLE_KEY | Yes | Service role key (server-only) |
GROQ_API_KEY.._15 | Yes | Groq API keys (up to 15 for rotation) |
SAMBANOVA_API_KEY.._8 | Optional | SambaNova keys for DeepSeek V3.2 |
CEREBRAS_API_KEY.._8 | Optional | Cerebras keys for Qwen 3 235B / 480B |
GEMINI_API_KEY.._12 | Optional | Google Gemini keys for Live mode |
OPENROUTER_API_KEY.._5 | Optional | OpenRouter safety net |
CLOUDFLARE_ACCOUNT_ID | Optional | For Workers AI image gen |
CLOUDFLARE_API_TOKEN | Optional | Workers AI token |
R2_ACCESS_KEY_ID, R2_SECRET_ACCESS_KEY, R2_ENDPOINT, R2_BUCKET | Optional | R2 file storage |
TAVILY_API_KEY.._8 | Optional | Structured search for live tools |
DDatabase Schema
CREATE TABLE ai_chat_sessions ( id uuid PRIMARY KEY DEFAULT gen_random_uuid(), user_id uuid NOT NULL REFERENCES auth.users(id) ON DELETE CASCADE, title text, messages jsonb NOT NULL DEFAULT '[]', updated_at timestamptz DEFAULT now() ); CREATE TABLE ai_chat_usage ( user_id uuid NOT NULL REFERENCES auth.users(id) ON DELETE CASCADE, day date NOT NULL, count integer NOT NULL DEFAULT 0, PRIMARY KEY (user_id, day) ); CREATE TABLE ai_events ( id uuid PRIMARY KEY DEFAULT gen_random_uuid(), user_id uuid NOT NULL REFERENCES auth.users(id) ON DELETE CASCADE, event_type text, backend text, status text, latency_ms integer, tokens_out integer, created_at timestamptz DEFAULT now() ); CREATE TABLE ai_user_memories ( id uuid PRIMARY KEY DEFAULT gen_random_uuid(), user_id uuid NOT NULL REFERENCES auth.users(id) ON DELETE CASCADE, fact text NOT NULL, created_at timestamptz DEFAULT now() ); -- Row-Level Security on every table ALTER TABLE ai_chat_sessions ENABLE ROW LEVEL SECURITY; ALTER TABLE ai_chat_usage ENABLE ROW LEVEL SECURITY; ALTER TABLE ai_events ENABLE ROW LEVEL SECURITY; ALTER TABLE ai_user_memories ENABLE ROW LEVEL SECURITY; CREATE POLICY "own_rows" ON ai_chat_sessions FOR ALL USING (auth.uid() = user_id); CREATE POLICY "own_rows" ON ai_chat_usage FOR ALL USING (auth.uid() = user_id); CREATE POLICY "own_rows" ON ai_events FOR ALL USING (auth.uid() = user_id); CREATE POLICY "own_rows" ON ai_user_memories FOR ALL USING (auth.uid() = user_id);
EReferences
- SarmaLink-AI repository
- SarmaLink-AI Wiki (22 pages)
- Groq Console · LPU inference
- SambaNova Cloud · DeepSeek V3.2
- Cerebras Cloud · WSE-3
- Google AI Studio · Gemini grounding
- OpenRouter · aggregator
- Cloudflare · Workers AI + R2
- Tavily · structured search
- Supabase · Postgres + Auth + RLS
- LMArena · benchmark leaderboards
- Sarma Linux · publisher