SarmaLink-AI
An open-source AI assistant that routes every message through up to 14 engines across 7 providers. If one is at capacity, the next fires in under 50 milliseconds. Powered by DeepSeek V3.2 (685 billion parameters), Google Gemini 3, GPT-OSS 120B, and 33 more engines. Built by Sarma Linux.
Why this exists
Every major AI provider offers a free tier. Groq hosts GPT-OSS 120B. SambaNova runs DeepSeek V3.2 (685B parameters). Cerebras does 2,000 tokens per second on their WSE-3 chip. Google Gemini has grounded Google Search built in. Each is individually generous. Each on its own still hits rate limits.
The moment a single provider returns a 429, the app breaks. Users see an error. They lose trust. The common workaround β paying for an upgrade β defeats the point of using free tiers in the first place.
SarmaLink-AI chains every free tier together. If Groq is busy, SambaNova fires. If SambaNova is busy, Cerebras. Then Gemini. Then OpenRouter's free model pool as the final safety net. Users never see errors β they always get an answer, from whichever engine is available.
How it routes your message
Six real questions. The auto-router picks the mode. The failover picks the engine. The green check shows which model actually answered β live as of today.
Deep-dive: a request that needed the failover
The Coder example above β full trace of what happens when the primary engine is rate-limited and the system hands off to the next.
User: "Fix: 'Type Date is not assignable to string'" β Auto-router: detects TypeScript error pattern β mode = Coder (9-engine failover) β Step 1 Β· SambaNova Β· DeepSeek V3.2 685B β rotate to key 3 (round-robin, 8 keys total) β POST /v1/chat/completions β 429 Too Many Requests (quota exceeded this minute) β logEvent(status: 'rate_limited', latency: 38ms) β 47ms later Step 2 Β· Cerebras Β· Qwen 3 Coder 480B β rotate to key 1 β POST /v1/chat/completions (streaming) β 200 OK Β· first token in 94ms β streaming SSE chunks to client β Response streamed in 1.4 seconds total β Backend label: "Cerebras Qwen 3 Coder 480B" β logEvent(status: 'success', latency_ms: 1403, tokens_out: 340) β Memory extractor (fire-and-forget) β runs Llama 3.1 8B on Groq after session save β no new facts extracted (code-only context) β Session persisted to Supabase Β· RLS scoped to auth.uid() Every step written to ai_events Β· queryable via /api/admin/health
Full walkthrough with Mermaid diagrams: How Failover Works Β· Architecture Diagrams
Six specialised modes
Each mode is backed by a different failover of engines, optimised for a specific type of task. The auto-router picks the right one β or users choose manually.
Smart
Professional emails, summaries, deep analysis, brainstorming. Primary engine outscores GPT-4o on MATH-500 (90.2% vs 76.6%) and HumanEval (92.7% vs 90.2%). 14 engines failover across 4 providers.
Reasoner
Complex logic, multi-step maths, legal reasoning, strategy. Shows its thinking process in a collapsible panel β click to follow the chain of thought. 10-engine failover.
Live
Real-time web search grounded in Google. Current news, weather, exchange rates, container tracking, sports scores. Sources cited at the bottom of every answer.
Fast
First token in 41 milliseconds. Quick lookups, one-liner rewrites, simple questions. 9-engine failover means practically unlimited capacity.
Coder
TypeScript, Python, SQL, HTML/CSS. Spots bugs, writes tests, refactors legacy code. DeepSeek V3.2 topped the SWE-bench coding leaderboard.
Vision
Reads photos, screenshots, receipts, diagrams. Auto-activates on image upload. FLUX.2 klein edits images with natural language instructions.
Deep dive on every mode: The 6 Modes β
Built-in features
Everything below works out of the box. Clone, add your keys, deploy.
Smart Failover
Up to 14 engines per mode. If one returns 429 or 5xx, the next fires in under 50 milliseconds. Round-robin key rotation spreads load so no single connection is hit first.
Persistent Memory
After each conversation, a cheap model (Llama 3.1 8B) extracts key facts β name, role, preferences, projects. Those facts are injected into every future chat. Works across all modes and model switches.
Auto-Router
Regex-based intent classifier detects code, web search, quick questions, deep reasoning, and vision from the message text. Routes to the right mode instantly β zero API calls, zero latency.
Image Gen & Editing
FLUX.2 klein 9B generates images from text in ~1.5 seconds. Upload an image and say "change to emerald green" β it actually changes the colour (verified by a second AI model). Failover: 9B β 4B β FLUX.1-schnell.
Live Exchange Rates
Powered by frankfurter.app (European Central Bank data). 13+ currencies, instant conversion. "Convert 5000 GBP to EUR" β real-time answer. No API key required.
Weather Anywhere
Open-Meteo β global coverage, 3-day forecast, auto-geocoding. "Weather in Milan" β current temp, humidity, wind, UV, and forecast. No API key required.
Container Tracking
Auto-detects shipping carrier from container prefix (ISO 6346 database β 25+ prefixes). Searches Tavily for live status. Generates direct tracking links for Maersk, MSC, CMA CGM, Hapag-Lloyd, COSCO, Evergreen, ONE, Yang Ming, ZIM.
Document Analysis
Upload PDFs, Excel spreadsheets, Word documents β up to 10 per conversation. Text extracted via Gemini Vision (PDF) or server-side libraries (xlsx, mammoth). Files persist in Cloudflare R2 across messages.
50 Saved Conversations
Each user gets 50 conversation slots. Oldest auto-deleted when the limit is reached. Thinking traces and backend model labels are saved so you can see which engine answered each message.
Dark & Light Mode
Full theme support with CSS variables. Markdown rendering includes syntax-highlighted code blocks, tables, lists, and images β all theme-aware.
Prompt Injection Defence
Every external input β user messages, tool results, saved memories β is wrapped in explicit untrusted markers. Known jailbreak patterns are stripped before reaching the model. Tool outputs never execute as instructions.
Observability Endpoint
The /api/admin/health route exposes per-provider success rates, p50/p95 latency, dead-model detection, and 24-hour volume. Built from the ai_events audit log that records every failover step.
Use cases
What people actually build with this.
Personal Assistant
Replace ChatGPT Plus. Unlimited practical capacity across 6 modes. Persistent memory across sessions. Β£0 monthly cost.
Team Internal Tools
HR policies, finance lookups, ops runbooks. Every request logged to your own database. RLS keeps per-user data separate.
Customer Support Backbone
Plug the SSE streaming API into any frontend. Auto-router surfaces the right mode without user selection. Sources cited for regulated-industry compliance.
Research & Reasoning
DeepSeek V3.2 on heavy maths, GPQA, PhD-level questions. Reasoner mode exposes chain-of-thought traces you can audit.
Code Generation & Review
Coder mode with SWE-bench 42%. Paste a diff, ask for bugs or refactors. TypeScript, Python, SQL, Go, Rust.
Document Intelligence
Upload contracts, invoices, spreadsheets, PDFs. Ask questions in natural language. Text extraction runs server-side before the model sees it.