One assistant. Thirty-six engines underneath.
SarmaLink-AI routes every message through up to fourteen engines across seven providers. If one is at capacity, the next fires in under fifty milliseconds. Powered by DeepSeek V3.2 (685 billion parameters), Google Gemini 3, GPT-OSS 120B, and thirty-three more engines. Built and shipped by Sarma Linux.
Why this exists
Every major AI provider offers a free tier. Groq hosts GPT-OSS 120B. SambaNova runs DeepSeek V3.2 (685B parameters). Cerebras does 2,000 tokens per second on their WSE-3 chip. Google Gemini has grounded Google Search built in. Each is individually generous. Each on its own still hits rate limits.
The moment a single provider returns a 429, the app breaks. Users see an error. They lose trust. The common workaround, paying for an upgrade, defeats the point of using free tiers in the first place.
SarmaLink-AI chains every free tier together. If Groq is busy, SambaNova fires. If SambaNova is busy, Cerebras. Then Gemini. Then OpenRouter's free model pool as the final safety net. Users never see errors , they always get an answer, from whichever engine is available.
How it routes your message
Six real questions. The auto-router picks the mode. The failover picks the engine. The green check shows which model actually answered, live as of today.
Deep-dive: a request that needed the failover
The Coder example above, full trace of what happens when the primary engine is rate-limited and the system hands off to the next.
Full walkthrough with Mermaid diagrams: How Failover Works · Architecture Diagrams
Six specialised modes
Each mode is backed by a different failover of engines, optimised for a specific type of task. The auto-router picks the right one, or users choose manually.
Smart
Professional emails, summaries, deep analysis, brainstorming. Primary engine outscores GPT-4o on MATH-500 (90.2% vs 76.6%) and HumanEval (92.7% vs 90.2%). 14 engines failover across 4 providers.
Reasoner
Complex logic, multi-step maths, legal reasoning, strategy. Shows its thinking process in a collapsible panel, click to follow the chain of thought. 10-engine failover.
Live
Real-time web search grounded in Google. Current news, weather, exchange rates, container tracking, sports scores. Sources cited at the bottom of every answer.
Fast
First token in 41 milliseconds. Quick lookups, one-liner rewrites, simple questions. 9-engine failover means practically unlimited capacity.
Coder
TypeScript, Python, SQL, HTML/CSS. Spots bugs, writes tests, refactors legacy code. DeepSeek V3.2 topped the SWE-bench coding leaderboard.
Vision
Reads photos, screenshots, receipts, diagrams. Auto-activates on image upload. FLUX.2 klein edits images with natural language instructions.
Deep dive on every mode: The 6 Modes →
Built-in features
Everything below works out of the box. Clone, add your keys, deploy.
Smart Failover
Up to 14 engines per mode. If one returns 429 or 5xx, the next fires in under 50 milliseconds. Round-robin key rotation spreads load so no single connection is hit first.
Persistent Memory
After each conversation, a cheap model (Llama 3.1 8B) extracts key facts, name, role, preferences, projects. Those facts are injected into every future chat. Works across all modes and model switches.
Auto-Router
Regex-based intent classifier detects code, web search, quick questions, deep reasoning, and vision from the message text. Routes to the right mode instantly, zero API calls, zero latency.
Image Gen & Editing
FLUX.2 klein 9B generates images from text in ~1.5 seconds. Upload an image and say "change to emerald green", it actually changes the colour (verified by a second AI model). Failover: 9B → 4B → FLUX.1-schnell.
Live Exchange Rates
Powered by frankfurter.app (European Central Bank data). 13+ currencies, instant conversion. "Convert 5000 GBP to EUR" → real-time answer. No API key required.
Weather Anywhere
Open-Meteo, global coverage, 3-day forecast, auto-geocoding. "Weather in Milan" → current temp, humidity, wind, UV, and forecast. No API key required.
Container Tracking
Auto-detects shipping carrier from container prefix (ISO 6346 database, 25+ prefixes). Searches Tavily for live status. Generates direct tracking links for Maersk, MSC, CMA CGM, Hapag-Lloyd, COSCO, Evergreen, ONE, Yang Ming, ZIM.
Document Analysis
Upload PDFs, Excel spreadsheets, Word documents, up to 10 per conversation. Text extracted via Gemini Vision (PDF) or server-side libraries (xlsx, mammoth). Files persist in Cloudflare R2 across messages.
50 Saved Conversations
Each user gets 50 conversation slots. Oldest auto-deleted when the limit is reached. Thinking traces and backend model labels are saved so you can see which engine answered each message.
Dark & Light Mode
Full theme support with CSS variables. Markdown rendering includes syntax-highlighted code blocks, tables, lists, and images, all theme-aware.
Prompt Injection Defence
Every external input, user messages, tool results, saved memories, is wrapped in explicit untrusted markers. Known jailbreak patterns are stripped before reaching the model. Tool outputs never execute as instructions.
Observability Endpoint
The /api/admin/health route exposes per-provider success rates, p50/p95 latency, dead-model detection, and 24-hour volume. Built from the ai_events audit log that records every failover step.
Cross-Repo Plugin System
Ten sibling open-source repos (voice-agent-starter, agent-orchestrator, ai-eval-runner, mcp-server-toolkit, local-llm-router, rag-over-pdf, receipt-scanner, webhook-to-email, k8s-ops-toolkit, terraform-stack) registered as plugins. Each has an intent + endpoint env var + invokePlugin() proxy. Surface them at /api/v1/plugins, dispatch via /api/v1/plugins/invoke.
Manus Integration
Typed Manus client (createTask, getTask, cancelTask, awaitTask) for delegating long-running agentic tasks. Webhook receiver verifies HMAC-SHA256 signatures and persists every task to a manus_tasks Postgres table, pollable by id via /api/v1/manus/tasks/[id].
Intent-Based Auto-Routing
Pre-LLM hook scans incoming chat messages for intent keywords (research, voice, eval, workflow, rag, ocr) and dispatches to the matching plugin instead of going to chat completion. Falls through to the LLM when no plugin matches. Gated by ENABLE_PLUGIN_AUTOROUTE.
White-Label Ready
Full MAKE-IT-YOURS guide with copy-paste v0 prompt that generates a complete branded front end (home, pricing, docs, login, signup, dashboard with usage charts and API key CRUD). Pair with terraform-stack for one-command reproducibility. /docs page lists every plugin with live enabled status.
What v2 added
v2 turns the gateway into an agent runtime, a voice stack, a live-data layer and a quota-aware tool catalog. The pieces below ship in the repository on main and run on the same Supabase project the chat backend already uses.
Intent auto-router
A regex sweep plus a tiny LLM classifier picks the correct mode per message before the failover runner fires. Smart, Reasoner, Coder, Fast, Live and Vision get selected without the user thinking about it.
Multi-step agent runner
POST /api/v1/agent runs a planner, fans out workers and a synthesiser, and streams every step over Server-Sent Events. Long tasks resolve in one round trip instead of an orchestrator round-tripping the client.
MCP-shaped tool catalog
A bearer-protected /api/v1/mcp endpoint exposes list_tools and plugin dispatch in the Model Context Protocol shape. External agents and IDEs can mount the gateway as a tool source without bespoke glue.
TTS cascade
MeloTTS on Cloudflare Workers AI is the primary text-to-speech path. Gemini TTS picks up when the primary returns an error or empty audio. Output is opus or mp3 ready to stream from the browser.
STT route
Speech to text via Groq Whisper first, Cloudflare Workers AI as fallback. POST audio in, get a transcript out, sub-second on a clean clip.
Live-data tools, zero keys
Weather from Open-Meteo, FX from Frankfurter (European Central Bank), news from the Hacker News Algolia index. No tokens to manage, no per-call cost, results cited.
FLUX with key rotation
Image generation across multiple Cloudflare account and token pairs. When one account hits its neuron cap, the next pair is dispatched, no user-visible pause.
Quota tracker
GET /api/v1/quota returns per-user and company-wide usage from a Supabase view. Wire it into a dashboard, or into the chatbot itself for a calm "you have X requests left today".
Smart suggestions
After each reply, an endpoint returns three follow-up prompts grounded in the conversation. Drop them into the UI as chips for one-tap continuation.
Reasoning-leak stripper
Chain-of-thought wrappers, <thinking> blocks and model-internal commentary are scrubbed from streamed output before the client sees them. Users get answers, not internal monologue.
Markdown to PDF, JSON to XLSX
Export an answer as a print-ready PDF via PDFKit, or a structured response as a spreadsheet via ExcelJS. Two endpoints, no extra dependencies on the client.
v2 request lifecycle
A single message now passes through the intent auto-router, optionally hands off to the agent runner, drives the TTS or STT cascade if voice is involved, hits the live-data tools where intent calls for them, and returns three smart follow-up suggestions alongside the answer.
Remembers who you are, across every session
After every conversation, a cheap classifier model (Llama 3.1 8B on Groq, ~80 ms) extracts durable facts about the user and writes them to ai_user_memories. Every future chat injects those facts into the system prompt. Works across modes and across model swaps.
What gets remembered
- IdentityName, role, organisation, location, time zone
- PreferencesBritish English, terse replies, code formatting style
- ProjectsNames, stacks, deadlines, current blockers
- ConstraintsAllergies, hardware limits, regulated industries
- GoalsLong-running objectives mentioned across sessions
How it stays sane
- Fire-and-forget extraction. Runs after the response streams to the user, never on the hot path.
- Embedding dedupe. Near-identical facts collapse into one row.
- Capped at 30 per user. Oldest facts rotate out, recent ones survive.
- Wrapped on retrieval. Memories are injected inside
<user_memory>tags, never as instructions. - User-deletable. One-click wipe via
/api/memory/forget.
-- Schema · supabase/migrations/001_sarmalink_ai.sql create table ai_user_memories ( id uuid primary key default gen_random_uuid(), user_id uuid not null references auth.users(id) on delete cascade, fact text not null, embedding vector(384), created_at timestamptz default now() ); alter table ai_user_memories enable row level security; create policy "own_memories" on ai_user_memories for all using (auth.uid() = user_id); -- Retrieval at request time select fact from ai_user_memories where user_id = auth.uid() order by created_at desc limit 30;
Free, because it has to be
SarmaLink-AI is open source. There is no licence fee, no paid tier, no upsell. The honest table further down shows what the seven providers underneath would charge if you blew past their free tiers.
Self-hosted
Clone the repo, paste your free-tier API keys, deploy to Vercel. The same path most people take.
- ›All 36 engines, all 7 providers
- ›~207,000 combined requests per day
- ›Persistent memory, image gen, vision, live tools
- ›Self-hosted on your Vercel + Supabase
- ›MIT licensed, commercial use allowed
Hosted (community)
A community-run instance at sarmalinux.com. Sign in, start chatting. Daily caps apply.
- ›500 requests/day per signed-in user
- ›Hosted on Sarma Linux infrastructure
- ›Same 36 engines under the hood
- ›No credit card, no payment details
- ›For evaluation and personal use
Bring-your-own-keys
Run on your own keys. The cost row below shows what each provider would charge if you outgrew their free tiers.
- ›Plug in paid OpenAI, Anthropic, or Bedrock keys
- ›Order them at the bottom of the failover chain
- ›Free tier burns first, paid only on overflow
- ›Same routing, observability, and security
- ›Costs flow straight to the provider, never to us
If the free tiers ran out
These are the published list prices for each provider underneath. Most users never hit them, because rotation across seven providers absorbs the load.
| Provider | Free tier | Paid (if needed) |
|---|---|---|
| SambaNova | Generous | $1.20 / M tokens (DeepSeek V3.2) |
| Groq | 14K req/day per key | $0.10 / M tokens (Llama 3.3 70B) |
| Cerebras | 5K req/day per key | $0.60 / M tokens (Llama 70B) |
| 250 req/day per key | $0.30 / M tokens (Gemini 2.5 Flash) | |
| OpenRouter | 1K req/day per key | Pass-through, ~$0.50 / M tokens |
| Cloudflare | 10K img/day | $0.0011 per FLUX.2 image |
| Tavily | 1K searches/month | $0.005 per search |
Indicative figures sourced from public provider pricing pages, March 2026. Always verify before relying on these numbers commercially.
Use cases
What people actually build with this.
Personal Assistant
Replace ChatGPT Plus. Unlimited practical capacity across 6 modes. Persistent memory across sessions. £0 monthly cost.
Team Internal Tools
HR policies, finance lookups, ops runbooks. Every request logged to your own database. RLS keeps per-user data separate.
Customer Support Backbone
Plug the SSE streaming API into any frontend. Auto-router surfaces the right mode without user selection. Sources cited for regulated-industry compliance.
Research & Reasoning
DeepSeek V3.2 on heavy maths, GPQA, PhD-level questions. Reasoner mode exposes chain-of-thought traces you can audit.
Code Generation & Review
Coder mode with SWE-bench 42%. Paste a diff, ask for bugs or refactors. TypeScript, Python, SQL, Go, Rust.
Document Intelligence
Upload contracts, invoices, spreadsheets, PDFs. Ask questions in natural language. Text extraction runs server-side before the model sees it.