Yes, for any reasonable personal or small-team workload. Every provider in the stack offers a free tier. With one key per provider you have practical capacity for 15+ daily users. The Gmail +alias trick lets you multiply keys further. The only paid optional: Vercel Pro ($20/mo) if you need >10s function timeouts.

Can I use my own paid keys (OpenAI, Anthropic)?

Yes. Add them as a failover step, just put them at the bottom of the failover chain so you burn free tier first. OpenAI and Anthropic both expose OpenAI-compatible endpoints (native for OpenAI, via OpenRouter for Anthropic). 10 lines to add.

What about rate limits?

Rate limits are the reason this project exists. Every mode has 6–36 engines. When provider A returns 429, the next key is tried, then the next provider. Users never see errors. If every single engine is exhausted, that is the only case where a user sees a rate-limit message, and by then you have observability data to act on.

How is this different from LiteLLM / LangChain / OpenRouter?

LiteLLM is a library, you still have to write the app. OpenRouter is one provider, not seven. SarmaLink-AI is the full vertical: routing, memory, tools, streaming, database schema, RLS, observability. Fork it and change what you want.

How do I monitor what's happening?

The /api/admin/health endpoint returns per-provider success rates, p50/p95 latency, dead-model detection, and 24-hour volume. Every request also writes to the ai_events table, query it directly in Supabase for any analysis you need.

Does it handle images?

Generation: FLUX.2 klein 9B via Cloudflare Workers AI (~1.5s). Editing: same model with instruction-following, "change to emerald green" actually changes the colour. Vision: Llama-4 Scout reads photos, screenshots, receipts, diagrams.

What database does it use?

PostgreSQL via Supabase. 4 tables: ai_chat_sessions, ai_chat_usage, ai_events, ai_user_memories. Row-level security is on every table, keyed on auth.uid(). See Database Schema in the wiki.

Is the code production-ready?

Yes. CI runs lint/typecheck/test/build on every PR. CodeQL scans weekly. Dependabot keeps dependencies current. 40+ unit tests on routing logic. The route handler was refactored from a 1,369-line god-route into modular lib/ files. Branch protection is on.

Open Source · MIT License · v1.1.0 · 36 engines live

One assistant. Thirty-six engines underneath.

SarmaLink-AI routes every message through up to fourteen engines across seven providers. If one is at capacity, the next fires in under fifty milliseconds. Powered by DeepSeek V3.2 (685 billion parameters), Google Gemini 3, GPT-OSS 120B, and thirty-three more engines. Built and shipped by Sarma Linux.

View on GitHub Whitepaper Setup guide Get help deploying

AI engines

Providers

Max failover

41ms

Fastest token

685B

Primary model

Recurring cost

Why this exists

Every major AI provider offers a free tier. Groq hosts GPT-OSS 120B. SambaNova runs DeepSeek V3.2 (685B parameters). Cerebras does 2,000 tokens per second on their WSE-3 chip. Google Gemini has grounded Google Search built in. Each is individually generous. Each on its own still hits rate limits.

The moment a single provider returns a 429, the app breaks. Users see an error. They lose trust. The common workaround, paying for an upgrade, defeats the point of using free tiers in the first place.

SarmaLink-AI chains every free tier together. If Groq is busy, SambaNova fires. If SambaNova is busy, Cerebras. Then Gemini. Then OpenRouter's free model pool as the final safety net. Users never see errors , they always get an answer, from whichever engine is available.

How it routes your message

Six real questions. The auto-router picks the mode. The failover picks the engine. The green check shows which model actually answered, live as of today.

“Draft a polite rejection email for a late supplier delivery.”

Smart

auto-routed

SambaNova · DeepSeek V3.2 685B

200 OK

820ms first token380 tokens out

Primary engine, 685B MoE frontier model

“Does GDPR Article 17 apply to database backups?”

Reasoner

auto-routed

SambaNova · DeepSeek V3.2 (reasoning)

200 OK

4.2s total · thinking trace shown1,240 tokens out

Collapsible chain-of-thought panel

“What's the weather in Singapore right now?”

Live

auto-routed

tool

Auto-router → weather tool detected

Open-Meteo API

Gemini 2.5 Flash Lite (formatter)

200 OK

680ms end-to-end95 tokens out

Tool runs before model, no LLM round-trip needed for data

“What's the synonym for 'utilise'?”

Fast

auto-routed

Groq · GPT-OSS 20B (LPU)

200 OK

41ms first token12 tokens out

Fastest route, Groq LPU chip

“Fix: 'Type Date is not assignable to string'”

Coder

auto-routed

retry

SambaNova · DeepSeek V3.2

429 rate-limited

Cerebras · Qwen 3 Coder 480B

200 OK

1.4s total (47ms failover)340 tokens out

Failover kicked in, user saw zero error

“[receipt.jpg uploaded] What was the total?”

Vision

auto-routed

Groq · Llama-4 Scout 17B (vision)

200 OK

1.1s end-to-end85 tokens out

Auto-activates on image upload

Deep-dive: a request that needed the failover

The Coder example above, full trace of what happens when the primary engine is rate-limited and the system hands off to the next.

rendering

Coder request, primary engine rate-limited, Cerebras wins in 1.4s. Every step lands in ai_events for /api/admin/health.

Full walkthrough with Mermaid diagrams: How Failover Works · Architecture Diagrams

Six specialised modes

Each mode is backed by a different failover of engines, optimised for a specific type of task. The auto-router picks the right one, or users choose manually.

Smart

1,000/day · 14-engine failover

DeepSeek V3.2 (685B MoE)

Professional emails, summaries, deep analysis, brainstorming. Primary engine outscores GPT-4o on MATH-500 (90.2% vs 76.6%) and HumanEval (92.7% vs 90.2%). 36 engines failover across 4 providers.

Reasoner

500/day · 10-engine failover

DeepSeek V3.2 + V3.1

Complex logic, multi-step maths, legal reasoning, strategy. Shows its thinking process in a collapsible panel, click to follow the chain of thought. 10-engine failover.

Live

1,000/day · 4-engine failover

Gemini 2.5 Flash + Google Search

Real-time web search grounded in Google. Current news, weather, exchange rates, container tracking, sports scores. Sources cited at the bottom of every answer.

Fast

5,000/day · 9-engine failover

Groq GPT-OSS 20B (41ms)

First token in 41 milliseconds. Quick lookups, one-liner rewrites, simple questions. 9-engine failover means practically unlimited capacity.

Coder

800/day · 9-engine failover

DeepSeek V3.2

TypeScript, Python, SQL, HTML/CSS. Spots bugs, writes tests, refactors legacy code. DeepSeek V3.2 topped the SWE-bench coding leaderboard.

Vision

500/day · 6-engine failover

Llama-4 Scout 17B

Reads photos, screenshots, receipts, diagrams. Auto-activates on image upload. FLUX.2 klein edits images with natural language instructions.

Deep dive on every mode: The 6 Modes →

Built-in features

Everything below works out of the box. Clone, add your keys, deploy.

Smart Failover

Up to 36 engines per mode. If one returns 429 or 5xx, the next fires in under 50 milliseconds. Round-robin key rotation spreads load so no single connection is hit first.

Persistent Memory

After each conversation, a cheap model (Llama 3.1 8B) extracts key facts, name, role, preferences, projects. Those facts are injected into every future chat. Works across all modes and model switches.

Auto-Router

Regex-based intent classifier detects code, web search, quick questions, deep reasoning, and vision from the message text. Routes to the right mode instantly, zero API calls, zero latency.

Image generation and editing

FLUX.2 klein 9B generates images from text in ~1.5 seconds. Upload an image and say "change to emerald green", it actually changes the colour (verified by a second AI model). Failover: 9B to 4B to FLUX.1-schnell.

Live Exchange Rates

Powered by frankfurter.app (European Central Bank data). 13+ currencies, instant conversion. "Convert 5000 GBP to EUR" returns a real-time answer. No API key required.

Weather Anywhere

Open-Meteo, global coverage, 3-day forecast, auto-geocoding. "Weather in Milan" returns current temp, humidity, wind, UV, and forecast. No API key required.

Container Tracking

Auto-detects shipping carrier from container prefix (ISO 6346 database, 25+ prefixes). Searches Tavily for live status. Generates direct tracking links for Maersk, MSC, CMA CGM, Hapag-Lloyd, COSCO, Evergreen, ONE, Yang Ming, ZIM.

Document Analysis

Upload PDFs, Excel spreadsheets, Word documents, up to 10 per conversation. Text extracted via Gemini Vision (PDF) or server-side libraries (xlsx, mammoth). Files persist in Cloudflare R2 across messages.

50 Saved Conversations

Each user gets 50 conversation slots. Oldest auto-deleted when the limit is reached. Thinking traces and backend model labels are saved so you can see which engine answered each message.

Dark and light mode

Full theme support with CSS variables. Markdown rendering includes syntax-highlighted code blocks, tables, lists, and images, all theme-aware.

Prompt Injection Defence

Every external input, user messages, tool results, saved memories, is wrapped in explicit untrusted markers. Known jailbreak patterns are stripped before reaching the model. Tool outputs never execute as instructions.

Observability Endpoint

The /api/admin/health route exposes per-provider success rates, p50/p95 latency, dead-model detection, and 24-hour volume. Built from the ai_events audit log that records every failover step.

Cross-Repo Plugin System

Ten sibling open-source repos (voice-agent-starter, agent-orchestrator, ai-eval-runner, mcp-server-toolkit, local-llm-router, rag-over-pdf, receipt-scanner, webhook-to-email, k8s-ops-toolkit, terraform-stack) registered as plugins. Each has an intent and endpoint env var and invokePlugin() proxy. Surface them at /api/v1/plugins, dispatch via /api/v1/plugins/invoke.

Manus Integration

Typed Manus client (createTask, getTask, cancelTask, awaitTask) for delegating long-running agentic tasks. Webhook receiver verifies HMAC-SHA256 signatures and persists every task to a manus_tasks Postgres table, pollable by id via /api/v1/manus/tasks/[id].

Intent-Based Auto-Routing

Pre-LLM hook scans incoming chat messages for intent keywords (research, voice, eval, workflow, rag, ocr) and dispatches to the matching plugin instead of going to chat completion. Falls through to the LLM when no plugin matches. Gated by ENABLE_PLUGIN_AUTOROUTE.

White-Label Ready

Full MAKE-IT-YOURS guide with copy-paste v0 prompt that generates a complete branded front end (home, pricing, docs, login, signup, dashboard with usage charts and API key CRUD). Pair with terraform-stack for one-command reproducibility. /docs page lists every plugin with live enabled status.

v2 release · ten new capabilities

What v2 added

v2 turns the gateway into an agent runtime, a voice stack, a live-data layer and a quota-aware tool catalog. The pieces below ship in the repository on main and run on the same Supabase project the chat backend already uses.

Intent auto-router

A regex sweep plus a tiny LLM classifier picks the correct mode per message before the failover runner fires. Smart, Reasoner, Coder, Fast, Live and Vision get selected without the user thinking about it.

Multi-step agent runner

POST /api/v1/agent runs a planner, fans out workers and a synthesiser, and streams every step over Server-Sent Events. Long tasks resolve in one round trip instead of an orchestrator round-tripping the client.

MCP-shaped tool catalog

A bearer-protected /api/v1/mcp endpoint exposes list_tools and plugin dispatch in the Model Context Protocol shape. External agents and IDEs can mount the gateway as a tool source without bespoke glue.

TTS cascade

MeloTTS on Cloudflare Workers AI is the primary text-to-speech path. Gemini TTS picks up when the primary returns an error or empty audio. Output is opus or mp3 ready to stream from the browser.

STT route

Speech to text via Groq Whisper first, Cloudflare Workers AI as fallback. POST audio in, get a transcript out, sub-second on a clean clip.

Live-data tools, zero keys

Weather from Open-Meteo, FX from Frankfurter (European Central Bank), news from the Hacker News Algolia index. No tokens to manage, no per-call cost, results cited.

FLUX with key rotation

Image generation across multiple Cloudflare account and token pairs. When one account hits its neuron cap, the next pair is dispatched, no user-visible pause.

Quota tracker

GET /api/v1/quota returns per-user and company-wide usage from a Supabase view. Wire it into a dashboard, or into the chatbot itself for a calm "you have X requests left today".

Smart suggestions

After each reply, an endpoint returns three follow-up prompts grounded in the conversation. Drop them into the UI as chips for one-tap continuation.

Reasoning-leak stripper

Chain-of-thought wrappers, <thinking> blocks and model-internal commentary are scrubbed from streamed output before the client sees them. Users get answers, not internal monologue.

Markdown to PDF, JSON to XLSX

Export an answer as a print-ready PDF via PDFKit, or a structured response as a spreadsheet via ExcelJS. Two endpoints, no extra dependencies on the client.

v2 request lifecycle

A single message now passes through the intent auto-router, optionally hands off to the agent runner, drives the TTS or STT cascade if voice is involved, hits the live-data tools where intent calls for them, and returns three smart follow-up suggestions alongside the answer.

rendering

v2 request lifecycle. Auto-router first, agent runner or mode failover next, cascades and tools on the side, leak stripper and quota tracker before the response leaves.

Persistent memory

Remembers who you are, across every session

After every conversation, a cheap classifier model (Llama 3.1 8B on Groq, ~80 ms) extracts durable facts about the user and writes them to ai_user_memories. Every future chat injects those facts into the system prompt. Works across modes and across model swaps.

What gets remembered

IdentityName, role, organisation, location, time zone
PreferencesBritish English, terse replies, code formatting style
ProjectsNames, stacks, deadlines, current blockers
ConstraintsAllergies, hardware limits, regulated industries
GoalsLong-running objectives mentioned across sessions

How it stays sane

Fire-and-forget extraction. Runs after the response streams to the user, never on the hot path.
Embedding dedupe. Near-identical facts collapse into one row.
Capped at 30 per user. Oldest facts rotate out, recent ones survive.
Wrapped on retrieval. Memories are injected inside <user_memory> tags, never as instructions.
User-deletable. One-click wipe via /api/memory/forget.

-- Schema · supabase/migrations/001_sarmalink_ai.sql
create table ai_user_memories (
  id          uuid primary key default gen_random_uuid(),
  user_id     uuid not null references auth.users(id) on delete cascade,
  fact        text not null,
  embedding   vector(384),
  created_at  timestamptz default now()
);

alter table ai_user_memories enable row level security;
create policy "own_memories" on ai_user_memories
  for all using (auth.uid() = user_id);

-- Retrieval at request time
select fact from ai_user_memories
  where user_id = auth.uid()
  order by created_at desc
  limit 30;

Pricing

Free, because it has to be

SarmaLink-AI is open source. There is no licence fee, no paid tier, no upsell. The honest table further down shows what the seven providers underneath would charge if you blew past their free tiers.

Recommended

Self-hosted

£0forever

Clone the repo, paste your free-tier API keys, deploy to Vercel. The same path most people take.

›All 36 engines, all 7 providers
›~207,000 combined requests per day
›Persistent memory, image gen, vision, live tools
›Self-hosted on your Vercel + Supabase
›MIT licensed, commercial use allowed

Hosted (community)

£0beta

A community-run instance at sarmalinux.com. Sign in, start chatting. Daily caps apply.

›500 requests/day per signed-in user
›Hosted on Sarma Linux infrastructure
›Same 36 engines under the hood
›No credit card, no payment details
›For evaluation and personal use

Bring-your-own-keys

£0forever

Run on your own keys. The cost row below shows what each provider would charge if you outgrew their free tiers.

›Plug in paid OpenAI, Anthropic, or Bedrock keys
›Order them at the bottom of the failover chain
›Free tier burns first, paid only on overflow
›Same routing, observability, and security
›Costs flow straight to the provider, never to us

If the free tiers ran out

These are the published list prices for each provider underneath. Most users never hit them, because rotation across seven providers absorbs the load.

Provider	Free tier	Paid (if needed)
SambaNova	Generous	$1.20 / M tokens (DeepSeek V3.2)
Groq	14K req/day per key	$0.10 / M tokens (Llama 3.3 70B)
Cerebras	5K req/day per key	$0.60 / M tokens (Llama 70B)
Google	250 req/day per key	$0.30 / M tokens (Gemini 2.5 Flash)
OpenRouter	1K req/day per key	Pass-through, ~$0.50 / M tokens
Cloudflare	10K img/day	$0.0011 per FLUX.2 image
Tavily	1K searches/month	$0.005 per search

Indicative figures sourced from public provider pricing pages, March 2026. Always verify before relying on these numbers commercially.

Use cases

What people actually build with this.

Personal Assistant

Replace ChatGPT Plus. Unlimited practical capacity across 6 modes. Persistent memory across sessions. £0 monthly cost.

Team Internal Tools

HR policies, finance lookups, ops runbooks. Every request logged to your own database. RLS keeps per-user data separate.

Customer Support Backbone

Plug the SSE streaming API into any frontend. Auto-router surfaces the right mode without user selection. Sources cited for regulated-industry compliance.

Research & Reasoning

DeepSeek V3.2 on heavy maths, GPQA, PhD-level questions. Reasoner mode exposes chain-of-thought traces you can audit.

Code Generation & Review

Coder mode with SWE-bench 42%. Paste a diff, ask for bugs or refactors. TypeScript, Python, SQL, Go, Rust.

Document Intelligence

Upload contracts, invoices, spreadsheets, PDFs. Ask questions in natural language. Text extraction runs server-side before the model sees it.

All open-source projects