Technical Deep Dive

How SarmaLink-AI works

The complete breakdown. How 36 engines across 7 providers deliver 99.9999% uptime at zero cost. Why every technology was chosen. How it compares to everything else.

The Problem

Every AI app has
a single point of failure.

You build an app on top of OpenAI. It works beautifully for 3 months. Then one day GPT-4o returns a 429 error. Your users see a broken page. Your support inbox fills up.

You switch to Anthropic. Works for 2 months. Then Claude goes down for maintenance. Same story.

The problem is not that providers are unreliable. They are remarkably reliable for 99% of the time. The problem is that the 1% is unpredictable, and your users experience 100% of it.

OpenAI
Rate limit (429)
User sees error
Anthropic
Maintenance window
App down 2 hours
Google
API quota exhausted
Chat stops working
Groq
Capacity spike
Slow or failed requests
This happens to every single-provider app
The Solution

Chain every provider together.
Fail over in 50ms.

Instead of depending on one provider, SarmaLink-AI treats every provider as a commodity. If one is busy, the next fires instantly. Users never see errors.

<span class="dim">User sends: "Draft a follow-up email to the supplier"</span> <span class="hl">Auto-router</span> detects: professional writing → <span class="hl">Smart mode</span> (36-engine failover) <span class="hl">Step 1</span> · SambaNova · DeepSeek V3.2 (685B) → Key 3 (round-robin rotation across 8 keys) → POST /v1/chat/completions → <span class="ok">200 OK</span> · First token in 820ms → Streaming SSE chunks to browser... → <span class="ok">Done. 380 tokens in 2.1 seconds.</span> <span class="dim">User sees a polished email. No errors. No delay.</span> <span class="dim">If Step 1 had returned 429, Step 2 would fire in 47ms.</span> <span class="dim">If all 14 steps fail, the user sees a rate-limit message.</span> <span class="dim">That has never happened in production.</span>
How Failover Works

What happens when a provider fails

A real trace from production. The primary engine returned 429. The user never knew.

<span class="dim">User: "Fix: 'Type Date is not assignable to string'"</span> <span class="hl">Auto-router</span> detects TypeScript error → <span class="hl">Coder mode</span> (9-engine failover) <span class="hl">Step 1</span> · SambaNova · DeepSeek V3.2 685B → rotate to key 3/8 (round-robin) → POST /v1/chat/completions → <span class="err">429 Too Many Requests</span> (quota exceeded this minute) → logEvent(status: 'rate_limited', latency: 38ms) <span class="dim">↓ 47ms later</span> <span class="hl">Step 2</span> · Cerebras · Qwen 3 Coder 480B → rotate to key 1/4 → POST /v1/chat/completions (streaming) → <span class="ok">200 OK</span> · First token in 94ms → Streaming... → <span class="ok">Done. 340 tokens in 1.4 seconds total.</span> <span class="dim">User saw: a correct TypeScript fix in 1.4 seconds.</span> <span class="dim">User experienced: zero errors, zero retries, zero friction.</span> <span class="dim">What actually happened: a 685B model was busy, a 480B model took over.</span> <span class="dim">The handoff took 47 milliseconds.</span>
47
milliseconds to fail over
Faster than a human blink (150ms)
14
maximum failover steps
Smart mode, the deepest chain
0
errors users have ever seen
From a full-chain exhaustion
v2 release · ten new capabilities

From a failover gateway to a full runtime

v2 widens the gateway from a chat completions failover into an agent runtime, a voice stack, a live-data layer and a quota-aware tool catalog. Every piece ships on the same Next.js app, same Supabase project, same MIT licence.

Intent auto-router

Regex sweep plus a tiny LLM classifier picks the mode per message. Smart, Reasoner, Coder, Fast, Live, Vision, all auto-selected, no UI toggle required.

Multi-step agent runner

POST /api/v1/agent runs planner, workers and synthesiser server-side, streams every step over SSE.

MCP-shaped catalog

Bearer-protected /api/v1/mcp exposes list_tools and plugins in the Model Context Protocol shape. Any MCP-aware client mounts the gateway with one URL.

TTS cascade

MeloTTS on Cloudflare Workers AI first, Gemini TTS as fallback. Streamable opus or mp3 audio.

STT route

Groq Whisper primary, Cloudflare Workers AI fallback. Sub-second transcription on a clean clip.

Zero-key live tools

Open-Meteo for weather, Frankfurter (ECB) for FX, HN Algolia for news. No tokens, no per-call cost.

FLUX with key rotation

Image generation across paired Cloudflare account and token pairs. When one cap hits, the next pair fires.

Quota tracker

GET /api/v1/quota returns per-user and company-wide counters from a Supabase view.

Smart suggestions

After each reply, three grounded follow-up prompts ready to chip into the UI.

Reasoning-leak stripper

Chain-of-thought wrappers and internal commentary scrubbed from the stream before the client sees them.

Markdown to PDF, JSON to XLSX

PDFKit and ExcelJS endpoints turn any answer into a print-ready PDF or a structured spreadsheet.

rendering
v2 architecture. Auto-router → agent or mode failover → cascades and tools on the side → leak stripper, quota tracker, suggestions, export endpoints.
v1.2, Released 2026-05-03

Now it routes to other open-source repos too

The v1.1 router picked an LLM. The v1.2 router can pick an entire sibling repo first, research goes to rag-over-pdf, voice goes to voice-agent-starter, evals go to ai-eval-runner. The LLM is the fallback, not the default.

Cross-repo plugins

lib/plugins/index.ts registers ten sibling repos as routable tools: voice-agent-starter, agent-orchestrator, ai-eval-runner, local-llm-router, mcp-server-toolkit, rag-over-pdf, receipt-scanner, webhook-to-email, k8s-ops-toolkit, terraform-stack.

Each plugin has one env var that holds its endpoint. If you set RAG_OVER_PDF_URL on your deployment, the rag plugin lights up. If you don’t, it stays dormant. List everything live at GET /api/v1/plugins; dispatch via POST /api/v1/plugins/invoke.

Intent auto-routing

lib/services/plugin-autorouter.ts runs before the LLM. It scans the user’s message for intent keywords and, when one matches an enabled plugin, dispatches there directly. No model call, no token cost, no failover chain, just the right tool for the job.

Distinct from lib/intent.ts, which picks a mode (Smart / Reasoner / Coder / Live / Fast / Vision). The new router picks a plugin. Both can fire, plugin first, mode second. Gated by ENABLE_PLUGIN_AUTOROUTE (off by default).

<span class="dim">User: "find me research on retrieval-augmented generation latency"</span> <span class="hl">Plugin auto-router</span> scans message → intent: <span class="hl">research</span> → match table: /\b(research|cite|literature)\b/i → slug: <span class="hl">rag-over-pdf</span> → registry lookup: RAG_OVER_PDF_URL is set → enabled → dispatch POST {RAG_OVER_PDF_URL}/query → <span class="ok">200 OK</span> · streamed citations + chunks → result streamed straight back to browser <span class="dim">No LLM call. No failover chain. The right tool fired directly.</span> <span class="dim">If RAG_OVER_PDF_URL was unset, the message would fall through</span> <span class="dim">to the standard mode router and hit Smart mode like in v1.1.</span>
rendering
Two-stage routing in v1.2. The plugin router gets first refusal; modes still pick up everything it doesn't claim.
Manus integration with persistent state

The typed Manus client (createTask, getTask, cancelTask, awaitTask) wraps the upstream task API. The webhook receiver verifies HMAC-SHA256 signatures before accepting any payload, unsigned requests get 401'd, never reach the database. Every state transition is upserted into manus_tasks via the migration at supabase/migrations/002_manus_tasks.sql. Clients poll GET /api/v1/manus/tasks/[id] to read the latest state without burning upstream quota.

<span class="dim">// Client kicks off a long-running Manus task</span> POST /api/v1/manus/tasks → { id: "task_abc", status: "queued" } → upsert into manus_tasks <span class="dim">// Time passes. Manus pings the webhook with progress updates.</span> POST /api/v1/manus/webhook (X-Signature: HMAC-SHA256) → verify signature against MANUS_WEBHOOK_SECRET → <span class="ok">valid</span> → upsert(task_abc, status: "running") → <span class="err">invalid</span> → 401, drop payload <span class="dim">// Dashboard polls without round-tripping to Manus.</span> GET /api/v1/manus/tasks/task_abc → SELECT latest FROM manus_tasks WHERE id = 'task_abc' → { status: "completed", output: {...} }
/docs page

A server-rendered route lists all ten plugins with live env-var status badges (enabled / not configured), repo links, and a Manus invite call-to-action. The invite code is read from NEXT_PUBLIC_MANUS_INVITE_CODE, so every fork can drop in its own without touching source.

MAKE-IT-YOURS guide

docs/MAKE-IT-YOURS.md is the white-label rebrand guide. Includes a copy-paste v0 prompt that generates a full branded front end, the list of files where logo / colour tokens / copy live, and the full Supabase + Vercel deploy path. A non-engineer founder can clone, run the prompt, change five env vars, and ship.

Technology Choices

Why this, not that

Every technology in the stack was chosen for a reason. Here is the reasoning, and what was rejected.

Next.js 14 (App Router)

Why we use it

Server components keep API keys server-side. App Router gives file-based routing, streaming responses, and edge deployment. The entire backend is API routes, no separate server needed.

Why not the alternative

Express.js, requires a separate server process, no built-in SSR, no edge deployment. Would need two repos (frontend + backend) instead of one.

TypeScript

Why we use it

Catches provider API shape changes at compile time. When SambaNova changes their response format, TypeScript finds every broken call site before users do. 90 tests + strict types = confidence to ship fast.

Why not the alternative

JavaScript, runtime errors from undefined properties are the #1 cause of "it worked on my machine." In a multi-provider system with 7 different API shapes, that is suicide.

Supabase (PostgreSQL)

Why we use it

Auth, database, and Row-Level Security in one service. RLS means even if my code has a bug, PostgreSQL refuses to serve User A's conversations to User B. The free tier gives 1GB storage and unlimited API calls.

Why not the alternative

Firebase, NoSQL makes cross-user queries (admin health, usage analytics) painful. No row-level security enforcement at the database level. Firestore's pricing model punishes high-read workloads like chat.

Vercel

Why we use it

Zero-config deployment for Next.js. Push to GitHub, live in 60 seconds. Edge functions for the streaming endpoints. Generous free tier (100GB bandwidth). But the app runs on ANY Next.js host, not locked in.

Why not the alternative

AWS / GCP, 50x more configuration for the same result. ECS, ALB, Route53, ACM, CloudFront, or one git push to Vercel. For a solo dev shipping an open-source project, complexity is the enemy.

SSE (Server-Sent Events)

Why we use it

One-way streaming from server to client, exactly what chat needs. Works through every proxy, CDN, and firewall. Native browser support with EventSource. No library needed on the client.

Why not the alternative

WebSockets, bidirectional, but chat only needs server→client streaming. WebSockets break through many corporate proxies, require connection management, and don't work on Vercel Edge. Overkill.

Cloudflare R2

Why we use it

S3-compatible object storage with zero egress fees. 10GB free. Stores uploaded PDFs, Excel files, and generated images. Signed URLs expire in 7 days for security.

Why not the alternative

AWS S3, egress fees add up fast when serving images. Vercel Blob, limited free tier and vendor-locked. R2 is S3-compatible, so migration is a one-line endpoint change.

Vitest

Why we use it

Vite-native test runner, 90 tests in 800ms. Same module resolution as the Next.js build. No config file needed. Jest compatibility mode means most patterns transfer directly.

Why not the alternative

Jest, slower startup, requires separate ts-jest config, doesn't share Vite's module graph. For a project with path aliases (@/lib/...) and TypeScript, Vitest is drop-in.

Honest Comparison

SarmaLink-AI vs everything else

No marketing spin. What each option does well, what it doesn’t, and an honest verdict.

ChatGPT Plus

$20/monthHosted AI chat by OpenAI
Strengths
  • +Best-in-class models (GPT-4o, o1)
  • +Polished UI
  • +Plugins ecosystem
Weaknesses
  • -One provider, if OpenAI is down, you're down
  • -No self-hosting
  • -No data ownership
  • -No failover
  • -Can't customise the system prompt
  • -$240/year per user
Great product. But you're renting it. You can't read the code, can't host it yourself, can't control your data. When OpenAI raises prices or deprecates a feature, you eat it.

LibreChat

Free (open source)Multi-provider chat UI
Strengths
  • +Open source
  • +Supports multiple providers
  • +Plugin system
  • +Good UI
Weaknesses
  • -Requires Docker to deploy
  • -No automatic failover, if your selected provider fails, the message fails
  • -No built-in live tools
  • -No persistent memory
  • -Complex config
The closest alternative. But no failover means a 429 from OpenAI = user sees an error. Requires Docker knowledge to deploy. No AI-assisted setup.

OpenWebUI

Free (open source)Self-hosted ChatGPT UI for Ollama/OpenAI
Strengths
  • +Beautiful UI
  • +Ollama integration for local models
  • +Active community
Weaknesses
  • -Designed for local models, cloud provider support is secondary
  • -No multi-provider failover
  • -Requires Docker
  • -Python backend, different ecosystem from web apps
  • -No live tools
Best choice if you want local LLMs on your own GPU. But if you want cloud providers with failover, it's not designed for that.

LobeChat

Free (open source)Modern chat UI with plugin marketplace
Strengths
  • +Beautiful design
  • +Plugin marketplace
  • +Supports many providers
Weaknesses
  • -No automatic failover
  • -Client-side API calls expose keys in the browser
  • -No server-side auth or RLS
  • -No persistent memory across sessions
Pretty UI, but API keys live in the browser. No server-side security. Fine for personal use, risky for teams.

LiteLLM

Free (library)Python library for multi-provider LLM calls
Strengths
  • +Supports 100+ providers
  • +Unified API
  • +Good fallback config
Weaknesses
  • -It's a library, not an app, you still have to build everything else
  • -Python only
  • -No UI, no auth, no database, no streaming, no deployment
Excellent library. But it gives you the failover engine and nothing else. No auth, no database, no streaming UI, no memory, no tools. You're building an app from scratch.

SarmaLink-AI

Free (open source)Full-stack AI assistant with automatic failover
Strengths
  • +36 engines, 7 providers, automatic failover in <50ms
  • +Full app, auth, database, RLS, streaming, memory, tools
  • +AI-assisted setup, non-developers can deploy in 15 min
  • +White-label via env vars, zero code changes
  • +TypeScript + Next.js, the web's most popular stack
  • +MIT license
Weaknesses
  • -Newer project, smaller community than LibreChat/OpenWebUI
  • -No local model support yet (cloud providers only)
  • -Opinionated stack (Next.js + Supabase + Vercel)
The only option that gives you failover + full app + AI setup + white-labeling out of the box.
What No One Else Does

Let AI set up your AI.

Every open-source AI project requires Docker, terminal commands, and 45 minutes of documentation reading. SarmaLink-AI ships with a setup skill that any AI coding tool can use.

Every other open-source AI project
xRead a 200-line README
xInstall Docker
xRun docker-compose up
xDebug port conflicts
xManually create database tables
xCopy 15+ environment variables
xFigure out which keys are required vs optional
xDebug build errors alone
xGoogle the error messages
xGive up and use ChatGPT instead
SarmaLink-AI
+Clone the repo
+Open in Claude Code (or Cursor, Copilot, ChatGPT)
+Say "help me set up"
+AI walks you through creating free accounts
+AI creates your .env.local
+AI runs the database migration
+AI tests every API key
+AI builds and deploys
+15 minutes. Zero terminal knowledge.
+Your AI assistant is live.
The Economics

How it costs $0

Every provider in the stack offers a free tier. Combined, they serve thousands of requests per day. No credit card needed for any of them.

Groq

Free tier: 14,000 req/day per key
Keys: Unlimited keys via Gmail aliases
126,000+ req/day

SambaNova

Free tier: 5,000 req/day per key
Keys: 8 keys
40,000 req/day

Cerebras

Free tier: 5,000 req/day per key
Keys: 4 keys
20,000 req/day

Google Gemini

Free tier: 250 req/day per key
Keys: 12 keys
3,000 req/day

OpenRouter

Free tier: 1,000 req/day (:free)
Keys: 5 keys
5,000 req/day

Cloudflare

Free tier: 10,000 neurons/day
Keys: Workers AI free tier
10,000 images/day
207,000+
combined requests per day, enough for ~15,000 daily active users
vs ChatGPT Plus: $20/user/month × 15,000 users = $300,000/month
Who It’s For

Built for builders

Solo developers

Replace ChatGPT Plus with your own instance. Persistent memory, image gen, live tools. $0 monthly.

Startups

White-label it as your product. Change the name with one env var. MIT license, no strings attached.

Agencies

Deploy for clients as a value-add service. Each client gets their own Supabase project, their own data.

Internal tools teams

HR policies, finance lookups, ops runbooks. RLS keeps per-user data separate. Admin health dashboard built in.

Non-technical founders

AI-assisted setup means you can deploy without writing code. Clone, let AI set it up, done.

Open source contributors

TypeScript, Next.js, Supabase, the most popular web stack. 90 tests, clean architecture, well-documented.

Ready to try it?

Clone the repo. Let AI set it up. Deploy your own AI assistant in 15 minutes.

Built by Sarma Linux · MIT License · v2.0.0