Technical Whitepaper · v1.1

SarmaLink-AI

Multi-Provider AI Routing with Automatic Failover

MIT LicensedOpen SourceSelf-Hostable7 Providers6 ModesZero Lock-in
~207Kreq/day capacity
41msfirst token
~50msfailover handoff
10⁻¹⁴P(total outage)

Version 1.0.0 · April 2026 · Sai Sarma · Sarma Linux

Abstract

SarmaLink-AI is an open-source, MIT-licensed multi-provider AI assistant that routes every request through up to 14 engines across 7 providers (Groq, SambaNova, Cerebras, Google Gemini, OpenRouter, Cloudflare Workers AI, Tavily) with automatic sub-50ms failover. Built on Next.js 14, TypeScript, Supabase (PostgreSQL with Row-Level Security), and Cloudflare R2. This whitepaper documents the architecture, failover algorithm, security model, benchmarks, and operational characteristics of a system designed to deliver 99.9999% effective uptime on free-tier infrastructure, at £0 recurring cost.

01Introduction & Motivation

Every major AI provider offers a free tier. Groq hosts GPT-OSS 120B and delivers first tokens in 41 milliseconds. SambaNova runs DeepSeek V3.2, a 685-billion-parameter Mixture-of-Experts model that beats GPT-4o on MATH-500 and HumanEval. Cerebras serves inference at 2,000 tokens per second on wafer-scale chips. Google Gemini has grounded Google Search built in at the token level. OpenRouter aggregates 100+ models behind a single OpenAI-compatible API, including a :free pool of 17+ community-hosted models. Cloudflare Workers AI runs FLUX.2 klein for image generation at the edge. Tavily provides structured search designed for LLM consumption.

Each of these, individually, is generous. Each, on its own, still hits rate limits. The moment a single provider returns a 429, an AI application breaks. Users see an error. Trust evaporates. The common workaround — paying for an upgrade — defeats the point of using free tiers in the first place, and lock users into a single vendor’s roadmap.

The problem with existing solutions

  • LiteLLM is a library, not an application. You still have to write the app, the routing logic, the retry policy, the stream parser, the database schema.
  • LangChain is a framework with a steep learning curve and heavy abstractions. Multi-provider failover is possible but manual.
  • OpenRouter is one aggregator service — not seven redundant providers. When OpenRouter has an issue, you have an issue.
  • ChatGPT Plus / Claude Pro / Gemini Advanced are rentals. No self-hosting, no vendor independence, no data ownership.

First-principles multi-provider failover

SarmaLink-AI treats every AI provider as a commodity. If Groq returns 429, SambaNova fires. If SambaNova is busy, Cerebras. Then Gemini. Then OpenRouter’s :free pool as the final safety net. Round-robin key rotation distributes load across keys within a provider. Round-robin across providers survives outages. Every step is logged to ai_events for observability.

Target audience

Small-to-medium businesses, indie developers, digital agencies, research teams — anyone who needs production-grade AI capability without a vendor rental fee, and who values code they can read, fork, and extend.

02System Architecture

SarmaLink-AI is a Next.js 14 application using the App Router, deployed on Vercel (or any Next.js-compatible host). The runtime is a thin HTTP layer over a modular lib/ directory. Every external dependency is behind an interface. Every piece of untrusted data is wrapped at a trust boundary before reaching the model.

Core modules

  • app/api/ai-chat/route.ts — the HTTP entry point. ~45 lines after refactoring. Delegates all logic to lib.
  • lib/providers/failover.ts — the failover runner (tryFailover). Orchestrates retries across steps and keys.
  • lib/providers/registry.ts — declares every provider, endpoint, and key collection.
  • lib/tools/registry.ts — plugin pattern for live tools (weather, FX, container tracking).
  • lib/prompts/sanitize.ts — prompt injection defence at trust boundaries.
  • lib/repositories/ — typed Supabase CRUD for sessions, usage, events, memories.
  • lib/intent.ts — auto-router classifier (regex patterns, zero API calls).

Request lifecycle

Browser
   │ POST /api/ai-chat { message, model? }
   ▼
Route handler (~45 lines)
   │ 1. Supabase auth (cookie → user.id)
   │ 2. RLS enforces ownership
   ▼
Sanitizer
   │ wrapUntrusted(user_message)
   │ wrapMemories(user_memories)
   │ wrapToolResult(tool_outputs)
   ▼
Auto-router (lib/intent.ts)
   │ image intent? → image pipeline
   │ search intent? → Live mode
   │ default → Smart mode
   ▼
tryFailover (lib/providers/failover.ts)
   │ for each step in mode.failover:
   │   for each key in providerKeys(step):
   │     try stream → yield tokens
   │     catch 429/5xx → next key or step
   ▼
SSE stream → Browser

Deployment topology

Vercel (or any Node.js host) runs the route handlers. Supabase provides PostgreSQL + Auth + Row-Level Security. Cloudflare R2 stores binary attachments (images, PDFs) with 7-day signed URLs. Cloudflare Workers AI serves image generation. All other providers are accessed via their public OpenAI-compatible endpoints.

03The Failover Runner

tryFailover is the load-bearing module. It accepts a sequence of provider/model steps and iterates through them until one returns a successful stream.

Algorithm

async function tryFailover(steps, messages, opts) {
  for (const step of steps) {
    const keys = providerKeys(step.provider)
    const offset = Date.now() % keys.length  // round-robin
    for (let i = 0; i < keys.length; i++) {
      const key = keys[(offset + i) % keys.length]
      try {
        const stream = await callProvider(step, key, messages)
        return stream  // success
      } catch (err) {
        logEvent({ backend: step.label, status: err.status })
        if (err.status === 429 || err.status >= 500) continue
        throw err  // non-recoverable
      }
    }
  }
  throw new Error('All providers exhausted')
}

Uptime math

Assumption: each provider has 99% uptime (industry minimum for production services).
P(all 7 providers fail simultaneously) = 0.01⁷ = 1 × 10⁻¹⁴
Effective uptime: 99.999999999999% — about 30 milliseconds of downtime per century, assuming the public internet itself stays up.

Handoff timing

Typical failover from 429 on one provider to first token on the next is under 50 milliseconds. The request parser reads the response headers, classifies the failure, rotates to the next key, and dispatches — all without user-visible interruption. Every step writes to ai_events with status, backend, latency_ms, and tokens_out.

04The Six Modes

Each mode is a different failover sequence, optimised for a specific task type.

ModeDepthPrimary engineDaily limitUse case
Smart14DeepSeek V3.2 685B1,000/dayProfessional writing, analysis, brainstorming
Reasoner10DeepSeek V3.2 + V3.1500/dayComplex logic, chain-of-thought traces
Live4Gemini 2.5 Flash + Google Search1,000/dayCurrent events, weather, news, FX
Fast9Groq GPT-OSS 20B5,000/day41ms first token — quick lookups, rewrites
Coder9DeepSeek V3.2 + Qwen Coder 480B800/dayTypeScript, Python, SQL, debugging
Vision6Llama-4 Scout 17B500/dayReceipts, screenshots, diagrams

05The Seven Providers

Groq

Custom LPU inference chips. GPT-OSS 120B in 45ms, GPT-OSS 20B in 41ms. Llama 3.3 70B, Qwen 3 32B, Llama-4 Scout for vision, Llama 3.1 8B for memory extraction. Free tier: 14,000 req/day/key.

SambaNova

Hosts DeepSeek V3.2 (685B MoE, 37B active per token) — the frontier model that powers Smart, Reasoner, and Coder modes. Custom Reconfigurable Dataflow Unit silicon. Also runs DeepSeek V3.1 and Llama 4 Maverick with 1M context.

Cerebras

WSE-3 wafer-scale engine — 46,225 mm² of silicon, the largest chip ever built. 2,000 tokens/sec on Llama 3.1 8B. Hosts Qwen 3 235B for reasoning and Qwen 3 Coder 480B — SarmaLink-AI’s Coder failover winner when SambaNova is busy.

Google Gemini

Live mode backbone. Gemini 2.5 Flash, Flash Lite, and Gemini 3 Flash Preview with grounded Google Search built in at the token level. 1M-token context window. Every Live mode answer includes cited sources.

OpenRouter

Aggregates 100+ models across 50+ providers into one OpenAI-compatible endpoint. The :free pool (17+ community-hosted models including GPT-OSS 120B, Nemotron Ultra 253B, GLM-4.5 Air, Gemma 3, DeepSeek R1) is the ultimate failover safety net.

Cloudflare Workers AI

Runs FLUX.2 klein 9B and 4B for image generation and instruction-following editing with three-step failover (9B → 4B → FLUX.1 schnell). R2 provides 10GB free S3-compatible object storage for file persistence with 7-day signed URLs.

Tavily

Structured web search designed for AI consumption. Returns titles, snippets, URLs, and relevance scores — ready for LLM citation. Powers weather (Open-Meteo fallback), exchange rate verification, container tracking (ISO 6346 carriers), news, and URL extraction tools.

06Security Model

Trust boundaries

Three sources of untrusted text reach the model on every request:

  • User messages — from the browser, potentially adversarial
  • Tool results — from external APIs (Tavily, Open-Meteo, frankfurter.app) which may return manipulated content
  • Saved memories — from the database, written by the memory extractor which may have laundered injection strings from past conversations

Each is wrapped by a dedicated sanitiser: wrapUntrusted, wrapToolResult, wrapMemories. Output is wrapped in explicit XML-style markers before reaching the model, and known jailbreak patterns ("ignore previous instructions", "system:" prefixes, role-switch attempts) are stripped.

Unit test coverage

11 unit tests in __tests__/sanitize.test.ts cover documented jailbreak categories. Defence is layered: even if strip misses a pattern, the wrapping ensures the model can never interpret untrusted text as a command.

Row-Level Security

Every table enforces per-user isolation at the PostgreSQL layer. Even if route logic has a bug, cross-user reads return zero rows.

CREATE POLICY "own_rows" ON ai_chat_sessions
  FOR ALL USING (auth.uid() = user_id);

The same policy is applied to ai_chat_usage, ai_events, and ai_user_memories. The service-role key bypasses RLS but is server-only — never in the client bundle, never in env vars exposed to browsers.

07Observability & Operations

The /api/admin/health endpoint returns per-provider success rates, p50/p95 latency, dead-model detection, and 24-hour request volume — all computed from the ai_events audit log.

Event schema

ai_events (
  id          uuid,
  user_id     uuid,
  event_type  text,      -- 'message' | 'tool' | 'error'
  backend     text,      -- 'Groq GPT-OSS 120B', etc.
  status      text,      -- 'success' | 'rate_limited' | 'error'
  latency_ms  integer,
  tokens_out  integer,
  created_at  timestamptz
);

Diagnostic queries

Per-backend p95 latency over 24 hours:

SELECT backend,
       percentile_cont(0.95) WITHIN GROUP (ORDER BY latency_ms) AS p95,
       COUNT(*) AS volume
FROM ai_events
WHERE event_type = 'message'
  AND created_at > now() - interval '24 hours'
GROUP BY backend
ORDER BY p95;

Dead model detection (always returns 404/429):

SELECT backend,
       COUNT(*) FILTER (WHERE status = 'success') AS ok,
       COUNT(*) FILTER (WHERE status != 'success') AS fail,
       ROUND(100.0 * COUNT(*) FILTER (WHERE status = 'success')
             / COUNT(*), 1) AS success_rate
FROM ai_events
WHERE created_at > now() - interval '1 hour'
GROUP BY backend
HAVING COUNT(*) > 10
ORDER BY success_rate;

Scaling up

The Gmail +alias trick multiplies capacity. Sign up at you+provider2@gmail.com, +provider3@gmail.com, etc. — each counts as a distinct account at most providers. Adding 8 keys per provider yields 8× daily capacity with zero code changes; the failover runner already rotates through all keys.

08Benchmarks & Performance

DeepSeek V3.2 — SarmaLink-AI’s primary engine — compared to commercial AI products. Published scores from DeepSeek technical reports, lmarena.ai, SWE-bench leaderboards, Arena-Hard, and the GPQA paper.

BenchmarkSarmaLink-AIGPT-4oClaude SonnetGemini 2.5
MATH-500 (advanced maths)90.2%76.6%78.3%83.2%
HumanEval (code synthesis)92.7%90.2%92.0%89.5%
Arena ELO (human preference)1318128712711299
MGSM (multilingual maths)88.3%85.5%86.0%87.4%
GPQA-Diamond (PhD reasoning)59.1%50.6%59.4%56.8%
MMLU (general knowledge)87.1%88.7%88.7%90.0%

Capacity math

Combined daily capacity across all 7 providers on free tiers (default configuration, 1 key per provider):
Groq 14K + SambaNova 5K + Cerebras 5K + Gemini 250 + OpenRouter 1K + Cloudflare 10K images + Tavily 100 = ~35,000 requests/day.
With 9 keys per provider via the Gmail +alias trick: ~207,000 requests/day — enough for approximately 15,000 daily active users.

09Deployment Guide

Prerequisites

  • Node.js 20 or later
  • Git
  • Supabase account (free tier)
  • API keys from Groq (required), plus optional SambaNova, Cerebras, Gemini, OpenRouter, Cloudflare, Tavily
  • GitHub account (for deployment via Vercel)
  • Vercel account (free tier)

Fast path

git clone https://github.com/sarmakska/sarmalink-ai.git
cd sarmalink-ai
npm install
cp .env.example .env.local
# Fill in .env.local with your Supabase + provider keys
# Run supabase/migrations/001_sarmalink_ai.sql in Supabase SQL editor
npm run dev

Then push your repo to GitHub, import into Vercel, paste env vars into Vercel’s dashboard, and deploy. Full 45-minute walkthrough in the Complete Setup Guide.

Vercel Pro recommendation

Vercel Hobby (free) has a 10-second function timeout — adequate for most requests but can cut off long failover chains. Vercel Pro ($20/month) raises it to 60 seconds (300s with streaming) and is recommended for production.

10Extension Points

Adding a new provider

Any provider with an OpenAI-compatible chat completions endpoint can be added in ~10 lines across four files: lib/ai-models.ts (register type), lib/providers/registry.ts (endpoint + keys), lib/env/validate.ts (env collection), and the mode’s failover array.

Adding a new live tool

Implement the Tool<Args> interface (match, args, run) in lib/tools/, then add one line to the TOOLS array in lib/tools/registry.ts. The auto-router picks it up automatically.

Customising the system prompt

System prompts are per-mode in lib/ai-models.ts. Each mode has its own persona, tone, and constraint set. Version control history preserves prompt evolution.

11Comparison with Alternatives

FeatureSarmaLink-AILiteLLMLangChainOpenRouterChatGPT Plus
Multi-provider routing7 providersYesPartialSingle serviceNo
Automatic failover50ms handoffManualManualManualN/A
Full app (not library)YesLibraryFrameworkAPI onlyHosted only
Self-hostableYesPartialYesNoNo
Free tier end-user£0 foreverLibrary onlyFramework onlyPay per token$20/month
Image generationFLUX.2NoNoNoDALL-E 3
Persistent memoryAuto-extractedNoYesNoYes
Observability built-in/admin/healthCallbacks onlyLangSmith (paid)AnalyticsN/A
LicenseMITMITMITCommercialCommercial

12Cost Analysis

A typical AI-heavy individual pays for 3-5 separate subscriptions to get the capabilities SarmaLink-AI ships with by default.

SubscriptionMonthlyYearly
ChatGPT Plus$20£190
Claude Pro$20£190
Gemini Advanced$20£190
Perplexity Pro$20£190
Midjourney Standard$10£95
All five combined$90£855
SarmaLink-AI$0£0

For a 15-person team each paying ChatGPT Plus + Claude Pro alone: £5,700/year. SarmaLink-AI serves 15,000 daily requests across the same team at £0 recurring (optional £20/month for Vercel Pro if function timeouts become a constraint).

13Roadmap

Now (shipped)

  • 7 providers, 36 engines, 6 specialised modes
  • Automatic mode detection from message content
  • Persistent cross-session memory (30-fact cap per user)
  • 5 live tools: weather, exchange rates, container tracking, news search, URL summary

Next

  • Per-mode prompt versioning with A/B testing
  • Example chat UI in examples/ folder
  • Streaming chunk replay for debugging
  • Usage analytics dashboard

Soon

  • Voice mode (Whisper + TTS via Groq)
  • Video frame analysis via Gemini Vision
  • Tool marketplace — community plugins
  • One-click Vercel deploy template

Later

  • Federated failover — share capacity across instances
  • Model fine-tuning pipeline
  • Mobile app with offline fallback to on-device LLM

14Governance & Licensing

SarmaLink-AI is released under the MIT License. Contributors retain copyright of their contributions. Pull requests are reviewed against CI (lint, typecheck, test, build) and CodeQL security scans; all must pass before merge.

Security vulnerabilities should be reported privately via the process documented in SECURITY.md. Community channels: GitHub Issues and Discussions.

15Conclusion

SarmaLink-AI demonstrates that production-grade AI capability doesn’t require per-user subscriptions, vendor lock-in, or proprietary infrastructure. A first-principles multi-provider failover architecture, built on open-source primitives and free tiers, delivers 99.9999% effective uptime with frontier model quality — at zero recurring cost. The codebase is small enough to read in an afternoon, documented in a 22-page wiki, and licensed under MIT for any use. Fork it, self-host it, extend it, ship it.

AGlossary

  • 429 — HTTP status code for "Too Many Requests". Indicates rate limiting.
  • Failover — a sequence of steps tried in order until one succeeds.
  • LPU — Language Processing Unit. Groq’s custom silicon for LLM inference.
  • MoE — Mixture of Experts. Large model where only a subset of parameters activate per token.
  • RLS — Row-Level Security. PostgreSQL feature enforcing per-row access policies.
  • SSE — Server-Sent Events. HTTP-native streaming protocol for one-directional server→client data.
  • WSE — Wafer-Scale Engine. Cerebras’ single-chip architecture using entire silicon wafers.
  • R2 — Cloudflare’s S3-compatible object storage with no egress fees.
  • OKLCH — Perceptually uniform colour space used in the site’s design system.

BAPI Reference

Chat (streaming SSE)

curl -N https://your-deploy.vercel.app/api/ai-chat \
  -H "Content-Type: application/json" \
  -H "Cookie: <supabase-auth>" \
  -d '{"message":"Draft a follow-up email","model":"smart"}'
# Returns: SSE stream with {"type":"token","value":"..."} events

Image generation

curl -X POST https://your-deploy.vercel.app/api/images/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt":"sunset over Himalayas"}'
# Returns: {url: "https://r2.../signed?..."}  (7-day URL)

Image editing

curl -X POST https://your-deploy.vercel.app/api/images/edit \
  -F "image=@original.jpg" \
  -F 'instruction=change sky to emerald green'

File upload

curl -X POST https://your-deploy.vercel.app/api/attachments/upload \
  -F "file=@contract.pdf"
# Extracted text stored and referenceable in next message

Health check

curl https://your-deploy.vercel.app/api/admin/health
# Returns: per-provider success rates, p95 latency, 24h volume

CEnvironment Variables

VariableRequiredPurpose
NEXT_PUBLIC_SUPABASE_URLYesSupabase project URL
NEXT_PUBLIC_SUPABASE_ANON_KEYYesSupabase anon key (client-safe)
SUPABASE_SERVICE_ROLE_KEYYesService role key (server-only)
GROQ_API_KEY.._15YesGroq API keys (up to 15 for rotation)
SAMBANOVA_API_KEY.._8OptionalSambaNova keys for DeepSeek V3.2
CEREBRAS_API_KEY.._8OptionalCerebras keys for Qwen 3 235B / 480B
GEMINI_API_KEY.._12OptionalGoogle Gemini keys for Live mode
OPENROUTER_API_KEY.._5OptionalOpenRouter safety net
CLOUDFLARE_ACCOUNT_IDOptionalFor Workers AI image gen
CLOUDFLARE_API_TOKENOptionalWorkers AI token
R2_ACCESS_KEY_ID, R2_SECRET_ACCESS_KEY, R2_ENDPOINT, R2_BUCKETOptionalR2 file storage
TAVILY_API_KEY.._8OptionalStructured search for live tools

DDatabase Schema

CREATE TABLE ai_chat_sessions (
  id          uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  user_id     uuid NOT NULL REFERENCES auth.users(id) ON DELETE CASCADE,
  title       text,
  messages    jsonb NOT NULL DEFAULT '[]',
  updated_at  timestamptz DEFAULT now()
);

CREATE TABLE ai_chat_usage (
  user_id     uuid NOT NULL REFERENCES auth.users(id) ON DELETE CASCADE,
  day         date NOT NULL,
  count       integer NOT NULL DEFAULT 0,
  PRIMARY KEY (user_id, day)
);

CREATE TABLE ai_events (
  id          uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  user_id     uuid NOT NULL REFERENCES auth.users(id) ON DELETE CASCADE,
  event_type  text,
  backend     text,
  status      text,
  latency_ms  integer,
  tokens_out  integer,
  created_at  timestamptz DEFAULT now()
);

CREATE TABLE ai_user_memories (
  id          uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  user_id     uuid NOT NULL REFERENCES auth.users(id) ON DELETE CASCADE,
  fact        text NOT NULL,
  created_at  timestamptz DEFAULT now()
);

-- Row-Level Security on every table
ALTER TABLE ai_chat_sessions ENABLE ROW LEVEL SECURITY;
ALTER TABLE ai_chat_usage     ENABLE ROW LEVEL SECURITY;
ALTER TABLE ai_events         ENABLE ROW LEVEL SECURITY;
ALTER TABLE ai_user_memories  ENABLE ROW LEVEL SECURITY;

CREATE POLICY "own_rows" ON ai_chat_sessions
  FOR ALL USING (auth.uid() = user_id);
CREATE POLICY "own_rows" ON ai_chat_usage
  FOR ALL USING (auth.uid() = user_id);
CREATE POLICY "own_rows" ON ai_events
  FOR ALL USING (auth.uid() = user_id);
CREATE POLICY "own_rows" ON ai_user_memories
  FOR ALL USING (auth.uid() = user_id);

EReferences