Capability

AI Engineering

I build AI systems that run in production, multi-provider LLM gateways with automatic failover, RAG pipelines grounded in your data, eval harnesses that measure quality, and voice agents with sub-second latency.

At a glance

Backed by public open-source code, not just a description on a page.
Long-form essays on the same topics, with sources cited.
Production patterns the same hiring team can lift straight into their stack.

About Sarma

Sarma is a UK-based software engineer running Sarmalinux as a one-person studio. He ships nineteen open-source repositories spanning LLM gateways, coding agents, inference, storage engines and consensus, and writes long-form engineering essays at sarmalinux.com/blog. Senior IC, end to end.

Most AI integrations break in one of three ways: they hit a single provider that goes down, they answer confidently with hallucinated facts, or they can't be measured so nobody knows if they are getting worse. I have shipped SarmaLink-AI, a 36-engine, 7-provider gateway with sub-50ms failover, as well as a production MCP server toolkit, a local-LLM router, a WebRTC voice agent, and a DuckDB-backed eval runner. When I take on AI work for a client I bring the same rigour: provider redundancy, grounding, evals, and observable production instrumentation.

What this covers in practice

Multi-provider LLM gateway

Design and deploy an OpenAI-compatible gateway that routes across providers, OpenAI, Groq, Anthropic, Mistral, Ollama, and more, with automatic failover. SarmaLink-AI is the production reference.

RAG pipelines

Retrieval-augmented generation grounded in your documents or database. Chunking strategy, embedding model selection, retrieval quality evaluation, re-ranking, and answer generation in one tested pipeline.

Eval harnesses

Evals-as-code with DuckDB persistence and a FastAPI/HTMX viewer. Run regression tests on prompts, compare model outputs across versions, and catch regressions before production.

Voice agents

Real-time voice loops with WebRTC + mediasoup, Deepgram STT, LLM turn, and TTS, targeting sub-second end-to-end latency. Based on voice-agent-starter.

MCP servers

Production-grade Model Context Protocol servers using FastAPI and Python 3.12, typed tool schemas, resource endpoints, auth, and full test coverage.

Agent orchestration

Durable multi-agent workflows with deterministic replay using Postgres + Drizzle + BullMQ. Handles long-running tasks, retries, and a visual Inspector UI.

Stack

Python 3.12 / uvFastAPITypeScriptNext.js 16OpenAI SDKGroqAnthropicOllamaDuckDBPostgres + SupabaseBullMQmediasoupDeepgramResendVercel

Recent work in this lane

Open-source repositories

Related writing

What a hiring team gets

No single-provider lock-in, failover is automatic

Grounded answers, not hallucinated confidence

Evals in CI so quality regressions surface early

Observable: latency, cost, and error rates tracked

Production reference code (open-source) you can read first

Documented handoff, your team can maintain it

Read the evidence

Open the public repositories, browse past work, then look at the hiring page if a PAYE shape fits your team.

Open-source repositories Past work Hire me, PAYE only