A sub-second voice agent loop, end to end.
Speak. The model speaks back. Cut in any time. No per-minute provider fees on the default stack.
Voice Agent Starter is the full-duplex voice loop most teams half-finish. The browser captures microphone audio over WebSocket. A Fastify server runs a streaming STT, LLM, and TTS pipeline with voice activity detection. Barge-in cancels the in-flight LLM and TTS streams the moment you start speaking, and the LLM can call server-side tools mid-turn through function-call passthrough. Every layer is a pluggable adapter behind a small interface.
Why this exists
Voice agents are easy to demo and brutal to ship. The one-to-one demo on a fast network sounds magical. The first time a real user is on a coach, on patchy 4G, with a five-year-old laptop, every shortcut a demo can take is exposed. Negotiations stall, half-duplex audio breaks the moment someone interrupts, the TTS finishes saying “please hold while I check” thirty seconds after the user has already moved on.
The half-decent reference implementations on GitHub mix transports and assume you have already built the streaming pipeline. The polished commercial offerings handle the hard parts but cost between $600 and $5,000 a month per instance and lock the audio path inside their cloud.
Voice Agent Starter is the open-source middle. A clean state machine, a clean barge-in path, pluggable STT, LLM, and TTS so the provider on the other end is your choice, and a self-hosted default stack that runs on Groq, Whisper.cpp, and OpenTTS with no per-minute fees.
What is in the box
Every feature below ships in the public repository today. Clone, configure, run.
Browser microphone capture
AudioContext resampled to 16 kHz mono, converted to PCM16, base64-encoded, sent as JSON over WebSocket. The orchestrator does not care which transport carried the frames.
Duplex state machine
IDLE to LISTEN to THINK to SPEAK, defined in apps/server/src/pipeline/orchestrator.ts. Owns one voice session, never blocks: every provider call is a stream consumed with for await.
Real barge-in
When the VAD detects speech mid-turn, the orchestrator aborts the in-flight LLM and TTS streams through AbortController, resets the STT and TTS adapters, emits a barge-in control message, and drops to LISTEN.
Streaming STT
Whisper.cpp by default with growing-window transcription. Voice frames feed the STT for live partials; trailing silence flushes the adapter for a final transcript that triggers the move to THINK.
Streaming LLM
Groq Llama 4 by default on the LPU stack. The first token flips the machine to SPEAK and is fed straight into TTS, so audio starts before the completion finishes.
Sentence-by-sentence TTS
OpenTTS Coqui XTTS v2 by default. Synthesises sentence by sentence and streams PCM chunks back to the client as base64. First audible response under one second on the self-hosted stack.
Function-call passthrough
The LLM is advertised registered tools on every call. The shared SSE reader assembles tool_calls deltas; the orchestrator runs the matching handler, appends an assistant and tool turn, and re-streams so the model finishes with grounded data.
Pluggable adapter contract
STT, LLM, and TTS each implement a small TypeScript interface. Defaults are self-hosted; alternatives include Deepgram, OpenAI Whisper, OpenAI, SarmaLink-AI, Cartesia, and ElevenLabs.
RMS voice activity detection
A simple RMS-threshold VAD in pipeline/vad.ts drives the state transitions. Clean seam for silero-vad-onnx if a heavier VAD is needed.
Shared SSE reader
The three OpenAI-compatible LLM adapters share apps/server/src/adapters/llm/sse.ts. Adding a fourth OpenAI-compatible provider is a handful of lines, including streamed tool_calls.
Bounded tool rounds
Tool rounds are bounded by maxToolRounds to guard against loops. Handler errors are returned to the model rather than crashing the session, so the agent recovers in-turn.
No-keys offline mode
The state machine, barge-in, and tool calls all run without provider keys. Set keys or point the self-hosted URLs at running servers to get real transcripts and audio.
Architecture, the duplex loop
One Orchestrator per voice session. Created on the /voice WebSocket open, disposed on close. Every box maps to a real file in apps/server/src.
apps/server/src/index.tsFastify server, /health and /voice WebSocket, message dispatch.
apps/server/src/pipeline/orchestrator.tsDuplex state machine, barge-in, function-call passthrough.
apps/server/src/pipeline/tools.tsTool registry and default tools (get_time, add_numbers).
apps/server/src/pipeline/vad.tsRMS-threshold voice activity detection.
apps/server/src/adapters/audio.tsPCM and WAV conversion, sentence splitting.
apps/server/src/adapters/llm/sse.tsShared OpenAI-compatible SSE reader and wire-format mapping.
apps/server/src/adapters/{stt,llm,tts}/*.tsProvider adapters and registries selected by env var.
apps/web/app/page.tsxNext.js browser client with microphone capture.
State machine: the duplex loop
One Orchestrator per voice session, created on the /voice WebSocket open and disposed on close. Implemented in apps/server/src/pipeline/orchestrator.ts.
Latency budget
Stage-by-stage targets on the self-hosted default stack. The shape is what matters; absolute numbers depend on hardware and network.
| Stage | P50 target | Notes |
|---|---|---|
| Mic to VAD | 30ms | RMS VAD on PCM frames |
| STT first partial | 250ms | Whisper.cpp growing-window transcription |
| LLM first token | 250ms | Groq Llama 4 on the LPU stack |
| TTS first audio chunk | 250ms | OpenTTS XTTS v2, first sentence |
| Total user-perceived | ~800ms | First audible response, fully self-hosted |
Quick start
Five commands from clone to a microphone-on browser tab. Commands taken straight from the README.
git clone https://github.com/sarmakska/voice-agent-starter.git cd voice-agent-starter
pnpm install cp .env.example .env # add GROQ_API_KEY etc., or leave blank for offline mode
pnpm dev
# http://localhost:3000 # Click Start, grant microphone access. # The web client connects to :3001 over WebSocket and streams PCM frames.
curl http://localhost:3001/health # reports the active providers selected by STT_PROVIDER, LLM_PROVIDER, TTS_PROVIDER
Under the hood
The state machine, the adapter contract, the tool-call path, and the audio frame shape. Real snippets from the repo and the architecture notes.
typescriptPluggable adapter contract+
// STT adapters implement: feed, flush, reset, id // LLM adapters implement: stream, id // TTS adapters implement: feed, stream, end, reset, id // Each layer is one TS file behind a small interface, // selected by an env var through a registry. // STT_PROVIDER=whispercpp | deepgram | whisper // LLM_PROVIDER=groq | sarmalink | openai // TTS_PROVIDER=opentts | cartesia | elevenlabs // Adding an OpenAI-compatible LLM adapter is a handful of lines // because all three OpenAI-compatible adapters share // apps/server/src/adapters/llm/sse.ts
typescriptFunction-call passthrough+
// The orchestrator advertises the registered tool definitions to the // model on every LLM call. The shared SSE reader assembles fragmented // tool_calls deltas into complete calls. // When the model requests a tool: // 1. orchestrator runs the matching handler from ToolRegistry // 2. appends an "assistant" turn recording the request // 3. appends a "tool" turn carrying the result // 4. re-streams so the model finishes with grounded data // Bounded by maxToolRounds. Handler errors are returned to the model // as a tool result rather than crashing the session.
jsonAudio frame shape+
// Browser to server, 20ms PCM16 frames, base64 encoded.
{ "type": "audio", "payload": "<base64 PCM16>" }
// Server to browser, sentence-by-sentence TTS chunks.
{ "type": "tts.chunk", "payload": "<base64 PCM16>" }
{ "type": "barge-in" }
{ "type": "turn.end" }Configuration, one env per layer
Three env vars choose the STT, LLM, and TTS providers. Everything else is keys and URLs for the providers you picked.
| Env var | Purpose | Default |
|---|---|---|
| STT_PROVIDER | whispercpp, deepgram, or whisper | whispercpp |
| LLM_PROVIDER | groq, sarmalink, or openai | groq |
| TTS_PROVIDER | opentts, cartesia, or elevenlabs | opentts |
| GROQ_API_KEY | Key for the Groq Llama 4 LLM adapter | unset |
| WHISPERCPP_URL | whisper-server endpoint for STT | http://localhost:8090 |
| OPENTTS_URL | OpenTTS server endpoint for TTS | http://localhost:5500 |
Where it fits
The patterns this repository was built around, and the ones it deliberately is not.
Customer support voice agents
Front-line agents for inbound support. Barge-in is essential the moment a customer wants to redirect mid-turn; without it the agent talks over the caller.
Tutors and coaches
Education and coaching apps where the human does most of the talking. The agent prompts, redirects, and pauses without speaking over the learner.
Hands-busy product UX
Voice for kitchens, garages, field engineers. The Next.js client runs in any normal browser; no native app required.
Internal voice ops
Warehouse and operations teams over WebSocket. Pluggable STT lets you pin a regional model for accent coverage without changing the pipeline.
Vendor A/B without rewrites
Swap Deepgram in for Whisper.cpp, Cartesia for OpenTTS, OpenAI for Groq, by changing one env var. The orchestrator is unchanged.
When NOT to reach for it
A finished consumer product, or a one-shot push-to-talk transcription tool. The full-duplex machinery is overhead you would not need.
Tech stack
Compared to the alternatives
Hosted voice-agent platforms, closed vendor stacks, and rolling your own. Three honest comparisons.
| Feature | Voice Agent Starter | Hosted platform | Closed vendor | DIY |
|---|---|---|---|---|
| Full-duplex with barge-in | Yes | Yes | Yes | You build it |
| Self-hostable end to end | Yes | Hosted only | Hosted only | Yes |
| Pluggable STT / LLM / TTS | Yes, per layer | Partial | Locked stack | You write it |
| Per-minute fees | £0 on self-hosted defaults | Per minute | Per minute | Your provider bills |
| Function-call passthrough | Yes, bounded rounds | Yes | Yes | You write it |
| Source code | MIT, all of it | Closed | Closed | Yours |
Documentation, all in the wiki
A handful of focused pages. Each one answers a single operational question.
Frequently asked
Eight real questions from teams that have shipped this.
Why WebSocket rather than WebRTC?+
The orchestrator is transport-agnostic. Terminating over mediasoup or LiveKit is a swap at the edge without touching the pipeline. The starter keeps the transport simple on purpose; the engineering value is in the duplex machinery, not the negotiation layer.
What does barge-in actually cancel?+
Both the in-flight LLM and TTS streams, through their respective AbortControllers. The abort signal propagates through the fetch body reader and the for await loops, so there are no orphaned streams talking over the user. STT and TTS adapters are reset, a barge-in control message is emitted, and the machine drops to LISTEN.
Does it really run with no provider keys?+
Yes for the state machine, barge-in, and tool calls. To get real transcripts and audio you set GROQ_API_KEY or point WHISPERCPP_URL and OPENTTS_URL at running servers. The test suite drives the full pipeline through fake adapters and a fake socket.
How do I add an OpenAI-compatible LLM provider?+
A handful of lines. Implement the LLM interface (stream, id) and reuse apps/server/src/adapters/llm/sse.ts for streaming and tool_calls assembly. Register it and set LLM_PROVIDER to the new id.
How do tools get registered?+
In apps/server/src/pipeline/tools.ts. The starter ships with get_time and add_numbers as worked examples. Your tool advertises a JSON Schema; the orchestrator advertises every registered tool to the model on every call.
Why send audio over JSON?+
Because the orchestrator is the design surface, not the wire format. A plain {type, payload} message is the simplest thing to swap for a binary frame, a Datachannel, or a media-server track without changing the pipeline.
Can I run the LLM on SarmaLink-AI?+
Yes. Set LLM_PROVIDER=sarmalink and the adapter calls into the SarmaLink-AI failover stack, giving you 36-engine routing under the same voice loop. The shared SSE reader handles the wire format.
What about word-level interim STT?+
The default Whisper.cpp adapter surfaces window-level partials, which is enough for the barge-in path. If you need finer granularity, wire the Deepgram streaming SDK and set STT_PROVIDER=deepgram. The interface is the same.
Related products
The rest of the Sarma Linux toolkit. Same opinions throughout: open source, MIT, real depth.
Ship a voice agent that does not feel like a 2018 demo.
Clone the repo, run pnpm dev, talk into your microphone, ship.