Open Source · MIT · Self-hosted defaults

A sub-second voice agent loop, end to end.

Speak. The model speaks back. Cut in any time. No per-minute provider fees on the default stack.

Voice Agent Starter is the full-duplex voice loop most teams half-finish. The browser captures microphone audio over WebSocket. A Fastify server runs a streaming STT, LLM, and TTS pipeline with voice activity detection. Barge-in cancels the in-flight LLM and TTS streams the moment you start speaking, and the LLM can call server-side tools mid-turn through function-call passthrough. Every layer is a pluggable adapter behind a small interface.

View on GitHub How it works Whitepaper Get help shipping

~800ms

First audible response

Yes

Barge-in

Swappable adapter layers

Self-host

Defaults

MIT

Licence

Why this exists

Voice agents are easy to demo and brutal to ship. The one-to-one demo on a fast network sounds magical. The first time a real user is on a coach, on patchy 4G, with a five-year-old laptop, every shortcut a demo can take is exposed. Negotiations stall, half-duplex audio breaks the moment someone interrupts, the TTS finishes saying “please hold while I check” thirty seconds after the user has already moved on.

The half-decent reference implementations on GitHub mix transports and assume you have already built the streaming pipeline. The polished commercial offerings handle the hard parts but cost between $600 and $5,000 a month per instance and lock the audio path inside their cloud.

Voice Agent Starter is the open-source middle. A clean state machine, a clean barge-in path, pluggable STT, LLM, and TTS so the provider on the other end is your choice, and a self-hosted default stack that runs on Groq, Whisper.cpp, and OpenTTS with no per-minute fees.

What is in the box

Every feature below ships in the public repository today. Clone, configure, run.

Browser microphone capture

AudioContext resampled to 16 kHz mono, converted to PCM16, base64-encoded, sent as JSON over WebSocket. The orchestrator does not care which transport carried the frames.

Duplex state machine

IDLE to LISTEN to THINK to SPEAK, defined in apps/server/src/pipeline/orchestrator.ts. Owns one voice session, never blocks: every provider call is a stream consumed with for await.

Real barge-in

When the VAD detects speech mid-turn, the orchestrator aborts the in-flight LLM and TTS streams through AbortController, resets the STT and TTS adapters, emits a barge-in control message, and drops to LISTEN.

Streaming STT

Whisper.cpp by default with growing-window transcription. Voice frames feed the STT for live partials; trailing silence flushes the adapter for a final transcript that triggers the move to THINK.

Streaming LLM

Groq Llama 4 by default on the LPU stack. The first token flips the machine to SPEAK and is fed straight into TTS, so audio starts before the completion finishes.

Sentence-by-sentence TTS

OpenTTS Coqui XTTS v2 by default. Synthesises sentence by sentence and streams PCM chunks back to the client as base64. First audible response under one second on the self-hosted stack.

Function-call passthrough

The LLM is advertised registered tools on every call. The shared SSE reader assembles tool_calls deltas; the orchestrator runs the matching handler, appends an assistant and tool turn, and re-streams so the model finishes with grounded data.

Pluggable adapter contract

STT, LLM, and TTS each implement a small TypeScript interface. Defaults are self-hosted; alternatives include Deepgram, OpenAI Whisper, OpenAI, SarmaLink-AI, Cartesia, and ElevenLabs.

RMS voice activity detection

A simple RMS-threshold VAD in pipeline/vad.ts drives the state transitions. Clean seam for silero-vad-onnx if a heavier VAD is needed.

Shared SSE reader

The three OpenAI-compatible LLM adapters share apps/server/src/adapters/llm/sse.ts. Adding a fourth OpenAI-compatible provider is a handful of lines, including streamed tool_calls.

Bounded tool rounds

Tool rounds are bounded by maxToolRounds to guard against loops. Handler errors are returned to the model rather than crashing the session, so the agent recovers in-turn.

No-keys offline mode

The state machine, barge-in, and tool calls all run without provider keys. Set keys or point the self-hosted URLs at running servers to get real transcripts and audio.

Architecture, the duplex loop

One Orchestrator per voice session. Created on the /voice WebSocket open, disposed on close. Every box maps to a real file in apps/server/src.

rendering

Full-duplex voice loop. The orchestrator owns one voice session and never blocks. Coloured nodes are pluggable provider adapters.

apps/server/src/index.ts

Fastify server, /health and /voice WebSocket, message dispatch.

apps/server/src/pipeline/orchestrator.ts

Duplex state machine, barge-in, function-call passthrough.

apps/server/src/pipeline/tools.ts

Tool registry and default tools (get_time, add_numbers).

apps/server/src/pipeline/vad.ts

RMS-threshold voice activity detection.

apps/server/src/adapters/audio.ts

PCM and WAV conversion, sentence splitting.

apps/server/src/adapters/llm/sse.ts

Shared OpenAI-compatible SSE reader and wire-format mapping.

apps/server/src/adapters/{stt,llm,tts}/*.ts

Provider adapters and registries selected by env var.

apps/web/app/page.tsx

Next.js browser client with microphone capture.

State machine: the duplex loop

One Orchestrator per voice session, created on the /voice WebSocket open and disposed on close. Implemented in apps/server/src/pipeline/orchestrator.ts.

rendering

IDLE, LISTEN, THINK, SPEAK. Barge-in cancels the in-flight LLM and TTS streams and drops the machine back to LISTEN.

Latency budget

Stage-by-stage targets on the self-hosted default stack. The shape is what matters; absolute numbers depend on hardware and network.

Stage	P50 target	Notes
Mic to VAD	30ms	RMS VAD on PCM frames
STT first partial	250ms	Whisper.cpp growing-window transcription
LLM first token	250ms	Groq Llama 4 on the LPU stack
TTS first audio chunk	250ms	OpenTTS XTTS v2, first sentence
Total user-perceived	~800ms	First audible response, fully self-hosted

Quick start

Five commands from clone to a microphone-on browser tab. Commands taken straight from the README.

Clone the monorepo

git clone https://github.com/sarmakska/voice-agent-starter.git
cd voice-agent-starter

Install with pnpm

pnpm install
cp .env.example .env   # add GROQ_API_KEY etc., or leave blank for offline mode

Run the SFU + model worker + web client

pnpm dev

Open the client, talk

# http://localhost:3000
# Click Start, grant microphone access.
# The web client connects to :3001 over WebSocket and streams PCM frames.

Sanity check

curl http://localhost:3001/health
# reports the active providers selected by STT_PROVIDER, LLM_PROVIDER, TTS_PROVIDER

Under the hood

The state machine, the adapter contract, the tool-call path, and the audio frame shape. Real snippets from the repo and the architecture notes.

typescriptPluggable adapter contract+

// STT adapters implement: feed, flush, reset, id
// LLM adapters implement: stream, id
// TTS adapters implement: feed, stream, end, reset, id

// Each layer is one TS file behind a small interface,
// selected by an env var through a registry.

// STT_PROVIDER=whispercpp | deepgram | whisper
// LLM_PROVIDER=groq       | sarmalink | openai
// TTS_PROVIDER=opentts    | cartesia | elevenlabs

// Adding an OpenAI-compatible LLM adapter is a handful of lines
// because all three OpenAI-compatible adapters share
// apps/server/src/adapters/llm/sse.ts

typescriptFunction-call passthrough+

// The orchestrator advertises the registered tool definitions to the
// model on every LLM call. The shared SSE reader assembles fragmented
// tool_calls deltas into complete calls.

// When the model requests a tool:
//   1. orchestrator runs the matching handler from ToolRegistry
//   2. appends an "assistant" turn recording the request
//   3. appends a "tool" turn carrying the result
//   4. re-streams so the model finishes with grounded data

// Bounded by maxToolRounds. Handler errors are returned to the model
// as a tool result rather than crashing the session.

jsonAudio frame shape+

// Browser to server, 20ms PCM16 frames, base64 encoded.
{ "type": "audio", "payload": "<base64 PCM16>" }

// Server to browser, sentence-by-sentence TTS chunks.
{ "type": "tts.chunk", "payload": "<base64 PCM16>" }
{ "type": "barge-in" }
{ "type": "turn.end" }

Configuration, one env per layer

Three env vars choose the STT, LLM, and TTS providers. Everything else is keys and URLs for the providers you picked.

Env var	Purpose	Default
STT_PROVIDER	whispercpp, deepgram, or whisper	whispercpp
LLM_PROVIDER	groq, sarmalink, or openai	groq
TTS_PROVIDER	opentts, cartesia, or elevenlabs	opentts
GROQ_API_KEY	Key for the Groq Llama 4 LLM adapter	unset
WHISPERCPP_URL	whisper-server endpoint for STT	http://localhost:8090
OPENTTS_URL	OpenTTS server endpoint for TTS	http://localhost:5500

Where it fits

The patterns this repository was built around, and the ones it deliberately is not.

Customer support voice agents

Front-line agents for inbound support. Barge-in is essential the moment a customer wants to redirect mid-turn; without it the agent talks over the caller.

Tutors and coaches

Education and coaching apps where the human does most of the talking. The agent prompts, redirects, and pauses without speaking over the learner.

Hands-busy product UX

Voice for kitchens, garages, field engineers. The Next.js client runs in any normal browser; no native app required.

Internal voice ops

Warehouse and operations teams over WebSocket. Pluggable STT lets you pin a regional model for accent coverage without changing the pipeline.

Vendor A/B without rewrites

Swap Deepgram in for Whisper.cpp, Cartesia for OpenTTS, OpenAI for Groq, by changing one env var. The orchestrator is unchanged.

When NOT to reach for it

A finished consumer product, or a one-shot push-to-talk transcription tool. The full-duplex machinery is overhead you would not need.

Tech stack

TypeScriptNode.js 22Next.js 15Fastify 5WebSocketPCM16 / 16 kHzWhisper.cppOpenTTS Coqui XTTS v2Groq Llama 4AbortControllerVitestpnpm

Compared to the alternatives

Hosted voice-agent platforms, closed vendor stacks, and rolling your own. Three honest comparisons.

Feature	Voice Agent Starter	Hosted platform	Closed vendor	DIY
Full-duplex with barge-in	Yes	Yes	Yes	You build it
Self-hostable end to end	Yes	Hosted only	Hosted only	Yes
Pluggable STT / LLM / TTS	Yes, per layer	Partial	Locked stack	You write it
Per-minute fees	£0 on self-hosted defaults	Per minute	Per minute	Your provider bills
Function-call passthrough	Yes, bounded rounds	Yes	Yes	You write it
Source code	MIT, all of it	Closed	Closed	Yours

Documentation, all in the wiki

A handful of focused pages. Each one answers a single operational question.

Architecture

Run lifecycle, state machine, barge-in, function-call passthrough, components table.

Open page

Home

High-level overview, who it is for, the latency budget.

Open page

Quick Start

Install, configure providers, run locally, open the client.

Open page

Roadmap

Shipped and planned.

Open page

ARCHITECTURE.md

In-repo design reference, components, design choices.

Open page

CHANGELOG.md

Versioned history of what shipped when.

Open page

Frequently asked

Eight real questions from teams that have shipped this.

Why WebSocket rather than WebRTC?+

The orchestrator is transport-agnostic. Terminating over mediasoup or LiveKit is a swap at the edge without touching the pipeline. The starter keeps the transport simple on purpose; the engineering value is in the duplex machinery, not the negotiation layer.

What does barge-in actually cancel?+

Both the in-flight LLM and TTS streams, through their respective AbortControllers. The abort signal propagates through the fetch body reader and the for await loops, so there are no orphaned streams talking over the user. STT and TTS adapters are reset, a barge-in control message is emitted, and the machine drops to LISTEN.

Does it really run with no provider keys?+

Yes for the state machine, barge-in, and tool calls. To get real transcripts and audio you set GROQ_API_KEY or point WHISPERCPP_URL and OPENTTS_URL at running servers. The test suite drives the full pipeline through fake adapters and a fake socket.

How do I add an OpenAI-compatible LLM provider?+

A handful of lines. Implement the LLM interface (stream, id) and reuse apps/server/src/adapters/llm/sse.ts for streaming and tool_calls assembly. Register it and set LLM_PROVIDER to the new id.

How do tools get registered?+

In apps/server/src/pipeline/tools.ts. The starter ships with get_time and add_numbers as worked examples. Your tool advertises a JSON Schema; the orchestrator advertises every registered tool to the model on every call.

Why send audio over JSON?+

Because the orchestrator is the design surface, not the wire format. A plain {type, payload} message is the simplest thing to swap for a binary frame, a Datachannel, or a media-server track without changing the pipeline.

Can I run the LLM on SarmaLink-AI?+

Yes. Set LLM_PROVIDER=sarmalink and the adapter calls into the SarmaLink-AI failover stack, giving you 36-engine routing under the same voice loop. The shared SSE reader handles the wire format.

What about word-level interim STT?+

The default Whisper.cpp adapter surfaces window-level partials, which is enough for the barge-in path. If you need finer granularity, wire the Deepgram streaming SDK and set STT_PROVIDER=deepgram. The interface is the same.

Ship a voice agent that does not feel like a 2018 demo.

Clone the repo, run pnpm dev, talk into your microphone, ship.

View on GitHub How it works Whitepaper Get help shipping

All open-source projects

A sub-second voice agent loop, end to end.

Why this exists

What is in the box

Browser microphone capture

Duplex state machine

Real barge-in

Streaming STT

Streaming LLM

Sentence-by-sentence TTS

Function-call passthrough

Pluggable adapter contract

RMS voice activity detection

Shared SSE reader

Bounded tool rounds

No-keys offline mode

Architecture, the duplex loop

State machine: the duplex loop

Latency budget

Quick start

Under the hood

Configuration, one env per layer

Where it fits

Customer support voice agents

Tutors and coaches

Hands-busy product UX

Internal voice ops

Vendor A/B without rewrites

When NOT to reach for it

Tech stack

Compared to the alternatives

Documentation, all in the wiki

Architecture

Home

Quick Start

Roadmap

ARCHITECTURE.md

CHANGELOG.md

Frequently asked

Related products

SarmaLink-AI

MCP Server Toolkit

Agent Orchestrator

AI Eval Runner

Local LLM Router

StaffPortal

RAG-over-PDF

Receipt Scanner

Webhook-to-Email

k8s-ops-toolkit

terraform-stack

slipstream

forge-infer

shipyard

lsmdb

raftkv

sandboxd

Ship a voice agent that does not feel like a 2018 demo.