How Voice Agent Starter works
A complete tour of the architecture: data flow, subsystems, technology choices, performance, and where the project is heading next.
A Next.js client opens one WebRTC peer connection to a mediasoup SFU. The SFU forwards audio to a Fastify model worker that runs streaming STT, LLM, and TTS through pluggable adapters. A turn state machine handles barge-in correctly. End-to-end response begins in under one second on a healthy network.
Core data flow
From the moment a request enters the system to the moment a response leaves it.
Browser SFU Model worker
------- ---- -------------
mic onset
│
│ Opus over WebRTC
├──────────────────────▶ mediasoup
│ │ plain RTP
│ ├──────────────────────▶ VAD + STT adapter
│ │ │
│ │ │ partial transcripts
│ │ │ (logged, not sent)
│ │ │
│ │ │ confident-final
│ │ ▼
│ │ LLM adapter (streaming)
│ │ │ tokens
│ │ ▼
│ │ TTS adapter (streaming)
│ │ plain RTP │ audio chunks
│ ◀──────────────────────────│
│ Opus over WebRTC │
◀────────────────────────────│
speakers play TTS
│
│ ── new mic onset detected ──▶ turn manager: state = INTERRUPTED
│ cancel TTS, drain buffer, reset LLMEach subsystem, deep-dived
Every component in the data flow above, opened up and explained.
WebRTC client
The Next.js client is intentionally small. It negotiates a single peer connection with the SFU, attaches the microphone as the outbound track, listens for the inbound TTS track, and renders a UI that exposes the agent’s current state. There is no audio worklet, no JS audio buffer manipulation; the browser handles audio playback through the WebRTC pipeline directly. The mic track has Opus configured at 48 kHz mono with discontinuous transmission disabled (we want continuous audio so VAD on the server is straightforward).
The UI surfaces the four states from the turn manager: listening, deciding, speaking, interrupted. This is not just for users , during integration testing, watching the state badge tick through the cycle catches more bugs than any console log.
mediasoup SFU
The SFU is one mediasoup worker running inside its own process. It accepts the WebRTC peer connection from the browser and creates two plain RTP transports to the model worker, one for inbound audio (mic to model) and one for outbound (TTS to speakers). The plain transports are local to the host; in production we colocate the SFU and the model worker to avoid an extra network hop.
mediasoup configuration is in sfu/config.ts. The two parameters most worth tuning are the worker count (one per CPU on the SFU host) and the RTP listen IP, which must be reachable from the model worker. The wiki has the Fly.io recipe.
Turn state machine
The brain. Four states, four transitions:
LISTENING: VAD is active, STT is consuming audio, partial transcripts are accumulating. Transition toDECIDINGwhen a confident-final transcript arrives.DECIDING: LLM adapter is generating. As soon as tokens start arriving, transition toSPEAKING.SPEAKING: TTS is producing audio chunks; the SFU is forwarding them. Transition toLISTENINGwhen TTS completes naturally.INTERRUPTED: VAD detected new mic onset duringSPEAKING. Cancel TTS, drain SFU buffer, reset LLM context with an “[user interrupted]” marker, transition toLISTENING.
Each transition emits a log line and an OpenTelemetry span. The state machine is the same across every adapter pairing.
STT, LLM, TTS adapters
Each adapter is a TypeScript module exporting one async iterator factory. STT takes audio chunks in, yields transcript events. LLM takes a chat history in, yields token strings. TTS takes a text iterable in, yields audio chunks (Opus packets, 20ms frames).
The defaults: Deepgram for STT (excellent streaming partials, low first-token latency), OpenAI Realtime for LLM (the streaming hot path is well-tuned), Cartesia for TTS (sub-200ms first-audio, natural prosody). All three are network calls. Replacing any one is a single file edit.
An adapter authoring guide in the wiki covers the gotchas: partial-transcript handling, confidence thresholds, end-of-turn detection, and how to handle a TTS provider that emits PCM not Opus (transcode in the adapter, do not break the contract).
Barge-in cancellation path
The hardest part. When VAD detects user speech during SPEAKING:
- The TTS adapter’s in-flight stream is cancelled via its AbortController. Most providers honour this immediately and stop generating new audio.
- The SFU producer for the TTS track is told to flush. mediasoup exposes
producer.pause(); we then immediately resume to ensure no stale frames are sent. - The LLM adapter context is appended with a marker: the agent’s reply was interrupted, the next turn should not assume the prior reply was heard.
- State transitions to
LISTENING; STT begins consuming the new turn.
Each step is tested in isolation and integrated with timing assertions. The acceptance criterion: the user must hear no more than 80 milliseconds of TTS after their interrupting word begins.
Configuration and deployment
Local development is pnpm dev:sfu + pnpm dev:worker + pnpm dev:web with a shared .env. Docker compose is available for one-command boot. Production deployment is documented for Fly.io with two machines: one for the SFU (UDP ports exposed), one for the model worker. The Next.js client deploys anywhere Next.js deploys.
Why this stack
The road not taken matters as much as the road taken. Here is what was picked, why, and what was rejected and why.
WebRTC
The right transport for browser-to-server real-time audio. Handles network adversity that websockets do not.
Plain WebSocket with PCM , works in lab, fails in coach-on-4G conditions. Latency is also worse because browsers do not jitter-buffer raw PCM.
mediasoup
Lean, performant, TypeScript-friendly, no separate service to run alongside. Fits the “clone and understand” goal.
LiveKit , excellent product, but heavier dependency surface. Janus / Kurento , older, less ergonomic Node bindings.
Fastify
Faster than Express for the HTTP control plane around the SFU and model worker. Plugin model is sane. TypeScript-first.
Express , slower, looser typing, missing SSE primitives.
Next.js
The client needs SSR for SEO and a small audio UI. Next.js gives both with deployment to anywhere.
Vite SPA , would work, no SSR. Acceptable, just not what we picked.
Streaming everywhere
The latency budget is unforgiving. Anything non-streaming costs hundreds of milliseconds.
Request-response adapters , the simpler shape, but invisibly destroys the experience.
Pino for logging
JSON, fast, low allocation. Fits inside the 700ms budget without adding noticeable overhead.
Winston , slower, more allocation, plugins fight ESM.
Performance & observability
The starter targets sub-700ms end-to-end response on a healthy 4G connection. On a wired connection the round trip drops to 400–500ms with the default adapter stack. The biggest single contributor is the LLM first-token; running OpenAI Realtime in the same region as the model worker is worth around 80ms.
Each stage emits an OpenTelemetry span: turn.listen, turn.stt.partial, turn.stt.final, turn.llm.first_token, turn.tts.first_audio, turn.speak.complete, turn.interrupted. The dashboard examples in the wiki show how to render these into a per-turn flame chart so you can see exactly which stage owns the next millisecond.
On the server side, a single SFU worker handles roughly forty concurrent sessions on a four-core machine. The model worker is dominated by adapter network calls; CPU is rarely the bottleneck. Horizontal scaling is by adding model-worker instances behind a sticky-session router , sessions are stateful inside a worker.
Where it is heading
- →Speech-to-speech adapter contract for providers that bypass STT and TTS (e.g. realtime audio models).
- →On-device Whisper adapter for self-hosted STT.
- →On-device Piper adapter for self-hosted TTS.
- →Multi-turn function calling across the LLM adapter, with cancellation on barge-in.
- →Worked SIP-to-WebRTC bridge example so PSTN traffic can use the loop.
Read the full whitepaper for the formal technical write-up.