How it works

How Voice Agent Starter works

A complete tour of the architecture: data flow, subsystems, technology choices, performance, and where the project is heading next.

TL;DR

A Next.js client opens one WebRTC peer connection to a mediasoup SFU. The SFU forwards audio to a Fastify model worker that runs streaming STT, LLM, and TTS through pluggable adapters. A turn state machine handles barge-in correctly. End-to-end response begins in under one second on a healthy network.

Core data flow

From the moment a request enters the system to the moment a response leaves it.

  Browser                       SFU                       Model worker
  -------                       ----                      -------------
  mic onset
     │
     │  Opus over WebRTC
     ├──────────────────────▶ mediasoup
     │                            │   plain RTP
     │                            ├──────────────────────▶ VAD + STT adapter
     │                            │                          │
     │                            │                          │ partial transcripts
     │                            │                          │ (logged, not sent)
     │                            │                          │
     │                            │                          │ confident-final
     │                            │                          ▼
     │                            │                       LLM adapter (streaming)
     │                            │                          │ tokens
     │                            │                          ▼
     │                            │                       TTS adapter (streaming)
     │                            │   plain RTP                │ audio chunks
     │                            ◀──────────────────────────│
     │  Opus over WebRTC          │
     ◀────────────────────────────│
  speakers play TTS
     │
     │  ── new mic onset detected ──▶ turn manager: state = INTERRUPTED
     │                                  cancel TTS, drain buffer, reset LLM

Each subsystem, deep-dived

Every component in the data flow above, opened up and explained.

WebRTC client

The Next.js client is intentionally small. It negotiates a single peer connection with the SFU, attaches the microphone as the outbound track, listens for the inbound TTS track, and renders a UI that exposes the agent’s current state. There is no audio worklet, no JS audio buffer manipulation; the browser handles audio playback through the WebRTC pipeline directly. The mic track has Opus configured at 48 kHz mono with discontinuous transmission disabled (we want continuous audio so VAD on the server is straightforward).

The UI surfaces the four states from the turn manager: listening, deciding, speaking, interrupted. This is not just for users , during integration testing, watching the state badge tick through the cycle catches more bugs than any console log.

mediasoup SFU

The SFU is one mediasoup worker running inside its own process. It accepts the WebRTC peer connection from the browser and creates two plain RTP transports to the model worker, one for inbound audio (mic to model) and one for outbound (TTS to speakers). The plain transports are local to the host; in production we colocate the SFU and the model worker to avoid an extra network hop.

mediasoup configuration is in sfu/config.ts. The two parameters most worth tuning are the worker count (one per CPU on the SFU host) and the RTP listen IP, which must be reachable from the model worker. The wiki has the Fly.io recipe.

Turn state machine

The brain. Four states, four transitions:

LISTENING: VAD is active, STT is consuming audio, partial transcripts are accumulating. Transition to DECIDING when a confident-final transcript arrives.
DECIDING: LLM adapter is generating. As soon as tokens start arriving, transition to SPEAKING.
SPEAKING: TTS is producing audio chunks; the SFU is forwarding them. Transition to LISTENING when TTS completes naturally.
INTERRUPTED: VAD detected new mic onset during SPEAKING. Cancel TTS, drain SFU buffer, reset LLM context with an “[user interrupted]” marker, transition to LISTENING.

Each transition emits a log line and an OpenTelemetry span. The state machine is the same across every adapter pairing.

STT, LLM, TTS adapters

Each adapter is a TypeScript module exporting one async iterator factory. STT takes audio chunks in, yields transcript events. LLM takes a chat history in, yields token strings. TTS takes a text iterable in, yields audio chunks (Opus packets, 20ms frames).

The defaults: Deepgram for STT (excellent streaming partials, low first-token latency), OpenAI Realtime for LLM (the streaming hot path is well-tuned), Cartesia for TTS (sub-200ms first-audio, natural prosody). All three are network calls. Replacing any one is a single file edit.

An adapter authoring guide in the wiki covers the gotchas: partial-transcript handling, confidence thresholds, end-of-turn detection, and how to handle a TTS provider that emits PCM not Opus (transcode in the adapter, do not break the contract).

Barge-in cancellation path

The hardest part. When VAD detects user speech during SPEAKING:

The TTS adapter’s in-flight stream is cancelled via its AbortController. Most providers honour this immediately and stop generating new audio.
The SFU producer for the TTS track is told to flush. mediasoup exposes producer.pause(); we then immediately resume to ensure no stale frames are sent.
The LLM adapter context is appended with a marker: the agent’s reply was interrupted, the next turn should not assume the prior reply was heard.
State transitions to LISTENING; STT begins consuming the new turn.

Performance & observability

The starter targets sub-700ms end-to-end response on a healthy 4G connection. On a wired connection the round trip drops to 400–500ms with the default adapter stack. The biggest single contributor is the LLM first-token; running OpenAI Realtime in the same region as the model worker is worth around 80ms.

Each stage emits an OpenTelemetry span: turn.listen, turn.stt.partial, turn.stt.final, turn.llm.first_token, turn.tts.first_audio, turn.speak.complete, turn.interrupted. The dashboard examples in the wiki show how to render these into a per-turn flame chart so you can see exactly which stage owns the next millisecond.

On the server side, a single SFU worker handles roughly forty concurrent sessions on a four-core machine. The model worker is dominated by adapter network calls; CPU is rarely the bottleneck. Horizontal scaling is by adding model-worker instances behind a sticky-session router , sessions are stateful inside a worker.

Where it is heading

→Speech-to-speech adapter contract for providers that bypass STT and TTS (e.g. realtime audio models).
→On-device Whisper adapter for self-hosted STT.
→On-device Piper adapter for self-hosted TTS.
→Multi-turn function calling across the LLM adapter, with cancellation on barge-in.
→Worked SIP-to-WebRTC bridge example so PSTN traffic can use the loop.

Read the full whitepaper for the formal technical write-up.

Whitepaper Repository Get help shipping