Technical Whitepaper · v1.0

Voice Agent Starter

A real-time voice agent loop with WebRTC, mediasoup, Fastify, and Next.js , pluggable STT, LLM, and TTS adapters, barge-in handled correctly.

MIT LicenceWebRTCmediasoup SFUPluggable AdaptersSub-secondBarge-in

< 700msRound-trip target

48 kHzAudio sample rate

OpusCodec

3Adapter contracts

Repository Wiki Product page How it works

§ Abstract

Voice agents have a deceptively narrow performance window. A user asking a question over voice expects a response that begins inside one second. Anything later than that and the conversation feels broken; anything later than two seconds and most users have started speaking again. Inside that one-second budget, the system must capture audio from a browser, transcribe it, decide what to say, synthesise speech for that decision, and stream the audio back. Each stage has variance. Each stage has independent failure modes. The integration is where most projects lose months.

Voice Agent Starter is the integrated reference: WebRTC for transport, mediasoup as the SFU, a Fastify worker that runs the STT, LLM, and TTS adapters under a single explicit turn state machine, and a Next.js client. Barge-in is treated as the load-bearing feature it is, with explicit cancellation paths for in-flight TTS and an audio buffer drain that runs before any new turn begins. The adapters are pluggable: the default stack is Deepgram, OpenAI Realtime, and Cartesia, but each is one TypeScript file that any team can replace.

This whitepaper documents the architecture, the latency budget, the turn state machine, the adapter contracts, and the operational notes that turned the loop from a fragile demo into something that survives a real network.

1Executive Summary

Voice Agent Starter is a TypeScript codebase split into three deployable units: a Next.js client, a mediasoup SFU worker, and a Fastify model worker. The browser opens a single WebRTC peer connection to the SFU, with one outbound audio track (the microphone) and one inbound audio track (the synthesised TTS). The SFU forwards audio between the browser and the model worker.

Inside the model worker, an explicit turn state machine moves between four states: listening (capturing audio, running streaming STT), deciding (LLM is generating a response), speaking (TTS is producing audio that the SFU is forwarding to the browser), and interrupted (a new mic onset has been detected during speaking; cancel TTS, drain buffers, return to listening). State transitions are observable; every transition emits a structured log line and an OpenTelemetry span.

STT, LLM, and TTS are adapters. Each implements a streaming TypeScript interface. The defaults work out of the box. Swapping a default for an internal model is a one-file change.

2Background

The current generation of voice models has narrowed the latency floor that voice agents can target. The realtime APIs from frontier providers offer time-to-first-audio in the low hundreds of milliseconds. Modern streaming TTS models begin emitting audio after a similar wait. The remaining contributors to perceived latency are network round trips, voice activity detection, and the seams between components. The seams are where good voice agents are won and lost.

WebRTC is the right transport for browser-to-server voice. It handles the network adversity that real users live with: variable bandwidth, packet loss, NAT traversal, devices that get plugged in mid-call. The alternative , a websocket pumping raw PCM , works in the lab and breaks in the field. Most production voice agents use WebRTC; almost no open reference servers do.

mediasoup is the SFU we chose. It is performant, well-maintained, deployed at scale by serious teams, and has a TypeScript-friendly client API. The alternative SFUs (Janus, Kurento, LiveKit’s server) are all defensible. mediasoup gave us the best Node-native developer ergonomics.

3Problem in detail

Latency is cumulative and unforgiving

The user perceives end-to-end latency. The system has a budget of around seven hundred milliseconds before the experience degrades. That budget includes round-trip network delay (~50–150 ms), voice activity detection (~80–150 ms), STT first-token (~150–250 ms after final speech), LLM first-token (~150–400 ms depending on model), TTS first-audio (~150–250 ms after first LLM token), plus jitter buffer time. Anything that adds even fifty milliseconds in the wrong place is visible.

Barge-in is the difference between alive and dead

An agent that cannot be interrupted is intolerable. Real conversation interrupts. The system must detect a mic onset during the speaking state, cancel TTS at the model layer (so no more audio is generated), drain the in-flight audio buffer (so the user does not hear three more seconds of stale speech), and reset the LLM context to indicate that the prior response was abandoned mid-sentence. Get any one of these wrong and the agent will say, three seconds later, the second half of a sentence the user already overrode.

Adapter coupling is the slow death

Most voice agent implementations couple the STT call directly into the LLM call directly into the TTS call. A change of provider is a rewrite. The starter inverts that: each provider is an adapter behind a streaming interface. The turn manager calls the interface, never the provider.

4Goals + non-goals

Goals

End-to-end voice loop that begins responding within seven hundred milliseconds on a healthy 4G connection.
Barge-in handled correctly, including the edge cases (false-trigger from background noise, late onset during the last word of a TTS sentence).
Pluggable STT, LLM, TTS via a small TypeScript interface.
One WebRTC peer connection per session, two tracks. No protocol gymnastics.
Local development with one command. Production deployment to Fly.io documented.

Non-goals

Multi-party calls. The starter is one human and one agent.
SIP, PSTN, or telephony bridges. The transport is WebRTC. Add a SIP gateway in front if you need a phone number.
Custom acoustic models. We assume hosted STT and TTS, even when those are self-hosted behind an HTTP endpoint.
A speech-to-speech model. The architecture supports it (see future direction); the default is the three-stage pipeline.

5Architecture

Three processes, communicating over two networks.

Browser                 Internet              Server side
  │                       │                       │
  ├── WebRTC ───────────────────────────────▶ mediasoup SFU
  │                                               │
  │                                               ├── PlainTransport ▶ Fastify worker
  │                                               │
  │ ◀───────────────── WebRTC ──────────────── mediasoup SFU
  │                                               ▲
  │                                               │
  │                                               └── PlainTransport ◀ Fastify worker

The browser maintains one peer connection to the SFU. The SFU has two plain RTP transports to the Fastify model worker, one for each direction. From the model worker’s perspective, audio in and audio out are local UDP streams; the SFU does the WebRTC negotiation with the browser. This separation is what allows the model worker to scale on its own dimensions (CPU and memory for STT/TTS adapters) without WebRTC complexity, and the SFU to scale on its own (network throughput).

The turn state machine in the model worker is the brain. Voice activity detection on the inbound stream sets the listening boundary. STT runs streaming, with partial transcripts. A confident-final transcript triggers the LLM call. LLM tokens stream into the TTS adapter; TTS audio streams to the SFU; SFU forwards to the browser. A new VAD onset during speaking triggers the interrupted path.

6Key technical decisions

One peer connection, two tracks

Two PCs would have made the cancellation path simpler in some ways, but it doubled the negotiation cost and meant glare conditions on slow networks. One PC, two tracks, is the standard. The complexity moves into the turn manager, which is where complexity belongs.

mediasoup over LiveKit

LiveKit is excellent. We chose mediasoup because it is the leanest dependency for the job: one library, no separate service to deploy alongside, no cluster bootstrap. For a starter optimised for “clone, run, understand”, mediasoup wins.

Streaming everywhere

Every adapter is streaming. STT yields partial transcripts; LLM yields tokens; TTS yields audio chunks. Non-streaming adapters can be wrapped, but the contract is streaming-first. This is the only way to hit the latency budget.

Explicit state machine

An implicit state machine , “the LLM is running so we are not in TTS” , is impossible to debug. The toolkit’s turn manager has named states, named transitions, and a log line per transition. The bug fixes that came out of testing on real networks were all enabled by reading those logs.

7Implementation milestones

Milestone 1 · audio loop

The first thing built was an echo loop. Browser captures audio, mediasoup forwards it to the model worker, model worker forwards it back. No models, no STT, no TTS. Once that loop was tight (sub-100ms RTT measured), every later addition was measured against that baseline.

Milestone 2 · STT and TTS adapters

Streaming STT was added next, with a Deepgram adapter as the reference. The TTS adapter contract followed, with Cartesia. Audio out from TTS went straight back through the SFU. At this point the demo was a one-shot “press the button, talk, hear an echo of yourself in a different voice”.

Milestone 3 · turn manager and barge-in

The state machine was added. Barge-in was the hardest part: cancelling a streaming TTS request mid-flight is harder than starting one, and draining the audio buffer in the SFU requires sending a stop on the producer. This milestone shipped with twenty named tests covering the timing edge cases.

Milestone 4 · LLM adapter and observability

The LLM adapter was added last because at that point the integration was already tight. OpenAI Realtime is the default. OTel spans were added across every stage of the loop, so latency regressions are caught in CI.

8Lessons / honest limits

Lessons

Barge-in cannot be added later. If the rest of the loop assumes it never happens, the integration is wrong. Build for it from the first test.
The audio buffer is sneaky. Cancelling TTS at the API stops new audio. The audio already in flight on the WebRTC track keeps playing. Drain it explicitly.
VAD tuning is per-environment. The starter ships with sensible defaults but the right thresholds for a quiet office are not the right ones for a kitchen.

Honest limits

One human, one agent. Multi-party is out of scope. Adapt with care.
No telephony. Front it with a SIP gateway if you need a phone number.
No on-device adapters today. Adapters are network calls. Local Whisper or local TTS is on the roadmap.
Latency in P99 is bandwidth-bound. On 3G or saturated wifi the budget breaks. The starter cannot fix the underlying network.

9Conclusion

Voice agents are not particularly hard, in the sense that the individual components are commodity. They are hard in the sense that the integration is unforgiving and the failure modes are subtle. Voice Agent Starter is the integration we wished existed: a single repository where the WebRTC, the SFU, the adapters, and the turn manager all live together, with the boring infrastructure already wired and the failure modes already exercised.

The starter is MIT licensed. The wiki contains a deployment walkthrough for Fly.io, an adapter authoring guide, and a barge-in test recipe you can run against your own deploy.

Voice Agent Starter · Built by Sarma Linux · MIT licensed