A sub-second voice agent loop, end-to-end.
Speak. The model speaks back. Cut in any time.
Voice Agent Starter is the full-duplex voice loop most teams half-finish. Browser captures audio over WebRTC, mediasoup forwards it to a Fastify worker that runs your chosen STT, LLM, and TTS adapters. The TTS audio streams back through the same WebRTC track. Barge-in cancels in-flight TTS the moment the user starts talking. Built in TypeScript with Next.js for the client.
Why this exists
Voice agents are easy to demo and brutal to ship. The 1-to-1 demo on a fast wifi network sounds magical. The first time a real user is on a coach, on patchy 4G, with a five-year-old laptop, every shortcut a demo can take is exposed. WebRTC negotiations stall, half-duplex audio breaks the moment someone interrupts, the TTS finishes saying “please hold while I check” thirty seconds after the user has already moved on.
The half-decent reference implementations on GitHub are written in five different languages, mix transports, and assume you have already built mediasoup workers. The polished commercial offerings handle the hard parts but cost between $600 and $5,000 a month per instance and lock the audio path inside their cloud.
Voice Agent Starter is the open-source middle. WebRTC and mediasoup so the audio path is a known quantity. Fastify so the server can be reasoned about. Pluggable STT, TTS, and LLM so you choose which provider is on the other end. Barge-in handled correctly so the agent feels alive, not hostage.
What it does
Every feature below ships in the public repository today. Clone, configure, run.
WebRTC capture
Browser captures audio at 48 kHz mono with the right Opus configuration. Voice activity detection drives the barge-in signal.
mediasoup SFU
A single mediasoup worker forwards audio to the model worker and TTS audio back to the browser. Same RTC peer connection, two tracks.
Adapter contract
STT, LLM, and TTS each implement a small TypeScript interface. Swap providers in one file. Defaults: Deepgram, OpenAI Realtime, Cartesia.
Streaming STT
Partial transcripts arrive while the user is still talking. The LLM is invoked the moment a confident-final transcript is detected.
Streaming TTS
TTS audio streams back over the WebRTC track as it is generated. First audible response well under one second on the default stack.
Correct barge-in
A clean barge-in cancels in-flight TTS, drains the audio buffer, resets the model context, and sends the new turn upstream. Tested for the awkward edge cases.
Turn manager
Explicit turn state machine: listening, deciding, speaking, interrupted. Logs are readable. Bugs are findable.
Latency telemetry
Every stage of the loop emits a span. P50 and P95 break down per stage, so you know whether STT, LLM, or TTS owns the next millisecond.
Adapter registry
Adapters live in their own packages. Add a new STT in twenty minutes by implementing the streaming-iterator interface.
Containerised
Docker compose for local development. Fly.io machine sizes recommended. mediasoup ports documented.
Tech stack
Architecture, in one diagram
The whole system on a single screen. Every box maps to a real folder in the repo.
┌─────────────────────────┐
│ Browser │
│ Next.js, WebRTC PC │
│ ▲ TTS audio in │
│ ▼ Mic audio out │
└──────┬─────────────┬─────┘
│ │
│ WebRTC │ WebRTC
▼ ▲
┌──────────────────────────┐
│ mediasoup SFU worker │
│ one PC, two tracks │
└──────┬─────────────┬─────┘
▼ ▲
┌──────────────────────────┐
│ Fastify model-worker │
│ ┌──────────┐ ┌──────────┐ │
│ │ STT │ │ TTS │ │
│ │ adapter │ │ adapter │ │
│ └────┬─────┘ └─────▲────┘ │
│ ▼ │ │
│ ┌──────────────────┐ │
│ │ Turn manager │ │
│ │ (state machine) │ │
│ └────────┬─────────┘ │
│ ▼ │
│ ┌────────────────────┐ │
│ │ LLM adapter │ │
│ │ streaming tokens │ │
│ └────────────────────┘ │
└──────────────────────────┘Quick start
From clone to first request in under five minutes.
git clone https://github.com/sarmakska/voice-agent-starter.git cd voice-agent-starter
pnpm install cp .env.example .env # add DEEPGRAM_KEY, OPENAI_KEY, CARTESIA_KEY
pnpm dev:sfu # starts mediasoup on :3478 pnpm dev:worker # starts model worker on :7100 pnpm dev:web # Next.js client on :3000
open http://localhost:3000 # click "Start", grant mic, talk
Where it fits
The patterns this repository was built around.
Customer support agents
Front-line voice agents for inbound support. The barge-in path is essential when callers are queuing for the operator.
Tutors and coaches
Education and coaching apps where the user is doing most of the talking. The agent pauses, prompts, redirects.
Embodied product UX
In-product voice for hands-busy workflows: kitchens, garages, field engineers. The Next.js client runs in a normal browser.
Internal voice ops
Phone-style internal interfaces over WebRTC for warehouse and ops teams. Pluggable STT lets you pin a regional model for accent coverage.
Related products
The wider Sarma Linux toolkit. Every project ships with the same opinions: open source, MIT, real depth, no marketing fluff.
Ship a voice agent that does not feel like a 2018 demo.
Clone the repo, follow the four-step quick start, ship something real.