Open Source · MIT · Real-time WebRTC

A sub-second voice agent loop, end-to-end.

Speak. The model speaks back. Cut in any time.

Voice Agent Starter is the full-duplex voice loop most teams half-finish. Browser captures audio over WebRTC, mediasoup forwards it to a Fastify worker that runs your chosen STT, LLM, and TTS adapters. The TTS audio streams back through the same WebRTC track. Barge-in cancels in-flight TTS the moment the user starts talking. Built in TypeScript with Next.js for the client.

< 700ms
Round-trip target
Yes
Barge-in
WebRTC
Transport
3x
Pluggable adapters
MIT
Licence

Why this exists

Voice agents are easy to demo and brutal to ship. The 1-to-1 demo on a fast wifi network sounds magical. The first time a real user is on a coach, on patchy 4G, with a five-year-old laptop, every shortcut a demo can take is exposed. WebRTC negotiations stall, half-duplex audio breaks the moment someone interrupts, the TTS finishes saying &ldquo;please hold while I check&rdquo; thirty seconds after the user has already moved on.

The half-decent reference implementations on GitHub are written in five different languages, mix transports, and assume you have already built mediasoup workers. The polished commercial offerings handle the hard parts but cost between $600 and $5,000 a month per instance and lock the audio path inside their cloud.

Voice Agent Starter is the open-source middle. WebRTC and mediasoup so the audio path is a known quantity. Fastify so the server can be reasoned about. Pluggable STT, TTS, and LLM so you choose which provider is on the other end. Barge-in handled correctly so the agent feels alive, not hostage.

What it does

Every feature below ships in the public repository today. Clone, configure, run.

WebRTC capture

Browser captures audio at 48 kHz mono with the right Opus configuration. Voice activity detection drives the barge-in signal.

mediasoup SFU

A single mediasoup worker forwards audio to the model worker and TTS audio back to the browser. Same RTC peer connection, two tracks.

Adapter contract

STT, LLM, and TTS each implement a small TypeScript interface. Swap providers in one file. Defaults: Deepgram, OpenAI Realtime, Cartesia.

Streaming STT

Partial transcripts arrive while the user is still talking. The LLM is invoked the moment a confident-final transcript is detected.

Streaming TTS

TTS audio streams back over the WebRTC track as it is generated. First audible response well under one second on the default stack.

Correct barge-in

A clean barge-in cancels in-flight TTS, drains the audio buffer, resets the model context, and sends the new turn upstream. Tested for the awkward edge cases.

Turn manager

Explicit turn state machine: listening, deciding, speaking, interrupted. Logs are readable. Bugs are findable.

Latency telemetry

Every stage of the loop emits a span. P50 and P95 break down per stage, so you know whether STT, LLM, or TTS owns the next millisecond.

Adapter registry

Adapters live in their own packages. Add a new STT in twenty minutes by implementing the streaming-iterator interface.

Containerised

Docker compose for local development. Fly.io machine sizes recommended. mediasoup ports documented.

Tech stack

TypeScriptNext.jsFastifymediasoupWebRTCOpusZodVitestPinoDockerFly.iopnpm

Architecture, in one diagram

The whole system on a single screen. Every box maps to a real folder in the repo.

┌─────────────────────────┐
│        Browser           │
│   Next.js, WebRTC PC      │
│   ▲ TTS audio in          │
│   ▼ Mic audio out         │
└──────┬─────────────┬─────┘
       │             │
       │ WebRTC      │ WebRTC
       ▼             ▲
┌──────────────────────────┐
│   mediasoup SFU worker    │
│   one PC, two tracks       │
└──────┬─────────────┬─────┘
       ▼             ▲
┌──────────────────────────┐
│ Fastify model-worker      │
│ ┌──────────┐ ┌──────────┐ │
│ │   STT    │ │   TTS    │ │
│ │ adapter  │ │ adapter  │ │
│ └────┬─────┘ └─────▲────┘ │
│      ▼            │       │
│   ┌──────────────────┐    │
│   │   Turn manager    │    │
│   │   (state machine) │    │
│   └────────┬─────────┘    │
│            ▼              │
│   ┌────────────────────┐  │
│   │   LLM adapter       │  │
│   │   streaming tokens  │  │
│   └────────────────────┘  │
└──────────────────────────┘

Quick start

From clone to first request in under five minutes.

01
git clone https://github.com/sarmakska/voice-agent-starter.git
cd voice-agent-starter
02
pnpm install
cp .env.example .env  # add DEEPGRAM_KEY, OPENAI_KEY, CARTESIA_KEY
03
pnpm dev:sfu       # starts mediasoup on :3478
pnpm dev:worker    # starts model worker on :7100
pnpm dev:web       # Next.js client on :3000
04
open http://localhost:3000  # click "Start", grant mic, talk

Where it fits

The patterns this repository was built around.

Customer support agents

Front-line voice agents for inbound support. The barge-in path is essential when callers are queuing for the operator.

Tutors and coaches

Education and coaching apps where the user is doing most of the talking. The agent pauses, prompts, redirects.

Embodied product UX

In-product voice for hands-busy workflows: kitchens, garages, field engineers. The Next.js client runs in a normal browser.

Internal voice ops

Phone-style internal interfaces over WebRTC for warehouse and ops teams. Pluggable STT lets you pin a regional model for accent coverage.

Ship a voice agent that does not feel like a 2018 demo.

Clone the repo, follow the four-step quick start, ship something real.