Open Source · MIT · Real-time WebRTC

A sub-second voice agent loop, end-to-end.

Speak. The model speaks back. Cut in any time.

Voice Agent Starter is the full-duplex voice loop most teams half-finish. Browser captures audio over WebRTC, mediasoup forwards it to a Fastify worker that runs your chosen STT, LLM, and TTS adapters. The TTS audio streams back through the same WebRTC track. Barge-in cancels in-flight TTS the moment the user starts talking. Built in TypeScript with Next.js for the client.

View on GitHub Whitepaper How it works Get help shipping

< 700ms

Round-trip target

Yes

Barge-in

WebRTC

Transport

3x

Pluggable adapters

MIT

Licence

Why this exists

Voice agents are easy to demo and brutal to ship. The 1-to-1 demo on a fast wifi network sounds magical. The first time a real user is on a coach, on patchy 4G, with a five-year-old laptop, every shortcut a demo can take is exposed. WebRTC negotiations stall, half-duplex audio breaks the moment someone interrupts, the TTS finishes saying “please hold while I check” thirty seconds after the user has already moved on.

The half-decent reference implementations on GitHub are written in five different languages, mix transports, and assume you have already built mediasoup workers. The polished commercial offerings handle the hard parts but cost between $600 and $5,000 a month per instance and lock the audio path inside their cloud.

Voice Agent Starter is the open-source middle. WebRTC and mediasoup so the audio path is a known quantity. Fastify so the server can be reasoned about. Pluggable STT, TTS, and LLM so you choose which provider is on the other end. Barge-in handled correctly so the agent feels alive, not hostage.

What it does

Every feature below ships in the public repository today. Clone, configure, run.

WebRTC capture

Browser captures audio at 48 kHz mono with the right Opus configuration. Voice activity detection drives the barge-in signal.

mediasoup SFU

A single mediasoup worker forwards audio to the model worker and TTS audio back to the browser. Same RTC peer connection, two tracks.

Adapter contract

STT, LLM, and TTS each implement a small TypeScript interface. Swap providers in one file. Defaults: Deepgram, OpenAI Realtime, Cartesia.

Streaming STT

Partial transcripts arrive while the user is still talking. The LLM is invoked the moment a confident-final transcript is detected.

Streaming TTS

TTS audio streams back over the WebRTC track as it is generated. First audible response well under one second on the default stack.

Correct barge-in

A clean barge-in cancels in-flight TTS, drains the audio buffer, resets the model context, and sends the new turn upstream. Tested for the awkward edge cases.

Turn manager

Explicit turn state machine: listening, deciding, speaking, interrupted. Logs are readable. Bugs are findable.

Latency telemetry

Every stage of the loop emits a span. P50 and P95 break down per stage, so you know whether STT, LLM, or TTS owns the next millisecond.

Adapter registry

Adapters live in their own packages. Add a new STT in twenty minutes by implementing the streaming-iterator interface.

Containerised

Docker compose for local development. Fly.io machine sizes recommended. mediasoup ports documented.

Architecture, in one diagram

The whole system on a single screen. Every box maps to a real folder in the repo.

┌─────────────────────────┐
│        Browser           │
│   Next.js, WebRTC PC      │
│   ▲ TTS audio in          │
│   ▼ Mic audio out         │
└──────┬─────────────┬─────┘
       │             │
       │ WebRTC      │ WebRTC
       ▼             ▲
┌──────────────────────────┐
│   mediasoup SFU worker    │
│   one PC, two tracks       │
└──────┬─────────────┬─────┘
       ▼             ▲
┌──────────────────────────┐
│ Fastify model-worker      │
│ ┌──────────┐ ┌──────────┐ │
│ │   STT    │ │   TTS    │ │
│ │ adapter  │ │ adapter  │ │
│ └────┬─────┘ └─────▲────┘ │
│      ▼            │       │
│   ┌──────────────────┐    │
│   │   Turn manager    │    │
│   │   (state machine) │    │
│   └────────┬─────────┘    │
│            ▼              │
│   ┌────────────────────┐  │
│   │   LLM adapter       │  │
│   │   streaming tokens  │  │
│   └────────────────────┘  │
└──────────────────────────┘

Quick start

From clone to first request in under five minutes.

01

git clone https://github.com/sarmakska/voice-agent-starter.git
cd voice-agent-starter

02

pnpm install
cp .env.example .env  # add DEEPGRAM_KEY, OPENAI_KEY, CARTESIA_KEY

03

pnpm dev:sfu       # starts mediasoup on :3478
pnpm dev:worker    # starts model worker on :7100
pnpm dev:web       # Next.js client on :3000

04

open http://localhost:3000  # click "Start", grant mic, talk

Where it fits

The patterns this repository was built around.

Customer support agents

Front-line voice agents for inbound support. The barge-in path is essential when callers are queuing for the operator.

Tutors and coaches

Education and coaching apps where the user is doing most of the talking. The agent pauses, prompts, redirects.

Embodied product UX

In-product voice for hands-busy workflows: kitchens, garages, field engineers. The Next.js client runs in a normal browser.

Internal voice ops

Phone-style internal interfaces over WebRTC for warehouse and ops teams. Pluggable STT lets you pin a regional model for accent coverage.

Related products

The wider Sarma Linux toolkit. Every project ships with the same opinions: open source, MIT, real depth, no marketing fluff.

SarmaLink-AI

Multi-provider AI assistant with sub-50ms failover across 36 engines.

Open product page

MCP Server Toolkit

Production-ready Model Context Protocol server starter, with plugins.

Open product page

Agent Orchestrator

Deterministic-replay multi-agent workflows with durable state.

Open product page

AI Eval Runner

Evals as code. Datasets, scorers, traces, regressions, all in one CLI.

Open product page

Local LLM Router

OpenAI-compatible proxy that routes between local Ollama and cloud LLMs.

Open product page

StaffPortal

Open-source HR + ops platform built to replace three SaaS subscriptions.

Open product page

RAG-over-PDF

A minimal, production-shaped RAG starter with cited streaming answers.

Open product page

Receipt Scanner

Vision-OCR receipt scanning starter with Zod-typed JSON output.

Open product page

Webhook-to-Email

A tiny, production-grade webhook receiver with HMAC and React Email.

Open product page

A sub-second voice agent loop, end-to-end.

Why this exists

What it does

WebRTC capture

mediasoup SFU

Adapter contract

Streaming STT

Streaming TTS

Correct barge-in

Turn manager

Latency telemetry

Adapter registry

Containerised

Tech stack

Architecture, in one diagram

Quick start

Where it fits

Customer support agents

Tutors and coaches

Embodied product UX

Internal voice ops

Related products

SarmaLink-AI

MCP Server Toolkit

Agent Orchestrator

AI Eval Runner

Local LLM Router

StaffPortal

RAG-over-PDF

Receipt Scanner

Webhook-to-Email

Ship a voice agent that does not feel like a 2018 demo.