Open Source · MIT · Self-hosted defaults

A sub-second voice agent loop, end to end.

Speak. The model speaks back. Cut in any time. No per-minute provider fees on the default stack.

Voice Agent Starter is the full-duplex voice loop most teams half-finish. The browser captures microphone audio over WebSocket. A Fastify server runs a streaming STT, LLM, and TTS pipeline with voice activity detection. Barge-in cancels the in-flight LLM and TTS streams the moment you start speaking, and the LLM can call server-side tools mid-turn through function-call passthrough. Every layer is a pluggable adapter behind a small interface.

~800ms
First audible response
Yes
Barge-in
3x
Swappable adapter layers
Self-host
Defaults
MIT
Licence

Why this exists

Voice agents are easy to demo and brutal to ship. The one-to-one demo on a fast network sounds magical. The first time a real user is on a coach, on patchy 4G, with a five-year-old laptop, every shortcut a demo can take is exposed. Negotiations stall, half-duplex audio breaks the moment someone interrupts, the TTS finishes saying “please hold while I check” thirty seconds after the user has already moved on.

The half-decent reference implementations on GitHub mix transports and assume you have already built the streaming pipeline. The polished commercial offerings handle the hard parts but cost between $600 and $5,000 a month per instance and lock the audio path inside their cloud.

Voice Agent Starter is the open-source middle. A clean state machine, a clean barge-in path, pluggable STT, LLM, and TTS so the provider on the other end is your choice, and a self-hosted default stack that runs on Groq, Whisper.cpp, and OpenTTS with no per-minute fees.

What is in the box

Every feature below ships in the public repository today. Clone, configure, run.

Browser microphone capture

AudioContext resampled to 16 kHz mono, converted to PCM16, base64-encoded, sent as JSON over WebSocket. The orchestrator does not care which transport carried the frames.

Duplex state machine

IDLE to LISTEN to THINK to SPEAK, defined in apps/server/src/pipeline/orchestrator.ts. Owns one voice session, never blocks: every provider call is a stream consumed with for await.

Real barge-in

When the VAD detects speech mid-turn, the orchestrator aborts the in-flight LLM and TTS streams through AbortController, resets the STT and TTS adapters, emits a barge-in control message, and drops to LISTEN.

Streaming STT

Whisper.cpp by default with growing-window transcription. Voice frames feed the STT for live partials; trailing silence flushes the adapter for a final transcript that triggers the move to THINK.

Streaming LLM

Groq Llama 4 by default on the LPU stack. The first token flips the machine to SPEAK and is fed straight into TTS, so audio starts before the completion finishes.

Sentence-by-sentence TTS

OpenTTS Coqui XTTS v2 by default. Synthesises sentence by sentence and streams PCM chunks back to the client as base64. First audible response under one second on the self-hosted stack.

Function-call passthrough

The LLM is advertised registered tools on every call. The shared SSE reader assembles tool_calls deltas; the orchestrator runs the matching handler, appends an assistant and tool turn, and re-streams so the model finishes with grounded data.

Pluggable adapter contract

STT, LLM, and TTS each implement a small TypeScript interface. Defaults are self-hosted; alternatives include Deepgram, OpenAI Whisper, OpenAI, SarmaLink-AI, Cartesia, and ElevenLabs.

RMS voice activity detection

A simple RMS-threshold VAD in pipeline/vad.ts drives the state transitions. Clean seam for silero-vad-onnx if a heavier VAD is needed.

Shared SSE reader

The three OpenAI-compatible LLM adapters share apps/server/src/adapters/llm/sse.ts. Adding a fourth OpenAI-compatible provider is a handful of lines, including streamed tool_calls.

Bounded tool rounds

Tool rounds are bounded by maxToolRounds to guard against loops. Handler errors are returned to the model rather than crashing the session, so the agent recovers in-turn.

No-keys offline mode

The state machine, barge-in, and tool calls all run without provider keys. Set keys or point the self-hosted URLs at running servers to get real transcripts and audio.

Architecture, the duplex loop

One Orchestrator per voice session. Created on the /voice WebSocket open, disposed on close. Every box maps to a real file in apps/server/src.

rendering
Full-duplex voice loop. The orchestrator owns one voice session and never blocks. Coloured nodes are pluggable provider adapters.
apps/server/src/index.ts

Fastify server, /health and /voice WebSocket, message dispatch.

apps/server/src/pipeline/orchestrator.ts

Duplex state machine, barge-in, function-call passthrough.

apps/server/src/pipeline/tools.ts

Tool registry and default tools (get_time, add_numbers).

apps/server/src/pipeline/vad.ts

RMS-threshold voice activity detection.

apps/server/src/adapters/audio.ts

PCM and WAV conversion, sentence splitting.

apps/server/src/adapters/llm/sse.ts

Shared OpenAI-compatible SSE reader and wire-format mapping.

apps/server/src/adapters/{stt,llm,tts}/*.ts

Provider adapters and registries selected by env var.

apps/web/app/page.tsx

Next.js browser client with microphone capture.

State machine: the duplex loop

One Orchestrator per voice session, created on the /voice WebSocket open and disposed on close. Implemented in apps/server/src/pipeline/orchestrator.ts.

rendering
IDLE, LISTEN, THINK, SPEAK. Barge-in cancels the in-flight LLM and TTS streams and drops the machine back to LISTEN.

Latency budget

Stage-by-stage targets on the self-hosted default stack. The shape is what matters; absolute numbers depend on hardware and network.

StageP50 targetNotes
Mic to VAD30msRMS VAD on PCM frames
STT first partial250msWhisper.cpp growing-window transcription
LLM first token250msGroq Llama 4 on the LPU stack
TTS first audio chunk250msOpenTTS XTTS v2, first sentence
Total user-perceived~800msFirst audible response, fully self-hosted

Quick start

Five commands from clone to a microphone-on browser tab. Commands taken straight from the README.

01
Clone the monorepo
git clone https://github.com/sarmakska/voice-agent-starter.git
cd voice-agent-starter
02
Install with pnpm
pnpm install
cp .env.example .env   # add GROQ_API_KEY etc., or leave blank for offline mode
03
Run the SFU + model worker + web client
pnpm dev
04
Open the client, talk
# http://localhost:3000
# Click Start, grant microphone access.
# The web client connects to :3001 over WebSocket and streams PCM frames.
05
Sanity check
curl http://localhost:3001/health
# reports the active providers selected by STT_PROVIDER, LLM_PROVIDER, TTS_PROVIDER

Under the hood

The state machine, the adapter contract, the tool-call path, and the audio frame shape. Real snippets from the repo and the architecture notes.

typescriptPluggable adapter contract+
// STT adapters implement: feed, flush, reset, id
// LLM adapters implement: stream, id
// TTS adapters implement: feed, stream, end, reset, id

// Each layer is one TS file behind a small interface,
// selected by an env var through a registry.

// STT_PROVIDER=whispercpp | deepgram | whisper
// LLM_PROVIDER=groq       | sarmalink | openai
// TTS_PROVIDER=opentts    | cartesia | elevenlabs

// Adding an OpenAI-compatible LLM adapter is a handful of lines
// because all three OpenAI-compatible adapters share
// apps/server/src/adapters/llm/sse.ts
typescriptFunction-call passthrough+
// The orchestrator advertises the registered tool definitions to the
// model on every LLM call. The shared SSE reader assembles fragmented
// tool_calls deltas into complete calls.

// When the model requests a tool:
//   1. orchestrator runs the matching handler from ToolRegistry
//   2. appends an "assistant" turn recording the request
//   3. appends a "tool" turn carrying the result
//   4. re-streams so the model finishes with grounded data

// Bounded by maxToolRounds. Handler errors are returned to the model
// as a tool result rather than crashing the session.
jsonAudio frame shape+
// Browser to server, 20ms PCM16 frames, base64 encoded.
{ "type": "audio", "payload": "<base64 PCM16>" }

// Server to browser, sentence-by-sentence TTS chunks.
{ "type": "tts.chunk", "payload": "<base64 PCM16>" }
{ "type": "barge-in" }
{ "type": "turn.end" }

Configuration, one env per layer

Three env vars choose the STT, LLM, and TTS providers. Everything else is keys and URLs for the providers you picked.

Env varPurposeDefault
STT_PROVIDERwhispercpp, deepgram, or whisperwhispercpp
LLM_PROVIDERgroq, sarmalink, or openaigroq
TTS_PROVIDERopentts, cartesia, or elevenlabsopentts
GROQ_API_KEYKey for the Groq Llama 4 LLM adapterunset
WHISPERCPP_URLwhisper-server endpoint for STThttp://localhost:8090
OPENTTS_URLOpenTTS server endpoint for TTShttp://localhost:5500

Where it fits

The patterns this repository was built around, and the ones it deliberately is not.

Customer support voice agents

Front-line agents for inbound support. Barge-in is essential the moment a customer wants to redirect mid-turn; without it the agent talks over the caller.

Tutors and coaches

Education and coaching apps where the human does most of the talking. The agent prompts, redirects, and pauses without speaking over the learner.

Hands-busy product UX

Voice for kitchens, garages, field engineers. The Next.js client runs in any normal browser; no native app required.

Internal voice ops

Warehouse and operations teams over WebSocket. Pluggable STT lets you pin a regional model for accent coverage without changing the pipeline.

Vendor A/B without rewrites

Swap Deepgram in for Whisper.cpp, Cartesia for OpenTTS, OpenAI for Groq, by changing one env var. The orchestrator is unchanged.

When NOT to reach for it

A finished consumer product, or a one-shot push-to-talk transcription tool. The full-duplex machinery is overhead you would not need.

Tech stack

TypeScriptNode.js 22Next.js 15Fastify 5WebSocketPCM16 / 16 kHzWhisper.cppOpenTTS Coqui XTTS v2Groq Llama 4AbortControllerVitestpnpm

Compared to the alternatives

Hosted voice-agent platforms, closed vendor stacks, and rolling your own. Three honest comparisons.

FeatureVoice Agent StarterHosted platformClosed vendorDIY
Full-duplex with barge-inYesYesYesYou build it
Self-hostable end to endYesHosted onlyHosted onlyYes
Pluggable STT / LLM / TTSYes, per layerPartialLocked stackYou write it
Per-minute fees£0 on self-hosted defaultsPer minutePer minuteYour provider bills
Function-call passthroughYes, bounded roundsYesYesYou write it
Source codeMIT, all of itClosedClosedYours

Frequently asked

Eight real questions from teams that have shipped this.

Why WebSocket rather than WebRTC?+

The orchestrator is transport-agnostic. Terminating over mediasoup or LiveKit is a swap at the edge without touching the pipeline. The starter keeps the transport simple on purpose; the engineering value is in the duplex machinery, not the negotiation layer.

What does barge-in actually cancel?+

Both the in-flight LLM and TTS streams, through their respective AbortControllers. The abort signal propagates through the fetch body reader and the for await loops, so there are no orphaned streams talking over the user. STT and TTS adapters are reset, a barge-in control message is emitted, and the machine drops to LISTEN.

Does it really run with no provider keys?+

Yes for the state machine, barge-in, and tool calls. To get real transcripts and audio you set GROQ_API_KEY or point WHISPERCPP_URL and OPENTTS_URL at running servers. The test suite drives the full pipeline through fake adapters and a fake socket.

How do I add an OpenAI-compatible LLM provider?+

A handful of lines. Implement the LLM interface (stream, id) and reuse apps/server/src/adapters/llm/sse.ts for streaming and tool_calls assembly. Register it and set LLM_PROVIDER to the new id.

How do tools get registered?+

In apps/server/src/pipeline/tools.ts. The starter ships with get_time and add_numbers as worked examples. Your tool advertises a JSON Schema; the orchestrator advertises every registered tool to the model on every call.

Why send audio over JSON?+

Because the orchestrator is the design surface, not the wire format. A plain {type, payload} message is the simplest thing to swap for a binary frame, a Datachannel, or a media-server track without changing the pipeline.

Can I run the LLM on SarmaLink-AI?+

Yes. Set LLM_PROVIDER=sarmalink and the adapter calls into the SarmaLink-AI failover stack, giving you 36-engine routing under the same voice loop. The shared SSE reader handles the wire format.

What about word-level interim STT?+

The default Whisper.cpp adapter surfaces window-level partials, which is enough for the barge-in path. If you need finer granularity, wire the Deepgram streaming SDK and set STT_PROVIDER=deepgram. The interface is the same.

Related products

The rest of the Sarma Linux toolkit. Same opinions throughout: open source, MIT, real depth.

SarmaLink-AI

multi-provider AI backend with sub-50ms failover across 36 engines.

Open product page

MCP Server Toolkit

Production-ready Model Context Protocol server starter, with plugins.

Open product page

Agent Orchestrator

Deterministic-replay multi-agent workflows with durable state.

Open product page

AI Eval Runner

Evals as code. Datasets, scorers, traces, regressions, all in one CLI.

Open product page

Local LLM Router

OpenAI-compatible proxy that routes between local Ollama and cloud LLMs.

Open product page

StaffPortal

Open-source HR + ops platform built to replace three SaaS subscriptions.

Open product page

RAG-over-PDF

A minimal, production-shaped RAG starter with cited streaming answers.

Open product page

Receipt Scanner

Vision-OCR receipt scanning starter with Zod-typed JSON output.

Open product page

Webhook-to-Email

A tiny, production-grade webhook receiver with HMAC and React Email.

Open product page

k8s-ops-toolkit

Helm chart for Next.js + bootstrap script for the full observability stack.

Open product page

terraform-stack

Vercel + Supabase + Cloudflare + DigitalOcean as one Terraform repo.

Open product page

slipstream

Claude Code plugin v1.0: React dashboard with code graph, cross-tab agent bus, ~95% per-read savings, 75 skills.

Open product page

forge-infer

Minimal LLM inference server with paged KV-cache and speculative decoding.

Open product page

shipyard

Multi-tenant SaaS starter with isolation, RBAC, billing, audit and rate limits.

Open product page

lsmdb

Log-structured merge-tree storage engine in Go with WAL and MVCC snapshots.

Open product page

raftkv

Raft key-value store with a fault-injection harness that proves linearizability.

Open product page

sandboxd

WebAssembly sandbox for running untrusted code under strict CPU and memory limits.

Open product page

Ship a voice agent that does not feel like a 2018 demo.

Clone the repo, run pnpm dev, talk into your microphone, ship.

All open-source projects