AI engineering

25 min read

Voice agent with sub-second round trip

The architecture I run for a natural-feeling voice agent: mediasoup for WebRTC, Whisper for STT, Groq for streamed LLM, OpenTTS for synthesis. Total round trip under 900 ms on a £6 VPS, no Apple/OpenAI accounts required. The full stack lives at github.com/sarmakska/voice-agent-starter.

The latency budget

Voice agents live or die on round trip latency. Humans tolerate roughly a second of silence before a conversation starts to feel broken; cross 1.5 seconds and the caller assumes the line dropped. The target I work to is 900 ms from the last syllable the user spoke to the first syllable they hear back. That number is not a guess. It is the budget you arrive at when you decompose every hop on the path.

Here is how I slice it. STT first partial token, 80 ms. LLM first token, 250 ms. TTS first audio chunk, 250 ms. Network round trip on a UK to Frankfurt path, 150 ms total. Client side decode and jitter buffer, 100 ms. Add 70 ms of slack for orchestration overhead and you land at 900 ms. Miss any one number by more than 50 ms and the whole thing feels sluggish.

The single biggest lever is choosing components that stream rather than batch. A 200 ms STT model that returns the full transcript only on silence is worse than an 800 ms model that emits partial tokens every 80 ms, because the LLM can start generating against the partial. Same for the LLM, same for the TTS. Streaming everywhere or nowhere.

Latency is not a metric you optimise after the fact. It is a budget you allocate before you write a line of code, and every component either fits inside it or gets cut.

Architecture in words

Three processes, four wires. The browser opens a WebRTC peer connection to a mediasoup SFU running in Node. The SFU forwards the inbound mic track to a Python orchestrator over a Unix socket, and forwards the orchestrator's outbound TTS track back to the browser. The orchestrator is the brain. It runs whisper.cpp for STT, calls Groq for the LLM, calls OpenTTS for synthesis, and stitches the three streams together.

People ask why an SFU and not raw peer to peer. Three reasons. First, echo cancellation in Chrome only kicks in when the inbound audio arrives over the same peer connection as the outbound, so you need a server endpoint anyway. Second, NAT traversal succeeds for around 85 percent of UK home networks on raw P2P; an SFU with a TURN fallback gets you to 99 percent. Third, the SFU is where you tap the stream for recording, transcripts, and per call audit logs without bolting it on later.

The orchestrator is deliberately a single Python process per call. Threads, not processes. STT runs on a background thread feeding a partials queue, the main thread consumes partials and drives the LLM, the TTS runs on a worker pool. A queue.Queue between each pair of stages keeps the whole thing back pressured without explicit locks.

WebRTC with mediasoup

mediasoup is a C++ SFU with a Node API. It is the same engine that powers Around, Discord stage channels, and most serious WebRTC products. Two producers and two consumers per call, mic in and TTS out. The browser code is a hundred lines.

ts
// browser/voice-client.ts
import { Device } from 'mediasoup-client'

const ws = new WebSocket('wss://voice.example.com/rtc')
const device = new Device()

ws.onmessage = async (ev) => {
  const msg = JSON.parse(ev.data)
  if (msg.type === 'routerRtpCapabilities') {
    await device.load({ routerRtpCapabilities: msg.data })
    ws.send(JSON.stringify({ type: 'createSendTransport' }))
  }
  if (msg.type === 'sendTransportCreated') {
    const transport = device.createSendTransport(msg.data)
    transport.on('connect', ({ dtlsParameters }, ok) => {
      ws.send(JSON.stringify({ type: 'connectTransport', dtlsParameters }))
      ok()
    })
    const stream = await navigator.mediaDevices.getUserMedia({
      audio: {
        echoCancellation: true,
        noiseSuppression: true,
        autoGainControl: true,
        sampleRate: 48000,
      },
    })
    await transport.produce({ track: stream.getAudioTracks()[0] })
  }
}

On the server, a single Fastify route handles the signalling websocket. The router is created once per worker and reused across calls.

ts
// server/sfu.ts
import * as mediasoup from 'mediasoup'
import Fastify from 'fastify'
import websocket from '@fastify/websocket'

const worker = await mediasoup.createWorker({ logLevel: 'warn' })
const router = await worker.createRouter({
  mediaCodecs: [
    {
      kind: 'audio',
      mimeType: 'audio/opus',
      clockRate: 48000,
      channels: 2,
      parameters: { useinbandfec: 1, usedtx: 1, minptime: 10 },
    },
  ],
})

const app = Fastify()
await app.register(websocket)

app.get('/rtc', { websocket: true }, async (conn) => {
  const send = (msg: unknown) => conn.socket.send(JSON.stringify(msg))
  send({ type: 'routerRtpCapabilities', data: router.rtpCapabilities })

  conn.socket.on('message', async (raw) => {
    const msg = JSON.parse(raw.toString())
    if (msg.type === 'createSendTransport') {
      const transport = await router.createWebRtcTransport({
        listenIps: [{ ip: '0.0.0.0', announcedIp: process.env.PUBLIC_IP }],
        enableUdp: true,
        enableTcp: true,
        preferUdp: true,
      })
      send({ type: 'sendTransportCreated', data: {
        id: transport.id,
        iceParameters: transport.iceParameters,
        iceCandidates: transport.iceCandidates,
        dtlsParameters: transport.dtlsParameters,
      }})
    }
  })
})

await app.listen({ port: 3001, host: '0.0.0.0' })

DTX (discontinuous transmission) and FEC on Opus are not optional. DTX kills the upstream during silence, cutting bandwidth by 60 percent on a typical call. FEC recovers the first lost packet in any burst, which on a noisy mobile network is the difference between intelligible and unintelligible.

Streaming STT with whisper.cpp

whisper.cpp is the only STT I trust for low latency self hosted work. It compiles to a single binary, accepts 16 kHz PCM on stdin, and emits JSON segments on stdout. On a £6 VPS without a GPU, the small.en model runs at roughly 0.3x realtime, which is fast enough for a single concurrent call. With a 4 GB GPU box at £20 a month, the medium.en model runs at 4x realtime and gives you four concurrent calls.

The trick to first token latency is voice activity detection. Feed whisper a chunk only when you have 20 ms of speech detected. WebRTC's native VAD is too aggressive; I use Silero VAD, which is a 2 MB ONNX model that runs on CPU.

python
# orchestrator/stt.py
import asyncio
from collections import deque
from silero_vad import load_silero_vad, get_speech_timestamps
from whispercpp import Whisper

vad = load_silero_vad()
whisper = Whisper.from_pretrained('small.en')

async def stt_loop(audio_in: asyncio.Queue, partials_out: asyncio.Queue):
    buffer = deque(maxlen=48000 * 30)  # 30 seconds at 16 kHz
    silence_ms = 0
    speaking = False

    while True:
        chunk = await audio_in.get()  # 20 ms PCM, 320 samples at 16 kHz
        is_speech = vad(chunk) > 0.5

        if is_speech:
            buffer.extend(chunk)
            silence_ms = 0
            speaking = True
            if len(buffer) % 8000 == 0:  # every 500 ms, emit a partial
                text = whisper.transcribe(list(buffer), partial=True)
                await partials_out.put({'text': text, 'final': False})
        elif speaking:
            silence_ms += 20
            if silence_ms > 400:  # finalise on 400 ms of silence
                text = whisper.transcribe(list(buffer), partial=False)
                await partials_out.put({'text': text, 'final': True})
                buffer.clear()
                speaking = False
                silence_ms = 0

400 ms of trailing silence is the sweet spot. Drop it to 200 ms and the agent cuts the user off mid sentence on natural pauses. Raise it to 600 ms and the conversation feels laggy. Tune this number per use case; a customer support bot can tolerate 600 ms, a games NPC cannot.

Streaming LLM via Groq

Groq is the only LPU provider that consistently hits sub 300 ms first token. Llama 3.1 70B runs at 600+ tokens per second on their hardware, which is roughly 6x faster than the next best cloud option. For a voice agent, first token latency matters more than total throughput, because TTS starts speaking the first sentence before the LLM has finished the second.

The orchestrator consumes the LLM stream a sentence at a time, not a token at a time. Sentence boundaries are the natural cut point for TTS handoff; smaller chunks cause audible joins, larger chunks delay the first audio.

python
# orchestrator/llm.py
import re
from groq import AsyncGroq

client = AsyncGroq()
SENTENCE_END = re.compile(r'[.!?]\s+|[.!?]$')

async def llm_stream(messages, sentences_out, cancel_event):
    stream = await client.chat.completions.create(
        model='llama-3.1-70b-versatile',
        messages=messages,
        stream=True,
        temperature=0.7,
        max_tokens=400,
    )
    buf = ''
    async for chunk in stream:
        if cancel_event.is_set():
            await stream.close()
            return
        delta = chunk.choices[0].delta.content or ''
        buf += delta
        while True:
            match = SENTENCE_END.search(buf)
            if not match:
                break
            sentence = buf[:match.end()].strip()
            buf = buf[match.end():]
            if sentence:
                await sentences_out.put(sentence)
    if buf.strip():
        await sentences_out.put(buf.strip())

The cancel_event is the hook for barge-in. When the user starts speaking over the bot, the main loop sets the event, the LLM stream closes mid generation, and the in flight tokens are discarded. The cost of a cancelled response on Groq is a fraction of a penny, which makes aggressive cancellation cheap.

Streaming TTS with OpenTTS

OpenTTS wraps Coqui, Larynx, MaryTTS, and eSpeak NG behind a single HTTP API. For production I use Coqui with the Glow-TTS model, which produces natural sounding speech at roughly 0.1x realtime on CPU. For prototypes I drop to eSpeak NG, which is robotic but synthesises in 5 ms per sentence, useful when iterating on the orchestration logic.

Each sentence from the LLM becomes one TTS request. The resulting PCM is pushed into the outbound mediasoup producer as it arrives. Playback on the client begins as soon as the first sentence's audio crosses the wire.

python
# orchestrator/tts.py
import httpx
import asyncio

OPENTTS = 'http://127.0.0.1:5500'

async def tts_loop(sentences_in: asyncio.Queue, pcm_out: asyncio.Queue, cancel):
    async with httpx.AsyncClient(timeout=10) as http:
        while True:
            sentence = await sentences_in.get()
            if cancel.is_set():
                continue
            r = await http.get(f'{OPENTTS}/api/tts', params={
                'voice': 'coqui-tts:en_ljspeech',
                'text': sentence,
                'cache': 'false',
            })
            r.raise_for_status()
            pcm = r.content  # 22050 Hz mono PCM
            # split into 20 ms frames for mediasoup
            frame_size = 22050 * 2 // 50
            for i in range(0, len(pcm), frame_size):
                if cancel.is_set():
                    break
                await pcm_out.put(pcm[i:i + frame_size])

The TTS output is 22050 Hz; mediasoup wants 48000 Hz Opus. Resampling happens in a tiny C extension I bind from Python, because doing it in pure Python costs 30 ms a sentence and that is 30 ms you cannot afford.

Barge-in handling

Barge-in is the feature that separates a toy from a product. The user must be able to interrupt the bot mid sentence and have the bot stop talking, throw away its current thought, and listen to the new input. The state machine is three lines.

python
# orchestrator/state.py
class CallState:
    def __init__(self):
        self.cancel = asyncio.Event()
        self.bot_speaking = False

    def on_user_speech_start(self):
        if self.bot_speaking:
            self.cancel.set()  # kills LLM and TTS in flight
            self.bot_speaking = False

    def on_bot_turn_start(self):
        self.cancel.clear()
        self.bot_speaking = True

The subtle bit is the audio that has already been buffered on the client side. When you cancel TTS server side, the browser still has 100 to 300 ms of audio queued in the Web Audio graph. You need to call audioContext.suspend() on the playback node and drop its buffer the moment the user starts speaking, otherwise the bot keeps talking for a third of a second after it should have stopped.

VPS deployment

One Hetzner CX22 at £4.50 a month handles a single concurrent call comfortably. 2 vCPU, 4 GB RAM, 20 TB bandwidth. Caddy out front for TLS termination, the mediasoup process on port 3001, the orchestrator as a sidecar over a Unix socket, OpenTTS on 5500 loopback only. For two to four concurrent calls, jump to a CX32 at £8.50 a month. Beyond that, scale horizontally: one mediasoup worker process per CPU core, a load balancer routing by call id.

caddy
# /etc/caddy/Caddyfile
voice.example.com {
  encode zstd gzip

  # WebRTC signalling websocket
  @ws path /rtc
  reverse_proxy @ws localhost:3001

  # Health and metrics
  reverse_proxy /metrics localhost:3001
  reverse_proxy /healthz localhost:3001
}

Open UDP 40000 to 49999 on the firewall for the mediasoup media ports. This is the one piece every first time deployer forgets, and the symptom is a connection that handshakes fine and then sits there with zero audio.

Pitfalls

Echo cancellation off in Chrome

If echoCancellation is false on getUserMedia, the user hears the bot through their own mic and the bot transcribes itself in a feedback loop. Always set echoCancellation, noiseSuppression, and autoGainControl to true.

Jitter buffer too small

Default browser jitter buffer is 50 ms, which drops audio on a wobbly network. Set the receiver playoutDelayHint to 0.1 (100 ms) on the consumer. Costs 50 ms of latency, saves audible glitches.

TTS sample rate mismatch

OpenTTS outputs 22050 Hz, mediasoup wants 48000 Hz Opus. If you skip resampling, the bot sounds like a chipmunk played at half speed. Use libsamplerate via a C binding, not pure Python.

Barge-in race with buffered audio

Server side cancellation does not stop the audio already in the browser playback graph. Call audioContext.suspend() and clear the buffer on the client the moment local VAD fires, not when the server tells you to.

The dreaded first-message warmup

whisper.cpp lazily loads its model on first use, adding 1.5 seconds to the first turn. Force a warmup transcription of one second of silence on call setup, before the user has finished saying hello.

Wrap up

A natural feeling voice agent is not a single big model. It is five small streaming components stitched together by a Python process that respects a 900 ms budget at every hop. The components are all open source, the bill is single digit pounds a month, and nothing in the stack depends on OpenAI, Apple, or any vendor you cannot replace in an afternoon. The full reference implementation, including the warmup, barge-in client code, and a Docker compose for the lot, lives at github.com/sarmakska/voice-agent-starter.

Want this done for you?

If you would rather skip the YAK shave and have someone who has done this fifty times set it up properly, that is what I do for a living.

Start a project