Forem: Nick Lackman

Voice AI: Fast and Dumb or Slow and Smart — Why Not Fast and Smart?

Nick Lackman — Mon, 06 Apr 2026 23:26:29 +0000

Many voice AI demos connect the browser directly to a real-time audio model API and lets the server decide when you've stopped talking. That's a demo architecture with a built-in latency tax that quickly breaks down in production. Here's the production alternative: a backend-mediated, multi-thinker voice system with local voice activity detection that owns the entire audio pipeline end-to-end.

I spent the last year and half building production voice AI systems that handle thousands of calls per day. This post covers the architecture I wish someone had documented when I started: how to make your voice AI product fast and smart, what the Responder-Thinker pattern is, why single-thinker breaks, how to build multi-thinker with your backend in the middle, and why local VAD is the key to making it feel instant.

The companion repo is fully functional — clone it, run it, talk to it (OpenAI API Key Required): github.com/lackmannicholas/responder-thinker

The Latency Budget You Can't Meet

Before the Realtime API existed, voice AI meant chaining three models in series: speech-to-text, an LLM, then text-to-speech. The math doesn't work.

STT endpointing and recognition eats 500-1000ms. The LLM's time-to-first-token adds another 500-1500ms. TTS synthesis takes 200-500ms. You're at 1.2-3 seconds minimum before the caller hears a single syllable — and conversational turn-taking breaks down around 800ms of silence.

In my previous post, I showed that server-side voice activity detection alone adds 500ms+ of unnecessary overhead to every turn. But even after fixing that, the serial pipeline architecture is the bottleneck. You can't engineer your way to natural conversation speed with a pipeline. The architecture has to change.

The Realtime API: Fast, But Not Smart Enough

OpenAI's Realtime API collapses the STT → LLM → TTS pipeline into a single api call. Latency drops to sub-second. The conversation finally feels naturalish.

But there's a tradeoff. The realtime model is conversational and fast, but compared to text-based models like GPT-5.4, it struggles with complex multi-step instructions, structured tool use, and domain-specific accuracy. It hallucinates more. Its instruction-following degrades as the system prompt grows.

A voice agent that responds instantly but gives wrong information is worse than one that takes two seconds and gets it right. The Realtime API solved the latency problem and created an intelligence problem.

Enter Responder-Thinker

The Responder-Thinker pattern resolves this by splitting responsibilities:

The Responder (Realtime API) is "always on". It handles conversational flow — greetings, acknowledgments, stalling, turn-taking. It's fast and socially intelligent. When the user asks something that needs real data or complex reasoning, the Responder classifies the intent and hands off to a Thinker.

The Thinker (text-based model) runs in the background. It has a focused system prompt, domain-specific tools, and the reasoning capability to get the answer right. When it's done, the result is injected back into the Realtime API conversation, and the Responder delivers it naturally.

The insight: you don't need your real-time voice to be smart. You need it to be present while the smart thing works in the background.

This pattern comes from OpenAI — their openai-realtime-agents repo calls it "Chat-Supervisor." The concept isn't new. Making it production-grade is the hard part.

Why Single-Thinker Breaks

The simplest implementation has one generalist Thinker handling everything — weather, stocks, news, FAQ, escalation. In my experience, this breaks fast.

The system prompt grows to accommodate every domain, and quality degrades across all of them. A weather lookup and a complex knowledge question go through the same agent with the same overhead. You can't tune one domain without risking regressions in the others. You can't use a cheaper model for simple lookups and a smarter model for hard reasoning — it's one model for everything. You have to vertically scale the model capability based on the your most complex task. Lighter tasks are "over-provisioned" in terms of model usage.

Single-thinker is a monolith. Multi-thinker is microservices. The voice AI industry is learning the same architectural lessons backend engineering learned fifteen years ago.

In a multi-thinker architecture, each Thinker owns a domain with a focused prompt and its own tools. Weather uses gpt-5.4-mini with a live weather API. News uses gpt-5.4 because summarization requires more reasoning. Each can be tested, cached, and optimized independently.

The Realistic Production Architecture

Here's where this implementation diverges from most tutorials you'll find.

Many demos connect the browser directly to OpenAI's Realtime API via WebRTC. The browser gets an ephemeral token, establishes a peer connection, and audio flows between the user and OpenAI with nothing in between. It's not how production voice systems work.

In production — Twilio, SIP trunks, contact centers — audio always flows through your backend. This architecture puts your backend in the middle:

Browser ←—WebRTC—→ Python Backend ←—WebSocket—→ OpenAI Realtime API
                        │
                   Thinker Agents

The browser connects to a FastAPI server via WebRTC (using aiortc for server-side WebRTC). The backend opens a WebSocket to OpenAI's Realtime API and streams audio bidirectionally, resampling between 48kHz (WebRTC) and 24kHz (Realtime API) using libswresample for proper anti-aliased conversion.

What this gives you that direct connection doesn't:

Interception: the backend sees every event between the user and the model. Tool calls route to your server-side agents, not browser JavaScript. This is important for conservation aggregation, metrics, and downstream analytics
State management: Redis-backed conversation history, cross-session user memory, per-domain result caching.
Local VAD: your backend owns turn detection, not OpenAI's servers. This is where hundreds of milliseconds live.
Security: API keys never touch the browser.
Transport flexibility: the same backend works for WebRTC browsers and telephony SIP trunks.

Local VAD: Owning Turn Detection End-to-End

This is the piece that makes the architecture feel instant.

Most implementations of the OpenAI Realtime API use semantic_vad or server_vad in the session config and let OpenAI decide when the user stopped talking. That means every audio frame travels to OpenAI's servers, their VAD processes it, they decide the turn is over, and only then does the model start generating a response. That round-trip is hundreds of milliseconds you're paying on every single turn.

My implementation replaces this entirely with local voice activity detection. The backend runs a TEN VAD model that processes audio locally and makes the turn detection decision on your own hardware, with zero network round-trip:

# When local VAD is active, server-side turn detection is completely disabled.
# The backend owns the full pipeline: detect speech end → commit buffer → trigger response.

if self._vad_gate is not None:
    result = self._vad_gate.process(pcm16_bytes)

    # Speech onset: interrupt if audio is still playing
    if result.speech_started:
        if self._response_active or has_queued_audio:
            await self._handle_interrupt()

    # Speech end: commit and request response immediately
    if result.speech_ended:
        asyncio.create_task(self._commit_and_respond())
else:
    chunks_to_send = [pcm16_bytes]  # fallback: send everything, let OpenAI decide

The VAD gate uses a three-state machine — SILENCE, SPEECH, and HANGOVER — with a pre-roll buffer that preserves audio from just before speech onset. When speech ends, the backend immediately commits the audio buffer and sends response.create. No server-side VAD involved. No round-trip. The Realtime API starts generating the instant it receives the committed buffer.

The _commit_and_respond method uses the same _response_create_lock that protects thinker result injection and idle nudges, because all of them compete for the same response.create API constraint:

async def _commit_and_respond(self):
    await self._realtime_ws.send(json.dumps({"type": "input_audio_buffer.commit"}))
    async with self._response_create_lock:
        await self._response_done.wait()
        if self._running:
            await self._realtime_ws.send(json.dumps({"type": "response.create"}))

The result: it feels like the agent starts responding before you've finished talking. It isn't really, but the gap between end-of-speech and first audio byte is so small that it feels that way. This is the same VAD research I published previously — 689ms improvement measured in controlled testing — now integrated into a full production architecture.

Routing: The Dumbest Model Makes the Most Important Decision

The Responder classifies intent via a single tool call — route_to_thinker(domain, query). The domain is constrained to a fixed enum:

ROUTE_TO_THINKER_TOOL = {
    "type": "function",
    "name": "route_to_thinker",
    "parameters": {
        "type": "object",
        "properties": {
            "domain": {
                "type": "string",
                "enum": ["weather", "stocks", "news", "knowledge", "research"],
            },
            "query": {
                "type": "string",
                "description": "The user's question, rephrased for the specialist.",
            },
        },
    },
}

This is architecturally interesting because your dumbest model is making the most important decision. And that's the right tradeoff. Routing needs to be fast — 100ms, not 2 seconds. The Responder already has full conversational context. And "what kind of question is this?" is a dramatically simpler task than "what's the answer?" Constraining routing to a fixed enum makes misclassification rare and fallback trivial: unknown domains go to the Knowledge Thinker.

The bridge intercepts the tool call and dispatches the Thinker concurrently so the Responder keeps talking:

case "response.function_call_arguments.done":
    asyncio.create_task(self._handle_tool_call(event))

Production Failure States and Three Guards Against Them

When the Thinker returns a result, you can't just inject it and call response.create. Three things can go wrong when handling real users:

Guard 1: The user interrupted. While the Thinker was working, the user barged in with a new question. The Thinker's result is stale. You still submit the tool output (the API requires it), but you don't ask the Responder to speak a stale answer.

dispatched_turn_id = self._turn_id  # snapshot before dispatch

# ... thinker runs ...

if self._turn_id != dispatched_turn_id:
    return  # stale — user moved on

Guard 2: The Responder is still talking. The Realtime API silently drops response.create while it's already generating a response — like the "let me check on that" filler. This is the primary cause of the "thinker came back but nothing happened" bug. You have to wait, and you have to serialize all callers:

async with self._response_create_lock:
    await asyncio.wait_for(self._response_done.wait(), timeout=10.0)

The lock serializes every response.create caller — thinker results, the local VAD commit path, idle nudges, and disconnect goodbyes — because they all compete for the same API constraint.

Guard 3: The user interrupted during the wait. After Guard 2 releases, check staleness again. The user could have barged in while you were blocked.

    if self._turn_id != dispatched_turn_id:
        return  # stale after wait

    await self._realtime_ws.send(json.dumps({"type": "response.create"}))

Real callers interrupt, change their minds, and don't wait politely for the AI to finish thinking. You need to handle each one to have a system that feels as close to a human conversation as possible.

Barge-In Handling

When local VAD detects speech onset while the Responder is outputting audio, the bridge does three things:

async def _handle_interrupt(self):
    # 1. Invalidate in-flight thinker tasks
    self._turn_id += 1

    # 2. Cancel the active response
    if self._response_active:
        await self._realtime_ws.send(json.dumps({"type": "response.cancel"}))
        self._response_active = False
        self._response_done.set()

    # 3. Flush queued audio so the speaker stops immediately
    if self.audio_track:
        self.audio_track.output_track.clear()

Incrementing _turn_id is the key move. Every in-flight thinker task holds a snapshot of the turn ID from when it was dispatched. When it returns, Guard 1 catches the mismatch and discards the result. No stale answers, no race conditions, no complex cancellation logic.

With local VAD, barge-in detection is also local — the backend sees speech onset in the VAD state machine before any audio reaches OpenAI. The interrupt fires faster than server-side detection could.

Context Is Not Just Conversation History

A caller asking "is a two-bedroom available?" means nothing without property context. "Same unit as last time" means nothing without user context. In production, managing multiple types of structured context beyond raw conversation history is paramount to giving your conversation a personal feel as well as better model performance.

The repo demonstrates this with a typed UserContext model persisted in Redis — preferences, memory facts, conversation summaries, and behavioral signals — keyed by browser fingerprint for cross-session persistence:

class UserContext(BaseModel):
    preferences: Preferences       # name, location, temp unit, watched tickers
    memory: MemoryStore            # inferred facts, deduped, capped at 20
    summary: Summary               # rolling LLM-generated conversation summary
    signals: Signals               # topic counts, session count, last active

Thinkers return a ThinkResult that includes an optional ContextUpdate — a class describing what the thinker learned. The router applies updates after the thinker returns:

class ThinkResult(BaseModel):
    response: str
    context_update: ContextUpdate | None = None

The Weather Thinker persists the user's location. The Knowledge Thinker picks up on it without being told. Context isn't trapped in a single agent's conversation. It's a shared, typed resource that any thinker can read from and contribute to. When context changes, the Responder's system prompt is updated mid-session via session.update so it immediately knows what the thinkers learned.

What I Learned

The cost of implementing your own turn detection with a local VAD is well worth it. The latency improvement isn't incremental — it's the difference between "this feels like talking to a computer" and "this feels like talking to someone." Owning the turn detection pipeline means you control the most latency-sensitive decision in the entire system. If you're building on the Realtime API and not doing local VAD, you're leaving hundreds of milliseconds on the table on every turn.

The routing decision matters more than the reasoning quality. A perfectly accurate Thinker routed to the wrong domain produces a wrong answer. A slightly less accurate Thinker routed correctly produces a useful one. Invest in your routing prompt and your domain enum. Simple and strict rules help the dumb realtime model perform routing well. An additional consideration is using a separate LLM call to classify, but with only a handful of potential tool calls, the realtime API can do that just fine.

Stalling is a prompt engineering problem, not a code problem. The Realtime API naturally acknowledges the user before executing the tool call. Your system prompt just needs to tell it how. The Research Thinker in the repo simulates a 30-second delay specifically to stress-test this.

Multi-thinker is worth the complexity. Independent prompts, independent model tiers, independent caching TTLs, independent testing. The overhead of managing multiple agents is far less than the quality cost of a bloated single-thinker prompt.

Backend mediation is not optional for production. Direct browser-to-OpenAI works for demos. The moment you need state, security, observability, local VAD, or telephony support, your backend has to be in the middle. The upfront work will save you time in the long run.

The three guards make it feel alive. The "thinker returned but nothing happened" bug (Guard 2) is a frustrating one to try to debug in production and ensures the user isn't left hanging no matter what. The stale-result-after-interrupt bug (Guards 1 and 3) only manifested when callers talked fast and gives them the answer with the fullest context. These are things I wish I had known or discovered without the pain of production issues.

The full implementation — local VAD, multi-thinker routing, typed user context, LangSmith observability, Docker deployment, and a 30-second research thinker for stress-testing stalling behavior — is at github.com/lackmannicholas/responder-thinker. Clone it, run it, talk to it.

Previously: Cutting 600ms from Every Voice AI Turn with Local VAD
Coming next: Adding guardrails and voice quality evals to the Responder-Thinker pattern.

Dude, Where's My Response? Cutting 600ms from Every Voice AI Turn with Local VAD

Nick Lackman — Sat, 21 Mar 2026 02:45:02 +0000

If you're building voice AI on OpenAI's Realtime API, your agent is slower than it needs to be — the main bottleneck is certainly inference but there's additional overhead to cut.

I spent the past week instrumenting a production telephony voice pipeline, measuring where latency actually lives, and testing whether local voice activity detection (VAD) could meaningfully reduce response time. The answer is yes — by 689ms per turn on substantive responses — and the methodology is cleaner than I expected.

Here's what I found, how I measured it, and why it matters for anyone building conversational AI on the Realtime API.

The Hidden Latency Tax

When you build a voice agent on OpenAI's Realtime API — whether you're using the OpenAI Agents SDK, a custom WebSocket implementation, or any orchestration framework — the audio pipeline follows the same path:

The user speaks, and your telephony provider (Twilio, in my case) streams audio frames to your server
Your server forwards every audio frame to OpenAI's Realtime API via WebSocket (input_audio_buffer.append)
OpenAI's server-side VAD (semantic_vad, the default) processes the audio and decides when the user has stopped talking
Only after the server-side VAD commits the audio buffer does the LLM begin generating a response
The generated audio streams back to your server and out to the caller

The problem is step 3. Every VAD decision requires a network round-trip. The audio has to travel to OpenAI's server, get processed by their turn detection model, and the commit decision happens server-side. Your code doesn't even participate — if you look at the OpenAI Agents SDK source, input_audio_buffer.speech_stopped is handled as an informational notification. The server has already committed and started response generation by the time your code hears about it.

This adds an irreducible network latency plus server-side model deliberation time on every single turn. And in a conversational AI system, latency after the user stops speaking is the most perceptible kind — it's the moment they're actively waiting.

The Approach: Local VAD + Manual Turn Control

The Realtime API supports disabling server-side turn detection entirely. When you set turn_detection to null, the server stops making autonomous commit decisions, and you take control of when to send input_audio_buffer.commit and response.create.

This means you can run a VAD model locally on your server, process the same audio frames as they arrive from Twilio — before they're even sent to OpenAI — and commit the turn the moment you detect silence. The audio is already on your machine. There's no round-trip to wait for.

I used TEN VAD (by Agora) as the local model, running via ONNX Runtime. More on why TEN VAD below.

Why Not Just Use Silero?

I evaluated three tiers of VAD before settling on TEN VAD:

Energy-based VAD (WebRTC VAD, fast-vad) uses signal processing — energy levels, spectral characteristics, zero-crossing rates — to make binary speech/no-speech decisions. Extremely fast, but can't distinguish speech energy from background noise. WebRTC VAD misses roughly 1 out of every 2 speech frames at a 5% false positive rate. Not viable for production turn detection.

Silero VAD is the industry-standard ML-based VAD — an LSTM-based architecture trained on 6,000+ languages, available as an ONNX model. Significantly more accurate than energy-based approaches. But it has a meaningful limitation for conversational AI: it suffers from a multi-hundred-millisecond delay when detecting speech-to-silence transitions. The recurrent architecture needs several silence frames to shift its internal state, which translates directly to turn detection delay.

TEN VAD (by Agora) is purpose-built for real-time conversational AI turn detection. Agora has 10+ years of experience in real-time voice infrastructure, and it shows. In my testing, TEN VAD detected speech-to-silence transitions with a median head start of 722ms over OpenAI's server-side VAD, compared to 342ms for Silero under the same conditions. It also achieves a 32% lower Real-Time Factor and 86% smaller library footprint than Silero, which matters when you're running VAD alongside everything else in a voice pipeline.

The key advantage for turn detection is transition speed. TEN VAD operates on 16kHz audio with 10ms frame hops, giving it finer temporal resolution than Silero's minimum 32ms chunks. It correctly identifies short silent durations between adjacent speech segments that Silero misses entirely.

Test Methodology

Measuring this correctly turned out to be the hardest part. The naive approach — comparing "local VAD detected silence at time X" vs "server sent speech_stopped at time Y" — has a fundamental bias: the server's speech_stopped event arrives after the server has already begun processing, so it makes server-side VAD look artificially fast.

The solution: use local VAD as a passive timestamp observer in both configurations. In the server-side VAD test runs, TEN VAD runs locally but doesn't commit or trigger responses — it only records when it detects silence. This gives both configurations the same "true speech end" anchor point.

The test protocol:

50 turns per configuration — local VAD + commit vs server-side semantic_vad
Scripted test calls from a cell phone through production Twilio PSTN infrastructure (8kHz µ-law audio)
Common measurement anchor: both configurations measure perceived latency from the true moment speech ends, as detected by the passive local TEN VAD observer
Controlled quiet-room environment to isolate the VAD comparison from acoustic variability
Perceived latency defined as: true speech end → first audio byte emitted to the caller

Filler Response Segmentation

An important methodological consideration: the LLM non-deterministically generates "filler" responses (e.g., "Let me look that up for you") that respond in under 1 second. Server-side VAD received 44% fillers vs 32% for local VAD in my test runs, which biases the unsegmented comparison. I present results segmented by response type to control for this.

Results

Non-Filler Turns (Primary Comparison)

These are substantive AI responses where the LLM performs real inference. LLM latency is closely matched between configurations, isolating the VAD effect.

Metric	Local VAD	Server VAD	Delta
Sample size	34 turns	28 turns
Perceived latency (median)	2,412ms	3,101ms	-689ms
Perceived latency (mean)	2,396ms	3,216ms	-820ms
LLM latency (median)	2,183ms	2,263ms	~equal
Cohen's d	1.04 (large)
Significance	p < 0.001	t = 3.93

22% reduction in perceived latency with closely matched LLM latency, confirming the improvement is attributable to the VAD change, not LLM variance.

Filler Turns (Cleanest Proof of VAD Effect)

Filler turns provide the cleanest isolation because LLM latency is virtually identical — the entire improvement is pure VAD overhead.

Metric	Local VAD	Server VAD	Delta
Sample size	16 turns	22 turns
Perceived latency (median)	679ms	1,134ms	-454ms
LLM latency (mean)	519ms	517ms	~equal
Cohen's d	1.74 (very large)
Significance	p < 0.001	t = 5.81

40% reduction. With LLM latency at 519ms vs 517ms (effectively identical), the entire 454ms improvement is pure VAD overhead eliminated. This is the irreducible cost of server-side turn detection made visible.

Response Time Distribution

The distribution shift tells the most compelling story:

Threshold	Local VAD	Server VAD
Under 1 second	28%	4%
Under 1.5 seconds	42%	36%
Under 2.5 seconds	78%	54%
Under 3 seconds	92%	70%

28% of local VAD turns respond in under 1 second vs essentially 0% for server-side VAD. Sub-second response time is a qualitatively different user experience — it's the difference between a conversation that feels like talking to a person versus waiting for a system.

Over a 10-turn call, the cumulative improvement is approximately 5–7 seconds.

How to Implement This

The Realtime API makes this straightforward. The key is setting turn_detection to null in your session configuration, which puts you in manual turn control mode:

# Disable server-side VAD
session_update = {
    "type": "session.update",
    "session": {
        "type": "realtime",
        "audio": {
            "input": {
                "turn_detection": None  # Manual turn control
            }
        }
    }
}
await websocket.send(json.dumps(session_update))

# When your local VAD detects end of speech:
await websocket.send(json.dumps({
    "type": "input_audio_buffer.commit"
}))
await websocket.send(json.dumps({
    "type": "response.create",
    "response": {"output_modalities": ["audio"]}
}))

If you're using the OpenAI Agents SDK (Python), the same mechanism works through the session's manual turn control:

await session.send_audio(audio_bytes, commit=True)

The approach works identically regardless of your orchestration framework — it's all the same Realtime API WebSocket protocol underneath.

For the local VAD model, TEN VAD is available on Hugging Face with ONNX weights and Python bindings. Silero VAD is the more established alternative if you want a simpler setup, though you'll see slower transition detection.

What's Next: Speculative Response Generation

With local VAD handling turn detection, the remaining bottleneck is LLM inference (~2.2s median on non-filler turns). The next optimization I'm exploring is speculative response generation — using the local VAD's early silence detection to trigger LLM inference before we're fully certain the user has finished speaking. This allows for super tight local VAD configuration that wouldn't fly in production without OpenAI's server-side VAD confirmation.

The generated audio would be buffered rather than played immediately. If the user continues speaking, we discard the speculative response. If they're done, the response is already generated and plays almost instantly.

The Realtime API supports a hybrid configuration for this: set turn_detection.create_response = false and turn_detection.interrupt_response = false. This keeps semantic_vad running as a signal while leaving response timing under your control — the best of both worlds.

Early prototyping suggests this could save an additional 200–300ms, potentially bringing total response latency consistently under 2 seconds. But the edge cases are real — still working through the interplay between local VAD and OpenAI's server-side VAD.

Methodology Details

For those who want to reproduce this or poke holes in it:

Perceived latency is defined as the interval from true speech end (local TEN VAD detection) to first audio byte emitted to the telephony provider. Both configurations are measured from the same anchor point — this eliminates the measurement bias inherent in using the server's speech_stopped event.

Commit latency (local VAD mode only): true speech end → server acknowledgment of input_audio_buffer.committed. Median 122ms — this is the WebSocket round-trip overhead that local VAD adds. A small price for a large gain.

LLM latency: server commit acknowledgment → first audio delta from OpenAI. This is the model inference time, independent of VAD choice.

Filler segmentation threshold: LLM latency < 1000ms. Filler responses are non-deterministic LLM behavior (e.g., "Let me find that for you") and are not controllable by VAD configuration.

Statistical tests: Welch's two-sample t-test (unequal variances), Cohen's d for effect size. All p-values are two-tailed.

Environment: Controlled quiet-room conditions. Scripted test calls from cell phone through production Twilio PSTN infrastructure (8kHz µ-law, ~20ms frames). Test dates: March 20–21, 2026.

I build real-time AI voice systems — telephony pipelines, streaming audio, LLM orchestration. If you're working on similar problems, I'd love to hear what latency challenges you're seeing. Reach out on LinkedIn.

I Tested Our WebSocket Audio Pipeline with WebRTC. Here's Why I Switched It Back.

Nick Lackman — Wed, 11 Mar 2026 04:01:01 +0000

There's a prevailing assumption in the voice AI space that WebRTC is inherently better than WebSockets for real-time audio. Better latency, better quality, better everything. I built a full proof-of-concept to test that assumption on an enterprise scale production AI voice system.

I found a few things surprising.

The Setup

Our system takes inbound phone calls, pipes the audio through an AI agent (OpenAI Realtime API), and sends the response back to the caller. The current architecture uses Twilio Programmable Voice with WebSocket media streams — G.711 μ-law audio at 8kHz using WebSocket protocol.

The hypothesis was straightforward: replace the WebSocket media path with WebRTC via LiveKit, and we'd get lower latency (UDP instead of TCP, no WebSocket framing overhead) and better audio quality (Opus codec at 48kHz instead of G.711 at 8kHz).

I built the full integration — LiveKit Cloud as the media server, Twilio Elastic SIP Trunking for the PSTN connection, a transport abstraction layer so both paths could run side by side, and a real-time audio pacer to handle frame timing. The key here was adding this new transport path without changing any of the LLM orchestration or Agent configuration and tools. It should work the exact same as production with the exception of using Livekit/SIP/WebRTC rather than Twilio/ProgrammableVoice/Websockets.

Measuring the delta was necessary to take any meaningful insights from this proof-of-concept.

The Latency Result

Median response latency (time from when the caller stops speaking to when the AI starts responding):

WebSocket path: ~1,920ms
WebRTC path: ~2,060ms

Essentially identical. The theoretical 50–150ms savings from eliminating WebSocket overhead is real, but invisible against 2+ seconds of LLM response time. The transport layer accounts for less than 5% of total conversational latency. The bottleneck is the model, not the pipe. The thing I found interesting about this is the conversation around websockets vs WebRTC for real-time AI. “WebRTC is always better” is the general consensus. While WebRTC is the superior transport mechanism for real-time communications - literally in the name, the efficiency benefits are hard to see when model inference is 500ms-4s.

The Audio Quality Result

Both paths delivered the same audio quality — because both paths carry the same audio. When a caller dials from a phone, the audio enters the PSTN as G.711 μ-law at 8kHz. That's a hard ceiling imposed by the telephone network. It doesn't matter whether those bytes travel over a WebSocket or a WebRTC connection; the frequency content is identical. You can't recover information that was never captured at the source. Said a different way, you can go from low quality audio encoding to high quality audio encoding and expect a better sounding output.

Spectral Analysis

The Surprise: WebRTC Sounded Worse at First

The initial WebRTC implementation actually sounded worse than WebSocket — choppy audio, dropped words, audible artifacts. It took real debugging to figure out why.

WebRTC's jitter buffer is designed for network jitter. It smoothing out packets that arrive with variable timing from a remote peer over UDP. It is not designed to handle an application dumping large bursts of AI-generated audio into the WebRTC stack all at once.

When the LLM generates a response, the audio arrives in variable-sized chunks — sometimes 50ms of audio, sometimes 500ms, delivered as fast as the model can produce it. The OpenAI Realtime API delivers fairly consistent audio chunks, but it’s not exact and not in the way that is expected for PSTN. Our WebSocket implementation had a strict real-time pacer that metered these chunks out at exactly one frame per 20ms with prebuffering and underrun detection. Without that same pacer on the WebRTC path, the audio sounded terrible.

The fix was porting the same pacer architecture to the WebRTC path. Once both paths had identical frame-level timing discipline, the audio quality matched. The lesson: application-level pacing of AI-generated audio is your responsibility regardless of transport. WebRTC handles network timing, not application timing.

Where WebRTC Actually Wins

I also tested a WebRTC-native path with no PSTN involved — a browser client connecting directly to the AI agent via LiveKit with Opus at 24kHz. The difference was dramatic:

99% audio bandwidth: 8,438 Hz (vs. ~3,969 Hz for PSTN paths)
2x+ frequency content — you can hear breathiness, sibilants, natural voice texture
Fewest signal artifacts of all three paths
Same latency as the other paths (still LLM-bound)

WebRTC is transformatively better when the caller isn't on a phone. The technology delivers on its promise — just not for PSTN calls.

The Takeaway

The right question isn't "should we use WebRTC?" It's "where is the bottleneck?" For PSTN-based AI voice calls today, the telephone network limits quality, and the LLM limits speed. Changing the transport layer between those two bottlenecks doesn't move the needle.

WebRTC becomes the right answer when one of these changes: callers move to VoIP/browser/app clients (removing the PSTN quality ceiling), LLM response times drop by an order of magnitude (making transport latency a meaningful fraction of total latency), or wideband codecs become available end-to-end on SIP trunks.

While WebRTC is the de facto real-time communication protocol, we have millions of phone numbers and deeply ingrained Twilio Programmable Voice integrations. Switching would mean setting up new infrastructure, changing the call routing logic, additional overhead of managing a media server ourselves or paying for a cloud service like livekit. SIP/WebRTC needed to be a significant improvement over Twilio/Websockets to justify the migration, and it was about the same.

If you are already deeply integrated with Twilio and their Programmable Voice, the boring WebSocket pipeline with a well-tuned audio pacer is the right architecture. Sometimes the best engineering decision is knowing when not to ship.