Forem: Mohamed-Amine BENHIMA

How WebRTC Actually Works, All in one post

Mohamed-Amine BENHIMA — Sun, 19 Apr 2026 21:33:09 +0000

TL;DR: WebRTC is the standard protocol for real-time peer-to-peer communication. Before any audio or video flows, it does a lot of work upfront to find the best candidate pair - the endpoints (an IP address + port combination that a peer can be reached at, e.g. 82.10.4.1:5000) both peers will use to communicate - and this post walks you through exactly how that works, step by step.

Let me ask you something. You're building a voice agent. The user speaks into their browser, your AI pipeline processes it, and the response comes back as audio. Simple, right?

Now - how does that audio actually travel? Not "the internet," I mean specifically. How does your browser find your server? How fast? Through what route?

That's what WebRTC solves. And once you understand it, a lot of real-time communication suddenly makes sense.

First - why not WebSocket or REST?

REST is stateless. Every request is a stranger. The moment you need a continuous, live audio stream, it breaks down - not even an option.

WebSocket is better. It keeps a persistent connection and works great for text, small messages, signaling. But for audio frames and video? It starts to feel sluggish. It wasn't designed for real-time media.

WebRTC was. It's the standard protocol built specifically for low-latency, real-time peer-to-peer communication - browser to browser, or browser to server.

Browser to server: voice agents, AI avatars, streaming pipelines.
Browser to browser: video calls, Google Meet, anything with two humans talking.

The core idea - find the best endpoint before you need it

Here's the interesting part. WebRTC doesn't just start sending data and hope for the best.

Before any audio flows, it figures out the best candidate pair - the endpoints both peers will use to communicate - and locks it in. Think of the internet as a city with a hundred possible doors to knock on. WebRTC picks the best one upfront, then always uses it.

But here's the catch. The two peers don't know each other's addresses yet. How do you find the best endpoint when you don't even know where the other person is?

That's where ICE comes in.

ICE - the algorithm that makes all of this work

ICE stands for Interactive Connectivity Establishment. It's the algorithm both peers run to figure out their own public address and, eventually, each other's.

The problem is NAT. Your browser doesn't sit directly on the internet - it's behind a router. That router has a public IP address, but your device has a private one (like 192.168.1.5). The browser doesn't know its own public address. So before two peers can connect, each one needs to discover how the outside world sees them.

ICE does this by gathering candidates - possible ways to reach you. There are four types.

The 4 ICE candidate types

1. Host candidate

The simplest one. Your local private IP address. If both peers happen to be on the same Wi-Fi network, they don't need the public internet at all - they just connect directly using their local IPs. Fast, zero overhead.

2. STUN candidate (server-reflexive)

When both peers are on the internet, they need to know their public IP. A STUN server solves this - it's a lightweight server that receives your request, sees your public IP and port from the outside, and sends it back to you.

Your browser        Router (NAT)           STUN Server
  |                     |                      |
  |-- request --------->|                      |
  |   (192.168.1.5)     |-- maps to public IP ->|
  |                     |   82.10.4.1:5000      |
  |                     |                      |
  |                     |<-- your public IP ----|
  |<-- 82.10.4.1:5000 --|                      |

Now you know your public address. That becomes your STUN candidate (server-reflexive: srflx).

3. Relay candidate (TURN)

Some peers can't be reached directly - their server is behind a firewall, on a private IP for security reasons, or on a network that blocks inbound connections. In these cases, you need a TURN server.

TURN is the last resort - mainly when either peer is behind Symmetric NAT (a NAT that assigns a different external port for each unique destination, making the STUN-discovered address useless for direct connections), or when a firewall blocks all inbound connections.

TURN is a relay. Instead of connecting directly, both peers connect to the TURN server, and it forwards the data between them.

Peer A                    TURN Server              Peer B
  |                           |                       |
  |-- data (TURN public IP) ->|                       |
  |                           |-- relay (port → B) -->|
  |                           |<------ response -------|
  |<-- relay back ------------|                       |

TURN is slower than a direct connection, but it's highly reliable. It's the last resort that works in almost every network - though extremely locked-down enterprise proxies can still block even TURN/TLS.

💡 STUN and TURN solve different problems. STUN tells you your public IP. TURN relays your traffic when direct connection isn't possible.

4. Peer-reflexive candidate (prflx)

This one isn't gathered upfront - it's discovered during connectivity checks. When Peer A sends a STUN ping to Peer B, that ping passes through Peer A's NAT. Peer B might see a different public address than the ones in Peer A's candidate list - one the NAT assigned on the fly for that specific connection. Peer B sees an address that wasn't in Peer A's candidate list and reflects it back via the STUN Binding Response (XOR-MAPPED-ADDRESS). Peer A reads that response and self-discovers its own peer-reflexive address from it. Peer A records it as a new local prflx candidate. Peer B records Peer A's source address as a new remote prflx candidate. This is uncommon in typical networks but fairly common in Symmetric NAT and corporate or mobile network scenarios - ICE handles it automatically.

How all candidates get gathered

Both peers run this ICE gathering process simultaneously. Each peer collects all candidate types and ends up with a list like:

192.168.1.5:5000 (host)
82.10.4.1:5000 (STUN / public)
45.33.12.8:3478 (TURN relay)

Peer A                  STUN Server           TURN Server
  |                          |                     |
  |-- what's my public IP? ->|                     |
  |<-- 82.10.4.1:5000 -------|                     |
  |                          |                     |
  |-- allocate relay address -------------------- >|
  |<-- relay at 45.33.12.8:3478 ------------------|
  |                          |                     |
  | now has all candidates                       |

The signaling channel - how peers find each other first

Here's the problem. Two peers want to connect, but they don't know each other's addresses yet. They can't send ICE candidates directly to each other - that would require already being connected.

So they use a signaling channel - a WebSocket server you build yourself, just to exchange this initial information. One persistent connection handles everything: offer, answer, and candidates as they trickle in.

The SDP (Session Description Protocol) is a blob of text that describes the session - codecs, media types, and connection info.

Peer A              Signaling Server (WebSocket)             Peer B
  |                           |                                 |
  |-- ws.send(offer) -------->|                                 |
  |   (SDP offer)             |-- forward offer --------------->|
  |                           |<-- SDP answer ------------------|
  |<-- ws.send(answer) -------|                                 |
  |                           |                                 |
  |-- ws.send(candidate) ---->|-- forward candidate ----------->| ← trickle
  |-- ws.send(candidate) ---->|-- forward candidate ----------->| ← trickle
  |-- ws.send(candidate) ---->|-- forward candidate ----------->| ← trickle

WebRTC doesn't care how you implement this server - what matters is that the candidates reach the other peer.

On the receiving end, each arriving candidate is added via addIceCandidate(), which immediately triggers a connectivity check on that candidate.

ICE Trickle - don't wait, start immediately

Gathering all candidates takes time. The host candidate is instant. STUN takes a round trip. TURN takes a bit longer.

Naive approach: wait until ALL candidates are gathered, put them all in the SDP offer, then send. Slower - you're blocking the whole handshake on the slowest candidate (TURN).

ICE Trickle (the default): the SDP offer/answer is sent once via WebSocket at the start. Then, each candidate is sent via WebSocket the moment it's discovered - without waiting for the others. Peer B starts connectivity checks on the first candidate while Peer A is still gathering the rest.

Peer A              Signaling Server              Peer B
  |                       |                          |
  |-- ws.send(offer) ----->|                          |
  |   (SDP offer)          |-- forward offer -------->|
  |<-- ws.send(answer) ----|<-- ws.send(answer) ------|
  |                       |                          |
  |-- ws.send(candidate) ->|-- forward -------------->|
  |                       |            Peer B runs connectivity check ✓ or ✗
  |-- ws.send(candidate) ->|-- forward -------------->|
  |                       |            Peer B runs connectivity check ✓ or ✗
  |-- ws.send(candidate) ->|-- forward -------------->|
  |                       |            Peer B runs connectivity check ✓ or ✗
  |                       |                          |
  |         best working candidate wins              |
  |<============ WebRTC connection opens ============>|

How does it know if a candidate fails? For each candidate received, ICE sends a small STUN "ping" through that path and waits for a response. No response after a timeout - that candidate is marked as failed and ignored. ICE moves on to the next one.

Peer A                                           Peer B
  |                                                 |
  |-- STUN ping (host candidate) ------------------>|
  |   no response (different network)               |  ✗ fail
  |                                                 |
  |-- STUN ping (STUN candidate) ----------------->|
  |<-- response ------------------------------------|  ✓ works
  |                                                 |
  |-- STUN ping (TURN candidate) ----------------->|
  |<-- response ------------------------------------|  ✓ works
  |                                                 |
  |   pick best working pair (Highest score wins)   |
  |<========== connection opens ====================>|

Priority order is: host > peer-reflexive > server-reflexive > relay. the highest-priority pair that passes connectivity checks wins.

ICE Trickle works because browsers fire candidates as they're found by default. But you still have to implement the onicecandidate handler yourself and send each candidate through your signaling channel - without that code, the candidates never reach the other peer.

The JavaScript side of this

In the browser, all of this maps to RTCPeerConnection:

const ws = new WebSocket("wss://your-signaling-server.com");

const peer = new RTCPeerConnection({
  iceServers: [
    { urls: "stun:stun.l.google.com:19302" },
    {
      urls: "turn:turn.yourserver.com:3478?transport=udp",
      username: "user",
      credential: "pass"
    }
  ]
});

// fires once per candidate found  -  each candidate represents YOU (this peer)
// sent via WebSocket to the signaling server
peer.onicecandidate = (event) => {
  if (event.candidate) {
    ws.send(JSON.stringify({ type: "candidate", candidate: event.candidate }));
  }
  // event.candidate is null → gathering is complete
};

// Peer A: create offer → send via WebSocket
const offer = await peer.createOffer();
await peer.setLocalDescription(offer); // triggers ICE gathering
ws.send(JSON.stringify({ type: "offer", offer }));

// handle incoming messages from signaling server
ws.onmessage = async (event) => {
  const msg = event.data;
  const data = JSON.parse(msg);

  if (data.type === "answer") {
    // SDP answer from Peer B
    await peer.setRemoteDescription(data.answer);
  }

  if (data.type === "candidate") {
    // ICE candidate from Peer B  -  triggers connectivity check immediately
    await peer.addIceCandidate(data.candidate);
  }
};

setLocalDescription is the trigger. The moment you call it, ICE gathering starts in the background.

Transport protocols - how the bytes actually travel

When TURN is involved, there's one more decision: what transport protocol does it use to relay the data?

Three options, in order of preference:

UDP - fastest, no connection overhead, ideal for real-time media
TCP - fallback when UDP is blocked by a firewall
TLS - TCP with encryption, last resort, most likely to get through any firewall

ICE prioritizes them in this order: UDP → TCP → TLS. It doesn't wait for one to fully fail before trying the next - it runs checks in parallel based on priority scores. UDP just wins when it works because it has the highest priority.

And here's a situational optimization when connecting to a media server with a stable public IP (like LiveKit): if direct UDP to the server failed, TURN/UDP will likely fail too on the same network. In that case, skipping to TURN/TCP is reasonable. That said, there are edge cases - some firewalls block most UDP ports but leave UDP 443 open, so TURN/UDP can still matter. For general peer-to-peer WebRTC, don't skip TURN/UDP.

(media server scenario)

Direct UDP      → works?  YES → connected ✓
                          NO  ↓
TURN relay UDP  → skip (UDP is blocked at firewall level  -  no point trying)
                          ↓
TURN relay TCP  → works?  YES → connected ✓
                          NO  ↓
TURN relay TLS  → connected ✓ (gets through almost any firewall)

The mesh problem - when 10 people are on a call

So far, everything above assumes two peers. What happens with 10?

In pure WebRTC peer-to-peer (called mesh topology), every peer connects to every other peer directly. With 10 people, each person sends 9 streams and receives 9 streams.

That's 45 unique connections (10×9/2). Every peer maintains all of them, and your device encodes the same audio 9 times. Bandwidth explodes. Latency goes up. This doesn't scale past 4-5 people in practice.

Peer A --- Peer B
  |    \  /    |
  |     \/     |
  |     /\     |
  |    /  \    |
Peer C --- Peer D

Every line = a separate connection.
4 peers = 6 connections. 10 peers = 45 connections.

Every line is a separate connection. Every peer maintains all of them.

SFU - the fix

The solution is a media server running an SFU (Selective Forwarding Unit).

Instead of each peer sending to every other peer, each peer sends their stream once to the media server. The SFU then forwards it to everyone else in the room.

Peer A ──┐
Peer B ──┤──► SFU (Media Server) ──► Peer A
Peer C ──┤                       ──► Peer B
Peer D ──┘                       ──► Peer C
                                  ──► Peer D

Every peer uploads once. SFU handles the rest.

One upload per peer. The SFU handles distribution. It's called "selective" because it doesn't broadcast to the whole internet - it forwards only to the participants in that room.

Participants in a room can be anything: browsers, voice agents, AI pipelines, recording servers, ingress/egress services. They all join the same room and consume from it.

LiveKit is the popular open-source SFU. It's what most WebRTC-based voice agent infrastructure is built on.

ICE gathering with a media server

With an SFU in the picture, ICE gathering works exactly the same way - but now the "other peer" is the media server, not the other participants.

Peer A                     SFU (Media Server)              Peer B
  |                               |                           |
  |-- ws.send(offer) ------------>|                           |
  |<-- ws.send(answer) -----------|                           |
  |-- ws.send(candidate) -------->|                           |
  |<======= Peer A ↔ SFU connected|                           |
  |                               |                           |
  |                               |<-- ws.send(offer) --------|
  |                               |-- ws.send(answer) ------->|
  |                               |<-- ws.send(candidate) ----|
  |                               |======= Peer B ↔ SFU connected
  |                               |                           |
  |    Peer A and Peer B never connect directly               |
  |    all media flows through the SFU                        |

Each peer runs the full ICE process - host, STUN, TURN candidates, offer/answer exchange - but only with the media server. The media server has a stable public IP (or a TURN server configured), so the handshake is simpler and faster than peer-to-peer.

Two different problems, two different servers

It's easy to confuse TURN and SFU. They're not the same thing:

	TURN Server	SFU (Media Server)
Problem it solves	Can't reach public IP	Bandwidth waste in group calls
What it does	Relays packets when direct connection fails	Receives one stream, forwards to all room participants
When you need it	Always (for reliability)	When you have more than 2-3 peers

In production, you configure both. The media server handles routing, and it has a TURN server configured inside it for when peers can't reach it directly.

That's the full picture. Two peers that don't know each other, finding the best endpoint pair to communicate through, negotiating a connection in milliseconds, and scaling to rooms of hundreds - all before a single byte of real audio flows.

WebRTC does a lot of work so you don't have to think about it. But when something breaks - latency spikes, connections drop, TURN relay kicks in unexpectedly - knowing this flow is what tells you exactly where to look.

Is NVIDIA NIM's free tier good enough for a real-time voice agent demo?

Mohamed-Amine BENHIMA — Sun, 08 Mar 2026 00:09:31 +0000

TL;DR: NVIDIA NIM gives you free hosted STT, LLM, and TTS, no credit card, 40 requests/min. Plug it into Pipecat and you have a real-time voice agent with VAD, smart turn detection, and idle reminders in a weekend. Full code on GitHub

I wanted to test NVIDIA's AI models on a real-time voice agent

Most voice agent tutorials start with "add your OpenAI API key." Then you blink and you've burned $20 before validating a single idea.

NVIDIA NIM gives you hosted STT, LLM, and TTS, all under one API key, no credit card required, 40 requests per minute. Enough for a POC, a demo, or a weekend build.

But the free tier wasn't the only reason I tried it. NVIDIA builds the GPUs everyone runs models on. They created TensorRT. So when they host their own models, I had one question: will I find a new hero, better latency, better accuracy, or both?

I used Pipecat to build a full real-time voice agent and put their stack to the test. Here's what I found.

The stack: NVIDIA NIM + Pipecat

For real-time voice agents, your stack choice matters more than people think. Every service in the pipeline adds latency, STT, LLM, TTS, and they compound.

NVIDIA NIM hosts optimized inference endpoints for all three. One API key, no setup, no infrastructure. The free tier gives you 40 RPM which is plenty to iterate fast and show a working demo to stakeholders.

I wired it up with Pipecat, an open-source framework built specifically for real-time voice pipelines. It handles audio transport, streaming, turn detection, and pipeline orchestration, so I could focus on what actually matters: does the stack perform?

The pipeline: WebRTC -> STT -> LLM -> TTS. Audio in, audio out, sub-second round trip is the goal.

Building the agent

Spin up the pipeline — Wire WebRTC transport into Pipecat, connect NVIDIA STT, LLM, and TTS services. The whole pipeline is 7 lines:

pipeline = Pipeline([
    transport.input(),
    stt, user_agg, llm, tts,
    transport.output(),
    assistant_agg,
])

Add VAD — No mic button. Silero VAD runs locally and detects when the user starts and stops speaking automatically.

vad_analyzer=SileroVADAnalyzer()

Add SmartTurn — VAD alone isn't enough. Users say "umm", "eeh", pause mid-sentence, VAD sees silence and triggers the pipeline too early. SmartTurn runs a local model that understands whether the user actually finished speaking or just paused.

stop=[
    TurnAnalyzerUserTurnStopStrategy(
        turn_analyzer=LocalSmartTurnAnalyzerV3(cpu_count=2)
    )
]

Mute the user on bot first speech — In IVR-style flows, you want the bot to finish its greeting before the user can interrupt. FirstSpeechUserMuteStrategy mutes the user's input until the bot finishes its first turn.

user_mute_strategies=[FirstSpeechUserMuteStrategy()]

Add an idle reminder — If the user goes silent for 60 seconds, the bot gently reminds them it's still there. One event hook, no polling.

@pair.user().event_handler("on_user_turn_idle")
async def hook_user(aggregator: LLMUserAggregator):
    await aggregator.push_frame(
        LLMMessagesAppendFrame(messages=[{
            "role": "user",
            "content": "The user has been idle. Gently remind them you're here to help.",
        }], run_llm=True)
    )

What the numbers actually look like

I went in expecting consistent results across all three services. That's not what I got.

STT, split verdict.
The streaming STT service is fast: ~200ms average for English. Accurate enough for a production demo. But it only works for English. I tried French (fr-FR) and it silently failed. After digging, including raw gRPC tests that bypassed Pipecat entirely, I found the root cause: NVIDIA's cloud truncates "fr-FR" to "fr" internally and fails to match a model. Not a Pipecat bug. A cloud infrastructure bug.

The workaround: NvidiaSegmentedSTTService with Whisper large-v3. It works for French, but it's ~1s average. That's a noticeable latency hit in a real conversation.

TTS, the hero.
Multilingual, ~400ms average, good voice quality. This one I'd use in production. Free.

LLM, inconsistent.
Latency varied too much turn to turn. Not reliable enough for a real-time conversation where the user expects a snappy response. I wouldn't recommend it for production yet.

What I'd do differently

Start with English. The streaming STT at ~200ms is a completely different experience than segmented at ~1s. If your demo feels sluggish, that 800ms gap is probably why.

Once the core flow is validated, swap the STT provider or self-host a model for other languages. The NIM free tier does its job, validate fast, then optimize the stack.

Full code on GitHub -> pipecat-demos/nvidia-pipecat