Skip to content
Published on

WebRTC Media Infrastructure 2026 — LiveKit·Pion·Daily·100ms·mediasoup·Janus·Cloudflare Realtime and WHIP/WHEP Deep Dive

Authors

Prologue — WebRTC is 90% an infrastructure choice

In 2018, WebRTC was the problem of "how do I write a P2P video call in the browser". One STUN/TURN server, 200 lines of RTCPeerConnection code, a demo page with two video boxes. Done.

In 2026, WebRTC is a different problem.

  • How do I run a 100-person room? — Answer: an SFU (Selective Forwarding Unit). The P2P mesh is N-squared and falls over at 5.
  • How do I plug in an AI voice agent? — Pipe LLM output tokens through TTS, microphone audio through STT, all under 200 ms of latency. WebRTC's strength.
  • How do I do live broadcast? — RTMP is stuck on old codecs (H.264 only) with 2–5 second latency. WHIP (WebRTC-HTTP Ingestion Protocol) and WHEP (WebRTC-HTTP Egress Protocol) have risen as the new standard.
  • Do we go managed (Daily, 100ms, Twilio, Chime) or self-host (LiveKit OSS, mediasoup, Janus, Jitsi)?

This piece is not about "200 lines of WebRTC code". It is about "where and how do you run production media infrastructure". That one decision sets your infra cost, latency and operational burden for the next six months.

Summary — the big picture as of May 2026.

  • LiveKit became the de facto standard infrastructure for AI voice agents. The LiveKit Agents framework integrates first-class with OpenAI Realtime API, Deepgram, and ElevenLabs.
  • Pion settled in as the WebRTC engine of the Go camp. LiveKit itself sits on top of Pion.
  • WHIP/WHEP became the live ingestion standard replacing RTMP. OBS 30+ ships WHIP egress out of the box.
  • The managed camp (Daily, 100ms, Twilio Video, AWS Chime SDK) is alive but under price pressure. LiveKit Cloud is the round-one strong contender.
  • Cloudflare Realtime threw WebRTC onto the edge as a new category. Calls API plus Realtime SFU.

Let's start.


1. The WebRTC Infrastructure Landscape — The 2026 Map

First, classification. Not everything is in the same place.

CategoryRepresentative productOne-line summary
AI voice/video infra (managed + OSS)LiveKit Cloud / LiveKit OSSThe 2026 voice-agent standard. Agents framework.
Managed video APIDaily.co, 100ms, Twilio Video, AWS Chime SDK, VonageSDK plus managed SFU. Fast time to ship.
Edge WebRTCCloudflare Realtime / CallsGlobal edge SFU. New category.
Self-host SFU (Node.js)mediasoupLibrary form. You write the signaling yourself.
Self-host SFU (C)JanusPlugin architecture. The OG.
Full-stack OSSJitsi (Meet / Videobridge)A meeting solution plus SFU you download and run.
WebRTC engine (Go)PionThe library that made WebRTC writable in Go.
WebRTC engine (Rust)webrtc-rsThe Rust port of Pion. Growing.
Live ingestion standardWHIP / WHEPStart a WebRTC session with a single HTTP request. RTMP replacement.

The focus of this piece is the bolded set — LiveKit, Pion, Daily, 100ms, mediasoup, Janus, Jitsi, Twilio, Chime, Cloudflare Realtime. WHIP/WHEP gets its own chapter.

Why LiveKit became the standard

  • The number-one infrastructure of the AI voice-agent era. The LiveKit Agents framework integrates first-class with OpenAI Realtime, Deepgram STT, ElevenLabs TTS and Cartesia.
  • Runs both OSS and Cloud (LiveKit Cloud) at the same time. You can start self-hosted and move to Cloud later.
  • Built on Pion, so a single Go binary. Kubernetes and Docker deploys are simple.
  • LiveKit's client SDK family (JS, Swift, Kotlin, Flutter, Unity, React Native) is well organized.

2. A 7-Axis Comparison Matrix

Before the deep analysis, the one-glance picture.

AxisLiveKitDaily100msmediasoupJanusJitsiTwilio VideoAWS Chime SDKCloudflare Realtime
Operating modelManaged + OSSManagedManagedOSS libraryOSS daemonOSS full stackManagedManagedManaged (edge)
LanguageGo (Pion)ClosedClosedNode.js + C++CJava + C (libwebrtc)ClosedClosedClosed
AI voice integrationFirst-class (Agents)GoodGoodDIYDIYDIYAverageGood (Voice Focus)DIY
WHIP/WHEPFirst-classSupportedSupportedPluginPluginExternal toolNot supportedNot supportedFirst-class
Global routingFirst-class in CloudFirst-classFirst-classDIYDIYDIYFirst-classFirst-classFirst-class (edge)
Price pressureStrong (OSS + Cloud)AverageAverageInfra cost onlyInfra cost onlyInfra cost onlyExpensiveExpensiveNew
New adoption trendVery strongStableStrong (India market)StableDecreasingStableDecreasingStableRapid growth

Don't decide off this table alone. The next chapters pin down what each tool can and cannot do.


3. LiveKit — Why It Became the Standard, and LiveKit Agents

LiveKit, born in 2021, is the WebRTC infrastructure that found its place the fastest. It runs both an OSS (Apache 2.0) and LiveKit Cloud track. Two decisive events happened between 2025 and 2026.

  1. The LiveKit Agents framework — Python and Node SDKs that bundle LLMs, STT, TTS and VAD to build voice agents. OpenAI Realtime API, Deepgram, AssemblyAI, ElevenLabs, Cartesia, and more integrate first-class.
  2. OpenAI officially adopted LiveKit — It came out publicly that the infrastructure behind ChatGPT voice mode runs on LiveKit. A de facto industry stamp.

Why LiveKit is strong

  • A single Go binarylivekit-server. Pion-based. One Docker line and it is up.
  • Client SDK lineup — JS, Swift, Kotlin, Flutter, Unity, React Native, Python. All share the same abstraction (Room, Participant, Track).
  • Egress and Ingress services — Recording (MP4, HLS), stream extraction (WHIP/WHEP ingest), transcoding are split into their own modules.
  • @livekit/components-react — Pre-built components that get you a meeting UI in 30 minutes.
  • LiveKit Cloud — When you don't want to run self-hosted, you switch to the same API. SLA and global routing included.

A LiveKit Agents skeleton

The simplest shape of a single voice agent. The code pipes microphone audio into OpenAI Realtime and sends the model's response audio back into the room.

# agent.py — minimal LiveKit Agents skeleton
import asyncio
import os
from livekit import agents, rtc
from livekit.agents import AgentSession, Agent
from livekit.plugins import openai, deepgram, elevenlabs, silero

class VoiceAssistant(Agent):
    def __init__(self) -> None:
        super().__init__(
            instructions=(
                "You are a friendly voice assistant. "
                "Answer in short, conversational sentences."
            ),
        )

    async def on_enter(self) -> None:
        # Auto-greet the user
        await self.session.say("Hi there, how can I help you?")

async def entrypoint(ctx: agents.JobContext) -> None:
    await ctx.connect()  # join the room

    session = AgentSession(
        # 1) STT — Deepgram nova-3
        stt=deepgram.STT(model="nova-3"),
        # 2) LLM — OpenAI Realtime, or plain chat completions
        llm=openai.LLM(model="gpt-4o-mini"),
        # 3) TTS — ElevenLabs
        tts=elevenlabs.TTS(voice_id="Rachel"),
        # 4) VAD — Silero voice activity detection (turn taking)
        vad=silero.VAD.load(),
    )

    await session.start(
        agent=VoiceAssistant(),
        room=ctx.room,
    )

if __name__ == "__main__":
    agents.cli.run_app(
        agents.WorkerOptions(entrypoint_fnc=entrypoint),
    )

To deploy, run python agent.py dev locally, then python agent.py start in production. When the LiveKit server creates a new room, the Agents worker is matched automatically and joins.

OpenAI Realtime + WebRTC direct mode

OpenAI's Realtime API launched in 2024 with WebSocket only. In 2025 a WebRTC connection mode was added. A client crafts an SDP and posts it directly to the OpenAI endpoint; the model responds like an SFU. Latency drops to 200–400 ms.

LiveKit Agents abstracts both options.

# Using OpenAI Realtime as a direct WebRTC connection
from livekit.plugins import openai

session = AgentSession(
    llm=openai.realtime.RealtimeModel(
        model="gpt-4o-realtime-preview",
        voice="alloy",
        # WebRTC mode — no separate STT/TTS needed
        modalities=["audio", "text"],
    ),
)

In this mode you don't need separate STT/TTS — the model handles audio in and out itself. Downsides: pricing is higher, and your model choice is locked to the OpenAI Realtime lineup.

LiveKit's weaknesses

  • Client-side analytics (call-quality measurement, session replay) lags the managed players. Daily leads in this area.
  • Going self-hosted means you operate TURN, load balancers, recording storage and everything around it.
  • Routing topology (node clustering) is powerful but has a learning curve.

4. Pion — The WebRTC Engine of Go

Pion is a full WebRTC implementation written in Go. First public in 2018. LiveKit, Galene, ion-sfu, and even OBS WHIP egress all sit on Pion. Go's single-binary, concurrency, and cross-compilation advantages match media-server operations well.

Why Pion

  • libwebrtc (Google's C++ WebRTC) is famously hard to build and embed. With Pion, go get is it.
  • Go's goroutines map naturally onto managing many peers.
  • Every WebRTC subsystem (SDP, DTLS, SRTP, ICE, SCTP) is provided as a separate library, so you can pull in just the part you need.

The simplest SFU peer fragment in Pion

The minimal shape of one peer taking one sender's track and forwarding it to another participant. A real SFU adds track routing, simulcast, DataChannel, and reconnection logic.

// sfu_peer.go — a 1-to-1 track-forwarding fragment in Pion
package main

import (
    "fmt"
    "github.com/pion/webrtc/v4"
)

func newPeer() (*webrtc.PeerConnection, error) {
    api := webrtc.NewAPI()
    pc, err := api.NewPeerConnection(webrtc.Configuration{
        ICEServers: []webrtc.ICEServer{
            {URLs: []string{"stun:stun.l.google.com:19302"}},
        },
    })
    if err != nil {
        return nil, err
    }

    // Take incoming track and forward it as an outgoing track
    pc.OnTrack(func(remote *webrtc.TrackRemote, _ *webrtc.RTPReceiver) {
        // Create an outgoing track (assume VP8)
        local, err := webrtc.NewTrackLocalStaticRTP(
            remote.Codec().RTPCodecCapability,
            "video",
            "pion-sfu",
        )
        if err != nil {
            return
        }
        if _, err := pc.AddTrack(local); err != nil {
            return
        }

        // Forward RTP packets as-is
        buf := make([]byte, 1500)
        for {
            n, _, readErr := remote.Read(buf)
            if readErr != nil {
                return
            }
            if _, writeErr := local.Write(buf[:n]); writeErr != nil {
                return
            }
        }
    })

    pc.OnICEConnectionStateChange(func(s webrtc.ICEConnectionState) {
        fmt.Println("ICE state:", s.String())
    })

    return pc, nil
}

The limits are clear. It handles only 1-to-1, doesn't take simulcast (multi-resolution tracks), and has no signaling. Still, it shows how direct Pion feels.

Projects on Pion

  • LiveKit
  • Galene — a lightweight SFU assuming single-operator use. Great for lectures and small meetings.
  • ion-sfu — a full SFU. Simulcast, recording, routing.
  • The WHIP egress in OBS Studio — a Pion client module.

5. mediasoup — Node.js's Standard SFU Library

mediasoup is the SFU library for the Node.js world. The important point: it isn't a daemon — it's a library you import into a Node process. The worker is written in C++, and the JS layer orchestrates it.

Why mediasoup

  • Because it is library-shaped, you control signaling, auth, and room management completely.
  • Supports simulcast, SVC, recording, transcoding, and server routing.
  • Large projects like Discord and Microsoft Mesh have been built on top of mediasoup.

mediasoup's downsides

  • You write signaling, room management, auth, and reconnection logic yourself. The opposite of LiveKit / Janus tying things end to end.
  • The steepest learning curve. The abstractions (router, transport, producer, consumer) are multi-layered.
  • Client SDKs outside the Node ecosystem are community-driven.

I only recommend mediasoup when the team includes an engineer who deeply understands media infrastructure. Writing a mediasoup stack to fill a slot that LiveKit or managed would have covered tends to burn about six months.


6. Janus — The OG SFU Written in C

Janus, released in 2014, is a C-based SFU. The OG of WebRTC infrastructure, with a plugin architecture as its trademark.

  • The VideoRoom plugin — a classic multi-party meeting.
  • The Streaming plugin — converts RTP/RTSP into WebRTC.
  • The AudioBridge plugin — multi-party audio mixing.
  • The WHIP plugin — receives standard WHIP egress.

Where Janus sits

  • Very old and stable. Small core, low memory footprint, the performance of C.
  • The plugin model lets one daemon serve live streaming, calls, and mixing scenarios at the same time.
  • Downside: a proprietary signaling protocol (the REST/WebSocket Janus API). Client SDKs aren't as rich as the managed players.
  • New adoption is shifting toward LiveKit and managed, but Janus is alive and active.

7. Jitsi — The Full-Stack OSS Meeting Solution

Jitsi is a full-stack OSS meeting solution. One download and you get Jitsi Meet (web UI) plus Videobridge (SFU) plus Jicofo (signaling) plus Prosody (XMPP), all as one bundle that runs together.

  • 8x8 (which acquired Jitsi) is the main maintainer.
  • The number-one pick for quickly bringing up a self-hosted "internal meeting solution".
  • API integration is weak — Jitsi is a "solution", not a "library".

Alternate use cases

  • Self-host an internal company meeting solution → Jitsi.
  • Embed video into your product → LiveKit, mediasoup, or managed.
  • People mix these up often. Their targets are different.

8. The Managed Camp — Daily, 100ms, Twilio Video, AWS Chime SDK

The big pattern across managed video APIs is similar. SDK + server SDK + managed SFU + recording + analytics. But strengths diverge.

Daily.co

  • The cleanest developer experience. Both the one-line prebuilt embed (the call-frame) and a full SDK are strong.
  • Analytics and session replay run the deepest. Daily best shows you which user's which call broke and how.
  • Pricing: per-minute. Reasonable up to roughly 1,000-user scale.

100ms

  • India-based. RoomKit is the pre-built meeting UI package.
  • Strong share in India and Southeast Asia. There's a separate live-streaming RoomKit, so live broadcast is bundled in.
  • Price competitiveness is the strength.

Twilio Video

  • Was the standard of enterprise camps, but closed new sign-ups in 2024 and is moving toward EOL in 2026. Twilio Voice and SMS are alive.
  • As of now there is almost no reason to pick Twilio Video new.

AWS Chime SDK

  • The first pick when building video calling inside the AWS world. Strong integration with IAM, CloudWatch, S3 recording, and the rest of the AWS ecosystem.
  • Voice Focus (noise suppression), Echo Reduction, and friends are first-class built-ins.
  • Downsides: pricing is expensive and the client SDK is not as polished as LiveKit's or Daily's.

Cloudflare Realtime / Calls

  • Arrived in 2024. The new category of "WebRTC SFU thrown onto Cloudflare's edge".
  • First-class global distribution and strong WHIP/WHEP standard support.
  • The new pricing model makes a big difference depending on traffic shape. Well worth evaluating.
  • Downside: ecosystem is still young. SDKs and docs trail LiveKit and Daily. But the catch-up rate is fast.

9. WHIP/WHEP — The New Standard Replacing RTMP

Live ingestion was long RTMP's seat. RTMP was made by Adobe in 2002, in the era of H.264 and Flash, and three weaknesses became decisive.

  • Codec effectively locked to H.264. Can't do AV1, VP9, or HEVC.
  • Latency is 2–5 seconds. WebRTC is 200–500 ms.
  • TCP-based, so packet-loss recovery is sluggish.

WHIP (WebRTC-HTTP Ingestion Protocol) and WHEP (WebRTC-HTTP Egress Protocol) are the answer. IETF-standardized.

How WHIP works

  1. The publisher posts an SDP offer to the server over HTTP POST.
  2. The server returns an SDP answer with 200 OK.
  3. Once DTLS/SRTP negotiation completes, media starts flowing.

That's the whole thing. Signaling is one HTTP round trip. No need to spin up WebSockets or a separate signaling server.

A WHIP publishing client — fetch-and-done

// whip-publisher.js — publish microphone and camera over WHIP
async function publishWHIP(endpoint, token) {
  const pc = new RTCPeerConnection({
    iceServers: [{ urls: 'stun:stun.l.google.com:19302' }],
  });

  // Add local media
  const stream = await navigator.mediaDevices.getUserMedia({
    audio: true,
    video: { width: 1280, height: 720 },
  });
  stream.getTracks().forEach((track) => pc.addTrack(track, stream));

  // Create the SDP offer
  const offer = await pc.createOffer();
  await pc.setLocalDescription(offer);
  // Wait for ICE gathering to complete (simplified)
  await new Promise((resolve) => {
    if (pc.iceGatheringState === 'complete') return resolve(null);
    pc.addEventListener('icegatheringstatechange', () => {
      if (pc.iceGatheringState === 'complete') resolve(null);
    });
  });

  // POST the SDP to the WHIP endpoint
  const response = await fetch(endpoint, {
    method: 'POST',
    headers: {
      'Content-Type': 'application/sdp',
      Authorization: `Bearer ${token}`,
    },
    body: pc.localDescription.sdp,
  });

  if (!response.ok) {
    throw new Error(`WHIP failed: HTTP ${response.status}`);
  }

  // Set RemoteDescription from the server's answer
  const answer = await response.text();
  await pc.setRemoteDescription({ type: 'answer', sdp: answer });

  return pc;
}

// Usage
const pc = await publishWHIP(
  'https://ingest.example.com/whip/my-stream',
  'my-token',
);

Where WHIP/WHEP took root

  • OBS Studio 30+ — WHIP egress is now a standard option.
  • LiveKit Ingress — accepts WHIP, RTMP, and SRT at the same entry.
  • Cloudflare Realtime — first-class WHIP/WHEP.
  • Twitch and YouTube Live — under trial (as of May 2026).
  • mediasoup and Janus — via plugins.

Why WHIP/WHEP replaces RTMP

  • Codec freedom (VP9, AV1, H.264, Opus, and more — settled by SDP negotiation).
  • Latency drops from single-digit seconds to 0.5 seconds.
  • One HTTPS POST, so firewall traversal is easy (port 443).
  • Sits on WebRTC standards, so all the standard debugging tools just work.

10. SFU vs MCU vs P2P — The Topology Decision

The topology of WebRTC infrastructure is a huge decision axis.

TopologyHow it runsWhen it fits
P2P meshEvery participant connects directly to every other2–3 people. Very small.
P2P starOne host sends to every other participant1-to-N streaming. Host upload bound.
SFUThe server receives incoming streams and forwards them to other participants as-isMeetings, webinars, live. The de facto standard.
MCUThe server decodes, mixes and re-encodes all streams into one compositePhone conferences. Lowest client load.

Why SFU is the standard

  • Server CPU cost is low. No decode/encode — it just routes RTP packets.
  • Simulcast (multi-resolution) support. The receiver-appropriate track is chosen based on bandwidth.
  • SVC (scalable video coding) support. One track contains multiple layers, more efficient.

MCU's place

  • Phone conferencing (PSTN gateways) — everyone has to be mixed into one stream.
  • Very weak clients (embedded, legacy devices) — they only need to receive one track.
  • The final artifact of recording (one composite video).

P2P's place

  • 1-to-1 video calls — if STUN connects them directly, the server bill is zero.
  • Small family/friend apps. Above 3 people, SFU is the right answer.

11. The Codec Landscape — Opus, VP8, VP9, AV1, H.264, H.265

WebRTC's mandatory codecs are Opus for audio and VP8/H.264 for video. The rest is optional.

Audio

  • Opus — effectively the only choice. 8–510 kbps variable. Handles music and speech well. Every WebRTC implementation supports it.
  • G.711 / G.722 — appear at PSTN gateways. Almost never used inside WebRTC itself.

Video

  • VP8 — mandatory since early WebRTC. Highest compatibility.
  • H.264 — mandatory. First-class hardware acceleration on iOS Safari. Preferred in the U.S. market.
  • VP9 — optional. The efficient codec preceding AV1. Chrome and Firefox support.
  • AV1 — optional. The most efficient (30% better than VP9). Encoding cost is high so it doesn't fit every scenario. Desktop hardware acceleration went mainstream in 2026 and adoption is moving fast.
  • H.265 (HEVC) — almost never used inside WebRTC. Licensing burden.

Simulcast and SVC

  • Simulcast — the publisher sends the same video at multiple resolutions/bitrates simultaneously. The SFU picks one per receiver.
  • SVC — one stream contains a base layer plus enhancement layers. The SFU strips layers as needed. Strong in AV1 and VP9.

Decisions

  • Start with VP8 + Opus. Compatible everywhere.
  • iOS-first → add H.264 too.
  • Bandwidth-saving and high-quality → add AV1, factoring in encoding cost.
  • Always turn simulcast on. Without per-receiver adaptation, a big room collapses.

12. Managed vs Self-Host Decision Matrix

The one big decision. This matrix is the guide.

SituationRecommendationReason
MVP, ship inside 6 monthsManaged (LiveKit Cloud, Daily, 100ms)Don't spend your time on media infrastructure
Voice-agent-centric (LLM, STT, TTS)LiveKit (Cloud or OSS)The Agents framework is waiting
Internal company meeting solutionJitsi self-hostedThe "solution" lands as-is
WHIP live ingestionLiveKit Ingress or Cloudflare RealtimeFirst-class WHIP support
Global distribution, edge routingCloudflare Realtime or LiveKit CloudSFU on the edge
Minimize infrastructure cost (mid-scale)LiveKit OSS self-hostedAvoid managed per-minute pricing
Total control over signaling and routingmediasoupLibrary-shaped, every decision is yours
Deep AWS ecosystem integrationAWS Chime SDKSyncs with IAM, S3, CloudWatch
Telephony (PSTN) integrationTwilio Voice + LiveKit SIPTwilio Voice is alive
India and Southeast Asia pricing100msPrice competitiveness

Signals to move from managed to self-hosted

  • Managed bills start exceeding USD 5,000 a month → that's beyond the cost of self-hosted infra plus one DevOps engineer.
  • Data sovereignty, HIPAA, or on-prem requirements → managed becomes hard.
  • You need deep work on the media pipeline (custom transcoding, custom routing) → the managed abstraction blocks you.

The hidden costs of self-hosting

  • TURN servers — you absorb traffic for the fraction of users behind corporate firewalls. Traffic is expensive.
  • Load balancers and node clustering.
  • Recording storage and transcoding workers.
  • Monitoring and observability (WebRTC stats are the hardest part).
  • Incident response — media servers are trickier than other servers.

13. Client-Side Libraries

The client matters as much as the server. Half the decision.

Plain RTCPeerConnection

  • The browser default API. Lightest, but you write the signaling yourself.
  • Fits small P2P demos, WHIP egress, very light situations.

LiveKit Client SDK

  • Abstractions Room, Participant, Track. Auto-reconnect, simulcast, DataChannel.
  • @livekit/components-react — a meeting UI in minutes.

Daily's call-frame

  • A one-line iframe embed. Simplest. Full SDK available separately.

mediasoup-client

  • Pairs with a mediasoup server. Transport / producer / consumer abstractions.

Janus's JS adapter

  • Pairs with a Janus server. REST/WebSocket signaling.

simple-peer

  • The simplest P2P library. You drive the signaling yourself.

14. Operations — TURN, NAT, Monitoring

Half of operations goes into signaling, NAT, and observability.

STUN and TURN

  • STUN — the server that helps a client learn its public IP/port. One UDP RTT.
  • TURN — a media relay for users behind NAT (symmetric NAT, firewalls) where STUN doesn't work. Traffic flows through TURN, so it is expensive.
  • coturn — the de facto OSS TURN implementation.
  • TURN traffic ratio is usually 5–20%. With many corporate networks, up to 50%.

Observability via getStats()

  • The standard RTCPeerConnection.getStats() gives you RTT, jitter, packet loss, codec, resolution, and more.
  • Collect once a minute and stream it into your data warehouse.
  • Managed (Daily, LiveKit Cloud) gives you a dashboard for this.

Call-quality KPIs

  • MOS (Mean Opinion Score) — a 1–5 call quality score. Estimated from RTT, packet loss, jitter.
  • Join-failure rate — the fraction that couldn't reach the room. The core SLI.
  • Reconnect frequency — number of reconnects per call.
  • Video freeze time — accumulated time the video was stuck.

15. Live-Streaming Scenarios

Live broadcast has settled into a separate area of WebRTC infrastructure. Per-scenario recommendations.

ScenarioPublishingRoutingReceiving
One speaker → 10,000 viewersOBS WHIP egressLiveKit or Cloudflare Realtime SFUHLS transcode, then HLS viewers
One speaker → 100 interactiveOBS WHIP egressLiveKit SFUWHEP receive
Multi-party meeting recording → broadcastLiveKit RoomLiveKit EgressHLS transcode
Gameplay broadcast → interactive chatOBS WHIPCloudflare RealtimeWHEP or HLS

Where WebRTC live replaces RTMP live, and where it doesn't

  • Very large viewer counts (100K+) → HLS/DASH still own CDN distance. WebRTC fits up to ~50K.
  • Broadcast features like ad insertion and DRM run deeper in HLS.
  • WebRTC live wins where "low latency + two-way interaction" matters.

16. Security, DRM, and E2EE

WebRTC is DTLS-SRTP encrypted by default. Media packets are always encrypted between client and server. But in SFU mode the SFU does see plaintext for routing decisions (codec info, simulcast layer selection, and so on).

E2EE — Insertable Streams

  • For true end-to-end encryption, use Insertable Streams so even the SFU can't see plaintext.
  • Adopted by LiveKit, Jitsi, and Google Meet. Managed key handling is the core problem.
  • Cost: simulcast efficiency drops, and server-side recording/transcoding gets hard.

DRM

  • WebRTC and DRM don't pair well. DRM is about content protection, WebRTC is about real-time interaction.
  • If you need DRM for live broadcast, go HLS + Widevine.

17. Case Study — A Full-Stack AI Voice Agent

What happens when these tools come together? The most common scenario.

[User browser]
    |
    | WebRTC audio in/out
    v
[LiveKit Server]
    |
    | LiveKit Agents worker joins the room automatically
    v
[Agents Worker (Python)]
    |
    +-- Deepgram STT (streaming)
    +-- OpenAI gpt-4o-mini (LLM)
    +-- ElevenLabs TTS (streaming)
    +-- Silero VAD (turn taking)
    |
    | Sends TTS audio back into the LiveKit room
    v
[User browser — immediate playback]

Latency breakdown (target: under 700 ms)

  • User stops speaking → VAD decides the turn: 100–200 ms
  • Final STT result locked in: 50–100 ms
  • LLM first token: 200–400 ms
  • First TTS audio chunk: 100–200 ms
  • WebRTC transport delay: 50–150 ms

Total: 500–1,000 ms. About the limit at which people feel "this is a conversation".

OpenAI Realtime + WebRTC direct mode

In the picture above Deepgram, OpenAI, and ElevenLabs collapse into one model. Latency drops to 200–400 ms.

[User browser]
    |
    | Direct WebRTC audio
    v
[OpenAI Realtime endpoint]
    |
    | The model handles audio input/output itself
    v
[User browser — response audio]

The trade-off

  • Pros: very low latency. The model directly senses speech nuances (laughter, interruptions).
  • Cons: pricing is high. Model choice is locked to OpenAI. You lose freedom in STT/TTS.

Most production systems run both. Fast conversation uses Realtime, while tool-calling and custom TTS run on a plain LLM plus separate STT/TTS.


18. Common Anti-Patterns

Things I have seen far too often.

  1. Trying to run a 4+ person room on P2P mesh — it collapses at 5. Go SFU from day one.
  2. Shipping without a TURN server — the moment users come in from behind corporate networks, call-failure rate hits 30%.
  3. Disabling simulcast and broadcasting one resolution — in a 10-person room, one person on mobile 4G drops everybody.
  4. Treating getUserMedia permission as "one and done" — permission changes per page and per session. Re-check every time.
  5. Trying to run the SFU inside one WebSocket signaling process — signaling is signaling and media is media. Don't merge them.
  6. Skipping WebRTC stats collection — one user reports "it's choppy" and you can't reproduce. Collect getStats() once a minute.
  7. Cramming recording/transcoding into the SFU process — one encoding session stalls the entire SFU. Split into a separate worker.
  8. Weak VAD in AI voice agents — the model interrupts before the user finishes, or doesn't respond when they do.
  9. Accepting only RTMP among WHIP/RTMP/SRT — in 2026 you can't push AV1 or VP9 over RTMP. Add WHIP as an option.
  10. Self-hosted with monitoring limited to 5 PromQL metrics — media-server observability runs a level deeper than general web servers.

19. The Big Picture — What Became the Standard

The summary as of May 2026.

  • LiveKit is the standard for AI voice-agent infrastructure — Agents framework, OpenAI adoption, both OSS and Cloud.
  • Pion is the WebRTC engine of Go — LiveKit, Galene, OBS WHIP all sit on Pion.
  • WHIP/WHEP is taking RTMP's seat fast — OBS 30+ ships it by default.
  • OpenAI Realtime added WebRTC direct mode — voice-agent latency at human-conversation level.
  • Cloudflare Realtime created the edge-SFU category — first-class global distribution.
  • The managed camp is still strong — Daily, 100ms, AWS Chime SDK. But under price pressure.
  • Twilio Video is essentially gone — new sign-ups closed, EOL in progress.
  • mediasoup, Janus, and Jitsi solidified as the three branches of self-host.

Decision checklist

  • AI voice agents at the core? — LiveKit + LiveKit Agents.
  • Need an internal meeting solution? — Self-host Jitsi.
  • Live broadcast ingestion at the core? — WHIP + LiveKit Ingress or Cloudflare Realtime.
  • Global edge distribution first? — Cloudflare Realtime or LiveKit Cloud.
  • Need first-class managed analytics? — Daily.
  • Need India/SEA pricing competitiveness? — 100ms.
  • Need deep AWS integration? — AWS Chime SDK.
  • Need to dig into a small SFU fragment in Go? — Pion directly.
  • Want full control over signaling/routing in Node.js? — mediasoup.
  • Want a very stable, very old C daemon? — Janus.

Anti-pattern summary

  1. 4+ people on P2P mesh.
  2. Shipping without TURN.
  3. Disabling simulcast.
  4. No WebRTC getStats() collection.
  5. Signaling, SFU, recording, and transcoding in one process.
  6. Weak VAD in AI agents.
  7. Receiving only RTMP (no WHIP).
  8. Forcing MCU in large rooms.
  9. Mixing managed and self-hosted without a decision.
  10. Enabling E2EE while expecting server-side recording to work.

Next post preview

Candidates for the next post: LiveKit Agents deep — token-streaming, tool calling, interruption handling, A month operating WHIP/WHEP ingestion — comparing OBS, Cloudflare, and LiveKit, 100 WebRTC getStats() metrics — what to watch and what to ignore.

"WebRTC is not one standard but a bundle of standards. The teams that can operate the bundle go the farthest."

— WebRTC Media Infrastructure 2026, end.


References