<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Mohamed-Amine BENHIMA</title>
    <description>The latest articles on Forem by Mohamed-Amine BENHIMA (@mohamedamine_benhima).</description>
    <link>https://forem.com/mohamedamine_benhima</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3799818%2F9adb8c08-f67b-4418-81e5-d0ccdfb032e6.png</url>
      <title>Forem: Mohamed-Amine BENHIMA</title>
      <link>https://forem.com/mohamedamine_benhima</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/mohamedamine_benhima"/>
    <language>en</language>
    <item>
      <title>Is NVIDIA NIM's free tier good enough for a real-time voice agent demo?</title>
      <dc:creator>Mohamed-Amine BENHIMA</dc:creator>
      <pubDate>Sun, 08 Mar 2026 00:09:31 +0000</pubDate>
      <link>https://forem.com/mohamedamine_benhima/is-nvidia-nims-free-tier-good-enough-for-a-real-time-voice-agent-demo-2fa1</link>
      <guid>https://forem.com/mohamedamine_benhima/is-nvidia-nims-free-tier-good-enough-for-a-real-time-voice-agent-demo-2fa1</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; NVIDIA NIM gives you free hosted STT, LLM, and TTS, no credit card, 40 requests/min. Plug it into Pipecat and you have a real-time voice agent with VAD, smart turn detection, and idle reminders in a weekend. &lt;a href="https://github.com/BENHIMA-Mohamed-Amine/pipecat-demos/tree/nvidia-pipecat" rel="noopener noreferrer"&gt;Full code on GitHub&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  I wanted to test NVIDIA's AI models on a real-time voice agent
&lt;/h2&gt;

&lt;p&gt;Most voice agent tutorials start with "add your OpenAI API key." Then you blink and you've burned $20 before validating a single idea.&lt;/p&gt;

&lt;p&gt;NVIDIA NIM gives you hosted STT, LLM, and TTS, all under one API key, no credit card required, 40 requests per minute. Enough for a POC, a demo, or a weekend build.&lt;/p&gt;

&lt;p&gt;But the free tier wasn't the only reason I tried it. NVIDIA builds the GPUs everyone runs models on. They created TensorRT. So when they host their own models, I had one question: will I find a new hero, better latency, better accuracy, or both?&lt;/p&gt;

&lt;p&gt;I used Pipecat to build a full real-time voice agent and put their stack to the test. Here's what I found.&lt;/p&gt;




&lt;h2&gt;
  
  
  The stack: NVIDIA NIM + Pipecat
&lt;/h2&gt;

&lt;p&gt;For real-time voice agents, your stack choice matters more than people think. Every service in the pipeline adds latency, STT, LLM, TTS, and they compound.&lt;/p&gt;

&lt;p&gt;NVIDIA NIM hosts optimized inference endpoints for all three. One API key, no setup, no infrastructure. The free tier gives you 40 RPM which is plenty to iterate fast and show a working demo to stakeholders.&lt;/p&gt;

&lt;p&gt;I wired it up with &lt;a href="https://github.com/pipecat-ai/pipecat" rel="noopener noreferrer"&gt;Pipecat&lt;/a&gt;, an open-source framework built specifically for real-time voice pipelines. It handles audio transport, streaming, turn detection, and pipeline orchestration, so I could focus on what actually matters: does the stack perform?&lt;/p&gt;

&lt;p&gt;The pipeline: WebRTC -&amp;gt; STT -&amp;gt; LLM -&amp;gt; TTS. Audio in, audio out, sub-second round trip is the goal.&lt;/p&gt;




&lt;h2&gt;
  
  
  Building the agent
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Spin up the pipeline&lt;/strong&gt; — Wire WebRTC transport into Pipecat, connect NVIDIA STT, LLM, and TTS services. The whole pipeline is 7 lines:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pipeline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Pipeline&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="n"&gt;transport&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;input&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;stt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_agg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;transport&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;output&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;assistant_agg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Add VAD&lt;/strong&gt; — No mic button. Silero VAD runs locally and detects when the user starts and stops speaking automatically.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;vad_analyzer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;SileroVADAnalyzer&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Add SmartTurn&lt;/strong&gt; — VAD alone isn't enough. Users say "umm", "eeh", pause mid-sentence, VAD sees silence and triggers the pipeline too early. SmartTurn runs a local model that understands whether the user actually finished speaking or just paused.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;stop&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="nc"&gt;TurnAnalyzerUserTurnStopStrategy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;turn_analyzer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;LocalSmartTurnAnalyzerV3&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cpu_count&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Mute the user on bot first speech&lt;/strong&gt; — In IVR-style flows, you want the bot to finish its greeting before the user can interrupt. &lt;code&gt;FirstSpeechUserMuteStrategy&lt;/code&gt; mutes the user's input until the bot finishes its first turn.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;user_mute_strategies&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;FirstSpeechUserMuteStrategy&lt;/span&gt;&lt;span class="p"&gt;()]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Add an idle reminder&lt;/strong&gt; — If the user goes silent for 60 seconds, the bot gently reminds them it's still there. One event hook, no polling.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@pair.user&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;event_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;on_user_turn_idle&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;hook_user&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;aggregator&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;LLMUserAggregator&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;aggregator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push_frame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nc"&gt;LLMMessagesAppendFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The user has been idle. Gently remind them you&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;re here to help.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;}],&lt;/span&gt; &lt;span class="n"&gt;run_llm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  What the numbers actually look like
&lt;/h2&gt;

&lt;p&gt;I went in expecting consistent results across all three services. That's not what I got.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;STT, split verdict.&lt;/strong&gt;&lt;br&gt;
The streaming STT service is fast: ~200ms average for English. Accurate enough for a production demo. But it only works for English. I tried French (&lt;code&gt;fr-FR&lt;/code&gt;) and it silently failed. After digging, including raw gRPC tests that bypassed Pipecat entirely, I found the root cause: NVIDIA's cloud truncates &lt;code&gt;"fr-FR"&lt;/code&gt; to &lt;code&gt;"fr"&lt;/code&gt; internally and fails to match a model. Not a Pipecat bug. A cloud infrastructure bug.&lt;/p&gt;

&lt;p&gt;The workaround: &lt;code&gt;NvidiaSegmentedSTTService&lt;/code&gt; with Whisper large-v3. It works for French, but it's ~1s average. That's a noticeable latency hit in a real conversation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TTS, the hero.&lt;/strong&gt;&lt;br&gt;
Multilingual, ~400ms average, good voice quality. This one I'd use in production. Free.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLM, inconsistent.&lt;/strong&gt;&lt;br&gt;
Latency varied too much turn to turn. Not reliable enough for a real-time conversation where the user expects a snappy response. I wouldn't recommend it for production yet.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'd do differently
&lt;/h2&gt;

&lt;p&gt;Start with English. The streaming STT at ~200ms is a completely different experience than segmented at ~1s. If your demo feels sluggish, that 800ms gap is probably why.&lt;/p&gt;

&lt;p&gt;Once the core flow is validated, swap the STT provider or self-host a model for other languages. The NIM free tier does its job, validate fast, then optimize the stack.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Full code on GitHub -&amp;gt; &lt;a href="https://github.com/BENHIMA-Mohamed-Amine/pipecat-demos/tree/nvidia-pipecat" rel="noopener noreferrer"&gt;pipecat-demos/nvidia-pipecat&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>pipecat</category>
      <category>nvidianim</category>
      <category>webrtc</category>
      <category>voiceagents</category>
    </item>
    <item>
      <title>What makes Pipecat different from other voice agent frameworks?</title>
      <dc:creator>Mohamed-Amine BENHIMA</dc:creator>
      <pubDate>Thu, 05 Mar 2026 00:16:33 +0000</pubDate>
      <link>https://forem.com/mohamedamine_benhima/what-makes-pipecat-different-from-other-voice-agent-frameworks-2n0</link>
      <guid>https://forem.com/mohamedamine_benhima/what-makes-pipecat-different-from-other-voice-agent-frameworks-2n0</guid>
      <description>&lt;h2&gt;
  
  
  It's not an LLM problem
&lt;/h2&gt;

&lt;p&gt;I thought building a voice agent was an LLM problem. Turns out, 80% of the work has nothing to do with the model.&lt;/p&gt;

&lt;p&gt;You're actually orchestrating a chain of AI services. VAD, STT, LLM, TTS, or a speech-to-speech model like we use here. On top of that you need audio streaming, turn cancellation, context management, WebRTC transport, observability, and async concurrency. All at once. All low latency.&lt;/p&gt;

&lt;p&gt;Pipecat handles all of that. In a few lines of code.&lt;/p&gt;




&lt;h2&gt;
  
  
  Most frameworks weren't built for this
&lt;/h2&gt;

&lt;p&gt;Frameworks like LangChain are great. But they're built for LLM calls and agentic workflows. Text in, text out. That's not what a real-time voice agent is.&lt;/p&gt;

&lt;p&gt;The first thing that breaks is transport. REST doesn't work here. You can't poll a server for audio. You need to stream the mic directly from the user's browser to your server, and stream the voice response back in real time.&lt;/p&gt;

&lt;p&gt;Most people jump to WebSockets. But WebSockets are designed for server-to-server communication, and more importantly they run on TCP/IP. TCP guarantees delivery and order, which sounds good until you realize that in real-time audio, a delayed packet is worse than a lost one. You don't want the protocol retrying. You want speed.&lt;/p&gt;

&lt;p&gt;WebRTC runs on UDP. It was built exactly for this: low latency, browser-to-server, real-time media streaming. That's why it's the right transport for voice agents.&lt;/p&gt;

&lt;p&gt;But transport is just the start. Once audio hits your server you still need to orchestrate VAD to detect when the user is speaking, a speech-to-speech model or a full STT + LLM + TTS chain, audio streaming back to the client, turn cancellation when the user interrupts, and context management across turns.&lt;/p&gt;

&lt;p&gt;Without a framework built around this, you're writing all of it from scratch. And that's months of debugging edge cases that have nothing to do with your actual product.&lt;/p&gt;

&lt;p&gt;That's what Pipecat is built for.&lt;/p&gt;




&lt;h2&gt;
  
  
  Building a voice agent pipeline in Pipecat
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: The entry point
&lt;/h3&gt;

&lt;p&gt;Everything starts with a single POST &lt;code&gt;/api/offer&lt;/code&gt; endpoint. The browser sends a WebRTC offer, the server processes it and returns an answer, and the connection is established.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@router.post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/api/offer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response_model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;WebRTCAnswer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;offer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SmallWebRTCRequest&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;background_tasks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;BackgroundTasks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;small_webrtc_handler&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SmallWebRTCRequestHandler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Depends&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;get_handler&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;WebRTCAnswer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;webrtc_connection_callback&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;connection&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;webrtc_transport&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SmallWebRTCTransport&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;webrtc_connection&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;connection&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;TransportParams&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;audio_in_enabled&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;audio_out_enabled&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;audio_out_10ms_chunks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;
            &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;background_tasks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_bot&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;webrtc_transport&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;small_webrtc_handler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;handle_web_request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;webrtc_connection_callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;webrtc_connection_callback&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;WebRTCAnswer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once the connection is ready, &lt;code&gt;run_bot&lt;/code&gt; is called as a background task. FastAPI doesn't block waiting for the bot to finish. Each user gets their own transport instance and their own pipeline running concurrently.&lt;/p&gt;

&lt;p&gt;There's also a PATCH &lt;code&gt;/api/offer&lt;/code&gt; endpoint for ICE candidates. WebRTC uses ICE to negotiate the best network path between browser and server. This endpoint handles those negotiation messages as they come in.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Managing WebRTC connections
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;SmallWebRTCRequestHandler&lt;/code&gt; is a singleton, initialized once at startup and shared across all connections. It manages the WebRTC state for every active session.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;small_webrtc_handler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SmallWebRTCRequestHandler&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_handler&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;SmallWebRTCRequestHandler&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;small_webrtc_handler&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On shutdown, the lifespan context manager closes the handler cleanly so no connections are left hanging.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@asynccontextmanager&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lifespan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;FastAPI&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;yield&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;small_webrtc_handler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For transport you have two options. Daily is the managed paid solution. It handles WebRTC infrastructure for you and works well if your server is behind a private VM with a load balancer, since WebRTC peer-to-peer connections don't work well in that setup without a managed relay.&lt;/p&gt;

&lt;p&gt;We use SmallWebRTC, the open source option. It works perfectly fine as long as your VM has a public IP. No extra cost, no external dependency.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: The pipeline
&lt;/h3&gt;

&lt;p&gt;Once the transport is ready, &lt;code&gt;run_bot&lt;/code&gt; builds and runs the pipeline. The core idea in Pipecat is simple. You define a list of processors that handle frames flowing through them in order. Audio in, intelligence, audio out.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pipeline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Pipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="n"&gt;transport&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;input&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="n"&gt;user_aggregator&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;transport&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;output&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="n"&gt;assistant_aggregator&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;transport.input()&lt;/strong&gt; streams raw audio from the user's browser directly to the server over WebRTC. No buffering, no polling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;user_aggregator&lt;/strong&gt; combines two things: Silero VAD to detect when the user starts and stops speaking, and SmartTurn to decide when they actually finished their thought. VAD gives you the audio boundaries. SmartTurn uses a local model to predict if the turn is really complete, not just a pause. Without this, the bot cuts in mid-sentence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;llm&lt;/strong&gt; here is Gemini Live, a speech-to-speech model. You send it audio, it responds with audio. No STT, no TTS in between. That removes two network hops from your latency budget.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;transport.output()&lt;/strong&gt; streams the bot's audio response back to the browser in real time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;assistant_aggregator&lt;/strong&gt; handles context. It keeps track of the conversation history and compresses the context window when it gets too long, so the model doesn't run out of memory mid-conversation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Smart Turn detection
&lt;/h3&gt;

&lt;p&gt;Most voice agents use basic silence detection. Wait 500ms of no audio, assume the user is done, send to the LLM. Simple, but it breaks constantly. People pause mid-sentence. They think out loud. A fixed silence threshold either cuts them off too early or adds noticeable delay.&lt;/p&gt;

&lt;p&gt;SmartTurn solves this with a small local model that runs on every audio chunk. It doesn't just detect silence, it predicts whether the turn is actually complete.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;stop_strategy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TurnAnalyzerUserTurnStopStrategy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;turn_analyzer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;LocalSmartTurnAnalyzerV3&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;SmartTurnParams&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stop_secs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
    &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;stop_secs=2&lt;/code&gt; is the fallback. If the model is uncertain for 2 seconds, it ends the turn anyway.&lt;/p&gt;

&lt;p&gt;This is wired into the user aggregator alongside VAD:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;user_aggregator&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;assistant_aggregator&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LLMContextAggregatorPair&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;user_params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;LLMUserAggregatorParams&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;vad_analyzer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;SileroVADAnalyzer&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="n"&gt;user_turn_strategies&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;UserTurnStrategies&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stop&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;stop_strategy&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;VAD detects speech boundaries. SmartTurn decides when to act on them. Together they make interruptions and natural pauses feel handled correctly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5: Observability
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PipelineTask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;pipeline&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;PipelineParams&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;enable_metrics&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;enable_usage_metrics&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;observers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;latency_observer&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;enable_metrics=True&lt;/code&gt; gives you TTFB and processing time per service. You see exactly how long each stage takes.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;enable_usage_metrics=True&lt;/code&gt; gives you token usage from the LLM and character count from TTS, per interaction.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;UserBotLatencyObserver&lt;/code&gt; measures the total end-to-end latency: from the moment the user stops speaking to the moment the bot starts speaking. That's the number that actually matters for how natural the conversation feels.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@latency_observer.event_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;on_latency_measured&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;on_latency_measured&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;observer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;latency&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;debug&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;User-to-bot latency: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;latency&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One callback. You get the full picture.&lt;/p&gt;




&lt;h2&gt;
  
  
  What surprised me
&lt;/h2&gt;

&lt;h3&gt;
  
  
  SmartTurn handles filler words
&lt;/h3&gt;

&lt;p&gt;I expected turn detection to be a silence threshold. Wait long enough, assume the user is done. Simple.&lt;/p&gt;

&lt;p&gt;The problem is people don't speak in clean sentences. They say "umm" and "euh" and pause mid-thought. Normal VAD hears that silence and ends the turn. The bot cuts in. The user feels interrupted. The conversation breaks.&lt;/p&gt;

&lt;p&gt;SmartTurn runs a local model on every audio chunk and predicts whether the turn is actually complete. It hears "umm" followed by silence and knows the user isn't done yet. It waits. That one thing has a bigger impact on conversation quality than almost anything else in the pipeline.&lt;/p&gt;

&lt;h3&gt;
  
  
  Concurrency is handled for you
&lt;/h3&gt;

&lt;p&gt;I expected to manage concurrent sessions myself. Thread safety, shared state, making sure one user's pipeline doesn't interfere with another's.&lt;/p&gt;

&lt;p&gt;Pipecat handles this through its frame-based architecture. Each WebRTC connection spins up its own pipeline instance as an async background task. Sessions are fully isolated. You don't write any of that isolation logic yourself.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;background_tasks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_bot&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;webrtc_transport&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That one line is doing a lot. Each call to &lt;code&gt;run_bot&lt;/code&gt; gets its own transport, its own context, its own pipeline. No shared state to worry about.&lt;/p&gt;

&lt;h3&gt;
  
  
  Observability is a first-class citizen
&lt;/h3&gt;

&lt;p&gt;I planned to wire up my own latency tracking after getting the core working. I assumed it would be a separate logging layer I'd have to build.&lt;/p&gt;

&lt;p&gt;It wasn't. Three lines of config and you get TTFB per service, token and character usage per interaction, and full end-to-end latency from user stop speaking to bot start speaking.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;PipelineParams&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;enable_metrics&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;enable_usage_metrics&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;observers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;latency_observer&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Most frameworks make observability an afterthought. In Pipecat it's built into the pipeline task itself.&lt;/p&gt;




&lt;p&gt;Full code is on GitHub: &lt;a href="https://github.com/BENHIMA-Mohamed-Amine/pipecat-demos/tree/fastapi-pipecat" rel="noopener noreferrer"&gt;pipecat-demos/fastapi-pipecat&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  What's coming next
&lt;/h3&gt;

&lt;p&gt;In the next posts I'll cover:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Integrating LangChain with Pipecat&lt;/strong&gt; — how to bring agentic workflows into a real-time voice pipeline without killing your latency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Communicating with the frontend&lt;/strong&gt; — streaming transcription as the user speaks, streaming LLM output word by word, and highlighting the sentence the bot is currently speaking. The stuff that makes a voice agent feel alive, not just functional.&lt;/p&gt;

</description>
      <category>pipecat</category>
      <category>voiceai</category>
      <category>python</category>
      <category>fastapi</category>
    </item>
    <item>
      <title>🎙️ I Built a Real-Time Voice AI Agent in ~90 Lines of Python</title>
      <dc:creator>Mohamed-Amine BENHIMA</dc:creator>
      <pubDate>Sun, 01 Mar 2026 11:33:54 +0000</pubDate>
      <link>https://forem.com/mohamedamine_benhima/i-built-a-real-time-voice-ai-agent-in-90-lines-of-python-2fbk</link>
      <guid>https://forem.com/mohamedamine_benhima/i-built-a-real-time-voice-ai-agent-in-90-lines-of-python-2fbk</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Speech in. Speech out. No fluff. Just vibes.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; A real-time voice AI agent — STT, LLM, TTS, WebRTC — in ~90 lines using Pipecat and Groq. No custom streaming logic, no callback hell. Just declare the pipeline and run it. &lt;a href="https://github.com/BENHIMA-Mohamed-Amine/pipecat-demos/tree/quickstart" rel="noopener noreferrer"&gt;Full code on GitHub&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Why this is hard — and why it's not anymore
&lt;/h2&gt;

&lt;p&gt;Building a real-time voice agent from scratch used to mean writing your own WebRTC server, manual audio streaming pipeline, multiple SDK integrations, and async concurrency management.&lt;/p&gt;

&lt;p&gt;Pipecat abstracts all of that. You declare the pipeline, it handles the rest.&lt;/p&gt;




&lt;h2&gt;
  
  
  The stack
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Speech-to-Text&lt;/td&gt;
&lt;td&gt;Groq (Whisper)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Language Model&lt;/td&gt;
&lt;td&gt;Groq (LLaMA)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Text-to-Speech&lt;/td&gt;
&lt;td&gt;Groq (PlayAI)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Transport&lt;/td&gt;
&lt;td&gt;WebRTC&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VAD&lt;/td&gt;
&lt;td&gt;Silero (local)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Framework&lt;/td&gt;
&lt;td&gt;Pipecat&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  How the pipeline works
&lt;/h2&gt;

&lt;p&gt;Think of it as an assembly line for audio:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Microphone
    └─► WebRTC input
            └─► Groq STT (Whisper)
                    └─► User context aggregator (+ Silero VAD)
                                └─► Groq LLM
                                        └─► Groq TTS
                                                └─► WebRTC output
                                                        └─► Assistant context aggregator
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each stage processes frames — units of audio, text, or control signals — and passes them downstream.&lt;/p&gt;




&lt;h2&gt;
  
  
  The code
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Services: plug in Groq
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;stt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;GroqSTTService&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;GROQ_API_KEY&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;tts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;GroqTTSService&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;GROQ_API_KEY&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;GroqLLMService&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;GROQ_API_KEY&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Context: give the bot memory
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a friendly AI assistant. Respond naturally and keep your answers conversational. Always give short, concise answers — no more than 2-3 sentences.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LLMContext&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. VAD: the unsung hero
&lt;/h3&gt;

&lt;p&gt;Voice Activity Detection is what determines when the user is done speaking. Without it, the pipeline either waits indefinitely or cuts the user off mid-sentence.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;user_aggregator&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;assistant_aggregator&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LLMContextAggregatorPair&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;user_params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;LLMUserAggregatorParams&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vad_analyzer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;SileroVADAnalyzer&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Silero VAD runs locally, monitors audio continuously, and fires signals for speech start and stop — triggering the STT stage only after it detects the user has finished speaking.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. The pipeline: declare the flow
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pipeline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Pipeline&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="n"&gt;transport&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;input&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;       &lt;span class="c1"&gt;# Audio in
&lt;/span&gt;    &lt;span class="n"&gt;stt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                     &lt;span class="c1"&gt;# Transcribe
&lt;/span&gt;    &lt;span class="n"&gt;user_aggregator&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;         &lt;span class="c1"&gt;# Accumulate + VAD
&lt;/span&gt;    &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                     &lt;span class="c1"&gt;# Think
&lt;/span&gt;    &lt;span class="n"&gt;tts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                     &lt;span class="c1"&gt;# Speak
&lt;/span&gt;    &lt;span class="n"&gt;transport&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;output&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;      &lt;span class="c1"&gt;# Audio out
&lt;/span&gt;    &lt;span class="n"&gt;assistant_aggregator&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c1"&gt;# Save response to context
&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This reads like the actual data flow — no callbacks, no nesting.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Events: connect and disconnect
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@transport.event_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;on_client_connected&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;on_client_connected&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transport&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Say Hello, and briefly introduce yourself.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;queue_frames&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="nc"&gt;LLMRunFrame&lt;/span&gt;&lt;span class="p"&gt;()])&lt;/span&gt;

&lt;span class="nd"&gt;@transport.event_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;on_client_disconnected&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;client_disconnected&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transport&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cancel&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On connect, the bot introduces itself. On disconnect, the task cleans up.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try it yourself
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/BENHIMA-Mohamed-Amine/pipecat-demos.git
&lt;span class="nb"&gt;cd &lt;/span&gt;pipecat-demos

uv &lt;span class="nb"&gt;sync

cp&lt;/span&gt; .env.example .env
&lt;span class="c"&gt;# Add your GROQ_API_KEY&lt;/span&gt;

uv run python main.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open the browser URL, click Connect, and try: &lt;em&gt;"Give me some info about Morocco"&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Final thoughts
&lt;/h2&gt;

&lt;p&gt;Pipecat handles WebRTC negotiation, audio buffering, frame scheduling, and async coordination. You get to focus on what the bot does, not how audio moves through the system.&lt;/p&gt;

&lt;p&gt;This is what an inflection point looks like for voice AI development.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Full code on GitHub → &lt;a href="https://github.com/BENHIMA-Mohamed-Amine/pipecat-demos/tree/quickstart" rel="noopener noreferrer"&gt;pipecat-demos/quickstart&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>pipecat</category>
      <category>realtimevoiceagent</category>
      <category>webrtc</category>
    </item>
  </channel>
</rss>
