<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Bassey John</title>
    <description>The latest articles on Forem by Bassey John (@ogabassey).</description>
    <link>https://forem.com/ogabassey</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3826517%2Ff3dc07c6-6bf0-4b3e-9e48-d33eed5829c4.png</url>
      <title>Forem: Bassey John</title>
      <link>https://forem.com/ogabassey</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/ogabassey"/>
    <language>en</language>
    <item>
      <title>Building Ekaette — A Multimodal AI Voice Assistant on Gemini Live API and Google Cloud</title>
      <dc:creator>Bassey John</dc:creator>
      <pubDate>Mon, 16 Mar 2026 06:56:08 +0000</pubDate>
      <link>https://forem.com/ogabassey/building-ekaette-a-multimodal-ai-voice-assistant-on-gemini-live-api-and-google-cloud-4e7c</link>
      <guid>https://forem.com/ogabassey/building-ekaette-a-multimodal-ai-voice-assistant-on-gemini-live-api-and-google-cloud-4e7c</guid>
      <description>&lt;p&gt;&lt;em&gt;This post was created for the purposes of entering the &lt;a href="https://geminiliveagentchallenge.devpost.com/" rel="noopener noreferrer"&gt;Gemini Live Agent Challenge&lt;/a&gt;. #GeminiLiveAgentChallenge&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What is Ekaette?
&lt;/h2&gt;

&lt;p&gt;Ekaette is a configurable multimodal AI voice and messaging assistant for customer-facing businesses. Customers can call a phone number, speak naturally, send photos or videos on WhatsApp mid-call, and continue the same conversation across channels without repeating themselves.&lt;/p&gt;

&lt;p&gt;It supports 6 industry templates (electronics, hotel, automotive, fashion, telecom, aviation) and is configurable per tenant and company without changing backend code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Try it live:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;📞 Call: +2342017001127 (Africa's Talking SIP)&lt;/li&gt;
&lt;li&gt;💬 WhatsApp: +2348124975729&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/ogabasseyy/ekaette" rel="noopener noreferrer"&gt;github.com/ogabasseyy/ekaette&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Most customer service lines still rely on static recordings, long hold queues, and rigid call-centre routing. Customers spend significant time waiting to solve a simple request, and urgent needs are delayed behind generic queue systems that do not understand intent or priority.&lt;/p&gt;

&lt;p&gt;We wanted to build an assistant that replaces that experience entirely — one that understands intent in real time on a live call, responds immediately when the task is simple, and continues the journey across voice and messaging without losing context.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq23ezmby37t59ou8dv0j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq23ezmby37t59ou8dv0j.png" alt="Ekaette Architecture" width="800" height="425"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Ekaette is a split real-time system running on Google Cloud:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cloud Run (Main HTTP Service)&lt;/strong&gt; — Africa's Talking voice/SMS webhooks, WhatsApp webhooks, admin APIs, callback orchestration, text channel runtime&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloud Run (Live Voice Service)&lt;/strong&gt; — Dedicated long-lived WebSocket sessions for real-time voice streaming via the Gemini Live API&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SIP Bridge VM (GCE)&lt;/strong&gt; — Converts Africa's Talking RTP/G.711 audio to PCM 16kHz for Gemini, with echo suppression, noise reduction, and VAD&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All channels converge on one agent graph built with &lt;strong&gt;Google ADK 1.26.0&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  How We Used Google AI Models
&lt;/h2&gt;

&lt;p&gt;Ekaette uses 8 specialized Gemini models, each chosen for a specific role:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Live voice (all agents)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;gemini-live-2.5-flash-native-audio&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Bidirectional streaming via &lt;code&gt;bidiGenerateContent&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Text channels&lt;/td&gt;
&lt;td&gt;&lt;code&gt;gemini-2.5-pro&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;WhatsApp/SMS via &lt;code&gt;Runner.run_async()&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Text fallback&lt;/td&gt;
&lt;td&gt;&lt;code&gt;gemini-2.5-flash&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Automatic fallback when primary is unavailable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vision analysis&lt;/td&gt;
&lt;td&gt;&lt;code&gt;gemini-2.5-flash&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Device grading and condition assessment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Live media analysis&lt;/td&gt;
&lt;td&gt;&lt;code&gt;gemini-2.5-pro&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Cross-session media analysis during active calls&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TTS&lt;/td&gt;
&lt;td&gt;&lt;code&gt;gemini-2.5-flash-tts&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;WhatsApp voice note replies&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Image generation&lt;/td&gt;
&lt;td&gt;&lt;code&gt;gemini-3.1-flash-image-preview&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Product preview images sent on WhatsApp&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Image fallback&lt;/td&gt;
&lt;td&gt;&lt;code&gt;gemini-2.5-flash-image&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Fallback for image generation&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The voice and text pipelines are intentionally separate — text models don't support &lt;code&gt;bidiGenerateContent&lt;/code&gt;, and voice models don't need &lt;code&gt;Runner.run_async()&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Agent Architecture
&lt;/h2&gt;

&lt;p&gt;A root orchestrator delegates to 5 specialized sub-agents:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Simplified from app/agents/ekaette_router/agent.py
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;create_ekaette_router&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;channel&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;voice&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ekaette_router&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;instruction&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;instruction&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;generate_content_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;GenerateContentConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;thinking_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ThinkingConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;thinking_budget&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;sub_agents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="nf"&gt;create_vision_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;channel&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;channel&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="nf"&gt;create_valuation_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;channel&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;channel&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="nf"&gt;create_booking_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;channel&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;channel&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="nf"&gt;create_catalog_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;channel&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;channel&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="nf"&gt;create_support_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;channel&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;channel&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;before_agent_callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;before_agent_isolation_guard_and_dedup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;after_agent_callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;telemetry_after_agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;before_model_callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;before_model_inject_config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;on_tool_error_callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;on_tool_error_emit&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Two singletons — one per pipeline
&lt;/span&gt;&lt;span class="n"&gt;ekaette_router&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_ekaette_router&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;LIVE_MODEL_ID&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;              &lt;span class="c1"&gt;# voice
&lt;/span&gt;&lt;span class="n"&gt;text_router&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_ekaette_router&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TEXT_MODEL_ID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;channel&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# WhatsApp/SMS
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Google Cloud Services
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Usage&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Vertex AI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Gemini Live API for real-time voice, Memory Bank for cross-session recall&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cloud Run&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Split deployment — main HTTP + dedicated live voice service&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Firestore&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Registry (templates, companies), session state, products, booking slots, knowledge&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cloud Storage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Media uploads (photos, videos for trade-in analysis)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cloud Tasks&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Async WhatsApp message processing, silence nudges&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  The Hardest Challenges
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Native Audio Function Calling Regression
&lt;/h3&gt;

&lt;p&gt;The GA &lt;code&gt;gemini-live-2.5-flash-native-audio&lt;/code&gt; model has significantly lower function-calling accuracy than the older preview model. It would hallucinate sub-agent names as direct function calls (&lt;code&gt;catalog_agent()&lt;/code&gt; instead of &lt;code&gt;transfer_to_agent(agent_name="catalog_agent")&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;We mitigated this with explicit agent &lt;code&gt;description=&lt;/code&gt; fields, negative instructions, and an &lt;code&gt;on_tool_error_callback&lt;/code&gt; that always returns a &lt;code&gt;dict&lt;/code&gt; — returning &lt;code&gt;None&lt;/code&gt; crashes the entire bidi stream (ADK Bug #4005).&lt;/p&gt;

&lt;h3&gt;
  
  
  Duplicate Agent Transfers (ADK Bug #3395)
&lt;/h3&gt;

&lt;p&gt;After multiple transfers + session resumption, the model can loop, repeatedly transferring to the same sub-agent. We built a dedup callback that fingerprints each transfer by agent name + content hash and suppresses duplicates within a 2-second cooldown:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Simplified from app/agents/dedup.py
&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;dedup_before_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;callback_context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;agent_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;callback_context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;agent_name&lt;/span&gt;
    &lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;callback_context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;agent_name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ekaette_router&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;  &lt;span class="c1"&gt;# never suppress root
&lt;/span&gt;
    &lt;span class="n"&gt;signature&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sha1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;agent_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;|&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;content_hash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;callback_context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;last&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temp:dedup_last_signature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;last_ts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temp:dedup_last_ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;last&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;signature&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;last_ts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;2.0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;parts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Part&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;I&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;m already working on that.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)])&lt;/span&gt;

    &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temp:dedup_last_signature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;signature&lt;/span&gt;
    &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temp:dedup_last_ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Voice Accent Inconsistency
&lt;/h3&gt;

&lt;p&gt;Without voice cloning (not yet available for Gemini native audio), the assistant's accent changed unpredictably between turns. IPA notation is ignored by the audio model. We solved this by pinning the voice to &lt;code&gt;Aoede&lt;/code&gt; and using phonetic spelling (&lt;code&gt;ehkaitay&lt;/code&gt;) in both the system instruction and greeting trigger.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cloud Run Scaling for Telephony
&lt;/h3&gt;

&lt;p&gt;A single active voice call ties up a Cloud Run instance with a long-lived WebSocket. With &lt;code&gt;min-instances=1&lt;/code&gt;, Africa's Talking webhook callbacks got 429 errors because no instance was free. The error comes from Google Frontend — no application logs are emitted. We had to set &lt;code&gt;min-instances=2&lt;/code&gt; and split into separate services.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Lessons
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The Gemini Live API is powerful but young.&lt;/strong&gt; Build assuming the model and SDK will surprise you. Invest in callbacks and guardrails early.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Prompt engineering is not enough for production voice AI.&lt;/strong&gt; Critical workflow decisions must live in the runtime layer, not in prompts. LLMs are strongest when they control expression, not business-critical state transitions.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Voice UX is unforgiving.&lt;/strong&gt; A 500ms silence gap feels like an eternity on a live call. We built voice fillers, non-blocking tool execution, and context compression (80k → 40k tokens) to keep conversations natural.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Split your Cloud Run services for telephony.&lt;/strong&gt; Long-lived WebSockets and short HTTP webhooks cannot share instances without starving each other.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Voice cloning when Google releases it for Gemini native audio&lt;/li&gt;
&lt;li&gt;Conversation analytics for quality scoring and conversion tracking&lt;/li&gt;
&lt;li&gt;Deeper industry-specific workflows beyond electronics&lt;/li&gt;
&lt;li&gt;Better memory and customer follow-up across longer time windows&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Built by &lt;a href="https://github.com/ogabasseyy" rel="noopener noreferrer"&gt;Bassey&lt;/a&gt; at Baci Technologies Limited. 641 automated tests. Strict TDD. Real phone calls.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;#GeminiLiveAgentChallenge&lt;/em&gt;&lt;/p&gt;

</description>
      <category>googlecloud</category>
      <category>gemini</category>
      <category>ai</category>
      <category>hackathon</category>
    </item>
  </channel>
</rss>
