<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Kumaraswamy Chavvakula</title>
    <description>The latest articles on Forem by Kumaraswamy Chavvakula (@kumaraswamy_chavvakula_4f).</description>
    <link>https://forem.com/kumaraswamy_chavvakula_4f</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3828074%2F6de28f23-2552-4242-b816-c14c4fbe9819.jpg</url>
      <title>Forem: Kumaraswamy Chavvakula</title>
      <link>https://forem.com/kumaraswamy_chavvakula_4f</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/kumaraswamy_chavvakula_4f"/>
    <language>en</language>
    <item>
      <title>I built a tool that catches misleading charts using Gemma 4 running locally</title>
      <dc:creator>Kumaraswamy Chavvakula</dc:creator>
      <pubDate>Mon, 25 May 2026 06:57:50 +0000</pubDate>
      <link>https://forem.com/kumaraswamy_chavvakula_4f/i-built-a-tool-that-catches-misleading-charts-using-gemma-4-running-locally-cao</link>
      <guid>https://forem.com/kumaraswamy_chavvakula_4f/i-built-a-tool-that-catches-misleading-charts-using-gemma-4-running-locally-cao</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/google-gemma-2026-05-06"&gt;Gemma 4 Challenge: Build with Gemma 4&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;You know the charts that look dramatic but are actually showing a 3% change? The Y-axis that conveniently starts at 95 instead of 0. The 3D pie chart whose slices somehow add up to 108%. The stock line that’s “up 59.5%!” — over a five-month window hand-picked from a bad year.&lt;/p&gt;

&lt;p&gt;I see these constantly — news, earnings decks, social posts — and it bugs me every time. So I built &lt;strong&gt;DataDetective&lt;/strong&gt;: drop in any chart image and Gemma 4 gives you a forensic breakdown — what manipulation tricks are in play, an integrity score from 0–100, what the chart &lt;em&gt;actually&lt;/em&gt; shows vs. what it wants you to think, and how to fix it.&lt;/p&gt;

&lt;p&gt;The whole thing runs &lt;strong&gt;locally through Ollama&lt;/strong&gt;. No API keys, no cloud, nothing leaves your machine — which matters when the thing you’re analyzing is an internal financial chart or a competitor’s deck.&lt;/p&gt;

&lt;p&gt;It ships with three intentionally-misleading sample charts (a truncated bar chart, a cherry-picked line, and that impossible 108% pie) so you can see it work in one click, plus drag-and-drop for your own images.&lt;/p&gt;

&lt;h2&gt;
  
  
  Demo
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Repo:&lt;/strong&gt; &lt;a href="https://github.com/kumarsparkz/datadetective" rel="noopener noreferrer"&gt;github.com/kumarsparkz/datadetective&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/xynazy5AOU4"&gt;
  &lt;/iframe&gt;
&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# grab Gemma 4 through Ollama (e4b runs comfortably on a laptop)&lt;/span&gt;
ollama pull gemma4:e4b      &lt;span class="c"&gt;# ~9.6 GB; or gemma4:26b if you have the RAM&lt;/span&gt;
ollama serve

&lt;span class="c"&gt;# serve the app — it's just static files&lt;/span&gt;
python3 &lt;span class="nt"&gt;-m&lt;/span&gt; http.server 8080
&lt;span class="c"&gt;# open http://localhost:8080&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Green dot = Ollama connected and a Gemma 4 model detected. Click a sample or upload a chart.&lt;/p&gt;

&lt;p&gt;Here’s what it actually returns (measured on &lt;code&gt;gemma4:e4b&lt;/code&gt;, not aspirational):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Chart&lt;/th&gt;
&lt;th&gt;Trust score&lt;/th&gt;
&lt;th&gt;What Gemma 4 flagged&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;108% pie chart&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;35 / 100&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;[high] Inconsistent totals (sum &amp;gt; 100%)&lt;/code&gt; — &lt;em&gt;“the parts sum to 108%, not 100%”&lt;/em&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cherry-picked stock line&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;35 / 100&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;[high] Cherry-picked time range&lt;/code&gt; + promotional language&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;An &lt;strong&gt;honest&lt;/strong&gt; bar chart (control)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;95 / 100&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;nothing — &lt;em&gt;“a highly effective and honest visualization”&lt;/em&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That last row is the one I’m proudest of. A tool that flags &lt;em&gt;everything&lt;/em&gt; is useless. The honest chart scoring 95 next to the pie scoring 35 is what makes this feel like it’s &lt;em&gt;reasoning&lt;/em&gt;, not pattern-matching for keywords.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I Used Gemma 4
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Why local, why Gemma 4
&lt;/h3&gt;

&lt;p&gt;Privacy is the real reason. If you’re analyzing your company’s revenue charts or a competitor’s investor deck, shipping those images to a cloud API feels wrong. Local means the data literally never leaves the machine. Gemma 4’s open weights make that possible, and it handles multimodal input natively — you POST to &lt;code&gt;localhost:11434/api/chat&lt;/code&gt; with the model, your messages, and an &lt;code&gt;images: [base64]&lt;/code&gt; array. No separate vision encoder, no plumbing.&lt;/p&gt;

&lt;h3&gt;
  
  
  The thing that actually made it work: let the model think first
&lt;/h3&gt;

&lt;p&gt;Here’s the part worth reading if you build on local models.&lt;/p&gt;

&lt;p&gt;My first version used Ollama’s &lt;code&gt;format: 'json'&lt;/code&gt; flag. It felt great — guaranteed parseable JSON, no regex-ing it out of markdown. But the analysis quality was quietly terrible on the subtle cases. I fed it the classic truncated-axis bar chart (Y-axis starting at $95M so a 5% rise looks enormous) and it returned a trust score of &lt;strong&gt;90–95 and didn’t flag the axis at all&lt;/strong&gt; — three times in a row. It would read the axis labels correctly and then conclude the chart “accurately represents the increase.”&lt;/p&gt;

&lt;p&gt;The problem wasn’t the prompt. It was that &lt;strong&gt;&lt;code&gt;format: 'json'&lt;/code&gt; forces the model to emit the JSON object immediately, with no room to reason first.&lt;/strong&gt; A small model like &lt;code&gt;e4b&lt;/code&gt; needs to &lt;em&gt;work through&lt;/em&gt; “the axis starts at 95, not 0, therefore the bars exaggerate a 5% change” in plain text. JSON mode amputates exactly that step.&lt;/p&gt;

&lt;p&gt;So I dropped &lt;code&gt;format: 'json'&lt;/code&gt; and restructured the prompt into an explicit procedure — &lt;em&gt;reason out loud through axis baseline, pie totals, time window, and language, then output the final answer inside a JSON code fence.&lt;/em&gt; Same model, same chart:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;108% pie: &lt;strong&gt;caught the bad total as a high-severity flag, score dropped to 35.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Truncated axis: started naming the non-zero baseline instead of waving it through.&lt;/li&gt;
&lt;li&gt;Honest chart: still scored 95 — so the extra scrutiny didn’t make it paranoid.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbbmvfvfhbm7i9ejjhyik.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbbmvfvfhbm7i9ejjhyik.png" alt=" " width="800" height="858"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The core call now looks like this — note the &lt;em&gt;absence&lt;/em&gt; of &lt;code&gt;format&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight jsx"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;http://localhost:11434/api/chat&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;method&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;POST&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Content-Type&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;application/json&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="na"&gt;body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;gemma4:e4b&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;system&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;FORENSICS_PROMPT&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;   &lt;span class="c1"&gt;// includes a step-by-step procedure&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Reason step by step, then return JSON...&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;images&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;base64Data&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="na"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;options&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;num_predict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;4096&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;   &lt;span class="c1"&gt;// low temp = consistent forensics&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then I parse the JSON out of the fenced block — and keep the reasoning text &lt;em&gt;before&lt;/em&gt; it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bonus: the reasoning became a feature
&lt;/h3&gt;

&lt;p&gt;Because the model now thinks in plain text before answering, I had its actual forensic reasoning sitting right there. So I surface it in a collapsible &lt;strong&gt;“Gemma 4 Reasoning”&lt;/strong&gt; panel. You can watch it add up the pie slices and catch the 108% itself. That transparency — showing the work, not just a verdict — turned out to be the most compelling thing in the whole app.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5n31raeefj8byvudrlfe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5n31raeefj8byvudrlfe.png" alt=" " width="800" height="500"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Being honest about the limits
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;e4b&lt;/code&gt; is the small variant, and it shows. It nails cherry-picking and impossible pie totals every time, but the truncated-axis case it catches maybe 3 runs out of 4, and as a &lt;em&gt;medium&lt;/em&gt; issue rather than a high one. &lt;code&gt;gemma4:26b&lt;/code&gt; (26B params, only ~3.8B active per pass thanks to the MoE design) handles it far more decisively — the architecture scales cleanly, you just trade RAM and a few seconds of latency. I built and tuned everything against &lt;code&gt;e4b&lt;/code&gt; specifically to prove the concept works on hardware people actually have.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faqt780v1fmxsexw4w9lf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faqt780v1fmxsexw4w9lf.png" alt=" " width="800" height="853"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The frontend
&lt;/h3&gt;

&lt;p&gt;Zero dependencies — HTML, CSS, vanilla JS. Dark glassmorphism theme, an animated SVG trust gauge (stroke-dasharray), staggered result cards, and system/light/dark themes. The three sample misleading charts are drawn with the Canvas API so there are no external image assets.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;p&gt;The headline lesson: &lt;strong&gt;on local models, JSON mode is a trap for any task that needs reasoning.&lt;/strong&gt; Convenience at the parsing layer cost me the model’s entire analytical capacity. Letting Gemma 4 think out loud first — and parsing the JSON out of the tail — was the difference between a tool that rubber-stamps misleading charts and one that actually catches them. And it handed me a transparency feature for free.&lt;/p&gt;

&lt;h2&gt;
  
  
  Team
&lt;/h2&gt;

&lt;p&gt;Solo project — just me and an unreasonable number of misleading charts I’ve been annoyed by over the years.&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>gemmachallenge</category>
      <category>gemma</category>
      <category>ai</category>
    </item>
    <item>
      <title>Building LinguaLive: A Real-Time AI Language Tutor with Gemini Live API</title>
      <dc:creator>Kumaraswamy Chavvakula</dc:creator>
      <pubDate>Mon, 16 Mar 2026 20:40:45 +0000</pubDate>
      <link>https://forem.com/kumaraswamy_chavvakula_4f/building-lingualive-a-real-time-ai-language-tutor-with-gemini-live-api-m3m</link>
      <guid>https://forem.com/kumaraswamy_chavvakula_4f/building-lingualive-a-real-time-ai-language-tutor-with-gemini-live-api-m3m</guid>
      <description>&lt;p&gt;&lt;em&gt;This blog post was created for the purposes of entering the &lt;a href="https://geminiliveagentchallenge.devpost.com/" rel="noopener noreferrer"&gt;Gemini Live Agent Challenge&lt;/a&gt; hackathon. #GeminiLiveAgentChallenge&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem: Language Learning Feels Disconnected
&lt;/h2&gt;

&lt;p&gt;We've all downloaded Duolingo, done the first week religiously, and then... stopped. Why? Because language learning apps are fundamentally disconnected from real life. You're matching words on a screen when what you actually need is someone patient sitting next to you, pointing at things, and helping you build vocabulary from your own world.&lt;/p&gt;

&lt;p&gt;Immersion is widely regarded as one of the most effective approaches to language acquisition — but it typically requires expensive human tutors or living abroad. What if AI could bring that immersive experience to everyone?&lt;/p&gt;

&lt;h2&gt;
  
  
  The Idea: Point Your Camera, Learn a Language
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;LinguaLive&lt;/strong&gt; is a real-time AI language tutor named Luna that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Sees&lt;/strong&gt; through your camera and teaches you words for objects in your environment&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hears&lt;/strong&gt; your pronunciation and gives specific, actionable feedback&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Speaks&lt;/strong&gt; back with native-sounding voices via the Gemini Live API&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generates&lt;/strong&gt; custom visual flashcards using Imagen 3&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adapts&lt;/strong&gt; to the learner's pace — the system prompt instructs Luna to simplify when the learner struggles and increase difficulty when they're doing well&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No text boxes. No multiple choice. Just a natural conversation where you point your camera at your kitchen and Luna teaches you cooking vocabulary in Spanish.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Tech Stack
&lt;/h2&gt;

&lt;p&gt;Here's what powers LinguaLive:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Technology&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Real-time AI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Gemini 2.0 Flash Live API (bidirectional audio/video streaming)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Agent Definition&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Google ADK (Agent Development Kit) for agent structure and tool registration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Live Streaming&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Google GenAI SDK (&lt;code&gt;client.aio.live.connect()&lt;/code&gt;) for real-time bidirectional streaming&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Image Generation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Imagen 3 on Vertex AI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Backend&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Python 3.11, FastAPI, WebSocket&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data Persistence&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cloud Firestore&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Asset Storage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cloud Storage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Hosting&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cloud Run (auto-scaling, session affinity)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CI/CD&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cloud Build&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Frontend&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Vanilla HTML/JS, Web Audio API, MediaDevices API&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  How It Works: The Multimodal Loop
&lt;/h2&gt;

&lt;p&gt;The core of LinguaLive is a &lt;strong&gt;multimodal streaming loop&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Voice In&lt;/strong&gt; → User speaks in their target language (PCM 16kHz via Web Audio API)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Camera In&lt;/strong&gt; → Browser captures JPEG frames at ~1fps and sends to Gemini&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Voice Out&lt;/strong&gt; → Gemini responds with native audio (PCM 24kHz)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Image Out&lt;/strong&gt; → Imagen 3 generates flashcard illustrations for key vocabulary on demand&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This happens over a single WebSocket connection. The browser captures audio via AudioWorklet (with a ScriptProcessor fallback) and camera frames via &lt;code&gt;getUserMedia()&lt;/code&gt;. The FastAPI backend bridges these to the Gemini Live API's bidirectional streaming endpoint.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Technical Decisions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Why WebSocket Instead of REST?
&lt;/h3&gt;

&lt;p&gt;The Gemini Live API uses &lt;code&gt;bidiGenerateContent&lt;/code&gt; — a bidirectional streaming endpoint. REST would add significant latency for real-time conversation. Our WebSocket carries audio chunks (~250ms each) and video frames interleaved, keeping the conversation feeling natural and responsive.&lt;/p&gt;

&lt;h3&gt;
  
  
  ADK + GenAI SDK: Why Both?
&lt;/h3&gt;

&lt;p&gt;We use &lt;strong&gt;ADK&lt;/strong&gt; to define the agent — Luna's persona, system instruction, and 7 registered tools. But for the actual Live API streaming, ADK's standard runner doesn't support real-time bidirectional audio/video, so we use the &lt;strong&gt;GenAI SDK's&lt;/strong&gt; &lt;code&gt;client.aio.live.connect()&lt;/code&gt; directly. This gives us:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Real-time PCM audio streaming in both directions&lt;/li&gt;
&lt;li&gt;Live video frame ingestion&lt;/li&gt;
&lt;li&gt;Function calling mid-stream&lt;/li&gt;
&lt;li&gt;Input and output audio transcription&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Luna's 5 active Gemini tools (the ones declared to the model):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;get_session_progress&lt;/code&gt; — returns real-time learning stats&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;get_vocabulary_quiz&lt;/code&gt; — generates adaptive quizzes from learned words&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;detect_scene&lt;/code&gt; — identifies environments for themed vocabulary lessons&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;identify_objects_in_view&lt;/code&gt; — processes camera object detection&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;generate_flashcard_image&lt;/code&gt; — creates Imagen 3 visual flashcards&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;(Vocabulary and pronunciation tracking happen automatically via output transcription to avoid interrupting the audio stream with tool calls.)&lt;/p&gt;

&lt;h3&gt;
  
  
  Grounding to Reduce Hallucinations
&lt;/h3&gt;

&lt;p&gt;A language tutor that invents translations is worse than no tutor at all. We added explicit grounding rules to Luna's system prompt: only teach words she's confident about, only identify camera objects she can clearly see, and acknowledge uncertainty rather than guessing. This doesn't eliminate hallucination entirely, but it significantly reduces it in practice.&lt;/p&gt;

&lt;h3&gt;
  
  
  Firestore for Returning Learners
&lt;/h3&gt;

&lt;p&gt;Session data persists to Cloud Firestore, enabling a "welcome back" experience. When a learner returns, Luna knows what words they learned last time and builds on that foundation rather than starting over.&lt;/p&gt;

&lt;h2&gt;
  
  
  Keeping the Audio Stream Smooth
&lt;/h2&gt;

&lt;p&gt;In a real-time voice app, anything that blocks the event loop causes audible stuttering. Two patterns were key to keeping audio smooth:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Async-safe Firestore initialization.&lt;/strong&gt; Multiple WebSocket connections can arrive simultaneously at startup. Without protection, each could try to create a Firestore client at the same time. We used &lt;code&gt;asyncio.Lock()&lt;/code&gt; with a double-check pattern inside &lt;code&gt;_init_firestore()&lt;/code&gt; to ensure the client is created exactly once, without blocking the event loop.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Background flashcard generation.&lt;/strong&gt; Imagen 3 takes 3–8 seconds to generate an image. If we awaited that inside the receive loop, audio would freeze. Instead, we respond to Gemini immediately with a &lt;code&gt;"pending"&lt;/code&gt; status and spin up the actual generation as a background task via &lt;code&gt;asyncio.create_task()&lt;/code&gt;. When the image is ready, it's pushed to the client over the WebSocket independently of the audio stream.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hardest Part: Audio Reliability
&lt;/h2&gt;

&lt;p&gt;Getting real-time audio working reliably across browsers was the biggest challenge. Key issues we solved:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;AudioWorklet vs ScriptProcessor&lt;/strong&gt; — AudioWorklet runs off the main thread for better performance. We use it as the primary approach with a ScriptProcessor fallback for broader compatibility.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sample Rate Mismatch&lt;/strong&gt; — Requesting 16kHz from the browser doesn't guarantee it. We added runtime resampling in the AudioWorklet to ensure Gemini always receives 16kHz PCM regardless of the device's native sample rate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Barge-in Handling&lt;/strong&gt; — When the user interrupts Luna mid-speech, we immediately stop audio playback, clear the queue, and let the new response stream through. We also suppress mic forwarding while the model is speaking to prevent speaker echo from causing false interruptions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Receive Loop Re-entry&lt;/strong&gt; — We discovered that the Live API's &lt;code&gt;receive()&lt;/code&gt; generator completes after each model turn. The fix is to re-enter it in a &lt;code&gt;while True&lt;/code&gt; loop for multi-turn conversations.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The Gemini Live API is remarkably capable&lt;/strong&gt; — bidirectional audio + video + function calling in a single streaming session opens up experiences that weren't possible before.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Grounding matters more for educational AI&lt;/strong&gt; — users trust a tutor implicitly. Teaching a wrong translation erodes that trust fast, so explicit anti-hallucination prompting is essential.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Imagen 3 adds a visual dimension&lt;/strong&gt; — generated flashcard illustrations make vocabulary tangible and give learners something to revisit later.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloud Run with session affinity&lt;/strong&gt; works well for WebSocket-based apps — the session affinity flag ensures long-lived WebSocket connections stick to the same instance. One thing to watch: the in-memory session cache works perfectly with sticky sessions, but if you ever scale to multiple instances without affinity, you'd need to handle cache coherence with Firestore.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Live API's &lt;code&gt;receive()&lt;/code&gt; generator ending per turn&lt;/strong&gt; was the most subtle bug — it looked like sessions were dropping after one exchange until we figured out the re-entry pattern.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;The code is open source: &lt;a href="https://github.com/kumarsparkz/lingualive" rel="noopener noreferrer"&gt;github.com/kumarsparkz/lingualive&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/kumarsparkz/lingualive.git
&lt;span class="nb"&gt;cd &lt;/span&gt;lingualive
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
gcloud auth application-default login
python &lt;span class="nt"&gt;-m&lt;/span&gt; app.main
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or deploy to Cloud Run with the automated script:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;GCP_PROJECT_ID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your-project-id
./deploy.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;&lt;em&gt;Built for the Gemini Live Agent Challenge 2026. #GeminiLiveAgentChallenge&lt;/em&gt;&lt;/p&gt;

</description>
      <category>geminiliveagentchallenge</category>
      <category>googlecloud</category>
      <category>ai</category>
      <category>gemini</category>
    </item>
  </channel>
</rss>
