<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: thilak15</title>
    <description>The latest articles on Forem by thilak15 (@thilak15).</description>
    <link>https://forem.com/thilak15</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2114840%2F0f0665c7-bd24-45a8-b1d1-b56c1ca6c852.jpeg</url>
      <title>Forem: thilak15</title>
      <link>https://forem.com/thilak15</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/thilak15"/>
    <language>en</language>
    <item>
      <title>Robotics Reinvention: Travis Kalanick's Atoms Targets Industrial Automation</title>
      <dc:creator>thilak15</dc:creator>
      <pubDate>Sat, 14 Mar 2026 01:33:01 +0000</pubDate>
      <link>https://forem.com/thilak15/robotics-reinvention-travis-kalanicks-atoms-targets-industrial-automation-47ba</link>
      <guid>https://forem.com/thilak15/robotics-reinvention-travis-kalanicks-atoms-targets-industrial-automation-47ba</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Uber founder Travis Kalanick has launched &lt;strong&gt;Atoms&lt;/strong&gt;, a new robotics venture absorbing &lt;strong&gt;CloudKitchens&lt;/strong&gt;. Atoms aims to revolutionize &lt;strong&gt;industrial automation&lt;/strong&gt; in &lt;strong&gt;mining&lt;/strong&gt; and &lt;strong&gt;transport&lt;/strong&gt;, leveraging Kalanick's capital-raising prowess. While technical differentiators are yet to be revealed, this high-risk, high-reward bet could be a major disruptor in a crowded robotics market.&lt;/p&gt;

&lt;p&gt;Travis Kalanick, the controversial but undeniably impactful co-founder of &lt;strong&gt;Uber&lt;/strong&gt; and the visionary behind &lt;strong&gt;CloudKitchens&lt;/strong&gt;, is once again making headlines, this time with a bold new foray into the rapidly expanding world of robotics. His latest venture, &lt;strong&gt;Atoms&lt;/strong&gt;, represents a significant move, absorbing the existing infrastructure and talent of CloudKitchens to pivot towards more ambitious, capital-intensive domains like &lt;strong&gt;mining&lt;/strong&gt; and &lt;strong&gt;transport&lt;/strong&gt;. This pivot by a proven entrepreneur with a track record of disrupting massive industries signals a potent new force in the robotics landscape, promising to accelerate innovation and challenge established players in sectors ripe for automation.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Market Context
&lt;/h2&gt;

&lt;p&gt;The problem space &lt;strong&gt;Atoms&lt;/strong&gt; aims to address is vast and multifaceted: the &lt;strong&gt;automation&lt;/strong&gt; of hazardous, repetitive, or logistically complex tasks across heavy industries. Sectors like &lt;strong&gt;mining&lt;/strong&gt; and &lt;strong&gt;transport&lt;/strong&gt; are characterized by high operational costs, labor shortages, safety concerns, and often inefficient manual processes. Robotics offers a compelling solution, promising increased safety by removing humans from dangerous environments, enhanced efficiency through continuous operation, and significant cost reductions over time.&lt;/p&gt;

&lt;p&gt;Industry reports project substantial growth in these areas. The global &lt;strong&gt;mining automation&lt;/strong&gt; market, for instance, is anticipated to reach tens of billions of dollars within the next decade, driven by demand for &lt;strong&gt;autonomous haulage&lt;/strong&gt;, drilling, and inspection systems. Similarly, the &lt;strong&gt;autonomous transport&lt;/strong&gt; market, encompassing everything from long-haul trucking to last-mile delivery and specialized industrial vehicles, is projected to see exponential growth, with forecasts often placing its value in the hundreds of billions.&lt;/p&gt;

&lt;p&gt;This problem is solvable and investable now due to several converging factors. Advances in &lt;strong&gt;artificial intelligence&lt;/strong&gt;, particularly in &lt;strong&gt;machine learning&lt;/strong&gt;, &lt;strong&gt;computer vision&lt;/strong&gt;, and &lt;strong&gt;reinforcement learning&lt;/strong&gt;, have made robots more capable of perceiving, understanding, and navigating complex, unstructured environments. Simultaneously, improvements in &lt;strong&gt;sensor technology&lt;/strong&gt; (LiDAR, radar, high-resolution cameras), &lt;strong&gt;processing power&lt;/strong&gt; (edge AI chips), and battery efficiency have made robust, real-world deployments more feasible and cost-effective. Furthermore, the increasing availability of sophisticated open-source robotics frameworks like ROS (&lt;strong&gt;Robot Operating System&lt;/strong&gt;) lowers the barrier to entry for development, while a growing talent pool in AI and robotics fuels innovation.&lt;/p&gt;

&lt;p&gt;The landscape is currently populated by a mix of incumbents and challengers. In mining, traditional heavy equipment manufacturers like &lt;strong&gt;Caterpillar&lt;/strong&gt;, &lt;strong&gt;Komatsu&lt;/strong&gt;, and &lt;strong&gt;Epiroc&lt;/strong&gt; have their own automation divisions, offering &lt;strong&gt;autonomous haulage&lt;/strong&gt; and drilling solutions. In transport, autonomous driving companies like &lt;strong&gt;Waymo&lt;/strong&gt;, &lt;strong&gt;Cruise&lt;/strong&gt;, and specialized trucking firms such as &lt;strong&gt;TuSimple&lt;/strong&gt; (though facing challenges) are pushing innovation.&lt;/p&gt;

</description>
      <category>robotics</category>
      <category>ai</category>
      <category>startup</category>
      <category>automation</category>
    </item>
    <item>
      <title>Brew: I Built a Real-Time Voice AI Drive-Thru Barista with Gemini Live API and Google ADK</title>
      <dc:creator>thilak15</dc:creator>
      <pubDate>Fri, 13 Mar 2026 23:36:42 +0000</pubDate>
      <link>https://forem.com/thilak15/brew-i-built-a-real-time-voice-ai-drive-thru-barista-with-gemini-live-api-and-google-adk-4di5</link>
      <guid>https://forem.com/thilak15/brew-i-built-a-real-time-voice-ai-drive-thru-barista-with-gemini-live-api-and-google-adk-4di5</guid>
      <description>

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;I built &lt;strong&gt;Brew&lt;/strong&gt; — a real-time, voice-first AI ordering system for coffee shop drive-thrus. Customers talk to an AI barista through their microphone, and it takes their order through natural conversation. No buttons, no typing, just speech. The AI listens, understands complex orders with modifiers, handles interruptions, and updates a live on-screen menu and receipt as the conversation flows.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/thilak15/Brew" rel="noopener noreferrer"&gt;github.com/thilak15/Brew&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem I Wanted to Solve
&lt;/h2&gt;

&lt;p&gt;Traditional drive-thru ordering is broken. Long wait times, order inaccuracies, and staffing challenges plague the industry. Human operators handle one car at a time, miscommunication leads to wrong orders, and during peak hours, lines stretch around the block.&lt;/p&gt;

&lt;p&gt;I wanted to see if a live voice AI agent could do this better — not a chatbot with text-to-speech bolted on, but a genuinely conversational agent that handles the full complexity of real ordering: sizes, modifiers, corrections, interruptions, and multi-item requests.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Brew Does
&lt;/h2&gt;

&lt;p&gt;Brew replaces the human operator at a drive-thru speaker box with an AI barista. Here's what it handles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Natural speech understanding&lt;/strong&gt; — "Can I get a grande iced latte with oat milk and an extra shot?" works exactly as you'd expect&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Interruptions (barge-in)&lt;/strong&gt; — Change your mind mid-sentence. The AI stops speaking and listens&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Real-time UI updates&lt;/strong&gt; — The menu highlights relevant categories and the receipt builds live as items are confirmed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Complex order management&lt;/strong&gt; — Modifiers (syrups, milk swaps, toppings, ice levels, warming), undo, batch operations, and running totals&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multilingual support&lt;/strong&gt; — Speak in Spanish, Hindi, or any language Gemini understands, and the agent mirrors your language automatically&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Session persistence&lt;/strong&gt; — Cart state survives Cloud Run instance restarts via Firestore&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The menu has 22 items across 3 categories (Drinks, Breakfast, Desserts) with a full modifier system.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Tech Stack — All Google AI and Cloud
&lt;/h2&gt;

&lt;p&gt;This project is built end-to-end on Google's AI and Cloud platform. Here's every piece:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Technology&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AI Model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Gemini 2.5 Flash Native Audio&lt;/td&gt;
&lt;td&gt;Real-time voice conversation with function calling&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Agent Framework&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Google Agent Development Kit (ADK)&lt;/td&gt;
&lt;td&gt;Agent orchestration, tool management, live streaming&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Backend&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Python 3.11, FastAPI&lt;/td&gt;
&lt;td&gt;WebSocket server, session management&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Frontend&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Next.js 14, React 18, TypeScript&lt;/td&gt;
&lt;td&gt;Dynamic UI with real-time state updates&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Audio&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Web Audio API (AudioWorklet)&lt;/td&gt;
&lt;td&gt;Low-latency audio capture and playback&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Transport&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;WebSockets&lt;/td&gt;
&lt;td&gt;Bidirectional PCM audio + JSON state streaming&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Session Persistence&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Google Cloud Firestore&lt;/td&gt;
&lt;td&gt;Cart state across instances&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Deployment&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Google Cloud Run&lt;/td&gt;
&lt;td&gt;Serverless containers for backend and frontend&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Container Registry&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Google Artifact Registry&lt;/td&gt;
&lt;td&gt;Docker image storage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CI/CD&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;GitHub Actions + Workload Identity Federation&lt;/td&gt;
&lt;td&gt;Keyless automated deployment to GCP&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  How I Built It — Architecture Deep Dive
&lt;/h2&gt;

&lt;p&gt;The system has four layers:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Browser (Customer Device)
&lt;/h3&gt;

&lt;p&gt;The Next.js frontend captures microphone audio via the Web Audio API using an &lt;code&gt;AudioWorklet&lt;/code&gt; processor. Raw PCM audio at 16kHz streams to the backend over a WebSocket. The frontend receives two things back: audio response bytes (played through another AudioWorklet) and JSON state updates that drive the UI.&lt;/p&gt;

&lt;p&gt;Three main components power the interface:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;SmartMenu&lt;/strong&gt; — A dynamic tabbed menu that auto-switches categories (ordering a "Cake Pop" flips the view to Desserts)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LiveReceipt&lt;/strong&gt; — A real-time order panel showing items, modifiers, and a running price total&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AudioVisualizer&lt;/strong&gt; — Visual feedback during the conversation&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Backend Server (Cloud Run)
&lt;/h3&gt;

&lt;p&gt;A Python/FastAPI WebSocket server running on Cloud Run. It manages the bidirectional audio stream between the browser and Gemini. The key responsibilities:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hosts the ADK &lt;code&gt;Runner&lt;/code&gt; that orchestrates the agent lifecycle&lt;/li&gt;
&lt;li&gt;Implements a &lt;strong&gt;tool gate mechanism&lt;/strong&gt; — blocks user audio while the AI executes tool calls, preventing race conditions where the model hears its own confirmations&lt;/li&gt;
&lt;li&gt;Handles upstream (browser → Gemini) and downstream (Gemini → browser) as concurrent async tasks&lt;/li&gt;
&lt;li&gt;Proactive session reconnection before the 10-minute Live API hard limit&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Agent Layer (Google ADK)
&lt;/h3&gt;

&lt;p&gt;The agent is defined using Google's Agent Development Kit with &lt;strong&gt;14 tools&lt;/strong&gt; for order management:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;root_agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;brew_agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.5-flash-native-audio-preview-12-2025&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Drive-thru barista that takes beverage orders with modifiers.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;instruction&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;get_system_prompt&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="n"&gt;add_item&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;add_items&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;remove_item&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;remove_items&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;add_modifier&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;add_modifiers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;remove_modifier&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;set_modifier&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;set_ice_level&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;undo_last_change&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;clear_order&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;set_menu_view&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;get_order_summary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;ADK's &lt;code&gt;run_live()&lt;/code&gt; method establishes a persistent bidirectional stream with the Gemini Live API. Tools are plain Python functions with detailed docstrings that the model uses for function calling.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. AI Model (Gemini Live API)
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;gemini-2.5-flash-native-audio-preview-12-2025&lt;/code&gt; model handles everything in a single streaming session: receives raw audio, processes speech, decides when to call tools, and generates spoken responses. The system prompt injects the full menu (items, prices, sizes, modifiers) so the model is grounded in real data.&lt;/p&gt;

&lt;h3&gt;
  
  
  Data Flow
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Customer speaks into mic
  → Browser captures PCM audio via AudioWorklet
  → WebSocket sends binary audio frames to backend
  → Backend forwards audio to Gemini via ADK run_live()
  → Gemini processes speech, decides to call tools or respond
  → If tool call: ADK executes tool → updates OrderState → syncs to Firestore
  → Gemini generates audio response
  → Backend streams audio bytes back over WebSocket
  → Browser plays audio via AudioWorklet
  → Backend sends JSON order state updates
  → Frontend re-renders SmartMenu + LiveReceipt in real time
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Google Cloud Services in Detail
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Gemini Live API (via Google GenAI SDK)
&lt;/h3&gt;

&lt;p&gt;This is the core of Brew. The Live API provides native audio streaming — the model receives raw audio and produces audio responses directly, without separate speech-to-text or text-to-speech steps. Combined with function calling, this means the model can hear "add oat milk to both drinks," call the &lt;code&gt;add_modifiers&lt;/code&gt; batch tool, and speak a confirmation — all in one streaming session with sub-second latency.&lt;/p&gt;

&lt;h3&gt;
  
  
  Google Agent Development Kit (ADK)
&lt;/h3&gt;

&lt;p&gt;ADK handles the agent lifecycle. The &lt;code&gt;run_live()&lt;/code&gt; method manages the persistent WebSocket connection to Gemini, routes tool calls to my Python functions, and handles the back-and-forth of a multi-turn conversation. I defined 14 tools with detailed docstrings, and ADK + Gemini handle the rest.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cloud Run
&lt;/h3&gt;

&lt;p&gt;Both the backend (FastAPI) and frontend (Next.js) are deployed as separate Cloud Run services. Session affinity is critical for WebSocket connections — without it, requests hit different instances that don't have the session state.&lt;/p&gt;

&lt;h3&gt;
  
  
  Firestore
&lt;/h3&gt;

&lt;p&gt;Cart state is persisted to Firestore after every order change. This means if Cloud Run scales horizontally or an instance restarts, the customer's order survives. I built a custom &lt;code&gt;FirestoreSessionService&lt;/code&gt; that wraps ADK's &lt;code&gt;InMemorySessionService&lt;/code&gt; with Firestore persistence.&lt;/p&gt;

&lt;h3&gt;
  
  
  Artifact Registry + Workload Identity Federation
&lt;/h3&gt;

&lt;p&gt;Docker images are stored in Artifact Registry. CI/CD uses Workload Identity Federation for keyless authentication from GitHub Actions to GCP — no service account keys stored anywhere.&lt;/p&gt;




&lt;h2&gt;
  
  
  Hard Problems I Solved
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. The Tool Gate Problem
&lt;/h3&gt;

&lt;p&gt;Without intervention, the model would hear its own tool-call confirmations as user input, creating infinite loops. I implemented a tool gate that blocks user audio forwarding while the AI is executing tools. This was the single most impactful fix for reliability.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. The 10-Minute Session Limit
&lt;/h3&gt;

&lt;p&gt;The Gemini Live API has a hard 10-minute session limit. Brew proactively reconnects at 8 minutes, injecting the current order context into the new session so the AI seamlessly continues the conversation without re-greeting the customer.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Model Hallucinating Tool Arguments
&lt;/h3&gt;

&lt;p&gt;Native audio models sometimes hallucinate tool arguments — inventing item IDs that don't exist. I switched from UUIDs to sequential integer IDs (&lt;code&gt;item_1&lt;/code&gt;, &lt;code&gt;item_2&lt;/code&gt;, ...) which dramatically reduced hallucination. I also added &lt;code&gt;_resolve_item_id()&lt;/code&gt; that handles numeric shorthand and positional references.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Batch Operations for Latency
&lt;/h3&gt;

&lt;p&gt;Without batch tools, the model makes sequential tool calls with separate confirmations for each item in a multi-item order. I added &lt;code&gt;add_items&lt;/code&gt;, &lt;code&gt;remove_items&lt;/code&gt;, and &lt;code&gt;add_modifiers&lt;/code&gt; batch tools that handle everything in a single call, cutting latency significantly.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Idempotency Guards
&lt;/h3&gt;

&lt;p&gt;The model sometimes retries tool calls during transient errors. Without idempotency guards, this would add duplicate modifiers. Every &lt;code&gt;add_modifier&lt;/code&gt; call checks for existing identical modifiers before applying.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Learnings
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Tool docstrings are the primary interface.&lt;/strong&gt; Clear, specific docstrings with examples produce dramatically better tool-calling accuracy than vague descriptions. I iterated on these more than any other part of the codebase.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;AudioWorklet is non-negotiable.&lt;/strong&gt; The deprecated &lt;code&gt;ScriptProcessorNode&lt;/code&gt; introduces unpredictable latency. AudioWorklet provides consistent low-latency audio processing.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Session affinity on Cloud Run is essential&lt;/strong&gt; for WebSocket connections. Without it, subsequent requests hit different instances.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Native audio models behave differently than text models.&lt;/strong&gt; They're more prone to hallucinating tool arguments, more sensitive to background noise, and need explicit instructions about when NOT to respond (e.g., to background noise or their own echoes).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Firestore for session persistence is a perfect fit&lt;/strong&gt; for serverless deployments. The read/write latency is low enough that it doesn't impact the real-time experience.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Running It Yourself
&lt;/h2&gt;

&lt;p&gt;Brew is fully open source. You can run it locally with Docker:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/thilak15/Brew.git
&lt;span class="nb"&gt;cd &lt;/span&gt;Brew
&lt;span class="nb"&gt;cp &lt;/span&gt;backend/.env.example backend/.env
&lt;span class="c"&gt;# Add your GOOGLE_API_KEY to backend/.env&lt;/span&gt;
docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="nt"&gt;--build&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then open &lt;code&gt;http://localhost:3000&lt;/code&gt; in Chrome, click "Drive Up," allow mic access, and start ordering.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/thilak15/Brew" rel="noopener noreferrer"&gt;github.com/thilak15/Brew&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Menu-agnostic deployment.&lt;/strong&gt; The menu loads from a JSON file. Swap it out, and Brew becomes a taco shop, a pizza place, or a pharmacy pickup counter. The next step is a pipeline that takes any restaurant's menu and auto-generates a ready-to-deploy voice ordering agent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multilingual real-time language switching.&lt;/strong&gt; Gemini's native audio model already understands multiple languages. The goal is automatic language detection mid-conversation — if a customer starts in English and switches to Spanish, the agent follows without any button press.&lt;/p&gt;

&lt;p&gt;The hard part was proving that a live voice agent can handle complex, modifier-heavy ordering with interruptions, corrections, and batch operations — correctly and reliably. That's done. Now it's about making it work for anyone, in any language.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This project was created for the &lt;a href="https://googleai.devpost.com/" rel="noopener noreferrer"&gt;Gemini Live Agent Challenge&lt;/a&gt; hackathon. #GeminiLiveAgentChallenge&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Built by &lt;a href="https://github.com/thilak15" rel="noopener noreferrer"&gt;Thilak Daggula&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>googlecloud</category>
      <category>gemini</category>
      <category>ai</category>
      <category>hackathon</category>
    </item>
    <item>
      <title>Challenging Dogma: Simple Fine-Tuning Enables Continual Learning in VLA Models</title>
      <dc:creator>thilak15</dc:creator>
      <pubDate>Fri, 13 Mar 2026 18:39:55 +0000</pubDate>
      <link>https://forem.com/thilak15/challenging-dogma-simple-fine-tuning-enables-continual-learning-in-vla-models-1mjj</link>
      <guid>https://forem.com/thilak15/challenging-dogma-simple-fine-tuning-enables-continual-learning-in-vla-models-1mjj</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Simple Sequential Fine-Tuning (Seq. FT) works surprisingly well for Continual Reinforcement Learning (CRL) in large pretrained Vision-Language-Action (VLA) models.&lt;/strong&gt; The paper "Simple Recipe Works" challenges the long-held assumption that complex strategies are always necessary to prevent catastrophic forgetting.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Large pretrained VLAs appear to be natural continual learners.&lt;/strong&gt; Their inherent capabilities, likely stemming from extensive pretraining on diverse data, make them more resilient to forgetting than previously thought when adapting to new tasks.&lt;/li&gt;
&lt;li&gt;  The research systematically evaluated this "simple recipe" across three distinct VLA models and five varied continual learning scenarios, consistently demonstrating its efficacy.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;This discovery simplifies the path toward developing robust, self-improving embodied AI agents.&lt;/strong&gt; Engineers can potentially forgo complex CRL algorithms, focusing instead on foundational VLA pretraining and task design for agents operating in dynamic, open-ended environments.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Problem: The Persistent Challenge of Catastrophic Forgetting
&lt;/h2&gt;

&lt;p&gt;Embodied AI agents, such as robots or virtual assistants, need to operate effectively in dynamic, open-ended environments. This requires them to continually learn new skills and adapt to novel situations without forgetting previously acquired knowledge. This challenge is known as &lt;strong&gt;Continual Reinforcement Learning (CRL)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The core hurdle in CRL is &lt;strong&gt;catastrophic forgetting&lt;/strong&gt;. When an AI model is trained sequentially on a series of tasks, fine-tuning on a new task often causes it to "forget" how to perform older tasks. For example, a robot learning to pick up a new object might suddenly lose its ability to grasp a previously mastered object. This phenomenon has plagued deep learning models, especially in reinforcement learning settings where data distributions change drastically between tasks.&lt;/p&gt;

&lt;p&gt;Historically, addressing catastrophic forgetting has led to the development of highly sophisticated and often complex CRL strategies. These include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Regularization-based methods:&lt;/strong&gt; Adding penalty terms to the loss function to protect important parameters learned from previous tasks (e.g., Elastic Weight Consolidation (EWC), Synaptic Intelligence (SI)).&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Rehearsal/Memory-based methods:&lt;/strong&gt; Storing a small subset of data or experiences from previous tasks and replaying them during training on new tasks (e.g., Experience Replay, Generative Replay).&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Architectural methods:&lt;/strong&gt; Dynamically expanding the model's capacity or creating task-specific sub-networks (e.g., Progressive Neural Networks, PackNet).&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Knowledge Distillation:&lt;/strong&gt; Using the old model's outputs as "soft targets" to guide the new model's learning on previous tasks.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;While these methods have shown promise, they introduce significant complexity. They often require careful hyperparameter tuning, increase computational overhead, and can hinder the development of truly adaptive AI systems. This paper, however, presents a compelling argument that for a specific class of models—large, pre-trained &lt;strong&gt;Vision-Language-Action (VLA) models&lt;/strong&gt;—a much simpler approach might be all that's needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;[Suggested emphasis for scanability]:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Bold key terms like &lt;strong&gt;Continual Reinforcement Learning (CRL)&lt;/strong&gt;, &lt;strong&gt;catastrophic forgetting&lt;/strong&gt;, &lt;strong&gt;Vision-Language-Action (VLA) models&lt;/strong&gt;, and &lt;strong&gt;Simple Sequential Fine-Tuning (Seq. FT)&lt;/strong&gt; throughout the main body of the article.&lt;/li&gt;
&lt;li&gt;  Use bullet points for lists of methods or findings.&lt;/li&gt;
&lt;li&gt;  Consider using blockquotes for direct quotes or key takeaways from the paper.&lt;/li&gt;
&lt;li&gt;  Ensure subheadings are clear and descriptive.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>continuallearning</category>
      <category>reinforcementlearning</category>
    </item>
    <item>
      <title>PACED: Unlock Faster, More Affordable LLM Training with Smart Distillation</title>
      <dc:creator>thilak15</dc:creator>
      <pubDate>Fri, 13 Mar 2026 18:17:16 +0000</pubDate>
      <link>https://forem.com/thilak15/paced-unlock-faster-more-affordable-llm-training-with-smart-distillation-1pk4</link>
      <guid>https://forem.com/thilak15/paced-unlock-faster-more-affordable-llm-training-with-smart-distillation-1pk4</guid>
      <description>&lt;h3&gt;
  
  
  TL;DR
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Targeted Distillation:&lt;/strong&gt; PACED is a novel framework for LLM distillation that focuses training on the 'zone of proximal development' (ZPD) for student models, avoiding computational waste.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Theoretical Basis:&lt;/strong&gt; It's grounded in the observation that gradient signal-to-noise ratio (SNR) vanishes when problems are either too easy (student has mastered) or too hard (beyond current competence).&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Computational Efficiency:&lt;/strong&gt; By concentrating compute on the ZPD, PACED promises significant gains in training efficiency, accelerating LLM development and reducing resource consumption.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Improved Learning:&lt;/strong&gt; This focused approach aims to not only make distillation faster but also more effective, preventing the erosion of existing knowledge and fostering better student model quality.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Large Language Models (LLMs) have transformed AI, but their immense size makes deployment expensive and slow. This is where &lt;strong&gt;knowledge distillation&lt;/strong&gt; becomes vital: transferring a large "teacher" model's knowledge to a smaller, more efficient "student" model.&lt;/p&gt;

&lt;p&gt;However, standard LLM distillation methods often suffer from a critical flaw: &lt;strong&gt;computational waste&lt;/strong&gt;. Imagine trying to teach someone by constantly reviewing what they already know or presenting concepts far beyond their grasp. This is precisely what happens in traditional LLM distillation, leading to inefficient training and inflated costs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Problem in Detail:&lt;/strong&gt;&lt;br&gt;
Student models are typically exposed to a uniform curriculum. This means valuable compute cycles are squandered on tasks they've either:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Already Mastered:&lt;/strong&gt; Leading to near-zero gradient signals and negligible learning.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Find Too Difficult:&lt;/strong&gt; Producing noisy, incoherent, or even contradictory gradients that can destabilize the model or erode prior knowledge.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This inefficiency not only slows down training and inflates costs but can also degrade the student's existing capabilities, hindering the development of agile, specialized models.&lt;/p&gt;

&lt;p&gt;Enter &lt;strong&gt;PACED: Distillation at the Frontier of Student Competence&lt;/strong&gt;, a groundbreaking framework by Yuanda Xu et al. (HuggingFace). PACED addresses this fundamental inefficiency head-on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How PACED Works:&lt;/strong&gt;&lt;br&gt;
The core of PACED lies in a theoretical observation: the &lt;strong&gt;gradient signal-to-noise ratio (SNR)&lt;/strong&gt;, crucial for effective learning, vanishes at both extremes of student competence. PACED dynamically identifies and concentrates distillation efforts on the &lt;strong&gt;'zone of proximal development' (ZPD)&lt;/strong&gt;. These are tasks that are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Challenging enough&lt;/strong&gt; to provide a strong, coherent learning signal.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Not so difficult&lt;/strong&gt; as to be unlearnable.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This targeted approach prevents compute from being squandered on unhelpful tasks, ensuring every computational cycle contributes meaningfully to learning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why PACED Matters for Practitioners:&lt;/strong&gt;&lt;br&gt;
While specific quantitative benchmarks are not detailed in the paper, PACED's strong theoretical grounding in gradient SNR promises significant gains in training efficiency. It aims to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Accelerate the distillation process.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Reduce compute costs dramatically.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Prevent the degradation of previously acquired knowledge&lt;/strong&gt; in student LLMs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Ultimately, PACED means we can train more capable, smaller LLMs faster and more affordably. This framework could unlock a new wave of specialized, deployable models, making advanced AI more accessible and sustainable for a broader range of applications and organizations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Read the Full Paper:&lt;/strong&gt;&lt;br&gt;
For a deep dive into the theoretical underpinnings and methodology, explore the full paper: &lt;a href="https://huggingface.co/papers/2603.11178" rel="noopener noreferrer"&gt;https://huggingface.co/papers/2603.11178&lt;/a&gt;&lt;/p&gt;

</description>
      <category>llms</category>
      <category>ai</category>
      <category>distillation</category>
      <category>deeplearning</category>
    </item>
    <item>
      <title>Parallel Chains in LangChain</title>
      <dc:creator>thilak15</dc:creator>
      <pubDate>Wed, 16 Oct 2024 15:48:20 +0000</pubDate>
      <link>https://forem.com/thilak15/parallel-chains-in-langchain-a-practical-guide-3o1j</link>
      <guid>https://forem.com/thilak15/parallel-chains-in-langchain-a-practical-guide-3o1j</guid>
      <description>&lt;p&gt;In this guide, we'll delve into how LangChain facilitates parallel processing using a Meeting Summary Generator as a reference.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Parallel Chains?&lt;/strong&gt;&lt;br&gt;
Parallel chains allow multiple tasks to run concurrently, reducing overall execution time and improving resource utilization. This is especially beneficial when dealing with tasks that can operate independently, such as extracting different components from a dataset.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Components&lt;/strong&gt;&lt;br&gt;
RunnableLambda: Wraps Python functions to be used within LangChain chains.&lt;br&gt;
RunnableParallel: Enables the parallel execution of multiple runnable branches.&lt;br&gt;
StrOutputParser: Parses the string output from the language model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-by-Step Implementation&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Initialize the language model using LangChain’s ChatOllama. This model will process the prompts and generate responses.
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from langchain_ollama import ChatOllama

# Initialize the ChatOllama model
model = ChatOllama(model="llama3.2:1b-instruct-fp16")

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;ul&gt;
&lt;li&gt;Create prompt templates to instruct the model on the specific tasks: extracting key points, decisions, and action items.
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from langchain.prompts import ChatPromptTemplate

# Prompt to summarize key points from meeting notes
prompt_template = ChatPromptTemplate.from_messages(
    [
        ("system", "You are an expert meeting assistant."),
        ("human", "Summarize the key points of the following meeting notes:\n\n{meeting_notes}"),
    ]
)

# Prompt to extract decisions
def analyze_decisions(key_points):
    decisions_template = ChatPromptTemplate.from_messages(
        [
            ("system", "You are an expert meeting assistant."),
            ("human", "Given these key points: {key_points}, list the decisions made during the meeting."),
        ]
    )
    return decisions_template.format_prompt(key_points=key_points)

# Prompt to extract action items
def analyze_action_items(key_points):
    action_items_template = ChatPromptTemplate.from_messages(
        [
            ("system", "You are an expert meeting assistant."),
            ("human", "Given these key points: {key_points}, list the action items assigned during the meeting, including the responsible person and the deadline if available."),
        ]
    )
    return action_items_template.format_prompt(key_points=key_points)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;ul&gt;
&lt;li&gt;Utilize RunnableLambda to wrap the analysis functions and RunnableParallel to execute them concurrently.
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnableParallel, RunnableLambda

# Function to combine decisions and action items
def combine_summary(decisions, action_items):
    return f"**Decisions Made:**\n{decisions}\n\n**Action Items:**\n{action_items}"

# Runnable chains for decisions and action items
decisions_branch_chain = (
    RunnableLambda(lambda x: analyze_decisions(x)) | model | StrOutputParser()
)

action_items_branch_chain = (
    RunnableLambda(lambda x: analyze_action_items(x)) | model | StrOutputParser()
)

# Combined parallel chain
chain = (
    prompt_template
    | model
    | StrOutputParser()
    | RunnableParallel(branches={
        "decisions": decisions_branch_chain, 
        "action_items": action_items_branch_chain
    })
    | RunnableLambda(lambda x: combine_summary(
        x["branches"]["decisions"], 
        x["branches"]["action_items"]
    ))
)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Explanation&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;RunnableLambda wraps the analyze_decisions and analyze_action_items functions, allowing them to be part of the LangChain pipeline.&lt;/li&gt;
&lt;li&gt;RunnableParallel runs the decisions_branch_chain and action_items_branch_chain simultaneously.&lt;/li&gt;
&lt;li&gt;The final RunnableLambda combines the outputs from both branches into a structured summary.
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Example meeting notes
meeting_notes = """
**Project Kickoff Meeting - April 25, 2024**

- Discussed project timeline and milestones.
- Assigned tasks to team members.
- Reviewed budget allocations.
- Identified potential risks and mitigation strategies.
- Decided to use Agile methodology for project management.
- Scheduled weekly check-in meetings.
- Agreed on communication channels and tools.

**Action Items:**
1. John to set up the project repository by April 26.
2. Sarah to draft the initial project plan by April 28.
3. Mike to research risk mitigation strategies by April 30.
"""

# Run the chain
result = chain.invoke({"meeting_notes": meeting_notes})

# Output the result
print(result)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Sample Output:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;**Decisions Made:**
- Decided to use Agile methodology for project management.

**Action Items:**
1. John to set up the project repository by April 26.
2. Sarah to draft the initial project plan by April 28.
3. Mike to research risk mitigation strategies by April 30.

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Benefits of Parallel Chains in LangChain&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Efficiency: Processes multiple tasks simultaneously, reducing total execution time.&lt;/li&gt;
&lt;li&gt;Modularity: Each task is encapsulated, making the workflow easy to manage and extend.&lt;/li&gt;
&lt;li&gt;Scalability: Additional analysis branches can be added without disrupting existing chains.&lt;/li&gt;
&lt;li&gt;Clarity: Organized outputs enhance readability and usability of the resu&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>langchain</category>
      <category>llm</category>
      <category>promptengineering</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
