Forem: thilak15

Robotics Reinvention: Travis Kalanick's Atoms Targets Industrial Automation

thilak15 — Sat, 14 Mar 2026 01:33:01 +0000

TL;DR: Uber founder Travis Kalanick has launched Atoms, a new robotics venture absorbing CloudKitchens. Atoms aims to revolutionize industrial automation in mining and transport, leveraging Kalanick's capital-raising prowess. While technical differentiators are yet to be revealed, this high-risk, high-reward bet could be a major disruptor in a crowded robotics market.

Travis Kalanick, the controversial but undeniably impactful co-founder of Uber and the visionary behind CloudKitchens, is once again making headlines, this time with a bold new foray into the rapidly expanding world of robotics. His latest venture, Atoms, represents a significant move, absorbing the existing infrastructure and talent of CloudKitchens to pivot towards more ambitious, capital-intensive domains like mining and transport. This pivot by a proven entrepreneur with a track record of disrupting massive industries signals a potent new force in the robotics landscape, promising to accelerate innovation and challenge established players in sectors ripe for automation.

The Market Context

The problem space Atoms aims to address is vast and multifaceted: the automation of hazardous, repetitive, or logistically complex tasks across heavy industries. Sectors like mining and transport are characterized by high operational costs, labor shortages, safety concerns, and often inefficient manual processes. Robotics offers a compelling solution, promising increased safety by removing humans from dangerous environments, enhanced efficiency through continuous operation, and significant cost reductions over time.

Industry reports project substantial growth in these areas. The global mining automation market, for instance, is anticipated to reach tens of billions of dollars within the next decade, driven by demand for autonomous haulage, drilling, and inspection systems. Similarly, the autonomous transport market, encompassing everything from long-haul trucking to last-mile delivery and specialized industrial vehicles, is projected to see exponential growth, with forecasts often placing its value in the hundreds of billions.

This problem is solvable and investable now due to several converging factors. Advances in artificial intelligence, particularly in machine learning, computer vision, and reinforcement learning, have made robots more capable of perceiving, understanding, and navigating complex, unstructured environments. Simultaneously, improvements in sensor technology (LiDAR, radar, high-resolution cameras), processing power (edge AI chips), and battery efficiency have made robust, real-world deployments more feasible and cost-effective. Furthermore, the increasing availability of sophisticated open-source robotics frameworks like ROS (Robot Operating System) lowers the barrier to entry for development, while a growing talent pool in AI and robotics fuels innovation.

The landscape is currently populated by a mix of incumbents and challengers. In mining, traditional heavy equipment manufacturers like Caterpillar, Komatsu, and Epiroc have their own automation divisions, offering autonomous haulage and drilling solutions. In transport, autonomous driving companies like Waymo, Cruise, and specialized trucking firms such as TuSimple (though facing challenges) are pushing innovation.

Brew: I Built a Real-Time Voice AI Drive-Thru Barista with Gemini Live API and Google ADK

thilak15 — Fri, 13 Mar 2026 23:36:42 +0000

TL;DR

I built Brew — a real-time, voice-first AI ordering system for coffee shop drive-thrus. Customers talk to an AI barista through their microphone, and it takes their order through natural conversation. No buttons, no typing, just speech. The AI listens, understands complex orders with modifiers, handles interruptions, and updates a live on-screen menu and receipt as the conversation flows.

GitHub: github.com/thilak15/Brew

The Problem I Wanted to Solve

Traditional drive-thru ordering is broken. Long wait times, order inaccuracies, and staffing challenges plague the industry. Human operators handle one car at a time, miscommunication leads to wrong orders, and during peak hours, lines stretch around the block.

I wanted to see if a live voice AI agent could do this better — not a chatbot with text-to-speech bolted on, but a genuinely conversational agent that handles the full complexity of real ordering: sizes, modifiers, corrections, interruptions, and multi-item requests.

What Brew Does

Brew replaces the human operator at a drive-thru speaker box with an AI barista. Here's what it handles:

Natural speech understanding — "Can I get a grande iced latte with oat milk and an extra shot?" works exactly as you'd expect
Interruptions (barge-in) — Change your mind mid-sentence. The AI stops speaking and listens
Real-time UI updates — The menu highlights relevant categories and the receipt builds live as items are confirmed
Complex order management — Modifiers (syrups, milk swaps, toppings, ice levels, warming), undo, batch operations, and running totals
Multilingual support — Speak in Spanish, Hindi, or any language Gemini understands, and the agent mirrors your language automatically
Session persistence — Cart state survives Cloud Run instance restarts via Firestore

The menu has 22 items across 3 categories (Drinks, Breakfast, Desserts) with a full modifier system.

The Tech Stack — All Google AI and Cloud

This project is built end-to-end on Google's AI and Cloud platform. Here's every piece:

Layer	Technology	Purpose
AI Model	Gemini 2.5 Flash Native Audio	Real-time voice conversation with function calling
Agent Framework	Google Agent Development Kit (ADK)	Agent orchestration, tool management, live streaming
Backend	Python 3.11, FastAPI	WebSocket server, session management
Frontend	Next.js 14, React 18, TypeScript	Dynamic UI with real-time state updates
Audio	Web Audio API (AudioWorklet)	Low-latency audio capture and playback
Transport	WebSockets	Bidirectional PCM audio + JSON state streaming
Session Persistence	Google Cloud Firestore	Cart state across instances
Deployment	Google Cloud Run	Serverless containers for backend and frontend
Container Registry	Google Artifact Registry	Docker image storage
CI/CD	GitHub Actions + Workload Identity Federation	Keyless automated deployment to GCP

How I Built It — Architecture Deep Dive

The system has four layers:

1. Browser (Customer Device)

The Next.js frontend captures microphone audio via the Web Audio API using an AudioWorklet processor. Raw PCM audio at 16kHz streams to the backend over a WebSocket. The frontend receives two things back: audio response bytes (played through another AudioWorklet) and JSON state updates that drive the UI.

Three main components power the interface:

SmartMenu — A dynamic tabbed menu that auto-switches categories (ordering a "Cake Pop" flips the view to Desserts)
LiveReceipt — A real-time order panel showing items, modifiers, and a running price total
AudioVisualizer — Visual feedback during the conversation

2. Backend Server (Cloud Run)

A Python/FastAPI WebSocket server running on Cloud Run. It manages the bidirectional audio stream between the browser and Gemini. The key responsibilities:

Hosts the ADK Runner that orchestrates the agent lifecycle
Implements a tool gate mechanism — blocks user audio while the AI executes tool calls, preventing race conditions where the model hears its own confirmations
Handles upstream (browser → Gemini) and downstream (Gemini → browser) as concurrent async tasks
Proactive session reconnection before the 10-minute Live API hard limit

3. Agent Layer (Google ADK)

The agent is defined using Google's Agent Development Kit with 14 tools for order management:

root_agent = Agent(
    name="brew_agent",
    model="gemini-2.5-flash-native-audio-preview-12-2025",
    description="Drive-thru barista that takes beverage orders with modifiers.",
    instruction=get_system_prompt(),
    tools=[
        add_item, add_items,
        remove_item, remove_items,
        add_modifier, add_modifiers,
        remove_modifier, set_modifier,
        set_ice_level, undo_last_change,
        clear_order, set_menu_view,
        get_order_summary,
    ],
)

ADK's run_live() method establishes a persistent bidirectional stream with the Gemini Live API. Tools are plain Python functions with detailed docstrings that the model uses for function calling.

4. AI Model (Gemini Live API)

The gemini-2.5-flash-native-audio-preview-12-2025 model handles everything in a single streaming session: receives raw audio, processes speech, decides when to call tools, and generates spoken responses. The system prompt injects the full menu (items, prices, sizes, modifiers) so the model is grounded in real data.

Data Flow

Customer speaks into mic
  → Browser captures PCM audio via AudioWorklet
  → WebSocket sends binary audio frames to backend
  → Backend forwards audio to Gemini via ADK run_live()
  → Gemini processes speech, decides to call tools or respond
  → If tool call: ADK executes tool → updates OrderState → syncs to Firestore
  → Gemini generates audio response
  → Backend streams audio bytes back over WebSocket
  → Browser plays audio via AudioWorklet
  → Backend sends JSON order state updates
  → Frontend re-renders SmartMenu + LiveReceipt in real time

Google Cloud Services in Detail

Gemini Live API (via Google GenAI SDK)

This is the core of Brew. The Live API provides native audio streaming — the model receives raw audio and produces audio responses directly, without separate speech-to-text or text-to-speech steps. Combined with function calling, this means the model can hear "add oat milk to both drinks," call the add_modifiers batch tool, and speak a confirmation — all in one streaming session with sub-second latency.

Google Agent Development Kit (ADK)

ADK handles the agent lifecycle. The run_live() method manages the persistent WebSocket connection to Gemini, routes tool calls to my Python functions, and handles the back-and-forth of a multi-turn conversation. I defined 14 tools with detailed docstrings, and ADK + Gemini handle the rest.

Cloud Run

Both the backend (FastAPI) and frontend (Next.js) are deployed as separate Cloud Run services. Session affinity is critical for WebSocket connections — without it, requests hit different instances that don't have the session state.

Firestore

Cart state is persisted to Firestore after every order change. This means if Cloud Run scales horizontally or an instance restarts, the customer's order survives. I built a custom FirestoreSessionService that wraps ADK's InMemorySessionService with Firestore persistence.

Artifact Registry + Workload Identity Federation

Docker images are stored in Artifact Registry. CI/CD uses Workload Identity Federation for keyless authentication from GitHub Actions to GCP — no service account keys stored anywhere.

Hard Problems I Solved

1. The Tool Gate Problem

Without intervention, the model would hear its own tool-call confirmations as user input, creating infinite loops. I implemented a tool gate that blocks user audio forwarding while the AI is executing tools. This was the single most impactful fix for reliability.

2. The 10-Minute Session Limit

The Gemini Live API has a hard 10-minute session limit. Brew proactively reconnects at 8 minutes, injecting the current order context into the new session so the AI seamlessly continues the conversation without re-greeting the customer.

3. Model Hallucinating Tool Arguments

Native audio models sometimes hallucinate tool arguments — inventing item IDs that don't exist. I switched from UUIDs to sequential integer IDs (item_1, item_2, ...) which dramatically reduced hallucination. I also added _resolve_item_id() that handles numeric shorthand and positional references.

4. Batch Operations for Latency

Without batch tools, the model makes sequential tool calls with separate confirmations for each item in a multi-item order. I added add_items, remove_items, and add_modifiers batch tools that handle everything in a single call, cutting latency significantly.

5. Idempotency Guards

The model sometimes retries tool calls during transient errors. Without idempotency guards, this would add duplicate modifiers. Every add_modifier call checks for existing identical modifiers before applying.

Key Learnings

Tool docstrings are the primary interface. Clear, specific docstrings with examples produce dramatically better tool-calling accuracy than vague descriptions. I iterated on these more than any other part of the codebase.
AudioWorklet is non-negotiable. The deprecated ScriptProcessorNode introduces unpredictable latency. AudioWorklet provides consistent low-latency audio processing.
Session affinity on Cloud Run is essential for WebSocket connections. Without it, subsequent requests hit different instances.
Native audio models behave differently than text models. They're more prone to hallucinating tool arguments, more sensitive to background noise, and need explicit instructions about when NOT to respond (e.g., to background noise or their own echoes).
Firestore for session persistence is a perfect fit for serverless deployments. The read/write latency is low enough that it doesn't impact the real-time experience.

Running It Yourself

Brew is fully open source. You can run it locally with Docker:

git clone https://github.com/thilak15/Brew.git
cd Brew
cp backend/.env.example backend/.env
# Add your GOOGLE_API_KEY to backend/.env
docker compose up -d --build

Then open http://localhost:3000 in Chrome, click "Drive Up," allow mic access, and start ordering.

GitHub: github.com/thilak15/Brew

What's Next

Menu-agnostic deployment. The menu loads from a JSON file. Swap it out, and Brew becomes a taco shop, a pizza place, or a pharmacy pickup counter. The next step is a pipeline that takes any restaurant's menu and auto-generates a ready-to-deploy voice ordering agent.

Multilingual real-time language switching. Gemini's native audio model already understands multiple languages. The goal is automatic language detection mid-conversation — if a customer starts in English and switches to Spanish, the agent follows without any button press.

The hard part was proving that a live voice agent can handle complex, modifier-heavy ordering with interruptions, corrections, and batch operations — correctly and reliably. That's done. Now it's about making it work for anyone, in any language.

This project was created for the Gemini Live Agent Challenge hackathon. #GeminiLiveAgentChallenge

Built by Thilak Daggula

Challenging Dogma: Simple Fine-Tuning Enables Continual Learning in VLA Models

thilak15 — Fri, 13 Mar 2026 18:39:55 +0000

TL;DR

Simple Sequential Fine-Tuning (Seq. FT) works surprisingly well for Continual Reinforcement Learning (CRL) in large pretrained Vision-Language-Action (VLA) models. The paper "Simple Recipe Works" challenges the long-held assumption that complex strategies are always necessary to prevent catastrophic forgetting.
Large pretrained VLAs appear to be natural continual learners. Their inherent capabilities, likely stemming from extensive pretraining on diverse data, make them more resilient to forgetting than previously thought when adapting to new tasks.
The research systematically evaluated this "simple recipe" across three distinct VLA models and five varied continual learning scenarios, consistently demonstrating its efficacy.
This discovery simplifies the path toward developing robust, self-improving embodied AI agents. Engineers can potentially forgo complex CRL algorithms, focusing instead on foundational VLA pretraining and task design for agents operating in dynamic, open-ended environments.

The Problem: The Persistent Challenge of Catastrophic Forgetting

Embodied AI agents, such as robots or virtual assistants, need to operate effectively in dynamic, open-ended environments. This requires them to continually learn new skills and adapt to novel situations without forgetting previously acquired knowledge. This challenge is known as Continual Reinforcement Learning (CRL).

The core hurdle in CRL is catastrophic forgetting. When an AI model is trained sequentially on a series of tasks, fine-tuning on a new task often causes it to "forget" how to perform older tasks. For example, a robot learning to pick up a new object might suddenly lose its ability to grasp a previously mastered object. This phenomenon has plagued deep learning models, especially in reinforcement learning settings where data distributions change drastically between tasks.

Historically, addressing catastrophic forgetting has led to the development of highly sophisticated and often complex CRL strategies. These include:

Regularization-based methods: Adding penalty terms to the loss function to protect important parameters learned from previous tasks (e.g., Elastic Weight Consolidation (EWC), Synaptic Intelligence (SI)).
Rehearsal/Memory-based methods: Storing a small subset of data or experiences from previous tasks and replaying them during training on new tasks (e.g., Experience Replay, Generative Replay).
Architectural methods: Dynamically expanding the model's capacity or creating task-specific sub-networks (e.g., Progressive Neural Networks, PackNet).
Knowledge Distillation: Using the old model's outputs as "soft targets" to guide the new model's learning on previous tasks.

While these methods have shown promise, they introduce significant complexity. They often require careful hyperparameter tuning, increase computational overhead, and can hinder the development of truly adaptive AI systems. This paper, however, presents a compelling argument that for a specific class of models—large, pre-trained Vision-Language-Action (VLA) models—a much simpler approach might be all that's needed.

[Suggested emphasis for scanability]:

Bold key terms like Continual Reinforcement Learning (CRL), catastrophic forgetting, Vision-Language-Action (VLA) models, and Simple Sequential Fine-Tuning (Seq. FT) throughout the main body of the article.
Use bullet points for lists of methods or findings.
Consider using blockquotes for direct quotes or key takeaways from the paper.
Ensure subheadings are clear and descriptive.

PACED: Unlock Faster, More Affordable LLM Training with Smart Distillation

thilak15 — Fri, 13 Mar 2026 18:17:16 +0000

TL;DR

Targeted Distillation: PACED is a novel framework for LLM distillation that focuses training on the 'zone of proximal development' (ZPD) for student models, avoiding computational waste.
Theoretical Basis: It's grounded in the observation that gradient signal-to-noise ratio (SNR) vanishes when problems are either too easy (student has mastered) or too hard (beyond current competence).
Computational Efficiency: By concentrating compute on the ZPD, PACED promises significant gains in training efficiency, accelerating LLM development and reducing resource consumption.
Improved Learning: This focused approach aims to not only make distillation faster but also more effective, preventing the erosion of existing knowledge and fostering better student model quality.

Large Language Models (LLMs) have transformed AI, but their immense size makes deployment expensive and slow. This is where knowledge distillation becomes vital: transferring a large "teacher" model's knowledge to a smaller, more efficient "student" model.

However, standard LLM distillation methods often suffer from a critical flaw: computational waste. Imagine trying to teach someone by constantly reviewing what they already know or presenting concepts far beyond their grasp. This is precisely what happens in traditional LLM distillation, leading to inefficient training and inflated costs.

The Problem in Detail:
Student models are typically exposed to a uniform curriculum. This means valuable compute cycles are squandered on tasks they've either:

Already Mastered: Leading to near-zero gradient signals and negligible learning.
Find Too Difficult: Producing noisy, incoherent, or even contradictory gradients that can destabilize the model or erode prior knowledge.

This inefficiency not only slows down training and inflates costs but can also degrade the student's existing capabilities, hindering the development of agile, specialized models.

Enter PACED: Distillation at the Frontier of Student Competence, a groundbreaking framework by Yuanda Xu et al. (HuggingFace). PACED addresses this fundamental inefficiency head-on.

How PACED Works:
The core of PACED lies in a theoretical observation: the gradient signal-to-noise ratio (SNR), crucial for effective learning, vanishes at both extremes of student competence. PACED dynamically identifies and concentrates distillation efforts on the 'zone of proximal development' (ZPD). These are tasks that are:

Challenging enough to provide a strong, coherent learning signal.
Not so difficult as to be unlearnable.

This targeted approach prevents compute from being squandered on unhelpful tasks, ensuring every computational cycle contributes meaningfully to learning.

Why PACED Matters for Practitioners:
While specific quantitative benchmarks are not detailed in the paper, PACED's strong theoretical grounding in gradient SNR promises significant gains in training efficiency. It aims to:

Accelerate the distillation process.
Reduce compute costs dramatically.
Prevent the degradation of previously acquired knowledge in student LLMs.

Ultimately, PACED means we can train more capable, smaller LLMs faster and more affordably. This framework could unlock a new wave of specialized, deployable models, making advanced AI more accessible and sustainable for a broader range of applications and organizations.

Read the Full Paper:
For a deep dive into the theoretical underpinnings and methodology, explore the full paper: https://huggingface.co/papers/2603.11178

Parallel Chains in LangChain

thilak15 — Wed, 16 Oct 2024 15:48:20 +0000

In this guide, we'll delve into how LangChain facilitates parallel processing using a Meeting Summary Generator as a reference.

Why Parallel Chains?
Parallel chains allow multiple tasks to run concurrently, reducing overall execution time and improving resource utilization. This is especially beneficial when dealing with tasks that can operate independently, such as extracting different components from a dataset.

Key Components
RunnableLambda: Wraps Python functions to be used within LangChain chains.
RunnableParallel: Enables the parallel execution of multiple runnable branches.
StrOutputParser: Parses the string output from the language model.

Step-by-Step Implementation

Initialize the language model using LangChain’s ChatOllama. This model will process the prompts and generate responses.

from langchain_ollama import ChatOllama

# Initialize the ChatOllama model
model = ChatOllama(model="llama3.2:1b-instruct-fp16")

Create prompt templates to instruct the model on the specific tasks: extracting key points, decisions, and action items.

from langchain.prompts import ChatPromptTemplate

# Prompt to summarize key points from meeting notes
prompt_template = ChatPromptTemplate.from_messages(
    [
        ("system", "You are an expert meeting assistant."),
        ("human", "Summarize the key points of the following meeting notes:\n\n{meeting_notes}"),
    ]
)

# Prompt to extract decisions
def analyze_decisions(key_points):
    decisions_template = ChatPromptTemplate.from_messages(
        [
            ("system", "You are an expert meeting assistant."),
            ("human", "Given these key points: {key_points}, list the decisions made during the meeting."),
        ]
    )
    return decisions_template.format_prompt(key_points=key_points)

# Prompt to extract action items
def analyze_action_items(key_points):
    action_items_template = ChatPromptTemplate.from_messages(
        [
            ("system", "You are an expert meeting assistant."),
            ("human", "Given these key points: {key_points}, list the action items assigned during the meeting, including the responsible person and the deadline if available."),
        ]
    )
    return action_items_template.format_prompt(key_points=key_points)

Utilize RunnableLambda to wrap the analysis functions and RunnableParallel to execute them concurrently.

from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnableParallel, RunnableLambda

# Function to combine decisions and action items
def combine_summary(decisions, action_items):
    return f"**Decisions Made:**\n{decisions}\n\n**Action Items:**\n{action_items}"

# Runnable chains for decisions and action items
decisions_branch_chain = (
    RunnableLambda(lambda x: analyze_decisions(x)) | model | StrOutputParser()
)

action_items_branch_chain = (
    RunnableLambda(lambda x: analyze_action_items(x)) | model | StrOutputParser()
)

# Combined parallel chain
chain = (
    prompt_template
    | model
    | StrOutputParser()
    | RunnableParallel(branches={
        "decisions": decisions_branch_chain, 
        "action_items": action_items_branch_chain
    })
    | RunnableLambda(lambda x: combine_summary(
        x["branches"]["decisions"], 
        x["branches"]["action_items"]
    ))
)

Explanation:

RunnableLambda wraps the analyze_decisions and analyze_action_items functions, allowing them to be part of the LangChain pipeline.
RunnableParallel runs the decisions_branch_chain and action_items_branch_chain simultaneously.
The final RunnableLambda combines the outputs from both branches into a structured summary.

# Example meeting notes
meeting_notes = """
**Project Kickoff Meeting - April 25, 2024**

- Discussed project timeline and milestones.
- Assigned tasks to team members.
- Reviewed budget allocations.
- Identified potential risks and mitigation strategies.
- Decided to use Agile methodology for project management.
- Scheduled weekly check-in meetings.
- Agreed on communication channels and tools.

**Action Items:**
1. John to set up the project repository by April 26.
2. Sarah to draft the initial project plan by April 28.
3. Mike to research risk mitigation strategies by April 30.
"""

# Run the chain
result = chain.invoke({"meeting_notes": meeting_notes})

# Output the result
print(result)

Sample Output:

**Decisions Made:**
- Decided to use Agile methodology for project management.

**Action Items:**
1. John to set up the project repository by April 26.
2. Sarah to draft the initial project plan by April 28.
3. Mike to research risk mitigation strategies by April 30.

Benefits of Parallel Chains in LangChain

Efficiency: Processes multiple tasks simultaneously, reducing total execution time.
Modularity: Each task is encapsulated, making the workflow easy to manage and extend.
Scalability: Additional analysis branches can be added without disrupting existing chains.
Clarity: Organized outputs enhance readability and usability of the resu