<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Yeshwanth TS</title>
    <description>The latest articles on Forem by Yeshwanth TS (@tsyeshwanth).</description>
    <link>https://forem.com/tsyeshwanth</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3814092%2F472a78d3-ece2-448e-bc52-62b2ae86f6d6.jpg</url>
      <title>Forem: Yeshwanth TS</title>
      <link>https://forem.com/tsyeshwanth</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/tsyeshwanth"/>
    <language>en</language>
    <item>
      <title>Building Visio — A Real-Time AI Accessibility Agent with Gemini and Google ADK</title>
      <dc:creator>Yeshwanth TS</dc:creator>
      <pubDate>Mon, 09 Mar 2026 07:14:23 +0000</pubDate>
      <link>https://forem.com/tsyeshwanth/building-visio-a-real-time-ai-accessibility-agent-with-gemini-and-google-adk-1568</link>
      <guid>https://forem.com/tsyeshwanth/building-visio-a-real-time-ai-accessibility-agent-with-gemini-and-google-adk-1568</guid>
      <description>&lt;p&gt;&lt;em&gt;This blog post was created for the purposes of entering the &lt;a href="https://geminiliveagentchallenge.devpost.com/" rel="noopener noreferrer"&gt;Gemini Live Agent Challenge&lt;/a&gt; hackathon. #GeminiLiveAgentChallenge&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;285 million people worldwide are visually impaired. Navigating everyday environments — stairs, parked vehicles, approaching people, unmarked curbs — requires constant assistance. Existing solutions are either passive (a white cane detects obstacles at arm's length) or delayed (photo-based apps require stopping and waiting for a response).&lt;/p&gt;

&lt;p&gt;I wanted to build something that works in real-time — like having a friend walking beside you, continuously watching and speaking.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Visio&lt;/strong&gt; is a real-time AI accessibility agent. Point your phone's rear camera forward while walking, and Visio continuously narrates your surroundings through your headphones:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Motorcycle ahead on your right, move left to pass"&lt;/li&gt;
&lt;li&gt;"You're past it. Pole ahead on your left, step around right"&lt;/li&gt;
&lt;li&gt;"Person in a blue jacket walking toward you"&lt;/li&gt;
&lt;li&gt;"Two steps down ahead, slow down"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It has three modes — &lt;strong&gt;Navigation&lt;/strong&gt; (hazard detection), &lt;strong&gt;Reading&lt;/strong&gt; (text/signs), and &lt;strong&gt;Exploration&lt;/strong&gt; (scene descriptions) — plus Emergency SOS with GPS, spatial audio, and haptic feedback.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Live demo&lt;/strong&gt;: &lt;a href="https://visio-agent-kiofaqcoyq-uc.a.run.app" rel="noopener noreferrer"&gt;visio-agent-kiofaqcoyq-uc.a.run.app&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Tech Stack
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Google ADK + Gemini 2.5 Flash
&lt;/h3&gt;

&lt;p&gt;The core of Visio is &lt;strong&gt;Google ADK&lt;/strong&gt; (Agent Development Kit) running &lt;strong&gt;Gemini 2.5 Flash&lt;/strong&gt; with native bidirectional audio streaming. This was the key decision — traditional request/response AI has 2-5 second latency. With BIDI streaming, Visio sees and speaks simultaneously with sub-second response times.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;run_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RunConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;streaming_mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;StreamingMode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;BIDI&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;response_modalities&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AUDIO&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;proactivity&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ProactivityConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;proactive_audio&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;context_window_compression&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ContextWindowCompressionConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;trigger_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;sliding_window&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;SlidingWindow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;target_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;80000&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;proactive_audio=True&lt;/code&gt; flag is what makes Visio speak without being asked — essential for a navigation agent where the user can't see the screen to trigger responses.&lt;/p&gt;

&lt;h3&gt;
  
  
  Server-Side Intelligence
&lt;/h3&gt;

&lt;p&gt;The model alone can't maintain reliable proactivity. I built several server-side systems to bridge the gaps:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Obstacle Memory&lt;/strong&gt; — The server tracks what hazards the model has reported. When the model says "clear" too soon after detecting obstacles, the server injects a &lt;code&gt;[SCAN AHEAD]&lt;/code&gt; prompt forcing it to check what's next.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Silence Monitor&lt;/strong&gt; — If the model goes quiet for 7+ seconds while the user is walking, the server nudges it with a &lt;code&gt;[HEARTBEAT]&lt;/code&gt; prompt.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Turn Re-scan&lt;/strong&gt; — Gyroscope data from the phone detects when the user changes direction. The server immediately injects a &lt;code&gt;[DIRECTION CHANGE]&lt;/code&gt; prompt so the model re-scans the new field of view.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Walking Updates&lt;/strong&gt; — Every 5 seconds while the user is moving, the server prompts the model to scan for new obstacles since the last report.&lt;/p&gt;

&lt;h3&gt;
  
  
  Adaptive Frame Rate
&lt;/h3&gt;

&lt;p&gt;Sending camera frames at a constant rate wastes tokens and money. I used the phone's accelerometer for step detection and speed estimation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Stationary&lt;/strong&gt;: 0.5 FPS (save tokens)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Slow walk&lt;/strong&gt;: 1.3 FPS&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Normal walk&lt;/strong&gt;: 2 FPS&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Running&lt;/strong&gt;: 2.5 FPS&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Combined with frame-diff analysis that skips unchanged frames, this cut token usage by roughly 60% with no reduction in safety.&lt;/p&gt;

&lt;h3&gt;
  
  
  Client-Side Features
&lt;/h3&gt;

&lt;p&gt;The browser client (vanilla JS) does more than just capture and send:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Proximity detection&lt;/strong&gt; — Edge analysis on the bottom quarter of each frame detects near-ground obstacles&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Spatial audio&lt;/strong&gt; — A &lt;code&gt;StereoPannerNode&lt;/code&gt; pans the model's voice based on directional keywords ("on your left" plays from the left headphone)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Haptic feedback&lt;/strong&gt; — Different vibration patterns for critical, warning, and info alerts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auto-focus&lt;/strong&gt; — Switches to near-range focus in reading mode for close-up text&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Google Cloud Services
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Gemini 2.5 Flash&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Real-time multimodal AI with native audio&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Google ADK&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Agent framework with BIDI streaming&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Google Search&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Grounding for landmarks and brands&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cloud Run&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Serverless container hosting&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cloud Build&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Automated Docker builds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cloud Logging&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Structured session logs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Firestore&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Session analytics&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Deployment is a single command: &lt;code&gt;./deploy.sh PROJECT_ID&lt;/code&gt; handles Cloud Build, Container Registry, and Cloud Run deployment automatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Biggest Challenge: Obstacle Chaining
&lt;/h2&gt;

&lt;p&gt;The hardest problem was what I call "obstacle amnesia." The model would warn about a parked bike, the user would pass it, and then... silence. The post 3 meters ahead? Not mentioned.&lt;/p&gt;

&lt;p&gt;The solution was two-fold:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;System prompt architecture&lt;/strong&gt; — A dedicated "Obstacle Chaining" section that explicitly instructs: after clearing ANY obstacle, immediately scan for the NEXT one. Never go silent after "you're past it."&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Server-side scan-ahead prompts&lt;/strong&gt; — When the obstacle memory detects the model said "clear" or "you're past it," it injects: &lt;code&gt;[SCAN AHEAD] You just cleared {obstacle}. Scan for the NEXT obstacle. What's ahead NOW?&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This simple pattern — prompt engineering + server-side reinforcement — made the difference between a demo that works sometimes and an agent that reliably chains obstacle-to-obstacle without gaps.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Proactive audio needs scaffolding.&lt;/strong&gt; Gemini's proactive audio is powerful but designed for conversation. For continuous narration, server-side prompt injection is essential.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;LLMs don't persist state.&lt;/strong&gt; Obstacle memory, silence monitoring, turn detection — all of this must live on the server because the model can't remember across turns.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;System prompt architecture &amp;gt; model parameters.&lt;/strong&gt; The 15-module system prompt (obstacle chaining, people awareness, priority tiers, surface hazards) determines quality more than any configuration.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Build for the user who can't see the screen.&lt;/strong&gt; Every design decision — spatial audio, haptic patterns, voice commands — had to work without visual feedback.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Live&lt;/strong&gt;: &lt;a href="https://visio-agent-kiofaqcoyq-uc.a.run.app" rel="noopener noreferrer"&gt;visio-agent-kiofaqcoyq-uc.a.run.app&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Demo&lt;/strong&gt;: &lt;a href="https://youtube.com/shorts/t_BfBCpFT9A?si=KEQ0ywabeG-DkxhA" rel="noopener noreferrer"&gt;YouTube&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Open the live URL on your phone, grant camera and microphone access, put on headphones, and walk around. Visio will start talking.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built for the &lt;a href="https://geminiliveagentchallenge.devpost.com/" rel="noopener noreferrer"&gt;Gemini Live Agent Challenge&lt;/a&gt; — Live Agents category. #GeminiLiveAgentChallenge&lt;/em&gt;&lt;/p&gt;

</description>
      <category>googlecloud</category>
      <category>ai</category>
      <category>gemini</category>
      <category>a11y</category>
    </item>
  </channel>
</rss>
