Forem: Vaibhav Shukla

I Found 60 Accessibility Bugs on Google Support That No Tool Catches

Vaibhav Shukla — Mon, 16 Mar 2026 17:12:59 +0000

I broke Google Support last week. Or rather, I proved it was already broken.

Not in English. In English it works fine. Lighthouse gives it a decent score, axe-core agrees, everyone moves on.

I checked the Japanese version.

The HTML tag said lang="en". On a page written entirely in Japanese. Every ARIA label — Search, Close, Main menu — was still in English. The search placeholder said "Describe your issue" in English. The page title said "Google Help" in English.

Sixty accessibility attributes on the Japanese page were never translated.

A blind person using a Japanese screen reader on that page would hear English words randomly mixed into Japanese audio. The screen reader sees lang="en", switches to an English voice, and tries to pronounce Japanese characters with English rules. Complete mess.

And the kicker: Lighthouse says the page is fine.

Why Lighthouse misses it

Lighthouse checks whether an aria-label exists. It does not check what language it is in. So aria-label="Search" on a Japanese page gets a green checkmark because the attribute is present.

Same with alt text. Same with placeholders. Same with the lang attribute — Lighthouse checks that lang exists on the html tag, but not whether it matches the actual content language.

To find these issues, you have to do something no tool currently does: open the same page in two languages and compare the accessibility attributes side by side.

That is what I spent the last few days building.

What I built

Locali11y fetches the English and Japanese (or any other language) versions of the same page. It parses both with Cheerio — no browser needed, just raw HTML parsing. Then it runs twenty checks that are specifically designed around what breaks during translation.

Not generic WCAG checks. Targeted questions like:

Is the aria-label on this button identical to the English version? If yes, it was probably never translated.

Does the html lang attribute match the page locale? If the page is in Japanese but lang says "en", something went wrong in the deployment.

Is the placeholder text the same as the English page? Is the page title? The meta description?

These are simple string comparisons. The code that does the core detection is about twenty lines. It collects all ARIA labels from the English page, then checks each label on the translated page to see if it matches. If a Japanese page has aria-label="Search" and the English page also has aria-label="Search", that label was never localized.

No machine learning. No complex parsing. Just a comparison nobody built into a tool before.

What I found on real sites

Google Support: 60 locale-specific issues on the Japanese page. Score dropped from 84 to 71. The html lang bug is the worst one — it affects how the entire page is announced by screen readers.

IKEA Japan: carousel buttons saying "See previous items" and "See next items" in English on a Japanese page. Those are navigation controls that blind users rely on.

The pattern was the same everywhere I tested. Teams translate the visible content — headings, paragraphs, button text. They forget the invisible stuff — ARIA labels, alt attributes, placeholder values, metadata. And no tool in the standard workflow catches the gap because every tool checks one page at a time.

The technical choices

I used Cheerio instead of Playwright because the attributes I care about exist in the static HTML. Running a full browser for each page would add thirty seconds per locale and create deployment headaches. Cheerio parses HTML in milliseconds.

The app is built with Next.js and uses Supabase for storing audit results. The scoring groups issues by check type so that one category (like missing alt on fifty product images) cannot tank the entire score alone.

I used Lingo.dev for the app's own translations. The tool itself works in English, Spanish, Japanese, and Chinese. Building a localization auditor with poor localization would have been embarrassing.

What surprised me

Browsers do not fix this problem. Chrome's translate feature changes visible text on a page. It does not touch HTML attributes. So even if you translate a page with Chrome, the aria-labels stay English. There was a Chrome bug filed about this in 2019. In 2023 it regressed. As of mid-2025, no major browser reliably translates hidden accessibility attributes.

The EU Accessibility Act requires accessible digital products across all supported languages. Having lang="en" on a Japanese page is a compliance issue. Most companies have no idea this is happening on their sites because no tool checks for it.

What is missing

Locali11y currently checks one page per locale. A full site crawler would be more useful but was too complex for the hackathon timeline. The scoring system is simple and approximate. The fix suggestions use AI and would need human review for production use.

If someone forks this, the most useful next step would be CI integration — running these checks automatically on every deploy and failing the build when locale accessibility drops below a threshold.

Links

Live: https://locali11y.vercel.app/en
Code: https://github.com/Vaibhav13Shukla/locali11y
Video: https://youtu.be/dWck2xBytMs

Built for the Lingo.dev Multilingual Hackathon #3. App localization powered by Lingo.dev.

The AI That Never Forgets — My Vision Agents Hackathon Journey

Vaibhav Shukla — Sun, 01 Mar 2026 18:44:00 +0000

How I solved video AI's biggest blind spot — amnesia — by building a real-time temporal memory engine on top of the Vision Agents SDK by Stream.

https://img.shields.io/badge/ARGUS-The_AI_That_Never_Forgets-00d4aa?style=for-the-badge&logo=openai&logoColor=white&labelColor=101010&logoWidth=30&pd=20

🏆 Built for the Vision Possible: Agent Protocol Hackathon

⚡ Powered by Vision Agents SDK by Stream

🎥 Watch ARGUS in Action

Before I explain how I built it, you have to see it to believe it. Here is ARGUS detecting objects, tracking them over time, and answering questions about the past in real-time.

PASTE_YOUR_VIDEO_LINK_HERE

(If the video doesn't load, click here to watch the demo

🚨 The Problem: AI Has Amnesia

I realized something frustrating while testing modern Video AI demos. They are brilliant at telling you what is happening right now, but they are terrible at telling you what happened 5 minutes ago.

If I drop my keys and ask a standard AI agent, "Where are my keys?", it looks at the current frame, sees nothing, and says: "I don't see any keys."

The Vision Agents SDK documentation actually highlighted this limitation:

"Longer videos can cause the AI to lose context. For instance, if it's watching a soccer match, it will get confused after 30 seconds."

That was my lightbulb moment. 💡

The Mission: Build ARGUS, a real-time agent that doesn't just "see" video—it remembers it.

🧠 What is ARGUS?

ARGUS is a multimodal AI agent that watches live video, tracks objects using computer vision, and maintains a Temporal Memory Engine.

Unlike standard agents that process Frame → Detect → Forget, ARGUS uses a stateful pipeline:

mermaid graph LR A[Camera Feed] --> B(YOLO26 Detection) B --> C{Temporal Memory Engine} C --> D[Update Object History] C --> E[Log Events] D & E --> F[LLM Context] F --> G((Voice Response))

Key Capabilities
👁️ Real-time Tracking: Uses YOLO26 Nano + ByteTrack to assign persistent IDs to objects.
🕰️ Time Travel: Can answer "What did I hold up 2 minutes ago?"
📍 Spatial Awareness: Converts raw coordinates into human terms like "top-left" or "center."
🗣️ Voice Interaction: Full duplex voice conversation with <1s latency.
💬 Real Conversations With ARGUS
These are actual interactions from my testing sessions:

Terminal Logs showing events
Real-time event logging showing objects appearing and moving.

I Said  ARGUS Responded
"What do you see?"  "Person ID:2 at middle-center, Cup ID:3 at bottom-right"
"What am I holding?"    "You appear to be holding a bottle, ID:7"
"What just moved?"  "Cup moved from bottom-left to bottom-right at 2:05 PM"
"Summarize everything"  "Person appeared at center 30s ago. Cup moved left to right at 2:05"
⚡ Response time: ~1 second
🧠 All answers came from temporal memory — not from re-analyzing the video frame.

🏗️ The Architecture & Tech Stack
I needed a stack that was fast, cheap, and capable of handling real-time video streams without melting my laptop.

Component   Technology  Why I Chose It
Framework   Vision Agents SDK   It handled all the WebRTC/Audio/Video piping for me.
Vision Model    YOLO26 Nano Benchmarked at 130ms/frame on CPU. Fast & Accurate.
Reasoning   Llama 3.3 via OpenRouter    Fast inference with tool-calling capabilities.
Speech  Deepgram (STT) + ElevenLabs (TTS)   The lowest latency combo available.
Transport   Stream Edge Network Kept video latency under 30ms.
🛠️ The Build Journey
1. The "Secret Weapon": Temporal Memory Engine
This is the heart of the project. I wrote a custom Python class that sits between the vision processor and the LLM.

Instead of feeding raw video frames to the LLM (which is slow and expensive), I feed it structured event logs.

Python

Core logic: If an object moves zones, log it.

if old_zone != zone:
self._log("moved", f"{class_name} (ID:{track_id}) moved from {old_zone} to {zone}")
When I ask, "Where is the cup?", the LLM receives this context injection:


[ARGUS MEMORY]

Cup (ID:2): Last seen at bottom-right at 12:05 PM.
Person (ID:1): Currently visible at center.
Event: Cup moved from left to right 30 seconds ago.
2. Building the Custom Processor
Using the SDK's VideoProcessorPublisher pattern was intuitive. I could access the raw av.VideoFrame, run my YOLO inference, draw bounding boxes, and push the frame back to the browser.

ARGUS Detection View
ARGUS tracking objects with persistent IDs and Spatial Zones.

3. Solving the Latency Problem
My first prototype had 5-second delays. To fix this, I optimized ruthlessly:

Switched from Gemini (Rate limits) to OpenRouter/Llama.
Switched YOLO11 to YOLO26 Nano (7.7 FPS on CPU).
Used human-readable zones ("top-left") instead of raw coordinates, reducing token usage for the LLM.
🧪 Benchmark Results
I ran a diagnostic script to prove efficiency on a standard laptop (No GPU):

Benchmark Results

Model Speed Max FPS Verdict
YOLO26 Nano 130ms 7.7 ✅ Winner
YOLOv8 Nano 138ms 7.2 Solid
YOLO11 Small 310ms 3.2 Too slow



The Vision Agents SDK was crucial here. Because it handles the video transport efficiently, I could use all my CPU cycles for the actual detection logic.

💡 The "Aha!" Moment
The magic happened during a test run. I held up a water bottle, put it down, and waited. Then I asked:

Me: "What did I just show you?"

ARGUS: "You were holding a bottle (ID:7) at the center of the screen about 15 seconds ago."

The Aha Moment

It wasn't looking at the bottle now. It remembered. That feeling of interacting with an AI that has object permanence is wild.

🌍 Why This Matters
Hackathons often produce cool demos that don't solve real problems. ARGUS solves the context window problem for video.

By abstracting video into structured temporal data, we can build agents that:

Monitor security feeds for hours and summarize activity.
Help find lost items in a room.
Analyze workflow efficiency in factories.
The Vision Agents SDK made this possible by removing the complexity of WebRTC and audio handling, allowing me to focus entirely on the memory innovation.

🔗 Links & Resources
Code Repository: GitHub - [ARGUS](https://github.com/Vaibhav13Shukla/argus)
Vision Agents SDK: Star the Repo!
Hackathon: Vision Possible
Thanks to Stream and WeMakeDevs for this challenge. It pushed me to build something I didn't think was possible in a weekend!