<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Max Luong</title>
    <description>The latest articles on Forem by Max Luong (@maxluong).</description>
    <link>https://forem.com/maxluong</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3625373%2F2d29f424-339c-4af5-b5c9-05c91a38487d.jpg</url>
      <title>Forem: Max Luong</title>
      <link>https://forem.com/maxluong</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/maxluong"/>
    <language>en</language>
    <item>
      <title>Why Do AI Agents Fail in Production? (And How to Fix the "Silent Click")</title>
      <dc:creator>Max Luong</dc:creator>
      <pubDate>Thu, 11 Dec 2025 08:50:14 +0000</pubDate>
      <link>https://forem.com/maxluong/why-do-ai-agents-fail-in-production-and-how-to-fix-the-silent-click-nk6</link>
      <guid>https://forem.com/maxluong/why-do-ai-agents-fail-in-production-and-how-to-fix-the-silent-click-nk6</guid>
      <description>&lt;h2&gt;
  
  
  Part 2: Moving from toy scripts to enterprise architecture using Qwen2-VL, Set-of-Mark, and Playwright.
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Series: Building AI Web Agents
&lt;/h3&gt;

&lt;p&gt;In &lt;strong&gt;&lt;a href="https://dev.to/duc_luong_10d1d0d0f726c3a/i-tried-to-teach-ai-to-click-buttons-and-it-missed-by-500-pixels-33ln"&gt;Part 1: I Tried to Teach AI to Click Buttons, and It Missed by 500 Pixels&lt;/a&gt;&lt;/strong&gt;, I shared the painful reality of building my first web agent. I fed a screenshot to a standard multimodal model, asked for coordinates, and watched it hallucinate a click into the white void of a webpage.&lt;/p&gt;

&lt;p&gt;It turns out, guessing 

&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;X,YX, Y &lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;X&lt;/span&gt;&lt;span class="mpunct"&gt;,&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;Y&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 pixel coordinates is a fragile game.&lt;/p&gt;

&lt;p&gt;If you are just playing around, a 70% success rate is cool. If you are building an enterprise agent to automate 1,000 tasks, a 70% success rate is a disaster.&lt;/p&gt;

&lt;p&gt;In this post, I'm breaking down the architecture that &lt;strong&gt;actually works in production&lt;/strong&gt;: The &lt;strong&gt;Generator-Executor Pattern&lt;/strong&gt;, powered by &lt;strong&gt;Qwen2-VL&lt;/strong&gt; and &lt;strong&gt;Set-of-Mark (SoM)&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Problem: The "Blind" Brain vs. The "Blurry" Eye
&lt;/h2&gt;

&lt;p&gt;Why did my first agent fail? It faced two massive walls that every developer in this space hits eventually:&lt;/p&gt;
&lt;h3&gt;
  
  
  1. Pure LLMs are Blind
&lt;/h3&gt;

&lt;p&gt;A text-only model (like GPT-4 or Llama 3) reading HTML code fails on modern web apps. It can't see inside &lt;code&gt;&amp;lt;canvas&amp;gt;&lt;/code&gt; elements, deep Shadow DOMs, or visual trickery (like a popup advertisement physically covering the "Login" button).&lt;/p&gt;
&lt;h3&gt;
  
  
  2. Standard VLMs are Blurry
&lt;/h3&gt;

&lt;p&gt;Most older vision models squash your beautiful 4K screenshot into a tiny 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;336×336336 \times 336 &lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;336&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;×&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;336&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 square. A "Submit" button becomes a smudge. If the model can't distinctively see the button, it definitely can't give you accurate coordinates for it.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Solution: The Hybrid "Neuro-Symbolic" Stack
&lt;/h2&gt;

&lt;p&gt;To build a production-grade agent, we stop asking the AI to &lt;strong&gt;act&lt;/strong&gt;. Instead, we ask the AI to &lt;strong&gt;plan&lt;/strong&gt;, and let dumb code do the acting.&lt;/p&gt;

&lt;p&gt;We need three components:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The Eye (Qwen2-VL)&lt;/strong&gt;: Specifically this model because it uses Naive Dynamic Resolution (it doesn't squash images) and M-RoPE (it understands 2D position natively).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Map (Set-of-Mark)&lt;/strong&gt;: We don't ask for pixels; we ask for labels.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Hand (Playwright)&lt;/strong&gt;: Deterministic execution code.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here is the &lt;strong&gt;Generator-Executor workflow&lt;/strong&gt; that moves us from "Toy" to "Tool":&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────────┐
│                    GENERATOR-EXECUTOR FLOW                   │
└─────────────────────────────────────────────────────────────┘

    🌐 Web Page Loaded
         │
         ▼
    📍 STEP 1: SET-OF-MARK INJECTION
         │
         ├─► Inject JavaScript into page
         ├─► Add red numbered badges to all interactive elements
         └─► Build selector map: {id: selector}
         │
         ▼
    📸 Take Screenshot (with numbered labels)
         │
         ▼
    🧠 STEP 2: GENERATOR (Qwen2-VL)
         │
         ├─► AI analyzes screenshot
         ├─► Identifies target element
         └─► Returns: "Target ID = 42"
         │
         ▼
    🎯 STEP 3: EXECUTOR (Playwright)
         │
         ├─► Retrieve selector from window.som_map[42]
         ├─► Snapshot BEFORE state (URL, DOM)
         └─► Execute: page.click(selector)
         │
         ▼
    ✅ STEP 4: VERIFICATION LOOP
         │
         ├─► Did URL change? ──────────────► ✓ SUCCESS
         │
         ├─► Did DOM change significantly? ─► ✓ SUCCESS
         │
         └─► No change detected? ──────────► ⚠️  SILENT FAILURE
                  │
                  ▼
            🤖 AI Visual Judge
                  │
                  ├─► Compare before/after screenshots
                  ├─► Analyze what happened
                  │
                  ▼
            Decision Point:
                  │
                  ├─► Retry ────────────► (loop back to Step 2)
                  │
                  └─► Fail ─────────────► ❌ Report Error &amp;amp; Stop

    ✓ SUCCESS ──► Continue to next task
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 1: Visual Grounding (The Setup)
&lt;/h2&gt;

&lt;p&gt;First, we solve the coordinate problem. Instead of asking the AI to guess pixels, we inject JavaScript to label every interactive element with a big red number.&lt;/p&gt;

&lt;p&gt;This is called &lt;strong&gt;Set-of-Mark (SoM)&lt;/strong&gt; prompting.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Injection Script (JavaScript)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// inject_som.js&lt;/span&gt;
&lt;span class="c1"&gt;// This runs inside the browser via Playwright&lt;/span&gt;
&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;markElements&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="c1"&gt;// Select everything clickable&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;elements&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;document&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;querySelectorAll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;button, a, input, [role="button"]&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="nx"&gt;elements&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;forEach&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;el&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;rect&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;el&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getBoundingClientRect&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;rect&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;width&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nx"&gt;rect&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;height&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// Skip invisible stuff&lt;/span&gt;

    &lt;span class="c1"&gt;// Create the visual badge&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;badge&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;document&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createElement&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;div&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nx"&gt;badge&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;style&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;position&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;absolute&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nx"&gt;badge&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;style&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;left&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;rect&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;left&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;px&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nx"&gt;badge&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;style&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;top&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;rect&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;top&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;px&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nx"&gt;badge&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;style&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;background&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;red&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nx"&gt;badge&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;style&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;color&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;white&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nx"&gt;badge&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;style&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;fontSize&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;12px&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nx"&gt;badge&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;style&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;zIndex&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;10000&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nx"&gt;badge&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;textContent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nb"&gt;document&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;body&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;appendChild&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;badge&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// CRITICAL: Map ID back to a unique selector for code usage&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nb"&gt;window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;som_map&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nb"&gt;window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;som_map&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{};&lt;/span&gt;
    &lt;span class="nb"&gt;window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;som_map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;getUniqueSelector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;el&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 2: The Generator (Qwen2-VL)
&lt;/h2&gt;

&lt;p&gt;Now, we take a screenshot of those red numbers. We ask Qwen2-VL a multiple-choice question: &lt;strong&gt;"User wants to Login. Which Number is the button?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This changes the task from &lt;strong&gt;Regression&lt;/strong&gt; (hard math) to &lt;strong&gt;Classification&lt;/strong&gt; (easy reading).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# The Python Manager
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Qwen2VLForConditionalGeneration&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AutoProcessor&lt;/span&gt;

&lt;span class="c1"&gt;# Load the specialist model
&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Qwen2VLForConditionalGeneration&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Qwen/Qwen2-VL-2B-Instruct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device_map&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auto&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;processor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoProcessor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Qwen/Qwen2-VL-2B-Instruct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_target_id&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;screenshot_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_goal&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;User Goal: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_goal&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;. Look at the screenshot with red numbered boxes. Return ONLY the number of the element needed.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="c1"&gt;# ... standard Qwen2-VL inference code ...
&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;predicted_number&lt;/span&gt; &lt;span class="c1"&gt;# e.g., 42
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 3: The Executor &amp;amp; The "Silent Failure" Check
&lt;/h2&gt;

&lt;p&gt;This is where production agents die. You click a button... and nothing happens. Did it fail? Or was the site just slow? Or was the button a dud?&lt;/p&gt;

&lt;p&gt;We can't just &lt;code&gt;click()&lt;/code&gt;. We need a &lt;strong&gt;Predict-Verify Loop&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Predict&lt;/strong&gt;: Ask the AI before clicking: "If I click 'Save', what should happen?" (Expectation: Network Request).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verify&lt;/strong&gt;: Check if that actually happened.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;robust_click&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;qwen_model&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# 1. Retrieve the selector from our JS Map (100% precision)
&lt;/span&gt;    &lt;span class="n"&gt;selector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;window.som_map[&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;target_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# 2. Snapshot state BEFORE action
&lt;/span&gt;    &lt;span class="n"&gt;url_before&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;

    &lt;span class="c1"&gt;# 3. EXECUTE
&lt;/span&gt;    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;click&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;selector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wait_for_load_state&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;networkidle&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CRITICAL FAIL: Element not clickable&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="c1"&gt;# 4. VERIFY (The Judge)
&lt;/span&gt;    &lt;span class="c1"&gt;# Did the URL change?
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;url_before&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SUCCESS: Navigation detected&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="c1"&gt;# Did the DOM change significantly?
&lt;/span&gt;    &lt;span class="c1"&gt;# If not, we trigger the AI Judge to compare Before/After screenshots
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;WARNING: Silent Failure - Needs AI Visual Inspection&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Verdict: Do We Need Reinforcement Learning?
&lt;/h2&gt;

&lt;p&gt;I initially thought I needed Reinforcement Learning (RL) to train a "Super Agent." I was wrong.&lt;/p&gt;

&lt;p&gt;For 95% of use cases, &lt;strong&gt;RL is a trap&lt;/strong&gt;. It's complex, expensive, and hard to debug (if the agent makes a typo, do you punish it?).&lt;/p&gt;

&lt;p&gt;The "State-of-the-Art" right now isn't a smarter brain; it's a &lt;strong&gt;better system&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Set-of-Mark&lt;/strong&gt; fixes the vision.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Qwen2-VL&lt;/strong&gt; fixes the reasoning.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verification Loops&lt;/strong&gt; fix the reliability.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By moving to this Generator-Executor pattern, my agent stopped missing by 500 pixels. It now hits the target every single time—because it's not guessing pixels anymore. It's reading map coordinates.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This is Part 2 of my journey into Multimodal AI. In Part 3, I'll be deploying this onto a live server to see how much it costs to run 10,000 steps.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>python</category>
    </item>
    <item>
      <title>Transformer: From O(N^2) to Light Speed – 3 Core Hacks Powering Modern LLMs</title>
      <dc:creator>Max Luong</dc:creator>
      <pubDate>Thu, 11 Dec 2025 08:13:18 +0000</pubDate>
      <link>https://forem.com/maxluong/transformer-from-on2-to-light-speed-3-core-hacks-powering-modern-llms-5akn</link>
      <guid>https://forem.com/maxluong/transformer-from-on2-to-light-speed-3-core-hacks-powering-modern-llms-5akn</guid>
      <description>&lt;h2&gt;
  
  
  🚀 Transformer: From 

&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;O(N2)O(N^2) &lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;O&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;N&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;2&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 to Light Speed – 3 Core Hacks Powering Modern LLMs
&lt;/h2&gt;

&lt;p&gt;If you've played with Llama, Mistral, or Gemini, you know Large Language Models (LLMs) are revolutionary. But underneath the magic of coherent text generation lies a massive bottleneck from the original 2017 Transformer architecture: &lt;strong&gt;the quadratic complexity 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;O(N2)O(N^2) &lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;O&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;N&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;2&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 problem.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;O(N2)O(N^2) &lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;O&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;N&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;2&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 wall fundamentally limits how long of a conversation (context) your model can handle. In production, this translates directly to high GPU costs and slow service.&lt;/p&gt;

&lt;p&gt;In this post, we'll dive into the heart of this mathematical problem and explore three brilliant modern hacks—two architectural and one positional—that allow LLMs to run faster and scale better today.&lt;/p&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;Part 1: The 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;O(N2)O(N^2) &lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;O&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;N&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;2&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 Bottleneck: The Cost of Global Attention&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The core of the Transformer is the &lt;strong&gt;Self-Attention&lt;/strong&gt; mechanism.&lt;/p&gt;

&lt;p&gt;The cost arises when we calculate the similarity score between every single word (Query) and every other word (Key) in the input sequence.&lt;/p&gt;


&lt;div class="katex-element"&gt;
  &lt;span class="katex-display"&gt;&lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;Attention(Q,K,V)=Softmax(QKTdk)V
\text{Attention}(Q, K, V) = \text{Softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord text"&gt;&lt;span class="mord"&gt;Attention&lt;/span&gt;&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord mathnormal"&gt;Q&lt;/span&gt;&lt;span class="mpunct"&gt;,&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;K&lt;/span&gt;&lt;span class="mpunct"&gt;,&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;V&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;=&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord text"&gt;&lt;span class="mord"&gt;Softmax&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="minner"&gt;&lt;span class="mopen delimcenter"&gt;&lt;span class="delimsizing size3"&gt;(&lt;/span&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mopen nulldelimiter"&gt;&lt;/span&gt;&lt;span class="mfrac"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord sqrt"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span class="svg-align"&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;d&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="hide-tail"&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="frac-line"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;Q&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;K&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;T&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose nulldelimiter"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose delimcenter"&gt;&lt;span class="delimsizing size3"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;V&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/div&gt;


&lt;h3&gt;
  
  
  &lt;strong&gt;The Problematic 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;QKTQK^T &lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;Q&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;K&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;T&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 Term&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;If your input sequence has a length of 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;NN &lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;N&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 tokens (words):&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Matrix Size:&lt;/strong&gt; The 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;QKTQK^T &lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;Q&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;K&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;T&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 operation results in an &lt;strong&gt;
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;N×NN \times N &lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;N&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;×&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;N&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 Attention Score matrix&lt;/strong&gt;. &lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Cost:&lt;/strong&gt; When 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;NN &lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;N&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 doubles (e.g., from 1,000 to 2,000 tokens), the computational cost and the GPU memory required to store the 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;N×NN \times N &lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;N&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;×&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;N&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 score matrix &lt;strong&gt;quadruples&lt;/strong&gt; (
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;2N2=4N22N^2 = 4N^2 &lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;N&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;2&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;=&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;4&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;N&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;2&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
).&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Context Length (N)&lt;/th&gt;
&lt;th&gt;Computational Cost (
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;N2N^2 &lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;N&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;2&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
)&lt;/th&gt;
&lt;th&gt;Cost Increase Factor&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1024&lt;/td&gt;
&lt;td&gt;1,048,576&lt;/td&gt;
&lt;td&gt;
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;1×1\times &lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;1&lt;/span&gt;&lt;span class="mord"&gt;×&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4096&lt;/td&gt;
&lt;td&gt;16,777,216&lt;/td&gt;
&lt;td&gt;
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;16×16\times &lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;16&lt;/span&gt;&lt;span class="mord"&gt;×&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8192&lt;/td&gt;
&lt;td&gt;67,108,864&lt;/td&gt;
&lt;td&gt;
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;64×64\times &lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;64&lt;/span&gt;&lt;span class="mord"&gt;×&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This cost structure dictates why serving LLMs with long context is so expensive.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Part 2: Optimizing Serving Memory: Introducing Grouped-Query Attention (GQA)&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;When an LLM is serving requests (decoding), it must store the calculated Key (
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;KK &lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;K&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
) and Value (
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;VV &lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;V&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
) vectors from previous tokens in a memory buffer called the &lt;strong&gt;KV Cache&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;The MHA Problem&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;In the original Multi-Head Attention (MHA), if you have 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;HH &lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;H&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 heads (e.g., 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;H=32H=32 &lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;H&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;=&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;32&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
), you store 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;HH &lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;H&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 separate pairs of 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;KK &lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;K&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 and 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;VV &lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;V&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
. The KV Cache quickly becomes the single biggest memory bottleneck on the GPU.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;The GQA Hack&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Grouped-Query Attention (GQA)&lt;/strong&gt; is an architectural tweak to reduce the size of this KV Cache, drastically improving serving efficiency:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Mechanism:&lt;/strong&gt; Instead of having a dedicated 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;KK &lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;K&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 and 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;VV &lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;V&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 head for every Query head, GQA allows &lt;strong&gt;multiple Query heads to share&lt;/strong&gt; the same 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;KK &lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;K&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 and 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;VV &lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;V&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 pair.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Win:&lt;/strong&gt; If 8 Query heads share 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;1K/V1 K/V &lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;1&lt;/span&gt;&lt;span class="mord mathnormal"&gt;K&lt;/span&gt;&lt;span class="mord"&gt;/&lt;/span&gt;&lt;span class="mord mathnormal"&gt;V&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 pair, you reduce the KV Cache memory footprint by a factor of 8.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Impact on Production:&lt;/strong&gt; A smaller KV Cache means the GPU can hold more user contexts simultaneously, leading to higher &lt;strong&gt;Throughput&lt;/strong&gt; (more requests served per second).&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Part 3: Solving the Context Length Wall: Rotary Positional Embedding (RoPE)&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The original Positional Encoding (PE) relies on simple &lt;strong&gt;addition&lt;/strong&gt; to combine positional information with the word embedding. This system breaks down when the model encounters sequence lengths longer than what it was trained on (
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;Ninfer&amp;gt;NtrainN_{\text{infer}} &amp;gt; N_{\text{train}} &lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;N&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord text mtight"&gt;&lt;span class="mord mtight"&gt;infer&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;&amp;gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;N&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord text mtight"&gt;&lt;span class="mord mtight"&gt;train&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
).&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;The RoPE Solution&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Rotary Positional Embedding (RoPE)&lt;/strong&gt; is a mathematically elegant solution:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Mechanism:&lt;/strong&gt; Instead of simple addition, RoPE applies a &lt;strong&gt;rotation&lt;/strong&gt; to the Query (
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;QQ &lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;Q&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
) and Key (
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;KK &lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;K&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
) vectors based on their absolute position. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Logic:&lt;/strong&gt; This rotation successfully encodes &lt;strong&gt;relative positional information&lt;/strong&gt; (the distance between two tokens) into the dot-product computation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Win (Extrapolation):&lt;/strong&gt; By focusing on relative distance rather than absolute position, RoPE enables the model to effectively &lt;strong&gt;extrapolate&lt;/strong&gt; (generalize) to longer context lengths, even if those lengths were never seen during training. This is why models like Llama can function well beyond their initial training context window.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Conclusion: Where We Go Next&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The journey from the original 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;O(N2)O(N^2) &lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;O&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;N&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;2&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 architecture to today's state-of-the-art LLMs is defined by clever engineering and deep mathematical insights.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;O(N2)O(N^2) &lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;O&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;N&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;2&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 Challenge:&lt;/strong&gt; The fundamental computational wall.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;GQA:&lt;/strong&gt; The memory hack for higher production throughput.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;RoPE:&lt;/strong&gt; The positional hack for better context scalability.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These advances set the stage for the next generation of research, focusing on &lt;strong&gt;optimizing GPU memory access&lt;/strong&gt; (like FlashAttention) and exploring new architectures to finally break the 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;O(N2)O(N^2) &lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;O&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;N&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;2&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 barrier for good (e.g., linear attention variants).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What are your thoughts?&lt;/strong&gt; Have you experimented with GQA or RoPE in your own models? Share your experiences in the comments below!&lt;/p&gt;

</description>
    </item>
    <item>
      <title>I Tried to Teach AI to Click Buttons, and It Missed by 500 Pixels</title>
      <dc:creator>Max Luong</dc:creator>
      <pubDate>Sun, 23 Nov 2025 09:42:07 +0000</pubDate>
      <link>https://forem.com/maxluong/i-tried-to-teach-ai-to-click-buttons-and-it-missed-by-500-pixels-33ln</link>
      <guid>https://forem.com/maxluong/i-tried-to-teach-ai-to-click-buttons-and-it-missed-by-500-pixels-33ln</guid>
      <description>&lt;h1&gt;
  
  
  TL;DR:
&lt;/h1&gt;

&lt;p&gt;I attempted to build a visual web agent using Playwright and &lt;strong&gt;Qwen2-VL-2B&lt;/strong&gt; to detect and click UI elements via raw coordinate prediction.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Result&lt;/strong&gt;: Failure. While it works on square test images, production websites on wide monitors (1920x1080) suffer from massive coordinate drift (up to 500px) due to the model's internal aspect ratio squashing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Takeaway&lt;/strong&gt;: Raw pixel prediction is mathematically unstable for browser automation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building the Ultimate AI Web Agent: A Zero-to-Hero Journey (Part 1)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The Dream:&lt;/strong&gt; Imagine telling your computer, "Go to Amazon, find a waterproof Bluetooth speaker under $50, and put it in my cart," and then watching your mouse cursor move on its own, clicking and typing exactly as you would.&lt;/p&gt;

&lt;p&gt;This isn't sci-fi anymore. This is the promise of Multimodal AI Agents.&lt;/p&gt;

&lt;p&gt;But if you think building this is as easy as taking a screenshot and asking ChatGPT to "click the button," I have bad news for you. I tried that. It failed hilariously.&lt;/p&gt;

&lt;p&gt;In this series, &lt;strong&gt;"Building the Ultimate AI Web Agent"&lt;/strong&gt;, I'm going to take you from a naive script that misses buttons to a robust, production-grade web agent.&lt;/p&gt;

&lt;p&gt;Today, in Part 1, we explore how AI "sees," why we tried the easy way (Pixel Prediction), and the mathematical trap that ruined it all.&lt;/p&gt;

&lt;h2&gt;
  
  
  Crash Course: What is a Multimodal LLM?
&lt;/h2&gt;

&lt;p&gt;Before we break the code, let's understand the brain. You likely know LLMs (Large Language Models) like GPT-4 or Claude. They take text in and spit text out.&lt;/p&gt;

&lt;p&gt;MLLMs (Multimodal Large Language Models) add a new sense: &lt;strong&gt;Vision&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  How does an AI "See"?
&lt;/h3&gt;

&lt;p&gt;When you look at a website, you see a seamless flow of pixels. When an AI looks at a website, it sees a &lt;strong&gt;Mosaic&lt;/strong&gt;.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The Grid&lt;/strong&gt;: The model chops the screenshot into tiny fixed-size squares called "Patches" (e.g., 14 x 14 pixels).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Flattening&lt;/strong&gt;: Each patch is turned into a list of numbers (a vector).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Reasoning&lt;/strong&gt;: The AI analyzes these vectors just like words in a sentence. It understands that the pattern of pixels in the top-right corner "looks like" a login button.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This process is called &lt;strong&gt;Visual Grounding&lt;/strong&gt;—mapping a text concept (e.g., "The Microphone Icon") to spatial coordinates on the screen.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Experiment: The "Naive" Approach
&lt;/h2&gt;

&lt;p&gt;For our first attempt, we used &lt;strong&gt;Qwen2-VL-2B&lt;/strong&gt;. Why Qwen? Unlike older models that force every image into a low-res square, Qwen is designed to handle various resolutions and natively supports bounding boxes—it can output &lt;code&gt;[x, y, x, y]&lt;/code&gt; coordinates.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Plan
&lt;/h3&gt;

&lt;p&gt;The logic was simple (or so I thought):&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Capture&lt;/strong&gt;: Take a screenshot of Google.com using Playwright.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ask&lt;/strong&gt;: Feed the image to Qwen with the prompt: "Find the bounding box of the Search by Voice icon."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Click&lt;/strong&gt;: Convert the AI's coordinates into a mouse click.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhhwiq9tore22o638zusl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhhwiq9tore22o638zusl.png" alt="Screenshot of Google.com" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Code
&lt;/h3&gt;

&lt;p&gt;Here is the script we used. It asks the model for coordinates and draws a box on the screen to verify them.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import re
from PIL import Image, ImageDraw
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from playwright.async_api import async_playwright
from qwen_vl_utils import process_vision_info
import torch

# 1. Setup
device = "cuda" if torch.cuda.is_available() else "cpu"
if 'model' not in locals():
    model = Qwen2VLForConditionalGeneration.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", dtype="auto",
                                                            device_map=device)
    processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")


async def find_and_verify_element(url, element_description):
    image_path = "page.png"

    # --- Capture ---
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        page = await browser.new_page()
        await page.goto(url)
        # Using 1024x1024 to match model training resolution often helps accuracy
        await page.set_viewport_size({"width": 1024, "height": 1024})
        await page.screenshot(path=image_path)
        await browser.close()

    # --- Inference ---
    # We specifically ask for the center point to help the model allow for single-point precision
    prompt = f"Find the bounding box of the {element_description}."

    messages = [{
        "role": "user",
        "content": [{"type": "image", "image": image_path}, {"type": "text", "text": prompt}]
    }]

    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    image_inputs, video_inputs = process_vision_info(messages)
    inputs = processor(text=[text], images=image_inputs, padding=True, return_tensors="pt").to(device)

    generated_ids = model.generate(**inputs, max_new_tokens=128)
    output_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
    print(f"Model said: {output_text}")

    # --- Parse (The 'Chat' Logic: x, y, x, y) ---
    # Matches (213, 372...) or [213, 372...]
    pattern = r"\((\d+),\s*(\d+)\)\s*to\s*\((\d+),\s*(\d+)\)"
    match = re.search(pattern, output_text)

    if match:
        # Chat mode usually outputs [x_min, y_min, x_max, y_max]
        # If the result looks like a vertical strip, swap these variables.
        norm_x1, norm_y1, norm_x2, norm_y2 = map(int, match.groups())

        # Get Real Dimensions
        img = Image.open(image_path)
        W, H = img.size

        # Convert 0-1000 to Pixels
        x1 = (norm_x1 / 1000) * W
        y1 = (norm_y1 / 1000) * H
        x2 = (norm_x2 / 1000) * W
        y2 = (norm_y2 / 1000) * H

        center_x = (x1 + x2) / 2
        center_y = (y1 + y2) / 2

        print(f"Found Coordinates: Center({center_x:.1f}, {center_y:.1f})")

        # --- Verify Visual (Python Side) ---
        draw = ImageDraw.Draw(img)
        # Draw Box
        draw.rectangle([x1, y1, x2, y2], outline="red", width=4)
        # Draw Center
        r = 10
        draw.ellipse((center_x - r, center_y - r, center_x + r, center_y + r), fill="green", outline="black")

        # Save/Show result
        img.save("verified_result.png")
        return (center_x, center_y)
    else:
        print("No coordinates found.")
        return None


# Run
# Note: If you want the gray button specifically, try prompting: "the gray Google Search button below the text bar"
await find_and_verify_element("https://www.google.com", "Google Search button")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Execution Log
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/usr/local/lib/python3.12/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning: 
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(
config.json: 
 1.20k/? [00:00&amp;lt;00:00, 26.7kB/s]
model.safetensors.index.json: 
 56.4k/? [00:00&amp;lt;00:00, 1.16MB/s]
Download complete: 100%
 4.42G/4.42G [01:00&amp;lt;00:00, 66.9MB/s]
Fetching 2 files: 100%
 2/2 [01:00&amp;lt;00:00, 60.64s/it]
Loading weights: 100%
 729/729 [00:06&amp;lt;00:00, 124.67it/s, Materializing param=model.visual.patch_embed.proj.weight]
generation_config.json: 100%
 272/272 [00:00&amp;lt;00:00, 24.9kB/s]
preprocessor_config.json: 100%
 347/347 [00:00&amp;lt;00:00, 37.1kB/s]
The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
tokenizer_config.json: 
 4.19k/? [00:00&amp;lt;00:00, 378kB/s]
vocab.json: 
 2.78M/? [00:00&amp;lt;00:00, 42.8MB/s]
merges.txt: 
 1.67M/? [00:00&amp;lt;00:00, 44.4MB/s]
tokenizer.json: 
 7.03M/? [00:00&amp;lt;00:00, 74.5MB/s]
chat_template.json: 
 1.05k/? [00:00&amp;lt;00:00, 51.6kB/s]
Model said: system
You are a helpful assistant.
user
Find the bounding box of the Google Search button.
assistant
The Google Search button is located at the center of the image, just below the Google logo. The bounding box for the Google Search button is approximately from (211, 372) to (788, 430).
Found Coordinates: Center(511.5, 410.6)
(511.488, 410.624)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I ran this on a simple test page. &lt;strong&gt;It worked!&lt;/strong&gt; The AI found the big "Google Search" button. I felt like a genius.&lt;/p&gt;

&lt;p&gt;Then, I tried to click a small icon on a wide-screen monitor.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwjyxrkqorygyo4b0vgfp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwjyxrkqorygyo4b0vgfp.png" alt="Verify The Reult" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  The Failure: The "Pixel Trap"
&lt;/h2&gt;

&lt;p&gt;To stress-test our agent, we set up a scenario that mimics real-world usage:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The Monitor&lt;/strong&gt;: A standard wide screen ($1920 \times 1080$).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Target&lt;/strong&gt;: The small "Search by voice" (Microphone) icon.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Constraint&lt;/strong&gt;: Most AI models (including Qwen) prefer square inputs (e.g., 1024x1024).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We captured the screen and asked the AI to find the icon. Here is exactly what happened in the logs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Capturing https://www.google.com at 1920x1080...
Ground Truth (DOM): [1163, 377, 1203, 427]

Squashing image to 1024x1024 to trigger Scaling Drift...

Parsing Model Output:
"The bounding box is [341, 342, 658, 396]."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The Math Doesn't Add Up
&lt;/h3&gt;

&lt;p&gt;Do you see the discrepancy?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Reality (DOM)&lt;/strong&gt;: The icon starts at pixel &lt;strong&gt;1163&lt;/strong&gt; (Right side).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI Prediction&lt;/strong&gt;: The AI guessed &lt;strong&gt;341&lt;/strong&gt; (Left side).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When we visualized this, the magnitude of the failure was shocking:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm3zzzcdhrqmtmkbpmkeb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm3zzzcdhrqmtmkbpmkeb.png" alt="Failure Case Demonstrated" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Anatomy of a Missed Click
&lt;/h3&gt;

&lt;p&gt;Why did our "smart" AI miss by 500 pixels?&lt;/p&gt;

&lt;h4&gt;
  
  
  1. The Aspect Ratio Trap (The Squish)
&lt;/h4&gt;

&lt;p&gt;We captured a wide 16:9 image, but the model processed it as a 1:1 square. This "squashed" the image horizontally.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The AI saw the microphone at roughly 30% of the image width.&lt;/li&gt;
&lt;li&gt;When we mapped that 30% back to the 1920px screen, the math drifted massively.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  2. The Ambiguity Problem
&lt;/h4&gt;

&lt;p&gt;Look at the &lt;span&gt;Red Box&lt;/span&gt; in the image above. The AI didn't just miss the coordinates; it boxed the entire search text input, not the microphone. To a human, the difference is obvious. To an AI looking at a squashed, low-resolution patch, the whole search bar looks like "The Search Thing."&lt;/p&gt;

&lt;h4&gt;
  
  
  The Result
&lt;/h4&gt;

&lt;p&gt;If this were a live agent, it would have clicked the empty white space in the text bar. The script would report "Success," but nothing would happen. This is a &lt;strong&gt;Silent Failure&lt;/strong&gt;—the worst enemy of automation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;"Asking for pixels is a trap."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Vision is not precise&lt;/strong&gt;: MLLMs see general patterns, not exact pixels. They are probabilistic engines, not rulers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scaling is the enemy&lt;/strong&gt;: Aspect ratio distortion between your screen and the model's training data causes "Math Drift."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context is ambiguous&lt;/strong&gt;: "Search button" might mean the icon, the text box, or the button below. The AI usually guesses the biggest one.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What's Next?
&lt;/h2&gt;

&lt;p&gt;We have hit a wall. We cannot rely on the AI to guess where pixels are. We need a way to force the AI to be precise.&lt;/p&gt;

&lt;p&gt;In &lt;strong&gt;Part 2&lt;/strong&gt;, we will abandon raw coordinate guessing. We will implement a technique called &lt;strong&gt;Set-of-Mark (SoM)&lt;/strong&gt;. We will use JavaScript to inject numbered tags directly into the browser's code, giving the AI "glasses" to see exactly where elements are—with &lt;strong&gt;higher precision&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Stay tuned.&lt;/p&gt;

</description>
      <category>python</category>
      <category>ai</category>
      <category>playwright</category>
      <category>webdev</category>
    </item>
  </channel>
</rss>
