Forem: Max Luong

Why Do AI Agents Fail in Production? (And How to Fix the "Silent Click")

Max Luong — Thu, 11 Dec 2025 08:50:14 +0000

Part 2: Moving from toy scripts to enterprise architecture using Qwen2-VL, Set-of-Mark, and Playwright.

Series: Building AI Web Agents

In Part 1: I Tried to Teach AI to Click Buttons, and It Missed by 500 Pixels, I shared the painful reality of building my first web agent. I fed a screenshot to a standard multimodal model, asked for coordinates, and watched it hallucinate a click into the white void of a webpage.

It turns out, guessing $X, Y$ pixel coordinates is a fragile game.

If you are just playing around, a 70% success rate is cool. If you are building an enterprise agent to automate 1,000 tasks, a 70% success rate is a disaster.

In this post, I'm breaking down the architecture that actually works in production: The Generator-Executor Pattern, powered by Qwen2-VL and Set-of-Mark (SoM).

The Problem: The "Blind" Brain vs. The "Blurry" Eye

Why did my first agent fail? It faced two massive walls that every developer in this space hits eventually:

1. Pure LLMs are Blind

A text-only model (like GPT-4 or Llama 3) reading HTML code fails on modern web apps. It can't see inside <canvas> elements, deep Shadow DOMs, or visual trickery (like a popup advertisement physically covering the "Login" button).

2. Standard VLMs are Blurry

Most older vision models squash your beautiful 4K screenshot into a tiny $336 \times 336$ square. A "Submit" button becomes a smudge. If the model can't distinctively see the button, it definitely can't give you accurate coordinates for it.

The Solution: The Hybrid "Neuro-Symbolic" Stack

To build a production-grade agent, we stop asking the AI to act. Instead, we ask the AI to plan, and let dumb code do the acting.

We need three components:

The Eye (Qwen2-VL): Specifically this model because it uses Naive Dynamic Resolution (it doesn't squash images) and M-RoPE (it understands 2D position natively).
The Map (Set-of-Mark): We don't ask for pixels; we ask for labels.
The Hand (Playwright): Deterministic execution code.

Here is the Generator-Executor workflow that moves us from "Toy" to "Tool":

┌─────────────────────────────────────────────────────────────┐
│                    GENERATOR-EXECUTOR FLOW                   │
└─────────────────────────────────────────────────────────────┘

    🌐 Web Page Loaded
         │
         ▼
    📍 STEP 1: SET-OF-MARK INJECTION
         │
         ├─► Inject JavaScript into page
         ├─► Add red numbered badges to all interactive elements
         └─► Build selector map: {id: selector}
         │
         ▼
    📸 Take Screenshot (with numbered labels)
         │
         ▼
    🧠 STEP 2: GENERATOR (Qwen2-VL)
         │
         ├─► AI analyzes screenshot
         ├─► Identifies target element
         └─► Returns: "Target ID = 42"
         │
         ▼
    🎯 STEP 3: EXECUTOR (Playwright)
         │
         ├─► Retrieve selector from window.som_map[42]
         ├─► Snapshot BEFORE state (URL, DOM)
         └─► Execute: page.click(selector)
         │
         ▼
    ✅ STEP 4: VERIFICATION LOOP
         │
         ├─► Did URL change? ──────────────► ✓ SUCCESS
         │
         ├─► Did DOM change significantly? ─► ✓ SUCCESS
         │
         └─► No change detected? ──────────► ⚠️  SILENT FAILURE
                  │
                  ▼
            🤖 AI Visual Judge
                  │
                  ├─► Compare before/after screenshots
                  ├─► Analyze what happened
                  │
                  ▼
            Decision Point:
                  │
                  ├─► Retry ────────────► (loop back to Step 2)
                  │
                  └─► Fail ─────────────► ❌ Report Error & Stop

    ✓ SUCCESS ──► Continue to next task

Step 1: Visual Grounding (The Setup)

First, we solve the coordinate problem. Instead of asking the AI to guess pixels, we inject JavaScript to label every interactive element with a big red number.

This is called Set-of-Mark (SoM) prompting.

The Injection Script (JavaScript)

// inject_som.js
// This runs inside the browser via Playwright
function markElements() {
  let id = 0;
  // Select everything clickable
  const elements = document.querySelectorAll('button, a, input, [role="button"]');

  elements.forEach(el => {
    id++;
    const rect = el.getBoundingClientRect();
    if (rect.width === 0 || rect.height === 0) return; // Skip invisible stuff

    // Create the visual badge
    const badge = document.createElement('div');
    badge.style.position = 'absolute';
    badge.style.left = rect.left + 'px';
    badge.style.top = rect.top + 'px';
    badge.style.background = 'red';
    badge.style.color = 'white';
    badge.style.fontSize = '12px';
    badge.style.zIndex = '10000';
    badge.textContent = id;
    document.body.appendChild(badge);

    // CRITICAL: Map ID back to a unique selector for code usage
    if (!window.som_map) window.som_map = {};
    window.som_map[id] = getUniqueSelector(el);
  });
}

Step 2: The Generator (Qwen2-VL)

Now, we take a screenshot of those red numbers. We ask Qwen2-VL a multiple-choice question: "User wants to Login. Which Number is the button?"

This changes the task from Regression (hard math) to Classification (easy reading).

# The Python Manager
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor

# Load the specialist model
model = Qwen2VLForConditionalGeneration.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", device_map="auto")
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")

def get_target_id(screenshot_path, user_goal):
    prompt = f"User Goal: {user_goal}. Look at the screenshot with red numbered boxes. Return ONLY the number of the element needed."

    # ... standard Qwen2-VL inference code ...

    return predicted_number # e.g., 42

Step 3: The Executor & The "Silent Failure" Check

This is where production agents die. You click a button... and nothing happens. Did it fail? Or was the site just slow? Or was the button a dud?

We can't just click(). We need a Predict-Verify Loop.

Predict: Ask the AI before clicking: "If I click 'Save', what should happen?" (Expectation: Network Request).
Verify: Check if that actually happened.

def robust_click(page, target_id, qwen_model):
    # 1. Retrieve the selector from our JS Map (100% precision)
    selector = page.evaluate(f"window.som_map[{target_id}]")

    # 2. Snapshot state BEFORE action
    url_before = page.url

    # 3. EXECUTE
    try:
        page.click(selector)
        page.wait_for_load_state("networkidle", timeout=3000)
    except:
        return "CRITICAL FAIL: Element not clickable"

    # 4. VERIFY (The Judge)
    # Did the URL change?
    if page.url != url_before:
        return "SUCCESS: Navigation detected"

    # Did the DOM change significantly?
    # If not, we trigger the AI Judge to compare Before/After screenshots
    return "WARNING: Silent Failure - Needs AI Visual Inspection"

The Verdict: Do We Need Reinforcement Learning?

I initially thought I needed Reinforcement Learning (RL) to train a "Super Agent." I was wrong.

For 95% of use cases, RL is a trap. It's complex, expensive, and hard to debug (if the agent makes a typo, do you punish it?).

The "State-of-the-Art" right now isn't a smarter brain; it's a better system.

Set-of-Mark fixes the vision.
Qwen2-VL fixes the reasoning.
Verification Loops fix the reliability.

By moving to this Generator-Executor pattern, my agent stopped missing by 500 pixels. It now hits the target every single time—because it's not guessing pixels anymore. It's reading map coordinates.

This is Part 2 of my journey into Multimodal AI. In Part 3, I'll be deploying this onto a live server to see how much it costs to run 10,000 steps.

Transformer: From O(N^2) to Light Speed – 3 Core Hacks Powering Modern LLMs

Max Luong — Thu, 11 Dec 2025 08:13:18 +0000

🚀 Transformer: From $O(N^2)$ to Light Speed – 3 Core Hacks Powering Modern LLMs

If you've played with Llama, Mistral, or Gemini, you know Large Language Models (LLMs) are revolutionary. But underneath the magic of coherent text generation lies a massive bottleneck from the original 2017 Transformer architecture: the quadratic complexity $O(N^2)$ problem.

This $O(N^2)$ wall fundamentally limits how long of a conversation (context) your model can handle. In production, this translates directly to high GPU costs and slow service.

In this post, we'll dive into the heart of this mathematical problem and explore three brilliant modern hacks—two architectural and one positional—that allow LLMs to run faster and scale better today.

Part 1: The $O(N^2)$ Bottleneck: The Cost of Global Attention

The core of the Transformer is the Self-Attention mechanism.

The cost arises when we calculate the similarity score between every single word (Query) and every other word (Key) in the input sequence.

\text{Attention}(Q, K, V) = \text{Softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

The Problematic $QK^T$ Term

If your input sequence has a length of $N$ tokens (words):

Matrix Size: The $QK^T$ operation results in an $\times N$ Attention Score matrix.
Cost: When $N$ doubles (e.g., from 1,000 to 2,000 tokens), the computational cost and the GPU memory required to store the $\times N$ score matrix quadruples ( $2N^2 = 4N^2$ ).

Context Length (N)	Computational Cost ( $N^2$ )	Cost Increase Factor
1024	1,048,576	$1×1\times$
4096	16,777,216	$16×16\times$
8192	67,108,864	$64×64\times$

This cost structure dictates why serving LLMs with long context is so expensive.

Part 2: Optimizing Serving Memory: Introducing Grouped-Query Attention (GQA)

When an LLM is serving requests (decoding), it must store the calculated Key ( $K$ ) and Value ( $V$ ) vectors from previous tokens in a memory buffer called the KV Cache.

The MHA Problem

In the original Multi-Head Attention (MHA), if you have $H$ heads (e.g., $H = 32$ ), you store $H$ separate pairs of $K$ and $V$ . The KV Cache quickly becomes the single biggest memory bottleneck on the GPU.

The GQA Hack

Grouped-Query Attention (GQA) is an architectural tweak to reduce the size of this KV Cache, drastically improving serving efficiency:

Mechanism: Instead of having a dedicated $K$ and $V$ head for every Query head, GQA allows multiple Query heads to share the same $K$ and $V$ pair.
The Win: If 8 Query heads share $1 K / V$ pair, you reduce the KV Cache memory footprint by a factor of 8.
Impact on Production: A smaller KV Cache means the GPU can hold more user contexts simultaneously, leading to higher Throughput (more requests served per second).

Part 3: Solving the Context Length Wall: Rotary Positional Embedding (RoPE)

The original Positional Encoding (PE) relies on simple addition to combine positional information with the word embedding. This system breaks down when the model encounters sequence lengths longer than what it was trained on ( $Ninfer>NtrainN_{\text{infer}} > N_{\text{train}}$ ).

The RoPE Solution

Rotary Positional Embedding (RoPE) is a mathematically elegant solution:

Mechanism: Instead of simple addition, RoPE applies a rotation to the Query ( $Q$ ) and Key ( $K$ ) vectors based on their absolute position.
The Logic: This rotation successfully encodes relative positional information (the distance between two tokens) into the dot-product computation.
The Win (Extrapolation): By focusing on relative distance rather than absolute position, RoPE enables the model to effectively extrapolate (generalize) to longer context lengths, even if those lengths were never seen during training. This is why models like Llama can function well beyond their initial training context window.

Conclusion: Where We Go Next

The journey from the original $O(N^2)$ architecture to today's state-of-the-art LLMs is defined by clever engineering and deep mathematical insights.

$O(N^2)$ Challenge: The fundamental computational wall.
GQA: The memory hack for higher production throughput.
RoPE: The positional hack for better context scalability.

These advances set the stage for the next generation of research, focusing on optimizing GPU memory access (like FlashAttention) and exploring new architectures to finally break the $O(N^2)$ barrier for good (e.g., linear attention variants).

What are your thoughts? Have you experimented with GQA or RoPE in your own models? Share your experiences in the comments below!

I Tried to Teach AI to Click Buttons, and It Missed by 500 Pixels

Max Luong — Sun, 23 Nov 2025 09:42:07 +0000

TL;DR:

I attempted to build a visual web agent using Playwright and Qwen2-VL-2B to detect and click UI elements via raw coordinate prediction.

The Result: Failure. While it works on square test images, production websites on wide monitors (1920x1080) suffer from massive coordinate drift (up to 500px) due to the model's internal aspect ratio squashing.

Takeaway: Raw pixel prediction is mathematically unstable for browser automation.

Building the Ultimate AI Web Agent: A Zero-to-Hero Journey (Part 1)

The Dream: Imagine telling your computer, "Go to Amazon, find a waterproof Bluetooth speaker under $50, and put it in my cart," and then watching your mouse cursor move on its own, clicking and typing exactly as you would.

This isn't sci-fi anymore. This is the promise of Multimodal AI Agents.

But if you think building this is as easy as taking a screenshot and asking ChatGPT to "click the button," I have bad news for you. I tried that. It failed hilariously.

In this series, "Building the Ultimate AI Web Agent", I'm going to take you from a naive script that misses buttons to a robust, production-grade web agent.

Today, in Part 1, we explore how AI "sees," why we tried the easy way (Pixel Prediction), and the mathematical trap that ruined it all.

Crash Course: What is a Multimodal LLM?

Before we break the code, let's understand the brain. You likely know LLMs (Large Language Models) like GPT-4 or Claude. They take text in and spit text out.

MLLMs (Multimodal Large Language Models) add a new sense: Vision.

How does an AI "See"?

When you look at a website, you see a seamless flow of pixels. When an AI looks at a website, it sees a Mosaic.

The Grid: The model chops the screenshot into tiny fixed-size squares called "Patches" (e.g., 14 x 14 pixels).
The Flattening: Each patch is turned into a list of numbers (a vector).
The Reasoning: The AI analyzes these vectors just like words in a sentence. It understands that the pattern of pixels in the top-right corner "looks like" a login button.

This process is called Visual Grounding—mapping a text concept (e.g., "The Microphone Icon") to spatial coordinates on the screen.

The Experiment: The "Naive" Approach

For our first attempt, we used Qwen2-VL-2B. Why Qwen? Unlike older models that force every image into a low-res square, Qwen is designed to handle various resolutions and natively supports bounding boxes—it can output [x, y, x, y] coordinates.

The Plan

The logic was simple (or so I thought):

Capture: Take a screenshot of Google.com using Playwright.
Ask: Feed the image to Qwen with the prompt: "Find the bounding box of the Search by Voice icon."
Click: Convert the AI's coordinates into a mouse click.

The Code

Here is the script we used. It asks the model for coordinates and draws a box on the screen to verify them.

import re
from PIL import Image, ImageDraw
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from playwright.async_api import async_playwright
from qwen_vl_utils import process_vision_info
import torch

# 1. Setup
device = "cuda" if torch.cuda.is_available() else "cpu"
if 'model' not in locals():
    model = Qwen2VLForConditionalGeneration.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", dtype="auto",
                                                            device_map=device)
    processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")


async def find_and_verify_element(url, element_description):
    image_path = "page.png"

    # --- Capture ---
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        page = await browser.new_page()
        await page.goto(url)
        # Using 1024x1024 to match model training resolution often helps accuracy
        await page.set_viewport_size({"width": 1024, "height": 1024})
        await page.screenshot(path=image_path)
        await browser.close()

    # --- Inference ---
    # We specifically ask for the center point to help the model allow for single-point precision
    prompt = f"Find the bounding box of the {element_description}."

    messages = [{
        "role": "user",
        "content": [{"type": "image", "image": image_path}, {"type": "text", "text": prompt}]
    }]

    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    image_inputs, video_inputs = process_vision_info(messages)
    inputs = processor(text=[text], images=image_inputs, padding=True, return_tensors="pt").to(device)

    generated_ids = model.generate(**inputs, max_new_tokens=128)
    output_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
    print(f"Model said: {output_text}")

    # --- Parse (The 'Chat' Logic: x, y, x, y) ---
    # Matches (213, 372...) or [213, 372...]
    pattern = r"\((\d+),\s*(\d+)\)\s*to\s*\((\d+),\s*(\d+)\)"
    match = re.search(pattern, output_text)

    if match:
        # Chat mode usually outputs [x_min, y_min, x_max, y_max]
        # If the result looks like a vertical strip, swap these variables.
        norm_x1, norm_y1, norm_x2, norm_y2 = map(int, match.groups())

        # Get Real Dimensions
        img = Image.open(image_path)
        W, H = img.size

        # Convert 0-1000 to Pixels
        x1 = (norm_x1 / 1000) * W
        y1 = (norm_y1 / 1000) * H
        x2 = (norm_x2 / 1000) * W
        y2 = (norm_y2 / 1000) * H

        center_x = (x1 + x2) / 2
        center_y = (y1 + y2) / 2

        print(f"Found Coordinates: Center({center_x:.1f}, {center_y:.1f})")

        # --- Verify Visual (Python Side) ---
        draw = ImageDraw.Draw(img)
        # Draw Box
        draw.rectangle([x1, y1, x2, y2], outline="red", width=4)
        # Draw Center
        r = 10
        draw.ellipse((center_x - r, center_y - r, center_x + r, center_y + r), fill="green", outline="black")

        # Save/Show result
        img.save("verified_result.png")
        return (center_x, center_y)
    else:
        print("No coordinates found.")
        return None


# Run
# Note: If you want the gray button specifically, try prompting: "the gray Google Search button below the text bar"
await find_and_verify_element("https://www.google.com", "Google Search button")

Execution Log

/usr/local/lib/python3.12/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning: 
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(
config.json: 
 1.20k/? [00:00<00:00, 26.7kB/s]
model.safetensors.index.json: 
 56.4k/? [00:00<00:00, 1.16MB/s]
Download complete: 100%
 4.42G/4.42G [01:00<00:00, 66.9MB/s]
Fetching 2 files: 100%
 2/2 [01:00<00:00, 60.64s/it]
Loading weights: 100%
 729/729 [00:06<00:00, 124.67it/s, Materializing param=model.visual.patch_embed.proj.weight]
generation_config.json: 100%
 272/272 [00:00<00:00, 24.9kB/s]
preprocessor_config.json: 100%
 347/347 [00:00<00:00, 37.1kB/s]
The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
tokenizer_config.json: 
 4.19k/? [00:00<00:00, 378kB/s]
vocab.json: 
 2.78M/? [00:00<00:00, 42.8MB/s]
merges.txt: 
 1.67M/? [00:00<00:00, 44.4MB/s]
tokenizer.json: 
 7.03M/? [00:00<00:00, 74.5MB/s]
chat_template.json: 
 1.05k/? [00:00<00:00, 51.6kB/s]
Model said: system
You are a helpful assistant.
user
Find the bounding box of the Google Search button.
assistant
The Google Search button is located at the center of the image, just below the Google logo. The bounding box for the Google Search button is approximately from (211, 372) to (788, 430).
Found Coordinates: Center(511.5, 410.6)
(511.488, 410.624)

I ran this on a simple test page. It worked! The AI found the big "Google Search" button. I felt like a genius.

Then, I tried to click a small icon on a wide-screen monitor.

The Failure: The "Pixel Trap"

To stress-test our agent, we set up a scenario that mimics real-world usage:

The Monitor: A standard wide screen ($1920 \times 1080$).
The Target: The small "Search by voice" (Microphone) icon.
The Constraint: Most AI models (including Qwen) prefer square inputs (e.g., 1024x1024).

We captured the screen and asked the AI to find the icon. Here is exactly what happened in the logs:

Capturing https://www.google.com at 1920x1080...
Ground Truth (DOM): [1163, 377, 1203, 427]

Squashing image to 1024x1024 to trigger Scaling Drift...

Parsing Model Output:
"The bounding box is [341, 342, 658, 396]."

The Math Doesn't Add Up

Do you see the discrepancy?

Reality (DOM): The icon starts at pixel 1163 (Right side).
AI Prediction: The AI guessed 341 (Left side).

When we visualized this, the magnitude of the failure was shocking:

Anatomy of a Missed Click

Why did our "smart" AI miss by 500 pixels?

1. The Aspect Ratio Trap (The Squish)

We captured a wide 16:9 image, but the model processed it as a 1:1 square. This "squashed" the image horizontally.

The AI saw the microphone at roughly 30% of the image width.
When we mapped that 30% back to the 1920px screen, the math drifted massively.

2. The Ambiguity Problem

Look at the Red Box in the image above. The AI didn't just miss the coordinates; it boxed the entire search text input, not the microphone. To a human, the difference is obvious. To an AI looking at a squashed, low-resolution patch, the whole search bar looks like "The Search Thing."

The Result

If this were a live agent, it would have clicked the empty white space in the text bar. The script would report "Success," but nothing would happen. This is a Silent Failure—the worst enemy of automation.

Key Takeaways

"Asking for pixels is a trap."

Vision is not precise: MLLMs see general patterns, not exact pixels. They are probabilistic engines, not rulers.
Scaling is the enemy: Aspect ratio distortion between your screen and the model's training data causes "Math Drift."
Context is ambiguous: "Search button" might mean the icon, the text box, or the button below. The AI usually guesses the biggest one.

What's Next?

We have hit a wall. We cannot rely on the AI to guess where pixels are. We need a way to force the AI to be precise.

In Part 2, we will abandon raw coordinate guessing. We will implement a technique called Set-of-Mark (SoM). We will use JavaScript to inject numbered tags directly into the browser's code, giving the AI "glasses" to see exactly where elements are—with higher precision.

Stay tuned.