Forem: Dhiraj Das

Starlight Part 3: The Autonomous Era — Headless CI/CD and Mutation Fingerprinting

Dhiraj Das — Sat, 03 Jan 2026 18:30:05 +0000

In Part 2: Mission Control, we explored the visual dashboard that lets you monitor the Starlight constellation in real-time. But in practice, most enterprise automation runs in CI/CD pipelines, headless, in the background.

This is Starlight v3.0: The Autonomous Era.

We've moved beyond visibility. The constellation can now sense its environment and run without human intervention.

🧬

Stability Sensing: Knowing When the Page is Ready

The upgraded Starlight Mission Control — Visualizing stability in real-time.

The hardest problem in browser automation isn't clicking a button—it's knowing *when* to click. Traditional scripts use arbitrary waits like wait\_for\_timeout(3000), which are either too slow or too fast.

Starlight v3.0 introduces Mutation Fingerprinting to solve this.

When you record a test using the built-in recorder, Starlight doesn't just capture your clicks. It also measures how long the page takes to "settle" after each action using the browser's MutationObserver API.

For example, if a page needs 450ms of DOM silence before it's truly stable, that timing is saved with the action. During playback, the Pulse Sentinel uses this data to wait exactly the right amount of time—no more, no less.

Temporal Intelligence

Temporal Intelligence

The result: Tests that are fast when they can be, and patient when they need to be.

🤖

One Command to Run Everything

Running a multi-agent system used to mean opening multiple terminals: one for the Hub, one for each Sentinel, and one for your test. That's a lot of moving pieces.

With v3.0, we've simplified this to a single command:

node bin/starlight.js test/my_mission.js --headless

The CLI Orchestrator handles the lifecycle automatically:

Starts the Hub and waits for it to be ready
Launches the Sentinels (Pulse, Janitor, etc.)
Runs your test
Generates the report and cleans up

This makes Starlight straightforward to integrate into GitHub Actions, GitLab CI, or any CI/CD pipeline.

🏷️

The No-Code Recorder

The test recorder has been upgraded with an in-browser HUD (Heads-Up Display). When you start a recording from Mission Control, a small floating panel appears on the page.

Tag Next Click: Give a meaningful name to your next action (e.g., "Login Button" instead of a raw selector)
Add Checkpoints: Insert named markers like "Cart Updated" to track logical milestones
Stop Recording: End the session and save the generated test file

This lets you create tests by interacting with your site normally, while adding semantic meaning where it matters.

📋

What's New in v3.0.1

The latest patch fixed an issue where the checkpoint and stop buttons in the HUD weren't responding. The fix involved:

Replacing browser-native dialogs (which Playwright intercepts) with in-HUD controls
Ensuring recording functions are available before page navigation
Using a fresh browser instance for each recording session

These are the kinds of edge cases you discover when building automation tools—the automation framework was automating away its own UI dialogs.

🌌

When to Use Starlight

Starlight is designed for complex, dynamic web applications where:

Pages have frequent DOM changes (animations, lazy loading)
Unexpected popups and modals appear
Selectors change due to dynamic IDs or framework updates
You need detailed reports showing what happened during a test

For simple, static sites, traditional automation tools work fine. Starlight shines when the environment is unpredictable.

🔮

What's Next for Starlight

Observability & Telemetry (Phase 10)

OpenTelemetry integration for Datadog/Grafana, Slack/Teams webhooks, and SLA dashboards.

Natural Language Intent (Phase 13)

Plain English test writing, Gherkin support, and auto-generated BDD scenarios.

Sentinel Marketplace (Phase 15)

Community registry for custom Sentinels and one-command installation.

The foundation we've built—the Hub, Sentinels, and the Starlight Protocol—is designed to be extensible. The marketplace will let teams share solutions for common obstacles (cookie banners, CAPTCHA handlers, login flows) rather than solving them from scratch.

👟

Next Steps

Clone the repository
Run npm install and pip install -r requirements.txt
Start Mission Control: node launcher/server.js
Open http://localhost:3000 and explore

✨

Built with ❤️ by Dhiraj Das
The mission is autonomous. The value is measurable. The future is visible.

Starlight Part 5: Introducing the Starlight Protocol Specification v1.0.0

Dhiraj Das — Sat, 03 Jan 2026 18:27:52 +0000

Today, I'm excited to announce the release of the Starlight Protocol Specification v1.0.0—a formal, open standard for building self-healing browser automation systems. This isn't just another testing library. It's a protocol—a contract that defines how autonomous agents coordinate to handle the chaos of modern web applications.

The official logo for the Starlight Protocol.

🚨

The Problem We're Solving

Every automation engineer knows this pain. The button is still there, your code is the same—but the environment changed.

// Your test yesterday
await page.click('#submit-btn');  // ✅ Passed

// Your test today
await page.click('#submit-btn');  // ❌ Failed: Element blocked by cookie banner

Traditional frameworks force you to write defensive code—50 if-statements for every possible environmental obstacle. This is madness. Your test should express intent, not handle every possible environmental obstacle.

💡

The Starlight Solution: Decoupling Intent from Environment

Pulse Sentinel: Monitors DOM/Network stability
Janitor Sentinel: Clears popups and modals
Vision Sentinel: AI-powered obstacle detection

Before every action, the Hub asks ALL Sentinels: 'Is the environment safe?' Only when they ALL agree does the action proceed.

📜

Why a Protocol, Not Just a Library?

Aspect	Library	Protocol
Language	Single	Any
Extensibility	Fork/Change	Add Components
Interoperability	Limited	Universal
Standardization	None	Formal Spec

By publishing Starlight as a protocol, we enable Hub implementations in any language, a community-built Sentinel ecosystem, and cross-platform compatibility.

📋

What's in the Specification?

The spec defines everything needed to build a compliant implementation: message formats, 12 protocol methods, the handshake lifecycle, and three compliance levels.

{
                "jsonrpc": "2.0",
                "method": "starlight.pre_check",
                "params": {
                    "command": { "cmd": "click", "selector": "#submit" }
                },
                "id": "msg-001"
            }

Level	Requirements
Level 1	All core methods
Level 2	+ Context, Entropy, Health
Level 3	+ Semantic Goals, Self-Healing

🎯

Design Goals

Zero Configuration: Works out of the box with sensible defaults
Language Agnostic: Hubs and Sentinels can be built in any language
Composable: Add or remove Sentinels without changing your tests

🚀

Get Started

Read the Specification

STARLIGHT_PROTOCOL_SPEC_v1.0.0.md

Use the Reference Implementation

git clone https://github.com/starlight-protocol/starlight.git && npm install && node src/hub.js

Build Your Own Sentinel

Extend SentinelBase and implement your detection logic.

🌌

Join the Constellation

The stars in the constellation are many, but the intent is one. Contribute Hub implementations in Rust, Go, or Python. Share your community-built Sentinels. Let's build the future of autonomous browser agents together.

✨

Built with ❤️ by Dhiraj Das
GitHub: starlight-protocol/starlight

Starlight Part 4: Democratizing the Constellation — The Visual Sentinel Editor

Dhiraj Das — Sat, 03 Jan 2026 18:25:57 +0000

In Part 3: The Autonomous Era, we explored how Starlight v3.0 runs hands-free in CI/CD pipelines. But there was still one barrier: creating custom Sentinels required Python programming skills. Not anymore.

Starlight v3.0.3 introduces the Visual Sentinel Editor—a no-code UI that lets anyone build a custom Sentinel in under a minute.

🛠️

The Visual Sentinel Editor

Imagine you're testing an e-commerce site. Every few months, they change their cookie consent banner. Your tests fail. Your developers grumble. The cycle repeats. With the Visual Sentinel Editor, a QA analyst—with zero Python experience—can solve this in 3 clicks.

The Starlight Visual Sentinel Editor — Building autonomous agents without code.

Open the Editor: Click "🛠️ Create Sentinel" from Mission Control
Choose Template: Pre-fills with common selectors and proven logic
🚀 Export: The editor generates the Python code and saves it to your fleet automatically.

🎯

Template-First Design

We studied hundreds of real-world automation failures and distilled them into four core templates:

Template	Solves
Cookie Banner	GDPR consent popups that block interactions
Modal Popup	"Subscribe to newsletter" overlays
Login Wall	"Please sign in to continue" blockers
Rate Limiter	CAPTCHAs and "Too many requests" errors

🛰️

The Fleet Manager: Control Your Constellation

Mission Control now automatically discovers every Sentinel in your directory. Each card shows the Sentinel's status and allows for granular lifecycle management. Click "▶️ Start All" and the entire constellation launches in a staggered, optimized sequence.

Our philosophy is simple: any Sentinel you create becomes a first-class citizen.

🔔

Real-Time Webhook Alerts

v3.0.2 introduced Webhook Alerting, allowing for instant notifications on Slack, Teams, or Discord when a mission succeeds or fails.

{
    "webhooks": {
        "enabled": true,
        "urls": ["https://hooks.slack.com/services/XXX"],
        "notifyOn": ["failure", "success"]
    }
}

🌌

The Vision: A Community of Sentinels

We're building toward a Sentinel Marketplace where community-maintained agents handle everything from Shopify checkouts to dark pattern detection. The constellation grows stronger with each contribution.

🔮

What's Next?

Natural Language Intent

Write tests in plain English: "Log in and add the first product to cart".

OpenTelemetry Integration

Export traces to Datadog, Grafana, or your APM of choice.

Cross-Browser and Mobile Support

Safari, Firefox, and mobile testing environments.

Accessibility Support

✨

Built with ❤️ by Dhiraj Das
The stars are aligned. The constellation is ready.

Beyond the Black Box: Visualizing Autonomous Intelligence with Starlight Mission Control

Dhiraj Das — Thu, 01 Jan 2026 11:35:28 +0000

In our previous exploration of the Starlight Protocol, we detailed the "nervous system" of autonomous automation. This is the continuation—the "Mission Control" that makes it all visible.

From "Scripts" to "Sovereign Dashboards"

In Part 1, we explored the inner workings of the Starlight Protocol—the "nervous system" of autonomous automation. We looked at how Sentinels coordinate to clear obstacles and how the Hub "learns" from history.

But there was a missing piece: How do humans interact with an autonomous fleet?

Automation has historically been a "black box." You fire off a script, cross your fingers, and wait for a green or red light in a terminal. If it fails, a human has to dig through thousands of lines of logs to find out why.

**Starlight changes this. We’ve turned "Automation" into "Mission Control."

🛰️

The Mission Control Launchpad

We’ve moved beyond the command line. The Starlight Mission Control is a premium dashboard designed for everyone—from the software engineer to the Project Manager. It gives you a real-time, window-seat view into the brain of the constellation.

The Starlight Mission Control Dashboard (Click to zoom)

When you hit "Launch Mission," you aren't just starting a test; you're initiating a sovereign journey. You can watch as the Hub hands off tasks to the Janitor, or as the Pulse Sentinel holds the line during a heavy network jitter.

📈

Beyond Pass/Fail: The "Autonomous Vitals"

The most significant shift in Starlight v2.8 is how we measure success. In traditional testing, a "Pass" just means nothing broke *this time*. In Starlight, we track Intelligence.

Success Rate

Not just a static percentage, but a real-time health indicator of your environment's resilience.

Saved Effort (ROI)

Every time a Sentinel clears a popup, it saves a human about 5 minutes of reproduction and triage work. We quantify this. The Mission Control dashboard ticks up in real-time, showing exactly how many manual "engineering hours" have been reclaimed by the Starlight Protocol.

Sovereign MTTR (Mean Time to Recovery)

How fast does the automation "heal"? We track the milliseconds it takes for a Sentinel to detect an obstacle, hijack the browser, fix the state, and resume the mission. This is the ultimate metric for a self-healing system.

🌠

The "Mission Evidence" Report: Proof for the Boardroom

At the end of every mission, Starlight generates more than just a log file. It generates a comprehensive Mission Evidence report.

Detailed ROI and Self-Healing Proof (Click to zoom)

This report is designed to be shared with stakeholders who don't care about XPaths but care deeply about reliability. It shows:

Visual Evidence: Before and after screenshots of every obstacle the Sentinels cleared.
Self-Healing Badges: Clear proof of the missions that *would have failed* in traditional tools but succeeded in Starlight.
ROI Dashboard: A professional summary of time and money saved during the run.

🤝

The New Trust Model: Visibility = Resilience

The biggest hurdle for AI-driven automation is trust. People are afraid that "Auto-Healing" might hide real bugs.

Starlight solves this through High-Fidelity Visibility. By making the "Handshake" logs human-readable and the GUI dashboard accessible, we allow teams to *trust but verify.*

When you see the Janitor Sentinel clear a "Newsletter Popup," you aren't just seeing a test pass; you're seeing a repetitive human task being permanently offloaded to a sovereign agent.

Conclusion: The Era of the Automation Architect

With the release of the Mission Control GUI and the Observability Engine, we’ve lowered the barrier to entry. You don't need to be a Python expert to launch a mission; you just need a goal.

The stars are no longer just for navigation—they are for everyone to see.

The mission is autonomous. The value is measurable. The future is visible.

✨

Explore the premium Mission Control UI on our GitHub
Built with ❤️ by Dhiraj Das

Beyond Selectors: The Starlight Protocol and the Era of Sovereign Automation

Dhiraj Das — Tue, 30 Dec 2025 11:36:42 +0000

"The ground is chaotic. Navigation requires a higher frame of reference."

— Inspired by the dung beetle, which navigates using the Milky Way

The Problem with Traditional Automation

Every test engineer has experienced the 3 AM page: "Build Failed - Element Not Found."

Traditional browser automation is fragile by design. We bind our tests to the implementation details of the UI—CSS selectors, XPaths, and dynamic IDs that change with every sprint. When a developer renames a button, our tests break. When a modal appears unexpectedly, our scripts crash.

The industry's solution? Add more wait statements. More try-catch blocks. More conditional logic.

The Fundamental Problem

This is treating symptoms, not the disease. The fundamental problem is that we're looking at the ground when we should be looking at the stars.

Introducing Constellation-Based Automation

What if your automation could handle unexpected obstacles the way a human does—not by predicting every possible state, but by *adapting* to whatever the environment throws at it?

This is the core philosophy behind Constellation-Based Automation (CBA) and its communication protocol, Starlight.

Instead of writing scripts that handle every edge case, CBA introduces a Sovereign Constellation of autonomous agents that:

Monitor the environment for obstacles (popups, modals, network jitter)
Clear the path before your intent even knows there was a problem
Learn from experience to handle similar situations faster next time

Clean Intent

Your test script stays clean and focused on the business goal. The *environment's chaos* becomes someone else's problem.

The Architecture: A New Paradigm

┌─────────────────────────────────────────────────────────────┐
│                      INTENT LAYER                           │
│         "Login" • "Submit Form" • "Initiate Mission"        │
└─────────────────────────┬───────────────────────────────────┘
                          │ JSON-RPC
                          ▼
┌─────────────────────────────────────────────────────────────┐
│                        CBA HUB                              │
│              Orchestrator • Semantic Resolver               │
│                   Predictive Memory                         │
└───────────┬─────────────┬─────────────┬─────────────────────┘
            │             │             │
            ▼             ▼             ▼
    ┌───────────┐  ┌───────────┐  ┌───────────┐
    │   PULSE   │  │  JANITOR  │  │  VISION   │
    │ Stability │  │ Heuristic │  │  AI-Based │
    │  Monitor  │  │  Healing  │  │ Detection │
    └───────────┘  └───────────┘  └───────────┘
         │              │              │
         └──────────────┴──────────────┘
                        │
                        ▼
              ┌──────────────────┐
              │     BROWSER      │
              │   (Playwright)   │
              └──────────────────┘

The Sentinels

Pulse Sentinel — The Guardian of Time

Monitors network requests and DOM mutations. Vetoes execution until the environment is stable. Eliminates the need for setTimeout or waitForSelector.

Janitor Sentinel — The Heuristic Healer

Detects known obstacle patterns (modals, cookie banners). Clears them automatically using proven selectors. Learns which actions work and remembers for next time.

Vision Sentinel — The AI Eye

Uses local AI models (Ollama/Moondream) to *see* obstacles. Works without selectors—pure visual detection. Handles encrypted or obfuscated UIs.

The Starlight Protocol

Communication between the Hub and Sentinels uses JSON-RPC 2.0 with a set of standardized signals:

Signal	Purpose
`starlight.intent`	"I want to click the Login button"
`starlight.pre\_check`	"Everyone check the path before I proceed"
`starlight.clear`	"Path is clear, proceed"
`starlight.wait`	"Hold on, environment is unstable"
`starlight.hijack`	"I need to take over and fix something"
`starlight.resume`	"Problem fixed, continue the mission"

Consensus-Based Execution

The Hub never executes an action until all relevant Sentinels have cleared the path.

Predictive Intelligence: The Galaxy Mesh

CBA doesn't just react—it learns.

Self-Healing Selectors

When a selector fails, the Hub checks its historical memory. If it has seen this goal before with a different selector that worked, it substitutes automatically.

// First run: User clicks "Submit" → selector fails
// Hub learns: "Submit" goal worked with "#submit-btn" in the past
// Second run: Auto-substitutes and succeeds

Aura-Based Throttling

The Hub tracks *when* entropy events occur during missions. If the first 5 seconds of a particular page are historically unstable, it proactively slows down before problems occur.

Sentinel Memory

Sentinels remember which remediation actions worked. If the Janitor cleared a modal with .modal .close-btn, it remembers this for next time—skipping the exploration phase entirely.

The ROI Dashboard: Proving Value

Every mission generates a "Hero Story" report that quantifies the business value:

Event Type	Value Saved
Sentinel Intervention	5 minutes (manual triage avoided)
Self-Healing Event	2-3 minutes (debugging avoided)
Aura Stabilization	30 seconds (flake prevention)

From Cost Center to Value Generator

This transforms testing from a cost center to a *measurable value generator*.

Real-World Impact

In traditional automation, a single unexpected modal can:

Crash the test → 30 seconds wasted
Trigger manual investigation → 5-10 minutes
Require code changes → 30-60 minutes
Wait for PR review → hours to days

In CBA, the same modal:

Detected by Janitor Sentinel → 0.1 seconds
Cleared automatically → 0.5 seconds
Test continues successfully
Event logged for dashboard

Zero Human Minutes

Total impact: 0 human minutes required.

Time-Travel Triage: Debugging the Future

When something does go wrong, CBA doesn't leave you guessing. The Time-Travel Triage feature records every handshake, every decision, every DOM state.

Open triage.html, load your mission trace, and *rewind* to see exactly what the browser looked like when the failure occurred. No more "works on my machine" debates.

Getting Started

# Clone and setup
git clone https://github.com/godhiraj-code/cba
cd cba
npm install
pip install -r requirements.txt
npx playwright install chromium

# Run the constellation
run_cba.bat  # Windows

Or build your own Sentinel in minutes:

from sdk.starlight_sdk import SentinelBase
import asyncio

class MySentinel(SentinelBase):
    def __init__(self):
        super().__init__(layer_name="MySentinel", priority=10)
        self.capabilities = ["custom-healing"]

    async def on_pre_check(self, params, msg_id):
        # Your healing logic here
        await self.send_clear()

if __name__ == "__main__":
    asyncio.run(MySentinel().start())

The SDK handles:

✅ WebSocket connection management
✅ Auto-reconnect on failure
✅ Persistent memory (JSON-based)
✅ Graceful shutdown (Ctrl+C saves state)
✅ Configuration loading

The Technology Stack

Component	Technology
Hub	Node.js + Playwright
Sentinels	Python + AsyncIO
Protocol	JSON-RPC 2.0 over WebSocket
AI Vision	Ollama + Moondream (local SLM)
Deployment	Docker Compose

Privacy First

All AI processing happens locally—no cloud dependencies, no data leakage.

The Future: Sovereign Security

Phase 9 is on the horizon, bringing enterprise-grade features:

Shadow DOM Penetration: Handle modern web components with encapsulated styles
PII Sentinel: Detect and redact sensitive data before screenshots
Traffic Sovereign: Network-level chaos engineering and request mocking

Why "Starlight"?

The dung beetle doesn't navigate by watching the ground. It looks up at the Milky Way—a fixed reference point that transcends the chaos below.

The Analogy

Traditional automation is like watching the ground: every rock, every leaf, every obstacle requires explicit handling. CBA is like looking at the stars: we navigate by intent, and the constellation handles the terrain.

Conclusion: A Paradigm Shift

CBA isn't just a framework—it's a philosophical shift in how we think about automation.

Old Paradigm	New Paradigm
Handle every edge case	Adapt to any edge case
Fragile selectors	Semantic goals
Hard-coded waits	Temporal intelligence
Invisible failures	Quantified ROI
Hope it works	Know it will work

The goal is constant. The path is sovereign. The mission will succeed.

"The stars are aligned."

✨

Built with ❤️ by Dhiraj Das
Explore the protocol on GitHub

Announcing pytest-mockllm v0.2.1: "True Fidelity"

Dhiraj Das — Tue, 23 Dec 2025 11:43:04 +0000

🎯

What's New in v0.2.1

True Async & Await: Native coroutines for OpenAI, Anthropic, Gemini, and LangChain
Pro Tokenizers: tiktoken integration for >99% token accuracy
PII Redaction: Automatic scrubbing of API keys before cassette storage
Chaos Engineering: Simulate rate limits, timeouts, and network jitter
Python 3.14 Ready: First to officially support and verify the latest Python

We are thrilled to announce the release of pytest-mockllm v0.2.1, codenamed "True Fidelity".

This release is a complete technical overhaul designed to make LLM testing as robust as the systems you're building. For the first time, developers can test complex asynchronous AI workflows with a level of accuracy that mirrors production environments exactly.

🚀

The Challenge We Solved

When we first released pytest-mockllm, our async support was a "best-effort" wrapper around synchronous mocks. While this worked for simple cases, it failed in production-grade environments where developers used:

Complex coroutine orchestration: Real async workflows with multiple awaits
Asynchronous generators: Streaming responses via LangChain's astream and ainvoke
Strict type checking: MyPy compatibility requirements
Enterprise security: VCR-style recordings risking API key leaks

⚡

True Async & Await

We've rewritten our core mocks from the ground up to support real asynchronous patterns. No more fake awaitables—pytest-mockllm now provides native coroutines and async iterators for OpenAI, Anthropic, Gemini, and LangChain.

Every provider mock now implements native async def methods that return real coroutines. This ensures that await calls behave exactly as they do with real SDKs.

import pytest
from pytest_mockllm import mock_openai

@pytest.mark.asyncio
async def test_async_completion():
    with mock_openai() as mock:
        mock.set_response("Hello from pytest-mockllm!")

        # Real async/await - no fake wrappers
        response = await client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": "Hi"}]
        )

        assert response.choices[0].message.content == "Hello from pytest-mockllm!"

Pro Tokenizers (tiktoken)

Standard character-based token estimation is often off by 20-30%. By integrating tiktoken (OpenAI) and custom heuristics (Anthropic), we brought our accuracy to >99% for standard models.

This allows developers to write precise assertions on usage and cost—critical for prompt window testing and budget limits.

Real Accuracy

Token counts now match exactly what you'd see in your OpenAI dashboard.

ROI Dashboard

Run your tests and see your savings! Every session now ends with a professional terminal summary showing exactly how many tokens you avoided paying for.

═══════════════════════════════════════════════════════
   pytest-mockllm ROI Summary
═══════════════════════════════════════════════════════
   Tests Run:        47
   API Calls Mocked: 312
   Tokens Saved:     847,291
   Estimated Cost:   $12.71 (at GPT-4 pricing)
═══════════════════════════════════════════════════════

PII Redaction by Default

Security should never be an afterthought. We implemented a PIIRedactor that automatically scrubs sensitive data before the cassette is ever written to disk, ensuring zero leak risk.

api\_key and sk-... strings
Authorization: Bearer ... headers
Sensitive parameters in request bodies

Enterprise Ready

Teams can now safely share VCR cassettes across repositories without security risk.

Chaos Engineering for LLMs

The real world is messy. Our new chaos tools allow you to simulate network jitter and random API refusals to ensure your retry logic and fallback systems are bulletproof.

from pytest_mockllm import mock_openai, chaos

def test_retry_logic():
    with mock_openai() as mock:
        # Simulate rate limit on first 2 calls, then succeed
        mock.add_chaos(chaos.rate_limit(times=2))
        mock.set_response("Success after retry!")

        # Your retry logic should handle this gracefully
        response = call_with_retry(prompt="Hello")
        assert response == "Success after retry!"

The First to Python 3.14

We are proud to be one of the first AI testing tools to officially support and verify compatibility with Python 3.14. We are building for the future, today.

🎯

Outcomes

Zero Flakiness: True async support eliminated TypeError and "coroutine not awaited" bugs in CI
Enterprise Ready: Secure recording allows teams to share cassettes without security risk
Future Proof: Full verification against Python 3.14 ensures the library is ready for the next decade of AI development

Get Started

pip install -U pytest-mockllm

PyPI: pypi.org/project/pytest-mockllm
GitHub: github.com/godhiraj-code/pytest-mockllm

Built by Dhiraj Das

Automation Architect. Making LLM testing as reliable as the AI systems you're building.

Stop Shipping "Zombie Tests": Introducing Project Vandal v0.2.0

Dhiraj Das — Sun, 21 Dec 2025 04:16:09 +0000

🎯

What You'll Learn

The Zombie Test Problem: Why passing tests can be more dangerous than failing ones
Runtime UI Mutation: How Vandal sabotages the live DOM instead of rebuilding source code
Shadow DOM Support: Penetrate modern web components that hide from standard selectors
Kill Ratio Metrics: Quantify your test suite's actual resilience
Quick Integration: Drop-in Playwright wrapper with zero test rewrites

Have you ever looked at a 100% green test suite and wondered: *"Is this actually testing anything, or is it just passing because the happy path hasn't changed?"* In the world of Test Automation, we often suffer from test rot—tests that remain green even when the application logic is broken. These are Zombie Tests.

The Hidden Danger

Zombie Tests give you a false sense of security. They are the reason bugs slip into production despite your massive automation suite.

🎯

What is Project Vandal?

Vandal is a deterministic chaos engineering tool for frontends. Unlike traditional mutation testing that modifies source code (slow and rebuild-heavy), Vandal sabotages the live DOM inside your browser during test execution.

That Moment

Traditional tools change your if statements to else in React/Vue source. Vandal changes the browser's reality. It strips click listeners, shifts UI elements, and sabotages form state *while the test is running*.

🚀

Vandal v0.2.0: What's New?

We've packed the v0.2.0 release with enterprise-grade features designed for high-scale apps.

1. Persistent Chaos (Navigation Survival) ⚓

The biggest challenge with UI mutation is navigation. Traditional scripts disappear on reload. Vandal v0.2.0 uses a combination of add\_init\_script and a deep MutationObserver to ensure your sabotages survive page reloads and transitions.

2. Recursive Shadow DOM Support 🕵️‍♂️

Modern apps are built with Web Components. Vandal now recursively penetrates Shadow DOM boundaries, ensuring that even elements hidden inside multiple shadow roots can be targeted and vandalized.

3. Automatic Revert (Live Healing) 🩹

Want to test a "broken" state and then "fix" it without reloading the page? Vandal v0.2.0 caches the original state of elements, allowing you to restore them on-the-fly with await v.revert\_mutation().

💀

The "Vandalism" Playbook

Vandal comes with high-impact strategies designed to mimic real-world regressions:

Stealth Disable: Sets pointer-events: none. The button looks perfect, but it's "dead" to user interaction.
UI Shift: Translates elements by 100px. Perfect for testing if your automation relies on hardcoded coordinates or if layout shifts break your assertions.
Slow Load: Simulates a 5-second UI hang by hiding elements temporarily. Does your test wait properly, or does it time out?
Data Sabotage: Replaces critical labels and input values with junk data to verify your data-validation logic.

Installation

pip install project-vandal

Basic Usage

Integrating Vandal into your existing Playwright tests is as simple as using it as an async context manager:

from vandal import Vandal

async def test_critical_path(page):
    async with Vandal(page) as v:
        # 1. Apply a persistent mutation
        await v.apply_mutation("stealth_disable", "#checkout-btn")

        # 2. Navigate - The mutation survives!
        await page.goto("https://myapp.com/cart")

        # 3. This SHOULD fail if your test is resilient
        try:
            await page.click("#checkout-btn", timeout=2000)
            print("🧟 MUTANT SURVIVED: Test is a Zombie!")
        except:
            print("💀 MUTANT KILLED: Test is Robust.")

    # Generate a beautiful HTML report
    v.save_report("ci_resilience_report.html")

📊

Reporting: From Console to HTML

Vandal v0.2.0 now exports structured JSON and beautiful HTML reports. No more digging through console logs. You get a visual scorecard of your test suite's effectiveness, ready for your CI/CD dashboard.

High-Impact Use Cases

CI/CD Gatekeeping: Fail builds where more than 10% of UI mutants survive.
Shadow DOM Validation: Finally test those elusive Web Components with confidence.
Assertion Benchmarking: Quantify the "Kill Ratio" of your automation suite.

🤘

Join the Vandalism Movement

Stop counting lines of code coverage. Start measuring assertion effectiveness. Project Vandal is the tool that makes "green checkmarks" mean something again.

Open Source

Project Vandal is an open-source initiative. Check it out on PyPI and start validating your test suite's resilience today.

SQL for Automation Testers: Understand and Optimize Queries Without Being a DBA

Dhiraj Das — Wed, 17 Dec 2025 13:29:25 +0000

🎯

What You'll Learn

Why SQL matters for testers: Database validation is part of modern test automation
The struggle is real: Most testers copy-paste SQL without truly understanding it
Plain English explanations: Our tool translates SQL into human-readable descriptions
Optimization made simple: Get actionable suggestions without studying query plans
Try it yourself: Interactive tool available right here sql optimizer

You're an automation tester. You write Selenium scripts, API tests, maybe some Appium for mobile. Then one day, your lead says: 'We need to validate the database state after this flow. Here's the query.'

SELECT DISTINCT u.name, COUNT(*) as order_count, SUM(o.total) as total_spent
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
WHERE UPPER(u.status) = 'ACTIVE' OR u.role = 'admin' OR u.role = 'moderator'
GROUP BY u.name
ORDER BY total_spent DESC;

You stare at it. You've seen SQL before—SELECT, FROM, WHERE—but this has JOIN, GROUP BY, COUNT, SUM, DISTINCT... and what's that UPPER function doing? Is this query even efficient? Will it timeout on production data?

Sound Familiar?

You're not alone. Most automation testers learned SQL 'on the job' through copy-pasting and trial-and-error. There's no shame in it—we can't be experts in everything. But there should be a tool that helps us understand what we're working with.

🎯

Why SQL Matters for Automation Testers

Modern test automation isn't just clicking buttons and checking text. Real-world testing often requires:

Test data setup: Inserting users, products, or orders before tests run
State verification: Confirming database records after API calls or UI actions
Data cleanup: Removing test data to keep environments consistent
Performance testing: Understanding why database operations are slow
Debugging failures: Checking what data actually exists when tests fail

If your tests interact with a database (and most non-trivial applications have one), you WILL encounter SQL. The question is: do you understand what it's doing?

😰

The "Copy-Paste DBA" Trap

Here's the typical automation tester's SQL journey:

Step 1: Need data? Ask a developer or DBA for the query
Step 2: Copy-paste the query into your test framework
Step 3: It works! Ship it.
Step 4: Query times out in staging (where there's more data)
Step 5: Panic. Ask the DBA again. Get a 'fixed' query.
Step 6: Repeat forever.

This works... until it doesn't. What if you need to modify the query? What if you need to write a new one? What if the DBA is on vacation?

The Real Problem

It's not that you can't learn SQL—it's that you don't have time to study query optimization theory. You just need to understand THIS query, right now, so you can do your job.

💡

Introducing the SQL Query Optimizer Tool

I built a tool specifically for people like us—automation testers, QA engineers, and developers who work with SQL but aren't database administrators. It answers two simple questions:

What does this query actually DO? — Explained in plain English, not SQL jargon
Is there anything wrong with it? — Optimization suggestions with clear explanations

📖

Feature 1: Plain English Explanations

Let's take that scary query from the beginning and paste it into the tool. Here's what you get:

💬 In Plain English

This query gets u.name, a count, a total sum from the 'users' table combined with data from 'orders', but only for records that match certain conditions, grouped by u.name, sorted by total_spent (highest first) — removing any duplicates.

Suddenly it makes sense! It's getting user names with their order statistics, filtering by active status or specific roles, grouping the counts per user, and sorting by who spent the most.

Step-by-Step Breakdown

Beyond the summary, the tool breaks down each part of the query:

🔍 Selecting Specific Data: The query calculates statistics (COUNT, SUM) — it's asking 'How many?' and 'What's the total?' rather than listing individual items.
📊 Data Source: Data is pulled from 2 tables: users, orders. The query combines information from both.
🔗 Connecting to 'orders': Shows ALL records from the main table, even if there's no matching data in 'orders'. Users without orders will still appear, but with empty order info.
🔎 Filtering Results: The query filters results to only include records that meet certain criteria. It looks for exact matches.
📦 Grouping Data: Instead of showing individual records, the query combines them into groups based on 'u.name'. It's like summarizing sales by month instead of listing every sale.
📈 Sorting Results: Results are sorted from highest to lowest. The biggest values appear first.

⚡

Feature 2: Optimization Suggestions

The tool analyzes your query and finds potential problems. For our example query, it catches several issues:

🔴 Critical: Function on Column in WHERE

WHERE UPPER(u.status) = 'ACTIVE'  -- ❌ This is a problem!

The tool explains: 'Applying functions to columns in WHERE clause prevents index usage. The database must scan every row and apply the function before filtering.' In other words: this query will be SLOW on large tables because the database can't use its shortcuts (indexes).

The Fix

Store the status in uppercase in the database, or compare against the exact case: WHERE status = 'ACTIVE'

🟡 Warning: OR Conditions → Consider IN Clause

WHERE ... u.role = 'admin' OR u.role = 'moderator'
-- Better as:
WHERE ... u.role IN ('admin', 'moderator')

Multiple OR conditions on the same column are harder to read and sometimes slower. The IN clause is cleaner and often faster.

🟡 Warning: DISTINCT May Be Expensive

DISTINCT requires sorting or hashing ALL results to remove duplicates. If your query returns millions of rows, this is memory-intensive. The tool suggests: 'Ensure DISTINCT is truly needed. Consider if proper JOINs or GROUP BY could eliminate duplicate sources.'

🟡 Warning: ORDER BY Without LIMIT

Sorting millions of rows is expensive. If you only need the top 10 spenders, add LIMIT 10 and the database can optimize significantly.

🛠️

Feature 3: Optimized Query Output

The tool generates an improved version of your query with suggestions applied. You can copy it directly and use it in your tests.

SELECT DISTINCT u.name, COUNT(*) as order_count, SUM(o.total) as total_spent
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
WHERE UPPER(u.status) = 'ACTIVE' OR u.role = 'admin' OR u.role = 'moderator'
GROUP BY u.name
ORDER BY total_spent DESC
LIMIT 1000;  -- Added for safety

🎮

Try It Yourself

The SQL Query Optimizer is available. No signup, no ads, no tracking—just paste your query and get instant explanations.

🚀 Use the Tool Now
sql optimizer

Sample Queries Included

Don't have a query handy? The tool includes sample queries to explore:

SELECT * Query: See why selecting all columns is problematic
JOIN Query: Understand how tables are combined
Subquery: Learn about nested queries
Complex Query: The full example from this article
Redshift Query: PostgreSQL/Redshift specific patterns

🔧

How It Works (Under the Hood)

For the curious, here's how the tool analyzes queries without an actual database connection:

1. Query Parsing

The tool tokenizes your SQL and identifies key components: query type (SELECT/INSERT/UPDATE/DELETE), tables, columns, joins, conditions, grouping, and ordering.

2. Pattern Recognition

Using rule-based analysis, it detects common anti-patterns:

SELECT * (always problematic)
Functions on columns in WHERE (non-SARGable)
Multiple OR conditions (often replaceable with IN)
DISTINCT without clear necessity
ORDER BY without LIMIT
Missing table aliases
Subqueries that could be JOINs

3. Plain English Generation

The tool constructs human-readable explanations by analyzing what each clause does and translating it into everyday language. No jargon, no assumed knowledge.

📚

SQL Concepts Every Tester Should Know

While using the tool, you'll naturally learn these key concepts:

| Concept     | What It Means                                        |
|-------------|------------------------------------------------------|
| SELECT      | What data you want to retrieve                       |
| FROM        | Which table(s) contains the data                     |
| WHERE       | Filter conditions (like "status = 'active'")         |
| JOIN        | Combining data from multiple tables                  |
| GROUP BY    | Aggregate rows into summaries (with COUNT, SUM, etc.)|
| ORDER BY    | Sort the results                                     |
| LIMIT       | Return only N rows                                   |
| DISTINCT    | Remove duplicate rows                                |
| INDEX       | Database "shortcut" for faster lookups               |

✅

Conclusion

You don't need to become a DBA to work with databases effectively. You just need to understand what your queries are doing and whether they have obvious problems.

The SQL Query Optimizer tool gives you that understanding in seconds—no database theory required. Paste a query, read the plain English explanation, fix the highlighted issues, and move on with your testing.

The Bottom Line

Next time someone hands you a SQL query and asks 'Does this look right?', you'll actually know. Not because you memorized query optimization theory, but because you have a tool that explains it in language you understand.

One Final Note

While this tool is here to help you understand and optimize queries quickly, there's no substitute for actually learning SQL fundamentals. Use the tool as a learning aid, not a crutch. Over time, you'll find yourself needing it less—and that's the goal.

Try it: sql optimizer
Built by: Dhiraj Das — Automation Architect who believes testing tools should be accessible to everyone

Why Your Selenium Tests Are Flaky (And How to Fix Them Forever)

Dhiraj Das — Mon, 15 Dec 2025 13:08:21 +0000

🎯

What This Article Covers

The Flakiness Problem: Why time.sleep() and WebDriverWait aren't enough
What Causes Flaky Tests: Racing against UI state changes
The Stability Solution: Monitoring DOM, network, animations, and layout shifts
One-Line Integration: Wrap your driver with stabilize() — zero test rewrites
Full Diagnostics: Know exactly why tests are blocked

If you've worked with Selenium for more than a week, you've written code like this:

driver.get("https://myapp.com/dashboard")
time.sleep(2)  # Wait for page to load
driver.find_element(By.ID, "submit-btn").click()
time.sleep(1)  # Wait for AJAX

And you've felt the shame of knowing it's wrong—but also the relief of "it works." Until it doesn't. Until the CI server is 10% slower than your machine, and suddenly your tests fail 20% of the time.

This is the story of flaky tests, why they happen, and how I built a library called waitless to eliminate them.

⚠️

The Flakiness Problem

Let me show you a real scenario. You have a React dashboard. User clicks a button. The button triggers an API call. The API returns data. React re-renders the component. A spinner disappears. A table appears.

This entire sequence takes maybe 400ms. But your test does this:

button = driver.find_element(By.ID, "load-data")
button.click()
table = driver.find_element(By.ID, "data-table")  # 💥 BOOM

The table doesn't exist yet. React is still fetching. Selenium throws NoSuchElementException.

So you "fix" it:

button.click()
time.sleep(2)
table = driver.find_element(By.ID, "data-table")  # Works... usually

The Problem with time.sleep()

The Problem with time.sleep()

Congratulations. You've just made your test: 1) 2 seconds slower than necessary, 2) Still flaky when the API takes 2.5 seconds, 3) Impossible to debug when it fails.

❌

Why Traditional Solutions Don't Work

time.sleep() — The Naive Approach

Sleep for a fixed duration and hope the UI is ready. Problems: Too short → test fails. Too long → test suite takes forever. No feedback on what's actually happening.

WebDriverWait — The "Correct" Approach

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

WebDriverWait(driver, 10).until(
    EC.element_to_be_clickable((By.ID, "submit-btn"))
)

This is better. You're waiting for a specific condition. But here's the dirty secret: it only checks one element.

What about the modal that's still animating into view?
What about the AJAX request that hasn't finished?
What about the React re-render that's about to move your button?

WebDriverWait says "the button is clickable." Reality says "there's an invisible overlay from an animation that will intercept your click."

Retry Decorators — The Denial Approach

@retry(tries=3, delay=1)
def test_dashboard():
    driver.find_element(By.ID, "submit-btn").click()

This is the equivalent of saying "I know my code is broken, but if I run it enough times, it'll eventually work." Retries don't fix flakiness. They hide it.

🔍

What Actually Causes Flaky Tests?

After debugging hundreds of flaky tests, I found they all come down to racing against the UI:

| What You Do              | What's Actually Happening           |
|--------------------------|-------------------------------------|
| Click a button           | DOM is being mutated by framework   |
| Assert text content      | AJAX response still in flight       |
| Interact with modal      | CSS transition still animating      |
| Click navigation link    | Layout shift moves element          |

The Real Question

The question isn't "is this element clickable?" The question is: "Is the entire page stable and ready for interaction?" That's what I set out to answer with waitless.

✨

Defining "Stability"

What does it mean for a UI to be "stable"? I identified four key signals:

1. DOM Stability

The DOM structure has stopped changing. No elements being added, removed, or modified. How to detect: MutationObserver watching the document root. Track time since last mutation.

2. Network Idle

All AJAX requests have completed. No pending API calls. How to detect: Intercept fetch() and XMLHttpRequest. Count pending requests.

3. Animation Complete

All CSS animations and transitions have finished. How to detect: Listen for animationstart, animationend, transitionstart, transitionend events.

4. Layout Stable

Elements have stopped moving. No more layout shifts. How to detect: Track bounding box positions of interactive elements. Compare over time.

🏗️

The Architecture

Waitless has two parts:

JavaScript Instrumentation (runs in browser)

window.__waitless__ = {
    pendingRequests: 0,
    lastMutationTime: Date.now(),
    activeAnimations: 0,

    isStable() {
        if (this.pendingRequests > 0) return false;
        if (Date.now() - this.lastMutationTime < 100) return false;
        return true;
    }
};

This script is injected into the page via execute_script(). It monitors everything happening in the browser.

Python Engine (evaluates stability)

class StabilizationEngine:
    def wait_for_stability(self):
        """Waits until all stability signals are satisfied."""
        # Checks performed automatically:
        # ✓ DOM mutations have settled
        # ✓ Network requests completed
        # ✓ Animations finished
        # ✓ Layout is stable

The Python engine continuously evaluates browser state until all configured stability signals indicate the page is ready for interaction.

🪄

The Magic: One-Line Integration

The key design goal was zero test modifications. Adding stability detection should require changing ONE line:

from waitless import stabilize

driver = webdriver.Chrome()
driver = stabilize(driver)  # ← This is the only change

# All your existing tests work as-is
driver.find_element(By.ID, "button").click()  # Now auto-waits!

How does this work? The stabilize() function wraps the driver in a StabilizedWebDriver that intercepts find_element() calls. Retrieved elements are wrapped in StabilizedWebElement. When you call .click(), it first waits for stability, then clicks.

class StabilizedWebElement:
    def click(self):
        self._engine.wait_for_stability()  # Auto-wait!
        return self._element.click()  # Then click

Zero Rewrites Required

Your tests don't know they're waiting. They just... stop failing.

🔧

Handling Edge Cases

Real apps aren't simple. Here's how waitless handles the messy reality:

Problem: Infinite Animations

Some apps have spinners that rotate forever. Analytics scripts that poll constantly. WebSocket heartbeats that never stop.

Solution: Configurable thresholds

from waitless import StabilizationConfig

config = StabilizationConfig(
    network_idle_threshold=2,  # Allow 2 pending requests
    animation_detection=False,  # Ignore spinners
    strictness='relaxed'        # Only check DOM mutations
)

driver = stabilize(driver, config=config)

Problem: Navigation Destroys Instrumentation

Single-page apps remake the DOM on route changes. The injected JavaScript disappears.

Solution: Re-validation before each wait

def wait_for_stability(self):
    if not self._is_instrumentation_alive():
        self._inject_instrumentation()  # Re-inject if gone
    # Then wait...

📊

Diagnostics: The Secret Weapon

When tests still fail, understanding why is half the battle. Waitless includes a diagnostic system that explains exactly what's blocking stability:

╔═════════════════════════════════════════════════════════════╗
║              WAITLESS STABILITY REPORT                      ║
╠═════════════════════════════════════════════════════════════╣
║ Timeout: 10.0s                                              ║
║                                                             ║
║ BLOCKING FACTORS:                                           ║
║   ⚠ NETWORK: 2 request(s) still pending                    ║
║   → GET /api/users (started 2.3s ago)                       ║
║   → POST /analytics (started 1.1s ago)                      ║
║                                                             ║
║   ⚠ ANIMATIONS: 1 active animation(s)                      ║
║   → .spinner { animation: rotate 1s infinite }              ║
║                                                             ║
╠═════════════════════════════════════════════════════════════╣
║ SUGGESTIONS:                                                ║
║   1. /api/users is slow. Consider mocking in tests.         ║
║   2. Spinner has infinite animation. Set                    ║
║      animation_detection=False                              ║
╚═════════════════════════════════════════════════════════════╝

This isn't just "test failed." It's "test failed because your analytics endpoint is slow, and here's exactly how to fix it."

📈

The Results

Here's what changes when you adopt waitless:

Before

driver.get("https://myapp.com")
time.sleep(2)
WebDriverWait(driver, 10).until(
    EC.element_to_be_clickable((By.ID, "login-btn"))
)
driver.find_element(By.ID, "login-btn").click()
time.sleep(1)
driver.find_element(By.ID, "username").send_keys("user")

After

driver = stabilize(driver)
driver.get("https://myapp.com")
driver.find_element(By.ID, "login-btn").click()
driver.find_element(By.ID, "username").send_keys("user")

| Metric               | Before               | After           |
|----------------------|----------------------|-----------------|
| Lines of wait code   | 4+ per test          | 1 total         |
| Arbitrary delays     | 3+ seconds           | 0               |
| Flaky failures       | Common               | Rare            |
| Debug information    | "Element not found"  | Full stability report |

🎭

Why Not Just Use Playwright?

Playwright has auto-waiting built in. It's great! But:

Migration cost — You have 10,000 Selenium tests. Rewriting isn't an option.
Framework lock-in — Playwright auto-wait is Playwright-only.
Different approach — Playwright waits for element actionability. Waitless waits for page-wide stability.

The Best of Both Worlds

Waitless gives Selenium users the reliability of Playwright without the rewrite.

⚠️

Current Limitations (v0.2.0)

Being honest about what doesn't work yet:

Selenium only — Playwright integration planned for v1
Sync only — No async/await support
Main frame only — iframes not monitored
No Shadow DOM — MutationObserver can't see shadow roots
Chrome-focused — Tested primarily on Chromium

These will be addressed in future versions — contributions welcome!

🚀

Try It Yourself

pip install waitless

from selenium import webdriver
from waitless import stabilize

driver = webdriver.Chrome()
driver = stabilize(driver)

# Your tests are now stable
driver.get("https://your-app.com")
driver.find_element(By.ID, "button").click()

One line. Zero test rewrites. No more flaky failures.

✅

Conclusion

Flaky tests are a symptom of racing against UI state. The solution isn't longer sleeps or more retries—it's understanding when the UI is truly stable.

Waitless monitors DOM mutations, network requests, animations, and layout shifts to answer one question: "Is this page ready for interaction?"

The Bottom Line

Your tests should be deterministic. Your CI should be green. And you should never write time.sleep() again.

PyPI: pypi.org/project/waitless
GitHub: github.com/godhiraj-code/waitless

Built by Dhiraj Das

Automation Architect. Making Selenium tests deterministic, one at a time.

Mastering Prompt Engineering for Automation Testers

Dhiraj Das — Sun, 14 Dec 2025 14:11:15 +0000

🎯

What You'll Master

CTCO Framework: Context, Task, Constraints, Output — the foundation
Advanced Patterns: Chain-of-Thought, Few-Shot, and Role-Playing prompts
7+ Practical Examples: From locators to debugging to test data
Anti-Patterns: Common mistakes that waste tokens and time
Prompt Engineering is the 10x multiplier for modern SDETs

In the age of AI, the quality of our output is directly proportional to the quality of our input. This concept, often called 'Garbage In, Garbage Out', is the cornerstone of effective interaction with Large Language Models (LLMs). For automation testers, mastering prompt engineering is not just a nice-to-have skill; it's a superpower that can 10x our productivity.

This isn't about asking ChatGPT to 'write a test'. It's about architecting your prompts so precisely that the AI becomes an extension of your engineering mind — generating production-ready code, uncovering edge cases you missed, and debugging failures faster than you could manually.

✨

Part 1: The CTCO Framework — Your Foundation

A vague request like 'Write a test' will yield a generic result. To get production-ready code, our prompt needs structure. Think of it as CTCO: Context, Task, Constraints, and Output. This framework is the difference between getting 'something that works' and 'exactly what you need'.

C — Context: Set the Stage

Context tells the AI who it should be and what domain expertise it needs. This primes the model's weights towards relevant knowledge patterns.

// Weak Context
"You are a helpful assistant."

// Strong Context for Automation
"You are a Senior SDET with 8+ years of experience in Python automation.
You specialize in Selenium WebDriver, pytest, and API testing with requests.
You follow PEP8 strictly and believe in clean, maintainable code.
You have worked extensively with e-commerce and banking applications."

Pro Tip: Domain-Specific Context

Pro Tip: Domain-Specific Context

If you're testing a banking app, mention it! The AI will generate assertions for things like 'account balance should not go negative' or 'transaction IDs should be unique'. Domain context unlocks domain knowledge.

T — Task: Be Surgically Precise

The task is WHAT you need. Ambiguity here leads to hallucinations and unusable output. Use the 'newspaper headline' test: could someone read your task and know exactly what deliverable to expect?

// Vague Task
"Write a login test."

// Precise Task
"Write a pytest function 'test_login_with_valid_credentials' that:
1. Navigates to /login
2. Enters username 'standard_user' and password 'secret_sauce'
3. Clicks the login button
4. Asserts that the URL contains '/inventory' after login
5. Asserts that the shopping cart icon is visible"

C — Constraints: The Guard Rails

Constraints are the most underutilized part of prompt engineering. They tell the AI what NOT to do, which is often more powerful than telling it what to do. Well-defined constraints eliminate 90% of 'almost correct but unusable' responses.

// Constraints for a Selenium Script
"CONSTRAINTS:
- Use WebDriverWait with explicit waits. NEVER use time.sleep() or implicit waits.
- Use CSS selectors as the primary locator strategy. XPath only as fallback.
- All locators must be defined as class constants at the top of the Page Object.
- Do not catch generic exceptions. Handle specific Selenium exceptions.
- All methods must have type hints and docstrings.
- Use the By class from selenium.webdriver.common.by, not string literals."

The Power of Negative Constraints

The Power of Negative Constraints

Saying 'Do NOT use Thread.sleep()' is more effective than 'Use explicit waits'. The AI strongly weights negative instructions. Use this to eliminate anti-patterns from generated code.

O — Output: Define the Deliverable

Specify the exact format you need. This prevents the AI from adding unwanted explanations, incomplete snippets, or the wrong structure.

// Output Specification Examples

// For Code Generation
"OUTPUT: Provide only the Python code. No explanations before or after.
Include inline comments for complex logic only."

// For Test Case Documentation
"OUTPUT: Return a markdown table with columns:
Test ID | Test Name | Preconditions | Steps | Expected Result | Priority"

// For Debugging
"OUTPUT: Return a JSON object with keys:
'root_cause', 'affected_components', 'fix_suggestion', 'confidence_score'"

🧠

Part 2: Advanced Prompt Patterns

Once you've mastered CTCO, level up with these advanced patterns that dramatically improve output quality for complex tasks.

Pattern 1: Chain-of-Thought (CoT) Prompting

For complex reasoning tasks, ask the AI to think step-by-step before generating output. This reduces errors in multi-step logic and makes debugging easier.

"Before writing the code, think through:
1. What is the user flow being tested?
2. What elements need to be interacted with and in what order?
3. What could go wrong (exceptions to handle)?
4. What assertions prove the test passed?

Then provide the implementation."

Pattern 2: Few-Shot Prompting

Show the AI 2-3 examples of your desired input-output pairs. This 'teaches' the model your exact style and format preferences. Critical for consistency across a test suite.

"Generate a Page Object for the Checkout page following this pattern:

EXAMPLE 1 - Login Page:
class LoginPage:
    URL = '/login'
    USERNAME_INPUT = (By.CSS_SELECTOR, '[data-test="username"]')
    PASSWORD_INPUT = (By.CSS_SELECTOR, '[data-test="password"]')
    LOGIN_BTN = (By.CSS_SELECTOR, '[data-test="login-button"]')

    def __init__(self, driver):
        self.driver = driver
        self.wait = WebDriverWait(driver, 10)

    def login(self, username: str, password: str) -> None:
        self.wait.until(EC.visibility_of_element_located(self.USERNAME_INPUT)).send_keys(username)
        self.driver.find_element(*self.PASSWORD_INPUT).send_keys(password)
        self.driver.find_element(*self.LOGIN_BTN).click()

NOW generate for: Checkout Page with fields for First Name, Last Name, Zip Code, and buttons for Cancel and Continue."

Pattern 3: Role-Playing Prompts

Assign the AI a specific role with personality and expertise. This activates different 'modes' in the model. Useful for getting varied perspectives.

// For Test Case Discovery
"You are a QA Architect who has broken into production systems for 15 years.
You think like a hacker. Given this login form, generate 10 edge cases
that most testers would miss. Focus on security, input validation,
and race conditions."

// For Code Review
"You are a tech lead reviewing a junior engineer's Selenium code.
Be constructive but thorough. Identify issues in: reliability,
maintainability, performance, and adherence to best practices.
Rate severity as Critical/Major/Minor."

Pattern 4: Iterative Refinement

Don't expect perfection in one shot. Design your prompts for conversation. Start broad, then narrow down with follow-up prompts.

// Round 1: Generate Structure
"Design the class structure for a Page Object pattern for an e-commerce site.
Just the class names and method signatures, no implementation yet."

// Round 2: Implement Core
"Now implement the ProductPage class with full locators and methods."

// Round 3: Add Edge Cases
"Add error handling for the case where a product is out of stock
and the Add to Cart button is disabled."

🔧

Part 3: Real-World Examples for Automation Testers

Let's apply these patterns to the actual tasks you face daily. Each example shows a weak prompt, the improved version, and why it works.

Example 1: Generating Robust Locators

❌ Weak Prompt:

Give me XPath for the login button.

✅ Effective Prompt:

"Given this HTML snippet:
<button class='btn btn-primary submit' id='login-btn-7829' data-testid='login-submit'>
  <span>Sign In</span>
</button>

Generate 3 locator strategies in order of reliability:
1. CSS Selector (preferred)
2. XPath (as backup)
3. Fallback strategy

CONSTRAINTS:
- Avoid dynamic IDs (like 'login-btn-7829')
- Prefer data-testid attributes
- XPath must be relative, not absolute
- Explain why each locator is resilient to UI changes"

Expected Output Quality

Expected Output Quality

This prompt yields: [data-testid='login-submit'] as primary, //button[contains(text(), 'Sign In')] as backup, with explanations of why each survives CSS class changes and ID rotations.

Example 2: Generating Complete Page Objects

❌ Weak Prompt:

Write a page object for the cart page.

✅ Effective Prompt:

"CONTEXT: Senior SDET writing Selenium Python Page Objects.

TASK: Generate a complete Page Object for a Shopping Cart page with:
- Cart item list (each item has: name, price, quantity, remove button)
- Total price display
- Checkout button
- Continue Shopping link

CONSTRAINTS:
- Use @property decorators for element access
- All waits must be explicit using WebDriverWait
- Include a method to get cart item count
- Include a method to remove item by name
- Include a method to verify total price calculation
- Follow POM best practices: no assertions in Page Object, return self for chaining
- Type hints on all methods

OUTPUT: Python code only. Comments on non-obvious logic."

Example 3: Writing Comprehensive Test Cases

❌ Weak Prompt:

Write test cases for the search feature.

✅ Effective Prompt:

"CONTEXT: E-commerce website with search functionality that supports:
- Text search
- Category filters
- Price range filters
- Sort by (relevance, price, rating)

TASK: Generate a comprehensive test case matrix covering:
1. Positive scenarios (valid searches)
2. Negative scenarios (empty/invalid inputs)
3. Boundary conditions (min/max values)
4. Edge cases (special characters, SQL injection attempts, XSS payloads)
5. Performance scenarios (response time limits)

OUTPUT: Markdown table with columns:
| TC_ID | Category | Scenario | Input | Expected Result | Priority |

Generate at least 15 test cases across all categories."

Example 4: Generating Test Data

❌ Weak Prompt:

Generate some test data for registration.

✅ Effective Prompt:

"CONTEXT: User registration form for a German e-commerce platform.

TASK: Generate 10 user profiles for registration testing.

INCLUDE:
- 3 valid users with realistic German names and addresses
- 2 users with edge case emails (long email, subdomain, plus addressing)
- 2 users designed to fail validation (XSS in name, SQL injection in email)
- 2 users with Unicode characters in names (umlauts, accents)
- 1 user with minimum valid data (only required fields)

OUTPUT: JSON array. Each object must have:
first_name, last_name, email, password, street, city, postal_code, country, phone

Mark each with a 'test_category' field: 'valid', 'edge_case', 'security', 'unicode', 'minimal'"

Example 5: Debugging Test Failures

When tests fail, prompt engineering can dramatically speed up root cause analysis.

"CONTEXT: Selenium test failed in CI. I need root cause analysis.

ERROR LOG:
selenium.common.exceptions.StaleElementReferenceException:
Message: stale element reference: element is not attached to the page document
  at test_add_to_cart (test_cart.py:47)

CODE SNIPPET (test_cart.py:40-50):
def test_add_to_cart(self):
    products = self.driver.find_elements(By.CSS_SELECTOR, '.product-card')
    for product in products:
        add_btn = product.find_element(By.CSS_SELECTOR, '.add-to-cart')
        add_btn.click()
        time.sleep(1)

TASK: Analyze this failure and provide:
1. Root cause explanation
2. Why this pattern causes StaleElementReference
3. Corrected code that handles dynamic DOM updates
4. Preventive pattern to avoid this in future tests

Be specific to Selenium WebDriver internals."

Example 6: API Test Generation

"CONTEXT: Testing a REST API with pytest and requests library.

API ENDPOINT: POST /api/v1/orders
REQUEST BODY: {
  "user_id": "string",
  "items": [{"product_id": "string", "quantity": int}],
  "shipping_address": {...},
  "payment_method": "credit_card" | "paypal"
}

TASK: Generate a comprehensive pytest test module that covers:
1. Happy path with valid order
2. Invalid user_id (404 expected)
3. Empty items array (400 expected)
4. Quantity = 0 and negative quantity
5. Invalid payment_method
6. Schema validation of response
7. Response time assertion (< 500ms)

CONSTRAINTS:
- Use pytest fixtures for API client setup
- Use pytest.mark.parametrize for data-driven tests
- Include both status code and response body assertions
- Use pydantic or jsonschema for response validation

OUTPUT: Complete pytest module, production-ready."

Example 7: Mobile Testing with Appium

"CONTEXT: Appium Python test for Android app, using pytest.

TASK: Generate a test that:
1. Launches the app
2. Handles the onboarding flow (3 swipeable screens with Skip button)
3. Logs in with test credentials
4. Navigates to Profile and verifies user name is displayed

CONSTRAINTS:
- Use Appium 2.0 with W3C capabilities
- Handle permissions popup if it appears (location, notifications)
- Use TouchAction for swipe gestures
- Implement explicit waits with AppiumWebDriverWait
- Make it resilient to slow emulator startup

OUTPUT: Complete pytest test file with fixture for driver setup/teardown."

⚠️

Part 4: Anti-Patterns to Avoid

Learning what NOT to do is equally important. These common mistakes waste tokens, produce unusable output, and frustrate the process.

Anti-Pattern 1: The Vague One-Liner

// DON'T
"Write Selenium test."

// WHY IT FAILS
- What language? What framework?
- What is being tested?
- What page structure?
- What assertions matter?

Anti-Pattern 2: Information Overload

// DON'T
"Here's my entire 500-line page object, my conftest.py, my pytest.ini,
three other page objects, and the full HTML of the page.
Fix the flaky test."

// WHY IT FAILS
- Exceeds context window / drowns the signal
- AI can't identify what's relevant
- Solution: Extract ONLY the relevant snippet

Anti-Pattern 3: No Constraints

// DON'T
"Generate test data for user registration."

// WHAT YOU GET
- Hardcoded values that match nothing
- Fake data that fails validation
- No edge cases
- Wrong format (JSON vs CSV vs Python dict)

// ALWAYS SPECIFY constraints and output format

Anti-Pattern 4: Asking for Everything at Once

// DON'T
"Give me a complete automation framework with page objects,
API clients, database utilities, reporting, parallel execution,
Docker setup, and CI/CD pipeline."

// WHY IT FAILS
- Too many interconnected decisions
- Output will be superficial on everything
- Solution: Break into 10+ focused prompts

🚀

Part 5: Best Practices for Daily Use

Build a Prompt Library: Save your best prompts in a team wiki. Reuse and refine. A good prompt is reusable across projects.
Version Your Prompts: As the AI evolves, so should your prompts. Track what worked with which model version (GPT-4, Claude 3.5, Gemini).
Context Window Management: Know your model's limit. GPT-4 Turbo: 128K tokens. Claude: 200K. Chunk large codebases intelligently.
Temperature Settings: For code generation, use temperature 0-0.3 (deterministic). For creative test case brainstorming, 0.7-0.9.
Validate Everything: AI-generated code MUST be reviewed. Treat it as a junior engineer's first draft — helpful, but not production-ready without review.
Local Models for Sensitive Data: Use Ollama with Llama 3 for proprietary code. Never send production data to external APIs.

The Compound Effect

Consider the math: If prompt engineering saves 20 minutes per day on code generation, debugging, and test case design, that's nearly 2 hours per week. Over a year, that's over 85 hours — more than two full work weeks. But the real gain isn't time; it's the quality leap. AI-assisted testers catch more edge cases, write more maintainable code, and debug faster.

The 10x Multiplier

The 10x Multiplier

Prompt engineering doesn't make you 10% better. When mastered, it makes you 10x more productive. The gap between SDETs who can prompt effectively and those who can't will only widen as AI tools improve.

Conclusion

Prompt engineering is the bridge between human intent and machine execution. The CTCO framework (Context, Task, Constraints, Output) is your foundation. Advanced patterns like Chain-of-Thought, Few-Shot, and Role-Playing are your power tools. And the examples in this guide are your starting templates.

Start refining your prompts today. Save your best ones. Share them with your team. And watch your automation efficiency soar to levels that weren't possible even a year ago. The future of testing isn't just about writing code — it's about writing the right prompts to generate the right code.

Why Your Selenium Tests Fail on AI Chatbots (And How to Fix It)

Dhiraj Das — Sat, 13 Dec 2025 21:14:49 +0000

🎯

What You'll Learn

The Problem: Why WebDriverWait fails on streaming responses
MutationObserver: Zero-polling stream detection in the browser
Semantic Assertions: ML-powered validation for non-deterministic outputs
TTFT Monitoring: Measuring Time-To-First-Token for LLM performance

You've built an automation suite for your new AI chatbot. The tests run. Then they fail. Randomly. The response was correct—you can see it on the screen—but your assertion says otherwise. Welcome to the nightmare of testing Generative AI interfaces with traditional Selenium.

🤖

The Fundamental Incompatibility

Traditional Selenium WebDriver tests are designed for static web pages where content loads once and stabilizes. AI chatbots break this assumption in two fundamental ways:

Streaming Responses: Tokens arrive one-by-one over 2-5 seconds. Your WebDriverWait triggers on the first token, capturing partial text.
Non-Deterministic Output: The same question yields different (but equivalent) answers. assertEqual() fails even when the response is correct.

User: "Hello"
AI Response (Streaming):
  t=0ms:    "H"
  t=50ms:   "Hello"
  t=100ms:  "Hello! How"
  t=200ms:  "Hello! How can I"
  t=500ms:  "Hello! How can I help you today?"  ← FINAL

Standard Selenium captures: "Hello! How can I"  ← PARTIAL (FAIL!)

The Usual Hacks (And Why They Fail)

Every team tries the same workarounds:

time.sleep(5): Arbitrary. Too short = flaky. Too long = slow CI. Never works reliably.
text\_to\_be\_present: Triggers on first match, missing the complete response.
Polling with length checks: Race conditions. Text length can plateau mid-stream.
Exact string assertions: Fundamentally impossible with non-deterministic AI.

The Real Cost

The Real Cost

Teams spend 30% of their time debugging flaky AI tests instead of improving coverage.

The Solution: Browser-Native Stream Detection

The key insight is that the browser already knows when streaming stops—we just need to listen. The MutationObserver API watches for DOM changes in real-time, directly in JavaScript. No Python polling. No arbitrary sleeps.

from selenium_chatbot_test import StreamWaiter

# Wait for the AI response to complete streaming
waiter = StreamWaiter(driver, (By.ID, "chat-response"))
response_text = waiter.wait_for_stable_text(
    silence_timeout=500,  # Consider "done" after 500ms of no changes
    overall_timeout=30000  # Maximum wait time
)

Under the hood, StreamWaiter injects a MutationObserver that resets a timer on every DOM mutation. Only when the timer reaches silence\_timeout without interruption does it return—guaranteeing you capture the complete response.

Semantic Assertions: Testing Meaning, Not Words

Once you have the full response, you face the second problem: AI outputs vary. The solution is semantic similarity—comparing meaning instead of exact strings.

from selenium_chatbot_test import SemanticAssert

asserter = SemanticAssert()

# These all mean the same thing—and this assertion passes!
expected = "Hello! How can I help you today?"
actual = "Hi there! What can I assist you with?"

asserter.assert_similar(
    expected,
    actual,
    threshold=0.7  # 70% semantic similarity required
)
# ✅ PASSES - Because they mean the same thing

The library uses sentence-transformers with the all-MiniLM-L6-v2 model to generate embeddings and calculate cosine similarity. The model is lazy-loaded on first use and works on CPU—no GPU required in CI.

TTFT: The LLM Performance Metric You're Not Tracking

Time-To-First-Token (TTFT) is critical for user experience. A chatbot that takes 3 seconds to start responding feels broken, even if the total response time is acceptable. Most teams have zero visibility into this metric.

from selenium_chatbot_test import LatencyMonitor

with LatencyMonitor(driver, (By.ID, "chat-response")) as monitor:
    send_button.click()
    # ... wait for response ...

print(f"TTFT: {monitor.metrics.ttft_ms}ms")  # 41.7ms
print(f"Total: {monitor.metrics.total_ms}ms")  # 2434.8ms
print(f"Tokens: {monitor.metrics.token_count}")  # 48 mutations

Real Demo Results

Real Demo Results

In testing, the library captured 41.7ms TTFT with 48 DOM mutations over 2.4 seconds, achieving 71% semantic accuracy—automatically.

Putting It All Together

Here's a complete test that would be impossible with traditional Selenium:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium_chatbot_test import StreamWaiter, SemanticAssert, LatencyMonitor

def test_chatbot_greeting():
    driver = webdriver.Chrome()
    driver.get("https://my-chatbot.com")

    # Type a message
    input_box = driver.find_element(By.ID, "chat-input")
    input_box.send_keys("Hello!")

    # Monitor latency while waiting for response
    with LatencyMonitor(driver, (By.ID, "response")) as monitor:
        driver.find_element(By.ID, "send-btn").click()

        # Wait for streaming to complete (no time.sleep!)
        waiter = StreamWaiter(driver, (By.ID, "response"))
        response = waiter.wait_for_stable_text(silence_timeout=500)

    # Assert semantic meaning, not exact words
    asserter = SemanticAssert()
    asserter.assert_similar(
        "Hello! How can I help you today?",
        response,
        threshold=0.7
    )

    # Verify performance SLA
    assert monitor.metrics.ttft_ms < 200, "TTFT exceeded 200ms SLA"

    driver.quit()

Get Started

Stop fighting flaky AI tests. Start testing semantically.

pip install selenium-chatbot-test

PyPI: pypi.org/project/selenium-chatbot-test
GitHub: github.com/godhiraj-code/selenium-chatbot-test

Built by Dhiraj Das

Built by Dhiraj Das

Automation Architect. Making GenAI testing deterministic, one MutationObserver at a time.

Selenium Teleport: Skip Login Screens Forever

Dhiraj Das — Sat, 13 Dec 2025 09:31:38 +0000

🎯

Why This Matters

Full State Capture: Cookies + LocalStorage + SessionStorage + IndexedDB
Same-Origin Handling: Automatic pre-flight navigation solves the silent killer
Stealth Mode: Built-in bot detection bypass for enterprise sites
Zero Login Tax: First run takes seconds, every run after is instant

The "Login Tax" is Killing Your Automation

Every automation architect knows the pain. You build a robust test suite, but 40% of your execution time is spent typing usernames, filling 2FA fields, and waiting for redirects.

It's not just wasted time—it's risk.

Flakiness: Every login attempt is a potential failure point (network glitches, CAPTCHAs).
Bot Detection: Logging in 50 times an hour from a CI server? That is the fastest way to get your IP flagged by Cloudflare or Google.
Maintenance: When the login UI changes, every single test breaks.

The industry standard solution has been "just save the cookies." But if you have tried using pickle or random StackOverflow snippets, you know the truth: It rarely works for modern applications.

🔐

Why the "Old Way" Fails

Most developers try to save cookies and inject them into a fresh browser. Here is why that approach crashes and burns in 2025:

1. Cookies Are Not Enough

Modern web apps are complex. That React dashboard? It stores your JWT auth token in localStorage. That checkout flow? It caches the cart ID in sessionStorage. Saving cookies alone captures maybe 30% of the state. If you don't capture the rest, you get logged out immediately.

2. The "Same-Origin" Trap

This is the silent killer of automation scripts.

If you try to inject a cookie for example.com while your fresh browser is sitting on about:blank (the default state), Chrome blocks it instantly due to security policies.

driver.get("about:blank")
driver.add_cookie({"domain": "example.com", ...})
# 💥 CRASH: InvalidCookieDomainException

The Silent Killer

Most libraries don't handle this. Selenium Teleport does.

⚡

Enter: Selenium Teleport

I built selenium-teleport to solve these architectural gaps once and for all. It is not just a cookie saver; it is a Full State Transporter.

It follows a strict "Teleportation Pattern" that guarantees success:

CAPTURE: Detailed snapshot of Cookies + LocalStorage + SessionStorage + IndexedDB.
NAVIGATE: Automatically detects the base domain and performs a "Pre-flight" navigation to satisfy the Same-Origin Policy.
INJECT: Surgically inserts the state into the browser's secure context.
TELEPORT: Instant reload to your target URL—already authenticated.

How to Use It

The "Standard" Way (Fast & Simple)

Perfect for internal tools, staging environments, and standard SaaS apps.

from selenium_teleport import create_driver, Teleport

# 1. The Setup
driver = create_driver(profile_path="my_sessions")

# 2. The Teleport
with Teleport(driver, "hackernews_identity.json") as t:
    if t.has_state():
        # ⚡️ SKIP LOGIN ENTIRELY
        t.load("https://news.ycombinator.com/submit")
    else:
        # First run only: Login manually
        driver.get("https://news.ycombinator.com/login")
        # ... user logs in ...

    # You are now authenticated.
    assert "logout" in driver.page_source

Result

Result

The first run takes 10 seconds. Every run after that takes 0.5 seconds.

The "Stealth" Way (Enterprise Grade)

This is where the package shines. If you are testing against sites protected by Cloudflare, DataDome, or Imperva, standard Selenium gets blocked.

selenium-teleport comes with a Hybrid Driver Factory that integrates with sb-stealth-wrapper.

from selenium_teleport import create_driver, Teleport

# This driver mimics a real human user to bypass bot detection
# Requires: pip install selenium-teleport[stealth]
driver = create_driver(use_stealth_wrapper=True)

with Teleport(driver, "protected_app.json") as t:
    t.load("https://tough-security-site.com/dashboard")

📊

What We Accomplished

This package solves the "Day 3" problems of automation that simple scripts miss:

Feature Comparison

Cookie Capture: ✅ Naive Script | ✅ Selenium Teleport
LocalStorage / SessionStorage: ❌ Naive Script | ✅ Selenium Teleport
Same-Origin Policy Handling: ❌ Crashing (Naive) | ✅ Auto-Pre-flight (Teleport)
Bot Detection Bypass: ❌ Naive Script | ✅ Built-in Stealth Mode
IndexedDB Support (For PWAs): ❌ Naive Script | ✅ Selenium Teleport

🎯

The Bigger Picture

selenium-teleport isn't just a utility; it's a shift in how we write tests.

Reliability: Fewer login attempts means fewer chances for network timeouts or CAPTCHAs to flake your build.
Focus: Your tests should verify your features, not the stability of your login page.
Speed: Stop paying the "Login Tax" on every single test execution.

Get Started

Stop writing login scripts. Start teleporting.

pip install selenium-teleport

GitHub: github.com/godhiraj-code/selenium-teleport
PyPI: pypi.org/project/selenium-teleport

Built by Dhiraj Das

Automation Architect. Stop fighting login screens—start building features.