Forem: RAAZU shanigarapu

Self-Healing Tests: My Pragmatic Approach Beyond the Hype

RAAZU shanigarapu — Mon, 11 May 2026 12:26:45 +0000

Self-healing tests are one of the most hyped concepts in QA right now.

They're also real. I've shipped them. But the version I shipped looks nothing like what most vendors are selling.

Let me break down the hype from what actually works in production.

What "Self-Healing" Actually Means

A self-healing test is one that can automatically recover from selector failures without a human intervention.

That's the promise. An element moves, a class name changes, a data-testid gets renamed — and instead of your test suite turning red and blocking your pipeline, the system detects the change, finds the element through alternative means, fixes the locator, and continues.

Sounds perfect. Here's why it's complicated.

The Problem With Most Self-Healing Implementations

Most vendor self-healing tools work like this:

Test fails because [data-testid="submit-btn"] isn't found
Tool takes a screenshot + DOM snapshot at failure point
Tool compares to previous successful run
Tool finds the "closest" element and retries
Tool saves the new locator as a "healed" version
Next run uses the healed locator

This works surprisingly well — for simple, isolated selector changes.

It fails for:

Dynamic applications with state-dependent elements
Elements that move position but retain the same selector
Fundamental layout changes where "submit" button moved to a different form
Race conditions and timing issues that look like selector failures

The deeper issue: self-healing that silently patches locators is self-healing that hides problems. If your test healed 50 locators last week, you have 50 silent signals that the UI is changing faster than your team knows.

What I Actually Built

The self-healing mechanism I implemented at Mendix works differently from the vendor pitch.

Layer 1: Locator Strategy Cascade

Instead of a single selector, every element has a priority-ordered strategy list:

element_strategies = [
    ("data-testid", "submit-btn"),        # Primary — developer-maintained
    ("aria-label", "Submit"),             # Secondary — accessibility
    ("text", "Submit"),                   # Tertiary — visible text
    ("css", "button[type='submit']"),     # Fallback — structural
]

If the primary locator fails, the framework tries the next. If a fallback succeeds, it logs the incident, creates an alert, and flags the primary for review.

No silent healing. The fix is flagged, tracked, and assigned.

Layer 2: Similarity Scoring

When all strategies fail, I use a DOM similarity algorithm (not an LLM — a structured comparator) that:

Takes the expected element's attributes and position
Scans the current DOM for the closest structural match
Returns a confidence score with a suggested locator update

If confidence is above 85%, the test continues with the suggested locator and creates a PR-ready fix suggestion. If below, the test fails with a diagnostic report.

Layer 3: Failure Intelligence

Every failure generates structured metadata:

Which locator strategy failed
Which (if any) fallback succeeded
The confidence score of any auto-suggestions
The code diff context from the last 24 hours
The element's historical stability score

This feeds into a dashboard that shows which UI elements are highest-maintenance. Developers see this. When an element is flagged as high-drift, it becomes a conversation about whether the test or the element is the problem.

The Real Value: Not Just Fewer Red Tests

The most valuable outcome of self-healing architecture isn't that fewer tests fail.

It's that the reason tests fail becomes data.

Before: "Tests are failing, probably a UI change."
After: "The submit button's primary selector has changed 4 times in 6 weeks. The Payments team is iterating fast. Let's add a data-testid that's stable."

That's a different conversation. A useful one.

When Self-Healing Is Worth It

Self-healing makes sense when:

Your application UI iterates faster than your test maintenance cadence
You have a large legacy test suite with inconsistent locator strategies
You have clear locator ownership and want to enforce it via automation

Self-healing is not worth it when:

Your core problem is test architecture (fix that first)
You want it to mask flakiness rather than surface it
You're hoping to avoid adding data-testid attributes to your UI

The vendors that promise "zero test maintenance" are selling a fantasy. The engineers who implement thoughtful self-healing architecture are solving a real problem.

My Verdict

Self-healing tests are worth building — not buying.

The buy-vs-build question matters here more than anywhere. Commercial tools optimize for impressive demos. Custom implementations optimize for your specific app, your specific failure patterns, your specific team's workflow.

The best self-healing system I've built took 3 weeks to implement and 6 months to refine based on real failure data. It's now the reason our test maintenance load is 40% lower than industry average.

That's not magic. That's engineering.

Originally published at https://raju-shanigarapu.vercel.app/blog/self-healing-tests-hype-vs-reality

Why Your Automation Framework is Failing (It's the Architecture)

RAAZU shanigarapu — Fri, 08 May 2026 21:17:43 +0000

I've seen it happen at every company I've joined.

Someone built an automation framework 2 years ago. It has 800 tests. Half of them fail on any given day. Nobody touches it except to disable the failing ones. The team has silently agreed to stop trusting it.

This is the most common QA failure mode. And it has almost nothing to do with the tool choice.

The Real Reason Frameworks Fail

Every post-mortem I've done on a dead automation framework finds the same root causes:

1. Tests were written for the happy path, not the system.

The team automated what users should do, not what the system could encounter. The first time a timeout, a race condition, or an API degradation hit production, the tests were useless.

2. No ownership model.

Who fixes a failing test? If the answer is "whoever broke it," the answer is actually "nobody." Automation without explicit ownership is automation in decline.

3. The framework grew faster than its architecture.

Someone wrote test_login.py and copy-pasted it 300 times. No page object model. No fixtures. No hierarchy. 300 tests that all fail when the login selector changes by one character.

4. It wasn't treated as code.

Tests have linting standards, code reviews, and refactoring cadence — or they don't. Teams that treat test code as second-class code produce second-class automation.

5. No feedback loop with engineering.

If developers can merge code without seeing automation results, the automation is decorative. Tests that don't block pipelines don't protect pipelines.

What Good Architecture Actually Looks Like

I've built automation systems that outlive teams and survive platform migrations. Here's what they have in common:

A Single Source of Locator Truth

Locators live in one place. Not scattered across test files. Not duplicated across 40 helpers. When the UI changes, you update one layer and every test that touches that element is fixed.

Page Object Model is table stakes. If you're not using it, stop reading this and implement it today.

Test Isolation as a Non-Negotiable

Every test must be independently executable. No test should depend on another test having run first. No shared mutable state between tests.

If you can't run test_checkout_flow.py in isolation and have it pass, you don't have a test — you have a dependency chain waiting to cascade.

Retry Logic That's Honest

Retry is not a solution. It's a suppressor. Use it sparingly, with a cap (3 retries max), with logging that exposes every retry. A test that passes on retry 3 is not a passing test — it's a flaky test with a mask on.

Track your retry rate. If it's above 5%, you have a structural problem.

Contract-First API Testing

Before testing behavior, test the contract. Does the API schema match what you expect? Does it match what downstream services expect?

API contract tests are the highest ROI automation you can write. They catch breaking changes before any UI test ever runs, and they run in milliseconds.

CI/CD Integration From Day One

Not after the framework "matures." Day one. Tests that don't run in the pipeline don't matter. Test results that don't block merges don't influence behavior.

If you can't run your smoke suite in under 5 minutes, fix that before writing more tests.

The Framework Health Checklist

Before expanding any automation project, run this checklist:

[ ] Can I run any single test in isolation?
[ ] Does every test have an owner?
[ ] Are failing tests blocking merges?
[ ] Is the flaky test rate below 5%?
[ ] Are locators centralized?
[ ] Is test code reviewed like production code?
[ ] Do tests run in CI on every PR?

If more than 2 of those are "no," you're building on sand.

The Architecture Decision That Matters Most

Here's the one I see teams skip most often:

Define your test pyramid before writing test one.

How many unit tests? How many integration tests? How many E2E tests? What's the expected execution time for each tier?

Without this contract, teams default to writing whatever's easiest. Usually E2E tests. Slow, brittle, expensive E2E tests that replace the fast, reliable tests that should have been written first.

The pyramid isn't a suggestion. It's load-bearing.

My Rule of Thumb

If your automation framework requires more than 20% of QA time to maintain, it's not working for you — you're working for it.

Good automation accelerates. Bad automation accumulates.

The difference is almost always in the first 30 days of decisions.

Originally published at https://raju-shanigarapu.vercel.app/blog/why-automation-frameworks-fail