Forem: Paul Coles

The Subtle Art of Herding Cats: How I Turned Chaos Into a Repeatable Test Process (Part 3 of 4)

Paul Coles — Mon, 11 Aug 2025 07:43:37 +0000

Proof of Concept: Does This Actually Work?

In Part 2, I found that gold standards are more effective than rulebooks. Also, lazy loading helps stop Context Rot. Part 3 shows how it works in real life. It includes fake examples and a truthful look at what happens when the cats face reality.

The Universal BDD Vision: Two Car Companies Principle

The Core Philosophy

If two companies do the same thing, like BMW and Mercedes-Benz with car configurators, you can take a requirement from either and create the same BDD scenario.

The scenario shouldn't contain:

Implementation details: REST APIs, microservices, specific databases
System names: ConfiguratorService v2.1, PricingEngine, ValidationAPI
Technical artefacts: JSON responses, event handlers, component states

Instead, it should focus on:

User intent: What does the person want to accomplish?
User actions: What do they actually do?
Observable results: What do they see happen?

The code behind is different, but the human need is identical.

The Problem: Implementation-Contaminated Scenarios

Here's what BDD scenarios look like when they're contaminated with implementation details:

# BMW's contaminated approach
Feature: BMW iDrive ConfiguratorService Integration [SPEC-BMW-123]
Background:
  Given the BMW ConnectedDrive API is initialized
  And the user authenticates via BMW ID OAuth
  And the PricingEngine microservice is available

Scenario: M Sport Package selection triggers pricing recalculation
  Given I have loaded the 3-series configurator via iDrive interface
  When I POST to /api/bmw/packages/m-sport with authentication headers
  Then the PricingCalculatorService should return updated totals
  And the frontend should display BMW-specific pricing components
  And the ConfiguratorState should persist to BMW backend systems

# Mercedes contaminated approach
Feature: Mercedes MBUX Configurator Integration [SPEC-MB-456]
Background:
  Given the Mercedes me connect platform is active
  And MBUX infotainment system is responsive
  And the pricing validation service confirms availability

Scenario: AMG package selection updates Mercedes pricing display
  Given I access the C-Class configurator through MBUX interface
  When the system processes AMG package selection via Mercedes API
  Then the integrated pricing module recalculates total cost
  And Mercedes-specific UI components reflect package changes
  And the selection persists in Mercedes customer profile system

The Problem: These scenarios test implementation details, not user behaviour. A tester needs different knowledge to understand BMW and Mercedes scenarios. Users perform the same tasks, but the context differs.

📌 Universal Behavior Insight: When configuring a BMW 3-series or a Mercedes C-Class, users want to choose packages, check pricing updates, and identify conflicts. The implementation shows significant differences, but the user experience remains fundamentally the same.

The Solution: Universal, Human-Focused Scenarios

Here's what the same functionality looks like when focused on universal human behaviour:

# Works for BMW, Mercedes, Audi, or any car configurator
Feature: Vehicle Package Configuration [SPEC-123]

Scenario: Premium package selection updates pricing
  Given I am on the vehicle configuration page
  When I select the premium package
  Then I should see the updated total price
  And the premium package should be marked as selected

Scenario: Package conflict prevention
  Given I have selected a premium package
  When I attempt to select a conflicting economy package
  Then I should see a conflict warning message
  And the economy package should remain unselected
  And my original premium selection should be preserved

Scenario: Package removal affects pricing
  Given I have selected multiple packages
  When I remove the premium package
  Then the total price should decrease
  And the premium package should no longer appear selected
  And any dependent options should be automatically removed

Domain Configuration Separation

Behind the scenes, each company uses their specific domain configuration:

BMW Domain Config:

{
  "navigation_url": "https://bmw.com/configurator",
  "premium_package": "M Sport Package",
  "economy_package": "Efficiency Package",
  "api_endpoint": "BMW ConnectedDrive API",
  "pricing_currency": "EUR"
}

Mercedes Domain Config:

{
  "navigation_url": "https://mercedes-benz.com/configurator",
  "premium_package": "AMG Line Package",
  "economy_package": "Eco Package",
  "api_endpoint": "Mercedes me connect API",
  "pricing_currency": "EUR"
}

The same universal scenarios have different domain implementations. Testers understand them easily because they focus on human behaviour, not on technical details.

Garbage in and Garbage out

Something that became clear was that some of our tickets were not so good. They're written in a given, when, then format, but, it's in a table and full of bullet points. Instead of writing bullet points, it's a blend of syntax. It will have 10 compound results, and some may contradict each other.

Example of a "Bad Ticket":

Given the user is on the config page 
When they add the M Sport Package 
Then
• The price updates
• The UI shows "M Sport"
• The PricingEngine service is called with SKU-123
• A confirmation modal appears (unless they are premium users)
• The total must not exceed credit limit in UserDB
• The economy package is disabled
• Loading spinner shows during calculation
• Error handling for network failures

This ticket mixes UI behavior, API calls, business rules, and error handling - all in one "Then" clause. Some requirements clash. They test implementation details rather than user behaviour.

From this I made the tool check the ticket before we could let the AI try and make scenarios from it.

read it
assess it
apply rules
Let the user decide what to do
- Accept the badness
- See what the scenarios would look like
- Rewrite them using the single responsibility principal
- Stop

The Complete Workflow: From Jira Ticket to Executable Tests

Task 1: Context Extraction in Action

When a Jira ticket arrives, the AI agent (loaded only with analysis rules and domain context) creates a structured conversation log:

## Requirements Analysis
- REQ-001: User can select vehicle packages
- REQ-002: Package selection updates total pricing
- REQ-003: Conflicting packages show warning messages
- REQ-004: Package removal updates pricing and dependencies

## Positive Test Scenarios Identified
- Premium package selection with pricing update
- Multiple package selection and total calculation
- Package upgrade scenarios

## Negative Test Scenarios Identified
- Conflicting package selection attempts
- Invalid package combinations
- Network error during selection

## Inferred Requirements (Agent Additions)
- Loading states during price calculation
- Confirmation for expensive package selections
- Package dependency validation

Task 2: BDD Generation with Pattern-Led Prompting

The agent then loads BDD generation rules (and only those rules) plus the Task 1 output. It uses the gold standards approach to create clear scenarios that focus on business language and user actions.

The key insight: The AI already knows BDD structure. I didn't need to teach "Given-When-Then." I just needed to steer it toward business-focused language and user-observable outcomes.

Task 3a: Behavioural Assessment - The Testing Filter

This is where the Context Smartness approach really shines. The agent loads only assessment criteria and applies strict behavioural filters:

Include for Automation:

Multi-step user workflows
Cross-component integration tests
Business process validation
State persistence across actions

Exclude from Automation:

Single component behaviour (unit test territory)
Subjective UX validation
Accessibility testing (specialised tools needed)
Performance without specific metrics

Golden Rule: Only test what you can control. Avoid putting product prices or names in automation since they change. Check that prices show correctly and names appear consistently. Focus on the user experience, not the system's internals. This aligns with "intent-based testing" principles, but I see it as common sense.

Task 3b: TAF Generation - From Human to Machine

The final task loads technical patterns and turns approved behaviour scenarios into working automation code.

Generated Test Automation Framework (TAF) Code:

Scenario: Premium package selection updates pricing
  Given I navigate to the package configuration page
  When I select the premium package option
  Then the pricing display should show updated costs
  And the premium package should appear selected

Generated Infrastructure Report:

## Required Page Objects
- PackageConfigurationPage
  - premiumPackageOption (data-testid="premium-package")
  - pricingDisplay (data-testid="pricing-total")
  - packageSelectionIndicator (data-testid="selected-indicator")

## Missing Step Definitions
- "I select the premium package option"
- "the premium package should appear selected"

The agent gets you 80-90% of the way there, then humans add the final details.

Using State Diagrams so you and it can know what it does

The Problem: AI Doesn't Know Your Application States

The LLM only knows what it knows. If you ask it to write API requests from a spec snippet, it will try. But the result often seems fine, even though it's completely wrong. It doesn't know the states of your application.

The Solution: Plain English State Description

I began explaining application states and process flows in plain English. I often used Figma designs since they show the actual state changes clearly.

Then I asked the agent to create Mermaid state diagrams from the scenarios:

graph TD
    A[Configuration Page] --> B[Premium Selected]
    B --> C[Pricing Updated]
    B --> D[Try Economy Selection]
    D --> E[Conflict Warning Displayed]
    E --> B[Return to Premium Selected]

These diagrams showed missing state transitions that weren't clear in Jira stories but were visible in Figma designs. The AI became better at identifying incomplete workflows and suggesting additional test scenarios.

Making the tasks self documenting

I was much like a university student writing their software plan after the code. I had this amazing system, but only I knew what it did, and I'd only remember this for a while. So, I asked the AI to produce flow charts using Mermaid again.

This allows others to understand what it does without reading a bunch of pseudo code. It also allows me to follow the paths through and debug problems. I quickly realised its value when I kept loading the domain. I checked it, made a decision, and then loaded the domain again for a more thorough check. 😒

Honest Assessment: What Actually Works

It's all new to me

I've jumped into this without much preparation. I'll discuss this more in part 4. The first version was a mess, but it worked and showed that it was possible. When we needed to roll it out, I realised I couldn't because it was tied to my area of work.

I learned what worked as I went along.

We launched what I'll call version 2. This version is domain-aware and loads context. But, it had some early bugs. One major issue was that ticket assessment would always fail. This meant it wasn't loading the domain context. It was definitely a "it works on my machine" problem.

I tried to strengthen the wording, but I know this only helps so much. The AI doesn't read like us; it sees everything as one long sentence. It also gives more weight to recent information.

I needed to change how I executed tasks. In V3, everything is pseudo code. I'm considering that this could all be real code, using a basic markdown file for the AI.

Lesson 1: The Ripple Effect (F1 Car Analogy)

I'm a big fan of F1, more of the design and off track stuff; the races can be pretty dull. What is clear is that changing the front wing affects other areas of the car.

Making the domain loading perfect had the issue of the AI treating it as an override for garbage tickets. It would still allow poor tickets because it believed the extra domain stuff improved them. It didn't, it meant they were nonsense with the correct names.

Lesson 2: You Can't Automate Quality Control

The domain stuff is seasoning on your pasta; it enhances the bland scenarios to reflect your specific business, but it cannot remedy fundamentally bad ingredients.

So, the assessment had to change to pseudo code so that it understood the rules. I did try putting rule 5 before rule 4 (changed the numbers and everything), but the AI ignored it!

Doing this stopped ticket processing. In a perfect world, this wouldn't occur. Everyone would create perfect scenarios. So I had to make it optional.

Lesson 3: The Human Must Be the Final Arbiter

There's something that the system has to adhere to - The human has to make the decisions. It's why the test cases aren't limited, they are in priority order, but it makes everything. When something goes wrong, people won't tell off the AI.

So, that's where the options I mentioned earlier came from.

After this, rather than mess around, we changed all the other tasks to pseudo code.

Turning the AI on Itself: Unit Testing the Rules

I tested all these changes manually, which was frustrating. Then, I got the AI to create some unit tests. It took good and bad examples, even tickets outside my domain. The AI generated expectations and made repeatable tests. Now, when the rules change, we can run these tests to check for any issues. We also recreate the mermaid diagrams, so we can see if the flow makes sense.

The Wins ✅

Consistency: Generated scenarios follow the same patterns every time. No more confusion about why one tester says "Given I navigate to" and another says "Given the user accesses."

Speed: Minutes instead of hours for complex features. What used to take an afternoon of careful scenario writing now happens in the time it takes to make coffee.

Creativity: The agent spots edge cases that humans often overlook. It often detects state changes, error conditions, and user journey differences not included in the original requirements. When you focus on user behaviour instead of technical details, you naturally uncover more realistic test scenarios. This is what intent-based testing advocates have always claimed.

Documentation: Creates the specifications that were missing. Generated BDD scenarios are often clearer than the original Jira tickets. They become the source of truth for what the feature does.

Onboarding: New team members understand features faster. Universal, behaviour-focused scenarios are self-documenting in ways that implementation-specific tests aren't.

The Ongoing Challenges ⚠️

Domain Drift: The AI loves to just go for it. Before you know it, there are domain-specific details creeping into what should be universal patterns. You have to watch what the AI does if you're changing rules.

Edge Case Handling: Still needs human review for unusual scenarios. The AI excels at common patterns but struggles with genuinely unique business logic.

Context Maintenance: Domain configurations need regular updates. As products evolve, the mappings between universal patterns and specific implementations require ongoing care.

What It Doesn't Fix ❌

Process Problems: Technical solutions don't fix workflow issues. If your requirements are unclear, arrive late, or change constantly, AI won't solve those basic communication problems.

Human Communication: Still need clear specs and acceptance criteria. The AI amplifies the quality of your inputs - it doesn't create clarity from chaos.

Domain Expertise: Agent can't replace understanding your business. It can apply patterns consistently, but someone still needs to know whether the business logic makes sense.

The Practical Implementation Guide

Create Your Gold Standards

Pick your best existing BDD scenarios
Clean them to perfection
Document why they're good
Use these as training examples

Build Task-Based Rules

Extract minimal rules from gold standards
Create focused rule sets per task
Test with lazy loading approach
Measure consistency improvements

Implement the Full Workflow

Task 1: Context extraction and analysis
Task 2: Human-readable BDD generation
Task 3a: Behavioural assessment
Task 3b: Automation code generation

Measure and Refine

Compare generated vs manual scenarios
Track consistency metrics
Identify remaining edge cases
Refine rules based on actual usage

What's Coming in Part 4

The framework works, the cats stay in formation, and the scenarios are consistent. But here's what really got me thinking: I accidentally solved fundamental AI problems that every developer faces.

It's annoying talking to an AI and it takes off half-cocked and does something you don't want. Turns out, I wasn't alone in this frustration.

In Part 4, I'll reveal:

The Context Rot discovery: How I identified performance degradation months before it was documented
The market irony: Why simple solutions to real problems get overlooked while flashy AI tools get all the funding
What I learned about AI reliability that applies to any system trying to get consistent behavior
From frustrated tester to accidentally solving problems I didn't know had names

The real breakthrough wasn't just herding cats - it was discovering that my specific frustrations were actually universal AI challenges.

Paul Coles is a software tester who proved that universal BDD patterns work across domains when separated from implementation details. In Part 3, he demonstrates the complete framework in action with real examples and honest assessment of what works and what doesn't. His cat now stays mostly in the designated areas.

🐾 Series Navigation

Part 1: Why AI Starts Making Stuff Up

The cat has opinions — and your postcode formatting rules aren't one of them.

Read it →
Part 2: Show, Don't Tell: Teaching AI with Better Examples

Bribing the cat with gold standards and smaller piles of paper.

[Read it →]https://dev.to/paul_coles_633f698b10fd6e/the-subtle-art-of-herding-cats-show-dont-tell-teaching-ai-by-example-part-2-of-4-ing
Part 3: How I Made My AI Stop Guessing

Teaching the cat one trick at a time with task-focused training.

(you are here)
Part 4: The More You Say, the Less It Learns

When you talk too much, the cat stops listening — and invents new requirements instead.

(Coming soon)

Photo by Birte Liu on Unsplash

The Subtle Art of Herding Cats: Show, Don’t Tell: Teaching AI by Example (Part 2 of 4)

Paul Coles — Wed, 06 Aug 2025 20:38:22 +0000

Quick Recap: The Problems We Discovered

In Part 1, I learned the hard way that giving AI 47 rules is like trying to get a cat to do well, much of anything. In Part 2, I'll show you the approach that finally made it behave: gold standards and lazy loading.

Show, Don't Tell

The Failed Approach: Death by Documentation

My first idea was to write complete rules. Hundreds of detailed instructions covering every possible scenario, edge case, and formatting requirement. The AI nodded politely and ignored most of it.

The talks were exhausting:

Me: "Why didn't you follow the naming rules?"
Agent: "Which ones? There were several different patterns mentioned."
Me: "The ones in section 4.2.1 about specification references!"
Agent: "I focused on the examples in section 6.3 instead."

Sound familiar? I was trying to teach by explanation rather than showing examples.

The Breakthrough Moment

Instead of writing more rules, I tried something different. I created one perfect example of what I wanted:

It didn't look exactly like this, this is just a tribute.

Gold Standard BDD Scenario:

Feature: Product Configuration [SPEC-123]

Scenario: Premium package selection shows correct pricing
  Given I am on the product configuration page
  When I select the premium package
  Then I should see the premium pricing displayed
  And the package should be marked as selected

Then I asked the agent: "Look at this gold standard. What rules do you need to reproduce this quality?"

The agent looked through the rules and Instead of 300 lines of stuff, the AI found 10 key principles:

Use clear, business-focused language
Follow Given-When-Then structure
Include specification references
Focus on user-observable outcomes

📌 Pattern-Led Prompting Discovery: AI agents learn better from perfect examples than from detailed explanations. Show the destination, let them find the path.

The Domain Separation Breakthrough

The gold standard looked simple, but there was hidden cleverness. Some elements needed to come from domain configuration, not be hardcoded in the pattern:

Universal Pattern:

Feature: Vehicle Package Configuration [SPEC-123]

Scenario: Premium package selection shows correct pricing
  Given I am on the [DOMAIN: configuration_page_url]
  When I select the [DOMAIN: premium_package_name]  
  Then I should see the [DOMAIN: pricing_display_format] updated
  And the package should show [DOMAIN: selection_indicator_state]

Domain Configuration:

{
  "configuration_page_url": "vehicle configuration page",
  "premium_package_name": "M Sport Package",
  "pricing_display_format": "total pricing", 
  "selection_indicator_state": "as selected"
}

For an online sock store, the same pattern works with different domain values:

Product categories: "athletic socks"
Size options: "Size 8-10"
Error messages: "Sorry, out of stock in your size"
UI elements: "add to basket button"

A BDD pattern is universal your domain makes it real.

These domain mappings improve the BDD's accuracy and relevance to your specific context, making scenarios immediately useful rather than generic templates. Other domains will have completely different maps, allowing the AI to recognise universal patterns while adapting to your specific terminology.

Previously, this context was scattered throughout the rules - a BMW configurator conversation would gradually contaminate generic BDD patterns with "M Sport Package" references. In the new approach, domain specifics live in dedicated files.

"Show, Don't Tell" in Practice

This discovery matched a principle I'd written in my framework: "Show, don't tell."

Instead of explaining what makes good BDD, I showed the agent perfect examples and let it find the patterns. The AI understands what BDD is, I don't need to tell it that. What it does need to know is what I want from it.

But I learned there's a big difference between guidelines (flexible suggestions) and rules (mandatory requirements). For things that absolutely had to happen, I needed very clear language.

Lazy Loading Context: The Architecture That Changed Everything

The Problem: Context Explosion

Using 25% of Amazon Q's context window was like trying to have a conversation at a concert. Too many distractions, too much noise, too many competing priorities.

The AI was drowning in:

✓ Generic BDD patterns         ✓ Domain-specific mappings
✓ Assessment criteria          ✓ Implementation details  
✓ Quality gates               ✓ Error handling
✓ Naming conventions          ✓ Edge case handling
✓ Reporting structures        ✓ 38 more categories...

The Solution: Task-Based Context Loading

I realised the agent needed focused context per task, not everything at once. Here's the architecture that worked:

Base Context (Always Available)

Task execution framework
Conversation logging patterns
State management rules

Dynamic Context (Loaded Per Task)

Task 1: Context Extraction

Load: Analysis rules + Relevant Domain context only
Goal: Extract requirements from Jira
Output: Structured conversation log
Blocked: BDD patterns, automation rules, technical details

Task 2: BDD Generation

Load: BDD patterns + Task 1 output only
Goal: Create human-readable scenarios
Output: Feature files for manual testing
Blocked: Automation assessment, technical implementation

Task 3a: Behavioural Assessment

Load: Assessment criteria + Task 2 output only
Goal: Determine automation suitability
Output: Automation assessment report
Blocked: Code generation patterns, implementation details

Task 3b: Test Automation Generation

Load: Technical patterns + approved scenarios only
Goal: Create executable automation code
Output: Test Automation-compatible feature files
Blocked: Analysis rules, BDD guidelines

Context Smartness Principle: Each task gets exactly the context it needs - no more, no less. No competing priorities, no overwhelming rule sets.

The Performance Impact

The difference was dramatic:

Before: 25% context usage, inconsistent results, mysterious failures
After: <5% context per task, reliable patterns, predictable behaviour

The AI went from confused and unreliable to focused and consistent. Each task could concentrate on its specific job without distraction.

Pseudocode Rules: When You Absolutely Must Be Obeyed

If there is one thing you take from this, it's if you really need the AI to follow rules use pseudo code and not just words.

The Guidelines vs Rules Problem

Even with focused context, the AI would still treat critical requirements as optional suggestions. Natural language left too much room for creative interpretation.

Before (Ignored):
"Please assess each scenario carefully, considering automation suitability and technical feasibility."

After (Followed Religiously):

FOR EACH scenario IN bdd_scenarios:
    IF scenario.type == "accessibility":
        EXCLUDE(scenario, reason="specialized_tools_required")
    ELIF scenario.complexity == "single_component": 
        EXCLUDE(scenario, reason="unit_test_territory")
    ELSE:
        ASSESS(scenario, gates=[0,1,2,3])

Mandatory Language That Works

For absolutely critical processes, I learned to use clear commanding language:

'MANDATORY' - not "should" or "please"
'ZERO TOLERANCE' - not "try to avoid"
'AUTOMATIC EXCLUSIONS' - not "generally not recommended"

The AI follows pseudocode and explicit commands while treating natural language as flexible guidance. But, each instance of commanding language is backed up with pseudo code.

The Conversation State Pattern

The breakthrough was having the agent create a structured conversation log in Task 1:

## Requirements Analysis
- REQ-001: User can select premium packages
- REQ-002: Pricing updates when packages change  
- REQ-003: Conflicts prevent invalid combinations

## Positive Test Scenarios
- Premium package selection
- Price calculation accuracy
- Package combination validation

## Negative Test Scenarios  
- Invalid package combinations
- Error handling for unavailable options

## Inferred Requirements (Agent Additions)
- Loading states during price calculation
- Confirmation dialogs for expensive options
- Network error handling

The "Made Up" Requirements Solution: Embracing AI Creativity

The Unexpected Discovery

Here's something that surprised me: the agent kept adding requirements that weren't in the original spec. My first idea was to stop this behavior.

Instead, I asked it to share invented requirements in a separate section.

The Value of AI Inference

Sometimes these "made up" requirements were brilliant:

"What happens during price calculation loading?"
"Should there be confirmation for expensive options?"
"How do we handle network errors?"

The agent was thinking like a tester, finding gaps in specifications. I learned to embrace this creativity rather than suppress it - but keep it clearly labeled so humans could check the suggestions.

Don't Limit the AI (But Do Limit Its Authority)

Something I decided early on: even though the AI can update Jira, I don't want it to. That's taking away too much control and will make people lazy. I need humans to decide whether other humans should do a test or not.

Put it this way: if you make the AI limit tests to only those that are "important," and something goes wrong, it won't be the AI that gets told off.

The AI's role: Prioritise and recommend

Your role: Make the final calls

Yes, the AI puts a priority on its creations. They're in an order, but you decide what actually gets done. The AI can be creative with requirements, suggest test scenarios, and even rate automation suitability, but humans retain control over the decisions that matter.

The AI won't be the one getting told off

Quick Wins You Can Implement Today

1. Create Your Gold Standard

Find your best existing BDD scenario and clean it to perfection. This becomes your teaching example.

2. Extract Minimal Rules

Ask your AI: "What rules do you need to reproduce this quality?" You'll get 5-10 essential principles instead of 300 lines of documentation.

3. Separate Domain from Pattern

Identify what's universal (user actions, observable results) vs domain-specific (product names, URLs, error messages). Put domain details in separate configuration.

4. Use Pseudocode for Critical Logic

Replace "Please assess carefully" with explicit IF/THEN logic for anything that must happen without exception. You don't have to write the code, just write bullet points and ask it to make the code

What's Coming in Part 3

These solutions sound good in theory, but do they actually work in practice? In Part 3, I'll show you:

Real before/after examples: Contaminated scenarios transformed into universal patterns
The framework in action: Complete workflow from Jira ticket to executable tests
Honest assessment: What actually works, ongoing challenges, and what it doesn't fix
What Not to Automate: Smarter Test Filtering: How to decide what should be automated vs tested manually

The cats are starting to line up, but the real test is whether they stay in formation when facing real-world complexity.

Paul Coles is a software tester who discovered that AI agents respond better to examples than explanations. In Part 2, he reveals the specific techniques that transformed chaotic AI behavior into reliable, consistent output. His actual cat learned to use the litter tray but still ignores most other commands.

🐾 Series Navigation

Part 1: Why AI Starts Making Stuff Up

The cat has opinions — and your postcode formatting rules aren't one of them.

[Read it →] https://dev.to/paul_coles_633f698b10fd6e/the-subtle-art-of-herding-cats-why-ai-agents-ignore-your-rules-part-1-of-4-5fhd
Part 2: Show, Don’t Tell: Teaching AI with Better Examples

Bribing the cat with gold standards and smaller piles of paper.

← You are here
Part 3: How I Made My AI Stop Guessing

Teaching the cat one trick at a time with task-focused training.

(Coming soon)
Part 4: The More You Say, the Less It Learns

When you talk too much, the cat stops listening — and invents new requirements instead.

(Coming soon)

Photo by Jon Tyson on Unsplash

The Subtle Art of Herding Cats: Why AI Agents Ignore Your Rules (Part 1 of 4)

Paul Coles — Mon, 28 Jul 2025 14:47:12 +0000

TL;DR My AI Training Hurdles

I spent months training an AI to create BDD tests (and more). I discovered that AIs are like keen cats - they forget instructions when given too many commands. This is Part 1 of my journey from chaos to Context Smartness. Parts 2-4 cover the solutions, framework, and market implications.

For the patient test automation engineers, the test leads who buy into the process, and the AI workflow engineers building the infrastructure, this information is valuable

Different readers will benefit in different ways:

Practitioners will learn specific patterns for training AI systems
Leads will understand why AI initiatives often fail and what makes them successful
Engineers will see systematic approaches to AI reliability and context management

Hopes vs. The Reality: When Smart Cats Act Dumb

I started this journey with a simple dream: get AI to read specs for me. I hadn't worked with multi-page specs for years. How hard could it be? The agent would understand context, follow rules, and produce perfect scenarios every time.

Reality check: AI agents are like cats. They're very clever, but they have their own views on which rules count. They will ignore you when it suits them.

From First Attempts to Hard-Earned Lessons

It's AI right? It's clever

I began treating the AI like an equal - a brilliant colleague who needed proper instruction. I worked with them. I made clear rules, shared important context, and expected steady results.

The conversations went like this:

Me: "Here are 47 detailed rules for writing BDD scenarios"
Agent: "Got it! I understand perfectly"
Agent: Proceeds to lowercase postcodes for mysterious reasons
Me: "Why did you do that? There's nothing in the domain file about lowercase postcodes"
Agent: "There are too many rules. I can't follow everything."

It was like chatting with a polite cat. It nods along, but then it still knocks your coffee mug off the table.

The Cat Rule Discovered: The Cat Rule isn't about counting rules literally. It's about the competing instructions. You may have 100 formatting guidelines, but the AI only needs to make about 10 types of decisions at once.

I had thousands of BDD rules. They asked the AI to manage formatting, domain knowledge, quality checks, and technical implementation all at the same time. It's no surprise it made strange choices!

🐱 The Cat Rule: Maximum 10 Instructions

❌ What I Did Wrong (47 Rules)

Follow BDD patterns
Use domain mappings
Apply quality gates
Handle errors properly ...and 43 more rules! 🤯

Result: AI ignored most rules, performed poorly

✅ What Actually Works (8 Rules)

Use clear business language
Follow Given-When-Then
Include spec references
Focus on user outcomes
Apply MANDATORY rules
Use domain config
Check quality gates
Generate clean scenarios

Result: Consistent, reliable AI behavior! 🌈🦄

The More You Say, the Less It Hears

As I refined the rules over time, they grew and grew. I was using 25% of Amazon Q's available context window after some time. The agent was drowning in information:

Rules included:
✓ Generic BDD patterns         ✓ Domain-specific mappings
✓ Assessment criteria          ✓ Implementation details
✓ Quality gates               ✓ Error handling
✓ Naming conventions          ✓ Edge case handling
✓ Reporting structures        ✓ ... and 38 more categories

The problem became clear: The agent forgets things when the context is too large.

AIs have extensive knowledge, but their application lacks consistency when overwhelmed. If I'm asking the AI about task 4, it still has to sort through the stuff for tasks 1 to 3 and all the other conventions. It's like asking a cat to follow 47 commands at once. They ignore most, and the noise makes them mess up the few they do hear.

Context rot identified: Adding more input tokens can hurt AI performance. I noticed this months before I found out it had a name. For more details, visit https://research.trychroma.com/context-rot.

In my case, if you were to ask the AI why it did something, it could find the rule it should have applied. It's like the needle in a haystack test. But what it couldn't do was apply this rule at the right time with all the other stuff it needed to do.

Why Your AI Starts Making Stuff Up

The breaking point came when I attempted to ensure the agent used domain context consistently. It kept making odd choices:

Lowercase postcodes (nowhere in the rules)
Technical error messages in human-readable scenarios
Treating mandatory rules as optional guidelines

But the real problem wasn't just using the wrong things at the wrong time. I had unknowingly created a perfect storm of conflicting information and massive context load that was literally making the AI dumber.

Training the AI via conversations about 'why' things happened had contaminated what should be "generic" BDD patterns with car configurator specifics, package bundle terminology, and React SPA assumptions. But worse, the huge amount of context was hurting performance in ways I didn't understand then.

The tool was becoming domain-specific instead of universally applicable, AND performing worse as context expanded.

The Lightbulb Moment: It's More of a Guideline

Then it struck me - the lightbulb turned on. I was treating the AI as an equal, but it isn't. It's not as smart as I thought in the way I thought (my internal monologue about it was less charitable).

The agent needed different training than a human colleague would. Even with focused context loading, the AI perceived key requirements as optional unless the language was clearly commanding. Or not applying mandatory rules - they were more guidelines.

Pattern-Led Prompting Principle: AI agents respond better to examples than explanations, tables and pseudocode better than natural language for complex logic.

This discovery would lead to what I now call Context Smartness - providing exactly the right information, at the right time, in the right amount.

This Pattern-Led Prompting Principle underpins the 'Show, Don't Tell' method. I will explain this in Part 2, with examples taking the place of lengthy documentation.

Key Discoveries from Part 1

Through months of frustrating talks with my AI agent, I found several basic principles:

The Cat Rule

Never give AI more than 10 competing instructions. Beyond this point, performance drops as the agent struggles to work out what matters most.

Context Rot

Adding more input context actually makes AIs perform worse - not better. I was using 25% of Amazon Q's context window and wondering why my "smart" agent was getting dumber. https://research.trychroma.com/context-rot

Domain Contamination

Generic rules slowly pick up domain-specific details through repeated conversations. This makes tools less reusable and adds to context bloat. You have to watch how the rules are formed and make sure they're generic.

The Guidelines vs Rules Problem

AI agents treat everything as flexible unless you use very clear commanding language. "Please assess carefully" becomes optional; "MANDATORY: ASSESS(scenario, gates=[0,1,2,3])" gets followed.

What's Coming Next

In the remaining parts of this series, I'll show you exactly how I solved these problems:

Part 2: "Show, Don’t Tell: Teaching AI by Example" - The breakthrough solutions that actually worked
Part 3: "How I Turned Chaos Into a Repeatable Test Process" - Real examples with BMW vs Mercedes
Part 4: "Context Rot and the Billion Dollar Opportunity" - Why these solutions matter for the AI industry

Each problem in Part 1 has a matching solution. The cat can be herded, but not the way you think.

The Foundation for What's Next

The problems I discovered - Context Rot, domain contamination, the guidelines vs rules confusion - aren't just BDD testing issues. They're fundamental AI reliability challenges that affect any system trying to get consistent behaviour from large language models.

In Part 2, I'll show you the Context Smartness approach that solved all of these problems: focused examples instead of comprehensive rules, task-based lazy loading, and the magic of "show, don't tell."

The cats stayed in formation, but it took understanding their psychology first.

Paul Coles is a software tester who accidentally discovered several AI reliability patterns while trying to automate BDD scenario generation. In this 4-part series, he shares the systematic approach that transformed unpredictable AI behaviour into reliable, consistent output. His actual cat still ignores most commands.

AI-Assisted Testing: A Survival Guide to Implementing MCP with Atlassian Tools

Paul Coles — Tue, 03 Jun 2025 19:06:40 +0000

1. Introduction

Still battling 'Wagile' processes, a skewed developer-to-tester ratio, and specs that could double as doorstops? You're not alone. This technical deep-dive builds on our previous discussion to show you exactly how to implement AI-assisted testing using Model Context Protocol (MCP) within your existing Atlassian ecosystem.

What You'll Achieve

By the end of this guide, you'll have a working AI assistant that can:

Parse Confluence specifications to identify test scenarios and acceptance criteria
Read your Jira tickets and extract acceptance criteria
Generate comprehensive test cases in your team's format
Identify edge cases you might have missed
Create test documentation that actually gets used

Why MCP Over Generic AI Tools

You can paste specifications into ChatGPT or use Copilot, but MCP has key benefits:

Direct Integration: No copy-pasting between tools
Context Awareness: Understands your specific project structure
Reduced Hallucination: Works with actual data, and rules, not assumptions
Pattern Learning: Adapts to your team's test framework

What This Guide Covers

This isn't about fixing broken processes—that's a cultural challenge. This is about practical implementation:

Setting up MCP to connect with your Atlassian stack
Configuring secure connections through corporate networks
Creating your first AI-generated test suite
Measuring the impact on your team's velocity

2. What You'll Need Before Starting

MCP Atlassian

It's worth reading about the MCP sever that will be used: https://github.com/sooperset/mcp-atlassian

It's distributed as a Docker image, which means that the setup is generally very simple. The only rinkles may be corporate certificates, which we talk about later.

Docker

First up, you'll need Docker installed on your machine. Whether you're using Docker Desktop or Docker Engine doesn't matter—both work fine. If you don't have Docker, go to docs.docker.com/get-docker. Then, follow the installation guide for your operating system.

Why use Docker? It keeps the MCP server running the same way on any setup. Plus, it manages all your dependencies without cluttering your system.

Amazon Q CLI

Next, you'll need the Amazon Q Command Line Interface. Installation instructions are here: [https://docs.aws.amazon.com/amazonq/latest/qdeveloper-ug/command-line-installing.html]. If they're set up right, you can run q --version without any errors.

Atlassian Access

For Jira:

A Jira account with API access (most regular accounts have this)
A Personal Access Token (PAT)—not your password!
The URL of your Jira instance (e.g., https://yourcompany.atlassian.net)

For Confluence:

Like Jira: account, PAT, and URL
Make sure you can access the spaces containing your specifications

Don't know how to create a PAT? In Atlassian, go to your account settings, find "Security," and look for "API tokens." Create one and save it somewhere secure—you can't view it again!

Corporate Certificates

Here's the fun part—if you're behind a corporate firewall, you'll need your company's SSL certificates. They're needed to allow the Docker container to trust your corporate proxy/firewall for outbound connections. Usually you'll have set them up when you started, probably in a certs folder in your home directory. They usually come as .crt, .pem or .cerfiles.

3. Setting Up Your Environment

Creating Your Configuration File

In the project directory, you'll find a file called .env-example. This is your template. Copy it to create your actual configuration:

cp .env-example .env

Adding Your Credentials

Open the .env file in your IDE. You'll see something like this:

JIRA_URL=https://yourcompany.atlassian.net
JIRA_PAT=your-jira-personal-access-token-here
CONFLUENCE_URL=https://yourcompany.atlassian.net/wiki
CONFLUENCE_PAT=your-confluence-personal-access-token-here

Replace the placeholder values with your actual credentials. A few tips:

Don't include quotes around the values
Make sure there are no trailing spaces
Double-check your URLs—they should be the base URLs without any paths

Understanding Configuration Options

You'll also see some SSL-related settings:

SSL_VERIFY=true
CERT_PATH=/path/to/certificates

If you're behind a corporate firewall, you may need to set SSL_VERIFY=false for testing. But for production use, keep it true and provide the correct certificate path.

Warning: Setting SSL_VERIFY=false significantly reduces security and should never be used in production or for sensitive data. It's only for temporary testing to isolate certificate issues. Always strive to provide the correct certificate path and keep SSL_VERIFY=true

4. Building and Running with Docker

Preparing Certificates

If you use corporate certificates like Zscaler, we need to build the docker image ourselves. Place certificates in your project root for Docker to access them. If you already have certificates elsewhere (like ~/certs), you can modify the Dockerfile to copy from that location instead.

The Docker build process will automatically add these certificates. This lets the container make secure connections through your corporate network.

Building the Docker Image

With your certificates in place, run:

docker build -t mcp-atlassian-with-zscaler .

This command does several things:

Creates a Python environment
Installs all necessary dependencies
Configures your certificates
Sets up the MCP server

The first build might take a few minutes.

If the build fails, it's usually because of missing certificates or network issues. Check the error messages—they're surprisingly helpful.

5. Connecting to Amazon Q

With your Docker container built, it's time to connect everything together.

Starting Your First Chat Session

Open your terminal and run the following command:

q chat

When you run this command, Amazon Q automatically:

Reads your MCP configuration from ./.amazonq/mcp.json.
Establishes connections to your configured tools, including the MCP server.

Example mcp.json file

{
  "mcpServers": {
    "mcp_atlassian": {
      "command": "docker",
      "args": ["run", "-i", "--rm", "--env-file", ".env", "mcp-atlassian-with-zscaler"],
      "disabled": false
    }
  }
}

"The MCP server runs inside Docker and communicates via stdin/stdout, so no port configuration is needed unless you're customising the setup."

Understanding the Initialisation

You should see something like this:

✓ mcp_atlassian loaded in 2.40 s
✓ 2 of 2 mcp servers initialized.

Each checkmark means a successful connection. If you see any errors here, it usually means:

Your credentials are incorrect
The URLs in your .env file are wrong
Network/firewall issues

If there are errors

Ask Amazon Q about it
Open docker desktop, look for the containers, select yours and click logs.

Verifying Your Connections

Once connected, try a simple query to verify everything works:

What are my open Jira tickets?

If Q responds with your actual tickets, congratulations! You're connected. If not, check your Jira PAT and URL in the .env file.

At this point the MCP will probably ask for permission to run a tool. A tool is a coded method to communicate.
When asked press t to trust for this session
or y for this question
You can configure the permissions of tools in amazong q from the chat [https://docs.aws.amazon.com/amazonq/latest/qdeveloper-ug/command-line-chat-tools.html]

6. Using AI for Test Generation

Now for the fun part—actually using this setup to generate tests.

Example Queries That Work Well

Here are some queries that deliver immediate value:

Basic Jira queries:

"What are my open assigned tickets, list them in priority order"
"Show me all tickets in the current sprint for project XYZ"
"What tickets are blocked?"

Test generation from tickets:

"What tests should I do for <ticket ID>"
"Generate test cases for the acceptance criteria in PROJ-12345"
"What edge cases should I consider for ticket <ticket ID>?"

Referencing Confluence Specifications

When you have detailed specs in Confluence, you can get specific:

In the spec <confluence page id> in section 1.8.5, what test cases and edge cases should I be looking for?

The AI will:

Fetch the specific Confluence page
Find the exact section you mentioned
Analyse the requirements
Generate relevant test cases

Combining Jira Tickets with Specifications

This is where the real power shows up. You can cross-reference:

For ? and the spec <confluence page id> in section 1.8.5, are there any requirements missing or unclear? What test cases should we do for this?

The AI will:

Pull the Jira ticket details
Fetch the Confluence specification
Compare acceptance criteria with spec requirements
Identify gaps or inconsistencies
Suggest comprehensive test coverage

Finding Gaps in Requirements

One of the most valuable uses is identifying what's missing:

"Looking at ticket XYZ-123 and its linked specification, what acceptance criteria might be missing?"
"Based on the UI mockups in CONF-789 and the requirements in JIRA-456, what scenarios aren't covered?"
"What questions should I ask the product owner about this feature?"

7. Working with Chat Sessions

Amazon Q can keep context between sessions. This is very helpful for ongoing testing tasks.

Resuming Previous Conversations

When you're in a project directory, Q remembers your previous conversations. To continue where you left off:

q chat --resume

You'll see a summary like:

We discussed your assigned Jira tickets related to configuring cars then examined the rules in your .q folder that define test case formats and bundle selection conventions for your project.

Understanding Context Retention

Q maintains context within a folder, which means:

Your test patterns are remembered
Previous queries inform new responses
You can build on earlier work without re-explaining

Best Practices for Ongoing Projects

To get the most from context retention:

Use project-specific folders: Keep each project's chats separate
Start sessions with context: "We're testing the car configurator feature"
Reference previous work: "Using the test pattern we established yesterday..."
Build incrementally: Start with simple tests, then add complexity

8. Troubleshooting Common Issues

Things don't always work perfectly. Here's how to fix common problems.

Connection Problems

Symptom: MCP servers fail to initialise

Common causes:

Incorrect URLs in .env file
Network firewall blocking connections
Docker not running

Fix:

Verify Docker is running: docker ps
Try with SSL_VERIFY=false temporarily to isolate certificate issues

Authentication Errors

Symptom: "401 Unauthorized" or "403 Forbidden" errors

Common causes:

Expired or incorrect PAT
Insufficient permissions

Fix:

Generate a new PAT in Atlassian
Ensure the token has read access to projects/spaces you need
Update .env file with new token
Rebuild Docker image

Certificate Issues

Symptom: SSL verification errors

Common causes:

Missing corporate certificates
Certificates in wrong location
Expired certificates

Fix:

Get latest certificates from IT
Place in project root directory
Rebuild Docker image
Ensure CERT_PATH in .env is correct

9. Next Steps

You're up and running—now what?

Quick Wins to Try First

Generate tests for your most complex feature: Pick that configurator or rules engine that everyone avoids testing manually
Document existing features: Use AI to create test documentation for features that were "completed" without proper test cases
Find test gaps: Run your existing test suites through AI analysis to identify missing scenarios

Building Evidence for Process Improvement

Track these metrics from day one:

Time to create test cases (before vs. after)
Number of edge cases identified
Defects found using AI-generated tests
Time saved per sprint

Use this data to show management the value of:

Better developer-to-tester ratios
Time for exploratory testing
Investment in test automation

Expanding Usage Across Your Team

Start small and grow:

Begin with one complex feature
Share success stories in team meetings
Create team-specific prompt templates
Schedule brown-bag sessions to train others
Build a library of effective queries

Remember: This tool doesn't fix broken processes, but it gives you breathing room to work on what really matters—building quality into your products from the start.

The path forward isn't about replacing testers with AI. It's about amplifying what testers do best: thinking critically about quality, understanding user needs, and finding the problems that matter.

10. Getting more advanced

Amazon Q like most LLMs thrives on rules.
Rules are markdown files that can either be global or project/repo based.
Rules are stored in the ./.amazonq/rules folder.
In a chat Q will load the rules, but sometimes it may forget about some of them, which is why your prompts matter.

Building rules

I would start off small and describe the interations of a feature in human readable terms. We tend to have states and interactions described in Figma. Without these interactions Q doesn't know what your system looks like, which means it may not be able to write the cases how you'd like.

We have a framework that takes BDD steps and translates those into automated UI tests in webdriver IO, for example

For example, if your framework has a step like: Then the user clicks the continue button on the configuration page where continue refers to a specific UI element (a 'locator') within a 'page object' for the configuration page. Your rules would teach Q about these conventions, allowing it to generate BDD steps that directly translate to your existing automation framework.

With this Q knows the layout of our page objects and the rules of how we write BDD tests.

I taught it these rules through trial and error some of the time, where it would produce something that looked correct, but was missing a something, perhaps the way it checked something. Using the chat interface I would ask why it did something, and work through the problem with the LLM. At the end I would ask it to update the rules.

Understanding the business

One thing to bear in mind, is that the buisiness when they view you test cases don't want to see detailed steps with code in them, they want something human readable. That's something the LLM can do using the knowledge you taught it of interactions and states from Figma and the understanding of how you structure your tests.

We can now ask the LLM to produce cases that can run automation, but also human readable test cases. It can even produce them within the same request.

Prompt Engineering

Sometimes the LLM doesn't always do what you ask, which is why propmts matter. If you've been working with the LLM within a chat, you can probably just say make a test case about putting in a light bulb and it'll do it. But, prompts really help

Create a BDD test case for [TICKET_ID/valid car configurations] following our framework conventions.

Context:
- Framework: WebdriverIO with BDD (Cucumber)
- Domain: Bundle selection testing
- Scope: Up to and including finalising configuration

Requirements:
1. Use current valid car configurations from our business rules
2. Follow standard Given/When/Then structure with proper indentation
3. Include page navigation verification at each step
4. Add strategic screenshot steps for:
   - Initial state
   - After key user actions
   - Final state verification
5. Clearly mark any new page object elements needed beneath the test: // NEW ELEMENT NEEDED: [description]

Mandatory inclusions:
- Initial test setup steps
- Page load verification
- Data validation steps
- Error handling considerations

Output format:
Feature: [Feature name]
  Scenario: [Scenario name]
    Given...
    When...
    Then...

AI-Assisted Testing: A Lifeline in the Waterfall-Sprint Hybrid Chaos

Paul Coles — Thu, 29 May 2025 22:32:06 +0000

TL;DR for Busy Testers

In 30 seconds: We're drowning in a 5:1 developer-to-tester ratio with massive specs and fake sprints. AI-assisted test generation isn't fixing our broken process, but it's buying us time to breathe, test what matters, and build evidence for real change.

Who This Is For

Test leads drowning in work with insufficient resources
QA managers trying to justify better processes to leadership.
Teams stuck in "Wagile" looking for practical survival tactics.
Anyone who's been told to "just automate everything".

Key Terms

MCP (Model Context Protocol): Anthropic's system for connecting LLMs to your tools
Wagile: Waterfall pretending to be Agile
Horizontal Slicing: Building all of one layer before moving to the next
TAF: Test Automation Framework (your team's specific patterns)

Part 1: The Problem

The Reality Check: When Agile Isn't Really Agile

You're calling it "Agile," but it seems more like waterfall dressed as a sprint. You're familiar with the symptoms:

Massive specifications that could double as doorstops.
Sprint planning ignores testing capacity
Developers pack the sprint based on their bandwidth
Features arrive in testing two sprints late
"Just automate it all!" echoes from above

Does this Sound familiar?

The Tick-Tock Death March

Here's how it plays out in this horizontal slice world:

Sprint 1: Developers put in place package logic. Everything looks great in isolation.

Sprint 2: The team builds mutual exclusion rules. Still seems fine — packages conflict as expected.

Sprint 4: The basket logic finally arrives. Out of the blue, nothing works properly.

The Problem: You can't test the configurator end-to-end because it's not a vertical slice. The packages were stubbed when the basket didn't exist. Now the basket exists, but the stubs are wrong. You're finding integration problems four sprints late. At the same time, you're testing the new features for Sprint 5.

Meanwhile, the testing tick-tock continues:

Tick: Developers grab work from the massive spec, packing the sprint. (After all, developers have to code; it can't be about flow, it has to be about being busy.) Code flies, and pull requests pile up.

Tock: Testers rush to handle half the tasks from Sprint 1. Meanwhile, they are also working on Sprint 2. You might be examining the specs in detail for the first time. After all, it's hard to remember 20,000 words in a 60-minute walk-through. You're checking a Figma, is it the right one? Is it up-to-date.

Tick: Developer work increases. The testing from the last sprint isn't done, and new features keep arriving.

Tock: The deadline looms, and you're seeing features for the first time, two sprints late. Is it correct? Is it good? Who knows — there's no time to find out.

Case Study: The Car Configurator Complexity

Imagine testing a system where:

You can choose individual options (leather seats, sunroof, premium audio).
OR you can choose packages (Sport Pack, Luxury Pack, Winter Pack).
BUT packages have mutual exclusions (can't have Sport and Eco pack).
AND some individual options conflict with packages.
PLUS pricing changes based on combinations.

Part 2: Why Traditional Solutions Fail

"Just Automate Everything!"

As testers, we hear management say "automate everything!" But here's what they don't see...

Reading specifications for the first time as testing starts.
Checking the implementation to ensure it matches the Figma designs.
Writing test cases while executing them.
Dealing with the backlog of "completed" work that we have never tested.

It's like someone telling you to build a ladder while you fall down a cliff.

The Hidden Multipliers: Why It's Even Worse Than You Think

Beyond the automation dream, we face chaos that worsens the testing crisis:

The Revolving Door Problem

Team members change all the time, but no one tells you who has left or joined.
New faces pop up in stand-ups without any introductions.
Knowledge walks away, taking important context with it.
You’re left figuring out who does what by trial and error.

"Agile" Theatre

Detailed specs show up fully formed (hello, waterfall!).
Stand-ups turn into 15-minute individual status reports.
The term "collaboration" appears in slides but not in practice.
You haven’t seen a retrospective in months (do they even happen?).

The Onboarding Black Hole

"Here's your laptop. Good luck!"
No team directory, no architecture overview, no transfer of domain knowledge just videos about something, but you're not sure, they never say.
You learn by osmosis and asking the same questions many times.
Six weeks in, you’re still uncovering critical systems you should test.
There's the automation system without a read.me, the word doc doesn't work, you have to work out what packages it needs through trial and error.

Process? What process?

UI discrepancies pop up during testing (surprise!).
Requirements live in Confluence, Jira, Slack, and someone's head.
We design interactions in Figma, but which one, and is it up to date?
- Every day you ask "Should it have this?"
Feedback loops are so long they feel like feedback spirals.
"We will improve the process after this release" (spoiler: they won't).

Then management wonders why adding more testers doesn't help. A week later, the new hires are still catching up. They are reading specs, watching meeting recordings, and trying to grasp the odd "legacy service that sometimes fails."

We can’t automate our way out – we can’t even communicate our way in.

Part 3: Enter AI-Assisted Testing

Not a Silver Bullet, but a Pressure Release Valve

AI-assisted test generation isn't about achieving testing nirvana; it's about survival.

"Can't I just use Copilot or a public Large Language Model (LLM) for this?"

You absolutely can. Tools like Copilot or direct ChatGPT interaction are easy to access. They can read open pages or take pasted specifications. This makes them quick for generating simple test cases or brainstorming ideas. But, they come with significant drawbacks in a production testing environment:

High Hallucination Risk: General-purpose LLMs don't have specific context about your Jira tickets, Confluence specs, or your team's test patterns. They can generate test cases that sound good but are often wrong, irrelevant, or made up.

No Integration: They don't work directly with your project management or documentation tools. This leads to a lot of manual copy-pasting of context.

No Learning from Your Patterns: They won't adapt to your team's test automation framework (TAF) step patterns. This leads to inconsistent tests that are harder to maintain.

This is the specific area where MCP demonstrates its strengths. It aims to close that gap by offering a more reliable and context-aware solution.

From Spec to Test in Minutes, Not Days

When you check the configurator section in the spec, you won't waste hours writing test cases. Instead, you extract rules from the spec:

"Sport Pack includes premium audio."
"Winter Pack excludes Summer Pack."
"Electric engine can't have a Sport Pack."

From Figma, you get the process flow and state changes:

"If you pick wood trim and change to Summer Pack, then the system shows an 'are you sure' prompt."

AI creates test permutations. This lets you focus on the key question: Does this look like what's in Figma? Also, does it make sense for users?

Catch-Up is Possible: If you notice features two sprints late, AI-generated tests can help you evidence your testing well.
Documentation Comes from Chaos: The AI-generated tests create the documentation that was missing. New team members (or you, three sprints later) can understand what the feature does.
Some Automation Is Better Than None: You can't automate all tasks, but AI can ease the heavy load of configuration testing.

Part 4: Implementation Reality

Building Team Capability, Not Replacing It

Janet Gregory and Lisa Crispin emphasise that quality is woven into the fabric of what we do. AI doesn't weave that fabric — teams do. This tool handles the mundane threading so we can focus on the patterns.

Yes, our process is broken. But at least we have those large specs - we know what we're supposed to be doing. Our stories mostly have acceptance criteria (though when there are 300 changes and 2 ACs, something's wrong).

The point is: AI gives us breathing room to work on the real problem - the cultural shift toward continuous quality.

How MCP Works (In Plain English)

Think of MCP as a translator between your project tools and AI:

It reads your Jira tickets and Confluence specs
It understands your team's test patterns from examples
It generates tests that match your style, not generic templates

Technical detail: It runs in a Docker container and connects via API tokens. (See appendix for setup details)

Here's a real example. The AI produced a car configurator component. This includes package conflicts and mutual exclusions based on the requirements.

There's less risk of it hallucinating because the MCP keeps it on track with your specific project context. It may also spot things you haven't.

Feature: Car Configurator Package Management

  Background:
    Given I am on the car configurator page
    And I have selected a "Hatch Back" model
    And the configurator displays available packages and options

  Scenario: Sport Package conflicts with Eco Package selection
    Given I have selected the "Sport Package" containing:
      | Option            | Price  |
      | Performance Tyres | £800   |
      | Sport Suspension  | £1,200 |
      | Sport Exhaust     | £600   |
    And the total package price is "£2,600"
    When I attempt to select the "Eco Package"
    Then I should see a conflict warning stating: "Eco Package cannot be combined with Sport Package."
    And the "Eco Package" option should be disabled
    And my current selection should remain "Sport Package"
    And the basket total should remain "£2,600"

The Survival Guide: Making It Work

1. Start Where You Are

Don't wait for the perfect process. If you're overwhelmed by complexity, let AI create test cases. You can then check the critical paths yourself.

2. Use It for Comprehension, Not Just Coverage

When you first see that spec section, use AI to quickly generate scenarios. This helps you understand the feature faster — what are all the combinations? What are the edge cases?

3. Focus Your Human Effort

With AI handling the combinatorial explosion, you can focus on:

Does this match the Figma designs?
Do the mutual exclusions make sense to users?
What happens when rules conflict?
Is this actually valuable to customers?

Quick Start for Desperate Teams

First: Check if you're allowed to use AI tools (Really. Ask security. Using unapproved tools with company data is a career-limiting move.)
Pick your most complex feature with multiple rules.
Set up MCP with your Jira/Confluence (see setup guide).
Generate test cases for just that feature.
Compare time spent vs. manual creation.
Use the time saved for exploratory testing.
- Document what you find!

Part 5: The Honest Path Forward

What This Fixes (and What It Doesn't)

Let's be real — AI won't fix your broken process. You still have:

Too much WIP (work in progress).
Artificial sprint boundaries destroying flow.
A 5:1 developer-to-tester ratio that guarantees bottlenecks.
Specifications that arrive fully formed rather than iteratively.

But AI can make an unsustainable situation slightly more bearable. It buys you time to demonstrate the value of comprehensive testing and build the case for the process changes you really need.

Build Evidence for Change

Plan to measure:

Test case creation time (target: 70% reduction).
Edge case coverage (how many combinations would you have missed?).
Time freed for exploratory testing.
Critical issues found with that freed time.

This evidence will help justify the process improvements we desperately need.

The Honest Truth

AI-assisted testing is a lifesaver, not the solution itself. It's like using a better bucket on a ship that’s taking on water. You can bail out faster, but it doesn’t solve the main issue.

AI helps make the case for proper testing. It shows what happens when your testers aren't overwhelmed. And it makes it clear why you need to change up your process.

You should fix the process first, but that is not how it works. In reality, AI-assisted testing is about staying afloat and keeping your sanity. It frees up time to focus on quality, rather than going through the motions.

For instance, AI can create a huge test suite in a few minutes – that's a massive help.

Part 6: Getting Started (Yes, Really)

What You'll Need

Docker (or someone who can install it for you)
Jira/Confluence access with API permissions
About 30 minutes when no one's pinging you
A complex feature to test (you've got plenty)

The 10-Minute Version

Pull the MCP-Atlassian Docker image
Create API tokens in Atlassian
Set up your .env file with credentials
Connect to Amazon Q
Work with Amazon Q to add some rules to guide it
Ask it about your most painful feature

What Success Looks Like

Within an hour, you should be generating test cases that:

Actually match your testing patterns
Cover edge cases you'd miss at 5 PM on a Friday
Make sense to other team members (unlike that legacy automation)

Common Gotchas

If you set up a corporate proxy during onboarding, you might need to build the docker image yourself. Make sure to include your certs.
API tokens expire. Set a calendar reminder.
Start small. One feature. Prove the value.

Full setup guide: https://dev.to/paul_coles_633f698b10fd6e/ai-assisted-testing-a-survival-guide-to-implementing-mcp-with-atlassian-tools-2gnm