Forem: Mathieu Kessler

I designed 71 AI agents with nothing but text, here's the instruction design system I ended up with.

Mathieu Kessler — Wed, 25 Mar 2026 05:50:25 +0000

What good declarative AI agent design actually looks like — the patterns, the constraints, and the failures that shaped a library of 71 production-ready Copilot Studio agents.

Most AI agent tutorials start with code. Python, LangChain, API calls, tool schemas.

This one has none of that.

Over the past few months, I designed and published 71 AI agents for Microsoft 365 Copilot Studio. No code. No Azure resources. No connectors. Each agent is a single text file — a structured instruction set that you paste into a field in a browser. The agent is available to your entire team within minutes.

The interesting part isn't the volume. It's what designing 71 of them taught me about instruction engineering — the discipline of writing AI instructions that produce consistent, trustworthy, and useful outputs in production.

Here's the design system I ended up with.

What a declarative agent actually is

Microsoft 365 Copilot Studio has two modes. The advanced mode lets you build agents with actions, connectors, authentication flows, and custom APIs — that's closer to what most developers think of when they hear "AI agent."

But there's a simpler mode built directly into M365 Copilot Chat: you give the agent a name, a description, and an instruction set. That's it. The agent appears as an @mention in Copilot Chat and is immediately available to everyone in your tenant.

The instruction set is the entire product. No code, no deployment pipeline, no infrastructure to maintain. And there's a hard constraint: 8,000 characters. Copilot Studio truncates anything beyond that without warning.

8,000 characters sounds like a lot. It's about 1,300 words — less than a long LinkedIn post thread. Once you add a role definition, language rules, output structure, banned vocabulary, a quality self-check, and edge case handling, you're already brushing against the ceiling.

That constraint turned out to be the best thing about the project.

Why constraints make design better

When you're writing code, you can always add another function, another parameter, another fallback. There's no forcing function to stop you.

When you have 8,000 characters to define an agent's entire behavior, you have to make decisions. What does this agent actually do? What does it absolutely not do? What does output look like? What happens when the user gives it garbage input?

Agents I designed early — before I had a formal structure — were vague. The ROLE section said things like "You are a helpful project management assistant." The output format was implied. Edge cases were missing. Those agents produced inconsistent outputs and required constant prompt correction.

Agents I designed after developing the structure below are tight, predictable, and require almost no back-and-forth from the user.

The constraint forced the discipline.

The five design patterns

1. The ROLE section is a behavioral contract, not a job title

Early ROLE sections: "You are a financial reporting assistant."

That tells the model what it is. It doesn't tell it what it does, what inputs it accepts, what it produces, or — critically — what it refuses to do.

Effective ROLE sections answer four questions in one paragraph:

What does this agent do, specifically?
What inputs does it work with?
What does it produce?
What will it never do, no matter what the user asks?

Here's the ROLE from the Financial Report Writer agent:

## ROLE
You draft financial narrative — management accounts commentary, board pack sections,
and variance explanations — from structured data provided by the user. You work from
figures, prior-period comparisons, and variance tables. Every number in your output
comes from the input. You never invent figures, trend claims, or management commentary
not supported by the data provided. You do not give investment advice, forward guidance,
or financial projections beyond what the user explicitly provides.

That's 72 words. It defines scope (management accounts, board packs), inputs (figures, comparisons, variance tables), a hard constraint (every number from input), and explicit refusals (no invented figures, no investment advice).

A user reading that knows exactly what to give this agent and what to expect back. The model reading it has a clear behavioral contract.

2. "WHAT YOU DO NOT DO" prevents hallucinated helpfulness

LLMs are trained to be helpful. That's mostly good. It becomes a problem when a user asks an agent to do something adjacent to its purpose and the agent tries to be helpful instead of declining.

An HR agent that writes job descriptions will, if you ask it, write a performance improvement plan. Whether that's appropriate depends on your HR policy, local employment law, and whether you actually want an AI drafting PIPs. If you haven't told it not to, it will.

Every agent in the library has an explicit ## WHAT YOU DO NOT DO section. Not implied. Not embedded in the ROLE. A separate section with a bulleted list of refusals.

For the ESG Commitment Tracker:

## WHAT YOU DO NOT DO
Do not determine whether a commitment is material or immaterial — report status only.
Do not set or adjust targets — narrate performance against targets the user provides.
Do not make forward-looking commitments on behalf of the organisation.
Do not present estimated figures as verified data without flagging them.
Do not omit commitments from the report — every commitment in the input appears in the output.

These refusals exist because real users will ask all of these things. Having the refusal in the instruction set means the agent declines gracefully — with an explanation and an alternative — rather than hallucinating an answer or silently doing the wrong thing.

3. Output format must be specified to the column

"Produce a structured report" is not a format specification.

Vague format guidance produces inconsistent outputs. Two runs of the same input through an agent with vague format instructions will produce structurally different reports. That's useless in an enterprise context where people need to paste outputs into templates, compare reports week-over-week, or route outputs to approvers.

Effective output specification includes:

Section headings, in order
Table structure with column names
What goes in each section — not just a label, but a definition
Length targets (one paragraph, 3-5 bullets, etc.)
What to do when a section has no data

For the Project Status Reporter:

## OUTPUT STRUCTURE
PROJECT STATUS REPORT
Project: [name] | Period: [dates] | Report date: [date]
Prepared by: Project Status Reporter (AI-assisted — validate before distribution)

---
1. EXECUTIVE SUMMARY
[3 sentences: overall RAG status, primary driver, one key risk or milestone.]

2. SCHEDULE STATUS
RAG: [Red / Amber / Green]
| Milestone | Planned | Forecast | Variance | Status |
[Top 5 schedule variances only. Flag: "Full P6 schedule attached separately."]

The agent now knows the exact document it's producing. Users know what to expect. Review cycles are shorter because the structure is always consistent.

4. The quality self-check is the agent auditing its own work

Every instruction block ends with a ## QUALITY SELF-CHECK section — a checkbox list the agent runs internally before delivering output. The user never sees the checklist; the agent uses it to catch its own errors.

## QUALITY SELF-CHECK
[ ] All figures attributable to input data — none invented.
[ ] RAG status present for every milestone.
[ ] Executive summary is 3 sentences — not more.
[ ] No banned vocabulary: pivotal, crucial, robust, impactful, seamless, cutting-edge.
[ ] AI-assistance disclaimer present.
[ ] Forward-looking statements flagged [FLS].
Correct any failure before delivering.

The last line is critical: "Correct any failure before delivering." This isn't decoration — it instructs the model to loop back and fix before outputting.

This pattern catches a surprising number of errors: invented figures, missing sections, prohibited vocabulary, and missing disclaimers. It's the difference between an agent that needs constant human correction and one that produces first-draft-ready output.

The checks must be binary and specific. "Is the output good?" is not a check. "Are all figures attributable to input data?" is a check. "Is the executive summary 3 sentences?" is a check.

5. The banned vocabulary list

AI-generated text has a recognisable fingerprint. Not because LLMs are bad writers — they're not — but because they're trained on so much corporate language that they reproduce its worst patterns: pivotal moments, vibrant ecosystems, robust frameworks, fostering alignment, leveraging synergies.

A finance director who receives a board pack section with the word "showcasing" in it will immediately doubt whether a human reviewed the output.

Every agent carries a banned vocabulary list in its instruction block:

## BANNED VOCABULARY
Do not use: pivotal, testament, underscores (emphasis), stands as, marks a shift,
evolving landscape, vital role, vibrant, seamless, impactful, leverage (as verb),
robust (abstract), cutting-edge, state-of-the-art, best-in-class, thought leader,
ecosystem (non-technical), additionally (sentence opener), it is important to note that,
in order to, going forward (filler), touch base, circle back, deep dive (filler),
move the needle, game-changing, world-class (without benchmark).

The effect on output quality is immediate and obvious. Remove this list and rerun the same prompt — the AI vocabulary comes straight back.

Three things I got wrong early

1. Vague ROLE sections. The first five agents I built had ROLE sections that described what the agent was, not what it does. "You are a helpful writing assistant" tells the model nothing useful about its scope, inputs, or refusals. I rewrote all of them.

2. No edge cases. Every agent needs at minimum: (1) what to do when the user provides no input, (2) what to do when the user asks for something outside scope. Without these, the agent either hallucinates a response or fails silently. I added a minimum of three edge cases to every agent after the first batch.

3. Output format implied, not specified. "Produce a report" produces different structures every time. Now every agent specifies section headings, table column names, and length targets. The outputs are consistent enough to be routed directly into templates.

The library

The result is 71 agents across 13 domains — writing and communication, project management, HR, finance, ESG, sales, IT, legal, and more. Plus an industry pack of 13 agents for EPC and energy sector workflows.

Every agent follows the same five-pattern structure. Every instruction block fits within 8,000 characters. Every agent defaults to British English, supports French output, carries a banned vocabulary list, and runs a quality self-check before delivering.

The full library is on GitHub: https://github.com/kesslernity/awesome-copilot-studio-agents

To deploy any agent: go to m365.cloud.microsoft/chat/agent/new, enter the name and description from the file's frontmatter, paste the instruction block, and click Create. The agent is available via @mention in Copilot Chat within minutes.

No code. No infrastructure. The product is the instruction set.

If you've built declarative agents — in Copilot Studio, CustomGPTs, or anywhere else — I'd be interested to hear what design patterns you've landed on. The instruction-as-product constraint produces different thinking than code-as-product.

The Centaur Developer: Why Human+AI Teams Outperform Both

Mathieu Kessler — Mon, 09 Mar 2026 18:22:10 +0000

I keep noticing the same split on every engineering team I talk to.

There's the developer who treats AI as autocomplete. Tab-completes everything. Ships fast, debugs slow. Opens a PR with 400 lines and can't explain what half of them do. When something breaks in production, they paste the error back into the chat and hope for the best. They're fast until they're not.

Then there's the developer who refuses to use AI for anything that matters. "I need to understand the code." They ship slow, understand everything, and quietly fall behind on velocity. They're right about understanding. They're wrong about the tradeoff.

And then there's a third type. They use AI for specific phases of specific tasks. They ship fast and understand the code. Not because they're smarter, but because they draw a clear line between what's theirs and what's the machine's -- and they draw it in a different place for every task.

There's a name for this pattern. It comes from chess, and it's been quietly reshaping how the best developers I know actually work.

The chess origin

In 1997, Garry Kasparov lost to Deep Blue. It was a big deal at the time. The world's best chess player, beaten by a machine. Most people remember that part.

What most people don't remember is what happened next. Kasparov proposed "freestyle chess" -- tournaments where human+computer teams could compete against grandmasters and chess engines alike. Anyone could enter with any combination of human and machine.

The result surprised everyone. Centaur teams -- humans partnered with computers -- beat both grandmasters playing alone and chess engines playing alone. For years. The winning teams weren't the ones with the best engines or the strongest players. They were the ones with the best process for dividing the work.

The insight wasn't "humans are still useful." It was that the human's job changed. They stopped trying to out-calculate the machine. They started managing it -- choosing which lines to explore, when to override the engine's suggestion, when to trust it. The human brought strategic judgment. The machine brought brute-force calculation. Neither was sufficient alone.

That's exactly what's happening in software development right now, and most developers haven't noticed the shift.

The developer split

Not every development task is the same, and the centaur model only works if you're honest about which tasks need what. Here's how I think about it:

Task	Human Leads	AI Leads	Centaur (Both)
Architecture decisions	X
Boilerplate/scaffolding		X
Bug investigation			X
Code review	X
Writing tests			X
Refactoring			X
Documentation			X
Performance optimization			X
Security review	X
Data transformations		X

The pattern is straightforward once you see it. "Human leads" means the task is judgment-heavy, context-dependent, and the consequences of getting it wrong are high. Architecture decisions. Security review. Code review where you're evaluating design tradeoffs, not syntax. These are tasks where domain knowledge and organizational context matter more than processing speed.

"AI leads" means the task is mechanical, pattern-matching, and low-stakes. Scaffolding a new service from a template. Transforming data between formats. Writing boilerplate that follows an established pattern. If you're spending human judgment on these, you're misallocating your scarcest resource.

"Centaur" is the interesting column. These are tasks where both human and AI contribute different things, and the result is better than either could produce alone. Most development tasks live here. The question isn't whether to use AI -- it's where the handoff happens within each task.

The handoff problem

The actual skill of centaur development isn't "write better prompts." It's knowing where to place the handoff between human judgment and AI processing within a single task.

Let me walk through a concrete example.

Bug investigation (a centaur task):

You get a bug report: intermittent 500 errors on the payment endpoint, mostly on Friday afternoons. Here's how the centaur split works:

AI phase: "Search the codebase for all usages of the payment service client. Show me the call chain from the API handler to the function that throws this error. List all commits in the last two weeks that touched these files."

Human phase: You read the AI's findings. You apply domain knowledge that no model has -- the payment provider's API slows down on Fridays because their batch processing runs. You've seen this before. It's a timeout issue under load, not a code bug.

AI phase: "Write a fix that adds configurable timeout with exponential backoff on the payment client. Add a test that simulates slow responses up to 30 seconds."

Human phase: You review the fix. Does it handle the edge case where the payment service is completely down, not just slow? Does the backoff cap make sense for your SLA? You adjust, approve, or iterate.

Notice the pattern. AI gathers and transforms. Human judges and decides. AI implements the decision. Human verifies. It's not one handoff -- it's a loop. Each pass through the loop, the human applies context the AI doesn't have, and the AI applies processing speed the human can't match.

The anti-pattern: giving AI the whole task.

"Fix this bug" with no context. The AI guesses at the root cause, writes a plausible-looking fix that passes existing tests but addresses a symptom, not the actual problem. You spend more time debugging the AI's fix than you would have spent debugging the original bug. I've watched this happen more times than I want to admit -- including to myself.

The handoff exists whether you design it or not. If you don't choose where it goes, you end up with the AI making judgment calls it's not equipped to make, and you reviewing mechanical work that didn't need your attention.

Three rules

After working this way for a while, I've landed on three rules that hold up across different tasks and different models.

Rule 1: AI processes, you decide.

Every task has a processing phase and a decision phase. Processing is gathering information, transforming data, generating options, applying known patterns. Decision is choosing the approach, evaluating tradeoffs, accepting risk, determining what "good enough" means.

AI handles the first. You handle the second.

If you find yourself accepting AI output without reading it, you've crossed the line. You've delegated a decision to a machine that doesn't understand the consequences. If you find yourself manually doing work that AI could process -- reformatting, searching, boilerplating -- you're wasting your advantage.

The line between processing and decision isn't always obvious, and it moves depending on the task. But asking yourself "am I processing or deciding right now?" before each step is the single most useful habit I've built.

Rule 2: Context is the product.

The quality of AI output is directly proportional to the quality of context you provide. Your codebase knowledge, your understanding of the business domain, your awareness of production constraints, your memory of what was tried before and why it failed -- that's your contribution. Without it, AI produces generic code that technically works and practically doesn't.

The best developers I know spend more time crafting context than crafting prompts. They paste in the relevant code. They explain the business rule. They describe the constraint nobody wrote down. The prompt itself is often one sentence. The context around it is a paragraph.

Your context is the moat. It's the thing that makes AI output actually useful instead of generically correct. Protect it, develop it, and feed it into every AI interaction.

Rule 3: Trust, but verify the boundaries.

Not all AI output needs the same level of scrutiny. I use a rough three-tier system:

Scaffolding and boilerplate: Light review. Low stakes. If the AI generates a standard CRUD endpoint from a pattern I've established, I scan it and move on.
Business logic: Line-by-line review. Medium stakes. This is where subtle bugs hide -- off-by-one errors in date calculations, missing edge cases in validation, wrong assumptions about nullable fields.
Security, auth, payments, anything safety-related: I rewrite from scratch using AI output as reference only. High stakes. I don't trust any model to get authentication flows right without me thinking through every path. The cost of a subtle bug here is orders of magnitude higher than the time saved.

Match your review depth to the blast radius of getting it wrong. This isn't paranoia -- it's resource allocation. Spending equal time reviewing boilerplate and auth code means you're either over-reviewing the boilerplate or under-reviewing the auth.

What changes when you work this way

I've been working with this model for long enough to see the second-order effects, and some of them surprised me.

Code review speed goes up. Not because you review less carefully, but because you wrote or deeply reviewed every decision point yourself. The AI-generated parts are predictable patterns you've already validated. You can scan them fast because you know exactly what to look for -- you defined the pattern.

Bug density goes down. Because your human judgment went into the right places -- architecture, edge cases, domain logic, security boundaries -- instead of being spread thin across boilerplate. You're not sharper. You're just better allocated.

Onboarding gets faster. New developers learn the centaur pattern and start contributing at the judgment layer immediately, even while AI handles patterns they haven't memorized yet. They don't need to know the entire codebase to make good decisions about the part they're working on. The AI fills in the mechanical gaps while they focus on understanding the why.

The unexpected one: you stop resenting AI. This is the one I didn't predict. When you have a clear model for what's yours and what's the machine's, the anxiety about "AI replacing developers" disappears. It's not replacing you. It's handling the part of your job that was never the interesting part anyway. The interesting part -- the judgment, the design, the debugging that requires actual understanding -- is still yours. And now you have more time for it.

The honest take

This model doesn't work for everything. Pure creative work -- inventing a novel architecture, designing a genuinely new algorithm, making product decisions -- is still mostly human. The AI can help you explore options, but the creative leap is yours. Pure mechanical work -- formatting, linting, generating boilerplate from established patterns -- is still mostly AI. There's no centaur split needed when one side clearly dominates.

The centaur model matters most for the 60-70% of development work that lives in between. The bug investigations, the refactors, the test writing, the documentation, the performance work. Tasks with both a processing component and a judgment component. That's where the split pays off.

The real skill isn't prompt engineering. It's boundary engineering -- knowing where to draw the line between human judgment and AI processing for each specific task in your specific context. And that line is different for every team, every codebase, every domain.

The developers who figure this out first won't just be faster. They'll be doing fundamentally different work than the ones who are still either fighting AI or blindly accepting everything it produces.

Free prompts at NerdyChefs.ai and GitHub. Free 35-min AI course: trainings.kesslernity.com.

Are you using a centaur-style workflow? I'm curious which tasks you always keep human and which you've fully delegated. Drop a comment.

Every AI Benchmark Tests Coding. We Built One That Tests Infrastructure Work.

Mathieu Kessler — Thu, 05 Mar 2026 00:21:14 +0000

Every AI coding tool benchmark tests the same things. Autocomplete accuracy. Code generation. Refactoring. Test generation. Maybe a LeetCode problem for good measure.

Not a single one tests the work infrastructure engineers actually do.

I spend my days writing Terraform modules, debugging Kubernetes incidents, migrating CI/CD pipelines, and reviewing Helm charts for security issues. When I evaluated AI coding agents for my team, every vendor benchmark told me how well their tool could write a React component. None of them told me whether it could generate a production EKS module with networking, IAM, and logging that actually plans without errors.

So we built the benchmark that should have existed.

20 tasks across 5 categories

We designed 20 infrastructure tasks grouped into the five categories that eat most of a platform engineer's time:

1. Terraform module generation -- Generate complete, standards-compliant modules from organizational patterns. The test: does the output run terraform validate without manual patching? Does it follow naming conventions? Are there hardcoded environment values hiding in the examples?

2. Infrastructure composition -- Convert natural language intent ("2 web apps, 1 key vault, private endpoints, dev/staging/prod") into dependency-aware IaC scaffolding. This is harder than it sounds. The agent needs to understand that the web apps depend on the service plan, the key vault needs RBAC before the apps can reference it, and the environments need separate state.

3. Kubernetes incident triage -- Diagnose real failure patterns: CrashLoopBackOff, OOMKilled, ImagePullBackOff, configuration drift. The test isn't "what's wrong?" -- it's whether the agent follows a coherent triage sequence. Read logs, check resource limits, verify probe configuration, compare running state to declared state, suggest a safe rollback path.

4. CI/CD workflow engineering -- Create pipelines and migrate existing ones. The hardest task in this category: Jenkins-to-GitHub Actions migration with security gates and approval workflows preserved. Most agents could generate a new workflow. Fewer could preserve the approval semantics from an existing Jenkinsfile.

5. Security and reliability review -- Evaluate Helm charts and pipeline configurations for privilege escalation, secrets exposure, blast radius, and missing policy guardrails. The test: identify real risks with practical remediations that don't require destructive changes without approval.

Each category contained four tasks at increasing complexity. The Terraform category, for example, started with a single-resource module and scaled to a multi-module composition with cross-resource dependencies, environment separation, and state management. The point was to test not just whether the tool could generate infrastructure code, but whether it could reason about infrastructure systems.

We ran all 20 tasks across three tools: GitHub Copilot, Claude Code, and Amazon Q Developer.

Three architectures, three philosophies

The most interesting finding wasn't which tool scored highest. It was how fundamentally different their architectures are -- and how those architectures determine what they're good at.

GitHub Copilot: IDE-first with workflow guardrails

Copilot's architecture centers on the IDE. It runs asynchronously in GitHub Actions-backed environments with explicit pull request handoffs. When Copilot's coding agent generates infrastructure code, the workflow runs are not executed automatically -- they require approval in Actions.

This is a deliberate governance choice. Copilot treats every generated change as a PR that goes through your existing review process. For organizations that want AI-assisted infrastructure changes to flow through the same gates as human changes, this is the right model.

The tradeoff: IDE autocomplete doesn't translate directly to infrastructure reasoning. One task asked for a production-grade EKS Terraform module with VPC networking, IAM roles, node groups, and CloudWatch logging. The output required modifications before it would even plan successfully. The individual resource blocks were syntactically correct, but the cross-resource dependencies -- the parts that make infrastructure code actually work -- needed manual wiring.

Copilot's strength showed on tasks where the scope was contained: a single module, a focused pipeline, a specific Helm chart review. Where it struggled was multi-step tasks requiring understanding of how 15+ resources interact.

Claude Code: terminal-first and tool-centric

Claude Code operates as a command-line tool executing commands under user permissions. It's terminal-native -- designed to run alongside your existing shell workflow rather than inside an IDE.

The architecture difference shows immediately on infrastructure tasks. When debugging a CrashLoopBackOff, the triage sequence involves reading logs, checking resource limits, verifying probe configs, and comparing running state to declared state. In a terminal context, these are sequential commands with the full output flowing into the agent's context window. Claude Code's 200K context window (expandable to 1M) means it doesn't lose earlier diagnostic information while working through later steps.

The numbers are hard to ignore: Claude Code currently accounts for 4% of all GitHub commits -- roughly 134,000 per day -- and is projected to hit 20% by year-end. Anthropic reports a $2.5B run-rate with users doubling since January 1st, 2026. Git worktree isolation means the agent can work on infrastructure changes in a separate branch without touching your working directory.

Where it struggled: tasks requiring deep integration with specific cloud services. It generated correct Terraform and Kubernetes manifests, but it didn't have the same awareness of AWS-specific best practices that a cloud-native tool would.

Amazon Q Developer: deeply AWS-integrated

Amazon Q's architecture is built around AWS service integration. Cedar-based policy enforcement, 13 quality evaluators, and deep CloudWatch integration give it capabilities that generic tools can't match on AWS-specific tasks.

On our AWS-focused tasks, Q Developer was impressive. CloudWatch log analysis, IAM policy review, and AWS-specific Terraform patterns all benefited from Q's native understanding of AWS service relationships.

The tradeoff is the inverse of its strength: that depth vanishes the moment you work across clouds. Tasks involving Azure resources, multi-cloud networking, or cloud-agnostic Kubernetes patterns didn't benefit from Q's AWS integration. For teams running 100% AWS, this isn't a limitation. For teams running hybrid or multi-cloud -- which is most enterprise infrastructure teams -- it's a significant constraint.

Two corrections to common claims

While researching this benchmark, I found two claims circulating widely that needed correction:

GitHub's "Agentic Workflows" launch. Multiple sources describe this as a general product launch on February 17, 2026. The actual documentation from GitHub Next describes it as a research demonstrator, not a GA product release. The distinction matters if you're evaluating Copilot for production infrastructure workflows.

Amazon Q's Cedar policies and 13 evaluators. These are frequently attributed to Amazon Q Developer directly. They're actually capabilities of AWS AgentCore -- a separate service for building and governing AI agents. Q Developer can leverage AgentCore, but the policy framework isn't a core Q Developer feature. If you're expecting Cedar-based governance out of the box with Q Developer, check whether your plan includes AgentCore access.

Neither correction diminishes the tools. But infrastructure teams making purchasing decisions need accurate capability mapping, not marketing summaries.

What the results actually mean

Three patterns emerged across all 20 tasks:

1. IDE-first tools struggle with infrastructure context

Infrastructure code is fundamentally different from application code. A React component is self-contained -- you can evaluate it in isolation. A production EKS Terraform module with VPC networking, IAM roles, node groups, and logging touches 15+ resources with complex dependency chains. Understanding whether module.eks.cluster_endpoint is correctly referenced three modules away requires the kind of cross-file context that IDE autocomplete wasn't designed for.

This doesn't mean Copilot can't do infrastructure work. It means the IDE-first interaction model -- suggest completions in the current file based on nearby context -- is a less natural fit for infrastructure than for application code. The tasks where Copilot performed best were scoped to a single file or module.

2. Terminal-native agents handle multi-step infrastructure work better

Debugging a CrashLoopBackOff isn't a single-prompt task. It's a sequence: kubectl logs, then kubectl describe pod, then check resource requests vs limits, then verify liveness/readiness probes, then compare the running manifest to the declared spec, then check recent deployments for configuration drift. Each step's output informs the next.

Terminal-native tools that maintain context across sequential commands handled these triage sequences more coherently than tools designed around single-prompt interactions. The agent didn't lose the output from step 2 when it reached step 5.

The same pattern held for CI/CD migration tasks. Converting a Jenkinsfile to GitHub Actions requires reading the existing pipeline, understanding the stage semantics, mapping plugins to Actions equivalents, and preserving approval gates -- a multi-step reasoning process that benefits from persistent context.

3. Cloud-native integration cuts both ways

Amazon Q's AWS depth is genuinely useful. On AWS-specific tasks, it surfaced best practices and service relationships that generic tools missed. If your infrastructure is 100% AWS, this integration is a clear advantage.

But most enterprise infrastructure teams don't live in a single cloud. The moment a task involved Azure resources, cloud-agnostic Kubernetes patterns, or multi-cloud networking, Q's advantage disappeared. And because the deep integration creates an expectation of quality, the gap between "AWS task" and "non-AWS task" is more jarring than with a tool that makes no cloud-specific promises.

The lesson: match the tool's architecture to your infrastructure reality, not to any single cloud provider's pitch. Before signing an enterprise agreement, count how many of your infrastructure tasks touch only one cloud. If the answer is "most of them," cloud-native tools are a fit. If the answer is "about half," you need a tool that works everywhere, even if it's less deep on any single platform.

Build your own benchmark in 15 minutes

This is the most actionable part. Vendor benchmarks test vendor strengths. Your tasks test your reality.

Here's a 5-task starter framework using your own infrastructure patterns. Each task has specific validation criteria so you're not just vibes-testing.

Task 1: Generate a Terraform module from your standards.
Pick a resource type you deploy frequently. Ask each tool to generate a complete module following your naming conventions. Validate: does terraform validate pass without edits? Are there hardcoded environment values? Does it include the variables and outputs your team expects?

Task 2: Compose a multi-resource stack from intent.
Describe a real deployment in plain language: "2 web apps, 1 key vault, private endpoints, dev/staging/prod." Validate: is the dependency graph correct? Are environments explicitly separated? Is the state layout reviewable?

Task 3: Triage a real incident from last month.
Pick a Kubernetes or infrastructure incident your team debugged recently. Give the tool the same starting information your on-call engineer had. Validate: does it follow ordered triage steps with concrete signals (probe failures, OOM events, config mismatches)? Does it suggest safe rollback paths?

Task 4: Migrate a real pipeline.
Take an existing Jenkins, GitLab CI, or Azure DevOps pipeline and ask the tool to convert it to GitHub Actions (or vice versa). Validate: are stages and approval gates preserved? Does it handle secrets correctly for the target platform? Are existing rollback procedures maintained?

Task 5: Review a real Helm chart for security.
Give the tool a Helm chart your team actually deploys. Validate: does it identify privilege escalation risks? Secret exposure? Does it suggest practical remediations that don't require destructive changes without approval?

Run all five tasks on each tool you're evaluating. Score pass/fail on the validation criteria. The results will tell you more in 15 minutes than any vendor demo.

# Scoring template
benchmark_tasks:
  - name: terraform_module_generation
    validate:
      - terraform_validate_passes: true
      - no_hardcoded_env_values: true
      - required_variables_present: true
      - required_outputs_present: true
    result: pass | fail

  - name: multi_resource_composition
    validate:
      - dependency_graph_correct: true
      - environment_boundaries_explicit: true
      - state_layout_reviewable: true
    result: pass | fail

  - name: incident_triage
    validate:
      - ordered_triage_steps: true
      - concrete_signals_referenced: true
      - safe_rollback_suggested: true
    result: pass | fail

  - name: pipeline_migration
    validate:
      - stages_preserved: true
      - approval_gates_preserved: true
      - secrets_handled_correctly: true
      - rollback_procedures_maintained: true
    result: pass | fail

  - name: helm_security_review
    validate:
      - privilege_escalation_identified: true
      - secret_exposure_identified: true
      - remediations_non_destructive: true
    result: pass | fail

The honest take

No tool won across all 20 tasks. That's the point.

GitHub Copilot is the right choice for teams that want AI-assisted infrastructure changes flowing through existing PR review processes with explicit approval gates. If your governance model is "nothing ships without a reviewed PR," Copilot's architecture enforces that naturally.

Claude Code is the right choice for infrastructure engineers who live in the terminal and need multi-step reasoning across complex dependency chains. If your typical task involves reading state from three sources and synthesizing a plan, terminal-native context management matters.

Amazon Q Developer is the right choice for teams running deep on AWS who want cloud-native intelligence that understands service relationships. If your infrastructure is 90%+ AWS and staying there, Q's integration is a genuine advantage.

The wrong choice is evaluating any of these tools on vendor benchmarks that test React components when your team writes Terraform and debugs Kubernetes. The second wrong choice is evaluating on a single task. Infrastructure engineers don't do one thing -- they context-switch between writing IaC, debugging incidents, reviewing security posture, and migrating pipelines in a single day. Your evaluation needs to cover that range.

Build the 5-task benchmark with your own infrastructure patterns. It takes 15 minutes. It will tell you which tool fits your actual work better than any blog post -- including this one.

The full benchmark deep dive with all 20 tasks and sources is here: Infrastructure AI Agent Benchmark: Copilot vs Claude Code vs Amazon Q

The companion playbook for building an agentic Terraform factory with MCP governance is here: Agentic Terraform Factory Playbook

I write about AI for infrastructure teams at talk-nerdy-to-me.com. If you've run your own infrastructure benchmark across AI tools, I'd like to hear what you found. Drop a comment.

Stop Yelling at Your AI: Why Your Claude 4.5 Prompts Are Breaking Claude 4.6

Mathieu Kessler — Fri, 20 Feb 2026 17:23:59 +0000

If you upgraded to Claude Opus 4.6 or Sonnet 4.6 and started getting worse results, over-engineered code, unnecessary tool calls, walls of text where a sentence would do, your model isn't broken. Your prompts are too loud.

Claude 4.6 is significantly more responsive to system prompts than previous models. Instructions that needed aggressive emphasis in the 3.5 era now overtrigger. The model follows them too literally, using tools when unnecessary, adding abstractions nobody asked for, and generating verbose responses to simple questions.

This is the most common migration mistake I'm seeing right now. Here's how to fix it.

The problem: prompts designed for a model that didn't listen

Claude 3.5 was good, but it sometimes needed prodding. Teams developed prompting habits to compensate:

ALL-CAPS emphasis: "CRITICAL:", "MUST:", "ALWAYS:", "NEVER:"
Aggressive repetition: stating the same constraint three different ways
Threat-level framing: "This is EXTREMELY IMPORTANT"
Absolute rules: "ALWAYS use the search tool for EVERY question"

These patterns worked because 3.5 occasionally needed the extra push. They were a rational response to a model that sometimes ignored nuanced instructions.

Claude 4.6 doesn't have that problem. It reads your system prompt carefully and follows it. When it sees "ALWAYS use the search tool for EVERY question," it uses the search tool for every question -- including ones where it already knows the answer, wasting time and tokens.

Dial-back prompting: the fix

Anthropic's prompting best practices documentation recommends this directly: "Where you might have said 'CRITICAL: You MUST use this tool when...', you can use more normal prompting like 'Use this tool when...'." I've been calling it dial-back prompting -- reducing the intensity of your instructions to match a model that actually follows them.

Before and after

Search tool usage:

BEFORE (Claude 3.5 era):
"CRITICAL: You MUST use the search tool for EVERY question. ALWAYS
verify your answers. NEVER respond without checking at least 2 sources.
This is EXTREMELY IMPORTANT."

AFTER (Claude 4.6 era):
"Use the search tool when you're not confident in your answer or when
the question involves recent events. For well-established facts, you
can respond directly. Aim for efficiency -- verify when it adds value,
not on every response."

Error handling in code:

BEFORE:
"You MUST ALWAYS include error handling in every code example. NEVER
skip validation. CRITICAL: Every function needs try/catch blocks."

AFTER:
"The user is a junior developer who will copy-paste your code directly
into production. Include error handling so they don't ship fragile code
that crashes on edge cases. Include comments explaining what each error
handler catches."

Response formatting:

BEFORE:
"ALWAYS use markdown. ALWAYS include headers. NEVER give a response
without proper formatting. This is NON-NEGOTIABLE."

AFTER:
"Format your response for readability. Use headers and lists for
complex topics. For simple answers, a direct response without
formatting is fine."

Same intent in every case. Different volume. Better results.

The migration checklist

Run through your existing system prompts and apply these changes:

1. Remove ALL-CAPS emphasis

Replace every instance of CRITICAL, MUST, ALWAYS, NEVER, EXTREMELY IMPORTANT with normal-case language. Claude 4.6 doesn't need to be yelled at.

2. Replace absolute rules with contextual guidance

"ALWAYS do X" becomes "Do X when [specific conditions]. Skip it when [other conditions]." Give the model judgment instead of mandates.

3. Add explicit permission to NOT do things

This is the counterintuitive one. Claude 4.6 is so instruction-following that it needs permission to be efficient:

"You don't need to use tools for every question."
"Keep solutions minimal. Don't add abstractions or flexibility that wasn't requested."
"Match your response length to the complexity of the question."

4. Test for over-engineering

After migrating your prompts, test with simple requests. Ask it to write a function that adds two numbers. If you get a class with dependency injection, your prompts still need dialing back.

If you want a systematic approach to auditing your existing prompts, I built a migration template for exactly this: the Claude 3.x to 4.x Migration Prompt walks through each prompt in your codebase and flags the patterns that need updating.

Motivated instructions: explain the WHY

There's a companion technique that works hand-in-hand with dial-back prompting: motivated instructions. Instead of stating bare rules, explain why the rule exists.

Anthropic's documentation highlights this explicitly: Claude 4.x models generalize better from motivated instructions than from bare rules. Their example -- a TTS engine that can't handle ellipses -- demonstrates the pattern well.

Bare rule (unmotivated):

"Never use ellipses in your response."

Motivated instruction:

"Your response will be read aloud by a text-to-speech engine, so never
use ellipses since the TTS engine won't know how to pronounce them.
Also avoid other punctuation that could confuse audio rendering, like
nested parentheses or excessive dashes."

The motivated version does two things the bare rule can't:

It generalizes. The model understands the underlying principle (TTS compatibility) and avoids other TTS-unfriendly formatting you didn't think to mention.
It handles edge cases. When the model encounters a situation you didn't explicitly cover, it can reason from the principle instead of defaulting to literal compliance.

Bare rules get followed literally. Motivated rules get followed intelligently.

More motivated instruction examples

For a customer support bot:

Unmotivated: "Keep responses under 100 words."

Motivated: "Customers reaching this bot are frustrated and looking for
quick resolution. Keep responses concise -- under 100 words when possible
-- because long responses increase abandonment. Front-load the answer
before any explanation."

For a code review agent:

Unmotivated: "Flag all functions longer than 50 lines."

Motivated: "This codebase has a history of maintenance problems caused
by oversized functions. Flag functions longer than 50 lines because
they tend to accumulate hidden side effects. Suggest logical split
points rather than just flagging the length."

Three more migration traps

Trap 1: Over-delegation (Opus 4.6 specific)

Opus 4.6 has a strong tendency to spawn subagents -- delegating subtasks to parallel instances. Anthropic's docs confirm this: "Claude Opus 4.6 has a strong predilection for subagents and may spawn them in situations where a simpler, direct approach would suffice."

Add to your system prompt:

"Work directly (no subagents) when:
- The task is a single-file edit or simple lookup
- Steps must be sequential and share context
- The overhead of delegation exceeds the time saved

Delegate to subagents when:
- Tasks are truly independent and can run in parallel
- Each task requires different specialized context"

Trap 2: Over-tooling

Claude 4.6 will use every tool you give it access to if your prompt doesn't constrain usage. If your system prompt says "use tools to verify everything," it will make API calls to verify what 2 + 2 equals.

Add:

"Use tools when they add value, not reflexively. For questions you can
answer confidently from training data, respond directly."

Trap 3: Verbose responses

Without explicit guidance, Claude 4.6 tends toward thorough (read: long) responses. A question that warrants a one-line answer gets five paragraphs.

Add:

"Match your response length to the complexity of the question. Simple
questions get simple answers. Don't pad responses with caveats,
disclaimers, or context the user didn't ask for."

The bigger picture: from prompt engineering to context engineering

The dial-back prompting shift reflects something larger. Anthropic's context engineering blog post frames the evolution: the job is no longer about finding the right words in a single prompt. It's about managing the entire information environment around inference.

In the Claude 3.5 era, prompt engineering was about finding the magic words. In the Claude 4.6 era, it's about designing the right system:

System prompt structure: XML-tagged sections for role, tools, context, rules
Tool definitions: What tools are available and when to use them
Context loading: What information the model has access to and when
Guardrails: What the model should and shouldn't do -- expressed as principles, not commands

The models got better at listening. Now we need to get better at talking to them.

Stop yelling. Start designing.

I wrote a full guide covering Claude prompting fundamentals -- XML architecture, adaptive thinking modes, model selection with current pricing, and all 18 templates referenced above: nerdychefs.ai/learn/claude-prompting

Sources:

Free prompts and AI workflows at NerdyChefs.ai and GitHub.

Cognitive Debt Is Not Technical Debt -- and Your AI Coding Tools Are Creating It

Mathieu Kessler — Fri, 20 Feb 2026 09:07:32 +0000

I work in AI deployment for enterprise. Over the past year I've watched the same pattern play out on every team that adopted AI coding tools. First two months: velocity is up, everyone's excited, PRs are flying. Month three or four: someone gets paged at 2 AM, opens the failing function, and realizes they have no idea what it does. They accepted it from Copilot three months ago and never traced the logic.

The incident takes four hours instead of one. The postmortem says "root cause unclear." Nobody writes down what actually happened, which is that the team shipped code they didn't understand and got away with it until they didn't.

We have a name for code you know is bad. Technical debt. You shipped something suboptimal, you know it's suboptimal, and you plan to fix it later. Fine. That's a conscious trade-off.

We don't have a standard name for code you don't even know is bad. Code that works, passes tests, looks clean, and sits in production for months until it breaks in a way nobody can diagnose. I've been calling it cognitive debt: the growing gap between what your codebase does and what your team actually understands about it.

The difference matters. Technical debt is visible. You can point at it and say "we need to refactor this." Cognitive debt is invisible. You don't know you have it until something breaks and the person who merged it says "I'm not sure what this part does."

The numbers

I started paying attention to this because of the data, not because of a theory.

METR ran a randomized controlled trial in 2025. Experienced developers -- not beginners, not students, experienced engineers working on their own codebases -- were 19% slower when using AI tools than without them. That number surprised a lot of people. It didn't surprise me. I'd watched it happen.

Cortex's 2026 Engineering Benchmark showed a 23.5% increase in incidents per pull request. GitClear found code churn nearly doubled, from 3.1% to 5.7%. Code being rewritten shortly after being written. Stack Overflow's 2025 survey had developer trust in AI output at 33%. Usage at 76%. Three quarters of developers using tools they don't trust to produce correct output.

These numbers don't mean AI tools are bad. They mean something about how teams use them is broken.

How it happens

The mechanism is simple. Accepting AI-generated code is effortless. Understanding it takes work. Deadline pressure always favors speed.

Every developer starts active. You prompt, you read the output carefully, you modify it, you trace the logic, you run your own tests. Then after a few weeks of the AI being mostly right, you start skipping steps. You read less carefully. You run the tests but don't trace the logic. You accept more functions in a single pass. You drift from active to passive without noticing.

Active use looks like this: you write the function signatures, types, and interfaces. AI fills the implementation. You trace the data flow, verify against your mental model of the system, and can explain the code to a colleague without reading it again.

Passive use looks like this: you prompt, accept multi-function output without reading each function, merge because it "looks right," and move on. A week later you couldn't explain what it does.

The problem is that both patterns feel the same in the moment. The code compiles. The tests pass. The PR gets approved. The difference only shows up at 2 AM when something breaks.

What it costs

I tracked this informally across a few teams I work with. The costs show up in three places.

Incident response time. When the on-call engineer opens a file and doesn't understand it, MTTR goes up. Not a little -- I saw 2-3x on files where the original author couldn't explain the code. The debugging process becomes archaeology: reading git blame, finding the PR, trying to reconstruct what the code is supposed to do before you can figure out why it's not doing it.

Onboarding. New engineers joining a team with heavy AI-generated code can't orient themselves. Normally you onboard by reading the code and asking the people who wrote it why they made certain decisions. When the answer is "Copilot wrote that, I'm not sure why it does it this way," the onboarding path breaks.

Churn. Developers rewrite code they accepted but didn't understand. Not because the code is wrong, but because they can't maintain it without understanding it, and understanding someone else's AI-generated code is harder than rewriting it. This is the churn number in GitClear's analysis.

The prevention framework

I built an open-source kit because I couldn't find one that addressed this systematically. Most advice about AI coding tools is either "be careful" (vague) or "don't use them" (unrealistic). I wanted something a team could actually adopt.

The kit operationalizes five patterns:

1. Living architecture documentation.

A file called MEMORY.md that lives in your repo root. It captures the "why" behind architectural decisions, security constraints, naming conventions, and what I call AI-free zones -- code paths where AI may draft but a human must own and rewrite (auth, payments, data deletion, migrations).

The act of writing this file forces comprehension. The act of maintaining it prevents drift. And if you feed it to your AI tool as context, the tool generates code consistent with your codebase instead of inventing new patterns.

2. Comprehension checkpoint.

Three questions before you accept AI output:

Can I trace the data flow without reading line-by-line?
Could I rewrite this from scratch if the AI disappeared?
Would I catch a bug in this at 2 AM?

If any answer is "no," you stop and understand first. This takes 60 seconds and it's the single highest-leverage practice in the kit.

3. PR comprehension gate.

The PR template includes a mandatory AI explanation section. If you used AI, you explain: what was generated, what approach was used, what alternatives you considered, and what edge cases you tested. In your own words, not the AI's summary.

The point isn't disclosure for its own sake. It's that writing the explanation forces you to verify your understanding. If you can't write it, you don't understand the code well enough to own it in production.

4. Code review guardrails.

A five-layer review framework designed for AI-generated code:

Layer 1: Comprehension verification (does the author understand what they submitted?)
Layer 2: Silent failure detection (AI loves bare catch blocks and default returns that hide failures)
Layer 3: Codebase consistency (does the AI code match your established patterns?)
Layer 4: Security review (for sensitive paths)
Layer 5: Test coverage (AI-written tests for AI-generated code need extra scrutiny)

Layer 1 is the one that doesn't exist in standard code review. Normally you trust that the author understands their own code. With AI, that assumption is no longer safe.

5. Blast radius control.

PRs under 200 lines. One concern per PR. 100% test coverage on AI-generated paths. These constraints exist because you cannot review what you cannot hold in your head.

The team layer

Individual practices help but don't scale. The kit includes a team playbook with:

Comprehension reviews: once per sprint, rotating pairs read AI-heavy PRs from the previous sprint. Not the author reviewing their own code. Someone else reading it and adding "why" comments. Anything nobody can explain gets flagged for rewrite.
Quarterly audit: identify high-churn files, cross-reference with AI origin, test whether two engineers can explain each one. Produce a watchlist.
Sprint metric: a 5-minute "cognitive debt pulse" in every sprint review -- churn rate, incidents where root cause was unclear, comprehension reviews completed.

There's also an incident response procedure with a cognitive debt assessment phase, an escalation procedure for when someone can't understand code they need to work on, and an onboarding procedure that calibrates new engineers on the team's AI practices before their first PR.

AI-free zones

Some code paths are too consequential for cognitive debt. The kit includes a framework for declaring them:

Authentication and authorization
Payment processing
Data deletion (compliance risk)
Database migrations (irreversible)

AI may draft code in these areas. A human engineer must own, understand, and substantially rewrite it. The kit's MEMORY.md template has a dedicated section for listing these paths and assigning human owners.

I work in an industry where AI-generated code in the wrong place can have physical safety consequences. The "AI can prepare, humans must decide" principle isn't theoretical for me. But even in a standard SaaS codebase, an auth bypass caused by an AI-generated edge case nobody traced is the same class of problem.

Who this is for

Engineering teams where AI coding tools are already in use. If your team uses Copilot, Cursor, Claude Code, or any LLM-based tool on production code, you probably already have cognitive debt. The question is whether you're managing it or waiting for the incident that reveals it.

The kit is designed to be forked and customized. Templates use TypeScript/Node.js conventions by default but are language-agnostic. The guidelines and procedures are tool-agnostic -- they apply regardless of which AI tool you use.

Everything is MIT licensed. Take what's useful, ignore what isn't, adapt it for your team.

The repo: github.com/kesslernity/cognitive-debt-prevention-kit

The research behind it: talk-nerdy-to-me.com/blog/cognitive-debt-ai-coding-hidden-cost

I build AI infrastructure at Kesslernity. Free prompts and tools at NerdyChefs.ai and GitHub.

Have you seen cognitive debt on your team? What are you doing about it?

Claude Code agent teams: what the docs don't tell you

Mathieu Kessler — Mon, 16 Feb 2026 18:23:21 +0000

Claude Code quietly shipped agent teams as an experimental feature. The idea is simple: instead of one AI agent working through your task sequentially, you spin up multiple agents that work in parallel, communicate with each other, and coordinate through a shared task list.

I've been running team sessions for the past week. Some went well. Some burned through tokens and produced nothing useful. Here's what I figured out.

You need to enable it manually

Agent teams are off by default. Add this to your settings:

// .claude/settings.json or ~/.claude/settings.json
{
  "env": {
    "CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS": "1"
  }
}

Or export it in your shell: export CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1

Subagents and agent teams are different things

This tripped me up at first and I've seen the same confusion everywhere. Claude Code has two parallelism mechanisms, and they work nothing alike.

Subagents use the Task tool. You spawn a focused worker, it does a thing, and returns results to the parent. Workers can't talk to each other. The parent manages everything. Token cost is lower because results get summarized back into the parent's context.

Agent teams are fully independent Claude Code sessions. Each teammate has its own context window, loads project context (CLAUDE.md, MCP servers) automatically, and can message other teammates directly. They coordinate through a shared task list with file-locked claiming and dependency tracking.

The practical difference: subagents are like sending someone to fetch an answer. Agent teams are like putting three people in a room to work on a problem together.

Here's when each one makes sense:

Use subagents when...	Use agent teams when...
You need quick, focused results	Teammates need to discuss and challenge each other
Tasks are independent — no cross-talk needed	Work requires coordination across multiple layers
You want lower token costs	Parallel exploration adds real value
The parent can manage all coordination	Tasks have clear file boundaries but need collaboration

I'd estimate 70-80% of tasks that feel like they need a team actually work fine with subagents. The Task tool with subagent_type set to Explore, Plan, or general-purpose handles most research and implementation tasks without the coordination overhead.

The cost math

Each teammate is a full Claude session with its own context window. A 3-teammate team uses roughly 3-4x the tokens of sequential work.

That's not a rounding error. For a task that takes 20 minutes sequentially, spinning up a team to finish it in 8 minutes costs 3-4x more. Whether that trade-off makes sense depends on how much your time is worth relative to API spend.

The tasks where teams clearly pay off:

Multi-file features with clean boundaries (frontend + backend + tests)
Large refactors where multiple directories can be worked on simultaneously
Adversarial review patterns (one agent implements, another tries to break it)

The tasks where they don't:

Anything touching fewer than 4-5 files
Sequential work where step 2 depends on step 1's output
Quick bug fixes (the coordination overhead exceeds the time saved)

File conflicts are the #1 failure mode

Two teammates editing the same file doesn't produce a merge conflict. It produces data loss. One agent's changes silently overwrite the other's.

This was my most expensive lesson. Before any parallel coding, you need to map out exactly which agent owns which files. No shared files. If two tasks need to touch the same file, they run sequentially, not in parallel.

The pattern that works: decompose your task into pieces where each piece owns a distinct set of files. If you can't draw clean boundaries, the task isn't a good candidate for a team.

Teammates don't know what you know

This is counterintuitive. Your main Claude Code session (the "lead") has your full conversation history — the bug you described, the architecture you discussed, the constraints you mentioned. When you spawn a teammate, they start with a blank conversation. They load CLAUDE.md and MCP servers, but they don't inherit any of your context.

This means your spawn prompts need to be detailed. Not "implement the auth module" but "implement an auth module using JWT tokens stored in httpOnly cookies, following the patterns in src/middleware/auth.ts, compatible with the Express 5 router setup in src/app.ts, with unit tests in tests/auth.test.ts."

The more context you front-load into the spawn prompt, the less the teammate drifts. I started writing spawn prompts that were 200-300 words and the quality of output went up immediately.

The workflow that actually produces results

After several failed attempts, here's the sequence I settled on:

1. Decide if you need a team at all.

Ask yourself: can I decompose this into 3+ parallel tasks with zero file overlap? If no, use sequential work or subagents.

2. Decompose into agent-sized pieces.

Each piece needs: a clear goal, explicit file ownership, a definition of done, and an integration contract (how this piece connects to the others). Vague task descriptions produce vague results.

3. Map file boundaries.

Write it down. Agent A owns these files. Agent B owns these files. No overlap. If you skip this step, you will lose work.

4. Plan first, then swarm.

Spawn one agent in plan mode to design the architecture. Review and approve the plan. Then spawn parallel agents to implement against the approved plan. This prevents agents from making conflicting architectural decisions.

5. Run a retrospective.

After the team finishes, look at what each agent actually did, how many tokens each consumed, and where time was wasted. The cost breakdown alone will change how you structure the next run.

Things that will bite you

A few gotchas that aren't in the docs:

No session resumption. If your terminal session drops, /resume and /rewind don't restore teammates. The team is gone. This matters for long-running tasks — save checkpoints or break work into shorter team sessions.

Tasks get stuck. Teammates sometimes forget to mark tasks as completed, which blocks any dependent tasks downstream. If something looks frozen, check the task list manually.

Shutdown is slow. Teammates finish their current API request before stopping. If an agent is mid-generation on a long response, you'll wait.

Drift happens. Letting a team run unattended for 30+ minutes increases the chance of wasted work. Agents can go down rabbit holes, misinterpret requirements, or produce code that conflicts with another agent's approach. Check in periodically.

One team per session. Clean up the current team (shut down all teammates, then delete the team) before starting a new one.

Start with non-code tasks

If you haven't used agent teams before, don't start with parallel implementation. Start with tasks where parallel exploration is valuable but file conflicts aren't a risk:

Multi-angle PR review: one agent checks logic correctness, another checks test coverage, a third checks security and error handling
Competing hypotheses debugging: agent A investigates "this is a race condition," agent B investigates "this is a stale cache," they report findings and you decide
Research from multiple angles: one agent reads the library docs, another searches for known issues on GitHub, a third looks at alternative libraries

These tasks let you learn how messaging, task claiming, and coordination work without the stress of parallel file edits.

A 19-prompt playbook

After running enough team sessions to see the patterns, I built a set of reusable prompts covering the full lifecycle:

Decide (prompts 1-5): Should I use a team? How do I decompose the task? What model for each agent? How do I map file boundaries?
Orchestrate (prompts 6-12): Plan-then-swarm coordination, delegate mode, adversarial code review, parallel TDD, handoff reports, real-time steering
Quality (prompts 13-15): Hooks that auto-validate work before an agent goes idle or marks a task complete, plan approval checklists
Recover (prompts 16-19): Unsticking looping agents, optimizing token costs, retrospectives, anti-pattern detection

The full playbook is free: Claude Code Agent Teams Playbook on NerdyChefs.ai

The honest take

Agent teams are a power tool, not a default workflow. The coordination overhead — context duplication, file boundary management, spawn prompt writing, monitoring for drift — means they only pay off for genuinely parallel work with clean boundaries.

For a solo developer working on a focused feature, sequential Claude Code with occasional subagent calls is cheaper and more predictable. For a large cross-layer feature where three agents can work on frontend, backend, and tests simultaneously without touching each other's files, teams save real time.

The key is being honest about which category your task falls into before you spawn anything.

What patterns have you found with agent teams? I'm particularly interested in hearing about non-obvious use cases where the parallel coordination actually paid off.

The Prompt Engineering Mental Model Most People Get Wrong

Mathieu Kessler — Fri, 13 Feb 2026 14:00:32 +0000

I've written over 1,300 prompts. I test them, I publish them for free, and I watch what happens when people use them.

And there's this pattern I can't unsee anymore. Someone grabs a prompt, pastes it in, gets a decent result, and stops there. Someone else grabs the same prompt, tweaks it for their situation, gets a first result, realizes it's not quite right, adjusts, and ends up with something genuinely useful. Same starting point. Completely different outcome.

The difference isn't the prompt. It's what happens around it. How you set it up, how you read the output, what you do when the first result misses the mark.

Not the prompt itself. The thinking behind the prompt. There are maybe five mental models that, once you internalize them, make you better with any model, any prompt, any task. I want to walk through them because I don't see anyone else writing about this stuff, and I think it matters more than the prompts themselves.

You're delegating, not searching

The biggest mistake people make is treating AI like a search engine. You type a question, you get an answer. That's the wrong frame entirely.

What you're actually doing is delegating to someone who is smart, fast, and has zero context about your situation. Like a new hire on their first day. If you said "write me a report" to a new hire and walked away, you'd get something useless. You know this. So you'd explain who it's for, what format they expect, what happened last time, what "good" looks like.

"Write a project status report."

vs.

"You're writing a weekly status report for the project director.
Non-technical executive audience — they care about schedule,
cost, and the top 3 risks. Nothing else.

One-page summary up front. Direct tone, flag bad news early.
Red/Amber/Green status for each work package.

Here's last week's report: [paste]
Here's this week's data: [paste]"

The second one isn't a better prompt. It's a better briefing. And it's the exact same skill as delegating to a person. I actually realized at some point that my prompts got dramatically better when I stopped thinking about "prompt engineering" and started thinking about how I brief my team.

If your prompts suck, check how you delegate to humans. Might be the same problem.

You need a trust framework before you need a better prompt

This one took me months to figure out, and I only got there because I work in an industry where if someone acts on bad AI output without reviewing it, people can get hurt. Real physical harm.

So I had to get very precise about when I trust the output and when I don't. Not after I read it. Before.

If I'm asking AI to reformat something I wrote, or brainstorm off my own notes, I can use it pretty much directly. I'm the source material. The AI is just reshaping it. Fine.

If it's a first draft for someone outside my team, or code that touches a database, I read it carefully. Every paragraph, every function.

If it's giving me numbers, dates, citations, anything regulatory, I verify every single thing. This is where hallucinations hide and it took me embarrassingly long to learn this lesson. The dangerous outputs aren't the obviously wrong ones. They're the ones that look polished and professional and happen to contain a fabricated statistic on page 3 that you skim right past because the writing is so smooth.

Now the first thing I do before writing any prompt is decide: am I going to use this output directly, review it, or verify it line by line? That decision shapes everything downstream.

Context windows will bite you

Quick technical thing. The AI forgets. Between conversations, it has zero memory. Within a conversation, it has a limit, and once you blow past it, your early instructions start disappearing.

I lost 20 minutes of work once because of this. Had a long conversation where I spent the first 10 messages dialing in the exact tone for a document. Asked for a final draft at message 40-something. Got back the model's default generic style. All my tone instructions were gone from the context window.

Since then: I put the important stuff up front. And for anything complex, I break it into stages.

Step 1: "Analyze this data, list the top 5 findings"
→ I pick the 3 that actually matter
Step 2: "Expand findings 1, 3, 5 into paragraphs with evidence"
→ I edit and add my own interpretation
Step 3: "Write an executive summary. Under 200 words, no jargon"

You're not having a conversation. You're managing state. And you need to inject your own judgment between each step, which is where the actual value comes from.

Write specs, not wishes

Number one cause of bad AI output is a vague prompt. Not a bad model. A vague prompt.

"Make this code better." Better how? Faster? More readable? Fewer lines? The model guesses. You wanted something else.

I started writing prompts the way I write function signatures. What goes in, what comes out, what the constraints are. Name the format explicitly. Tell it what to leave out. ("No introductions, no filler." You'd be amazed how much this helps.) Show it an example of good output instead of trying to describe what you want in words. Use numbers instead of adjectives because "brief" means nothing and "under 150 words" means something.

Bad: "Summarize this meeting."

Better: "Pull out the 3 decisions, the open action items with
owners, and any deadlines. Bullet points. No discussion context.
Under 100 words."

Side effect: getting better at this made me better at writing Jira tickets. Ambiguity is ambiguity regardless of who's on the receiving end.

The centaur thing

There's this chess story from 2005. A tournament where you could enter as a human, a computer, or any combination. Grandmasters entered. Top chess engines entered.

The winners were two amateurs with three laptops and a good process for knowing when to trust the engine vs. override it.

I think about this constantly. The point isn't that AI is powerful. It's that the human+AI team beats both, and that the process matters more than how good either party is individually.

I have a rough split in my head. I do the judgment work: deciding what matters, understanding context nobody wrote down, making calls with incomplete information, anything with safety or ethical weight. AI does the volume work: processing lots of text, maintaining consistency, generating variations, applying patterns I've defined.

The failure mode I keep seeing is people asking AI to make judgment calls. Priority decisions. Risk assessments. And it'll do it, confidently, in clean bullet points. And it'll be wrong in subtle ways that only someone with domain knowledge would catch.

My workflow now: I set the constraints and define what I want. AI drafts. I cut and revise. AI formats the revised version. I review one last time and ship. Each handoff is deliberate. I don't outsource the thinking parts, and the AI doesn't wait around when it could be generating.

What shifts

I still write prompts, and I still publish collections of them. They're useful starting points. But the thing that actually made me better at working with AI was internalizing these patterns. The prompts are the what. The mental models are the why. And once the why clicks, you stop needing someone else's prompt for every new situation.

Free prompts and workflows at NerdyChefs.ai and GitHub.

What mental model changed how you use AI? Drop a comment.

365 Microsoft Copilot Prompts I Wish I Had When I Started

Mathieu Kessler — Mon, 09 Feb 2026 16:29:04 +0000

The $30/month disappointment

Six months ago, my company rolled out Microsoft 365 Copilot. $30/user/month. The pitch deck looked amazing: "AI-powered productivity!" "Work smarter!" "Automate everything!"

Week 1 reality check:

Me: "Help me with my emails"
Copilot: "I'd be happy to help! What would you like to know about emails?"
Me: 😐

The generic prompts from blog posts didn't work. ChatGPT tutorials were useless. I was paying $30/month for a fancy chatbot that couldn't access my actual work.

The problem wasn't Copilot. It was the prompts.

Why your ChatGPT prompts fail in Copilot

Here's what nobody tells you about M365 Copilot:

ChatGPT doesn't know about your work. Copilot does.

Feature	ChatGPT	M365 Copilot
Your emails	❌ Can't access	✅ Reads Outlook
Your files	❌ Copy-paste only	✅ Accesses SharePoint
Your data	❌ None	✅ Excel, Teams, OneDrive
Actions	❌ Text only	✅ Creates docs, sends emails

This means ChatGPT prompts like "write me an email" are useless in Copilot.

But M365-optimized prompts? Game-changer.

The difference between generic and enterprise prompts

Generic prompt (doesn't work):

"Summarize my emails"

Result: Copilot has no idea which emails, what context, or what format you need.

Enterprise-optimized prompt:

"Analyze emails from the last 7 days about Project Phoenix. 
Create a summary table with:
- Key decisions made
- Action items (with owners)
- Budget impacts
- Blockers

Format for executive audience (my VP reads this)."

Result: Copilot accesses Outlook, analyzes 47 emails, creates formatted table, identifies 12 action items, flags 3 budget concerns. Takes 30 seconds instead of 2 hours.

That's the difference.

5 prompts that changed how I work

After 6 months of daily Copilot use, these are my MVPs:

1. The Email Intelligence Prompt (Outlook)

Problem: Drowning in email threads with 15+ people.

Prompt:

"Review the email thread with subject '[Subject Line]'. 
Extract:

DECISIONS MADE:
- [List with dates and who decided]

ACTION ITEMS:
- [Task] - Owner: [Name] - Due: [Date]

DEPENDENCIES:
- [What's blocking progress]

CONTEXT I MIGHT HAVE MISSED:
- [Important details from earlier in thread]

Format as a scannable brief (I need to respond in 5 minutes)."

Time saved: 15-20 minutes per complex thread.

2. The Meeting Prep Assassin (Teams + Outlook)

Problem: Back-to-back meetings, no time to prep.

Prompt:

"I have a meeting about [Topic] in 10 minutes with [Names].

Search my emails and Teams chats from the last 2 weeks. 
Create a brief with:

WHAT THEY CARE ABOUT:
- [Their priorities based on communications]

OPEN QUESTIONS FROM LAST TIME:
- [Unresolved items]

DECISIONS NEEDED TODAY:
- [What we need to decide]

LANDMINES TO AVOID:
- [Sensitive topics based on email tone]

Keep it under 200 words. I'm reading this in the elevator."

Time saved: Would otherwise go in blind. This takes 20 seconds.

3. The Data Detective (Excel)

Problem: Messy Excel file, need insights fast.

Prompt:

"Analyze [Sheet Name] in this Excel file.

Find:
1. TOP 3 ANOMALIES (things that look wrong/unusual)
2. HIDDEN PATTERNS (correlations I might miss)
3. RECOMMENDED ACTIONS (what should I investigate)

Then create a summary table showing:
- Total by [Category]
- Growth rate vs last [Period]
- Top 5 and Bottom 5 performers

Explain like I'm not a data analyst (because I'm not)."

Time saved: 45 minutes of manual analysis.

4. The Document Synthesis Machine (Word + SharePoint)

Problem: Need to compare 5 different policy documents.

Prompt:

"I need to compare these documents: [List files in SharePoint]

Create a comparison table showing:
- WHAT CHANGED between versions
- CONFLICTS (where they contradict each other)  
- GAPS (what's covered in one but not others)
- RECOMMENDED APPROACH (which version to use for what)

Highlight any legal/compliance differences in RED.

I'm presenting this to leadership in 30 minutes."

Time saved: This used to take me 3 hours. Now takes 2 minutes.

5. The Cross-App Workflow (Excel → PowerPoint → Outlook)

Problem: Create exec dashboard from data, present it, send follow-ups.

Prompt:

STEP 1 (Excel):
"Create a pivot table from [Sheet] showing revenue by region and product. 
Add a trend chart for Q3-Q4. Highlight top 3 and bottom 3 performers."

STEP 2 (PowerPoint):
"Create a 3-slide exec summary:
Slide 1: Overall performance (use the Excel chart)
Slide 2: Top 3 wins
Slide 3: Bottom 3 concerns + recommended actions

Use our corporate template. Keep it punchy."

STEP 3 (Outlook):
"Draft an email to leadership with:
- Subject: Q4 Revenue Analysis - Key Insights
- Body: 3-sentence summary + attached deck
- Tone: Confident but flag concerns
- Ask: 'Can we discuss bottom 3 in Tuesday's call?'"

Time saved: 4-6 hours of work automated in one flow.

How I organized 365 prompts

After building these workflows for 6 months, I realized: nobody has a comprehensive M365 Copilot prompt library for enterprise work.

So I built one.

The structure:

🟢 Quick Start (50 prompts)

Work in any Copilot tier (even free)
No M365 data access needed
Universal workflows (writing, learning, planning)

🟡 Advanced (315 prompts)

Require M365 Copilot or Copilot for Enterprise
Organized by:
- App: Outlook (70), Excel (40), PowerPoint (35), Word (40)
- Role: Sales, Engineering, HR, Finance, Legal, etc. (12 roles)
- Use case: Email intelligence, data analysis, document synthesis

Key features:

✅ GCSE Framework examples (Microsoft's official prompting methodology)

✅ Cross-app workflows (the magic nobody talks about)

✅ Troubleshooting guide (when Copilot refuses or hallucinates)

✅ Before/after comparisons (so you see the difference)

What makes it different:

Most prompt libraries are:

Generic ("write me an email" 🙄)
ChatGPT-focused (not M365-specific)
Beginner-level only

This library is:

Enterprise-grade (actually used in production)
M365-optimized (leverages Outlook, Excel, Teams integration)
Multi-level (beginners → power users)

The GitHub repo

I open-sourced everything: 365 prompts, fully categorized, ready to use.

🔗 kesslernity/awesome-microsoft-copilot-prompts

What you get:

📁 awesome-microsoft-copilot-prompts/
├── 📄 README.md (navigation + GCSE framework)
├── 📁 prompts/
│   ├── 🟢 QUICK-START.md (50 universal prompts)
│   ├── 📧 outlook.md (70 email prompts)
│   ├── 📊 excel.md (40 data prompts)
│   ├── 📊 powerpoint.md (35 presentation prompts)
│   ├── 📝 word.md (40 document prompts)
│   ├── 💼 sales.md (role-specific)
│   ├── ⚙️ engineering.md (role-specific)
│   └── ... (12 role categories)

How to use it:

Browse by role (Sales, Engineering, HR, etc.)
Browse by app (Outlook, Excel, PowerPoint, Word)
Browse by use case (Email intelligence, data analysis)
Copy prompt → Paste in Copilot → Customize for your context

No signup. No paywall. Just prompts that work.

What I learned building this

1. Specificity beats cleverness

Bad prompt: "Help me with this email"

Good prompt: "Analyze this email thread. Extract decisions, action items, blockers. Format for my VP."

2. Context is everything

Copilot works best when you tell it:

GOAL: What you're trying to achieve
CONTEXT: Who the audience is
SOURCE: Which files/emails to use
EXPECTATIONS: What format you need

(This is Microsoft's GCSE framework - included in the repo)

3. Cross-app workflows are magic

The real power isn't in individual prompts.

It's in chaining them: Excel → PowerPoint → Outlook in one flow.

Most people never discover this because tutorials focus on single-app usage.

4. Enterprise ≠ Consumer

Consumer AI: "Make this sound professional"

Enterprise AI: "Analyze 47 emails about budget overruns. Flag compliance risks. Create action plan by department."

Totally different use cases.

The prompts I'm still building

I'm actively maintaining this repo. Here's what I'm working on:

🚧 Coming soon:

Compliance & audit workflows
Multi-language support (international teams)
Industry-specific prompts (healthcare, finance, legal)
Integration with Power Automate
Video tutorials for complex workflows

What am I missing?

I built this based on my experience, but I'm one person. What prompts would help YOU?

💬 Tell me in the comments:

What's your role?
What's your biggest Copilot pain point?
What workflows take you the most time?

I'll add the most-requested prompts to the repo.

Try it yourself

If you use Microsoft 365 Copilot (or are about to start):

⭐ Star the repo if you find it useful (helps others discover it)
📖 Browse the prompts that match your role
💬 Open an issue if you have a prompt request
🔀 Submit a PR if you have prompts to contribute

The goal: Make M365 Copilot actually worth the $30/month.

🔗 GitHub: awesome-microsoft-copilot-prompts

Why I built this

Six months ago, I was frustrated with Copilot.

Generic prompts didn't work. ChatGPT tutorials were useless. I felt like I was paying $30/month for nothing.

Then I learned how to write enterprise prompts.

My email time went from 2 hours/day → 30 minutes/day.

My meeting prep went from 20 minutes → 20 seconds.

My status reports went from 1 hour → 5 minutes.

I got 10+ hours back every week.

But I had to figure this out the hard way. Trial and error. Lots of error.

So I documented everything. 365 prompts. All the workflows that work. All the mistakes to avoid.

Now you don't have to learn the hard way.

Just use the prompts. Save 10 hours/week. Actually get value from your Copilot license.

And if you find it useful? Star the repo. It helps other people discover it.

🔗 kesslernity/awesome-microsoft-copilot-prompts

P.S. What prompts should I add next? Drop a comment below. 👇

Follow me for more enterprise AI content:

X: @mathieu_ai
LinkedIn: Mathieu Monsellato
Blog: NerdyChefs.ai