Forem: Ali Suleyman TOPUZ

Building an Eval Stack for a LangGraph Agent: From LangFuse to AWS AgentCore

Ali Suleyman TOPUZ — Sat, 11 Apr 2026 21:17:26 +0000

How a two-week evaluation design sprint almost ended with us switching tools entirely — and what we learned from not doing that.

There’s a particular kind of confidence that sneaks up on you when you’re building an LLM agent. You test it manually a few times, it gives reasonable answers, and you think: okay, this works. Then someone on the team asks, “but how do you know it works?” and suddenly that confidence gets a lot wobblier.

That question is what led us down the rabbit hole of LLM evaluation infrastructure for our agent system — a multi-layer, tool-heavy setup built on LangGraph, running in AWS Bedrock AgentCore, with a FastMCP server handling tool calls. We had three distinct layers — conversation, orchestration, and search — and each of them could fail in different, non-obvious ways. “It works” wasn’t good enough. We needed proof.

This is the story of how we built the eval stack, nearly replaced it entirely, and ended up with a decision framework that I think applies well beyond our specific setup.

1. The Problem: What Does “Eval” Even Mean for an Agent?

Here’s something that doesn’t get said enough: evaluation for an LLM agent is not one thing. It’s at least three.

In a traditional software system, you write unit tests for functions, integration tests for services, and end-to-end tests for flows. An agent has the same stratification — except that the “functions” are probabilistic, the “services” are external model APIs, and the “flows” involve multi-turn conversations with context that mutates across turns.

For our LangGraph-based agent, we identified three evaluation concerns that had to be addressed independently:

Conversation quality — Is the final response accurate? Is it grounded in the retrieved context? Is it relevant to what the user actually asked?
Orchestration quality — Is the agent routing to the right tools? Is it invoking them with correct parameters? Is it retrying sensibly when something goes wrong?
Search/retrieval quality — When a RAG-like tool call happens, is the context that comes back actually useful? Is the retrieved content faithful to the source?

A monolithic evaluator that just looks at the final output misses everything in the middle. You can have a response that looks good but was assembled from hallucinated intermediate steps, or a retrieval call that returned garbage that the LLM happened to paper over with prior knowledge. You won’t catch either of those without layer-specific evals.

This is the core reason why we needed a structured approach rather than ad-hoc testing.

2. The Test Fixtures: .eval.yaml

Before picking any tools, we needed a consistent format for defining what we were testing. We settled on a pattern tied to a ticketing approach we internally called a simple YAML structure with two required fields:

# example: booking_intent.eval.yaml

test_input:
  conversation_id: "test-001"
  user_message: "I need to extend my stay by two nights, checking out on Friday instead of Wednesday."
  session_context:
    property_id: "prop_42"
    current_checkout: "2024-11-20"

success_criteria:
  - type: contains_tool_call
    tool: "modify_reservation"
    with_params:
      new_checkout: "2024-11-22"
  - type: llm_judge
    metric: response_relevance
    threshold: 0.85
  - type: deterministic
    check: no_hallucinated_dates

The test_input captures a realistic scenario — not a synthetic toy, but something derived from actual usage patterns or edge cases that were reported. The success_criteria is a mixed list of deterministic checks (did the right tool get called with the right parameters?) and LLM-judged metrics (is the response relevant, faithful, grounded?).

Why separate YAML files per test case rather than a big test suite file? A few reasons: they’re easier to review in PRs, they can be tagged and filtered independently, and they map cleanly to the tickets or stories that motivated them. When a new edge case surfaces in production, you create one new file and the eval pipeline picks it up automatically.

3. First Proposal: LangFuse + Ragas + DeepEval

Our initial evaluation stack combined three tools, each with a distinct role:

LangFuse handles tracing and observability. Every LangGraph node execution gets captured — what went in, what came out, how long it took, what the token counts looked like. It’s the backbone that gives you visibility into what the agent actually did, not just what it said.

Ragas provides the core RAG-oriented metrics. Faithfulness (is the response supported by the retrieved context?), answer relevance (does the answer actually address the question?), context precision, context recall. These are the metrics that matter for retrieval-augmented flows.

DeepEval fills in the rest — hallucination detection, task-specific metrics, toxicity checking, and the ability to define custom metrics with your own rubrics. It also provides the test runner infrastructure that ties everything together.

Docker Compose setup to Get This Running (Local)

version: "3.9"

services:
  langfuse:
    image: langfuse/langfuse:latest
    ports:
      - "3000:3000"
    environment:
      - DATABASE_URL=postgresql://langfuse:langfuse@postgres:5432/langfuse
      - NEXTAUTH_SECRET=your-secret-here
      - NEXTAUTH_URL=http://localhost:3000
      - SALT=your-salt-here
    depends_on:
      - postgres

  postgres:
    image: postgres:15
    environment:
      POSTGRES_USER: langfuse
      POSTGRES_PASSWORD: langfuse
      POSTGRES_DB: langfuse
    volumes:
      - pgdata:/var/lib/postgresql/data

volumes:
  pgdata:

The Python-side wiring looks like as follows:

# eval_runner.py
import yaml
import os
from langfuse.callback import CallbackHandler
from deepeval import evaluate
from deepeval.metrics import (
    AnswerRelevancyMetric,
    FaithfulnessMetric,
    HallucinationMetric,
)
from deepeval.test_case import LLMTestCase
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from ragas import evaluate as ragas_evaluate

LANGFUSE_HANDLER = CallbackHandler(
    public_key=os.environ["LANGFUSE_PUBLIC_KEY"],
    secret_key=os.environ["LANGFUSE_SECRET_KEY"],
    host="http://localhost:3000",
)

def load_eval_fixture(path: str) -> dict:
    with open(path) as f:
        return yaml.safe_load(f)

def run_agent_with_tracing(test_input: dict) -> dict:
    """Run the LangGraph agent with LangFuse tracing attached."""
    from your_agent import graph # your compiled LangGraph graph

    result = graph.invoke(
        {"messages": [{"role": "user", "content": test_input["user_message"]}]},
        config={"callbacks": [LANGFUSE_HANDLER]},
    )
    return result

def evaluate_response(fixture: dict, agent_output: dict):
    test_case = LLMTestCase(
        input=fixture["test_input"]["user_message"],
        actual_output=agent_output["final_response"],
        retrieval_context=agent_output.get("retrieved_chunks", []),
    )

    metrics = [
        AnswerRelevancyMetric(threshold=0.8),
        FaithfulnessMetric(threshold=0.8),
        HallucinationMetric(threshold=0.3),
    ]

    evaluate([test_case], metrics)

The callback handler is the key integration point — LangFuse hooks into every LangGraph step automatically through the callbacks mechanism, so you get full trace visibility without changing your agent code.

4. The Reviewer Comments That Shaped the Design

We opened this design up for internal review, expecting mostly rubber-stamping. We got something more useful: a few pointed questions that fundamentally shaped how we thought about the problem.

“Which of these metrics are deterministic and which use an LLM judge?”

This turned out to be more important than it first appeared. Deterministic checks — did tool X get called, did parameter Y have value Z — are stable across runs. LLM-judge metrics are not. They can vary based on which model you use, how the prompt template is phrased, and even non-determinism in the judge model itself. A score of 0.83 today might be 0.79 tomorrow not because your agent got worse, but because the judge got slightly different. We needed to track these separately and be explicit about which was which in our YAML fixtures.

“What’s the judge model, and do you have a plan for judge model bias?”

If you’re using GPT-4 as your judge and your agent is also using GPT-4, you’re likely getting inflated scores — the judge model tends to favor outputs that look like its own outputs. We added a note in our evaluation config to pin the judge model to a different provider than the agent model, and to document this explicitly.

“Where do I see the scores? Per-run? Aggregated over time?”

The answer at that point was “in the terminal.” That wasn’t good enough. We added LangFuse dashboards for score aggregation and set up alerts for when any metric dropped below threshold on consecutive runs.

“What does a health check look like for the eval pipeline itself?”

Good question. We added a canary fixture — a trivially easy test case that should always pass — that runs first in every eval job. If the canary fails, something is wrong with the eval infrastructure, not the agent, and the run is aborted before generating misleading results.

Small comments. They had a bigger impact on the final design than most of the actual architecture decisions.

5. The AWS Alternative: AgentCore Evaluations

While we were finalizing the LangFuse + Ragas + DeepEval design, our colleague Maciej was separately evaluating AWS’s native evaluation offering through Bedrock AgentCore. The pitch was compelling: fewer moving parts, native tracing that integrates with everything else in the AgentCore stack, and a production path that doesn’t require managing three separate services.

The proposal was to run a hybrid approach:

+----------------------+-----------------------------+
| Layer | Tool |
+----------------------+-----------------------------+
| Tracing | AgentCore native tracing |
| Built-in metrics | AgentCore Evaluations |
| Custom metrics | DeepEval (kept) |
| Test fixtures | .eval.yaml (kept) |
+----------------------+-----------------------------+

The reduction in moving parts is real. Instead of running LangFuse locally or self-hosted, you lean on AgentCore’s built-in tracing. Instead of standing up Ragas metrics computation, you use AgentCore’s built-in faithfulness and context relevance metrics.

We were genuinely tempted. The operational simplicity argument is hard to ignore when you’re a small team.

6. Where the Native Replacements Break Down

Then we actually compared the metrics side by side. And this is where the “native” story got complicated.

Faithfulness: Same Name, Different Problem

AgentCore provides a metric called Builtin.Faithfulness. Ragas provides a metric called faithfulness. They sound equivalent. They are not.

Ragas faithfulness asks: “Are the claims in the response supported by the retrieved context?” It decomposes the response into individual claims, checks each claim against the context, and computes a ratio. It’s specifically a RAG faithfulness check.

AgentCore’s Builtin.Faithfulness asks: "Is the response consistent with the input prompt and conversation history?" That's a consistency check, not a grounding check. For a RAG-heavy agent, these catch completely different failure modes. You can pass one and fail the other.

If you swap Ragas faithfulness for AgentCore’s built-in and call it a day, you’ve silently dropped one of your most important safety checks.

Context Relevance vs. Context Precision

Similar story with retrieval quality. Ragas has context_precision, which measures whether the retrieved chunks that were actually used in the response were among the most relevant ones available. It's a quality-of-retrieval metric — it penalizes you for retrieving ten chunks but only using the bottom three.

AgentCore’s ContextRelevance measures whether the retrieved context is relevant to the input query at all. That's a different, weaker check. Passing context relevance just means you retrieved something related to the question. It says nothing about whether the retrieval was precise or whether the agent used the best available context.

Here’s a Side-by-Side Summary

+---------------------+----------------------------------+----------------------------------+
| Metric | Ragas | AgentCore |
+---------------------+----------------------------------+----------------------------------+
| faithfulness | Claims grounded in retrieved | Response consistent with |
| | context (RAG grounding check) | conversation history |
+---------------------+----------------------------------+----------------------------------+
| context quality | context_precision: were the | ContextRelevance: is retrieved |
| | best chunks selected? | context related to the query? |
+---------------------+----------------------------------+----------------------------------+
| answer relevance | answer_relevancy: does response | Similar — reasonable overlap |
| | address the question? | here |
+---------------------+----------------------------------+----------------------------------+

The name similarity is what makes this dangerous. You could swap these metrics, see similar-looking scores on a simple test case, and conclude the migration is safe. The divergence only shows up on the cases where it matters — complex multi-hop retrieval, sparse context, adversarial inputs.

7. The Decision Framework: PoC First

Evaluation infrastructure is not business logic. You can swap it out. But you can also create invisible regressions if you swap it out carelessly — which is exactly what the faithfulness naming collision would have caused.

Instead, we defined a three-outcome PoC:

+-------------------+------------------------------------------+
| Outcome | Action |
+-------------------+------------------------------------------+
| Adopt | AgentCore metrics are equivalent or |
| | better — migrate fully |
+-------------------+------------------------------------------+
| Swap | Some AgentCore metrics work, others |
| | don't — hybrid approach |
+-------------------+------------------------------------------+
| Build custom | Neither works well enough — write |
| | custom metric using AgentCore's |
| | custom evaluator API |
+-------------------+------------------------------------------+

The PoC scope was intentionally small: run the same 10 eval fixtures through both stacks in parallel, compare scores for the same inputs, and flag any case where the scores diverge by more than 15%. Two weeks of data. One shared dashboard.

That’s a much cheaper way to answer the question than migrating your entire eval pipeline and discovering the issue three months into production.

8. A Minimal Local Eval Setup with Ollama

If you want to experiment with this pattern without spinning up cloud infrastructure, here’s a fully local setup using Ollama as the LLM judge. This runs entirely on your machine.

# docker-compose.eval-local.yml
version: "3.9"

services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    # Pull a model after startup: docker exec -it ollama ollama pull llama3

  langfuse:
    image: langfuse/langfuse:latest
    ports:
      - "3000:3000"
    environment:
      - DATABASE_URL=postgresql://langfuse:langfuse@postgres:5432/langfuse
      - NEXTAUTH_SECRET=dev-secret-change-in-prod
      - NEXTAUTH_URL=http://localhost:3000
      - SALT=dev-salt
    depends_on:
      - postgres

  postgres:
    image: postgres:15
    environment:
      POSTGRES_USER: langfuse
      POSTGRES_PASSWORD: langfuse
      POSTGRES_DB: langfuse
    volumes:
      - pgdata:/var/lib/postgresql/data

volumes:
  ollama_data:
  pgdata:

# eval_local.py — uses Ollama as the judge model
import os
from deepeval.models.base_model import DeepEvalBaseLLM
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
from deepeval.test_case import LLMTestCase
from deepeval import evaluate
import requests
import json

class OllamaJudge(DeepEvalBaseLLM):
    """Custom DeepEval judge backed by a local Ollama model."""

    def __init__ (self, model_name: str = "llama3"):
        self.model_name = model_name

    def load_model(self):
        return self

    def generate(self, prompt: str) -> str:
        response = requests.post(
            "http://localhost:11434/api/generate",
            json={"model": self.model_name, "prompt": prompt, "stream": False},
        )
        return response.json()["response"]

    async def a_generate(self, prompt: str) -> str:
        return self.generate(prompt)

    def get_model_name(self) -> str:
        return f"ollama/{self.model_name}"

def run_local_eval(test_cases: list[dict]):
    judge = OllamaJudge(model_name="llama3")

    cases = [
        LLMTestCase(
            input=tc["input"],
            actual_output=tc["output"],
            retrieval_context=tc.get("context", []),
        )
        for tc in test_cases
    ]

    metrics = [
        AnswerRelevancyMetric(threshold=0.7, model=judge),
        FaithfulnessMetric(threshold=0.7, model=judge),
    ]

    evaluate(cases, metrics)

if __name__ == " __main__":
    sample_cases = [
        {
            "input": "What time is check-out?",
            "output": "Check-out is at 11:00 AM. Late check-out until 2 PM is available for an additional fee.",
            "context": [
                "Hotel policy: standard check-out at 11:00 AM.",
                "Late check-out available until 14:00 for EUR 30.",
            ],
        }
    ]
    run_local_eval(sample_cases)

Start everything with docker compose -f docker-compose.eval-local.yml up -d, pull a model with docker exec -it ollama pull llama3, and you have a fully local eval stack with no API keys, no cloud costs, and no data leaving your machine.

9. Lessons Learned

Eval infrastructure is not your core product — treat it as something you can swap. Don’t get attached to a specific tool. What matters is the fixture format and the success criteria. If your .eval.yaml files are tool-agnostic, you can migrate the underlying runner without losing any of the work you put into defining good tests.

“Native” does not mean “equivalent.” This seems obvious in retrospect, but the naming similarity between AgentCore’s Builtin.Faithfulness and Ragas's faithfulness is genuinely confusing. Always read the metric definition, not just the name. Check what it's actually measuring and whether that maps to the failure mode you care about.

Name similarity is a trap, especially when you’re under time pressure. When you’re evaluating tools quickly, you tend to match on names. That’s fine as a first pass, but it needs to be followed by an actual comparison on real data before you commit.

Keep your PoC scope small. Two weeks, ten fixtures, one shared dashboard. That’s enough to make a data-driven decision. The instinct to do a “comprehensive evaluation” before deciding is usually a way to delay the decision indefinitely. Define the minimum evidence you’d need to choose, run the experiment to get it, then choose.

Deterministic and LLM-judge metrics are different animals. Keep them separate in your fixtures, track them separately in your dashboards, and don’t conflate a drop in one with a drop in the other. A regression in tool-call correctness (deterministic) is a different kind of problem than a regression in faithfulness score (LLM judge) and needs a different debugging approach.

The eval stack is never really done. New failure modes emerge, judge models get updated, new metrics become available. But if you invest upfront in a good fixture format and a clear framework for comparing evaluation tools, you’re set up to evolve the infrastructure without losing ground. And the next time someone asks “but how do you know it works?” — you have an answer.

If you’ve run into interesting eval challenges with LangGraph or other agent frameworks, I’d be curious what metrics ended up being most useful for you. The more people share on this, the better the whole ecosystem gets.

Agentic Architectures — Article 4: Agentic Protocols (MCP and A2A)

Ali Suleyman TOPUZ — Tue, 31 Mar 2026 16:42:13 +0000

Interoperability and the “Connective Tissue” of AI

Every mature technology ecosystem eventually hits the same wall. Early on, everyone builds their own integrations — custom API wrappers, bespoke data formats, proprietary communication layers. It works, until the ecosystem grows large enough that the integration cost becomes the dominant cost. Then someone proposes a standard, half the industry argues about it for two years, and eventually something wins.

The agentic AI ecosystem is hitting that wall right now.

A year ago, if you wanted your agent to read files from your local filesystem, query your database, and post a summary to Slack, you wrote three custom integrations. If you wanted two agents from different vendors to hand off a task, you wrote a custom serialization format and hoped both sides agreed on what “done” meant. Every team was solving the same plumbing problems independently, and none of the pipes connected.

Two protocols are emerging to fix this. The Model Context Protocol (MCP) standardizes how agents connect to tools and data sources. Agent-to-Agent (A2A) standardizes how agents talk to each other. Together, they are becoming the connective tissue of the agentic ecosystem — the infrastructure layer that lets you stop thinking about plumbing and start thinking about what your agents actually do.

This article is about both: what they are, how they work, and what production deployment with them actually looks like.

Model Context Protocol: One Interface to Rule Them All

Before MCP, every agent-to-tool integration was a bespoke engineering project. Want your agent to read from a PostgreSQL database? Write a tool wrapper. Want it to search Confluence? Write another wrapper. Want it to list files in an S3 bucket? Another wrapper. Each wrapper has its own error handling, its own authentication scheme, its own data format. You end up with a collection of brittle, hard-to-maintain glue code that grows proportionally with every new tool you add.

Anthropic introduced MCP in late 2024, and the core insight is simple: if every tool exposes the same interface, agents only need to learn one way to talk to tools.

MCP defines a standardized JSON-RPC interface between a “host” (the agent or the application running it) and a “server” (any tool or data source). The protocol specifies three primitive types that a server can expose:

Resources — data that the agent can read (files, database rows, API responses, calendar entries)
Tools — functions the agent can invoke with parameters (send an email, create a Jira ticket, run a SQL query)
Prompts — reusable prompt templates that the server exposes for the agent to use in context

The communication looks like this:

Agent (MCP Host) MCP Server (e.g., Filesystem)
      | |
      |── initialize() ────────────────────────>|
      |<─ capabilities (resources, tools) ──────|
      | |
      |── tools/list() ────────────────────────>|
      |<─ [read_file, write_file, list_dir] ────|
      | |
      |── tools/call(read_file, {path}) ────────>|
      |<─ {content: "..."} ─────────────────────|

What makes this powerful is that the agent doesn’t need to know anything about the underlying data source. It just knows: here is a list of tools available on this server, here are their schemas, here is how to call them. The MCP server handles the actual implementation — the filesystem calls, the database queries, the API authentication.

The practical consequence is that an agent built on MCP can connect to any MCP-compatible server without custom integration code. Your Slack workspace, your local filesystem, your PostgreSQL database, your Google Calendar — if there’s an MCP server for it (and increasingly, there is), your agent can use it out of the box.

How MCP Gives Agents Context They Couldn’t Have Before

The “context” in Model Context Protocol is doing real work. One of the fundamental limitations of LLM-based agents has always been that their knowledge is frozen at training time — they know what they were trained on, and nothing that happened after the cutoff date. RAG helps with some of this, but it’s fundamentally a retrieval problem: you have to know what to retrieve.

MCP takes a different approach. Instead of retrieving information and injecting it into the prompt, it gives the agent live access to the systems where your information actually lives.

Consider the difference in practice. A customer support agent without MCP retrieves customer history from a vector store populated by a nightly batch job. The information is at least a day old, possibly more, and it’s a lossy representation — embeddings capture semantic meaning but lose precise details. An MCP-enabled agent with access to your CRM’s MCP server reads the customer record directly, in real time, with full fidelity.

The agent can now:

See the customer’s last three support tickets — not summaries, the actual tickets
Check their current subscription status — not a cached version, the live record
Read the internal notes the account manager left yesterday
Look at the open invoices in your billing system

None of this required a custom integration. It required an MCP server for your CRM, an MCP server for your billing system, and an agent configured to connect to both.

The architectural implication is significant: MCP shifts the integration burden from the agent developer to the tool developer. Once a tool has an MCP server, any MCP-compatible agent can use it. This is the same network effect that made REST APIs dominant — not because REST was technically superior in every dimension, but because a common standard made the ecosystem composable.

Agent-to-Agent Communication: Defining a Common Language

MCP solves the agent-to-tool problem. A2A solves the agent-to-agent problem, and it’s a harder one.

When two agents need to collaborate on a task, they face a set of questions that are easy to answer between humans but surprisingly tricky to standardize for software:

How does Agent A tell Agent B what it needs?
How does Agent B signal that it’s accepted the task, is working on it, or has completed it?
What format does the result come back in?
What happens if Agent B can only partially complete the request?
How does Agent A know Agent B is trustworthy?

The A2A protocol (developed collaboratively by Google and a consortium of enterprise technology vendors, with broad industry participation) defines a standard vocabulary for all of these interactions. Like MCP, it’s built on JSON-RPC, which means it’s transport-agnostic and integrates cleanly with existing HTTP infrastructure.

The core concept in A2A is the Task — a unit of work that one agent requests from another. A Task has a lifecycle:

submitted → working → [input-required] → working → completed
                                                  → failed
                                                  → cancelled

Agent A submits a Task to Agent B’s endpoint. Agent B acknowledges with a task ID and status. Agent A can poll for updates or receive streaming events as Agent B makes progress. When Agent B completes the task, it returns a structured result. If something goes wrong, it returns a structured error with enough context for Agent A to decide what to do next.

What makes this more than just a REST API convention is the Agent Card — a machine-readable document that each agent publishes at a well-known endpoint, describing:

What tasks it can accept (its capabilities)
What authentication it requires
What input formats it accepts and what output formats it produces
Its current availability and load

An orchestrator agent discovering a new peer doesn’t need documentation or a human to explain the integration. It reads the Agent Card, understands the capabilities, and knows how to submit tasks. The protocol handles the rest.

The Contract-Net Protocol: Agents That Bid on Work

One of the more elegant ideas in the A2A ecosystem is borrowed from classical distributed AI: the Contract-Net Protocol , originally proposed in the 1980s and now finding new relevance in the agentic era.

The idea is that task assignment shouldn’t be static — orchestrators shouldn’t hardcode which agent handles which task type. Instead, agents should be able to bid on tasks based on their current state, capabilities, and load.

The flow works like this:

Orchestrator broadcasts task announcement
        ↓
Available agents evaluate: Can I do this? At what cost? How fast?
        ↓
Interested agents submit bids (capability match, estimated latency, current load)
        ↓
Orchestrator evaluates bids and awards task to winning agent
        ↓
Winning agent executes, reports completion
        ↓
Orchestrator releases other agents

In practice, a bid might contain:

Capability score : How well does this agent’s specialization match the task requirements? (0.0 to 1.0)
Estimated completion time : Based on current queue depth and task complexity
Resource cost estimate : How many tokens, compute cycles, or API calls will this take?
Confidence level : How certain is the agent that it can complete this task successfully?

The orchestrator applies a selection policy — lowest cost, fastest completion, highest confidence, or a weighted combination — and awards the contract.

This pattern is particularly valuable in systems where agent load is uneven. A Coder Agent might be heavily loaded while a Reviewer Agent is idle. Without bidding, the orchestrator has no visibility into this. With bidding, the idle Reviewer Agent can bid aggressively on tasks that are near its competency boundary, while the overloaded Coder Agent bids conservatively or not at all.

The Contract-Net Protocol also provides natural load balancing for horizontally scaled agent pools. If you’re running three instances of the same agent type, whichever instance is least loaded will submit the most competitive bid. The orchestrator doesn’t need to know anything about instance count or load distribution — the bidding mechanism handles it automatically.

Security & Identity: How an Agent Proves Who It Is

This is the section that gets skipped in tutorials and becomes an urgent problem in production. When Agent A calls Agent B’s endpoint, Agent B needs to answer a question that is non-trivial: is this request actually coming from a trusted agent in my system, or is someone impersonating it?

In human-facing systems, we solve this with OAuth 2.0 and OIDC — the user authenticates with an identity provider, gets a token, and presents that token to services. The same pattern applies to agents, with some important adaptations.

OIDC for Agents (increasingly referred to as Workload Identity in the cloud provider ecosystem) works like this:

Agent Runtime Identity Provider Downstream Service
      | | |
      |── request token ───────────>| |
      |<─ signed JWT (agent ID) ────| |
      | | |
      |── call with JWT ────────────────────────────────────> |
      | | verify signature ────> |
      | |<─ valid, proceed ─────── |
      |<─ response ─────────────────────────────────────────── |

The key components:

Agent Identity Token — A short-lived JWT issued by your identity provider that asserts the agent’s identity, role, and the specific permissions it has been granted. “I am the CRM-Reader agent, issued by your organization’s IDP, and I am authorized to read customer records but not write them.” The token is signed by the IDP; the downstream service verifies the signature without needing to call the IDP on every request.

Scoped Permissions — Each agent should have a token scoped to the minimum permissions it needs for its function. The Coder Agent doesn’t need write access to the CRM. The Customer Service Agent doesn’t need access to the code repository. Principle of least privilege applies to agents exactly as it does to human users.

Token Rotation — Agent tokens should be short-lived (15–60 minutes) and automatically rotated. This limits the blast radius if a token is compromised. The agent runtime handles rotation transparently — the agent doesn’t need to manage its own credential lifecycle.

Audit Logging — Every action an agent takes should be logged with its identity token. When you need to answer “which agent accessed this customer record at 14:32 yesterday and why,” the audit log should give you a precise answer. This is not optional in regulated industries; it’s increasingly expected everywhere.

On AWS, this pattern maps naturally to IAM Roles for Tasks (ECS) or Pod Identity (EKS). On the Bedrock AgentCore Runtime, each agent execution context gets an IAM role with the permissions defined at deployment time. The agent never handles long-lived credentials — the runtime injects temporary credentials into the execution environment automatically.

Discovery Services: Building the Agent Registry

As your agent ecosystem grows, a new operational problem emerges: how does an orchestrator find the right agent for a given task? Hardcoding agent endpoints into orchestrator logic works for two or three agents. It becomes a maintenance liability at ten, and an operational nightmare at fifty.

The solution is borrowed directly from service mesh architecture: a Discovery Service — a registry where agents advertise their presence, capabilities, and health, and where orchestrators query to find appropriate peers.

The concept maps to familiar infrastructure patterns:

Eureka (Netflix’s service registry) and Consul (HashiCorp’s service mesh) solve this problem for microservices. The same principles apply to agent registries.
In the Kubernetes ecosystem, this maps naturally to Service resources and endpoint discovery.
In the cloud-native agentic ecosystem, the A2A Agent Card serves as the registration payload.

A well-designed Agent Registry exposes two primary interfaces:

Registration — Agents announce themselves on startup and deregister on shutdown. The registration payload includes the Agent Card (capabilities, input/output schemas, authentication requirements) plus runtime metadata (current load, health status, version).

Discovery — Orchestrators query the registry with a capability description: “I need an agent that can process PDF documents, write to a SQL database, and respond within 5 seconds.” The registry returns a ranked list of matching agents, filtered by health status and sorted by relevance score.

Agent Startup Registry Orchestrator
      |── register(AgentCard) ──>| |
      |<─ registered (id) ───────| |
      | |<─ discover(capability query) ───|
      | |── [Agent A, Agent B] ──────────>|
      |<─ task submission ────────────────────────────────────────|

Health checking is essential. An agent that has registered but stopped responding is worse than an absent agent — it will be selected for tasks it can’t complete, causing failures and retries. The registry should actively probe registered agents on a regular heartbeat interval and automatically deregister agents that miss consecutive health checks.

The discovery query language deserves careful design. Simple string matching on capability names breaks down quickly — “summarization” and “document summarization” and “text condensation” might all refer to the same capability. A well-designed registry uses structured capability taxonomies (standardized tags from a shared vocabulary) rather than free-text descriptions, ensuring that capability matching is reliable rather than approximate.

Putting It Together: The Full Protocol Stack

Across this series, we’ve built up a complete picture of what a production agentic system looks like. The protocol layer is where all of it connects:

┌─────────────────────────────────────────────────────────────────┐
│ USER / APPLICATION │
└───────────────────────────────┬─────────────────────────────────┘
                                │
┌───────────────────────────────▼─────────────────────────────────┐
│ ORCHESTRATOR AGENT │
│ (Hierarchical Planning, ReAct Loop) │
│ [Article 2 patterns] │
└──────────┬──────────────────────────────────┬───────────────────┘
           │ │
    A2A Protocol A2A Protocol
           │ │
┌──────────▼──────────┐ ┌──────────▼──────────┐
│ SPECIALIST AGENT │ │ SPECIALIST AGENT │
│ (Coder / Writer) │ │ (Reviewer / Critic)│
│ [Article 2 & 3] │ │ [Article 2 & 3] │
└──────────┬──────────┘ └──────────┬──────────┘
           │ │
    MCP Protocol MCP Protocol
           │ │
┌──────────▼──────────┐ ┌──────────▼──────────┐
│ MCP SERVER │ │ MCP SERVER │
│ (Filesystem / DB) │ │ (Slack / Calendar) │
└─────────────────────┘ └─────────────────────┘
           │ │
     [AgentOps Layer: OTel, Guardrails, HITL — Article 3]

MCP handles the vertical connections — agents to tools and data. A2A handles the horizontal connections — agents to agents. The AgentOps layer (observability, guardrails, eval pipelines, HITL) sits across all of it, providing the operational visibility and control that makes the whole system trustworthy in production.

The Maturity Model from Article 1 maps onto this stack naturally: L1 and L2 systems use neither MCP nor A2A. L3 systems benefit significantly from MCP — standardizing tool access reduces integration overhead and makes the agent more capable without custom code. L4 and L5 systems need both — A2A for agent-to-agent coordination and MCP for tool access, with the AgentOps layer making the whole thing operable.

Production Reality Check

MCP and A2A are genuine improvements over the integration chaos they replace. They’re also early-stage standards in an ecosystem that is moving fast, and production adoption comes with real caveats.

MCP server quality is uneven. The protocol is well-designed, but the ecosystem of available servers ranges from production-ready to experimental. Before adopting a community-maintained MCP server for a critical tool, audit its error handling, its authentication implementation, and how actively it’s maintained. A poorly implemented MCP server that swallows errors silently is harder to debug than a custom integration that fails loudly.

A2A task lifecycle management requires discipline. The protocol defines task states clearly, but implementing correct lifecycle management — handling timeouts, zombie tasks that never complete, cascade failures when a Worker agent goes down mid-task — requires careful engineering. Don’t assume the protocol handles operational edge cases for you; it defines the interface, not the reliability.

Discovery services add operational surface area. A registry is another system to operate, monitor, and keep highly available. If your registry goes down, your orchestrators can’t find agents. Design for registry failure explicitly: agents should cache recent discovery results, orchestrators should have fallback direct-connection configurations for critical agents, and your monitoring should alert on registry health before it affects agent routing.

Identity and security are non-negotiable at scale. It’s tempting to skip the OIDC integration during early development and use shared API keys for agent-to-agent authentication. This is fine for a proof of concept and a liability in production. Build the identity layer before you scale, not after — retrofitting workload identity into a running multi-agent system is significantly more painful than designing it in from the start.

The practical adoption path that has worked well: start with MCP for tool integrations (the ROI is immediate and the risk is low), add A2A when you have multiple agents that need to coordinate (and not before), build the identity layer in parallel with A2A adoption, and add a discovery service when you have more than five distinct agent types in production.

Closing the Series

Over four articles, we’ve covered the full arc of agentic system design:

Article 1 gave us the vocabulary — five levels of maturity, mapped to the infrastructure and cost reality of each.
Article 2 gave us the reasoning patterns — how agents plan, reflect, coordinate, and share knowledge without drowning in state.
Article 3 gave us the operational discipline — observability, safety, evaluation, and the human checkpoints that keep the system trustworthy.
Article 4 gave us the protocols — the standardized interfaces that make agents composable, discoverable, and secure at scale.

The through-line across all four is a consistent argument: the intelligence of your agent system is not primarily determined by the model you choose. It’s determined by the architecture around the model — the planning patterns, the memory design, the error handling, the observability, the coordination protocols. Models are commoditizing. Architecture is the durable differentiator.

The teams building production agentic systems that actually work — not just in demos, but at scale, with real users, over time — are the ones treating AI like the distributed systems discipline it has become. The tools, protocols, and patterns in this series are what that looks like in practice.

Build carefully. Measure everything. Ship incrementally. And set a maximum iteration count on your reflection loops.

                  L1: Stateless L2: Tool-Augmented L3: Autonomous L4: Multi-Agent L5: Self-Correcting
------------------------------------------------------------------------------------------------------------------------------------------------------
Execution Serverless / Edge Serverless + integr. Long-running container Distributed orchestrator Distributed + feedback loops
State None None Short + long-term memory Shared state across agents State + mutation history
Latency profile Predictable Slightly variable Variable (loop-dependent) High, parallelizable Highest, bounded by budget
Cost model Linear (tokens) Linear + tool costs Nonlinear (calls per task) Nonlinear × agent count Nonlinear × iteration count
Primary failure Bad retrieval Tool hallucination Context overflow Cascade failures Runaway loops
Observability Basic logging Tool call tracing Full trace per loop Cross-agent tracing Cost + quality dashboards

Included for reference from Article 1 — MCP maps primarily to L2 and L3. A2A maps primarily to L4 and L5.

Agentic Architectures — Article 3: AgentOps

Ali Suleyman TOPUZ — Tue, 31 Mar 2026 16:42:00 +0000

Treating AI Like the Distributed System It Actually Is

There’s a moment every team hits, usually somewhere between the third demo and the first real production deployment. The agent works beautifully in the notebook. It handles every test case you throw at it. You ship it. And then, three days later, you get a Slack message from a user that says something like: “It’s been running for 20 minutes and nothing is happening.”

You open the logs. There are no logs. The agent made 47 API calls, hit a rate limit on call 12, entered an undocumented retry state, and has been quietly spinning ever since — accumulating token costs, holding open a connection, and doing absolutely nothing useful.

Welcome to production.

The discipline of AgentOps exists because agentic systems are distributed systems, and distributed systems fail in distributed ways — partially, silently, and at the worst possible time. The practices in this article aren’t optional polish you add after launch. They’re the foundation that determines whether your system is operable when things go wrong. And they will go wrong.

Observability & Tracing: You Can’t Fix What You Can’t See

In a traditional web application, a request comes in and a response goes out. If something breaks, you have a single trace to inspect — one thread of execution, one error to find.

An agentic system doesn’t work like this. A single user request might spawn a Manager agent, three Worker agents, a Critic, and a tool execution layer. Each of these makes independent model calls. Some run in parallel. Each can fail independently. The user sees one thing — “the agent is thinking” — and behind that is a branching tree of execution that can fail at any node.

Debugging this without proper tracing is like trying to debug a microservices outage by reading individual server logs with no correlation IDs. Technically possible. Practically miserable.

The modern answer is OpenTelemetry (OTel) — the vendor-neutral observability standard that has become the lingua franca of distributed systems monitoring. The good news is that both LangSmith (from the LangChain ecosystem) and Arize Phoenix support OTel-compatible trace ingestion, which means you can instrument your agent once and route traces to whichever backend you prefer.

What you want to capture at every node in your agent graph:

Span start and end timestamps — so you can see exactly where time is being spent
Model call metadata — which model, which prompt template version, input/output token counts
Tool call inputs and outputs — what the agent asked the tool to do, and what it got back
State transitions — when the agent moved from Planning to Executing to Reflecting
Error events — with full context, not just the exception message

The two metrics that matter most in production, and that most teams don’t track until they should:

Trace Latency is the wall-clock time from user request to final response, across the entire agent execution graph. Not just model latency — total latency, including tool calls, state reads and writes, and any waiting time. This is what the user experiences, and it’s almost always higher than you think because it includes all the overhead your benchmark tests don’t capture.

Token Cost per Trace is the total model spend for a single user task, aggregated across all agents and all model calls in the trace. In a multi-agent system, this is the number that will surprise you. Individual model calls look cheap. When you multiply them by agent count, loop iterations, and daily request volume, the number that emerges is frequently 5–10x what the team estimated during design.

Build a dashboard with both of these as primary metrics before you launch, not after. The alert threshold for Trace Latency should trigger before your user-facing timeout does. The alert threshold for Token Cost per Trace should trigger before your monthly budget does.

Guardrails & Safety: The Gates That Protect the System

Every agent that interacts with real users is a potential attack surface — not just for adversarial users, but for the model’s own failure modes. A guardrail is an enforcement layer that sits between the world and your agent, checking inputs before they reach the model and outputs before they reach the user.

Think of it as two gates:

User Input → [INPUT GATE] → Agent → Model → [OUTPUT GATE] → User Response
                  ↓ ↓
             Block / Sanitize Block / Rewrite / Flag

The Input Gate protects the model from harmful, manipulative, or out-of-scope inputs. Common implementations:

LlamaGuard — Meta’s open-source safety classifier, trained specifically to detect harmful content categories (violence, hate speech, self-harm, illegal activity). It runs as a separate model call before your main agent, adding ~100–200ms of latency and a fraction of a cent in cost per request. Worth it for any user-facing deployment.
Regex and rule-based filters — Fast, cheap, and reliable for known patterns. Prompt injection attempts often have detectable signatures (ignore previous instructions, you are now, your new system prompt is). A well-maintained regex filter catches a meaningful percentage of these before they ever reach the model.
LLM-based classifiers — For nuanced cases where rule-based filtering isn’t sufficient. A small, fast model (Haiku, GPT-4o mini) classifying the intent of an input before it hits your expensive main model is a good investment for high-value workflows.

The Output Gate protects users from the model’s failure modes — hallucinations, off-topic responses, sensitive data leakage, and policy violations:

PII detection — Before any agent output reaches a user or gets written to a log, scan it for personally identifiable information that shouldn’t be there. Regex handles the obvious cases (email patterns, credit card formats, SSN patterns). For subtler cases, a dedicated NER model does the job.
Factual grounding checks — For RAG-based agents, verify that claims in the output can be traced back to retrieved source documents. Outputs that make claims not present in the source context should be flagged or blocked.
Output format validation — If your agent is supposed to return structured JSON, validate the schema before passing it downstream. A malformed output that crashes a downstream service is a guardrail failure, not a model failure.

One implementation principle that’s easy to overlook: guardrails must fail safely. If your Input Gate goes down, what happens? If the answer is “all user requests go through unfiltered,” you have a single point of failure in your safety layer. Design guardrails with explicit fallback behavior — if the safety classifier is unavailable, either queue the request or return a graceful error, never silently bypass the check.

Evaluation Pipelines: The Regression Test for Your Agent

Software engineers have a deeply ingrained habit of writing tests before shipping code. Most AI teams don’t extend this habit to their agents — and they pay for it every time a prompt change breaks something in production that worked perfectly last week.

The equivalent of a test suite for an agentic system is an Eval Pipeline , and the core artifact it runs against is a Golden Dataset.

A Golden Dataset is a curated collection of input-output pairs that represent the behavior your system should exhibit. Each entry contains:

A realistic input (user query, document, task description)
The expected output — or the criteria by which a good output can be evaluated
Metadata: the scenario type, difficulty level, which agent capabilities it exercises

Building a good Golden Dataset is not a one-time task. It grows over time, fed primarily by production failures. Every time your agent produces a wrong or unexpected output in production, that input — along with the correct output — gets added to the dataset. The Golden Dataset becomes a living record of every failure mode your system has ever exhibited and been fixed for.

The Eval Pipeline runs this dataset against your agent automatically on every significant change — prompt updates, model version changes, tool modifications, new agent roles. The output is a regression report:

Golden Dataset Run — 2025-03-28
------------------------------------------------------------
Total cases: 847
Passed: 821 (96.9%)
Regressed: 18 (2.1%) ← These need investigation
Improved: 8 (0.9%)
------------------------------------------------------------
Regressed cases by category:
  Multi-step reasoning: 9
  Tool selection: 5
  Output format: 4

The 18 regressions are the cases where a change that was supposed to improve things has broken something that worked before. Without the eval pipeline, you’d find these in production. With it, you find them in CI.

Evaluation metrics vary by task type. For classification tasks, precision and recall. For generation tasks, you need LLM-based evaluation — a judge model that scores outputs against a rubric (this is increasingly standard and works well when the rubric is specific). For tool-use tasks, check whether the correct tools were called with correct arguments, independent of the final text output.

One practical note: keep your Golden Dataset honest. The temptation is to add only cases your system handles well, which turns the eval into a confidence-boosting exercise rather than a quality gate. Actively seek out edge cases, adversarial inputs, and the kinds of queries that make your system stumble.

Agentic Error Handling: Building for the Inevitable

A production agent will encounter rate limits. APIs will return 500 errors. Model calls will time out. Tools will return malformed responses. The question is not whether these things will happen — it’s whether your system handles them gracefully or catastrophically.

Exponential Backoff with Jitter

When a model API returns a 429 (rate limit exceeded), the naive response is to retry immediately. This is exactly wrong — immediate retries hammer the rate-limited endpoint and make the congestion worse. The correct pattern is exponential backoff with jitter:

Attempt 1: fail → wait 1s + random(0-500ms)
Attempt 2: fail → wait 2s + random(0-500ms)
Attempt 3: fail → wait 4s + random(0-500ms)
Attempt 4: fail → wait 8s + random(0-500ms)
Attempt 5: fail → give up, return error to orchestrator

The jitter (random delay) is critical in multi-agent systems. Without it, multiple agents hitting the same rate limit will retry in synchrony, creating thundering herd waves that make the rate limit problem worse. With jitter, retries spread out naturally.

Set a maximum retry count (5 is a reasonable default) and a maximum total wait time (30–60 seconds for interactive tasks). After that, fail explicitly — a clean error the orchestrator can handle is better than a silent spin.

Fallback Models

Not all model failures are rate limits. Sometimes a model is genuinely unavailable, or a specific request exceeds the model’s context limit, or a cost budget has been hit. For these cases, build a fallback model hierarchy:

Primary: Claude Sonnet (full capability, higher cost)
Fallback: Claude Haiku (reduced capability, lower cost)
Emergency: Cached response or template-based response

The fallback trigger conditions and the fallback target should be explicit configuration, not hardcoded logic. Different tasks warrant different fallback strategies — a customer-facing response probably shouldn’t fall back to a template, but an internal classification task might be fine running on a smaller model.

Critically, log every fallback event. A spike in fallback usage is an early warning signal — it means your primary model is struggling for some reason, and you want to know about it before it becomes a full outage.

Human-in-the-Loop: Designing Deliberate Pause Points

There’s a class of agent actions where the cost of getting it wrong is high enough that no amount of automated validation is sufficient. Executing a SQL write against a production database. Sending an email to a thousand customers. Approving a financial transaction. Deploying code to production.

For these, Human-in-the-Loop (HITL) isn’t a limitation of your agent’s capability — it’s a deliberate architectural choice that reflects the appropriate level of trust for that action.

The implementation pattern is an Interrupt Point — a designated node in your agent’s state graph where execution pauses, the pending action is surfaced to a human reviewer, and the agent waits for explicit approval before proceeding.

Agent Planning → Tool Selection → [INTERRUPT: Awaiting approval]
                                          ↓
                                   Human Reviews
                                    / \
                               Approve Reject (+ feedback)
                                  ↓ ↓
                           Agent Executes Agent Replans

The UX of the approval interface matters more than most engineering teams acknowledge. The human reviewer needs to see: what action the agent wants to take, why it decided to take it (a brief reasoning summary), what the expected outcome is, and what the rollback plan is if it goes wrong. A one-line notification that says “Agent wants to run a database query — approve?” is not sufficient. A panel showing the exact SQL, the expected rows affected, and the agent’s stated rationale is.

HITL points should be defined in configuration, not code — so that business stakeholders can adjust the approval threshold for an action type without requiring a code deploy. “Any SQL write affecting more than 1,000 rows requires approval” is a policy decision, not an engineering decision.

One nuance worth designing for early: what happens if the human doesn’t respond? The agent is waiting at an interrupt point. The reviewer is in a meeting. An hour passes. Your system needs explicit timeout behavior — either escalate to a different reviewer, cancel the task gracefully, or (for some use cases) proceed with a lower-risk fallback action. The worst outcome is an agent silently holding state and accumulating cost while waiting indefinitely.

Production Reality Check

AgentOps is the article in this series most likely to be skimmed and least likely to be implemented before launch. That’s a mistake that reliably produces avoidable incidents.

Some concrete numbers to make the case:

Observability instrumentation — setting up OTel, integrating LangSmith or Arize Phoenix, building a basic dashboard — takes a senior engineer approximately 2–3 days to do properly. The first production incident it prevents will typically save more than 2–3 days of debugging time, usually within the first month of operation.

A well-maintained Golden Dataset with 500–1000 cases catches roughly 60–70% of prompt-change regressions before they reach production, based on experience across several production systems. The remaining 30–40% are novel failure modes — which get added to the dataset after they’re found, continuously improving the coverage.

HITL approval flows feel like friction during development (“the agent should just do it”). In production, they become the feature that saves your team from the 2am incident where the agent queued 50,000 email sends based on a misconfigured trigger. Every high-stakes agentic system needs at least one HITL checkpoint. Design it in from the start — retrofitting it into an existing state machine is painful.

The honest framing: every hour you invest in AgentOps before launch is worth roughly five hours of incident response after it. The math isn’t complicated. The discipline is.

What Comes Next

Article 4 is where we zoom out from how individual agents work to how different agents talk to each other — across team boundaries, vendor boundaries, and trust boundaries.

The Model Context Protocol (MCP) and Agent-to-Agent (A2A) communication standards are quietly becoming the connective tissue of the agentic ecosystem. If you’re building agents that need to interoperate with tools, services, or other agents you don’t control — which is almost everyone — understanding these protocols is no longer optional.

                  L1: Stateless L2: Tool-Augmented L3: Autonomous L4: Multi-Agent L5: Self-Correcting
------------------------------------------------------------------------------------------------------------------------------------------------------
Execution Serverless / Edge Serverless + integr. Long-running container Distributed orchestrator Distributed + feedback loops
State None None Short + long-term memory Shared state across agents State + mutation history
Latency profile Predictable Slightly variable Variable (loop-dependent) High, parallelizable Highest, bounded by budget
Cost model Linear (tokens) Linear + tool costs Nonlinear (calls per task) Nonlinear × agent count Nonlinear × iteration count
Primary failure Bad retrieval Tool hallucination Context overflow Cascade failures Runaway loops
Observability Basic logging Tool call tracing Full trace per loop Cross-agent tracing Cost + quality dashboards

Included for reference from Article 1 — AgentOps practices apply most critically at L3, L4, and L5.

This is Article 3 of a 4-part series on Agentic AI Architectures. Article 4 — Agentic Protocols (MCP and A2A) — is the final piece.

Agentic Architectures — Article 2: Advanced Coordination and Reasoning Patterns

Ali Suleyman TOPUZ — Sun, 29 Mar 2026 16:28:59 +0000

Solving the “Stochastic Parrot” Problem with Structured Logic

There’s a criticism of large language models that has stuck around since 2021, and it still stings a little: the “stochastic parrot” argument. The idea is that LLMs are sophisticated pattern-matchers that produce statistically plausible text without any genuine understanding behind it. They’re parroting, not reasoning.

I’m not here to settle that philosophical debate. What I am here to tell you is this: if your agentic system behaves like a stochastic parrot — confidently producing plausible-sounding but wrong answers, failing to backtrack when it hits a dead end, unable to break a hard problem into manageable pieces — the fix is almost never the model. It’s the architecture.

The difference between an agent that looks intelligent in a demo and one that stays intelligent in production comes down to coordination and reasoning patterns. How does your agent plan? How does it check its own work? How do multiple agents share what they know without drowning each other in JSON?

That’s what this article is about.

Dynamic Planning: From Static Chains to Hierarchical Thinking

The first generation of “agentic” products were really just dressed-up chains. You’d define a fixed sequence of LLM calls — summarize, then classify, then respond — and call it a pipeline. It worked for simple, predictable tasks. It fell apart the moment the real world showed up.

Real tasks are rarely linear. A user asking “research our top three competitors and draft a positioning document” doesn’t map cleanly to a fixed sequence of steps. The number of competitors might be two or five. Each competitor might require a different depth of research. The positioning document might need a complete rewrite after the research reveals something unexpected.

What you need is Hierarchical Planning — a “Manager” agent that treats the task as a problem to be decomposed, not a script to be executed.

The pattern works like this:

User Task
    └── Manager Agent (Planner)
            ├── Sub-task A → Worker Agent 1
            ├── Sub-task B → Worker Agent 2
            └── Sub-task C → Worker Agent 3
                    └── Sub-sub-task C1 → Worker Agent 3a

The Manager receives the top-level goal and produces a structured plan — a list of sub-tasks with dependencies, assigned roles, and success criteria. Worker agents execute their assigned sub-tasks and report results back. The Manager synthesizes the results, evaluates whether the goal has been met, and either delivers the final output or replans if something went wrong.

The critical implementation detail that most tutorials skip: the plan must be a living document, not a frozen spec. If Worker Agent 2 comes back with an unexpected result — say, a competitor has already pivoted out of your market — the Manager needs to update the plan in response. A Manager that rigidly executes the original plan in the face of new information isn’t planning; it’s just executing a slightly fancier chain.

In practice, this means storing the plan in a mutable shared state that the Manager can read and rewrite between steps. LangGraph handles this elegantly with its state graph model. CrewAI has a more opinionated take with its hierarchical process mode. Both work — the choice depends on how much control you want over the graph structure (more on that shortly).

Fractal Chain-of-Thought: Reasoning That Zooms In

Standard Chain-of-Thought prompting — “think step by step before answering” — is one of the most reliable techniques for improving LLM reasoning quality. But it has a ceiling. For deeply complex problems, a flat sequence of reasoning steps runs out of resolution. The model is reasoning about the right things at the wrong granularity.

Fractal Chain-of-Thought (FCoT) addresses this by making reasoning recursive. When an agent encounters a sub-problem that is itself complex enough to warrant multi-step reasoning, it spawns a nested reasoning process rather than trying to resolve it in a single step.

Think of it like a zoom function. The top-level reasoning operates at the problem level:

Problem: Optimize our database query performance
  Step 1: Identify the slow queries
  Step 2: Analyze the execution plans
  Step 3: Propose index changes
  Step 4: Estimate performance impact

But Step 2 — “analyze the execution plans” — is itself a multi-step reasoning problem that deserves its own chain:

Sub-problem: Analyze execution plan for Query #7
  Step 2.1: Identify full table scans
  Step 2.2: Check join order efficiency
  Step 2.3: Evaluate predicate pushdown opportunities
  Step 2.4: Flag missing statistics

And Step 2.3 might zoom in further still.

The implementation is cleaner than it sounds. You give the agent a tool called something like deep_reason(sub_problem: str) -> str that recursively invokes the same reasoning architecture on the sub-problem. The result gets folded back into the parent reasoning chain. You set a maximum recursion depth (3-4 levels is usually plenty) to prevent infinite descent.

The payoff is significant for domains with nested complexity — legal analysis, systems debugging, financial modeling. The cost is proportionally higher token usage. FCoT is a targeted tool, not a default setting.

The Reflection Pattern: Building a Critic That Actually Criticizes

Here’s a failure mode that bites almost every team eventually: you implement a self-review step where the same model that generated an output also reviews it. The model gives itself a pass. Every time. The “reflection” becomes a rubber stamp.

This happens because LLMs are, to put it charitably, optimistic about their own work. The same statistical patterns that produced the original output will evaluate it favorably. You’ve built a conflict of interest into your architecture.

The fix is the Critic Pattern , and the key design principle is model diversity.

Generator (Model A) → Output → Critic (Model B) → Feedback → Generator → Revised Output

Using a different model for the critic role — Claude reviewing GPT-4o output, or Gemini reviewing Claude output — introduces genuine perspective diversity. Each model has different training data emphases, different failure modes, and different stylistic biases. A cross-model critic is far more likely to catch errors that the generator is systematically blind to.

The Critic agent should be given a structured evaluation rubric, not a vague “review this” prompt. A good rubric for a code-generating agent might look like:

Correctness : Does the code do what the spec requires?
Edge cases : Are null inputs, empty collections, and boundary values handled?
Security : Are there injection vectors, exposed secrets, or unsafe deserialization?
Readability : Would a mid-level engineer understand this without comments?
Test coverage : Are the happy path and at least two failure paths tested?

The Critic returns a structured response — pass/fail per criterion, plus specific feedback for each failure. The Generator receives this structured feedback and revises. This continues until all criteria pass or the maximum iteration count is reached.

One underrated implementation detail: give the Critic explicit permission to fail things. If your Critic prompt says “review this and suggest improvements,” you’ll get suggestions. If it says “your job is to find reasons this should not ship — be adversarial,” you’ll get a real review.

State Machines vs. DAGs: Choosing Your Control Flow Model

This is the question that causes more architecture debates than almost any other in the multi-agent space, and the answer is genuinely context-dependent.

Directed Acyclic Graphs (DAGs) model workflows that flow in one direction without cycles. Task A feeds into Task B and Task C; B and C feed into Task D; done. This is the natural model for pipelines where each step produces input for the next and you never need to revisit a completed step.

CrewAI’s sequential and hierarchical processes are essentially DAG-based. Temporal workflows are explicitly DAG-structured. They’re excellent for deterministic, well-understood workflows where the shape of the computation is known in advance.

Cyclic graphs (State Machines) allow loops — the ability to return to a previous state based on new information. This is what LangGraph was purpose-built for, and it’s the right model for any agent that needs to:

Retry a failed tool call with modified parameters
Return to a planning step after discovering the current plan won’t work
Run a reflection loop until quality criteria are met
Wait for human approval before proceeding

The decision rule I’ve converged on after shipping several production systems:

Does your agent ever need to go backwards?
                           / \
                         YES NO
                          | |
                     LangGraph CrewAI / Temporal
                  (cyclic graph) (DAG model)

“Going backwards” means any scenario where the correct next step depends on the outcome of a previous step in a way that might require revisiting earlier work. Reflection loops go backwards. Replanning goes backwards. Waiting for a human to approve and then resuming goes backwards.

If your workflow is genuinely linear — always the same steps, always in the same order, with no branching based on intermediate results — a DAG model is simpler and easier to reason about. But be honest with yourself about whether your workflow is actually linear or whether you’re just assuming it will be.

The infrastructure implications differ significantly:

                  DAG Model Cyclic / State Machine
-----------------------------------------------------------------------------------------------
Execution model Step functions / pipelines Long-running stateful process
State management Passed between steps Persisted in graph state store
Debugging Linear trace, easy to follow Requires full state inspection per node
Scalability Each step independently Entire graph runs in one execution context
Failure recovery Retry from last step Checkpoint and resume from last stable state
Cost predictability High (bounded steps) Lower (loop count is variable)

Shared Epistemic Memory: The Blackboard Architecture

Here’s a scaling problem that hits every team that gets beyond three or four agents: how do agents share what they know?

The naive approach is to pass everything as function arguments — the output of Agent A becomes the input to Agent B as a large JSON blob. This works until it doesn’t. The blob grows. Context windows fill up. You start seeing agents with 80% of their context window consumed by state they don’t actually need for their specific sub-task.

The sophisticated approach is the Blackboard Architecture — a pattern borrowed from classical AI and distributed systems that is experiencing a quiet renaissance in the agentic era.

The concept is simple: instead of passing state between agents directly, all agents read from and write to a shared “blackboard” — a structured, queryable state store that sits outside any individual agent.

                        ┌─────────────────┐
                        │ BLACKBOARD │
                        │ (Shared State) │
                        └────────┬────────┘
                    ┌────────────┼────────────┐
                    ↓ ↓ ↓
              Agent A Agent B Agent C
            (reads/writes) (reads/writes) (reads/writes)

Each agent reads only the sections of the blackboard relevant to its current task. Each agent writes its outputs back to designated sections. No agent needs to know what other agents are doing — it just needs to know the schema of the blackboard.

In practice, the blackboard is typically implemented as:

A structured document in a database (DynamoDB, Redis, or Postgres with JSONB) for fast key-based access to specific state sections
A vector store for semantic retrieval when agents need to find relevant context without knowing the exact key
A message log for ordered history that agents can replay or summarize

The schema design of your blackboard is one of the most important architectural decisions you’ll make. Too flat and agents can’t find what they need without reading everything. Too nested and updates become complex. A layered approach works well: top-level sections for task metadata, agent outputs, shared knowledge, and execution history.

One design principle worth emphasizing: write provenance into every blackboard entry. Every piece of information written to the shared state should include which agent wrote it, when, and with what confidence level. When a downstream agent reads a fact and makes a decision based on it, you want to be able to trace that decision back to its source when something goes wrong.

Production Reality Check

The patterns in this article are genuinely powerful. They’re also genuinely expensive, and the cost compounds in ways that aren’t obvious until you’re staring at a cloud bill.

Let’s put some real numbers on it.

A Reflection loop that runs an average of 2 iterations doubles your model call count. If your base cost is $0.05 per task, reflection takes it to $0.10. That sounds manageable — until you’re handling 50,000 tasks per day, at which point reflection alone adds $2,500/day in model costs.

Fractal Chain-of-Thought with 3 levels of recursion and an average of 4 steps per level generates roughly 64 reasoning steps (⁴³) for a single complex query. At even modest token counts per step, this can push a single query cost into the $0.50–$2.00 range. Reserve it for problems that actually need that depth.

Cross-model Critic patterns (e.g., Claude reviewing GPT-4o) introduce a second API dependency with its own latency, rate limits, and cost curve. Budget for both. More importantly, test what happens when the Critic’s API goes down — your system should degrade gracefully, not grind to a halt.

The honest question to ask before adding any coordination pattern is: what’s the quality delta, and what’s it worth?

Reflection improving accuracy from 78% to 91% on a customer-facing recommendation engine that drives revenue? Worth the cost. Reflection improving accuracy from 94% to 96% on an internal summarization tool that saves analysts 10 minutes a day? Probably not.

Measure first. Add complexity second.

What Comes Next

In Article 3, we shift from how agents think to how agents fail — and more importantly, how you detect and recover from those failures before your users do.

AgentOps is the unglamorous but absolutely essential discipline of treating your AI system like the distributed system it actually is: with observability, guardrails, eval pipelines, and human checkpoints baked in from the start. Not bolted on after the first production incident.

Because there will be a production incident. There always is.

                  L1: Stateless L2: Tool-Augmented L3: Autonomous L4: Multi-Agent L5: Self-Correcting
------------------------------------------------------------------------------------------------------------------------------------------------------
Execution Serverless / Edge Serverless + integr. Long-running container Distributed orchestrator Distributed + feedback loops
State None None Short + long-term memory Shared state across agents State + mutation history
Latency profile Predictable Slightly variable Variable (loop-dependent) High, parallelizable Highest, bounded by budget
Cost model Linear (tokens) Linear + tool costs Nonlinear (calls per task) Nonlinear × agent count Nonlinear × iteration count
Primary failure Bad retrieval Tool hallucination Context overflow Cascade failures Runaway loops
Observability Basic logging Tool call tracing Full trace per loop Cross-agent tracing Cost + quality dashboards

Included for reference from Article 1 — the coordination patterns in this article map primarily to L3, L4, and L5.

This is Article 2 of a 4-part series on Agentic AI Architectures. Ready for Article 3 — AgentOps — whenever you are.

Agentic Architectures — Article 1: The Agentic AI Maturity Model

Ali Suleyman TOPUZ — Sun, 29 Mar 2026 16:28:25 +0000

From “Just Call the API” to Self-Evolving Ecosystems

There’s a conversation I keep having with engineering teams. Someone has just shipped a feature that calls GPT-4o or Claude, the demo looks impressive, and then a product manager walks in and asks: “So when do we make it fully autonomous?”

The room goes quiet.

The problem isn’t ambition — it’s vocabulary. “Autonomous” means five completely different things depending on who’s in the room. To the CTO, it means cost savings. To the ML engineer, it means ReAct loops and tool-calling. To the backend team, it means a distributed system they’re going to have to debug at 2am.

What we need is a shared language. A maturity model.

I’ve spent the last two years building production AI systems — RAG pipelines, multi-agent orchestrators, agentic workflows running on cloud runtimes — and I’ve come to believe that every system you build sits at one of five levels. Knowing which level you’re on is the single most important thing you can do before making architectural decisions.

Let’s walk through all five.

Level 1: Prompt-Based — The Stateless

The signature move: You write a prompt. The model responds. Done.

This is where every team starts, and there’s no shame in it. A well-engineered Level 1 system — think basic RAG with a vector database, or a single-turn LLM call wrapped in a clean API — can handle an enormous amount of real business value. Customer FAQ bots, document summarization, code explanation tools: these are Level 1, and they work.

The architecture is simple because the state is zero. Each request is born and dies in a single HTTP round-trip. There’s no memory between turns, no planning, no tool use. The LLM is a sophisticated function: input goes in, text comes out.

User Query → [Context Retrieval] → Prompt → LLM → Response

Infrastructure fingerprint: A single serverless function (Lambda, Cloud Run) is often enough. Latency is predictable because you’re making exactly one model call. Cost is linear and easy to forecast. The main failure mode is retrieval quality — garbage in, garbage out — not the agent layer, because there is no agent layer.

Where it breaks: The moment a user wants the system to do something with the answer. “Summarize this contract” is Level 1. “Summarize this contract and then send the action items to our Jira board” is not.

Level 2: Tool-Augmented — The Doer

The signature move: The model decides which function to call, and your infrastructure executes it.

This is where things get genuinely interesting — and where a surprising number of teams stop, thinking they’ve “done AI.” Function calling (or “tool use” in Anthropic’s terminology) fundamentally changes the mental model. The LLM is no longer just generating text; it’s generating intent.

You define a set of tools — an OpenAPI spec, a Python function schema, a list of MCP-compatible endpoints — and the model figures out which ones to invoke based on the user’s request. Your code handles the execution and feeds the result back.

User Query → LLM (reasoning) → Tool Call → Execution → LLM (synthesis) → Response

What makes Level 2 non-trivial in production is error handling. Models hallucinate tool names. They pass arguments with wrong types. They call a write endpoint when they should have called a read one. A robust Level 2 system needs:

Input validation on every tool call before execution
Graceful fallbacks when a tool returns an error (don’t just crash — tell the model what went wrong and let it retry)
Idempotency checks on any tool that mutates state

The OpenAPI spec integration story is particularly powerful here. If you describe your internal APIs in OpenAPI format, you can essentially give the model a self-describing interface to your entire backend. This is the beating heart of products like Copilot for enterprise apps.

Infrastructure fingerprint: You’re now managing tool execution latency in addition to model latency. Two or three tool calls in sequence, each taking 200–500ms, can make a “fast” response feel slow. Start thinking about parallelizing independent tool calls. Cost starts to diverge from simple per-token math — a tool that calls a third-party API has its own cost curve.

Where it breaks: When the task requires multi-step reasoning across interdependent actions. The model can call tools, but it can’t hold a plan in its head across a long sequence of them. For that, you need state.

Level 3: Autonomous Agents — The Planner

The signature move: The ReAct loop. Reason, Act, Observe, repeat.

This is the architecture that the word “agent” was coined for. Introduced in the landmark 2022 paper ReAct: Synergizing Reasoning and Acting in Language Models, the core idea is elegantly simple: instead of a single prompt-response cycle, you give the model a loop.

Thought → Action → Observation → Thought → Action → Observation → ... → Final Answer

At each step, the model articulates its reasoning (“I need to check the user’s account balance before proceeding”), selects a tool, observes the result, and decides what to do next. The loop continues until the model decides it has enough information to respond.

What makes Level 3 qualitatively different from Level 2 is memory management. A ReAct agent needs to track what it’s done, what it’s learned, and what it still needs to do. This splits into two distinct concerns:

Short-term memory is the conversation context — the running thread of thoughts, actions, and observations that constitutes the current task. In practice, this is the LLM’s context window, and it’s finite. Naive implementations stuff everything into the context until it overflows. Sophisticated ones implement sliding windows, summarization, or structured scratchpads.

Long-term memory is everything the agent needs to remember across tasks — user preferences, learned facts, past decisions. This typically lives outside the model entirely: a vector database for semantic retrieval, a key-value store for structured facts, or a graph database for relational knowledge.

The combination of a reasoning loop and dual-layer memory is what gives Level 3 agents their apparent intelligence. They can decompose problems, backtrack when a tool call fails, and accumulate knowledge over a session in ways that feel remarkably human.

Infrastructure fingerprint: Now you’re operating a stateful, long-running process. Serverless functions with 30-second timeouts don’t cut it anymore. You need persistent execution environments — containerized long-running services, step function orchestrators, or purpose-built agent runtimes (AWS Bedrock AgentCore, Azure AI Foundry Agent Service). Token costs are no longer linear: a complex reasoning chain might make 8–12 model calls to answer one user query. Build cost monitoring from day one.

Where it breaks: A single agent with access to all tools is a single point of failure — and a single point of security exposure. When the task requires genuine parallelism or specialist expertise, one planner isn’t enough.

Level 4: Multi-Agent Orchestration — The Team

The signature move: Specialized agents with defined roles, coordinated by an orchestrator.

The intuition here maps cleanly to how human teams work. You don’t hire one person who is simultaneously a senior engineer, a QA lead, a security auditor, and a product manager. You build a team. Level 4 applies the same logic to AI.

A canonical software engineering multi-agent system might look like this:

Orchestrator Agent : Receives the task, breaks it into sub-tasks, routes work, and assembles the final output.
Coder Agent : Writes code given a spec. Has access to file system tools and a code execution sandbox.
Reviewer Agent : Reads code, applies a checklist, and returns structured feedback. Possibly runs on a different model for perspective diversity.
Tester Agent : Generates test cases, runs them against the code, and reports pass/fail.
Security Agent : Scans for common vulnerabilities (injection, exposed secrets) before the code is merged.

Each agent operates within a narrow, well-defined context. This matters for three reasons:

Reduced hallucination: A focused prompt with a specific role and limited tool access produces more reliable output than a general-purpose agent trying to do everything.

Parallelism: Independent sub-tasks can run concurrently. The Reviewer and Tester can work in parallel on the same code diff.

Accountability: When something goes wrong (and it will), you can isolate which agent in the pipeline failed and why. This is far easier than debugging a single monolithic agent’s 40-step reasoning trace.

The coordination layer is where the real engineering lives. You need to decide how agents communicate — direct calls, a message queue, a shared state store — and how to handle failures in one agent without cascading across the whole system. (More on this in Article 2.)

Infrastructure fingerprint: You are now running a distributed system. All of the distributed systems problems apply: network partitions, partial failures, ordering guarantees, idempotency. Your observability stack needs to trace requests across agents, not just within one. Tools like LangSmith or Arize Phoenix become essential, not optional. Compute costs grow non-linearly with agent count — a four-agent pipeline that each makes 5 model calls generates 20 model calls per user request.

Where it breaks: Quality drift. Multi-agent systems can converge on confidently wrong answers because each agent assumes the previous one got it right. No one is questioning the chain. That’s the job of Level 5.

Level 5: Self-Correcting Systems — The Optimizer

The signature move: Agents that critique their own output and update their own behavior.

This is the frontier — and the most misunderstood level. “Self-correcting” doesn’t mean the AI is rewriting its own weights (that’s training, not inference). It means the system has architectural mechanisms to catch its own errors and improve its outputs within a deployment.

The foundational pattern is Reflection. After an agent produces an output, a separate “Critic” agent (or a second pass of the same model with a different prompt) evaluates it against a rubric:

Does this answer the original question?
Are there factual claims that need verification?
Does the code actually compile and pass tests?
Is the tone appropriate for the context?

If the critic finds problems, the output goes back to the generator with structured feedback. The generator revises. The critic reviews again. This loop runs until the output passes — or until a maximum iteration count is hit (always set one; an infinite reflection loop is a runaway cost event).

The more advanced form of this is prompt mutation — when an agent not only fixes its current output but also updates the prompt template that produced it, so future calls start from a better baseline. This is where you start to see systems that genuinely improve over time without retraining.

Generator → Output → Critic → [Pass] → Deliver
                             → [Fail] → Feedback → Generator (repeat)

Some teams implement this with dedicated frameworks (DSPy’s prompt optimization is a notable example). Others build it manually by storing “lessons learned” in long-term memory that gets injected into future prompts.

Infrastructure fingerprint: The cost profile becomes unpredictable in a way that requires active management. A single reflection loop doubles your model calls. Two loops quadruples them. You need circuit breakers — hard limits on loop iterations, cost caps per request, and alerting when a task is taking 3x the expected token budget. The compute requirement is also asymmetric: reflection runs well on smaller models (you don’t need GPT-4o to critique a GPT-4o output; Claude 3.5 Haiku reviewing a Sonnet output can work remarkably well and costs a fraction of the alternative).

Mapping Levels to Infrastructure

A quick reference for the architectural decisions that change at each level:

                  L1: Stateless L2: Tool-Augmented L3: Autonomous L4: Multi-Agent L5: Self-Correcting
------------------------------------------------------------------------------------------------------------------------------------------------------
Execution Serverless / Edge Serverless + integr. Long-running container Distributed orchestrator Distributed + feedback loops
State None None Short + long-term memory Shared state across agents State + mutation history
Latency profile Predictable Slightly variable Variable (loop-dependent) High, parallelizable Highest, bounded by budget
Cost model Linear (tokens) Linear + tool costs Nonlinear (calls per task) Nonlinear × agent count Nonlinear × iteration count
Primary failure Bad retrieval Tool hallucination Context overflow Cascade failures Runaway loops
Observability Basic logging Tool call tracing Full trace per loop Cross-agent tracing Cost + quality dashboards

Production Reality Check

Here’s the honest conversation you need to have before choosing a level: most production systems should be Level 2 or 3, and that’s not a failure.

I’ve seen teams build Level 4 multi-agent systems because it felt more impressive, only to discover that a well-engineered Level 2 system with good tool design would have answered 80% of the queries faster, cheaper, and with fewer failure modes.

The maturity model isn’t a ladder you’re supposed to climb as fast as possible. It’s a map. The right level is the one where the complexity you’re adding is justified by the capability you’re gaining.

Some honest benchmarks from production:

A Level 3 ReAct agent making 8 model calls to answer a single query costs roughly 8–15x more than a Level 1 RAG call. The accuracy improvement is real — but measure it against your actual use case, not a benchmark.
Adding a reflection loop (Level 5 element) to a Level 3 agent typically improves output quality by 15–30% on complex reasoning tasks. It also doubles latency and cost. For a customer-facing product with a 3-second SLA expectation, that tradeoff often doesn’t pass.
Multi-agent systems (Level 4) have an operational overhead that is consistently underestimated. Plan for it to take 3–4x longer to debug a failure in a 4-agent pipeline than in a single agent — not because the problem is harder, but because the trace is longer and the failure point is further from where the error surfaces.

What Comes Next

In the next article, we’ll go deeper into the coordination and reasoning patterns that make Level 3 and Level 4 systems actually work in practice — hierarchical planning, the Critic architecture, and the surprisingly important question of whether you should use a cyclic graph or a DAG to model your agent’s workflow.

The short answer is: it depends on whether your agent ever needs to go backwards. And the answer is almost always yes.

This is Article 1 of a 4-part series on Agentic AI Architectures. The series covers the Maturity Model, Coordination & Reasoning Patterns, AgentOps, and Agentic Protocols (MCP & A2A).

Semantic Kernel for Enterprise AI: Architecting Production-Grade LLM Integration in .NET

Ali Suleyman TOPUZ — Sat, 21 Mar 2026 18:47:46 +0000

Semantic Kernel for Enterprise AI: Architecting Production-Grade LLM Integration in .NET — Implementation & Observability — Part 2

This is Part 2 of the series. Part 1 covered the foundational architecture of Semantic Kernel — plugins, planners, memory, and filters — along with the FinOps cost model and SRE failure taxonomy. In this part, we move from architecture to implementation: building the async-first parallel orchestration engine, the Redis-backed semantic cache, and the complete production filter pipeline with token metering.

I. Recap and What This Part Covers

Part 1 established that the gap between LLM demo and production system is architectural. Semantic Kernel closes that gap through four composable primitives — plugins, planners, memory, and filters — wrapped in a resilience and observability model that matches enterprise operational standards.

Part 2 builds on that foundation with concrete, production-ready .NET 9.0 implementations of the three highest-leverage components an architect must get right:

Async-First Parallel Orchestration — collapsing multi-step LLM workflows from sequential to concurrent execution, with proper cancellation, error isolation, and result aggregation patterns.

Redis Semantic Cache — a vector similarity-backed caching layer that achieves 30–60% token cost reduction in enterprise workloads, with TTL management, cache invalidation, and hit-rate instrumentation.

Complete Filter Pipeline — the full middleware chain covering rate limiting, semantic caching, audit logging, output validation, and token metering, wired together as a coherent operational stack.

II. Async-First Parallel Orchestration

2.1 The Sequential Trap

The most common performance anti-pattern in Semantic Kernel implementations is inadvertent sequential LLM invocation. Engineers familiar with async/await in .NET often write code that looks asynchronous but executes serially:

// ❌ Anti-pattern: Awaiting each invocation sequentially
// Total latency = sum of all individual latencies
public async Task<DocumentAnalysisResult> AnalyzeDocumentAsync(string documentId)
{
    var summary = await _kernel.InvokeAsync("DocumentPlugin", "Summarize", args); // 2.1s
    var entities = await _kernel.InvokeAsync("DocumentPlugin", "ExtractEntities", args); // 1.8s
    var sentiment = await _kernel.InvokeAsync("DocumentPlugin", "AnalyzeSentiment", args); // 1.4s
    var keywords = await _kernel.InvokeAsync("DocumentPlugin", "ExtractKeywords", args); // 1.2s

    // Total: ~6.5s — user is waiting for sequential completion
    return BuildResult(summary, entities, sentiment, keywords);
}

Four independent LLM calls executed sequentially when they have no data dependencies on each other. This is a 4× latency penalty and a FinOps problem — the user session holds open a server thread for the full duration, limiting throughput.

2.2 The Parallel Orchestration Pattern

The correct pattern treats independent LLM invocations as parallel tasks, combining Task.WhenAll with proper cancellation token propagation and isolated error handling per invocation:

// ✅ Parallel orchestration with error isolation
public class ParallelDocumentAnalysisOrchestrator
{
    private readonly Kernel _kernel;
    private readonly ILogger<ParallelDocumentAnalysisOrchestrator> _logger;
    private readonly ParallelOrchestrationOptions _options;
    public async Task<DocumentAnalysisResult> AnalyzeDocumentAsync(
        string documentId,
        CancellationToken ct = default)
    {
        var document = await LoadDocumentAsync(documentId, ct);
        var sharedArguments = new KernelArguments
        {
            ["document_content"] = document.Content,
            ["document_id"] = documentId
        };
        // Define parallel execution units with individual timeout budgets
        using var cts = CancellationTokenSource.CreateLinkedTokenSource(ct);
        cts.CancelAfter(_options.MaxOrchestrationDuration); // Hard timeout for entire operation
        var summaryTask = InvokeWithTimeoutAsync(
            "DocumentPlugin", "Summarize", sharedArguments,
            timeout: TimeSpan.FromSeconds(8), ct: cts.Token);
        var entityTask = InvokeWithTimeoutAsync(
            "DocumentPlugin", "ExtractEntities", sharedArguments,
            timeout: TimeSpan.FromSeconds(6), ct: cts.Token);
        var sentimentTask = InvokeWithTimeoutAsync(
            "DocumentPlugin", "AnalyzeSentiment", sharedArguments,
            timeout: TimeSpan.FromSeconds(5), ct: cts.Token);
        var keywordTask = InvokeWithTimeoutAsync(
            "DocumentPlugin", "ExtractKeywords", sharedArguments,
            timeout: TimeSpan.FromSeconds(5), ct: cts.Token);
        // WhenAll preserves individual task exceptions-don't use WaitAll
        var results = await Task.WhenAll(
            summaryTask, entityTask, sentimentTask, keywordTask);
        // Total latency = slowest individual invocation (~2.1s vs 6.5s sequential)
        return BuildResult(results[0], results[1], results[2], results[3]);
    }
    private async Task<OrchestrationResult> InvokeWithTimeoutAsync(
        string pluginName,
        string functionName,
        KernelArguments arguments,
        TimeSpan timeout,
        CancellationToken ct)
    {
        using var timeoutCts = CancellationTokenSource.CreateLinkedTokenSource(ct);
        timeoutCts.CancelAfter(timeout);
        try
        {
            var result = await _kernel.InvokeAsync(
                pluginName, functionName, arguments, timeoutCts.Token);
            return OrchestrationResult.Success(
                pluginName, functionName, result.GetValue<string>()!);
        }
        catch (OperationCanceledException) when (!ct.IsCancellationRequested)
        {
            // Individual invocation timed out-don't cancel sibling tasks
            _logger.LogWarning(
                "Function {Plugin}.{Function} exceeded timeout of {Timeout}ms",
                pluginName, functionName, timeout.TotalMilliseconds);
            return OrchestrationResult.TimedOut(pluginName, functionName);
        }
        catch (Exception ex)
        {
            // Isolate failure-sibling tasks continue executing
            _logger.LogError(ex,
                "Function {Plugin}.{Function} failed with exception",
                pluginName, functionName);
            return OrchestrationResult.Failed(pluginName, functionName, ex);
        }
    }
}

2.3 Dependency-Aware DAG Orchestration

Real-world workflows are rarely fully parallel. Some steps depend on the output of prior steps, creating a directed acyclic graph (DAG) of dependencies. The architectural pattern for this is staged parallel execution — group independent operations into waves, execute each wave in parallel, and feed outputs forward to dependent stages:

public class DagOrchestrator
{
    private readonly Kernel _kernel;
    /// <summary>
    /// Contract Review Workflow DAG:
    ///
    /// Stage 1 (parallel): ExtractParties, ExtractDates, ExtractObligations
    /// ↓
    /// Stage 2 (parallel, depends on Stage 1): 
    /// ValidateParties(parties), CheckDeadlines(dates), PrioritizeObligations(obligations)
    /// ↓
    /// Stage 3 (sequential, depends on Stage 2): GenerateExecutiveSummary(all Stage 2 outputs)
    /// </summary>
    public async Task<ContractReviewResult> ReviewContractAsync(
        string contractText,
        CancellationToken ct = default)
    {
        // ── Stage 1: Independent extraction (parallel) ──────────────────────────
        var stage1Args = new KernelArguments { ["contract_text"] = contractText };
        var (parties, dates, obligations) = await ExecuteStageAsync(
            ct,
            ("LegalPlugin", "ExtractParties", stage1Args),
            ("LegalPlugin", "ExtractDates", stage1Args),
            ("LegalPlugin", "ExtractObligations", stage1Args));
        // ── Stage 2: Validation (parallel, consumes Stage 1) ────────────────────
        var (validatedParties, deadlineAnalysis, prioritizedObligations) = 
            await ExecuteStageAsync(
                ct,
                ("LegalPlugin", "ValidateParties", 
                    BuildArgs(stage1Args, ("parties_json", parties))),
                ("LegalPlugin", "CheckDeadlines", 
                    BuildArgs(stage1Args, ("dates_json", dates))),
                ("LegalPlugin", "PrioritizeObligations", 
                    BuildArgs(stage1Args, ("obligations_json", obligations))));
        // ── Stage 3: Synthesis (sequential, consumes all prior stages) ───────────
        var summaryArgs = new KernelArguments
        {
            ["contract_text"] = contractText,
            ["validated_parties"] = validatedParties,
            ["deadline_analysis"] = deadlineAnalysis,
            ["prioritized_obligations"] = prioritizedObligations
        };
        var executiveSummary = await _kernel.InvokeAsync(
            "LegalPlugin", "GenerateExecutiveSummary", summaryArgs, ct);
        return new ContractReviewResult
        {
            Parties = DeserializeParties(validatedParties),
            DeadlineAnalysis = DeserializeDeadlines(deadlineAnalysis),
            Obligations = DeserializeObligations(prioritizedObligations),
            ExecutiveSummary = executiveSummary.GetValue<string>()!
        };
    }
    private async Task<(string, string, string)> ExecuteStageAsync(
        CancellationToken ct,
        params (string Plugin, string Function, KernelArguments Args)[] invocations)
    {
        var tasks = invocations.Select(inv =>
            _kernel.InvokeAsync(inv.Plugin, inv.Function, inv.Args, ct)
                .ContinueWith(t => t.Result.GetValue<string>()!, 
                    TaskContinuationOptions.OnlyOnRanToCompletion));
        var results = await Task.WhenAll(tasks);
        return (results[0], results[1], results[2]);
    }
}

2.4 Streaming with Backpressure

For user-facing operations where progressive disclosure is preferable to waiting for full completion, GetStreamingChatMessageContentsAsync paired with server-sent events or SignalR delivers tokens to the UI as they arrive. The critical implementation detail is proper backpressure — don't buffer unboundedly if downstream consumers are slower than the LLM's generation rate:

public class StreamingOrchestrator
{
    private readonly Kernel _kernel;
    public async IAsyncEnumerable<string> StreamAnalysisAsync(
        string documentContent,
        [EnumeratorCancellation] CancellationToken ct = default)
    {
        var arguments = new KernelArguments
        {
            ["content"] = documentContent
        };
        var executionSettings = new OpenAIPromptExecutionSettings
        {
            MaxTokens = 1024,
            Temperature = 0.3,
            // Stream-specific: stop generation when we have enough for the UI
            StopSequences = ["[END_SUMMARY]"]
        };
        arguments.ExecutionSettings = new Dictionary<string, PromptExecutionSettings>
        {
            [PromptExecutionSettings.DefaultServiceId] = executionSettings
        };
        var tokenBuffer = new StringBuilder();
        var tokenCount = 0;
        await foreach (var chunk in _kernel
            .InvokeStreamingAsync<StreamingChatMessageContent>(
                "AnalysisPlugin", "StreamSummary", arguments, ct))
        {
            if (chunk.Content is null) continue;
            tokenBuffer.Append(chunk.Content);
            tokenCount++;
            // Yield complete words/sentences rather than individual tokens
            // to reduce UI flicker and downstream rendering load
            if (chunk.Content.Contains(' ') || chunk.Content.Contains('\n'))
            {
                yield return tokenBuffer.ToString();
                tokenBuffer.Clear();
            }
            // Early termination: if we've generated enough for the use case,
            // cancel further generation to avoid paying for unused completion tokens
            if (tokenCount >= 800)
            {
                yield return tokenBuffer.ToString();
                yield break;
            }
        }
        // Flush any remaining buffered content
        if (tokenBuffer.Length > 0)
            yield return tokenBuffer.ToString();
    }
}

III. Redis Semantic Cache Implementation

3.1 Why String Equality Is the Wrong Cache Key

Naive caching uses exact string matching: two requests are equivalent if their prompt strings are byte-for-byte identical. This produces near-zero cache hit rates in practice because natural language users ask the same question in slightly different ways:

“Summarize this contract”
“Give me a summary of this contract”
“What does this contract say, briefly?”

These are semantically equivalent — they should return the same cached response. String equality misses all three matches.

Semantic caching uses vector embeddings to measure intent similarity. Two prompts are considered equivalent if their embedding vectors are within a configurable cosine similarity threshold. This is what drives the 30–60% cache hit rates referenced in Part 1.

3.2 Redis Stack Setup and Index Configuration

RedisStack (available in Azure Cache for Redis Enterprise) provides the FT.SEARCH capability with vector similarity search. The index configuration determines both search performance and accuracy:

public class RedisSemanticCacheService : ISemanticCacheService, IAsyncDisposable
{
    private readonly IConnectionMultiplexer _redis;
    private readonly IDatabase _db;
    private readonly ITextEmbeddingGenerationService _embeddingService;
    private readonly SemanticCacheConfiguration _config;
    private readonly ILogger<RedisSemanticCacheService> _logger;
private const string IndexName = "semantic-cache-idx";
    private const string KeyPrefix = "sk-cache:";
    public RedisSemanticCacheService(
        IConnectionMultiplexer redis,
        ITextEmbeddingGenerationService embeddingService,
        SemanticCacheConfiguration config,
        ILogger<RedisSemanticCacheService> logger)
    {
        _redis = redis;
        _db = redis.GetDatabase();
        _embeddingService = embeddingService;
        _config = config;
        _logger = logger;
    }
    public async Task InitializeIndexAsync()
    {
        var server = _redis.GetServer(_redis.GetEndPoints().First());
        try
        {
            // Check if index already exists
            await server.ExecuteAsync("FT.INFO", IndexName);
            _logger.LogInformation("Semantic cache index {IndexName} already exists", IndexName);
        }
        catch (RedisServerException ex) when (ex.Message.Contains("Unknown index name"))
        {
            // Create the vector index
            // HNSW (Hierarchical Navigable Small World) is the recommended algorithm
            // for production-better query performance than FLAT at scale
            await server.ExecuteAsync(
                "FT.CREATE", IndexName,
                "ON", "HASH",
                "PREFIX", "1", KeyPrefix,
                "SCHEMA",
                    "plugin_name", "TAG",
                    "function_name", "TAG",
                    "tenant_id", "TAG",
                    "embedding", "VECTOR", "HNSW", "10",
                        "TYPE", "FLOAT32",
                        "DIM", "1536", // text-embedding-3-small dimension
                        "DISTANCE_METRIC", "COSINE",
                        "INITIAL_CAP", "10000",
                        "EF_CONSTRUCTION", "200", // Higher = better quality, slower build
                    "response_text", "TEXT",
                    "created_at", "NUMERIC",
                    "hit_count", "NUMERIC");
            _logger.LogInformation(
                "Created semantic cache index {IndexName} with HNSW vector configuration",
                IndexName);
        }
    }
    public async Task<SemanticCacheEntry?> GetAsync(
        string pluginName,
        string functionName,
        string tenantId,
        string promptText)
    {
        var embedding = await _embeddingService.GenerateEmbeddingAsync(promptText);
        var embeddingBytes = EmbeddingToBytes(embedding);
        // KNN vector search with metadata filter
        // The TAG filters narrow the search space before vector comparison
        var query = $"(@plugin_name:{{{EscapeTag(pluginName)}}} " +
                    $"@function_name:{{{EscapeTag(functionName)}}} " +
                    $"@tenant_id:{{{EscapeTag(tenantId)}}})=>" +
                    $"[KNN {_config.MaxCandidates} @embedding $vec AS score]";
        var searchResults = await _db.ExecuteAsync(
            "FT.SEARCH", IndexName, query,
            "PARAMS", "2", "vec", embeddingBytes,
            "RETURN", "3", "response_text", "score", "hit_count",
            "SORTBY", "score",
            "DIALECT", "2");
        var results = ParseSearchResults(searchResults);
        if (!results.Any())
        {
            _logger.LogDebug(
                "Cache miss for {Plugin}.{Function} (no candidates found)",
                pluginName, functionName);
            return null;
        }
        var best = results.First();
        // Cosine similarity: score 0 = identical, score 1 = orthogonal
        // We want high similarity, so threshold is checked as (1 - score) >= threshold
        var similarity = 1.0 - best.Score;
        if (similarity < _config.SimilarityThreshold)
        {
            _logger.LogDebug(
                "Cache miss for {Plugin}.{Function}: best similarity {Similarity:F3} " +
                "below threshold {Threshold:F3}",
                pluginName, functionName, similarity, _config.SimilarityThreshold);
            return null;
        }
        // Increment hit counter asynchronously-fire and forget acceptable here
        _ = IncrementHitCountAsync(best.Key);
        _logger.LogInformation(
            "Cache hit for {Plugin}.{Function}: similarity={Similarity:F3}, " +
            "hit_count={HitCount}",
            pluginName, functionName, similarity, best.HitCount + 1);
        return new SemanticCacheEntry
        {
            ResponseText = best.ResponseText,
            Similarity = similarity,
            CacheKey = best.Key
        };
    }
    public async Task SetAsync(
        string pluginName,
        string functionName,
        string tenantId,
        string promptText,
        string responseText,
        TimeSpan? ttl = null)
    {
        var embedding = await _embeddingService.GenerateEmbeddingAsync(promptText);
        var embeddingBytes = EmbeddingToBytes(embedding);
        var cacheKey = $"{KeyPrefix}{Guid.NewGuid():N}";
        var effectiveTtl = ttl ?? _config.DefaultTtl;
        var hashFields = new HashEntry[]
        {
            new("plugin_name", pluginName),
            new("function_name", functionName),
            new("tenant_id", tenantId),
            new("prompt_text", promptText), // Store for debugging/audit
            new("response_text", responseText),
            new("embedding", embeddingBytes),
            new("created_at", DateTimeOffset.UtcNow.ToUnixTimeSeconds()),
            new("hit_count", 0)
        };
        var tx = _db.CreateTransaction();
        _ = tx.HashSetAsync(cacheKey, hashFields);
        _ = tx.KeyExpireAsync(cacheKey, effectiveTtl);
        await tx.ExecuteAsync();
        _logger.LogDebug(
            "Cached response for {Plugin}.{Function} with TTL={TTL}",
            pluginName, functionName, effectiveTtl);
    }
    private static byte[] EmbeddingToBytes(ReadOnlyMemory<float> embedding)
    {
        var span = embedding.Span;
        var bytes = new byte[span.Length * sizeof(float)];
        MemoryMarshal.AsBytes(span).CopyTo(bytes);
        return bytes;
    }
    private static string EscapeTag(string value) =>
        value.Replace("-", "\\-").Replace(".", "\\.");
    public async ValueTask DisposeAsync() =>
        await _redis.CloseAsync();
}

3.3 Cache Configuration Tuning

The SimilarityThreshold parameter is the most consequential configuration value in the semantic cache. Too low and you return cached responses for queries that are semantically distinct—producing incorrect or irrelevant answers. Too high and your hit rate collapses toward zero.

The recommended tuning process is empirical: deploy with threshold 0.92 as a starting point, instrument cache hit rates and user feedback signals, and adjust based on observed quality vs. hit rate trade-off. Different plugin functions warrant different thresholds—summarization is more tolerant of semantic variation than precise data extraction:

public class SemanticCacheConfiguration
{
    public TimeSpan DefaultTtl { get; set; } = TimeSpan.FromHours(4);
    public int MaxCandidates { get; set; } = 3; // KNN k value
    // Per-function threshold overrides
    public Dictionary<string, double> FunctionThresholds { get; set; } = new()
    {
        // High tolerance: question phrasing variations map to same answer
        ["Summarize"] = 0.88,
        ["AnalyzeSentiment"] = 0.90,

        // Low tolerance: subtle wording changes carry semantic weight
        ["ExtractObligations"] = 0.95,
        ["ExtractEntities"] = 0.94,
        ["ClassifyIntent"] = 0.93
    };
    public double DefaultThreshold { get; set; } = 0.92;
    public double GetThreshold(string functionName) =>
        FunctionThresholds.TryGetValue(functionName, out var threshold)
            ? threshold
            : DefaultThreshold;
}

3.4 Cache Hit Rate Instrumentation

Without measuring the cache, you cannot optimize it. Wire the cache service into your OpenTelemetry metrics pipeline:

// Metrics registered at startup
private static readonly Counter<long> CacheHits = 
    Metrics.CreateCounter<long>("sk_cache_hits_total",
        description: "Total semantic cache hits");
private static readonly Counter<long> CacheMisses = 
    Metrics.CreateCounter<long>("sk_cache_misses_total",
        description: "Total semantic cache misses");
private static readonly Histogram<double> CacheSimilarityScore = 
    Metrics.CreateHistogram<double>("sk_cache_similarity_score",
        description: "Cosine similarity score of cache hits");
private static readonly Counter<double> CacheTokensSaved = 
    Metrics.CreateCounter<double>("sk_cache_tokens_saved_total",
        description: "Estimated tokens saved by cache hits");

The sk_cache_hits_total / (sk_cache_hits_total + sk_cache_misses_total) ratio is your cache hit rate metric—the primary FinOps KPI for the caching layer. Target 35%+ in steady state for workloads with repetitive query patterns.

IV. The Complete Production Filter Pipeline

4.1 Filter Registration and Ordering

Semantic Kernel filters execute in registration order for pre-invocation logic and in reverse order for post-invocation logic — identical to ASP.NET Core middleware ordering semantics. The order matters for correctness:

// Kernel configuration with ordered filter pipeline
builder.Services.AddSingleton<Kernel>(sp =>
{
    var kernelBuilder = Kernel.CreateBuilder();
    kernelBuilder.AddAzureOpenAIChatCompletion(
        deploymentName: config["AzureOpenAI:DeploymentName"]!,
        endpoint: config["AzureOpenAI:Endpoint"]!,
        apiKey: config["AzureOpenAI:ApiKey"]!,
        serviceId: "primary");
    kernelBuilder.AddAzureOpenAIChatCompletion(
        deploymentName: config["AzureOpenAI:FallbackDeployment"]!,
        endpoint: config["AzureOpenAI:Endpoint"]!,
        apiKey: config["AzureOpenAI:ApiKey"]!,
        serviceId: "fallback");
    // Filter order (pre-invocation): 1 → 2 → 3 → 4 → 5
    // Filter order (post-invocation): 5 → 4 → 3 → 2 → 1
    kernelBuilder.Services
        .AddSingleton<IFunctionInvocationFilter, TenantRateLimitFilter>() // 1: Gate first
        .AddSingleton<IFunctionInvocationFilter, SemanticCacheFilter>() // 2: Check cache
        .AddSingleton<IFunctionInvocationFilter, PromptSanitizationFilter>() // 3: Sanitize input
        .AddSingleton<IFunctionInvocationFilter, ObservabilityFilter>() // 4: Measure
        .AddSingleton<IFunctionInvocationFilter, OutputValidationFilter>() // 5: Validate output
        .AddSingleton<IPromptRenderFilter, AuditPromptFilter>() // Audit resolved prompts
        .AddSingleton<IAutoFunctionInvocationFilter, PlannerBoundaryFilter>(); // Bound planner
    return kernelBuilder.Build();
});

4.2 Tenant Rate Limit Filter

The rate limiting filter is the outermost gate — it rejects requests before any LLM work begins, protecting both cost budgets and downstream service capacity. Implementation uses a sliding window algorithm backed by Redis atomic operations for correctness under concurrent load:

public class TenantRateLimitFilter : IFunctionInvocationFilter
{
    private readonly ITokenBudgetService _budgetService;
    private readonly IRateLimiterService _rateLimiter;
    private readonly IHttpContextAccessor _httpContext;
    public async Task OnFunctionInvocationAsync(
        FunctionInvocationContext context,
        Func<FunctionInvocationContext, Task> next)
    {
        var tenantId = ResolveTenantId();
        // Check token budget before any invocation
        var budgetCheck = await _budgetService.CheckAsync(tenantId);
        if (!budgetCheck.HasBudget)
        {
            throw new TokenBudgetExceededException(
                $"Tenant {tenantId} has exhausted daily token budget of " +
                $"${budgetCheck.DailyLimitUsd:F2}. " +
                $"Budget resets at {budgetCheck.ResetTime:HH:mm} UTC.");
        }
        // Check request rate limit (RPM - requests per minute)
        var rateCheck = await _rateLimiter.CheckRateLimitAsync(
            key: $"rpm:{tenantId}",
            limit: _config.GetRpmLimit(tenantId),
            window: TimeSpan.FromMinutes(1));
        if (!rateCheck.IsAllowed)
        {
            throw new RateLimitExceededException(
                $"Rate limit exceeded for tenant {tenantId}. " +
                $"Limit: {rateCheck.Limit} RPM. " +
                $"Retry after: {rateCheck.RetryAfter.TotalSeconds:F0}s")
            {
                RetryAfter = rateCheck.RetryAfter
            };
        }
        await next(context);
    }
    private string ResolveTenantId() =>
        _httpContext.HttpContext?.User
            .FindFirst("tenant_id")?.Value
        ?? throw new InvalidOperationException(
            "Tenant context not available - ensure authentication middleware runs before kernel invocation.");
}

4.3 Prompt Sanitization Filter

The prompt sanitization filter intercepts function arguments before prompt rendering, applying heuristic and model-based checks for prompt injection attempts. This operates on the IPromptRenderFilter interface, which provides access to the rendered prompt text before it is submitted to the LLM:

public class AuditPromptFilter : IPromptRenderFilter
{
    private readonly IPromptAuditStore _auditStore;
    private readonly IPromptInjectionDetector _injectionDetector;
    private readonly ILogger<AuditPromptFilter> _logger;
    public async Task OnPromptRenderAsync(
        PromptRenderContext context,
        Func<PromptRenderContext, Task> next)
    {
        await next(context); // Let prompt render complete
        var renderedPrompt = context.RenderedPrompt;
        if (renderedPrompt is null) return;
        // Injection detection: heuristic patterns first (cheap), LLM-based second (expensive)
        var injectionResult = await _injectionDetector.DetectAsync(renderedPrompt);
        if (injectionResult.IsInjectionDetected)
        {
            _logger.LogWarning(
                "Prompt injection detected for function {PluginName}.{FunctionName}. " +
                "Confidence: {Confidence:F2}. Pattern: {Pattern}",
                context.Function.PluginName,
                context.Function.Name,
                injectionResult.Confidence,
                injectionResult.DetectedPattern);
            // Record security event
            await _auditStore.RecordSecurityEventAsync(new PromptSecurityEvent
            {
                Timestamp = DateTimeOffset.UtcNow,
                TenantId = ResolveTenantId(),
                PluginName = context.Function.PluginName,
                FunctionName = context.Function.Name,
                DetectedPattern = injectionResult.DetectedPattern,
                Confidence = injectionResult.Confidence,
                // Never log full prompt in audit store-may contain PII
                PromptHash = ComputeHash(renderedPrompt)
            });
            if (injectionResult.Confidence >= _config.BlockThreshold)
            {
                throw new PromptInjectionException(
                    $"Prompt injection blocked with confidence {injectionResult.Confidence:F2}.");
            }
        }
        // Audit all prompts in regulated deployments
        if (_config.AuditAllPrompts)
        {
            await _auditStore.RecordPromptAsync(new PromptAuditRecord
            {
                Timestamp = DateTimeOffset.UtcNow,
                PluginName = context.Function.PluginName,
                FunctionName = context.Function.Name,
                PromptHash = ComputeHash(renderedPrompt),
                TokenEstimate = EstimateTokenCount(renderedPrompt)
            });
        }
    }
}

4.4 Output Validation Filter

The output validation filter is the post-invocation gate for semantic correctness. Where infrastructure filters handle binary pass/fail scenarios, output validation handles the probabilistic quality spectrum of LLM responses:

public class OutputValidationFilter : IFunctionInvocationFilter
{
    private readonly IOutputValidatorRegistry _validatorRegistry;
    private readonly ILogger<OutputValidationFilter> _logger;
    private readonly OutputValidationConfiguration _config;
    public async Task OnFunctionInvocationAsync(
        FunctionInvocationContext context,
        Func<FunctionInvocationContext, Task> next)
    {
        await next(context);
        var output = context.Result.GetValue<string>();
        if (output is null) return;
        // Look up validators registered for this specific function
        var validators = _validatorRegistry.GetValidators(
            context.Function.PluginName,
            context.Function.Name);
        if (!validators.Any()) return;
        var validationContext = new OutputValidationContext
        {
            PluginName = context.Function.PluginName,
            FunctionName = context.Function.Name,
            Output = output,
            InputArguments = context.Arguments
        };
        foreach (var validator in validators)
        {
            var result = await validator.ValidateAsync(validationContext);
            if (!result.IsValid)
            {
                _logger.LogWarning(
                    "Output validation failed for {Plugin}.{Function}: {Reason}. " +
                    "Validator: {ValidatorType}",
                    context.Function.PluginName,
                    context.Function.Name,
                    result.FailureReason,
                    validator.GetType().Name);
                // Determine escalation strategy based on validator severity
                switch (result.Severity)
                {
                    case ValidationSeverity.Critical:
                        // Block response entirely-cannot return this output
                        throw new OutputValidationException(
                            $"Critical output validation failure: {result.FailureReason}");
                    case ValidationSeverity.High:
                        // Attempt retry with modified execution settings
                        if (context.RequestSequenceIndex < _config.MaxRetries)
                        {
                            _logger.LogInformation(
                                "Retrying {Plugin}.{Function} due to high-severity " +
                                "validation failure (attempt {Attempt}/{Max})",
                                context.Function.PluginName,
                                context.Function.Name,
                                context.RequestSequenceIndex + 1,
                                _config.MaxRetries);
                            // Signal retry-caller will re-invoke
                            context.Result = FunctionResult.Empty;
                            return;
                        }
                        goto case ValidationSeverity.Medium;
                    case ValidationSeverity.Medium:
                        // Return with quality degradation signal in metadata
                        var metadata = context.Result.Metadata ?? 
                            new Dictionary<string, object?>();
                        metadata["validation_warning"] = result.FailureReason;
                        metadata["validation_severity"] = result.Severity.ToString();
                        break;
                    case ValidationSeverity.Low:
                        // Log only-pass through
                        break;
                }
            }
        }
    }
}

4.5 Planner Boundary Filter

The planner boundary filter is the safety mechanism for auto-invoke scenarios. It enforces hard limits on plan execution — maximum steps, allowed plugin set, and cumulative token budget — preventing unbounded planning loops:

public class PlannerBoundaryFilter : IAutoFunctionInvocationFilter
{
    private readonly PlannerBoundaryConfiguration _config;
    private readonly ILogger<PlannerBoundaryFilter> _logger;
    public async Task OnAutoFunctionInvocationAsync(
        AutoFunctionInvocationContext context,
        Func<AutoFunctionInvocationContext, Task> next)
    {
        // Enforce maximum execution steps
        if (context.RequestSequenceIndex >= _config.MaxPlannerSteps)
        {
            _logger.LogWarning(
                "Planner exceeded maximum step count of {MaxSteps} for goal: {Goal}. " +
                "Terminating plan execution.",
                _config.MaxPlannerSteps,
                context.ChatHistory.LastOrDefault()?.Content?.Truncate(100));
            context.Terminate = true;
            return;
        }
        // Enforce plugin allowlist
        var requestedPlugin = context.Function.PluginName;
        if (!_config.AllowedPlugins.Contains(requestedPlugin))
        {
            _logger.LogWarning(
                "Planner attempted to invoke unauthorized plugin {PluginName}. " +
                "Blocked by boundary filter.",
                requestedPlugin);
            // Don't throw-instead provide a synthetic result explaining the restriction
            context.Result = new FunctionResult(
                context.Function,
                $"Plugin '{requestedPlugin}' is not authorized for autonomous invocation.");
            return;
        }
        await next(context);
        // Post-step: check cumulative token consumption
        var stepUsage = context.Result.Metadata?
            .GetValueOrDefault("Usage") as CompletionsUsage;
        if (stepUsage is not null)
        {
            var cumulativeTokens = TrackCumulativeTokens(context, stepUsage);
            if (cumulativeTokens > _config.MaxPlannerTokenBudget)
            {
                _logger.LogWarning(
                    "Planner exceeded token budget of {Budget} tokens " +
                    "(consumed: {Consumed}). Terminating plan.",
                    _config.MaxPlannerTokenBudget,
                    cumulativeTokens);
                context.Terminate = true;
            }
        }
    }
}

V. End-to-End Wiring: The Complete Kernel Bootstrap

With all components defined, the production bootstrap assembles them into a coherent, observable, resilient system:

// Program.cs — Production Semantic Kernel Bootstrap
var builder = Host.CreateApplicationBuilder(args);
// ── Infrastructure ──────────────────────────────────────────────────────────
builder.Services.AddStackExchangeRedisCache(options =>
{
    options.Configuration = builder.Configuration["Redis:ConnectionString"];
    options.InstanceName = "sk-prod:";
});
builder.Services.AddSingleton<IConnectionMultiplexer>(sp =>
    ConnectionMultiplexer.Connect(
        builder.Configuration["Redis:ConnectionString"]!));
// ── Semantic Cache ──────────────────────────────────────────────────────────
builder.Services.AddSingleton<SemanticCacheConfiguration>(sp =>
    builder.Configuration.GetSection("SemanticCache").Get<SemanticCacheConfiguration>()!);
builder.Services.AddSingleton<ISemanticCacheService, RedisSemanticCacheService>();
builder.Services.AddHostedService<SemanticCacheIndexInitializer>(); // Ensures index on startup
// ── Resilience ──────────────────────────────────────────────────────────────
builder.Services
    .AddHttpClient("AzureOpenAI-Primary")
    .AddResilienceHandler("llm-primary", ConfigureLlmResiliencePipeline(isPrimary: true));
builder.Services
    .AddHttpClient("AzureOpenAI-Fallback")
    .AddResilienceHandler("llm-fallback", ConfigureLlmResiliencePipeline(isPrimary: false));
// ── Kernel ──────────────────────────────────────────────────────────────────
builder.Services.AddSingleton<Kernel>(sp =>
{
    var kernelBuilder = Kernel.CreateBuilder();
    // Models
    kernelBuilder.AddAzureOpenAIChatCompletion(
        deploymentName: config["AzureOpenAI:GPT4oDeployment"]!,
        endpoint: config["AzureOpenAI:Endpoint"]!,
        apiKey: config["AzureOpenAI:ApiKey"]!,
        serviceId: "primary",
        httpClient: sp.GetRequiredService<IHttpClientFactory>()
            .CreateClient("AzureOpenAI-Primary"));
    kernelBuilder.AddAzureOpenAIChatCompletion(
        deploymentName: config["AzureOpenAI:GPT4oMiniDeployment"]!,
        endpoint: config["AzureOpenAI:Endpoint"]!,
        apiKey: config["AzureOpenAI:ApiKey"]!,
        serviceId: "fallback",
        httpClient: sp.GetRequiredService<IHttpClientFactory>()
            .CreateClient("AzureOpenAI-Fallback"));
    kernelBuilder.AddAzureOpenAITextEmbeddingGeneration(
        deploymentName: config["AzureOpenAI:EmbeddingDeployment"]!,
        endpoint: config["AzureOpenAI:Endpoint"]!,
        apiKey: config["AzureOpenAI:ApiKey"]!);
    // Plugins
    kernelBuilder.Plugins.AddFromType<DocumentAnalysisPlugin>("DocumentPlugin");
    kernelBuilder.Plugins.AddFromType<LegalAnalysisPlugin>("LegalPlugin");
    kernelBuilder.Plugins.AddFromType<NotificationPlugin>("NotificationPlugin");
    // Filter Pipeline (order is significant)
    kernelBuilder.Services
        .AddSingleton<IFunctionInvocationFilter>(
            sp.GetRequiredService<TenantRateLimitFilter>())
        .AddSingleton<IFunctionInvocationFilter>(
            sp.GetRequiredService<SemanticCacheFilter>())
        .AddSingleton<IFunctionInvocationFilter>(
            sp.GetRequiredService<ObservabilityFilter>())
        .AddSingleton<IFunctionInvocationFilter>(
            sp.GetRequiredService<OutputValidationFilter>())
        .AddSingleton<IPromptRenderFilter>(
            sp.GetRequiredService<AuditPromptFilter>())
        .AddSingleton<IAutoFunctionInvocationFilter>(
            sp.GetRequiredService<PlannerBoundaryFilter>());
    return kernelBuilder.Build();
});
// ── OpenTelemetry ───────────────────────────────────────────────────────────
builder.Services.AddOpenTelemetry()
    .ConfigureResource(r => r.AddService("SemanticKernelService", serviceVersion: "2.0.0"))
    .WithTracing(t => t
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddSource("Microsoft.SemanticKernel*")
        .AddSource("SemanticKernel.Custom*")
        .AddOtlpExporter(o => o.Endpoint = 
            new Uri(config["OpenTelemetry:Endpoint"]!)))
    .WithMetrics(m => m
        .AddAspNetCoreInstrumentation()
        .AddMeter("Microsoft.SemanticKernel*")
        .AddMeter("SemanticKernel.Custom")
        .AddOtlpExporter(o => o.Endpoint = 
            new Uri(config["OpenTelemetry:Endpoint"]!)));
await builder.Build().RunAsync();

VI. Operational Runbook: What to Watch in Production

6.1 The Five Metrics That Matter Most

Once the system is running, these five metrics define your production health posture:

| Metric | Alert Threshold | Action on Breach |
|-------------------------------------|------------------------|----------------------------------------|
| sk_cache_hit_rate (7d avg) | < 25% | Review query diversity, adjust TTL |
| sk_function_duration_p99 | > 12s | Check model latency, circuit breakers |
| sk_estimated_cost_usd_total (daily) | > 110% of daily budget | Activate model tiering, alert FinOps |
| sk_circuit_breaker_opened_total | Any increment | Page on-call, activate fallback |
| sk_output_validation_failures_total | > 2% of invocations | Review prompt quality, check for drift |

6.2 Incident Response Patterns

Symptom: Sudden latency spike (P99 > 15s)

Check circuit breaker state — if open, primary model is degraded
Verify fallback model is receiving traffic and responding within SLA
Check Redis cache hit rate — a spike in misses increases LLM load
Review FT.INFO semantic-cache-idx for index health

Symptom: Cost burn rate 2× normal

Check cache hit rate — a cache index rebuild or Redis failover may have cleared cached entries
Review planner step counts — unbounded planning may have slipped through boundary filter
Audit token consumption by plugin — identify highest-cost functions
Check for prompt injection events — attackers deliberately inflating token usage

Symptom: Elevated output validation failures

Pull sample failed outputs from audit store
Check if a prompt template was recently updated — prompt regressions are the primary cause
Verify LLM model version hasn’t changed in Azure OpenAI deployment
Review input distribution — a change in user query patterns may be exposing prompt brittleness

VII. Key Takeaways and What’s Next

Part 2 has walked through the three implementation layers that bridge architectural intent and production reality.

The async parallel orchestration pattern transforms multi-step LLM workflows from sequential 6–10 second operations into concurrent 2–3 second operations — not through faster LLMs, but through correct use of Task composition and dependency-aware DAG execution.

The Redis semantic cache eliminates token costs for repeated semantic intent — the most impactful FinOps lever available at this layer of the stack. The implementation details that matter most are KNN vector indexing with HNSW, per-function similarity thresholds, and TTL policies that balance freshness against hit rate.

The complete filter pipeline is what separates a Semantic Kernel proof-of-concept from a system an enterprise can operate. Rate limiting, prompt sanitization, token metering, output validation, and planner boundary enforcement are not optional hardening steps — they are the production system.

In Part 3, we will move into advanced territory: implementing a multi-agent orchestration architecture where specialized kernel instances collaborate on complex tasks, building a domain-specific memory system for regulated industries with PII redaction and audit-trail requirements, and exploring the emerging Semantic Kernel Process Framework for stateful, long-running AI workflows that survive service restarts and scale across distributed nodes.

This article is Part 2 of a series on Semantic Kernel for Enterprise AI in .NET. Part 1 covered foundational architecture, FinOps cost modeling, and SRE reliability patterns.

Semantic Kernel for Enterprise AI: Architecting Production-Grade LLM Integration in .NET

Ali Suleyman TOPUZ — Sat, 21 Mar 2026 18:47:34 +0000

Semantic Kernel for Enterprise AI: Architecting Production-Grade LLM Integration in .NET — Foundations — Part 1

This article is Part 1 of a series on Semantic Kernel for Enterprise AI in .NET. I work at the intersection of distributed systems, AI infrastructure, and .NET engineering.

I. Executive Summary

The Gap Between Demo and Production

Every engineering team that has delivered an LLM proof-of-concept eventually faces the same humbling reality: a polished ChatGPT-style demo is light years away from a production system that handles real business transactions under real load, with real money on the line. The raw OpenAI or Azure OpenAI API gets you to the demo in days. Getting from there to a system that a Fortune 500 organization can stake its operations on — that takes an architectural framework purpose-built for the challenge.

Microsoft’s Semantic Kernel is that framework for .NET engineers.

At its core, Semantic Kernel is an open-source SDK that functions as an orchestration layer — a sophisticated middleware bridging the deterministic world of traditional enterprise software with the probabilistic, latency-sensitive, and expensive world of Large Language Models. It is not a thin API wrapper. It is not a prompt-templating library. It is a full orchestration runtime with first-class support for skills composition, autonomous planning, semantic memory, token optimization, circuit breakers, and OpenTelemetry-grade observability. Every one of those capabilities was forged in the furnace of Microsoft’s own internal AI deployments — Copilot across Office, GitHub Copilot, Azure AI services — serving hundreds of millions of users where failures are measured in dollars, not just SLA percentages.

For senior engineers operating at the intersection of AI, distributed systems, and .NET, this article is a deep architectural walkthrough of what Semantic Kernel is, why it exists in its current form, and how to wield it at the level of a production systems architect rather than a tutorial consumer.

II. Why Raw LLM APIs Are Necessary but Insufficient

Before examining Semantic Kernel’s architecture, we need to name the production failure modes that motivate its existence. These are problems that emerge reliably once an LLM-powered system moves beyond controlled demos.

2.1 Prompt Management at Scale

In a demo, your prompt is a string in your source code. In production, prompts are configuration artifacts that must be versioned, deployed independently of code, A/B tested, localized, role-conditioned, and audited. A single enterprise AI agent may require dozens of prompt templates, each evolving independently. Managing this through string interpolation in C# is an operational disaster waiting to happen.

Semantic Kernel introduces a structured prompt template system — YAML-based prompt configuration with variable injection, function calling declarations, and execution settings — that treats prompts as first-class, versionable assets rather than embedded strings.

2.2 Context Window Economics

GPT-4’s context window is large but not infinite, and every token you send costs money. In a naive implementation, developers stuff entire conversation histories into every API call, leading to two correlated failure modes: context window overflow when histories grow long, and exponentially escalating costs as conversation depth increases. The economic model of LLMs means that architectural inefficiency translates directly and immediately to the FinOps P&L.

Semantic Kernel’s memory abstractions — backed by vector stores like Azure AI Search, Qdrant, or Chroma — enable selective context retrieval through semantic similarity rather than brute-force history injection. This is not a convenience feature; it is a cost-control mechanism that can reduce per-conversation token spend by an order of magnitude at enterprise scale.

2.3 Non-Deterministic Failure Modes

Traditional enterprise systems fail in ways that ops teams have learned to handle: service unavailable (503), timeout, null reference, constraint violation. These are binary, deterministic, and observable. LLM failures are none of these things. An LLM can:

Return a response that is syntactically valid JSON but semantically incorrect
Confidently hallucinate facts, API parameters, or code behavior
Produce outputs that drift in quality as prompt context fills up
Fail silently when a function call argument is subtly malformed
Exhibit prompt injection vulnerabilities when user input is insufficiently sandboxed

Your circuit breaker doesn’t catch hallucinations. Your retry policy doesn’t fix semantic drift. Addressing this class of failure demands architectural patterns — output validation pipelines, structured output enforcement, semantic guardrails, and human-in-the-loop escalation hooks — that raw API calls cannot provide.

2.4 Multi-Model, Multi-Provider Orchestration

No enterprise runs a single LLM for all use cases. GPT-4o handles complex reasoning; GPT-3.5-turbo handles high-volume simple classification at a fraction of the cost; a fine-tuned domain model handles specialized extraction. Routing intelligently across these models — based on task complexity, cost budget, latency SLA, and availability — requires an abstraction layer that decouples your business logic from any specific model’s API surface.

Semantic Kernel’s connector architecture provides exactly this. You register multiple model services with service IDs, define routing strategies, and let the orchestration layer handle model selection. Your skills and planners operate against the abstraction, not the concrete API.

III. Architectural Anatomy of Semantic Kernel

Understanding Semantic Kernel at an architectural level requires examining its four foundational concepts: Plugins (Skills), Planners , Memory , and Filters (Middleware). These are not independent features — they compose into an orchestration runtime.

3.1 Plugins: The Unit of AI Capability

A plugin in Semantic Kernel is a class that exposes one or more functions — either semantic functions (backed by LLM invocations) or native functions (backed by deterministic C# logic) — that the kernel can discover, invoke, and compose. The [KernelFunction] attribute and [Description] annotations are not decorative; they are the mechanism by which the planner understands what a function does and when to invoke it.

public class DocumentAnalysisPlugin
{
    private readonly Kernel _kernel;
    private readonly IDocumentStore _documentStore;

    public DocumentAnalysisPlugin(Kernel kernel, IDocumentStore documentStore)
    {
        _kernel = kernel;
        _documentStore = documentStore;
    }

    [KernelFunction("summarize_document")]
    [Description("Produces a structured executive summary of a business document given its ID")]
    public async Task<DocumentSummary> SummarizeDocumentAsync(
        [Description("The unique identifier of the document to summarize")] string documentId,
        [Description("Target audience: executive, technical, or legal")] string audience = "executive")
    {
        var document = await _documentStore.GetAsync(documentId);

        var arguments = new KernelArguments
        {
            ["document_content"] = document.Content,
            ["audience"] = audience,
            ["max_length"] = audience == "executive" ? "200" : "500"
        };
        var result = await _kernel.InvokeAsync(
            "DocumentPrompts", 
            "SummarizeForAudience", 
            arguments);
        return JsonSerializer.Deserialize<DocumentSummary>(result.GetValue<string>()!)!;
    }
    [KernelFunction("extract_obligations")]
    [Description("Extracts legal obligations, deadlines, and parties from contract documents")]
    public async Task<ContractObligations> ExtractObligationsAsync(
        [Description("Raw contract text to analyze")] string contractText)
    {
        // Native validation before LLM invocation-keep deterministic logic out of prompts
        if (contractText.Length > 100_000)
            throw new ArgumentException("Contract text exceeds maximum analyzable length");
        var arguments = new KernelArguments { ["contract_text"] = contractText };

        var result = await _kernel.InvokeAsync(
            "LegalPrompts", 
            "ExtractObligations", 
            arguments);
        return ParseObligations(result.GetValue<string>()!);
    }
}

The architectural principle here is the separation of semantic and native logic. Input validation, data access, and output parsing belong in native functions. Reasoning, synthesis, and language understanding belong in semantic functions. Mixing these concerns produces systems that are harder to test, harder to debug, and more expensive to run.

3.2 Planners: Autonomous Orchestration

The planner is Semantic Kernel’s most architecturally significant component — and the one most likely to surprise engineers coming from traditional orchestration systems like Temporal or Azure Durable Functions. A planner takes a high-level goal expressed in natural language and dynamically generates a plan — a sequence of plugin invocations — to achieve it, using the LLM itself as the planning engine.

Semantic Kernel currently supports two primary planner strategies:

Handlebars Planner generates a structured Handlebars template representing a fixed execution plan. This is deterministic once generated — suitable for workflows where you want human review of the plan before execution, or where the execution graph needs to be auditable.

Function Calling Planner (Auto-invoke) delegates step-by-step planning to the LLM’s native function-calling capability, allowing iterative, dynamic execution where the model decides at each step which function to invoke based on prior results. This is appropriate for exploratory or open-ended tasks.

For production enterprise systems, the architectural guidance is this: never use open-ended auto-invoke without bounding constraints. Unbounded planners can generate execution graphs of arbitrary length, incurring unbounded token costs and latency. Always configure FunctionChoiceBehavior with an explicit maximum step count and a function allowlist:

var executionSettings = new OpenAIPromptExecutionSettings
{
    FunctionChoiceBehavior = FunctionChoiceBehavior.Auto(
        functions: allowedFunctions, // Constrain available functions
        options: new FunctionChoiceBehaviorOptions
        {
            AllowConcurrentInvocation = true, // Parallel execution where safe
            AllowParallelCalls = true
        }
    ),
    MaxTokens = 2048,
    Temperature = 0.1 // Low temperature for planning tasks—higher determinism
};

3.3 Memory: Semantic Context Management

Semantic Kernel’s memory system abstracts vector storage and retrieval behind a clean interface that integrates naturally with the kernel’s execution pipeline. The architectural pattern here is Retrieval-Augmented Generation (RAG), but implemented at the framework level rather than hand-rolled per application.

The key design decision is between volatile memory (in-process, session-scoped) and persistent memory (vector database-backed, cross-session). In enterprise applications, persistent memory backed by Azure AI Search or Qdrant is the standard pattern — it enables knowledge bases that grow over time, cross-session context, and shared memory across multiple kernel instances in a horizontally scaled deployment.

// Memory configuration with Azure AI Search
var memoryBuilder = new MemoryBuilder();
memoryBuilder
    .WithAzureOpenAITextEmbeddingGeneration(
        deploymentName: config["AzureOpenAI:EmbeddingDeployment"]!,
        endpoint: config["AzureOpenAI:Endpoint"]!,
        apiKey: config["AzureOpenAI:ApiKey"]!)
    .WithMemoryStore(new AzureAISearchMemoryStore(
        endpoint: config["AzureSearch:Endpoint"]!,
        apiKey: config["AzureSearch:ApiKey"]!));

var memory = memoryBuilder.Build();
// Storing knowledge
await memory.SaveInformationAsync(
    collection: "product-knowledge-base",
    text: productDocument.Content,
    id: productDocument.Id,
    description: productDocument.Title);
// Semantic retrieval-returns chunks ranked by cosine similarity
var results = await memory.SearchAsync(
    collection: "product-knowledge-base",
    query: userQuery,
    limit: 5,
    minRelevanceScore: 0.75)
    .ToListAsync();

The minRelevanceScore parameter is operationally critical. In production systems, you need a floor on retrieval quality to prevent low-relevance noise from polluting the LLM's context window—which degrades response quality while increasing token costs simultaneously.

3.4 Filters: The Middleware Pipeline

Semantic Kernel’s filter pipeline is its most underutilized and most architecturally powerful feature for enterprise hardening. Filters are middleware components that intercept function invocations — both before and after execution — enabling cross-cutting concerns to be implemented once and applied uniformly across all kernel operations.

The three filter interfaces are:

IFunctionInvocationFilter — intercepts individual plugin function calls
IPromptRenderFilter — intercepts prompt construction before LLM submission
IAutoFunctionInvocationFilter — intercepts auto-planner function calls specifically

For enterprise systems, the canonical set of filters to implement includes:

Rate Limiting Filter — enforces per-user, per-tenant, or global token budget constraints, rejecting invocations that would exceed configured limits before they hit the LLM API.

Semantic Cache Filter — checks a Redis-backed semantic similarity cache before forwarding requests to the LLM, returning cached responses for semantically equivalent queries. This is the single highest-ROI FinOps optimization available in the stack.

Audit Filter — writes a structured audit log of every LLM invocation including the resolved prompt, model parameters, token consumption, and response — essential for regulated industries where AI decision auditing is a compliance requirement.

Output Validation Filter — applies post-invocation validation rules (schema validation, content policy checks, confidence scoring) and triggers retry or escalation logic when validation fails.

public class SemanticCacheFilter : IFunctionInvocationFilter
{
    private readonly ISemanticCache _cache;
    private readonly ILogger<SemanticCacheFilter> _logger;
    private readonly CacheConfiguration _config;
    public async Task OnFunctionInvocationAsync(
        FunctionInvocationContext context,
        Func<FunctionInvocationContext, Task> next)
    {
        // Only cache semantic function invocations, not native functions
        if (!context.Function.IsSemanticFunction())
        {
            await next(context);
            return;
        }
        var cacheKey = BuildCacheKey(context);
        var cached = await _cache.GetAsync(cacheKey);
        if (cached is not null)
        {
            _logger.LogInformation(
                "Cache hit for function {FunctionName}. Tokens saved: ~{EstimatedTokens}",
                context.Function.Name,
                EstimateTokens(cached));

            context.Result = new FunctionResult(context.Function, cached);
            return; // Skip LLM invocation entirely
        }
        // Cache miss-execute and store
        await next(context);
        var result = context.Result.GetValue<string>();
        if (result is not null)
        {
            await _cache.SetAsync(
                cacheKey, 
                result, 
                TimeSpan.FromMinutes(_config.TtlMinutes));
        }
    }
    private string BuildCacheKey(FunctionInvocationContext context)
    {
        // Cache key must capture the semantic intent, not just surface syntax
        // Use embedding similarity rather than string equality for cache lookup
        var prompt = context.Arguments.ToString();
        return $"{context.Function.PluginName}:{context.Function.Name}:{ComputeSemanticHash(prompt)}";
    }
}

IV. FinOps Architecture: Treating Token Costs as Infrastructure

Token costs are not an afterthought. At enterprise scale, they are a primary cost driver that must be engineered with the same rigor as compute and storage costs. This section outlines the architectural patterns that constitute a mature FinOps posture for Semantic Kernel deployments.

4.1 The Token Cost Model

Understanding the cost model is prerequisite to optimizing it. Azure OpenAI pricing operates on a per-million-token basis, differentiated between input (prompt) tokens and output (completion) tokens, with output tokens typically priced at a 3× multiplier. For GPT-4o as of this writing:

Input: ~$2.50 per 1M tokens
Output: ~$10.00 per 1M tokens

A production AI agent handling 10,000 conversations per day, each averaging 8,000 input tokens and 500 output tokens, incurs:

Daily input cost: 10,000 × 8,000 / 1,000,000 × $2.50 = $200/day
Daily output cost: 10,000 × 500 / 1,000,000 × $10.00 = $50/day
Monthly total: ~$7,500/month from token costs alone

The leverage points for FinOps optimization, in approximate order of impact:

1. Semantic caching — Returning cached responses for semantically equivalent queries eliminates token costs entirely for cache hits. In enterprise applications with repetitive query patterns (FAQ-style, report generation, classification), cache hit rates of 30–60% are achievable, representing proportional cost reductions.

2. Model tiering — Routing simple tasks (intent classification, entity extraction from structured text, short summarization) to GPT-3.5-turbo or GPT-4o-mini rather than GPT-4o can reduce per-invocation costs by 10–50× for those tasks. The routing logic belongs in a Semantic Kernel filter or a custom ITextGenerationService implementation.

3. Prompt compression — Systematically auditing and reducing prompt verbosity. Every word in your system prompt costs money on every invocation. Compressing a system prompt from 800 tokens to 400 tokens halves input costs for that plugin.

4. Context window management — Implementing sliding window context truncation rather than naive full-history injection. Keeping the last N turns plus retrieved semantic memory rather than the complete conversation history.

5. Streaming with early termination — Using GetStreamingChatMessageContentsAsync() and terminating the stream once enough content has been generated to satisfy the use case, avoiding payment for completion tokens you don't need.

4.2 Token Budget Enforcement Architecture

public class TokenBudgetService : ITokenBudgetService
{
    private readonly IDistributedCache _cache;
    private readonly TokenBudgetConfiguration _config;
    public async Task<BudgetCheckResult> CheckAndDeductAsync(
        string tenantId, 
        int estimatedInputTokens, 
        int estimatedOutputTokens)
    {
        var budgetKey = $"token-budget:{tenantId}:{DateTime.UtcNow:yyyy-MM-dd}";

        var currentUsage = await GetUsageAsync(budgetKey);
        var projectedCost = CalculateCost(
            currentUsage.InputTokens + estimatedInputTokens,
            currentUsage.OutputTokens + estimatedOutputTokens);

        if (projectedCost > _config.DailyBudgetUsd[tenantId])
        {
            return BudgetCheckResult.Exceeded(
                currentUsage: currentUsage.TotalCostUsd,
                limit: _config.DailyBudgetUsd[tenantId],
                resetTime: DateTime.UtcNow.Date.AddDays(1));
        }
        // Optimistic deduction-reconcile against actual usage post-invocation
        await IncrementUsageAsync(budgetKey, estimatedInputTokens, estimatedOutputTokens);

        return BudgetCheckResult.Approved();
    }
}

V. SRE Reliability Patterns for Semantic Kernel

Production AI systems must meet the same availability and reliability standards as any other mission-critical service. The SRE challenge with LLM-backed systems is that the failure domain is richer and less familiar than traditional backend services.

5.1 The AI System Failure Taxonomy

Before designing for reliability, enumerate what you’re designing against:

| Failure Class | Description | Detection Signal | Mitigation Pattern |
|-------------------------|------------------------------------------|-----------------------------------|---------------------------------------------|
| Model Unavailability | LLM API returns 503/429 | HTTP status codes | Circuit breaker + fallback model |
| Rate Limit Exhaustion | TPM/RPM limits exceeded | 429 with Retry-After header | Exponential backoff + quota management |
| Context Overflow | Input exceeds context window | 400 with context length error | Dynamic context truncation |
| Semantic Degradation | Output quality falls below threshold | Custom quality scorer | Output validation + retry |
| Prompt Injection | Malicious user input hijacks prompt | Heuristic + LLM-as-judge | Input sanitization filter |
| Hallucination | Factually incorrect output | Grounding verification | RAG + citation enforcement |
| Token Budget Exhaustion | Daily spend limit reached | Budget service check | Graceful degradation to cached/static |

5.2 Circuit Breaker Implementation

The Polly library integrates naturally with Semantic Kernel’s HttpClient pipeline, providing circuit breaker, retry, and bulkhead patterns that protect against cascading failures from LLM API instability.

// Configure resilience pipeline for LLM HTTP clients
builder.Services.AddHttpClient("AzureOpenAI")
    .AddResilienceHandler("llm-pipeline", pipeline =>
    {
        // Retry with exponential backoff—respects Retry-After headers from 429 responses
        pipeline.AddRetry(new HttpRetryStrategyOptions
        {
            MaxRetryAttempts = 3,
            Delay = TimeSpan.FromSeconds(2),
            BackoffType = DelayBackoffType.Exponential,
            UseJitter = true,
            ShouldHandle = static args => args.Outcome switch
            {
                { Result.StatusCode: HttpStatusCode.TooManyRequests } => PredicateResult.True(),
                { Result.StatusCode: HttpStatusCode.ServiceUnavailable } => PredicateResult.True(),
                { Exception: HttpRequestException } => PredicateResult.True(),
                _ => PredicateResult.False()
            },
            OnRetry = args =>
            {
                // Extract Retry-After header and honor it
                if (args.Outcome.Result?.Headers.RetryAfter?.Delta is TimeSpan retryAfter)
                {
                    args.Arguments.Context.Properties.Set(
                        new ResiliencePropertyKey<TimeSpan>("retry-after"), 
                        retryAfter);
                }
                return ValueTask.CompletedTask;
            }
        });

// Circuit breaker-trips after 5 consecutive failures, resets after 30 seconds
        pipeline.AddCircuitBreaker(new HttpCircuitBreakerStrategyOptions
        {
            SamplingDuration = TimeSpan.FromSeconds(30),
            MinimumThroughput = 10,
            FailureRatio = 0.5,
            BreakDuration = TimeSpan.FromSeconds(30),
            OnOpened = args =>
            {
                // Alert SRE team and activate fallback routing
                MetricsRegistry.CircuitBreakerOpened
                    .Add(1, new TagList { ["model"] = "primary-gpt4" });
                return ValueTask.CompletedTask;
            }
        });
        // Timeout-LLM calls should complete within SLA
        pipeline.AddTimeout(TimeSpan.FromSeconds(30));
    });

5.3 Multi-Model Failover Strategy

When the primary model’s circuit breaker is open, the system must degrade gracefully rather than fail completely. The architectural pattern is a fallback chain: primary GPT-4o → fallback GPT-3.5-turbo → cached response → static degraded response.

public class ResilientKernelService : IResilientKernelService
{
    private readonly Kernel _primaryKernel; // GPT-4o
    private readonly Kernel _fallbackKernel; // GPT-3.5-turbo
    private readonly ISemanticCache _cache;
    private readonly ICircuitBreakerRegistry _circuitBreakers;
    public async Task<string> InvokeWithFallbackAsync(
        string pluginName, 
        string functionName, 
        KernelArguments arguments,
        CancellationToken ct = default)
    {
        // Check circuit breaker state before attempting primary
        if (!_circuitBreakers.IsOpen("primary-gpt4"))
        {
            try
            {
                var result = await _primaryKernel.InvokeAsync(
                    pluginName, functionName, arguments, ct);
                return result.GetValue<string>()!;
            }
            catch (Exception ex) when (IsTransientFailure(ex))
            {
                _circuitBreakers.RecordFailure("primary-gpt4");
            }
        }
        // Fallback to cheaper/more available model
        try
        {
            var result = await _fallbackKernel.InvokeAsync(
                pluginName, functionName, arguments, ct);

            MetricsRegistry.FallbackInvocations.Add(1, 
                new TagList { ["reason"] = "primary-unavailable" });

            return result.GetValue<string>()!;
        }
        catch (Exception ex)
        {
            // Last resort: return cached response if available
            var cacheKey = BuildCacheKey(pluginName, functionName, arguments);
            var cached = await _cache.GetAsync(cacheKey);

            if (cached is not null)
            {
                MetricsRegistry.FallbackInvocations.Add(1, 
                    new TagList { ["reason"] = "all-models-unavailable-cache-hit" });
                return cached;
            }
            // Propagate only if no fallback is available
            throw new AiServiceUnavailableException(
                "All AI service tiers are unavailable and no cached response exists.", ex);
        }
    }
}

VI. Observability: The Foundation of Production AI Operations

You cannot operate what you cannot observe. This principle, axiomatic in traditional SRE, becomes even more critical in AI systems where failures are probabilistic and often invisible at the infrastructure layer. An LLM returning a hallucinated response looks exactly like a successful 200 OK from the network’s perspective.

6.1 The Three-Layer Observability Model

Enterprise Semantic Kernel observability operates at three distinct layers, each requiring different instrumentation:

Infrastructure Layer — Traditional metrics: request rates, error rates, latency percentiles, HTTP status codes. Semantic Kernel emits OpenTelemetry traces and metrics natively from Microsoft.SemanticKernel.* activity sources. Wire these into your existing observability stack (Azure Monitor, Datadog, Prometheus/Grafana) with zero custom code.

Resource Consumption Layer — Token-level metrics: input token count, output token count, model invoked, cost estimate. These require custom instrumentation via a IFunctionInvocationFilter that reads token usage from the FunctionInvocationContext.Result.Metadata.

Semantic Quality Layer — The layer most organizations skip and later regret. Metrics around output quality: confidence scores, grounding percentages, retrieval relevance scores, user satisfaction signals. These require purpose-built evaluation infrastructure, typically implemented as a background evaluation pipeline that samples production outputs and scores them against quality criteria.

public class ObservabilityFilter : IFunctionInvocationFilter
{
    private static readonly Histogram<double> InvocationDuration = 
        Metrics.CreateHistogram<double>(
            "sk_function_duration_seconds",
            "Duration of Semantic Kernel function invocations",
            new InstrumentDescription { Unit = "s" });
    private static readonly Counter<long> TokensConsumed = 
        Metrics.CreateCounter<long>(
            "sk_tokens_consumed_total",
            "Total tokens consumed by Semantic Kernel functions");
    private static readonly Counter<double> EstimatedCostUsd = 
        Metrics.CreateCounter<double>(
            "sk_estimated_cost_usd_total",
            "Estimated cost in USD of Semantic Kernel function invocations");
    public async Task OnFunctionInvocationAsync(
        FunctionInvocationContext context,
        Func<FunctionInvocationContext, Task> next)
    {
        var stopwatch = Stopwatch.StartNew();
        Exception? exception = null;
        using var activity = ActivitySource.StartActivity(
            $"sk.function.{context.Function.PluginName}.{context.Function.Name}",
            ActivityKind.Internal);
        activity?.SetTag("sk.plugin.name", context.Function.PluginName);
        activity?.SetTag("sk.function.name", context.Function.Name);
        try
        {
            await next(context);
            // Capture token usage from response metadata
            var usage = context.Result.Metadata?
                .GetValueOrDefault("Usage") as CompletionsUsage;
            if (usage is not null)
            {
                var tags = new TagList
                {
                    ["plugin"] = context.Function.PluginName,
                    ["function"] = context.Function.Name,
                    ["model"] = context.Result.Metadata?
                        .GetValueOrDefault("ModelId")?.ToString() ?? "unknown",
                    ["token_type"] = "input"
                };
                TokensConsumed.Add(usage.PromptTokens, tags);

                tags["token_type"] = "output";
                TokensConsumed.Add(usage.CompletionTokens, tags);
                var cost = CalculateCost(
                    usage.PromptTokens, 
                    usage.CompletionTokens,
                    tags["model"].ToString()!);
                EstimatedCostUsd.Add(cost, new TagList 
                { 
                    ["plugin"] = context.Function.PluginName,
                    ["function"] = context.Function.Name 
                });
                activity?.SetTag("sk.tokens.prompt", usage.PromptTokens);
                activity?.SetTag("sk.tokens.completion", usage.CompletionTokens);
                activity?.SetTag("sk.cost.usd", cost);
            }
        }
        catch (Exception ex)
        {
            exception = ex;
            activity?.SetStatus(ActivityStatusCode.Error, ex.Message);
            throw;
        }
        finally
        {
            stopwatch.Stop();
            var resultTags = new TagList
            {
                ["plugin"] = context.Function.PluginName,
                ["function"] = context.Function.Name,
                ["success"] = exception is null
            };
            InvocationDuration.Record(stopwatch.Elapsed.TotalSeconds, resultTags);
        }
    }
}

6.2 The SRE Dashboard for AI Systems

Beyond the instrumentation code, the operational posture requires purpose-built dashboards that surface AI-specific signals:

Cost Burn Rate Dashboard — Real-time token consumption and estimated cost, broken down by tenant, plugin, and model. Alert when daily burn rate projects to exceed monthly budget. This is the FinOps control plane.

Quality Degradation Dashboard — Trending quality scores from the semantic evaluation pipeline. A sudden drop in average confidence or grounding scores is often the first signal of a prompt regression before users start complaining.

Latency Percentile Dashboard — P50, P95, P99 invocation latency by plugin and model. LLM latency is highly variable; mean latency is a misleading metric. The P99 determines actual user experience under load.

Fallback Activation Rate — Track the ratio of primary model invocations to fallback invocations. A rising fallback rate signals primary model degradation before the circuit breaker fully opens.

VII. Bringing It Together: The Production Architecture Blueprint

A production Semantic Kernel deployment for an enterprise AI agent system has the following high-level architecture:

Ingress & Auth Layer — API Gateway (Azure API Management) handling authentication, per-tenant rate limiting, and request routing. Token budget checks happen at this layer before requests reach the Semantic Kernel service.

Semantic Kernel Service — Horizontally scaled .NET 9.0 service hosting the kernel, plugins, and filter pipeline. Stateless — all state lives in the memory layer or distributed cache.

Semantic Cache — Redis Cluster with vector similarity index (RedisStack) for semantic deduplication. Shared across all kernel service instances.

Vector Memory Store — Azure AI Search or Qdrant for persistent semantic memory. Supports the RAG pipeline for knowledge-grounded responses.

LLM Providers — Primary Azure OpenAI deployment (GPT-4o) and fallback deployment (GPT-4o-mini), each behind Polly resilience pipelines with independent circuit breakers.

Observability Stack — OpenTelemetry Collector receiving traces and metrics from all services, forwarding to Azure Monitor + Application Insights. Grafana dashboards surfacing the AI-specific signals described in Section VI.

Evaluation Pipeline — Asynchronous worker service that samples production invocations, runs them through quality evaluation (LLM-as-judge pattern), and writes scores to the metrics store for dashboard visibility.

VIII. Key Takeaways and What’s Next

Semantic Kernel earns its complexity. For teams moving LLM applications from proof-of-concept to production scale, the framework provides:

The plugin architecture that makes AI capabilities composable, testable, and maintainable across large teams. The planner that enables autonomous AI agents without requiring every orchestration path to be pre-coded. The memory system that makes RAG a first-class architectural pattern rather than a hand-rolled feature. The filter pipeline that gives you the cross-cutting concerns — caching, rate limiting, observability, output validation — as a structured composition point rather than scattered middleware. And the multi-model connector architecture that decouples business logic from any specific provider’s API surface.

None of these benefits come for free. Semantic Kernel introduces genuine complexity: asynchronous orchestration, distributed state management, probabilistic failure modes, and token cost economics all demand architectural attention that simpler integrations can defer. The payoff is a system architecture that can survive the transition from demo to production, from hundreds of users to millions, and from the first AI feature to the tenth.

In Part 2, we will move from foundations to implementation: building the async-first parallel orchestration patterns that collapse multi-step LLM workflows from sequential to concurrent execution, implementing the semantic cache with Redis vector similarity search, and walking through the full filter pipeline implementation with production-grade token metering.

AI-Powered Code Generation and Testing in .NET:

Ali Suleyman TOPUZ — Sat, 21 Mar 2026 18:47:24 +0000

AI-Powered Code Generation and Testing in .NET: A Staff Engineer’s Guide to Modernizing Development Workflows

This guide reflects patterns and practices from production .NET deployments integrating OpenAI APIs. Code examples target .NET 9 and Azure OpenAI SDK. Model pricing and API specifications should be verified against current OpenAI and Azure documentation, as these change frequently.

Executive Summary

The integration of artificial intelligence into software development represents one of the most profound shifts in our industry since the advent of high-level programming languages. Large Language Models (LLMs), particularly those from OpenAI, have moved from experimental novelties to production-critical components in many organizations’ development pipelines. Teams that successfully integrate AI assistants are reporting 30–55% productivity gains in specific development tasks, while those that fail to adapt are accumulating technical debt at an alarming rate.

What makes this transition particularly significant is that AI isn’t simply accelerating existing workflows — it’s fundamentally restructuring them. Traditional development followed a relatively linear path: requirements gathering, design, implementation, testing, and deployment. AI introduces feedback loops at every stage, enabling developers to iterate on designs through conversational interfaces, generate boilerplate code from natural language specifications, and create comprehensive test suites that would have taken weeks to develop manually.

From a Staff Engineer’s perspective, the challenge isn’t whether to adopt AI-powered development tools, but how to integrate them in a way that enhances rather than compromises system reliability. The modern development process must balance AI’s generative capabilities with human oversight and architectural discipline — establishing clear boundaries for where AI-assisted generation is appropriate, implementing robust validation frameworks, and maintaining the engineering rigor that ensures systems remain operable at scale. Organizations that treat AI as a force multiplier for experienced engineers, rather than a replacement for fundamental software engineering principles, are seeing sustainable productivity improvements. Those that don’t are discovering that AI can amplify bad practices just as effectively as good ones.

Part I: The Business Case — FinOps and the Economics of AI-Assisted Development

Reframing the ROI Calculation

The economics of AI-powered code generation are compelling but require nuanced analysis. A senior .NET developer’s fully loaded cost (salary, benefits, overhead) typically ranges from $150,000 to $250,000 annually in major tech markets. If AI-assisted code generation reduces time spent on boilerplate by even 20%, that represents $30,000–$50,000 in recovered value per developer per year — value redirectable toward architectural decisions, performance optimization, and innovation.

However, this calculation only holds if the quality of generated code meets production standards. The hidden costs of poor AI integration erode these gains faster than most organizations anticipate.

Hidden Costs and the New Technical Debt Profile

AI-generated code creates a new class of technical debt that’s subtler and more expensive to remediate than traditional quality issues. Specifically:

Shallow correctness : Code passes basic tests but lacks defensive programming, proper error handling, idiomatic patterns, and observability hooks. It works in development but fails gracefully in production.

Context blindness : AI models generate stateless completions. They don’t understand your team’s specific architectural decisions, internal libraries, or the non-obvious constraints baked into existing systems.

Test hallucinations : Generated tests can have high coverage metrics while testing nothing of value — asserting implementation details rather than behavior, or passing against buggy code because the model trained on the same buggy patterns.

Amplification of bad prompting : A developer who writes imprecise requirements gets imprecise code. The quality of the output is ceiling-bounded by the quality of the input. This shifts the critical skill from “writing code” to “specifying requirements precisely.”

FinOps Token Economics

At scale, API costs become a first-order engineering concern. Key dimensions:

Token consumption patterns. Input tokens (prompts + context) are typically cheaper than output tokens. For code generation, the ratio of input to output matters significantly. A well-engineered prompt that delivers tight, complete context will outperform a verbose one that wastes the context window on irrelevant information.

Caching strategies. System prompts and static context (coding standards, architectural patterns, common interfaces) should be leveraged with prompt caching where available. OpenAI’s caching can reduce costs by up to 50% on repeated prompt prefixes.

Model tiering. Not all tasks require GPT-4-class models. A cost-optimized workflow routes:

Simple completions and boilerplate → GPT-4o-mini or equivalent
Complex multi-file generation, architectural reasoning → GPT-4 Turbo or GPT-4o
Real-time inline completion → the lowest-latency, lowest-cost model available

Rate limit budgeting. OpenAI enforces tokens-per-minute (TPM) and requests-per-minute (RPM) limits. At scale, these become constraints that require queuing, prioritization, and graceful degradation strategies — the same patterns you’d apply to any external dependency.

Part II: Architectural Deep Dive — How AI Code Generation Actually Works

The Transformer Architecture and Why It Matters for Engineers

Modern LLMs are trained on vast corpora of code repositories, documentation, and natural language. The transformer architecture, with its self-attention mechanisms, captures long-range dependencies in code that earlier models missed — understanding that a variable declared 300 lines ago is relevant to the function being completed now.

Training occurs in three phases, each of which has practical implications for how you use these models:

Pre-training ingests billions of tokens from public repositories (GitHub, GitLab, Stack Overflow). This gives the model broad knowledge of programming patterns across languages. It also means the model has learned from poor-quality code — a pre-trained model isn’t a source of ground truth on best practices.

Fine-tuning on domain-specific datasets refines capabilities for particular frameworks — .NET-specific patterns, ASP.NET Core middleware conventions, Entity Framework idioms. When OpenAI fine-tunes on high-quality .NET code, it meaningfully improves output for that ecosystem.

Reinforcement Learning from Human Feedback (RLHF) uses human evaluators to rank outputs based on correctness, efficiency, and adherence to best practices. This phase optimizes for production-grade code rather than merely syntactically correct code. Understanding this pipeline explains why model behavior can shift between versions even for identical prompts.

Inference-Time Parameters: A Practitioner’s Guide

Temperature (0.0–2.0): Controls output randomness. For production code generation, 0.2–0.4 is the practical range. Higher values introduce creative variation that’s appropriate for exploratory prototyping but inappropriate for deterministic infrastructure code. Setting temperature to 0 doesn’t guarantee idempotent outputs due to floating-point sampling, but it gets close.

Context window management: Modern models support 8K–128K tokens, but longer contexts increase both latency and cost non-linearly. Effective context engineering — providing the right information rather than all information — is as important as prompt phrasing. For .NET development, this means including relevant interfaces, existing method signatures, and architectural constraints, not the entire solution file.

Nucleus sampling (top-p): Constrains sampling to the most probable token pool. A value of 0.95 with temperature 0.3 works well for most code generation tasks. Avoid tuning both temperature and top-p simultaneously.

OpenAI Model Selection for .NET Workflows

| Scenario | Recommended Model | Rationale |
|---------------------------------------------|---------------------------|------------------------------------------------------------------|
| Large codebase understanding, multi-file generation | GPT-4 Turbo (128K context) | Context window enables full solution understanding |
| Real-time inline completion | GPT-4o | Sub-500ms latency viable for IDE integration |
| Boilerplate, CRUD, test scaffolding | GPT-4o-mini | Cost optimization without meaningful quality loss |
| Batch test generation (CI/CD) | GPT-4 Turbo | Throughput over latency; quality matters for test suite health |

Large codebase understanding, multi-file generation GPT-4 Turbo (128K context) Context window enables full solution understanding Real-time inline completion GPT-4o Sub-500ms latency viable for IDE integration Boilerplate, CRUD, test scaffolding GPT-4o-mini Cost optimization without meaningful quality loss Batch test generation (CI/CD) GPT-4 Turbo Throughput over latency; quality matters for test suite health

Part III: Implementation — Building Production-Grade AI Workflows in .NET

Foundation: Dependency Injection and Configuration

Production AI integration in .NET starts with proper infrastructure. The following pattern uses .NET 9 with the Azure OpenAI SDK:

// Program.cs
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Hosting;
using Azure.AI.OpenAI;
using Azure;

var builder = Host.CreateApplicationBuilder(args);
builder.Services.AddSingleton(sp =>
{
    var config = sp.GetRequiredService<IConfiguration>();
    var endpoint = new Uri(config["AzureOpenAI:Endpoint"]!);
    var apiKey = config["AzureOpenAI:ApiKey"]!;
    return new OpenAIClient(endpoint, new AzureKeyCredential(apiKey));
});
builder.Services.AddSingleton<ICodeGenerationService, CodeGenerationService>();
builder.Services.AddSingleton<IAITestGenerationService, AITestGenerationService>();
builder.Services.AddMemoryCache();
builder.Services.AddSingleton<IRateLimiter, TokenBucketRateLimiter>();
var host = builder.Build();
await host.RunAsync();

Key architectural decisions here: the OpenAIClient is registered as a singleton (thread-safe, connection-pooled), caching is included from the start (not added later), and rate limiting is a first-class dependency.

Core Code Generation Service

// Services/CodeGenerationService.cs
public interface ICodeGenerationService
{
    Task<CodeGenerationResult> GenerateCodeAsync(
        CodeGenerationRequest request,
        CancellationToken cancellationToken = default);
}

public record CodeGenerationRequest(
    string Prompt,
    string Language = "csharp",
    int MaxTokens = 2000,
    double Temperature = 0.3,
    Dictionary<string, string>? Context = null);
public record CodeGenerationResult(
    string GeneratedCode,
    string[] Suggestions,
    int TokensUsed,
    TimeSpan Duration,
    bool FromCache = false);
public class CodeGenerationService : ICodeGenerationService
{
    private readonly OpenAIClient _openAIClient;
    private readonly IMemoryCache _cache;
    private readonly ILogger<CodeGenerationService> _logger;
    private readonly IRateLimiter _rateLimiter;
    private const string DeploymentName = "gpt-4";
    public CodeGenerationService(
        OpenAIClient openAIClient,
        IMemoryCache cache,
        ILogger<CodeGenerationService> logger,
        IRateLimiter rateLimiter)
    {
        _openAIClient = openAIClient;
        _cache = cache;
        _logger = logger;
        _rateLimiter = rateLimiter;
    }
    public async Task<CodeGenerationResult> GenerateCodeAsync(
        CodeGenerationRequest request,
        CancellationToken cancellationToken = default)
    {
        var startTime = DateTime.UtcNow;
        var cacheKey = GenerateCacheKey(request);
        if (_cache.TryGetValue<CodeGenerationResult>(cacheKey, out var cachedResult))
        {
            _logger.LogInformation("Cache hit for prompt: {Prompt}",
                request.Prompt[..Math.Min(50, request.Prompt.Length)]);
            return cachedResult! with { FromCache = true };
        }
        // Rate limit before calling external API
        await _rateLimiter.AcquireAsync(request.MaxTokens, cancellationToken);
        try
        {
            var options = new ChatCompletionsOptions
            {
                DeploymentName = DeploymentName,
                Messages =
                {
                    new ChatRequestSystemMessage(BuildSystemPrompt(request.Language)),
                    new ChatRequestUserMessage(BuildUserPrompt(request))
                },
                MaxTokens = request.MaxTokens,
                Temperature = (float)request.Temperature,
                NucleusSamplingFactor = 0.95f
            };
            var response = await _openAIClient
                .GetChatCompletionsAsync(options, cancellationToken);
            var choice = response.Value.Choices[0];
            var result = new CodeGenerationResult(
                GeneratedCode: ExtractCode(choice.Message.Content),
                Suggestions: ExtractSuggestions(choice.Message.Content),
                TokensUsed: response.Value.Usage.TotalTokens,
                Duration: DateTime.UtcNow - startTime);
            _cache.Set(cacheKey, result, TimeSpan.FromMinutes(30));
            return result;
        }
        catch (RequestFailedException ex) when (ex.Status == 429)
        {
            _logger.LogWarning("Rate limit hit; propagating for retry policy");
            throw new AIRateLimitException("OpenAI rate limit exceeded", ex);
        }
    }
    private static string BuildSystemPrompt(string language) => $"""
        You are an expert {language} developer following production-grade engineering standards.
        Generate clean, maintainable code with:
        - Comprehensive XML documentation comments
        - Null safety and defensive programming
        - Structured logging via ILogger
        - Async/await with proper CancellationToken propagation
        - Input validation with meaningful error messages
        Always provide the complete implementation. Do not truncate.
        """;
    private static string BuildUserPrompt(CodeGenerationRequest request)
    {
        var contextSection = request.Context is { Count: > 0 }
            ? $"\n\nContext:\n{string.Join("\n", request.Context.Select(kv => $"{kv.Key}: {kv.Value}"))}"
            : string.Empty;
        return $"{request.Prompt}{contextSection}";
    }
    private static string GenerateCacheKey(CodeGenerationRequest r) =>
        Convert.ToHexString(SHA256.HashData(
            Encoding.UTF8.GetBytes($"{r.Prompt}|{r.Language}|{r.Temperature}")));
    private static string ExtractCode(string content) =>
        Regex.Match(content, @"```

(?:\w+)?\n([\s\S]*?)

```").Groups[1].Value.Trim();
    private static string[] ExtractSuggestions(string content) =>
        Regex.Matches(content, @"(?:NOTE|WARNING|CONSIDER):\s*(.+)")
             .Select(m => m.Groups[1].Value)
             .ToArray();
}

AI-Powered Test Generation

Test generation is where AI delivers disproportionate value. Writing comprehensive unit and integration tests is time-consuming and often deprioritized under delivery pressure. AI can generate a solid test scaffolding that developers then validate and extend.

// Services/AITestGenerationService.cs
public interface IAITestGenerationService
{
    Task<TestGenerationResult> GenerateTestsAsync(
        string sourceCode,
        TestGenerationOptions options,
        CancellationToken cancellationToken = default);
}

public record TestGenerationOptions(
    TestFramework Framework = TestFramework.XUnit,
    bool IncludeEdgeCases = true,
    bool IncludeNegativeTests = true,
    bool GenerateMocks = true,
    string[]? FocusAreas = null);
public record TestGenerationResult(
    string TestCode,
    TestCoverageEstimate Coverage,
    string[] GeneratedTestNames,
    int TokensUsed);
public class AITestGenerationService : IAITestGenerationService
{
    private readonly OpenAIClient _openAIClient;
    private readonly ILogger<AITestGenerationService> _logger;
    private const string DeploymentName = "gpt-4";
    public async Task<TestGenerationResult> GenerateTestsAsync(
        string sourceCode,
        TestGenerationOptions options,
        CancellationToken cancellationToken = default)
    {
        var prompt = BuildTestGenerationPrompt(sourceCode, options);
        var chatOptions = new ChatCompletionsOptions
        {
            DeploymentName = DeploymentName,
            Messages =
            {
                new ChatRequestSystemMessage(GetTestSystemPrompt(options.Framework)),
                new ChatRequestUserMessage(prompt)
            },
            MaxTokens = 4000,
            Temperature = 0.2f // Low temperature for deterministic test logic
        };
        var response = await _openAIClient
            .GetChatCompletionsAsync(chatOptions, cancellationToken);
        var content = response.Value.Choices[0].Message.Content;
        var testCode = ExtractTestCode(content);
        var testNames = ExtractTestMethodNames(testCode);
        _logger.LogInformation(
            "Generated {Count} tests for source ({Length} chars) using {Tokens} tokens",
            testNames.Length, sourceCode.Length, response.Value.Usage.TotalTokens);
        return new TestGenerationResult(
            TestCode: testCode,
            Coverage: EstimateCoverage(sourceCode, testCode),
            GeneratedTestNames: testNames,
            TokensUsed: response.Value.Usage.TotalTokens);
    }
    private static string GetTestSystemPrompt(TestFramework framework) => $"""
        You are a senior .NET test engineer specializing in {framework} test suites.
        Write tests that:
        - Test behavior, not implementation
        - Use the Arrange-Act-Assert pattern with clear section comments
        - Use meaningful test names: MethodName_StateUnderTest_ExpectedBehavior
        - Cover happy paths, edge cases, and error scenarios
        - Use NSubstitute for mocking (not Moq)
        - Assert on observability: verify ILogger calls for error paths
        - Never test private methods; test through public contracts
        """;
    private static string BuildTestGenerationPrompt(
        string sourceCode, TestGenerationOptions options)
    {
        var sb = new StringBuilder();
        sb.AppendLine("Generate comprehensive tests for the following code:");
        sb.AppendLine();
        sb.AppendLine("```
{% endraw %}
csharp");
        sb.AppendLine(sourceCode);
        sb.AppendLine("
{% raw %}
```");
        sb.AppendLine();
        if (options.FocusAreas?.Length > 0)
            sb.AppendLine($"Focus especially on: {string.Join(", ", options.FocusAreas)}");
        if (options.IncludeEdgeCases)
            sb.AppendLine("Include edge cases: null inputs, empty collections, boundary values.");
        if (options.IncludeNegativeTests)
            sb.AppendLine("Include negative tests: invalid inputs, exception scenarios, timeout behavior.");
        if (options.GenerateMocks)
            sb.AppendLine("Generate NSubstitute mocks for all external dependencies.");
        return sb.ToString();
    }
    private static string ExtractTestCode(string content) =>
        Regex.Match(content, @"```

(?:csharp)?\n([\s\S]*?)

```").Groups[1].Value.Trim();
    private static string[] ExtractTestMethodNames(string testCode) =>
        Regex.Matches(testCode, @"\[(?:Fact|Theory|Test)\][\s\S]*?(?:public|private)\s+\w+\s+(\w+)\s*\(")
             .Select(m => m.Groups[1].Value)
             .ToArray();
    private static TestCoverageEstimate EstimateCoverage(string source, string tests) =>
        new(
            MethodsCovered: CountMethodsInTests(source, tests),
            EstimatedLineCoverage: EstimateLineCoverage(source, tests),
            HasEdgeCases: tests.Contains("null") || tests.Contains("empty", StringComparison.OrdinalIgnoreCase),
            HasExceptionTests: tests.Contains("Assert.Throws") || tests.Contains("await Assert.ThrowsAsync"));
}

Part IV: Resilience Patterns — SRE Considerations for AI-Integrated Systems

The External Dependency Problem

Integrating OpenAI into a development pipeline or runtime system introduces an external dependency with distinct failure characteristics. Unlike internal services, you don’t control its availability, latency distribution, or rate limits. From an SRE perspective, you must design for:

Availability : OpenAI’s SLAs are not enterprise-grade guarantees. Your system’s critical paths cannot depend on OpenAI availability unless you’ve architected for graceful degradation.

Latency variance : GPT-4-class models can respond in 500ms or 15,000ms depending on load. Any synchronous dependency on these calls will cause tail latency issues in user-facing systems.

Rate limiting : Token and request quotas are hard limits. Exceeding them returns 429 errors that, if unhandled, cascade through your system.

Resilience Implementation with Polly

// Infrastructure/AIResiliencePolicy.cs
public static class AIResiliencePolicy
{
    public static ResiliencePipeline<CodeGenerationResult> Build(
        ILogger logger) =>
        new ResiliencePipelineBuilder<CodeGenerationResult>()
            .AddRetry(new RetryStrategyOptions<CodeGenerationResult>
            {
                MaxRetryAttempts = 3,
                Delay = TimeSpan.FromSeconds(1),
                BackoffType = DelayBackoffType.Exponential,
                UseJitter = true,
                ShouldHandle = new PredicateBuilder<CodeGenerationResult>()
                    .Handle<AIRateLimitException>()
                    .Handle<HttpRequestException>(),
                OnRetry = args =>
                {
                    logger.LogWarning(
                        "AI call retry {Attempt} after {Delay}ms",
                        args.AttemptNumber,
                        args.RetryDelay.TotalMilliseconds);
                    return ValueTask.CompletedTask;
                }
            })
            .AddCircuitBreaker(new CircuitBreakerStrategyOptions<CodeGenerationResult>
            {
                FailureRatio = 0.5,
                SamplingDuration = TimeSpan.FromSeconds(30),
                MinimumThroughput = 10,
                BreakDuration = TimeSpan.FromSeconds(60),
                OnOpened = args =>
                {
                    logger.LogError("AI circuit breaker opened; falling back to non-AI path");
                    return ValueTask.CompletedTask;
                }
            })
            .AddTimeout(TimeSpan.FromSeconds(30))
            .Build();
}

Fallback Strategy

When the AI service is unavailable, degrade gracefully rather than failing hard:

// Fallback to template-based generation when AI is unavailable
public class ResilientCodeGenerationService : ICodeGenerationService
{
    private readonly ICodeGenerationService _aiService;
    private readonly ITemplateCodeGenerationService _templateService;
    private readonly ResiliencePipeline<CodeGenerationResult> _policy;

    public async Task<CodeGenerationResult> GenerateCodeAsync(
        CodeGenerationRequest request,
        CancellationToken cancellationToken = default)
    {
        try
        {
            return await _policy.ExecuteAsync(
                async ct => await _aiService.GenerateCodeAsync(request, ct),
                cancellationToken);
        }
        catch (BrokenCircuitException)
        {
            // Circuit is open; fall back to deterministic template generation
            return await _templateService.GenerateAsync(request, cancellationToken);
        }
    }
}

Part V: Observability — Measuring What Matters

The Metrics That Drive Decisions

For AI-powered development workflows, standard application metrics are necessary but insufficient. You need a second layer of AI-specific metrics:

Cost and efficiency metrics:

Token consumption per request (input vs. output)
Cost per code generation request (by model tier)
Cache hit rate (target: >40% for repeated patterns)
Token utilization efficiency (useful output tokens / total tokens billed)

Quality metrics:

Generated code acceptance rate (what percentage of AI suggestions developers keep without major modification)
Test pass rate for AI-generated tests on first run
Post-merge defect rate for AI-assisted code vs. manually written code
Code review comments per AI-generated PR

Reliability metrics:

API error rate (4xx, 5xx by type)
P50/P95/P99 latency for AI calls
Circuit breaker trip frequency
Fallback invocation rate

OpenTelemetry Integration

// Observability/AIMetricsCollector.cs
public class AIMetricsCollector
{
    private readonly Meter _meter;
    private readonly Counter<long> _requestCounter;
    private readonly Histogram<double> _latencyHistogram;
    private readonly Counter<long> _tokenCounter;
    private readonly Counter<long> _cacheHitCounter;
    private readonly ObservableGauge<double> _estimatedCostGauge;
    private double _sessionCost;
    public AIMetricsCollector(IMeterFactory meterFactory)
    {
        _meter = meterFactory.Create("AICodeGeneration");
        _requestCounter = _meter.CreateCounter<long>(
            "ai.requests.total",
            description: "Total AI code generation requests");
        _latencyHistogram = _meter.CreateHistogram<double>(
            "ai.request.duration_ms",
            unit: "ms",
            description: "AI request latency distribution");
        _tokenCounter = _meter.CreateCounter<long>(
            "ai.tokens.consumed",
            description: "Total tokens consumed by model and type");
        _cacheHitCounter = _meter.CreateCounter<long>(
            "ai.cache.hits",
            description: "Cache hits avoiding API calls");
        _estimatedCostGauge = _meter.CreateObservableGauge<double>(
            "ai.session.estimated_cost_usd",
            () => _sessionCost,
            description: "Estimated session cost in USD");
    }
    public void RecordRequest(
        string model, bool fromCache, double durationMs,
        int inputTokens, int outputTokens)
    {
        var tags = new TagList { { "model", model }, { "from_cache", fromCache } };
        _requestCounter.Add(1, tags);
        _latencyHistogram.Record(durationMs, new TagList { { "model", model } });
        _cacheHitCounter.Add(fromCache ? 1 : 0);
        if (!fromCache)
        {
            _tokenCounter.Add(inputTokens,
                new TagList { { "model", model }, { "type", "input" } });
            _tokenCounter.Add(outputTokens,
                new TagList { { "model", model }, { "type", "output" } });
            _sessionCost += CalculateCost(model, inputTokens, outputTokens);
        }
    }
    private static double CalculateCost(string model, int input, int output) =>
        model switch
        {
            "gpt-4" => (input * 0.00003) + (output * 0.00006), // $0.03/$0.06 per 1K
            "gpt-4o" => (input * 0.000005) + (output * 0.000015), // $0.005/$0.015 per 1K
            _ => 0
        };
}

Part VI: Governance — Managing AI in Enterprise Environments

Security and Compliance Architecture

Code sent to external AI APIs leaves your infrastructure boundary. This has critical implications for enterprises operating under regulatory frameworks (SOC 2, HIPAA, PCI DSS, GDPR).

PII and secrets scrubbing. Implement a mandatory pre-processing layer that redacts API keys, connection strings, PII, and other sensitive data before prompt submission. This is non-negotiable:

public class CodeSanitizer
{
    private static readonly Regex[] SensitivePatterns =
    [
        new(@"(password|secret|apikey|token)\s*[=:]\s*['""]?[\w\-\.]+['""]?",
            RegexOptions.IgnoreCase),
        new(@"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b"), // email
        new(@"\b\d{3}-\d{2}-\d{4}\b"), // SSN
        new(@"\b\d{4}[\s-]\d{4}[\s-]\d{4}[\s-]\d{4}\b"), // credit card
    ];

    public string Sanitize(string code)
    {
        var result = code;
        foreach (var pattern in SensitivePatterns)
            result = pattern.Replace(result, "[REDACTED]");
        return result;
    }
}

Azure OpenAI vs. OpenAI API. Azure OpenAI offers data processing agreements (DPAs), regional data residency, and Private Link support that the public OpenAI API does not. For enterprise environments, Azure OpenAI is the correct architectural choice — not because of capability differences, but because of compliance posture.

Prompt Governance Framework

Uncontrolled prompt construction is both a security and quality risk. Implement centralized prompt management:

Prompt versioning. Treat prompts as first-class artifacts with versioning, testing, and rollback capability. A prompt change can alter output quality for thousands of developers. Store prompts in your configuration system with change history.

Output validation. Generated code should pass through a validation pipeline before surfacing to developers or being committed. At minimum, validate syntactic correctness (Roslyn compiler) and run static analysis. Consider a secondary AI review pass for critical generation paths.

Audit logging. Log all AI interactions with sufficient context to audit decisions, diagnose quality regressions, and demonstrate compliance. Structure logs to support cost attribution by team and project.

The Human-in-the-Loop Imperative

AI code generation must be positioned as augmentation, not replacement. Governance policies should reflect this:

Establish clear boundaries for where AI-generated code requires mandatory human review before merge. High-risk categories include authentication and authorization logic, cryptographic operations, data access and query construction, external API integrations, and infrastructure-as-code. For these categories, require that a reviewer explicitly affirms they have evaluated the generated code, not merely approved that “it looked fine.”

Part VII: Strategic Integration — Building the AI-Native Development Culture

Shifting the Skill Profile

AI-assisted development shifts which skills are most valuable on engineering teams. The premium moves from “writing code quickly” to:

Requirements precision. The limiting factor for AI code quality is how precisely requirements are specified. Engineers who can decompose ambiguous problems into precise, testable specifications will extract dramatically more value from AI tools.

Critical evaluation. The ability to rapidly evaluate generated code for correctness, edge cases, security implications, and alignment with architectural patterns is more valuable than the ability to write the code from scratch.

Prompt engineering as a systems skill. Writing effective prompts for code generation is a learnable, improvable skill. Teams should invest in building shared prompt libraries, running experiments to measure output quality, and codifying effective patterns.

CI/CD Integration Architecture

AI code generation integrates most sustainably into the development workflow as a CI/CD-adjacent capability, not an ad-hoc interactive tool. Specific integration points:

Pre-commit test generation. Automatically generate test suggestions for changed files as part of the pre-commit hook. Surface these to developers for review rather than auto-committing — this drives quality without removing human judgment.

PR description and review assistance. Use AI to generate structured PR descriptions from diffs, flag potential issues, and suggest test coverage gaps. This accelerates review without replacing it.

Technical debt detection. Run AI analysis on modified code to flag technical debt patterns — missing error handling, lack of observability, unhandled async exceptions — as part of the CI pipeline. Surfacing these as advisory warnings (not blocking) builds awareness without creating friction.

Measuring Success and Iteration

Define success metrics before deploying AI tooling, and measure them rigorously:

Developer velocity metrics: cycle time (PR open to merge), time from requirement to passing tests, and PR iteration count. Watch for counterintuitive results — AI tooling sometimes increases PR iteration count initially as review standards rise.

Quality metrics: post-release defect rate for AI-assisted code, test coverage delta, and technical debt accumulation rate measured via static analysis trends.

Adoption and satisfaction metrics: active usage rate among the team, qualitative feedback on whether the tool is helping or hindering. Tools that developers circumvent provide zero value and erode trust in the initiative.

Instrument everything from day one. The data you collect in the first 90 days will determine whether you’re investing in the right workflows or building elaborate tooling that no one uses.

Conclusion

AI-powered code generation and testing represent a genuine structural shift in software development productivity — not incremental tooling improvement, but a change in the fundamental economics of building software. The teams that will succeed are those that approach this shift with the same architectural discipline, quality standards, and operational rigor they apply to any production system.

The core principles for sustainable AI integration in .NET environments: treat AI as an external dependency with real reliability characteristics; invest in prompt engineering as a first-class engineering discipline; maintain the human judgment layer for high-risk generation paths; instrument everything to drive continuous improvement; and position experienced engineers as force multipliers, not replacements.

The technology will continue to improve rapidly. The architectural patterns, governance frameworks, and organizational disciplines described in this guide will remain relevant even as the underlying models evolve. Build the foundation right, and you’ll be positioned to capture successive waves of improvement as they arrive.

Leading Through Technical Crisis: A Staff Engineer’s Guide to Architecture, Resilience, and…

Ali Suleyman TOPUZ — Sat, 21 Mar 2026 18:47:11 +0000

Leading Through Technical Crisis: A Staff Engineer’s Guide to Architecture, Resilience, and Strategic Decision-Making

When systems fail, it’s not just your code that gets tested — it’s your judgment, your leadership, and your identity as an engineer.

There’s a particular kind of silence that falls over an engineering team when something goes catastrophically wrong in production. Slack channels that were humming with chatter thirty minutes ago suddenly fill with terse messages. Dashboards turn red. Someone types “I think it’s the database” and nobody laughs at the Occam’s razor of that statement because maybe — just maybe — it actually is the database.

I’ve been in that silence more times than I’d like to admit. And what I’ve come to understand, through hard-won experience, is that technical crises are not primarily engineering problems. They are leadership problems that happen to have engineering solutions. The distinction matters enormously, because the engineers who thrive in crisis aren’t necessarily the ones who write the cleanest code or know the most about distributed systems theory. They’re the ones who can hold their cognitive composure, orient a team under pressure, and make irreversible decisions with incomplete information — all while simultaneously debugging a distributed system that is actively lying to them.

This is what it means to lead through technical crisis.

What We Mean When We Say “Crisis”

Before we talk about strategy, it’s worth being precise about what a technical crisis actually is — because engineering culture often conflates “incident” with “crisis,” and that blurry boundary has real consequences for how teams respond.

A production incident is a system behaving outside expected parameters. A technical crisis is something more specific: it’s the moment when a failure’s business impact exceeds your team’s standard operating procedures. The system isn’t just down. Multiple things are failing in ways that amplify each other. The root cause isn’t clear. Recovery time is approaching or has already exceeded what your stakeholders can tolerate. The on-call runbook doesn’t have an entry for this particular flavor of disaster.

This distinction matters because crises demand leadership intervention — not just technical execution. They require someone to be simultaneously debugging the system, managing upward communication, deciding which recovery path to pursue, allocating team resources, and maintaining the psychological safety of an exhausted team. These are not engineering tasks. They are organizational tasks that require an engineer to perform them.

The modern software landscape makes crises more likely, not less. The migration toward microservices, distributed databases, third-party LLM integrations, and cloud infrastructure creates systems of staggering complexity — systems where failures in one component cascade unpredictably into failures in components that had no business being affected. When your authentication service starts dropping 40% of requests, the knock-on effects through a system of 50 microservices can be nearly impossible to trace in real time. Understanding why your system behaves this way under stress, and having the architectural and psychological tools to respond, is the difference between a two-hour recovery and a twelve-hour death march.

The Psychological Architecture of Crisis Leadership

Here’s something engineering culture is reluctant to discuss: crisis response is a stress-mediated cognitive activity, and stress degrades exactly the cognitive functions you need most.

When a production incident reaches crisis level, your body enters a state of heightened arousal. Your prefrontal cortex — the part responsible for working memory, risk assessment, and flexible decision-making — starts operating at reduced capacity. Your thinking becomes more rigid. You anchor harder on the first hypothesis that sounds plausible. You become worse at integrating information from multiple sources. Your time horizon collapses; the next five minutes feel more real than the next five hours.

None of this is a character flaw. It’s biology. What separates experienced crisis leaders isn’t immunity to this response — it’s having built systems to compensate for it.

The most important system is a shared mental model. When every engineer on your incident response team has a common understanding of how your architecture behaves under stress, you reduce the cognitive load required to diagnose problems. You’re not rebuilding your understanding of the system from scratch under pressure; you’re pattern-matching against a map you’ve studied in calmer moments. This is why architectural review, game days, and postmortem culture aren’t just “nice to haves” — they are investments in your team’s cognitive infrastructure for exactly the moments when it will be most strained.

The second system is vocabulary. When I tell my team “implement a circuit breaker on the LLM client,” there’s no ambiguity. We’ve talked about circuit breakers before. Everyone knows what it means, what it does, and roughly how to implement it. The design pattern has become a shorthand that bypasses the need for lengthy explanation under pressure. This is one of the least-discussed values of standardizing on architectural patterns across a codebase: not elegance, not theoretical purity, but shared vocabulary for crisis moments.

The third system is clear role definition. During crisis, role ambiguity is catastrophic. Someone needs to be in charge of diagnosis. Someone needs to own stakeholder communication. Someone needs to have authority over deployment decisions. These roles don’t need to be formal or permanent — they can be assumed situationally — but they need to be explicit. Ambiguity about who’s driving creates the worst possible outcome: multiple engineers pulling in different directions, each acting on a different theory of the failure, each second-guessing the others.

Architectural Decisions as Crisis Prevention

The most powerful thing a Staff Engineer can do for crisis management is work that happens months or years before any incident occurs. Architectural decisions made during the calm of normal development either create or close off options during crisis. This is not metaphorical — it’s literally true that the design patterns chosen in sprint planning meetings determine whether an incident responder has a clean lever to pull at 2 AM.

Consider the difference between two codebases. In the first, LLM provider API clients are instantiated in a dozen different places throughout the service layer, each with slightly different configuration, each with ad hoc retry logic, each failing in its own idiosyncratic way. In the second, all LLM client creation flows through a single factory, which checks provider health status, enforces rate limits, and integrates circuit breaker logic.

When OpenAI returns 503s at 2 AM on a Friday, these two codebases have dramatically different failure modes. In the first, you’re hunting through scattered instantiation points trying to understand why some requests are failing and others aren’t, manually patching retry logic in multiple places. In the second, you have a single choke point where you can emergency-throttle requests, swap to a fallback provider, or disable a misbehaving model across fifty service instances with a configuration change.

This is the Factory pattern at work. It seems like an engineering best practice with modest operational benefits in normal conditions. It becomes a survival tool in crisis conditions. The same logic applies across the architectural patterns that define resilient systems.

The Patterns That Save You

Not all design patterns are created equal in crisis scenarios. Some patterns are primarily about code organization or developer experience. Others are genuinely load-bearing under production stress. The ones that matter most are those that provide control surface — meaningful levers that an incident responder can pull to change system behavior without a code deployment.

Circuit Breakers are the most important resilience pattern that teams consistently under-implement. A circuit breaker wraps calls to an external dependency — a third-party API, a downstream service, a database — and tracks failure rates. When failures exceed a configurable threshold, the circuit “trips” and subsequent calls fail immediately rather than waiting for timeout. This prevents cascading failures: when your payment processor is struggling, your checkout service stops accumulating threads waiting for responses that won’t come in time, which prevents your checkout service from taking down your recommendation service, which prevents your recommendation service from degrading your homepage. The cascade stops at the first circuit break rather than propagating through the entire system.

The subtlety that teams miss is that circuit breakers need to be tunable in production. Thresholds that make sense under normal load may be wrong under crisis conditions. Half-open states need monitoring. This means your circuit breaker implementation needs observability baked in — not as an afterthought, but as a first-class feature.

The Strategy Pattern solves a different crisis problem: what happens when you need to change fundamental behavior without deploying code? If your primary LLM provider starts returning errors and you’ve hard-coded provider-specific logic throughout your service layer, switching to a fallback provider requires code changes, testing, and deployment — all under pressure, all with elevated risk of introducing new bugs. If you’ve encapsulated provider-specific behavior behind a strategy interface, you can swap implementations at runtime, driven by configuration or health check signals, with no deployment required.

This pattern is particularly powerful in the current LLM landscape, where providers have meaningfully different rate limits, latency characteristics, and failure modes. A well-designed strategy implementation lets you route to OpenAI under normal conditions, fall back to Anthropic when OpenAI rate limits, and fall back to a local model when both are degraded — all transparently, all without the calling code knowing which provider is active.

Bulkhead Patterns apply the naval engineering insight that isolating compartments prevents a single breach from sinking the ship. In software, this means isolating resource pools — thread pools, connection pools, memory allocations — so that degradation in one area of the system cannot consume resources needed by other areas. Your LLM inference requests should not be competing for the same thread pool as your critical path authentication logic. Your batch processing jobs should have connection pool limits that prevent them from starving real-time user requests of database connections. These boundaries feel over-engineered until the moment they’re the only thing standing between you and a total system failure.

The CAP Theorem Isn’t Academic

Every engineer has encountered the CAP theorem in a whiteboard interview or a distributed systems course. Fewer have encountered it at 3 AM while a distributed database is exhibiting split-brain behavior and business stakeholders are sending messages with increasing numbers of exclamation points.

The theorem states that a distributed system cannot simultaneously guarantee Consistency, Availability, and Partition Tolerance during a network partition. Since network partitions are not theoretical — they happen, in production, to everyone — you must choose: during a partition event, will your system favor consistency (potentially refusing requests to avoid serving stale or conflicting data) or availability (serving requests even when you cannot guarantee the data is current)?

This is not an implementation detail. It is a product decision with business consequences, and it needs to be made explicitly and understood widely before an incident occurs. An e-commerce system that favors consistency might refuse to process orders during a partition event, showing users a 503 and losing revenue. A system that favors availability might process orders against stale inventory data, causing overselling that requires expensive manual reconciliation. Neither choice is wrong — but the choice needs to be deliberate, documented, and understood by the people who will be making recovery decisions.

The crisis leadership failure mode here is discovering, during an incident, that nobody on the team knows what the intended behavior is. Different engineers have different intuitions. Some want to restore consistency as quickly as possible; others are focused on bringing availability back online. Without explicit architectural decisions to anchor the conversation, teams waste precious recovery time relitigating product decisions under pressure.

The Staff Engineer’s job is to ensure that these trade-offs are discussed, decided, and written down before they’re relevant. The postmortem is too late.

Observability: Seeing Before It Hurts

There’s a meaningful difference between monitoring and observability, and understanding it is essential for crisis prevention. Monitoring tells you when things have gone wrong. Observability tells you why they went wrong and — ideally — gives you signals before they go catastrophically wrong.

Traditional monitoring is threshold-based: CPU above 80%, response time above 500ms, error rate above 1%. These metrics have their place, but they suffer from a fundamental limitation: they measure symptoms, not causes, and they measure them after the fact. By the time your error rate crosses 1%, customers have been experiencing failures for some time already.

Observability, as a practice, means building your systems so that their internal state can be interrogated through their external outputs. This requires three types of telemetry working in concert: metrics (the quantitative health indicators that tell you something is wrong), logs (the contextual records that tell you what was happening when things went wrong), and distributed traces (the cross-service journey maps that show you how a failure propagated through your system).

The implementation detail that makes observability useful in crisis scenarios is correlation. When a customer reports a failed checkout at 14:23:47, you need to be able to trace that specific request across six services, correlating it with the database query that took 3 seconds, the downstream API call that returned a transient error, and the retry logic that eventually exhausted its budget. Without correlation identifiers threading through your logs and traces, you have data but not information.

This is where structured logging — logging that emits machine-parseable records with consistent field names rather than free-text strings — pays enormous dividends. During crisis, the speed at which you can answer “what was happening in service X at time T for correlation ID Y” determines how fast you can form and test hypotheses. Structured logs, indexed and searchable, let you answer that question in seconds rather than minutes. Every minute saved in diagnosis is a minute saved in recovery.

Equally important is building observability that surfaces signals before they become crises. This requires moving beyond reactive monitoring toward leading indicators: not just measuring error rates, but tracking the rate of change of error rates. Not just watching queue depth, but measuring how queue depth correlates with upstream request volume. Not just alerting on database connection pool exhaustion, but alerting when you’re at 70% of pool capacity and trending up. The goal is a system that gives you a ten-minute warning, not a ten-second one.

Making Decisions With Incomplete Information

Here’s the hardest truth about crisis leadership: you will never have enough information to be certain your decision is right, and you will have to make the decision anyway.

This sits badly with engineers. Our discipline trains us to be precise, to gather data before drawing conclusions, to avoid premature optimization. These instincts serve us well in normal development. In crisis, they can become pathological. The engineer who waits for certainty before acting is the engineer who lets a recoverable incident become an existential one.

The mental model that helps me most is thinking about reversibility. Every decision during crisis can be placed somewhere on a spectrum from fully reversible to fully irreversible. Enabling a feature flag is reversible. Disabling a service to stop cascading failures is mostly reversible. Deleting data is irreversible. Rolling back a database migration might be irreversible, or nearly so.

For reversible decisions, speed matters more than certainty. Make the call, observe the results, adjust. The cost of being wrong and reversing is low; the cost of hesitation is high. For irreversible decisions, the calculus flips. Take the time you need to build confidence. Get a second opinion. Document your reasoning. The cost of being wrong and unable to reverse is potentially catastrophic.

This framework also helps with a common crisis mistake: spending too long diagnosing when you should be mitigating. If you have a reversible mitigation available — rolling back a deployment, disabling a feature, routing traffic to a healthy region — there’s often wisdom in taking that action before you fully understand the cause. Stop the bleeding, then diagnose. The pressure to understand before acting comes from a place of intellectual honesty, which is admirable, but it can extend customer impact unnecessarily. You can always do a thorough postmortem investigation once the system is stable.

Communication as a Technical Skill

One of the most persistent misconceptions about technical crisis leadership is that communication is a soft skill layered on top of the hard technical work. This is wrong. In crisis conditions, communication is the hard technical work.

The reason is that crisis recovery almost always involves multiple stakeholders with fundamentally different information needs and tolerance for uncertainty. Your engineers need detailed technical context: what’s failing, what’s been tried, what hypotheses are being tested. Your product leadership needs business impact framing: how many users are affected, what functionality is degraded, when will it be resolved. Your executive team needs confidence that the situation is under control and clear expectations about timeline. Each audience requires a different translation of the same underlying reality.

Managing this translation, while simultaneously contributing to technical diagnosis and recovery, is a profound cognitive challenge. The leaders who do it well have developed a set of practices that reduce the cognitive overhead: standardized status update templates that can be filled in quickly, a designated communications lead for large incidents who is distinct from the technical lead, and an explicit commitment to updating stakeholders on a regular cadence (every 30 minutes, say) regardless of whether there’s new information. The last point matters enormously — silence in a crisis is interpreted as incompetence or dishonesty, even when the reality is simply that diagnosis is still underway.

The other communication challenge is upward translation of technical debt and systemic risk. Crises are often symptoms of accumulated technical debt that was deprioritized in favor of feature development. After the immediate fire is out, the Staff Engineer or Principal Engineer who led the recovery has an opportunity — and arguably a responsibility — to translate what happened into business-language justification for the investment required to prevent recurrence. This is not complaining about technical debt. It’s strategic communication: connecting architectural risk to business outcomes in a language that enables leadership to make informed investment decisions.

The Postmortem: Making Crises Count

Every major incident contains the seeds of organizational improvement. Whether those seeds are cultivated or left to decay depends almost entirely on the quality of your postmortem practice.

A good postmortem is not a blame exercise. This sounds obvious and is, in practice, genuinely difficult to maintain, because the natural human instinct in the wake of failure is to identify who made the mistake. Blame-oriented postmortems are worse than useless: they cause engineers to be defensive rather than transparent, which systematically degrades the quality of the causal analysis, which means the same failure modes recur.

Blameless postmortems — the approach championed by Google’s SRE culture and widely adopted across the industry — operate on the premise that competent engineers working within a given system will make the decisions that the system makes it natural to make. When something goes wrong, the correct question is not “who made the mistake?” but “what was it about our system, our processes, or our information environment that made this mistake the natural thing to do?” The answer almost always points to something actionable: a missing validation, an ambiguous runbook, an alert that fires too late, an architectural assumption that turned out to be wrong.

The output of a good postmortem is a set of concrete, time-bounded action items with clear ownership. Not “we should improve our monitoring” — that’s a sentiment, not a commitment. Instead: “By March 15th, we will add latency percentile alerting to the checkout service, owned by the platform team.” The specificity is what converts learning into change.

Over time, postmortem culture compounds. Each incident investigation builds institutional knowledge about how your system fails. Each action item either reduces the likelihood of similar failures or improves your response capability when they occur. The organizations that emerge from crises stronger are the ones that treat each incident not as an embarrassment to be put behind them as quickly as possible, but as a funded investment in organizational learning.

The Long Game: Building Resilient Teams

Technical resilience and team resilience are not separate concerns. The systems that survive crises are built and maintained by teams that have learned how to function under pressure — and learning to function under pressure requires practice in conditions that are stressful but survivable.

This is the rationale for game days and chaos engineering practices: deliberately inducing controlled failures to give teams practice at the cognitive and operational skills that crises demand, in conditions where the cost of imperfect performance is low. Teams that have run game days together develop the shared mental models, communication rhythms, and role clarity that make them dramatically more effective in real incidents. The first time your team practices diagnosing a simulated database failure should not be during a real one.

Beyond operational practice, resilient teams share a cultural characteristic that is harder to engineer but equally important: psychological safety. Teams where engineers are afraid of blame, afraid of looking incompetent, or afraid of surfacing bad news are teams that are slow to escalate, slow to admit uncertainty, and slow to call for help. These behaviors make crises worse. The investment in building a culture where engineers feel safe saying “I don’t know,” “I was wrong,” or “I need help” pays compounding returns in crisis performance.

As a Staff or Principal Engineer, you have more influence over this culture than you might think. The way you respond when a junior engineer reports a problem they created. The way you talk about your own mistakes in postmortems. The way you ask questions in code review. These behaviors model what the culture values, and the culture you model is the culture your team inherits.

Closing: The Crisis That Made You

There’s a version of this essay that frames technical crisis management as a domain of pure competence — a set of techniques to be mastered and deployed. That version is incomplete.

The deeper truth is that crises are formative experiences that fundamentally shape engineering leaders. They reveal what you actually believe about trade-offs, about people, about how systems should be built. They expose the gap between the engineer you aspire to be and the engineer you are under pressure. And they offer — if you’re willing to sit with the discomfort long enough — a clear view of what you need to develop.

The leaders who emerge from major incidents wiser and more capable are almost never the ones who performed perfectly. They’re the ones who paid attention: to what they got right, to what they missed, to where their team struggled, and to what the system was trying to tell them about its own design. They treated the crisis not as a problem to be survived but as information to be processed.

Build the systems that let you see clearly when things are going wrong. Build the architectural patterns that give you control surface when they do. Build the team culture that enables honest, fast response. And when the crisis comes — and it will — remember that your job is not to be the hero who single-handedly restores service. It’s to be the leader who brings clarity to chaos, confidence to uncertainty, and learning to failure.

That’s what it means to lead through technical crisis.

If this resonated with you, I’d love to hear about your own experiences leading through production incidents. What patterns have saved you? What failures taught you the most? The conversation in the comments is often richer than the article.

Accessibility-First Software Engineering: Building Inclusive Systems from the Ground Up

Ali Suleyman TOPUZ — Sat, 21 Mar 2026 18:46:49 +0000

This article explores the architectural and cultural foundations of accessibility-first software engineering. The technical patterns discussed — including accessibility context propagation, multi-modal API design, and accessibility observability — represent practical starting points for teams beginning this work. WCAG 2.2 remains the primary technical reference standard, and hands-on testing with actual assistive technologies should accompany any automated tooling strategy.

What if the way we build software has been wrong this whole time — not morally wrong, but architecturally wrong?

There is a story engineers love to tell about curb cuts. When cities started installing those small concrete ramps at sidewalk corners to help wheelchair users navigate urban streets, something unexpected happened. Everyone started using them — parents pushing strollers, delivery workers hauling hand trucks, travelers dragging rolling luggage, cyclists, skaters, elderly pedestrians with canes. A design decision made for one specific group of people turned out to make life measurably better for nearly everyone.

Software has been ignoring this lesson for decades.

Accessibility in most engineering organizations is treated like a compliance obligation — a list of legal checkboxes to satisfy after the “real” product work is done. Teams ship features, build architectures, make foundational decisions, and then, somewhere near the end of a sprint cycle or just before a launch deadline, someone says: we should probably make this accessible. At that point, the work becomes exponentially harder, exponentially more expensive, and almost always incomplete.

Accessibility-first software engineering is the argument that this sequence is backwards. It is the practice of treating accessibility not as a feature layer but as an architectural constraint — one that belongs at the very beginning of design, right alongside performance, scalability, and security. And when you actually build software this way, something remarkable happens: you end up with better software, period.

The Scale of the Problem We Are Ignoring

Before getting into architecture and implementation, it is worth sitting with the numbers for a moment, because they have a way of reframing what “edge case” means.

The World Health Organization estimates that approximately 1.3 billion people — roughly 16% of the global population — experience significant disability.

But that figure captures only a fraction of the people affected by inaccessible software. Expand the lens to include temporary disabilities (a broken wrist, post-surgery recovery, an eye infection), situational limitations (using a phone in direct sunlight, operating a laptop while riding a bus, watching a video in a noisy open office), and age-related changes that affect virtually every person who lives long enough, and you are no longer talking about a minority use case. You are talking about every user, at some point in their relationship with your product.

WebAIM publishes an annual analysis of the top one million websites on the internet. In recent years, that analysis has consistently found that over 96% of those homepages have detectable WCAG 2 failures. Nearly every major site on the internet is, to some degree, broken for a significant portion of its users. This is not a niche problem. It is a systemic failure embedded in how the industry thinks about and practices software development.

The business case is real too, though it probably should not be the primary motivation. People with disabilities represent substantial purchasing power — estimates place discretionary income for working-age people with disabilities in the United States alone at around $21 billion annually. The global figure, including older adults and people experiencing temporary limitations, stretches into the trillions. Companies building genuinely accessible products are not just being ethical. They are also accessing a massively underserved market that their competitors are actively excluding.

But if the ethical argument is sufficient — and it should be — let us start there. Millions of people are unable to fully use software that has become essential infrastructure for modern life: job applications, healthcare portals, banking systems, government services, educational platforms. Inaccessibility in software is not a minor inconvenience. For many people, it is a barrier to participation.

Why Retrofitting Never Works (And Why We Keep Trying)

Most accessibility work happens as retrofitting. A product ships. Legal risk appears — or worse, a lawsuit arrives. Engineers are tasked with making the existing system accessible. They work hard, they make improvements, but the result is almost never fully accessible because the foundational decisions have already been made and cannot be easily undone.

The Department of Homeland Security’s Trusted Tester program has done work estimating the cost differential between addressing accessibility issues at different stages of the development lifecycle. Their findings point toward roughly a 10x cost multiplier: problems caught and fixed during the design phase cost about one-tenth of what they cost to fix once a product is in production. When you factor in the potential legal exposure — website accessibility lawsuits have historically settled in ranges from $35,000 to $75,000 before accounting for legal fees and remediation costs — the financial argument for building correctly from the start becomes overwhelming.

But the problem runs deeper than cost. Retrofitting accessibility onto a poorly architected system often produces what might be called “accessibility theater” — the appearance of compliance without the reality of usable experience. You can add ARIA labels to semantically meaningless markup. You can make a keyboard focus indicator visible without making the keyboard navigation logical. You can include alt text on images while the surrounding interactive components remain completely opaque to screen readers. These patches satisfy automated scanning tools without actually serving the people who need accessible software.

The reason retrofitting produces these hollow results is that genuine accessibility requires architectural decisions. It requires that your application expose meaningful state information. It requires that your interaction models support multiple input modalities from the ground up. It requires that your data layer returns everything needed to construct accessible interfaces. These are not features that can be sprinkled on top of an existing system. They have to be designed in.

What Accessibility-First Architecture Actually Means

Shifting to accessibility-first development is not primarily about learning a longer checklist or hiring specialists to review finished work. It is about a set of architectural principles that shape decisions from the very beginning.

Accessibility context must propagate through the system: In traditional layered architectures, user preferences and accessibility metadata get dropped as requests cross service boundaries. A user who depends on reduced motion preferences, or who uses a screen reader that requires different response structures, or who needs extended session timeouts because motor impairments slow their interactions — that context needs to travel through every layer of the system. This is an explicit architectural requirement, not an implementation detail. The overhead is real but minimal, typically a few hundred bytes per request. The impact on users who depend on it is substantial.

Asynchronous patterns are not just a scalability optimization — they are an accessibility requirement: Tightly coupled, synchronous architectures create barriers for users of assistive technologies that operate on different temporal scales than standard input devices. A switch control user navigating a form may take 30 to 60 seconds to complete interactions that a mouse user handles in under five. An architecture that treats this as a timeout scenario rather than a legitimate usage pattern is an architecture that excludes this user entirely. Event-driven designs, CQRS patterns, and saga patterns that accommodate extended interaction timescales are not just engineering elegance — they are what allow real users to complete real tasks.

API contracts need to be designed for multi-modal interfaces from day one: This is a point that backend engineers sometimes resist, feeling that accessibility is a frontend concern. It is not. An image resource endpoint that returns URLs without alt text forces frontend developers to invent accessibility metadata that should live in the data layer. An error response that returns generic 500 status codes with no semantic content makes it impossible to build screen-reader-friendly error messages. An API that lacks hypermedia controls forces assistive technologies to navigate complex workflows without any structural guidance. Backend API design choices have direct accessibility implications, and those choices need to be made deliberately.

Fallback mechanisms are primary paths, not error handling: Accessible systems are resilient systems. If a WebSocket connection fails, keyboard navigation cannot become broken. If a JavaScript error disrupts dynamic content updates, there needs to be a fallback mechanism for content announcements. Progressive enhancement — the practice of building a functional core experience that works without advanced features, then layering enhancements on top — is not just a best practice for low-bandwidth environments. It is an accessibility architectural pattern.

Embedding Accessibility Across the Development Lifecycle

One of the more powerful shifts in accessibility-first engineering is what happens when accessibility requirements are treated as first-class functional requirements from the very beginning of a project, rather than as a quality checklist at the end.

In the requirements phase, this means writing accessibility specifications with the same rigor as performance requirements. What WCAG conformance level is the target? What assistive technologies need to be tested against? What are the performance budgets for accessibility-critical interactions? What error messaging requirements exist for cognitive accessibility? These questions should have answers before a line of code is written.

In the design phase, it means architecture reviews that explicitly evaluate accessibility implications. Decisions about synchronous versus asynchronous processing, caching strategies, API contract design, and state management patterns all have accessibility consequences. Those consequences need to be part of the design conversation.

In the development phase, it means accessibility patterns in coding standards, not accessibility reviews in code review. There is a meaningful difference between a team that checks for accessibility problems after code is written and a team where accessible patterns are the default from which developers start.

In the testing phase, it means automated accessibility testing running in CI/CD pipelines alongside unit and integration tests, and manual testing with actual assistive technologies treated as a standard activity rather than a specialized audit. It also means load testing scenarios that include assistive technology usage patterns, which often produce meaningfully different request profiles than standard interactions.

In the monitoring phase — and this is where accessibility-first thinking often reveals gaps — it means measuring “accessibility uptime” as a distinct metric from general system uptime. A system can be fully operational for mouse users while being completely broken for keyboard-only users because of a JavaScript error that disrupts focus management. Standard uptime monitoring will not catch this. Intentional accessibility observability will.

The Engineering Culture Dimension

Technical practices can be documented in a style guide. Culture is harder.

One of the most common patterns in organizations that struggle with accessibility is that it is owned by a single person or a small team who bear the entire cognitive and organizational load of making inaccessible systems accessible. These people are often talented, passionate, and chronically overwhelmed. They become bottlenecks. They review finished work looking for problems. They write long reports that get triaged against other priorities. And eventually, they burn out.

Accessibility-first engineering requires distributing this ownership. It requires that every engineer building a feature understands enough about accessibility to make good decisions in their day-to-day work. It requires that product managers include accessibility requirements in acceptance criteria. It requires that designers understand the accessibility implications of their patterns. It requires that engineering leadership sets clear expectations and measures results.

This is not a simple transformation. It requires investment in education, in tooling, in processes that make the right choices easy and the wrong choices visible. But it is the only version of accessibility work that actually scales. A single team auditing the output of hundreds of engineers will always be fighting a losing battle. A culture where accessibility is everyone’s responsibility — supported by good tooling and clear standards — is the only thing that produces consistently accessible software.

Staff engineers and technical leads carry particular responsibility here. The architectural decisions made at the senior level — about system design, about API contracts, about component libraries, about testing infrastructure — either make accessibility easy or make it hard for every engineer downstream. Technical leadership that treats accessibility as a first-class concern creates the conditions for accessible software to be the path of least resistance. Technical leadership that does not creates the conditions for accessibility debt to accumulate indefinitely.

Closing: The Standard We Should Hold Ourselves To

There is a version of this conversation that frames accessibility as a nice-to-have, a differentiator, a business opportunity. All of those things are true, and none of them are the right starting point.

The right starting point is this: millions of people cannot fully use software that has become essential to participating in modern society. Healthcare, employment, education, civic participation — all of it increasingly mediated through digital systems that were built without them in mind. That is a failure. It is a failure of craft, of leadership, and of professional ethics.

Accessibility-first software engineering is the practice of taking that failure seriously enough to actually change how we build things. Not by adding an accessibility review at the end of the process, but by designing systems from the ground up that work for everyone. Not by treating disabled users as an edge case, but by recognizing that designing for the full spectrum of human capability produces better software for all users.

The curb cut is still the right metaphor. The features that make software accessible to people with disabilities — keyboard navigability, semantic structure, clear error handling, consistent state exposure, meaningful labels — make software better for everyone. They make it more robust, more maintainable, more testable, and more resilient. Accessibility is not a constraint on good engineering. It is an expression of it.

The question for every engineering team is not whether to take accessibility seriously. It is whether to take it seriously before you ship or after, and whether you want to pay for it once or many times over.

Vector Search and Queryable Encryption in .NET: Engineering Secure AI Systems at Scale

Ali Suleyman TOPUZ — Sat, 21 Mar 2026 18:46:33 +0000

A comprehensive technical deep-dive for .NET architects and senior engineers on building production-grade vector search systems with encryption-in-use. Explores the intersection of semantic search, LLM embeddings, and privacy-preserving computation in enterprise environments where regulatory compliance and performance cannot be compromised.

Executive Summary

Contextual Importance: The Convergence of Three Inflection Points

The enterprise software landscape is experiencing a profound transformation driven by three simultaneous inflection points. First, the explosive growth in unstructured data — high-dimensional vector representations derived from text, images, and audio. Second, the transition of privacy-preserving AI from academic curiosity to regulatory mandate (GDPR, HIPAA, and the EU AI Act). Third, the strategic challenge of Vector Search and Encryption : the need to perform mathematical similarity operations on data that must remain protected.

Target Technologists: Who Needs This Knowledge Now

This deep-dive addresses .NET architects building LLM-powered enterprise systems and Security Engineers tasked with auditing AI infrastructure. We move beyond theoretical “Hello World” tutorials to explore how to ship reliable, compliant, and high-performance vector systems at scale.

Core Architectural Components

Building a production-grade system requires orchestrating specialized components that handle high-dimensional math and cryptographic transformations.

The Component Topology

Vector Embedding Service Layer

This layer converts raw data into vectors (e.g., 1536-dimensional floats for OpenAI’s text-embedding-3-small).

Isolation : Use ASP.NET Core Minimal APIs with System.Threading.RateLimiting to protect against upstream LLM latency and costs.
Data Handling : Use ReadOnlyMemory or Span to ensure zero-copy semantics, avoiding expensive array allocations in the hot path.

Vector Store Layer: The Persistence Conundrum

PostgreSQL + pgvector : Best for teams needing ACID compliance and relational joins. Integrated via Npgsql.
Qdrant : A purpose-built Rust-based store with excellent gRPC support for .NET. Ideal for sub-50ms latencies and complex metadata filtering.

Hands-on Vector Search Logic in .NET

We will implement a production-ready vector search service using .NET 9 and C# 13 features.

Foundation: Domain Models

We start by defining type-safe models that separate the vector data from its business metadata.

public sealed record VectorDocument
{
    public required Guid Id { get; init; }
    public required ReadOnlyMemory<float> Vector { get; init; }
    public required VectorMetadata Metadata { get; init; }
    public string? EncryptionScheme { get; init; }
}

public sealed record VectorMetadata(string Domain, string Source, int ModelVersion);

The Implementation: Encrypted Search Service

The following implementation demonstrates a “Player-Coach” approach: a service that handles both the embedding generation and the secure search execution.

public class SecureVectorSearchService(
    IVectorEmbeddingService embeddingService,
    IVectorStore vectorStore,
    IVectorEncryptionService encryptionService,
    ILogger<SecureVectorSearchService> logger) : ISecureVectorSearchService
{
    public async Task<IReadOnlyList<VectorSearchResult>> SearchProtectedAsync(
        string plainTextQuery, 
        CancellationToken ct = default)
    {
        // 1. Generate Embedding
        var rawVector = await embeddingService.GenerateEmbeddingAsync(plainTextQuery, ct);

        // 2. Apply Search-Optimized Encryption (e.g., Order-Preserving or Homomorphic)
        // This allows the DB to perform distance calculations without seeing the raw values.
        var searchToken = encryptionService.EncryptForSearch(rawVector);

        var request = new VectorSearchRequest
        {
            QueryVector = searchToken, // Pass the encrypted token to the store
            TopK = 10,
            UseEncryption = true
        };

        // 3. Execute Search with Observability
        var sw = Stopwatch.StartNew();
        try 
        {
            return await vectorStore.SearchAsync(request, ct);
        }
        finally 
        {
            logger.LogInformation("Vector search completed in {Elapsed}ms", sw.Elapsed.TotalMilliseconds);
        }
    }
}

Mathematical Accuracy in Distance Calculation

When implementing custom similarity logic (e.g., for in-memory caching), use hardware acceleration via System.Runtime.Intrinsics.

In .NET 9, we optimize this using Hardware Intrinsics (SIMD). Instead of a standard for loop, use System.Numerics.Tensors (or TensorPrimitives) to process multiple floating-point operations in a single CPU clock cycle:

// High-performance SIMD-optimized dot product in .NET 9
float similarity = TensorPrimitives.Dot(vectorA.Span, vectorB.Span);

Observability and Scaling

In production, you cannot manage what you do not measure.

Vector Drift : Monitor the statistical distribution of your embeddings. If the average distance between new embeddings and your baseline increases, your model may need retraining.
Recall vs. Latency : Use OpenTelemetry to track the trade-off between HNSW ef_search parameters and search accuracy.

Metrics, Tooling and Target

| Metric | Tooling | Target (P99) |
|---------------------|------------------------|--------------|
| Embedding Latency | Azure Monitor | < 200ms |
| Vector Index Search | Prometheus / Grafana | < 50ms |
| Decryption Overhead | Custom DotNetCounters | < 5ms |

Summary of Execution

We have moved from the high-level need for secure AI to a concrete .NET implementation. By leveraging ReadOnlyMemory for performance and isolating the encryption logic, we build a system that is both fast and compliant.

Semantic Kernel and AI Agent Architecture: Orchestrating Enterprise LLMs in .NET 9

Ali Suleyman TOPUZ — Sat, 21 Mar 2026 18:45:49 +0000

A staff engineer’s deep-dive into Microsoft’s Semantic Kernel framework for building production-grade AI agents. Learn why enterprise LLM integration fails and how orchestration frameworks solve memory, composability, and operational challenges.

Executive Summary

For Senior Software Engineers, Semantic Kernel (SK) represents a paradigm shift. It doesn’t just simplify LLM integration — it fundamentally restructures application boundaries, state management, and workflow orchestration when non-deterministic AI components become first-class citizens in our architecture. In .NET 9, this is further solidified by the Microsoft.Extensions.AI (MEAI) ecosystem, allowing for a decoupled, vendor-agnostic AI stack.

1. The Core Challenge: The Stateless Black-Box Dilemma

Traditional APIs are predictable; LLMs are probabilistic. They are Stateless , Non-deterministic , and Context-Limited. Bridging this gap requires an orchestrator to manage context, enforce schemas, and provide observability.

2. Conceptual Overview of Semantic Kernel

In the modern .NET 9 ecosystem, the architecture is split into two layers:

The Foundational Layer (Microsoft.Extensions.AI): Provides the standard interfaces like IChatClient.
The Orchestration Layer (Semantic Kernel): Uses those interfaces to manage Plugins (the hands), Planners (the brain), and Filters (the guardrails).

3. Implementation Deep Dive in .NET 9

As a “Player-Coach,” I don’t just talk architecture; I write the “Golden Path” code. Here is how we implement a production-grade Kernel setup in .NET 9

3.1 Bootstrap and Configuration

We leverage the new .NET 9 abstractions to ensure our kernel is decoupled from the specific model provider.

using Microsoft.SemanticKernel;
using Microsoft.Extensions.AI;
using Microsoft.Extensions.DependencyInjection;
using Azure.AI.OpenAI;

public static class AIOrchestrationExtensions
{
    public static IServiceCollection AddEnterpriseAIServices(this IServiceCollection services, IConfiguration config)
    {
        // 1. Setup the Foundational IChatClient (standard in .NET 9)
        services.AddChatClient(builder => builder
            .UseFunctionCalling() // Enables the model to use tools
            .UseOpenTelemetry() // Native distributed tracing
            .Use(new AzureOpenAIChatClient(
                new Uri(config["AzureOpenAI:Endpoint"]!), 
                new AzureKeyCredential(config["AzureOpenAI:ApiKey"]!),
                config["AzureOpenAI:ModelId"]!)));

        // 2. Build the Semantic Kernel using the Chat Client above
        var builder = Kernel.CreateBuilder();
        builder.Services.AddSingleton(services.BuildServiceProvider().GetRequiredService<IChatClient>());

        // 3. Register Business Logic Plugins
        builder.Plugins.AddFromType<InventoryPlugin>();
        builder.Plugins.AddFromType<VendorContractPlugin>();

        services.AddTransient<Kernel>(sp => builder.Build());
        return services;
    }
}

3.2 Building a Native Plugin

Native plugins are deterministic C# methods. The [Description] attribute is crucial; it acts as the "API Documentation" for the LLM.

public class InventoryPlugin
{
    [KernelFunction]
    [Description("Returns current stock levels for a product SKU.")]
    public async Task<int> GetStockLevelAsync(string sku)
    {
        // High-perf DB or gRPC call logic here
        return await Task.FromResult(4); 
    }
}

3.3 Auto-Invocation Loop

In .NET 9, we don’t manually call tools. We let the kernel “think” and call them automatically until the goal is met.

public async Task ProcessAgentRequest(Kernel kernel, string userPrompt)
{
    // AutoInvokeKernelFunctions handles the "Reasoning Loop"
    var settings = new OpenAIPromptExecutionSettings 
    { 
        ToolCallBehavior = ToolCallBehavior.AutoInvokeKernelFunctions 
    };

    var result = await kernel.InvokePromptAsync(userPrompt, new(settings));
    Console.WriteLine(result.ToString());
}

4. Observability and Resilience

For a Senior Architect, Observability is non-negotiable. .NET 9’s AI stack is built on ActivitySource, making OpenTelemetry integration native.

The “Guardrail” Filter

We use IFunctionInvocationFilter to intercept calls before they execute. This is where we check permissions or cost limits.

public class SafetyFilter : IFunctionInvocationFilter
{
    public async Task OnFunctionInvocationAsync(FunctionInvocationContext context, Func<FunctionInvocationContext, Task> next)
    {
        // Pre-call: Check if the SKU being requested is authorized
        if (context.Function.Name == "GetStockLevelAsync") { /* Logic */ }

        await next(context); // Execute the actual function
    }
}

5. Practical Use Case: Supply Chain Assistant

Scenario: Automating stock exceptions.

Ingestion: The agent sees low stock via InventoryPlugin.
RAG: It queries the Vector Store (Vendor Contracts) for lead times.
Action: If stock is critical, it drafts an email automatically.

public async Task HandleStockShortageAsync(Kernel kernel, string sku)
{
    string prompt = @"You are a Supply Chain Assistant. 
                      Check stock for {{$sku}}. 
                      Search contracts for lead times. 
                      If stock < 10 and lead time > 5 days, draft a restock email.";

    var result = await kernel.InvokePromptAsync(prompt, new() { ["sku"] = sku });
    _logger.LogInformation("Agent Output: {Result}", result);
}

6. Distributed State & Long-term Memory

In production, AI agents often run on stateless infrastructure (like Azure Functions or Container Apps). However, a “human-like” assistant must remember preferences from five minutes ago and technical specs from a 500-page manual. Semantic Kernel handles this through Chat History and Vector Stores.

6.1 Chat History: Managing the Conversation Context

LLMs do not “remember” sessions. We must pass the entire conversation back to them with every request. In .NET 9, we architect this using a persistent IChatHistory store.

public async Task ExecuteStatefulConversationAsync(Kernel kernel, string userId, string input)
{
    // 1. Load conversation history from a persistent store (e.g., Redis or CosmosDB)
    var history = await _sessionRepository.GetHistoryAsync(userId);

    // 2. Append the new user input
    history.AddUserMessage(input);

    // 3. Invoke the Chat Completion service with history
    var chatService = kernel.GetRequiredService<IChatCompletionService>();
    var response = await chatService.GetChatMessageContentAsync(history, kernel: kernel);

    // 4. Update the store with the assistant's response
    history.AddAssistantMessage(response.Content!);
    await _sessionRepository.SaveHistoryAsync(userId, history);
}

6.2 Vector Stores: The “Corporate Brain” (RAG)

.NET 9 introduces a standardized Vector Store abstraction. This allows the Agent to perform Retrieval Augmented Generation (RAG) by searching across vectorized corporate data.

Architectural Advantage: By using the IVectorStore interface, your code remains decoupled from the specific database provider (e.g., Azure AI Search, Pinecone, or Milvus).
Metadata Filtering: A staff-level implementation doesn’t just search for “text similarity.” It uses metadata (e.g., DepartmentId, SecurityLevel) to ensure the Agent only retrieves information the user is authorized to see.

Staff Engineer Note: Always implement Semantic Caching . Before sending a query to the LLM, check if a similar question has been answered recently in your Vector Store to save on token costs and reduce latency.

7. Operational Excellence: Token Economics and Advanced Guardrails

In production, the “cool factor” of AI fades quickly if the cloud bill spikes. As architects, we must treat LLM tokens like any other expensive resource (e.g., IOPS or egress).

7.1 Token Management and Semantic Caching

Every word sent to the LLM costs money and increases latency. To optimize this, we implement a Semantic Cache. Before the Kernel hits the LLM, it checks a Vector Store to see if a similar question was answered recently.

public async Task<string> GetOptimizedResponseAsync(Kernel kernel, string userPrompt)
{
    // 1. Search Vector Cache for a 'similar enough' previous question
    var cachedResponse = await _vectorCache.GetSimilarResultAsync(userPrompt, threshold: 0.95);
    if (cachedResponse != null) return cachedResponse;

    // 2. If no cache hit, proceed to LLM
    var result = await kernel.InvokePromptAsync(userPrompt);

    // 3. Update cache for future hits
    await _vectorCache.StoreResultAsync(userPrompt, result.ToString());

    return result.ToString();
}

7.2 Advanced Guardrails: The “Planner Validator”

In Section 4, we discussed simple filters. For high-stakes environments, we need a Plan Validation Step. If an agent generates a plan to “Delete User Account,” a deterministic layer must intercept it.

public class PlanValidationFilter : IAutoFunctionInvocationFilter
{
    public async Task OnAutoFunctionInvocationAsync(AutoFunctionInvocationContext context, Func<AutoFunctionInvocationContext, Task> next)
    {
        // Staff-level check: Is the agent trying to call a restricted tool?
        if (context.Function.Name.Contains("Delete", StringComparison.OrdinalIgnoreCase))
        {
            if (!IsUserAdmin(context.Kernel.Arguments["userId"]?.ToString()))
            {
                throw new UnauthorizedAccessException("Agent attempted unauthorized destructive action.");
            }
        }

        await next(context);
    }
}

7.3 Governance: Rate Limiting and Circuit Breakers

LLM APIs can be flaky or slow. By wrapping our IChatClient (configured in Section 3.1) with standard Polly policies, we ensure our .NET 9 application doesn't hang when OpenAI/Azure is under load.

Retry Pattern: For 429 Too Many Requests.
Circuit Breaker: To stop calling a degraded model and fallback to a smaller, cheaper one (e.g., GPT-4o to GPT-4o-mini).

8. Testing and Evaluation (The “Staff” Reality Check)

Unlike traditional unit tests, AI testing is probabilistic. We use LLM-assisted Evaluation (LLM-as-a-judge).

Input: User Question + Agent Answer.
Validator: A separate, highly-capable model (like GPT-4o) evaluates the answer based on a rubric (Accuracy, Tone, Grounding).
Result: A numeric score for CI/CD pipelines.

Conclusion

Semantic Kernel in .NET 9 is more than a library; it is the implementation of a Reliable AI Distributed System. By combining the new IChatClient abstractions, IVectorStore memory, and IFunctionInvocationFilters, we move AI from a "chat box" to a mission-critical enterprise asset.