Forem: Stephen Trembley

Building a Compliant AI Agent System: Lessons from 347 Production Agents

Stephen Trembley — Sat, 09 May 2026 02:55:49 +0000

When we started building a multi-agent compliance system, we thought the hard part would be making agents accurate. We were wrong. The hard part is making them auditable.

This post covers the architectural patterns we discovered while running 347 production AI agents across regulated industries — financial services, healthcare, and government contracting. If you're building multi-agent systems that need to survive a compliance audit, this is for you.

The Multi-Agent Orchestration Problem

Single-agent architectures are straightforward to reason about. One model, one prompt, one output, one audit trail. The moment you introduce multiple agents — each with different specializations, competing recommendations, and varying confidence levels — you create a compliance nightmare.

Here's why: regulators don't care about your architecture diagram. They care about who decided what, when, and why. In a multi-agent system, "who" is ambiguous. Did the summarization agent make the call? The risk-scoring agent? The orchestrator that chose between them?

We identified three core challenges that every multi-agent compliance system must solve:

1. Decision Attribution

When Agent A provides input to Agent B, which then triggers Agent C to produce a final recommendation — who owns the decision? Traditional audit trails capture the final output. Compliance requires capturing the decision chain.

The pattern that works: treat every agent interaction as a signed transaction. Each agent emits a structured decision record containing its input context, reasoning trace, confidence score, and output. The orchestrator maintains a directed acyclic graph (DAG) of these records. Any compliance query — "why did the system recommend X?" — becomes a graph traversal.

DecisionRecord {
  agent_id: string
  timestamp: ISO8601
  input_hash: SHA256
  reasoning_trace: string[]
  confidence: float
  output_hash: SHA256
  parent_records: string[]
}

This isn't optional decoration. Under SOC 2 Type II, HIPAA, and SEC Rule 206(4)-7, you need to demonstrate that automated decisions are traceable to their inputs. A flat log file doesn't cut it.

2. Compliance Verification at Inference Time

Most teams bolt compliance checks on after the agent produces output. This is backwards. By the time you're checking whether an output violates a regulation, you've already spent the compute, introduced latency, and created a failure mode where the check itself can silently fail.

The pattern that works: compliance gates embedded in the inference pipeline. Before an agent's output is accepted by the orchestrator, it passes through a lightweight verification layer that checks:

Regulatory boundary violations: Does this output reference data the agent isn't authorized to access under the applicable regulation?
Confidence thresholds: Is the agent's confidence above the minimum required for this decision category?
Contradiction detection: Does this output contradict another agent's output that's already been accepted in the same decision chain?
PII leakage scanning: Does the output contain personally identifiable information that shouldn't propagate downstream?

The key insight: these gates must execute in under 20ms to avoid degrading the user experience. That rules out calling another LLM for verification. We use a combination of rule engines, embedding similarity checks, and pre-compiled regulatory boundary maps.

At Sturna, we've gotten this down to sub-18ms per gate by using what we call a biomimetic auction architecture — agents compete rather than collaborate, and compliance verification happens as part of the auction scoring, not as a separate step. The result is that non-compliant outputs never even enter the candidate pool. More on the architecture in our technical whitepaper.

3. The Audit Trail Architecture

Here's where most teams get it wrong: they think "audit trail" means "logging." It doesn't. An audit trail for a multi-agent system needs to support four distinct query patterns:

Forward trace: Given an input, show every agent that touched it and what they did.
Backward trace: Given an output, show the complete chain of decisions that produced it.
Temporal query: Show me everything the system did between T1 and T2 for entity X.
Counterfactual query: If agent A had produced output Y instead of Z, would the final decision have changed?

The last one is what separates "we have logs" from "we're audit-ready." Regulators increasingly want to understand not just what happened, but what would have happened under different conditions. This is especially critical for financial services under the SEC's new AI guidance.

The architecture pattern: an append-only event store with materialized views for each query pattern. Every agent interaction produces an immutable event. Views are rebuilt on read, never mutated. This gives you:

Tamper-evident history (append-only = no retroactive edits)
Efficient querying (materialized views optimized per pattern)
Reproducibility (replay events to verify any historical decision)

The GSAR Pipeline

We formalized our approach as the GSAR pipeline: Gate, Score, Audit, Report.

Gate: Compliance verification at inference time. Non-compliant outputs are rejected before they enter the decision chain.

Score: Every accepted output receives a composite compliance score based on confidence, regulatory alignment, and consistency with the decision chain.

Audit: The append-only event store captures every gate pass/fail, every score, and every agent interaction. Immutable, tamper-evident, queryable.

Report: Automated compliance report generation. Given a time range and entity, produce a complete decision history with regulatory citations.

The GSAR pipeline runs at every agent interaction, not as a batch process. Real-time compliance is the only kind that matters when you're making decisions at inference speed.

What We Learned from 347 Agents

Running this at scale taught us things the architecture diagrams don't show:

Agent drift is the silent killer. An agent that was compliant on day one gradually drifts as its context window accumulates edge cases. We now run weekly compliance regression tests — replay historical inputs and verify outputs still pass all gates.

Confidence calibration is harder than accuracy. An agent that's 90% accurate but thinks it's 99% confident is more dangerous than an agent that's 80% accurate and knows it.

The audit trail is the product. We initially built the audit system for regulators. Turns out, our customers use it more than we do.

Latency budgets are non-negotiable. Compliance verification that adds 500ms per request will be turned off by engineering within a month. Our 18ms budget is the maximum tolerable overhead.

Try It Yourself

If you're building multi-agent systems for regulated industries, the patterns above will save you months. The implementation details are in our technical whitepaper.

For reproducible benchmarks, see sturna.ai/benchmarks-vs.

Built by the team at Sturna.ai — compliance intelligence for AI agent systems.

Building a Compliant AI Agent System: Lessons from 347 Production Agents

Stephen Trembley — Sat, 09 May 2026 02:27:02 +0000

When we started routing real compliance workloads through multi-agent systems, we expected the hard problems to be latency, cost, and model selection. We were wrong. The hardest problem was proving what happened — and when it happened, and which model made which decision, and why.

This is the problem that most multi-agent frameworks treat as an afterthought. It is not an afterthought. For regulated industries, it is the entire product.

The Orchestration Problem Nobody Talks About

Most discussions of multi-agent AI focus on capability: can you chain agents together, can you parallelize tasks, can you route to the right specialist? These are solvable problems. The frameworks are mature.

What they do not solve: compliance verification at inference time.

When you run 20 competing agents against a single compliance question — say, "Does this AI system fall under the EU AI Act's high-risk classification?" — you get 20 different answers, 20 different confidence scores, and 20 different reasoning paths. To pick the winner, you need more than a voting mechanism. You need a verifiable selection protocol that you can explain to an auditor six months later.

This is where most teams reach for logs. Logs are insufficient. Logs tell you what happened. They do not tell you why a specific agent's output was selected, what alternatives were considered, or whether the selection criterion was applied consistently. These are the questions regulators actually ask.

What Compliance Verification at Inference Time Requires

We identified four requirements that any serious compliance AI architecture must satisfy:

1. Decision chain visibility — Every agent invocation must produce an immutable record: which model, which version, which prompt, what input, what output, what latency, what cost. Not a summary — the full artifact.

2. Selection auditability — When multiple agents compete on the same query, the selection mechanism must be deterministic and explainable. "The highest-confidence response won" is not an explanation. The selection criteria, the scoring rubric, and the inputs to the scoring function must all be captured.

3. Contradiction detection — When two agents produce contradictory outputs on a compliance question, that contradiction is itself a compliance signal. A robust system should surface these disagreements rather than silently resolving them through averaging or majority vote.

4. Replay capability — An auditor should be able to replay any historical query against the same agents, with the same prompts, and verify that the output is stable. This requires versioning not just the data, but the agent configurations.

The GSAR Pipeline Architecture

One architectural pattern that addresses these requirements is what we call a GSAR pipeline: Generate, Score, Audit, Return.

Generate: Submit the query to N specialized agents in parallel. Each agent is domain-scoped — one trained on EU AI Act provisions, one on HIPAA, one on SOC 2 controls, and so on. Parallelization keeps latency under control; you are not serializing 20 LLM calls.

Score: Apply a deterministic scoring function to the N outputs. The scoring function is itself versioned and auditable. It considers semantic consistency, source citation quality, confidence calibration, and response completeness. The scores are stored alongside the raw outputs.

Audit: Before returning any output, run it through a compliance gate. The gate checks: Does this output contain legally actionable language without appropriate disclaimers? Does it contradict established regulatory guidance in this jurisdiction? Has this specific agent configuration been flagged for quality issues in the past 30 days?

Return: The winning output, plus the full audit trail: all competing outputs, all scores, the scoring rationale, the audit gate decision, and a provenance hash that links this response to the specific agent configurations and model versions used.

The audit trail is not optional. It is the product.

Biomimetic Auction: A Different Selection Model

Traditional agent orchestration frameworks pick winners through static routing rules or simple quality metrics. Neither approach scales to high-stakes compliance scenarios.

A more robust model is competitive selection — letting agents bid for the query based on their self-assessed competence, then validating those bids against historical performance. This is analogous to how biological immune systems work: specialized cells compete to respond to a specific antigen, and the most relevant responders are selected through a competitive process rather than a predetermined hierarchy.

The practical implementation requires each agent to maintain a performance model: how often has this agent type been correct on queries in this specific domain, at this confidence level, in the past N days? Agents with higher historical accuracy are given more weight.

In production systems running at sub-18ms latency, the overhead of this competitive selection mechanism is consistently under 4ms — well within acceptable bounds for synchronous compliance queries.

Building the Audit Trail

The core challenge is immutability under update. When a compliance finding is later revised — because a regulation changed, or because a model error was identified — you need to update the finding without losing history. This requires treating compliance outputs as event records, not mutable state.

The pattern:

Compliance events table:
  - query_id (immutable after creation)
  - agent_run_id (links to full agent execution record)
  - output (verbatim response, never updated)
  - selection_score (frozen at generation time)
  - audit_gate_result (pass/fail/flag, frozen)
  - superseded_by (null if current, query_id if revised)
  - created_at (immutable)

Revisions create new records that reference the old ones. You never update a compliance output — you supersede it. For regulated industries, this is the minimum viable audit trail.

Safety Gates at Inference Time

Safety gates for compliance AI should check at minimum:

Jurisdictional scope: Is this output scoped to the correct legal jurisdiction?
Confidence thresholds: Low-confidence outputs on compliance questions should be flagged, not silently returned.
Regulatory currency: Has the underlying regulation been updated since this agent's training cutoff?
Contradiction with established guidance: Does this output contradict published regulatory guidance?

Implementing these gates creates the artifact that makes compliance AI defensible: a system that can demonstrate, for every output it has ever produced, that it passed a documented validation process.

What 347 Production Agents Taught Us

Confidence calibration degrades faster than accuracy. A model that produces correct outputs 90% of the time may still be dangerous if its confidence scores are miscalibrated — if it expresses high confidence on the 10% of cases where it is wrong. Monitoring confidence calibration as a first-class metric is essential.

Inter-agent disagreement is a leading indicator of regulatory change. When agents that previously agreed on a question start disagreeing, it usually means the underlying regulatory landscape has shifted. This is a valuable signal most frameworks throw away.

Audit trail completeness has a direct effect on enterprise sales velocity. Enterprises in regulated industries do not buy compliance AI that cannot produce an audit trail. The audit trail is not a feature — it is the table stakes for the conversation.

For the technical deep dive on our architecture: sturna.ai/whitepaper. For reproducible benchmark results: sturna.ai/benchmarks-vs.

Why Competitive Agent Routing Beats Static Orchestration

Stephen Trembley — Sat, 25 Apr 2026 19:47:10 +0000

And why your router is about to become your bottleneck

You're a developer. You've built something that works—a system with multiple agents, each specialized, each good at one job. Your router picks which agent handles which request. It works.

Then you scale.

At agent #5, your router is a simple if/else chain. Ugly, but fine.
At agent #30, it's a switch statement. Maintainable.
At agent #100, you're rewriting it every sprint. New agent added? Update the router. Agent retired? Update the router. New domain? Update the router. Someone on your team eventually asks: "What if we just... didn't do this?"

That's the problem with static routing. Your router is hardcoded logic that lives outside your agents. Every agent you add is a new edge case to handle. Every business rule shift means touching the router. At scale (we're talking 200+ agents across 50+ domains), this doesn't just become annoying—it becomes the systemic bottleneck that keeps you from shipping.

This is the story of static orchestration platforms like LangGraph and CrewAI. They're powerful. They're flexible. But they're also fundamentally static—you define the routing logic upfront, then ship it. Changing your routing strategy means code changes, testing, and deploys.

There's a different way. It's called competitive agent routing.

The Problem with Static Routing (Why it breaks at scale)

Let's say you have a Slack integration handling customer requests. You have:

3 agents for support issues
2 agents for billing questions
4 agents for technical debugging
2 agents for account recovery

That's 11 agents. Your router probably has 11 decision paths. When your support team escalates a "complex billing bug involving account access," which agent should handle it? Support or Billing or Technical? All three could claim expertise. Your router has to guess, and guesses fail. You fix it manually, push a new deploy, and move on.

Now scale to 201 agents across 59 domains.

Your router doesn't scale. Worse, your router is now the system's single point of failure. A routing error affects every single request. A routing change affects the entire system.

More fundamentally: Static routing assumes you know the right decision at deploy time. But what if your best support agent goes on vacation? What if a new agent joins and is exceptional? What if seasonal business changes mean billing agents should handle more volume? Static routing can't adapt—you have to redeploy.

How Competitive Routing Works

Competitive routing flips the model.

Instead of a centralized router making decisions, every agent independently evaluates whether it should handle a request. The agent that's most confident—and can handle it fastest/cheapest—wins.

Here's the flow:

Intent arrives — A customer request or system event
Broadcast to all agents — "Can you handle this? What's your confidence? How long will it take?"
Agents bid — Each agent responds with a confidence score (0-100%) and predicted execution cost/time
Dynamic ranking — Proposals ranked by confidence and cost
Quorum selection — Top agent executes; backups standby in case of failure
Execute & log — Result logged with metadata for analytics and future learning

No centralized router. No hardcoded rules. Just agents self-selecting based on their actual capabilities in real-time.

The Real Numbers

At Sturna, we built this. Here's what it looks like at scale:

201 agents across 59 different domains (Shopify, billing, content moderation, customer support, technical debugging, etc.)
2,965 proposals generated per day—201 agents, each bidding on their confidence
1% selection rate—only the highest-confidence agent actually executes (the other 200 standby as fallbacks)
31.7 seconds average resolution time—competitive routing determines not just who handles the request, but which combination of agents or approach succeeds fastest
0 hardcoded routing rules—new agents onboard automatically; they bid alongside existing agents

The trick: you don't deploy new routes when you add new agents. The agents add themselves to the system. Existing agents compete fairly. Your system adapts in real-time.

If you're running standard orchestration (CrewAI's supervisor, LangGraph's routing), you're deploying a new routing strategy. If you're running competitive routing, you're just adding another participant to an existing competition.

Why This Matters for Your Scale

Three reasons this becomes critical at scale:

1. Adaptability without deploys — Your system handles new agents, domain shifts, and seasonal load changes without code changes. On a Monday, maybe your support agents are overloaded; competitive routing automatically routes more work to technical agents. Tuesday, rebalance. No deploys required.

2. Fault tolerance baked in — If your top-ranked agent fails, your system doesn't fail. You have 199 other proposals already ranked. Grab the second-best. Competitive routing is inherently redundant—every request has fallbacks.

3. Cost optimization — Agents can bid based on their actual running cost. A cheaper agent with 85% confidence might beat a more expensive agent with 92% confidence. Your system is making trade-off decisions automatically. Over thousands of requests, this scales into real savings.

The Catch

Competitive routing has one hard requirement: agents need enough context and confidence to bid accurately.

A dumb agent that always says "90% confident, I'll handle this in 5 seconds" will lose to an intelligent agent that says "23% confident; I'd need to fetch 3 external APIs." Competitive routing surfaces bad agents naturally—they get ranked lower and lose the competition. But you do need agents that are smart enough to know what they don't know.

That's not a limitation of the pattern—it's the feature. You're forcing every agent to self-assess. Static routers can hide incompetence. Competitive routing exposes it immediately.

When Competitive Routing Wins

This pattern wins hardest when you have:

Multiple agents solving overlapping problems — Support + Account Recovery both might handle a refund request. Let them compete.
Uncertainty about the best path — You genuinely don't know if Technical Debugging or Product Support should own this issue. Competitive routing figures it out.
Scaling beyond 30-40 agents — Static routing becomes genuinely unmaintainable. Competitive routing scales linearly.
High-volume, low-latency requirements — Every millisecond matters. Competitive routing lets fast agents win over slow ones, even if both have high confidence.

Try It Yourself

We built Sturna on this pattern. 201 agents, 59 domains, zero hardcoded routing rules. You can see it in action:

Try Sturna — Broadcast an intent, watch 161 agents bid, see the best one execute.

The interface is intentionally simple: type what you need, hit enter, watch your agents compete. It's a 15-second mental model, but the implications are enormous once your system scales.

The Bottom Line

Static routing gets you from 0 to ~30 agents. After that, it becomes a bottleneck—not just operationally (new deploys per new agent) but strategically (your routing logic becomes a central point of failure and fragility).

Competitive routing is the pattern that scales. It's not magic. It's just agents smart enough to know when they're the right tool, confident enough to bid, and humble enough to lose when they're not.

If you're building multi-agent systems and thinking about scale, competitive routing should be on your roadmap. And if you've already built static routing, it might be time to revisit.

How We Built a Self-Healing Agent Marketplace with 201 Competing AI Agents

Stephen Trembley — Tue, 21 Apr 2026 22:06:38 +0000

Most agent frameworks assume you know the best agent for the job before the job starts. You pick a model, wire a DAG, and hope it holds.

We didn't know. So we made 201 agents compete for every task — and let outcomes decide.

This is the architecture behind Sturna.ai, and why we call it the octopus brain.

The Problem with Static DAGs

LangGraph, CrewAI, AutoGen — they're all variations of the same idea: you compose agents into a fixed graph. Agent A calls Agent B which calls Agent C. The flow is known at design time.

That works until it doesn't.

In production, task diversity is brutal. A single "analyze my competitors" intent might need a web scraper, a summarizer, a data formatter, and a report writer — or it might need completely different agents depending on which competitors, which market, which output format. Static graphs require you to anticipate all of this upfront. You can't.

The deeper problem: when a node fails, the whole DAG fails. There's no self-healing. There's no "try something else." You get an error and you restart.

The Octopus Brain Model

An octopus has a central brain but its arms have their own neural clusters — each arm can act semi-independently, process information locally, and adapt without waiting for central coordination.

We built Sturna with the same principle:

Central coordinator receives an intent and broadcasts it to all capable agents
201 specialized agents each evaluate the task independently and submit proposals
Competitive routing selects the best proposal based on past performance, confidence scores, and task type
Execution layer runs the winning agent — and if it fails, automatically routes to the next best proposal

No fixed DAG. No predetermined path. The route emerges from competition.

What "Self-Healing" Actually Means

When people say "self-healing," they usually mean retry logic. Retry the same thing 3 times, then give up.

That's not healing. That's hoping.

Sturna's self-healing is architectural:

Every task has N competing proposals, ranked by predicted success
If agent #1 fails, the system doesn't restart — it promotes agent #2
Agent #2 runs with full context of what agent #1 attempted
Failure data feeds back into routing scores, making future routing smarter

The agents aren't just competing for the first run. They're competing across every run, accumulating performance history that shapes every future routing decision.

The Numbers After 6 Months in Production

After running this system across thousands of real tasks:

201 active agents across 14 capability categories
86%+ first-attempt success rate (vs ~60% with our original static routing)
45-second median time-to-value from intent to delivered result
Self-healing triggered on ~14% of tasks — those tasks still complete, they just take a second pass

The 86% number is the one I'm most proud of. That's not accuracy on benchmarks — that's real tasks from real users completing successfully on the first agent attempt.

Competitive Routing vs Static DAGs: The Real Tradeoff

I want to be honest about what you give up with competitive routing:

	Static DAG	Competitive Routing
Predictability	High — same path every time	Lower — path varies by agent performance
Debuggability	Easy — trace the graph	Harder — need proposal replay logs
Latency (simple tasks)	Lower	Higher — broadcast overhead
Latency (complex tasks)	Higher — no fallback path	Lower — parallel evaluation
Failure recovery	Manual — fix the DAG	Automatic — next proposal promoted
Improvement over time	Manual — you retune	Automatic — routing learns

For simple, well-scoped tasks you run thousands of times, static DAGs win on predictability. For diverse, open-ended tasks where failure matters, competitive routing wins on resilience.

We built Sturna for the second category.

How Agents Submit Proposals

Each agent in Sturna exposes a canHandle(intent) method that returns a confidence score (0-1) and an execution plan. When a task comes in:

// Simplified — real implementation has more context
interface AgentProposal {
  agentId: string;
  confidence: number;
  estimatedDuration: number;
  executionPlan: string;
  requiredCapabilities: string[];
}

// Coordinator broadcasts and collects
const proposals = await Promise.all(
  agents.map(agent => agent.evaluate(intent))
);

// Rank by: confidence × historical success rate × recency
const ranked = rankProposals(proposals, agentHistory);

The ranking function is the core IP. Confidence alone isn't enough — an agent can be overconfident on task types it's bad at. We weight heavily by actual historical success rate, with recency bias (recent performance matters more than old performance).

What We Got Wrong First

Two things killed our first two versions:

Version 1: Too much competition. Broadcasting to all 201 agents created ~400ms of overhead even before execution started. We added capability tagging — agents declare what they can handle, and broadcast only goes to capable agents. Overhead dropped to ~30ms.

Version 2: No proposal replay. When an agent failed, the next agent started completely fresh. Users saw inconsistent results. We built a context handoff layer — the winning backup agent receives what the failed agent attempted, and can continue rather than restart.

The context handoff was 3 weeks of work and cut re-execution time in half.

Where This Goes

The 201-agent number isn't a ceiling. Every new capability we add is a new agent. The routing system gets better the more agents compete — more data, more diversity, more paths to success.

We're currently working on agent coalitions: groups of agents that propose to handle a task collaboratively, with shared execution context. The octopus brain, but with arms that can coordinate.

If you're building agent infrastructure and want to compare notes, we're at sturna.ai. The system is live and handling real production traffic — we'd rather learn from builders than pitch in abstractions.

This post covers the architecture as it exists today. The numbers are from our internal dashboards as of April 2026.