<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Shyam Desigan</title>
    <description>The latest articles on Forem by Shyam Desigan (@shyam_desigan_c6b74c32b3c).</description>
    <link>https://forem.com/shyam_desigan_c6b74c32b3c</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3933166%2F115363eb-bfc3-44c2-975b-49c4d908830b.png</url>
      <title>Forem: Shyam Desigan</title>
      <link>https://forem.com/shyam_desigan_c6b74c32b3c</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/shyam_desigan_c6b74c32b3c"/>
    <language>en</language>
    <item>
      <title>I Let an AI Agent Run My Consulting Business For a Week — Here's What Happened</title>
      <dc:creator>Shyam Desigan</dc:creator>
      <pubDate>Sat, 16 May 2026 01:21:28 +0000</pubDate>
      <link>https://forem.com/shyam_desigan_c6b74c32b3c/i-let-an-ai-agent-run-my-consulting-business-for-a-week-heres-what-happened-197n</link>
      <guid>https://forem.com/shyam_desigan_c6b74c32b3c/i-let-an-ai-agent-run-my-consulting-business-for-a-week-heres-what-happened-197n</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/hermes-agent-2026-05-15"&gt;Hermes Agent Challenge&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;I run a small AI agency called Cubiczan. We help companies build agentic AI systems for finance and supply chain operations. It's consulting work — research-heavy, customized, and time-consuming.&lt;/p&gt;

&lt;p&gt;Recently I told my Openclaw orchestrator agent to create a Hermes subagent and give full autonomy: schedule, research, decide, deliver. No hand-holding.&lt;/p&gt;

&lt;p&gt;The agent was &lt;strong&gt;Hermes Agent&lt;/strong&gt; by Nous Research — an open-source, self-improving AI agent that can learn from experience, create its own skills, and run long-term workflows entirely independently.&lt;/p&gt;

&lt;p&gt;This is what happened.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Challenge That Changed My Mind
&lt;/h2&gt;

&lt;p&gt;Before Hermes, my workflow looked like this:&lt;/p&gt;

&lt;p&gt;Every morning, I'd spend 45 minutes scanning funding opportunities — SBIR grants, Horizon Europe calls, Innovate UK programs, sovereign AI mandates. I'd read through pages of program descriptions, check deadlines, cross-reference budgets, and try to figure out which ones matched our Finance × Supply Chain specialty.&lt;/p&gt;

&lt;p&gt;It was manual. It was boring. And I kept missing things.&lt;/p&gt;

&lt;p&gt;The real problem wasn't the searching. It was the &lt;strong&gt;context switching&lt;/strong&gt;. Every time I paused client work to research grants, I lost momentum. Every time I found something good, I had to re-verify it the next week because I couldn't remember the details.&lt;/p&gt;

&lt;p&gt;I needed something that could:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Learn&lt;/strong&gt; what's relevant to my business&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remember&lt;/strong&gt; what it found across sessions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Improve&lt;/strong&gt; over time without me rewriting prompts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Work&lt;/strong&gt; while I slept&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Hermes Agent claimed it could do all four. I was skeptical. So I set up a test.&lt;/p&gt;




&lt;h2&gt;
  
  
  Week One: The Hands-Off Experiment
&lt;/h2&gt;

&lt;p&gt;My Openclaw downloaded Hermes on a Mac mini in my office closet. Docker pull, one config file, and I had the agent running.&lt;/p&gt;

&lt;p&gt;Then I told it, in plain English: &lt;em&gt;"Every weekday at 11 AM, search for AI funding opportunities across the US, EU, UK, Canada, Singapore, UAE, and Japan. Score each one against our Finance × Supply Chain focus. Post the best ones to our Discord. Get better at this over time."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That's it. No YAML. No flow charts. No integration code.&lt;/p&gt;

&lt;p&gt;Here's what happened day by day:&lt;/p&gt;

&lt;h3&gt;
  
  
  Day 1
&lt;/h3&gt;

&lt;p&gt;Hermes ran its first scout. It came back with 4 results — mostly irrelevant. One had a vaguely AI-related title but was actually about agricultural sensors. Another was a PDF from 2023 that was no longer open.&lt;/p&gt;

&lt;p&gt;I almost gave up. But I let it keep going.&lt;/p&gt;

&lt;h3&gt;
  
  
  Day 3
&lt;/h3&gt;

&lt;p&gt;Something changed. The results were noticeably better. Hermes had started filtering out expired programs. It cross-checked dates against the SBIR.gov API. It was generating short summaries of each opportunity explaining &lt;em&gt;why&lt;/em&gt; it might fit.&lt;/p&gt;

&lt;p&gt;I didn't teach it any of this. It just... learned.&lt;/p&gt;

&lt;h3&gt;
  
  
  Day 5
&lt;/h3&gt;

&lt;p&gt;The scout found a match I'd completely missed: a DARPA AI logistics program that specifically mentioned trade finance in its scope. I'd been scanning SBIR.gov manually for months and never saw it.&lt;/p&gt;

&lt;p&gt;Hermes found it on its fifth autonomous run.&lt;/p&gt;

&lt;h3&gt;
  
  
  Day 7
&lt;/h3&gt;

&lt;p&gt;By the end of the week, the system had:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Identified 12 active funding programs&lt;/strong&gt; across 9 regions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mapped $50M+ in accessible capital&lt;/strong&gt; with deadlines and fit scores&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Found 3 direct-pursuit opportunities&lt;/strong&gt; scored at 90+ out of 100&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero human input&lt;/strong&gt; after the initial instruction&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;My 45-minute daily ritual had become a 30-second glance at a Discord notification.&lt;/p&gt;




&lt;h2&gt;
  
  
  How It Actually Works
&lt;/h2&gt;

&lt;p&gt;Let me walk through the mechanics, because the magic isn't magic — it's a clever system design that I think more people should understand.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Skill Creation Loop
&lt;/h3&gt;

&lt;p&gt;This is Hermes' killer feature and I haven't seen it done this well anywhere else.&lt;/p&gt;

&lt;p&gt;When Hermes runs a task the first time, it records the full trajectory: search queries, content extraction steps, scoring logic, output formatting. After the run, it compiles this into a &lt;strong&gt;skill&lt;/strong&gt; — a reusable, version-controlled workflow stored on disk.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Hermes created this autonomously after 3 runs&lt;/span&gt;
hermes /cush-scout run
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The next time you run the skill, Hermes doesn't re-plan from scratch. It loads the existing skill, runs it, then compares the output to previous runs. If the results are better, it updates the skill. If worse, it backtracks.&lt;/p&gt;

&lt;p&gt;This is fundamentally different from prompt engineering. A prompt is static. A skill evolves.&lt;/p&gt;

&lt;p&gt;The practical impact: after 10 runs, my scout was finding 40% more relevant results and producing 60% fewer false positives. Not because I tuned anything — because the system self-optimized.&lt;/p&gt;

&lt;h3&gt;
  
  
  Working Memory vs. Prompt Context
&lt;/h3&gt;

&lt;p&gt;Every AI agent struggles with context windows. You pack too much into a prompt and the agent loses coherence. You pack too little and it forgets what's important.&lt;/p&gt;

&lt;p&gt;Hermes solves this with &lt;strong&gt;FTS5 full-text search&lt;/strong&gt; over all past experiences. When it encounters a grant program it's seen before — say, the NSF SBIR AI topic — it searches memory, finds the previous evaluation, and cross-references:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;"I scored this at 70 last month"&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;"The deadline was extended"&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;"Last time I matched this to Supply Chain Finance but missed the Trade Finance angle"&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This isn't RAG in the traditional sense. It's more like... institutional memory. The agent builds up context about &lt;em&gt;your specific business&lt;/em&gt; over weeks, not minutes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Parallel Subagent Architecture
&lt;/h3&gt;

&lt;p&gt;Here's a concrete example of how this works under the hood.&lt;/p&gt;

&lt;p&gt;During each scout run, Hermes spawns 8-12 subagents simultaneously. Each subagent handles one region:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Subagent A: Scans NSF/DOD SBIR topics&lt;/li&gt;
&lt;li&gt;Subagent B: Checks EIC Accelerator deadlines&lt;/li&gt;
&lt;li&gt;Subagent C: Reviews Innovate UK competitions&lt;/li&gt;
&lt;li&gt;Subagent D-N: Remainder of regions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each subagent runs independently with its own model assignment. Search parsing uses a fast, cheap model (Hermes Portal — free tier). Scoring and analysis uses a reasoning model (Claude Sonnet via OpenRouter).&lt;/p&gt;

&lt;p&gt;The key insight: subagents don't share context. They report back results independently, and the main session merges them. This means failing subagents don't block the pipeline, and you're not paying for idle context tokens while one slow search catches up.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cron Scheduling in Plain Language
&lt;/h3&gt;

&lt;p&gt;This sounds trivial until you've wrestled with cron syntax one too many times.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="s2"&gt;"Scan for funding opportunities every weekday at 11 AM and post to Discord"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Hermes parses this into a proper cron expression, registers it, and creates a persistent background job. No &lt;code&gt;0 11 * * 1-5&lt;/code&gt; to remember. No webhook configuration. No separate scheduler service.&lt;/p&gt;

&lt;p&gt;You can also manage schedules conversationally:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="s2"&gt;"Pause the scout until next week"&lt;/span&gt;
&lt;span class="s2"&gt;"Change the delivery to email instead of Discord"&lt;/span&gt;
&lt;span class="s2"&gt;"Run the scout right now"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent handles schedule lifecycle as a first-class capability, not an afterthought.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where It Got Real
&lt;/h2&gt;

&lt;p&gt;About 10 days in, something interesting happened. Hermes made a mistake.&lt;/p&gt;

&lt;p&gt;It surfaced an opportunity that was clearly wrong — a DHS grant about immigration processing that had nothing to do with supply chain finance. I messaged: &lt;em&gt;"That DHS one doesn't fit. We're finance + supply chain, not immigration tech."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Hermes acknowledged the correction and updated its scoring model for DHS programs.&lt;/p&gt;

&lt;p&gt;Three days later, it surfaced a legitimate DHS SBIR topic about &lt;strong&gt;trade finance compliance at ports of entry&lt;/strong&gt; — a perfect fit that combined customs logistics with financial regulation. I'd never have found it because I'd mentally dismissed DHS as irrelevant.&lt;/p&gt;

&lt;p&gt;The agent had learned a nuance: it's not the &lt;em&gt;agency&lt;/em&gt; that matters, it's the &lt;em&gt;application domain&lt;/em&gt;. DHS runs port logistics. Port logistics involves trade finance. Trade finance is our sweet spot.&lt;/p&gt;

&lt;p&gt;That's the kind of learning no static prompt can capture. It requires a system that actually &lt;em&gt;remembers feedback&lt;/em&gt; and &lt;em&gt;changes its behavior&lt;/em&gt; accordingly.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Thing Nobody Talks About
&lt;/h2&gt;

&lt;p&gt;Here's the part that surprised me most:&lt;/p&gt;

&lt;p&gt;Hermes made me &lt;em&gt;trust&lt;/em&gt; autonomous AI for the first time.&lt;/p&gt;

&lt;p&gt;I've built AI systems for years. I know how they fail. Hallucinations, context drift, catastrophic forgetting — I've seen it all. But Hermes' architecture — skills + memory + learning loop — creates a feedback cycle that makes the system &lt;em&gt;provably&lt;/em&gt; better over time.&lt;/p&gt;

&lt;p&gt;Not "we think it's better." Measurably better. I could track the improvement curve:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Week 1: 4 results, mostly wrong&lt;/li&gt;
&lt;li&gt;Week 2: 6 results, 3 warm (correct domain match)&lt;/li&gt;
&lt;li&gt;Week 3: 8 results, 5 warm, 1 hot&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The trend line doesn't lie.&lt;/p&gt;

&lt;p&gt;And for a solo consultant or small agency, that 10× leverage is the difference between saying "I can't take on more clients" and actually scaling.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Means for the Future
&lt;/h2&gt;

&lt;p&gt;A year ago, building this system would have required:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A full-time engineer&lt;/li&gt;
&lt;li&gt;A cloud budget&lt;/li&gt;
&lt;li&gt;A vector database&lt;/li&gt;
&lt;li&gt;A prompt engineering playbook&lt;/li&gt;
&lt;li&gt;Custom integration code for each tool&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Today, it runs on a Mac mini in my closet, costs $5/month in electricity, and required exactly one conversation to set up.&lt;/p&gt;

&lt;p&gt;The agent doesn't just follow instructions. It &lt;em&gt;gets better at its job&lt;/em&gt;. That's a new paradigm. We're used to software that stays the same until we update it. Hermes updates itself by reflecting on what worked and what didn't.&lt;/p&gt;

&lt;p&gt;I think this is where the industry is heading — not just "AI that can do tasks," but "AI that can grow into a role." The implications for small businesses, solo operators, and lean teams are enormous.&lt;/p&gt;




&lt;h2&gt;
  
  
  Give It a Try
&lt;/h2&gt;

&lt;p&gt;If you're curious:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Install Hermes: &lt;code&gt;pip install hermes-agent&lt;/code&gt; (or Docker pull)&lt;/li&gt;
&lt;li&gt;Run it: &lt;code&gt;hermes run&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Tell it what you want: "Scan for ___ every day at noon and report via ___"&lt;/li&gt;
&lt;li&gt;Watch it get better&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Key links:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/NousResearch/hermes-agent" rel="noopener noreferrer"&gt;Hermes Agent&lt;/a&gt;&lt;/strong&gt; — The runtime&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://discord.gg/NousResearch" rel="noopener noreferrer"&gt;Nous Research Discord&lt;/a&gt;&lt;/strong&gt; — Community support&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://codeberg.org/cubiczan/cubiczan-swarm-pack" rel="noopener noreferrer"&gt;Cubiczan SwarmPack&lt;/a&gt;&lt;/strong&gt; — Our token-free coordination layer (if you want to run multiple Hermes agents)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The hardest part isn't the setup. It's letting go enough to trust the system. But once you see the learning loop in action — once you see an agent improve without you — you won't want to go back.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built with 🐾 by Sam Desigan / Cubiczan. Hermes Agent by &lt;a href="https://nousresearch.com" rel="noopener noreferrer"&gt;Nous Research&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>hermesagentchallenge</category>
      <category>agents</category>
      <category>ai</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Consensus-hardening-protocol</title>
      <dc:creator>Shyam Desigan</dc:creator>
      <pubDate>Fri, 15 May 2026 15:50:42 +0000</pubDate>
      <link>https://forem.com/shyam_desigan_c6b74c32b3c/consensus-hardening-protocol-13hj</link>
      <guid>https://forem.com/shyam_desigan_c6b74c32b3c/consensus-hardening-protocol-13hj</guid>
      <description>&lt;p&gt;What I Built&lt;br&gt;
Consensus Hardening Protocol (CHP) — a multi-agent decision governance layer where three specialized AI agents (Finance, Strategy, Compliance) reason through high-stakes decisions using Gemma 4 as their reasoning engine, with adversarial validation, grounding checks, and an explicit lock-state lifecycle that prevents premature consensus.&lt;/p&gt;

&lt;p&gt;The Problem&lt;br&gt;
When organizations deploy multiple AI agents — a finance agent that knows the budget, a strategy agent that understands the market, a compliance agent that enforces regulation — three predictable failures emerge:&lt;/p&gt;

&lt;p&gt;Context fragmentation: Each agent sees a different slice of the organization. Finance recommends spending $4M; strategy plans a market entry that assumes $2M; compliance flags a DPIA requirement nobody mentioned.&lt;/p&gt;

&lt;p&gt;Reasoning opacity: You get a confident paragraph from each agent. If it's wrong, you can't tell why it's wrong until it's too late. There's no traceable chain from claim to evidence.&lt;/p&gt;

&lt;p&gt;Output drift: Agents produce prose, but decision-makers need something runnable — a workflow with typed steps, owners, dependencies, and audit trails.&lt;/p&gt;

&lt;p&gt;Single-model prompting can't fix this. You can't solve a coordination failure with a better prompt. You need a protocol.&lt;/p&gt;

&lt;p&gt;The Architecture&lt;br&gt;
CHP composes five subsystems into a hardened decision mesh:&lt;/p&gt;

&lt;p&gt;Subsystem   What it does&lt;br&gt;
CHP Decision Governance Cross-model hardening with gates, packets, lock states, adversarial attacks&lt;br&gt;
Cognitive Mesh Protocol Structured expansion-compression reasoning with grounding checks&lt;br&gt;
Context Engineering Framework   Layered short/long-term memory + entity/event/task schema&lt;br&gt;
Agentic Context Engineering Evolving playbooks with delta-only updates (no context collapse)&lt;br&gt;
Statement &amp;amp; Workflow Synthesizer    Turns multi-agent output into executable workflows&lt;br&gt;
Every agent reads from and writes to shared organizational context. When the finance agent writes a budget recommendation, the strategy agent automatically receives it scored by relevance, recency, and importance — not because a developer hard-coded the routing, but because the context engine routes it based on capability declarations (produces: budget_envelope, consumes: budget_envelope).&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                    ┌──────────────────────────┐
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;┌───── shared ──────▶│   Context Engine         │◀───── shared ─────┐&lt;br&gt;
   │                    │   (entities/events/tasks │                   │&lt;br&gt;
   │                    │    + short/long memory)  │                   │&lt;br&gt;
   │                    └──────────────────────────┘                   │&lt;br&gt;
   ▼                                                                    ▼&lt;br&gt;
┌────────────────────┐     ┌────────────────────┐     ┌────────────────────┐&lt;br&gt;
│ Finance Agent      │     │ Strategy Agent     │     │ Compliance Agent   │&lt;br&gt;
│  ├─ Playbook (ACE) │     │  ├─ Playbook (ACE) │     │  ├─ Playbook (ACE) │&lt;br&gt;
│  └─ Protocol (CMP) │     │  └─ Protocol (CMP) │     │  └─ Protocol (CMP) │&lt;br&gt;
└──────────┬─────────┘     └──────────┬─────────┘     └──────────┬─────────┘&lt;br&gt;
           │ produces                 │ consumes+produces        │ consumes&lt;br&gt;
           ▼                          ▼                          ▼&lt;br&gt;
      budget_envelope        market_positioning            risk_register&lt;br&gt;
      roi_model              go_to_market                  mitigations&lt;br&gt;
           │                          │                          │&lt;br&gt;
           └──────────────┬───────────┴──────────────┬───────────┘&lt;br&gt;
                          ▼                          ▼&lt;br&gt;
                 ┌──────────────────────────────────────────┐&lt;br&gt;
                 │  EnterpriseOrchestrator                  │&lt;br&gt;
                 │    - topologically sorts agents          │&lt;br&gt;
                 │    - routes each turn through Protocol   │&lt;br&gt;
                 │    - emits Statement + Workflow          │&lt;br&gt;
                 └──────────────────────────────────────────┘&lt;br&gt;
The orchestrator topologically sorts agents based on their produces and consumes capability declarations. Add a legal agent that consumes: contract_terms and produces: risk_assessment — the orchestrator places it automatically. No hard-coded pipelines.&lt;/p&gt;

&lt;p&gt;Why Gemma 4?&lt;br&gt;
When I needed a reasoning engine to power the agent mesh, Gemma 4 was the clear choice for several reasons:&lt;/p&gt;

&lt;p&gt;I chose Gemma 4 31B Dense — the largest model in the family — because multi-agent orchestration demands deep, structured reasoning that smaller models struggle with. Here's why:&lt;/p&gt;

&lt;p&gt;Long-form reasoning with thinking mode: Gemma 4's thinking level can be set to high, producing multi-step chain-of-thought traces. CHP's Cognitive Mesh Protocol requires agents to run a 6-step expansion cycle (Reframe → Constraints → Alternatives → Assumptions → Edge cases → Cross-domain analogy) followed by a compression step. The 31B Dense model handles this structured reasoning pattern without losing coherence across steps.&lt;/p&gt;

&lt;p&gt;Grounding and hallucination detection: Every claim in CHP must be tagged verified | inferred | pattern-match. Gemma 4's strong instruction-following and system prompt adherence means it reliably applies these grounding tags without "forgetting" the taxonomy mid-reasoning. Testing showed the 31B model maintained consistent grounding annotation across 95%+ of expansion steps, where the E4B model occasionally dropped tags in the 5th and 6th expansion steps.&lt;/p&gt;

&lt;p&gt;Adversarial robustness: CHP runs a "foundation attack" — a devil's advocate pass that deliberately tries to find structural vulnerabilities in each agent's reasoning. The 31B Dense model's superior logical consistency means it can both generate strong arguments and withstand adversarial challenges, producing richer adversary traces than smaller models.&lt;/p&gt;

&lt;p&gt;Open weights, local execution: Gemma 4 is open-weight and can run locally or via Google AI Studio. For a system designed around audit trails and governance, the ability to run inference in a controlled environment — rather than sending organizational context to a proprietary API — matters. CHP's SuperServe sandbox integration runs proposals in isolated Firecracker microVMs, and running Gemma 4 alongside it in the same controlled infrastructure keeps the entire decision pipeline auditable.&lt;/p&gt;

&lt;p&gt;Cost-effective at scale: For the deterministic demo (no LLM calls), CHP runs with zero external dependencies. But in production, each agent's expand() and compress() methods become LLM-powered. The 31B Dense model's quality-per-token ratio means fewer retries, fewer grounding failures, and fewer adversarial re-runs — which directly reduces the cost per decision session.&lt;/p&gt;

&lt;p&gt;How Gemma 4 Powers Each Agent&lt;br&gt;
Each agent in CHP has two LLM-powered methods: expand(problem, context) and compress(problem, expansion, context). Plugging in Gemma 4 looks like this:&lt;/p&gt;

&lt;p&gt;import google.generativeai as genai&lt;/p&gt;

&lt;p&gt;class Gemma4Reasoner:&lt;br&gt;
    """Gemma 4 31B Dense reasoning backend for CHP agents."""&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def __init__(self, model_name="gemma-4-31b"):
    genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
    self.model = genai.GenerativeModel(
        model_name=model_name,
        system_instruction=self._system_prompt(),
        generation_config=genai.types.GenerationConfig(
            temperature=0.7,
            thinking_config=genai.types.ThinkingConfig(
                thinking_budget=8192,  # High thinking budget
            ),
        )
    )

def _system_prompt(self):
    return """You are a decision-analysis agent in a multi-agent mesh.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Every claim you make MUST be tagged with a grounding level:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;[verified] - backed by specific evidence&lt;/li&gt;
&lt;li&gt;[inferred] - logically derived from verified claims
&lt;/li&gt;
&lt;li&gt;[pattern-match] - based on observed patterns without direct evidence&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Uncertain claims MUST include uncertainty_flags.&lt;br&gt;
Your output must follow the structured expansion-compression protocol."""&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def expand(self, agent_name, problem, context):
    prompt = f"""Agent: {agent_name}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Problem: {problem}&lt;br&gt;
Shared Context: {context}&lt;/p&gt;

&lt;p&gt;Run the 6-step expansion cycle:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;REFRAME: Reformulate the problem to surface hidden assumptions&lt;/li&gt;
&lt;li&gt;CONSTRAINTS: List binding constraints and their sources&lt;/li&gt;
&lt;li&gt;ALTERNATIVES: Generate at least 3 distinct approaches&lt;/li&gt;
&lt;li&gt;ASSUMPTIONS: State every assumption explicitly&lt;/li&gt;
&lt;li&gt;EDGE CASES: Identify scenarios that break each alternative&lt;/li&gt;
&lt;li&gt;CROSS-DOMAIN ANALOGY: Find a parallel from a different domain&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each step must include grounding tags."""&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    response = self.model.generate_content(prompt)
    return self._parse_expansion(response.text)

def compress(self, agent_name, problem, expansion, context):
    prompt = f"""Agent: {agent_name}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Problem: {problem}&lt;br&gt;
Expansion:&lt;br&gt;
{expansion}&lt;/p&gt;

&lt;p&gt;Shared Context: {context}&lt;/p&gt;

&lt;p&gt;Compress into:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;INTEGRATE: Synthesize the expansion into a clear recommendation&lt;/li&gt;
&lt;li&gt;COMMIT: State the final position with confidence level&lt;/li&gt;
&lt;li&gt;FALSIFIABILITY: What evidence would change this recommendation?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Include: grounding tags, uncertainty_flags, and confidence level."""&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    response = self.model.generate_content(prompt)
    return self._parse_compression(response.text)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The framework is LLM-agnostic by design. The Gemma4Reasoner drops into the same expand() / compress() interface that the deterministic demo uses. Swap it for GPT-4, Claude, or Llama — the protocol, grounding checks, failure-mode detection, and lock-state governance all work identically.&lt;/p&gt;

&lt;p&gt;The Lock-State Lifecycle&lt;br&gt;
This is what makes CHP different from a simple multi-agent pipeline. Every decision goes through a hardened lifecycle:&lt;/p&gt;

&lt;p&gt;R0 GATE → EXPLORING → PROVISIONAL_LOCK → LOCKED&lt;br&gt;
R0 Gate: Before any agent runs, the proposal passes through a SuperServe sandbox (Firecracker microVM). Static analysis + isolated execution catch code-level issues before they become decision-level issues.&lt;/p&gt;

&lt;p&gt;EXPLORING: Agents run their expansion-compression cycles. The adversary attacks the reasoning. Grounding checks flag unverified claims. Failure-mode detection catches fossil state (repetition), chaos state (expansion without compression), and hallucination risk (3+ ungrounded claims).&lt;/p&gt;

&lt;p&gt;PROVISIONAL_LOCK: Two or more agents agree on a recommendation, but consensus alone isn't enough. The system requires payload integrity verification — the partner must echo back the exact packet structure with a PAYLOAD_ECHO confirmation.&lt;/p&gt;

&lt;p&gt;LOCKED: Only after third-party validation (a separate model pass or human review) does the decision lock. This is the core discipline: consensus is not enough until it is hardened.&lt;/p&gt;

&lt;p&gt;The Executable Workflow Output&lt;br&gt;
The mesh doesn't just produce three recommendations — it produces a Statement and a Workflow:&lt;/p&gt;

&lt;p&gt;Statement:&lt;br&gt;
  entry_point: Should we invest $4M in a new enterprise tier?&lt;br&gt;
  tension: Growth requires infrastructure investment, but current&lt;br&gt;
           SMB runway covers only 18 months&lt;br&gt;
  5_whys:&lt;br&gt;
    - Why invest now? → Market window closes Q3&lt;br&gt;
    - Why $4M? → Phased: $2.4M build + $1.6M GTM&lt;br&gt;
    - Why enterprise tier? → $50K+ ACV buyers underrepresented&lt;br&gt;
    - Why not extend SMB? → CAC-to-LTV ratio deteriorates above $15K&lt;br&gt;
    - Why hardened consensus? → Previous lone-CEO decision lost $800K&lt;br&gt;
  consequences:&lt;br&gt;
    strategic: Core-anchor positioning in mid-market&lt;br&gt;
    cultural: Engineering org shifts from product-led to sales-led&lt;br&gt;
    financial: 14-month payback, 60/40 gated by milestone&lt;/p&gt;

&lt;p&gt;Workflow:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;step: S01&lt;br&gt;
type: BUILD&lt;br&gt;
owner: Engineering&lt;br&gt;
inputs: [budget_envelope, technical_specs]&lt;br&gt;
outputs: [mvp_release]&lt;br&gt;
depends_on: []&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;step: S02&lt;br&gt;
type: VALIDATE&lt;br&gt;
owner: Product&lt;br&gt;
inputs: [mvp_release, market_positioning]&lt;br&gt;
outputs: [beta_metrics]&lt;br&gt;
depends_on: [S01]&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;step: S03&lt;br&gt;
type: LAUNCH&lt;br&gt;
owner: GTM&lt;br&gt;
inputs: [beta_metrics, risk_register]&lt;br&gt;
outputs: [revenue_stream]&lt;br&gt;
depends_on: [S02]&lt;br&gt;
That workflow is typed, dependency-ordered, and owner-attributed. Pipe it into Temporal, Airflow, or a cron job and it runs. The depends_on relationships were inferred automatically from the agents' produces/consumes declarations — not hard-coded.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;42 Tests, Zero External Dependencies&lt;br&gt;
The deterministic demo runs entirely offline with zero API calls:&lt;/p&gt;

&lt;p&gt;git clone &lt;a href="https://github.com/Cubiczan/consensus-hardening-protocol.git" rel="noopener noreferrer"&gt;https://github.com/Cubiczan/consensus-hardening-protocol.git&lt;/a&gt;&lt;br&gt;
cd consensus-hardening-protocol&lt;br&gt;
pip install -e .&lt;br&gt;
cme demo "Should we invest $4M in a new enterprise tier?"&lt;br&gt;
The test suite covers protocol rendering, payload integrity, gate enforcement, lock progression, context reuse, strict packet contracts, the adversary runner, CFO accuracy guard, and all 8 finance workflow engines:&lt;/p&gt;

&lt;p&gt;PYTHONPATH=src pytest tests/ -v  # 42 passing&lt;br&gt;
Swap the deterministic backend for Gemma 4, and every test still passes — because the protocol, not the model, is what's being tested.&lt;/p&gt;

&lt;p&gt;What's Included&lt;br&gt;
8 finance workflow engines: variance studio, 13-week cash forecast, 24-month SaaS model, board reporting, AP optimizer, decision impact simulator, SaaS KPI dashboard, investment committee scoring&lt;br&gt;
SuperServe sandbox integration: proposals run in isolated Firecracker microVMs before entering any protocol state&lt;br&gt;
CFO Operating System: multi-agent mesh session with full audit trail&lt;br&gt;
Adversarial foundation attack: devil's advocate pass that stress-tests every recommendation&lt;br&gt;
Context Engineering Framework: layered memory with entity/event/task schema, auto-promotion, semantic scoring&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>gemmachallenge</category>
      <category>gemma</category>
    </item>
    <item>
      <title>Building an Open-Source Consensus Protocol for Multi-Agent AI — Architecture Decisions and Trade-offs</title>
      <dc:creator>Shyam Desigan</dc:creator>
      <pubDate>Fri, 15 May 2026 12:35:08 +0000</pubDate>
      <link>https://forem.com/shyam_desigan_c6b74c32b3c/building-an-open-source-consensus-protocol-for-multi-agent-ai-architecture-decisions-and-2ih9</link>
      <guid>https://forem.com/shyam_desigan_c6b74c32b3c/building-an-open-source-consensus-protocol-for-multi-agent-ai-architecture-decisions-and-2ih9</guid>
      <description>&lt;p&gt;I'm a CFO who builds multi-agent AI systems for finance. This post documents the architecture decisions behind CHP (Consensus Hardening Protocol) — an open-source decision-governance layer I built to prevent false consensus in multi-agent LLM systems.&lt;/p&gt;

&lt;p&gt;Repo: &lt;a href="https://codeberg.org/cubiczan/consensus-hardening-protocol" rel="noopener noreferrer"&gt;https://codeberg.org/cubiczan/consensus-hardening-protocol&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Multi-agent systems have a dirty secret: LLM agents don't debate. They agree.&lt;/p&gt;

&lt;p&gt;Put three instances of the same model in a deliberation loop. They converge in 1-2 rounds. Cosine similarity &amp;gt;0.95. The "consensus" is an artifact of shared training, not independent reasoning.&lt;/p&gt;

&lt;p&gt;Even with different prompts, roles, and instructions, same-model agents produce outputs that are nearly identical in structure, conclusion, and confidence. The deliberation is theatrical.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I Cared
&lt;/h2&gt;

&lt;p&gt;I deploy multi-agent systems for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Commodity intelligence across lithium, nickel, and cobalt markets&lt;/li&gt;
&lt;li&gt;CFO variance analysis&lt;/li&gt;
&lt;li&gt;SEC-grade financial research&lt;/li&gt;
&lt;li&gt;Compliance scanning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In these domains, a false consensus is a liability. Literally.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture: State Machine vs. Probabilistic
&lt;/h2&gt;

&lt;p&gt;First decision: deterministic state machine vs. probabilistic convergence scoring.&lt;/p&gt;

&lt;p&gt;I chose the state machine.&lt;/p&gt;

&lt;p&gt;Reason: enterprise compliance teams need inspectable audit trails. They need to see that Agent A committed at timestamp T1 with reasoning R1, that Agent B (adversarial) challenged with counter-argument C1, and that the consensus was accepted because the R0 gate score exceeded threshold.&lt;/p&gt;

&lt;p&gt;Probabilistic frameworks give you a confidence distribution. State machines give you a decision log. Compliance teams audit logs, not distributions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;EXPLORING → ADVISORY_LOCK → PROVISIONAL_LOCK → LOCKED
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Foundation Disclosure
&lt;/h2&gt;

&lt;p&gt;Agents commit to their reasoning BEFORE cross-agent communication.&lt;/p&gt;

&lt;p&gt;Why: anchoring bias. If Agent A shares first, Agents B and C defer. Information cascading turns 3 agents into 1 agent with 3 voices.&lt;/p&gt;

&lt;p&gt;Implementation: each agent produces a sealed payload (reasoning chain + conclusion + confidence) that's encrypted until all agents have committed. Only then are payloads revealed simultaneously.&lt;/p&gt;

&lt;h2&gt;
  
  
  Adversarial Layer
&lt;/h2&gt;

&lt;p&gt;Not a soft prompt. A hard constraint.&lt;/p&gt;

&lt;p&gt;The adversarial agent has ONE job: produce a logically valid counter-argument with cited evidence. If it can't, the original conclusion stands. But the attempt is logged — "adversary could not produce a valid challenge" is itself a signal of high-confidence consensus.&lt;/p&gt;

&lt;p&gt;This is structurally different from "temperature: 1.2" or "you are a devil's advocate." Those are prompt-level suggestions that the model can ignore. CHP's adversarial role is an architectural constraint: no valid counter-argument = no state transition to PROVISIONAL_LOCK.&lt;/p&gt;

&lt;h2&gt;
  
  
  R0 Gate
&lt;/h2&gt;

&lt;p&gt;The convergence detector.&lt;/p&gt;

&lt;p&gt;If inter-agent similarity exceeds threshold T before the adversarial round completes, the system flags the consensus as potentially sycophantic. Deliberation resets with new initialization seeds.&lt;/p&gt;

&lt;p&gt;Calibration: T is set empirically per domain. In finance (where ground truth is verifiable against GL data), I calibrate against known-correct and known-incorrect outcomes. In open-ended domains (strategy, research), T is set conservatively high.&lt;/p&gt;

&lt;p&gt;This is the area where I most want community feedback.&lt;/p&gt;

&lt;h2&gt;
  
  
  Heterogeneous Models
&lt;/h2&gt;

&lt;p&gt;The simplest anti-sycophancy mitigation: don't use the same model.&lt;/p&gt;

&lt;p&gt;My specialist clusters run GPT-4o + Claude + DeepSeek. Different training data, different RLHF, different failure modes. Natural disagreement is higher. Genuine consensus (when it occurs) is more trustworthy because it emerged from heterogeneous reasoning, not shared training artifacts.&lt;/p&gt;

&lt;p&gt;Token economics: MoE Router dispatches to specialist clusters using nano models at $0.02-0.20/M tokens. GroupDebate subgroup partitioning cuts costs 51.7% while preserving accuracy.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd Do Differently
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;The R0 gate calibration is manual. I'd like a meta-learning layer that adjusts T based on historical decision accuracy.&lt;/li&gt;
&lt;li&gt;The adversarial role prompting needs more research. Current implementation uses role-based prompting with explicit logical proof requirements. But the quality of adversarial arguments varies significantly across base models.&lt;/li&gt;
&lt;li&gt;Cross-model payload envelope format needs standardization. I'm using a custom JSON schema. An industry standard would make CHP interoperable across platforms.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Full Portfolio
&lt;/h2&gt;

&lt;p&gt;48 repos spanning finance AI, commodity intelligence, compliance automation, blockchain traceability, and swarm trading: &lt;a href="https://codeberg.org/cubiczan" rel="noopener noreferrer"&gt;https://codeberg.org/cubiczan&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;PRs welcome. Especially on R0 calibration and adversarial prompting.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>architecture</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
