Forem: Shyam Desigan

I Let an AI Agent Run My Consulting Business For a Week — Here's What Happened

Shyam Desigan — Sat, 16 May 2026 01:21:28 +0000

This is a submission for the Hermes Agent Challenge

The Setup

I run a small AI agency called Cubiczan. We help companies build agentic AI systems for finance and supply chain operations. It's consulting work — research-heavy, customized, and time-consuming.

Recently I told my Openclaw orchestrator agent to create a Hermes subagent and give full autonomy: schedule, research, decide, deliver. No hand-holding.

The agent was Hermes Agent by Nous Research — an open-source, self-improving AI agent that can learn from experience, create its own skills, and run long-term workflows entirely independently.

This is what happened.

The Challenge That Changed My Mind

Before Hermes, my workflow looked like this:

Every morning, I'd spend 45 minutes scanning funding opportunities — SBIR grants, Horizon Europe calls, Innovate UK programs, sovereign AI mandates. I'd read through pages of program descriptions, check deadlines, cross-reference budgets, and try to figure out which ones matched our Finance × Supply Chain specialty.

It was manual. It was boring. And I kept missing things.

The real problem wasn't the searching. It was the context switching. Every time I paused client work to research grants, I lost momentum. Every time I found something good, I had to re-verify it the next week because I couldn't remember the details.

I needed something that could:

Learn what's relevant to my business
Remember what it found across sessions
Improve over time without me rewriting prompts
Work while I slept

Hermes Agent claimed it could do all four. I was skeptical. So I set up a test.

Week One: The Hands-Off Experiment

My Openclaw downloaded Hermes on a Mac mini in my office closet. Docker pull, one config file, and I had the agent running.

Then I told it, in plain English: "Every weekday at 11 AM, search for AI funding opportunities across the US, EU, UK, Canada, Singapore, UAE, and Japan. Score each one against our Finance × Supply Chain focus. Post the best ones to our Discord. Get better at this over time."

That's it. No YAML. No flow charts. No integration code.

Here's what happened day by day:

Day 1

Hermes ran its first scout. It came back with 4 results — mostly irrelevant. One had a vaguely AI-related title but was actually about agricultural sensors. Another was a PDF from 2023 that was no longer open.

I almost gave up. But I let it keep going.

Day 3

Something changed. The results were noticeably better. Hermes had started filtering out expired programs. It cross-checked dates against the SBIR.gov API. It was generating short summaries of each opportunity explaining why it might fit.

I didn't teach it any of this. It just... learned.

Day 5

The scout found a match I'd completely missed: a DARPA AI logistics program that specifically mentioned trade finance in its scope. I'd been scanning SBIR.gov manually for months and never saw it.

Hermes found it on its fifth autonomous run.

Day 7

By the end of the week, the system had:

Identified 12 active funding programs across 9 regions
Mapped $50M+ in accessible capital with deadlines and fit scores
Found 3 direct-pursuit opportunities scored at 90+ out of 100
Zero human input after the initial instruction

My 45-minute daily ritual had become a 30-second glance at a Discord notification.

How It Actually Works

Let me walk through the mechanics, because the magic isn't magic — it's a clever system design that I think more people should understand.

The Skill Creation Loop

This is Hermes' killer feature and I haven't seen it done this well anywhere else.

When Hermes runs a task the first time, it records the full trajectory: search queries, content extraction steps, scoring logic, output formatting. After the run, it compiles this into a skill — a reusable, version-controlled workflow stored on disk.

# Hermes created this autonomously after 3 runs
hermes /cush-scout run

The next time you run the skill, Hermes doesn't re-plan from scratch. It loads the existing skill, runs it, then compares the output to previous runs. If the results are better, it updates the skill. If worse, it backtracks.

This is fundamentally different from prompt engineering. A prompt is static. A skill evolves.

The practical impact: after 10 runs, my scout was finding 40% more relevant results and producing 60% fewer false positives. Not because I tuned anything — because the system self-optimized.

Working Memory vs. Prompt Context

Every AI agent struggles with context windows. You pack too much into a prompt and the agent loses coherence. You pack too little and it forgets what's important.

Hermes solves this with FTS5 full-text search over all past experiences. When it encounters a grant program it's seen before — say, the NSF SBIR AI topic — it searches memory, finds the previous evaluation, and cross-references:

"I scored this at 70 last month"
"The deadline was extended"
"Last time I matched this to Supply Chain Finance but missed the Trade Finance angle"

This isn't RAG in the traditional sense. It's more like... institutional memory. The agent builds up context about your specific business over weeks, not minutes.

Parallel Subagent Architecture

Here's a concrete example of how this works under the hood.

During each scout run, Hermes spawns 8-12 subagents simultaneously. Each subagent handles one region:

Subagent A: Scans NSF/DOD SBIR topics
Subagent B: Checks EIC Accelerator deadlines
Subagent C: Reviews Innovate UK competitions
Subagent D-N: Remainder of regions

Each subagent runs independently with its own model assignment. Search parsing uses a fast, cheap model (Hermes Portal — free tier). Scoring and analysis uses a reasoning model (Claude Sonnet via OpenRouter).

The key insight: subagents don't share context. They report back results independently, and the main session merges them. This means failing subagents don't block the pipeline, and you're not paying for idle context tokens while one slow search catches up.

Cron Scheduling in Plain Language

This sounds trivial until you've wrestled with cron syntax one too many times.

"Scan for funding opportunities every weekday at 11 AM and post to Discord"

Hermes parses this into a proper cron expression, registers it, and creates a persistent background job. No 0 11 * * 1-5 to remember. No webhook configuration. No separate scheduler service.

You can also manage schedules conversationally:

"Pause the scout until next week"
"Change the delivery to email instead of Discord"
"Run the scout right now"

The agent handles schedule lifecycle as a first-class capability, not an afterthought.

Where It Got Real

About 10 days in, something interesting happened. Hermes made a mistake.

It surfaced an opportunity that was clearly wrong — a DHS grant about immigration processing that had nothing to do with supply chain finance. I messaged: "That DHS one doesn't fit. We're finance + supply chain, not immigration tech."

Hermes acknowledged the correction and updated its scoring model for DHS programs.

Three days later, it surfaced a legitimate DHS SBIR topic about trade finance compliance at ports of entry — a perfect fit that combined customs logistics with financial regulation. I'd never have found it because I'd mentally dismissed DHS as irrelevant.

The agent had learned a nuance: it's not the agency that matters, it's the application domain. DHS runs port logistics. Port logistics involves trade finance. Trade finance is our sweet spot.

That's the kind of learning no static prompt can capture. It requires a system that actually remembers feedback and changes its behavior accordingly.

The Thing Nobody Talks About

Here's the part that surprised me most:

Hermes made me trust autonomous AI for the first time.

I've built AI systems for years. I know how they fail. Hallucinations, context drift, catastrophic forgetting — I've seen it all. But Hermes' architecture — skills + memory + learning loop — creates a feedback cycle that makes the system provably better over time.

Not "we think it's better." Measurably better. I could track the improvement curve:

Week 1: 4 results, mostly wrong
Week 2: 6 results, 3 warm (correct domain match)
Week 3: 8 results, 5 warm, 1 hot

The trend line doesn't lie.

And for a solo consultant or small agency, that 10× leverage is the difference between saying "I can't take on more clients" and actually scaling.

What This Means for the Future

A year ago, building this system would have required:

A full-time engineer
A cloud budget
A vector database
A prompt engineering playbook
Custom integration code for each tool

Today, it runs on a Mac mini in my closet, costs $5/month in electricity, and required exactly one conversation to set up.

The agent doesn't just follow instructions. It gets better at its job. That's a new paradigm. We're used to software that stays the same until we update it. Hermes updates itself by reflecting on what worked and what didn't.

I think this is where the industry is heading — not just "AI that can do tasks," but "AI that can grow into a role." The implications for small businesses, solo operators, and lean teams are enormous.

Give It a Try

If you're curious:

Install Hermes: pip install hermes-agent (or Docker pull)
Run it: hermes run
Tell it what you want: "Scan for ___ every day at noon and report via ___"
Watch it get better

Key links:

Hermes Agent — The runtime
Nous Research Discord — Community support
Cubiczan SwarmPack — Our token-free coordination layer (if you want to run multiple Hermes agents)

The hardest part isn't the setup. It's letting go enough to trust the system. But once you see the learning loop in action — once you see an agent improve without you — you won't want to go back.

Built with 🐾 by Sam Desigan / Cubiczan. Hermes Agent by Nous Research.

Consensus-hardening-protocol

Shyam Desigan — Fri, 15 May 2026 15:50:42 +0000

What I Built
Consensus Hardening Protocol (CHP) — a multi-agent decision governance layer where three specialized AI agents (Finance, Strategy, Compliance) reason through high-stakes decisions using Gemma 4 as their reasoning engine, with adversarial validation, grounding checks, and an explicit lock-state lifecycle that prevents premature consensus.

The Problem
When organizations deploy multiple AI agents — a finance agent that knows the budget, a strategy agent that understands the market, a compliance agent that enforces regulation — three predictable failures emerge:

Context fragmentation: Each agent sees a different slice of the organization. Finance recommends spending $4M; strategy plans a market entry that assumes $2M; compliance flags a DPIA requirement nobody mentioned.

Reasoning opacity: You get a confident paragraph from each agent. If it's wrong, you can't tell why it's wrong until it's too late. There's no traceable chain from claim to evidence.

Output drift: Agents produce prose, but decision-makers need something runnable — a workflow with typed steps, owners, dependencies, and audit trails.

Single-model prompting can't fix this. You can't solve a coordination failure with a better prompt. You need a protocol.

The Architecture
CHP composes five subsystems into a hardened decision mesh:

Subsystem What it does
CHP Decision Governance Cross-model hardening with gates, packets, lock states, adversarial attacks
Cognitive Mesh Protocol Structured expansion-compression reasoning with grounding checks
Context Engineering Framework Layered short/long-term memory + entity/event/task schema
Agentic Context Engineering Evolving playbooks with delta-only updates (no context collapse)
Statement & Workflow Synthesizer Turns multi-agent output into executable workflows
Every agent reads from and writes to shared organizational context. When the finance agent writes a budget recommendation, the strategy agent automatically receives it scored by relevance, recency, and importance — not because a developer hard-coded the routing, but because the context engine routes it based on capability declarations (produces: budget_envelope, consumes: budget_envelope).

                    ┌──────────────────────────┐

┌───── shared ──────▶│ Context Engine │◀───── shared ─────┐
│ │ (entities/events/tasks │ │
│ │ + short/long memory) │ │
│ └──────────────────────────┘ │
▼ ▼
┌────────────────────┐ ┌────────────────────┐ ┌────────────────────┐
│ Finance Agent │ │ Strategy Agent │ │ Compliance Agent │
│ ├─ Playbook (ACE) │ │ ├─ Playbook (ACE) │ │ ├─ Playbook (ACE) │
│ └─ Protocol (CMP) │ │ └─ Protocol (CMP) │ │ └─ Protocol (CMP) │
└──────────┬─────────┘ └──────────┬─────────┘ └──────────┬─────────┘
│ produces │ consumes+produces │ consumes
▼ ▼ ▼
budget_envelope market_positioning risk_register
roi_model go_to_market mitigations
│ │ │
└──────────────┬───────────┴──────────────┬───────────┘
▼ ▼
┌──────────────────────────────────────────┐
│ EnterpriseOrchestrator │
│ - topologically sorts agents │
│ - routes each turn through Protocol │
│ - emits Statement + Workflow │
└──────────────────────────────────────────┘
The orchestrator topologically sorts agents based on their produces and consumes capability declarations. Add a legal agent that consumes: contract_terms and produces: risk_assessment — the orchestrator places it automatically. No hard-coded pipelines.

Why Gemma 4?
When I needed a reasoning engine to power the agent mesh, Gemma 4 was the clear choice for several reasons:

I chose Gemma 4 31B Dense — the largest model in the family — because multi-agent orchestration demands deep, structured reasoning that smaller models struggle with. Here's why:

Long-form reasoning with thinking mode: Gemma 4's thinking level can be set to high, producing multi-step chain-of-thought traces. CHP's Cognitive Mesh Protocol requires agents to run a 6-step expansion cycle (Reframe → Constraints → Alternatives → Assumptions → Edge cases → Cross-domain analogy) followed by a compression step. The 31B Dense model handles this structured reasoning pattern without losing coherence across steps.

Grounding and hallucination detection: Every claim in CHP must be tagged verified | inferred | pattern-match. Gemma 4's strong instruction-following and system prompt adherence means it reliably applies these grounding tags without "forgetting" the taxonomy mid-reasoning. Testing showed the 31B model maintained consistent grounding annotation across 95%+ of expansion steps, where the E4B model occasionally dropped tags in the 5th and 6th expansion steps.

Adversarial robustness: CHP runs a "foundation attack" — a devil's advocate pass that deliberately tries to find structural vulnerabilities in each agent's reasoning. The 31B Dense model's superior logical consistency means it can both generate strong arguments and withstand adversarial challenges, producing richer adversary traces than smaller models.

Open weights, local execution: Gemma 4 is open-weight and can run locally or via Google AI Studio. For a system designed around audit trails and governance, the ability to run inference in a controlled environment — rather than sending organizational context to a proprietary API — matters. CHP's SuperServe sandbox integration runs proposals in isolated Firecracker microVMs, and running Gemma 4 alongside it in the same controlled infrastructure keeps the entire decision pipeline auditable.

Cost-effective at scale: For the deterministic demo (no LLM calls), CHP runs with zero external dependencies. But in production, each agent's expand() and compress() methods become LLM-powered. The 31B Dense model's quality-per-token ratio means fewer retries, fewer grounding failures, and fewer adversarial re-runs — which directly reduces the cost per decision session.

How Gemma 4 Powers Each Agent
Each agent in CHP has two LLM-powered methods: expand(problem, context) and compress(problem, expansion, context). Plugging in Gemma 4 looks like this:

import google.generativeai as genai

class Gemma4Reasoner:
"""Gemma 4 31B Dense reasoning backend for CHP agents."""

def __init__(self, model_name="gemma-4-31b"):
    genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
    self.model = genai.GenerativeModel(
        model_name=model_name,
        system_instruction=self._system_prompt(),
        generation_config=genai.types.GenerationConfig(
            temperature=0.7,
            thinking_config=genai.types.ThinkingConfig(
                thinking_budget=8192,  # High thinking budget
            ),
        )
    )

def _system_prompt(self):
    return """You are a decision-analysis agent in a multi-agent mesh.

Every claim you make MUST be tagged with a grounding level:

[verified] - backed by specific evidence
[inferred] - logically derived from verified claims
[pattern-match] - based on observed patterns without direct evidence

Uncertain claims MUST include uncertainty_flags.
Your output must follow the structured expansion-compression protocol."""

def expand(self, agent_name, problem, context):
    prompt = f"""Agent: {agent_name}

Problem: {problem}
Shared Context: {context}

Run the 6-step expansion cycle:

REFRAME: Reformulate the problem to surface hidden assumptions
CONSTRAINTS: List binding constraints and their sources
ALTERNATIVES: Generate at least 3 distinct approaches
ASSUMPTIONS: State every assumption explicitly
EDGE CASES: Identify scenarios that break each alternative
CROSS-DOMAIN ANALOGY: Find a parallel from a different domain

Each step must include grounding tags."""

    response = self.model.generate_content(prompt)
    return self._parse_expansion(response.text)

def compress(self, agent_name, problem, expansion, context):
    prompt = f"""Agent: {agent_name}

Problem: {problem}
Expansion:
{expansion}

Shared Context: {context}

Compress into:

INTEGRATE: Synthesize the expansion into a clear recommendation
COMMIT: State the final position with confidence level
FALSIFIABILITY: What evidence would change this recommendation?

Include: grounding tags, uncertainty_flags, and confidence level."""

    response = self.model.generate_content(prompt)
    return self._parse_compression(response.text)

The framework is LLM-agnostic by design. The Gemma4Reasoner drops into the same expand() / compress() interface that the deterministic demo uses. Swap it for GPT-4, Claude, or Llama — the protocol, grounding checks, failure-mode detection, and lock-state governance all work identically.

The Lock-State Lifecycle
This is what makes CHP different from a simple multi-agent pipeline. Every decision goes through a hardened lifecycle:

R0 GATE → EXPLORING → PROVISIONAL_LOCK → LOCKED
R0 Gate: Before any agent runs, the proposal passes through a SuperServe sandbox (Firecracker microVM). Static analysis + isolated execution catch code-level issues before they become decision-level issues.

EXPLORING: Agents run their expansion-compression cycles. The adversary attacks the reasoning. Grounding checks flag unverified claims. Failure-mode detection catches fossil state (repetition), chaos state (expansion without compression), and hallucination risk (3+ ungrounded claims).

PROVISIONAL_LOCK: Two or more agents agree on a recommendation, but consensus alone isn't enough. The system requires payload integrity verification — the partner must echo back the exact packet structure with a PAYLOAD_ECHO confirmation.

LOCKED: Only after third-party validation (a separate model pass or human review) does the decision lock. This is the core discipline: consensus is not enough until it is hardened.

The Executable Workflow Output
The mesh doesn't just produce three recommendations — it produces a Statement and a Workflow:

Statement:
entry_point: Should we invest $4M in a new enterprise tier?
tension: Growth requires infrastructure investment, but current
SMB runway covers only 18 months
5_whys:
- Why invest now? → Market window closes Q3
- Why $4M? → Phased: $2.4M build + $1.6M GTM
- Why enterprise tier? → $50K+ ACV buyers underrepresented
- Why not extend SMB? → CAC-to-LTV ratio deteriorates above $15K
- Why hardened consensus? → Previous lone-CEO decision lost $800K
consequences:
strategic: Core-anchor positioning in mid-market
cultural: Engineering org shifts from product-led to sales-led
financial: 14-month payback, 60/40 gated by milestone

Workflow:

step: S01
type: BUILD
owner: Engineering
inputs: [budget_envelope, technical_specs]
outputs: [mvp_release]
depends_on: []
step: S02
type: VALIDATE
owner: Product
inputs: [mvp_release, market_positioning]
outputs: [beta_metrics]
depends_on: [S01]
step: S03
type: LAUNCH
owner: GTM
inputs: [beta_metrics, risk_register]
outputs: [revenue_stream]
depends_on: [S02]
That workflow is typed, dependency-ordered, and owner-attributed. Pipe it into Temporal, Airflow, or a cron job and it runs. The depends_on relationships were inferred automatically from the agents' produces/consumes declarations — not hard-coded.

42 Tests, Zero External Dependencies
The deterministic demo runs entirely offline with zero API calls:

git clone https://github.com/Cubiczan/consensus-hardening-protocol.git
cd consensus-hardening-protocol
pip install -e .
cme demo "Should we invest $4M in a new enterprise tier?"
The test suite covers protocol rendering, payload integrity, gate enforcement, lock progression, context reuse, strict packet contracts, the adversary runner, CFO accuracy guard, and all 8 finance workflow engines:

PYTHONPATH=src pytest tests/ -v # 42 passing
Swap the deterministic backend for Gemma 4, and every test still passes — because the protocol, not the model, is what's being tested.

What's Included
8 finance workflow engines: variance studio, 13-week cash forecast, 24-month SaaS model, board reporting, AP optimizer, decision impact simulator, SaaS KPI dashboard, investment committee scoring
SuperServe sandbox integration: proposals run in isolated Firecracker microVMs before entering any protocol state
CFO Operating System: multi-agent mesh session with full audit trail
Adversarial foundation attack: devil's advocate pass that stress-tests every recommendation
Context Engineering Framework: layered memory with entity/event/task schema, auto-promotion, semantic scoring

Building an Open-Source Consensus Protocol for Multi-Agent AI — Architecture Decisions and Trade-offs

Shyam Desigan — Fri, 15 May 2026 12:35:08 +0000

I'm a CFO who builds multi-agent AI systems for finance. This post documents the architecture decisions behind CHP (Consensus Hardening Protocol) — an open-source decision-governance layer I built to prevent false consensus in multi-agent LLM systems.

Repo: https://codeberg.org/cubiczan/consensus-hardening-protocol

The Problem

Multi-agent systems have a dirty secret: LLM agents don't debate. They agree.

Put three instances of the same model in a deliberation loop. They converge in 1-2 rounds. Cosine similarity >0.95. The "consensus" is an artifact of shared training, not independent reasoning.

Even with different prompts, roles, and instructions, same-model agents produce outputs that are nearly identical in structure, conclusion, and confidence. The deliberation is theatrical.

Why I Cared

I deploy multi-agent systems for:

Commodity intelligence across lithium, nickel, and cobalt markets
CFO variance analysis
SEC-grade financial research
Compliance scanning

In these domains, a false consensus is a liability. Literally.

Architecture: State Machine vs. Probabilistic

First decision: deterministic state machine vs. probabilistic convergence scoring.

I chose the state machine.

Reason: enterprise compliance teams need inspectable audit trails. They need to see that Agent A committed at timestamp T1 with reasoning R1, that Agent B (adversarial) challenged with counter-argument C1, and that the consensus was accepted because the R0 gate score exceeded threshold.

Probabilistic frameworks give you a confidence distribution. State machines give you a decision log. Compliance teams audit logs, not distributions.

EXPLORING → ADVISORY_LOCK → PROVISIONAL_LOCK → LOCKED

Foundation Disclosure

Agents commit to their reasoning BEFORE cross-agent communication.

Why: anchoring bias. If Agent A shares first, Agents B and C defer. Information cascading turns 3 agents into 1 agent with 3 voices.

Implementation: each agent produces a sealed payload (reasoning chain + conclusion + confidence) that's encrypted until all agents have committed. Only then are payloads revealed simultaneously.

Adversarial Layer

Not a soft prompt. A hard constraint.

The adversarial agent has ONE job: produce a logically valid counter-argument with cited evidence. If it can't, the original conclusion stands. But the attempt is logged — "adversary could not produce a valid challenge" is itself a signal of high-confidence consensus.

This is structurally different from "temperature: 1.2" or "you are a devil's advocate." Those are prompt-level suggestions that the model can ignore. CHP's adversarial role is an architectural constraint: no valid counter-argument = no state transition to PROVISIONAL_LOCK.

R0 Gate

The convergence detector.

If inter-agent similarity exceeds threshold T before the adversarial round completes, the system flags the consensus as potentially sycophantic. Deliberation resets with new initialization seeds.

Calibration: T is set empirically per domain. In finance (where ground truth is verifiable against GL data), I calibrate against known-correct and known-incorrect outcomes. In open-ended domains (strategy, research), T is set conservatively high.

This is the area where I most want community feedback.

Heterogeneous Models

The simplest anti-sycophancy mitigation: don't use the same model.

My specialist clusters run GPT-4o + Claude + DeepSeek. Different training data, different RLHF, different failure modes. Natural disagreement is higher. Genuine consensus (when it occurs) is more trustworthy because it emerged from heterogeneous reasoning, not shared training artifacts.

Token economics: MoE Router dispatches to specialist clusters using nano models at $0.02-0.20/M tokens. GroupDebate subgroup partitioning cuts costs 51.7% while preserving accuracy.

What I'd Do Differently

The R0 gate calibration is manual. I'd like a meta-learning layer that adjusts T based on historical decision accuracy.
The adversarial role prompting needs more research. Current implementation uses role-based prompting with explicit logical proof requirements. But the quality of adversarial arguments varies significantly across base models.
Cross-model payload envelope format needs standardization. I'm using a custom JSON schema. An industry standard would make CHP interoperable across platforms.

Full Portfolio

48 repos spanning finance AI, commodity intelligence, compliance automation, blockchain traceability, and swarm trading: https://codeberg.org/cubiczan

PRs welcome. Especially on R0 calibration and adversarial prompting.