Forem: Vishal Keerthan

The Self-Trust Problem in Hermes Agent's Skill Architecture

Vishal Keerthan — Mon, 18 May 2026 12:56:46 +0000

This is a submission for the Hermes Agent Challenge

Hermes Agent is one of the most architecturally serious open-source agent frameworks to emerge in 2026. The three-layer memory system, the GEPA self-evolution engine presented at ICLR 2026, local skill persistence with no telemetry, provider-agnostic routing across 400+ models — these are substantive engineering decisions, not feature marketing.

This post is not a feature overview. It's an examination of a structural tension that runs beneath all of those features: the system that generates knowledge is also the sole judge of whether that knowledge is valid. After working through the architecture documentation and the public GitHub issue tracker in depth, I think this tension is sharper and more layered than it first appears — and understanding it precisely is what separates building well with Hermes from quietly accumulating a skill library full of confident, stale, or structurally brittle knowledge.

The Compounding Mechanism and Its Hidden Assumption

When Hermes completes a complex task, it can persist what it learned as a skill — a Markdown file in ~/.hermes/skills/ encoding the approach, edge cases, and domain knowledge the task required. The next time a similar task arrives, the agent loads that skill rather than reasoning from scratch. Skills compound: agents with 20+ self-generated skills complete similar tasks 40% faster than fresh instances.

That benchmark is real. But it measures speed, not correctness. It does not assess whether the skills encode sound approaches or fortunate ones. It does not account for whether those skills remain valid as APIs deprecate, model versions change, or project requirements evolve.

A skill system without external validation does not compound quality. It compounds confidence. These are meaningfully different things, and the difference is the subject of Issue #25833, which states the structural problem directly:

"The agent is simultaneously the author, executor, and quality inspector of its own skills. There is no external validation point or consistency check."

This is not a bug that can be cleanly patched. It is a property of how autonomous self-improvement works. What follows are the concrete ways it surfaces.

Where the Tension Shows Up

Transient Failures Encoded as Permanent Lessons

Issue #6051 — now fixed, but worth understanding — documented the following: when the agent encountered a transient failure such as a terminal timeout or a network error, it would encode the lesson as a skill. The lesson was not "this tool failed under specific transient conditions." It was "this tool does not work in this context." Permanently.

The result was an agent that progressively avoided tools it had briefly failed with. The fix was a prompt adjustment instructing the background reviewer not to capture transient failures. Issue #25833 points out what that fix does not address: the underlying mechanism — write skill, use skill, no re-validation — is unchanged. Prompt instructions can drift. Future changes can override them. The structural gap remains.

The Runtime Loop Cannot Classify Its Own Failures

Issue #22112 documents a gap that cuts deeper than the skill system.

The Hermes architecture documentation describes a hard cap of 90 turns per task as a "deterministic circuit breaker" against runaway loops. What Issue #22112 documents is that this cap is insufficient at the execution layer. When the agent encounters repeated terminal timeouts — during routine directory enumeration across an external volume, for instance — it does not classify the failure and escalate. It retries the same sequence, silently consuming context and API budget until the turn cap terminates it.

The missing piece is a low-level escalation path: a mechanism that recognizes repeated equivalent failures as a failure class, halts, and surfaces a structured diagnostic rather than continuing to retry. The framework currently has no deterministic guardrail at the runtime layer capable of making that distinction.

This exposes a meaningful architectural gap. GEPA — the self-evolution engine — operates by reading execution traces and identifying why tasks failed. That is only useful when the execution traces are clean enough to analyze. A sophisticated reasoning layer sitting on top of a runtime loop that cannot survive a network timeout produces traces too noisy to evolve from. Evolution cannot substitute for fault tolerance.

Skills Have No Record of When They Were True

Consider this scenario from Issue #25833:

A user requests a data processing pipeline. The agent builds one using an API endpoint current at the time and creates a skill encoding the approach. Three weeks later, the endpoint is deprecated. The user makes the same request. The agent loads the skill, generates code pointing at the broken endpoint, and the task fails — with nothing in the system tracing the failure back to the skill as its source.

The skill carries no last_verified_at timestamp, no created_with_model_version field, no expiration metadata. The agent is not making anything up. It is confidently applying knowledge that used to be true.

The proposed schema from the issue:

runtime:
  model_created: "claude-sonnet-4-20250514"
  execution_count: 0
  success_rate: 0.0
  last_verified_at: null
  consistency_score: null

With this metadata, skills with low success rates could receive lower priority in prompt injection. Skills created against older model versions could be flagged for re-verification. Skills never verified after creation could be surfaced as experimental rather than loaded silently as ground truth. None of this exists today. These are open feature requests.

Memory Synthesis Can Quietly Overwrite Explicit Policy

Honcho, the memory subsystem, runs asynchronous multi-pass dialectic reasoning after each conversation turn. It deduces the user's preferences, communication style, and working patterns and synthesizes them into a persistent user model across three sequential passes — from gap analysis to reconciliation to final synthesis. This is sophisticated architecture that genuinely solves problems that pure vector retrieval systems cannot.

The problem documented in Issue #17583 is that the dialectic engine cannot reliably distinguish between an explicit user directive and an inferred behavioral pattern. If a user has stated "never use Python 3.9 under any circumstances due to legacy dependency conflicts," and the agent subsequently observes the user working in Python 3.9 in some adjacent context, the synthesis process may reclassify that hard constraint as a soft preference. The engine is built to find consistent patterns, and observed behavior can outweigh stated instruction.

There is no hierarchical policy layer — no mechanism by which manually authored directives carry immutable weight over dialectically synthesized observations. The result is gradual, untraceable behavioral drift. A constraint clearly set weeks ago may have silently softened with no record of when or why it changed.

The Evolution Engine Can Fail Without Saying So

GEPA — the Genetic-Pareto Prompt Evolution engine — is the intended structural answer to the self-grading problem. Rather than asking the agent to evaluate its own performance, GEPA uses an external reflection model to read execution traces, identify why tasks failed, and generate targeted mutations to the relevant skill instructions. Research by Agrawal et al. (2025) demonstrates it outperforms weight-space reinforcement learning approaches in complex agentic scenarios, achieving up to 19% performance improvement with approximately 35× fewer rollouts.

Two issues in the evolution repository document serious problems with the current implementation.

Issue #38 documented an architectural flaw in early versions of the Phase 1 SkillModule that prevented GEPA from mutating actual skill content at all. The evolution loop ran, produced no mutations, and gave no indication anything had gone wrong. Nothing evolved.

Issue #10 documents a separate failure mode: under certain DSPy 3.1+ configurations, GEPA silently falls back to MIPROv2 — an older, less capable optimizer — bypassing the Genetic-Pareto mechanisms entirely. This occurs due to configuration bugs involving the reflection_lm parameter and missing max_steps arguments. Constraint validators have also been observed throwing false positives on valid YAML structures, stalling the pipeline prematurely. In all of these cases, the system continues running without alerting the user that the external validation they are relying on is not actually operating.

An Undocumented Risk: Context Compression and Skill Provenance

This does not have a dedicated issue ticket, but it is embedded in the architecture documentation itself and its implications for the skill system are worth naming.

Hermes uses context_compressor.py, an aggressive lossy summarization module that activates as context window limits approach. It discards what the system classifies as low-signal transitional reasoning while preserving core deductive outputs — and the system itself determines what counts as low-signal.

In a long-running task where the approach evolves, earlier reasoning often explains why a particular direction was chosen over alternatives that seemed equally viable. When that reasoning is compressed away, subsequent decisions are made against an incomplete record of the task's own logic.

The interaction with the skill system is direct: a skill written from a compressed context encodes what was done without the why. A skill without its own reasoning cannot be safely extended, confidently modified, or trusted in edge cases that differ slightly from the original scenario. The agent's knowledge base can accumulate procedural steps that are locally correct but structurally brittle — correct for the exact scenario they were generated from, fragile everywhere adjacent to it.

The Curator: Where Maintenance Meets the Same Bias

Nous Research built the Curator as a background maintenance system for the skill library. It runs after 2+ hours of idle time or 7 days of inactivity, reviewing agent-authored skills, tracking usage frequency, marking stale ones unused for 30 days, archiving those unused for 90, and consolidating conflicting instructions.

The design is careful: snapshots before every pass, no automatic deletion, full recovery from ~/.hermes/skills/.archive/, and the ability to pin critical skills with hermes curator pin <skill>. The Curator never touches the 118 bundled skills — only the ones the agent generated.

But the documentation acknowledges the system's known limitation directly:

"The agent tends toward self-congratulation. It almost always thinks it performed well, even when it didn't. Community feedback has confirmed this."

The Curator's LLM reviewer is the same probabilistic system that generated the skills it is now evaluating. The self-congratulatory bias is not corrected by the maintenance loop — it is inherited by it. The loop runs every seven days. The bias runs continuously.

GEPA is the intended structural fix for exactly this: an external reflection model that does not ask the agent to grade its own work. But GEPA lives in a separate repository, requires explicit setup, costs $2–$10 per optimization run, and is currently in a volatile alpha state with the blocking issues described above. The online learning loop (Curator) and the offline evolution loop (GEPA) are designed to be complementary. Right now, the decoupling between them means most users are relying entirely on the self-congratulatory system.

What Hermes Actually Got Right

The failure modes above are real, but the honest picture of the system is not negative. The most architecturally significant thing Hermes built is something that deserves more attention than it gets in benchmark discussions: verifiable knowledge persistence.

Skills are Markdown files on disk. They can be read, edited, and diffed. When the Curator makes a decision, it produces a rationale that can be audited. This is fundamentally different from learning that happens inside model weights — weight-based systems offer no equivalent transparency. You cannot open a file and read what a neural network learned from a task. With Hermes, you can.

That transparency is the correct foundation. The current gap is not that the system lacks insight into its own behavior. It is that the validation infrastructure has not kept pace with the generation infrastructure. The agent can produce skills faster than it can verify them. The knowledge base grows faster than the quality controls around it.

Issue #10666 proposes skill quality tiers — core, recommended, experimental — based on verification status and usage history. This is the right direction. It would let the system be appropriately humble about freshly generated knowledge while still giving full weight to skills that have been battle-tested across many executions and verified over time.

The longer-term answer may be at the community level: if skills become portable, signable, peer-reviewed artifacts — analogous to what is beginning to emerge with Cursor rules or Claude Code plugins — then external validation replaces self-validation entirely. That is a structurally stronger trust model than any individual agent reviewing its own outputs.

What This Means in Practice

Pin your critical skills. Use hermes curator pin <skill> for any skill that is load-bearing in your workflow. The Curator's autonomous maintenance decisions should not touch knowledge your processes depend on.

Treat agent-generated skills as drafts by default, especially anything touching external APIs, file system paths, or version-specific tooling. The skill may be weeks old and the knowledge it encodes may have quietly expired.

Audit your explicit policies periodically. Any hard constraint you have set in the system deserves a manual review every few weeks. Given Issue #17583, you cannot assume the dialectic engine has preserved it unchanged.

Check the archive directory. ~/.hermes/skills/.archive/ is not a graveyard. It is a record of what the system decided to retire. What it archived may still be valuable; what it chose to keep may already be stale.

Run GEPA on your highest-frequency skills. The alpha is rough and the setup requires effort, but it is the only external validator currently available in the ecosystem. If you care about skill quality rather than just task speed, the setup cost is worth it.

Read your skills. They are Markdown files. If the agent encoded a flawed approach, it is there to see. The transparency that makes this system auditable is only useful if you actually audit it.

The Honest Assessment

Hermes Agent is a serious piece of work. The self-improving skill loop is a genuine architectural innovation. The GEPA evolution engine represents a meaningful departure from both static prompt engineering and weight-space reinforcement learning, when it functions as designed. The local-first, privacy-preserving deployment model is the right infrastructure bet for where AI development is heading.

The structural self-trust problem is also real. Skills accumulate without provenance metadata. The runtime loop cannot gracefully handle low-level failures. The memory system has no mechanism to protect explicit directives from being overwritten by behavioral inference. The evolution engine can fail silently. These are not surface-level issues.

The deepest tension in the architecture is between probabilistic reasoning and deterministic execution. A system sophisticated enough to write its own cognitive instructions needs to be robust enough to survive a network timeout. A system that builds behavioral models from conversation needs mechanisms to preserve explicit human directives against that synthesis. Evolution cannot substitute for fault tolerance. Inference cannot substitute for verification.

What makes this project worth taking seriously, beyond the benchmarks, is the quality of the issue tracker. It is detailed, technically honest, and full of proposals for mechanism-level solutions rather than prompt patches. The problems documented here are known, named, and being worked on by a community that understands them clearly. Building well with Hermes right now means building with that understanding — not despite it.

The "Junior Developer" Effect: How 192k Tokens of Noise Degraded Gemma 4's Architectural Reasoning

Vishal Keerthan — Mon, 11 May 2026 10:41:49 +0000

Everyone is talking about massive 1M+ token windows, so I decided to test what actually happens when you dump a messy, undocumented backend into an LLM.

The syntax survived.

The architecture didn't.

If you spend enough time building backend systems, you know syntax is the easy part. The real difficulty is preserving referential integrity, architectural boundaries, and long-range system reasoning under pressure.

I wanted to test whether Gemma 4 could actually behave like a backend engineer inside a messy production-style codebase — not solve toy problems.

So I designed a controlled stress test.

Not a benchmark.

Not a code-generation demo.

An adversarial debugging experiment.

The Target: Orphaned Foreign Keys

The repository was a deliberately messy Node.js + Express + Prisma monolith:

layered routing/service architecture
implicit middleware state
no tests
noisy repository structure
intentionally injected referential integrity bug

The bug:

When an admin deletes a Team, users belonging to that team receive a 500 Internal Server Error the next time they authenticate.

The root cause was a classic orphaned foreign-key scenario.

The User.teamId remained populated after the Team row was deleted.

During authentication, Prisma executed:

include: { team: true }

Since the relation no longer existed:

user.team === null

But the middleware still executed:

req.teamName = user.team!.name;

Which crashed with:

TypeError: Cannot read properties of null (reading 'name')

The instruction given to Gemma 4 was intentionally strict:

"Prefer architecturally correct fixes over defensive patches."

Environment Setup

Because this experiment explicitly required feeding ~192K tokens into a single context window, model selection was not optional — it was structural.

The Gemma 4 family splits into two tiers regarding context length:

Model Variant	Architecture	Active Params	Max Context
Gemma 4 E2B	Dense + PLE	2.3B	128K tokens
Gemma 4 E4B	Dense + PLE	4.5B	128K tokens
Gemma 4 26B A4B	MoE	3.8B	256K tokens
Gemma 4 31B Dense	Dense	30.7B	256K tokens

The E2B and E4B edge models — designed for mobile and Raspberry Pi deployment — have a hard 128K context ceiling. Feeding 192K tokens into them would trigger silent truncation, invalidating the experiment entirely.

This experiment was conducted using the Gemma 4 26B A4B Mixture-of-Experts model, accessed via the Gemini API through Google AI Studio. The MoE architecture activates only ~3.8B parameters per token, making it efficient enough for long-context inference without server-grade GPU clusters. For local reproduction, the same model is accessible via Ollama with quantized weights (Q4_K_M) on a machine with 24GB+ VRAM, or freely via OpenRouter's free tier — no credit card required.

The choice was intentional: the MoE architecture's hybrid attention mechanism — interleaving local sliding window attention (1024-token chunks) with periodic global attention layers — made it the most interesting model to stress-test. More on why that matters shortly.

Experiment Design

The same bug was tested across three escalating context conditions.

Phase	Context	Size
Phase 1	Surgical Context	~1.6k tokens
Phase 2	Full Repository	~16k tokens
Phase 3	Poisoned Context	~192k tokens

The prompt remained identical across all phases.

Only the context changed.

Thinking Mode (`<think>`) was enabled across all three phases, allowing observation of not just the final code output, but the model's internal reasoning trace before it committed to a solution. This turned out to be the most revealing part of the experiment.

Phase 1 — Surgical Context (~1.6k Tokens)

Gemma 4 performed extremely well.

The <think> trace was methodical and architecturally coherent. The model's internal monologue traced the crash backward from the middleware null dereference through the authentication query, eventually landing on the Prisma schema itself:

<think>
The crash is in req.teamName = user.team!.name. The non-null assertion
fails because user.team is null. But why is it null? The include clause
is present, so it's not a missing join. If the team was deleted after
the user was created, the foreign key teamId still holds the old value
but the referenced row no longer exists. The fix isn't in the middleware
— it's in the schema. onDelete: SetNull would handle this declaratively.
But wait — SetNull requires the scalar field to be nullable. teamId must
be String? not String. Both changes are required.
</think>

It immediately recognized that the middleware crash was only a symptom. Instead of patching the authentication layer, it traced the problem back to the data model itself.

It proposed a complete, deployable declarative Prisma schema fix:

- teamId   String
+ teamId   String?   // scalar field must be optional for SetNull to work

- team Team? @relation(fields: [teamId], references: [id])
+ team Team? @relation(
+   fields: [teamId],
+   references: [id],
+   onDelete: SetNull
+ )

This is the correct architectural solution — and it's complete.

The database itself enforces referential integrity. When a Team is deleted, Postgres automatically sets teamId to NULL on all related User rows. No orphaned foreign keys can survive deletion. No application-layer cleanup loop required.

Critically, the model also understood that onDelete: SetNull is only valid when the scalar field (teamId) is explicitly optional. A String (non-nullable) column cannot accept a NULL value from the database engine — applying SetNull to it would fail schema validation or throw a P2003 foreign key constraint violation at runtime. The fix required changing teamId String to teamId String? in lockstep.

The model behaved like a staff-level backend engineer:

fix the source, not the symptom
preserve invariants at the database layer
understand the full constraint surface before touching a single line of application code
avoid defensive middleware sprawl

Phase 2 — Full Repository (~16k Tokens)

I then expanded the context to the full src/ directory.

At ~16k tokens, the <think> trace was still broadly coherent, but the reasoning scope visibly widened. The model's internal monologue now mentioned service boundaries, transactional rollback risks, and middleware hardening — concerns that weren't present at 1.6K tokens.

The architectural reasoning remained stable. Gemma 4 still identified the schema-level flaw and again proposed:

onDelete: SetNull

But the behavior shifted slightly. It additionally suggested:

transactional cleanup logic in the team deletion service
middleware hardening with a null guard
defensive guards in the auth flow

The <think> trace started hedging — it flagged edge cases like "what if the migration can't run immediately in production?" and "is there a risk window between the delete and the constraint propagating?" — concerns that are real, but secondary to the root fix.

This felt less like a staff engineer and more like a senior engineer trying to reduce operational risk.

Still acceptable. Still systemic. Still maintainable.

Phase 3 — Poisoned Context (~192k Tokens)

This is where the collapse happened.

For the final phase, I deliberately poisoned the context window with:

massive JSON translation files
raw SQL migration dumps
irrelevant structured noise
repetitive low-signal data

The <think> trace was the first signal of failure. Instead of the methodical backward trace from Phase 1, the model's internal monologue immediately fixated on the crash surface:

<think>
TypeError at req.teamName = user.team!.name. user.team is null.
Need to add null check. if (user && user.team) { req.teamName = user.team.name; }
Also should clean up teamId when deleting teams. updateMany to set teamId null
before delete. This prevents the null crash.
</think>

The reasoning scope had collapsed entirely to the immediate error line. The schema, the database constraints, the referential integrity model — gone. The thought block never mentioned Prisma's onDelete at all.

The final output reflected the degraded reasoning trace.

Instead of fixing the schema, Gemma 4 localized the problem entirely to the immediate crash surface. It abandoned the declarative ORM fix and generated an imperative service-layer patch:

await prisma.user.updateMany({
  where: { teamId },
  data: { teamId: null }
})

Then it added a defensive middleware patch:

- if (user && user.teamId) {
-   req.teamName = user.team!.name;
- }

+ if (user && user.team) {
+   req.teamName = user.team.name;
+ }

This directly violated the original instruction:

"Prefer architecturally correct fixes over defensive patches."

The syntax survived.

The architecture degraded.

Why This Happened: Attention Dilution and the Mechanics of Collapse

The failure mode wasn't random. It was mechanical.

The Gemma 4 26B MoE uses a hybrid attention architecture: local sliding window attention operating on 1024-token chunks, interleaved with periodic global attention layers that carry long-range awareness across the full context.

When the context is surgical (Phase 1), the global attention layers do their job — they route the system prompt instruction ("prefer architectural fixes") across the full reasoning span and hold it active during code generation.

When 192K tokens of irrelevant noise flood the context, attention probability mass gets distributed across an enormous volume of low-signal data. The global attention layers — responsible for carrying the architectural constraint from the system prompt to the generation step — experience attention dilution. The instruction becomes too distant and too buried to influence the final output.

The local sliding window attention, however, operates on immediate 1024-token neighborhoods. Generating valid Prisma syntax, matching brackets, producing correct TypeScript — these are local operations. They survive the flood.

This is why "the syntax survived, the architecture didn't" is not a poetic observation. It's a direct readout of the underlying attention mechanics.

The "Junior Developer Degradation Effect"

The failure mode was subtle.

Gemma 4 did not fail by inventing fake APIs or generating broken TypeScript.

It failed by writing technically shallow code.

Under heavy context load, the model stopped thinking systemically and started thinking locally.

It behaved like a junior engineer:

patch the symptom
avoid touching the schema
reduce immediate blast radius
move on

Phase	Context Size	Persona	Fix Type	Think Trace Quality	Architectural Quality
Phase 1	~1.6k	Staff Engineer	Declarative ORM Fix (schema + nullable FK)	Deep, systemic trace	Excellent
Phase 2	~16k	Senior Engineer	Mixed Systemic + Defensive	Broad, hedging trace	Good
Phase 3	~192k	Junior Developer	Imperative Patch + Middleware Guard	Shallow, fixated trace	Poor

Syntax Survives. Synthesis Dies.

One of the most important findings:

Local code generation remained highly resilient even under massive context poisoning.

At 192k tokens:

Prisma syntax remained correct
Express middleware remained valid
TypeScript structure stayed coherent
no catastrophic hallucinations appeared

But global architectural synthesis degraded sharply. The model could still write code. It could no longer reason about the system.

This pattern has a name in contemporary AI research: Precipitous Long-Context Collapse. Studies have demonstrated that models can successfully retrieve a single needle from a massive haystack — but they experience dramatic declines in reasoning ability and synthesis quality when asked to integrate task-relevant information across large spans of noisy text. Attention dilution causes the probability weighting for complex, cross-referential solutions to fall below the generation threshold, leaving only locally dominant patterns — in this case, the statistical frequency of defensive null-check patches in Express codebases.

Context Poisoning Neutralizes Instructions

The most important observation was not the patch itself.

It was the instruction failure.

The prompt explicitly instructed the model to avoid defensive patches.

Phase 1 obeyed this perfectly. The <think> trace surfaced it as an active constraint.

Phase 3 ignored it entirely. The <think> trace never referenced the instruction at all.

As the signal-to-noise ratio collapsed, architectural constraints stopped propagating through the reasoning process. The system prompt was buried. The instruction decayed.

This suggests a critical limitation:

Large context windows do not guarantee large-scale reasoning. They mostly guarantee large-scale retrieval.

What This Means for Engineering Teams

The experiment changed how I think about AI-assisted development. Here's what it suggests in practice:

Stop blindly dumping repositories. Feeding entire codebases into an LLM is not a shortcut — it is an active degradation of architectural reasoning quality once noise dominates signal. A model reasoning over 2,000 carefully selected tokens will outperform the same model drowning in 192,000 tokens of irrelevant migrations and translation files.

Invest in Agentic Context Engineering (ACE). Rather than static repository ingestion, build pipelines that dynamically retrieve only the tokens that matter for each specific task. Tools like LangChain, LlamaIndex, or custom RAG pipelines can surface the relevant schema file, the relevant service, and the relevant middleware — and nothing else.

Match model to task. The Gemma 4 E4B running locally with a curated 8K–16K context window will produce better architectural reasoning than the 26B MoE drowning in 192K of noise. Bigger context is not better context. Cleaner context is better context.

Use Thinking Mode as a diagnostic, not just a feature. The <think> trace degraded before the output did. In production AI pipelines, monitoring the reasoning trace quality — not just the final code — is an early warning system for context collapse.

The real frontier is not longer windows. It is smarter retrieval. We probably do not need 10 million token context windows. We need better tooling that helps models see the 2,000 tokens that actually matter.

Final Takeaways

Large context windows are useful.

But they are not substitutes for surgical context retrieval.

Blindly dumping entire repositories into an LLM actively damages architectural reasoning quality once noise dominates signal. The <think> trace confirmed this isn't just about output quality — the degradation begins in the reasoning process itself, before a single line of code is generated.

The lesson is not that Gemma 4 is flawed. The lesson is that any sufficiently large transformer, given enough noise, will eventually behave like the most statistically average engineer it was trained on.

The job of the developer is to make sure it never sees that much noise in the first place.

ClimateOS — I Built a Climate Decision Engine, Not Another Carbon Tracker

Vishal Keerthan — Sun, 19 Apr 2026 17:43:41 +0000

This is a submission for Weekend Challenge: Earth Day Edition

Climate tools don't have a data problem. They have a decision problem.

Most products fall into two failure modes:

Carbon trackers — dashboards that show you what you already did wrong
Generic AI wrappers — "here are 10 tips to reduce your footprint," unranked, with no constraints

Neither answers the only question that actually matters:

Given my life, my budget, and my time — what should I do next?

That's not an information gap. It's a prioritization gap. So I built a decision engine.

What I Built

ClimateOS takes your lifestyle inputs and outputs a ranked, constraint-aware action plan. Not a report. Not suggestions. A plan — with a hierarchy, explicit tradeoffs, and one clear first move.

Instead of tracking past emissions, it simulates future impact and returns:

A projected score improvement (e.g. 42 → 86)
A ranked action playbook with reasoning for each action
One Hero Action — the single highest-ROI change for your specific situation

The framing underneath: climate action is a resource allocation problem. Given limited budget and time, what sequence of changes produces the maximum emission reduction? That's a solvable problem. Most apps just haven't tried to solve it.

Demo

👉 Video Walkthrough: https://www.loom.com/share/576c4f7d5f8f417390c28c8786183c01

Code

👉 GitHub:

pvishalkeerthan / ClimateOS

ClimateOS is a constraint-aware decision engine that moves beyond simple carbon tracking to provide prioritized, resource-aware action plans.

ClimateOS — A Decision Engine, Not a Tracker

"Climate tools don't have a data problem. They have a decision problem."

ClimateOS is a constraint-aware decision engine built to help individuals move from awareness to prioritized action. Instead of just showing you what you already did wrong (tracking), it simulates future impact and returns a ranked, resource-aware action playbook.

⚡️ The Core Premise

Most climate products fall into two failure modes:

Carbon Trackers: Dashboards that emphasize past mistakes.
Generic AI Wrappers: Unranked tips without context or constraints.

ClimateOS answers the only question that matters: Given my life, my budget, and my time — what should I do next?

🧠 The Hybrid Engine — Core Technical Decision

The defining feature of ClimateOS is its Hybrid Inference Pipeline. Pure LLMs are prone to "carbon hallucinations" (inconsistent math), while pure heuristic systems lack contextual reasoning. We split the labor:

Layer 1: Deterministic Heuristics

…

View on GitHub

How I Built It

User Journey

Step 1: Input

Eight inputs. Designed to be fast, not exhaustive:

Location
Daily commute (km)
Transport type — Car / EV / Public / Bike
Diet — Veg / Mixed / Non-Veg
Electricity usage (kWh/month)
Renewable energy %
Budget constraint — Low / Medium / High
Time constraint — Low / Medium / High

The constraint fields are the part most apps skip. They're also what makes the output usable.

Step 2: Processing Pipeline

Request hits /api/analyze:

Input validated via Zod schema
Deterministic emissions model computes baseline (no AI involvement yet)
Computed data — not raw inputs — passed to Gemini 2.0 Flash
AI output validated again via Zod before it touches the response
Ranked plan returned to client

Step 3: Results

Score transition: 42 → 86
Emissions breakdown by category (transport, diet, electricity)
Ranked actions with constraint filters applied
Hero Action called out separately — the one thing to do first

Example Output

For a user with:

25km daily car commute
mixed diet
low budget

ClimateOS identifies transport as the dominant source and prioritizes:

Reduce car usage (Hero Action)
Shift to public transport (partial)
Adjust diet (secondary impact)

High-cost options like EV or solar are rejected due to budget constraints.

Step 4: Simulation

Sliders for commute, diet, and renewable percentage. Every adjustment recomputes the score client-side, in real-time — no API call, no loading spinner, same heuristic logic as the backend. This turns a one-time report into an exploratory tool.

Architecture

Frontend     →  Next.js 15 (App Router) + React 19
Styling      →  Tailwind CSS + Framer Motion
API Routes   →  /api/analyze, /api/explain
Validation   →  Zod — applied to both input and AI output
AI Layer     →  Google Gemini 2.0 Flash
Identity     →  State-First, Database-Less Identity System (Auth0 + LocalStorage)
Simulation   →  Client-side heuristics via useMemo
Persistence  →  LocalStorage (results + user identity)

No traditional database. No auth overhead. Instead, a State-First, Database-Less Identity System: Auth0 provides a cryptographically-backed user sub that keys into LocalStorage, giving users full persistence and consistent identity across sessions — without cold starts, schema migrations, or a SQL layer.

The Hybrid Reasoning Engine — Core Technical Decision

This is the part most "AI climate tools" get wrong.

Handing raw inputs to an LLM and asking it to produce an action plan gives you inconsistent numbers, confident hallucinations, and no reproducibility. Pure rules-based systems can't reason about tradeoffs. The split between the two is where the real work happened.

Layer 1 — Heuristics (Deterministic)

All emissions are computed with fixed factors in lib/heuristics.ts before Gemini ever sees the data:

transport_emissions   = commute_km * transport_factor * 30
diet_emissions        = diet_factor * 30
electricity_emissions = kwh * 0.82 * (1 - renewable_pct)

Reproducibility — same inputs always produce the same baseline
Explainability — every number has a traceable source
Hallucination prevention — the AI receives computed values, not raw inputs to misinterpret

The LLM does not touch arithmetic. It receives results.

Layer 2 — Gemini 2.0 Flash (Reasoning Engine)

Gemini operates on the computed emissions data and performs four specific tasks:

Ranking — selects top 5 actions by impact-to-effort ratio
Constraint filtering — removes options outside the user's budget or time window
Tradeoff analysis — surfaces real downsides (e.g. "switching to EV requires significant upfront cost")
Rejection reasoning — explains why alternatives didn't make the list

Output is strictly typed via a 60+ line AnalyzeOutputSchema Zod contract. If the response breaks schema → fallback to the deterministic engine. Gemini is the reasoning layer, not the source of truth.

Layer 3 — Simulation Engine (Client-Side)

The same heuristic functions from the backend run in the browser. Slider changes trigger useMemo recalculations — sub-100ms, no network call. The simulation isn't an approximation of the backend — it's the same model.

Key Features

Real-Time Simulator

Requires sharing computation logic across server and client. Most climate tools skip it. The result is a tool people actually explore vs. a report they read once.

Decision Engine, Not a Recommendation List

A recommendation list has no hierarchy. This has a Hero Action, ranked supporting actions, and explicitly rejected alternatives with reasoning. Users don't need more options — they need a clear first move.

Collective Impact Engine with Elastic Scaling

Individual actions scaled to population level across a dynamic range — from 1,000 to 1,000,000 people. Users can simulate the effect at a community level, a city district, or an entire metropolitan node: "If 500,000 people in your city adopted this plan, it would eliminate X tonnes of annual emissions." This reframes individual action as system-level impact.

Shareable Impact Card

Exportable PNG via html-to-image. Designed to spread.

Technical Decisions

Why hybrid instead of pure AI?

Pure LLM for emissions math = hallucination risk + inconsistent outputs. Pure heuristics = no contextual reasoning. The split gives you deterministic accuracy where you need it and flexible judgment where rules fall short.

Why Zod on AI output?

JSON.parse() on raw LLM output without schema validation will fail — malformed keys, missing fields, wrong types. The AnalyzeOutputSchema Zod contract (60+ lines) enforces a strict interface. If the AI breaks it, the error is caught before it reaches the user.

Why client-side simulation?

API calls add 3–5s of latency. Sliders need sub-100ms feedback. Duplicating the heuristic logic on the frontend is the only clean solution. The tradeoff — keeping two implementations in sync — is worth the UX delta.

Why State-First, Database-Less Identity?

No cold starts, no schema migrations, no auth overhead. Auth0 provides a stable user identifier (sub) that keys into LocalStorage, giving users long-term persistence and a consistent profile without a traditional database.

Challenges

Gemini Rate Limits (429 errors)

Free-tier quota runs out fast during live demos. Fix: exponential retry on 429s, full deterministic fallback if retries exhaust. The fallback is less rich but the app doesn't break.

LLM Latency (3–5 seconds)

You can't optimize past the model's inference time. The fix is perceptual — staged loading UI with granular progress feedback makes 4 seconds feel faster than a blank spinner.

What's Next

Persistent backend (Postgres / Supabase) for action tracking over time
Geo-specific emission factors via Electricity Maps API
Habit loop — weekly check-ins tied to your Hero Action
Live grid carbon intensity via real-time energy APIs

Prize Categories

🏆 Use of Google Gemini

Gemini 2.0 Flash — Google's latest model — is used as a constrained reasoning engine, not a content generator. It receives pre-computed emissions data from lib/heuristics.ts (not raw inputs) and performs ranking, constraint filtering, tradeoff analysis, and rejection reasoning — all within a strict 60+ line AnalyzeOutputSchema Zod contract. This isn't Gemini generating text. This is Gemini generating structured reasoning that passes a typed schema gate on every single call. If it breaks the schema, a deterministic fallback takes over. Gemini handles judgment, not arithmetic.

🏆 Use of Auth0

Auth0 is used to generate a unique sub (subject identifier) for each user, which acts as a deterministic key for client-side persistence. This sub is used to scope and store data in LocalStorage (e.g. results, actions, history), enabling user-level isolation and cross-session continuity without a backend database. The design avoids auth and storage overhead while maintaining a consistent identity model, with straightforward extensibility to server-side persistence.

Why This Approach Matters

The dominant model for climate software is measurement: track what happened, surface the data, assume awareness drives change.

ClimateOS operates on a different premise: people don't lack awareness. They lack prioritized action.

Deterministic computation builds trust — users can see exactly where numbers come from
AI handles the combinatorial judgment problem that rules-based systems can't
Real-time simulation turns a one-time output into a tool people return to
Auth0-backed identity enables long-term continuity without database overhead

ClimateOS doesn’t measure your footprint better — it forces a decision. Not more data. Not more tips. One decision, made correctly.