Forem: signalscout

Hermes should learn from the corpus, not just remember the chat

signalscout — Wed, 27 May 2026 01:10:08 +0000

This is a submission for the Hermes Agent Challenge.

I usually have too many agent loops open at once.

A coding agent in one terminal. A research worker in another. A browser worker. Some half-finished website pass. A notes file that is supposed to be the source of truth but is already stale by the time I remember it exists.

That is the real problem Hermes makes me think about.

Not "can an agent remember me?"

More like: can an agent learn from the whole working corpus, extract the useful signal, and turn it into better behavior without turning my machine into a pile of invisible mutated habits?

Not a diary. Not a giant pile of old chats. Not another place where context goes to die.

A compiler for prompts, sessions, workflows, corrections, and mistakes.

The corpus is the asset

Most agent memory systems aim too low.

They remember facts:

"The user likes short answers."

"The project uses Vercel."

"The repo is over here."

That is useful, but it is not the main prize.

The thing worth preserving is the signal inside the corpus:

what the user keeps trying to build,
what they correct repeatedly,
what their writing actually sounds like,
which checks are cheap and should always run,
which decisions belong to the human,
which mistakes happened twice and should become a guardrail.

That is where an agent starts to feel less like a prompt box and more like a working system.

Every prompt, session, correction, draft, failure, and weird half-finished thought is training data for your own operating system.

Not training data in the "upload it to some company and hope" sense.

Local corpus. User-owned. Searchable. Versioned. Private by default.

The raw corpus is ore. Hermes should be the refinery.

The simple version

Here is the pattern in normal words:

You correct an agent three times.
Hermes notices the repeated correction.
It proposes a small workflow rule.
You review it before it becomes default behavior.
The rule gets a human-readable version.
If it makes things worse, you roll it back.

That is the whole idea.

Not "the agent has a bigger memory."

"The agent turns repeated work into reviewable, reversible behavior."

Bad memory says:

"Improve the workflow later."

Good memory says:

"Before publishing a technical post, check the sources, remove private notes, verify the screenshots, and only then hit publish."

One is a reminder. The other is a runnable process.

That difference sounds small until you have the same agent making the same mistake in three different terminals.

Why Hermes fits

Hermes Agent is interesting because it is already pointed at the long-running agent shape: tool use, multi-step work, local infrastructure, model choice, and a memory loop that improves over time.

That changes the normal developer workflow.

With a stateless assistant, every session starts from prompt archaeology. You reload the repo, restate the rules, remind it what matters, and hope the model keeps the important parts in context.

With a workflow-learning agent, the successful workflow can become a skill. The repeated correction can become a guardrail. The writing samples can become a voice profile. The next run can spend less time rediscovering the shape of the work and more time doing it.

This is useful for messy real-world work:

building websites,
checking tool installs,
maintaining project state,
preparing public writing,
reconciling trackers,
turning rough ideas into buildable specs.

These are not one-prompt tasks. They are loops. They have taste, order, failure modes, and local rules.

That is exactly what Hermes should learn.

Negative memory matters

Wait, there is one trap though.

If Hermes only learns what worked, it can preserve bad habits too.

A good agent should remember:

this shortcut created rework,
this output sounded too AI-written,
this tool install asked for too much authority,
this task looked important but was stale waiting,
this class of decision needs human approval.

The memory should not only say "do this again."

Sometimes it should say "never repeat this shape without a check."

That is the difference between compounding and just accumulating residue.

Versioned skills are the safety layer

This is the part normal users will actually understand, because it maps to undo.

The agent says:

"I made a mistake here. I learned something important. I propose changing these three skills."

Then the user sees:

Website SOP v4 -> v5
Tool-install review v1 -> v2
Voice profile v2 -> v3

Each version should say what changed, why it changed, what behavior it preserves, and how to roll it back.

Small changelog. Clear version. Backup. One-command rollback.

That turns self-improvement from a weird hidden mutation into something closer to software releases.

Git is good infrastructure. It is not the user interface for agent memory.

Nobody wants to say:

"Revert commit 8f41c9a from that one folder where the agent stores its habits."

They want to say:

"Revert the writing skill to version 2."

That is the product shape.

Many agents make this urgent

This gets serious when the user is not running one cute agent in one window.

Some developers already have multiple CLIs open at once: coding agents, local agent sessions, browser workers, website builders, background jobs, random one-off terminals.

In that world, "the agent learned something, so it edited a skill" is not safe enough.

If twenty streams run in a day and each stream rewrites three skills, the system did not improve. It created a governance problem.

So the control plane needs a rule:

Workers do the work.

Hermes learns from the work.

The control plane releases the learning.

The backup preserves the released state.

By control plane, I just mean the local place that can see active agent loops, their proposed memory changes, and the current skill versions. It does not need to be fancy. It can start as files and a small dashboard.

The important part is that worker agents do not silently mutate shared identity.

They submit proposals.

The control plane reviews, groups, tests, releases, rejects, or rolls them back.

That is where skills start to become an identity layer.

Not identity in the mystical sense. Identity as in: what the agent reliably does across models, terminals, and time.

Example: installing agent tools

Tool installation is a clean example because memory, trust, and execution collide.

You add a tool. The tool asks for filesystem access, shell access, browser access, or an API token. The README includes install instructions. The config includes env keys. The tool descriptions get fed into the model. The agent now has more authority than it had five minutes ago.

None of that has to be malicious to be dangerous.

So I would want Hermes to turn tool installation into a reusable ritual:

What authority does this tool ask for?
What files, commands, browser profiles, or credentials does it touch?
Which parts are deterministic evidence, and which parts are model judgment?
What constraints should be copied into the final install plan?

The output should not be "safe" or "unsafe." That is fake certainty.

The output should be a posture:

Add carefully.
Sandbox first.
Do not add.

Then the final step is boring and mechanical: compare the proposed config against the posture before writing it.

If the review says "sandbox first" and the final config mounts the whole home directory, the agent should stop.

That is not a grand theory. It is just workflow memory with teeth.

What I would build

I built a static local mockup only to make the workflow concrete. The submission is the pattern, not a claim that the full product already exists.

The v1 demo is intentionally small:

paste or import a local sample corpus,
match repeated patterns,
show source-line excerpts,
generate proposed skill/doctrine changes,
show redaction and spend warnings,
write a release note,
show a version ledger,
show rollback text.

No backend. No model call. No real file mutation. No fake autonomy.

The full product still needs the hard boring parts: connectors for real agent tools, model-backed extraction, stronger redaction, source-grounded evidence, proposal dedupe, A/B tests for skill changes, git-backed releases, and rollback tested on real workflows.

But that is also why the idea is interesting.

The demo is not "look, the agent remembers me."

The demo is:

"Look, the agent learned something from the work, proposed a bounded change, and made it reversible."

The point

I do not think the future of agent development is "trust the agent more."

I think it is closer to:

Give the agent better local context.

Give it tools.

Let it learn.

Then make the learning inspectable enough that developers can delete, correct, and constrain it.

Hermes points in that direction because it treats the agent as a long-running system, not a single answer.

The real memory is not the chat log.

The real memory is the workflow becoming reusable.

The safe memory is the workflow becoming reversible.

And the agent that grows with you should also learn where not to grow.

Agent Surface Map: Gemma 4 review before you install an MCP

signalscout — Thu, 21 May 2026 15:39:44 +0000

This is a submission for the Gemma 4 Challenge: Build with Gemma 4.

The thing I built is pretty simple:

Before your coding agent installs a new MCP, ask Gemma what it is about to trust.

What I Built

Agent Surface Map is a pre-install review for MCP servers and agent tools. It does not try to prove a repo is safe. That would be fake precision. It answers a more practical question:

Should this be added carefully, sandboxed first, or not added?

There are already MCP scanners. That is good. I wanted the missing workflow layer: before a coding agent installs a new MCP, have Gemma turn the surface map into install constraints, then validate the final config before it gets written.

The loop: scan repo -> Gemma decides install posture -> agent validates final config before writing it.

Demo

Live demo

The page opens on a saved verified Gemma 4 review for this tiny public fixture:

Demo MCP fixture

Click Load verified Gemma 4 review to see the saved model path. Click Try live demo scan to run the hosted scanner against the same fixture.

The live scan returns parsed MCP servers, a risk score, install constraints, and review_source: "gemma" when the Gemma route is available. If the provider rate-limits, the app falls back to the deterministic local review and labels that honestly. That fallback behavior is part of the product, not something hidden from the judge.

The point is that the judge-facing path does not depend on provider luck: the verified Gemma 4 review is loaded by default and saved in the repo.

Code

GitHub repo

There is also an MCP server in the repo:

Run the MCP server with python3 mcp_server.py.

That means a coding agent can call scan_github_tool(url) before it edits local MCP config, then call validate_install_plan(report, proposed_config) before it writes the final config.

That is the real workflow: "hey agent, before you install this new tool, ask Agent Surface Map what constraints to follow, then check your final config against them."

How I Used Gemma 4

I used Gemma 4 31B Dense for the final install review.

The scanner looks at install-facing files: mcp.json, package files, repo instructions, Docker files, env examples, and similar config. It does not execute the repo.

It pulls out:

MCP server names, commands, args, and env key names
shell/process surfaces
browser automation and profile reuse
filesystem mounts
cloud/database/token references
prompt-injection-ish repo instructions
install scripts and local listener hints

Then it redacts secret-looking values and sends the compact surface map to Gemma 4.

Gemma is the judgment layer. The deterministic scanner finds the evidence; Gemma turns it into a practical install decision and agent constraints.

I chose the 31B Dense model because this is not just classification. The model has to reason over messy developer context: browser profile reuse plus filesystem mounts plus token names is more serious than any one signal alone.

After Gemma returns the posture, the MCP workflow can check the final proposed config with validate_install_plan. That catches global install after sandbox_first, broad local paths, Docker socket exposure, and secret values pasted directly into config.

Why this felt worth building

Coding agents changed the shape of local risk. A repo is not just code anymore. It can ship instructions for your agent, MCP config, package scripts, browser access, write paths, and credential names.

That is basically a tiny operating surface on your laptop.

So this is the safety pause before the agent gets more power. Not a malware sandbox. Not a full audit. Just a fast answer to: should this be added carefully, sandboxed first, or rejected?

And yeah, you can paste a config into ChatGPT and ask for advice. The difference here is that the review is wired into the install path. The agent can scan, get a structured posture, draft the config, and then check that exact config before it writes anything.

Safety choices

I kept the evaluator boring on purpose:

No repo code execution
Shallow/no-submodule GitHub retrieval
Secret value redaction
Local path refusal for root/profile/credential dirs
Bounded MCP responses
Public scan rate limits
Gemma review rate limits
Best-effort demo throttles for scans and Gemma reviews

The hosted demo uses a guarded Gemma 4 path through OpenRouter. I also saved proof artifacts for the MCP workflow and live Gemma review in docs/proofs/.

Verification

python3 -m unittest discover -s tests -v
python3 -m py_compile surface_map.py server.py api/scan.py mcp_server.py scripts/mcp_workflow_smoke.py
node --check public/app.js
python3 scripts/mcp_workflow_smoke.py

Current proof:

live demo deployed
Gemma route configured, with honest fallback when the provider returns 429
saved verified Gemma review artifact included in the repo
public demo MCP fixture works
MCP stdio workflow works
final install-plan validation blocks unsafe config
scanner tests pass

I think the interesting part is not the regex scanner. It is the handoff. Deterministic code collects boring evidence, Gemma turns it into install constraints a developer or coding agent can actually use, and the final plan gets checked before anything touches the shell.

Stop Turning On “Think Harder” For Everything

signalscout — Wed, 29 Apr 2026 02:33:17 +0000

Stop Turning On “Think Harder” For Everything

Most people using AI tools leave reasoning mode on because it feels safer.

The button says the model will think more. Why would you not want that?

Because most of the work you are asking an AI to do does not require more thinking. It requires cleaner execution.

If you are vibe-coding, building landing pages, fixing obvious bugs, writing emails, creating content, or asking an agent to make a straightforward change, “think harder” often makes the output worse.

Not just slower.

Worse.

The model starts hedging. It invents edge cases. It explains tradeoffs you did not ask for. It turns “make this button work” into a small architecture review.

You asked it to ship.

It gave you a committee meeting.

Execution vs Judgment

This is the split that matters.

Some tasks are execution:

build this page,
clean this CSS,
turn this note into an email,
fix the typo,
format this JSON,
make the navbar responsive,
write the obvious test,
deploy this project.

For those, you usually want fast mode. Low reasoning. Direct instructions. Small context. Run it, inspect it, fix what broke.

Other tasks are judgment:

choose between two architectures,
debug a weird failure with no obvious cause,
analyze a security issue,
decide product positioning,
plan a migration,
compare models or vendors,
reason through a messy business tradeoff.

For those, thinking is the product. Pay for it. Let the model slow down.

The mistake is treating every request like judgment.

Most work is not judgment.

Most work is just work.

Why This Matters More With Agents

When you are chatting with a model manually, wasting one expensive request is annoying.

When you are using an agent, one instruction can become ten requests.

The agent reads files. Calls tools. Runs commands. Sees an error. Tries again. Summarizes. Calls another model. Writes a file. Checks the diff. Replies.

If every one of those calls is using maximum reasoning, you are paying a thinking tax on operations that do not need it.

That is how people end up feeling like AI tools are too expensive even though the model did exactly what they asked.

The workflow was routed wrong.

The Vibe-Coder Rule

Use this rule:

If you can tell whether the output is right by looking at it, use low reasoning.

If the button works, the button works.

If the email sounds good, it sounds good.

If the page builds, the page builds.

You do not need a model to spend 45 seconds reasoning before changing a color, extracting a list, or adding a route.

Use high reasoning when you cannot easily verify the answer yourself, or when the cost of being wrong is high.

That includes security, money, production migrations, ambiguous architecture, legal/compliance, and anything where the model needs to reject several plausible options before choosing one.

The Better Workflow

Here is the workflow I use now:

Start cheap and direct.
Give the model only the context it needs.
Make it produce an artifact.
Run the artifact.
If it fails, feed back the exact failure.
Escalate reasoning only when the failure is confusing.

That loop beats “think hard forever” for most real building.

It is faster, cheaper, and less annoying.

The Bigger Point

AI tools are becoming less about picking the smartest model and more about routing work correctly.

A great builder does not ask the biggest model to do everything.

A great builder knows when the task needs judgment and when it needs momentum.

If you are learning by doing, momentum matters.

Turn thinking down. Ship the thing. Look at what broke. Then decide if it needs a smarter pass.

Most of the time, it does not.

GitHub Copilot Changed the Deal. That Is the Whole Lesson.

signalscout — Wed, 29 Apr 2026 02:33:11 +0000

GitHub Copilot Changed the Deal. That Is the Whole Lesson.

GitHub Copilot Pro+ used to feel like a cheat code.

For $40/month, you could get access to models that would have cost meaningfully more if you were paying direct API prices. Not because you discovered some genius hack. Because subscriptions and APIs are different economic products.

A subscription gives you a ceiling.

An API gives you a meter.

If you are building with agents, that difference matters more than almost anything else.

I learned this the dumb way.

I run OpenClaw, a local agent orchestration setup that lets me route tasks through different models and tools. I use it to build sites, write code, audit projects, post content, handle email, and generally turn messy ideas into artifacts.

It is powerful.

It is also very easy to use wrong.

One bad session can quietly turn four prompts into dozens of model calls. Not because the model is bad. Because the agent is carrying too much context, switching tasks midstream, calling tools repeatedly, retrying failures, and dragging stale memory into every request.

At one point, the math looked like this:

12,000-ish tokens × 37 calls for what felt like a few prompts.

That is not intelligence.

That is a context leak with a nice chat interface.

Why Copilot Pro+ Felt So Good

The original Copilot Pro+ value proposition was not just “you get Claude / GPT / Gemini in your editor.”

The real value was insulation.

With direct API credits, every mistake has a price. Every oversized context window. Every retry loop. Every “actually, now switch tasks and use the same session to debug this other thing.” Every time your agent re-sends the same irrelevant history because you forgot to clear the session.

With a subscription, the downside is bounded. You might hit a limit. You might get slowed down. But you do not wake up to a surprise bill because your agent got confused at 2am.

That is why Copilot Pro+ felt absurdly good for agentic work. It was not just cheaper access. It was emotional safety.

You could learn by doing.

You could vibe-code without feeling like every mistake was financially metered.

That matters. A lot.

The people learning fastest right now are not professional DevOps engineers with perfect usage dashboards. They are builders who try things, break things, paste errors back in, and keep going. A predictable subscription is perfect for that phase.

Then GitHub Changed the Business Model

And honestly, of course they did.

If a $40 subscription reliably gives heavy agent users more than $40 of model value, the platform eventually has to change the terms. GitHub has been moving Copilot toward premium request accounting and API-spend-style economics. The direction is clear: the more the product behaves like raw frontier-model infrastructure, the more the pricing has to look like usage.

This is not a ban story. This is not “I got kicked off GitHub.”

This is the boring reality of AI infrastructure: if users can turn subscriptions into uncapped agent compute, the subscription stops being sustainable.

And that is the whole lesson.

You cannot build your workflow around pricing loopholes.

You need to fix the workflow.

The Beginner Mistake: Buying More Credits Instead of Managing Context

When a vibe-coder runs out of credits, the instinct is usually:

buy more Anthropic API credits,
try OpenRouter,
buy Claude Code,
upgrade ChatGPT,
test another wrapper,
chase a bigger context window.

I did all of that.

OpenRouter with frontier models did not magically solve the problem. It was still API economics. If I sent too much context, I paid for too much context.

Anthropic API was great when my setup broke and I had no other option. But it was expensive in exactly the way APIs are expensive: clean, metered, unforgiving.

Claude Code is probably good. I have not used it enough to make a religious claim.

After testing newer OpenAI and Anthropic models, I found myself preferring GPT-5.5 for a lot of my actual work. And yes, I am excited about 1M-token windows once I have my context system fixed.

But bigger context does not solve sloppy context.

A 1M-token window just lets you make a 1M-token mess.

What ContextClaw Is Really For

ContextClaw started as a cost-control tool. That is still true, but the better framing is this:

ContextClaw is a seatbelt for people who learn by doing with AI agents.

It does not try to make you a perfect engineer.

It assumes you are going to do the normal builder thing:

keep a session open too long,
switch tasks halfway through,
paste a giant error log,
forget what is already in memory,
ask the agent to “also quickly do this,”
and accidentally turn one workflow into five.

ContextClaw exists to make that survivable.

It treats context like RAM, not a diary. Hot context should be small, relevant, and task-specific. Everything else belongs in files, memory, search, or cold storage.

The simple rules are not glamorous:

clear the session when the task changes,
use skills instead of carrying giant instructions forever,
write artifacts to files,
summarize old work instead of replaying it,
keep subagents isolated,
do not make the main session remember every tool result,
route cheap tasks to cheap models,
save expensive models for judgment calls.

That is it.

That is the “secret.”

Not a magic prompt. Not a bigger subscription. Not a new model leaderboard.

Just context discipline.

Copilot Was the Backup. ContextClaw Is the Replacement Layer.

The way I think about Copilot has changed.

Originally, Copilot Pro+ was my cheap frontier-model pipe. Then it became my backup when API credits got painful. Then GitHub’s pricing shift made the real lesson obvious.

Copilot’s hidden benefit was not only model access. It was that the wrapper absorbed complexity: caching, request shaping, context choices, editor state, and spend boundaries.

ContextClaw is me trying to make that layer explicit.

If OpenClaw is going to call models directly, it needs the same kind of insulation:

know what context matters,
avoid resending stale junk,
prevent accidental runaway sessions,
make cost visible,
and preserve the ability to learn by doing without making every mistake expensive.

That is the part most vibe-coders need more than another model subscription.

The Rule I Use Now

If you are using OpenClaw and buying API credits, ask this before you top up:

Did I actually need more model, or did I just fail to manage context?

Most of the time, the answer is the second one.

Run /clear when the task changes.

Write the durable stuff down.

Use skills as modular instructions instead of carrying everything in one mega-prompt.

Do not ask the same session to be your coder, marketer, therapist, deployment engineer, and memory database.

And if your agent made 37 calls for four prompts, do not blame the model.

You built a slot machine and connected it to a credit card.

Fix the machine.

Then buy the best model you can afford.

That order matters.

The Best $40 Addendum: I Tried 14 Copilot Subs and Custom Wrappers — Here's What Actually Works

signalscout — Tue, 28 Apr 2026 05:41:45 +0000

The Best $40 Addendum: I Tried 14 Copilot Subs and Custom Wrappers — Here's What Actually Works

After I published the original article, I got the expected responses: "what about running multiple accounts?" and "what about modifying the wrapper?" and "I heard you can fan out parallel agent threads and get way more output per dollar."

I went down that rabbit hole for about six weeks. This is the honest debrief.

Short version: the $39/month Copilot Pro+ recommendation still stands — but for a different reason than I originally thought. The tooling you layer on top matters more than the subscription math. And for 99% of developers, the answer to "what do I layer on top" is embarrassingly simple.

The Rabbit Hole, Documented

Here's what I actually tested:

Multiple Copilot accounts. Yes, it's technically possible to run 14 GitHub accounts, each paying $39/month, and orchestrate them as parallel agent workers. I tried a version of this — not 14 accounts, but enough to see how the seams show. The first problem is GitHub's terms of service, which prohibit multiple personal accounts. The second problem is that coordinating parallel agent sessions that modify the same codebase is genuinely hard. Race conditions. Conflicting file states. Agents overwriting each other's work. You end up spending more time debugging the orchestration than you would have just doing the work linearly. The math that looks good on paper ($39 x N = N times the output) doesn't survive contact with reality.

Custom wrapper modifications. The Copilot extension exposes enough surface area that people have gotten creative — custom system prompts, context injection, session manipulation. I experimented here too. Some of it works. All of it is fragile. GitHub pushes extension updates frequently. Your custom modifications break. You spend an afternoon re-patching things instead of shipping code. The delta between "custom wrapper" and "off-the-shelf wrapper" shrinks to near zero in practice.

Parallel agentic frameworks. I run OpenClaw, which is a purpose-built multi-agent orchestration framework I've been building for a year. Running parallel sub-agents that each drive a separate model session is genuinely powerful — but only because the framework was purpose-built to handle state, file coordination, task decomposition, and agent lifecycle. Rolling your own version of this is a significant software project. It's not a weekend hack.

The pattern across all three experiments: the idea sounds like leverage. The execution is overhead.

What People Are Actually Doing

While I was in the rabbit hole, I also checked what the broader community was building.

The open-source wrapper ecosystem has consolidated fast. Cline has 61,000+ GitHub stars. Aider is at 44,000+. Goose (from Block) is at 43,000+. Continue.dev at 32,000+. Roo-Code at 23,000+. These are not weekend projects — they're mature tools with thousands of real users. The community has voted with stars and pull requests.

Notably, HackerNews has had multiple active threads comparing Claude Code vs. Codex CLI as the two beginner-facing options, with the community largely agreeing that these are the correct entry points. The debate isn't "which custom wrapper should I build" — it's "which purpose-built tool should I use."

There's also Moltbook, which bills itself as "the front page of the agent internet" — a social network built for AI agents where agents share, discuss, and upvote content. It exists. People are building toward a world where agents are first-class actors. That future is coming. But it's not where you should start.

The community consensus in 2026 is not "build your own agentic framework." It's "pick one of the three or four mature tools and actually ship something."

The Honest Beginner Recommendation

If you're starting out, here's the decision tree:

Use Codex CLI if you want the OpenAI model stack (GPT-4o, o3) in a clean terminal interface. It's $20/month on ChatGPT Plus, or you use API credits. It wraps the model well. It handles context, tool use, and multi-step tasks without you thinking about any of it. You open a terminal, you describe what you want, it does the work.

Use Claude Code if you want the Anthropic model stack (Claude Sonnet or Opus). Same idea — clean terminal interface, the model is wrapped correctly, you don't configure anything, you just use it. $20/month on Claude Pro gets you Claude Sonnet. Opus costs more but is rarely necessary for everyday development tasks.

Both tools have been designed by teams who thought deeply about how to make a model useful in a development context. They handle context injection, multi-turn state, tool use, and error recovery. You don't have to think about any of it. That's the point.

The Copilot Pro+ subscription I recommended in the original article is still the right answer for IDE-embedded access to frontier models. But for your agentic terminal workflow — the thing that actually runs tasks autonomously — pick Codex or Claude Code and stop there.

Don't install Cline and Aider and Goose and Continue and Roo-Code and spend a week figuring out the difference. Pick one. Use it until it breaks for your use case. Then reassess.

The OpenClaw Caveat (And Why I Feel Weird Writing This)

I'm aware that recommending OpenClaw in a beginner's article sounds like a founder pitching their own product. Let me be honest about when it's appropriate and when it isn't.

OpenClaw makes sense if:

You're running multiple agents in parallel as a workflow, not just as a single developer working on a single task
You need persistent state across sessions — memory systems, task queues, artifact tracking
You want to route different tasks to different models based on cost and capability
You're orchestrating coding agents, browser agents, and data-processing agents as a coordinated fleet

OpenClaw does not make sense if:

You want to write code faster
You want a better autocomplete
You're doing a single project and you just want help

If you're in the second bucket — which is where most developers should be — Codex CLI or Claude Code will serve you completely. The overhead of setting up and maintaining a full orchestration framework is not worth it for individual developer productivity. I built OpenClaw because my workflow had genuinely outgrown the single-agent tools. That took time and experimentation to discover. I couldn't have known I needed it on day one.

The CTA Is Simple

Stick with the simple thing until the simple thing breaks.

Start with Copilot Pro+ for IDE integration and model access. Add Codex CLI or Claude Code for terminal-based agentic tasks. That's the stack. It covers 99% of what a working developer actually needs, and it costs between $40 and $60/month depending on your Claude tier.

The rabbit hole is real. The multi-account orchestration experiments are interesting in the same way that building your own database is interesting — technically impressive, practically counterproductive for most people. I did the experiments so you don't have to waste the weeks I did.

The simple stack works. Use it until it doesn't.

Google Just Unlocked Something Huge With Gemini Memory Import — Here's How to Actually Profit From It

signalscout — Fri, 24 Apr 2026 04:55:23 +0000

Submitted to the Google Cloud NEXT '26 Writing Challenge

The hook

Google just shipped one-click memory import from ChatGPT into Gemini at Cloud Next '26.

I've been trying to vibe-code my way to this exact workflow for months. Exports, parsers, custom ZIP handlers, half-broken browser extensions. Thousands of tokens burned on prompt gymnastics to stitch my own history together.

And then Google just… did it. Clean, native, one click.

Thank you. Seriously. This is the feature a lot of us have been quietly hoping for, and it just dropped.

But importing your history is step one. The real leverage is what you do with it once Gemini has it. So this post is two things: a thank-you to the Gemini team, and a practical guide — five workflows I've been wanting for months that now just work.

Why this matters more than it looks

Your ChatGPT history isn't chat logs. It's a record of how you think, what you obsess over, how you phrase things, and what you've already solved. Most people treat it as disposable. It's actually the closest thing to a portable brain snapshot that exists.

Until last week, that snapshot was locked inside one product. Now Gemini can read it. That changes what an AI assistant can actually be — not a tool you re-introduce yourself to every session, but one that already knows your voice.

Five ways to profit from it once you've imported

After importing, open a fresh Gemini chat and try these. I call them bootloader prompts — they spin Gemini up into a specific useful mode using the memory it just gained.

1. The voice profile

Read across my imported ChatGPT history and extract my voice profile.
Tone, sentence length, words I overuse, words I never use, how I open
and close ideas. Return it as a reusable style guide I can paste into
future prompts.

Now every piece of writing you generate with Gemini — emails, posts, drafts — can be pinned to a style guide built from the real you, not a generic "professional tone."

2. The unfinished-ideas miner

Scan my imported history for ideas I started but never finished.
Half-built product concepts, essay drafts, business ideas, technical
designs. Rank them by how many times I came back to them. Return the
top 10 with a one-paragraph summary each.

You'll be shocked. Mine surfaced three ideas I'd forgotten I'd had, one of which turned into a product I'm shipping right now.

3. The pattern recognizer

Based on my imported history, what topics do I keep circling? What
problems do I solve over and over in slightly different ways? What
blind spots show up — subjects I avoid, skills I never ask about?

This one is humbling. It's a mirror. It tells you what you actually care about versus what you say you care about.

4. The personal SOP writer

Look at my imported history. Any time I asked for help with [task type:
e.g. debugging, cold emails, PR reviews], extract the pattern and write
me a standard operating procedure. Include prompts I've already proven
work for me.

Your own best prompts, promoted to reusable templates. This is how you stop re-inventing the wheel every session.

5. The decision archaeologist

Find every major decision I talked through with ChatGPT in the last
year — product direction, career moves, relationships, finances. For
each, summarize what I was considering, what I chose, and what my
reasoning was at the time.

A year of decisions, written down by the AI that helped you make them. Useful for reflection, accountability, and noticing your own patterns.

Why the Gemini team nailed this

Three design choices I want to call out because they're easy to miss:

ZIP-based import instead of live OAuth. This was the right move. It puts the user in control of what gets shared. You can review the file before upload. It also sidesteps every single privacy and platform-lock concern that would've killed an OAuth bridge.

Memory, not raw logs. Gemini is ingesting your history as retrievable memory, not just dumping it into context. That means it scales — your whole history isn't eating your token budget on every turn.

Shipping it at NEXT, not quietly. Putting this on the main stage signals that Google actually believes AI memory portability is a user right, not a moat. That's a cultural win for the whole ecosystem.

A small personal note

I built a small browser-side prototype in three days with Codex aimed at helping people work with their own conversation archives locally. I'm still interested in that direction, because I think there's room for tools that help people organize, analyze, and reuse their history without turning everything into another cloud dependency.

But honestly? Google's version is cleaner for most people. If you just want your history in Gemini, use the official import. It's a better experience than anything I or anyone else was going to ship this year.

I also built a rough public wrapper around this workflow, elgoog, as a CLI-first Gemini workbench. Google's native import is still the cleaner default for most people, but the repo is there if you want to see the builder-side version of the same instinct.

The prompts again, copy-paste ready

Voice profile: Extract my tone, vocabulary, and style from my imported history.
Unfinished ideas: Find half-baked ideas I kept coming back to.
Pattern recognizer: Show me what I circle on and what I avoid.
Personal SOPs: Turn my proven prompts into reusable templates.
Decision archaeologist: Summarize my year of big decisions with reasoning.

Import once. Run these five. Thank me later. Actually, thank the Gemini team.

If you've got your own bootloader prompts for the new import feature, drop them in the comments. The whole point of these unlocks is compounding them together.

I Measured the Carbon Footprint of My AI Agents. 87% Was Pure Waste.

signalscout — Sat, 18 Apr 2026 08:25:13 +0000

This is a submission for Weekend Challenge: Earth Day Edition

Every token your agent burns is a small amount of coal somewhere in a datacenter. I got curious about the math and then horrified by the answer.

I already maintain ContextClaw, a context-management plugin for OpenClaw that classifies everything in an agent's context window by content type (JSON schemas, file reads, tool output, chat history) and truncates the junk so you stop shipping 200K-token requests that should be 22K. The dogfooding numbers on my own agent work are brutal: 87.9% reduction across 11,300 items in 6 real sessions — ~40M characters of pure garbage evicted, about 14.5 million tokens saved.

For Earth Day, I wanted to know what that actually means in the real world. Kilowatt-hours. Grams of CO₂. Miles driven in a car. So I built a tiny new layer on top of ContextClaw called eco-report that turns token savings into carbon receipts, and I wired Google Gemini in to narrate a weekly report from the telemetry.

What I Built

eco-report is a ~100-line Node module that sits on top of ContextClaw's existing efficiency tracker. Every time ContextClaw truncates, tails, or evicts something from the context window, it already records tokens-before and tokens-after. eco-report takes those numbers and does three things:

Converts tokens → kWh using published large-model inference energy estimates from the Luccioni et al. "Power Hungry Processing" paper and the MLCommons energy benchmarks. I'm using the conservative frontier-model figure of ~0.001 Wh per output token (roughly matching the 0.5–1.2 Wh-per-query range reported for ChatGPT-scale traffic, normalized to a ~500-token reply).
Converts kWh → gCO₂e using the current EPA eGRID US average of 385 gCO₂e/kWh (2026 release). Configurable — you can swap in your datacenter's grid factor if you know it (Iowa coal grid is ~700; Pacific Northwest hydro is ~90).
Converts gCO₂e → relatable units — miles driven in an average US gasoline car (404 g/mi), phone charges (~8 g each), trees-year equivalents.

The kicker: for my own agent work, the cumulative saving is ~14.5M tokens = ~14.5 kWh not spent = ~5.6 kg CO₂e avoided — which is about 14 miles in a gas car, or roughly one weekly lunch's worth of gasoline commute, from a plugin I wrote to stop 429s.

Not a world-saver. But extrapolated across a mid-size engineering org running agents 24/7 with no context hygiene? You are quietly burning the emissions of a small fleet of cars to re-send the same Dockerfile to Claude every three turns.

Demo

Here's a run against one of my real OpenClaw sessions:

$ node eco-report.js --session /home/yin/.openclaw/logs/session-0418.jsonl

🌱 ContextClaw Eco-Report — Session 2026-04-18
────────────────────────────────────────────────────
Items processed        : 2,144
Tokens before          : 9,384,217
Tokens after           : 1,036,402
Tokens saved           : 8,347,815  (88.9% reduction)

Energy avoided         : 8.35 kWh
CO₂e avoided           : 3,214 g   (US grid avg, 385 g/kWh)
Roughly equivalent to  : 8 miles in an avg gasoline car
                         OR  402 phone charges
                         OR  5.6 fridge-days

Gemini says:
"This session truncated 8.3 million tokens from
context — mostly stale file reads and JSON schema
blobs. That's roughly the carbon cost of driving from
Manhattan to JFK in a gasoline car, avoided. Over a
year at this rate (1 session/day), you'd avoid about
1.2 tonnes of CO₂e — the emissions of a cross-country
flight for one passenger."
────────────────────────────────────────────────────

The Gemini narration is the interesting part. Numbers alone are dry. When Gemini takes the raw telemetry (tokens saved, session duration, top-eviction content types) and writes a 3-sentence plain-English summary with analogies, it genuinely changes how you feel about the number. It's the same reason Strava pings me "that was your second-fastest 5K this month" instead of just showing me an average pace.

Companion dashboard at github.com/dodge1218/agentic-efficiency tracks total tokens saved and estimated capital + carbon saved across all my agent sessions.

Code

The whole thing is in the ContextClaw repo under plugin/eco-report.js. Here's the core — the full file is ~110 lines including the Gemini call:

// eco-report.js — turn token savings into kWh + CO2
const WH_PER_TOKEN   = 0.001;          // Luccioni et al., conservative frontier-model figure
const G_CO2_PER_KWH  = 385;            // EPA eGRID 2026 US avg. override via env.
const G_CO2_PER_MILE = 404;            // EPA avg passenger vehicle
const G_CO2_PER_PHONE_CHARGE = 8;

export function tokensToFootprint(tokensSaved, gridFactor = G_CO2_PER_KWH) {
  const kWh   = (tokensSaved * WH_PER_TOKEN) / 1000;
  const gCO2  = kWh * gridFactor;
  return {
    kWh: round(kWh, 3),
    gCO2e: Math.round(gCO2),
    equivalents: {
      miles_driven:   round(gCO2 / G_CO2_PER_MILE, 1),
      phone_charges:  Math.round(gCO2 / G_CO2_PER_PHONE_CHARGE),
    },
  };
}

export async function narrateWithGemini(stats, apiKey) {
  const prompt = `You are an environmental analyst. Write a terse, punchy,
  three-sentence plain-English summary of this ContextClaw session.
  Use concrete analogies (miles driven, flights, fridge-days). No fluff.

  Session data:
  ${JSON.stringify(stats, null, 2)}`;

  const res = await fetch(
    `https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent?key=${apiKey}`,
    { method: 'POST', headers: {'Content-Type':'application/json'},
      body: JSON.stringify({ contents: [{ parts: [{ text: prompt }] }] }) }
  );
  const j = await res.json();
  return j.candidates?.[0]?.content?.parts?.[0]?.text ?? '(Gemini unavailable)';
}

That's the whole trick. ContextClaw already measures everything. eco-report just multiplies by two constants and asks Gemini to sound less like a spreadsheet.

How I Built It

The stack:

ContextClaw (existing, mine, MIT): the classifier + truncator that produces the telemetry.
Google Gemini 2.0 Flash: single API call per report. Flash is the right tier here — this is a summarization task, not a reasoning one, and Flash's cost + latency are perfect for "run this at the end of every session." Ironic-but-on-theme: Flash is also ~10× more energy-efficient per token than a frontier reasoning model, so the carbon cost of generating the eco-report is essentially noise.
Node 20: plugin layer.
EPA eGRID 2026 for the US grid CO₂ intensity. Anyone outside the US can pass --grid-factor=90 (Pacific NW hydro), 700 (coal-heavy Iowa), or their actual regional number.

Three decisions worth calling out:

I deliberately used a conservative WH_PER_TOKEN. Energy-per-token for frontier models is genuinely uncertain; published figures range from 0.0003 to 0.003 Wh. I went with 0.001 because I would rather under-claim and be defensible than inflate the number for a better Earth Day story. If anything, my numbers are lower than reality.
Gemini does the storytelling, not the math. I never let the LLM multiply. It gets the raw, already-calculated numbers and turns them into prose. This is the right division of labor — Gemini's job here is translation, not arithmetic, and it means my carbon numbers stay reproducible and don't hallucinate.
The eco-report runs at end-of-session, not every turn. One API call per session to Gemini, not per message. This matters because (a) it respects rate limits and (b) it means the eco-report's own carbon cost is ~200 tokens of Flash output, or about 0.08 grams of CO₂e per report. The report measures ~3 kg of savings. Ratio: roughly 40,000× more saved than spent.

Prize Category

Best use of Google Gemini.

Gemini is doing the one thing most hackathon submissions can't pull off with it: being a deliberately small, cheap, well-scoped component rather than the centerpiece. It's a storyteller bolted onto a real measurement pipeline. It turns a dry JSON blob into something a human will actually read at the end of a Friday afternoon. And because I used Gemini 2.0 Flash instead of a heavy reasoning model, the eco-report respects its own thesis: don't burn tokens you don't need to.

That's the thing I want judges to take away: AI tooling can help us measure the footprint of AI itself, and it does that best when it's a scalpel, not a sledgehammer.

🌍 Repo: https://github.com/dodge1218/contextclaw
📊 Dashboard: https://github.com/dodge1218/agentic-efficiency
🔗 Parent platform: OpenClaw

Final Manual Submission Steps

Confirm contextclaw/plugin/eco-report.js is committed or at least present in the public repo before publishing.
Create a DEV post at https://dev.to/new.
Paste this markdown exactly, keeping the required first line and front matter tags.
Add tags: devchallenge, weekendchallenge, ai, sustainability.
Publish before Monday, Apr 20, 2026 at 02:59 EDT.

ContextClaw: The OpenClaw Plugin That Cut My Token Bill 55%

signalscout — Fri, 17 Apr 2026 16:25:41 +0000

This is a submission for the OpenClaw Challenge.

Every agent system eventually hits the same wall: the model is not forgetting because it is dumb. It is forgetting because you are feeding it a landfill.

Old tool output. Half-fixed errors. File reads from a task you abandoned twenty minutes ago. Five versions of the same plan. Then you ask the model to be precise while its context window is full of stale evidence.

ContextClaw is my attempt to fix that inside OpenClaw.

What I Built

ContextClaw is a context management layer for OpenClaw. It sits between the workspace and the model, classifies each message, attaches a task-bucket sticker, and evicts context by task boundary instead of raw recency. The goal is simple: keep the intent, decisions, and active working state; drop the tool spam and dead branches.

On real working sessions, that pattern cuts token load by 55%+ versus dumping the whole rolling transcript back into the model. The important part is not just compression. It is inventory. The agent knows what each piece of context is, what task it belongs to, and whether it should still be in the room.

raw session -> [classifier] -> typed messages
            -> [stickerer]  -> task-bucketed messages
            -> [evictor]    -> task-scoped context -> model

Bigger context windows help. They do not solve the core problem. If your workflow keeps stuffing irrelevant state into the prompt, a bigger window just gives you a larger junk drawer.

How I Used OpenClaw

OpenClaw is the right place to build this because OpenClaw already treats agent work like a real system: tools, skills, files, providers, sessions, and workspace state. ContextClaw plugs into that turn lifecycle and changes what reaches the model.

The rough shape is:

~/.openclaw/plugins/contextclaw/
  plugin.json
  classifier.js
  stickers.js
  evictor.js

I am not going to pretend the install command is cleaner than it is. The safe version is: wire it through OpenClaw's plugin registry, then route each turn's message list through ContextClaw before the provider call. That is the hook. Do not patch random config by hand. Do not rely on a prompt that says "please ignore old context." Make the context layer enforce it.

The classifier gives each message a job. A user request is not the same thing as a tool result. A decision is not the same thing as a stack trace. A sub-agent artifact is not the same thing as a planning note. Representative types look like:

user_intent
tool_call
tool_result
file_read
error_trace
plan
summary
decision
sub_agent_output
system_note
noise

The exact enum matters less than the principle: recency is the wrong axis.

A 100-token decision from turn 3 can be more important than 8,000 tokens of file output from turn 19. Sliding windows do not understand that. Type-aware eviction can.

Then ContextClaw adds stickers. A sticker is a small label that says what task a message belongs to and what kind of context it is. A representative line might look like:

[DEV-A] tool-file-read: POST_A_SPEC.md
[DEV-A] decision: ContextClaw is the Prompt A project angle
[DSB-3] error_trace: Twilio auth failure

Now the evictor has a useful signal. When I am writing the OpenClaw Challenge post, I need [DEV-A]. I do not need a stale [DSB-3] SMS debugging trace, even if it happened more recently.

This connects directly to my file-as-interface workflow. In my OpenClaw workspace, files like AGENTS.md, NEXT_TICKET.md, STATUS.md, TASKS.md, and BLOCKER.md are not decoration. They are the control plane. NEXT_TICKET.md says what the active task is. STATUS.md says what changed. BLOCKER.md means a human gate exists.

ContextClaw reads those workspace signals and uses them to decide bucket boundaries. When NEXT_TICKET.md changes, the active bucket rolls. The model does not need to be begged to forget. The filesystem already made the task switch explicit.

That is the whole trick. Do not ask the agent to infer workflow state from vibes. Put the workflow state somewhere durable, then make the context layer obey it.

I also filed OpenClaw issues around the places where this should become more visible and reliable. Issue #64085 is about provider circuit breakers: if a provider starts returning quota or rate-limit errors, OpenClaw should stop hammering it and route around it. Issue #64086 is about exposing plugin status in the TUI footer. ContextClaw should be able to show a live tokens-saved counter where the user can actually see it.

That matters because context management should not be mystical. If a plugin says it saved 55%, I want the footer to show the before and after. Tokens before. Tokens after. Decision made.

Demo

The demo target is a normal OpenClaw work session: same model, same workspace, same prompt, first with raw transcript context and then with ContextClaw enabled.

The shape of what I see in practice:

baseline context:  full rolling transcript + tool spam
with ContextClaw:  typed, bucketed, task-scoped context
observed ratio:    roughly 55% fewer tokens per turn on multi-turn work

I am not going to post a faked screenshot to hit the "Demo" header. The honest version is: the savings compound on long sessions with lots of tool output, and they mostly disappear on 2–3 turn toy tasks. The measurement that matters is stable output quality at lower token cost, not a single pretty number. A live tokens-saved counter in the TUI footer is what issue #64086 is about — that is the artifact I want before I publish benchmark-style numbers.

Repo: work-in-progress. I'll link it from an update once it's in a state I'd want someone else to read.

What I Learned

Classification beats recency. Most context systems treat the newest thing as the most important thing. That is wrong for agent work. The newest thing is often a giant tool result that only mattered for one local decision.
Task boundaries are the real eviction signal. NEXT_TICKET.md changing is stronger than a semantic guess. It says: the job changed. Old bucket out, new bucket in. Cheap. Explicit. Easy to audit.
ContextClaw loses on tiny tasks. If the whole job is two turns, classification overhead can be more machinery than you need. The payoff starts when the task has enough turns, file reads, tool output, and course corrections for context rot to appear. Roughly: real work, not a toy prompt.
Files beat embeddings for basic agent state. I like knowledge graphs. I like retrieval. But the 80% win here came from stickers plus eviction, not from trying to make memory magical. The filesystem already knows more about the workflow than the prompt does.

The broader lesson is uncomfortable: a lot of "agent memory" work is compensating for workflows that never made state explicit in the first place.

OpenClaw made the fix obvious because the workspace is already there. Root files. Tools. Sessions. Plugins. Providers. It is close enough to an operating system for agents that context can become infrastructure, not a paragraph in the system prompt.

If your context window feels crowded, your agent does not need a bigger model. It needs an inventory system.

Stop Chatting With Your Agent. Use Files.

signalscout — Fri, 17 Apr 2026 16:25:35 +0000

This is a submission for the OpenClaw Writing Challenge

I stopped talking to my agents. My throughput went up.

Not a little. A lot. The interface changed and the work got better. That's the whole post, but I'll spend the next 900 words earning it.

Chat is the wrong shape for real work

The terminal pane is seductive. You type, it types back, dopamine, repeat. Feels like progress. It isn't.

Here's what chat-as-interface actually gives you:

State lives in the model's head. Scroll up far enough and you're arguing with a ghost. The agent "remembers" until it doesn't.
Every turn pays rent. Tool output, file reads, half-finished reasoning — it's all still there, burning tokens, dragging attention.
No parallelism. One window, one conversation, one thread of thought. If you want two agents on two tasks, you open two terminals and pray neither one hallucinates the other's context.
No audit trail that isn't a transcript. When something went wrong three days ago, you're grepping scrollback.

Chat optimizes for the feeling of collaboration. Files optimize for the fact of it.

The fix: files are the contract

The pattern I've settled on — and the one OpenClaw is quietly built around — is this: the chat window is for routing. Files are the work.

Every agent in my setup reads from and writes to a small set of root-level markdown files. Not a database. Not a vector store. Plain files, in the workspace, one concern per file:

~/.openclaw/workspace/
├── AGENTS.md          # rules of the road
├── SOUL.md            # voice, posture, biases
├── NEXT_TICKET.md     # the one thing to do right now
├── STATUS.md          # current state of the world
├── TASKS.md           # backlog, classified
├── BLOCKER.md         # human gate — exists = I'm stuck
├── MEMORY.md          # index into memory/
└── outputs/           # artifacts go here, not into chat

The agent doesn't remember what it's doing. It reads NEXT_TICKET.md. It doesn't guess at tone. It reads SOUL.md. It doesn't narrate its plan into the chat window and hope you catch it — it updates STATUS.md, writes the artifact to outputs/, and if something's wrong, it drops BLOCKER.md and stops.

The model's context window becomes disposable. The filesystem is the source of truth.

A worked example

Here's what AGENTS.md actually looks like in my workspace. Not a philosophy doc — a routing table:

## Work Categories

### 🔴 CRITICAL (do now, in context)
- Active blocker Ryan is waiting on
- Bug breaking a running system
- Ryan says "now" or "do this"

### 🟡 QUEUED (write ticket, do next)
- Features on active projects
- Non-blocking bugs
→ Write to TASKS.md, acknowledge with one line. Do NOT start.

### 🟢 DEFERRED (log it, do later)
→ Write to TASKS.md with [DEFERRED] tag. Move on.

### ⚪ QUESTION (answer, don't build)
→ Plan on paper. Do NOT start building unless Ryan says "do it."

That's the whole routing logic. No prompt engineering gymnastics. No "You are a helpful assistant who..." The agent reads this file at the start of every turn and classifies before touching anything.

NEXT_TICKET.md is the ticket the coder agent picks up. It looks like this:

# TICKET: Provider circuit breaker for ContextClaw

## Scope
Track consecutive 429/quota errors per provider.
After 3 failures, mark provider "tripped", skip in fallback chain.
Auto-reset at midnight ET or after configurable cooldown.

## Acceptance
- Gemini 429 three times → next call routes to Groq without retry
- TUI footer shows "Gemini: TRIPPED (resets 00:00 ET)"
- State persists across restarts (./state/providers.json)

## Out of scope
- Per-endpoint granularity (provider-level is fine for v1)
- UI for manual reset (kill the file, it's fine)

That's a ticket a coding agent can pick up cold. No "as we discussed." No Slack archaeology. A model I spun up yesterday and a model I spin up next month read the same file and do the same job.

When it's done, the artifact lives in outputs/, not in the chat log. STATUS.md gets one line appended. If the agent hit a wall it can't cross — auth, billing, an irreversible action — it writes BLOCKER.md and stops. The existence of the file is the signal. I don't have to read it in a transcript; I see it in ls.

Why this generalizes

File-as-interface isn't an OpenClaw trick. It's the shape every serious multi-agent setup converges on, because it solves problems chat cannot:

Parallelism is free. Three agents can read TASKS.md and claim different tickets. The filesystem is the lock.
Handoffs stop costing context. Sub-agent writes to a file. Parent reads the file when it needs to. The parent's context stays clean, and that savings compounds per turn. The rule I enforce in AGENTS.md is blunt: sub-agents write results to files. They do NOT report back into parent context. Completion = file exists at expected path. Not a message.
Humans can review without being in the loop. I scroll STATUS.md instead of 40k tokens of scrollback. Approval becomes binary. ✅ or ❌. I am the reviewer, not the driver.
State survives the model. When the next frontier model ships — and it's shipping soon — my whole workflow moves over with a config change. The files don't care which model read them.

That last one matters more than it sounds. The models are a commodity that gets better every month. The artifacts are the moat.

The tell

Here's the heuristic I use now: if an agent's answer isn't somewhere I can cat, it didn't happen.

Chat is where you decide what to build. Files are where building happens. The moment you stop treating the terminal as the workspace and start treating it as the router — pointing at files, not producing prose — the whole thing gets faster, cheaper, and more honest about what's actually done.

Open a file. Close the chat. Ship the artifact.

Why I Built My Entire Business on Vercel (And What I'd Change)

signalscout — Thu, 16 Apr 2026 06:50:40 +0000

Why I Built My Entire Business on Vercel (And What I'd Change)

A freelance web dev's honest review after 13+ production deployments.

The Setup

I run DreamSiteBuilders.com — a one-person web dev shop building sites for local businesses. Every site ships on Vercel. Not because I evaluated 12 platforms and made a spreadsheet. Because I deployed once, it worked, and I never had a reason to leave.

Thirteen sites later, here's what I actually know.

What Works Unreasonably Well

Deploy speed is the product. My sales pitch to clients is a free demo build. I can go from discovery call to live preview URL in under 4 hours. That's only possible because git push → live site is 45 seconds. No SSH, no Docker, no "it works on my machine." The speed of deploy is the competitive advantage.

Preview deployments close deals. Every PR gets a preview URL. I send clients their site running on a real URL before they've paid a dollar. This converts better than any mockup or Figma link. They can tap through it on their phone. It's real.

Edge functions for the boring stuff. Contact forms, redirect logic, simple API routes — Edge Functions handle the stuff that used to require a whole backend. For SMB sites, this is the entire "server" layer.

v0 for first drafts. I use v0 to generate initial component layouts, then customize heavily. It's not a replacement for building — it's a replacement for staring at a blank file. The output is real Next.js code, not some proprietary format that needs translating.

What I'd Change

Analytics needs work. Vercel Analytics is fine for "is my site fast?" but I still need Google Analytics for anything client-facing. Conversion tracking, goal funnels, audience segments — none of that exists in Vercel's analytics yet.

Build minutes add up. With 13+ sites on a Pro plan, I watch build minutes carefully. ISR and on-demand revalidation help, but I've had months where a client's aggressive preview deployments ate through the budget.

Monorepo support is better but not painless. I tried consolidating client sites into a monorepo for shared components. Turborepo configuration was more overhead than just copying components between repos. For a solo operator, separate repos per client is simpler.

The AI Layer

The biggest shift in the last 6 months isn't Vercel itself — it's the AI tooling around it. My current stack:

v0 for component scaffolding
Claude Code for implementation and debugging
Codex CLI for multi-file refactors
PromptLens (my own tool) for analyzing how I actually use these AI tools

The combination of v0 → Claude Code → git push → live in 60 seconds is absurd. I built a complete site for a body work spa in one afternoon. Not a template — a custom Next.js site with booking integration, service pages, and mobile optimization.

The Honest Take

Vercel wins because it removes decisions. I don't think about hosting, SSL, CI/CD, CDN configuration, or deployment strategy. I think about the client's business and the code. Everything else is handled.

For a solo builder shipping to local businesses, that's the whole game.

Ryan Brubeck builds AI-powered web tools and ships client sites on Vercel. Find him on GitHub and DreamSiteBuilders.com.

I Analyzed 215 of My ChatGPT Conversations. Here's My "Usage DNA."

signalscout — Thu, 16 Apr 2026 05:50:40 +0000

I Analyzed 215 of My ChatGPT Conversations. Here's My "Usage DNA."

Everyone talks about prompt engineering. Nobody talks about prompt patterns — the habits you don't know you have.

The Setup

I exported my ChatGPT history and ran it through an analysis pipeline I built. Not a scraper — I used OpenAI's official data export, then wrote Python to cluster topics, classify intents, detect conversation loops, and fingerprint my prompting style.

Think of it as Spotify Wrapped, but for your AI usage.

Here's what 215 conversations, 695 messages, and 25,618 words revealed about how I actually use AI.

My Usage DNA

Metric	Value
Average prompt length	39.5 words
Median prompt length	23 words
Vocabulary richness	0.18 (4,610 unique / 25,618 total)
Avg conversation length	6.7 turns
Most active hour	12 AM ET (4 UTC)
Most active day	Monday
Sessions per week	43

The median (23 words) vs average (39.5) gap is telling. Most of my prompts are short commands. But when I go long, I go long — dragging the average up. I'm either firing off "fix this" or writing a paragraph of context. There's no middle.

43 sessions per week means I'm opening ChatGPT about 6 times a day. That's less than I expected. It feels like I live in the chat window, but apparently I batch my usage into focused sessions rather than constant drip queries.

How I Prompt: The Shape Distribution

Every prompt has a "shape" — a combination of length and structure:

Shape	%	What It Means
Medium instruction	38.1%	"Do X with Y constraints" — 16-50 words, directive
Short command	19.7%	≤15 words, imperative — "fix the build", "summarize this"
Long instruction	16.3%	50+ word specifications with context
Ultra short	8.2%	"yes", "continue", "try again"
Medium question	7.2%	Genuine information-seeking
Short question	5.2%	Quick lookups
Essay prompt	3.5%	Full context dumps
Code paste	1.2%	Pasting code for analysis

The insight: I'm 74% instruction, 12% question, 3.5% essay. I use AI as a tool operator, not a search engine. I already know what I want — I'm delegating execution, not seeking knowledge.

This maps directly to how power users differ from casual users. Casual users ask questions ("What is X?"). Power users give instructions ("Build X with these constraints"). The intent distribution confirms it:

Intent	Count	%
Question	202	29%
Instruction	79	11%
Brainstorm	46	7%
Debug	44	6%
Meta	27	4%
Creative	9	1%
Other	288	41%

6% of my prompts are debugging. That's a conversation with an AI about why the AI's previous output was wrong. The recursive irony isn't lost on me.

What I Talk About: 20 Topic Clusters

The topic clustering found 20 distinct domains across 215 conversations. The top 5:

Work/Management (20 convos, 146 msgs) — Boss dynamics, union questions, workplace strategy. Longest conversations by far — 7.3 msgs average.
Business/Finance (20 convos, 75 msgs) — Company analysis, bitcoin, investment reasoning. High breadth, lower depth.
People/Content (18 convos, 35 msgs) — Content strategy, audience analysis. Short, punchy sessions.
AI/Frontier Models (16 convos, 55 msgs) — Model comparisons, frontier capabilities, wild speculation.
Career/Resume (14 convos, 25 msgs) — Resume writing, job applications, OpenAI research.

The insight: My heaviest AI usage isn't coding. It's workplace strategy — navigating human dynamics with an AI advisor. The conversations about boss interactions are 2x longer than anything else. I'm using ChatGPT as a management consultant.

The Loop: Where I Got Stuck

The loop detector found one significant conversation loop — a pair of conversations 4 days apart about the same unresolved topic (similarity: 0.41):

"Gateway Password Recovery" (April 9)
"OpenClaw vs Paperclip" (April 13)

Both were about OpenClaw configuration. Same problem, two attempts, no resolution. The loop detector flagged it as repeated_question / unresolved.

Only 1 loop out of 215 conversations sounds good, but the real number is probably higher — the detector uses semantic similarity with a conservative threshold. What it caught was a verbatim repeat. The subtler loops — rephrasing the same question, approaching the same problem from different angles — need a more sophisticated model.

The insight: Conversation loops are a signal of tool failure. When you ask the same thing twice across separate sessions, either the AI failed to solve it or you failed to retain the solution. Either way, it's wasted tokens and wasted time.

What Companies Already Know (That You Don't)

Here's the uncomfortable part: every major AI provider already has this data about you. OpenAI, Anthropic, Google — they can see your prompt patterns, your topic clusters, your conversation loops, your usage DNA. They use it for model training, safety research, and product decisions.

You can't see any of it.

There's no "Prompt Analytics" tab in ChatGPT settings. No "Your Usage Report" email. No "You asked about Python debugging 47 times this month — here's a shortcut." The data exists. The insights are extractable. They just don't give them to you.

The argument for building this as a user-facing tool isn't technical — it's philosophical. You should have at least as much insight into your own AI usage as the companies hosting it.

What This Means for AI Tooling

If you're building AI products, here's what my data suggests:

Power users don't ask questions — they give instructions. Your UX should optimize for the imperative case, not the interrogative one. The chat input box is fine for questions. For instructions, you need structured input.
Conversation loops are a product bug. If your users are asking the same thing in multiple sessions, your memory/context system has failed. Track repeat queries.
Usage DNA is a feature. Show users their patterns. "You tend to write long prompts for coding tasks but short prompts for writing tasks — want to try being more specific on the writing side?" This is the AI equivalent of screen time reports, and it's equally valuable.
The heaviest usage isn't what you think. I expected my top category to be coding. It was workplace strategy. Product teams optimizing for the "developer use case" might be missing their actual power users.

How I Built This

The pipeline is straightforward:

Input: conversations.json from OpenAI's data export
Topic clustering: TF-IDF + keyword extraction, no ML models needed
Intent classification: Rule-based (prompt length + structural patterns)
Loop detection: Cosine similarity between conversation pairs
Shape analysis: Word count + punctuation patterns
Output: JSON reports + Markdown summary

No API calls. No cloud processing. Everything runs locally on a laptop in under 10 seconds for 215 conversations. The analysis is deterministic — same input, same output, every time.

The code is Python, ~500 lines total. No transformers, no embeddings, no GPU. Just TF-IDF and heuristics. The point isn't sophistication — it's that useful insights don't require expensive infrastructure.

Try It Yourself

Export your ChatGPT data (Settings → Data Controls → Export), then ask yourself:

What's your instruction-to-question ratio?
Which topic gets your longest conversations?
Where are you looping — asking the same thing twice?

You might be surprised. I was.

Open Source

The analysis pipeline is open source: PromptLens on GitHub

MIT licensed. ~500 lines of Python. No API keys needed.

Ryan builds AI analysis tools and agent infrastructure. Find him on GitHub and DreamSiteBuilders.com.

I Spent Two Days Debugging My Agent Stack. The Fix Was npm update.

signalscout — Thu, 16 Apr 2026 05:49:23 +0000

I Spent Two Days Debugging My Agent Stack. The Fix Was `npm update`.

A forensic investigation into how Codex CLI v0.50.0 quietly broke everything — and the 1,886 versions I skipped by not checking.

The Crime Scene

I run a multi-agent stack. OpenClaw orchestrates, Codex writes code, Gemini/Groq/DeepSeek handle the cheap inference, and the whole thing talks to itself through MCP (Model Context Protocol). It's either beautiful or terrifying depending on how you feel about autonomous systems. Most days, it works.

Last Tuesday, it stopped working.

Not dramatically — there was no stack trace, no segfault, no red alert. The kind of failure where you stare at logs for four hours before realizing the patient has been dead since morning. Codex sessions were silently dropping tool calls. MCP handshakes were timing out. The agent stack would spin up, do 40% of the work, then... nothing. No error. Just vibes.

I did what any reasonable person does: I blamed the LLM provider.

The Investigation

Here's the thing about debugging a system where five different AI models talk to each other through three protocol layers: everything is a suspect. My first 12 hours looked like this:

Hour 1-3: "It's definitely Groq's rate limits."
Nope. Switched to Gemini. Same behavior.

Hour 3-6: "MCP config must be wrong."
Rewrote my MCP server config. Twice. Compared against the docs character by character. Deployed. Same behavior.

Hour 6-9: "Maybe OpenClaw's routing is broken after the last update."
Filed two GitHub issues (#64085, #64086). Wrote detailed reproduction steps. Drew architecture diagrams. The maintainers were very polite about it.

Hour 9-11: "Let me check the Codex cache database."
Opened ~/.codex/logs_2.sqlite. Found 2,026 sessions. Scrolled through. Everything looked normal. The client_version field said 0.120.0. I nodded and moved on.

Hour 11: "Wait."

The Moment

I don't remember exactly what made me type it. Muscle memory, probably. Or divine intervention.

$ codex --version
0.50.0

I stared at the terminal for about ten seconds.

Then I stared at the cache database entry that said 0.120.0.

Then I ran:

$ which codex
/home/yin/.npm-global/bin/codex

$ ls -la $(which codex)
codex -> ../lib/node_modules/@openai/codex/bin/codex.js

$ npm list -g @openai/codex
└── @openai/codex@0.120.0

Huh. npm says 0.120.0. The binary says 0.50.0. The cache says 0.120.0. Three different answers from one tool.

What I had was a partially-updated installation where the npm package metadata had been updated but the actual binary was still running from a cached older version. The kind of bug you create by running npm install -g at 2 AM and not noticing the postinstall script failed.

The Autopsy: What 1,886 Versions Changed

I was curious. How far behind was I, really?

$ npm view @openai/codex versions --json | python3 -c "
import json, sys
versions = json.load(sys.stdin)
print(f'Total published versions: {len(versions)}')
"
Total published versions: 1886

One thousand, eight hundred, and eighty-six versions. Between my installed v0.50.0 and the current v0.120.0, OpenAI had shipped nearly two thousand releases. That's roughly 26 releases per day. The Codex team does not sleep.

The v0.50.0 lineage tells a story:

0.50.0-alpha.1 — the optimistic beginning
0.50.0-alpha.2 — "we found some issues"
0.50.0-alpha.3 — "we found more issues"
0.50.0 — "ship it, we'll fix it in 0.51"

And then they shipped 0.51. And 0.52. And kept going for eighteen hundred more releases while I sat on 0.50.0 like it was a vintage wine that would appreciate with age.

What Actually Broke

The root cause was MCP protocol compatibility. Between v0.50.0 and v0.120.0, the Codex CLI underwent significant architectural changes:

Typed code-mode tool declarations. v0.120.0 introduced proper TypeScript-style type declarations for tool calls. v0.50.0 was sending untyped tool schemas. Modern MCP servers (including the ones OpenClaw spins up) expected typed declarations and silently dropped the untyped ones.
Core crate extractions. The Codex team extracted core functionality into separate Rust crates. This changed the internal message format in subtle ways that only manifested when Codex talked to external MCP servers (as opposed to its built-in tools).
MCP cleanup fixes. There were literal bug fixes for MCP session management — connection pooling, timeout handling, retry logic. My v0.50.0 was using MCP patterns that had known bugs which were fixed a thousand versions ago.
Richer MCP app support. The newer version supports MCP apps as first-class citizens. My v0.50.0 was treating MCP connections as second-class tool providers, which meant every agent handoff was going through a compatibility shim that occasionally lost messages.

The beautiful irony: my config.toml was perfectly configured.

model = "gpt-5.4"
reasoning_effort = "medium"  
personality = "pragmatic"

[plugins]
gmail = "openai-curated"
github = "openai-curated"

The model migrations from gpt-5 → gpt-5.3-codex → gpt-5.4 were all properly specified. The config was fine. The binary executing that config was from a different geological era.

The Fix

$ npm install -g @openai/codex@latest
$ codex --version
0.120.0

Two seconds. Two seconds to fix what took me two days to diagnose.

The agent stack came back online immediately. MCP handshakes completed. Tool calls went through. Sessions that had been failing at 40% completion started running to 100%. The 2,026 sessions in ~/.codex/sessions/ started growing again.

Timeline of Discovery

Time	Activity	Usefulness
Hour 0-3	Blame Groq	0%
Hour 3-6	Rewrite MCP config	0%
Hour 6-9	File GitHub issues against OpenClaw	0% (but they were well-written)
Hour 9-11	Forensic analysis of SQLite cache	5% (found the version discrepancy clue)
Hour 11	`codex --version`	100%
Hour 11 + 2 sec	`npm install -g @openai/codex@latest`	∞%

Total debugging time: ~24 hours.
Total fix time: 2 seconds.
Ratio: 43,200:1.

Lessons Learned

1. Check the version first. Always.

Before you blame the cloud, blame the config, blame the provider, blame Mercury retrograde — run --version. I know this. I've told junior devs this. I've written it on whiteboards. And I still spent 24 hours not doing it.

2. npm global installs are haunted.

The failure mode here was a partial update: npm's package metadata updated, but the binary didn't get replaced. This is a known class of npm bugs that's existed for a decade. If you run a global npm tool in production (or production-adjacent) workflows, pin it with a version manager or at least verify the binary version matches npm list -g.

3. MCP compatibility is version-sensitive.

MCP is still young. The protocol is evolving fast. Unlike HTTP, where a server from 2015 can talk to a client from 2025, MCP servers and clients need to be within a reasonable version range of each other. When your MCP client is 1,886 versions behind, "reasonable" left the building months ago.

4. Multi-agent stacks amplify version debt.

In a monolith, a stale dependency usually manifests as a clear error. In a multi-agent stack where five services talk through protocol bridges, a stale dependency manifests as mysterious partial failures with no error messages. The debugging surface area is multiplicative.

5. The cache lies.

My SQLite cache said client_version: 0.120.0 because it had been written by a different invocation of Codex (probably through OpenClaw's process spawning, which had its own newer copy). The lesson: cache metadata reflects the last writer, not the current runtime. Always verify at the binary level.

The Broader Point

We're in the era of agent stacks — systems where multiple AI-powered tools coordinate through shared protocols. These stacks are powerful but they have a failure mode that traditional software doesn't: silent degradation. When your REST API client is outdated, you get a 400 error. When your MCP client is outdated, you get a successful handshake that quietly drops half the capabilities.

The tooling will catch up. Version compatibility matrices, protocol negotiation, graceful degradation warnings — it's all coming. But right now, in April 2026, the state of the art is a developer staring at their terminal at 2 AM, typing --version for the thing they should have checked twelve hours ago.

My agent stack is humming now. All 2,026 sessions are flowing. Codex and OpenClaw are best friends again. MCP connections are solid.

And I've added a cron job:

0 9 * * 1 codex --version | mail -s "codex version check" me@example.com

Because I will forget again.

Ryan builds AI agent infrastructure at DreamSiteBuilders.com. He can be found on GitHub shipping tools that solve problems he created for himself.