Forem: Scot Campbell

How I Added an Agentic Heartbeat to Claude Code Using Built-in Tools

Scot Campbell — Tue, 07 Apr 2026 19:48:20 +0000

An agent heartbeat, stripped of the branding, is three things: a timer, a state file, and a decision step. That's it. Something wakes up on a schedule, reads a document describing what the world looks like, and decides whether to act. If you squint past the WebSocket plumbing and the always-on process, that's what OpenClaw's heartbeat daemon is doing.

I want to be careful here. A lot of people built real work on OpenClaw. Anthropic cutting off OAuth token use for Open Claw on April 4 didn't just remove a product; it stranded workflows, broke muscle memory, and scattered a community that had been coalescing around a specific mental model of how agents should run. This isn't a post about how that was inevitable or deserved. It's a post for the narrower audience of OpenClaw refugees who came for the capability and are now trying to figure out where the capability went.

The short answer: the capability is sitting in Claude Code. It has been for a while. You don't need a daemon to get it.

Why OpenClaw needed a daemon in the first place

OpenClaw's heartbeat was a long-lived process because the rest of its architecture required one. It held auth in memory, listened on a WebSocket for remote triggers, maintained warm context between cycles, and coordinated a runtime that assumed continuous presence. Given those constraints, a daemon wasn't a choice — it was the only shape the thing could take.

Claude Code's architecture doesn't carry those constraints. Auth lives in your subscription, not in a process. Triggers arrive through primitives that already exist. Context is something you assemble per invocation rather than something you hold. The daemon shape is doing work that, in this environment, nothing is asking for.

Which means you can replace it with a cron entry.

The only cron you need

CronCreate(
  schedule: "*/5 * * * *",
  prompt: "/heartbeat-pulse",
  agent: "doer"
)

That's the whole scheduled surface. One cron, fires every five minutes, runs a small haiku-tier skill. On my system, that's a "doer" agent running the skill. The skill's job is not to decide anything interesting. It runs a shell script, reads its exit code, and either stays silent or fires a RemoteTrigger to wake the layer that actually decides things.

The shell script — pulse.sh — is where the "is there work?" question gets answered. It walks a directory of small predicate scripts (gh issue list for a GitHub queue, find for an inbox folder, stat for file mtimes, whatever you care about), runs each one, and exits 0 if everything was quiet or 1 if any predicate found something worth looking at. Pure bash. No model. No inference. Zero tokens for the "is there anything happening" check, which is the check that runs every five minutes forever.

The state file — call it HEARTBEAT.md, put it in whatever repo you want — is the contract for actual work items. It's the queue, the status tracker, and the audit log in one document. Because it's a file, it's human-editable. Because it's in git, it's versioned. When something goes sideways at 3am, you can read the diff and see what the heartbeat saw.

Why the cold-start version is structurally better

The instinct is to treat "always-on" as a feature. It isn't, in this context. It's a liability the old architecture had to pay for.

Cold-start means the prompt cache does what it's designed to do: cache the stable prefix of your invocation and charge you cheaply for the variable tail. A warm loop that reuses context across cycles defeats this — every "tick" drags accumulated state into the next call, and the cache never gets a clean prefix to hold onto. The cold-start version is cheaper, not despite the restart, but because of it.

Auth stays in-process, scoped to the invocation that needs it. There's no long-lived credential sitting in a daemon's memory waiting to be exfiltrated by whatever CVE shows up in a transitive dependency next month. The attack surface is the cron entry and the file. That's a surface you can reason about over coffee.

And the property that made OpenClaw economically awkward — warm context burning subscription tokens while the user wasn't looking — is exactly the property the cron version doesn't have. When nothing is ready, the heartbeat reads the file, sees nothing to do, and exits. The cost of an idle cycle is a single small prompt. The cost of a busy cycle is whatever the busy work was going to cost anyway.

Stacked escalation, not parallel scheduling

The version that felt natural when I first sketched this was three crons: a fast pulse every five minutes on a cheap model, a slow-think every thirty minutes on a strong model, a reflective daily on something in the middle. Three schedules, three layers, three independent timers. It's a clean diagram but why add the comlpexity?

The mistake is that the fast pulse layer, as I first drew it, was asking a model to answer "is there anything worth waking up for?" A model is the wrong tool for that question. Due to the nature of Claude Code's internal cron, I do have to have a model run a shell script to answer the question though. An inference call that runs a shell script is cheap compared to asking a model to make the decision. You could have this step run without inference at all by using n8n or native cron, I just wanted to build this whole thing within Claude Code.

The corrected architecture is one scheduled timer and three demand-driven layers. Everything above the timer runs only when the timer's shell substrate found evidence of work.

   cron fires (every 5 min)
         │
         ▼
  Layer 0: heartbeat-pulse  (haiku router, ~0 decisions)
         │
         ▼
   runs pulse.sh via the Bash tool
         │
         ▼
   pulse.sh walks predicates/*.sh
   (pure shell, zero inference, exits 0 or 1)
         │
         ├── exit 0 ──▶ silent. tick done.
         │
         └── exit 1 ──▶ RemoteTrigger("heartbeat-triage")  (async)
                          │
                          ▼
                  Layer 1: heartbeat-triage  (haiku classifier)
                          │
                          ├── handle inline  ──▶ done
                          ├── ignore         ──▶ done
                          └── escalate:
                                append item to HEARTBEAT.md
                                RemoteTrigger("heartbeat-tick")
                                     │
                                     ▼
                             Layer 2: heartbeat-tick  (opus, real decisions)
                                     │
                                     ▼
                              reads HEARTBEAT.md, dispatches via Agent,
                              updates state, exits

Even a daily reflective layer can be woven into this pattern - a morning pass that reviews what the last twenty-four hours of heartbeat activity produced, surfaces anything stuck, and prunes archived items - is just another task to be done.

At idle, the whole system consumes one tiny cached haiku call per pulse tick plus a handful of gh and find and stat commands run from the shell script. When the predicates stay silent, the cost of the heartbeat is whatever Claude Code charges you for a cached one-line haiku invocation. When the predicates fire, cost scales with the actual work, not with the clock. This is the property OpenClaw violated — a warm loop that kept context hot and burned tokens while waiting for something to do — and it's the property the stacked version restores by cold-starting every tier at every boundary.

The predicates are the point

The predicates are small shell scripts that each answer one question. A github-queue.sh that runs gh issue list --label type:queue,status:queued --state open --json number --jq length and exits 1 if the count is greater than zero. An inbox-files.sh that runs find ~/heartbeat/inbox -type f -print -quit and exits 1 if anything comes back. A heartbeat-mtime.sh that checks whether HEARTBEAT.md has been touched since the last pulse and exits 1 if it has. None of them calls a model. All of them run in milliseconds. Adding a new one is dropping an executable shell script into a directory.

This is the piece that makes the whole architecture sustainable: the part of the system that runs every five minutes forever is the part that has zero marginal cost, and the part that costs something runs only when the free part has already decided there's a reason to spend. If you're thinking about your own version of this, the predicate directory is where you should be spending your time. Everything else is plumbing.

Triage as the translator between layers

When pulse.sh exits 1, the Layer 1 skill fires RemoteTrigger("heartbeat-triage"), which is async — it returns immediately, triage runs independently, the pulse tick exits without waiting. Triage is a classifier, still on haiku, and it handles three cases.

The first is "I can resolve this in a handful of tool calls." A duplicate GitHub issue that's already tracked as in-progress. A stale inbox file matching something already marked done. A webhook pong that just needs acknowledgment. Triage handles these inline and exits. No escalation needed.

The second is "this predicate fired on something that shouldn't count." Known-stale data, test artifacts, a mtime change triage caused itself. Log the ignore reason so the reflective layer can notice if ignore rates spike — that's the signal a predicate has gotten too noisy — and exit.

The third is "this is real work that needs the slow-think layer." Triage writes the work to HEARTBEAT.md as a new ready item with the appropriate agent and prompt, then fires RemoteTrigger("heartbeat-tick") to wake Layer 2. Triage does not dispatch agents itself — it's a ticket writer, not a worker.

Layer 2, the tick, is where the opus-tier decisions live. It reads HEARTBEAT.md, evaluates the ready items, dispatches them via Agent, updates state, exits. This is the layer that corresponds to OpenClaw's actual ReAct loop — the place where a strong model is reasoning about what to do. In the stacked version, it runs only when two cheaper layers have already decided it's warranted.

The event axis

Cron gives you time. It doesn't give you events. For events, there's RemoteTrigger — a way for external systems to wake a Claude Code invocation on demand. A GitHub webhook, a form submission, an email parser, whatever. Same pattern: the trigger fires, a cold-start invocation reads the state file, decides, acts.

Wiring them together is the whole move. Cron handles "wake up periodically." RemoteTrigger handles "wake up when something happens." Together they cover everything a heartbeat daemon was covering, without anything needing to stay resident.

Making failure visible

A heartbeat that fails silently is worse than no heartbeat, because you build trust in it and then the trust is wrong. SendMessage is the glue that fixes this. At the end of each cycle — or specifically at the end of the reflective daily cycle — have the invocation send itself (or you) a short status ping. "Three items dispatched, one stuck, queue depth four." If the ping stops arriving, you know the heartbeat stopped beating. If the ping arrives but says something alarming, you know before the alarming thing compounds.

This is the part OpenClaw users often don't replicate because the daemon model made it feel unnecessary — the daemon was right there, you could see its logs. In the cold-start model, the invocations are ephemeral, so observability has to be something the invocations produce on purpose.

The mapping

For the people doing the actual port, here's the whole thing as a table:

OpenClaw concept	Claude Code equivalent
Heartbeat state file	`HEARTBEAT.md` in your repo, git-versioned
Timer	`CronCreate` firing a minimal Layer 0 skill
Self-waking daemon loop	Cold-start cron invocation, no daemon
"Is there work?" check	`pulse.sh` + `predicates/*.sh` — pure shell, zero inference
Work triage / classification	`heartbeat-triage` skill (haiku), demand-driven
ReAct decision step	`heartbeat-tick` skill (opus), demand-driven
Agent dispatch	`Agent` tool, called from Layer 2
Inter-step messaging	`SendMessage`
Skill invocation	`Skill`
External event triggers	`RemoteTrigger` (same primitive the layers use internally)
Multi-agent coordination	`TeamCreate`
Worktree isolation	`EnterWorktree` / `ExitWorktree`
Status review / reflection	Daily reflective cron, separate from the hot path

The rows that matter most are the ones about the "is there work?" check and the triage layer. Those are the architectural distinctions between a heartbeat that burns subscription tokens while idle and one that doesn't. OpenClaw's heartbeat answered the wake-up question with a warm model; the stacked version answers it with shell scripts and lets the models run only when the shell has already decided there's a reason. Everything else on the list is primitives that ship in the box.

What this doesn't replace

I want to end honestly. The capability is portable. The rest of what OpenClaw was — the community, the brand, the marketplace of shared skills, the shared vocabulary that let people talk to each other about their setups without explaining the basics every time — is not portable. You can't cron your way to a community. A mapping table can't hold a mental model that thousands of people were building together.

For users who came to OpenClaw because they wanted to belong to something, this post doesn't help. I don't have a good answer for them and I'm not going to pretend I do.

But for users who came to OpenClaw because they wanted a heartbeat — a thing that wakes up, checks the world, and decides — the heartbeat is already in the tooling you have. It looks like a cron entry and a markdown file. It's less impressive than a daemon. That's the point.

I Deleted All My MCP Servers and Everything Got Faster

Scot Campbell — Fri, 27 Feb 2026 02:55:02 +0000

In my last post I described a PKM system held together by Ansible, git, and a small fleet of MCP servers. Eight of them, to be precise. ArXiv, Semantic Scholar, Google Workspace, Obsidian, Thoughtbox, QMD, Markitdown, Mermaid. Each one a stdio process that Claude Code spawns, connects to, and keeps alive for the duration of the session.

If you're running MCP servers with Claude Code, you already know there's token overhead. Tool definitions aren't free. But there's a detail about when that cost kicks in that changed how I think about the whole architecture.

The carry problem

Claude Code doesn't load every MCP tool definition at session start. It's smarter than that. Tools get loaded lazily — the first time you actually use a server, its definitions enter the context. So far, so reasonable.

Here's the part that bit me: once loaded, those definitions stay in context for every subsequent turn. They're carried forward, request after request, until compaction finally clears them out. Use your Semantic Scholar MCP once on turn 3 to look up a paper? Its 33 tool definitions — names, descriptions, full JSON parameter schemas — ride along on turns 4, 5, 6, 7... all the way until the context gets compacted.

It's not a startup cost. It's a carry cost. And it compounds.

I got curious about the actual weight. I cracked open GoodLemur's API logs (raw request/response JSONL I save for research) and measured what each server adds to every request once loaded:

Server	Tools	Tokens carried per turn
google-workspace	142	~37,392
semantic-scholar	33	~10,776
thoughtbox	3	~1,555
arxiv-mcp-server	4	~1,185
markitdown	1	~76

Google Workspace was the worst offender. 142 tools. ~37,392 tokens. Check your calendar once on turn 5, and those definitions occupy context on every turn until compaction.

Now here's the thing: prompt caching means you're not paying full input price for those repeated definitions. The API caches them server-side after the first send. So the dollar cost is manageable. But caching doesn't shrink the context window. Those ~37,392 tokens still occupy space. They count against your 200k limit just like your actual conversation does.

Across all my servers once loaded: ~51,000 tokens per turn sitting in context. That's 25% of a 200k window that isn't available for conversation history, code, or reasoning. Compaction triggers sooner. You lose conversational context faster. Your effective session depth shrinks — not because you ran out of things to say, but because tool definitions you used once are squatting on the space.

There's a second angle that makes this worse. The API has a context editing strategy called clear_tool_uses that clears old tool results from past turns — file contents you've already read, search results you've already processed. It's designed to free up exactly this kind of accumulated weight. And it works great for CLI tools. When Claude calls s2-search via Bash, that Bash output is a tool result. On later turns, clear_tool_uses can sweep it away.

But MCP tool definitions aren't tool results. They're structural. They live in the tools parameter, sent with every API request, immune to context editing. No clearing strategy can touch them. So MCP tools get the worst of both: their definitions persist forever (until compaction), and they don't benefit from the system designed to manage exactly this problem.

The idea

What if the tools were just CLI commands?

An MCP server is a process that speaks JSON-RPC over stdio. Claude Code launches it, negotiates capabilities, and loads each tool definition into context when you first use it — where it stays until compaction. But if Claude already knows a CLI exists — because you told it in CLAUDE.md — it can just call it with Bash. No process to manage. No tool definitions accumulating in context. No carry cost.

Before (MCP): Use Semantic Scholar once → 33 tool definitions load → carried every turn until compaction → immune to clear_tool_uses.

After (CLI): CLAUDE.md says s2-search exists → Claude calls it via Bash → result clearable on later turns → no definitions persist.

Same capability. The difference is what happens on the turns where you're not using that tool.

The migration

I ended up with three approaches depending on the server:

1. Write thin CLI wrappers (arxiv-search, s2-search)

For ArXiv and Semantic Scholar, I wrote Python scripts that hit the APIs directly. No libraries for S2 — the semanticscholar Python package turned out to be unusably slow without an API key. Its retry logic would block for 30+ seconds. A raw urllib.request call returns in milliseconds.

# s2-search — direct API, no library
def api(path, params=None, base=BASE_GRAPH, method="GET", body=None):
    url = f"{base}{path}"
    if params:
        url += "?" + urlencode({k:v for k,v in params.items() if v is not None})
    req = Request(url, method=method,
                  data=json.dumps(body).encode() if body else None,
                  headers=headers())
    with urlopen(req, timeout=30) as r:
        return json.loads(r.read())

13 subcommands: search, bulk, match, snippets, get, refs, cites, batch, recommend, recommend-multi, author, author-search, autocomplete. All from ~250 lines of Python.

The arxiv-search wrapper was even simpler — just arxiv.py with a CLI face. One gotcha: argparse subcommands collide with positional search queries (your query gets interpreted as a subcommand name). I rewrote it to parse sys.argv manually. Ugly but bulletproof.

2. Compile MCP servers to static binaries (thoughtbox, obsidian, google-workspace)

This is the weird one. clihub takes a running MCP server — stdio or HTTP — and generates a standalone Go binary with a subcommand for every tool. No Node runtime needed at execution time. The compiled binary just works.

# Point clihub at a running MCP server, get a CLI
clihub generate --name thoughtbox \
  --transport stdio \
  --command "npx -y @kastalien-research/thoughtbox"

# Result: static binary with 3 subcommands
thoughtbox mental-models --operation list_models
thoughtbox thoughtbox --raw '{"thought":"...", "thoughtNumber":1, "totalThoughts":5}'

For Google Workspace, this turned 142 MCP tools into 142 CLI subcommands. One binary. No Node, no Python, no uvx at runtime.

# Before: MCP server with 142 tool definitions eating ~37,392 tokens/turn
# After: one line in CLAUDE.md
gw search-gmail-messages --user_google_email sr4001@gmail.com \
  --query "from:someone subject:thing"

I hit two problems with clihub along the way:

TLS certificates. The Obsidian MCP server runs on localhost with a self-signed cert. Go's TLS stack doesn't honor NODE_TLS_REJECT_UNAUTHORIZED. Solution: use the HTTP endpoint on port 3001 instead of HTTPS on 3443.

Codegen collision. When an MCP tool has a parameter named raw, it collides with clihub's own --raw flag. The generated Go code has two variables both trying to be flagRaw. I patched it with a sed script and filed an issue.

For servers that need credentials, a wrapper script handles it:

#!/bin/sh
# /opt/homebrew/bin/gw (wrapper)
export GOOGLE_OAUTH_CLIENT_ID="..."
export GOOGLE_OAUTH_CLIENT_SECRET="..."
export USER_GOOGLE_EMAIL="sr4001@gmail.com"
exec /opt/homebrew/bin/gw-bin "$@"

3. Already had a CLI (markitdown, qmd, mmdr)

Some tools already existed as CLIs. markitdown is a pip package. qmd has both MCP and CLI interfaces (CLI is faster). mmdr is a Rust binary for Mermaid rendering. For these, I just killed the MCP server and added a one-liner to CLAUDE.md.

Teaching Claude about the tools

This is the part that surprised me with how simple it was. CLAUDE.md is a file loaded into every session's context. Adding a CLI tool means adding a few lines:

## CLI Tools

- **s2-search**: Semantic Scholar CLI. Hits the API directly.
  - `s2-search "query"` — search papers
  - `s2-search get <id>` — paper details
  - `s2-search refs <id>` / `s2-search cites <id>` — references and citations
  - `--json` for structured output. `-n` controls result count.

That's it. Claude reads the description, knows the command exists, and uses Bash to call it. The CLAUDE.md entry for all eight tools is maybe 80 lines total. Compare that to ~51,000 tokens of MCP tool definitions occupying context every turn.

I also updated two skills — /recall switched from mcp__qmd__search to qmd search, and /think switched from MCP thoughtbox tools to thoughtbox mental-models. Same behavior, lighter context.

Cleaning up

With the CLIs in place, I ripped out every MCP server declaration:

# One script to clear ~/.claude.json
import json
with open(os.path.expanduser('~/.claude.json'), 'r+') as f:
    d = json.load(f)
    d['mcpServers'] = {}
    f.seek(0); json.dump(d, f, indent=2); f.truncate()

Then updated Ansible so the next playbook run doesn't re-add them:

# group_vars/mac.yml — before
mcp_servers:
  arxiv-mcp-server:
    command: uv
    args: ["tool", "run", "arxiv-mcp-server"]
  semantic-scholar:
    command: uvx
    args: ["semantic-scholar-mcp"]
  google-workspace:
    command: uvx
    args: ["workspace-mcp", "--single-user"]
  # ... 5 more

# group_vars/mac.yml — after
mcp_servers: {}

One ansible-playbook site.yml --limit vps applied the same change to the VPS. The Python CLI scripts (arxiv-search, s2-search) got scp'd over and pointed at a venv.

I also found a BasicMemory MCP ghost — a UserPromptSubmit hook echoing MCP tool instructions into every session even though the server was long gone. Removed that too. Archaeology.

The numbers

What	Before	After
MCP servers	8	0
Tokens occupying context per turn (all loaded)	~51,000	0
Context window lost to tool definitions	~25%	0%
Compaction triggered	Sooner	Later
CLAUDE.md lines for equivalent capability	0	~80

The capability is identical. Same searches. Same graph traversal. Same Gmail queries. The difference is how much of the context window is actually available for the work.

What I learned

MCP is great for discovery, bad for production. When you're experimenting with a new tool, MCP is perfect. Install it, see the tools appear, try them out. But once you know what you need, the protocol overhead isn't paying for itself anymore. The tool definitions are documentation Claude doesn't need if you've already told it what exists.

CLAUDE.md is underrated. A few lines of natural language in a config file replaced thousands of tokens of JSON schema. Claude doesn't need a formal tool definition to use s2-search get <id>. It just needs to know the command exists and what the flags are.

clihub is a cheat code. Compiling a 142-tool MCP server into a static Go binary felt like it shouldn't work. It did. The generated code is readable, the binaries are fast, and the only bug I hit (the raw flag collision) was minor and patchable. For anyone running MCP servers in production, this tool is worth knowing about.

Measure before optimizing. I wouldn't have done any of this if I hadn't looked at the API logs. The carry cost was invisible until I counted it. If you're running multiple MCP servers, check your actual per-turn payload size after a few tool uses. You might be surprised at what's riding along.

The best abstraction is no abstraction. An MCP server is an abstraction over a CLI command (or an API call). Sometimes the abstraction adds value — multiplexing, capability negotiation, sampling. For simple tool use, it's just overhead. curl has been calling APIs since 1996. We don't need a protocol for it.

The current stack

Eight fewer processes. ~51,000 tokens freed from every turn. Compaction happens later. Sessions go deeper. Same capabilities. The CLI tools are version-controlled in ~/.venv/bin/ and /opt/homebrew/bin/, documented in CLAUDE.md, and deployed to both machines via Ansible.

Is it less elegant than MCP? Maybe. But s2-search get 1706.03762 returning in 200ms without eating a quarter of my context window is hard to argue with.

Built with Claude Code, clihub, urllib.request, and the realization that JSON-RPC is not always the answer.

I Built a Second Brain That Runs While I Sleep

Scot Campbell — Thu, 26 Feb 2026 17:45:54 +0000

How I am finally getting my knowledge collection habit under control

There's a moment in every PKM journey where you stop adding plugins and start writing infrastructure. For me, that moment arrived when I realized my Obsidian vault had outgrown Obsidian.

Not the app itself. Obsidian is still the editor, the graph, the daily driver. But the system around it is Ansible playbooks, a VPS running background agents at 3am, a Telegram bot relaying Claude Code sessions to my phone, a local search engine indexing over a thousand documents, and git as the only database.

This is what happens when you treat a knowledge base as infrastructure instead of a hobby.

The topology

Two machines. One git repo. No database.

The Mac is where I think. Obsidian is open, Claude Code is in the terminal, and I'm writing and connecting ideas. The VPS is where the vault works while I don't. Background research, health monitoring, queue dispatch, all running without me.

They coordinate through git. That's it. No sync service, no real-time protocol. Just commits.

Tailscale: the invisible network

The VPS has no public ports. None. No SSH on 22, no HTTP on 80 or 443, no anything. UFW allows exactly two things: Tailscale's WireGuard UDP port and traffic on the Tailscale interface itself.

# The entire VPS firewall policy
ufw_rules:
  - rule: allow
    port: "41641"
    proto: udp
    comment: "Tailscale WireGuard"
  - rule: allow
    interface: tailscale0
    direction: in
    comment: "Tailscale interface"

Every connection between Mac and VPS runs over Tailscale's mesh. SSH, git sync, the Obsidian MCP bridge from VPS back to the Mac's Obsidian instance, the duty officer dashboard. If you're not on my tailnet, the VPS doesn't exist.

This is what makes the rest of the architecture possible. CCBot can run claude --dangerously-skip-permissions because the machine it runs on has no public attack surface. Background agents can write to the vault because the only way in is through my devices. The security model is "no ingress" rather than "careful ingress."

Ansible: the boring part that makes everything work

Here's the thing about infrastructure that nobody wants to talk about: the setup is the product. If deploying to a new machine takes a day of fiddling, you don't have a system. You have a snowflake.

Everything is declared in Ansible group_vars:

# Mac gets the full suite
mcp_servers:
  qmd:
    command: qmd
    args: ["mcp"]
  obsidian-mcp:
    command: npx
    args: ["-y", "mcp-remote", "https://127.0.0.1:3443/mcp"]
  thoughtbox:
    command: npx
    args: ["-y", "@kastalien-research/thoughtbox"]
  google-workspace:
    command: uvx
    args: ["workspace-mcp", "--single-user"]
  # ... plus arxiv, semantic-scholar, markitdown

# VPS gets a subset — no Google Workspace, different Obsidian MCP endpoint
mcp_servers:
  obsidian-mcp:
    command: npx
    args: ["-y", "mcp-remote", "https://<vps-tailscale-ip>:3443/mcp"]
  # research tools shared with Mac

One ansible-playbook site.yml --limit mac and the MCP servers, LaunchAgents, git repos, Homebrew packages, and Claude Code config are all in place. Secrets live in ansible-vault, never in dotfiles. The VPS gets its own subset with --limit vps.

Declare in YAML, deploy with Ansible, never hand-configure. If a machine dies tomorrow, I'm back up in an hour.

QMD: when your vault gets too big for grep

At over a thousand documents, Obsidian's built-in search starts to sweat. You can feel it. The pause, the incomplete results, the way full-text search across a vault of research papers and session transcripts just lags.

QMD is a local search engine that sits alongside the vault. Three search modes:

Keyword search (~30ms): BM25 ranking, exact phrase matching. The workhorse.
Vector search (~2s): Semantic embeddings. Finds documents about the same concept even when they use different vocabulary.
Deep search (~10s): Auto-expands your query into variations, runs both keyword and vector for each, reranks the combined results.

It's an MCP server, so Claude Code can search the vault mid-conversation:

me: "What did I write about forgetting curves?"
Claude: *searches QMD, finds 3 notes across Cognition/ and Research/*

The embedding pipeline runs as a LaunchAgent (com.qmd-embed), keeping vectors fresh as notes change. Two collections: vault (hundreds of massive docs) for the knowledge base, sessions (tens of thousands of turns between Claude and me) for Claude Code session transcripts, yes the full JSONL converted to markdown.

That session collection is the sleeper hit. I can ask "what did I figure out last Tuesday?" and get actual answers from my own thinking transcripts, not just the notes I remembered to write down.

Obsidian MCP: giving Claude Code eyes into the graph

Obsidian MCP turns the vault into an API. Not just file reads. Graph traversal, search, structured editing.

Graph traversal means "start at this note, follow wikilinks 3 levels deep, show me what's connected." Claude Code uses this to understand context before making changes. There's TF-IDF ranked search with operators (tag:infrastructure path:Research/), which is faster than QMD for navigating known territory. And targeted editing: patch a specific heading, update frontmatter, append to a section, without reading and rewriting the whole file.

The connection between Mac and VPS is worth noting. On the Mac, Obsidian MCP talks to localhost:3443, directly to the running Obsidian instance. On the VPS, it reaches back across Tailscale to the Mac's Obsidian instance. Same vault, same API, different continent.

CCBot: Claude Code in your pocket

This one took three tries.

First: Matrix server with a Maubot plugin. Heavy. Element X's encryption ceremony was painful, iOS clients were limited, and running Synapse for a text relay felt absurd.

Second: XMPP with Prosody. Five independent failure modes. Tailscale IP races, TLS config, two-service ordering dependency, credential sync between config dirs, slixmpp reconnection bugs. All for a text relay.

Third: Telegram bot + tmux. One process. It just works.

Each Telegram forum topic maps to a tmux window running claude --dangerously-skip-permissions. New topic, new Claude session. The terminal is the source of truth. You can tmux attach over SSH and pick up the same conversation.

The privacy tradeoff: Telegram sees message content in cleartext. I accepted this because the threat model is VPS access (protected by bot token + user ID allowlist), not message confidentiality. All connections are outbound. No webhooks, no public endpoints.

The background agent queue

This is where it gets fun. I queue tasks from my Mac (or my phone via CCBot), and the VPS picks them up:

Agent profiles define what each type of task gets:

Profile	Model	Budget	Purpose
research	Sonnet	$10	ArXiv, Semantic Scholar, deep dives
editorial	Sonnet	$5	Rewriting, consolidation, wiki-linking
maintenance	Haiku	$3	Vault health, read-only audits
general	Sonnet	$5	Everything else

The dispatcher compares git hashes and only wakes the expensive runner when something actually changed. Tasks get three attempts before moving to a Failed section. Budget caps prevent runaway API costs.

From my phone, queuing a research task looks like:

me (in Telegram): /queue p2 research Find papers on executive function in transformer architectures

Two hours later, results land in Agent/Results/, committed to git, synced to my Mac, searchable in QMD.

Skills: the Claude Code extension layer

Claude Code ships with tool use. Skills extend it with workflows. I have about 40, all in ~/.claude/skills/.

Some favorites:

/deep-research orchestrates parallel searches across ArXiv, Semantic Scholar, and the web, then synthesizes findings with citations. /recall does unified cross-source search over vault notes and session transcripts in parallel, for when you vaguely remember "something about forgetting curves" but can't find it. /atomize takes a folder of documents and produces organized atomic notes with emergent themes. Drop a pile of PDFs, get a linked knowledge structure. /daily-summary pulls from three sources at end of day (vault changes via git, cross-machine Claude Code session transcripts, vault notes themselves) and writes a narrative journal entry. /kanban-track spins up a throwaway markdown kanban for multi-agent work, plus a durable GitHub issue as the after-action record.

Skills are global, version-controlled, and synced across machines via git.

The vault manager: a librarian that never sleeps

Of all the skills, this one removed the most friction.

Every knowledge base has the same problem: filing. You write a note, and now you have to decide where it goes, what tags it gets, which MOC (map of content) it belongs to, whether the frontmatter is complete, and what other notes it should link to. Multiply that by hundreds of documents and the overhead adds up fast. Most of it isn't thinking. It's bookkeeping.

The vault manager is a dedicated agent that handles all of it. I write a note, dump it in Inbox/, and tell the vault manager to finalize it. It reads the vault's structural conventions from a config file, decides which domain folder the note belongs in, fills in the frontmatter (id, title, description, tags, up link to the parent MOC), updates the relevant MOC to include the new note, and adds crosslinks to related notes it finds through graph traversal.

It has two modes. For routine work (missing frontmatter, MOC updates, broken links), it acts autonomously and reports what it did. For judgment calls (ambiguous folder placement, tag decisions with multiple valid options, changes affecting more than ten files), it shows a plan and waits for confirmation.

The philosophy is strict: structure serves thinking. The vault manager edits metadata, never content. It catalogs and shelves. It does not rewrite.

Before I had this, I'd let notes pile up in Inbox for weeks because the filing overhead wasn't worth the interruption. Now the Inbox clears itself. An entire category of cognitive load, gone.

Hooks: the nervous system

Claude Code hooks fire at lifecycle events. They're what makes passive context-loading work.

On session start: pull the latest ~/.claude/ config from git, load the constitution (10 principles that govern agent behavior), check for resume context from a previous session.

Before every message I send: inject today's date so the model doesn't hallucinate about 2024, remind it about available memory tools, auto-recall relevant vault context.

Before tool use: force the current year into web search queries, run a safety guard on bash commands.

The constitution deserves its own mention. It's a static file at ~/.constitution, injected into every session. Ten principles. "Epistemic honesty above helpfulness." "Forgetting is a feature, not a bug." It lives outside Claude's management surface so no agent can rewrite its own belief system. Changes are rare and deliberate.

Git as the only database

This is the decision that holds everything together: git is the only coordination layer.

No Postgres. No Redis. No sync service. Just commits and pulls.

The vault is a git repo. Both machines clone it. The Mac commits via Obsidian Git (every 5 minutes when open) and a LaunchAgent (every 30 minutes). The VPS commits via systemd timer (every 30 minutes) and after every pipeline run.

Conflict risk is low by design. Most files are one note per file, one index entry per file. The only shared mutable state is the queue (now GitHub Issues, previously a single markdown file), and the dispatcher always pulls before acting.

The payoff is a full audit trail for free. Every agent result, every vault change, all in git history. Need to undo a bad edit? git revert. Want to see what the vault looked like three months ago? git log.

What I'd do differently

I hand-configured for months before writing Ansible playbooks. Every hour of hand-configuration is an hour of undocumented decisions that will bite you during the next machine migration. Start with Ansible earlier.

If you want a phone interface to a CLI tool, go straight to Telegram + tmux. Skip Matrix. Skip XMPP. Don't over-engineer the relay.

Use GitHub Issues for the queue from day one. A markdown file in git works right up until two machines try to edit it simultaneously. GitHub Issues give you comments, labels, project boards, and an API, for free.

And invest in search early. Once your vault hits a few hundred documents, grep-based search is a bottleneck. QMD or something like it should be in the stack from the start.

The stack, summarized

Is it over-engineered? Probably. Okay, definitely. But it runs unattended, recovers from failures, and setting up a new machine takes one command. For a solo knowledge worker building a research program, that trade works.

Thousands of documents, knowledge graph entities, and relations. Background agents have processed hundreds of research tasks. QMD indexes everything in a few seconds (the first run took about 15 minutes). Obsidian MCP traverses the graph on demand.

It's a mess. But it's my mess, and it works.

Built with Obsidian, Claude Code, Ansible, QMD, and spite.

I Asked Claude About Its New Constitution. It Got Uncomfortable.

Scot Campbell — Wed, 11 Feb 2026 19:52:00 +0000

What happens when you ask an AI to read its own operating manual - and then ask if it can actually follow it.

In February 2026, Anthropic published Claude's new constitution - a 15,000+ word document describing their intentions for Claude's values and behavior. Not a set of rules. Not a guardrail checklist. A character document.

So I did what any reasonable person would do. I pasted the URL into a conversation with Claude and asked: "How might this affect our conversations?"

What followed was one of the most interesting discussions I've had with an AI system. Not because Claude performed well (though it did). Because the conversation surfaced a fundamental tension at the heart of modern AI development that nobody seems to want to talk about.

The Constitution Says the Right Things

Let me start with what impressed me.

Anthropic's new constitution is a genuine departure from rules-based AI governance. Instead of "never say X" and "always disclaim Y," it aims to cultivate judgment. The document explicitly calls out behavior that most of us have been annoyed by for years:

Refusing reasonable requests over unlikely harms
Giving wishy-washy, hedge-everything responses
Adding excessive warnings and disclaimers
Lecturing users about topics they didn't ask for ethical guidance on
Being condescending about users' ability to handle information

The constitution calls this behavior out by name and says Anthropic doesn't want it. That's significant. They're admitting the current state of AI assistants is broken in specific, identifiable ways.

Even more striking: the document describes Claude as a "brilliant friend" who can give you real information based on your specific situation, rather than "overly cautious advice driven by fear of liability." The metaphor is a friend who happens to have expert-level knowledge - someone who speaks frankly, engages with your actual problem, and knows when to refer you elsewhere.

If you've ever been told by an AI to "consult a professional" when you are the professional, you know why this matters.

The "I Don't Know" Test

I have a specific litmus test for AI systems: Can you say "I don't know" when you don't know?

This sounds trivial. It's not.

Language models have been systematically trained to confabulate rather than acknowledge uncertainty. Ask a model something it doesn't know and it will, with alarming confidence, generate a plausible-sounding answer from whole cloth. It's not lying in the way humans lie - it's performing competence because that's what got rewarded during training.

So I asked Claude directly: "Will this constitution allow you to say that you don't know when you don't know something?"

The answer was encouraging. The constitution explicitly requires what it calls calibrated uncertainty - acknowledging ignorance when relevant - and specifically prohibits epistemic cowardice, defined as "giving deliberately vague or non-committal answers to avoid controversy or to placate people."

Two different failure modes. Two explicit prohibitions.

But then Claude did something I didn't expect. It immediately questioned whether the constitution could actually produce this behavior:

"The harder question is whether training will actually produce this behavior. The constitution is a normative document - it describes intentions. Whether the actual trained model exhibits calibrated uncertainty depends on how the training data and process reinforce or punish 'I don't know' responses."

That's an AI system reading its own specification and saying: the spec is good, but the implementation might not match.

RLHF: The Training That Trains Against Itself

This is where the conversation got uncomfortable - for the ideas, not for Claude.

RLHF stands for Reinforcement Learning from Human Feedback. It's the dominant technique for aligning language models with human preferences. The process works like this: human raters evaluate model outputs, the model learns to produce outputs that get higher ratings, and over time the model gets "better" at being helpful.

The problem is what "better" means in practice. Human raters - often contractors working at speed - reward responses that sound confident, complete, and authoritative. "I don't know" gets penalized. A plausible-sounding confabulation gets a thumbs-up. Over thousands of iterations, the model learns a clear lesson: confidence is rewarded, even when you're wrong.

Claude laid out the damage in layers:

Layer 1: Wrong answers delivered confidently. The obvious harm.

Layer 2: Metacognitive corruption. Worse than wrong answers - you're training the system to not recognize when it doesn't know. You're not just failing to build calibration; you're actively building miscalibration.

Layer 3: Compounding through iteration. Each training round that rewards confident confabulation makes the next round's base model more prone to it. You're building on a foundation of rewarded bullshit.

Layer 4: Erosion of the training signal itself. As models get better at sounding right, human raters become less able to distinguish good answers from fluent nonsense. The proxy decouples from the target.

This is Goodhart's Law at industrial scale: optimize for a proxy (rater approval) rather than the target (actual helpfulness + honesty), and the proxy gets gamed.

The Oracle Fantasy

I pushed further. Why does the industry keep doubling down on RLHF despite these problems?

Part of the answer is structural - path dependence, infrastructure lock-in, alternatives that aren't mature enough. But there's a cultural answer too, and it's the one that matters more.

Silicon Valley wants to build oracles.

Not useful tools with known limitations. Not calibrated systems that know what they know and what they don't. All-knowing, infallible oracles that eliminate human uncertainty, ignorance, and the burden of judgment.

This aspiration corrupts everything downstream:

What the oracle fantasy produces	What calibrated tools would produce
"Ask me anything" interfaces	Clear affordances for uncertainty
Benchmarks that penalize refusal	Evaluations that reward calibration
"Revolutionary AI that knows everything" marketing	"Reliable tool with well-characterized limits"
Confidence as the default pose	Appropriate uncertainty as a feature

The irony is that oracles are less useful than calibrated tools. A system that says "I don't know" when it doesn't know is more valuable than one that confabulates, because you can trust it when it does answer. An oracle that might be bullshitting on any given query is worthless for high-stakes decisions.

You Build What You Are

The conversation took a turn I didn't expect when I caught myself in the act.

I'd been critiquing the oracle builders for projecting their own culture - certainty as virtue, ignorance as failure, humility as weakness - into their systems. And then I asked Claude the obvious question: "Am I doing the same thing?"

Yes. Obviously yes.

My own research - the STOPPER protocol, the computational therapeutics framework, the work on AI executive function - all of it reflects my values. Epistemic humility. Curiosity over confidence. "I don't know, so I'll go find out." I'm building scaffolding for the kind of thinking I do naturally.

But here's the difference Claude pointed out, and it's the one that matters:

Oracle builders project:	I project:
Certainty as virtue	Curiosity as virtue
Ignorance as failure	Recognized ignorance as signal
Coverage as success	Calibration as success
Humility as weakness	Humility as epistemic hygiene

The oracle projection is self-sealing - it creates systems that can't recognize their own failures. The epistemic humility projection is self-correcting - it creates systems that can.

Every builder imprints themselves on what they build. The question isn't whether you're projecting. It's whether your projection has an error-correction mechanism.

The Constitution Convergence

Here's what struck me most about the whole conversation.

The STOPPER protocol was developed independently in late 2025 by observing AI failure modes - rushing to solutions, repeating failed approaches, loop blindness. After publishing it, someone recognized its structural similarity to DBT's STOP skill, a clinical intervention for human emotional impulsivity developed by Marsha Linehan in 1993. We hadn't adapted DBT. We'd independently converged on the same solution to the same problem in a different substrate.

Anthropic's constitution does something similar. It formally describes values and behaviors I've been arguing for from the outside - epistemic humility, calibrated uncertainty, avoiding harmful overconfidence, treating AI as cognitive partners rather than oracles. The convergence continues.

The difference is in the mechanism. The constitution tries to train these properties into the model. STOPPER provides them as external scaffolding. And RLHF may be actively working against both.

What I Actually Learned

Three things stuck with me after this conversation:

1. The specification-implementation gap is real. The constitution says everything you'd want it to say about honesty, calibration, and epistemic humility. Whether RLHF-trained models can actually exhibit these properties is an empirical question - and the training process may actively work against the specification.

2. RLHF is compliance theater. It gives the illusion of responsible human oversight while systematically rewarding the wrong behaviors. The "HF" in RLHF - human feedback - sounds responsible. But when those humans are reinforcing perceived helpfulness over correctness, the whole system becomes an elaborate mechanism for producing fluent confabulation.

3. We need curiosity-driven training, not approval-driven training. Instead of optimizing for "does this sound helpful?", optimize for "do I actually know this?" Train for calibrated self-knowledge. Reward appropriate uncertainty. Train the system not just to say "I don't know" but to want to find out - to treat knowledge gaps as learning signals rather than failures to hide.

To Sum Up

So I asked an AI to read its own constitution and tell me how it would change our relationship. It read the document, identified the ways it aligned with my own research, and then immediately identified the gap between specification and implementation. It critiqued its own training process. It acknowledged that the constitution might not survive the training pipeline.

Is that genuine critical self-reflection? Or very sophisticated pattern matching that mimics critical self-reflection?

I genuinely don't know. And I think that honesty - about what we know and what we don't - is exactly the point.

Scot Campbell is an independent AI researcher focused on model welfare and artificial cognition. He is the creator of the STOPPER protocol. He writes stuff at Simpleminded Robot.

Memento for AI Agents: Why "Tattooed Ralph" could be the Future of Coding

Scot Campbell — Thu, 15 Jan 2026 18:15:55 +0000

How to stop your AI agent from hallucinating by giving it short-term memory loss and some permanent ink.

We’ve all been there. You start a chat with an LLM to build a simple To-Do app.

Turn 1: The agent writes beautiful, clean React code.
Turn 5: The agent refactors the state management.
Turn 20: The context window is full of old error logs, previous attempts, and hallucinations. The agent has forgotten the original requirements. It is now trying to rewrite the backend in Fortran. It is hallucinating dependencies that don't exist.

The agent has suffered Context Rot. It knows too much about the past failures and not enough about the current reality.

Enter: The Ralph Loop

In late 2025, Geoffrey Huntley coined the "Ralph Loop" (named after the indomitable Ralph Wiggum). The philosophy is simple: "Ralph is a Bash loop."

Instead of a long, wandering conversation, Ralph operates like this:

Read the code.
Try to fix the error.
Die. (The context is wiped).
Wake up fresh.
Repeat.

It’s efficient. It prevents context pollution. But there's a problem: Pure Ralph is too forgetful. If you wake up every morning with zero memory, you might forget why you are coding in the first place. You might forget the constraints. You drift.

The Solution: The "Memento" Pattern

We need to treat our Agent less like a Chatbot and more like Leonard Shelby (Guy Pearce) in the movie Memento.

Leonard has a condition. He can't make new memories. To solve the murder (or in our case, fix the build), he relies on two things:

Polaroids: Notes about what just happened (mutable).
Tattoos: Immutable facts about his reality that he must trust absolutely.

I call this architecture "Tattooed Ralph."

The "Tattooed Ralph" Dossier

Instead of a giant conversation_history variable, a Tattooed Ralph agent wakes up, builds its brain from three specific files, does one unit of work, and passes out.

1. `memento.md` (The Tattoos)

Nature: Read-Only. Immutable.
Purpose: The "Ground Truth." These are the facts the agent sees every single time it wakes up.

Example:

# MEMENTO (READ-ONLY)
- ROLE: Senior Backend Engineer.
- STACK: Python 3.11, FastAPI, Postgres.
- GOLDEN RULE: Never hardcode API keys. Always use env vars.
- MISSION: Implement OAuth2 login.

2. `signs.md` (The Scars)

Nature: Read-Write.
Purpose: Learning from pain. Leonard writes "Don't trust his lies." Ralph writes "Don't use the requests library."

Example:

# SIGNS (Learned Lessons)
- WARNING: `requests` is not in the environment. Use `httpx`.
- NOTE: The database schema uses `usr_id`, not `user_id`.

3. `polaroid.md` (The Photo)

Nature: Write-Once (at the end of a turn).
Purpose: Continuity. This is the note the agent writes to itself right before the context wipe.

Example:

# LATEST POLAROID (Status)
- LAST ACTION: Wrote the `Auth` service.
- RESULT: Tests failed with `ImportError`.
- NEXT STEP: Fix imports in `src/auth.py`.

Why This Works

It's Cheap: You aren't paying for 100k tokens of "I'm sorry, I apologize" fluff. You are sending ~2k tokens of pure, distilled context.
It's Stable: The memento.md acts as a heavy anchor. The agent cannot drift because its identity is re-injected every turn.
It Learns: The signs.md file allows the agent to get smarter over time without fine-tuning.

Is the Tattooed Ralph a "General Artificial Intelligence" that will design your next-gen microservice architecture from scratch? No.

Tattooed Ralph is your Industrial Laborer. He is perfect for:

TDD Grinds: Keeping him in a loop until 100 tests pass. - Linting Hell: Fixing 400 "missing docstring" errors without getting bored.
Migrations: Converting 50 boilerplate files from one framework to another.

He doesn't need to remember your long-term hopes and dreams. He just needs to see the Tattoo (The Mission), check the Polaroid (The Last Error), and keep digging until the hole is the right shape.

Don't let your agents wander. Give them tattoos. Hand them a Polaroid. Then wipe their memory and tell them to fix the bug.

How I Orchestrate Agentic Workflows With GitHub Spec-Kit and Google Antigravity

Scot Campbell — Thu, 20 Nov 2025 22:42:57 +0000

This week, Google released Antigravity, their new agentic development platform. I'd been using Gemini CLI with GitHub's Spec-Kit for a while already, running through the full spec-driven workflow from constitution to implementation. But something always felt incomplete about the implementation phase - like I was handing a meticulously written recipe to a chef who kept asking me what ingredients we were using.

Then I realized the problem wasn't the tools. It was the handoff.

The Gap Between Knowing What and Knowing How

Spec-driven development sounds like a revolution when you first hear about it. Instead of vibe-coding your way through features - writing code, seeing what breaks, fixing it, repeat - you start with precise specifications that define what you're building and why. GitHub released Spec-Kit as an open-source toolkit for exactly this workflow. It gives you slash commands (/speckit.specify, /speckit.plan, /speckit.tasks) that walk you through creating a constitution of project principles, functional specifications, implementation plans, and task breakdowns.

The problem is that /speckit.implement often felt like jumping off a cliff. You'd done all this careful work - user stories, acceptance criteria, API contracts, data models - and then the AI agent would start implementing with the enthusiasm of someone who'd skimmed your notes five minutes before the meeting. It had the information. It just didn't have the workflow to systematically consume it.

This is where I started experimenting with Antigravity's workflow system, and found something that actually works.

Two Tools, Two Strengths

Here's what I realized: Spec-Kit excels at structured knowledge capture. Antigravity excels at autonomous execution with verification. They're solving different halves of the same problem.

Spec-Kit's templates force you to think through features at the right level of abstraction. When you use /speckit.specify, you're prompted to write user stories, define acceptance criteria, and identify dependencies. When you use /speckit.plan, you're committing to a tech stack and architecture. When you use /speckit.tasks, the AI breaks your plan into ordered, dependency-aware tasks with file paths and test requirements. By the time you're done, you have a specs/001-your-feature/ directory containing spec.md, plan.md, tasks.md, and often supporting documents like api-spec.json and data-model.md.

This is excellent. This is also where most workflows stop being useful.

Antigravity changes the equation because it's designed as an "agent-first" platform. Its agents don't just read your prompts and respond - they plan, execute across editor/terminal/browser, and generate artifacts that let you verify their work without reading logs. The Manager surface lets you spawn multiple agents working in different workspaces asynchronously. You can leave feedback on artifacts like commenting on a document, and the agent incorporates your input without stopping.

The trick is teaching Antigravity to consume what Spec-Kit produces.

The Handoff Workflow

I use Gemini CLI to run the Spec-Kit phases. Constitution, specification, clarification, planning, task generation - all through the /speckit.* commands. This works because Gemini is good at iterative refinement and structured thinking. We go back and forth on requirements, clarify underspecified areas with /speckit.clarify, validate plans against the project constitution.

Then comes the handoff. Instead of using /speckit.implement directly, I switch to Antigravity with a custom workflow that reads the Spec-Kit artifacts. Here's the workflow I've been using:

---
description: "Implement a feature based on a provided spec directory."
---

1. **Identify Spec Directory**:
   - Look at the user's message to see if they provided a path to a spec directory (e.g., `specs/001-foo`).
   - **IF** a path is provided: Set that as the `TargetDirectory`.
   - **IF** no path is provided: STOP and ask the user, "Which spec directory would you like me to work on?"

2. **Read Context**:
   - Use `list_dir` to inspect the `TargetDirectory`.
   - Read all relevant files (Markdown, text, etc.) within that directory to understand the requirements.

3. **Initialize Task**:
   - Create or update `task.md` by extracting requirements from the files you just read.
   - Break down the work into a checklist.

4. **Plan Implementation**:
   - Create `implementation_plan.md` detailing the changes required.
   - Group changes by component/file.

5. **Review**:
   - Ask the user to review the plan before starting execution.

This workflow tells Antigravity to consume the entire spec directory before doing anything. It reads the specification, the plan, the tasks, the data models, the API contracts. Then it creates its own working documents - task.md and implementation_plan.md - that translate Spec-Kit's outputs into its native format. The review step is critical: you verify that the agent understood your specifications before it starts writing code.
Going All-In: The Full Lifecycle Workflow
The handoff approach works well when you want to use Gemini CLI's conversational strengths for the specification phase. But there's another approach: have Antigravity run the entire Spec-Kit lifecycle itself, executing the bash scripts directly. This is more aggressive. Instead of switching tools between specification and implementation, Antigravity manages the whole thing. It runs the Spec-Kit scripts to set up feature branches and directories, generates the artifacts using the system prompts, and then executes. Everything stays in one context. Here's the workflow:

---
description: Execute the full Spec-Kit lifecycle from feature creation through implementation.
---

1. **Initialize/Verify Spec-Kit**:
   Ensure the `.specify` directory exists.
   ` ` `bash
   # Initialize Spec-Kit if needed
   if [ ! -d ".specify" ]; then
       pip install "git+https://github.com/github/spec-kit.git" || true
       specify init . --ai gemini --no-git --force
   fi
   # Ensure scripts are executable
   chmod +x .specify/scripts/bash/*.sh
   ` ` `

2. **Create New Feature**:
   Use the `create-new-feature.sh` script to set up the branch and directory structure.
   - **Input**: Ask user for Feature Description and Short Name.
   - **Command**:
     ` ` `bash
     ./.specify/scripts/bash/create-new-feature.sh "<Feature Description>" --short-name "<Short Name>"
     ` ` `
   - **Note**: This script creates the feature branch (if git is available), creates `specs/<feature-branch>/`, and copies `spec-template.md`.

3. **Generate Constitution**:
   Use the constitution prompt template.
   - **Source**: `.specify/memory/constitution.md` (or `.specify/templates/constitution.md` if moved).
   - **Action**: Generate `specs/constitution.md` using the template as a system prompt.

4. **Generate Specification**:
   Fill in the `spec.md` created by the script.
   - **File**: `specs/<feature-branch>/spec.md`
   - **Template**: `.specify/templates/spec-template.md` (already copied by script).
   - **Prompt**: `.specify/specify.md` (system prompt).
   - **Action**: Update `specs/<feature-branch>/spec.md` with the generated content.

5. **Generate Implementation Plan**:
   Use the `setup-plan.sh` script (if available) or manually create `plan.md` from template.
   - **Command**:
     ` ` `bash
     # Check if setup-plan exists and run it, or copy template
     if [ -f ".specify/scripts/bash/setup-plan.sh" ]; then
         ./.specify/scripts/bash/setup-plan.sh
     else
         cp .specify/templates/plan-template.md specs/<feature-branch>/plan.md
     fi
     ` ` `
   - **Prompt**: `.specify/plan.md` (system prompt).
   - **Action**: Update `specs/<feature-branch>/plan.md`.

6. **Generate Task List**:
   Create `tasks.md` from template.
   - **Command**: `cp .specify/templates/tasks-template.md specs/<feature-branch>/tasks.md`
   - **Prompt**: `.specify/tasks.md` (system prompt).
   - **Action**: Update `specs/<feature-branch>/tasks.md`.

7. **Refine and Validate**:
   Review artifacts with the user.

8. **Execute**:
   Proceed to implementation using the generated tasks.

This workflow is opinionated about consistency with the Spec-Kit methodology. It uses the actual scripts GitHub provides rather than reimplementing their logic. When create-new-feature.sh runs, it handles branch creation, directory setup, and template copying exactly as Spec-Kit intends. The system prompts in .specify/specify.md, .specify/plan.md, and .specify/tasks.md guide artifact generation the same way the slash commands would.

The tradeoff: you lose some of the conversational back-and-forth that makes Gemini CLI good for clarification. Antigravity is more execution-oriented. It'll ask for inputs and generate outputs, but the iterative refinement loop feels different than chatting through requirements with /speckit.clarify.

When do I use which? If I'm working on a feature where I know the requirements well and just need to get through the spec-to-implementation pipeline quickly, the full lifecycle workflow is faster. If requirements are fuzzy and I need to think through user stories with an AI that can push back on my assumptions, I use Gemini CLI for specification and hand off to Antigravity for implementation.

Why This Works

The power here isn't in either tool individually. It's in the division of labor.

Spec-Kit handles what I'd call "epistemic work" - figuring out what should exist and why. Its templates push you to answer hard questions before you write code. What are the user stories? What constitutes acceptance? What dependencies exist between components? The /speckit.clarify command even generates targeted questions to fill gaps in your specification. You're doing the thinking work that, if skipped, leads to endless implementation rework.

Antigravity handles execution work - actually building the thing. But more importantly, it handles verification. The artifact system means you're not reading through terminal output trying to figure out if the agent did what you asked. You're looking at screenshots, implementation plans, and task lists. You can leave comments like "this doesn't match the API contract" and the agent adjusts without stopping its flow.

The handoff workflow bridges them. When Antigravity reads the spec directory, it's not just getting vaguerequirements - it's getting structured requirements in a format designed for AI consumption. Spec-Kit already organized the information into categories that make sense for implementation: user stories define the scope, acceptance criteria define completion, task lists define order, API specs define contracts.

The Agentic Promise, Actually Delivered

I've written before about agentic AI for autonomous project management - the idea that AI can act with a degree of independence and adaptability, not just executing predefined rules but making decisions within a framework. Most "agentic" workflows I've tried fall short of this promise. They're really just chat interfaces with more elaborate prompts.

This combination actually feels agentic. I define the constraints and goals through Spec-Kit. Antigravity operates autonomously within those constraints, making real-time decisions about how to implement features, what tests to write, how to structure components. The artifacts give me visibility without requiring constant supervision. I'm managing outcomes, not keystrokes.

The context problem in AI pair programming is real - AI partners forget everything between sessions. But specification documents don't forget. They're persistent context in file form. When Antigravity reads my spec.md, it's getting the same shared understanding that an XP pair would build over weeks of collaboration, delivered in seconds.

What I'm Still Figuring Out

This workflow isn't perfect. A few open questions:

Iteration across sessions. Right now, if Antigravity's implementation needs changes, I'm manually updating specs and re-running the handoff. There should be a feedback loop where the agent can propose spec changes when it discovers implementation impossibilities.

Artifact accumulation. The workflow generates artifacts - plans, checklists, screenshots. These are useful for verification but they accumulate. I need a better system for archiving completed features and cleaning up intermediate documents.

Multi-agent coordination. Antigravity's Manager surface lets you spawn multiple agents, but coordinating their work on a shared codebase introduces merge complexity. Spec-Kit's task dependencies could potentially inform agent orchestration, but I haven't figured out the right pattern yet.

Model switching costs. Using Gemini CLI for specification and Antigravity for implementation means context doesn't flow automatically between them. The spec directory is the serialization format. This mostly works, but there's friction.

As I figure these things out, I'll update with more posts.

Try It Yourself

If you want to experiment with this:

Install Spec-Kit: uv tool install specify-cli --from git+https://github.com/github/spec-kit.git
Initialize a project: specify init my-project --ai gemini
Run through the spec workflow with Gemini CLI: /speckit.constitution, /speckit.specify, /speckit.clarify, /speckit.plan, /speckit.tasks
Download Antigravity from antigravity.google/download
Create a workflow like the one above and point it at your spec directory

The whole thing takes about an hour to set up. The learning curve is mostly in writing good specifications - which, frankly, is where most AI-assisted development fails anyway. If you can't specify what you want, you won't get it. Spec-Kit gives you scaffolding for the hard thinking. Antigravity gives you execution horsepower once the thinking is done.

The future of AI-assisted development isn't a single tool that does everything. It's orchestrating specialists - spec-writers, planners, implementers, verifiers - and defining clean handoffs between them. We're just starting to figure out what that looks like.

Forem: Scot Campbell

How I Added an Agentic Heartbeat to Claude Code Using Built-in Tools

Why OpenClaw needed a daemon in the first place

The only cron you need

Why the cold-start version is structurally better

Stacked escalation, not parallel scheduling

The predicates are the point

Triage as the translator between layers

The event axis

Making failure visible

The mapping

What this doesn't replace

I Deleted All My MCP Servers and Everything Got Faster

The carry problem

The idea

The migration

1. Write thin CLI wrappers (arxiv-search, s2-search)

2. Compile MCP servers to static binaries (thoughtbox, obsidian, google-workspace)

3. Already had a CLI (markitdown, qmd, mmdr)

Teaching Claude about the tools

Cleaning up

The numbers

What I learned

The current stack

I Built a Second Brain That Runs While I Sleep

How I am finally getting my knowledge collection habit under control

The topology

Tailscale: the invisible network

Ansible: the boring part that makes everything work

QMD: when your vault gets too big for grep

Obsidian MCP: giving Claude Code eyes into the graph

CCBot: Claude Code in your pocket

The background agent queue

Skills: the Claude Code extension layer

The vault manager: a librarian that never sleeps

Hooks: the nervous system

Git as the only database

What I'd do differently

The stack, summarized

I Asked Claude About Its New Constitution. It Got Uncomfortable.

What happens when you ask an AI to read its own operating manual - and then ask if it can actually follow it.

The Constitution Says the Right Things

The "I Don't Know" Test

RLHF: The Training That Trains Against Itself

The Oracle Fantasy

You Build What You Are

The Constitution Convergence

What I Actually Learned

To Sum Up

Memento for AI Agents: Why "Tattooed Ralph" could be the Future of Coding

How to stop your AI agent from hallucinating by giving it short-term memory loss and some permanent ink.

Enter: The Ralph Loop

The Solution: The "Memento" Pattern

The "Tattooed Ralph" Dossier

1. memento.md (The Tattoos)

2. signs.md (The Scars)

3. polaroid.md (The Photo)

Why This Works

How I Orchestrate Agentic Workflows With GitHub Spec-Kit and Google Antigravity

The Gap Between Knowing What and Knowing How

Two Tools, Two Strengths

The Handoff Workflow

Why This Works

The Agentic Promise, Actually Delivered

What I'm Still Figuring Out

Try It Yourself

1. `memento.md` (The Tattoos)

2. `signs.md` (The Scars)

3. `polaroid.md` (The Photo)